As someone who works 90% of the time on web-related projects; data science, and machine learning were not my core competencies or areas of interest for most of my career.
In the past 2 years, machine learning has become a vital component of my toolbox as a developer.
Initially, the landscape was wild! We dealt with Keras, PyTorch, Tensorflow, and many other tools. Luckily, since the advent of ChatGPT and Sentence Transformers, it is now easier than ever to build AI-related products.
If you need to sprinkle a little AI magic into your project, you probably can get away with just simple rest calls to OpenAI endpoints, however, if you are building a more complex product and have little to no experience with deep machine learning concepts such as RAG, Langchain and so on, then this article will be an ideal primer for you.
馃う馃従 We actually abuse the term "AI". Machine learning is just one subset of AI. Nonetheless, to keep things simple, since everyone is on the bandwagon of "AI" these days. I will use these terms interchangeably. Sorry!
Models: Give up control
There are no or very few "IF", "THEN", "ELSE", or "DO WHILE" controls in machine learning, thus machine learning is a different paradigm compared to conventional programming.
You will often use natural language and configuration-style programming to tune models in the direction you want them to go. Ultimately though, you never have 100% control over what the model will do. You can In most cases predict the outcome with great certainty but it's not the same as programming where we write rules and control flows to precisely guide the user along the path we want to take them.
A model is largely a "black box" of neural networks. You can think of a neural network as a chain of tiny nodes (neurons). Together all the nodes in a neural network build up some pattern recognition ability that the model uses to analyze users' input and generate the appropriate response.
A neural network usually contains 3 types of layers:
The input layer: takes in data as a whole. Neurons in this layer would then represent various features of the input data. For example an image of a "Cat", some neurons will represent the fur, others the tail, others the whiskers, and so forth.
Hidden layers: Features extracted from the first layer are analyzed further by these layers to determine weights and biases. Essentially mathematical algorithms are applied to the data, to determine patterns that develop the model's reasoning and prediction abilities.
Output layer: This now uses the "learnings" from the previous layer to build an appropriate response back to the user. In the cat image example, we probably want to classify the type of animal. This layer will then return "Cat".
When you train a model, you pass the same data to the model over several iterations. One full data cycle through all layers is known as an "epoch". Each time you cycle through an epoch, the model constantly adjusts and optimizes its pattern recognition ability ( Too many epochs can be bad, read up more on "model loss functions").
Training via epochs is vital to improve the model's accuracy. For example: if the first image is of a "sphynx" cat species, the neural networks are not aware of other cat species yet and therefore will make various assumptions about cats using this species of cat.
This will cause weird predictions since a "sphynx" is significantly different compared to a regular house cat. On the second epoch, the network is now aware of the 99 other cat species and thus has a holistic view of what the different species are and their features. Each subsequent epoch would then re-analyze the data and look for ways to optimize the model's accuracy.
Usually when training a model, you also provide a small sample dataset with accurate examples, so that the model can use this dataset to help fine-tune its accuracy during the training phase.
Tooling and frameworks to build models
In the Python world, PyTorch is the undisputed leader when it comes to building machine learning models. It's open-source and freely available, so just about anyone can implement their own models from scratch, or extend someone else's model.
Before you even build a model, you are going to need some kind of dataset. Usually a CSV or JSON file. You can build your own dataset from scratch using your own data, scrape data from somewhere, or use Kaggle.
馃挕 Kaggle is a community of "machine learners" where you can find various kinds of freely available datasets and even models to use for both commercial and non-commercial purposes.
Dealing with large datasets in regular Python lists or dictionaries can become inefficient, thus in addition to Pytorch, you will need to learn 2 libraries that assist with parsing and manipulating data (and Langchain for LLM-specific tasks):
Pandas: This will allow you to load and parse your CSV/JSON training data and format it efficiently into whatever format the PyTorch model needs the data in. Furthermore, Pandas has a consistent API, thus you can easily switch between data sources without needing major refactoring in your scripts.
Numpy: Models usually work with vector embeddings of your text (numerical representations) which are arrays of floating point numbers. Numpy is used throughout the process by models to perform various numerical computations. Pytorch models in general will return a result in the form of a Numpy array, therefore you will need to process or convert that result accordingly.
Langchain: A powerful LLM utility library to make working with LLM APIs and Models much easier. You most certainly can use the REST API or Python library for your LLM and work with it just fine without Langchain. For larger applications, Langchain simplifies common tasks, like building system prompts, agents, RAG retrievers, and so on.
A wide variety of pre-trained models
You may be aware of Llama3 and OpenAI models, but in the machine learning world, there are hundreds if not thousands of models available for all kinds of machine learning tasks. Even OpenAI itself has several different models: GPT3.5 (gpt3.5-turbo...) (powers ChatGPT free version), GPT-4 (powers ChatGPT pro), Whisper (for audio to text), DALL路E (image generation) and so forth.
Some of the most popular model types are:
Text embedding: Converts text into vector embeddings. There is a leaderboard to rank models in this class: https://huggingface.co/spaces/mteb/leaderboard
Image generation: DALL路E 3, Midjourney, Stable Diffusion, and so forth.
Text classification: Models that can categorize various pieces of data, like spam classification, product categorization, tagging, etc... Some model examples: FastText, BERT, facebook/bart-large-mnli
Image classification: Given an image, these models can determine an appropriate label, tag, or caption. Examples: microsoft/resnet-50, OpenAI CLIP.
Object detection: Given an image, detect various entities in the image. Examples: facebook/detr-resnet-50, keremberke/yolov5m-garbage
Text to Speech (and vice-versa): microsoft/speecht5_tts, OpenAI Whisper, OpenAI TTS.
Large language models: General purpose models that can do some or all of the tasks above. Examples: Chat GPT 3.5, Mixtral, LLama3, Gemini.
Similar to "Kaggle" I mentioned earlier, when it comes to models, one of the best places to find pre-trained models is HuggingFace.
In addition to providing models, they also provide an open-source library called "sentence-transformers" allowing you to easily use their models in your code in a consistent way.
Here's a simple example of how you can generate vector embeddings using the Huggingface "sentence-transformers" library:
from sentence_transformers import SentenceTransformer
sentences = ["Some sentence or phrase here"]
# Notice we pass in the model name:
# 'sentence-transformers/all-mpnet-base-v2'
# You can easily swap this model out for any other
# - compatible model e.g. "sentence-transformers/all-MiniLM-L6-v2"
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
# Will convert the sentences into vector embeddings.
embeddings = model.encode(sentences)
# Will print a Numpy array
print(embeddings)
What are vector embeddings?
I have mentioned this concept a few times, so just to clarify for those that are unfamiliar.
Machine learning, under the hood, is essentially statistical algorithms at play, they use a ton of mathematical formulas to generate predictions and perform other machine learning tasks.
Now as you can imagine, text and letters are not mathematical in nature. You can't perform calculations on them, therefore, for the math algorithms to work efficiently, we need to convert words/letters into numbers.
Each phrase or word is then converted into a floating point array of numbers (a vector), and multiple vectors are grouped together (e.g. a sentence or paragraph) to form vector embeddings.
Vector embeddings capture contextual information based on how close words are together, how frequently they appear in similar sentences, and so on. This allows for storing enough information for algorithms like COISINE similarity, ANN, and KNN to calculate semantic meanings of the text.
As you can imagine one of the most useful use cases for vector embeddings is to perform semantic searches. With semantic searches, the algorithm can identify sentences and words even if the search term does not appear in the results (this is different compared to SOLR or Elasticsearch where they just look for synonyms or fuzzy-spelled keywords).
You can generate vector embedding as follows:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
embeddings = model.encode(["Machine learning"])
print(embeddings)
The result will be a Numpy array (The actual vector embedding is much larger, I have shortened it below to just give you an example) :
[[-2.14947201e-02 6.24050908e-02 -6.70606121e-02 1.17204664e-02
-2.23315824e-02 3.22391130e-02 -1.10149011e-02 2.08304841e-02
-9.21188202e-03 3.96048091e-02 1.33278236e-01 4.91726398e-02
-2.97797378e-02 6.26323521e-02 2.64899582e-02 -8.43470395e-02
8.12786724e-03 7.97691289e-03 -5.54570891e-02 6.59016613e-03
1.62357709e-03 7.52743008e-03 2.48725782e-03 4.35999036e-03]]
Depending on the model you use, the dimensions will differ (number of floats). In this case, the model generates "768" floating point numbers. This is not always true, but for the most part, the larger the dimension, the better the accuracy.
What is Retrieval-Augmented Generation (RAG)?
This will be one of the most common machine learning systems you will need to build for web applications.
When you ask ChatGPT a question, it refers back to its vast dataset of data and generates data based on that context. Sometimes it can hallucinate and give you incorrect information.
Other times, it may give you the correct information, but in your unique context that information may be irrelevant or too broad to be useful. For example: "What is the price of a Macbook Pro?", the model responds with "From $1,299....".
This may be correct, but it's not very useful because it doesn't give me the price in my local currency, or provide more detail about each model.
RAG systems aim to fix this problem; essentially what happens, is that you give the model a custom dataset and it then scopes its responses to only your dataset, thus allowing for better accuracy and providing better local context for your domain/business.
This is different from finetuning. In finetuning, you are taking a pre-trained model and extending its ability by providing a custom dataset for your use case. You then have to re-run training on the model and the model will generate a new trained state, thus the model's context is frozen at this state and not real-time.
With a RAG system, you are not re-training the model, you are providing context data in real-time, therefore you do not need to train the model every time the data changes.
The LLM uses its original training data as its basis to develop reasoning, but any generation or prediction tasks it performs using RAG will be scoped to your custom data and precedence will be given to your data instead of its original dataset.
If you want to learn how to build a RAG system, you can have a look at an earlier article I have done here.
How to host your own LLM?
When you first get into machine learning, I would advise you not to tinker with open-source models like Llama3, rather focus on using OpenAI models first and get familiar with their APIs.
While OpenAI credits are fairly cheap and can get you good mileage, sometimes though, when you are building a large-scale application, you may want to run your own in-house model.
You have a few open-source options:
Llama3 - Developed by Facebook.
Mixtral - Developed by an Independent company that's making waves in the ML world.
Phi3 - Developed by Microsoft, very compact and efficient. This model can even run on mobile devices.
I wouldn't touch Google's Gemma model, I had tried it initially and the results were poor, however, to be fair it was within the first week of it being launched.
Depending on your GPU resources, you may want to try Phi3 first, then Mixtral, and finally Llama3. Llama3 should be the best-performing model. It is a tad more resource-intensive and you may not always need that kind of power.
To run your models, you can use ollama. I also did a tutorial on setting this up here.
Conclusion
Hopefully, I have given you enough insights to get started on expanding your skillset into machine learning, of course, machine learning has gone through many changes in recent years, and libraries/models are still constantly evolving.
I would advise not to get caught up in all the hype, rather pick up Pandas, NumPy and play around with Huggingface models first.
Once you have a solid grasp of these, then move on to Langchain and integrate with OpenAI to build a simple RAG, followed by a simple agent.
Happy building!