How to host your own ChatGPT-like model?

Want to run a similar model like ChatGPT on your own infrastructure?

With a huge push to build open-source models, Mixtral is one of the best models available for free. It’s also super efficient to run on low-spec hardware.

Although Mixtral is not as powerful as the chatGPT 3.5 and 4 models, it still is powerful enough for most generation tasks. I use Mixtral to classify products, label, and generate descriptions.

Setting up

You probably can get away with a decent-sized VPS or dedicated server, I suggest getting a GPU box. These can be expensive, however, there are companies like Hetzner where you can get a GPU box for under $150 pm.

First things first, you would want to set up the graphics drivers and CUDA. These instructions are for Ubuntu 22.04, and may or may not work with other Ubuntu versions.

sudo add-apt-repository ppa:graphics-drivers/ppa --yes
sudo apt update -y
sudo apt-get install linux-headers-$(uname -r)
sudo ubuntu-drivers install --gpgpu
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update -y
sudo apt-get -y install cuda-toolkit-12-3

Install Ollama

Ollama is a powerful Golang library that can run large language models more efficiently. I have tested various ways of running models including “llama.cpp”, hugging face inference API, and various other tools. Ollama tends to perform the best with lower-spec hardware with a GPU.

If you are stuck on a CPU, “llama.cpp” may work better, but still, I managed to get Ollama working on a CPU just fine. I didn’t do enough tests to conclude which is better for CPU-only machines, however, on a GPU box Ollama wins by a large margin.

To install Ollama:

curl https://ollama.ai/install.sh | sh

Now you should have Ollama installed, to set up Mixtral:

ollama pull mixtral:instruct

The above command will pull down the Mixtral model and configure it for you so that Ollama can run this model locally.

Running Mixtral

Now that you have successfully configured Mixtral with Ollama, running this model is as simple as:

ollama run mixtral:instruct

The command above will open a prompt shell, where you can prompt the model similar to chatting with ChatGPT.

This is great for local testing but not very useful for integrating with web applications or other external apps. Next, We will look at running Ollama as an API server to solve this very problem.

Running Ollama as an API Server

To run Ollama as an API server you can use “systemd”. “Systemd” is a Linux daemon that allows you to run and manage background tasks.

Here is an example of a Systemd config: “mlapi.service”

[Unit]
Description=Ollama Service
After=network-online.target[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollamauser
Group=ollamauser
Restart=always
RestartSec=3
Environment="OLLAMA_HOST=0.0.0.0:8000"

[Install]
WantedBy=default.target

In the above config, we run this process as “ollamauser”, this is just an isolated system user I created for security purposes.

You can run Ollama as any available user on your server, however, I would avoid running this process as “root”. Rather, create a new system user and keep the process as isolated as possible.

You will then need to place the config file in: “/etc/systemd/system/”

I called the file “mlapi.service”. You can name this whatever you like, just be aware when using the Systemd CLI tool, you need to reference this name as per the file name.

To enable your service:

systemctl enable mlapi.service

Now start your service as follows:

systemctl start mlapi.service

To check that the service is up and running, you can use:

systemctl status mlapi.service

Now that you are all set up, you can make an API call to the service as follows:

import requests
import json

url = "http://127.0.0.1:8000/api/generate"

payload = json.dumps({
  "model": "mixtral:instruct",
  "stream": False,
  "prompt": "Designing Data-Intensive Applications By Martin Kleppmann",
  "system": "Tag this book as one of the following: programming,cooking,fishing,young adult. Return only the tag exactly as per the tag list with no extra spaces or characters."
})
headers = {
  'Content-Type': 'application/json'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)

Correctly so, the model returns “programming” as the tag. Ollama supports various options. You can get more detailed information about the options available here.

To get you started, here is a breakdown of the most common parameters:

model (required) — Ollama can run multiple models from the same API, so we need to tell it which model to use.
stream (optional) — Set this to “false” to return the whole model’s response. The default is to stream the response, so you will get a large JSON object with several child objects containing chunked phrases.
prompt (required) — The actual chat prompt.
system (optional) — Any context information you want to give the model before it processes your prompt.

Load balancing requests

While “Systemd” should restart the service if Ollama had to crash for some reason, it’s advisable not to make direct API requests to the Ollama service.

Instead, I suggest putting a load balancer in front and load balancing between multiple instances. You can easily achieve this using Nginx. Here is a load balancer example:

upstream backend {
    server 192.168.1.1:8080 weight=1;
    server 192.168.1.2:8080 weight=1;
    server 192.168.1.3:8080 weight=2;
    keepalive 200;
}

server {
   server_name ollamaapi.example.com;
   listen 443 ssl http2;
   ... ssl and other configs here

   location / {
        proxy_set_header Host ollamaapi.example.com;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_pass http://backend;
    }
}

This is a basic example, you may need to adjust it to suit your environment but should give you a good starting point.

Since I covered this topic in detail in one of my earlier articles, I won’t go into too much detail on this config, you can learn more about how Nginx load balancing works here.

The essential aspect here is the “upstream backend” and “proxy_pass”. For “upstream backend” we are simply creating a “list” of servers that can be used to route the request.

The “proxy_pass” directive just forwards the request to our “upstream backend” which then routes the request accordingly.

An alternative approach would be to use FastAPI, it’s fairly efficient, lightweight, and therefore won’t add too much overhead to requests.

You can use the “asyncio” library to create a lock so that only one process can access the backend model at any given point in time.

Example:

from fastapi import FastAPI,Request,Depends
from asyncio import Lock

app = FastAPI()
llm_lock = Lock()

@app.post("/api/llm")
async def query_llm(data: Prompt, lock: Lock = Depends(lambda: llm_lock)):
    async with lock:
        result = prompt_llm(data)
        return result