AI ๐Ÿค– LLM Crash & Burns ๐Ÿ”ฅ

AI ๐Ÿค– LLM Crash & Burns ๐Ÿ”ฅ

ยท

3 min read

Why LLMs are not going to take your job anytime soon? I've been building AI-powered applications for some time now and have a vast experience in practical production use cases of most LLMs.

Here are some real-world challenges that LLMs are not able to overcome anytime soon.

Complex multi-turn conversations

LLMs cannot read for meaning, they miss basic instructions, and even the most advanced models like GPT4o and Claude Sonnet sometimes struggle.

Here's a basic example:

Given the following context data, you must guide the customer through an order process. Ensure you follow all rules and the sequence in which questions are asked.

<context>
</context>

Rules:

1) Ask the customer if they prefer delivery or pickup.
2) If they say delivery, ask for their address.
3) Ask the customer for their vehicle model and make.
4) If match the relevant make and model to the <make> and <model> fields in your context data.

... More steps...

LLMs struggle with step-by-step instructions when you have a large context of say 5000+ tokens.

You often have to repeat or put special emphasis on instructions, and even after doing so, the model is not always consistent.

How to solve this issue:

  1. Fine-tuning. You can train models like ChatGPT3.5-Turbo with question-and-answer pairs to help the model better handle the dialog.

  2. Break down prompts over several steps (prompt chaining).

  3. Building a state machine. You can supplement the LLM with a state machine of some sort, where you keep track of previous steps and ask the model to return special markers like XML tags or JSON at key checkpoints so that you can optimize prompts along the way.

  4. Keep prompts as concise as possible.

Cost of tokens

When you start building more complex applications, you may need to use both fine-tuning or few-shot examples together with your RAG data and message history.

This will escalate costs quite quickly, especially if you are based outside the US. The exchange rate in my case is X 18. Nonetheless, it's still cheaper than running your own GPU servers.

How to mitigate this issue: Split your prompts across different models, use the cheaper models for basic tasks and the larger ones for more complex tasks.

You can use intent routing to determine which model to route a task to, basically, with intent routing you ask the model to categorize the user prompt, and based on the categorization you can then route the request to the relevant data and maybe even reduce the amount of RAG data used.

RAG data is not enough

A human can read a shoddy document in 5 minutes and be able to adapt to just about any scenario that arises. Ask them a question about data they encountered six weeks ago, and they most likely would have gained sufficient experience to handle your question now without having any data in front of them.

When using RAG you must constantly pass message history, and RAG data to the model at every turn (unless of course, you fine-tune).

RAG works by analyzing the current prompt, and then finding similar data in your vector store, however, what happens on question 6? when you no longer have the context RAG data from question 1?

The model may then produce undesired results.

How to get around this issue:

  • Keep a RAG buffer, so that you merge all results and clean up duplicates from previous messages (libraries like Langchain sort of do this for you).

  • Feed your message history to a smarter model, and get it to re-write and summarize the entire message history to optimize a prompt for the RAG similarity search.

  • Fine-tune or provide few-shot examples to the model, this way you may need less RAG data and if your training data is variant enough, it should give the model sufficient knowledge to handle most scenarios that occur.

ย