RAG vs Big Context Windows | Christopher H. Wood

At this point, most people who have been following generative AI discussions have probably heard of RAG (retrieval-augmented generation). For those who aren’t familiar, briefly it means you pull data from elsewhere and feed it through the LLM to get your final result.

Recently there’s been some doubt about the future of RAG. Context windows are bigger now (gpt-4-turbo has 128k, Claude 100k). OpenAI is releasing their Assistant API that manages RAG for you. Is there value in building your own RAG for your AI app anymore?

While larger context or managed RAG might work for the low hanging fruit applications like summarizing a long PDF or single-threaded chat applications, I see a lot of value still in custom RAG development.

AI apps have the following 4 distinguishing elements: proprietary data, exactness of response, cost, and speed. Most startups aren’t going to have proprietary data, so that limits advantages to exactness of response, cost, and speed. In practically all of these, larger context windows and managed RAG lose out.

Exactness of Response

Exactness of response is a moving target. Depending on who your user is, what time it is, what they are doing, etc. the definition of what is exact may differ. That means there’s no one size fits all generation technique. Unless the answer you’re looking for is a general summarization of a very long document, then bigger context windows are not going to give you a more exact response. Studies show that lazily packing irrelevant content into a large context window and relying on the LLM to figure it out performs worse than RAG.

Cost

It goes without saying that as far as LLM service costs go, when you’re paying for tokens then sending bigger contexts is going to be more expensive than smaller contexts.

Speed

Sending more tokens means it takes longer to get a response. There are many studies that show even tens of milliseconds can influence retention and timespent at scale. For startups who want to differentiate by serving faster responses, using less tokens is going to be one of the easiest ways.

The Recursive Nature of Work

One of my main workflows for ChatGPT is breaking a large project into tasks, and then smaller tasks until I can convert tasks to code. What’s interesting is the recursive nature of this workflow. It seems like this is not only how individuals operate but also how larger organizations operate. Vice Presidents set high level, abstract goals. Directors and senior managers turn this into more concrete tasks. Then managers and team leads break these into projects for teams of 3-5.

To me, this looks something like LLMs with large context windows figuring out general strategy, LLMs with a more focused dataset putting together a sub-strategy, and then progressively focused LLMs breaking these into projects, tasks, and subtasks until finally the tasks are converted into the final products.

While completely automated AI agents are appealing in a technical and magical sense, I highly doubt we are able to create such sophisticated pieces of machinery from non-deterministic LLMs. For the time being, I still see the best AI products being those that have a human-in-the-loop to course correct and guide the AI.