As we discussed briefly in this post RAG, short for retrieval augmented generation, comes to solve some problems inherent to foundational models.

The pain point 🌓

The process to train a Large Language Model is very costly and demands huge amounts of data and time. The problem is that, while training new information might be generated, turning the model knowledge obsolete, and that the knowledge the model has after training cannot be referenced. For example, we currently know that Saturn has 274 moons. So we went ahead and trained our model to know that information.

model training took several weeks

After that, a Brazilian astronomer finds one more moon on Saturn, raising that number to 275. But now our model doesn't have that information so every time we query it, it will tell us that Saturn has 274 because it doesn't know that new information.

query: How many moons 🌕 has Saturn?
model: Saturn has 274 moons
query: Where did you take that information from?
model: I was trained on it

What can we do now?

That's where RAG comes to help. We basically add content store that can be easily updated and we fetch information based on the user's query. That fresh new information will increment on that same user query that will go to the model. The benefit of that is having the option to update the store instead of retraining the AI. Now our prompt was enriched and the model can give accurate references from where it took that information.

So what is RAG? 🤔

RAG is the framework to utilize these external sources to optimize the output. Usually the foundational models will face problems when we are working with a Domain Specific problem. So we use a retriever to get more data from the knowledge base by using the encoded prompt. We then combine that data with the prompt and sends to the model, which will in turn generate a response.

The process is the same as this: RAG

What is the encoder and how does it work? 🤾‍♀️

The encoder is the piece that will embed the prompt transforming it into vectors. It consists of a Neural Network that will split text into tokens (chunking) and convert those into a dense vector called embedding that will be averaged to make a single vector with a concise meaning. This encoder will work for both, the documents for the store, and the query we will use to the retrieval part. With a store full of multidimensional vectors that came from embedded documents, and a query encoded into a vector, we will compare the distance between the question and the knowledge base utilizing vector calculations like the dot product.

Usually the conversion rate from tokens to words is 100 token ≈ 75 words

Why not to rely on input only? 🧐

Dependency → The user needs to have the information to pass to the model → this is replaced by the data store
Capacity → The token capacity is limited → With RAG we retrieve only relevant information
Redundancy → If the model has too much data it can fall into the needle in a haystack situation → With chunking we have short relevant data
Time → Longer prompts will take more time to be processed → As with have less bloated information we take less time to process
Cost → Increased computational cost of longer prompts → Shorter meaningful data, shorter computational requirement and cost

Shortcomings

The embedding of the documents need to be the same as the prompt's one
The model needs to be guarded by guardrail to say when it doesn't know something
If the retriever is not good enough we might miss answerable prompts

Recap 🤓

RAG is a very useful technique to add relevant information to foundational models. By using an external, updatable, source called documents store, we can retrieve and add relevant information every time the user makes a query. By using encoders we shrink the computational costs and add only relevant meaningful data to the user prompt.

Why to use RAG with AI?