𝗪𝗵𝗮𝘁 𝗼𝗻 𝗲𝗮𝗿𝘁𝗵 𝗶𝘀 𝗥𝗔𝗚?

# RAG

# LLMs

# AI

Demystifying RAG: Enhancing LLMs with Real-Time Knowledge Retrieval

January 8, 2025

Steven Jieli Wu

In a recent technical office hour, I gave a talk on Retrieval Augmented Generation (RAG). The highlight? A peer challenged me to explain RAG as if I were talking to a 10-year-old! Over the past year, I had the privilege of building multiple RAG-based applications and taking them from proof-of-concept to production roll-out. This blogpost covers everything I have learned the past year.

Before we dive in, let us establish some baseline understanding.

Baseline Understanding

What is an Vector Embedding?

An embedding(typically vector based) is a way to represent objects like text, images, or audio as points in a continuous vector space. This means that each object is converted into a numerical format, or a vector, which allows machine learning models to process and understand them.

How does Large Language Model (LLM) Work?

Large language models work by predicting the next token, or word.

This video from Spotify CTO Gustav Söderström is PURE GOLD, and it helped me tremendously when I first started on this journey a year ago. (kudos to my director Eric Immermann)

EXPLAIN RAG LIKE I AM 10

Okay! With that out of the way, let's imagine we have a Smart Robot friend in the library.

He is here to help answer any questions you have. He has read all the books in the library. Superb!!!

But Our Friend has Some Limitations 1. Our robot friend’s knowledge is limited to the books in this library and those books are all slightly out-of-date

2. We spent million of $$$ and months of time to calibrate the robot friend. We don’t have an easy way to recalibrate this robot.

3. Our robot friend is optimized to give you AN answer. Making you happy matters more to him than answer your question truthfully so sometimes he may make stuff up

4. Our robot friend has “no filter” in his mouth so he may say something inappropriate to you…

5. When our robot friend makes something up, you won’t be able to tell if your robot friend is telling the truth or not

Good news! We can Help Guide Our Smart Robot Friend (the LLM)

1. To help our friend, we can give the robot some new books you’d like him to use.2. we can also give the robot specific instructions before it answers like “make sure the answer is appropriate for my age and make sure to only answer using the books I provided. If you cannot answer, simply say I do not have the information”.

Now, Our Smart Robot Friend is even smarter and more effective in helping us. Hooray!

RAG IS

To summarize, Retrieval Augmented Generation, or RAG is

1/ a pattern (Lewis et al. 2020) that can improve the efficacy of large language model (LLM) applications by leveraging custom data.

2/ done by retrieving data/documents relevant to a question or task and providing them as context to augment the prompts to an LLM to improve generation

Sample Reference Architecture

Benefits of RAG

Up-to-date Knowledgeand Tailored Responses

LLM responses are not based solely on static training data

The model uses up-to-date external data to provide responses

We are extending the learningin-context and provide domain specific knowledge content. The answers are more specific and valuable

Reducing Inaccurate or FabricatedResponses

RAG attempts to mitigate the risk of producing hallucinations or incorrect fabricated information Outputs can include citations of original sources allowing human verification

Efficiency and Cost-Effectiveness

Offers an alternative to fine-tuning LLMs by enabling in-context learning without the up-front overhead of fine-tuning

It’s cost effective for use cases where the system need to be frequently updated with new data

My Adventure with RAG Implementations

#1: Chatbot

Key Learnings

• With good retrieval system like Coveo | Lucidworks, you don’t need a frontier model (GPT 4o).

• GPT 3.5 turbo was cost-effective and highly-performant. Nowadays, 4o-mini is more appropriate.

• Streaming the response is better and makes the experience more fluid. Tricking the user’s perception on how fast the system responds.

#2 Client POC to Production Implementation

Business Use Case

Help Call Center Service Representative and Field Service Representative Become More Effective at Their Job When Answering Customer Calls

"One of the field service representatives had a case open for tow month. A very large truck was not operational after multiple attempts. The field agent got access to the RAG solution and in 15 minutes of searching got an answer that directed them to an old case with a similar issue that had a resolution. They tried the same fix from that old case and it fixed the truck!”

- Client Executive

Key Learnings

• Contextual relevance is king

• Mobile support is important for field service agents

• Search platform like Coveo’s Question and Answering offering has amazing time-to-value

#3: Client POC Implementation

Business Use Case

Assist Call Center Service Representatives in the Fund Servicing Department to Become more Effective at Their Daily Operations

Key Learnings

• Highly-regulated industries like insurance and financial services have REALLY high expectations in terms of answer accuracy and latency because wrong answers will have regulatory consequences (> 90% answer accuracy and <200 MS time-to-first response)

• In certain situations, traditional ML and rule-based systems are more appropriate for the use cases than the modern LLM-based RAG | agentic systems.

• Compound AI systems powered by platforms like Databbricks is needed for a RAG implementation to meet client’s expectation of 90% accuracy and <200 MS latency requirement

• Platform like Databricks Data Data Intelligence platform still lacks support for conditional filtering & boosting in its native offering and therefore suffer in relevance calculation during retrieval. Typically you only use vector-based search (dense retrieval). A search platform, on the other hand, excels in both sparse and dense retrieval along with contextual relevance tuning. Marrying the two platform together would deliver a production-grade RAG implementation that is future-proof and flexible for any use cases. (For instance, Coveo offers a Passage Retrieval API so you can leverage what Coveo does best (retrieval and relevance calculation and incorporate the retrieved passage into the remainder of the Compound AI system stack).

• Evaluation is important. Best thing to do is to combine LLM as a judge automated testing with business SME testing.

• Business SME testing are expensive and time-consuming but it is inevitable.

• The key here for practitioners in this rapidly changing space is to continue to bring AI to the table and get their hands dirty so they can develop an intuition working with this new generation of tooling.

Wishing everyone a joyful, healthy, and prosperous New Year!

Contact [email protected] for RAG strategy and implementation advisory. Follow Steven (Jieli) W on LinkedIn for more complex AI concepts in layman's terms.

Originally posted at:

𝗪𝗵𝗮𝘁 𝗼𝗻 𝗲𝗮𝗿𝘁𝗵 𝗶𝘀 𝗥𝗔𝗚?

Demystifying RAG: Enhancing LLMs with Real-Time Knowledge Retrieval

Popular

Related