๐ช๐ต๐ฎ๐ ๐ผ๐ป ๐ฒ๐ฎ๐ฟ๐๐ต ๐ถ๐ ๐ฅ๐๐?
Demystifying RAG: Enhancing LLMs with Real-Time Knowledge Retrieval
January 8, 2025In a recent technical office hour, I gave a talk on Retrieval Augmented Generation (RAG). The highlight? A peer challenged me to explain RAG as if I were talking to a 10-year-old! Over the past year, I had the privilege of building multiple RAG-based applications and taking them from proof-of-concept to production roll-out. This blogpost covers everything I have learned the past year.
Before we dive in, let us establish some baseline understanding.
Baseline Understanding
What is an Vector Embedding?
An embedding(typically vector based) is a way to represent objects like text, images, or audio as points in a continuous vector space. This means that each object is converted into a numerical format, or a vector, which allows machine learning models to process and understand them.
How does Large Language Model (LLM) Work?
Large language models work by predicting the next token, or word.
This video from Spotify CTO Gustav Sรถderstrรถm is PURE GOLD, and it helped me tremendously when I first started on this journey a year ago. (kudos to my director Eric Immermann)
EXPLAIN RAG LIKE I AM 10
Okay! With that out of the way, let's imagine we have a Smart Robot friend in the library.
He is here to help answer any questions you have. He has read all the books in the library. Superb!!!
- But Our Friend has Some Limitations 1. Our robot friendโs knowledge is limited to the books in this library and those books are all slightly out-of-date
- 2. We spent million of $$$ and months of time to calibrate the robot friend. We donโt have an easy way to recalibrate this robot.
- 3. Our robot friend is optimized to give you AN answer. Making you happy matters more to him than answer your question truthfully so sometimes he may make stuff up
- 4. Our robot friend has โno filterโ in his mouth so he may say something inappropriate to youโฆ
- 5. When our robot friend makes something up, you wonโt be able to tell if your robot friend is telling the truth or not
Good news! We can Help Guide Our Smart Robot Friend (the LLM)
1. To help our friend, we can give the robot some new books youโd like him to use.2. we can also give the robot specific instructions before it answers like โmake sure the answer is appropriate for my age and make sure to only answer using the books I provided. If you cannot answer, simply say I do not have the informationโ.
Now, Our Smart Robot Friend is even smarter and more effective in helping us. Hooray!
RAG IS
To summarize, Retrieval Augmented Generation, or RAG is
1/ a pattern (Lewis et al. 2020) that can improve the efficacy of large language model (LLM) applications by leveraging custom data.
2/ done by retrieving data/documents relevant to a question or task and providing them as context to augment the prompts to an LLM to improve generation
Sample Reference Architecture
Benefits of RAG
Up-to-date Knowledgeand Tailored Responses
LLM responses are not based solely on static training data
The model uses up-to-date external data to provide responses
We are extending the learningin-context and provide domain specific knowledge content. The answers are more specific and valuable
Reducing Inaccurate or FabricatedResponses
RAG attempts to mitigate the risk of producing hallucinations or incorrect fabricated information Outputs can include citations of original sources allowing human verification
Efficiency and Cost-Effectiveness
Offers an alternative to fine-tuning LLMs by enabling in-context learning without the up-front overhead of fine-tuning
Itโs cost effective for use cases where the system need to be frequently updated with new data
My Adventure with RAG Implementations
#1: Chatbot
Key Learnings
โข With good retrieval system like Coveo | Lucidworks, you donโt need a frontier model (GPT 4o).
โข GPT 3.5 turbo was cost-effective and highly-performant. Nowadays, 4o-mini is more appropriate.
โข Streaming the response is better and makes the experience more fluid. Tricking the userโs perception on how fast the system responds.
#2 Client POC to Production Implementation
Business Use Case
Help Call Center Service Representative and Field Service Representative Become More Effective at Their Job When Answering Customer Calls
"One of the field service representatives had a case open for tow month. A very large truck was not operational after multiple attempts. The field agent got access to the RAG solution and in 15 minutes of searching got an answer that directed them to an old case with a similar issue that had a resolution. They tried the same fix from that old case and it fixed the truck!โ
- Client Executive
- Key Learnings
- โข Contextual relevance is king
- โข Mobile support is important for field service agents
- โข Search platform like Coveoโs Question and Answering offering has amazing time-to-value
#3: Client POC Implementation
Business Use Case
Assist Call Center Service Representatives in the Fund Servicing Department to Become more Effective at Their Daily Operations
Key Learnings
โข Highly-regulated industries like insurance and financial services have REALLY high expectations in terms of answer accuracy and latency because wrong answers will have regulatory consequences (> 90% answer accuracy and <200 MS time-to-first response)
โข In certain situations, traditional ML and rule-based systems are more appropriate for the use cases than the modern LLM-based RAG | agentic systems.
โข Compound AI systems powered by platforms like Databbricks is needed for a RAG implementation to meet clientโs expectation of 90% accuracy and <200 MS latency requirement
โข Platform like Databricks Data Data Intelligence platform still lacks support for conditional filtering & boosting in its native offering and therefore suffer in relevance calculation during retrieval. Typically you only use vector-based search (dense retrieval). A search platform, on the other hand, excels in both sparse and dense retrieval along with contextual relevance tuning. Marrying the two platform together would deliver a production-grade RAG implementation that is future-proof and flexible for any use cases. (For instance, Coveo offers a Passage Retrieval API so you can leverage what Coveo does best (retrieval and relevance calculation and incorporate the retrieved passage into the remainder of the Compound AI system stack).
โข Evaluation is important. Best thing to do is to combine LLM as a judge automated testing with business SME testing.
โข Business SME testing are expensive and time-consuming but it is inevitable.
โข The key here for practitioners in this rapidly changing space is to continue to bring AI to the table and get their hands dirty so they can develop an intuition working with this new generation of tooling.
Wishing everyone a joyful, healthy, and prosperous New Year!
Contact [email protected] for RAG strategy and implementation advisory. Follow Steven (Jieli) W on LinkedIn for more complex AI concepts in layman's terms.
Originally posted at: