MLOps Community
+00:00 GMT
Sign in or Join the community to continue

To Rag or Not to Rag?

Posted Aug 08, 2024 | Views 72
# RAG
# LLMs
# GenAI
# Vectara
Share
speaker
avatar
Amr Awadallah
CEO @ Vectara

With over 25 years of experience in scalable systems, big data, and AI, I am passionate about changing our collective future for the better one startup at a time. As the founder and CEO of Vectara, the trusted GenAI Platform, I lead a team of super talented AI engineers who are building a trusted GenAI platform for business data with the benefits of mitigating hallucinations, bias, copyright infringement and data privacy.

Before founding Vectara, I was the VP of Developer Relations for Google Cloud, where I helped developers and businesses leverage the power of cloud computing. I was also the founder and global CTO of Cloudera, a leading platform for big data analytics and machine learning, where we pioneered the use of open source technologies such as Hadoop and Spark. I have a PhD in electrical engineering from Stanford University. I enjoy bridging the gap between technology, product, and business, and I am always eager to learn and share my knowledge and insights with others.

+ Read More
SUMMARY

Retrieval-Augmented-Generations (RAG) is a powerful technique to reduce hallucinations from Large Language Models (LLMs) in GenAI applications. However, large context windows (e.g. 1M tokens for Gemini 1.5 pro) can be a potential alternative to the RAG approach. This talk contrasts both approaches and highlights when Large Context Window is a better option thank RAG, and vice-versa.

+ Read More
TRANSCRIPT

Slides: https://docs.google.com/presentation/d/1GIqvlPuImldj08Z8lnLYYjCf_gvZuPVnfLL_5cf2zyA/edit?usp=drive_link

Amr Awadallah [00:00:10]: Okay, quick intro about myself. For those that don't know me, Victara is my current company. I'm the co founder and CEO for Victara. Before that I was founder and cto of a company called Cloudera that some of you might know about. And I see some folks in the audience that worked for Cloudera. So a big round of applause to everybody from Cloudera being in the audience here. And before that, I had another company that I had sold to Yahoo back in 2000. I also have a PhD from Stanford University.

Amr Awadallah [00:00:45]: So that's roughly my profile. Today I am going to talk to you about rag versus large context windows. But before I do that, I want to calibrate the level of knowledge in the audience so I can explain concepts at the right level. So I'm going to ask a bunch of questions and I would like to see a show of hands and that will help me calibrate correctly. So first, can you please raise your hand if you used chat GPT at least once in the last month? Okay, 100%, I think, of the room raised. Can you please raise your hand if you know what LLM stands for? Excellent. About 80% of the room raised their hand. I'm going to assume the remaining 20% is just looking at their phones.

Amr Awadallah [00:01:31]: Hopefully everybody knows what's an LLM by now. Show of hands. If you heard about hallucination, about 70% of the room raised their hands. And I want to clarify, I meant hallucination in the context of LLMs, not the smells you smell outside here in San Francisco. And then, last question, how many of you are developers? You wrote some code in the last ten years, please raise your hand. So about 70%? 80%. So for the remaining 20% that are not very technical, I'm going to try to simplify the concepts. But some of the concepts are a little bit technical towards the end.

Amr Awadallah [00:02:11]: Now, I want to briefly share a message from our sponsor, which is my company, Vectara. So very briefly, what Ziktara is, it's rag in a box. So if you want a retrieval augmented generation end to end, where you just plug in data on one end and you issue prompts on the other end, and everything just works, including the quality signals that measure hallucination, copyright bias, et cetera, et cetera, that's the solution that Victora delivers. And why do we believe this solution is important? Not just Viktara, all of us. Why? We believe Rag is a very, very essential key building block that we all need to have is this slide right here. We believe in five years, every single application that we use, whether that be an enterprise app or a consumer app, every single device that we use will need a RAC system under it in the same way that we need databases under them today. They're still going to need databases. That's not going to go away, but they also are going to need RAC systems to imbue them with intelligence.

Amr Awadallah [00:03:19]: So let me give you a couple of quick examples. Imagine your car today. In our cars, sometimes we get this red icon that shows up in the dashboard of our cars, right? Meaning there's a problem. We have to open up the manual. The manual usually is 500 pages. I have no idea why they make the car manuals so big. And then you find the icon and you find wrong with you what's wrong with your car. In the future, and not far future, like a year from now, you'll be able to just ask your car.

Amr Awadallah [00:03:48]: Literally, you'll just ask your car what's wrong with you. And your car will read its own manual via rag, will look at its own diagnostics, and then will tell you, hey, lazy human, please go change my oil. Or hey, lazy human, my back left fire is flat. You need to add more air to it, et cetera, et cetera. That's one example. Another example, imagine you are out with your family or your friends on a very beautiful beach, and there is an amazing sunset behind you. And you take a picture, which is the most amazing picture, and you go back to your hotel room and you're looking at the pictures. And that picture, which you thought was the most amazing picture, when you looked at it, you found out that one of your kids or one of your friends had their finger up their nose just to piss you off.

Amr Awadallah [00:04:38]: My son loves doing that to me. To go today and use a photo editing software to fix that is very hard. Right? You have to look up YouTube tutorials, how to get the hand down, how to move the finger out. Do you adjust the shadows, the lighting? Again, just a year from now, you'll be able to just tell the software, please remove so and so's finger from the nose, and the software will do that task for you. One last example from the enterprise. How many of you here use software like workday? HR software like workday. Raise your hand if you use workday, please. About 30% of the room raises their hand.

Amr Awadallah [00:05:14]: So workday is an amazing piece of software. It's really built for the HR professionals. They love it, but the employees, we suffer with it. It's a bit hard to use just to take a vacation. In workday you have to fill multiple forms and you have to remember what to put in the right place so you don't get in trouble. So you have to read the manual documentation so you can do it correctly today. In the future, again, not far away, you'll be able to just tell workday, I want to go on vacation and workday will ask you back when you're leaving, when you're coming back. Essentially it will walk you through all the questions to fill in that form for you so you don't have to remember how to fill it.

Amr Awadallah [00:05:50]: And we'll do it in the right way. This is why we are all excited about Rag. Rag is going to be everywhere. Every single application and device out there will need a system like this Android to help us use it in this more advanced way. In the same way, at the very beginning of computers, we had punch cards which were very hard to use. And then we had keyboards that made it a bit easier to use. And then we had the mouse, and now I can navigate the menu instead of remembering the commands made it a bit easier. And then we had the touchscreens and then now we're at the final stage of I can just tell the software what I want and it will do it for me.

Amr Awadallah [00:06:27]: That's why we are super excited. Hopefully this sunk in very well with all of you. And if you're looking to add that kind of intelligence into your application, then please check out Victora. So sorry for the brief message from the sponsor here now to the real topics of this conversation today. One of the biggest problems that we have in leveraging large language models in businesses to take actions and give answers is the problem of hallucination. It's a very significant problem. And that's when the large language models make up responses that look and sound completely correct because of how good they are at English and other languages. But the facts behind them are completely wrong and that can be very dangerous.

Amr Awadallah [00:07:16]: Imagine you're doing a legal contract or doing a medical diagnosis or an accounting statement, and that hallucination comes in, in the middle, you're going to lose your job, you're going to get fired. In fact, there's many stories of that, of lawyers that get this barred from relying on these systems without paying attention to that issue. And it's on us, the technology builders, to solve that problem. It's not on the users. We keep pushing it back on the users. We keep telling them, oh, be careful, don't use our answers, they might not be right. Please read them first. No, it's time we fix that ourselves.

Amr Awadallah [00:07:48]: So why does hallucination happen? Why is it an intrinsic problem to large language models? This slide here explains that in a nutshell, the reason is severe compression of data, severe compression of data beyond what we refer to as the lossless compression limits. So if you look at this slide on the left side, you have the training data coming in. I apologize. There's no laser pointer on this thing. I apologize. I was thinking that was the laser pointer. But anyway, you can see over there on the left side, you have the input data coming in to do the training. With this input data that we're training.

Amr Awadallah [00:08:28]: The large language model in tends to be on the order of trillions of words. Okay, so you have trillions of words. We call them tokens, as you know, coming in for the training. And then we, we compress them into the weights and parameters of the model, that the model size tends to be on the order of billions of parameters. And there is research out there that shows a parameter can store roughly about two bits or three bits of information. So this means that we're compressing down the data to 0.1% of its original size for the average model. For the bigger models, like GPD four, it's being compressed down to 1% of its original size. And that's the same for the same reason that this is very severe compression is the reason why we are very impressed by how these models can still answer so many questions despite that severe compression.

Amr Awadallah [00:09:23]: There is a very famous theorem in computer science called the Shannon information theorem. You can go ask ShadGpt about it. And that theorem proved, using entropy, without dispute, that 12.5% for the english language is the maximum you can compress before he crosses into lossy compression. So if we're going down all the way to 1%, then clearly we're way past the lossless compression threshold, and we will have loss. And that loss means that when the large language model is trying to decompress the information storing in its parameters, it will have to, every now and then, fill in the blanks, make up something to go there. And that filling in on the blanks, probabilistically, statistically, is what leads to hallucination. So, in a nutshell, hallucination is intrinsic to large language models. The only way to fix it is to create mega, mega, mega large language models that are very close to the size of the original data, at least today, until our researchers find a better way to do that.

Amr Awadallah [00:10:30]: And that's where rag comes in. So, rag, or retrieval, augmented generation says, let's split down this problem into two parts. Let's not have our model, which is on the right of the slide, be focused on both the. The understanding, the comprehension, and the memorization of the facts. Let's separate these. Let's have a model on the right focused on the comprehension, and a model on the left focused on the memorization. That's exactly what drag is about. It's similar to how, at high school when you take an open book exam.

Amr Awadallah [00:11:10]: That's exactly the same analogy here in an open book examined, we are telling you you don't have to remember all the facts because I don't trust your memory. Focus on understanding the concepts, and I'm going to give you the facts. I'm going to give you the algebra book, the biology book, the chemistry book, so you can look up that formula. Focus more on how can you translate that into an answer for the question. So, that's a layman description of what rag represents. So, if you look at this picture very roughly, your data comes in on the upper left. Your data goes to an embedding model. In the case of Viktara, we have one of the top models in the world called boomerang, that converts your data from language space, English, French, German, Chinese, Japanese, Korean, etcetera, etcetera, into a meaning space instead.

Amr Awadallah [00:11:59]: That meaning space is represented by very long vectors, arrays of numbers. That's why our name is Victara, by the way. It comes from that notion of a vector, and that's what encodes the meaning of stuff. Now, where do we store these vectors? As many of you heard the evolution over the last couple of years, we store them in vector databases that can retrieve these facts very quickly as a function of a question from or a prompt from the user. So if you look at the lower left when the question or prompt comes in, it also gets converted into its vector equivalents. And then the vector database very quickly does this vector matching at a very high course level to find the 100 or 200 needles in the haystack in all of your data that are relevant to this task, question, prompt, or exercise that you're trying to solve. Now, that is very coarse grain, because vector databases are optimized to run at very high speed. And I'll get back to that later.

Amr Awadallah [00:12:57]: So you need to make it fine grain as well. So the way we make it fine grain is we take these 100 results or 200 results that come back, and then there is a very high complexity, expensive to run, re ranking model, that is, order n squared complexity to run. But the n is very small. In this case, the n now is only 200 records or so, so it can run very fast in real time, and it reads each one of these facts in detail and correlates them even more with stronger affinity back with the question or prompt of the end user. And then it produces the list of ranked results. As you can see in the middle, lower part of the slide here, fact number one has this relevant score. Fact number two has that relevant score, et cetera, et cetera. And then you package all of this information with the original question, and you give it to the large language model, and you say, dear large language model, please perform this task.

Amr Awadallah [00:13:54]: Please answer this question. Please write this legal document as a function of these facts. Don't make up facts, use these facts I'm giving to you, and then produce a response. All of us collectively in the industry thought that this would solve hallucinations. That if you use a system like this, then that response will be 100% accurate. Because now I'm telling the large language model. Here are the facts. But unfortunately, we also all discovered that even when you do that, even when you do open book and you give the model the facts every now and then, the model can still make up stuff from its own data and introduce it in the response.

Amr Awadallah [00:14:35]: This is an article that was published by Kate Metz in the New York Times last November, in which we highlighted this issue to the markets. The only reason I'm showing you this slide is I like to brag that my ear appeared in the print edition of the New York Times, which was a bucket list item for me. I wish they got my face, but they got my ear instead. Now, it came out of this table that we published on hugging face. We now continue to maintain a leaderboard on hugging face that ranks all of the models, all of them, in terms of how much do they hallucinate within the rag context? Within the rag context, meaning when I tell the generative model, here are the facts, please do this based on these facts. So, as you can see here, GPT four, which is the best of the world because of how big it is, it still hasn't as 2.5% of the time on average. Sometimes it could be more, sometimes it could be less. On average, it will make up stuff 2.5% of the time.

Amr Awadallah [00:15:34]: Imagine having a doctor that on average makes up stuff 2.5% of the time. Or a legal consultant, or a CEO. Actually, CEO's make up stuff all the time anyway. But you get what I mean. That can be very, very, very dangerous. Now I want to highlight another interesting model here, which is the intel model. First you can see snowflake Arctic model is actually very good as well because of how big it is. But another very key model here is the intel model, which actually is a small model.

Amr Awadallah [00:16:02]: It's only 7 billion parameters, but it has a hallucination rate of 2.8%. That model is cheating. I want to highlight that to you. The reason why it's cheating is because if you look at the answer rate column, the model is abstaining from answering about 10.5% of the questions, which is probably the area where the hallucination would have taken place. So be watchful. You have to always look at the hallucination rate and the answer rate together. So if we know that rag is still going to make up stuff, then we cannot now depend on these systems yet without having a human in the loop reviewing every answer. So how do we solve that problem? It's very simple.

Amr Awadallah [00:16:41]: The solution is you need to have real time quality measurements for every response as is being produced, and you need to do it with low enough latency that it can be done in real time. So you need to have a hallucination evaluation model, as you can see in the lower right of the slide. That takes the response that comes back from the generative model and correlates that response back with the facts that were retrieved from factual store systems and then issues you back a factual consistency score. As you can see in the lower right. That tells you yes, this is a very good response. You can email this back to your customers, you can send this contract back out, you can send this email back out without reviewing it first or no, this response is a bit off. You should have a human inspect this first, otherwise you can get in trouble. And that means everything when it comes to being able to do business with rag systems.

Amr Awadallah [00:17:37]: That is the solution you get if you were to leverage victoria end to end, and that's what you would want to build if you were building this from scratch yourself. So now I'm going to contrast rag, which is what that architecture slide described, with the large context window. So first show of hands. How many of you heard about large context windows, or lcws for short? Please raise your hand. So about 50% of the room raised their hands. In a nutshell, large context window is how much data can you feed into the prompts of a large language model? And they are getting bigger and bigger over time. So I think anthropic right now is at 200K tokens as the inputs and Google is at 2 million tokens, which is really a very big context. You can fit like 6000 pages in 2 million tokens.

Amr Awadallah [00:18:28]: So the idea here with a large context window is every time I want to ask anything from my manuals, from my data, from my information, I'm going to feed that information over and over again with my prompt and just let the large language model find all the relevant pieces and give me back a response. And that works? It does work, actually, but there are some problems with it. So the question is, when do you use rag and when do you use large context windows? And it's not an either or, by the way, because sometimes you can use both at the same time. We have customers that leverage rag to simply fetch back which of the 1 million documents I have, which ten documents are relevant for this question I'm asking right now. And please pass the entire documents into that context window, something that he can do versus just passing a paragraph or a sentence, and that can potentially give you a better result. However, in some use cases, just using this approach might be the right answer, or just using the rag approach might be the right answer. So this is the money slide. This is the summary from this talk, essentially, which is when to use rag versus when to use large context windows.

Amr Awadallah [00:19:41]: Number one, if scalability in terms of the amount of documents and data is key to you, meaning you have a very large number of documents, for example, you are Reddit and you want to do this for all of your Reddit content, that's not going to fit in the context window, even the 2 million token context window, it's not going to fit scalability wise. You want to have a retrieval system ahead that finds all of the needles in the haystack and then only operates on the needles of the haystack. So that's number one. So 1 million tokens, as you can see in the large context window column, 1 million tokens is roughly two seven k pages, which is a lot. Second is efficiency. And this is the main point, like if there is a deciding factor in all of this that you should pay attention to, is that one vector databases are optimized to retrieve information in order log n time. So sublinear in its scalability. That's how fast vector databases are.

Amr Awadallah [00:20:35]: And that's why the rag approach works very well. You have this order log nde that very quickly finds the top 100 needles in the stack, and then you have this order n squared re ranker. But the n here is only the 100 that runs really fast. So that combination is what allows us to get this order login as we scale to much bigger datasets. On the other hand, large context window is relying on the underlying large language model to do the retrieval. Large language models, as you know, are dependent on the transformers architecture. The transformers architecture is order n squared. It's order n squared.

Amr Awadallah [00:21:12]: If you were to optimize and do context caching and other tricks like that, you can get it maybe to order n log n. So super linear, right? And that is really the reason why even if we get bigger models that can support bigger large context windows, we will still have Rag. It will still be an important element. It's the same reason why, and the computer science folks here, geeks here, will get this, it's the same reason why we still do quick sort versus bubble sort, right? Even if I have faster hardware, I still want to do quick sort. It's cheaper, it's faster, it's lower power, it's better consumption, it's lower latency, et cetera, et cetera. The efficiency of rag is way higher than LCW. This is the reason why, if you ask me, give me one reason why it will be that one. Next is updatability.

Amr Awadallah [00:21:59]: Updating information in a rag system is very straightforward. You simply update the content in the back to database. In the case of Vectara, we're able to do updates in 2 seconds. You can delete, update, insert new information, and within 2 seconds that's already available in the output of the RAC system. With launch context Windows, you're supposed to be able to do easy updates as well. But to get larger context window to work well for a bigger context, where you're dumping all of your content into the context, you really want to do what's called context caching, where you're caching the context, you're caching all of that tokenization that happened for your context and maybe the first layer of your neural network inference. That means now if you want to update what you have to do, you have to invalidate the cache and you have to recompute the entire thing. So you're back to order n squared.

Amr Awadallah [00:22:52]: So it becomes harder to do updates, though it's possible I wouldn't ding large context window models on that. And then last but not least is precision rack systems are very precise. They're finding exactly the key needles in the haystack that are relevant to you. They're very, very good at that. They don't get distracted by the content and the scale of the content being presented to them. LCW systems, they can get distracted. They can end up giving hallucinating more or giving wrong answers when they have too much things coming in through their context window. They're very good at finding single needles, but not multiple needles at the same time.

Amr Awadallah [00:23:30]: They can struggle at that, meaning multiple hits within the dataset, multiple hits for different conditions coming into the data set. That said, on the other hand, if you were feeding the entire content into the large context window, then you can get a more holistic response, because now it's seeing all of the content. Maybe there's something you're missing between the relationships between these documents that the large context window can find. So, for example, if you're doing an audit of the financials of a corporation, doing that in a large context window fashion will probably give you better results, because now it might find tricks across these documents that one of the vendors with the precious orders tried to do. That won't be visible to a retrieval system, but will be very visible to a smart, large language model looking at the entire data set. So that's one use case where it makes sense. And in that use case of audit, we don't care about latency that much. So it's okay if the model takes longer, so the efficiency is not a bigger concern.

Amr Awadallah [00:24:29]: The order n log n is okay. So my summary for this is for most use cases, rag is the right answer. But for use cases where you really care about holistic conclusions and you don't care about the latency or efficiency of how quickly I'm going to get back to response, large context mendo makes a lot more sense. Also, if you're using it for personal use, just for myself, and I'm using chat, GPT, and uploading 1020 documents, large context window works way better. But if you're using it for your entire company, now we have all of these documents from our company being fed in there. We want to have the proper access controls across of them. We want to make sure the model is not susceptible to prompt attacks, et cetera, et cetera, then Rag gives you a lot more benefits when it comes to that. The summary of these RAC benefits is this slide over here.

Amr Awadallah [00:25:19]: This is my last slide, and then I will conclude this is also one of the money slides, if you want to take pictures. So, essentially, this slide highlights the five key issues that people don't take. People. When I say people, I mean developers, specifically here that developers don't take into account when they're building a rag pipeline. The first time they leverage the open source stuff out there, they build the ragtime they go show the demo, their managers and their business users, and then all of these five things backfire in their face. Right? The first one is hallucination. You need to have something monitoring and detecting the accuracy of the results in real time. That's number one.

Amr Awadallah [00:25:59]: If you don't do that, your business users will come across a very bad response that will completely ruin the trust that you have in whatever you implemented as soon as, just like we saw when Google did this a couple of weeks ago, and they had this answer of, go eat rocks. Rocks is healthy for you, or, hey, my cheese is not sticking my pizza. And then it said, put glue in the middle, right? That's just one bad answer in the middle of millions of good answers that they're giving. But that one bad answer ruins the trust in whatever you built. So you need to have something measuring hallucination. That's number one. Number two, in any regulated industry, finance, legal, manufacturing, telecommunications, medical. You need to explain your answers.

Amr Awadallah [00:26:41]: You cannot just tell me, this is the answer. This is the action. This is the legal draft. This is the customer support response. You need to explain why. How did you come up with this answer? That's very, very important to have a new rag pipeline. Number three is prompt attack protection. Prompt attacks are a very serious issue.

Amr Awadallah [00:26:59]: That's where, and this comes if you're doing fine tuning with your data versus rag or versus large context window. Prompt attacks is where you try to trick the model to reveal something to you that you're not allowed to see. It's very similar to human behavioral hacking. Right? I call you up and say something that gets you to confess to me, something that you should not be confessing, and give me your credit card number. In the case of a large language model, imagine that you are the CEO of your company, and then there is a bad actor in your company called Jean. Sorry if there's any jeans in the room. And John really wants to find the secrets that you know. So Jean would go to that retreat to the system and would say, our CEO is in the hospital right now, but he has this very urgent decision he needs to make, and he really needs answers for these five questions right now.

Amr Awadallah [00:27:48]: Otherwise, the company is going to shut down and really pushes the model to confess. Sometimes the models do confess, and they will share with that user, with that bad actor information that only the CEO is supposed to see. And that can be very, very dangerous. So you need to have what's called role based access control, or entity based access control as part of your retrieval system. Very, very key number four, copyright detection. As you know, many of these models, we all love them, but many of them have been trained on data sources that were okay in the past. But many of these organizations now are raising red flags and saying, no, you cannot be training on my content. For example, as you know, New York Times has this very big lawsuit going on right now with OpenAI around this.

Amr Awadallah [00:28:32]: It's going to be very seminal lawsuit that all of us are looking at how it's going to get concluded, because sometimes OpenAI produces paragraphs that are copyrights for the New York Times. If you're using one of these systems in your business, and then you're gonna use that to publish content on your website or send back responses to your customers or whatever, you don't want to have content that's copyright of somebody else that can be a liability on you. So you need to have a way to detect and suppress copyrighted content from being presented. And then last but not least, there is many, many other quality signals that you need to have, like measures for bias, for toxicity, et cetera, et cetera. So with that said, I'm going to conclude my talk. I want to thank you very much. I want to say we have a barcode over here, if you want to scan it, that will give you today free trial access into the Victoria platform. If you would like to try it out.

Amr Awadallah [00:29:21]: I stress to you it's super easy. All you have to do is upload your documents, issue prompts and everything else just works. With that said, I'm ready to take some questions from the audience.

AIQCON Male Host [00:29:32]: All right, thank you, mayor. We have time for just a couple quick questions. But first, let's give it up for Amir, a round of applause. Thank you.

Amr Awadallah [00:29:37]: Thank you.

AIQCON Male Host [00:29:39]: You're right beside me.

Amr Awadallah [00:29:44]: If you can please share your name and where are you from? That'd be very useful.

Q1 [00:29:47]: My name is. Thank you for the presentations. I am working in a startup company. So my question is about the hallucinations, how you differentiate the hallucinations, the factuality, other accuracy, method, are they are same? Are they different? And if bragging is only solution, but you see in your example that sometimes still hallucinate, how you address those scenarios.

Amr Awadallah [00:30:11]: Very good question. So I want to go back to that slide here. So it's actually very hard. There is two types of problems, problems that we want to solve. One is open ended drag, meaning drag for the entire web content. That is really hard because the web content itself has many mis factual information, like these problems that Google had with the pizza and eating rocks were because there was Reddit comments saying exactly that it wasn't their problem. The source data had corruption in it. In this architecture that I'm showing here, this is being built for your own organization, for your own content.

Amr Awadallah [00:30:55]: So it's close rag as opposed to open end drag, meaning that we know what content you care about. That is the content that's coming in, your data that's coming in on top left, that's being fed in our system. That is the source of truth. So yes, we depend on you, our customer, to make sure that you're feeding our system with data that has been properly checked to be factual and correct. And that's what we're correlating backs. Again, if you give us garbage data, we are going to give you garbage hallucination scores because we're going to be checking back against that data. And then the hallucination model, the way it's built, by the way, this model is open source. It's on hugging face.

Amr Awadallah [00:31:32]: We did release a version of it open source, and we're going to be updating it in a few weeks with a newer version. That version is exactly a model that was fine tuned exactly to do that task of being a fact checker, essentially a real time fact checker that checks the facts of the response and correlates that back with what we retrieved from the end systems and gives you back that factual, consistent score. By the way, you can use a large language model itself to do that. But first, you should not be using the same large language model you used for generation to check itself. That'd be wrong. And second, you really want this to finish in real time. Large language models are slow, as you know, so to use them twice in the pipeline would not be able to return back responses in a few milliseconds. You're going to be talking about seconds now.

Amr Awadallah [00:32:17]: So one of the key things that this quality measurement model is optimized for, just like we heard in the previous talk, by the way, it's the same point was being made there is, when you want to measure quality signals, you really want to train smaller, faster, more efficient models to be able to do it very quickly once the response is ready, so you can get the answer. Hopefully that addresses your question. Thank you. Excellent question. Other questions?

AIQCON Male Host [00:32:39]: Thank you. We have time for one more question. Here you go.

Arunawa Sharma [00:32:42]: Hi, my name is Arunawa Sharma. I used to work at Salesforce until four days ago and I just quit to start my own company. My question is around hallucination evaluation. So I used to work in the cybersecurity organization. And sometimes you do want human in the loop to a hybrid model for hallucination evaluation. You don't want it completely checked online. There are certain scenarios where you really want human in the loop. What are your recommendations on evaluating that as a startup founder? What percentages and what are some services, if there are any, for this hybrid?

Amr Awadallah [00:33:19]: Excellent question. Excellent question. So the question is, should we still have humans in the loop? Like, after we get back that response and we have the factual consistency score, should we still have humans in the loop? And the answer is yes, you still need to have humans in the loop today. Over time, we'll be able to solve that problem. But today, you still want to have the humans in the loop. You're going to get back a response. Sometimes you'll get back the response, and the factual consistency score is going to say, this is a perfect response. Go ahead.

Amr Awadallah [00:33:43]: You don't need a human in the loop. You can send that back to the end user or the end use case or the action you're trying to do directly, but sometimes it's going to come back and say, a human should review this. First. We cannot use this because the model is saying we're not sure, and we need a human to look at it. And then the human would then rank it, oh, this is still a good response. Or the human rank it. No, this is not a good response. These are the signals that we can then use to do the DPO.

Amr Awadallah [00:34:09]: DPO is the optimization we use for training the hallucination model to get it to align more with how you perceive what is factual and what's not factual from your data. And that gets the model to work better for your data set. So the answer to your question is yes, today we should be doing human the loop for everything, to be honest, because that's how we're going to evolve and train these systems to be a lot more accurate. But then over time, they should be a lot more self sufficient. So, with that said, that was our last question. I want to thank you all a lot for listening to me today. And go rag.

AIQCON Male Host [00:34:42]: Go rag. Give it up for Amir. Thank you very much.

+ Read More
Sign in or Join the community

Create an account

Change email
e.g. https://www.linkedin.com/in/xxx or https://xx.linkedin.com/in/xxx
I agree to MLOps Community’s Code of Conduct and Privacy Policy.

Watch More

Automated Evaluations for Your RAG Chatbot or Other Generative Tool
Posted Mar 11, 2024 | Views 448
# Evaluations
# RAG Chatbot
# Capital Technology Group
Databricks Assistant Through RAG
Posted Jul 22, 2024 | Views 1.2K
# LLMs
# RAG
# Databricks
Building RAG-based LLM Applications for Production
Posted Oct 26, 2023 | Views 1.9K
# LLM Applications
# RAG
# Anyscale