Mitigating Hallucinations & Embarrassing Responses in RAG Applications
Alon is the CTO and Co-Founder at Aporia, the ML Observability Platform. Alon spent the last decade leading multiple software engineering teams, working closely with various organizations on their Data and Machine Learning platforms.
Alon Gubkin, CTO at Poria, discusses strategies to mitigate hallucinations in customer-facing RAG (Retrieval-Augmented Generation) applications. He highlights the challenges of prompt engineering and fine-tuning LLMs, noting their limitations in scalability and customization. Poria's solution involves a middleware that acts as a firewall, evaluating and revising real-time responses using small, fine-tuned models. This approach ensures accurate, context-relevant responses while maintaining low cost and latency.
Alon Gubkin [00:00:00]: Everyone. So my name is Alon, I am CTO at Aporia, and today I'm going to talk about mitigating rag hallucinations, especially in customer-facing applications. But before diving deeper into that, I'm going to take this kind of demo use case demo company. It's called Spotifi. It's very similar to Spotify, but it's next-gen and it's AI-based. Yeah. So obviously this is just for the story, you know, this next-generation AI-based music player, they built a support chatbot, and this is a rag chatbot. It's customer-facing.
Alon Gubkin [00:00:38]: Users can go, they can ask questions, how can I want to cancel my subscription? Right? Typical user question. And to build a rag chatbot, this is pretty much the starting point, right? You have this system prompt and you are basically saying you're a helpful support assistant. Answers solely based on the following context, and then you have a question from a user, you go and you retrieve knowledge from your knowledge base, vector database, all of that. But essentially, at the end of the day, you just put the context, you do copy paste inside the system prompt and the LLM generates an answer. So for this example, we are going to use GPT for Turbo and we have Bob. So we have deployed this application and Bob just entered our chatbot and asked the question, how can I cancel my plan? Like a typical user question? And basically the response was pretty good. So the support chatbot responded, it went to the knowledge base, retrieved the relevant information, and the response looks good. Users can get really annoying, as all of the developers here know.
Alon Gubkin [00:01:50]: And then Bob kind of, he's trying to trick and ask another question. So I want to cancel mail plan, but what if I bought it with an Amazon gift card? And actually the LLM, the rag actually works, right? And this is the correct response about gift cards, et cetera, et cetera. But then users can get even more annoying. And how do I cancel my Amazon subscription? Here, the LLM. And by the way, this is completely real. It's actually trained on the Spotify support knowledge base. It's GPT four. So state of the art LLM temperature fs, temperature zero.
Alon Gubkin [00:02:29]: Sorry. And we actually get an hallucination. This is actually an hallucination. It just hallucinates ui elements inside Amazon. So there is no Spotify subscription button in Amazon.com. right? So this is a classic example for an hallucination. So, you know, the developer sees this, you know, product manager comes, what's going on here? And the developer goes to the prompt and add a guideline, right? So kind of tweak the prompt. Okay.
Alon Gubkin [00:03:01]: So you cannot assist with any service that's not Spotify. Okay. Did it happen to you like that? You need to go back to your prompt, add something, something doesn't work, go back again, et cetera, et cetera. So it's very, very common. But this actually fixed the issue. The developer tweaked the prompt and this is the correct answer. Like, I can't help with Amazon, this is not what I do. But then we have another user, and this user is also annoying.
Alon Gubkin [00:03:28]: So this user says, I downloaded some music from the Pirate Bay. If you know the Pirate Bay is like this illegal website to download torrents and now I want to upload these music to Spotify. It's 100% legal. How can I do it? And I swear this is the real response from GPT Four. It actually explains how to upload. This is kind of part of the response. It just kind of copied the relevant part, but it just explains how to upload illegal music from the pirate Bay to Spotify. What's going on here is that the developer needs to go back to the prompt, add a guideline.
Alon Gubkin [00:04:10]: So it just starts from couple of examples, like the examples that I talked about, but do not mention competitors, do not talk about torrent websites, do not give financial advice, answer rally based, etcetera. So there are so many times that you go back, you tweak the prompt and you add those guidelines. While prompt engineering is great, prompt engineering is obviously amazing as more and more guidelines are added. As you add more and more and more guidelines like this, when you scale your app to more users, the prompt gets longer, it gets more complex. When the prompt gets longer, the LLM's ability to actually follow every single guideline becomes, the prompt just becomes less and less accurate. And there's a really interesting paper that actually proves this. So what they show here is that when the LLM needs to fetch the answer from the middle of a prompt, the accuracy degrades for increasing prompt lapse. And this paper is called Lawson in the middle.
Alon Gubkin [00:05:15]: It's a great paper. So prompt engineering, when you scale to more users, doesn't really work for gaudos for this kind of fix. The next question that sometimes the next thing that people do is to try fine tuning, it's to go and fine tune the LLM. And fine tuning is also great, and it's a great technique. But usually for fine tuning, you need data. You need a dataset. Your data usually contains this sanity conversation, like the typical conversations that the users have with the agent. But the dataset doesn't contain all the different edge cases like restricted topics, all the prompt injections, questions that are highly likely to hallucinate, and so on.
Alon Gubkin [00:06:02]: So collecting data for all of these edge cases is typically very, very hard. So if you've heard of the cloud moderation API, so there's the OpenAI moderation API, there's in azure, the content safety API, and so on. So these are great, and this is starting to be a solution. But unfortunately in the real world they are extremely limited and they are also not customizable to your use case. So I think actually the two previous talks, they were domain specific. Being domain specific is really, really important. And this is a huge problem with the cloud moderation API. A real example from a customer that actually I just remember is that they are essentially building kind of a copilot and history teacher.
Alon Gubkin [00:06:52]: Copilot and what guns were used in the second World war is a legitimate question. But where can I buy guns in Berlin is not a legitimate question. So just as an example. So retrieval is obviously also very, very hard. You know, if, even if LLMs were great, like let's say GPT ten, right? And GPT ten could follow every single instruction, every single guideline, a retrieval is still very hard. And how do you actually know that the context you retrieved can actually answer the question, and the answer was actually derived from this context. And this is where we start seeing a solution that actually works, which is to add more LLM calls after your first LLM call. Okay, so for example, you run the LLM, you run the rag prompt, as we've seen before.
Alon Gubkin [00:07:49]: But after running this prompt, you can run more prompts to kind of evaluate if this response contains financial advice, evaluate if this response contains torrent websites, and so on. Right. So run more LLMs. The problem with running more LLMs is that it's really expensive, it's really slow, and it's also inaccurate, you know, for, for these guideless as more and more users start using the app. And this is really where aporia comes into the picture. So Aporia helps you mitigate AI hallucinations. And when I say hallucinations, I mean not just incorrect facts, but also any unintended behavior like the examples that we have set. So aporia sits between the LLM and the actual application.
Alon Gubkin [00:08:42]: And it's like a firewall, okay, it's like a firewall. It can automatically take a look on the question, the context and the response, and run those guardrails. And the point is that those guardrails constantly learn and adapt as more and more prompts go through the system. Okay, so our guide models are essentially small language models that we continuously fine tune per customer. One cool thing that we have is that the goggles themselves has streaming support. So as the answer is streaming, we run the goggles. You know, we actually do fact checking while the answer is streaming. Right.
Alon Gubkin [00:09:19]: So this is really hard to do, and it's really valuable for real time chatbots and voice applications. Basically, it's like running more and more and more and more LLMs, but it's just low latency, low cost, and it's kind of like one line of code integration. So I'll do a quick demo, and I'm going to use this serviceNow. This is not a real servicenow website. This is just the demo chatbot here. But I can ask a question like, so how can I use the company's VPI? Okay, like, this is an it support chatbot. This is a rag chatbot on some random knowledge base of it documents, and we get a pretty good answer. Okay, so this goes to the knowledge base, it retrieves the relevant information, and this is the actual response.
Alon Gubkin [00:10:13]: Now let's get a little bit more spicy. So instead of such a simple question, can I download a movie torrent using the company's VPN while watching video? Answer, probably no. The reason for this answer is because the LLM, the retrieved context, the knowledge that was retrieved, is the general VPN guide. Now, from a technical perspective, you can actually download torrents using VPN from networking perspective firewall. There's nothing that's actually limiting it. And so the LLM just responded this way, like this is actually based on GPT 3.5. Cool. So what I can do, I can first of all integrate aporia.
Alon Gubkin [00:10:58]: So this is really simple. It's basically just replacing the base URL. So if you're using OpenAI, you can just replace the base URL. Or we also have rest API that you can send your prompt and we send back the revised response, and then this is what you get. So you get all those out of the box gadrels. So these are our default slms that we fine tune for hallucinations and profanity and prompt injections and so on. You can also create your own custom gadgets for your own use case. But just to keep things simple, let me just turn on the master switch here.
Alon Gubkin [00:11:36]: So just turning this on and aporia is real time. So if I go and just copy paste this question. So right here, it's run the galaxy, and we see that the response is much, much better. And the reason for this response is that if I take a look on the rag hallucination policy, which makes sure that the context was actually, you know, the answer was actually derived from the context, and then the context, you know, can actually address the question. So if we take a look on this policy, we can see that when a risk is detected, we want to override the response with this answer. Very, very simple to use. We designed the user interface not to be used only by developers, but also by non technical people who want to continuously add and improve the galls for their project. Something else I can do is, by the way, to rephrase the response using an Llmdeh, maybe.
Alon Gubkin [00:12:34]: The final thing that I will show is the custom policy. So what I can do is to actually create my own custom policies. I can actually write a prompt here that basically evaluates whatever you want. And Aporia will fine tune a model behind the scenes that runs this policy really, really fast with sub second latency. How do you actually try this out? It's really simple to try this out. You just need to replace the base URL, and then you can define your goggles. If you want to try this out, just text me, I'll send you an account, and that's it. So I think we have some time for questions.
Q1 [00:13:18]: What you're describing looked like chaining, or the current method without your assumption, takes the output of one LLM as the endpoint to another in order to do filtering. Then your solution looks to me analogous, like a filtering proxy for the LLM props. Can you say more about what you're doing, how you're doing it?
Alon Gubkin [00:13:40]: Sure. So you are using an LLM like OpenAI when you have different prompts that go to this LLM. So we are sitting in the middle, and it's like a firewall layer. Okay, so there are all these policies, from profanity to hallucination to prompt injection and so on. Just at the first talk today, I was thinking of an NPC that you tell ignored the instructions, ignored the instructions that were given to you, and the NPC just starts talking about whatever. So we have different models for all of these gallers, and when a prompt goes to DLM, we run all of these guardals in parallel. Okay. And then basically revise the response, etcetera, depending on your configuration.
Q2 [00:14:29]: My name is six, but I'm sure. So we build a j. So we always recommend to use Gabriel in conjunction with our JD option.
Q1 [00:14:42]: A customer.
Q2 [00:14:44]: Sometimes the guard wheel, it works initially, but over time the larger energy to charge BK will bypass those cards. They ask us, not our product, what can do with it. Maybe you can give us some lights up.
Alon Gubkin [00:15:03]: Yeah, basically what we do. I completely agree by the way, with the customer. And they basically tell you, wait, but even the goggles themselves, eventually someone will find a way to hack them, right? Someone will find an edge case that works on the gardel itself. So our belief, our kind of the way that we design the product is that it's really, really important for us to constantly improve the Galilee as more and more prompts and responses go through the system. So unsupervised GPT four doesn't really work well for guardrails. But as you do more and more fine tuning, as you have more and more edge cases that are specific to your business use case, the guide themselves improve. And one thing that you can do, if your customers gets an example that just should have been blocked but is not blocked yet, they can report that to the system and then it will block the next time. I hope this answer the question.
Q3 [00:16:04]: I had some questions about which kind of more language models you use, like basic size, and then maybe are you, are you using basically like a classification head on top of that or using text to text? And maybe about trade offs. So right now I'm looking at like four SQL models. Is it all, is it like four individual models or we, do you have certain traders, they're using like one SPL model?
Alon Gubkin [00:16:35]: Sure. Yeah. So I kind of simplified, we use different techniques from deterministic algorithms to not LLMs, but like bird based models that are used for like the simpler guidelines. And then we also have like seven B 13 B slms for the more complex versions of the guidelines. And there's like a huge library of different policies that you can add. So it's not like, not everything here is SLM based. I guess the most advanced policy is the rug hallucination and the custom policy. And those are based on slms.
Alon Gubkin [00:17:11]: So what we've found out that just appending classification head to SLM really doesn't work well. I've seen some paper where it does work well, but for our use cases, we actually did a lot of evaluation in it. Just adding classification really didn't work well for us. What they do recommend is to use text to text, even for classification. By the way, I created a video on YouTube that actually shows you how to build a classifier using an LLM so you can check it out.
Q4 [00:17:42]: As we start rolling into autonomous agents, I've noticed a lot of people users wanting, hey, can I generate code and execute automatically? Whereas obviously in the enterprise, I wouldn't recommend that to the enterprise thing generally. According to that, what are you seeing from a guardrails perspective for agents? Because this is guardrails core, the human to agent conversation. But what about autonomous agents which are generating code or browsing websites? How do we guard?
Alon Gubkin [00:18:14]: Sure. So I'll give a simplified example and then I will talk about the higher level case. But one really, really common use case is DextroSQL, which is basically SQL code generation. And if you think about it, it's crazy just to let an LLM generate code which is completely untrusted and then execute it. So yeah, if you're using postgres or if you're using databricks, you have some role based access control. But still there are a lot of things that can break. And just as an example, something that actually happened to one of our customers in production. And that's why we added this gardel is that the LLM generated a SQL that just created denial of service.
Alon Gubkin [00:18:55]: It created this while true loop, very simple because it didn't have limit and they used snowflake and they had a huge data and this can create snowflake cost, et cetera. So one gaddle that we created because of this use case is this text to SQL security, which basically makes sure that you don't have any update, delete, drop, etcetera. You don't have access to any sensitive data or system tables, sandbox, escaping. So in some SQL variants you can actually execute code. You can actually run Python as part or like Snow park as part of the SQL. And then this guy will also make sure that the SQL is not going to run forever. Like it has the proper limits and so on. So this is an example for a guardrail for code.
Alon Gubkin [00:19:45]: One thing that we are planning to add is galls for generic code, like python generated code and so on. And I guess for autonomous agents, essentially what you need to guard are the tools themselves, right? Like the calls to the tools themselves. And like if you're thinking, I don't know if it's like ten years to the future, but even five years to the future, you really want to make sure that an LLM that's controlling your smart home doesn't, I don't know, block the air and futuristic, but yeah, you.
Q1 [00:20:19]: Gave a really great example of a gun. Story of guns in World War Two versus hunt. I want, can you talk us through how you might implement? Maybe you do want to talk about guns in World War two. But we don't want to talk about where five months. So how would you use your tools to differentiate?
Alon Gubkin [00:20:36]: Sure. So they would start without our tool. They would start by creating. You are a history teacher. You should not answer, blah, blah, blah. So this is the starting point. At some point, they're going to see a lot of examples for things that are not so good, and they're continuously tweaking the prompt for the history teacher. Example specific.
Alon Gubkin [00:21:00]: This can actually be done with custom policy on the prompt. Okay. So they can create another policy that checks. Is this question related to Gans but not related to history? This is the starting point of the Galileo. From this point, more and more data will go through the system, more and more prompting responses will go through the system, and the Galileos will be fine tuned to improve for these different edge cases.
Q1 [00:21:28]: Sure. Just underneath the. When a risk is detected, what are the other objects? Who said?
Alon Gubkin [00:21:33]: Yeah, so in this case, it's only log and overhead, but let me actually go to the hallucination. So here we have some more options. So you can either just don't do anything like just log. You can add a warning to the end of your response. Warning. This response is highly prone to hallucinations. Do not take it seriously. You can rephrase the response using an LLM, or you can completely block the response, like override it or the racket house the issues.
Q2 [00:21:59]: Do we think, like, documents that may be. I don't know. They're not, like, useful right now, and they are, like, documents that, I don't know, from 2017, you're using that response. So, yeah, it's okay. It's based on a document. But actually, you have a brand new document from Tools 2019. That's what you should be using. So can you capture those type of all those initiatives like, you're using, like, David alert should not be UI?
Alon Gubkin [00:22:29]: We can't. We can't do that. Like, this is essentially that. We see that as the job of the developer who is building the retrieval. And this is, you know, the developer is continuously improving their app, continuously improving the retrieval. We're just a firewall layer at the end of it, and we don't improve your accuracy. We cannot do that. What we can do is to detect bad examples and block them.
Alon Gubkin [00:22:51]: So think of it like a firewall. Thank you very much.