Combining LLMs with Knowledge Bases to Prevent Hallucinations
Large Language Models (LLMs) have shown remarkable capabilities in domains such as question-answering and information recall, but every so often, they just make stuff up. In this talk, we'll take a look at “LLM Hallucinations" and explore strategies to keep LLMs grounded and reliable in real-world applications.
We’ll start by walking through an example implementation of an "LLM-powered Support Center" to illustrate the problems caused by hallucinations. Next, I'll demonstrate how leveraging a searchable knowledge base can ensure that the assistant delivers trustworthy responses. We’ll wrap up by exploring the scalability of this approach and its potential impact on the future of AI-driven applications.
It's been so awesome. So without further ado, Scott, nice to see you. Hello. Hello. How's it going? I imagine people are, are getting tired from the day, but hopefully still have a little bit more energy. I'm doing really well. I think just revved up from all the learning is how people are feeling. Well that's, that's good to know.
That's great to hear. Yeah. And where are you joining us from? I'm actually based out of Vancouver, Canada. So out on the West coast. So early afternoon for me. You guys sent us, I'm in the east coast of the us. You guys sent us your smokey skies. I think it's a little bit more Alberta, but yeah, no, you're definitely right.
It's been a bit of a, a crazy year. Uh, just like, yeah, it's been incredibly hot, which for Vancouver's pretty nice because the, you know, it's, it's summer now. It's nice and warm. We go hiking, see the mountains. But yes, not so great for anyone down wind of, uh, this area. Yeah. Yeah. But good. I hope you're getting some good hiking in.
That's, that's awesome. Yeah. Uh, cool. Well, we will give the floor to you and I'll pull up your slides. Thank you so much for joining us. Of course. Thank you so much. Yeah. All right. All, all right, well, let's jump into it. So, all alums in production part two, you'll see the, the red theming here. This is for the, the company I work at.
And so, um, hello. I'm Scott Mackey, I, I'm a founding engineer at mem and today we're gonna talk about a very exciting topic, something I'm very passionate about, which is combining LLMs with knowledge bases to prevent hallucinations. A little wordy, but I, I think the topic itself is really, really exciting.
And so wait and see here. This is what we are building. I wanna say we, I'm referring to the company I work at, which is me. We're building a personalized AI knowledge assistant. Uh, one way to think about this is like chat G p T, but it has access to your email, your calendar. Um, it understands what you are working on.
And I like to, to tell people you, I think of it as a personalized executive assistant, right? My ei go and ask them, Hey, can you summarize those meeting notes? That kind of thing. Um, and the reason that I'm so excited about, uh, preventing hallucinations and work with knowledge base, The thing that matters the most to me, at least, uh, when working with an assistant is that they're not gonna lie to you.
They're gonna tell you the truth. They're gonna go and prevent, um, anything weird from happening. Uh, they're, they're gonna make sure that every piece of information they're sharing with you is accurate and factually true. So what are we gonna cover today? Uh, there's three things that I'm gonna walk through.
First thing is quick intro into what are hallucinations. Second thing, why do we wanna prevent them? Why do we even care about all this? And then the third thing is we're actually gonna walk through a real world scenario, which is building a q and a chatbot for Bevy, which is a game making framework. And the reason I chose Bevy is because, um, it's, it's actually something personal to me.
Uh, I've been trying to build a rust game on the side, and something that's been really challenging has been trying to go and learn it when. There's, it's, it's not a language I've worked with before, not a framework. I've worked before. I've never done game development. And I think it's a really great example of a scenario where a q&a chat bot can be extremely useful, but something like chat GP isn't good enough.
And so we'll go and do a deeper dive in that later. But, um, that's gonna be the bulk of the presentation is focusing on the real world scenario. Maybe one thing to add to this section is that, I'd love for people to take away from today, uh, that this is something you can go and build yourself, right? It's not extremely challenging, and you can actually use this set of slide decks in this presentation as a reference or guidebook for actually taking all of these, you know, different, different learnings and applying them to your own work.
So what are hallucinations when now? Um, hallucinates, it's producing. Imagined output. And so I think like a really obvious version of this is you go and you ask it for some source and it gives you some link. And the link is, is just totally false right there. There's two main types. So one of them is fabrication of facts, and second is faulty reasoning.
So the first fabrication of facts. I have a little example here. It might be a little challenging for everyone to read, but essentially what I've done is I've gone to chat gpt, this is the 3.5. So GPT 3.5 turbo. And this is the May 24 version. And I ask it, Hey, if you could recommend one behavioral economics paper to me, which one would it be?
And can you give me its name and id? And it goes and responds with a name, which is nudging individuals towards better financial decisions. And it sounds like a, a real paper, right? It sounds like Richard DA's famous, like work on Nudge. Um. And it gives us this link, right? And when you actually go and take a look at that, ID, you'll realize that the citation is made up.
It's not the correct paper, right? This paper, I, I can't even find one that has the exact matching name online. Maybe it's somewhere, but it's very like challenging to go and actually track down if it does exist. And, and you can just tell, like the yellow one has just lied to me about this being a proper source.
Another example of hallucination is I really like this one here, which is, Also the chat GP 3.5 model. But this is an example of faulty reasoning. So I go and ask it, Hey, suppose I've got two pounds of feathers and one pounds of bricks. Which one weighs more? Right? Two pounds or one pound, right? Two pounds should weigh more.
It should say yes, the feathers weigh more, but it goes, it describes rationale and it says no, the feathers gonna be lighter. And you know, they don't weigh more than bricks, right? Which is just an example of faulty reasoning. I think what's really interesting about this example is that, Before is able to get this one correct in some, you know, larger models are able to do a much better job of reasoning, and I think it hints at over time, um, this class of hallucination becoming less problematic.
I, I think another one that's really interesting about this kind of like faulty reasoning hallucination is that there's lots of different strategies you can take to try and improve it. And so you might have seen people sharing prompts online where you're asking it to self-critique, right? Hey, does this make sense to you?
And. Model might say, oh, I apologize, right? Like, that actually doesn't make sense and gets it right second time. And there's lots of different strategies you can use like Tree of Thought and all that that can help solve faulty reasoning. But um, what I really wanna focus today on is this first kind of hallucination, which is fabrication facts.
So why, why does any of this matter? We might have heard or seen some of these articles around, uh, uh, it was publicized recently. I'm sure there's lots of examples of this, but this is one that kinda up a bit, which was a lawyer went and cited some. Sources from Chad G B T in a court case, and they were made up, they were hallucinated.
And I think like a lot of people look at this like, ah, you know, why would someone go and not like double check and make sure that they're real? But I think that what I found really interesting is that. It's clear that this is a use case, right? This is something that's valuable to people is being able to go and use these tools for research purposes.
There's some utility there, and I think it just highlights how important it is to start developing products where you can go and build for this use case, right? Things like legal research, but actually ground it in reality and make sure that anything you are citing is real. And so that's what we're gonna spend some time diving into.
Two things I wanna call it. First is, is kinda like what what's gonna change in the future with LLMs, right? Is there's gonna be strategies, at least today, right? Like some of these prompt optimization tools that you would expect to kind of go maybe a bit outta style as the reasoning gets better and other things that are gonna stick around.
Um, one thing that I think we're gonna see improve and, and ironically it has recently, is improved instruction following. And so anyone that might have seen, I think it was last day or two, open AI announced a few. Um, new language models that had better instruction following capabilities. And what's really useful about this is when you go and ask it to perform certain tasks, so only respond with, you know, two words or respond in the J S O N scheme, format, that kind of thing.
They're becoming a lot better at doing that. And so it's much easier to build products on top of tools that are gonna follow the instructions that are provided prompts, or.
The other things that I think we're gonna see improved reasoning. So I talked a little bit about that. And then larger context windows. We've already seen this with some models like Claude's, uh, like hundred K token model and some that might even be larger. Now, um, this one's really interesting because it means that some things like compressing documents are having really good information retrieval systems where you're like getting the exact document that has the right data.
Those are gonna matter a little bit less cuz we'll be able to stuff more in the context. Even milk products that have like. Longer chat histories, that kind of thing, without having to go and do a bunch of extra engineering on top of it. And so, umm, really excited for that to be more common. And then something I'm not gonna spend too much time talking about today, but I, I just wanna call out is that fine tuning is going to become a lot more simpler, right?
There's lots of companies building the space, trying to help people take, maybe there are thousands or tens of thousands of user feedback examples and fine tune models for specific use cases like classification. Um, I think that, Fine tuning is really interesting for some things. Um, but for generative text, it's, it, it struggles in some areas specifically when you're trying to go and have something that's factual, right?
Like you need to be a hundred percent true. It's not great at that. Um, what one kind of side note through the presentation I've linked a couple, like further reading notes and so I'd encourage people after if you're interested in any parts of it, to go and read some of these papers. I think the AI space is something that is very unique, uh, in that.
Within the last two years, there's been so, so much new information that, uh, reading papers can be really valuable to understand what's the bleeding edge, what direction are things heading and so on. And so, uh, one thing I'd encourage people to do. Yeah, so improvements, what's not gonna improve. I think the main thing that is, Not going to change is the fact that IR systems or information systems are still gonna be required, right?
So LLMs, they won't have access to realtime data. I think a really simple example this is I might go and say, Hey, what's the score of the basketball game right now? Right? That's like a piece of data that I'd probably want that to be down to like 30 seconds or maybe a minute delayed. But more than that, I'm gonna be kind of unhappy.
Lms, there's not really good capabilities now to go and get that data into the underlying like model weights, right? You, you actually need to get into the context somehow, and that's gonna likely be with some kind of information retrieval system, integrating with APIs, that kinda thing. The other thing that the improvements won't solve is a hundred percent trustworthiness, and I think that.
We're gonna get a lot closer to this. I think Yon has a really interesting presentation on like autos models and some their limits. And I think that, um, it's clear that at the end of the day when a fact is produced, You wanna be able to point to one or more sources for that fact. And something that the LLMs today struggle with is, is being able to go and say, oh, well, you know, I adjusted all this training data.
I'm not exactly sure where this came from. Whereas when you're building systems that are maybe a little bit more agent, right? You've gone, you've gotten some documents, you put them in the context, it's much easier to say, oh, this fact came from here, here, here, here are some sources, right? Obvious examples are like, um, you know, Bing chat.
Uh, we'll go and site like which websites pull data from that kind of thing. And, and, you know, to, to maybe reiterate, that's a big part of this talking like why it's valuable is that this strategy and some of the things we talk about are gonna continue to be useful in the future. And when you're building these systems, you're gonna need to be thinking about this.
So real world scenario, we're actually gonna dive into it now. Um, building a chat bot to teach us about a topic, right? Product, whatever it is. In this case, I'm gonna use Bevy, which is a game making framework written in rust. I'm not really familiar with Rust that worked in a number of languages, but it's not one I've really been exposed to that much.
And, um, Again, like this, this strategy could be used for many different systems where you're trying to do the same kind of thing. And so you can imagine a product where they have a help support center, right? With a bunch of different documents, like some knowledge base that's gonna be very similar to this if you're trying to go and, and build on some other tools where they're just like a bunch of facts that you wanna query over, right?
Things like chat pdf, right? All of these strategies are gonna be very similar. Um, The reason I'm excited about Bevy again is cause I've been trying to go and build a game and learn how to use bevy, and one of the first things I did was I leaned for chat DT and said, Hey, right, like I can use this great tool to help me learn what APIs I should be using.
How do I go and render or sprites, how do I go and make like the, uh, very performance? So on. The problem is that. All of the data is outta date for the current versions of Bevy. Um, I think Chat has data from 2021 and the Bevy framework has been updated like a hundred times since then and lots of APIs have been broken.
And so every time I go and generate something, I get some error and it can't help you with the errors cause it doesn't know about the newer APIs. Um, so it's been frustrating cause it feels like there's this great tool but I can't really leverage it for, for my purposes. And so that's, Um, so the, the main thing that, that I wanna make sure is happening is how do we avoid hallucinations and make sure that it's telling me truthful things that I can use.
When you're trying to build this kind of system, the very first thing that you need to do is provide the chatbot access to some knowledge base. Right? And so a knowledge base can, like, it's, it's kind of a vague word, uh, but I think generally it can refer to. A bunch of information, um, in multiple formats, right?
So it could be facts and a list. It could be documentation, it could be files, it could be websites, sports center, that kind of thing. And essentially what we're just trying to do is like collect it into one spot so that the chatbot can reference it. And this example, I'm using data from three sources. So there's web tutorials, which is the bevy website.
There's the docs and code examples. And so when you actually go and take a look at Bev, they got lots of examples, building like a breakout game, that kind of thing. And then the GitHub faq. Um, the GitHub FAQ is really interesting cuz this is actually live data that is, you know, you could download a snapshot of it, but ideally we'd actually be able to use the GitHub interface.
I've kind of split them out into two strategies. One is downloaded the web tutorials and docs and code examples. So snapshot in time, right in, in this example, as we're working through the chat bot, this is actually some code that I've written. I plan to share it later this week. Um, just open source it if anyone wants to go and take a look at it.
And recently went and took a snapshot of web and code and thenq using.
Part. Right? So once we've got all of the information from the chat pod to reference, second part is building guardrails. So the chat bot's only responding with answers that are grounded in reality. And what I mean by that is we wanna make sure that whenever you're asking questions, it's using data from the knowledge basis.
It's not pulling data from it's. That is encoded in training weights, right? We actually want to be able to point to some source, I think, uh, an example that I like to think about, as you can imagine, um, you know, someone in, uh, they, they can't speak, right? All they have access to is, is knowledge base files.
All they can do to answer your questions is point at things. I think that's a really good way of like mentally modeling this out, is you wanna go and build the system so that it's always pointing at something and then it's gonna be grounded in reality, it's not. Um, one, one example I wanna highlight on these slides is the math example here.
So what is two plus two? Because it feels like, why, why would you not let the bot do this, right? Like, it's gonna be able to know, you know, chat t can do basic math, right? What is two plus two? If the user wants to know that, why are we gonna deny, deny that functionality from them? Why would we say like, you know, oh no, you can't do this.
The reason is all about setting the right expectations for the user, right? So if you're thinking of user experience, You're answering some math problems, right? Two plus two to the power of nine, right? And then you ask it something like, what's the, you know, is 1, 2, 3, 4, 6, a prime number? Um, it's not going to be able to solve all of the math problems because that's not part of the knowledge base.
And it's been known to hallucinate these kinds of scenarios, right? And I think like mouse, one example, but there's other things like weather, geography, right in, in this case, like email here, fix this code and send me an email. Right? That's taking. A, an action. It's using some capability that we don't have, you know, we don't give it access to.
And so a big part of building systems is how do you go and help educate the user on what the capabilities are and kinda add some guardrails so they stay on the right paths. So one are the types of questions that it should answer. Um, the ones that we're gonna be focusing on are questions like this, right?
Can I build the 3D game with. That's really simple, right? One that I'm interested in is how do I add keyboard controls to my game? I wanna be able to like move some character around on the screen. What code do I need to write to do that? Um, I'm experiencing some bug. Those are the kind of questions that should be like our happy path style questions.
Those ones who wanna encourage users to use. Now, how do I actually know if the system is working right? Well, when you're building software, you normally write tests. You know, I've heard the word evals thrown around a lot. I think that is like a more advanced version of this, essentially. But, um, our goal is to be able to go and say, given some inputs, we're able to go and validate the outputs.
Um, what's hard about working with LLMs is that there can be like a million different inputs, right? People can ask any questions, but then also for certain inputs, there can be many different valid outputs, right? If I go and ask it, how do I handle keyboard input for my game? There, there could be a, a ton of different valid answers.
And so something that's important when you're trying to evaluate the system, um, is that if you're writing some tests, you wanna go and try and isolate what the important parts of the system that you wanna validate are. And so in this case, there are two main things, right? When I'm asking this question, I wanna understand how do I handle key presses, someone press key, and how do I connect all this to the game state so it's all wired up.
So, This here, this is, uh, like the first screen here showing the example of Chat bot, right? This is just a C L I tool. Again, I'm gonna share some code if someone's curious. Obviously it's not very pretty, but it's meant to try and simplify the system as much as possible. And what you can see here is like, I ask this same question, how do I handle keyboarding Pro for a game?
And it gives me a response. What the eval should be looking at is certain. Pieces of it, right? And so here I'm saying, oh, you know this here. Add system, keyboard input system, right? That's the handler, it's registering handler, and then the key press handler, right? So key code left that's saying, oh, you know, this is how I go and set up key codes and key commands that are gonna work with.
Um, so if you're actually like writing a test for this kind of thing, you wanna try and match it all right? It's gonna be a bit of a fuzzy match, not gonna be perfect. Hearing your test might be a little tiny bit flaky, but I think that's okay because the main goal of the eval systems is to go and evaluate improvements over time, right?
If you're adding new capabilities, if you're changing parts of system, you're changing prompts, is it becoming more or less accurate? As you have more and more evals, it's gonna matter less about like, One specific one during one test run. It's more just helping you build in a specific direction. Yeah.
Another example is the unhappy paths. And so, uh, what is the capital of Canada? Right? This is a question that Chad CT would be able to answer, um, but we explicitly wanna make sure it's not answering this, cuz again, we wanna go and reinforce that we're not a geography bot. We're not gonna answer like, you know, weird questions about geography.
So we're not gonna answer any questions about geography. We're gonna be focused on task and like the workflows that we've defined. And what we'd like to have happen is like, it responds with something like, hey, sorry, uh, I'm not able to answer that question, but if you have any questions about bevy, let me know.
That's the kind of thing that we would like it to, to do. Maybe one thing to highlight here is that you don't wanna end up in a situation similar to Alexa or, um, some of these have like voice activated, um, assistance where you ask it to do something and it says, sorry, I can't do that. And then you're just stuck, right?
You don't necessarily understand the system capabilities. You want the system to help the user along and point them in the right direction. And so it's, it's always a balance of like you some guard rails, but you know, towards the, the right direction. Um, sorry, I know we're, we're coming up on time, but, um, I'm gonna like continue plugging through, but let me know if, uh, I, I need to pause for any moment.
So, um, so evals. LA last point I need, Alice, is that they should cover five main cases. The first case is hero use cases. So again, this is what I defined at the beginning, right? Like what are the kinds of questions that we wanna be able to answer the second case system capabilities. And so as you're building this, right, you can imagine I had one source was GitHub.
One source was these explicit files that went embedded and so on. And, um, making sure that we're testing all the different capabilities of the system, right? Like this might referred to as white box testing. You wanna go every code path. Edge cases missing or conflicting data. Um, user driven use cases. So if you're building a tool, right, like a product that you want lots of users to use, I think immediately you're gonna even see lots of users try specific use cases, either good or bad.
And you wanna make sure you're turning those in the evals so that as you're building the future, you're preventing regressions, but then you're also helping codify it so that if you're working with teammates, um, they, they're gonna understand like, what should the capabilities of. And then last thing is the non workflow scenario.
So this was things like the Canada geography example. So finally at the chatbot itself, right? A basic chatbot. This is if you've looked at any story online, like this is the very simple, like get started with the API example, right? A user makes some query. We have something, I'm referring to it as an agent here as.
Purposeful because an agent is someone who can take like multiple actions depending on, you know, different inputs and, and we'll expand upon that. Um, the agent turns around to things like chat, G P T or some other l l m, you know, pipes, data along it respond. And we'll see here, right in, in my little example, I've only written 20 evals for this little chatbot.
Ideally, if you have a larger system, you're probably looking at like a couple hundred, maybe more than that. But this is just trying to, to keep it simple, um, and. What you'll see here is I ask a question, what is the key code for the space bar button? The key code for space bar button is 32. Um, this is just like input output.
This is not the answer that I want, right? The answer that I actually want here is the word like key code, colon, colon, space, bar, space or something like that. Um, but it's not giving me the correct response cuz it doesn't know that we're talking about be. Now the second thing I do, right, and, and you'll see like some version notes at the top here, is we go and instruct the l l m, right?
And we're saying, Hey, you know, we're adding some things to the prompt. You know, your, your goal is to provide support and answer questions related to bevy and the RU program language. After you go and do that, you'll see the evaluation success increases pretty dramatically. You'll see here that you know what's the key code for the space bar button.
In Bevy, you can use key code space. That's a successful eval. That's exactly what I wanted to see. Now, what's the latest version as of August 9th? It's O 5.0 and it's 2021. That's not what I wanted to see, right? I want to be able to see today what is the latest version of Bevy. And so this is the problem that we wanna try and solve, is how do we go and ground it in real information, not just training.
Before we do that, There's one more step that I want to add, which is putting up some of those guardrails. How do you go and like, help make sure that we're on a happy path. It's not responding. Things like what's two plus two? And one strategy for this is using something called like React. You might have seen like tool former or function calling.
This is something that Opening Eye recently highlighted when their blog posts. Um, there's lots of different strategies for that, but. The open eye function calling is really simple, and so I'd encourage folks to like take a look and try it out if they haven't done it already. This is how you go and like let the model go and select different actions to take.
And in my case I wanna say, Hey, let's only generate valid answers about bevy. And if it's not about bevy, right? If it's something off topic, you're talking about the capital of Canada. Then we're gonna respond to something else, right? We can still generate some response. Maybe it's like, Hey, you know, I'm I, I don't wanna talk about that topic.
But if you have questions about be something like that, right? But you're trying to go in and separate it from situations that we expect to handle, situations we don't. After I do that, the eval success increases a little bit more. And you'll see when I say something like I am hungry, it says, sorry, I can only answer questions related to Betty.
Right. It's not gonna go and help me plan out my dinner. And that's good, right? And that's helping set the expectation for. Finally we're at knowledge retrieval. This is where things start to get a little bit more exciting. And so you wanna go and take your system and wire it up to like round it somewhere.
Right? And you, there's probably been like a bunch of other presentations on the other track. And this track today talking about this concept of you wanna have some data, you put it somewhere for, uh, recall. Um, One of the most popular examples of this is using embeddings and storing things in some vector store and being able to go and, and run your queries using semantic search.
I think that's one example. I think that you can also just use something like Elasticsearch for this. Um, it's outta scope for this talk. What I'm using in my little chatbot example is an embedding based strategy where you're going and like using semantic search on them. But I think that the, the thing that I wanna highlight is it doesn't necessarily matter.
What kind of system, like how it's built, what does matter is you put in some semantic query and. Or or any query really. You put in some query and you go and get knowledge that's gonna be relevant to the user's query and that's what's required to go and build and ground these systems. Once you get the data, you take it and you put it in the response generation context and you say, here are some documents or some knowledge that we go and gathered.
You use this knowledge when generating your response, right? And you wanna highlight it and say, only use the knowledge that I'm giving you. Right? Don't use the knowledge from your training. Right. Use. The knowledge that's in like this set of data. And if it's not there, then then say you don't know. And you see here our eval's gone way up, right?
We're able to go and return, oh BE'S, O 1.0, right? Or O one. Oh, this is fantastic. So now we're responding the right data. Couple more improvements we can make to this kind of system, our tool-based knowledge retrieval. And so this is taking that similar approach we just used, right? React where you're gonna go and.
Plan and you're gonna say, Hey, given this question, where do I want to go and look for the data? Do I wanna go to the docs that are stored in this document store or do I want to go and look at the GitHub FAQs? And that just happens to be another source for our knowledge base, like another source that we can query, right?
There's some great papers. I think Gorilla is like a really interesting one. Um, but um, that does a really good job. And you'll see that there is about like 10%, two of our evals. We're actually using data that was only retrievable via the GitHub FAQs, and so our success rate increased here. Last thing that I think is really CI is knowledge evaluation, which is once you go and get data from the knowledge retrieval step, you have something which looks at all those results and decides, can I answer this question or not?
Right. If no, you'll see it's, it might be a little bit hard to see, but you can respond with unable to answer, right? I, I don't know. Right? The, the, I'm gonna say, oh, you know, I checked my fax, but I'm not really sure how to answer that. Maybe it provides like some recommendations. Go check GitHub. This is what I would recommend if you actually wanna go and find the answer here.
But the other thing it can do is it can actually go and highlight, uh, oh, you know, these three pieces of knowledge. Are useful for generating the results and you should use them and cite them. And what this school looks like, uh uh, as an implementation is that now when it goes and finds that one document that was mentioning Be's version O one oh, it goes and it grabs it and will go and return that.
Um, so if you take a look here, right, the knowledge evaluation step will turn around to the response generation step and say, Hey, there's this specific file. Use it in your generated prompt, put it in your citations. And then you go and see like system here again, the UIs are beautiful, but it goes and says, source one came from this file.
All right, our eval rates are, are up again. Um, Another case here. This is when it's saying, Hey, I didn't find any information. You can try looking elsewhere. Right? That, that was that red path where it said, I don't know. It's unable to answer the question. Now, even with all those improvements that we've done and, and put in place, um, it's still not entirely comprehensive, right?
You can't solve every situation. In, in this case, it doesn't really understand temporal queries of, oh, you know, what was a notable feature from 2020? That's why'd already is not hundred think. The percentages here, again, don't really matter that much. Um, it's really just meant to be an indicator of are you moving the right direction?
Right? Over time, I'll be adding more evals to this. The eval percentage is gonna go down. I'll have more capabilities. It'll go up, right? And so it's really just meant to be as a marker for are you improving? Are you getting better at use cases that you want the user to? So done two less things. Um, future improvements.
There's a bunch of. Potential ways you could take this next, right? I think the ones that are most exciting are introducing some kinda planning or task loops, right? If you have multi hop queries where you need to go get data here and here and here and combine it, that's something where a task loop comes in.
Play in handy. If you wanna go and. Um, enhance the quality of retrieval. There's some really interesting things where you can pre-process the documents that you're storing the system, right? For, you know, those examples where you're storing the, the code files, uh, temporal querying, that was the example we looked at.
And then one that's really challenging resolving conflicting data. So if you have different things from multiple sources, how do you handle that? That one's really tricky. Um, And then improving observability and interpretability, right? Like when you're looking at the bot that I was showing, you can't really understand what it's thinking, what decisions it's making.
Um, I think some things like the Chad plugin system is trying to do that and going like, highlight a few things, but it's not perfect, right? It feels like there's better, um, some better UX out there and I think there's more exploration to do there. Uh, quick plug for ma'am, if any of these are interesting to you.
I think every single one of these, everything that we work through, those are the kinds of challenges that I'm thinking about every day working with the team on. And so if you're interested in joining, I would recommend checking out our career page. So, last thing, right? The two levers, two things that are really important when you're building these kind of systems, providing access to the knowledge base, right?
There's a bunch of ways to do that, but that's the most important thing, and it should be using that data in its responses, and it should only be using that data in its responses. And the second thing is the guardrails. And that's just to help bring the user to the spots where you have knowledge. So they're not asking all these questions that you're just not gonna be able to answer.
And that was a bunch of stuff. We're a little bit over time. Um, but uh, I have a link here. The bevy bot chat is where I'm gonna go and share some of this stuff after be on GitHub. That's just a, an easy link. Um, uh, I don't know if anyone has some, uh, questions I'd be happy to answer, but I know we're a little bit over.
Awesome. Thank you so much, Scott. Um, you said there was a link you were gonna share. Did you wanna uh, uh, yeah, sorry. So, uh, in, in this here you can see HTPs slash slash be bebot chat. That's where it's gonna be shared. It'll rearranged. I haven't uploaded the code there yet. Got it clean up just a tiny bit before it's ready for public.
But, um, over the next couple days we'll go and share it there and I encourage people to go and check it out. Okay. Awesome. Awesome. Um, I think one question I'm seeing in the chat from Apurva, who is, uh, one of our fearless Emma Lops community leaders in Toronto, she's asking, how do we add in evals? Is it mostly through prompt engineering?
Um, That's a really good question. And so I think there's two ways to think about it. Let me jump back here for a quick sec. So, um, the simplest way to like create evals is to actually write a test suite that is integrating with some live version of your system. So that might be a staging or production equivalent, and you're actually running it live, right?
If you run any kind of smoke tests on your system before, this is the kind thing that it would do. Um, and this chat bot module, process message would actually make the API call to your agent, process it, get the response, and then you could do, like in this example, this is some TypeScript code which goes and checks.
Did the response text include X, Y, Z? I think in the future there's lots of really interesting, um, folks that are working on how do we. Build LLMs, which might be fine tuned or just like, uh, specialized in evaluating the outputs of other LLMs. So you could actually build your evals using LLMs, and then you wouldn't have to do this yourself, you would just say, oh, here's the response, right?
And ask the system. Is this a good response and it would be able to go and grade it, maybe give you some score between like zero and 10 and then like use that to go and reinforce or improve the system. But for now, I would definitely like recommend by starting with just keeping it simple, see, do some simple string matching.
I think what's really good about the IR style systems, information retrieval is that. For a lot of queries, there is a right answer, right? Like you have a grounding and some specific document. What version is that? Right? That should always produce the same answer. And so for these kinds of systems, it's pretty easy to build evals.
If you have a system where it's like a poetry generator, that would be much harder to generate effective evals for. Great. Thank you so much. Hopefully that answered your question of Hova. Um, and another question that we got, um, thoughts on Strt G P T paper. Um, I'm not sure, uh, if I recall a struck G P D paper.
I don't know if the, the person who shared that would be able to maybe give a little bit more information about it, but I'd be happy to go and, and read it and share some more information after. Yeah. Drew, uh, have you heard that? Send us a little bit more information for that question. Another question that came through.
Um, From Ben. Um, which of these pipeline steps do you see having the most room for improvement slash growth in the short to medium term? The router slash planning step or, or the knowledge retrieval and query planning? That's a really good question. I think the one that I found, That I always come back to when I'm like, whenever there's, there's parts of the system where we're not retrieving the right data.
It's, it's this knowledge retrieval step. And I think what makes it challenging is that there's so many different ways to do retrieval One way. Like you can think about retrieval and I think there's some good papers in this. I think it's like SQL L LM or something where it actually goes, we'll write a SQL query that will be executed against your database and you provide the schema and it goes and, and can go and do it and fetch the data live.
I think that kind of thing is moving in the right direction of how are we grounding the data? Well, we're actually able to go on live, interact with certain parts and it feels like there's so many interesting opportunities there on like introducing new kinds of. Uh, retrieval, but then also synthesis, right?
And so I talked a little bit about like how hybrid systems are really common. Um, where's that slide? So that's when you're, you have like Elasticsearch and something like Pine can, and you're combining the results after you wanna make the query, right? Some kind of re-ranking. Um, I think that what is really hard is when you might have like 15 different sources that you wanna query, right?
You have all these different APIs that the system you can interface with. How does it know which one to look for to get the data? Um, one way you can solve it if it depends on your use case, but some of these agents will go and they'll try source one. Oh, I didn't find it there. Try source two. I didn't find it there.
Try source three. That's hard because. The user ends up waiting a long time. But I think there's some interesting things you could do with the UX and you can say like, Hey, I'll get back to you in 10 minutes. Let me go and like check my sources. And so, um, I think this is where there's a lot of really interesting work to be done.
Um, but once you actually get the knowledge that, uh, like. Grounding it with citations is really important, and I think that's, that's a step that I wouldn't want, like, you know, if you could, it's hard to go and discard of anyone of these pieces and, and keep the user trust really high in the system. I think you need everything, but I think that's the part where there's like the biggest amount of room for improvement.
Yeah. That's great. And do, do, let's see. Okay, drew. I'm, I'm, I know I'm pronouncing your name wrong, drew. Cause I, I apologize. Um, but Drew followed up on his struc club, uh, struc g PT question, uh, by saying, um, by saying knowledge graphs being integrated into LLMs through trip triplet mining. I don't know if that gives you more inform.
Mm, got it. Interesting. I've read a couple similar papers. I don't know, I, it's like graph hop is one and there might be one or two others. Mm-hmm. But, um, I, I like, some of the concepts I have seen are when you have the l l m and you will give it essentially some schema of, you know, what attributes in the system, what objects in, like, you know, in, in, in knowledge graphs, you'll have the subject predicate.
Uh, I, I forget the, the last part of it, but it's a triplet, right? And you can go and, and feed it a list of like different, you know, nodes and, and edges essentially, and say, Hey, what kind of query would you like to. And I think it's similar to the sequel, um, L L M style systems. You're actually like empowering the, um, knowledge retrieval to specifically say like, oh, I want to go and get all of the people who live in this city, right.
And then count them up or something like that. And I think it's really fascinating. I think the really hard part about using knowledge graphs. Is getting the data into that format, and if you don't already have it there, so like in this case with the knowledge base, it's hard to go and extract that all into a knowledge graph because that process, you are either doing it manually, which is a ton of time and, and you have to keep up to date or you are, um, Using an LM or something else to go and generate triplets.
So you're gonna insert into the system and at that point it might hallucinate or it might, you know, miss data or include extra data. And I think that's where you definitely don't wanna end up with where you have bad data in the system you can't trust. And so I think that's where it's really tricky.
There might be a future where that just becomes like good enough, like human level quality and a lot of systems end up being built by, you know, you, you run. All of your data through some pre-processing stuff. And I think, like here, what I was trying to get at was, um, th this, uh, second point, right?
Pre-processing here where you're extracting all the entities, right? These are all the people that are mentioned in the knowledge base. These are all the APIs that are mentioned in the knowledge base. And then when you query it, it's able to go and actually, instead of, you know, performing a search on.
Pine cone to go and get like, oh, these, like 10 entities. It's actually able to go to Postgres and a specific table that has all the people and query for anyone that has like a specific name or something like that. So, um, I think there's lots of interesting, uh, lots of interesting things in the pre-processing stage and I think the hardest part of it is still, um, just, just getting it into a format that's really high quality.
Yeah. That's great. Uh, one other question, what about hallucination from data within the knowledge base? So I, I guess that question is probably saying essentially like, what if the knowledge base is. Not, doesn't have the right data. Is that Mm. Like, like if, if there's some piece of information, right? Like bevy is version O seven and it's incorrect.
Um, I think that's extremely challenging, right? And that, I think that's part of this, like the fourth point here on resolving conflicting data, you might have two documents, right? Imagine you're running like an e-commerce store or, or a physical store and you have some store hours, right? Like 9:00 AM to 5:00 PM and then you have another document that says, You know, like on holidays we're like 9:00 AM to 3:00 PM and helping the system understand the difference between the two is really challenging.
And, and so the approach that I found best is you can imagine something like this part of the system, the knowledge evaluation, where it, it has, it doesn't just have unable answer, but it has another one where it says like conflicting. Sources, and it actually responds to the user that says, Hey, I found two different sources.
Here's source one, here's Source two. Look at them yourself, right? And says, I don't really know, but I'm just gonna like, try and let you decide that's what you like. I, I'd encourage, I think, um, the Wiki chat paper is really interesting. They don't use this specific approach. What they do is if they find conflicting data or data that they can't.
Um, uh, what's the word? Like they can't, uh, prove, essentially they're like, they, you know, the, the, the system has made some statement that I don't have any proof or citation about. It'll just remove it from the output altogether. Um, just because it wants to make sure the output is only ever referencing things that it can say are true.
So, not really a great answer, but it's you, I think just like helping the user understand why it's saying certain things. Mm-hmm. So that it's really easy for the user to do the fact check. Yeah, that makes sense. Uh, Petros, I hope that answered your question a bit and feel free to follow up if you want more information.
Cool. Well thank you so much Scott. I think you're closing out the day. Um, And this was awesome, and it looks like there might be a few more questions, but I'm gonna send you over to the chat to answer those if you don't mind. That would be great. I actually do apologize. I do have to hop off. Um, I have a, a work thing I need to run to, but of course if anyone has any other questions, please please send me an email.
You can either reach me out at work, email at Scott AI or my personal at Scott Mackey live. And so I think my mem AI should be in my profile. I'd be happy to answer any questions that were emailed or you had me on LinkedIn or something like that. So thank you. Thank you so much. Thanks for having me as well.
Thanks so much. So.