MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Automated Evaluations for Your RAG Chatbot or Other Generative Tool

Posted Mar 11, 2024 | Views 550
# Evaluations
# RAG Chatbot
# Capital Technology Group
Share
speakers
avatar
Abigail Haddad
Lead Data Scientist @ Capital Technology Group

Abigail Haddad is a data scientist who is working on automating LLM evaluations for whether they can assist in tasks like building bioweapons. Previously, she worked on research and data science for the Department of Defense, including at the RAND Corporation and as a Department of the Army civilian. Her hobbies include analyzing federal job listings and co-organizing Data Science DC. She blogs at Present of Coding.

+ Read More
avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
SUMMARY

You can write test cases to assess the performance of narrow-use LLM tools like RAG chatbots. Like with testing in general, you write tests based on the behavior that you want your tool to exhibit, run these tests when you update your code, and add to your tests when you find something broken and fix it. In this talk, I'll go to talk through several testing methods and how and where you might use them. I'll talk about tests that determine whether your model is responding to the prompts it should be responding to, as well as methods for evaluating text output, including LLM-led evaluations.

+ Read More
TRANSCRIPT

Automated Evaluations for Your RAG Chatbot or Other Generative Tool

AI in Production

Slides: https://drive.google.com/file/d/12VMgXb-XaVs2PBy0jDOqcXVuvkR3FlGT/view?usp=drive_link

Demetrios [00:00:05]: Next up is Abigail. Where are you at? Abigail? There you are. Hello.

Abigail Haddad [00:00:12]: Hello.

Demetrios [00:00:13]: How you doing?

Abigail Haddad [00:00:15]: Great. I am going to try to figure out how to share my screen. Okay.

Demetrios [00:00:18]: Oh, yeah, that's a fun one.

Abigail Haddad [00:00:19]: There we go.

Demetrios [00:00:20]: Trouble earlier, I think you were here, you may have seen that I didn't.

Abigail Haddad [00:00:25]: Is that working? Can you see that?

Demetrios [00:00:27]: Yes. There it is. Okay, wait. Just before you jump into your talk, I want to mention to everyone, sorry, I'm going to steal your thunder real fast, but I want to mention everyone you're about to talk about evaluation and how fun evaluation is. We've got a whole survey on evaluation, and I put together there are some really clear data that has come through already in the evaluation metrics that we've gotten from people. So I just want to share this real fast before. Abigail, you jump. Don't worry, you still got ten minutes.

Demetrios [00:01:08]: I just want to show you this so you can maybe enjoy it while you're talking in your piece. So if anybody wants to fill out the evaluation survey, it's there. There's the QR code. We can drop it in the chat, too. But here, what you'll see is what people have already filled out. Here is a little bit of their roles and their seniority. So you can see that the people that have already filled it out, we've got some senior people and data scientists, ctos, data engineers, et cetera, ML engineers, all that fun stuff. I've been teasing this data real fast and Abigail, I thought it was perfect for you because you're giving this talk.

Demetrios [00:01:49]: You might want to know about it. What we've seen is that majority of people are using OpenAI. And that's not really a surprise, I wouldn't think. But then the other majority of people are using these smaller models, which I think is really cool to see. So it's like one or the other. And that hasn't changed so much since we did the first evaluation survey back in the day in last August, September. So what aspects of LLM performance do you consider important when evaluating? Look at that accuracy, which is what we just saw in the chat from. I think it was Michael mentioning this in the chat.

Demetrios [00:02:32]: Accuracy is very important. Mui importante. And then you've got truthfulness and hallucinations also pretty high up there, and of course, cost. But we're talking about how to evaluate a system. Abigail, I don't want to jump into your talk too much. Sorry. And thank you for letting me steal your thunder a little bit. I know you've got an incredible talk all about rags and evaluating rags.

Demetrios [00:02:56]: I'm going to let you get to that. You've been super awesome with your patients and sitting through some of our shenanigans, so I'm going to hand it over to you and let you tell us all about the evaluations for the rags.

Abigail Haddad [00:03:13]: Awesome. Thank you. So, my name is Abigail Hadad. I'm a data scientist, and last year, like a lot of people, I spent part of my weekend making a little demo for a rag chat bot. Well, my chat bot, it was terrible, guys. It was really bad. It was supposed to answer questions using documentation for this python package, which was super new, so it wasn't in GPT yet, so I needed a workaround. So my rag thing, it was like making stuff up.

Abigail Haddad [00:03:40]: It was getting things wrong. It was just generally not usable. Okay, so my second rag, I make another one to answer questions about the California traffic code. And this one is better. It's somewhat better. I'm like, okay, this is a thing. But when we were trying to make decisions about what open source model we should use for the underlying LLM, I didn't have an assessment strategy for figuring out which of these actually works the best, which, what do we need? And that was also bad. But now I've been working on LLM model evals for a while, and I've spent a lot of time evaluating LLM output in different ways.

Abigail Haddad [00:04:13]: And so I have at least the beginning of a strategy for how to evaluate your rag, chatbot, or other generative AI tool, like something you're building for customers, or maybe internally to do, like, one specific thing. And that is what I'm going to talk to you about today. So first, why are we automating testing? Why do we care about this? So, first of all, why do you ever automate testing? Right? You automate testing because you're going to break things and you want to find out we broke something. When you push your code to a branch that's not your main branch, and before you merge that code and your product goes boom. So we automate testing because we're human and we're fallible, but also specifically in the context of your LLM tool. We're also automating testing because there are choices you're going to make about your tool and you want to have quick feedback about how it worked or could work. For instance, like I mentioned, if you're trying to decide what underlying LLM to use or what broad system prompt to use, if that's relevant for you when you make or consider these changes, you want to know how they affect performance. And the more your tool is doing, again, like with any kind of software, the less feasible it becomes to test things manually.

Abigail Haddad [00:05:14]: So what I was doing with my California traffic code chatbot was I had a series of questions and I automatically ran them through multiple models. And then I looked at the questions myself on the responses. It's not the worst, but it's not the best either. So let's talk about how to automate testing broadly. We test to make sure that our tools are doing what we want them to do. So what do you want your tool to do? What are some questions you want it to be able to answer? What does a good question look like? What does a bad answer look? Sorry, what does a good answer look like? What does a bad answer look like? And we want to test that it's doing that so easy, right? So just test that it's doing what you want it to be doing. We do that all the time with machine learning problems generally and with NLP specifically, but actually, okay, in this case, not that easy. It's actually kind of hard.

Abigail Haddad [00:05:58]: So why is it hard? It's hard because text is high dimensional data. It's complicated, it has a lot of features. And with generative AI, like with a chatbot, we're not talking about a classification model where the result is like a pass or a fail, like we're categorizing something as spam or not spam. Now, with digression, you can also use large language models to build classification models or do entity extraction, and they're really, really good at that. And that's my favorite use case for llms, in part because you can evaluate them really easily, the same way we've always evaluated these kinds of tools, by comparing the model output with our ground truth label data. If you have the opportunity to do that instead, do that like, oh my gosh, evaluate it. You have a confusion matrix, call it a day, but every wants a chat bot. Okay, so we're talking about that.

Abigail Haddad [00:06:42]: In the case of chat bots, we're asking a question and we're getting maybe a whole sentence back, or a whole paragraph, or even more. So how do we assess that? So we have a few options, which I'll go through, but first, I want to note that the purpose of this kind of testing is not to comprehensively test everything someone might ask your tool about. If you're going to generate every possible question someone could ask your tool, and then criteria for evaluating all the answers that were specific to that question, you wouldn't need a generative tool, you would need a frequently asked questions page and then some search functionality. So the purpose of this is you pick some of the types of questions you want your tool to be able to answer and then you're going to test those. So first option string matching. Okay, we've got some choices here. We can look for an exact match. So if you want your answer to be one exact sentence, or you want it to contain a particular substring, like if we ask it for capital, France is Paris somewhere in the response, right, the string matching substring.

Abigail Haddad [00:07:34]: We can use regular expressions if there's a pattern we want, like if we want a particular substring, but only if it's a standalone word, not part of a bigger word. We can measure edit distance or how syntactically close two pieces of text are, like how many characters we have to flip to get from one string to another. Or we can do a variation of exact matching where we want to find a list of keywords rather than just one. So here's an example. We have a little unit test, we have a prompt we're sending to our tools API. We pass a question, we get back a response and we test to see if there's something formatted like an email address in it, right? Super easy. So ship it. Does this look good? Is this a good way of evaluating high dimensional text data to see if it's got the answer we want? Isn't that great? Right? There's a lot we just can't do in terms of text evaluation with string matching.

Abigail Haddad [00:08:22]: Maybe there are some test cases you can write if you want very short factual answers, maybe do this. But in general I would say like don't ship this. Next we have semantic similarity. So with semantic similarity we can test how close one string is to another in a way that takes into account both like synonyms and context. So what's before and after? There are a lot of small models you can use for this. You can basically project your text into like 760th dimensional space, which actually is a major fewer dimensions than it starts out with. It's a major simplification. And then we take the distance between those two strings.

Abigail Haddad [00:08:55]: So here's an example. This isn't quite real by the way, but basically you download a model to sort of tokenize it. Again, you're hitting your tool API with a prompt, you're getting a response, you're projecting that response into the semantic space, you're projecting the target text, basically the thing you wanted your tool to say into that space as well. And then you compare the two and you use a test which has a threshold for similarity. So like if it's 0.7 or above whatever, then it passes. And then, so the idea is it passes if the two texts were sufficiently close using this model. So ship it again, I don't want to say never, but there's a lot of nuance you're not necessarily going to capture with similarity, and that's especially as your text responses get longer. Something can be importance and you can miss it one way or the other.

Abigail Haddad [00:09:39]: Okay, so finally, this is what I'm excited about. Okay, so we have LLM led evals. So this is where you write a specific test for whatever it is you're looking for and you let an LlM do your evaluation for you. And this doesn't need to be the LLM you're actually using for your tool. Like you might use an open source tool or a smaller LLM for your actual tool, right? Because you've got a lot of users, you don't want the cost, but you might still want to use GPT four for your test cases because it's not going to cost very much to actually run your test cases and it's going to be a very good model. So what does this look like? What does writing an LLM lead eval look like? It looks like basically whatever you want it to look like. So this is one I use for text closeness. So this is how close is the text of the tool output to the text I wanted it to put.

Abigail Haddad [00:10:22]: And this gives an answer on a scale of one to ten. Again, set a threshold wherever you want. Here's another one. So you can write an actual grading rubric for each of your tests. So this is a grading rubric for a set of instructions where it passes if it contains all seven of these pretty specific steps and fails otherwise. And I'm using a package called Marvin here, which I really recommend. And that makes getting precise, structured outputs from open AI models really as easy as writing this rubric. So you really write the rubric, it's one line of code to pass it, and you can get a pass or fail and that's it.

Abigail Haddad [00:10:55]: Or you can write rubrics which return scores. So like ten points for each of the steps. If it includes it, it needs 60 to or 70 to pass however you want to do it. And again, this is a level of detail which you're just not going to get using string matching or semantic similarity. Like the idea of trying to write this as a pass fail using those would be not too lined code. Okay, here are a couple of other ideas. These are both things. These are actually something.

Abigail Haddad [00:11:23]: There's a startup called Athena AI, which is doing some really interesting things with LLM led edols. And the first is, is this answering the question that was asked. So this isn't about accuracy. It's basically about completeness. You ask a question and did the answer have the things that it have, the types of things that it should have? And then they have another one that's specifically for assessing rag tools, which is the answer that the LLM gave, is the content that contained something that was passed to it in the context window. So you got your vector database, you got your chunk of text. You pass some text in. That's the information.

Abigail Haddad [00:11:56]: Did it make something up, or was it actually just using that? And those are also really cool. So ship it. Yeah, totally ship it. You write some tests, you treat your LLM tool like real software because it is real software and it works. Right. Okay, so I'm at the end of my time, but thank you again for coming. I wanted to mention again real quick the two products that I'm in no way affiliated with, and I can be enthusiastic about it because they're not paying me to say this and just saying this. So, again, Marvin AI is a python library.

Abigail Haddad [00:12:24]: You can do very quick evaluation, rubrics, classification, or scoring, and it'll manage your API interactions and transform your rubrics into the full prompts that it sends to the OpenAI models. And then Athena AI is also doing some really cool things with model Eddie dolls, also specifically for rag. So this is me on LinkedIn. You can check me out there if you're interested. Here's my substac, and there'll be a slightly longer piece on this topic next week, and I will also share some code for all of these methods. So that's all I got.

Demetrios [00:12:51]: Ship it is. Great. Wow. You made my job very easy. You did it in perfect timing. I love it. That is so cool. I think there's so many great pieces that are coming through in the chat about this, so if you're in there, then you might want to answer a few questions and ask Abigail of the questions.

Demetrios [00:13:25]: Abigail, thank you so much. I really appreciate you humoring me with all of the different stuff that we've been up to since you jumped on the stream in the background, and also humoring me with this evaluation metric stuff that I went over beforehand.

Abigail Haddad [00:13:43]: Thanks for having me.

+ Read More
Sign in or Join the community

Create an account

Change email
e.g. https://www.linkedin.com/in/xxx or https://xx.linkedin.com/in/xxx
I agree to MLOps Community’s Code of Conduct and Privacy Policy.

Watch More

1:01:43
How to Systematically Test and Evaluate Your LLMs Apps
Posted Oct 18, 2024 | Views 13.8K
# LLMs
# Engineering best practices
# Comet ML
Building LLM Applications for Production
Posted Jun 20, 2023 | Views 10.7K
# LLM in Production
# LLMs
# Claypot AI
# Redis.io
# Gantry.io
# Predibase.com
# Humanloop.com
# Anyscale.com
# Zilliz.com
# Arize.com
# Nvidia.com
# TrueFoundry.com
# Premai.io
# Continual.ai
# Argilla.io
# Genesiscloud.com
# Rungalileo.io