Evaluating Large Language Models for Production
Zairah is a Staff Data Scientist at you.com, an AI chatbot for search, where she leads data and analytics. Previously, Zairah was a Data Scientist at IBM Watson and obtained her M.S. in Computer Science from the University of Pennsylvania. Zairah is a digital nomad and enjoys adventure sports and poetry.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
In the rapidly evolving field of natural language processing, the evaluation of Large Language Models (LLMs) has become a critical area of focus. We will explore the importance of a robust evaluation strategy for LLMs and the challenges associated with traditional metrics such as ROUGE and BLEU. We will conclude the talk with some nontraditional such as correctness, faithfulness, and freshness metrics that are becoming increasingly important in the evaluation of LLMs.
Evaluating Large Language Models for Production
AI in Production
Slides:
Demetrios [00:00:05]: Oh, hello. You are now on the stage. Zairah, how you doing? Can you see me? Can you hear me while you.
Zairah Mustahsan [00:00:16]: I think I'm on mute now.
Adam Becker [00:00:18]: Yes, you are unmuted. I saw an instrument. Did I not see an instrument?
Zairah Mustahsan [00:00:23]: I got inspired it by your musical skills. I was like, oh, maybe I can set one like mine up. But it's not tuned, so I can't do anything right now.
Adam Becker [00:00:34]: Yeah. Okay, so what are you playing?
Zairah Mustahsan [00:00:37]: I can show it off.
Adam Becker [00:00:41]: Please. We would love that. This is the fun part about improvising.
Zairah Mustahsan [00:00:46]: Because we, I don't know if you can see it with the virtual background. What is that? But this is my instrument. Wow. It's called Rabab. It's like 2500 years old and it originated in central Asia, Afghanistan.
Adam Becker [00:01:01]: Wow.
Zairah Mustahsan [00:01:02]: And it's like a string instrument. Yeah, it's not tuned right now, so I won't be able to play.
Adam Becker [00:01:10]: But it sounds, like pretty. Now I totally understand why you can't just tune it up in flish flash. Yeah, that looks like it takes some dedication to tune.
Zairah Mustahsan [00:01:21]: Yeah, it takes me like 1015 minutes to get that done.
Adam Becker [00:01:26]: All right, well, are you open to jumping on? We have Annie in the background. But Annie, I'll let you continue to figure out if you can share your screen. And Zairah, can you share your screen now? The moment of truth, does it work for you?
Zairah Mustahsan [00:01:45]: Let me see if I can.
Adam Becker [00:01:49]: And while you're doing that, I'm going to remind everyone, if you want more of this music that we've got, here are the prompt, template, lyrics, video, and grab that while you can. All right.
Zairah Mustahsan [00:02:04]: Okay. Can you see my screen?
Adam Becker [00:02:06]: I do. Can you see it?
Zairah Mustahsan [00:02:10]: Yeah, I don't know what happened. If I do a slideshow. Okay.
Adam Becker [00:02:15]: All right. And for those, since this is an evaluation one, I'm going to recommend anyone that's into evaluation. We've got a survey that is out and we would love it if you go ahead and fill it out. So, Sarah, I lost your screen.
Zairah Mustahsan [00:02:35]: Oh, really?
Adam Becker [00:02:37]: Yeah, this is the best part. Somebody in the chat just said we need a lightning talk on how to share screens. Yeah. Hopefully we can get some AI agents to help us out soon enough. But you're going to have to share your screen again, Zairah, and I am going to share it again. I'll share the.
Zairah Mustahsan [00:03:07]: How about now?
Adam Becker [00:03:09]: Here we go. I'll be off the screen and I'll let you get ten minutes. See you in a sec.
Zairah Mustahsan [00:03:15]: All right. Hello, everyone. I'm very excited to be at the ML Ops community conference. I've seen the videos from the backgrounds for a very long time. So it's an inspiring community and very thrilled to be as a speaker today. Quick introduction. I'm Zairah. I'm a staff data scientist at You.com.
Zairah Mustahsan [00:03:35]: It's a search engine, chatbot search engine, which a lot of people, I mean, you might know about us, but today we are mainly going to be focusing on the general practices of evaluating large language models. I know that evaluation is a big thing with llms, and it's a very nascent space. So I'll try to summarize whatever I have seen people do in the industry and the best practices and challenges. So first of all, why benchmark? One of the things is that we want to be able to just understand the strengths and weaknesses of different models. There are new models that come in pretty much every day these days. So how do we know which ones to use? Especially when you're thinking about putting AI into production, should you have some kind of a strategy to swap out for the new models? Or do you in your application know if there are certain strengths? For one model, you create a pipeline where all the answers get answered. All the questions in that strength get answered from one model while you have the other model take on the lead on certain other things and so on and so forth. So it helps you understand your systems a lot more deeper and make the right decisions.
Zairah Mustahsan [00:05:00]: The other is that it helps you with quality assurance in general. Sometimes, especially when you're providing an application or a service to users who are actually using it, say, for example, you have an app which is helping people write emails for sales campaigns or something of that nature. You can have specific. So that's a very specific task. And you can maybe fine tune your models and then benchmark and see if it's high quality for that task. Or maybe you want to keep it more general and be able to answer all types of questions and you do benchmarking for that and so on and so forth. Ultimately, if you look at this kind of a list, it's about improving the user experience. What is it that you, as an orchestrator of llms, are able to do and take away from users plate and provide them answers that are high quality? So just to zoom back a little bit, before we came into this world of llms, traditionally we used to use metrics which used to mainly rely on arrangement and order of words.
Zairah Mustahsan [00:06:18]: So some of those metrics were, this is not obviously an exhaustive list, but just a high level summary. We still have bilingual evaluation understudy exact match and other kinds of metrics that relied more on arrangement and order of words. So what I mean for that by that is, for example, in the bilingual evaluation, understudy how many ngrams in the generated text appear in the reference text. That is what it used to kind of, it measures. So if you look at this example here, let's say you have some kind of a translation task. You have a query or like a sentence in one particular language, and your model is translating into another language. How do you assess the quality of this translation? Typically, in those kinds of traditional mechanisms, you would have some kind of a ground truth that you would compare your models translated outputs to. And in this case, you can see that they're not exactly the same.
Zairah Mustahsan [00:07:24]: But the way the metric kind of measures is basically like, let's say if it was doing like a unigram precision, every word in the translated text, which is the mat, is under the cat, is present in the ground truth. So in this case, the metric would be 1.0, perhaps indicating that your translation model was working very well. This is just one example where it kind of fails. But then there are other cases as well that is well documented in the literature. So these are the kinds of things that kind of were not well captured by those traditional metrics. And so then we came on to this world where we have more embedding based metrics, which took away this problem of not having the semantic understanding of words and text and sentences. So some of those semantic based metrics were like birch score or mover score, or x mover score, which was more like multilingual. So while all of these scores did and still do perform in a different dimension than just the ones whose job is to only look at, like, word combinations, these metrics still kind of fail to capture the quality of the output of large language models.
Zairah Mustahsan [00:08:54]: So the reasons for those are. Some of the reasons are the following. For example, these embedding based metrics still struggle to capture the richness, or maybe the thematic complexities that the llms are able to produce these days. Usually, if you input a query or use an API for an LLM, you'll see that it produces quite a long form kind of a text. And so some of those embedding metrics would fail or fail to capture the full context of things. The other places where those metrics fail is one of the use cases where llms, I've seen llms being used a lot is users or people will have a certain kind of a text in one tone, and they want to just convert that in another tone or style. They want to personalize certain kinds of written text in their own style or in someone else's style. So those kinds of things are also not very well captured in embedding based metrics.
Zairah Mustahsan [00:10:01]: And then the other is that the embedding based metrics are also kind of heavily relying. They rely heavily on perhaps like outdated, pretained models which diminish over time, or there is like a distribution shift and so on and so forth. So knowing that there are certain shortcomings in traditional, in some of these embedding based metrics, how do we go about then measuring the quality of llms? So some of the things that I have seen come up a lot in the last year are using llms themselves to evaluate other llms. This could be either like a GPT score, or there is gevow, both of them at a high level. For example, GPT score. You can ask the LLM to look at different aspects of the text and rate that text on different things that you're measuring. For example, tone or clarity or informativeness, whatever matters for your application. Maybe you give it a few examples in a few shots to help it kind of go in the right direction.
Zairah Mustahsan [00:11:13]: The other is Geval, where you instruct the LLM to assign a range of scores for the text. So on a range of 105. What is the informativeness score of this answer? Given this question? Something like that. While they are kind of going in the right direction, circumventing or addressing some of the challenges in the previously mentioned metrics, there still are challenges that the community is actively facing. So, for example, llms are stochastic in nature. Every time you ask the LLM a certain question, even if it's to rate or score some kind of an output, like, let's say you use LLM as a judge, it's going to maybe come back with a lot of variance in those scores. So that's a challenge. The second is that there is some research out there that has shown that GPT four has some kind of a preference for its own manner of responses.
Zairah Mustahsan [00:12:19]: So maybe it will give some kind of a biased answer for that. There is also some people have done the research that there is some kind of a positional bias in llms, being able to assign a preference score. And then finally, I think I've seen most recently the industry practitioners, and this is something that I've also observed in my own work, is if you ask the LLM to rank a certain answer on a scale of one to five for example, usually you'll see those rankings are not very meaningful. So it's better to instruct the LLM to predict whether answer a was better than answer b. Instead of ranking both of them on a scale of one to five, so do the latter than the former. At least that's what I have seen come up in the recent evaluation space in the industry. So, yeah, I think we continue to see a lot of innovation and challenges in the world of being able to evaluate llms. And it's kind of fascinating to see that within a span of just a few years, like maybe a decade, there's so much that we have done and there is so much to look forward to.
Zairah Mustahsan [00:13:39]: To summarize, I'd say that traditional metrics usually evaluate on a large open source. They have a large open source database, and they're evaluating for a very specific kind of a task, like summarization or question answering. Then there is the other kind of evaluation that you can also do, which is very important in my opinion, is to do the human feedback, where you either ask the users to give you some kind of like a thumbs up, thumbs down voting, or you use like third party labelers dedicated to label certain data for you. This actually is a very crucial step in my opinion, and it's costly, but I think it pays long term returns. And then of course, there are the LLM based evaluations that I mentioned where you use LLM as a judge to measure certain things in your answers, like correctness, clarity, informativeness, and things like that. Usually I'll say that wherever we are in the evaluation space in llms, I think coming up with as many custom metrics for your application is a good practice and is probably what you need. There are some open source libraries as well that we can use. I think Ragus is one, and there's a few more.
Zairah Mustahsan [00:15:05]: But having said that, I think with the state where we are with llms and so much innovation happening, I do think that it's a better investment to come and implement your own custom metrics for your application. Whether that's a combination of traditional and LLM based, I think that's the way to go. So, yeah, I think that is my talk. Just a quick rundown on evaluation of llms.
Adam Becker [00:15:35]: Excellent. Okay, very cool. So I have one question, because evaluation is so top of mind for me and a lot of people in the community. Have you seen any of these methods work better depending on your use case? I think one thing is pretty clear, right? Like if you're evaluating code, it's a little bit easier because it works or it doesn't work. And then there's different use cases that potentially can warrant different evaluation techniques. And so have you seen any of that being clear?
Zairah Mustahsan [00:16:09]: It depends on what your goal is. Like, for example, one of the features that we have on u.com or others may have on their platforms as well is to do something like we call deep research on topics. So for that deep research, we have generally seen that if we try to measure something like comprehensiveness, it's a lot better to do that than to measure something else for it. So, yeah, if you ask llm to be a judge and be like, oh, was this answer comprehensive enough? I think that kind of is more tailored to what you're trying to achieve versus asking you to do something random and more generic.
Adam Becker [00:16:51]: Yeah. Awesome. And the other thing is, have you seen people doing many different types of evaluation, like stringing a few of these together to try and figure out what the best is? And then how do you weigh each evaluation technique or metric?
Zairah Mustahsan [00:17:10]: Yeah, usually you have to do some kind of part checking and hand labeling. So let's say you have like four different use cases or five different use cases on your application. Each one of those typically will have its own suite of metrics. And then you can either take a basic average or typically in the industry, what we do is take a harmonic mean of those. So if you're measuring your application on a suite of five metrics, you take a harmonic mean of that so that it gives an equal weight to each one of them. But then if you have something like a deep research mode where you care about comprehensiveness a lot more, you probably want to give that a lot more weightage than some other metric, which is why I say that custom metrics are the way to go here. I don't think there is a one size fits all.
Adam Becker [00:18:02]: Excellent. There are some really cool questions that are coming through the chat. Sadly, we don't have time to answer them. I think, Zairah, you're on Gradual or you're on the platform, so I'm going to ask you to jump into the chat and answer some of those. I also am going to be giving a talk about evaluation and the survey that we did. As a reminder, everyone, this survey is live and you can scan the link right now and tell us how you're doing evaluation. If there's any tricks that you've seen. Zairah told me she's going to fill out the survey.
Adam Becker [00:18:35]: Even so, you're going to be in good company. And thank you so much it's brilliant to hear about this because I think it is so top of mind for so many people. Since we haven't really figured it out, it's almost like we're all comparing notes. Right? So that's the fun part.
Zairah Mustahsan [00:18:52]: Yeah. Well, thank you for having me here. It was a pleasure being here. I can't find the chat. Where can I kind of interact with the questions that I'm getting?
Adam Becker [00:19:01]: I'll drop in the chat that we have. I'll drop you the link to the chat where everybody else is at.
Zairah Mustahsan [00:19:05]: Okay, perfect. Thank you.