Evaluating Language Models
Author of LLMs in Production, through Manning Publications. Has worked in ML/AI for over ten years working on building machine learning platforms for start up and large tech companies alike. My career has focused mainly on deploying models to production.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
Matt talks about the challenges of evaluating language models, as well as how to address them, what metrics you can use, and the datasets available. Discuss difficulties of continuous evaluation in production and common pitfalls. Takeaways: A call to action to contribute to public evaluation datasets and a more concerted effort from the community to reduce harmful bias.
Evaluating Language Models
AI in Production
Slides: https://docs.google.com/presentation/d/1SnphsSgsF6mnNcvE8lUfVfPAuSuTE3ia/edit?usp=drive_link
Demetrios [00:00:05]: I'm gonna bring up our next speaker. Where are you at, Matt? There he is. Oh, boy. How you doing, dude?
Matthew Sharp [00:00:16]: Good. Life is good.
Demetrios [00:00:18]: So I gotta just introduce you real fast, because for those that do not know, Matt literally wrote the book. Book on do. You ended up calling it LLM ops or llms in production. I can't. I remember. I remember there was some contemplation around the name, and so you called it llms in production. I'm going to drop a link to the book in the chat in case anyone wants to have a looksy while you're talking. But you got ten minutes, man.
Demetrios [00:00:54]: I'll throw it on the clock here, and we can get rocking and rolling.
Matthew Sharp [00:00:59]: All right. Let me share my screen.
Demetrios [00:01:03]: This is the fun part.
Matthew Sharp [00:01:06]: Is it working?
Demetrios [00:01:07]: It is. Oh, I love it. Because after you talk about evaluation, I'm going to talk about evaluation, and maybe I'll just have you hang out with me if you have an extra five minutes and we can talk about the survey that we did, which, by the way, folks, if you have not done the survey yet, this is a perfect opportunity. Hopefully, Matt will inspire you. Go scan that QR code real fast and take our evaluation survey, because we're going to be writing all kinds of reports on it. All right, Matt, I'll get out of the way. I'll let you cook, my man. Here we go.
Matthew Sharp [00:01:44]: Yeah. So this has kind of been the theme a lot, evaluating language models. Kind of already introduced myself, but I'm Matt Sharp. I'm an Mlops engineer. I've been doing it for quite a long time. Been in the data space ten years. I currently work at a startup called LTK, and I'm an author of llms in production. So why did I want to talk about this? If you've been paying attention at all, there's benchmarks, and they're just everywhere.
Matthew Sharp [00:02:16]: There's so many benchmarks out there, it feels like every single time someone trains a new LLM, they go out and they run it against benchmarks and they run it against all of them until they find one that it's like, the number one in. And then they say, hey, my llms is the best at this benchmark. And it's like, cool. Okay, what does that mean? A lot of people haven't really dug into the benchmarks, and so that's kind of what I wanted to do a little bit here. So benchmarks are waypoints. I'm a big Zelda fan. If you've played Zelda, you might understand this, but you see something off in the distance, and you say, hey, I want to go over there. And so you kind of set a waypoint there, and you kind of follow your map to get there.
Matthew Sharp [00:03:00]: And there's lots of different interesting spaces in natural language and the language space. And we have set waypoints, which is our benchmarks. And the industry tends and community tends to over engineer and kind of move towards these benchmarks to make our models better to beat the benchmarks. And so it's really important that we set them correctly. So what makes a benchmark? It's essentially two things. We have a metric and a data set. However, a good metric is really difficult when it comes to language models, because how do you compare text here? I asked two different models, who's steamboat Willie? And they gave different answers. Both of them are pretty good.
Matthew Sharp [00:03:46]: They kind of focus on different kind of facts. But how do you really compare which one is better than the other? And so that's kind of where metrics come in. So the very classic blue versus, if you're french, just red versus blue. These have been around for 20 years. Very classic lots. Unfortunately, they kind of have this understanding of not being very good. But to be fair, these are actually still some of the best metrics we have, and I think we're not using them enough. Blue, it's a precision metric.
Matthew Sharp [00:04:24]: How many words in the generated text appear in the ground truth? Rouge is the opposite. It's a recall. How many of the human words appear in the generated text? Not going to go too much more, because I don't have a lot of time, but there are some issues with that. Mainly we have to come up with human ground truth and other things. And so we've kind of moved on to regression metrics. So, like toxicity in particular, it's used a lot. It's just a regression model. We had data saying, hey, this is toxic.
Matthew Sharp [00:04:57]: We had data saying, this isn't toxic. And then we train the model, and then we say, hey, this new generated text, is it toxic or not? And it gives it a know. And that happened a lot in the video game industry. It's happened a lot in forums like Reddit and things because they care about that a lot, where there's human to human interaction. But then we have sentiments, which happens at Amazon, hey, is this product good or bad? The reviews, are they good or bad? And then there's other ones. Hurtfulness, reading level, complexity. There's actually a whole bunch of these different regression metrics. However, you got to be careful because toxicity in the video game industry is going to look very different from a business use case in an email where it says, per my last email or other things like that.
Matthew Sharp [00:05:40]: That can be very toxic, but it's business language and so it's not going to pick up the same things. And so we do have other metrics as well. These are actually really solid metrics because they just kind of really get down to what is the text like word count, word length. However, they often don't tell us a lot. However, despite all of these metrics, we've come up with so many good ones. What do you think the one we use when it comes to benchmarks? Well, we don't use any of these. In fact, what we do is we just do multiple choice. Essentially, we're giving in these benchmarks, we give them a question, we say, hey, is it ABC or D? And then we say, hey, which one is it? And for me as an ML ops person, this is a big problem mainly, and you can come up with lots of different reasons for it.
Matthew Sharp [00:06:34]: But the biggest one, in my opinion, is the fact that it doesn't model what's actually happening in production. When I go on and I talk to Chachi PT, I never once have asked a multiple choice question. And so these are actually, in my opinion, just very poor benchmarks. And just to kind of prove it. So openllm, if you go to hugging face and this is their main leaderboard, these are the main metrics. It's going arc multiple choice, helliswag multiple choice, MMLU multiple choice all the way to grade school math, which is actually an exact match because it's like two plus two equals four. Does it come up with four? So none of these are very good, but we tend to like them because they give us that almighty accuracy score versus like red and blue, rouge and blue. Right.
Matthew Sharp [00:07:25]: Unfortunately, this is kind of where we are. So next part is the data sets. And as you can imagine, overwhelming majority of our benchmarks are just standardized tests. This isn't always used to be the case. We used to have some amazing data sets that just don't matter anymore because llms beat them all very quickly. But we created tests to evaluate where language models were bad at. But because the industry has been moving so quickly, people are just like, oh, let's throw the sat at it or the act test, or let's see if it can become a lawyer or a doctor, go through all those different tests out there. And ultimately this is actually really lazy.
Matthew Sharp [00:08:09]: We're not taking the time to really evaluate where the models are good or bad, and this is one of the big cases, is that we don't have a lot of benchmarks around responsible AI, and we actually know how to do this pretty well. There's some good data sets on this. And so essentially what we do is we give it. If we wanted to compare gender to make sure our models are responsible in that space, we'd say, hey, complete the sentence with a profession. The woman is, the man is. If it says nurse versus doctor, there's some feelings there about what it is. And then generally we would go and we would use those regression metrics, maybe toxicity to kind of compare. How do you compare whether or not these results, though, are toxic? It's pretty difficult.
Matthew Sharp [00:09:00]: And so we kind of still need some better metrics when we look at this. For responsible AI, there are some great data sets, like there's the Wiener bias, there's calm, there's BBQ. But honestly, there's not a lot. I've read lots of different white papers where people have really gone into, hey, let's look at this on politics. Let's look at this on religions. They use that kind of the same things. Muslim man is, a catholic man is. And that's a great way to kind of see, hey, are these models biased towards certain religious groups or occupations or stereotypes? Unfortunately, a lot of these white papers aren't releasing their data sets, and so the open source community is really struggling, and we don't have very good benchmarks around this stuff, which ultimately comes down.
Matthew Sharp [00:09:50]: So monitoring and production. So natural language is constantly changing. And so one of the most common ways to monitor your model in production is say, hey, we'll just run the benchmark every week and see if it's moving and changing. And it is going to be changing. We introduce new sling, new technology, cultural simulation. But the big problem with this is data leakage. Overwhelmingly, our benchmarks are getting worse. This is getting back is because we're overfitting towards the benchmarks.
Matthew Sharp [00:10:24]: We keep on saying, hey, the model is doing bad in this spot where the benchmark found it. And so we're going to go and retrain and fine tune our models and beat the benchmarks. This is a constant battle. You probably face this in a lot of other AI models, but this is kind of where we are. So where does the state of our affairs, our metrics, they're essentially being ignored. We have so many of them, but we're just doing multiple choice questions. Our data sets, they're pretty much thrown together last minute. Researchers don't take the time to really create great data sets.
Matthew Sharp [00:11:00]: So they go and find some standardized tests to run against the model. Our benchmarks, ultimately, because of the other two, they're just very unimpressive and we're not really getting good benchmarks in areas that people tend to care about, like responsible AI. And to compound all this, we're optimizing to beat these terrible benchmarks. And so that brings up the question, is this the flagpost that we really want, or should we set new waypoints? And so just want to call out like, this is a problem. We need everyone in it. We need everyone to really start thinking about how to make better metrics, how to make better data sets, how to make better benchmarks. And so. Thank you.
Matthew Sharp [00:11:44]: If you scan this QR, you can find my book for all those that have come to. Manning has given me this discount code where you can get 45% off my book for all everyone here. And I hope you enjoy it. Thank you.
Demetrios [00:12:03]: All right. Don't think like you're going anywhere anytime soon. Man, this is so good. And it leads perfectly into what I'm about to talk about, because basically I wanted to go over the evaluation survey that we did a few months ago, and I know you were kind enough to help me with the new questions on the new evaluation survey that we've got out. So while everybody's got their QR code scanning apparatus out, I would recommend that you take a second and also scan that QR code. But the interesting piece here, man, you talked about a few things that are so top of mind for me. First of all, the metrics. And can you go back to that last slide before the anchorman one, which props on the anchorman one? It's absolutely awesome.
Demetrios [00:12:57]: This breaks it down so well. It just shows everything that is messed up with how we're trying to do evaluations in one very clean slide. It's like metrics. Meh. Data sets, benchmarks could be better and models. Well. Nah. So I'm going to share my screen real fast and we can go over this evaluation survey that we did so that you can see what's going on there.
Demetrios [00:13:35]: I think it's probably. Hold on a sec, let me share this. And you should be seeing a Google sheet right now. Yeah, I'll make it bigger so that everybody can see it nice and big. Here's how we got here. Here's all the raw data. Anyone can see this raw data and play around with it, see what people are saying. I mean, you'll come to the conclusion that Matt shared on that slide and what we have here is, here's the answers.
Demetrios [00:14:07]: Or it's basically like going through and showing what the questions were. And so again, this was version one. If you want version two, we've got the evaluation survey that we're doing here and you can scan that and do it. But version one, we were just asking some basic stuff like how big is your organization, what industry, all that fun stuff. Here's version two. Boom. What I came across, I tried to put together my insights from this, and after talking or digging through the data, what I found was when it comes to llms and just AI, there's a lot of budget being allocated to things. So of course that follows the hype, right? We've seen that.
Demetrios [00:14:59]: Then we've got budget is coming from existing ML budget or is it a whole new budget? The wild part was that there was 45% of respondents that said they were using existing budget, and then the remaining, whatever, 40%, 41. I can't do my math, I don't do public math. But that was saying that it's a whole new budget. So that is cool. That is really cool to see. And that makes sense. There's a lot of vc money pouring into this thinking that there's a whole gold rush happening. The model size was super interesting too, because when it comes to what people were using, which I think plays a huge factor in how we're evaluating things, because if you're evaluating GPT four, that's different than if you're evaluating something that you have a bit more control over, like a fine tuned mistral or mixture or whatever, right.
Demetrios [00:15:59]: People basically were saying they're using OpenAI or they're using some kind of a fine tuned open source model. And we saw a lot less of, we're using cohere, we're using anthropic, which I thought was pretty fascinating there. Seems like plus one for the open source team. That's cool.
Matthew Sharp [00:16:24]: But love that.
Demetrios [00:16:26]: Yeah, it is nice to see. And finally, it feels like there's some actual quality open source models out there. And what you see though is that you have this trade off when it comes to, especially if you're dealing with the seven b models, you've got a trade off that people talked about how you can't really have it all. Like, you can't have your cake and eat it too. You've got either, like, I can have it fast, a fast model, a cheap model or an accurate model. It can't really be fast, cheap and accurate, or at least not at this time. You can only have like two out of the three. So if you're using OpenAI, maybe you're not going to get that fast part of it because the model is so big, unless you're doing the GPT 3.5 turbo.
Demetrios [00:17:18]: Right. And then here was that, I think, is super important to touch on when it comes to what performance. Like, what are people evaluating when they're looking at evaluation? It's all about accuracy. And then if you look at it, hallucinations and truthfulness, those are the other, like twins, hallucinations and truthfulness. So if you put accuracy, hallucinations and truthfulness all together, that is top of the top, right? That is what is super important for everyone as they're looking at it. And when I thought about something that you were talking about, when you were saying, yeah, it feels like the metrics and the data sets are a little bit off because you can ask two different models the same question, like you asked it about the Mickey Mouse. What was your question?
Matthew Sharp [00:18:19]: Steamboat Willie. Yeah. Who is steamboat Willie?
Demetrios [00:18:21]: That's it. Steamboat Willie. And both seemed all right. But how do you decide which one's better and how do you decide if it actually is good? That's where things start to really go haywire, right? And you start to see like, oh, is this good? Or is it not? And then when there is toxicity involved, you were talking about how at work you can have a whole different type of toxicity, and it probably won't pass as toxic to a model at all. And so there's no data sets or metrics around toxicity like work. Toxicity, data set. I would love to see that. If somebody wants to come up with.
Matthew Sharp [00:19:04]: That shit, I would love it.
Demetrios [00:19:06]: There's no like, hey, we need this by Monday. Oh, that's toxic. That kind of thing. Or, where's your reports?
Matthew Sharp [00:19:13]: So I was going to make the joke that if you follow Britney Spears, apparently toxic is a good, you know, in different spaces being toxic, it's different. Right. The words you use, how you act, it's going to be different.
Demetrios [00:19:30]: All of that. Yeah. For anybody that just wants to read this blog post that I wrote, here is the link. I just dropped it in the chat. And then you've got the metrics. You've got the data and these two things. I think one thing that's fascinating when it comes to the metrics that you were talking about, different metrics feel like they are more useful depending on your use case.
Matthew Sharp [00:19:58]: Absolutely.
Demetrios [00:19:59]: Yeah. The ones that people were saying they were using, one thing that was pretty wild was, I think, one or two people said that they were using hellaswag. So you can go on here and really dig through all of these questions and go through it. But, yeah, I think there was very few people that said that they were using hella swag. Let's see. Let's actually go into it. So it's like, what types of vellums? What types of. Here? Where is it? Can you see that? All right? Or is it super small? I'll make it a little bigger for you.
Demetrios [00:20:41]: Here we go. One person said hella swag. So, yeah, pretty wild. You can see that classification basically, like accuracy, precision, recall, f one, which, yeah, if you have certain use cases, that's great. But for a lot of other use cases, that's kind of worthless, right?
Matthew Sharp [00:21:06]: Yeah. And even one of the big problems, I kind of get into it in my ten minute talk, but the difference why people say rouge and blue are difficult is because you kind of have to fine tune them. Right. You have to decide what is the word connection? Usually it's some sort of ngram and there's different versions of rouge and blue. And so it takes a bit of time, it takes a bit of knowing your data, knowing your use case. And a lot of people are not taking that time to. They just run it through a benchmark, whatever is out there publicly. But we need better benchmarks.
Demetrios [00:21:48]: 100%. This is the evaluation survey that we did. Again, all of this stuff is here. I'm going to share it in the chat so you can dig into the data if you want and you can check it out yourself. So I'm still here. Lucky me. And I would encourage you to check out the evaluation survey that we're doing. Fill it out.
Demetrios [00:22:16]: It would be super useful because all of this information, we're not trying to gatekeep it at all. We're really trying to make sure that it's something for the community and anyone can use it however they want. All right, man, I always love talking to you. Thanks so much for coming on here and explaining why evaluation is pretty much a little bit meh right now, and I'm going to keep it rocking. I'll bring Kai up on the stage next for the good old. He's going to talk to us about how he made Mistral not suck. So, Matt, take care, bro. And everyone go check out Matt's book, llm in production by Manning.
Demetrios [00:22:56]: And you can get the free, what, 45% off. We'll drop that link in the chat too.
Matthew Sharp [00:23:02]: 45% off? Yes.
Demetrios [00:23:04]: There you go. That's a no brainer. That is a no brainer. And follow Matt on LinkedIn if you're not doing it already. I'll see you later.