Sign in or Join the community to continue

Evaluating LLM-based Applications

Posted Jun 20, 2023 | Views 2.4K

# LLM in Production

# LLM-based Applications

# Redis.io

# Gantry.io

# Predibase.com

# Humanloop.com

# Anyscale.com

# Zilliz.com

# Arize.com

# Nvidia.com

# TrueFoundry.com

# Premai.io

# Continual.ai

# Argilla.io

# Genesiscloud.com

# Rungalileo.io

Share

speaker

Josh Tobin

Founder @ Gantry

Josh Tobin is the founder and CEO of a stealth machine learning startup. Previously, Josh worked as a deep learning & robotics researcher at OpenAI and as a management consultant at McKinsey. He is also the creator of Full Stack Deep Learning (fullstackdeeplearning.com), the first course focused on the emerging engineering discipline of production machine learning. Josh did his PhD in Computer Science at UC Berkeley advised by Pieter Abbeel.

+ Read More

SUMMARY

Evaluating LLM-based applications can feel like more of an art than a science. In this workshop, we'll give a hands-on introduction to evaluating language models. You'll come away with knowledge and tools you can use to evaluate your own applications, and answers to questions like:

Where do I get evaluation data from, anyway? Is it possible to evaluate generative models in an automated way? What metrics can I use? What's the role of human evaluation?

+ Read More

TRANSCRIPT

Introduction

And oftentimes the main challenge is not actually building the application itself. It's like convincing your teammates that this is actually good enough to go in front of your end users. So evaluation can be an antidote to that. And then finally, if you use evaluation the right way, I think it can be a great roadmap for making improvements to model performance.

So if you, um, if you have robust evaluation and granular evaluation, That can point you to what are the current opportunities to make your model or your system better, and that can help you, um, uh, decide your development path. Um, and then, you know, so the corollary of this is, you know, if, if we're using a evaluation for these three things, validating your model's performance, um, working together with your team to decide when you, um, can go or no go, and then finding places to improve, um, In light of that, let's talk about some, some desired properties of a good evaluation.

So what makes good evaluation? The first and probably most important quality of a good evaluation is it should be a number or set of numbers that is highly correlated with the outcomes that you actually care about as a business. Um, if there, if it's not correlated with the outcomes that you care about, then it's not gonna be a very useful metric to help you make, um, decisions about whether to ship the model.

One other thing that helps with the evaluations and this this sort of cuts both ways, is, you know, in a perfect world, you would have a single metric, like a single number that you could make, go up or make, go down. Um, and that's, that would be your evaluation metric. Um, the reason why that's valuable is because machine learning as a field, it's kind of like, um, the whole field has evolved over the, the course of decades to be really, really good at like finding different ways to make number go up or make number go down.

Um, as soon as you start getting into things that are qualitative or. Uh, or having many metrics, then this can fall apart. Now in the real world, um, doing evaluation in a granular way is also really important. So it's helpful to have a single number that you're looking at, but it's also helpful to have, um, many different ways of slicing and dicing that number so you can understand performance in a more granular fashion.

And then finally, um, the third sort of desired property of a good evaluation is that it should be fast and it should be automatic. Now, um, not every evaluation is going to be super well, is gonna be instantaneous, and not every evaluation is gonna be completely automatic because as we'll talk about in more detail, humans still play a role here, but the faster and more automated the, um, evaluation is the more of a role it can play in your development process, right?

So if, if you think about like a perfect evaluation suite, Um, it's something that you as a developer can just run, um, very, very quickly as you make iterations and changes, um, that will tell you, okay, do these changes actually improve performance or do they not improve performance and so reliable? That's devaluation.

Okay. I'll pause here and, um, take questions and continue. There's still someone who's talking and unmuted, so I've tried to mute everyone myself, but if you don't mind, Muting yourself, um, as well. That'd be helpful. What's happening here with the mute? You can continue, Josh. No worries. Oh yeah, no, no worries.

Um, was just gonna pause here and see if there's questions. Okay. All right.

Cool. All right, I'll keep going. Um, okay. Next thing that we're gonna talk about is, so we talked about like, uh, why is evaluation important? What makes a good evaluation metric? Or evaluation suite. Next thing we're gonna talk about is like, why is this hard, right? So, um, I'm gonna ground us a little bit in old school machine learning, right?

So machine learning, as of, you know, circa 12 months ago or older, um, the, the old way that this used to be done, so in the old world we had this notion of a training distribution. So this is the data that you're gonna train your model on, and then you would sample two data sets from this training distribution.

You sample your training set. Um, and you'd sample your evaluation set. The training set is the data you actually train the model on, and the evaluation set is the data you hold out to measure, um, generalization performance. Then you can look at the difference between the performance of your model and these two sets, and that tells you how much your model is overfitting, so how much your evaluations, uh, how much your model is, like just too specific to these data points.

Then you deploy your model and you look at this in production on a potentially different distribution of data. Um, ideally you'd sample a test set from that predict, uh, production distribution as well, um, and measure the performance on that. The difference between the evaluation set and the test set is a measure of your domain shift.

So how, how much has your, um, production, distribution, uh, shifted away from what you're actually, the data you're actually able to train your model on? And then finally, on an ongoing basis, you'd measure performance on your production data. And the difference between your test set performance and your production performance is a measure of, uh, what people call Drift, right?

Which is how much is your production data changing over time and how is that affecting the performance of your model? So why doesn't this make sense in the l LM world? The reason this doesn't make sense in the L l M world is because. Let's face it, um, you probably didn't train your model, right. Um, you probably didn't train your model and that means that you probably don't have access to the training distribution, or at the very least, you know, you're using an open source model and, you know, maybe you could get access to that data if you wanted to, but it's pretty impractical cuz it's a lot of data and it has nothing to do with the task that you're trying to solve.

Then when you get into production, um, since we're using these mat pre-trained models that are trained on the whole internet, Massive amounts of data. Um, the idea that the idea of distribution shifts, um, or data drift is like endemic to the problem. The production distribution that you're evaluating your model on is always different from training no matter what.

There's no such thing as, I mean, unless you're evaluating on, um, on just random data from the internet, there's no such thing as an LLM problem. Um, if you're using off the shelf LLMs, that doesn't suffer from distribution shift. Um, so in traditional, um, the second reason why this doesn't work very well for LLMs is metrics.

So, traditional ml, let's say that you're building a classifier to predict whether an image is a picture of a cat or a dog. Um, the, the thing that makes this a lot easier is that we're able to compare the predictions that the model makes to the actual labels of whether this is in fact a cat or a dog, and compute metrics like accuracy that says, you know, how many, how, like what fraction of the time do we get the answer correct?

In generative applications, which is a lot of what folks are doing with LLMs, this is a lot harder. Right? So your, the output of your model might not be a label, it might be a sentence like, this is an image of a tabby catt. And if you have a label that says, um, you know, this image is a photograph of a cat, how do you actually tell whether this is, whether this prediction is correct or not?

It's not super clear what metric you can use to distinguish between these two sentences, which both might be accurate descriptions of the image that they're describing. Um, so it's hard to define quantitative metrics for LLMs. And then finally, um, even if you have a quantitative metric, the, uh, the, like, another big challenge is that oftentimes we're using these LMS since they're so general purpose to solve tasks across lots of different domains.

And so if you have a model that is. 90% accurate, let's say on questions about different topics, but it's 95% accu accurate when you're asking about startups. It's only 90 or 85% accurate when you're asking about dogs or food. But if you ask a questions about physics, the accuracy drops all the way down to 17%.

Then is this, is this actually a good model or not? Well, it depends, right? It's really, really hard to summarize. Whether, um, whether this performance on this diverse set of tasks is good or not, because it depends a lot on the problem that you're trying to solve. Like, if this is a chat bot for, um, you know, uh, uh, nine year olds who just have lots of questions about dogs and food, this is, you know, maybe a pretty good chat bot.

But if you needed to answer your questions about physics, this is not a very good chat bot.

Um, so summarize, Why does traditional evaluations like l uh, machine learning evaluation breakdown for LLMs? First, you know, the, these models are trained on the whole intranet, so there's always drift and data drift doesn't really matter so much in this context. The, uh, the metrics, the, the sort of outcomes that we're aiming for are often qualitative, so it's hard to come up with a number to measure success.

And finally, um, we're often aiming for our diversity of behaviors. And so the, the goal of, um, that we talked to before, talked about before of like pointing at a single metric as the measure of quality for this model is more difficult. Um, so how should we actually think about evaluating language models?

Um, the recipe for a language model evaluation really has like two components. The first is we need some data to evaluate the model on. So what data are we gonna feed into the model to see how it responds? And then you need some metrics, some, some ways that you, some functions that you can use to measure the output of these models, maybe compared to a reference answer or maybe not to tell, uh, quantitatively how performance is doing.

So what this looks like is, you know, you take this dataset, you, um, run one or more models on it. Um, you compute one or more metrics on the outputs of each of those models, and you summarize this in some sort of report. Um, so. One of the kind of key things to understand about this is that if our recipe for building language model evaluations involves, um, picking data to evaluate on and then picking metrics, um, on, uh, to run on that data to compute the number for the evaluation, then better evaluations correspond to having data that is better and metrics that are better.

What does it mean for data to be better? Um, Data is better. If it is more representative of the data, you're actually going to feed the model in the real world for the task that you're trying to solve. So if your data is more like your production data, that's a better evaluation data set. Um, if it has nothing to do with your production data set, um, uh, your production data, then it's not that those, the metrics that you compute on that are useless, but they're much less relevant to the problem that you're trying to solve.

What does it mean for an evaluation metric to be better? Well, an evaluat, an evaluation metric is better if that evaluation metric is more predictive of the actual outcomes that you're aiming for with the product that you're building. Um, so if the metric that you're computing, you know, has almost nothing to do with the, the thing that your users are trying to do, then it doesn't matter if you are running it on the perfect data set, it's still not a very helpful, um, uh, sense of what, how well the model's doing.

So, Um, but if the evaluation metric is perfectly predictive of how humans would rate the outputs of this model, um, then that's a really good metric and you need both of these things to be true in order to have a, a great evaluation set. So this is another good stopping point. I will jump back over and chat and see if folks have questions.

No questions so far, I think. All right. Okay. Um, cool. So we kind of, on this previous slide we talked about, um, I sort of put public benchmarks all the way in the bottom left here. Um, so I wanna talk a little bit, I wanna justify that a little bit, talk about some of the challenges with using publicly available benchmarks to make decisions about, um, language models.

Um, so let's talk through some of the different categories of publicly available benchmarks. Um, so the first category and the most useful category is, Benchmarks for functional correctness. So these benchmarks, what they do is they, um, take the output of a model and they actually run, or they use that output to try to solve a task.

Um, and so most of these benchmarks that are publicly available for functional correctness are operating on code generation tasks. The great thing about code generation is, you know, code is like a thing that computers can run. So if you generate some code, And then you run that code and it solves the task that you wanted to solve.

That's a very good indication that the, the model's doing the right thing. Um, but even just the, the code compiling or, uh, or um, having some general properties of like correct code is a pretty decent indicator that the model's doing something That's good. So these publicly available benchmarks are actually very helpful if you are, um, if you're doing a task that corresponds to 'em.

So if you can do this, you should do it. The next category of publicly available benchmarks are live human evaluation benchmarks. And so the, the one that is, um, kind of the most, uh, popular for this right now is called the chat bot arena. And the chat bot arena. The way that this works is they have, um, they host like kind of, um, two models side by side.

I think I have this on the next slide actually. Um, yeah. So they host two model side by side, and then you get to type in what you, um, you know, a question that you have for the models. You see both responses and you pick the one that you prefer. And so they're crowdsourcing answers to, um, to the, uh, uh, to the questions.

Um, they're crowdsourcing feedback on the responses that the different models are getting. Um, and so this allows you to kind of stack rank the different models that are available. Uh, if. Yeah, based on the feedback from like actual people. Oh. Um, I think there's kind of like a common perception that human evaluation is the best way to evaluate whether language models are doing the right thing or not.

Um, we'll talk a little bit more in the next section about sort of general challenges with human evaluation. Um, but I personally find the chatbot arena to be, you know, probably the most helpful thing for, or the thing that's most correlated with my, um, perception of model performance on most tasks. Third, sort of set of publicly available benchmarks is, um, uh, categorized by models evaluating other models.

And so this is kind of a, sounds like kind of a crazy idea, but the idea is, you know, rather than, um, having a human say like, which of these two outputs is better? Instead, you can just ask GPT four, okay, which of these two outputs is better? Um, turns out that that actually works like surprisingly well. And GPT four is like a pretty good evaluator of other language model outputs.

So these types of evaluations, I think, are growing rapidly in popularity. They're powerful because they're very general. You can, anything you can prompt a model to do, you can, uh, get it to evaluate. And so I think these are gonna play an increasing role. And again, in the next section, we'll talk a little bit about the limitations of them, uh, limitations of this technique in general.

Next category of publicly available benchmarks is task specific performance. So the most, um, the most popular or the most like famous versions of this are helm and big bench. Um, these are great because they're aiming to be as holistic as possible, um, in terms of the different tasks that you might want to evaluate the language model on.

And the way that they compute the evaluations is they formulate the tasks so that there such that there's a clear right answer. Um, so. These are like decent ways to get a rough sense of comparison of different models, but the, they don't include the tasks that you care about as model developer, um, in most cases.

And so, you know, generally these benchmarks are not super useful for, um, picking a model. And then finally, um, kind of the, the lowest quality form of publicly available benchmarks, um, are sort of automated evaluation metrics that compare, um, a gold standard output to the output of the model in a way that does not use language models to do that evaluation.

So these metrics have been around for a long time in nlp. Um, they're kind of falling outta favor recently because recent papers have shown that they're actually not very correlated with, um, with. With the way that humans would do the same evaluation. Um, and again, yeah, chat bot arena is, you know, when I'm looking, when I'm trying to get a sense of a new model, um, how good it is.

This is usually the first place that I'll start, um, in addition to just playing out the models on my own. Okay. So what's wrong with publicly available benchmarks? In general? They, the key issue with publicly available benchmarks is they don't measure performance on your use case. Just because, you know, um, g PT 3.5 is slightly better or worse than some other model on a general benchmark, does not mean that that'll be true on the task that you're trying to solve.

Um, also the methodology that a lot of these benchmarks use is not evolving as quickly as the rest of the field. So, um, they, uh, like a lot of these benchmarks. Um, you know, aren't using modern sort of chain of thought prompting techniques. Um, they're often not using like in context learning. Um, and they're definitely not using the prompting or incon context learning or fine tuning techniques that you are using in your application.

Um, and in addition to this, like measurements of model performance is really hard in general, which we'll talk more about in the context of building your own evaluations, uh, and the publicly available benchmarks, although they're very carefully thought out. Still have all of the measurement issues we're gonna talk about in a second.

All right. Another good pausing point.

Um, okay. There's a question here, question, and it just mm-hmm. Go for it. Yeah. So, um, I think he's trying to, Uh, talk about the, the data aspect of things, and I think in one of your slides you mentioned, um, um, better data for, for evaluation, if I'm not mistaken. So he, he was talking about an experiment. He ran, uh, designing training and evaluation data sets in accordance to factorial design principles.

And that was with 1,200 data points and achieved, uh, 93 present accuracy. So generally, what do you think about, um, you know, the factorial, uh, what is it called? Um, design principles for, uh, for, for, for this sort of things? Yeah. Um,

oh, let, let me reach question cause I'm trying to assimilate as well. Yeah, maybe, um, if you don't mind, whoever asked that, just kind of, um, helping us understand, um, what you're, what you're trying to answer here. Um, or have us answer here. Um, but I'll, I'll keep going. Yeah. So, um, we talked about the problem of just like kind of googling, uh, a benchmark and using that benchmark to make decisions about models.

Um, so what do you do instead? Well, what you should do if, like, as you continue to invest in a project that you're working on is. Build your own evaluation set and choose your own evaluation metrics that are specific to the tasks that you're trying to solve. So let's talk about how to do that. Um, so the kind of key points for building your evaluation set, um, are, you know, you wanna start incrementally, um, you wanna use your language model to help.

And then as you roll out your, your project to more and more users, you want to add more data from production that corresponds to the data your model's actually faced with in the real world. So starting incrementally. Um, usually when I'm starting out a new l l m project, I'll start by just evaluating the model in kind of like an ad hoc way.

So if I was writing a, a prompt to try to, you know, get a model to write a short story about different subjects, um, I might try out a few different subjects in this prompt. Like I might try to have it write a short story about dogs or LinkedIn or hats. Um, then as I sort of try out these examples ad hoc, I'll find some ones that are kind of interesting in some way.

And interesting might mean, um, maybe the model does a bad job on this data or interesting. Also might mean this is just another way that, um, I think a user might use this prompt that I didn't think of before. And as I find these interesting examples, so as I try out, um, inputs to the model that are interesting, I will organize those examples into a small data set.

And then once I have that small data set, it might just be, you know, 2, 3, 5, 10 examples. Then as I make a change to the prompt, rather than, um, going ad hoc, sort of input by input to try the model on those. Instead, I'll run my model on every example in the data set. Um, and so yeah, interesting examples are ones that are hard or ones that are different.

So, um, second thing that you can do to make this process less manual is you can use your language model to help generate test cases, um, for your language model. Um, and so the way this works is. You'll write a prompt that is focused on getting the model to sort of generate inputs, uh, to your, uh, to your prompts.

And then you'll run that model, um, generate some inputs, and then add those to your evaluation data set. So there's an open source library that helps with this in the context of, um, question answering in particular. Um, that you can poke around to get some inspiration about how to do these data generation prompts.

Um, I'll also, uh, show you kind of how this looks in gantry as well. Um, this is like one of the pieces of functionality that we have and, um, uh, lastly I would say like, the one caution here is. That we have found that these models, um, that it's difficult to get models to generate like super diverse inputs.

They'll generate inputs that are valid and interesting, but um, they won't really cover all of the possibilities that you as a human might think of. And so it's just good to be aware of that limitation as you use models for this. And then finally, you know, probably the key point here is you shouldn't think of your evaluation data set as static evaluation data is something that you build over the course of your project as you encounter more and more sort of use cases and failure modes.

Um, and so as you roll this out, and this interacts with more users, your team, your friends, and eventually your end users, you wanna capture examples from production. Like, what do your users dislike? Um, maybe if you have annotators, what do they dislike? Uh, what does another model dislike? You might wanna look for outliers relative to your current evaluation set.

You might wanna look at, um, you know, underrepresented topics in your current data set. So all these different heuristics, all sort of building up to taking production data, um, that is interesting. So different or hard, and feeding that back into your evaluation set so that you're progressively evaluating on, uh, more and more, uh, challenging and more and more interesting data over time.

So that's kind of the quick version of building your own evaluation set. Um, next thing I wanna talk about is how you should think about, you know, picking metrics to run on that evaluation set. Here's a flow chart you can kind of think through, um, as you make, as you make this decision. So first key question is, is there a correct answer to the problem that you're asking the language model to solve?

If there is, then this problem gets a lot easier because you can use evaluation metrics like in, you know, classical ml. Um, so if you're trying to get your model to classify, uh, sentences about cats or dogs, then you can still use accuracies as a metric. If there's no correct answer, then it's helpful to have a reference answer that you can point to.

Um, if there is, then you can, um, use that reference answer as a way to judge whether your answer is correct, even though you're not expecting your answer to be exactly the same as the reference answer. If you don't have a reference answer, then looking for other kind of examples to guide you is still really helpful.

So if you have a, a previous example or an example from a different version of the model, um, that you might expect to be reasonable, you can use that. And if you have human feedback, um, then you can also use that as a way to evaluate models. And at the end of the day, if you don't have any of that, there's still metrics you can run on your output that, um, based on your task, can help get you a metric.

Um, so expand on this a little bit. Um, you know, you've got your, your normal kind of machine learning evaluation metrics. Then you've got some metrics for, um, matching a reference score. So, Um, again, the context here is you have an example output from a model, and then you have a prediction made by, uh, so you have, you have an example, correct answer from a human, and then you have a, um, a example that's generated by your model and you wanna compare these two things and say like, is the model doing the same thing as the reference answer?

Um, you can do this using sort of deterministic things like semantic similarity, like you can embed these two answers and see how close together they are. Um, or you can ask another LM. Like is the answer that, um, I know is correct. And the answer that the model generated are these two things factually consistent.

So you can write a prompt to have an l LM run that evaluation. Um, if you have, um, two different answers. So you have one model that says this is the answer and another model that says this is the answer, you can again ask another language model, um, which of these two answers is better according to some criteria.

So you can write a prompt that says, Um, okay. Your job is to assess the factual accuracy of these two answers, um, to the question. Here's the question, here's answer A, and here's answer B, which one is more factually accurate? Um, you can also write metrics that assess whether the new output of the language model incorporates feedback from an old, from the, uh, um, from feedback given on an old output.

So if, um, you run your model and, uh, you know, one of your friends says, Hey, like I asked this a question about Taylor Swift and it, um, gave me back an answer that was, you know, about Taylor Swift's old albums, and I wanted an answer about the new albums. Well, you can write down that feedback and then, um, you can take the question that the user asked.

You can run your new language model on that question, look at the output, and then ask, um, an evaluation model whether the new output incorporates that feedback on which albums. From your friend. And so the language model is assessing whether the new answer incorporates the feedback on the old answer.

And that can be a pretty effective way to evaluate, uh, the model. And then finally, you know, if you don't have access to any of that, then you can, um, compute static metrics on the output of a single model. So the, you know, the, the easiest and most deterministic way is to verify that the output has the right structure.

Um, but, you know, more interesting but harder to get right is asking the model to grade the answer. Like give it a score, uh, on a scale of one to five. So, um, I, I want to give the model a score, uh, a score on a score on a scale of one to five as to like how, um, how factually correct the answer is, for example.

Um, okay, so, you know, I think this, a lot of these metrics incorporate this kind of key idea here, which is. Using models to evaluate other models. Um, what you might be thinking is, that sounds like kind of a crazy idea, right? Like how, how do we verify that the model doing the evaluation itself is actually correct and that's true.

Um, but this is empirically, this is still very useful and like a lot of papers are sort of moving in this direction for doing evaluation. Um, so the reason why is that automated evaluation using language models can unlock more parallel experimentation. If you need to rely on humans to do your evaluation, then that's just gonna really slow you down in running your experiments.

You still, before you actually, um, roll out a model into production that was only evaluated using other models, you probably still wanna do some manual checks. Um, and, uh, and that'll give you more confidence that it's actually doing the right thing. Um, more concretely, the research literature has been identifying some.

Limitations of using language models to evaluate other LLMs and, um, there's really two like, kind of categories of criticism. Um, the first is that there's a number of papers that are, have been discovering biases in the l l m evaluation of other LLMs. So, um, some of those biases have been, you know, if you're asking models to output a sc uh, score on a scale of one to five, the LMS tend to prefer like one number over other numbers.

Um, Models tend to prefer their own outputs. So if you're asking GPT four to evaluate, you know, Claude and GPT four, it's gonna be a little bit biased towards its own answers. Um, maybe even over human answers. The, if you're asking a model to compare to two different answers, um, some research has found that the order of the answers like which one comes first in the evaluation actually matters.

And language models also tend to prefer like, kind of characteristics of the text that might not have anything to do with whether it's correct or not. Like which one is longer. So there's a bunch of biases in language model evaluation. Um, I think all of these have paths to being mitigated and, uh, but it's just something to be aware of if you're gonna use this today.

And then the second category of, um, of objection to this is like, okay, well if we're, if the problem with language models, the reason why they need to be evaluated is that they're unreliable, then why are we trusting language models to do the evaluation? And so, um, I think the way forward here is, Really not either human evaluation or language model evaluation, but it's a smart combination of both where, um, you know, you use the best of both, of both of these things to get great evaluations.

So as a developer, you primarily interact with a, an auto evaluator. Um, and the reason why you do that is because that lets you move fast, right? Like every change you make, you can auto evaluate and determine whether it's actually worth spending the money to do a human eval, but then that auto evaluation should be verified on your task.

Um, and the way that you verify that is through high quality human evaluation. Um, so last thing I wanna talk about before kind of showing you this in like a more practical setting is the role of human evaluation. Um, so there's different like ways that people commonly collect feedback from humans on language model outputs.

The, probably the most common one that you'll see is just asking people to rate the, the answer on a score from one to five. This is maybe the most common way of doing this, but it's also. Um, uh, probably the worst and the issue is that people are inconsistent, right? So what one person says is a three might be a four for someone else or a two for someone else, and that makes it really difficult to make metrics that you compute on top of these assessments reliable, um, to kind of combat that.

A lot of the field has moved from. Asking people to rate individual responses to asking them to compare two responses. And so people are more reliable at saying, you know, which of A or B is better than they are at giving A or b A score. Um, and so this is kind of part of why, um, you know, A versus B evaluations are what's typically used in reinforcement learning from you and feedback.

And this is how a lot of the kind of evaluation in the field is moving. Um, This is also not without its challenges. So when people are doing A versus B evaluations or, um, then like one big issue is that human evaluators tend to look at service level attributes of the, um, of, of the outputs rather than the actual, uh, sort of factual accuracy of them.

And so, um, this is a really interesting paper called The False Promise of Imitating Proprietary LLMs, where they did kind of a, um, an assessment of this and they found that. Um, when, you know, like there's been a lot of buzz about, um, open source, fine-tuned LLMs being close to G B T 3.5 quality. Um, it turns out that they're close to G B T 3.5 in terms of human preference, but they're not very close in terms of like the actual factor factual accuracy.

And so humans are kind of picking up on like how things are worded and formatted, um, in, uh, to a much larger degree than like the underlying accuracy of the statements when they do their evaluations. And so kind of one research direction that the field is moving in is moving away from just asking people to like read a whole statement and then give it a score or a comparison score, and instead asking people to give more fine-grained evaluations.

So rather than looking at the entire output and saying like, is this good or is this bad? These approaches, um, are asking humans to select a particular passage within the sentence that is irrelevant or untruthful. Um, and some of the early research indicates that this might be a more reliable way of getting evaluations from humans.

Um, but at the end of the day, human, just like LM evaluation, human evaluation is also highly limited. Um, so the main limitations are quality, cost, and speed. Um, the reality today is that GPD four, um, writes better evaluations than most people that you hire on M Turk. So, um, which is like kind of surprising and, and crazy fact, but I, I was reading somewhere, I think maybe this was a paper that came out in the past week or so, that, uh, a very large percentage of people on Turk are just using, you know, GT 3.5 or GPT four to write their evaluations anyway.

So maybe it's not that surprising. Um, and with, you know, the second big challenge for quality is that if, unless you design your experiment really carefully, human evaluators might not measure the thing that you really care about, right? So, again, like if you're just asking people for their preference, a lot of times their preference is gonna be based on surface level attributes of the text.

Not underlying factuality, but human evaluation is also really costly and really slow, right? So this is not, this is not like the silver bullet answer for eval. Um, and again, coming back to, you know, I think the way forward here is human verified eval, um, automated eval with language models. All right. Um, final thing I wanna talk about is I wanna just kind of like show you what this looks like a little bit more concretely.

Um, and so we're gonna, um, talk through like a process that you can use to evaluate language models progressively on data that looks like your data from production. So, Um, using metrics that correspond to the things that you actually care about. And so the way this works is kind of analogous to test driven development in, you know, traditional software development.

So start out with, um, getting a simple version of your model out there in production, um, running in the real world. Then you gather feedback on how that model's doing. You use that feedback to build, to like add your evaluation data set, um, and also to iterate on your prompt or your model to try to improve performance.

When it's, when you think the model is good enough, then you run a systematic evaluation, um, using auto eval, using human eval, whatever technique makes the most sense for the stage of the project that you're in. Then if you're in a bigger company, you'll approve those, those changes. You'll deploy the model and start this, this loop over again.

Um, so

I will flip over here. Um, Yeah. So let's walk through like a simple example of this. So what we're gonna do is we're gonna write, um, a, we're gonna build like a simple application to do some grammar correction. So, um, let's write a, like a naive prompt, like correct the grammar of the user's input, and then let's add the user's inputs.

So, Okay, so this is like a naive prop, uh, prompt I might use for grammar correction. And so like, lemme just get a quick sense check for whether this does something reasonable at all. So let's do like, um, this sentence has bad grammars and I don't know another one this sentence has. Okay. Uh, okay. Grammar.

Let's just see, let's just make sure the model does something reasonable. Um, okay. So. High level sense check. This seems fine, right? But the question is now like, okay, just because it worked on these two examples that I just made up on the spot, how do we know that it's actually working more generally than that?

So that's where evaluation comes in. Um, so we'll take a look at this, this prompt that we just wrote, and we'll run an evaluation on this. And again, the, the two key challenges for evaluation are where do you get the data? And then what are the, what metrics do you use, what criteria do you use to evaluate performance?

Um, and so in gantry, our approach to both is using LMS to help you. So we don't have any data to evaluate this on yet, so let's Eva, let's, um, generate some. So I'll describe like the sentence that I want. And in this case, we want like sentences with bad grammar. So this is gonna hit opening AI API behind the scenes.

Um, and it's going to pull back, like, give us some examples of sentences that have this characteristic, right? So there's like some interesting ones here. If you don't like spicy food, um, eyeballing this, it seems like these are pretty reasonable examples that we might want our model to perform well on. So let's create a new data setter on this called evals.

Um, and then the second question is evaluation criteria. And so, um, what we're gonna do here is we're going to write like a simple prompt to describe to the model, um, what characteristics we want the output to have. In this case, we want the, the outputs, uh, to be grammatically correct.

Um, and so now we have this data set and we have this ev, this evaluation criteria. So let's run this. And this will take a few seconds cuz we're, you know, subject to open ai, uh, uh, latency as we all are in the field these days. Um, but what's going on behind the scenes is we're generating the outputs for that prompt template that I wrote for each of the inputs.

And then we're evaluating them using the prompt that we wrote as a criteria. And so what we see here is like, maybe not too surprising. The model's doing really well on these examples because these are simple examples, simple prompts, um, models doing fine. And so it's getting five out of five on all of them.

Um, and, uh, um, if we want to, if we wanna understand why the model thinks that this is a good grammar correction, we can look at its chain of thought as well. Um, so we can see the reasoning that the model applied to give the, to give the answer to the score. So in the real world, we'd probably do some more like validation of this model.

But for the sake of this demo, I'm just gonna deploy this, uh, yellow this into prod. Um, and let's do, um, let's make this a little bit more interactive. So I'm gonna drop a link. Um, here, let me just verify that this is working.

Uh, yep. Um, so I'll drop a link here. And this is like, you can imagine this as being the, uh, Say again?

Yeah, a couple of questions I thought you might be interesting in the next eight minutes to also touch on as well. Yeah. Um, would love to kind of like show the, make this more concrete with the rest of the loop. Um, and then hopefully I'll get to the questions in like the last five minutes. Um, so I dropped a link.

Um, feel free to drop in and kind of, um, you know, play around with this and, you know, interact with this as a user and give feedback on like whether this is doing the right thing or not. Um, I guess while folks do that, just for the sake of time, I'll just, you know, I'll move into kind of the next phase of this, right?

Which is, we talked about evaluation data sets are not static. These are something that you should continue to build, um, using production data as it comes in. So I, um, have like preloaded some production data in here or some, you know, fake production data in here. And, um, what we see is like kind of some high level statistics on what's going on.

And what we're looking for here is we're looking for examples that are interesting to add to our evaluation data set and to use to, um, potentially make our model better, right? So we can, you know, we can look at all the inputs that people are sending in. Let's see if people are actually sending inputs in.

Um,

well, yeah, here are the ones that I sent in. I guess people are still working on it. Um, but, you know, one thing that we notice here is, um, there's, there's inputs here that are in French. So. So if we zoom in on these inputs and see what the model's doing, um, in, for some of these inputs, it's translating the French to English, which is really probably not what we want, like not a characteristic that we want our model to have.

So, um, again, we're gonna progressively build our evaluation set by adding these to this evaluation data set. Um, and so we jump over to this evaluation data set. Now we have these like 10 examples that we generated as our starting point, as well as the four, um, from production where our model wasn't doing the right thing.

And so we can, um, now make an improvement to our prompts to, you know, try to fix these issues. So I'll just, you know, I'll copy and paste the one that I prew wrote for this purpose. Um,

well, I guess this is loading. Let me, let me just do a little live prompt engineering. So it's, uh, let's do, um, uh, maintain the language.

Um, and so what you would do is you would, you know, make your changes to your prompt that you think, um, would fix the problem that you found in production. Then you'd go back, you would, um, evaluate this new version compared to the last version, um, and then maybe add some additional evaluation criteria like, You know, outputs should be in the same language as the inputs and then rerun your evaluation.

And so what'll happen behind the scenes here is you'll take your new prompt and your old prompts, you'll run both of those against all of the evaluation data that you've collected. So both the data that you started with, as well as the data that you added from prod. Um, and then you'll ask a model to compare the performance of.

Those two versions, um, on those inputs. And that's kind of the, um, this sort of iterative process of building an evaluation set and evaluation criteria as you develop your model. And so that's kind of the, the process that I, that I think makes the most sense for, um, for building these things going forward.

So, um, this is a good pausing point for me to, um, answer questions.

Right. We're almost out of time, so I'll try to take the, the, the questions in order. And the first question was, was from Athan mentioned, would psychometric data preparation methods be helpful for evaluating lms? Uh, I don't know what psychometric data preparation methods are. Um, would you mind expanding on those and then happy to, happy to give, give it a shot.

Yeah, so please expand on it in the chat. I can. So the next question is on Christopher and, um, he asked what are the best ways to evaluate one's fi on fine tuned other, you know, are there open source tools and frameworks and do this? Yeah, I mean, it's, it's kind of following the, the same process that, um, that I, uh, that I advocated here for, right?

Which is like, define your task. Um, come up with some data that you think represents your task. Uh, come up with some, some criteria or some metrics. Um, run the open source model on those criteria and those metrics, um, get a sense for performance and then build your evaluation set over time. And gantry is a great tool for doing that.

Great. Great. Nice. Thanks for sharing that. And on this next question is from NA Nathan. Now, um, if you are, if you are using LLMs to make the test content, how different is it? Uh, how different can the model generating the test cases be compared to the model being tested? I think you touched on this earlier with lms evaluating lms.

I dunno, you wanna, yeah. I think that for generating the test cases, um, it's a little bit different because. Um, you know, the test cases, it's, it's, um, it's a different type of issue if the test cases are biased than if the evaluation is biased. So if the evaluation is biased, then you might, um, then even if you have the right data, you might get the wrong idea about whether the model's doing well in that data.

Um, but if you just have the wrong data, like if, if the data generation process is biased, Then you still are able to kind of continue this iteration process, right? So the mindset should be, you're not gonna have a perfect data set to evaluate your model on at any point, and it's something that you should always be improving.

So I'm a lot less worried about, um, about, you know, folks using biased models to generate evaluat, generate their initial evaluation set than I am about, um, using, uh, bias models to create the evaluation itself. Yeah. And, and still on the. Evaluation side of things. This, um, question from Jose and he asked, you know, our, um, G B T models, like, um, general proposed models like, um, G P T 3.5, better at generating versus evaluating, um, all other, uh, results they're better at evaluating.

So the, the, um, the, the mental model you should have is like, um, This isn't like strictly true, but at kind of first order approximation, um, models think like one token at a time. So if you, um, if you ask a model, um, what is, you know, uh, like what is the answer to a specific question, then it, you're relying on it to get the, to have the next token be like the right answer.

Um, and so like one in general in prompt engineering, like one way to get more reliable answers from models is to let them think before they answer. So, you know, write out, think step by step, write out your reasoning, and then answer. Um, and evaluation takes that a step further, which is that you give the model access to both the question and the full answer, and then allow it to think about whether that answer is correct or not.

And so evaluation tends to be more reliable than generation, uh, for lms. Right. And I think you already touched on James' question on whether you think makes sense to fine tune smaller LLMs to evaluate models. I think there's, there's research in that. Um, could you sort of share that? So the question is does it make sense to fine tune LLMs for evaluation?

Yes. Smaller? Yeah. Mm-hmm. I think that, I do think that makes sense. Mm-hmm. Uh, did you have your examples of research that's been done there or. Um, I don't have any examples of research that's being done there. No. Um, but I, I think people are working on it. Okay. All right. So I think this, uh, we can take two more questions before the, the workshop ends.

And the first question from as I can hear is, uh, how about giving a minimal pair where one is grammatical and the other one is on grammatical, the system to pick the on grammatical one. Will that be a better evaluation criteria? Well, um, Maybe, but it, it's, I, um, I think it's tricky to set that up, right, because where do you get the grammatical and ungrammatical answer from?

Um, I think if you wanted to follow like more of a evaluation best practices for this, then what you do is kind of the last step that we didn't quite get to is having a model compare two answers and ask, which is more grammatical. So, you know, if you, have you made a change to your prompt, then you could ask a model like, okay, did this change improve how grammatical this was or not?

Right. That makes sense. And just one more question before we, um, hop off, um, from, um, Mohan and he, he asked, based on your experience, um, apart from the human evaluator, uh, what other suitable metrics and approach can be considered when doing annotation with, uh, LLMs? Um, yeah, I think it's kind of, you know, for anything that you can have humans annotate, you can also have LLMs annotate.

Um, and I think generally speaking, if, unless you know what the answer is, like unless you're comparing something where it has to be exactly the same, um, in which case, you know, normal metrics work well. If the, if the right answer is ambiguous, then having LLMs do the evaluation is probably gonna be better than any other metric that you're gonna come up with.

Right. Okay. I just got a signal here that we can take one more question and, uh, I think that should be the last question from Okay. Um, you, and, and he asked, does it make sense to, um, ask the LMS to evaluate their own out outputs in the same prompts where it generates the M p o? What's the, what, what was the feedback look, look like?

Yeah, totally. Um, so there is like, there, you know, you can use a lot of these evaluation techniques in your chain to make your model outputs better. So this is, There's a paper called like Self-Critique, which I think was the first, um, maybe the first really, uh, at least like the paper that covered that, or the first one that I'm aware of.

Um, and so you can apply the evaluation in the loop with the L l m, um, as part of, you know, like having the model improve its own outputs. Um, and that can work reasonably well. You can see lifts in performance. Obviously what you're trading off is cost and latency at inference time. Um, and so it's doesn't make sense for all use cases.

All right. And, you know, do you think any special prompt engineering needs to happen there for that to be really effective? Uh, I, I think so. Yeah. Yeah, I think so. It's, um, the, uh, you know, ho hopefully not too much, but I think the, um, there's, uh, you know, getting, we've found as we've built out some of these auto automated evaluations that like creating the evaluation prompts, um, is.

Uh, like the, how good an output that you get is somewhat sensitive to that. And so I think, um, you know, there's, we need like, best practices to emerge and I think that's also a place where, um, companies like gantry can just help the rest of the field, um, answer those questions for everyone. And then, uh, hopefully not everyone needs to figure that out from scratch.

Great. Um, thank you so much, Josh. Um, I think people are super excited about the, the, the talk. It's been a really, really comprehensive one so far. Thank you so much. Great. Thank you. And, um, I'll just drop a link in the chat here. Um, if folks want to, um, play around, like with the, the gantry UX that I just showed, um, We're kind of running an alpha right now, and so I'll um, just drop a link here that gives everyone access to play around with it, um, totally free.

Um, just let us know what you think. Yeah. Awesome. That was super helpful. So thanks a lot everyone. Um, I think we can all join the, um, other talks now. Thank you. Alright, great. Thanks.

+ Read More

Sign in or Join the community

Watch More

Building RAG-based LLM Applications for Production

Posted Oct 26, 2023 | Views 2.2K

# LLM Applications

# RAG

# Anyscale

All About Evaluating LLM Applications

Posted Sep 28, 2023 | Views 886

# Evaluation

# LLM Applications

# Exploding Gradients

Building LLM Applications for Production

Posted Jun 20, 2023 | Views 11.1K

# LLM in Production

# LLMs

# Claypot AI

# Redis.io

# Gantry.io

# Predibase.com

# Humanloop.com

# Anyscale.com

# Zilliz.com

# Arize.com

# Nvidia.com

# TrueFoundry.com

# Premai.io

# Continual.ai

# Argilla.io

# Genesiscloud.com

# Rungalileo.io