Evaluating Generative AI Systems
Jineet Doshi is an award-winning Data Scientist, Machine Learning Engineer, and Leader with over 7 years of experience in AI. He has a proven track record of leading successful AI projects and building machine learning models from design to production across various domains such as security, risk, customer churn, and NLP. These have significantly improved business metrics, leading to millions of dollars of impact. He has architected scalable and reusable machine-learning systems used by thousands. He has chaired workshops at some of the largest AI conferences like ACM KDD and holds multiple patents. He has also delivered guest lectures at Stanford University on LLMs. He holds a master's degree focusing on AI from Carnegie Mellon University.
Jineet Doshi, an AI lead at Intuit, offered valuable insights into evaluating generative AI systems. Drawing from his experience architecting Intuit's platform, he discussed various evaluation approaches, including traditional NLP techniques, human evaluators, and LLMs. Jineet highlighted the importance of establishing trust in these systems and evaluating for safety and security. His comprehensive overview provided practical considerations for navigating the complexities of evaluating generative AI systems.
Jineet Doshi [00:00:00]: Good evening, everyone. Thank you so much for coming. And special thanks to Audrey Rahol and the Mlops community for having me here. It's an honor. So I'll be specifically talking about evaluating generative AI systems, and this topic is specifically interesting to me because even today it is an open problem. Like everyone's still figuring this out, there is no right or wrong. What I'll do in this talk is cover the different approaches to evaluating generative AI systems and just talk about the pros and cons of each approach. The purpose of this talk is mainly to engage with all the bright minds that are out here on this important topic so we can all collectively think and discuss how to move this forward.
Jineet Doshi [00:00:54]: Because, as I said, it is very much still an open problem. So, a little bit about me I went to grad school at Carnegie Mellon. That's where I studied data science and machine learning. Got a lot of my technical skills from there. Since the last seven years or so, I've been an AI lead. Currently, I'm an AI lead at intuit. I've worked on production models, like across a variety of different domains. More recently, I've been spending a lot of my time specifically on generative AI.
Jineet Doshi [00:01:32]: I am one of the architects of our generative AI platform that just got featured on Wall street. We call it Intuit assist, and it basically runs across all of our products, reaching out to more than 100 million customers across the world. So a lot of the stories that I'm going to say are from my experiences while designing and building that system. I'm also a pretty active member in the Mlops community. So, special shout out to the Mlops community. A couple of months ago, I worked with Demetrius, who was the founder of the community, and together we conducted this evaluation survey for llms. And this was taken by, I think, more than 120 people across the globe, like proper, real AI practitioners. And I'll be sharing some of those insights from those surveys as well in this talk.
Jineet Doshi [00:02:28]: And then here's my LinkedIn in case anyone wants to connect and continue these discussions. Also, one disclaimer that the views that I express in this talk are of my own and not connected with my employer. Just have to get that out of the way. So let's begin. The first, very first question is, why evaluate, right? Why is there a need to evaluate anything? So, obviously, the first thing is that we want to establish trust in these systems right before we deploy them to production, before millions of our customers start playing around with these. We want to make sure that we can trust them and they behave in ways that we expect them to behave. If that does not happen, we want to avoid headlines like this, right? Like in December, this was like a very popular headline. I don't know how many of you read this, but it was hilarious that a car dealership in Watsonville, California deployed a chat bot and people were able to do some very quick, clever, prompt engineering to essentially offer you cars for one dollars.
Jineet Doshi [00:03:32]: And it was legally binding as well. They made it say that, oh, this is a legally binding agreement. So clearly they didn't think about evaluations before deploying this system. So I guess another reason why we absolutely need evaluations, and the other reason is if you want to improve something in production, the first thing is we want to measure it, right? So that's how we can baseline where it's at today, and that's how we can constantly iterate over it and improve it. So why is evaluating llms so challenging? Right. Let's take a step back to, let's call it the traditional way of doing machine learning. Back in the day, though, that day was not too long ago, like around a year and a half ago, which honestly, in the AI space feels like light years ago, when we had more focus on traditional models like xgboost and linear regression and what have you. So in the traditional ML world, we used to build these task specific models.
Jineet Doshi [00:04:44]: You wanted to design a model to predict, is it going to rain today, for example, right? And what you would essentially do is train the model, have a test set with proper labels, and then evaluate the model on those specific labels. In this case, the model output was constrained. You knew that the model is going to output the probability of whether it's going to rain or not. And also, the metrics were very clear, like very well defined metrics, whether it's a classification problem or whether it's a regression problem. These were, like very well studied, very well established metrics out there. But now, with the advent of chat, GPT and generative AI, the script has been completely flipped because these models are able to do a lot of these open ended tasks. Like, for example, if you ask it to write a poem about Pytorch, it'll happily do that. And in these type of scenarios, the question is, what is the ground truth? Right? Like, what is my label? Because there's millions of ways of literally generating a poem.
Jineet Doshi [00:05:53]: And also now the flipping of the script is that one model is now able to do so many different tasks, right? Like, one model can not just generate code, summarize pdfs, do knowledge based Q and a, even write you a Valentine's Day poem in case if you need ideas for tomorrow. So that begs the question, like, how do you measure general intelligence? Right? And this is something, if you think about it, even amongst humans, to measure human intelligence. We haven't really figured out a perfect system to do that. So now our models are also kind of like reaching that stage where they have so many broad capabilities, and it's like, how do we measure their intelligence? And the other factor with llms we've seen is also that they hallucinate, they sometimes give out harmful answers, and we absolutely need to check for these things before we deploy to production again. So again, these are more factors we kind of need to keep in mind. There have been studies that have specifically been pointing to the variability in LLM outputs, which again, kind of adds to the challenge of why evaluating them is so difficult. So in this study, for example, what they observed was the LLM's performance on a specific task would vary significantly all the way from 3% all the way to 80%. And only difference was just modifying the format of the prompts.
Jineet Doshi [00:07:31]: So all you would do is in one prompt, you would add some extra spaces or just add a colon, and that would significantly change the model's performance. I'm sure a lot of us have seen this in practice as well. Till OpenAI's recent dev day, we would not even get deterministic responses from Chat GPT. Right? Every time you give it a prompt today and you give it a prompt, the same prompt next week, it would give back different answers. Right? Only in the recent dev day did they recently added, like the seed parameter or the system fingerprint, through which we can now at least get deterministic responses. But still, the fact remains that minor adaptations to prompts in some cases does produce drastic changes in output. Again, just adding to our concerns while evaluating. And also, there's tons of hyperparameters as well.
Jineet Doshi [00:08:26]: With these models, I'm sure a lot of you would have played with it. Things like temperature especially causes a big difference in the model's output. So again, it's like you are basically trying to throw a dart at a constantly moving target out here. So what are different approaches to evaluating llms? Right. So the first idea was NLP. As a field has existed for multiple decades, so can we borrow ideas from NLP to help us with evaluating llms? Because at the end of the day, a lot of their tasks are language based. So the first idea is, rather than posing the question to the LLM as an open ended problem, can we essentially twist it to make it like a multiple choice kind of a question where we ask it to choose the option. So that way we are constraining the model's output.
Jineet Doshi [00:09:25]: And with this, we can essentially use traditional classification metrics, like precision recall, AUC RoC, like all of the stuff that we've known for ages. An extension of that is, let's say you have a more open ended kind of a format where the LLM has produced like a long format answer. So even in this, how can we extend that idea? So even though the right answer can be expressed in multiple ways, there might be specific words that you feel are necessary for the answer to be counted as correct, right. In certain cases. So you can have like a list of words that you want to check for in the LLM output. And then the similar idea, you can extend it not just to words, but also to different entities, things like numbers, dates. Depending upon your use case, you can check for these things in the LLM output. And then again, when you do this, you can apply the traditional classification metrics of accuracy, precision recall.
Jineet Doshi [00:10:32]: So the pros of this approach is that this is good for certain tasks. It's mainly limited to knowledge and reasoning based tasks where you have reference answers. It's pretty easy to compute, pretty easy to run. But of course, the con is that it does not work for many of the tasks that these LLMs do, especially for open ended outputs. So the next topic is around text similarity. So then let's say in this case, what we do is there is a question, there's the LLM's answer, and you also have a reference answer that you can compare against. But then of course the LLM's answer is not going to match your reference answer word by word, but you want to make sure that it is at least semantically correct. Like it's semantically similar to the reference answer.
Jineet Doshi [00:11:21]: So that's where we go to the embedding space, right? That's where the idea of, okay, can we just convert both the reference and the LLM answer into the embedding space? And in the embedding space, can we do things like cosine similarity and vector similarity? Because the idea is that in the embedding space, if the answers are semantically similar, the vectors would be quite near to each other in that dimension. So some of the metrics that can be used in emitting space are things like bird score, blue score, edit distance, cosine similarity. Again, these are like very well studied metrics in NLP so the pros of this approach are that it's very easy to compute and automate. It is language independent, which is also quite good. Like at the end of the day, your LLM's answer could be in Spanish as well, but in the embedding space, if it is similar to your reference answer, the embeddings would still match. At the end of the day, though, the con is that there is dependence on the quality of the embeding that you use. So if you have a very domain specific use case, like, let's say, in the legal space, you want to make sure that your embedding captures semantic meaning in the legal space. Sometimes there's also been known that these metrics do tend to have a bias towards short or long text, and sometimes there are studies that have shown that they're not correlated with human judgment.
Jineet Doshi [00:12:50]: Always. The next idea within the traditional NLP space was using benchmarks. Again like a very well established, very well studied technique, where modern NLP benchmarks essentially think of them as comprising of multiple tasks and they come inbuilt with metrics. So some of the very popular benchmarks are MMLvu superglue, and there's like a very long list of constantly evolving benchmarks. So today there's a lot of benchmarks available out there, each testing for a specific aspect of the LM. So like, you have a benchmark for testing logical reasoning, you have a benchmark for testing toxicity, you have a benchmark for testing math concepts and things like that. Again, highly recommend checking it out. So the advantage of using benchmarks is you can cover a wide variety of tasks, especially if you combine multiple benchmarks together, right? That way you can evaluate the LLM across a variety of different parameters.
Jineet Doshi [00:13:56]: It is pretty easy to set up and automate. A lot of these benchmarks are publicly available along with the metrics, though the con is that it is still limited to multiple choice kind of questions, because a lot of these metrics are, again, like your traditional classification metrics. And more importantly, because the way these llms are trained on the entire Internet's data, they could be trained on this benchmark data themselves. So it's important to keep in mind and check and make sure that the LLM you are evaluating is not trained on these benchmarks. Otherwise it's like cheating, right? In a way, one really interesting tool to check out, it's an open source tool, is LM eval hardness by Luther AI. So it supports more than 400 plus benchmarks, and it also supports running like open AI models, hugging face models. You could evaluate all these open source llms out there with just one line of code after installing. So, pretty convenient.
Jineet Doshi [00:15:00]: So the idea of benchmarks is also extended into LLM leaderboards, where again, I'm sure a lot of you would have been keeping track of things like the hugging face leaderboard, where they essentially do this, which is they take multiple benchmarks, run these models against these benchmarks, and then come up with an aggregate score of how well these models are doing. Hugging face is not the only one. There's like tons of leaderboards out there. Chatbot arena has one, CRFM from Stanford has one as well, and many more. Though, again, one caveat to keep in mind with these leaderboards is you see that the same model across different leaderboards could have different rankings, because basically the benchmarks that they've used are different. Or again, the formatting of the prompt that they've used in the evaluation is different. And as we already saw that even minor changes in formatting sometimes causes drastic changes in the model's output. So again, things to just keep in mind while using leaderboards, the second bigger idea was, okay, apart from traditional NLP techniques, how can we leverage human knowledge? How can we leverage human evaluators? So essentially, to evaluate these LLMs, we can have manual labelers, and we can ask them to score the LLMs outputs across a variety of different criteria, like factual correctness, relevancy, fluency, hallucinations being just some of them.
Jineet Doshi [00:16:36]: And then for each criteria, the labelers could essentially give scores, which can be aggregated later on. Though, in practice, what we have seen is it is extremely important to provide guidelines and training to the human evaluators who are doing this. Otherwise it can get quite messy. Right? Like, what is a three out of five answer for you? Could be like a five out of five answer for me. So it's really important to provide guidelines to ensure consistency of evaluations when humans are involved. And also, what we've seen is for domain specific tasks like, let's say, in the medical domain or the legal domain, if you have humans evaluating the llMs, they need to have that domain knowledge with them. So that's one more limitation to keep in mind. So what people realized is, again, what I just spoke about is, with human evaluations, things can get pretty subjective, and sometimes it's really difficult for someone to say, hey, is this a three out of five? Is this a two out of five? So an easier alternative to that is preference based answers, right? Where you just have like a model face off and you say, hey, out of these two answers, which one is better? So that's usually much simpler and much quicker for humans to do.
Jineet Doshi [00:17:56]: And then again, these preferences can be aggregated to come up with an aggregate ranking. So that's the idea behind some of these model arenas, which are, again out there. An extreme version of this is red teaming, which is like a concept taken from cybersecurity, where essentially you hire a bunch of specialists to try and break your own LLM before someone else does. Right? And we've seen foundation model companies like OpenAI, Anthropic, all of these companies already have them. And it's interesting to think if others would also follow suite if red teaming of llms becomes a new job category out there. So the pros of human evaluations is that it gives us a wider array, a broader range of tasks that we can evaluate. In many tasks, human evaluation is still considered gold standard, though. The con is that it is expensive, it's absolutely not cheap, and you cannot scale it easily if you have millions of data points you want to evaluate on, that's not going to scale.
Jineet Doshi [00:19:12]: And of course, as we discussed, it could have variability if guidance is not provided, and then domain specific tasks do require domain specific knowledge. So the survey that we did a couple of months ago with the MLOps community, an interesting insight that we gathered was that 80% of the respondents are using human evaluators today to evaluate their generative AI systems. So it's good to know that in this world of AI, humans still hold value. Another key idea I guess I would like to introduce now is to use llms to evaluate other llms. And if there's one, I guess takeaway from this talk, if I would like you to take away, is essentially this. So some bright people had this idea that, hey, if we are saying models like GPD four are reaching close to human level performance on specific tasks, then can we ask them to do the task of evaluation itself? So that's what has been tried here. That was essentially the idea that using more capable models like GPT four to evaluate the lesser capable models or the other generative AI systems out there. So the hypothesis here is you can get the benefits out of human evaluation, but without all the cons, right? Because GPT four can scale infinitely and it's less expensive as well, and it's more consistent than humans.
Jineet Doshi [00:20:40]: There have been a lot of studies that actually support this hypothesis where they showed like, Chat GPT outperforms lot of the mechanical mturk workers on many tasks out there. Chat GPT does a better job than a lot of those workers on specific tasks. And now that's why this idea is really caught off of late. It's become extremely popular now to use llms as judges. And hugging face has a lot of these models out there which are specifically trained to be judges and detect specific things like toxicity, hallucinations, and what have you. But of course, issues still remain. We haven't solved the problem yet. When you use an LLM as the judge, the evaluations are very dependent on the choice of evaluator.
Jineet Doshi [00:21:32]: Whether you pick GPD four versus whether you pick, like Claude or GPD 3.5, your evaluation results would be very different because your judge is basically very different. And then, as we discussed, the instructions and formatting of the prompt that you give to the evaluator also affects the evaluations ultimately, which affect your metrics. We've, of course, observed that llms are not very good at some complex tasks which involve multistep reasoning. And also, some of these llms are known to have biases, like a verbosity bias, where the llms tend to prefer longer answers over shorter ones, even though the shorter ones are, like, of good quality. And there's also positional bias sometimes with these llms, where they tend to prefer answers at specific positions. I'll just skip through some of these pros and cons, which we already discussed, that the pros is that it does allow us to have a broader array of evaluations, and it's cheaper than humans. It allows us to scale as well. But then the con is that we do need to be aware that there are biases still in there, and it's very sensitive to the choice of evaluator.
Jineet Doshi [00:22:46]: So, like, kind of putting it all together. Our friends over at hugging face created this really nice chart which puts all these different evaluation approaches together. And a good observation here is that if we want to evaluate across more domains, the cost increases, right? That's like the natural thing. But if you see there's one outlier, which is the model based approaches that's using llms as the evaluator. And that's exactly why that is becoming very popular these days. Additionally, as we discussed initially, it's really important for us to evaluate for safety and security as well. So how do we evaluate for toxicity within the LLM's outputs? How do we evaluate for bias hallucinations as well? Right? Again, very important things we want to look into before we put this into production. And even on the security side, again, this is a constantly evolving space.
Jineet Doshi [00:23:46]: Lot of different attacks have been invented by nefarious actors who could do things like prompt injection, where they could run malware into your systems via the prompt. They could have data leakage, where they could steal some of the data on which the LLM is trained on. Again, we want to make sure that our llms and our systems are not doing this before deploying to production, right? So some of the strategies which are commonly used are running special benchmarks. So there are special security based benchmarks that are out there that we could run against before deployment using LLM judges. Again, there are specialized judges for toxicity, hallucinations, things like that. And if you do have the budget, you can always do red teaming and moving beyond just llms. When we look at systems like rag, it's like an entire system with all these different components involved, right? So in this case, then it becomes very important to evaluate all the different components of the system. So like for rag, for example, we want to evaluate, is the retrieval system working correctly? Is the generation system working correctly? And then finally, as a whole, is the system working as expected? So the open source community has been pretty active in the space.
Jineet Doshi [00:25:10]: There's tons of tools that are available out there in the open source world. Again, this is just like some of them. Highly recommend to check some of these out. They're pretty cool. There's also a lot of work being done in the universities. Helm is an interesting initiative. At Stanford, it's being led by Professor Percy Liang. And essentially the idea is to think of evaluation holistically.
Jineet Doshi [00:25:37]: That's what the H in Helm stands for, is like holistic evaluation of language models. So a quick recap of what I just covered. First, we discussed the challenges of evaluating llms. Then we went into the different approaches of evaluating llms, traditional NLP techniques, human evaluators, and then finally using llms themselves as evaluators. And we spoke about the pros and cons of each approach. We also spoke about evaluation for safety and security, and evaluating the bigger, generative AI systems and the open source landscape as well. Yeah. Again, thank you very much for having me.
Jineet Doshi [00:26:20]: Habit break questions. I guess we probably don't have time, but afterwards I'd be happy to discuss this close.