MLOps Community
timezone
+00:00 GMT
Sign in or Join the community to continue

Measuring the Minds of Machines: Evaluating Generative AI Systems

Posted Mar 15, 2024 | Views 270
# Evaluation
# GenAI
# Intuit
Share
SPEAKERS
Jineet Doshi
Jineet Doshi
Jineet Doshi
Staff Data Scientist @ Intuit

Jineet Doshi is an award-winning Data Scientist, Machine Learning Engineer, and Leader with over 7 years of experience in AI. He has a proven track record of leading successful AI projects and building machine learning models from design to production across various domains such as security, risk, customer churn, and NLP. These have significantly improved business metrics, leading to millions of dollars of impact. He has architected scalable and reusable machine-learning systems used by thousands. He has chaired workshops at some of the largest AI conferences like ACM KDD and holds multiple patents. He has also delivered guest lectures at Stanford University on LLMs. He holds a master's degree focusing on AI from Carnegie Mellon University.

+ Read More

Jineet Doshi is an award-winning Data Scientist, Machine Learning Engineer, and Leader with over 7 years of experience in AI. He has a proven track record of leading successful AI projects and building machine learning models from design to production across various domains such as security, risk, customer churn, and NLP. These have significantly improved business metrics, leading to millions of dollars of impact. He has architected scalable and reusable machine-learning systems used by thousands. He has chaired workshops at some of the largest AI conferences like ACM KDD and holds multiple patents. He has also delivered guest lectures at Stanford University on LLMs. He holds a master's degree focusing on AI from Carnegie Mellon University.

+ Read More
Adam Becker
Adam Becker
Adam Becker
IRL @ MLOps Community

I'm a tech entrepreneur and I spent the last decade founding companies that drive societal change.

I am now building Deep Matter, a startup still in stealth mode...

I was most recently building Telepath, the world's most developer-friendly machine learning platform. Throughout my previous projects, I had learned that building machine learning powered applications is hard - especially hard when you don't have a background in data science. I believe that this is choking innovation, especially in industries that can't support large data teams.

For example, I previously co-founded Call Time AI, where we used Artificial Intelligence to assemble and study the largest database of political contributions. The company powered progressive campaigns from school board to the Presidency. As of October, 2020, we helped Democrats raise tens of millions of dollars. In April of 2021, we sold Call Time to Political Data Inc.. Our success, in large part, is due to our ability to productionize machine learning.

I believe that knowledge is unbounded, and that everything that is not forbidden by laws of nature is achievable, given the right knowledge. This holds immense promise for the future of intelligence and therefore for the future of well-being. I believe that the process of mining knowledge should be done honestly and responsibly, and that wielding it should be done with care. I co-founded Telepath to give more tools to more people to access more knowledge.

I'm fascinated by the relationship between technology, science and history. I graduated from UC Berkeley with degrees in Astrophysics and Classics and have published several papers on those topics. I was previously a researcher at the Getty Villa where I wrote about Ancient Greek math and at the Weizmann Institute, where I researched supernovae.

I currently live in New York City. I enjoy advising startups, thinking about how they can make for an excellent vehicle for addressing the Israeli-Palestinian conflict, and hearing from random folks who stumble on my LinkedIn profile. Reach out, friend!

+ Read More

I'm a tech entrepreneur and I spent the last decade founding companies that drive societal change.

I am now building Deep Matter, a startup still in stealth mode...

I was most recently building Telepath, the world's most developer-friendly machine learning platform. Throughout my previous projects, I had learned that building machine learning powered applications is hard - especially hard when you don't have a background in data science. I believe that this is choking innovation, especially in industries that can't support large data teams.

For example, I previously co-founded Call Time AI, where we used Artificial Intelligence to assemble and study the largest database of political contributions. The company powered progressive campaigns from school board to the Presidency. As of October, 2020, we helped Democrats raise tens of millions of dollars. In April of 2021, we sold Call Time to Political Data Inc.. Our success, in large part, is due to our ability to productionize machine learning.

I believe that knowledge is unbounded, and that everything that is not forbidden by laws of nature is achievable, given the right knowledge. This holds immense promise for the future of intelligence and therefore for the future of well-being. I believe that the process of mining knowledge should be done honestly and responsibly, and that wielding it should be done with care. I co-founded Telepath to give more tools to more people to access more knowledge.

I'm fascinated by the relationship between technology, science and history. I graduated from UC Berkeley with degrees in Astrophysics and Classics and have published several papers on those topics. I was previously a researcher at the Getty Villa where I wrote about Ancient Greek math and at the Weizmann Institute, where I researched supernovae.

I currently live in New York City. I enjoy advising startups, thinking about how they can make for an excellent vehicle for addressing the Israeli-Palestinian conflict, and hearing from random folks who stumble on my LinkedIn profile. Reach out, friend!

+ Read More
SUMMARY

Evaluating LLMs is essential in establishing trust before deploying them to production. However, evaluating LLMs remains an open problem. Unlike traditional machine learning models, LLMs can perform a wide variety of tasks such as writing poems, Q&A, summarization etc. This leads to the question how do you evaluate a system with such broad intelligence capabilities? This talk covers the various approaches for evaluating LLMs along with the pros and cons of each. It also covers evaluating LLMs for safety and security and the need to have a holistic approach for evaluating these very capable models.

+ Read More
TRANSCRIPT

Measuring the Minds of Machines: Evaluating Generative AI Systems

AI in Production

Slides: https://docs.google.com/presentation/d/1hAtTaPhhZZ3JwakimRdX6e827VzOij-K/edit?usp=drive_link&ouid=112799246631496397138&rtpof=true&sd=true

Adam Becker 00:00:10: Hi, Jineet, how are you? Can you hear us?

Jineet Doshi 00:00:16: Hey, yeah, I can hear you. Can you hear.

Adam Becker 00:00:19: Nice, nice. Yes. Welcome. Thanks for coming on the stage today. You're going to talk about. Thank you for having, I have found fascinating for a few years now, and it was first sparked in me after I read this book called the Measure of Mind. And I believe that you'll be talking about something very similar. It's just, how do we evaluate intelligence now, given that it is able to just be implemented across so many different tasks and it's relatively difficult to just wrap our heads around what it actually means to be intelligent across so many different domains, so certainly how to do that qualitatively and quantitatively.

Adam Becker 00:00:59: Jineet, the floor is yours. Do you need to share your screen?

Jineet Doshi 00:01:03: Yes, I do.

Adam Becker 00:01:05: Okay, so go for that. And folks, if you have questions in the meanwhile, make sure to put them in the chat, and I'm going to pose them to Jineet. I'll be back here on the stage with you in about 25 minutes.

Jineet Doshi 00:01:20: Sounds good. Can you see my screen?

Adam Becker 00:01:21: We can see your screen.

Jineet Doshi 00:01:22: Can you see my slides?

Adam Becker 00:01:23: Yes. Take it away.

Jineet Doshi 00:01:25: Okay, awesome. Yeah. Thank you so much for having me. I'm Janith. I'm currently a staff data scientist at Intuit, leading generative AI projects, and I've been basically working in the AI space for the past seven years, building and deploying machine learning models in production across a wide variety of different domains. I do want to say before I begin my talk that the views expressed in this talk are of my own and should not be associated with anyone else. So evaluating generative AI systems, this is like a really interesting topic to me personally because it remains an open problem. So no one's figured it out yet.

Jineet Doshi 00:02:11: So there is no right, there's no wrong. What I would like to do as part of this talk is just cover different approaches to evaluating generative AI systems, along with the pros and cons of each of those approaches. And ultimately, the purpose of my talk is to engage with all the bright minds that are listening here on this important topic and see how we can together move this field forward. With that being said, so the first question is, why is evaluating LLMS challenging? Right? So to answer that, if we take a step back into the traditional machine learning world, so, like around a year and a half ago, before the generative AI storm really kicked off, one and a half years in our space, I know, feels like light years ago, but when we had more of the traditional machine learning models, like your classification regression models, how we would treat. Evaluation is essentially like, let's say we want to build a model to predict whether it's going to rain today, right? So we had these very task specific models which were focused on specific tasks. So you would predict whether it's going to rain today, you would predict the sentiment of a review someone left. So for these models, the output of the model was very limited, right? Like the model would basically output either the probability of whether it's going to rain today or what kind of sentiment against probability of the sentiments. And in these cases, we had task specific metrics, which are very well defined, which are very easy to measure.

Jineet Doshi 00:04:04: So in case of classification tasks, we had metrics like accuracy, precision, recall, AUC, and with regression things like RMSE, which were very well defined. But now, suddenly, in the generative AI world, things become very different, because now these models are capable of lot of these open ended tasks, right? So in these open ended tasks, many times there is no ground truth, because there's like millions of ways of doing the same task. And in a way, these models have also flipped the script, which is now that one model can do multiple tasks. So earlier we used to have task specific models, but then now in this new realm, suddenly one model can generate code, it can do Q and a, it can even tell jokes, it can do all these type of things. So it really begs the question, how do we measure general intelligence? And this is something like, I feel even with human intelligence, we really haven't figured it out yet, like a perfect system to measure human intelligence. So now we are kind of in that realm, even with our AI systems. And of course, we know that these LLMs sometimes do hallucinate. They produce harmful answers, and we absolutely need to evaluate for these factors as well before we put these to production.

Jineet Doshi 00:05:33: So another thing that makes evaluating LLMs really challenging is just the variability in the output. So there have been a lot of studies that have been done which show that sometimes even minor changes to the prompt, like in this case, like adding some extra spaces or removing colons, would drastically affect the LLM output, to the point that the task accuracy would vary all the way from 3% to 80% just based on these small, minor formatting changes in the prompt. So clearly, this is something which would make the task of evaluating this quite challenging. Even till the most recent OpenAI dev day, the outputs of a lot of these LLMs were not even deterministic, right? So in chat GPT, the same prompt would produce a different answer across different days. But just recently, during the Dev day, I think OpenAI added the seed parameter and the system fingerprint parameter, which at least introduces some sense of determinism to these models outputs. But nevertheless, variability still remains a challenge. And there's so many other hyperparameters as well for the model, especially things like temperature, where moving and changing these things would drastically change the LLM output and again would affect evaluations. So with that being said, I would like to just cover all the different approaches to evaluating LLMs.

Jineet Doshi 00:07:10: So the first line of thinking that people had when facing this problem was, hey, NLP as a field has existed for multiple decades, it's not new. So can we reuse some of the techniques from the NLP world to help us with this? So when it comes to things like factual accuracy, the first idea is if you have kind of like a factual accuracy evaluation task, can you essentially reformat the question itself? So instead of having the LLM output an open ended answer, what you could do is ask it to choose amongst the specific options that you give it, essentially convert it into multiple choice, and that way you're artificially restricting the model's output. And if you do that, essentially you can now apply all the traditional classification metrics, things like precision recall, Ucroc and what have you. And even to extend that in cases where you do have long form based answers, there are still ways where, let's say, for the answer to be right, you have a list of important words that you feel the correct answer should be present in order for the answer to count as accurate. So even in these situations, you can essentially check for those list of words within the final LLM output. And the same idea can also be extended to different entities, where in certain situations, you would also have a list of specific numbers that you would want the LLM answer to have in order for it to count as right, as well as things like dates and all of these other entities. And again, in this scenario as well, we can apply traditional classification metrics, things like precision recall, UCRC. So the pros of this approach is that it's definitely good for certain tasks like knowledge and reasoning, things related to factual accuracy.

Jineet Doshi 00:09:16: It is quite easy to compute, but the natural con is that it doesn't really work for many tasks, especially the open ended ones. So extending that idea further, like let's say in situations where you do have a reference answer, and you want to compare the LLM answer to the reference answer. But then of course, given the long form text outputs of some of these LLMs, there could be like millions of possible ways of generating the right answer. Right. So then how do you approach this problem? So one of the ideas used in the field is essentially to convert the raw text into embeddings or some other form, and essentially do the comparison in the other embedding space. So the idea here is that the embeddings, if they're good enough, they can capture semantic meaning. And essentially the distance between the reference answer and the LLM answer in the embedding space would be low if they are close to each other semantically. And there's a whole bunch of metrics to capture the semantic similarity, things like bird score, blue score, edit distance, as well as cosine similarity.

Jineet Doshi 00:10:38: So some of the pros of this are that it's pretty easy to compute and automate. Some of these metrics are pretty well studied and established. It is also language independent. So even if you have your reference answer in Spanish and your LLM answer in English, you could still use this, because the assumption is if the embeddings really capture the semantic meaning, they would be like language independent. Though that also brings us to the key thing to be aware of is that there is dependence on the quality of embeddings. So in domain specific areas, like in the medical field or the legal domain, you would require embeddings that are able to capture the semantics of that domain. Sometimes there is also bias towards short or long text, and some of these metrics sometimes have shown not to be correlated with human judgment, just something to be aware of. Another idea from the traditional NLP world, which has been extended now into the LLM world, is the idea of benchmarks, which are essentially multiple tasks, and they come with metrics along with the tasks itself.

Jineet Doshi 00:11:54: Some of the very common benchmarks you would have come across are things like MMLU, superglue, and this is like an ever growing list of benchmarks. Every day there's new benchmarks being published, and these benchmarks cover a whole variety of different factors, things like evaluating for toxicity, evaluating for logical reasoning, or evaluating physical physics concepts as well. So the pro of the benchmarks is that it can cover a wide variety of different tasks. You can even combine multiple benchmarks to increase the coverage of evaluations even further. And again, very easy to set up and automate. A lot of these are published and openly available on the Internet, though one of the limitations is that open ended tasks are still difficult to evaluate via benchmarks. And an even more important thing is because of the way these LLMs are trained on the entire Internet's data. Sometimes it could happen that the LLM is trained on the benchmark itself, in which case the evaluations would not really be quite fair.

Jineet Doshi 00:13:12: Just something to keep in mind is to make sure the model that you're evaluating is not trained on some of these benchmarks. One really interesting tool for benchmarks is LME eval harness by Luther AI. So in just one line of code, you can evaluate different LLMs across 400 different benchmarks. And this is open source, it's available out there. So the idea of multiple benchmarks is extended, kind of into the LLM leaderboards, where again, hugging face leaderboard. It's like a very common and popular leaderboard out there, where essentially they run multiple benchmarks against the model and generate model rankings based off of that. Along with hugging face, there are other leaderboards, like by Chatbot Arena, CRFM, and again, this is an ever increasing list. So one thing to be mindful of with leaderboards is that sometimes you'll see that the same model, the ranking of it, could vary across different leaderboards.

Jineet Doshi 00:14:19: And that's essentially because the different leaderboards could be using different benchmarks for their leaderboards, and also they could be having different formatting of the prompts themselves. So, as we've seen, even minor changes to the prompt format sometimes does affect the LLM output. So again, something to be mindful of, the next big area of evaluations is human evaluators. So the idea here is essentially like, can we hire a bunch of people and then ask them to score the LLM output across a variety of different criteria, things like factual correctness, relevancy, hallucinations, and all the different criteria that you care about. And in this case, the scores could be averaged across multiple labelers and then ranked. Though in practice, we've seen that it is extremely important to provide guidelines and trainings to the human evaluators to ensure consistency, because essentially, what could be a three out of five score for one evaluator could very well be a five out of five for someone else. So it's really important to provide guidelines to ensure the consistency of these evaluations. So, given some of the challenges around the subjectiveness of scoring by humans, another idea is to have model face offs within arena.

Jineet Doshi 00:15:55: So there's like a lot of these model arenas out there where again, the idea is that it's easier to provide preferences. So when you're giving two LLM answers, it's much easier to give your preference, like, hey, which answer do you prefer? Versus explicitly scoring them, whether it's a three out of five or five out of five. And then again, these preferences can be aggregated to different rankings. And a more extreme version of this is red teaming. So this is a concept borrowed from cybersecurity, where essentially you would hire your own specialists who would be actively evaluating and exploiting your own LLMs. Lot of foundation model companies like OpenAI Anthropic have them already, they have large red teams, and maybe other companies could also follow suit. So the pros of human evaluations is that it does allow us to cover a vast variety of different tasks, and in some cases it is still considered as the gold standard for evaluation, though the con is that it is expensive, like human evaluators don't come cheap, and of course it cannot scale easily. So if you want to evaluate across millions of data points, it would get challenging unless guidance is provided.

Jineet Doshi 00:17:17: It could have variability. And then we've also seen that for domain specific tasks like, let's say you are evaluating something in the legal domain or in the medical domain, the human evaluators are required to have that domain knowledge in order for them to evaluate correctly. So along with the MLOps community, we conducted this survey of close to around 100 different LLM practitioners, and we asked them, hey, how are you evaluating your LLMs currently? And one of the interesting insights that we got from the survey was like, close to 80% of the responders are using human evaluators in their processes right now. And the third, bigger category of evaluating LLMs is to use LLMs themselves as evaluators. And if there is one takeaway I want you to have from this talk is probably this, which is, this idea has really picked up quite a bit and has become very popular. So the essential idea behind this is when we are saying, like, models like GPT four are intelligent and capable, and on certain tasks are as good as humans, then can we ask them to be evaluators themselves? Right? So that was kind of like the hypothesis is to have more capable models like GPT four act as evaluators for other LLMs. And the hypothesis is that with this, if it works, you can get the benefits of the human evaluators, which is you can evaluate across various different factors, but without the cons. Right? Like GPD four would be less expensive and inherently more scalable and more consistent as well.

Jineet Doshi 00:19:05: And there have been studies that do support this hypothesis. So there have been studies which were published to show that chat GPT sometimes does outperform crowd workers through like, mechanical Turk and some of these other platforms. And now even hugging face has a lot of LLMs which are specifically trained to act as judges and to identify things like toxicity, hallucinations, and what have you. But the problem is not solved yet. Issues still remain even with this. So what has been noticed is that evaluations by LLMs are very sensitive to the choice of the evaluator itself. Right? So whether you use GPT four as the judge, or Claude or judge LM, or whichever model you use, your evaluations could vary based on the choice of the evaluator itself, as well as, as we discussed before, which is the instructions and formatting of the prompts that you give the evaluator would also significantly impact the evaluation output from these models. Again, something which is very important to keep in mind.

Jineet Doshi 00:20:15: Also, we've seen that in some studies, these LLMs do tend to have biases sometimes. Again, something to be mindful of, like sometimes the LLMs are shown to have verbosity bias, where essentially they prefer longer answers over shorter ones, even though the shorter ones could be of similar quality, as well as positional bias, where sometimes these LLMs tend to prefer the answers at a specific position. Again, for whatever reason, just some things to keep in mind. So the pros here are, of using LLMs as a judge, is that we can evaluate on a vast variety of different criteria. It is cheaper than hiring human evaluators and also easy to scale and automate. But some things to keep in mind are that it is very sensitive to the choice of the model that you use as the evaluator, as well as the instructions that you give to it. And these models could have biases sometimes in the evaluation. So putting it all together in this chart, our friends at hugging face did a great job putting all of these different approaches of evaluating into one single chart.

Jineet Doshi 00:21:32: And if there's one thing we can observe, is the more extensive we want our evaluations to be, the more expensive it gets generally. But one outlier to this is the model based approaches, where you can see we can cover quite a bit of different factors by valuation, but at a much lower cost. So those model based approaches are essentially using LLMs as the judge. What we were just discussing before. And because of this factor, this idea is really picked up within the community. Additionally, it's also very important to evaluate for safety and security. So we want to evaluate our LLM outputs for things like toxicity, bias, hallucinations as well, because we know sometimes these models do display some of these issues. And even in the security world, there have been all sorts of different threats, which are constantly evolving as well.

Jineet Doshi 00:22:35: So via prompt injection attacks, nefarious actors are able to steal data, they are able to inject malware into the systems, and we definitely don't want this to be happening in production. So some of the evaluation strategies for safety and security are that there are special benchmarks that are available which are specifically catered for safety and security. There are also special LLM evaluators that are available. And in case, if you do have the budget, you could always hire human evaluators or have a red team as well. Look into this. And then moving beyond just LLMs, when we start thinking about systems like Rag, where LLM is just one component of the system, evaluations become even more challenging because now it's like we have to evaluate all the different components in the system. So we need to evaluate the retriever first, like is the retriever getting the relevant documents? Then we need to evaluate the LLM itself, like is it generating the response as well? And then evaluating the entire system together. The open source landscape is again constantly evolving.

Jineet Doshi 00:23:53: There's tons of new tools that are out there which are open source to help evaluate our LLMs across a variety of different factors. And there's a lot of initiatives currently at universities like Stanford as well that are going on where essentially the idea is like, how do we think about evaluating these models holistically, which is what is the need of the R? So a quick recap. We spoke about the challenges of evaluating LLMs. We covered the different approaches to evaluating LLMs, starting with some of the traditional NLP techniques using human labelers, and then finally the new idea, which is really picking up of using LLMs as evaluators. And then we also spoke about evaluating for safety and security and evaluating like generative AI systems as well. So yeah, thank you. Any questions, comments? I'll also hang around in the chat after this to answer.

Adam Becker 00:24:58: Wonderful. Nice, Jineet, thank you so much. I believe this is absolutely necessary for everybody who's building anything with LLMs, but also just with traditional machine learning as well. So there's one comment in the chat, I'm going to rephrase it a little bit, but I think that there is some kind of deeper insight there that it's trying to kind of get at, which is when you think about bringing on board a bunch of human evaluators, then you have to ask yourself, well, first there's maybe like a scientific methodology question which is like, who are these people? And to what extent are they adding or introducing any type of bias? Right? And how are you going to the control for that? And then there's also just been part of the discourse about kind of like the ethical implications of bringing people who are. That's all they do. Right. They just become clickers. All they do is they're just like clicking and helping you trade.

Adam Becker 00:25:54: Do you have any thoughts about that? Anything that you're able to share?

Jineet Doshi 00:25:59: Yeah. So I did cover a bit about that in my talk where, yes, if you are bringing human labelers, it is extremely important to provide guidance and training. Right. So that's where through that, the idea is we could reduce some of these biases, because inherently humans do have their own biases, and without proper guidelines, those are very easy to creep into the evaluation. So if we are hiring human labelers, it is extremely important to figure out the guidelines and the training piece.

Adam Becker 00:26:35: Yeah. The ethical question, did anybody ever bring this up in terms of the economic sort of ethics of it or not? I remember at some point there were like companies that were trying to do human evaluators, but that are ethically sourced and paid for and that sort of thing. But also at that point, it starts to bump up against kind of like the already high costs of doing these sorts of solutions. Right. So I imagine that as arguments of that sort continue to bubble up, you're going to continue to get pushback and LLMs will just be doing that work instead.

Jineet Doshi 00:27:11: Exactly. And that is exactly why the idea of using LLMs themselves as evaluators has really picked up because of all these different factors. Cost is one of the main factors. And the other thing is also things like biases. Right. Again, not saying that LLMs themselves are not free of biases. They have their own biases, as different studies have pointed out. But, yeah, that's why this is an open problem still.

Jineet Doshi 00:27:40: And if anyone can solve this, they would be overnight millionaires.

Adam Becker 00:27:46: Jineet, I have one last question for you. Do you feel like this is a bit of a curveball question? I'm not entirely sure if you've. Just the moment that I started thinking more deeply about how to do evaluation more rigorously, then I ended up sort of where you started, which is you begin to reflect on how we evaluate human intelligence, and you realize, well, this is a very difficult thing to do in the first place, and forget even about intelligence and just broad general intelligence. Even. Just did this student actually learn the right kind of content in the class, even that seemingly relatively trivial question? It turns out it's a very complex question to ask. Right. And there's a lot of methodologies, even with testing design, of just trying to figure out how humans learn. Do you feel like you've picked up any new intuition or new found respect for how humans are to be evaluated?

Jineet Doshi 00:28:46: Well, yeah, I mean, we haven't figured that out yet as well. Right. It still remains open challenge. We have things like IQ tests, right, to test someone's intelligence. But then IQ tests also very limited. They just focus on specific cognitive skills. They don't really measure intelligence of other forms, for example. So, yeah, I think that's why I highlighted.

Jineet Doshi 00:29:12: I feel we are kind of running into that same challenge now, even with our AIs, because now a lot of these AIs are supremely capable as well. But in many situations, what we have seen is when evaluating, people are normally concerned about their task. Like, they have a very specific task that they care about, and they want to kind of see how well would this LLM, for example, handle my customer support issues. Right. So that's a very specific task. And in those kind of situations, the good thing is we do have things like benchmarks and using your own, like creating your own golden test sets, essentially, along with metrics that you can then evaluate these models against.

Adam Becker 00:29:58: Yeah. Okay. A couple of other comments from the audience. So one of them is from Edwin. Okay, this is not a question, just a thought, but I want to share this with you anyway. For LLM evaluators, one question I often think about is who or what evaluates or judges the LLM evaluator, and then who or what evaluates the evaluator. You get the idea, right?

Jineet Doshi 00:30:21: It's a very meta question. That's a great question. It's a great question and definitely something to be very mindful of as well. As I covered in my talk, I do feel like when you use an LLM as a judge, the choice of which llm you use, that is very important, as well as the prompt that you give it. Right. So one idea to evaluate the evaluator is in some cases, if you do have some human evaluations or some manual labels or ground truth, what you can do is have the judge LLM evaluate, and then you compare against those evaluations that you consider as like your golden test set, essentially the golden evaluations, and then see how much of correlation is there between the two.

Adam Becker 00:31:18: Yeah, it sounds like this is an unsolved problem. I think it's like, doesn't Karl Marx have a quote? Or, like, who will teach the teachers? Who will police the police, it sounds like.

Jineet Doshi 00:31:28: Exactly.

Adam Becker 00:31:29: We're back in square one here. One more question for you. Is Manassi asking? You mentioned some examples, but can you suggest one tool or platform for evaluating rag application end to end, effectively and efficiently.

Jineet Doshi 00:31:45: There are a lot of tools, honestly, out there in the open source world. So one package for rack specifically I know about is called ragas. That is quite popular, I think, in the space for evaluating rag applications.

Adam Becker 00:32:01: Cool. Sorry. Now, one last thing, because somebody seems to really think this would be useful. Wondering how you think comparing different evaluations over time when the test sets change, perhaps during development. Also, can you please bring up the evaluation and guardrail tools slide once more? Sure. Yeah.

Jineet Doshi 00:32:20: Happy to. And by the way, again, I'll be sharing these slides afterwards as well, so attendees would definitely have these. Let me quickly.

Adam Becker 00:32:29: Maybe somebody just wants a quick screenshot of it.

Jineet Doshi 00:32:36: Sure. Just a moment. Okay. I guess this is the one they were looking for. Is that right? I think this has all the different.

Adam Becker 00:32:56: I'm not seeing it, but I am getting a correction from the chat. Zeb Abraham says. Okay. It's a latin phrase from juvenile. Sorry, juvenile. I think I might have attributed it to.

Jineet Doshi 00:33:11: Okay, how about. Okay, I wasn't sharing before, but, yeah, again, just something to be mindful of is this is still an incomplete list. This is constantly evolving every day as we speak.

Adam Becker 00:33:30: Yeah. Okay.

Jineet Doshi 00:33:33: It is quite a fascinating space to be in.

Adam Becker 00:33:38: Yeah. I mean, it's changing so quickly. So who will watch the watchman? Zev says. That's right. Okay, very cool. I hope this is useful. Tom says, thank you very much, Janit. It's been a pleasure having you.

Adam Becker 00:33:53: Please keep up the good work.

+ Read More
Sign in or Join the community

Create an account

Change email
e.g. https://www.linkedin.com/in/xxx
I agree to MLOps Community’s Code of Conduct and Privacy Policy.

Watch More

26:31
Posted Mar 25, 2024 | Views 102
# GenAI
# LLMs
# Intuit
# Intuit.com
56:55
Posted Aug 07, 2023 | Views 612
# Generative AI
# LLM
# Scale Venture Partners
10:38
Posted Oct 31, 2023 | Views 555
# LLMs Evaluation
# AI Risk
# Robust Intelligence