Holistic Evaluation of Generative AI Systems
Jineet Doshi is an award-winning Data Scientist, Machine Learning Engineer, and Leader with over 7 years of experience in AI. He has a proven track record of leading successful AI projects and building machine learning models from design to production across various domains such as security, risk, customer churn, and NLP. These have significantly improved business metrics, leading to millions of dollars of impact. He has architected scalable and reusable machine-learning systems used by thousands. He has chaired workshops at some of the largest AI conferences like ACM KDD and holds multiple patents. He has also delivered guest lectures at Stanford University on LLMs. He holds a master's degree focusing on AI from Carnegie Mellon University.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
Evaluating LLMs is essential in establishing trust before deploying them to production. Even post deployment, evaluation is essential to ensure LLM outputs meet expectations, making it a foundational part of LLMOps. However, evaluating LLMs remains an open problem. Unlike traditional machine learning models, LLMs can perform a wide variety of tasks such as writing poems, Q&A, summarization etc. This leads to the question how do you evaluate a system with such broad intelligence capabilities? This talk covers the various approaches for evaluating LLMs such as classic NLP techniques, red teaming and newer ones like using LLMs as a judge, along with the pros and cons of each. The talk includes evaluation of complex GenAI systems like RAG and Agents. It also covers evaluating LLMs for safety and security and the need to have a holistic approach for evaluating these very capable models.
Jineet Doshi [00:00:00]: My name is Jineet Doshi. I'm a staff AI scientist at Intuit, and I'm caffeine free, so I do not drink neither like tea or coffee. Try to keep it natural.
Demetrios [00:00:20]: Welcome back to the MLOps community podcast. I am your host, Demetrios. And today, this is the holistic view of evaluation. We break down three different ways that you can evaluate your gen AI systems and talk about LLMs as a judge. Some best practices, some challenges, different ways that you can do them. Oh, I felt like I've had a lot of conversations on evaluation in the past two years. This really took all of those conversations that I've had. You know, maybe there was a sliver over here, a nice little chunk over there, and it wrapped them up, put a little bow tie on top of it, and served it as a present.
Demetrios [00:01:11]: So Christmas came early for y'all. Hope that some of the stuff we're saying in this makes you have better AI systems. Let's jump into the conversation and as always, if you enjoyed it, share it with just one friend. This conversation came about because I was given a talk two years ago now, not this last data and AI summit, but the one before that around LLMs and AI in production. And you were at that talk, and afterwards, or actually even before I went on, we got to chatting and you were saying, you know, one thing that is very important that I'm not seeing enough people talking about is evaluating, but evaluation on a systems level. And you had ample experience because of your work at Intuit and building out. What is it? Intuit Assist that's powering over a hundred million customers, which is insane, just in general insane scale. And so that planted the seed in my head.
Demetrios [00:02:24]: And here we are two years later, finally getting around to having the evaluation conversation. So it's great to have you here, man.
Jineet Doshi [00:02:32]: Yeah, thank you, Demetrios, for having me. Yeah, absolutely. This is, like, a very important topic because it kind of goes, like, right to the core of, like, building trustworthy ML systems, which is, like, extremely important. So, yeah, happy to talk about this. And yeah, I also do want everyone to know, like, the views that I express are my own. And, yeah, it should not be associated with any organization or entity.
Demetrios [00:03:06]: There we go. Just as a disclaimer, and we'll put that one front and center so nobody gets twisted. But there is a lot of cool stuff that's happening at Intuit. You guys are powering a lot of models. What was it? Hundreds of thousands or tens of thousands?
Jineet Doshi [00:03:26]: Right. So when I joined Intuit, like six And a half years ago at that time, like basically we had like six single digit modules in production to now, like fast forward to now where we've really like invested a lot in AI and we are at the point where we have like thousands of modules in production making 58 billion predictions a day. Like that's the scale at which we're at now.
Demetrios [00:03:56]: 58 billion. That is such a large amount. I, my mind can't even fathom how many zeros that is. And I can assume that they're not all LLMs. You have a mix of traditional and gen AI in there. But you've been noticing, right, that when it comes to the gen AI type of products that you're putting out, you need to really think wisely about evaluation and how you are going about recognizing if the system in general is performing. And so I think when we kicked off this conversation way back when, you were saying, yeah, you have to look at the model output, but you have to look at the model output on many different vectors and you also have to look at the system. So retrieval evaluation can be one thing.
Demetrios [00:04:55]: So how are you chunking things and looking at that? There's, and even looking at the, how are you embedding and the embedding model and that is becoming more and more mainstream. But I'm wondering if you want to give us a lay of the land on why you feel evals are challenging and why it still feels like it's an open problem even two years later.
Jineet Doshi [00:05:21]: Yeah, yeah, absolutely. Yeah. So I mean evaluating generative AI systems is, it is quite challenging and I think there's like no right or wrong. It's all like dependent on the use case at the end of the day. So like when we look back and see like, okay, why is fundamentally evaluating LLMs challenging? Right? So if we go back to like the classical ML space essentially like it feels like a long time ago, but it wasn't that long ago. So with the classical ML models, essentially what we would do is we would train like your usual tree based models or SVMs on a very specific task. Like those were all very task specific models. The problem was very well defined and the output was like very limited.
Jineet Doshi [00:06:16]: Right. Like the model would essentially output the probability of classification or like some number if it was a regression problem. And the metrics as well were pretty well defined. So essentially it was like these models were trained on very niche tasks and the outputs were very limited, well defined and the metrics were also very well defined. So if it's classification, you look at accuracy Pre precision recall, auc, like what have you. Same with regression. It had its own set of metrics. But then when these LLMs kind of came about, it completely flipped the script.
Jineet Doshi [00:06:57]: Right. Like everything I felt like went upside down in a way because now suddenly these LLMs are producing open ended outputs. Like you can ask it to like write a poem about Pytorch and it'll do that. And there's like no one right way of doing that. There's like millions of possible right answers. And the other piece is the breadth as well. Right. Like earlier we had very task specific models, but now like one single model is able to do so many different tasks.
Jineet Doshi [00:07:33]: Right. It can write a poem, it can summarize, it can do Q and A. There's like so many different things that these LLMs can do, which again begs the question, like how do you measure intelligence like on a broader scale? And I feel this is something we haven't really solved even for humans. Like we have things like IQ tests, but again those are far from perfect. So that's why I feel like it's a very essential and challenging topic.
Demetrios [00:08:09]: You know, one thing that I have thought about so much when it comes to evals is if there will be a product to solve the evaluation problems and if that product is an eval product or if it's part of a greater platform that you're doing AI development on. And I know that you've, you've had some experience already. Do you feel like it's going to be a product?
Jineet Doshi [00:08:46]: I'm not sure about the product piece, but it is absolutely essential like to have EVAs at I would say the entire ML life cycle. So like right when you choose your data to like when you're building or like training or fine tuning the model offline to then, okay, even once it's in production we still need evals to constantly monitor and make sure like there's no drift and everything's like working as expected. So I feel like evals at the end of the day are essential across the entire like ML life cycle. Again, whether that's classical ML or Genai, I think that doesn't matter.
Demetrios [00:09:27]: Yeah. What are some different approaches that you have in mind when you are doing evals? Have you experimented with ways that have deemed to be more successful or less successful?
Jineet Doshi [00:09:43]: Sure. Yeah. So I, I'm guessing you're referring specifically to LLMs and Genai systems. Yeah, so yeah. So in that space again, there's, I feel there's no right or wrong. There's like different Techniques that are out there and each one comes with its own trade off at the end of the day. So we have to like be aware of the pros and cons of each of them and make those decisions based on the use case. But in general I feel like we can think of the different evaluation techniques for gen AI systems into like three higher level categories.
Jineet Doshi [00:10:23]: So the first category is what I would say like some more traditional NLP techniques. So again when LLMs came about and people realized, okay, like evaluating is a big challenge, like how do we approach this? The first thought process was NLP as a field has existed for multiple decades, right? Like that's not me. So can we like borrow all of the great work that people have done already in that domain and see like how, how we can reuse some of that? So if we like dive a little deeper into that. So one approach is essentially that LLMs again usually create have open ended outputs, but essentially can we like tweak our evaluation test set to essentially have multiple choice questions and like almost artificially force the LLM to pick options and force like a close ended output versus an open ended output. And then by doing that like all of the traditional evaluation metrics suddenly become applicable. Of course, like it's easy to compute and like everything is well defined. But of course the con here is we cannot really evaluate on a lot of the open ended tasks. It's more helpful when it's like knowledge based or reasoning based tasks that we can evaluate on an extension of.
Jineet Doshi [00:11:59]: That was also like in the like the text similarity space where again if you do have like long form output from the LLM and let's say you have some reference answer, the idea is that you can convert both of them into like an embedding space or like into like an n gram space. And then like let's like do the comparison in that space. So again there's like pretty well studied metrics like BERT score, Bart score, cosine similarity, edit distance that exist in that space, which again have been used like across the board by different teams. So one of the advantages of doing that is basically in the embedding space your answers can be like language agnostic in a way, right? Like if your LLM output is in English, let's say, but your reference answer is in Spanish. The idea is that in the embedding space at least as long as the embeddings are good quality, they capture the semantics and the similarity scores would be like higher there. But then of course that also leads to a con of that approach, like something to keep in mind is it is heavily dependent on the choice of the embedding model. So if you're looking at like domain specific evaluations where it's like you have the legal language or the medical domain, where the language tends to have like quirks and nuances of its own, you need to pick an embedding model that kind of captures that because if it doesn't, then bite you in the ass. Things failure.
Jineet Doshi [00:13:52]: And also some of these metrics, there have been studies that have been shown that they do not always like align with human judgment. So again, it's like something to keep.
Demetrios [00:14:02]: In mind and to underpin. This is the traditional NLP type of evaluation. So that's what you are. That's like the first bucket that you see as one way to do it.
Jineet Doshi [00:14:13]: Yes, absolutely. There's some more techniques there as well. Like benchmarks is another pretty, I would say popular, widely used technique. Again, there's thousands of benchmarks out there increasing like every day. Things like, you know, mmlu, Superglue Squad, like these are all like very well known benchmarks. So the idea behind, I'm a fan.
Demetrios [00:14:36]: Of just a quick shout out to the Pro LLM benchmarks that our friends at Process set up. And the reason is because they don't give any of that data out publicly and they're constantly creating new data sets. So it feels like it's high quality benchmarks, you know. And I know there's a lot of discussion about how benchmarks are essentially bullshit since you can overfit to them if the data is public. And so the Pro LLM benchmarks feel like, hey, this is a good way to do it. And it's got 10 different ways to evaluate the models. So it's not just one type of benchmark. You get different benchmarks.
Demetrios [00:15:23]: Like there's a stack overflow benchmarks, there's function calling benchmarks for agents, et cetera, et cetera. But I don't mean to derail it. Keep going. What's the other buckets?
Jineet Doshi [00:15:36]: Yeah, sure. So yeah, since we're on benchmarks, like as you mentioned, there's a lot of open source benchmarks as well that are out there. And like these benchmarks are really like across the board, right? Like benchmarks for testing physics or knowledge or reasoning or even for things like hallucinations or toxicity. Like all of these different vectors we kind of want to test before putting into production. Right. So that's why they're pretty popular. Like you can combine, of course, multiple benchmarks together and See across the board, across the different factors, like how your model is performing, which is helpful. But yes, at the same time.
Jineet Doshi [00:16:19]: Again, benchmarks have their own limitations as well. Like lot of them are again fairly multiple choice type of questions. So again like we cannot really evaluate like the long form, more open ended type of tasks.
Demetrios [00:16:34]: Are you grabbing benchmarks and adding them to your test sets just to get like a sanity check on your models. And if there is something that feels like it's not really working, like toxicity score for some reason is really high on this model, then you can go and you can curate your own data set and try to dig into it a little bit more. Is that something that you've experimented with?
Jineet Doshi [00:17:03]: Yeah. So benchmarks, I would say it's very use case specific at the end of the day. Like yes, there's a lot of like open source benchmarks out there. But at the end of the day you have to like look at what exact use case your model is trying to solve and if whatever benchmark you pick is applicable for that use case. Well, in some cases, yes, off the shelf benchmarks are good enough. But again depending on like how niche your use cases, there have been times where again you almost have to create your own benchmark as well. Yeah. And then like use that because again you don't have like an open benchmark for your specific needs.
Jineet Doshi [00:17:52]: One more point on the benchmark piece though while we're at it. I like you raised an important point about essentially like training data leakage at the end of the day because again these LLMs are trained on like Internet scale data. If the benchmarks become part of their training data then of course that leads to like inflated numbers in terms of performance. So something to keep in mind, there have been a lot of benchmarks that have had to be updated because of this problem. I think glue was one of them. MMLU also I think had to be updated to like V2 because of this problem. So yeah, again one of the challenges I would say in this space.
Demetrios [00:18:39]: And what are the other buckets that you see?
Jineet Doshi [00:18:43]: Sure. So the other two, I would say higher level buckets. One was low like the traditional NLP kind of approaches. The second bucket I would probably put under is like human based approaches. So where you have essentially manual labelers. So expensive. Yeah, like that is one of the disadvantages of that approach for sure. But yeah, it's still at the end of the day it's still like considered to be golden in terms of a lot of different use cases.
Jineet Doshi [00:19:26]: So whenever manual labeling is involved from experience, what I've seen is it's really important to have a well defined criteria for the labelers because at the end of the day, again, when subjective things get involved, it's like, okay, if this is a 3 out of 5 answer for you, it could be like a 5 out of 5 for me, like based on like subjective factors. So it's really important to like have and like a good definition of like all these different criteria to ensure consistency across like human leaders. And even within, I think human labeling there's some interesting approaches. So again it doesn't always have to be absolute scoring. There is this idea of arenas as well where essentially you have like different models are kind of facing off against each other. So the idea there is sometimes it's easier to just give preferences rather than absolute scores. Like you can say, okay, this answer is better than this versus okay, this is like a three out of five or this is like a four out of five. So there's like Chatbot arena is like quite popular.
Jineet Doshi [00:20:41]: There's like a bunch of such arenas out there as well. And then ultimately you can aggregate these preference scores and create a leaderboard from that one. I would say like extreme version of this is like just hiring your own red team. There's a lot of companies that are doing that now where again like the, the idea is, has been around for quite some time in the cybersecurity space. So now that's kind of being applied here, which is pretty interesting to me, where you have these specialists who are like trained on essentially like trying to break these models and trying to evaluate them before like you expose them to like millions of your customers.
Demetrios [00:21:28]: Yeah, I've even seen red teams. So red teams as a service. But also I've seen companies that are creating agents or a SaaS platform that is basically agents as red teams. And so it's not humans that have been trained to break models, it is models that have been trained to break models.
Jineet Doshi [00:21:51]: Yes, that is a fairly common approach and it's increasing in popularity now. So that was actually going to be my third bucket, which is model based evaluations essentially. But before we go there, I think while we're at the human evaluations piece, like just a quick, I think pros and cons. There is like the advantages of course, that like with manual labeling you can evaluate across like a wide array of different spectrum. Like you can really do holistic evaluations, whether it's not just for quality but also for like toxicity bias, like things like that. It is again still considered golden for Lot of different use cases. But of course the disadvantage is that it can't scale that that easily. And then for domain specific evaluations, like again, if you're in the legal space or the medical space or the finance space, you have to ensure that your evaluators are experts in that domain or they, they have domain knowledge essentially, which again can be difficult many times.
Demetrios [00:23:09]: And expensive.
Jineet Doshi [00:23:10]: And expensive. So yeah, and that's why, which brings me to the third bucket, which has really picked up, I think off late, which is model based evaluation. So essentially using like LLM as a judge kind of approach. So the high level hypothesis there is like, if you use a more capable model essentially as a judge, can it provide the same type of benefits as that of a human evaluator but without some of the bombs? And again there have been studies that have shown promising results along that front. But still it's not, I would say it's still not a solved problem because like, even with LLM as a judge based approaches, there are challenges to keep in mind. Like again there's been a lot of studies that have shown that essentially the performance of LLM as a judge at the end of the day boils down to your choice of judge. And like things like prompt formatting. Like how do you format the prompt of the judge? And there is like a.
Demetrios [00:24:17]: By choice of judge you mean the model. Yes.
Jineet Doshi [00:24:21]: Which model do you pick as a judge? How do you format the prompts? There's like. And also there have been a lot of studies that have shown that uh, it's important to keep in mind that Sometimes again these LLMs do have biases.
Demetrios [00:24:36]: Yeah.
Jineet Doshi [00:24:37]: So like verbosity bias is one of them. Where essentially LLMs sometimes tend to prefer answers that are longer versus shorter ones, even though the shorter ones might be of like similar quality. There's also positional bias to keep in mind, which is some somehow, LLMs tend to prefer answers at the first position somehow versus the others. So again, things to keep in mind.
Demetrios [00:25:06]: There is something fascinating when it comes to LLMs as a judge because so much research is going on in this area. I love reading about how folks are doing it. And the one that struck me as a really cool idea is LLMs as a jury. So instead of saying, because as you were mentioning you, if you just have one LLM as a judge, then you really gotta know the prompt here is the best prompt that I can get and the model is the best model that I should be using. But I've seen others who say, well, let's just grab a bunch of LLMs and different prompts, different models, different. And then we can take the average of all of those and see if it is higher or lower than what we are ready to deal with.
Jineet Doshi [00:26:04]: Yes, absolutely. And that's another very interesting like idea that has come about. I have seen like a lot of studies around that as well. And with regards to like bias mitigation strategy that is definitely seen something which does provide promising results. But then again there's trade offs there which is like when you have a jury now again your evas become more expensive and like time consuming. So again there's like some trade off decisions to be made. And then again when you have a jury it's like okay, how do you consider like all the different outputs from different judges? Like you know, how do you weight them separately? Do you weight them equally? Like then you go into like those kind of details.
Demetrios [00:26:55]: There's that detail and there's another way that I've seen it done which is you have one LLM as a judge, but then you have another LLM judging the judge. And so it's saying hey, let's see if this matches up the, the LLM as a judge said this and you give both of the answer and what the LLM as a judge said to another LLM and say does this make sense? And so then it will judge the judge. And so what are some LLM as a judge best practices that you've been seeing?
Jineet Doshi [00:27:30]: Yeah, so this technique has definitely picked up quite a bit off late because essentially what we've seen is it provides the, the flexibility that human evaluators provide. But of course it's much more scalable if done right. So there has been a lot of work that's gone in I think over the last couple of years in the space and there have been some now well defined set of like best practices while using LLM as a judge. So the first thing is like how do you validate the validator? Right, right, we spoke about that a bit. So essentially one idea that has worked for a lot of people is like to have a golden data set first which is like created by domain experts by manual labeling which you have complete confidence on and then use that to calibrate the judge. So you would pick your judge model, you would finalize the prompt and all of that stuff based on that to ensure that there is high correlation with that data set. So at least you have some confidence. Again like any ML model drifts do happen.
Jineet Doshi [00:28:48]: So the judge is expected to drift over time. So again it's Very important to like monitor the judge and recalibrate as needed over time. The judge is usually again it can be used to do either like absolute scoring or pairwise comparisons for more subjective type of use cases. Pairwise comparisons are usually better than absolute scores. But again if you do using absolute scoring, another really good practice, and again a lot of studies have been published around this is to have the judge use as low of a precision scale as possible. So like for example, so what I mean by that is essentially having the judge preferably output a 0 or a 1 versus it outputting like a score between 0 to 100. Again just like humans, there can be a lot of confusion for the LLM asset judge as well, which is like you know, how do I differentiate between a 60 out of 100 and an 80 out of 100 answer? So like essentially it's really good to keep it simple. It's also helpful to add few short examples to the prompts.
Jineet Doshi [00:30:09]: Again you're kind of giving that reference to the judge. So that has shown to be like pretty helpful asking the judge to output reasons with every evaluation. That's another very important point that has helped, especially with debugging. So when you ask the judge to really reason like why it thinks something is not good or something is good, you can check its reasoning and use that to iterate. And then there's some really interesting work that has gone on recently where the, there's this idea of grading notes that was introduced like I think just a couple of months ago which has also shown to really improve the performance of LLM as a judge, especially on domain specific evaluations. So essentially the idea of these grading notes is for your evaluation set, for each row you would kind of add like a more detailed like version of what to look for. So if it's let's say like a code execution task, you would say okay, like please check if basically from the input file, like are all the columns right? Like is the formatting of all of the inputs right? Like check for essentially like any null type errors or like those type of exceptions. So essentially like adding these kind of granular notes to the judge really helps especially in, in like the domain specific evaluations where the judge might not have all the knowledge.
Demetrios [00:31:56]: I, after reading a few research papers I'm like man, I swear some of these researchers gotta be so high and they're eating pizza in their dorm room or their just enjoying life and they're thinking, you know, it would be awesome if we just had more LLM calls. That's what we need to remedy the problem here. We'll just have more LLMs.
Jineet Doshi [00:32:24]: Yeah, I think you hit on a very important challenge with this is at the end of the day like who validates the validator or like how do you validate the validator given the validator itself is another AI in this case. Right. So one of the like the best practices like I've seen like different people use in the industry is essentially you kind of have like a golden test set which is again very trustworthy. So let's say again you have like domain experts who would essentially create that, that golden data set, like do those EVAs for you. You have that golden data set and then you use that to kind of calibrate the judge in the first place and you first like establish some form of correlation with that golden data set where then you can say okay, the judge is like whatever, 80%, 90% correlated with my golden data set. So then I have a fair amount of confidence that that okay, this judge is going to do like fairly well. But again we are dealing with challenges of overfitting there as well. The classic bias versus variance trade off in ML systems.
Jineet Doshi [00:33:45]: And then another thing to keep in mind with that is of course that like every ML system there is going to be drift with time. So it's like you constantly have to monitor the charge. Again like, you know, is there drift in the judges judgments over time like how frequently do we need to retrain and things like that.
Demetrios [00:34:06]: And you're bringing up a great point that I just literally finished writing a blog post on which is the costs associated with this type of stuff. And it came about, my blog post came about because I, I saw yet another person on the Internet posting about how the cost per token is going down. And I, I wrote yeah, cost per token is going down, but cost per answer is going up. And I didn't really explain myself good enough I guess with that comment the person replied like no, any given model, anything, the best models, the tokens, costs are all just plummeting. And I'm like yeah, I didn't say cost per response is going up. I said cost per answer. And what I meant by that is if you look at the whole system and you look at everything that has to do with going into getting you a quality answer, you as that end user getting that quality answer. All of this that we're talking about LLM calls like up the wazoo because we've got a judge and a jury and the judger that's judging the jury and also all of the Human resources that are going into building out the system.
Demetrios [00:35:24]: If you're using agents, like don't even talk about how many more LLM calls you need because of the all of the agents. And that's not even factoring in this golden data set that you need to pay humans to curate. And now there's a ton of other factors that I go into in depth in the blog post, but that's like the gist of it. The cost per answer. No one can say that that is going up or that is going down. That is 100% going up.
Jineet Doshi [00:35:59]: Yeah, I think that kind of boils down to, I think the systems that we are now designing like we are taking on more complex tasks. And as you rightly mentioned, when you look into like RAG or agentic workflows, like some of these systems are extremely complex involving like multiple LLM calls sometimes there is a lot going on behind the scenes. So yeah, as like these models get better, I feel we are also trying to tackle bigger tasks, bigger challenges which like creates very complex systems which has an impact on the cost. But there have been again some interesting, there has been some interesting work in that space as well where I have seen lot of folks again try to use like small language models rather than large language models in again, stress that.
Demetrios [00:36:56]: That's so funny you say that because there's a whole part of the blog that I wrote where I said, all right, this is where folks are going to come in and saying, but what about open source, small domain specific models? And I said, okay, that's kind of arguing my point though because let's think about that. Who is gathering the data to fine tune those models? If they are fine tuned, what are you using the GPU costs to fine tune the models? The people that actually know how to fine tune the models, how much do they cost? Are you hosting those models on and are you renting GPUs to host those models on your own? How much is that? And you better have some good alerting set up. So there's all this extra cost that comes into it that I do see is not part of the LLM call itself. But if you again are looking at it in this systematic way, in a holistic way, you recognize that the cost per answer is going up.
Jineet Doshi [00:37:58]: Yeah, again that's a valid point that even with small language models we have to factor in like all of these different costs. At the end of the day. I think it boils down to the use case like again, is your use case like specific enough that a small language model could work versus having like a more capable like large language model. So I think a lot of these design and trade off decisions have to be made kind of like specific to the use case.
Demetrios [00:38:29]: Em, you, you said it perfectly. With the complexity we, we expect so much more from AI these days and their capabilities that inevitably that just translates to more complexity in the systems that we design. And so since we expect more, we're not just going and creating a system that hits an API once and then comes back with an answer no, we've gotta set up a whole RAG system and, or we've got agents that we want to actually go out and do things or plan and reason around these things and then go and execute. So it's interesting to think about that when you're looking at these different systems. And I know that you have thought about evaluating RAG and evaluating agents in different ways. And so let's like jump into that. What are some things that you keep top of mind when you are evaluating a RAG system?
Jineet Doshi [00:39:31]: Yeah, so that's a very good question. So as we like develop more complex systems along the way, again evaluate, evaluation remains like a challenge. It is very important to keep in mind like with these complex systems that we need to evaluate the parts and kind of the sum of the parts ultimately. So in like the more traditional software sense, it's like you need to do unit testing and integration tests at the end of the day for the whole system. End to end. Right. So when we talk about rag, it's really important to like evaluate different parts of the RAG system. Like you want to know how well is the retrieval doing? Like, is the retrieval actually fetching relevant like chunks from the vector database for you? So there again there has been a lot of work again done in the space because that's essentially a search problem.
Jineet Doshi [00:40:29]: So there are more traditional metrics like your mrr, nbcg reciprocal, I mean recall AT K, precision AT K. So these metrics have existed for quite some time again because the search problem has existed for quite some time, like multiple decades. So I do see like reuse of some of that work within like retrieval systems as well. But then of course there's scope for newer techniques as well like LLM. As a judge, again, I've seen like lot of open source packages and tools out there using LLM as a charge just to evaluate the retrieval system. And that's just the retrieval system piece. Right. Then we go on to the answer generation system.
Jineet Doshi [00:41:17]: Like okay, once you retrieve relevant chunks, you want to make sure that the LLM that's Answering eventually is generating good answers and then you want to test the whole system end to end. Right. Like, right from where you have a user query to all the way into like generating the final answer. Like, how good is that system? How consistent is that system? Like, is there any like bias in there? Is it producing like toxic outputs? Like, yeah, there's, I think there's absolutely a need to like think of it holistically.
Demetrios [00:41:56]: Yeah. So when you are doing unit tests, you're thinking about the different pieces where you can add that evaluation in. And you mentioned one which is how's it retrieving? Are you also doing it on the, on the embedding model or are you doing it on like where are all the places that you're putting in unit tests? I guess as you set up your system and your rag.
Jineet Doshi [00:42:26]: Right. So again, that's a very good question with a system like rag, which again, there's so many bells and whistles that you can pull. Like each of those things kind of requires its own set of evals. Right. Like okay, what kind of chunking strategy do you want to use? What kind of embedding model do you want to use again to make a lot of those decisions? If he evals uneven? So yeah, if you want to ensure like success of the complete big Rack system, I feel like you need all these evals like at the individual level and then at a system level as well.
Demetrios [00:43:12]: And the. Yeah, and then the integration tests, what do those look like?
Jineet Doshi [00:43:16]: Sure. So the integration test in this case would be you're just testing the whole system like end to end. So right from when you get the user query in the Rack system to your final output. Right. And again, so a lot of the foundational techniques that we discussed earlier come into play here. So like you can have your own benchmark, you can create your own benchmark, or you can reuse an existing benchmark from the open source domain. We can have like again human evaluators, we can have a red team as well try to evaluate the system. Or again, you can always have like LLM as a judge based approaches there, sir.
Demetrios [00:44:00]: And so when I traditionally think about tests in software development, it is a very instantaneous thing that you know if my tests failed or passed. Right. But with this, as soon as you start bringing humans into the mix, or if you start bringing LLM calls into the mix, are you still getting that instantaneous feedback to know if your tests fail or pass? And are you doing these types of extended tests? I guess. What does, what does it look like is It a monthly thing that you're bringing a human in and asking it to look and looking at the drift and really having deep examination there. Or is it more of we do it once and then it's more of like a yearly maintenance. How do you see that?
Jineet Doshi [00:44:59]: Right. So yeah, evaluations are again very important I feel in the entire life cycle of the project. So like I think what we discussed and now was kind of offline evals. So like all the evals you would do before you launch the model and then there's this again, completely different world out there of like online evals or production evals as well. Right. So once you launch the system, you constantly want to monitor and make sure it's producing like the right quality of outputs. So even for that, yes, there are different ways of doing that. So like establishing feedback loops with your customers.
Jineet Doshi [00:45:39]: Like again, the classic thumbs up, thumbs down. And again, so not everyone likes to provide those kind of inputs. So sometimes you also have to look at like more implicit kind of signals, like are they trying to search for help somewhere else rather than like, you know, are they abandoning your system like more frequently? So there's like lot of some of these implicit systems that can be used implicit signals. And then again, I've seen approaches where again when you use LLM as a judge, like essentially deploying the same thing online as well and you're having the LLM judgments kind of like on a daily basis or on an hourly basis as well in some use cases. But on the cadence piece, I do feel like it is very important to have evals like as frequent as possible so you can monitor for like drift and quality and things like that.
Demetrios [00:46:40]: Why do you feel like maintaining production ML systems is such a challenge and underappreciated?
Jineet Doshi [00:46:51]: Yeah, it is. I think it is one of those things which I feel like people don't think of that as like, oh, this is like something cool to do. But it is extremely essential to ensure like a good experience for customers eventually in production. Right. So with software, again, traditional software systems, when we speak about maintenance, it was more around like ensuring that we take care of technical debt. We ensure that all the dependencies, all the different libraries, they're up to date. But with ML systems there is like an added layer of complexity to that whole thing. Right.
Jineet Doshi [00:47:32]: So you have all of the software based maintenance that you still need to do and on top of that now there is all this data level stuff that you need to look into as well. Right. Like is your data drifting in production versus what you train the model on, are there any issues with your features or is there any drift in the model itself? Like is the model's prediction kind of drifting over time and trying to debug. Okay, why that is. And then again constantly having to retrain the systems, the models to make sure that they are up to date, like with what's going on in your product and with customers expectations. And again at the rate of rate at which I feel like the whole AI space is moving again, just essentially keeping up with all of that progress. Right. Ensuring that we have like again good quality models.
Jineet Doshi [00:48:30]: If there are better models out there, if there are better techniques out there, again, how do we again incorporate that into the system? Again there's new and new better tools that are coming out every day. So again it's like, okay, now how do we incorporate that into our system? So as you can see, basically it ends up being like a very huge task which I feel is sometimes overlooked because people want to go and build things from scratch without like realizing like even the current systems in production, like maintaining them is like an absolutely, like a big, I would say like resource requirement for sure.
Demetrios [00:49:22]: Yeah, it's a beast. Especially as you start to add more models into production and you start to try to recognize which models are creating value. How can we get more of those types of models? Right. And so speaking of along that line, how do you look at evaluating and identifying the right use cases for different AI and ML applications?
Jineet Doshi [00:49:55]: At the end of the day I think like you just have to be very like surgical about like where to, where you really want to deploy like ML AI modules. And there's like a lot of like background work I feel that goes into that where you would be talking essentially with like all the product people with design, with like engineers, like basically the whole cross functional team and really trying to like identify like what is our hypothesis at the end of the day, like if we add a AI model or an ML model here in this space, like what's our hypothesis? Like how can it move the needle versus like let's say a rules based approach or like something like that. So and then essentially once you have that hypothesis then it's all about like okay, how can we go, go and validate that hypothesis as quickly as possible. So can we launch essentially a mvp, like a minimal viable product in there, do some quick testing, see if basically essentially validate that hypothesis and then okay, if we see good promising results then okay, now let's scale this to like a full blown system. So I think that's Usually the approach I've seen like being taken.
Demetrios [00:51:24]: What are some things that you haven't been able to figure out recently?
Jineet Doshi [00:51:30]: A lot of, lot of things I think remain like big challenges. I think evaluation is definitely like. Evaluation of generative AI systems particularly is definitely like very important and still remains like a challenge. Again, I think we spoke about rag, but now again with the advancements in like agentic workflows, I think again we are essentially entering a space where with the systems the evaluations also getting more complex. So with agents, again depending on like how many steps are there in your agentic flow, again I think the same concept applies where you need to evaluate every step output and then you again need to evaluate the whole end to end output. Also, depending on how many tools your agent is calling or has access to, the evaluation complexity increases quite a bit because with tool calling again we are essentially looking at two things. Right at the end of the day is the LLM picking the right tool for the task and then B are the right parameters being passed to the tool that's picked. So there's two factors that you're evaluating for each tool and then as again the agent has access to more and more tools that adds again significant complexity.
Jineet Doshi [00:53:00]: Yeah. At the same time I do see a lot of interesting ideas, lot of interesting approaches coming, coming up. So I know one of the ideas I was basically thinking about is like with CLAUDE launching like essentially computer use, right? Like it's like what if you could simulate users now with claude, where it's like literally you have CLAUDE play like different Personas and then go and kind of do the testing of these agentic systems or like RAG workflows. I feel it's pretty fascinating right now it's like, yeah, we have more complex systems and we're trying to come up with solutions for them. And now we do have more capable tools and more capable models. Again I think when we think about multimodal LLMs, that's completely different beast altogether. I feel like we were barely figuring out the text output and now we also are now in domains where we have to look at, okay, image outputs and speech and audio and like what have you. So yeah, I personally find it pretty fascinating.
Demetrios [00:54:18]: I like this idea of quad computer use going and stress testing or validating your AI workflow that you've built or your RAG pipeline and you are able to see if you have different ways of synthetic data generation in a way trying to use your product and use it with different Persona prompts.
Jineet Doshi [00:54:51]: Yeah, absolutely. It's a fascinating idea. Some of it is already used I think in the self driving space where you have companies like Waymo Cruise, Tesla, where they simulate lot of these cases in the virtual world and they use RL essentially to create this huge model from all that feedback. So I feel like again with open ended output where again there's infinite different possibilities out there. I do see that idea like being pretty interesting, so I think it's fairly new. Again I would probably love to see more studies and results from that. I know there's a bunch of startups as well working in that space trying different that idea out, so I think it's going to be pretty interesting.