LLM Evaluation with Arize AI's Aparna Dhinakaran // MLOps Podcast #210
Aparna Dhinakaran is the Co-Founder and Chief Product Officer at Arize AI, a pioneer and early leader in machine learning (ML) observability. A frequent speaker at top conferences and thought leader in the space, Dhinakaran was recently named to the Forbes 30 Under 30. Before Arize, Dhinakaran was an ML engineer and leader at Uber, Apple, and TubeMogul (acquired by Adobe). During her time at Uber, she built several core ML Infrastructure platforms, including Michelangelo. She has a bachelor’s from UC Berkeley's Electrical Engineering and Computer Science program, where she published research with Berkeley's AI Research group. She is on a leave of absence from the Computer Vision Ph.D. program at Cornell University.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
Dive into the complexities of Language Model (LLM) evaluation, the role of the Phoenix evaluations library, and the importance of highly customized evaluations in software application. The discourse delves into the nuances of fine-tuning in AI, the debate between the use of open-source versus private models, and the urgency of getting models into production for early identification of bottlenecks. Then examine the relevance of retrieved information, output legitimacy, and the operational advantages of Phoenix in supporting LLM evaluations.
Demetrios [00:00:00]: Hold up. Before we get into this next episode, I want to tell you about our virtual conference that's coming up on February 15 and February 22. We did it two Thursdays in a row this year because we wanted to make sure that the maximum amount of people could come for each day since the lineup is just looking absolutely in credible, as you know, we do. Let me name a few of the guests that we've got coming because it is worth talking about. We've got Jason Liu. We've got Shreya Shankar. We've got Dhruv, who is product applied AI at Uber. We've got Cameron Wolfe, who's got an incredible podcast, and he's director of AI at ReBuy Engine.
Demetrios [00:00:46]: We've got Lauren Lochridge, who is working at Google, also doing some product stuff. Oh, why is there so many product people here? Funny you should ask that because we've got a whole AI product owners track along with an engineering track. And then as we like to, we've got some hands on workshops too. Let me just tell you some of these other names, just for a know, because we've got them coming and it is really cool. I haven't named any of the keynotes yet either, by the way. Go and check them out on your own if you want. Just go to home, mlops community and you'll see. But we've got Tunji, who's the lead researcher on the DeepSpeed project at Microsoft.
Demetrios [00:01:29]: We've got Holden, who is the open source engineer at Netflix. We've got Kai, who's leading the AI platform at Uber, you may have heard of. It's called Michelangelo. Oh my gosh. We've got Faizaan, who's product manager at LinkedIn. Jerry Liu, who created good old Llama Index. He's coming. We've got Matt Sharp, friend of the pod, Shreya Rajpal, the creator and CEO of Guardrails.
Demetrios [00:01:58]: Oh my gosh, the list goes on. There's 70 plus people that will be with us at this conference. So I hope to see you there. And though, let's get into this podcast.
Aparna Dhinakaran [00:01:57]: Hey, everyone. My name is Aparna. I'm one of the co-founders of Arize AI, and I recently stopped drinking coffee, so I started on matcha lattes instead.
Demetrios [00:02:25]: Hello and welcome back to the MLOps community podcast. As always, I am your host, Demetrios, and we're coming at you with another fire episode. This one was with my good and old friend Aparna, and she has been doing some really cool stuff in the evaluation space, most specifically the LLM evaluation space. We talked all about how they are looking at evaluating the whole LLM system, and of course, she comes from the observability space. And for those that don't know, she's co-founder of Arise and Arise is doing lots of great stuff in the observability space. They've been doing it since the traditional MLOps stays, and now they've got this open-source package, Phoenix, that is for the new LLM days, and you can just tell that she has been diving in headfirst. She's chief product officer, and she has really been thinking deeply about how to create a product that will help people along their journey when it comes to using LLMs and really making sure that your LLM is useful and not outputting just absolute garbage. So we talked at length about evaluating rags, not only the part that is the output, but also the retrieval piece she also mentioned, and she was very bullish on something that a lot of you have probably heard about, which is LLMs as a judge.
Demetrios [00:04:03]: So I really appreciated her take on how you can use LLMs to evaluate your systems and evaluate the output. But then at the very end we got into her hot takes and so definitely stick around for that because she thinks very much on the same lines as I do. I don't want to give it away, but she came up with some really good stuff when it comes to fine-tuning and traditional ML, and how traditional ML engineers might jump to fine-tuning. But that is all. No spoilers here. We're going to get right into the conversation, and I'll let you hear it straight from Aparna before we do, though. Huge shout out to the Arise team for being a sponsor of the Mlops community since 2020. They've been huge supporters and I've got to thank them for it.
Demetrios [00:04:52]: Apartheid was one of the first people we had on a virtual meetup back when everything was closed in the COVID era and she came into the community. Slack was super useful in those early days when we were all trying to figure out how to even think about observability when it relates to ML. I've got to say huge thanks. Huge shout out to the arise team. Check out all the links below if you want to see any of the stuff that we talked about concerning all of the LLM observability or just ML observability tools. And before we get into the conversation, would love it if you share this piece with just one person so that we can keep the good old ML ops vibes rolling. All right, let's get into it. Okay, so you wanted the story about how I ended up in Germany.
Demetrios [00:05:51]: Here it is. Here's the TLDR version. I was living in Spain, so I moved to Spain in 2010, and I moved there because I met a girl in India, and she was in Bilbao, Spain, doing her masters. She wasn't from India or Spain. She was from Portugal. But I was like, oh, I want to be closer to her. And I also want to live in Spain because I enjoyed it. I had lived in Spain.
Demetrios [00:06:13]: I spoke a little bit of Spanish, and then I was like, all right, cool. Let's go over to Bilbao. I've heard good things about the city and the food and the people. So I moved there. As soon as I got there, this girl was like, I want nothing to do with you. And so I was sitting there, like, heartbroken on the coastline of the bass country. And it took me probably, like, a month to realize, well, there's much worse places I could be stuck. And so I enjoyed it.
Demetrios [00:06:45]: And I had the time of my life that year in Bilbao. And then I met my wife at the end of that year at this big music festival. So we were living in Spain. We ended up getting married, like, five years later, had our first daughter, like, eight years later. And then we were living there until 2020, when Covid hit. And when Covid hit, the lockdown was really hard. And we were in this small apartment in Bilbao, and we were like, let's get out of here. Let's go to the countryside.
Demetrios [00:07:15]: And we had been coming to the German countryside because there's, like, this meditation retreat center that we go to quite a bit. And so we thought, you know what? Let's go there. Let's see if there's any places available. And we can hang out on the countryside, not see anybody. The lockdowns weren't as strict. I mean, there were lockdowns and stuff, but when you're on the countryside, nobody's really enforcing it. So we did that, and we ended up in the middle of nowhere, Germany, with 100 cows and maybe, like, 50 people in the village that we're in. So that's the short story of it.
Aparna Dhinakaran [00:07:52]: Wow. Well, that's an interesting intro.
Demetrios [00:07:56]: There you go. We were just talking, and I will mention this to the listeners, because we were talking about how you moved from California to New York, and you are freezing right now because it is currently winter there, and Germany isn't known for its incredible weather. But it's definitely not like New York. That is for sure.
Aparna Dhinakaran [00:08:18]: It's East Coast winter out.
Demetrios [00:08:20]: So I wanted to jump in to the evaluation space because I know you've been knee deep in that for the last year. You've been working with all kinds of people, and maybe you can just set the scene for us because you're currently for those who do not know you. I probably said it in the intro already, but I will say it again. You're the head product or chief product officer, I think is the official title at arise, and you have been working in the observability space for ages. Before you started arise, you were at Uber and working on good old Michelangelo with that crew that got very famous from the paper. And then you've been talking a ton to people about how they're doing observability in the quote unquote traditional ML space. But then when LLMs came out, you also started talking to people about, okay, well, how do we do observability? What's important with observability in the LLM space? And so I'd love to hear you set the scene for us. What does it look like these days? I know it's hard out there when it comes to evaluating llms.
Demetrios [00:09:32]: Give us the breakdown.
Aparna Dhinakaran [00:09:34]: Yeah, no, let's jump. I so first off, we're seeing a ton of people trying to deploy LLM applications. Like the last year, Demetrius, has just been super exciting. People are, and I'm not just saying the fast moving startups, I'm saying there's older companies, companies that you're like, wow, they're deploying llms that have a skunks work team who are out there trying to deploy these LM applications into production. And in the last year, what I think we've seen is that there's a big difference between a Twitter demo and a real LLM application that's deployed. And the fact that what we're seeing is that unlike traditional ML, where people have deployed these applications, there's a lot of people who have that kind of experience with llms. It's still relatively a small group or few people who have deployed successfully these applications. And the hardest parts about this still ends up being evaluating the outcomes in the new era.
Aparna Dhinakaran [00:10:44]: In traditional ML, one of the things that still matters is you want your application to do well. In traditional ML, you had these common metrics, right? You had classification metrics, you had regression metrics, you had your ranking metrics. In the new LLM era, you can't just put these metrics. What I'm saying is we saw in the beginning some people were doing things like rouge and blue score. Oh, it's translation task. Oh, it's summarization task. But there's a lot more that we could do to evaluate if it's working. And the biggest one that we're seeing kind of take off is you've probably been hearing it is LLM is a judge.
Aparna Dhinakaran [00:11:26]: And so it's a little meta where you're asking an LLM to evaluate the outcome of an LLM. We're finding that across deployments, across what people are actually putting in production, it's one of the things that teams actually really want to get working. And I don't know, it's not that crazy. As you start to think about it. Humans evaluate each other all the time. We interview each other, we grade each other, teachers grade students papers. And it's not that far of a jump to think AI evaluating AI. But that's kind of this novel new thing in the LLM space, so you don't have to wait for the ground truth, necessarily.
Aparna Dhinakaran [00:12:16]: You can actually just generate an eval to figure out, was this a good outcome or not?
Demetrios [00:12:22]: And the thing that my mind immediately jumps to are all of these cases where you have people that have deployed llms or chat bots on their website a little bit too soon, and you see the horror stories because people go and they jailbreak it and it's like, oh man, that is not good, what this chat bot is saying on your website. All of a sudden. I saw one screenshot where people were saying, I can't remember what the website was, but people were talking about how, oh, I don't even need OpenAI, I can just use the chat bot on this website. It's obviously Chat GPT or GPT four. You can ask it anything and it will give you any kind of response and you can play with it just like you would play with OpenAI LLM or GPT four. Like, is this asking questions about the product and saying, but the product isn't that good, is it? And then leading it into the chat bot, then saying, yes, this product is actually horrible, you shouldn't buy it. And it's like, you can't say that on your own website about your product. This is really bad.
Demetrios [00:13:40]: Do the llMs, as a judge, stop that from happening? I think is the really interesting piece.
Aparna Dhinakaran [00:13:47]: No, I mean, it's absolutely a component of it. So you're absolutely right. People don't want the common applications we're seeing right now is like, I think you hit one of them, which is like, chat bot on your docs or chat bot on your product. So give me some kind of like a replacement for the customer support bots we had. And then there's kind of some more interesting ones which I've been calling like a chat to purchase type of application, where a lot of these companies that are doing, used to do recommendation models or et cetera, or are maybe selling you trips or selling you some kind of products now have a chat bot where you can go in and actually explicitly ask, hey, I'm doing XYZ, I'm looking for this. And then it recommends a set of products. And sometimes these applications use both ML and LLMs together in that chat bot. Like the LLM is doing the chat component, it's doing the structured extraction, but then the actual recommendation, it might call out to an internal recommendation model.
Aparna Dhinakaran [00:14:54]: So it's not one or the other. But sometimes you have both of them working together in a single application. And you're absolutely right, they don't want stuff like it saying stuff it shouldn't say giving. We had one interesting case where someone's like, I don't want to say we support something in this policy because then we're liable to it. If someone asks a question, especially if you're putting it external facing, there's all sorts of more rigor that it goes through to make sure it's working. And what LLM as a judge can do is what we see people checking for is things like, well, did it hallucinate in the answer? Is it making up something that wasn't in the policy? Is it toxic in its response? Is it negative about like in the one where it's kind of shit talking its own product, is it negative in its own response? Is it correctness, factuality? So all of these are things that you can actually generate a prompt to generate, basically an eval template to go in and say, well, here's what the user asked, here's what all the relevant information we pulled was, and then here's the final response that the LLM came back with. Does the response actually answer the question that the user asked? And two, is that answer based on something factual, aka the stuff that it was pulled on the retrieval component? And you can go in and score that. And what we end up seeing a lot is if the response isn't actually based on the retrieval, then it's hallucinating and they actually don't want to show those types of responses back to the user.
Aparna Dhinakaran [00:16:38]: And so this is just a very specific, I'd say the hallucination eval the correctness eval summarization. All of these are very common kind of LLM task evals that we're seeing out in kind of the wild right now. I should mention, it's very different than what the model evals are, which is a whole nother category of evals you might be seeing. Like, if you go on, like, hugging face, for instance, has a whole open source LLM leaderboard. I'm sure you've all seen it. It changes every couple of hours. And they have all these metrics, right, like MMLU and hella swag, which is all different ways of measuring how good the LLM is across a wide variety of tasks. But for the average kind of AI engineer who's building an application on top of an LLM, they kind of pick their LLM and then they're not really looking at how well does the model do across all sorts of generalizable, multimodal kind of tasks they care about.
Aparna Dhinakaran [00:17:50]: Specifically, I'm evaluating how good is this LLM and the prompt template and the structure that I built doing on this one specific task that I'm asking it to do. And so it's a very. Does that make sense, like that delineation between the model evals versus the task evals here?
Demetrios [00:18:11]: Yeah, 100%. And you did say something else. I guess what my mind goes to here is how you are able to restrict the output from getting to the end user. Is it that if that LLM, as a judge gives a certain confidence score, then anything lower than whatever, a five of confidence that this is the right answer, it doesn't go out to the end user and it has to regenerate it, or what does that look like? Like that last mile piece?
Aparna Dhinakaran [00:18:48]: Yeah, I think it depends back on the application owner. So there are some people who will generate that eval and then decide not to show that response because the eval was pretty poor. But there's some folks where if it's a lower risk type of application, then they still show it, but then they can come back and use the ones where it did poorly on to come back and figure out how to improve the application. So I think there's kind of like a trade off that the application owner has to make between do you want to block? Kind of. You got to wait for the LM to evaluate it, and then you're going to, for sure, impact speed of the application experience. And so what actually, we see a lot of people doing is these eval. You don't just have to do them in production. So similar to how in the traditional ML world, you'd have a performance like an offline performance as you're building the model, and then you're kind of monitoring an online performance.
Aparna Dhinakaran [00:19:55]: Well here, similarly, as you're building an eval, we're seeing folks kind of evaluate the LLM offline, see how build some confidence essentially around how well is the application doing before they actually go on to pushing it out to an online monitoring. So there's a lot of these similar kind of paradigms that still apply in this space.
Demetrios [00:20:23]: Yeah, I would imagine you try and do a little red teaming. You try and make it hallucinate, see how much you can break it before you put it out there and set it live. If you're doing it the right way, not just rushing it out the door. Ideally, the other piece on this that I think is important is a little bit more upstream because you were talking about how the LLM as a judge, it's evaluating if the answer that the other LLM gave with respect to the context that it was giving or that it was given was actually the right answer, or evaluating that answer based on the context. But then one step above that is actually getting the right context, I would imagine. And so making sure that are we getting the context that is relevant to the question that was asked. And I know that's a whole nother can of worms. And if you've been seeing a bunch of that and maybe do you also use LLM as a judge there?
Aparna Dhinakaran [00:21:28]: Oh, yeah. Okay, so this is a great, actually segue to talk about some research we've been dropping lately. So, yes, LLM as a judge can totally be used to also evaluate the performance of retrieval. So for folks who are listening to this, what is retrieval? How do you measure the performance of it? Well, basically, in retrieval, you're retrieving some sort of context. So if someone asked a question about some kind of, let's say, product, you're retrieving relevant information about, let's just say, a chat on your docs type of application. It's very common. Someone's asking your product support documentation or your customer support documentation questions. And what happens is it retrieves relevant information from your document corpus and then it pulls just relevant chunks and then uses those relevant chunks in the context window to actually answer the question.
Aparna Dhinakaran [00:22:29]: The most important thing there is that this type of retrieval, by the way, is super important because it helps llms connect to private data. Remember, LLMs were not trained on every single company's individual private documents. And so if you actually wanted to answer questions on your own private data, then using a retrieval is kind of one of the best ways to do that. And so here what really ends up being important is that did it retrieve the right document to answer the question? And that ends up being, there's a whole set of metrics that we can use to actually evaluate that retrieval. And recently, actually, we've been running a ton of tests measuring different LLM providers. So we evaluated GPT four, we did anthropic, we chart some results on Gemini, we did Mistral, and we actually showed, if you guys have been following, Greg Cameron actually dropped the first of these needle in a haystack type of tests.
Demetrios [00:23:37]: You have kind of like that lost in the middle.
Aparna Dhinakaran [00:23:39]: Yeah. And for those of you who haven't seen, it's basically an awesome way to think about it, which is it essentially checks on one axis. You have how long is the context? So the context can be one k tokens all the way to. For some of the smaller models, it's like thirty two k. I think for some of the bigger ones, we tested pretty significantly. Let me double check exactly what, one.
Demetrios [00:24:09]: Hundred and twenty k, I would imagine. I think that it feels like anthropics goes all the way up to that. Or maybe it's even more. These days. It's like 240. They just said, fuck it, we'll double it.
Aparna Dhinakaran [00:24:21]: Yeah. So some of them we checked definitely close to what we did was basically. So that's on one axis, which is basically just the context length. And then on the other axis is basically where in the context you put the information, because there's all these theories out there of like, if you put it early on, does it forget, if you put it later down, does it not use it? And so kind of placement of the context within that context window to see, can you actually find the needle in the haystack? And the question we did for this was, so a little context was the question we asked. So we did kind of like a key value pair. The key was the city, and the value is a number. So we said something like, what's Rome's special magic number? And then inside the context, we put something like Rome's special magic number is like some seven digit number. Yeah, like one, two, three, rough I six, seven.
Aparna Dhinakaran [00:25:31]: And so that was Rome's special magic number. And then later we put that somewhere. So we tested kind of all of the dimensions of putting it at the very beginning of the document for a very short context window, putting it at the very end for a very long context window. And all the combinations in between. And then we asked it what was Rome's special magic number? And so it would have to go through, look at the entire context window and then answer the question. And sometimes it couldn't find it and it said unanswerable, et cetera. And sometimes it answered the question. And what we did was we just ranked a lot of these LLM providers out there on how good was it at retrieval.
Aparna Dhinakaran [00:26:16]: And GPT four was by and large definitely the best out there. I think, of the small model providers, we were definitely impressed with mistral, like the 32k content, it was pretty impressive. But there were some where I think we realized some of them were very if you change the prompt a little bit, then the results totally varied. So you got totally different, varied responses based on just like adding a sentence or adding two sentences. And so as a user, as we're coming back and we're evaluating this, if you're using some of those llms where you have to be really careful about how you prompt it, those prompt iterations can have a big difference in the actual outcome of that. You know, I'll share the links with Demetrius. Maybe you can link it with the podcast interview. But definitely, I think going back to your original question of like, can you evaluate retrieval? Absolutely.
Aparna Dhinakaran [00:27:27]: I think it's really important to make sure that if you want it to answer well and not hallucinate on your private data, it's got to do well at that retrieval part.
Demetrios [00:27:38]: Yeah. So it's almost like you're evaluating in these two different locations, right. You're evaluating the retrieval when it comes out, and if that is relevant to the question, and then you're evaluating the output of the LLM once it's been given that context, and if that output is relevant to the question also.
Aparna Dhinakaran [00:27:58]: Yeah, exactly. And if that output is based on the context that was retrieved.
Demetrios [00:28:08]: Yes, totally.
Aparna Dhinakaran [00:28:09]: So first is like, is the retrieval that was retrieved even relevant to the question asked? Then, was the output based on the retrieved text? And then. Yeah. Was the output itself answering correctness, I guess, is looking at, is it correct based on the information that was retrieved? There's levels to this.
Demetrios [00:28:33]: I'm a little bit like, I just had a stroke of inspiration right now, and I'm going to let you have it because I love talking with you and I love the way that you create products. But the next product that you create in this space, I think I found the perfect name for it.
Aparna Dhinakaran [00:28:54]: What is it?
Demetrios [00:28:56]: Golden retriever. You can do so many things with that, and it is so perfect. It's golden. Of course, it's the golden metrics. It's the golden retriever. And so if you create a product around that, we'll talk later about the way that we can figure out this patent and the golden retriever. I love that. But I mean, jokes aside, I know that you have been creating products in this space.
Demetrios [00:29:29]: I saw Phoenix, and I would love to know a little bit more about Phoenix. And also, as I mentioned before, we hit record. One of my favorite things with all of the products that you've been putting out from the get go, I think when we talked in, like, 2021, of the first things that I noticed with the arise product was how well put together the UI and the UX was for observing what was happening under the hood. And it feels like Phoenix took that ethos and went a little bit further. You being a product person, can you break down how you think about that and how you were able to get inside of what metrics are useful and how can we present the metrics that are useful in a way that people can really grab onto it?
Aparna Dhinakaran [00:30:19]: Yeah. Well, first of all, thanks, Mitris. That's really kind. Yeah, I'm super excited about Phoenix, I think. Got to give a big shout out to the Phoenix team within our eyes, actually. So Phoenix is actually, for those of you who don't know, it's our OSS product. It's got a ton of support for LLM evaluations and LLM observability. So if any of you guys are looking to just try something, not have to send your data outside, have it be lightweight, it's open source, so do what you want with it.
Aparna Dhinakaran [00:30:57]: I think the intention behind Phoenix really was. So there's a couple different components in Phoenix that I think folks who are trying to get observability on llms will like.
Demetrios [00:31:07]: This is Skyler. I lead machine learning at health rhythms. If you want to stay on top of everything happening in mlops, subscribe to this podcast now.
Aparna Dhinakaran [00:31:28]: First is if you are. One of the things we just noticed very early on was that these applications, many of them have not just one call they're making. There's many calls under the hood. Like even in a simple chat bot with retrieval, there's first the user's question, then you generate an embedding, then there's the retriever, then there's the actual synthesis of the context, and then there's a response generation. So there's five or six different steps that have happened in something that feels like one interaction between the product and the user and so the first thing was, well, if you have all these sub steps, if something goes wrong, something goes wrong within those five or six different steps that's happened, then being able to pinpoint exactly what are the calls that are happening under the hood, and how do I get. Visibility is important. And so with Phoenix, one of the most popular components of it is you can see the full traces and spans of your application. So kind of like the full stack trace is how you can think about it.
Aparna Dhinakaran [00:32:36]: So you'll see the breakdown of each calls and then which calls took longer, which calls used the most tokens, and then you can also evaluate at each step in the calls. So kind of like we were just talking about where at the end of the application, at the very end when it generated a response, you can have a score of how well was the response. But then if the response, let's say, was hallucinated or was incorrect, then there's a step above. You can go in and look at the individual span level, evaluate well, how well did it retrieve. And then within the retriever, let's evaluate each document that it retrieved and see if it was relevant or not. So there's kind of a lot of thought put into first, how do I break down the entire application stack and then evaluate and evaluate each step of that outcome. And then the other part that's been, I'd say, really a lot of thought in is Phoenix does come with an evaluator. It's task evals first or LLM application evals.
Aparna Dhinakaran [00:33:45]: So it's definitely useful for folks who are actually building the application. And then we've just seen a bunch of people kind of build these evals in production. So it comes with kind of a lot of these best practices baked in one of them that actually just went viral on Twitter last week. We dropped a big research start on this, which is, should you use score evals or classification evals? I don't know if you caught that.
Demetrios [00:34:16]: One, but I saw your post blew up. I definitely did, but I don't know what the difference is between the two.
Aparna Dhinakaran [00:34:22]: Which might make me look like no space is changing so fast or early. I think we're all just trying to learn and soak up as much as we can. So score evals versus classification evals. Score evalves is basically, you can think about it as the outputs a numeric value. So let's just say we were asking LLM to evaluate how frustrated is this response and rank it between one to ten. One being someone's really not frustrated. Ten being someone is super frustrated. Well, what would you expect? You would expect that, okay, if it said one, it's super frustrated.
Aparna Dhinakaran [00:35:05]: If it's ten, it's not frustrated. But then kind of somewhere, if it said something like a five, it's kind of like, okay, maybe it's passive aggressive, it's not super frustrating, but you kind of expect it to kind of the numbers in the middle to make sense. And what we just realized as we did this research was that the score value actually had no clear connection to the actual frustration that this person, if it gave a number that basically wasn't one or ten, then that score value really had no real connection. Like another example, which was actually the one that we posted, was we gave it a paragraph and we said, rank. How many words have a spelling error in this document between one to ten. If every word had a spelling error, give it a ten. If no words have a spelling error, give it a zero. And then if it's kind of somewhere in the middle, like 20% of the words have a spelling error, give it a two.
Aparna Dhinakaran [00:36:12]: If it's 80%, give it an eight. And what we saw was that in many cases, it would give a spelling score of like ten, but in some cases only 80%. Only 11% of the words had a spelling ear, but it still said all.
Demetrios [00:36:28]: Of the words were off.
Aparna Dhinakaran [00:36:29]: Yes.
Demetrios [00:36:30]: This might as well be in Dutch.
Aparna Dhinakaran [00:36:31]: Yeah. So that value that it came back with meant nothing to it. And the reason this is important is what we're seeing is there's a lot of these LLM eval cookbooks that are out there where people are recommending, basically set it up as a score. And what we've actually been seeing is, don't do that. Just do it as a class. Just do binary stuff. Or you can do multiclass. But tell it explicitly, like frustrated, not frustrated.
Aparna Dhinakaran [00:37:02]: Because if you try to assign a score, that score just doesn't actually have any meaning to the corruption level.
Demetrios [00:37:13]: So basically it's saying, hey, it has to be very clear. It's frustrated, not frustrated, or a little frustrated. And you have to make it. It's not a sliding scale. Llms do not understand spectrums.
Aparna Dhinakaran [00:37:30]: From all the tests we've done. It is not going to give you a meaningful value on a spectrum.
Demetrios [00:37:37]: And then if you're basing stuff downstream off of that score, you're screwed.
Aparna Dhinakaran [00:37:43]: Exactly. And this has been something that has been a lot of people. I think it's kind of like, as we've been putting it out, people have been like, I've seen that. I was using score. And then I was like, these values meant nothing to me. So then I switched to classification. And so there's a whole set of research around this probably still to be deep dived into here. But, yeah, this is the kind of stuff that's in the Phoenix evaluations library is just best practices based off of what we're seeing, what we're putting in production, what we're helping folks actually launch.
Aparna Dhinakaran [00:38:22]: And so you get kind of these know s in class, I'd say templates that are pre tested around things like how to test for hallucination, how to test for toxicity, how to test for correctness. And then you can kind of go, and we have people who then go off and make their own custom evals, but it's a great place to kind of have a framework that runs both in a notebook and also in a pipeline very efficiently and is meant for kind of, you can swap in and out of offline and online very easily.
Demetrios [00:38:58]: Yeah. Because the other piece that I was thinking about, all these evows that you give it, then it feels like you're not going to get very far if you're not doing custom evals. Have you seen that also?
Aparna Dhinakaran [00:39:13]: Totally. I think there's a lot of folks who are building. Maybe they start with something, but then they end up kind of building their own. What makes sense for their application, adding on to it. So I definitely think at the end of the day, the nuance here is probably different than the ML space, is that that customization of that email ends up being really important to measuring what's important to your app. I don't know. That's one of the things I predict, because we're going to see a lot more of this.
Demetrios [00:39:46]: Because one thing that I've seen is how when you put out these different evaluation test sets, the next model is just trained on top of them, and so then they're obsolete. And so it's going to be this game of cat and mouse in a way, because the models are going to be able to eat up the test sets as soon as they get put out for anyone to see. Or is it going to be all right? I've got my evaluation test set and I just keep it in house. I'm not going to let anybody else see that so that I don't taint the model.
Aparna Dhinakaran [00:40:24]: Yeah, it's actually something I wonder about a lot, too, is as these new llms come out, are they really blind to the test sets that they're actually then evaluating them on? I think the Gemini paper I thought did a really good job of calling out. They actually built their own data set that was blind and then tested on that data set and they called that out explicitly, which I thought was really important because as people are sharing the results of the next best lm, et cetera, I think we're all wondering, did you just did have access to that training data set? So I wonder that all the time, too.
Demetrios [00:41:08]: Well, it's pretty clear these days that I did not coin this term, but I like it and I will say it a lot. Benchmarks are bullshit. And so all these benchmarks on hugging face were on Twitter that you'll see like, oh, this is Sota. This just came out. It blew everything else out of the water by whatever, ten times or you make up a number there. I don't even consider that to be valuable anymore. It's really like what you were saying where these things. I know you actually went and you did a rigorous study on it, but it's so funny because the rest of us are just going off of vibes and we're seeing, oh, yeah, this is not really working.
Demetrios [00:41:54]: This is not doing what I thought it was going to do. And so if I use a different model, does it, and then you try that and you go, oh, yeah, okay, cool. This is better. This is easier to prompt, or this prompt is much easier to control, whatever it may be. And so I appreciate that you did a whole rigorous study on it. I'm conscientious of time. I want to make sure that I highlight that you all are doing like, paper studies, right? And you're meeting every once in a while. I think that's awesome.
Demetrios [00:42:23]: I know that you've got a ton of smart people around, and so I can imagine you're getting all kinds of cool ideas from gathering and doing these paper studies. I would encourage others to go and hang out and do those. We'll put a link to the next one in the show notes so that in case anyone wants to join you and be able to ping ideas off of you. That's great. I still would love to hear, how did you come up with the visualizations? That's the coolest pieces. I think you didn't get into that part, and I want to get to it.
Aparna Dhinakaran [00:42:53]: Before we go, I just got to say the team is amazing and I think trying to find a way to bubble up, at least in the LLM space, one of the cool things, maybe you've seen some of the demos of it, but especially with retrieval, there's a lot in the embedding space, that's helpful to visualize. So how far away is the prompt that you have from the context that was retrieved? And if you're missing any context and it's super far away, or it's like reaching to find anything that's relevant to, those are all really cool visualizations that you could actually surface and kind of help people see a little bit of. Okay, here's my data. Here's what it thinks things are connected to. So, yeah, again, check out Phoenix. Love all the notes on the UI.
Demetrios [00:43:52]: Yeah, actually it reminds me that I think one of the first people I heard the word embeddings from was you on that first meetup that you came on back in June 2020, around then, because you were already thinking about them. I think you were thinking about them from recommender systems and that angle. And then how has that changed now?
Aparna Dhinakaran [00:44:19]: No, great question. Well, I think embeddings are just so powerful, and I'm so glad that we're all talking about them and using them in observability, because it's super powerful even in the LLM space. I think in the past folks used them, like you mentioned, recommendation systems, image models. But the LM space, the core basis of retrieval, is based off of those word, the embeddings itself, and doing that vector similarity search to fetch the nearest embeddings. So I think the use case is really strong in rag for llms because it's such a core component. It's also something that is important to now, just like when we were going back, if the retrieval is off, then the response is just not going to be good. And so giving you a really good way to verify that what was retrieved was relevant. And if there's any shift in, again, going back to all of this is now textual.
Aparna Dhinakaran [00:45:30]: If the prompts are changing, what your users are asking are different, or the response of the llms are different, these are all things that you can actually measure using embeddings and embedding drift and things like that. So I think there's just maybe more use cases now than ever to dig into embeddings.
Demetrios [00:45:48]: Yeah. It has to be treated as a first class citizen 100% these days.
Aparna Dhinakaran [00:45:52]: Exactly.
Demetrios [00:45:52]: That's a really good point. And I saw a recent paper, speaking of papers from Shreya Shankar, did you see that splad or spade, I think is what it is. Talking about the prompt deltas and evaluating via the prompt delta, like you have your prompt templates, but then you're evaluating the prompt deltas and it's like, wow, there's so much creativity in this space and the ability to look at. How can we evaluate things differently than we are right now and see if we can get a better outcome?
Aparna Dhinakaran [00:46:31]: Yeah, I think I still need to dig into spade specifically, but I think the amount that the space is moving is just so fast right now. It's so exciting.
Demetrios [00:46:48]: It is very cool.
Aparna Dhinakaran [00:46:50]: The one thing maybe I'll just. There's maybe two things I just wanted to, like, at least drop my quick hot takes or notes on.
Demetrios [00:46:59]: Let's do it. This is great. This is what we're going to chop and put at the beginning of the episode.
Aparna Dhinakaran [00:47:03]: Yeah, exactly. Right. So I think I always hear this from, I don't know, I just see it in discussions, but I see a lot of people talking about fine tuning, like really early on. Like their application is not even deployed and they're like, oh, well, our use case is eventually we're going to go back and fine tune. And I get asked from folks like, hey, Parna, does that make sense as a step in troubleshooting an LLM application? And I think one of the reasons I get that question a lot is if you just think back to a lot of the AI teams are now, they work on traditional mal and they're shifting now to llms. But that's something we're all very used to and very familiar with. We're used to training models on data and then deploying those models and fine tuning feels very familiar. Right.
Aparna Dhinakaran [00:48:02]: You grab data points that it doesn't work on, you fine tune it and that's how you improve the performance of your model. But in this space, fine tuning feels like you're jumping to like level 100 when sometimes a lot of this could be, like I was telling you in the rag case, change the prompt a bit and you get vastly different responses. And so it's like almost the thing that we're geared towards to do, which is like, oh, it makes sense. Training is now fine tuning and we're all used to that paradigm. But I think in this space, let's start with the lowest hanging fruit and see how that improves, because I think Andraj Karpathi actually drew this really awesome image of level of effort versus the roi kind of, of that effort and prompt engineering. There's so many things you could do to improve the LLM's performance before you jump into fine tuning or training your own LLM. So I think it's important to start with something that could have the highest ROI.
Demetrios [00:49:16]: You are preaching to the choir and I laugh because I was, like, talking about how fine tuning, to me feels like when all else fails, you'll throw some fine tuning at it. Yeah, that's what you need to look at it as, like the escape hatch almost, not as step two. It should be what you go to when you can't get anything else to work and try rigorously to get everything else to work because it is exactly like you said, it is so much easier to just tweak the prompt than fine tune it. And I didn't connect the dots on how similar the two are. And like, oh, if we're coming from the traditional ML space, then it's easier to jump there and be like, oh, well, that's just because we need to fine tune it and then it'll do what we want it to do.
Aparna Dhinakaran [00:50:08]: Yeah, totally. I think there's just something very natural feeling about, okay, training is now fine tuning, but I think it's one of those changes we all have to just adapt with the space changes.
Demetrios [00:50:24]: Yeah. Assimilate. Yeah, excellent.
Aparna Dhinakaran [00:50:28]: And then my other hot take, I guess. Yes, totally a hot take. But I think sometimes I hear a lot of this. Maybe I hear it less now than I was in the beginning. So I hear a lot of like, well, it's kind of continuation of the fine tuning. Well, if I pick an open source model, I can go in and fine tune it more or I can then go and modify it for my use case because I know I have access to model weights and do that. I hear a lot of folks asking, well, does choosing an open source model versus a private model end up slowing down product development or what's kind of the pros and the cons of one versus the other? I think I was hearing a lot more of almost resistance for some of just the private models in the beginning and a lot more of the open source in this horse. I got to say, I'm all for the open source community.
Aparna Dhinakaran [00:51:44]: I think I'm also all for whatever. LLM just makes your application the most successful it can be. So pick the one that gets you the performance and the outcomes that you need. I don't think that some people make a bet on the open source because they're like, later, I can go back and fine tune or it's better and et cetera, but it's, again, how many of those folks are really going to actually fine tune? And for what I've been seeing out in the wild, starting with the OpenAI or GPT, four has just been helping most people get to kind of the outcome that they need their application to get to. And so again, I think I just come back to like all for the open source community, all for just getting your application to actually work as good as it needs to work. But start with what do you need for the application? And less of like, I think the.
Demetrios [00:52:47]: How'S this going to scale. Yeah, that conversation back in the day where you're like, oh, we're going to need to use kubernetes for this. And you're like, wait a minute, we have no users.
Aparna Dhinakaran [00:52:59]: Are you sure?
Demetrios [00:52:59]: Kubernetes, I know you're planning for the future and this is great for the tech debt, but we might want to just get some up on streamlit before we do anything.
Aparna Dhinakaran [00:53:09]: Totally, totally. And I think that's what I keep coming back to is the more of these similar in the ML space, we want to get more of these deployed actually in the real world, get the application to add value to the organization, show the ROI. And I think that that's really important to the success of these llms and companies, actually.
Demetrios [00:53:35]: And the other piece to this that I find fascinating was something that Laszlo said probably like two years ago. And Laszlo is infamous person in the community for those who do not know, in the community, slack. And he was talking about how you need to get something in production as fast as possible, because then you'll find where all of the bottlenecks are, you'll find where everything is messing up. And unless you get into production, you don't necessarily know that. So each day or each minute that you're not in production, you're not finding all of these problems. And if you can use a model to make your life easier and get you into production faster, then you're going to start seeing, oh, well, maybe it's the prompts or oh, maybe it's whatever the case may be where you're falling behind and you're making mistakes or the system isn't designed properly.
Aparna Dhinakaran [00:54:34]: Yeah, absolutely. So I think maybe as we wrap up the podcast, that really is get stuff out as fast as you can, evaluate the outcomes. I think that's LLM evals is something that I think has got a lot of momentum around it in folks who are deploying and in the community. So evaluations is important. And then I think knowing how to set up the right evals, knowing how to benchmark your own evals, customize it, what types of eval score versus classification. There's just so much nuance in that whole eval space. And so as we continue to drop more research or share more stuff we're learning, we'll share it with the community.
Demetrios [00:55:20]: Excellent. Parna. It's been absolutely fascinating having you on, as always. I really appreciate and look forward to having you back.
Aparna Dhinakaran [00:55:28]: Awesome. Thanks, Demetrius. Thanks and thanks. Mlops community hey everyone. My name is Aparna, founder of Verize, and the best way to stay up to date with Mlops is by subscribing to this podcast.