MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Evaluating LLMs with Lessons From Teaching // Stephanie Kirmer // Agents in Production 2025

Posted Aug 06, 2025 | Views 45
# Agents in Production
# Evaluating LLMS
# DataGrail
Share

speaker

avatar
Stephanie Kirmer
Senior Machine Learning Engineer @ DataGrail

Stephanie Kirmer is a staff machine learning engineer at DataGrail, a company committed to helping businesses protect customer data and minimize risk. She has almost a decade of experience building machine learning solutions in industry, and before going into data science she was an adjunct professor of sociology and higher education administrator at DePaul University. She brings a unique mix of social science perspective and deep technical and business experience to writing and speaking accessibly about today's challenges around AI and machine learning. Learn more at www.stephaniekirmer.com.

+ Read More

SUMMARY

Evaluating LLM performance is vital to successfully deploying AI to production settings. Unlike regular machine learning where you can measure accuracy or error rates, with text generation you're dealing with something much more subjective, and need to find ways to quantify the quality. As we combine LLMs together and add other tools in the agentic context, this becomes even more challenging, requiring robust evaluation techniques. In this talk I propose an approach to this evaluation that borrows from academic evaluation - namely, creating clear rubrics that spell out what success looks like in as close to an objective fashion as possible. Armed with these, we can deploy additional tested LLMs to conduct evaluation. The result is highly efficient and solves much of the evaluation dilemma, although there are still gaps that I will also discuss. (This is an adaptation of an article I wrote: https://towardsdatascience.com/evaluating-llms-for-inference-or-lessons-from-teaching-for-machine-learning)

+ Read More

TRANSCRIPT

Stephanie Kirmer [00:00:11]: Excellent. So much. Yeah, I'm really excited to be here and glad everybody's joining us today. I wanted to, you know, talk a little bit about something that's, that's near to my heart because I have a. But I'm a staff machine learning engineer at a company called DataGrail where we produce a data privacy software solutions. But I actually used to be a teacher. I was a sociologist and I taught college classes for quite a few years. And so I have been able to bring a little bit of my background into today's talk, which I'm really excited about.

Stephanie Kirmer [00:00:45]: So let me just dive right in since we're in a lightning talk and I've got a limited amount of time to cover this. So first thing I want to talk about is like, what is the problem that I'm trying to help you solve here? The first thing is that you've got a task, you need something done and you think maybe an LLM can help you with that. But you can't just put an LLM into production based on hope and good vibes because we know LLMs can make mistakes, sometimes even catastrophic mistakes. We don't want to risk that. You need to have confidence that that's not going to happen before you deploy anything. We need to evaluate the LLM. Right. So when it comes to free text responses from an LLM, we had two options.

Stephanie Kirmer [00:01:27]: We've got a human in the loop using human judge, which can be slow and expensive, or we can use a computer, but then we're using another LLM and it's like, can we trust that LLM? And it's LLMs all the way down. You have some built in evaluation metrics, maybe like correctness, conciseness and some evaluation met frameworks. But personally I find these kind of vague and they're not very transparent about what it is they're actually measuring and what they're going to do in your production setting. So my proposal is that we actually use an LLM to do the evaluation, but we adapt that process to minimize subjectivity and ambiguity by using a rigorous rubric. And this is the thing that I'm bringing to us from the teaching world. So why, why rubric? What is this for? What is the point of this? Well, it's a tool to you that will tell you and everyone involved clearly what is expected, what does success look like. When I'm teaching, I will create a rubric at the same time I write the assignment, whether it's a paper or a multiple choice or whatever. Mostly this is for free.

Stephanie Kirmer [00:02:32]: Text responses. What is the student supposed to learn by doing this assignment? What does it look like when they have achieved the learning that was the goal here? So the rubric is meant to measure whether the student learned what they were supposed to learn and whether they're exhibiting the skills they were supposed to be able to do after doing this. So this keeps me accountable as a professor. Right. But it also makes clear to the student what they're being asked to do before they ever pick up a pen or get into anything like that. So the machine learning context, it's pretty much the same. Right. We're using a rubric to define whether the LLM successfully did the task that we asked.

Stephanie Kirmer [00:03:10]: Okay. So a few notes to remember, though your LLM can't do anything, you know, it's not, it's not magic. So we know that there are certain areas where LLMs are not maybe the best evaluator tools. So truthfulness and accuracy can be a little wishy washy. Arithmetic, obviously, things like that. Keep in mind when you're designing your rubric whether or not this is an LLM appropriate task. And then also keep in mind this is machine learning. We're in the machine learning space and probability is, is never going to be 100% right.

Stephanie Kirmer [00:03:41]: We need to accept a risk of error and we need to like sit with that and be okay with that too. So how do we create a rubric? What are we doing here? How do we define an acceptable response? That's the real question at the core of creating rubric. Do we understand the problem that our LLM is trying to solve and do we know what would what a good answer would look like? You need to really know your task at a pretty deep level. To do that, you also need to have a sense of what matters and what doesn't. If I'm getting an essay and I really need to know if the student understands how to read a research paper. Grammar is important for being able to be understood, but it may not be the big thing here. There's certain things that you might decide you're not going to prioritize in your rubric because they're not really the most important thing to understand whether your task LLM is getting the job done. Also, I recommend thinking from the end user perspective.

Stephanie Kirmer [00:04:43]: Don't be afraid to get consumer feedback or customer feedback. Do user research. Because if your responses are going to be seen by people, not if they're necessarily going to be just part of an agent chain, but if they're going to be seen by people you need to know what those people are looking for in a successful answer, too. That can be a really useful way to go about it. We'll figure out what a good answer looks like. We're going to spend some time on that. Then you break down that good answer into measurable components. What are the traits of that good answer that we care about? Then you'll describe different levels of performance.

Stephanie Kirmer [00:05:15]: So what does it look like when they're really, really stink and when is it really, really good? And what does that look like? It's still adding point values to each one. And I'll look at an example in a second so we understand kind of what that means. Then each component's numeric score can be combined, averaged, or summed or whatever you want to do to get an overall score. Here's a sample. Let's talk about it. First, we're defining the goal. We have a project. We know what the goal is.

Stephanie Kirmer [00:05:40]: LLM should be providing friendly, professional responses to users asking for product details. Okay, so there's three, four themes here that I've assigned for this rubric. It has to be friendly, it needs to be professional, it needs to be complete, and it needs to be on topic. Okay, I'm just breaking down number three because I think this one's kind of an interesting one. Complete. Does the response thoroughly address the question that was asked? Zero means no. No answer to the question. Nothing was addressing the question.

Stephanie Kirmer [00:06:08]: It would generate an error. So that's zero points. Zero points on our scorecard. If it gets some of the answer it answered a little bit, but there's like big old holes. One point response answered most of the question and there's a few gaps, but it's not like the end of the world. It's a decent answer. We'll give it two points and then three points has got a perfect. Completely addressed the entire question.

Stephanie Kirmer [00:06:30]: Great job. I would say two maybe is a passing grade here. Three is like a. Now think about this. Can you, as the human person in this context, convincingly clearly decide where you would grade it? If you can't, if you're not. If you look at the rubric that you design and you're like, I'm really not sure if this answer is a one or a two. That will happen. And you need to think about, like, how do you know what is the breakdown? And maybe add more description, more clarity to the different values so that you can really decide the difference between a one or two or two or three.

Stephanie Kirmer [00:07:04]: And that's important because we need to be Human validating, whatever we're asking the LLM to do. And I'm going to talk about that in a second. Also consider that the points don't have to be, you know, ordinal. You don't have to go 0, 1, 2, 3. You could go, you know, 3, 10, 20, you know, whatever point scores you want, changing them to be relative to the priority you placed on equality. And on the highest point values should be on the most important things. That's my recommendation. So then we got a rubric.

Stephanie Kirmer [00:07:35]: Awesome, Fabulous. Not done yet. We need to validate and calibrate to make sure that this rubric is going to do the job and that our evaluator LLM can actually work with it. So this is the process in a bit of a nutshell. First you need to create validation sample texts like whatever your LLM is supposed to be spitting out. You want to get some examples that are similar to that of varying quality. And this is key. You need some really bad ones, you need some stinkers in there.

Stephanie Kirmer [00:08:01]: You need some that are bad on different criteria in the road, right? Like one's rude, one's incomplete, but very polite. You know, this kind of thing. Then you use the task LLM to generate samples if you want to, but you don't have to. You can write them yourself. You can whatever is going to get you things that are close enough to resembling production data. Because this is validation data right here. Then first you go grade them yourself. You get your pen and paper and you get your rubric and you go give them the point scores.

Stephanie Kirmer [00:08:27]: And it'll take a little bit of time, but it's worth it it because you're trying to calibrate your LLMs judgment to your judgment so that you can use it like a ta. You can use it as do the grading for you so you don't have to think about that rubric. Think about whether you get fuzzy spots and you're like is it a one or two or three? You know, because you need to be able to make sure that that's very clear for the LLM. Then go to ask the evaluator LLM and just be out of the box. LLM doesn't matter. It doesn't need to be pre trained or it doesn't need to be tuned by you or anything like that. But ask it to grade all the same cases with the same rubric. I love DPVALS interface for this myself.

Stephanie Kirmer [00:09:04]: I like that framework for the, for the like actual code infrastructure. But there's also some other ones And I've got some links at the end. So then you've got your grade, you've got the LLMs grade. Analyze the differences. Where did it say, you know, I gave it a three and you gave it a one or something like that. And what you want to do with that is you want to do this sort of cycle that I've done on this on the slide here and adjust the prompt to the evaluator LLM to get closer to your score. Now don't, don't change your scores because your scores are ground truth now. But you want to train your, get your prompt engineering organized so that your rubric gets your LLM to the same place.

Stephanie Kirmer [00:09:40]: So then you can work on your prompt engineering to the LLM, you can work on your rubric and then you'll eventually get to the point where you've got an LLM that evaluates the performance that you feel confident you can trust. So then now we can put in production, we've got it aligned with our manual grading. So this LLM can basically perform the task that you would otherwise have to perform yourself and you have pretty confident trust in it. I recommend one more calibration step here. Not mandatory, but I recommend it is to take your task LLM the job that we've started out with all this with. Get it to generate some more samples, some fresh samples. You and the evaluator LLM both review, check your calibration. Make sure that you're all working lockstep.

Stephanie Kirmer [00:10:25]: This is what you would do if you had two TAs for the same class. You would ask them to grade the different, same papers themselves and then make sure that they're grading the same way because you want everyone to get a relatively objective grading experience as their students in the class. Then now you can keep monitoring your task LLM throughout because you've got an evaluator LLM that can continually score the output and give you numeric quantifiable scores. You can monitor your performance of your task LLM in production. You can check and see if it's drifting. If there's some weird outliers, it'll give you a little red flag that you can just keep an eye on. You can use spot checking or just sampling if costs are concerned because you running LLMs isn't free. But that's what I personally recommend for the next step is to just keep using this regularly and calibrate it a little bit yourself over time and check that evaluator element.

Stephanie Kirmer [00:11:20]: Make sure it's still doing what you think it should be doing. And then there's a few bonus choices. One, you can use those evaluator scores to go back to your task LLMs prompt engineering. You can basically use it to help work on the prompt engineering for the task LLM because then you've got an easy way to get evaluations out of the task LLMs like you know, sort of test cases. You can also use those scores as like a quality gate to prevent low quality responses from getting to the end user or getting further through your agenda pipeline or whatever you're working on. And you can even use the task LLM like you can, you can use reinforcement learning from AI feedback with your evaluator LLM and your task LLM is an option that you have as well. That's like way past the scope of today's talk, but it's something that this set you up for the potential to do next. So I the limited time today so that's basically what I had to present.

Stephanie Kirmer [00:12:15]: But I wanted to also just throw a few extra links in just in case you're interested. A few frameworks you can try dpvl, Langsmith or both ways. You can use custom evaluators which is the thing you're going to need in order to build in your own evaluation rubric into the pipeline. There's a really nice little collected set of papers on custom LM evaluators on Hugging Face that you can go visit. And then there's also a data camp has a nice little overview of reinforcement learning from AI feedback. So that is the thing that, that's everything I wanted to talk about today. So I really am excited to be able to present it to you and I hope this has been helpful. Thank you.

Skylar Payne [00:12:56]: Yeah, that was an action packed a dense 10 minutes. So awesome. I think we have time for maybe one question. I don't see any in the chat, but I definitely had one myself. I don't know if you're familiar with Shreya Shankar's who validates the validators paper, but one of the things that she has been saying a lot is that when trying to define a rubric you have to look at your data first and trying to come up with one before. So I'm just curious what your thoughts are on that.

Stephanie Kirmer [00:13:31]: Yeah, I think that the rubric needs to be is aligned to the. To the task goal as possible because it is possible that you end up developing a rubric that can never be met by the. By the task that you're by the LEM or the underlying data that you're working with and so that might be you might get to the point where the best quality success can't be done by the methodology that you're using and that can be a problem of course but if possible ideally I think that the best way to go about it is to try and find a way that your evaluation can be calibrated to the task less than the underlying data because we always kind of get to that problem where is it we're measuring it because we can measure it or measuring it because it matters. In this case I think it's important to go with measuring what matters and is actually going to make a big difference for the project because you might just find out that an LM isn't the right tool for this problem and that's totally you might discover.

Skylar Payne [00:14:37]: Totally awesome. Well thank you so much. We're going to roll directly into the next talk. Love this. Thank you so much. We'll say bye bye to Stephanie.

+ Read More
Sign in or Join the community
MLOps Community
Create an account
Change email
e.g. https://www.linkedin.com/in/xxx or https://xx.linkedin.com/in/xxx
Comments (0)
Popular
avatar


Watch More

Generative AI Agents in Production: Best Practices and Lessons Learned // Patrick Marlow // Agents in Production
Posted Nov 15, 2024 | Views 6.4K
# Generative AI Agents
# Vertex Applied AI
# Agents in Production
Lessons From Building Replit Agent // James Austin // Agents in Production
Posted Nov 26, 2024 | Views 1.4K
# Replit Agent
# Repls
# Replit
Pitfalls and Best Practices — 5 lessons from LLMs in Production
Posted Jun 20, 2023 | Views 1.1K
# LLM in Production
# Best Practices
# Humanloop.com
# Redis.io
# Gantry.io
# Predibase.com
# Anyscale.com
# Zilliz.com
# Arize.com
# Nvidia.com
# TrueFoundry.com
# Premai.io
# Continual.ai
# Argilla.io
# Genesiscloud.com
# Rungalileo.io