Sign in or Join the community to continue

EDD: The Science of Improving AI Agents // Shahul Elavakkattil Shereef // Agents in Production 2025

Posted Aug 06, 2025 | Views 76

# Agents in Production

# EDD

# AI Agents

# Ragas

Share

speaker

Shahul Elavakkattil Shereef

Co-founder & CTO @ Ragas

I’m Shahul, one of the founders of Ragas. Prior to building Ragas, my line of work was in Applied ML, and I loved contributing to Kaggle ( Kaggle GM best rank #18) and contributing and leading OSS initiatives like OpenAssistant AI. I’m also a science enthusiast and loves reading physics and philosophy in my free time.

+ Read More

SUMMARY

If you’re an engineer building AI Agents, you probably know how hard it is to consistently improve them. But I think it’s not that hard—if you have the right mental framework to solve the problem. That framework is Eval-Driven Development—a fancy way of applying the scientific method to building ML systems. Fundamentally, it’s about iterating on ML systems using science (EDD) rather than art (vibe checks). In this session, we’ll explore how one can use the ideas of experimentation and evaluation to improve any AI agents consistently. We’ll also learn how to use LLMs as effective proxies for human judgment (evals) and build a data flywheel for improving its alignment, choose the right metrics, and set feedback loops from production to identify and improve long-tail scenarios.

+ Read More

TRANSCRIPT

Shahul Elavakkattil Shereef [00:00:11]: Hey guys. So in this session I'm going to talk about driven development for a agent. And I'm Shahul, one of the co founders of Ragas, which is an open source team tool for evaluating a applications. So in this particular talk I want to walk you through what is edd, what is quantitative evals and what is qualitative evals experimentation for improving IA agents and a little bit details about the upcoming release of progress and ask me anything session. So let's get started on that. So yeah, first one, evaluate development. It's a very common term talked about much but less practice when it comes to you know, building a agents and a applications itself. So I just wanted to start by sharing why it's actually better than why based iteration.

Shahul Elavakkattil Shereef [00:01:19]: So why based iteration is basically when you have, when you just iterate on stuff and think about stuff and feel and iterate based on feel. So it's basically let's just try out stuff and see what feels better. It's much wide based. You change stuff around and you just look through eyeball through some examples and you get a feeling if it's better or not and you iterate on based on that. Evaluate development method on the other hand works on proper scientific method. We hypothesize, test and validate, then act on that. So when it comes to building scalable A agents, when you have multiple team members and multiple stakeholders wipe based iteration often fails because wipes of different people can be very different. When it comes to worldwide development since you have a very definite north goal and metrics and definite agreement on what is a good behavior and what is a bad behavior, you can formulate hypothesis based on that, test your hypothesis by running experiments and validate it and then act based on it.

Shahul Elavakkattil Shereef [00:02:29]: So coming to agent evals there are two parts of it. One is quantitative and other qualitative. The quantitative aspects of eval are much talked about like metrics, cost, latency. We'll also cover that. We'll also cover these individual factors that are very specific to agent evals. But the qualitative aspects of evaluation specific to agent evaluation are not, are not given enough importance than it should have because many aspects of of a system development is actually qualitative. You we will also we'll also delve into techniques like error analysis and attribution analysis that can be widely used when it comes to developing and improving a agents. So yeah, let's start with the quantitative evals.

Shahul Elavakkattil Shereef [00:03:28]: Quantitative wells basically for a agents mostly means four factors. One thing first matrix is to quantify the performance of your agent. Second thing, the consistency of the agent. The consistency of agents can vary from agent to agent. So if an agent is less consistent or very, you know, consistent acting consistently in very few iterations, that means that it's hard for to put that agent into production. Third thing is cost. Four thing fourth factor is latency. So cost and latency are common to any systems whether AA or non systems.

Shahul Elavakkattil Shereef [00:04:12]: And these factors also affect agent development and iteration. So let's get into the metrics part of a agent development. So in a when developing a agents you the metrics are used to quantify the performance of AA agent. So the first factor when it comes to starting or starting to think about what metrics to choose for developing a agent is to start I would recommend to start with end to end evaluation metrics. Why? Because end to end evaluation metrics quantify the performance of the system as a whole and this is what ultimately you are trying to optimize against. So if you are developing something like an SWE agent or test to SQL agent or a deep research agent, what your users ultimately care about is the end to end performance of the system rather than the performance of your individual components. So that is also one good reason to start with end to end evaluation. It is the closest proxy to the satisfaction of the user that is consuming your system.

Shahul Elavakkattil Shereef [00:05:22]: So that's that's why I always reckon recommend to start with end to end evaluation. Second part is picking up one or two strong proxies for success over multiple weak proxies. So this is a common error pattern that I have observed in while working with multiple teams in evaluation. So what happens is that teams usually pick up multiple weak proxies to quantify the performance of their agents. So when I mean weak proxies that for example, these are metrics like helpfulness, tone, coherence, relevance, etc. That does not really give direct insight on the user's satisfaction while interacting with your agent. Right. So these are all meta information or inferred information that has less value compared to something like an agent's goal accuracy.

Shahul Elavakkattil Shereef [00:06:26]: So when a user interacts with an agentic system, the user cares a lot about is the agent actually using the correct or actually does the actually did the agent actually achieve the right goal that was actually asked by the user to do or did not. So the these are the strong proxies that we can use to measure the performance of the system rather than weak proxies like helpfulness, coherence, relevance, etc. So in my experience what we have seen is that using few strong proxies or strong matrixes that directly correlates with your users experience of the system is much more practical and important than using or having multiple weak proxies like helpfulness, coherence, relevance, etc. Another important point to mention here is always lane when you're using LLMs in many times when you are when you are evaluating agentic systems like deep research agents or agents that or conversational agents, we tend to use LL messages just to evaluate the performance or correctness of the system. Even with ground truth, LLM messages should be aligned with human annotation in a human reviewers in a way that LLMs has a higher correlation with human judgment. So that would normally involve some kind of prompt tuning and additional few short optimization tuning so that the LL messages understands what is considered as right and what what is considered as wrong. So the whole idea of aligning LLMs just when used is is to teach the LLM the the decision boundary that a human annotator has in his mind. So this can be a process to run.

Shahul Elavakkattil Shereef [00:08:30]: We can iterate on the prompts, you can even use another LLM to iterate on the prompt. And you can also have few examples that you can feed in with the LL messagesearch like bootstrap examples or a few short triggered examples that can be passed so that LLM messages just are much more effective than vanilla LM messages which is which can go wrong in go wrong in many situations. So these are the important things when it comes to using matrixes for evaluating a agents. So for example, here are some examples of the strong proxies or matrixes that you can use in different use cases in a text to SQL agent execution accuracy can be treated as a strong proxies because whatever SQL or whatever SQL or code the system produces at the end it needs to execute in a way that it gives the same output as the ground truth SQL run by the domain expert. So this is a strong proxy compared to a weak proxy like SQL correctness or SQL semantic correctness. So a strong proxy indicator is much more indicative of the user satisfaction compared to a weak proxy like something like SQL semantic correctness or SQL syntactic correctness which can which which which is the baseline for what is expected. For example, an SQL produced by an agent can be syntactically correct but still give very invalid results. Likewise in a deep research agent and LMS research with proper grading nodes and rubrics can be used to score the final reports generated by the research agent.

Shahul Elavakkattil Shereef [00:10:26]: In a conversational a Agent goal success rate or goal success is a strong proxy. So whenever a user interacts with a conversational agent, it is assumed that the user has a goal to satisfy. So in goal success rate we are trying to see if that goal was satisfied in interactive while interacting with the A agent. So if the goal was not satisfied, that is a bad by definition a failure case of the AA agent which is a strong proxy for failure. So when it comes to now that we have covered some important aspects of quantitative evaluations, those are matrixes, how to choose the right matrixes, and some examples of good matrixes. The second thing is consistency. Consistency can be easily computed by running the same agent on the same task multiple times and measuring how much consistent results does the agent give each time when run when run independently. So if you are having some natural language answers, the consistency can be measured by again either using LMS research or using something like semantically similarity.

Shahul Elavakkattil Shereef [00:11:47]: When it comes to cost and latency, these are well defined and understood matrices. I'm not diving into them much in much more detail in this in this one. So quantum qualitative evaluation is much underlooked part of a agent evaluation which also prevents in many cases for teams to actually iterate. So why is even qualitative evaluation required? So the metrics can only tell you why or how you know what is the system's performance. It does, it does not really explain why it failed. Consider a case where you have an SWA agent or text SQL agent or a deep research agent and you have an end to end evaluation metric that shows that the end to end performance of your system is 80 percentage. So that means that of 100 in 100 samples you only have. You have about 20 samples that fail your end to end evaluation criteria.

Shahul Elavakkattil Shereef [00:12:53]: So a quantitative metrics or end to end evaluation will only tell you that there are 20% samples that fail. It does not tell you why it failed or how you can actually improve your system that it doesn't fail the next time you run. Right? So the idea in qualitative evaluation is to convert these 20 failed samples into actionable insights. So let's. There are two techniques I'll cover in this presentation for you to do in qualitative evaluation and get those actionable out of evaluation process. So the first process is error analysis. In error analysis you identify what went wrong in the system's output. So if you have a deep research agent that produces a report of maybe 20 pages, it may have run for like half an hour.

Shahul Elavakkattil Shereef [00:13:51]: You can look at the, look at the, look at the final response and note down what actually went wrong in the final system output. So it could be something like the output does not have enough depth. The research, the deep research seems to be vague in terms of knowledge. The second step is attribution analysis. An agentic system runs, usually runs in loops and is autonomous. That means that it, it will take multiple steps before it actually gets to the final output. So the challenge here is when debugging a agent is that even though you know what went wrong, you wouldn't really know what part of the system contributed to that particular error. So let's say in a deep research agent I could look at the final output and write down that the output is weak or does not satisfy my criteria.

Shahul Elavakkattil Shereef [00:14:54]: But I would be very hard for me to actually attribute those errors to a step or a series of steps or a reasoning change that caused it. Right? So that is what attribution analysis means. So if you do attribution analysis, you'll understand that. We'll understand which part of the system is actually, which part of the system is actually causing the error in, in your agent. So let's get some more details into it. So in the error analysis step, the two steps is that you label the failures. So you look at the 20 samples or n samples that have failed your evaluation run and you can do a, do a manual data inspection to see and note down hypothesis like incorrect answers, reasoning mistake, tool misuses, hallucination, etc. And also you can also identify which step the failure occurred.

Shahul Elavakkattil Shereef [00:15:58]: The failure might have occurred in a tool calling step or the field might have occurred in the reasoning step or anything. So that's what you do in the error analysis part. The second thing is attribution analysis. In this, not only you analyze which component contributed to the false step rather than why it actually happened. In this case, let's say you have an agent that gave incorrect results or hallucinated. Now your job is to understand which step in the agent loop iteration actually caused this particular, particular hallucination. Did it come out of some faulty memory mechanism or did it come out of some faulty tool usage or is it because of some kind of back retrieval? So this is attribution analysis. You are trying to attribute a particular error in the system to a set of a step or a set of steps in the agent iteration.

Shahul Elavakkattil Shereef [00:17:05]: So I have, I also have a example, a simple example for this that I have prepared. So this is a simple agent trace viewer that I always that I use for developing or viewing agent traces. So I have a simple math agent that is loaded here and I have a user query that ask it to calculate how how much does it cost to ship 50lbs package from new York to California. And you can see that I have traced the agents each individual step. And at the end when I when I was using my end to end evaluation, I was able to understand that the final answer was incorrect. The expected answer was somewhere around 5,000, but the 2 of the final answer was around 6,250. So using an end to end evaluation metric I was able to understand that this particular sample failed my evaluation. But as said before, it does not really tell you where it failed or why it failed.

Shahul Elavakkattil Shereef [00:18:06]: Now I want to understand both these. Then I look at. I use the error analysis step first to actually understand the cause of failure which is here. For example you could see that it first went down and used the math tool calculate and it passed the wrong argument in terms of distance. So I can note down that it I. I have here nodded that the agent used Mac tool instead of shipping lookup tool. And now I am going one step ahead and trying to find out why what in the system or in the agent caused it to do this particular particular error. Right? So if I look further or do some more analysis to understand where does this come from, I could understand that the word calculate actually triggered the math calculate function in this age.

Shahul Elavakkattil Shereef [00:19:04]: In. In this particular LLM age. So when the user asked to calculate, it thought that it should use calculate function instead of actually using shipping tool. So I can also note down the particular error associated attribution associated with this error. So next time I trade I could understand the higher level patterns of errors that are. That are caused by different reasons in building agents. So I'll get back to my slide again. So these are steps that you can actually do to evaluate and improve your agents.

Shahul Elavakkattil Shereef [00:19:53]: And so to just have a recap, first thing would be to have a quantitative evals. And when it comes to quantitative evals, we will have a set of one or two matrixes that are strong proxies for performance of your system in an end to end basis. Second thing will be to have a qualitative evaluation pipeline where you actually do two steps. One thing is error analysis, second thing is accurate distribution analysis and understand where exactly is your agent going wrong. So with this, if you do this process correctly with you know, your custom viewer or some metrics that you have and you have and you also have a good experimentation pipeline, it's, it's. It's possible and much faster to actually iterate own agent using evaluation driven development rather than going for a purely wipe check based iteration. So here I also wanted to give a small update on the next release of Ragas which is centered around the idea about ideas of data sets, experiments and metrics. So the idea is that you can different teams uses different ways to organize and run their experiments.

Shahul Elavakkattil Shereef [00:21:15]: Some teams uses Google Drive, some teams uses both. Some teams uses local file system to organize and run their experiments. So Ragas now supports running and orchestration of these experiments on different backends. You can configure data sets in anywhere and have an evaluation script associated with any of your project that can be run anytime you make changes or iterate on your system. So whenever you iterate on a prom or a component of your A application you can run evals locally or if you can hook the script into some cloud, you can trigger it from the cloud and it will run the evals associated with that particular project by fetching the data set, running the experiment, running the whole evaluation group and and storing the results of the experiment as you have configured in any of these file systems like local Google Drive, box or et cetera, so that the other team members involved in developing the AI agents can also view the results of each successive iterations. It also is based on the idea of custom metrics. So now as a systems evolve we have seen that more successful metrics are built custom by teams and there are heavyweight items associated with metrics like aligning and message of just having it having sortable and retrievable and reusable across projects and everything. So those parts are the ones that drag us actually handles.

Shahul Elavakkattil Shereef [00:22:54]: And yeah, you can check out the new docs in experimental and I will open up the session for any questions that are there.

Skylar Payne [00:23:05]: Awesome, thanks. We're a little tight on time right now in the schedule, but we do have time for one question that I'll pull from the chat. Ricardo asked have you ever faced any scenarios that couldn't be solved through this method or ones that it was really hard to apply it to? What did you end up doing?

Shahul Elavakkattil Shereef [00:23:23]: Yes, even though theoretically rinals qualitative analysis like error analysis and attribution analysis, even in the examples that I showed is easier. But with sophisticated agents like browser agents, you are looking at traces that are of hundreds of steps. So even though you may know where exactly is the error, it's very very hard to attribute that error to a particular step step in agents, recent chain or any particular step that has happened before that. So it can be very cucumbersome and I have sometimes even taken up to close to 30 minutes to just find or find or do attribution analysis for browser usage etc which usually runs for multiple, you know, close to 10 minutes and stuff and produces like hundreds of loads sometimes. So so it can be very boring process sometimes because sometimes it's very hard to look through and debug and attribute errors to particular steps in agents reasoning chain whenever there are longer running reasoning chains. So those process are something that we are also trying to make easier.

Skylar Payne [00:24:44]: Awesome. We have one more question that came in and it was actually a question I was thinking of during the chat. What are the you. You had a bullet point to say always align your LLM as a judge. And somebody asked what are the best practices in aligning an LLM as a judge?

Shahul Elavakkattil Shereef [00:25:01]: Absolutely. So the the what has worked for me multiple times is that for either you can review the results of the so you don't really trust the LL messengers first time you prompt it and run it and you just don't trust it like that. So what have what? The process that I follow is that whenever I configure an SNL message I make sure that I write the prompt for the LLMs usage. Then I run it over my test data, I review the results and I make sure that after looking through the results I make some corrections if required. And I have built a basic pipeline where the corrections that I make gets feedback to the LLM messages so that it aligns with what I what I think is, you know, right or what I think is wrong. So two things. First thing is optimize the prompt either manually using or using some optimization method. Second thing is have examples where LR message judges initially goes wrong in the few short few short in the prompt with few short examples.

Shahul Elavakkattil Shereef [00:26:07]: Or you can even go and try out something like a little more fancy things like you know, retrieval dynamic retrieval for lms which in turn improves its alignment with human judges. So at least one one time review the results of LLMs judge, have a simple mechanism to feed that resource of review to your LLMs judge and further align it.

Skylar Payne [00:26:29]: Thank you so much for your time. This is very information dense. Really appreciate you sharing your knowledge and with that we'll say goodbye.

+ Read More

Sign in or Join the community

Comment (1)

Popular

Watch More

Generative AI Agents in Production: Best Practices and Lessons Learned // Patrick Marlow // Agents in Production

Posted Nov 15, 2024 | Views 6.5K

# Generative AI Agents

# Vertex Applied AI

# Agents in Production

LLMs & the Rest of the Owl // Neal Lathia // Agents in Production

Posted Nov 26, 2024 | Views 1.3K

# LLMs

# AI Agents

# Gradient Labs AI

Challenges of Working with Voice AI Agents // Panel // AI in Production 2025

Posted Mar 14, 2025 | Views 596

# Voice AI

# AI Agents