Sign in or Join the community to continue

Driving Evaluation-Driven Development with MLflow 3.0 // Yuki Watanabe // Agents in Production 2025

Posted Jul 23, 2025 | Views 183

# Agents in Production

# Databricks

# MLFLOW

Share

speaker

Yuki Watanabe

Sr. Software Engineer @ Databricks

Yuki Watanabe is an experienced backend developer specializing in machine learning systems and scalable web services. As part of the Databricks team in Tokyo, Yuki designs and implements robust data pipelines and ML infrastructure that power intelligent analytics platforms

With over three years of hands-on experience, Yuki has led full-stack backend projects that integrate complex ML models, optimize data workflows, and ensure high-performance delivery in production environments. His work bridges the gap between data science and engineering, enabling seamless deployment of AI-driven features at scale.

+ Read More

SUMMARY

Quality is the top barrier preventing Agentic applications from reaching production. This talk introduces Evaluation-Driven Development, a methodology that uses evaluation as the cornerstone for building high-quality, reliable Agentic systems. We will demonstrate how to drive it with MLflow 3.0, a new generation of the popular MLOps platform redesigned for the LLM era, including one-line observability, automatic evaluation, human-in-the-loop feedback loops, and monitoring.

Huge shout out to our sponsors @Databricks

+ Read More

TRANSCRIPT

Demetrios [00:00:00]: [Music]

Yuki Watanabe [00:00:08]: Hey guys, I'm Yuki Watanabe, I'm an engineer from Databricks data company and I'm one of the full time core maintainer of ML Flow, which is a ML Ops tool. So today I'm going to introduce or introduce about agent driven development as a technique to how to build your agent with confidence and deliver it confidently into production. And as a tool I want to introduce MFROS 3.0. So starting from there, my question so what is the number one problem in the agents? We're handling a lot of problems in the industry like this domain. We're still dealing with a lot of challenges. But what is the number one thing? The short answer is quality. So we did a big survey last year and to all the ML practitioners, data scientists and also agent builders, the number one blocker for organization to launch agent in production was actually quality. And it's quite challenging to make sure your LLM or agent actually behaves properly because LLM is like not deterministic, it's very unpredictable.

Yuki Watanabe [00:01:25]: Also they accept the free form user input for most of the application. This has so much freedom compared to the numeric tensors we were doing before with deep learning or traditional machine learning. Also, quality has many aspects. It's not only single metric like accuracy, but there are many different aspects you need to assess before learning launching your agents for production. Also, there are a lot of moving pieces in a single agent application such as LLM like the search tools. So whenever you want to update something, it can break your agent. One great solution is kind of emerging in the industry which is called evaluation driven development. So conceptually it's very similar to test driven development which is a software driven technology to control your development with tests as an anchors.

Yuki Watanabe [00:02:20]: So similarly, evaluation driven development uses evaluation as a cornerstone of your agent building. And there are actually five pillars in this process starting from data collection. So good evaluation always requires good data and building. So you need to write some code or maybe like local tools to develop a new agent and apply some new method. And it often requires a trial and error a lot of times. And then once you have some working agent you get to evaluate it manually. So having human check is still very important step to ensure the core debar. And it's time consuming.

Yuki Watanabe [00:02:59]: So the next step is automatic evaluation. So you want to automate it so that you can scale it to many versions or many agents. And finally once you get confident you redeploy it to the production and the monitoring will start. So machine learning project should never end at first launch. So we need to continuously monitor the application performance and detect any regression or any concept drift or improve the system. And with that paradigm in mind, we just released MLflow3 is a kind of redesigned version of the entire MLO platform so that it can work with both classical ML and LLM or Genai system. And the foundation of the fundamental tool we use for this evolution driven development paradigm is tracing. I think today's session, like a lot of today's session, actually talks about tracing.

Yuki Watanabe [00:04:03]: So I think it's kind of obvious that it's important. But let me just briefly introduced as a game, hopefully it's not too difficult. So tracing is an observability solution for your LLMs models and agents. So the fundamentally it captures various information about your agent execution including intermediate steps. So you can think of this as a type of structure log. But and we talk about traces in this MLL literature, it's mostly tailored for generate debugging, evaluation and monitoring. And for instance this picture is a trace. It's a trace generated for a single language graph agent execution.

Yuki Watanabe [00:04:52]: So you can see the trace visualize the execution as a tree in the left side view and view include like many steps such as alarm calls or tool calls or other like LangChain tunnel processes. So and on the right panel you can see input and output captured for the selected step. In this case it's a letter mcall so you can see also you can see metadata like model name or available tools. So by looking at those intermediate steps you can easily understand and debug your agent execution as well as using those data for evaluation and monitoring. Since tracing is such a foundation of the evaluation driven development, we try to make it very easy to adding tracing into your agent or stack. So we provide one line automatic tracing integration for over 20 different LLM providers and frameworks such as OpenAI, LangChain, DSPY, Panagai or Padlock. There's so many great libraries over there, so for those libraries you just need to add mmflower something out log for example, if it's a lang graph it is mlflow LangChain outlog. The mlflow automatically inspect your code execution and generate traces like a previous example.

Yuki Watanabe [00:06:06]: And the generated traces conform to the standard called OpenTelemetry. I think, I think the previous talk also talked a lot about opentelemetry. But opentelemetry is number one industry standard specification for observability. And since we conform to this standard, you can use ML4 traces not only with MFL backend, but also another observability platform Like Datadog, Grafana or New Relic, even Langfields or more. So this significant delays the bend locking and it's very, very open. All right, so this is the intro for tracing. Let's get into how we use it for evaluation driven improvement. I started with building step for the sake of easy slide organization.

Yuki Watanabe [00:06:53]: But the actual order is not super important because it's cycle during building phase, the primary work is to write code to define your agent and casually iterate on it until you get some working result. And while there are many great frameworks to define your agent in a very simple and easy syntax, doing something more than tutorial always requires trial and error. So tracing really helps the backing errors or issues during this process. When you enable auto tracing in a notebook or manually instrument your call, MFL displays a trace within the notebook right on the cell output. So you can see what happens inside a framework, call in the cell and debug and what data is passed between steps. And you can easily see back any exceptions from the agent or encourage the issue in the answer and quickly iterate on it. And after you build a good agent, the next step is evaluating it. It can be manual evaluation by yourself or maybe asking feedback from others.

Yuki Watanabe [00:08:04]: However managing those feedbacks is tricky. So you might need to manage bunch of spreadsheets with feedbacks from different people. You ask feedbacks or maybe different criteria or different versions and you need to have it for different version of agents. So to make this simple MF role support annotating feedbacks on traces directory. So on the right part of the trace UI you can see assessment section. So here we can put various feedbacks about the agent. For example, if the output is correct, if it matches with the expected context, if it calls correct tools or some things like that. This feedback then persisted on traces with metadata such as who added the feedback or what is the rationale of the feedback, what's the history of that feedback.

Yuki Watanabe [00:08:54]: We can also override feedback feedback feedbacks to kind of correct like improper feedbacks. So since feedback often come from multiple people or also even if you're a single person, you have different sense of feedback over time. So this metadata helps a lot to keep track of how and why they were generated. And also sometimes you're not like expert on the domain, so you need to ask for engage people who are familiar enough with the domain to getting feedbacks. But the thing is those people are not necessarily ML4 fan. Like they don't have a. They don't have business with ml4 like mlops tool. They just there's just no domain and they might not be like technical person.

Yuki Watanabe [00:09:45]: So many times you don't want to share the MLflow UI directory to them and also you need to access control all the other artifacts like models or runs metrics. So to resolve this problem, MLS also introduced a feature called Labeling Session Leveling Session is a shareable UI based annotation workflow that is designed easy to use by non technical users. So just like in this video you can create a session and add traces to there. Then you can share the extension URL directly to stakeholders or like those domain experts and then they can open the link and see this leveling session UI and start inputting their feedbacks and those feedbacks will be then synced back to the traces so that you can use it for evaluation reader. So by using this leveling session you can effectively collaborate with those non technical folks to correct female feedback without basically without asking them to understand mlflow. Okay, so now we have a few more feedback. So next step is automation. And one common technique for automating the feedback process is using LLM or agents.

Yuki Watanabe [00:11:02]: This is so called LLM as a judge. And since LLMs are very intelligent and capable, they can give somewhat trustable feedbacks about your agent if given the proper instructions and examples. And this is pretty useful technique. But however we found people have two problems roughly in evaluation. So one is liking this judge. LM judge itself is tricky. This is another agent development. So it's just non trivial effort to write a good instruction to let those judges output consistent and accurate feedbacks.

Yuki Watanabe [00:11:39]: Another problem is tracking the result. Evaluation is tedious. Running a single eval is fine, but when you start to have multiple versions of agent, it quickly becomes challenging to keep the version result linked with each version, its parameters or environment settings or those important informations. So it's easy to lose track and like oh this is good result, but what prompt I use or what kind of parameters we use. So to tackle this problem, MLflow offers a variation suite that natively integrate with tracing and feedback dragging. So with the new API MLflow Genaivaluate, you can directly evaluate the traces or dataset with various built in LLM judge metrics. So this allows you to take advantage of automated garages built by our research team without spending a lot of time consuming tuning processes. Also you can create your custom judge as well with natural language which is pretty easy to onboard.

Yuki Watanabe [00:12:47]: And the evaluation result will log to ML flow tracking server with a comprehensive lineage between prompts, parameters, models or runs. So you can easily keep track of agent versions and evaluation linked together. Moreover, while those judges are kind of great and convenient, we also see a lot of limitations in the real world scenario. In many projects, those general judges are not great. You already need to define your custom metrics or gadgets in them to when you get deeper. So MLflow allows you to define such custom version criteria as well. Basically you just need to define a function and decorate it with a score or decorator. Inside this function you have access to model inputs, outputs or trace and also grant truth labels annotated on it, which gives you pretty much almost flexibility to define evaluation criteria.

Yuki Watanabe [00:13:47]: For example, you can you can extract the document retrieved in the agent from the trace and pass it to the precision recall metrics. Also this gives you ability to plug in other evaluation libraries such as ragas deep checks in just few lines like this example below. The version result will be like this so it shows like metrics and aggregated metrics and also individual scoring scores on the traces. So to dip deeper into particular response you can click on each row and open up the corresponding trace to find the loose codes like how if it is hallucination you can just take the context return from the driver if it's sufficient or not. Or you can do that kind of recall analysis and you can also open the another tab in this view to see the input model agent persian or all the parameters or not linked to the notebook or those important lineage information. And so far we talk about manual and automatic evaluation but full evaluation is typically limited. So especially for LLM agent or LLM or agentix system user input is very dynamic so it is challenging to create a static data set that represents like all the real traffic. So production monitoring is must have to make sure your application works as expected in a real environment.

Yuki Watanabe [00:15:19]: In MF3 and with Databricks platform you can set up production monitoring system with tracing in just a few clicks. The only thing you need to do is enable the MF4 tracing, configure the storage to databricks and set up monitoring configs such as what metrics you want to compute. You can even use a custom scorer defined locally defined offline version so that because M4 supports saving and loading in the can score us. So this is pretty powerful because you can use unified way between offline and production monitoring. Once the monitor is set up and the production trace comes in, you can see the chart on various metrics like this including operational ones like latency errors because counts include and also quality metrics you define the computation is totally doing background jobs you don't need to worry about that. And if you spot any errors or issues on the dashboard, you can click on each bar to see the list of traces in the bar. So you can easily record analysis and reproduce the problem as well. Because trace captures inputs and outputs and the last piece is dataset collection.

Yuki Watanabe [00:16:31]: So data is a fear of evolution driven development. It is crucial. However we hear a lot of organization actually have challenges in building a good data set to drive this process. So you often start with a small manual data set or public available one. But as your project grows you will start to also grow the data set. For example, when a user complains about like hey this answer is wrong, you want to add it to your evaluation data sets so the you can validate the fix it actually works. And also make sure we don't do regression. So to track this data set mflow support network support dataset management so that we can create a data set from the directory by writing input or outputs and things.

Yuki Watanabe [00:17:19]: But also you can create data set from traces by picking the actual traces from the list. And for example you can create a data set from the production logs from the retraces those are given the bad user feedback and you can just integrate with your deployment pipeline with automatic evaluation. Whenever you update a model you can ensure it won't call the same mistake again. So in summary we covered this so far we covered these five pillars and summary. MFR tracing helps you to in each step of the version driven development and allows you to iterate your agent path and confidence and infrastructure3 makes it like streamlined by the features like feedback, tracking, evaluations and monitoring and we support entire end to end linear in this process so you won't be like lose track of your very important developer robots. Okay so then how we can get started with mmflow3? So mmflow3 is released last month and it's already available in the pypy so you can just install it from PIP using PIP or uvui or whatever package manager your favorite and also visit the new website to see new features I introduced like tracing or feedback tracking and then we created a documentation with a flow of tutorials for the new features so you can just easily get onboarded with those features. And the features I covered in this talk is just a small subset of mfree release. We have a lot of way more updates we cannot cover here like full tickets, search or traces enter lineage prompt optimization or TypeScript SDK.

Yuki Watanabe [00:19:15]: So a lot of things are coming or released so please check it out. And we also waiting for the Feedbacks on all the new features we launched. Okay, so this is my closing slide. But last but not least. Yeah, this is. We have a kind of free managed service in the Netflix. So open source MFRO requires like setting up some environment and also spin up the server. So which is sometimes a bit tedious.

Yuki Watanabe [00:19:47]: So we are trying to make it simple. So we have a managed version in Devrix that you can just sign up with your email and you will get the MF4 tracking server by fleet. So yeah, you can just find a ring from this QR code. But yeah, we highly recommend checking out this as well. All right, so I think that is all for my session and I think I'm doing good for the time so I think we can go to the QA session.

Demetrios [00:20:19]: You're doing great on time. You're making my life easy, which I enjoy actually. Keep that, keep the QR code up while we chat. There's a few questions that have already come through the the Q A but I want to make sure that we give people time to type in their questions. Sometimes there's a flood and a flurry of questions. Are monitoring capabilities only available on databricks content?

Yuki Watanabe [00:20:46]: Yes, currently monitoring is only available in the databricks but we are currently making. So we are currently trying to open source things from databricks there. So we are in the future effort of like putting all the release we did for M3 to open source and Monkchang is one of them. So prestige it will be available in I think just few months is man.

Demetrios [00:21:09]: I. I have to say I loved that you incorporated in the idea of like experiment tracking into the agents and how you can now have this experiment tracking with agentic runs which is so good because how much of our agent building is just exploration.

Yuki Watanabe [00:21:37]: Yeah, exactly. I think experiment tracking has been there for classical deep learning but the flaming was very fixed. We have training, we have a variation validation and then we deploy. But we realized that in the agent specifically the path is very random. We do a lot of random search. So we are just trying to. We want to make sure the people have some anchors on the trajectory and which is evaluation. So evaluation the tool we found to make it tractable.

Yuki Watanabe [00:22:12]: So.

Demetrios [00:22:14]: But you aren't helping with the agent memory, are you? So if it is able to complete tasks it has like mlflow is not helping the agent remember these task completions or paths that it took.

Yuki Watanabe [00:22:35]: Yeah, mlflow is not trying not to be opinionated on any like agent architecture or frameworks. So currently it's not in the scope of us. But yeah, more like we attract trying to track the version of agent or if you find. So if you build a good agent with good memory mechanism with some framework, we are making sure you track that you don't lose that. But we are not trying to build a basically reinvent the wheel that is done by a lot of great frameworks.

Demetrios [00:23:09]: Excellent. There's another question coming through from Manish. Do we have custom LLM operations in ML Flow coming up as well or is it already there? I mean like good features around pre training SFT and RLHF?

Yuki Watanabe [00:23:30]: Yeah, absolutely, we are. So. So that is an interesting area that the fine tuning is a kind of bridge between traditional machine learning deep learning and this agent error. So definitely want to cover it. And one of the good thing is yeah we're doing that for I think five years to make deep learning traditional ML great experience. So MLflow is a great intersection between those two areas. And especially for if you're in the transformers, you can track the transformer model mmflow with hyperparameters and metrics. An interesting area is like also some people use like DSP is as a tool of like intersection.

Yuki Watanabe [00:24:14]: So you're like doing kind of tuning of your prompt just like you did for Pycoach. But you're a building agent and we also like doubling down in the direction. So that is also kind of interesting thing I want to share here.

Demetrios [00:24:29]: Yeah, that's awesome. So I think that that is it. I'm looking for more questions, but I don't think we have any at the moment. Now, the ML Flow team, you work with Ben Wilson quite a bit I think. Right. And he's come on here and given plenty of talks. I absolutely enjoy it a hundred percent when he does. I now are putting you into the category of folks that I enjoy listening to talk about moflow.

Demetrios [00:25:07]: I mean moflow is like something that I think every data scientist is familiar with and the ML engineers we like came into our own with ML Flow and so it's nice to see that it's extending into the agentic era.

Yuki Watanabe [00:25:26]: Yeah, absolutely. We are doubling down there. And also yeah, I'm very looking forward to events from your agent community and.

Demetrios [00:25:38]: Yeah, all right, last one from my side for you is what was one of the biggest surprises or gotchas as you were building out these new capabilities?

Yuki Watanabe [00:25:52]: Yeah, the big surprise is not a surprise but like how, how dynamic or how sometimes people are intelligent in building those agents. So that makes us sometimes cause us headache of. For example, we want to save the Land graph agent. But sometimes people did a lot of complex agent building them and asking about like can you sererize that? And we were like oh but. And, and also it's same happens to evaluation as well. Like people sometimes build a very complex evaluation process and can MFO do that? And yeah, we are trying to catch up with those new techniques or any demands, but sometimes it kind of surprised me, like how intelligent people are and how smart people actually build things in a way.

Demetrios [00:26:44]: Yeah. One thing that became very clear to me after all of the incredible talks today was how research is being done by everyone out in the wild with this agent building. You know, it is literally like Devin from DOSU said. Yeah, we're doing research in this area. And so it's not just research that's happening in the labs or at the universities and academia. It is companies that have specific use cases and problems that they're encountering as they're building their agents who are doing the research.

Yuki Watanabe [00:27:27]: Yeah, absolutely. I think also like boundary of research and engineering are becoming like big and big. So we are kind of have, we need to have some sort of research in mind always. Also they have sold under some engineering so it's, it's very exciting era, but it's also like challenging sometimes.

Demetrios [00:27:45]: Yeah. Yuki, this has been awesome, man. I really appreciate you coming on here. We've got some closing notes and you're gonna start your day, I'm gonna end mine. I'll see you tomorrow maybe. Otherwise, thank you so much for coming on here.

Yuki Watanabe [00:28:02]: Yeah, thank you so much for organization. Thank you.

Demetrios [00:28:05]: Cheers.

+ Read More

Sign in or Join the community

Comments (0)

Popular

Watch More

GenAI in production with MLflow // Ben Wilson // DE4AI

Posted Sep 17, 2024 | Views 1.7K

Generative AI Agents in Production: Best Practices and Lessons Learned // Patrick Marlow // Agents in Production

Posted Nov 15, 2024 | Views 6.5K

# Generative AI Agents

# Vertex Applied AI

# Agents in Production

Anatomy of a Software 3.0 Company // Sarah Guo // AI in Production Keynote

Posted Feb 17, 2024 | Views 4.1K

# MLOps

# DevOps

# LLM Operations

# Machine Learning