Sign in or Join the community to continue

MLflow Leading Open Source

Posted Jan 16, 2026 | Views 262

# Agents in Production

# Open Source

# MLflow

# Databricks

Share

Speakers

Corey Zumar

Software Engineer @ Databricks

Corey Zumar is a software engineer at Databricks, where he’s spent the last four years working on machine learning infrastructure and APIs for the machine learning lifecycle, including model management and production deployment. Corey is an active developer of MLflow. He holds a master’s degree in computer science from UC Berkeley.

+ Read More

Danny Chiao

Engineering Leader @ Databricks

Danny is an engineering lead at Databricks, leading efforts around data observability (quality, data classification). Previously, Danny led efforts at Tecton (+ Feast, an open source feature store) and Google to build ML infrastructure and large-scale ML-powered features. Danny holds a Bachelor’s Degree in Computer Science from MIT.

+ Read More

Jules Damji

Lead Developer Advocate @ Databricks

Jules is a developer advocate at Databricks Inc., an MLflow and Apache Spark™ contributor, and Learning Spark, 2nd Edition coauthor. He is a hands-on developer with over 25 years of experience. He has worked at leading companies, such as Sun Microsystems, Netscape, @Home, Opsware/LoudCloud, VeriSign, ProQuest, Hortonworks, Anyscale, and Databricks, building large-scale distributed systems. He holds a B.Sc. and M.Sc. in computer science (from Oregon State University and Cal State, Chico, respectively) and an MA in political advocacy and communication (from Johns Hopkins University)

+ Read More

Demetrios Brinkmann

Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

MLflow isn’t just for data scientists anymore—and pretending it is is holding teams back.

Corey Zumar, Jules Damji, and Danny Chiao break down how MLflow is being rebuilt for GenAI, agents, and real production systems where evals are messy, memory is risky, and governance actually matters. The takeaway: if your AI stack treats agents like fancy chatbots or splits ML and software tooling, you’re already behind.

+ Read More

TRANSCRIPT

Demetrios Brinkmann [00:00:00]: All right, folks, I'm super excited for this conversation that I just had with Jules, Danny and Corey. They are on the ML Flow team. They are veterans in this space. Jules literally wrote the book on Spark. Now he is leading the developer relations side of MLflow. And Danny and Corey are both lead engineers crafting what we know and love about this classic tool. And they Talked about how MLflow is leading with open source. They really want to go all in on the open source ML Flow and bring it to this new agent paradigm that we're in.

Demetrios Brinkmann [00:00:40]: Enough of me chatting. Let's get into the conversation with it. Not every day that I get to hang out with so many incredible people, I am graced by not one, not two, but three of the ML Flow team. How y' all doing today? I'm super stoked to chat about agents in production. You guys have been doing a lot with it, and I think each one of you individually has been on this podcast before, just in different phases of your lives. It's cool to see you all working on MLflow and. And doing things in this new agent era.

Corey Zumar [00:01:24]: Absolutely. I think all of us have come a long way and at the same time, some of us are continuing to focus on developer experience for AI in a radically different domain models to agents, feature stores to agents. It's kind of exciting. Thanks for having us back.

Jules Damji [00:01:45]: Yes, we all will come a long way from Spark to MLflow and back to Spark and back to MLflow. So I'm delighted to be working again with Danny and with Corey and obviously talking to you. We have had many interactions. I just loved you, me serenading on the stage with Yokelele. That was the highlight. Absolutely. That was the highlight.

Corey Zumar [00:02:11]: Yeah.

Danny Chiao [00:02:12]: Well, I feel like a lot of this is reflective just with. It's not just like with what the three of us have been doing. It's also like the ML Ops community has been shifting as well. We kind of all were working in various versions of the classic ML world before and then kind of evolved, shifted over to this Genai agent kind of ops space, right?

Demetrios Brinkmann [00:02:32]: Yeah, it evolved real quick. And so speaking of evolving, like, ML Flow has evolved a ton. You guys have been seeing folks that are putting their agents into production. I would love to start out with just what have you been seeing out there? Each one of you comes at it from a different perspective. So maybe we can get your takes on the conversations you're having each day.

Jules Damji [00:03:00]: Yeah.

Corey Zumar [00:03:01]: 100 I. I think as we talk about agents in production, the sort of first natural question to ask is, what are People trying to put in production, what are they building and what are they deploying. And after talking to a lot of our customers, open source users can pretty confidently declare that the age of the chatbot is still here. It ain't going anywhere. The bulk of our customers, the bulk of our open source users, they're building chatbots. Yes. You know, some of them that started as text only bots a couple years ago have become multimodal. We do get the occasional like voice application.

Corey Zumar [00:03:41]: Folks are doing speech to text and then they get the text back and they go text to speech. Some are processing images, but at the end of the day like that, chat interface remains pretty ubiquitous, like well over, you know, 50 or 60% of use cases, which is, is massive. The other thing that I'd say about chatbots, and there are a bunch of other use cases that I'll let Danny and others talk about, is that they're pretty much all tool calling chatbots now. We had the age of rag up until about a year ago. Everybody was really excited about document retrieval specifically. And now, yeah, folks are continuing to retrieve documents, they're gathering information from knowledge bases, they're indexing them, but they're doing so much more around that as well. You know, they're making real time API calls to other services to fetch information. They're even having their chatbots take various actions.

Corey Zumar [00:04:36]: I'm going to go ahead and save some information about this user to a knowledge base. So there's not just read, there's a little bit of writes coming in as well. So this chatbot interface has remained somewhat consistent. Maybe it's become multimodal. But then under the hood the architecture has evolved quite a bit thanks to the powerful models that seem to be being released every week or two now. I think that's what you can probably attribute to a lot of the upgrades behind the scenes. So that's my pitch on chatbot use cases. Danny, Jules, what else are we seeing?

Danny Chiao [00:05:10]: Yeah, well, just to elaborate on that, I think chatbots have. What I've seen is chatbots are getting more and more sophisticated. So maybe initially it was just like text based chatbots. Now we're seeing there's a lot more rich kind of return types like maybe rich HTML that's being returned or more human in the loops or within this agent. Now you're adding kind of multi agent systems, but it's still like a chat interface to the user or there's basically the infrastructure underneath. People are building more complex guardrails within the agent just to as they get closer and closer to production or working with higher stakes use cases. I think people are now just beginning to get some of those like oh, you can't just willy nilly launch something to production kind of use case into production now. So, so that's been cool to see and yeah, I've seen.

Demetrios Brinkmann [00:06:02]: Let me jump in there, Danny, because when you talk about the advance, the chatbot is a little bit more advanced. I don't know if you guys saw this website called AI UX Design Patterns. It is brilliant. It gives you all these different ways that folks are leveling up chat and right now a lot of them are just kind of concepts, but you have things like you can highlight certain parts of the response in a chat and then expand on only that part you, you know, which is so nice as a user experience because a lot of times you want to say like, oh yeah, I like all of this, but I don't like that part. Or I've also seen designs where they'll come back with a carousel of items for you that you can scroll through like we're natively used to using as opposed to only chat in, chat out, that type of thing. But yeah, you were going to say something else, Danny, so hit me with what else you were mentioning.

Danny Chiao [00:07:09]: Yeah, well that kind of reminds me of a different thing that I'll touch on as well. But I was going to say even though all the chatbots are getting more and more complex, it's surprising to me how nascent the observability and eval space has been for these chatbots. Like you'll notice that the current this course is a lot about multi turn evaluation, assimilation and how basically every vendor or solution still is really just kind of beginning to get their feet wet. But multi turns conversations in chatbots are naturally multi turns. So why hasn't there been more robust tooling in that space? And really a lot of customers had to just work around that and build their own tooling. That's super bespoke. So yeah, that's been interesting.

Jules Damji [00:07:52]: Yeah, I mean talking to Boat, what Corey and Danny say that we actually begin to see, according to my conversations, having a lot of developers out there. The chatbot seems to be the favorite one. And as Danny pointed out, more of these evaluation judges are being written. For example, deep eval and Ragas do provide very specific board built in as well as custom judges. You can actually do multi turn evaluations. So we begin to see more of that in that particular area where chat ports are being written. But they begin to feel confident that they can actually write these judges to evaluate the multi turn sessions, to evaluate the entire conversation. And that somehow seems to be the trend, at least from what I've seen from developers trying to ask questions or hit me with how are you actually evaluating your chatboard in production? I'm using either MLflow, I'm using some framework that actually has a host of built in judges that allow me to gauge a particular metric, or I can write my custom judges along with agent as a judge.

Jules Damji [00:08:57]: So we begin to see quite a bit of that trend.

Danny Chiao [00:08:59]: Yeah, but yeah, I think people are beginning to talk about multi turn judges now, but not because they didn't need it. I think as soon as we launched multi turn judges or we started talking about it, people were like, yeah, I need that. That's like the first thing I want to start with. It's like I want to know, are humans escalating to get people trying to escalate to get human support? Are they, Is the agent repeating it over and over again? All these common agent problems you really start seeing in the context of a conversation rather than in a single turn. And really as soon as we launch it, people are like, cool, yeah, we're going to use that immediately. We needed that yesterday. Yeah.

Corey Zumar [00:09:40]: An example of this that I like to use a fair bit is determining whether an agent is being repetitive. I can't figure that out just by looking at one input output, no clue. But if I can see 20 turn conversation like, oh, this agent keeps saying the same thing, it's giving me the same answer. I ask for more information and it continues to respond with basically the same stuff rephrased, it becomes a lot more obvious. And so that's where I can't figure that out. If I have this little function, this thing called a judge that is looking turn by turn, I have to have the bigger picture. Same thing comes with human feedback. Though a lot of times folks, rightfully so, are using their end users or internal testers and whatnot to determine whether a chatbot meets their needs and they need that conversational feedback.

Corey Zumar [00:10:33]: Indicating that this particular message didn't make a ton of sense is kind of helpful, but it's much more useful if I can say this entire interaction should have been half as long or I got this is what I was trying to achieve with this interaction and this is where I got and what was missing. So I think there's an element of human feedback. There are these questions about how I automate the process of finding issues in these multi turn interactions and bugs in my Agent. And it's funny, it's taken us this long to get here and that everything started with sort of single turn input output because this chatbot interface, especially this multi turn thing, was the first thing launched, you know, November of three years ago now. You know, ChatGPT was a multi turn chat interface. And so it's kind of funny that we've finally started to catch up in terms of the evaluation on the quality measurement side, kind of.

Jules Damji [00:11:29]: Yeah. You touched on two very salient point. One is the repetition. Right. How do I know that if I had, if I had a 20 turn conversation, something was not being repeated and the third was the logical consistency. Is there a logical flow between each and every step or am I being consistent or I've been contradictory? And I think those are the important elements where the judges can actually help to detect in the multi turn conversation. One, the coherence is there a logical flow? B is the context being retention? I referred to something 20 turns before is my agent or is my conversation board has the ability to remember that in the context. So I think those are the important thing and that actually collectively those make the chatbot more robust and more cohesive.

Jules Damji [00:12:18]: Yeah, so you touched on that important point.

Demetrios Brinkmann [00:12:19]: Yeah. Because you kind of take it for granted, us in this field that we understand, oh, potentially a lot of times to get the best result, we should just start a new chat and clear the context window. But that's not like, I would guess that your average user off the street doesn't know that. And so how can you make sure that what you're talking about, Jules, that stays true? Like if somebody wants to reference back and you're not doing it properly, you know that and you have a way to either be the janitor of that context or encourage them to start a new chat or do something, because that just leads to poor user experience.

Corey Zumar [00:13:05]: Yeah, 100%. And I think it dovetails nicely with this whole topic of memory management for agents that everybody seems very rightfully excited about the idea that a very long message history gives an agent a ton of rich context. If I've been talking with a chatbot for 15 minutes and then I ask it to do something that follows naturally from that 15 minutes of conversation, I'm much more likely, I would guess, to get the result that I'm looking for. At the same time, there is that trapdoor pitfall that I think you rightly point out, where if I'm asking for something totally different, that 15 minutes of context can hurt more than it helps. And so to me, I view it as a problem of memory and state management in an agent. Across all the conversations that I've had with this particular chatbot, when I go and I ask for something, whether it's in the middle of an active conversation or a brand new one, what information about me or about my intense or previous interactions is this agent drawing from? And that's something that we're taking a pretty close look at at databricks and from a research perspective as well. And I think it's an important topic for the industry at large. Agent memory.

Demetrios Brinkmann [00:14:25]: Yeah, yeah. And how do you weight memories?

Corey Zumar [00:14:28]: Exactly.

Jules Damji [00:14:29]: So Corey, Corey, you're referring to the short term memory and long term memory where if the 15 minute conversation has, has, has trespassed and nothing has been eluded in the further conversation, that 15 minute turn can be actually put in a long term memory and can be retrieved on demand. Right. And the short, short term memory is just like, okay, I'm going to flesh this out and I'm going to start a new session, but I'll keep that in short term memory. If there is a reference to certain thing that I talked about previously that doesn't exist in my short term memory, I'm going to go and refer to the long term memory to see if the context is there. And that way, as you say, it maintains a good flow and you as a user has a better experience to say, okay, you know, this guy remembers everything for the last half an hour that I actually had. So it's like a human really paying attention to you and referring to something that you had asked about 15 to 20 minutes half an hour ago. That is like a moment. Oh wow, this is, this is brilliant.

Jules Damji [00:15:25]: I can't, I can't believe half an hour, half an hour ago I asked it to do something and I referred it to it. Now it actually comes back and gives me a full answer that is definitely aha. Moment.

Demetrios Brinkmann [00:15:36]: Magical. I'll tell you what though, when I encounter chatbots in the wild and they don't work that well, I am very adamant, caps, locks and everything saying how bad this damn bot is. And I just like am trying to flag it. So whoever is doing evals on the other side, they see that like, oh, maybe we should try and make it.

Corey Zumar [00:16:02]: Better in this area.

Danny Chiao [00:16:03]: Yeah, it's funny because one customer that we worked with, they were kind of like, they launched a production without really doing too much for bus eval and like at some point they were looking at all their like responses. It's just like, I think most of their Chats were just people swearing at the chatbot. And so really it was like trying to serve a different purpose, but ended up just being a therapy chatbot for all of all their users.

Jules Damji [00:16:28]: But interesting, Danny, when you become abusive to the chatbot, if you have a judge that actually did that has the frustration. Like for example, in MLflow we have the built in evaluation called user frustration, right? And so it detects your language and tonality and based on that it would actually respond. Sometimes it surprises you when you know, you were frustrated about this particular topic and you would come back with a nice polite answer to mitigate the frustration. I think that's an interesting way how chatbots can somehow appear like human interaction because of the fact that we actually have this all both built in as well as custom judges that you can actually write and you can specify and tailor it to your chatbot. Right. If the chatbot is talking about insects, then everything dealing with insects would be sort of more tailored to that.

Corey Zumar [00:17:17]: The other piece of this, beyond yelling at the chatbot, is the degree to which folks are collecting like thumbs up, thumbs down, categorical and free form feedback from their chatbots. I noticed that a lot of folks, the first time they deploy to production, haven't really thought about collecting those signals as much as you might expect. As a result, Demetrios, you resort to caps, locking and yelling at an agent in the chat window. But it's often high signal for developers if they can get a more structured set of feedback, if they can get thumbs up, thumbs down, or they can get a 1 to 5 rating and they can carve out a separate text window or field for an explanation about why things are going well or where that they or where they can improve. I think a lot of folks treat that as an afterthought and it's a little bit surprising, at least to me, because that's pretty fundamental to a figuring out if you deployed something that's working reasonably well, and then B automating that process of fixing and identifying quality issues in the future. It's hard to sort of automate the detection of a real quality problem without having some early signal from your users about what the quality problems are. So the upshot here would be feedback collection is massively important and to some extent under invested in in the chatbot ecosystem. And the interesting thing is it's very easy to build all the stuff that folks are doing around stitching together tool, calling chatbots with knowledge databases and external APIs and putting that in an awesome web front end, storing data securely all that stuff is hard.

Corey Zumar [00:19:02]: Collecting a little bit of end user feedback is pretty straightforward and I'll shamelessly plug the MLflow feedback collection APIs as being a particularly low friction and easy product surface to use. But the point is spend that extra hour, that's really all it should take to wire up some end user feedback and understand the quality of your agents.

Jules Damji [00:19:25]: Corey, you touched on a very important point. How do we actually take the human feedback and operationalize it? From your experience talking to customers, even developers, how do you think they're actually doing it? So when they have the emflow deployed and they're going through the UI to add additional feedback, how does it get injected into the traces so they can improve that?

Corey Zumar [00:19:51]: Yeah, it's a great question. So what you have with most deployed agents where you've integrated some amount of tracing and is effectively this mass of log statements, whether you collected them with mlflow or a variety of other solutions, you have thousands, hundreds of thousands, even millions of these timestamped logs. You can open them up and you can see the statement by statement stack trace. What feedback helps you do is filter those. So I want to find the needles in this haystack. Most of my hundreds of thousands of interactions aren't necessarily going to be labeled. The quality might be reasonably good if folks didn't take the time to express their frustration or provide some feedback. And so what I would like to quickly do is ignore all the stuff that I know is working well and I don't have to look at closely.

Corey Zumar [00:20:42]: I want to click in this ui, apply a filter for certain types of negative feedback. For example, user frustration is high. Okay, I'm going to apply that filter or the completeness of whatever task the agent was being asked to perform is low. It's only providing partial information or not fully meeting the user's objective that condenses this log of thousands and hundreds of thousands of stack traces into a few things that I can then analyze more deeply and start to root cause. It's great to understand that the user was frustrated. Why? Well, then now I have to actually take a look at those particular log statements. Then in the future, once I've uncovered that root cause and I've started to implement a fix, I don't want to have to go back to my end user and say, is this better Now I should be able to take their feedback and turn it into some sort of unit test. And that's where these like evals, automated techniques really help.

Corey Zumar [00:21:45]: So that's sort of the pipeline that I see. It starts with the feedback collection, it proceeds to filtering. Then there is some human debugging, and then you're fixing with these automated unit tests that you can build.

Jules Damji [00:21:56]: And those unit tests somehow help us to improve the prompt as well. Right. Not to repeat the same rut again, not to get into the same frustration loop and all that.

Corey Zumar [00:22:07]: Yeah, I think there's a variety of solutions out there. So prompt optimization is certainly one. Prompts and still remain, I think, a major determining factor of agent quality. Two or three years ago, everybody was talking about hiring professional prompt engineers. A year after that it was, well, you know, this prompt engineering thing is brittle. We should look at fine tuning. There are other techniques that may serve us better and get us further. It seems like prompt engineering, though, has sort of remained front and center for a lot of agent developers.

Corey Zumar [00:22:40]: So you're absolutely right. In the process of fixing an issue, I may want to do a fair bit of prompt engineering to get the agents to make better decisions, retrieve information from a knowledge source more proactively, issue queries against a web service somewhat differently and beyond, or take on a different Persona. These are all valid use cases for prompt engineering. And so something that we've been investing in quite a bit in MLflow is making that quick and effective. I've personally spent days prompt engineering a voice assistant that I build at home, and it can be agonizing. Wouldn't it be great if there was some automated technique that I could use to make that process a little more efficient? And that's where the research folks at databricks, but also just more broadly in academia, have been pushing forward this concept of prompt optimization. So given the collection of negative feedback, given some information about what isn't working in an agent, and given a set of prompts, how can I come up with better versions of those prompts that mitigate and fix those issues? And can I treat this basically like a optimization job and an automated solution rather than this human in the loop English instruction modification thing where, you know, I don't know if I'm making deterministic progress and I might be spending like hours or days coming up with better versions of my prompt you're talking about.

Jules Damji [00:24:14]: That's where the GEPA optimization comes into play, where you can actually optimize your prompt engineering using the Get Power optimization as part of the McLflow library. So one of the things, as you mentioned, that when you actually want to automate that, the prompt engineering part would be. Okay, here's a. Here, here are the set of prompts. Run it through the gap Run it through the gap optimization to get the best prompt to see how you can actually do that.

Demetrios Brinkmann [00:24:38]: One thing that I find interesting when it comes to trying to optimize your prompts from when you're out there and you're trying to add all these edge cases, you can quickly get into a place where you've got this massive prompt because you've added all these edge cases in and then you end up destroying your usability of the agent. Since you tried to cover for all this. And I, I know I've seen folks that will focus on like dynamically adding in or injecting in the context. And that's why we have all heard, you know, the rise of the context engineer, because you gotta throw the right context in at the right time and you really want to have not more than you need. If it's anything, it's like just enough, but not too much.

Danny Chiao [00:25:31]: Certainly.

Corey Zumar [00:25:32]: And if you're not careful, like you said, you end up rewriting a rule based system, except this time you're not writing it in Python or Typescript or Java or whatever, you're writing it in English. And that effectively becomes lower quality and more brittle almost than just writing it in code. So you're 100% right. And that's where some of these optimization techniques and optimizer like GEPA, frameworks like DSPY that we try to integrate closely with in MLflow can help you avoid that because you have to force them to generalize across a whole bunch of samples at once. So you can't necessarily overfit sample by sample by sample. Ultimately that optimizer is not going to be able to produce good quality that way across thousands of test cases. So that's certainly one way to arrive at a prompt that A fixes your problem, but B avoids that overfitting and sort of rule based prompting that can quickly really get out of hands. That really resonates.

Corey Zumar [00:26:33]: Absolutely.

Danny Chiao [00:26:35]: Yeah. I don't know if we've seen our own kind of customers hit the case where they're just adding so much stuff into the prompt that it's like really polluting the conversation. I think they're still more in the state of we've thrown something in there and we're still trying to figure out how good is the agent in general. But yeah, also I feel like people have stopped short of that because they are cost conscious as well. They don't want to just blow their LLM kind of token budget and that is something they're tracking very closely. So you don't want to Construct this massive prompt that just continuously sucks all your budget every single turn.

Demetrios Brinkmann [00:27:11]: And you were, you were talking about something before gory that I think is fascinating when it comes to the feedback collection and how like you normally need to bring in the subject matter experts for that type of a thing. When you get that feedback, it's great from users if they raise their hand and they're kind enough to take the time and tell you what's wrong with it. But, but a lot of times when you're looking at these multi turn conversations, you're just going to bring in a subject matter expert and say like, is it doing what we want it to be doing? And I've heard a huge complaint from folks about how subject matter experts don't want to be looking at like JSON. They in fact they can't really be doing that. So you have to figure out how can I give these folks the data in a way that they can properly annotate it and add value to it.

Corey Zumar [00:28:06]: That isn't just like logs dump 100%. Absolutely. And that's something that we've spent quite a bit of time at Databricks and in the MLflow open source community investing in and attempting to address that challenge of as a domain expert, how do I quickly recreate that same view that that end user had? Overlay some extra information that helps the domain expert get to the bottom of what went wrong and then very quickly provide their expertise. And so that's a matter of if I have a chatbot out there and user said, hey, this really isn't going well. You know, I'm Dimitrios, I'm yelling at the thing, I'm frustrated, whatever. I then call in my domain expert Danny or Jules and they're reviewing this interaction. They're trying to figure out, you know, whether this agent that's designed to answer questions about like telecom customers or something is what, what is it that is making Demetrios so frustrated? They should be able to replay that exact chat. They, they should have access to the turn by turn chat first and foremost.

Corey Zumar [00:29:15]: But then oftentimes that isn't 100% sufficient. They should also be able to see, okay, this particular turn of the chat, the agent went and accessed information. Where did it access? Okay, it accessed information about the user's account rather than about the line of products. Demetrios was trying to get a new phone. He's not asking about his account or billing related questions. So if I can give Danny and Jules that information in a UI and they can look through it and very quickly go, ah, okay, this thing is fetching the wrong info. Then it's very easy for them to come back and say, I all right, I found the source of Demetrios frustration. It was the sort of knowledge retrieval problem.

Corey Zumar [00:30:00]: And so at Databricks, we've invested quite a bit of time building this review application where I can collect a bunch of these chat sessions and I can send them over to domain experts in my organization. They get back that full chat view with the overlay of what information was retrieved. That way they can dive deeper and they get this awesome feedback form where they can provide root cause analysis which then helps the developer go back and fix their agent. They can then send that review app to any other member of their organization. And we've been really invested and actively working on bringing this into the open source community as well. The vision is that anybody running open source ML Flow should be able to spin up exactly the same ui, send it to any other member of their organization and start collecting these labels. And so that's a problem that hits home, is deeply, I think, personal to all of us on this call at this point.

Demetrios Brinkmann [00:30:55]: Well, maybe it's interesting to talk for a second about how MLflow you traditionally were building for data scientists, right? And now do you feel like the Persona has changed? And if so, who are you building for? There's more stakeholders involved. What does it look like now?

Corey Zumar [00:31:15]: 100%. We are focused on continuing to serve that community of data scientists who want to build awesome high quality models. And to make clear that's not going anywhere and is super important to us. At the same time, what we found a couple of years ago is there's so much synergy between agent development and model development. You're going through this process of building an initial version of an AI application, you're measuring its quality, you're iterating on it, you're managing the deployment of that thing, you're collecting signals from your end users once you've deployed it. That's true whether you're building an agent or a model. And so we felt like, wow, we can take on this problem, we can start addressing the needs of agent developers. But as you dive in, the next order of detail around, okay, how do I measure quality, how do I deploy this thing and understand whether it's operating well and in production, as we've talked about, looks a fair bit different.

Corey Zumar [00:32:16]: And so that's where you have to serve your agent developer who's building that initial version of an agent by giving them logs and Observability and stack traces so they can debug the same way that you serve the model developer by giving them a way to track train, test and VAL loss. During the training process, you're meeting similar needs, but you're doing it with different tooling. And then at the point of I'm trying to evaluate and improve quality. Okay, well for the model developer, again, they're going to take a look at that test loss and they're going to go and do feature engineering and things like that. But for the agent developer to evaluate quality, they need to go talk to a domain expert within their organization or they need to have these awesome out of the box LLM judges that can look at natural language response and determine whether it meets the needs of the user who sent the query or not. So again, similar objectives, very different sort of sets of tools underneath. So we're still serving AI developers, model developers and agent developers, but it's required us to tap these brand new spaces of agent evaluations and things like prompt optimization and build higher fidelity feedback collection into the platform. So what you'll see now when you go to mlflow.org you install mlflow and open it up is you get this sort of structured experimentation and deployment platform for ML and Genai.

Corey Zumar [00:33:50]: But you pick the objective, am I building an AI agent or am I building an ML model? We tailor the experience based on that selection.

Jules Damji [00:34:01]: I think Corey is absolutely right in terms of this is sort of a natural progression that we're actually serving in two different audiences. Right. As you pointed out, the data scientists deal with models which are sort of deterministic and the evaluation criteria is more quantitative. You know, you have a certain precision that you're actually referring to, you have certain recall. Whereas if you look at the experimentation, they go through the same process, but the evaluation process is quite different. Right. You have more a non deterministic model where you actually now have this set of built in judges and a set of custom judges that evaluate based on their criteria. And I see a similarity between prompt optimization and fine tuning a traditional model, they are sort of similar tools and similar techniques you use, but it's a natural progression.

Jules Damji [00:34:47]: And then EmberFlow currently serves both needs. So whether you're a traditional ML engineer or ML data scientist, you can use MLflow for that particular need. And if you had the recent MOIA engineer engineers and Gen AI developers with building agents, as Corey pointed out, we have a set of techniques and tools and judges that allow you to measure that in a way that meets your demands and Meets your needs, Right?

Danny Chiao [00:35:12]: But also very often it's the same person doing both. Right. So I think a while back we were like, you know, who is actually building agents? And I'm actually curious, Demetrius, like whether you've seen this as well. We were like, oh, maybe it's all these new class of engineers who are building agents and we're not talking to them at all. And then we went out and just talked to a bunch, a bunch of people, including a lot of non MLFO users. And it seems like actually a lot of people just shifted from being like, okay, I used to be a data scientist or data engineer collaborating on a classic model and now I've just rebranded into like a gen AI engineer, AI engineer. And so that's still probably like half the market is like these former data scientists or Data engineers or MLEs converted into these gen AI people. So that's where they often then will come to us and be like, hey, we've already used MLflow in the past.

Danny Chiao [00:36:00]: We would love to just continue using MLflow, but certainly there are these kind of new engineers who haven't really gone through this process before. And then they're like, what is an eval? And so then there's kind of an education component there of like, hey, we've already built all this expertise over decades of how to, like, how do you iterate quickly to get high quality, you know, outputs right? Let us teach you kind of one of the best practices there.

Corey Zumar [00:36:26]: So.

Demetrios Brinkmann [00:36:27]: Well, that answers my question that was going through my head as you guys were talking, which is like, why not make a different product? Why not separate the two? But it feels like if it is for almost this very similar Persona, then it makes sense because they're already familiar with MLflow, they're just going to extend the capabilities.

Corey Zumar [00:36:51]: Absolutely the case, I think that there are, from what I've seen, two temptations that product developers, competitors out there have had to grapple with. The first, like you say, is that temptation to I'm going to build a brand new product, throw everything out. You know, it's a brand new world of agent development. We don't need all that ML stuff. People can install a totally separate package, host a separate service, manage the work entirely separately. Our agent developers, who as Danny points out, are often the same folks, need a different platform, they're going to install something else that we build, and we're going to have to maintain two different ecosystems. That's sort of the first temptation. Second temptation is there's nothing different here.

Corey Zumar [00:37:37]: It's all just Kind of, you know, training jobs and quality measurement for agents and for models. It all just comes down to metrics anyway, you know, what's the difference? And that has its own problems. The agent developer needs traces. The model developer. Yeah, I mean they might want them way in the future after they deploy something, but it's not even something they're thinking about as they start to build a new use case. The agent developer needs LLM as a judge. I don't know why the heck I would use that if I was building like a traditional machine learning class of. So you definitely have to understand the workflows are different and the tools are different, but the objectives and the Persona, the person that's trying to meet those objectives remains the same.

Corey Zumar [00:38:21]: They don't want to have to learn and install a separate platform. The business stakeholders or folks that are making purchasing decisions within an organization don't want to pay for multiple platforms and set them up and figure out how to make them available. So we believe we found a pretty good happy middle here. You know, the one platform with the right tooling for those agent devs and model developers who were often just data scientists. A fair bit of software engineers as well.

Demetrios Brinkmann [00:38:54]: Yeah, the Personas really had to meet in the middle in a way. Like data scientists had to learn how to become better software engineers and then software engineers had to learn how to become better data scientists. And so you have this nice like Venn diagram of, of both of those skill sets that you need and you want to take advantage of. And I do see like the agent building very much as lots of software skills that we're talking about. But being able to fall back on fundamental data science type stuff is crucial.

Corey Zumar [00:39:33]: It really is. Especially as the actual software development part of it is increasingly automated. The coding assistants, basically agents, building agents are kind of becoming the de facto pattern for a new use case. We talk to customers all the time where despite the fact that Claude code and codecs and other tools really only rose to prominence within the last year or so, everybody has subscriptions throughout. A lot of our customers, the entire dev team is using these things to build their agents. And so it's easier than ever for them to integrate new tools into their agent or connect a new data source or even ask something like Claude or Codex to prompt engineer. And so then because the software development lifecycle is so accelerated, they have to fall back even more quickly in their workflow to okay, what does the quality look like? How well is this going? Like those data science skills, to your point, become increasingly important Earlier, I'm not thinking about it three or four weeks from now after I've hand coded an entire agent, I'm thinking about it three or four hours from now because the code is already written because I asked a coding assistant to do it. And so where this is all leading, I believe is A, ensuring that folks can do the data science properly themselves.

Corey Zumar [00:40:56]: The B, why can't we enable coding assistants to do some of this too? So there is, I think, an appetite out there for, you know, if I'm asking Claude or Codex to build an awesome agent, can I ask it to evaluate it as well? Can I give it some example inputs or tell it more about what I'm trying to build and let it come up with those inputs? Can I then ask it to assess quality and iterate? And so as a result, we've tried to make all of MLflow's judges evaluation capabilities feedback functionality available to coding assistants through MCP. So I can basically drop MLflow into my coding assistant and have it act like the agent developer, which includes the eval and the more data sciencey work. And that's something that folks seem to appreciate as well.

Demetrios Brinkmann [00:41:44]: Jules, you're off mute. I get the feeling you got something to say.

Jules Damji [00:41:48]: No, I mean, Corey just articulated very well that the Personas now are sort of merging along. And it's hard to say I'm a data scientist or I'm a data engineer or I am an architect, given the ability and the extensibility of code assistant to say go ahead and create a traditional model for me. And I'm not even a data scientist, but I'm just saying the developer might say, I'm not a data scientist, but I heard about you can actually create a classical traditional model and it will do it for me so I don't have to actually have the skills. The skills comes in how do you actually measure the metric and which metrics do you actually want to use for a classical model or for an agent? I think that's where the skills come in. That's where the background knowledge comes in. But the coding assistants can do one thing or the other. As Corey pointed out, you drop in an MCB server with code and you can say, go ahead and, go ahead and write me an agent to evaluate this particular chatbot that I've written and it'll spill one out. So I think we're entering this new frontier where the Personas are very gray area, but you just need the background to be able to say how I'm going to assess and what are the things that I need to know to understand to get the best out of this particular piece of software that the agent has actually created.

Danny Chiao [00:43:02]: Yeah, I think one kind of cool thing that's at least I'm also seeing is that yes, these Personas are merging, but also a lot of the innovation seems to be coming from the more data science kind of Persona or people who are well versed in that space. A couple examples of this would be, I think we've seen on our side, very good success with if you want to switch between different say you're using a Gemini model, you want to switch to GPT or vice versa. How do you make that switch? Your prompts are tied to your models. I think there's the data science world was like, well, this just looks like model distillation. How do I teach a model to act like a different model? And I think we've seen that actually if you try to do that and use DSPY in the mix to do that, you get really, really good results. In fact, we've even seen results where you can use a smaller, cheaper model and that can really accurately reflect the outputs of a much more extensive model. So it's almost a cost optimization. So that's kind of like one innovation, I think, that's coming out of more traditional research or data science kind of spaces.

Jules Damji [00:44:12]: I just want to reiterate on something what Corey said, which was I think the MLflow team has done a fabulous job of even though having two distinct platforms, one for the agent developers and GNI developers and the other one for the MLflow traditional machine learning scientists, the concepts are not very dissimilar. Right. Both need experimentation, both needs a way to fine tune, as I said earlier, which one does the prompt memorization, the other does the fine tuning board needs a set of data sets that you can actually provision over your training data set, you need evaluation data set, you need a test set. We do the same thing in Gen AI where you actually have the data set. So I think the concepts are very similar and the underlying tools and techniques used are pervasive across both these particular platforms. So somebody who is familiar with ML and is coming into Genai can actually do that, or somebody who's familiar with Genai and is going into ML can easily transport those particular skills. So I think conceptually and visually and through workflow, they're not very dissimilar. I think the Emma Flow team has done a very conscientious job in making sure that the efforts are flawless from one to the other.

Jules Damji [00:45:23]: It's not like you're in a completely different world. There is some concepts that actually come from. There's a Venn diagram as Dimitrish there's kind of a Venn diagram of set of tools and concepts that merge together.

Corey Zumar [00:45:35]: Absolutely the case. I also wanted to dive deeper into that sort of model distillation piece that Danny was bringing up because I think it's a great example of the capability that folks resonate with whether they're coming from a data science background doing model development or they're building agents. Everybody wants as high quality an AI powered application as possible and they want it for $0. And obviously you trade off cost and quality a fair bit. But the real upshot I think of a lot of this work in model distillation is I can build a great agent using frontier models and then over time, once I understand the quality of that agent and the factors that determine quality, I can ramp that choice of model down to a less intelligent, less expensive frontier without sacrificing quality. And what I need is a structured way to measure it. And what I need is a structured way to adjust my agents prompts actually in particular if I'm going to change models. So the task of moving from like a GPT 5.2 down to a GPT 5 mini for example, or a Gemini Pro to flash that process requires me to understand what are the examples of great agent quality, what are the examples of negative quality that I'm trying to avoid and how do I tailor my agent's instructions so that Mini can understand those instructions just as well as GPT 5.2 et cetera.

Corey Zumar [00:47:13]: So if I have both of those pieces which MLflow provides, I can get to a point of cutting my costs by a factor of 10, even 20x just by virtue of switching the model. And so that's pretty powerful as well.

Demetrios Brinkmann [00:47:27]: One other area I wanted to hit on Danny, we talked about a little bit last time we chatted was the idea of governance and how folks are really trying to figure that out right now, especially with agents being able to read and write and you're like.

Corey Zumar [00:47:48]: So.

Demetrios Brinkmann [00:47:48]: Who we gonna blame when they write the wrong thing? How are we going to keep tabs on this? Especially if we're at an organization with 100,000 people and they all now are leveraging agents.

Danny Chiao [00:48:02]: Yeah, I think governance is a super hot topic for us right now at databricks and Corey's been actually in a lot of these conversations more recently than I have obviously. I guess databricks is, we're known for governance. We Think about governance a lot. I think AI governance kind of breaks down to a couple of components here. Probably the first thing that people talk to us about it is like, well, I have PI and all my different agents and traces. What do I do with that? How do I kind of ensure only the right people can have access to that? And then there's also, I guess it funnels out into, well, what data does the agent have access to as well, who can use what agents. And also it really kind of gets into that AI gateway kind of realm as well of like, well, kind of like fall back to different models, kind of control costs and whatnot. So I think we're thinking about all that stuff right now.

Danny Chiao [00:48:55]: On our side, we do have an AI gateway kind of managed component. I think. Corey, we've. Now, is it controversial to say that we are kind of exploring open source?

Corey Zumar [00:49:08]: I would go beyond that and say, you know, that MLflow has. And for we have an open source.

Danny Chiao [00:49:12]: Gateway, but like, kind of to like we are. I think the plan is to invest substantially more into kind of an open source governance kind of gateway solution here, right?

Corey Zumar [00:49:21]: A hundred percent, absolutely. And so going a level deeper on that axis, I think Danny did a fantastic job of discussing the sort of aspects of model access to data and then developer access to the data that an agent produces. So going back to that model access to data problem, we are invested pretty heavily in providing cost controls for model access as part of open source MLflow also providing data access controls. So I've built a bunch of tools. I have this awesome tool catalog with mcp. I'm going to hook it up to my agents. How do I make sure that the right tools are exposed to the right agents for different use cases and that they. You don't have an agent accessing data that it shouldn't be allowed to, for example, about a completely different customer.

Corey Zumar [00:50:12]: Like these knowledge bases are pretty sensitive and then so locking down tool access is really important. But then as I mentioned, just cost controls. I have two teams building agents. One is building a low QPS chatbot. The other one is processing billions of documents every day. They require different budgets and blowing the budget out of one the project and sort of exceeding costs can mean that the other developers don't get to do anything. And so like folks definitely need what we call this AI gateway that allows organizations to say, hey, all right, for your use case, here's your budget. Here are the models you have access to.

Corey Zumar [00:50:49]: Here are the tools that you have access to if you need something else, you know Come and talk to us, talk to your admins and we'll grant that access in a structured manner. Otherwise it's, you know, API keys flying around, access to data unregulated. It becomes chaos. And so there's like real risks there from a model access to data and costs governance perspective. And I think we could probably talk about that all day. As Danny mentioned, in terms of developer access to agent data, that second consideration, there's some big challenges around that as well. PII ensuring that the only privileged eyes can look at certain slices of those traces. That's all super important.

Corey Zumar [00:51:34]: And that's where databricks Unity catalog has been crafted and tuned for years to provide governed access to model data, agent data, BI data. Basically this is no different than any arbitrary set of sensitive data you're ingesting into our platform. It's just that it's coming out of an agent. You need to govern and manage it and share it just as securely. And so that's where we've been building databricks and all flow on top of Unity Catalog in this amazing secure enterprise grade and governed data lake on databricks.

Danny Chiao [00:52:14]: Yeah. I have a separate life at databricks where I also think a lot about PI detection and, and how it relates to Unity Catalog. And I think one thing is similar to what we're seeing here, which is the easiest thing that people will start with is they will just redact all PI and say nobody has access to this or totally lock it down. But rarely is that the actual goal. So you need a catalog and there's been a lot of investments on how do you do ABAC attribute based access control. And what people are trying to do now is like, okay, well if there's a scalable way to manage governance, then I would actually open up data as much as possible to people who should have access to it. But in absence of that technology then they'll say, let me just fully lock it down. Nobody can see this information.

Danny Chiao [00:52:59]: And that's kind of like now you're losing a lot of that business value that you could be kind of getting access to. Right. I think that's also going to be playing out within the agent space too where if you want to really debug why is an agent going wrong or you want to do downstream BI insights out of this stuff, sometimes you are going to need to capture some sort of API information. Just not everybody has access to that. And so at least we've seen some early kind of success here is having all these traces get ingested directly into adults table. And then now you have access to all the governance controls you have with the Unity catalog. And now you can go back to kind of having open access and sharing and doing all your downstream insights. Right.

Danny Chiao [00:53:44]: Whereas a lot of I think the other competitors or vendor solutions out there, their default reaction is like, let's give you a tool to just redact all your PI and have it never landed to begin with. To me, that's a great first step, but it means that you're locking away a lot of that super valuable data that you could be really analyzing and parsing.

Demetrios Brinkmann [00:54:03]: Well, I've talked to people who can't use memory tools because the memory tools, just by default, the open source ones that you get off the shelf now, expose all of this conversation that's happening with their users. And there's PI all over, it's just like littered throughout. That even if you extract away sensitive data that's obviously sensitive, like a Social Security number, for example, inside of the enterprise, if you're asking a chatbot to help you craft an email to fire someone, that's also sensitive and like, that's much harder to get when you're thinking of it like in the traditional sense of pii, you know, it's not like, oh, that's not my credit card number, that's not my Social Security number. But that is very sensitive.

Jules Damji [00:55:00]: So you're talking about Dimitri, you're talking about, you know, the, the what Danny was talking about the PII that can be redacted and, and that comes up quite a bit in, in meetup talks about how you're actually going to redact the traces. But Dimitris, I think you bring up a point where important is I'm crafting an email which is going to fire someone, or I'm crafting an email that's going to promote someone that's considered kind of, you know, private information.

Danny Chiao [00:55:24]: Yeah.

Jules Damji [00:55:25]: How do you actually read out that, you know, what's the context? So that the context is important. How do I actually take a piece of trace or a piece of output that has come up and then evaluate and categorize that as sensitive or non sensitive? Right. That's again an evaluation criteria.

Danny Chiao [00:55:40]: Well, I was going to say, and actually the process of redacting PI is actually a hard process. So I think a lot of people will start thinking like, let me just write some random rejects for people who are a little bit more sophisticated. They might use Microsoft Presidio. But I think everybody I've talked to, including ourselves, who've tried it, is like you just get so many false positives, you miss everything. So what we ended up having to build for our own PI detection offering is we had to build an agent, essentially like a whole kind of complex sequence of LLMs and heuristics and what not to kind of actually figure out with relatively low cost what is a pio, what is not. And we've seen that is much, much better than kind of the off the shelf solutions out there.

Corey Zumar [00:56:23]: And what makes this a lot easier is actually having access to the raw data. If I can treat this thing as a data frame or as a table and use all the brilliant SQL and Python and ecosystem tools for scrubbing that data, analyzing it, tweaking it, feeding only subsets that I know are safe into a memory database, so that when Demetrios tool reads from it, I know that it's reading information that is safe and secure. That's really important. Unfortunately, a lot of solutions out there don't actually give you access to the raw data. Typically these traces get dumped into some shared big multi tenant postgres database. And if you ask for, hey, how can I get a table or a data frame or the raw records that I can do some sort of more complex analysis or find PII and redact it, the answer is good luck, you can't actually get that data back. And so that's where again, this sort of databricks unity catalog delta table approach that stores all of your data as beautiful tabular formats and data frames makes this kind of analysis a lot easy, easier. You can plug in any tool from the ecosystem, you can use SQL, you can express any sort of transformation or PAI redaction task that you'd like.

Corey Zumar [00:57:48]: And so I think that's really powerful as well.

Danny Chiao [00:57:51]: Yeah, I'm excited for kind of what future stuff we start building with Lake Base as well. Because Lake Base is kind of tailored to this kind of use case that Demetrius you're talking about. Right. You want like a database that has fast access, retrieval and whatnot, but is well governed. Right. And that is, I think, kind of the promise of LAKE based on database as well.

+ Read More

Watch More

Finetuning Open-Source LLMs // LLMs in Production Conference 3 Keynote 1

Posted Oct 09, 2023 | Views 7.7K

# Finetuning

# Open-Source

# LLMs in Production

# Lightning AI

The Birth and Growth of Spark: An Open Source Success Story

Posted Apr 23, 2023 | Views 6.5K

# Spark

# Open Source

# Databricks

MLflow Pipelines: Opinionated ML Pipelines in MLflow

Posted Aug 02, 2022 | Views 2.7K

# MLX

# ML Flow

# Pipelines

# Databricks

# Databricks.com