GenAI in production with MLflow // Ben Wilson // DE4AI
Tech Lead of the AI OSS Ecosystem team at Databricks - the maintainers of MLflow. Author of Machine Learning Engineering in Action https://www.amazon.com/Machine-Learning-Engineering-Action-Wilson/dp/1617298719/ . Co-host of the Adventures in Machine Learning podcast.
We'll be covering the recent advancements in the supported integrations for GenAI application lifecycle management, from supported GenAI application tracking and evaluation to deployment and monitoring. Part of this talk will focus on the future of GenAI support in MLflow and what our vision is for supporting advanced agentic solutions.
Link to Presentation Deck: https://docs.google.com/presentation/d/1nOD1D3Q19EK2UlirqkxPEHqKhWuAOMVR0EIHhO6QhK0/edit?usp=drive_link
Adam Becker [00:00:06]: And we have Ben coming up. Ben, are you with us?
Ben Wilson [00:00:10]: Ben is with us.
Adam Becker [00:00:12]: Ben is with us. How are you, Ben?
Ben Wilson [00:00:14]: I'm doing great. How are you?
Adam Becker [00:00:16]: I'm stoked to hear about what you're up to. We're going to be discussing ML flow in the context of Genai. Is that true?
Ben Wilson [00:00:23]: Yeah, some exciting features the team's been working on.
Adam Becker [00:00:26]: Okay, so let's see them. Do you have your slides ready?
Ben Wilson [00:00:31]: I do.
Adam Becker [00:00:31]: Here they are. Awesome. Okay, Ben, so good luck. I'll be back.
Ben Wilson [00:00:37]: Sounds good. Hey, everybody, I'm Ben. I'm one of the mo flow maintainers. I work at databricks in the software engineering department for ML. I've been at databricks about six and a half years now, but work on the team now that supports a lot of open source packages related to ML. And today I'm going to be talking about stuff that a lot of people have been talking about recently, which is agentic workflows, and how you can actually use stuff in your own data sets and your large volumes of data to build intelligent agents using Genai. And particularly focusing on the production aspects of this, about how to go from the demo state into something that you can deploy and actually get real use out of your data without having to expend a lot of time and energy of doing analysis of it. All right, so a quick introduction about why we're talking about this and why agents are so powerful for leveraging your data.
Ben Wilson [00:01:39]: The rate of data that is being collected around the world is increasing at an alarming rate. And with these increased amount of data that companies are pulling in, being able to make use out of that in a very timely manner is complex. Over the years, people have touted that, oh, we can build analytics, use cases that make it simpler to analyze your data and gather insights out of it, but that restricts the access of that data to highly specialized people that know how to write code and write queries and build visualizations that explain a story out of that. But it doesn't help the end users who want to get an answer based on company data really quickly. And being able to parse through that data and get those answers is something that is very complex to do and requires a lot of specialized training. So with agentic frameworks, we can leverage genai capabilities with large language models and agentic tools, functions that are created that can execute against your data, or they can query your data, as well as some rather clever things that have been put out in open source packages recently, where we can have multiple agents having turns with each other, each taking a different role in order to provide the most accurate answer based on the data that is available to them, that are just contextually accurate to the question being asked. Now, getting started is a little bit complex because there's so many different providers out there that can provide LLM access tools out there that allow you to build these agents. The field is very crowded, and if I were to list out all the companies that I've either tested out their product or worked on integrations with or have just heard about, I would have an entire slide deck of probably 50 slides with just logos.
Ben Wilson [00:03:36]: So it's a very crowded space. A lot of people have been around for a while, have varying levels of sophistication and what their offerings can do. Because of all this complexity, it's really hard to know how to get started on building these things, about how to leverage your data with these new technologies and make it so that you can solve an actual business use case that you need. Now, at its core, an agent that we're talking about, slightly different from the LLMs that most people are familiar with, is basically software code, a system that is designed in order to answer questions by generating content based on context that it requires. And these agents can implement very complex back and forth decision making processes and being able to access system data through stuff like retrieval to fetch your data. And it can employ techniques such as few shot learning in order to optimize the responses that it's then generating to do further processing. What this means in a nutshell, is instead of just interfacing with a generally trained model like you would with, say, OpenAI's chat GPT, you can get contextually aware responses to what you're asking about, because it's able to leverage the data that you have and that you've put in an indexable system so that that can be supplied to a general language model. And the TLDR of this is you can use your data, use some very fancy model that somebody else has trained, and you can get these answers quickly, and you can actually ask very complex questions to these because of this multi turn architecture that is employed.
Ben Wilson [00:05:25]: It can evaluate and adapt to staged responses that are happening based on actual data. Now, getting started with building one of these things, there's a lot of examples out there, countless numbers of blog posts, even the ML flow group, we've written a bunch of them. You can check them out on mlflow.org dot. But there's a massive leap between building something that is a really fun demo that might take an afternoon to build and then moving over towards a production grade agentic system that can be deployed. And you could actually expose these endpoints to your entire business or to customers in order to get better responses to questions they might be asking or insights they want to get about the company or about what it is that you're doing. And some of those things that make this very complicated, very complex is that when you build one of these agents just interfacing with those APIs, it's effectively a black box. You may be defining what you're actually doing in that agent, but you can't really see what's going on for an individual request that comes in. So they're incredibly hard to debug.
Ben Wilson [00:06:37]: It's basically this non deterministic system that you have no idea why it's making the decisions that it's doing or why it's answering the way that it is. There's also the complexity with retrieval relevance. There's a whole bunch of things that need to be thought of when you're doing something like taking your data and indexing it and putting into a vector retrieval system, making sure that you're setting up and chunking your documents in a way that information retrieval is as relevant as possible. You don't have too big of a context that's coming in because that costs a lot of money. It takes a while to search through and also not too small, so that you're actually losing context on a broader question of something that's a little bit more complicated as well as how many documents to retrieve and what to do with them. The prompt engineering that goes into what is actually being sent to the general language models, that's complicated and takes a lot of time to figure out how to optimize that. And then the index retrieval stage itself, like how does that actual system return the documents? It's really hard to evaluate this and test it out and go through a process of seeing which configurations work best for, particularly scenarios that you're trying to build. The process of building these things as well is much faster than in traditional ML.
Ben Wilson [00:07:57]: You can get responses really quickly, you can change just a few things, and it's pretty easy over a short period of time to get lost in the weeds of, okay, I've tested out a thousand things, and something that was 40 or 50 iterations ago was what actually worked better than what. I'm the rabbit hole that I'm going down right now. How do you get back to that, or see what the state is of that while you were developing it? 40 or 50 iterations ago, a couple of hours before. So keeping track of everything is something that's pretty challenging. And then which library to integrate with. And you know, there's lots of them out there that build, that allow you to build agents. All of them take slightly different approaches to what it is they're trying to solve and what works best for particular different scenarios and use cases. So it's, it's really hard to know where to, where to go and what the capabilities are of all these different things without spending weeks or months of evaluating all of them for a production use case.
Ben Wilson [00:08:54]: What our team has been doing and focusing on over the last year or so is we're trying to build in a bunch of tooling into ML flow in order to make it support the Genai work stream as best as possible. These include a bunch of tools that we've built that allow you to leverage what Moflow is good for historically, which is tracking your experiments and making sure that you know what it is that you've done in the past, so that you can go back and either resume from that point or deploy what it is that you've actually recorded to a, you know, a serving endpoint. But on the genai space, we've, we've looked at what these problems are. We've gone through and dogfooded this. We've built a bunch of stuff where we've, we've tried to, you know, walk in a user's shoes and see what actually is the, what are the pain points here and why is this so difficult to build these things for a real world use case. So we built mo flow tracing, and that allows you to see inside that black box. It's no longer a black box. You can see exactly what it's doing, what each input is going to each stage of an agent and understand and be able to, you know, leverage your analysis of that to make modifications to your agent in order to hopefully make it better.
Ben Wilson [00:10:12]: Being able to evaluate your, the retrieval relevance. Like the documents, those chunks that have been encoded, seeing which ones are returned and what the contents are of those. That's something that is baked into MFLO evaluate as well as visible within MlFlow tracing, you can see exactly what is being fetched and are those documents actually relevant? Do you need to do something with your data in order to make it more relevant or how you're indexing it? Prompt engineering and the vector index configurations for retrieval evaluate helps out with that too. So every iteration that you're doing, every test that you're doing, you have a hypothesis of how to make something better. You can evaluate it and test it on a static data set that you have a question and a gold standard answer and how close does it get to that? And we can score that. And then the whole process of this fast iterative development, when you're trying these things out and trying to make improvements or trying to build new functionality and being able to basically snapshot and log the state of that in a system that hopefully you're pretty familiar with, in ML flow tracking, you can get the exact state of what that is logged to the tracking server. You can see it in the UI, you can look at all the metadata associated with it, what configurations you did, and also see what the evaluation results are. And then also look at the traces and you can see really what's going on and which is the best candidate for deployment.
Ben Wilson [00:11:46]: And then baked into support of this is a number of Genei libraries that we're making official integrations with. So it simplifies the process of actually logging these things and using them and deploying them. And here on screen is a quick demo of tracing in ML flow. This is a pretty cookie cutter one where it's just a very simple arithmetic that's being done. But this shows the general UI approach of here's my entry point into my application, some functions that I'm calling and I'm getting a report on what actually was for each of these steps. When I call an addition step and then subtraction step, I'm seeing what the inputs and outputs are for each of those individual stages. Then in general the input and output that is at the top level, the parent span, that's showing you what the user is actually doing, the user's input, and then the system's output. But being able to really dig into that.
Ben Wilson [00:12:48]: Here's another example of tracing for autogen, which is an agentic framework that allows for multi turn passing of agents defined with specific roles. You can integrate tool calling and stuff into it as well, but we can see this particular trace. This is mapping out everything that that agent has been doing. So the initiate chat step is showing us what that input that came into the call to the agent and then the final output of what the agent actually responds to with the user. But we can see these different assistants that have been defined within here and individual steps that are happening within that assistant. So the assistant is making a call to a language model and it's getting a response from that. We're logging all of those things, including the metadata parameters, which are logged as attributes, things that we really want to know when we're evaluating the results of an agent. Well, what model did it use and what were the configs that were passed with that? Do we need to change some of those in order to improve the responses so you can iterate and do another test with slight modifications? Where you have this state, you know, the source of truth, the statement of what has been tested over time.
Ben Wilson [00:14:03]: Some of the libraries that we are offering automatic, you know, we offer it through Mo Flow's autolog feature, but automatic tracing where you don't have to configure anything, you don't have to set any particular calls within your code, you just call autolog and we'll trace for you. We'll apply all of the instrumentation that's needed in order to wrap all of these calls for these agentic frameworks. So auto gen Lama index released recently, Lang Graph, which is part of Langchaine and we're currently working on DSPY support. So all of these official flavors will be at the top level namespace of ML flow. To make it very simple to use these, just as an example of something, this is the type of thing that you can do with these agent frameworks. I'm loading a model from Mo flow that I logged, and this is a fairly complex agent that was built. Once I load that up, I can just call predict on it and that's what a user's information would be in the backend for this. There's a whole bunch of things going on.
Ben Wilson [00:15:10]: The vector index that's being used for the agent to be able to retrieve information is the entire corpus of Wikipedia. So it's got all of this additional data that's not just a generally trained model to answer this question. And it's doing a bunch of steps within here to say, well, I need information about Blu ray disks and how much data can they store and what is the dimensions of a Blu ray disk. There's a standard associated with that. So it's going and pulling all of that data and then calling a tool function call to actually calculate a bunch of basic math, basically to determine how can I calculate how many Blu ray discs would be based on this. So this abstract question that's fairly ridiculous has a bunch of steps that it needs to do in order to answer this. And there's another endpoint API that is called after this agent is done in order to generate an image. Agents are capable of building things like this.
Ben Wilson [00:16:11]: With an integration with Mo flow, we make it very simple to build these things and make it so that you can do things like even log images like this directly to it to determine what the state of your agent is over time. And that's about it for my talk. Thank you everybody.
Adam Becker [00:16:29]: Awesome. Ben, perfect timing. I have a bunch of questions for you, but I am only going to ask you the ones that a couple of people put in the chat before folks shuffle out to the roundtable. Okay, so Ignacio here is asking, do we have similar capabilities in ML flow for diffusion models, or is this only for LLMs?
Ben Wilson [00:16:52]: We're actually doing design on that right now. We want to make it so that we can support like a little bit more native. Image model, image generation model there's a blog that I just reviewed this morning, actually, that is about using auto gen for automatic generation of improved prompting for Dolly, three integration and it shows how to actually log the images that are generated and a prompt to image index mapping. So check that out on our blog post. You'll see exactly how to do this today in Moflow, but we are working on making it a little bit simpler.
Adam Becker [00:17:28]: Ignacio, stay tuned. Kate is asking. I'm working on rag evaluation right now and I looked into true lens Giscard Ragas. ML flow is in my backlog to try. And can you please shortly compare these tools and see how MLFlow stands out?
Ben Wilson [00:17:42]: Funny gift card actually is a plugin for ML Flow, so you can use both at the same time. We have a partnership with them for our evaluate. You can use a lot like most of the stuff that we're using are using LLMs as the evaluation tool, which is fairly popular. But we're actually working over the next quarter to improve ML Flow's evaluate capabilities. One of the things that we're going to be doing soon is making it so that the prompts that are being generated as an evaluation function definition is something that's callable because we want to be able to be able to be used in a broader ecosystem. So if other tools have evaluation functions that are defined, you should be able to use that in ML Flow. Right now it doesn't kind of work like that, but our process is making it more open so that we're not taking a stand and drawing a line in the sand of like, you have to use ML Flow to do exactly this to everything's baked in. We want to make it more open.
Ben Wilson [00:18:44]: So that's hopefully that gives you some guidance.
Adam Becker [00:18:48]: One more comparison question. Eric is asking, how does Mlflow tracing compare to Langsmith's traces?
Ben Wilson [00:18:55]: They're fairly similar in functionality and it's not really a differentiator. We think of it as table stakes, and I think most other genai frameworks out there, everybody's building it because it's such an immediate need in order to build these things for a production use case. So I think most frameworks, the functionality is going to converge and your features are going to be pretty ubiquitous across all the different platforms. The thing that stands out for us is we have tracking, so we have the entire ecosystem as well as tracing. But that's what we're going for, is making a single unified experience to make it simpler to do these things.
Adam Becker [00:19:36]: Ben, thank you very much for your time. If folks have more questions, perhaps you can stick around for a little bit longer in the chat. And good to have you.
Ben Wilson [00:19:46]: Yeah, thanks so much.