LLMOps: The Emerging Toolkit for Reliable, High-quality LLM Applications
Matei Zaharia is a Co-founder and Chief Technologist at Databricks as well as an Assistant Professor of Computer Science at Stanford. He started the Apache Spark project during his Ph.D. at UC Berkeley in 2009 and has worked broadly on other widely used data and AI software, including MLflow, Delta Lake, Dolly, and ColBERT. He works on a wide variety of projects in data management and machine learning at Databricks and Stanford. Matei’s research was recognized through the 2014 ACM Doctoral Dissertation Award, an NSF CAREER Award, and the US Presidential Early Career Award for Scientists and Engineers (PECASE).
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
Large language models are fluent text generators, but they often make errors, which makes them difficult to deploy in high-stakes applications. Using them in more complicated pipelines, such as retrieval pipelines or agents, exacerbates the problem. In this talk, Matei will cover emerging techniques in the field of “LLMOps” — how to build, tune and maintain LLM-based applications with high quality. The simplest tools are ones to test and visualize LLM results, some of which are now being incorporated into MLOps frameworks like MLflow. However, there are also rich techniques emerging to “program” LLM pipelines and control LLMs’ outputs to achieve desired goals.
Matei discusses Demonstrate-Search-Predict (DSP) from my group as an example programming framework that can automatically improve an LLM-based application based on feedback, and other open-source tools for controlling outputs and generating better training and evaluation data for LLMs. This talk is based on their experience deploying LLMs in many applications at Databricks, including the QA bot on our public website, internal QA bots, code assistants, and others, all of which are making their way into our MLOps products and MLflow.
Link to the slides
Introduction
Get the real deal on stage and talk with somebody that's actually doing this for real. And where is he? I'm gonna bring Matt ta to the stage. Paging Matt. Where? Where? Hey, there he is. How's it going dude? It is great to have you here, man. I am so excited. This is like something that Of course it is.
It's always a pleasure talking to you. And as I mentioned when you came on the podcast for I think the ML world and the ML ops world, you're like the DJ Khaled of our industry cuz you just keep dropping hits and hits. I mean, before, let's just give it in case for like the two people that do not know you.
Speaker's Introduction and Accomplishments
You created Spark. You were co-creators of Spark. You also had your hand in ML Flow that everyone knows and loves. And is starting to play around ML Flow now supports LLMs, which maybe you're gonna talk about, maybe not, but we should talk about that later. And then last, since we talked last man, you dropped Dolly, so I mean, wow.
Then you've got your name over, all, over all kinds of cool papers. Frugal GPT, if anyone has not read that paper, that is. Incredible. I really love that you're still keeping your roots in the academia and putting out papers. And so, man, last thing I want to announce to everyone before we get started, and I'm gonna let you take over and drop some wisdom on us in two weeks, on June 26th in San Francisco, you and I we're gonna meet face to face.
We're gonna be in person at the LLM Avalanche. Meetup that we are both doing and that is uh, it's like the sound check or it's gearing up for the data and AI summit that you all are putting on, and so I'm super excited for that. If anyone is in San Francisco for June 26th, let us know. I'm gonna drop a link to that in the chat and I'm gonna hand it over to you, Matt, man, it's great to have you here.
Alright, thanks. Excited to be here. Uh, let me share. Ooh, now the phone part starts. Now. Now this requires, this requires AGI to figure out this is, this is not your regular Zoom call, man. Yeah. All right. Okay. Do you see my slide? I do. I see it. Yep. Okay. Perfect. Alright.
Introduction to the Topic of LLM Ops
So, yeah, so I'm going to talk about this emerging area of LLM ops and some of the things that I'm seeing in industry at Databricks.
You know, both building our internal LLM applications, which we have, you know, quite a few of. And, um you know, helping you know, our customers with them. Um, and I'll also talk about some research that I've been doing at Stanford, um even kind of before. Chad, GPT became really popular on using LLMs more reliably.
That is also, I think, relevant to this field. So it'll be sort of a tour of a, a bunch of things, you know, not I'm also gonna list a bunch of stuff that like, you know, I'm not working on at all, but that I think is cool. Um, and hopefully it gives you, you know, a little bit of a perspective of how to approach this problem.
Okay, so large language models are super amazing. They can, you know, generate text as very fluent, often, like very grounded in reality. You can do a lot of cool applications with them, all the way from. Uh, the traditional you know few shots classification type of stuff to new things like chat bots where you can ask questions.
You know, they know about all kinds of, like research, research and stuff like that. Uh, they can even do programming, so very cool.
Challenges and Problems with LLMs
But if you look at these answers a little bit more closely there are actually, you know, a bunch of problems. So for example, this first one we asked John, GPT, when was Stanford University founded?
And it says 1891, but that's actually Hong, that's just when the first class came in. But the university was founded earlier. Um, Colbert this is a model that my. Research group developed, so we were super excited that called that, you know, Chad, GPT like, knew something about it. But it says it was developed by Salesforce research, which it wasn't.
It was developed by us at Stanford. Um, and finally, this code here, you can't see, you know, I'm, I'm, it's not showing all the code, but it's saying to parallelize this code, you should use third rule executor. But the problem with the code was actually that it's some kind of serial algorithm, so you, you need to change the algorithm so you ju you, you can't just throw in instead full executor.
So there are definitely challenges with the quality of responses in LLMs, especially if you want the production grade app. And on top of that, there are also other operational challenges with cost, performance, privacy, and just the need to update your model and your application. Uh, so here are like some examples of things that can go wrong.
You know, of course these models are expensive to run early on. You know, they used to go down some of the time because of the load. Um, another interesting issue that people have been talking about is swamp drift, which is the models are changing you know, over time as they're being trained. And they might get worse at like, some tasks they could do before now, now they're worse at it.
And then there are also issues with timeliness and with. Privacy that affect their relapse. So even today, if you ask, you know, all these open AI models about, um you know, events past 2021, they don't know. So, and obviously it's expensive to update them, otherwise they would've updated them right now.
And if you have an application with privacy, you know, with the emerging sort of rules around privacy, you might also add into issues there where a user wants to remove their data. So, for example, Chad, GPT seems to know something about where Bill Gates lives, but if Bill Gates was a European or a Californian, he could ask them to remove that data about him and then it would have to forget this information.
So the question is how can we make LLM apps reliable? Uh, this is this area that people are, are calling LLM ops, and I'm gonna talk about like three you know, things in, in this area. Uh, the first one is how do we extend what we currently do in ML lops to work well for generative ai? Um, the, and I'll talk about what we're doing in ML Flow specifically based on, you know, our experience.
The second is new programming models for these kind of applications, which is one of the things I've been doing at Stanford. And then I'll talk a little bit about other emerging tools too. Okay, so let's start with with the basics. So I really think that for, you know, any, anything you're doing with language models, even though it's a, you know, really cool new kind of application, if you want it to work very reliably, you want a way to improve it, you need to start with the basics.
And these are the same as in ML lops the way it normally was. So one of the, the most basic things you need is a way to swap and compare different models or even pipelines of models cuz a lot of this is done using chaining now for an application and just say how, how are they doing, you know, at different stages, how do they compare on different metrics?
So you should set up your development, you know, infrastructure to support this. Simple kind of comparison task. A second thing you need is to track and evaluate outputs, you know, and, and have this historical database of what happened so that you can kind of search you need a way to deploy pipelines reliably and, and see what's happening.
And you need a way to monitor these deployed applications and. Analyze the data once they're actually out to see what's happening with them over time. So we're trying, you know, we're, we're basically implementing all of these in the ML flow open source project, and then the part that ties with data.
We're also integrating into just Databricks as a platform if you, you know, if you want to have kind of a consistent platform across data engineering and ML engineering. We're designing it, so it's very easy to do both of these things. So let's start with the first one, swapping and comparing language models.
Making LLM Apps Reliable
So, in ML flow, if, if you, you know, if you, if you're not super familiar with that, one of the key concepts there is this model abstraction that can wrap a model. So the rest of your code doesn't need to know how it's implemented. You don't need to know, you know, what framework is it using. Is it a local model or something?
I'm calling over the network, like OpenAI is it, you know, a single thing or, or is it a pipeline actually of multiple steps? Um, you just basically have this function. For example, you know, a function from string to string could be a call to OpenAI or a local model or, or some kind of pipeline. And we've created integrations with a lot of the popular LLM tools, um to let you manage them as models in there.
And the same thing with pipelines specifically, we've been focusing on lane chain for that. So you can easily wrap all these things into these into these model objects. And then in your application, you can swap them in your evaluation code. You can compare different ones. Um, you know, you could imagine like routing some percent of traffic to one versus the other, and so on.
So then you can just call these general functions on each one. So that's very basic, but important. And we're seeing this come up a lot as people wanna try, you know, different you know, different options in this space. Um, a second thing that we're doing for the experiment, tracking. Is extending the, what we call auto logging in ML flow.
So in ML flow, the the idea is, you know, you develop your machine learning code and we have some lightweight calls you can make that cause ML flow to record what's happening. Wrap it into these kind of generic formats like the model I talked about, and make it easy to evaluate and compare and track results over time.
So one of these is, you know, you can just do ML flow that start to run, and then you can, you can train models and log them and so on. Uh, so we've, we've extended this auto longing for open ai. Um, so you can, you can. Remember basically your prompt and other parameters you used in open AI and wrap that up and call it later with, with model that predict.
And we've also extended it with link chain, so you can set up a chain with local models or remote models and save it. And again you know, load it later and call stuff about it. Once you've run these, the, the other thing we've done is in the UI we've created you know, a new view in the UI that works well for evaluating long form like generative outputs, like these texts from these models.
So basically you can run all these models on. You know, some evaluation data sets, some questions or whatever, some input texts. And then you can see for each one side by side, you can see what they're each producing. Um, and um, we're also gonna extend this. So you can actually add ratings like right here, or import ratings from a third party labeling service.
So you can compare all of them. Uh, and of course you, once you've saved these models, you can also evaluate them programmatically. We have this function ML flow that evaluate with a lot of co common metrics people want on. And we've put in a lot of the metrics you want for LLM applications, like the HOOGE metrics for summarization and various, you know, various other ones, some metrics, some toxicity, various things like that.
You can also easily extend this to have your own. So that's you know, that's basically like working with the models themselves. Um, and then of course you also want to deploy the models and check what's happening after. And the way ML flow works, you can deploy a model to a variety of different serving layers.
We wanna make a. Portable across them. So you can pick, you know, your favorite one here and once deployed you can get these predictions out of it. And this is like one of the areas where on Databricks specifically, we've integrated this very closely with what you do for data engineering. Like, basically as your model runs, you just get a table that's automatically created.
And you can run alerts on the table. You can basically have a sequel statement and say, you know, alert me if whatever the sequel statement ever returns, you know, to, or something like that. Um, so so it's easy to do that, but you know, you can use whatever deployment tool you want with this stuff. Um, just little bit more on some of the things we're, we're doing with this at Databricks.
We've also released, um some reference applications like, for example, how to do a customer support bot that includes a vector database language model and the ability to say, you know, it doesn't know about certain topics and so on. So some kind of filtering. Um, and we've also released this free course on edX on LLMs application through production.
That covers, you know, what we've been seeing at different stages of this workflow. Not just LLMs themselves, but all the peripheral things like monitoring searching, you know, vector databases, chains, and stuff like that. So that's kind of the basics of ML lops and they will definitely help you like get pharma demo, which is usually easy to make with with these LLM tools to something that like works reliably, you know, 80, 90, hopefully eventually, 99% of the time or more.
Future of Programming with LLMs
But it's still, you know, it's still a lot of work to make these, these happen. So I also wanted to talk a little bit about what the future of programming with these models could look like. And this is one of the projects that my, my grad students have been doing at Stanford. It's called DSP or demonstrate search Predict.
So today, if you look at not just LLMs, but foundation models in general, these pre-train models that on, on large data sets, there are so many different tools and techniques you can use. You know, just for language models, there are different ways to use them. There's instruction prompting, chain of thought agents fine tuning and so on.
And in addition to language models, there are different retrieval models, you know, your favorite vector index and, and database task specific models and maybe models for other things as well, like vision and speech that you wanna combine. And most people wanna build these into pipelines. So we're, we're seeing that you know, you, you've seen it if you've used chain or probably it's just a very natural way for you to work with these.
So the idea is you break down. Your problems and, and delegate smaller subtests to the models. And I'm showing a couple pipelines from the research literature on the right that are, you know, that kind of did this by, by hand. Um, so in principle it sounds great that you can just make a pipeline. But in practice it's actually pretty hard to connect all these pieces and then to optimize the whole pipeline.
You know, if you wanna get from like whatever default accuracy it has as you hook everything up to increase your accuracy and you know, say eliminate like the, the, you know, some type of bad output it's reducing or something like that. You've just got all these models sitting there, maybe with palms or with like some python code to glue them together and it's quite hard to, you know, to fix that.
And every time. Every change you make to the pipeline affects everything in it, all the other models. And for example, if you want to if you were trading one of the models or fine tuning it to deal with, like whatever input format you're gonna give it from above. You'll have to retrain it if you change what you're doing above.
So we want developers to be able to step away from this and just think about their system design and components, not about gluing and optimizing all these stages together. So that's what we're trying to do in this in this programming model. DSP or demonstrate search Predict. Um, so DSP is a, is a declarative programming model for these pipelines.
Basically think it a little bit like PyTorch, how in PyTorch you can. Hook together layers that are different you know, operations and and pass, you know, data in the form of sensors between them. In, in DSP, the layers are actually different foundation models like language models or.
Retrievers or potentially other things, you know, a calculator, maybe some, some other, you know, string to string function. And then the stuff you're passing between them is text objects. It's not dancers. Um, so there are three types of, um functions in dsp or primitives you can use on your program.
There's demonstrate, which is how you tell it what to do. You know, this can be basically providing some trading data or other ways to, to specify constraints. Uh, there's search, which is breaking down problems and looking up useful information. This is worth something like a vector database would come in, and then there's predict, which is.
Different strategies for using the information you pulled out, checking the quality, you know, asking is this really the right answer and so on. And you set up your program. Uh, but uh, once you set up this different steps in it, you delegate to the DSP on time to figure out how to implement each component.
And it sees the whole pipeline and it tries all these techniques I had on the previous slide, you know, the few shark pumping chain of thought, different. Ways of selecting data for each step to automatically tune the whole pipeline. So you don't have to do that yourself. So let's just show a little bit of an example here.
This is from this research project we did called baline, which is system that can answer complicated questions using multiple searches over text documents. So for example, over Wikipedia. So, One of the, one example of the, the kind of question you could ask was when was the creator of Hadoop given an award?
So Hadoop is a piece of software that was designed by Doug Cutting, and then Doug Cutting got Theri Award in 2015. But to answer this question, if I gave you this question as a person, you'd have to do a bunch of research, you know, a bunch of searches over Wikipedia or whatever uh, to, to figure this all out.
So this is how you write that in DSP and hopefully you see it's very simple. Um, so basically you have this example, the examples, what holds our state. You can have different fields on it and you can say, Hey for you can, you can have a maximum number of searches or hubs it's doing. And what you can do is you can say, Hey, first generate a question from this example.
Like some question, some query that I wanna. Um, on, on my search engine for example, who was the creator of Hadoop, and then search for that and, and take the top three examples and stick them in this array called passages. And next, generate a summary of everything I've read so far in light of the question and append that to the context.
So that's the strategy. Just keep doing these searches. And finally, once you've done this many hops generate an answer. So it seems a little bit crazy by the way, these things, the question to answer and so on. Think of them a little bit like palms. So it seems a little bit crazy that this would just work out of the box, but if you set up the palms, it can work sort of Okay.
Not all the time, but a bunch of the time. Um, just in aside on how to set this up we actually make the palms somewhat declarative. So basically you just say, you know, for each step I'm gonna have. These inputs, like a context and a question here. There's a text description, and then I'm looking for an answer, which is a short sentence, but the DSP framework can figure out how to actually word the palms to make and give you this.
And this is one of the cool things that lets you you know, evolve your, your program and optimize it over time. So if you just to end this program DSP would would generate a pump that looks like this you know, as it's going along. So let's imagine it did the searches correctly, and it found some stuff.
It would have these passages here. What is Hadoop, you know? Um, who was it created by? Whatever. Um, and it would, it uses these items, like it says, Hey, I'm looking for a short sentence. So it says, answer with a short sentence, right? Um, and it, it even automatically does chain of thought. It, it says, Hey, fill in this stuff.
So this is kind of the pump that does, and the part in green is what the model fills in. And you know, in this case you get the right answer, but this is, again, this is like few basically zero shot. I pumped everything, but it probably won't work all the time. Let's say it only works like 30% of the time or something.
How do I make this program better? So this is where you can feed DSP additional data and it'll automatically tune the program to make it work. So one of the things you can give it, and this is this function called DSP compile. You can give it a bunch of like labeled examples of a question like this and the final answer, you don't even have to show it, the search as it should do.
DSP is going to like, try running each of them multiple times and figure out if it can create a path, like get a path that actually, you know, basically do some searches and actually give you the right answer cuz it just knows the right answer for each one. And for the ones where it does, it will use them as few shot examples in your bot.
So, for example, to, to, to teach your model, like, let's say the final call here is to something like open AI to teach us how to use this kind of context and question and rationale. Um, you know, we, we gave it this one, but maybe we have an example for, from the training data where we found that we could answer something.
You know, there was this question about. Which award did the first book of Gary Zuka receive? And um, you know, we did these searches and we found some stuff, so we automatically found an example that we can put in there, or maybe multiple examples. And we've just gone from instructions alone to few shot bombing, which works a lot better.
And the other thing that it can do here, if you've got enough examples, it can actually fine tune a supervised model for each part. So instead of trying to do it with chain of thought and so on, it, you know, you can train a model that's good at answering. These kinds of like trivia questions or at generating the searches for them, you know, which, which works in a similar way.
So that's the idea here. We wanna separate the you know, your specification of your pipeline from the optimization and to work on them separately. Um, so what this gives you is a great way to do iterative development. Um, it's super easy to get started and it does something out of the box with instructions, which is probably what you would've done on yourself.
It's modular because it unties the system design from. Prompting and fine tuning you, you can, you know, as, as we update the DSP software, those parts get better and you don't have to rewrite your whole program. Um, and um, it also enables optimization because as we make new releases of of the DSP compiler, for example, it benefits all the programs.
And another really cool thing is that it actually works really well in practice. We've been evaluating this on a lot of research tasks, especially these, these knowledge intensive tasks that require looking up information and compared to other few shot methods, you know, where you don't tune on, like millions of examples.
This is basically competitive with all of them and it's this very simple interface. So in this plot, it's show is showing a couple of different data sets and all of these things here are different. Implementations with engines, and you can see the scores here on different data sets. And the score with dsp, with the same number of examples, I think it's like 10 or 20 examples is better than all of them in this case because it automatically searches for how to optimize this.
So this is, this is DSP itself. Um, I'll skip this one, but this, this, another thing we can do is even feed in unlabeled examples and, um basically use dose to train a cheap model that does the same thing our expensive model does. And again, basically even better performance after you deploy. Um, So, yeah, so these are, you know, these are a couple of, you know, things, but there's quite a bit more in the LLM op space that you should look at.
One cool thing is constrained generation. Uh, there's this project LM ql, and there's also guardrails and jsun form that forced the model to output stuff in a specific syntax that really helps in some applications. Um, Demetrios mentioned frugal GPT, which is something we've been developing that can combine.
Multiple models and how upgrades to them or combine the results from several models to get better cost performance, like the same, you know, better performance than GPD four and even lower cost while achieving that performance. Um, and then there are all kinds of ideas for reasoning, post-processing, filtering, stuff that further improve quality.
And definitely if you wanna learn more about these, I do encourage you to check out the Data AI Summit, which is like free to v online and to check out the LLM Avalanche meetup that we're organizing as well. Cool.
Conclusion
So I can't hear you for some reason, but let's see if I can fix, oh my, yeah. That, that was on my side and Okay. There we go. For anyone who came to the first one of these knows that I spent the first like 10 minutes talking on mute to myself. Oh, no. Yeah. And yeah, having a blast. And so I I'm waiting for some questions to come through in the chat because there is a bit of a lag between when we talk right now.
Yeah, makes sense. People ask questions, but one has already come through and it's about the latency and it's asking do these searches Take a lot of time from Joel Alexander. Yeah, great question. So, so it depends a lot on what you use the search. So in DSP we actually used we used our own search index, which is called Cold Bear over Wikipedia.
And so the search is is very fast. It's like basically like maybe tens of milliseconds for each search. Um, it's it's comparable to like the cost of calling the model. Um, so it doesn't have to be, it depends on your setting. Like if you called out to like Bing or Google search, you know, it would be slower.
Mm-hmm. Yeah. That makes sense. That makes sense. And uh, while we're waiting for more questions to come through the chat, I just wanna bring up that there is a whole meetup going on in Amsterdam right now. And we've got a live feed of them. Wow. So I'm just gonna let somebody jump on screen. Look at that. So they don't know they're on live yet.
Now they're gonna get it about 20 seconds later. So we're gonna take 'em off before they even realize that they're on, they get to do the Yeah, well we're, we're live. That's cool. I'm also gonna grab, oh man, this is, this is great. So we've got Chip coming up next. But before I bring her on, there is a question that's coming through.
So DSP is better than is DSP better than GPT four? Is this a correct assumption? So it, so it depends on the test, but for these the tests we were doing, which are these complicated like knowledge tests, like this Wikipedia one, it's definitely better. Yeah. Like out of the box, GPD four won't. Won't do that if you just ask the question.
Of course you can use GPT for as the language model in dsp, so you can kind of benefit from. Everything it's doing. If you use a better language model, you know the whole program will do better. But the combining like search plus, you know, some strategy, I didn't show it, but it can also do things like you know, take a majority vote across answers and stuff like that.
Does do better than just the language model alone. Oh, that's awesome. So the stream just hit the chat because the chat blew up now. There's so many questions, man. I gotta actually like, point to them and figure out what the best one is. Uh. So are there any downsides to using DSP versus alternatives?
Yeah, I mean, so definitely depends what alternative, but um, definitely this idea of breaking down here. So, Application into many steps will add cost and latency as you run each step. So again, in, in dsp, the hope is that there, you can optimize it or, you know, the most important thing is to get the right quality.
But yeah, you can, it, it'll add some costs. Um, although, so when, when we started dsp, like no one was using, Chaining at all. So everyone was worried like, wait, it's, it's some cost. But now if you use something like land chain or agents, it's very similar, similar cost. So you can think of DSP as a way to like take that programming model, but try to figure out how to auto optimize stuff in the background.
All right. I got one last one for you, and then. We're gonna bring Chip on and we'll all have a little chat before she gives her presentation. Question for Mattai. How does DSP evaluate its optimizations? Does it do them before deployment or even during execution? Yeah, great question. Yeah, the, the stuff I showed ha uh, the DSP dot compile stuff is all before deployment, so there's nothing like online.
And you can give it, you know, you can give it some examples where you know the right answer. Um we're also working on other ways of doing it, where you can give it just other metrics or things, you know, you tell it avoid. This, you know, this, this this situation. Um, so it's all offline and then you deploy and you can collect more data and pass that back in.
Oh, killer man. So I knew it was gonna be great. I did not realize it was gonna be so good.