AIOps, MLOps, DevOps, Ops: Enduring Principles and Practices
Charles builds and helps people build applications of neural networks. He completed a PhD in neural network optimization at UC Berkeley in 2020 before working at MLOps startup Weights & Biases and on the popular online courses Full Stack Deep Learning and Full Stack LLM Bootcamp. He currently works as an AI Engineer at Modal Labs, a serverless computing platform.
It may be hard to believe, but AI apps powered by big Transformers are not actually the first complex system that engineers have dared to try to tame. In this talk, I will review one thread in the history of these attempts, the "ops" movements, beginning in mid-20th-century Japanese factories and passing, through Lean startups and the leftward shift of deployment, to the recent past of MLOps and the present future of LLMOps/AIOps. I will map these principles, from genchi genbustu and poka yoke to observability and monitoring, onto emerging practices in the operationalization of and quality management for AI systems.
Slide deck: https://docs.google.com/presentation/d/10DnzXnWnbsunBErhr-G6CTaUoAsYXhoZ/edit?usp=drive_link
AIQCON Female Host [00:00:09]: And I'm here to introduce our next speaker. His name is Charles Fry from Modal Labs. I'm sure you have heard DevOps. Now we have mlops, llmops, aiops, so many others. So Charles is going to introduce those for us. Over to you, Charles.
Charles Frye [00:00:28]: All right. Is that. That's audible? Seems like that's audible. Hey, everybody. Yeah, so my name is Charles. I did a PhD on optimizing neural networks at Berkeley, then worked on ML opset weights and biases, and now I work at modal serverless infrastructure company. And in the run up, what I think is exciting about this AI QCon is the mo ops community has been sort of, like, in the trenches, making hard to, like, complex systems work and turning them into products. And in the last year or two, there's been this huge inrush of, like, a new collection of software engineers and engineers who want to do the same thing with a new, slightly different type of system based off of pre trained language models.
Charles Frye [00:01:17]: And so they have gotten excited about the idea of, like, aiops, llmops, agent ops. And this is like, people are encountering a bunch of problems and running into, like, oh, I can set up a demo, and that works really well on Twitter, but then I can't turn that into an actual product. And so the interest in this AI ops discipline is that it promises to actually consistently improve production outcomes from these systems. And because some of these people haven't been in the field as long as lots of us in the mlops community have been, they're, like, sort of repeating a lot of the mistakes from the, like, mlops period. And, like, it actually turns out that lms are not the first piece of, like, opaque, hard to control software that people have tried to get to work. And nor is it the first time that people have tried to get a complex system that they don't have full control overdem to deliver value. So there's actually kind of, like, a half century of operations, research, and disciplines that we can learn from. So I wanted to kind of, like, walk through that really quickly in this talk.
Charles Frye [00:02:30]: So I'm going to talk about, like, everything from aiops back to, like, you know, running factories and try and draw lessons for, like, how to make. How to make your chatbot not suck. So right now with aiops, the thing people are most excited about, the thing that people, like, are raising a money over, is eval. So this virtuous cycle that Hamal Hussein posted in his your chatbot needs Eval's blog post, he. He was talking about how you had this like ver, this cycle of improvement where you would like, make a model, you put a model out there, you evaluate, you curate good evaluations for that model, you put it out into production, the model is better, that generates additional data, and that allows you to kind of push this loop forward of continual improvement. The center of that is this evaluation and curation. And a lot of people talk about evaluation and curation because of its centrality, but really, it's like the thing it is the center of is the most important part, this iterative loop of improvement. And it's this iterative loop of improvement that we're going to see repeated throughout these different disciplines, each time with a slightly different emphasis.
Charles Frye [00:03:41]: So there's been a number of kind of like key insights that even in the short period that people have been trying to operationalize foundation models that we've seen already. Number one, simplifying tasks for evaluation rather than having, like, if you have a summarization task, it's very hard to evaluate whether a summary is good or not, but you can evaluate whether it's 100 characters long or longer. You can evaluate, like, whether it contains a particular proper noun or not. So if you simplify it down to a binary classification, you'll have an easier time. Another key insight that I think people who've been working in the ML field will find unsurprising is that there's a lot of janky software out there to support the use of loms. There's fresh and hot open source french software that might fail more frequently than you would like. People working on AI engineering and llmops have found that they need to, like, trust no one. Look very carefully.
Charles Frye [00:04:39]: Look directly at the tokens coming in and out of your model. Look directly at the tensors coming in and out of your model. And then lastly, the maybe most. That's my calendar, and that's whisper AI, by the way. It's actually great speech to text whisper, but that's not what I'm trying to show right now. Last thing like maybe the most important or surprising novel insight with aiops relative to mo Ops is that synthetic data is extremely powerful within certain limits. So you can take these pre trained, large scale models and use them to generate new data that you can use to improve that model, improve your application, or to train an entirely new model. I think one of the reasons this is most powerful is because fine tuning on synthetic data is a very well posed ML problem.
Charles Frye [00:05:28]: You have a data generating process in the form of a model, and you just want to copy that data generating process into a new form. And that's a very well posed ML problem compared to the other things that people try and do with ML, where your data generation processes and then control of users, or it's some sensors out in the natural world. So, synthetic data is really useful for sort of kickstarting that iterative loop of improvement, because now you don't need data in order to get started. You can use, like, you don't need to go out and collect data in some expensive annotation project. You can just start with synthetic data. And usually it's not fully synthetic. It's some semi synthetic thing where you get some easy to acquire data and then you enrich it with additional synthetic data. So it doesn't solve every problem, but it solves many of them.
Charles Frye [00:06:15]: So those are all sort of like, kind of insights and, and a figure from the field of aiops and llmops. But this stuff is not entirely new. I saw a lot of people nodding along because some of these same ideas show up in mo ops. So, looking back at, like, Shreya Shankar's paper on operationalizing machine learning that also centered on one of these loops. Figure one was a loop that looked very similar. There's more focus on data collection, uh, here. You know, data collection is at the, at the beginning, as opposed to partway through with, uh, foundation models, but it's a very similar, um, loop. This is also the, like, core loop that when I was teaching full stack deep learning with Josh Tobin and Sergey Karyev, the core thing we tried to teach people about going from, like, research to production with neural networks, was that you needed to find a way to set up a loop like this where you are able to, like, collect additional data from the behavior of your users.
Charles Frye [00:07:17]: So, there are a couple key insights from mlops that I think are kind of easy to lose sight of when a new, exciting model or domain comes around. One of the most important ones is that pipelines are what matter, not models. If you have a set of weights, that's going to be useful for a while, but at some point, weights that produce inferences are very fungible. You can swap them out with a different program that has the same input and output types, and so the weights don't really store value. What stores value is the system that allows you to produce a new model when new data comes in, or that allows you to determine that a new model that just came out is actually better. So this was an important part of what made mlops successful. At the same time, there's kind of a counterpoint, which is people tried to get continual learning loops working with ML, where this entire sort of circle would spin itself automatically. Andrej Karpathy's operation vacation and given that everybody here is not on vacation, this didn't actually really work out.
Charles Frye [00:08:24]: So full automation of this loop is really, really hard. So people reaching to be like, oh, I'll just use LLMs as a judge and they'll evaluate my LM prompts and they'll write a new prompt and then that will run and then it'll judge, this is a suggestion, we won't be able to pull that off. Then finally, I think the most important thing that needed to be communicated from Mlops that was novel relative to other ops disciplines, was the importance of looking closely at data. So everybody from OpenAI researchers to folks like Shreya Shankar were saying the most important thing is to spend time actually looking at the inputs and outputs of models and doing things like making cleaning data and looking at data part of your interview process, making it part of your stand up or on call for your engineering team, that everybody has to sit and look at data for 1 hour a week. These are the kinds of practices that really good MLOps teams adopted, and they'll transfer directly to building on foundation models. Well, a lot of the ideas in Mo ops, like if you were around building software in the early 2010s, struck people as familiar because they were kind of just the ideas of DevOps reapplied. So DevOps before Mlops was about closing the loop between the release of software in the world and the development of software internally, operationalization of software and IT teams and development teams and it had its own loop of continual improvement. They also ran into the problem that fully automating software systems was really, really hard.
Charles Frye [00:09:58]: You don't really want zero click deploys. You always want a person deciding to do a merge or a person deciding that now is time to deploy or now is time to include this. So again, full automation, very difficult. Another one of the bugbears of DevOps was the difference between your testing environment and your development environment and your prod environment. And this is also shown up in mlops and llmops as differences between the data that you have in your testing and evaluation and in production. So we get to have the fun of both having the same software 1.0 test prod Sku of versions being slightly different and like, oh yeah, we changed the tokenizer behavior and now it tokenizes differently and data being different between those two places. And then lastly, maybe the controversial insight of DevOps was that the way to solve that problem is to test in production. And the best teams who did testing in production were very clear that this worked only if you could, from the information you retrieved in production, from the traces you got in production, you could turn those into bug fixes directly without needing to go and spin up a separate dev environment and do your debugging there.
Charles Frye [00:11:11]: We're still short of that in ML. We're still short of that in building on foundation models. We have monitoring without observability or controllability. That's an open problem. People are working on that. With mlops tooling, then DevOps is like as a discipline was not entirely new. It was directly inspired by the lean movement. So DevOps, the people building it originally were very clear that they were applying the ideas of lean startup and lean production to development.
Charles Frye [00:11:46]: So you can look, you can find the exact same loop in something like lean startup. There's a bunch of insights from that that we can carry over. The minimum viable product is maybe the most popular one. It's like the idea is try and find something simpler than the whole system you want to build to at least get this wheel turning in its first iteration. Like have a human do the task instead of an LLM, or have software 1.0 do the task poorly instead of an LLM. Maybe a deeper insight from lean startup was that metrics are good when they are accessible, auditable and actionable. Accessible means like everybody who interacts with the system is able to see those metrics. Auditable means that they can see how those metrics were calculated.
Charles Frye [00:12:31]: These two things. I think some of the early lmops tooling that we're seeing, like Langsmith from LangChain and WanB weave, have targeted these things. We've focusing on audibility and Lang Smith focusing on accessibility, but we're still missing actionability, the ability to turn these insights into direct model improvements. So maybe it's mechanistic interpretability and control vectors, maybe it's better continual learning algorithms and prompt optimization. But something needs to come in here and improve the actionability of these insights to make them like, actually have them be good metrics useful for turning this loop. And lastly, the lean startup movement, despite what people in northern California will tell you, did not originate in Silicon Valley. It came from Toyota. So the people working on DevOps were clear that they were inspired by manufacturing, physical manufacturing, not like computer science and mathematics.
Charles Frye [00:13:31]: And Eric Reese in the lean startup is very clear that what made everything click for him on entrepreneurship and made it easy to understand, like what insights he had were generalizable, was studying the Toyota production system and lean manufacturing. And like this manufacturing system was like very critical to it. Like it's like at the scale of Ford's assembly line, but like less well known, I think. And at the core of that system is also a loop of continual improvement. A lot of the insights from the Toyota production system, they aren't necessarily unique. Like Genshi Genbutsu. The idea of looking at real things in real places on factory floors is like, that's not a, like the translation of that is look at the data, look at the actual data in production. So we've already seen that, but it's nice to see it repeated in this like very different context.
Charles Frye [00:14:23]: I guess maybe the last thing I would say on it is that one of the key components of the Toyota production system was the idea of single minute exchange of dye, like swapping out some of the heavy key components that determine whether a machine makes widget a or widget b. That could take hours in old production systems oriented to large batches. And Toyota's insight was that if they changed the way that they oriented their system, they could switch these dies over in minutes instead of hours. And that allowed them to be directly responsible to consumer demand. So somebody says, I want a Toyota Corolla with this special piece added, and then that gets propagated back to the factory and turned into a machine, makes that part in response to that demand. And despite the much greater complexity, the much greater opacity, and the much lower controllability of the system they were working with, they were able to achieve that level of responsiveness to actual consumer demand. That's the guiding star that I have in thinking about how to build LLM applications is able to achieve this sort of like single minute exchange of prompt, this immediate responsibility of my application to consumer demand. So that's it.
Charles Frye [00:15:38]: That's my TED talk. I'll say 2 seconds at the end. I work at an infrastructure company, Modal. We build infrastructure that helps support these kinds of applications and these kinds of pipelines. It's like Aws lambda, but if it was good and had GPU's. So yeah, modal.com comma check us out. Thank you.
AIQCON Female Host [00:15:58]: Thank you. We have time for one to two questions. If anybody has.
Charles Frye [00:16:05]: Oh, nice.
Q1 [00:16:08]: I am also a huge fan of Toyota. So one thing about the Toyota system is that they allow anyone on the product line, regardless of hierarchy, to stop the assembly line. How does that work out in a software system where revenue and continual revenue is key.
Charles Frye [00:16:24]: Yeah, I would say that the equivalent of the Ondon cord in DevOps is like a red x on a pr means you don't deploy. I think the challenge with that is that nobody wants their deploy blocked on a flaky test. And right now the tests that we have for LLMs are generally fairly flaky. Like if you set a threshold of 0.75 on some metric and then you block a release because it's 0.74, like, nobody's happy when that happens. So I think that's one reason why it's harder to do this. I think maybe the sort of thing I would page on for these or block deploy on is things like safety, or like maybe Google should have pulled the and on cord when they told people to eat pizza with glue. So like, that's what I think is the equivalent of that. Whereas these sort of the test equivalents that are checking bulk performance metrics aren't worth shutting down the production line over.
AIQCON Female Host [00:17:24]: One more question before we take a short break.
Q2 [00:17:28]: I really love the thing about aiops being part of ML and Dev, and then lean Toyota in the new age of Gen AI, the data flywheel, and you go back to model retraining and mlops. So we already use large language models, which nobody actually is retraining the entire thing. We either do rag or fine tuning. How does the landscape change when things don't work as expected in production? What are some of the things that are the challenges currently and the open problems in which, how we can optimize this space?
Charles Frye [00:18:02]: Yeah, in some ways it feels like a harder problem because you don't control the production of this initial artifact like the pre trained weights that you're using. But in some ways that actually makes the problem easier, because a huge component of your ML problem was the kind of stuff that could be learned from web scale text. And if you were producing that artifact from scratch, you'd be tempted to always go all the way back to beginning of training and start over every time you had a problem in production. So I think it actually is a good thing that we are forced to, we're essentially forced to cut off at rag systems, at fine tuning, at Lauras, or at control vectors, as opposed to going all the way back. Because I feel like the majority of your problems can be solved by those changes for the problems that can't be solved by those changes that actually need like a boost in the fundamental model capabilities. It's like you kind of have to wait for new models to come out from the core model providers. And then the thing that makes you able to do that quickly is an evaluation chassis that makes you confident that the model is actually better. So like quad 3.5 has great Twitter PR and good vibes, does that actually mean that you should use it to replace 4.0 in production? If you have a solid evaluation chassis that gives you confidence for making every other change that you do to prompts or fine tuning or what have you, that is the very same chassis should, if constructed well, also allow you to swap models in and out, and that gets you back that outermost loop of improvement.