Building Robust AI Systems with Battle-tested Frameworks // Mini Summit #10
speakers

Vaibhav is a software engineer with over 9 years of experience productizing research. At Microsoft, he worked on realtime 3D reconstruction for HoloLens. At Google, he led performance optimization on ARCore and Face ID. Now he's bringing that same experience to help bring better quality and speed to Generative AI technology.

Charles builds and helps people build applications of neural networks. He completed a PhD in neural network optimization at UC Berkeley in 2020 before working at MLOps startup Weights & Biases and on the popular online courses Full Stack Deep Learning and Full Stack LLM Bootcamp. He currently works as an AI Engineer at Modal Labs, a serverless computing platform.

SUMMARY
Breaking the Demo Barrier and Getting Agents Shipped Deploying Large Language Models (LLMs) in production brings a host of challenges well beyond prompt engineering. Once they're live, even the smallest oversight—like a malformed API call or unexpected user input—can cause failures you never saw coming. In this talk, Vaibhav Gupta will share proven strategies and practical tooling to keep LLMs robust in real-world environments. You'll learn about structured prompting, dynamic routing with fallback handlers, and data-driven guardrails—all aimed at catching errors before they break your application. You'll also hear why the naïve use of JSON can reduce a model's accuracy, and discover when it's wise to push back on standard serialization in favor of more flexible output formats. Whether you're processing 100+ page bank statements, analyzing user queries, or summarizing critical healthcare data, you'll not only understand how to prevent LLMs from failing but also how to design AI-driven solutions that scale gracefully alongside evolving user needs.
Modal: ML Infra That Does Not Suck Building an application on the cloud doesn't have to suck. Even if it uses GPUs and foundation models! In this talk, I'll present Modal, the serverless Python infrastructure you didn't know you always wanted.
TRANSCRIPT
Ben Epstein [00:01:49]: Okay, we are live.
Vaibhav Gupta [00:01:54]: Cool.
Ben Epstein [00:01:54]: Very, very excited about this. Hoping that we get one more person on the stream. But we have today three of my favorite companies of all time. So I'm super stoked. I've been trying to get all three of them into this stream for a really long time. We have Charles with us from Modal, which is. He'll tell you, but serverless, compute and GPUs. We have Vaibhav from Boundary who have built BAML, who I've talked about a bunch, but putting essentially type safety into LLMs.
Ben Epstein [00:02:20]: And we have Brendan, hopefully from Prefect, who's going to teach us about how to build really scalable workflows, data workflows, system workflows, things like that. All three of these products work unbelievably well together, but they also work really, really well on their own. So today we'll see them on their own and, and then hopefully we'll share some stuff after about how they can be used together. Thank you guys both for coming. I'm really stoked about this.
Charles Frye [00:02:45]: Yeah, thanks for having us, man.
Ben Epstein [00:02:47]: Awesome. Cool. So, vaibhav, why don't you kick us off? You can share your screen. I saw that in a sec. You can tell us what you're going to show on BAML and then we'll jump into Modal.
Vaibhav Gupta [00:02:59]: Let's do it. I think one thing that I'm always a fan of sharing is this idea of how agents work and how we all think about agents. Because I think a lot of times a lot of us are able to build demos, but it almost feels like we're stuck in demos for our product. We spend tons of thousands of dollars in our companies trying to get to an AI stage that works, but we all get stuck because it's 80% good. And 80% is great to show the CEO that we can go do this or the business leader that you care about, but it's just not good enough to put in the hands of customers. I think that's the thing that I want to chat about a little bit today. And I'll share my screen, but I want to make this fun. Ben, we always have great conversations while we're chatting.
Vaibhav Gupta [00:03:46]: And Charles, I've seen some of the work that Modal isn't super impressive. As you guys have thoughts, just chime in and feel free and we'll keep it fun and light and really interactive.
Ben Epstein [00:03:55]: Yeah, maybe what we'll do this time to be a little bit different from normal is that I'll keep everybody on during the stream and as you're talking, we can maybe interrupt you with questions. Questions or thoughts and things like that.
Vaibhav Gupta [00:04:06]: Yeah. So the first thing that I always love to start is just the premise of how software as a whole works. Is my screen share on.
Ben Epstein [00:04:18]: Not yet.
Charles Frye [00:04:19]: I don't think I see it.
Ben Epstein [00:04:20]: Here you go.
Charles Frye [00:04:21]: Perfect.
Vaibhav Gupta [00:04:22]: Okay. And I think when we think about software, the first thing that I ask is normal software isn't allowed to fail 5% of the time. Like your company shuts down, if it continues to do this, you go bankrupt. You can't build software or reliable business at that failure rate. But like somehow all of our perspectives allows this to be acceptable with LLMs. And I view this as a very early days of how people are building websites. In the beginning, it's like it didn't matter if Amazon was up most days, it just had to be up most days and that was good enough and it allowed it to survive. But for Amazon to really thrive, it turned out that that was not okay.
Vaibhav Gupta [00:05:00]: And in fact, the reason that they're so successful is because they're up a whole lot more than that. In fact, their reliability is why Netflix was able to be built. Netflix can't be built if Amazon doesn't exist, because Amazon solved the reliability problem of networking. And now when we switch networking and APIs over LLM, we need something that can help us go do that. How do we get an LLM? How do we to have like three nines of accuracy or three nines of reliability as we build software so things aren't just breaking along the way? That's a conversation I always like to start with. Because really, what's different here is in a world where only a few engineers at AWS have to think about three nines of accuracy, now all of us app developers and application layer programmers now have to think about this as well. And that's just a different paradigm of software engineering that a lot of us haven't thought about in a while. I don't build error checking when I build applications.
Vaibhav Gupta [00:05:57]: I just assume that the GitHub API will work. And if it doesn't, then so what? My website goes down, I just tweet GitHubs down and people deal with it. But now at any point I call OpenAI anthropic, any of these places it can fail. And that means we all have to fundamentally rethink the way now we approach this system, the way we write to code around these systems. Our old way of exception handling just isn't good enough. And really, I think what it boils down to when I think about this is how does exception handling work? Well, let's just think about websites. You could write a website like this and if you forget a slash token over here, your website will break, but you won't know it until you actually deploy the website. That's deploy time failure.
Vaibhav Gupta [00:06:42]: That means your customers will see it and your engineers will not. You write a website in react, you miss a slash here, you get a compile time failure. So your engineers don't see it. So your engineers see it and your customers do not. That's what we want to be able to do with LLMs. And every single time I go see that, we've all probably seen prompts that look like this. There's some if conditions in here. And one thing that I find really odd is if I forgot to slash N somewhere here, my prompt could be completely different.
Vaibhav Gupta [00:07:12]: This could actually look like a minus sign. So the LLM because I forgot to render this differently. And how do you catch these kinds of errors statically and ahead of time in a way that allows you to build a really reliable system where it doesn't feel like a demo and it will just work. So I'll show everyone a couple demos that I think give you the premise of this and I'll talk a little bit about bamboo and how it makes it possible right after that. So let's take for example, like, I don't know, I have like a picture of a cute little dog here and this is a live demo, so it breaks. I'm not very good at front end generating, but I think it'll work. What I have here is I have an LLM that's reading that image and actually defining BAML code to go and describe that image. When I run this, you can see that not only is it producing the schema, but it's actually producing the schema while streaming.
Vaibhav Gupta [00:08:00]: And you'll see it in a second again. See how I already know what the schema is ahead of time and just filling it out, that way of being able to go from code to structured data in a really reliable way without relying on any specific LLM and possibly even doing this with models like Llama 8B allows you to build really practical applications like invoice processing now can just be solved. We don't have to use OCR models now. What I'm doing is I'm reading the invoice, pulling out a schema on it, and it's just pulling out the data. And once it's done, I can do interesting things along this and I can build validations into my system. For example, we all know LLMs hallucinate, but how could we leverage that to our advantage rather than to a disadvantage and really view that as a handicap? Well, one, what if we don't tell an LLM, I'll tip you $50 to go solve this problem. And instead, because this is a schema, all I do is I just say quantity times rate should equal amount. And if it doesn't, I know the LLM hallucinated.
Vaibhav Gupta [00:09:05]: I don't even need to ask another LLM. I don't have to. I don't have to go build any validations on it. I'm doing programmatic validations for the math that I can know ahead of time because I know quantity and rate multiplied together will always be amount. I can add another dimension of validation where I can say all the amounts together should equal the subtotal, and then the subtotal times the tax rate should equal the total. And in reality, if all of these multiplications add up together to be correct, it's so unlikely an LLM hallucinates after you get past like seven or eight numbers, because we literally have to hallucinate all of those digits correctly and coherently from a base data set, which it's just not going to go do. So what that ends up leading you to is you can go build systems that can do bank analysis. I'll just take a live phone, for example, and you can see right over here, it's going to take this phone and it'll build out a data model that represents this exact schema.
Vaibhav Gupta [00:10:11]: And yes, my shirt is purple because purple is an awesome color and can pull out the time, the date, and everything else that's in there as it's telling me. It's funny, I do have a notification about that about itself, and you can actually go see what's going on. So when I can build these kinds of reliable systems, all of a sudden the way I think about an LLM fundamentally changes because it goes from being this black box that sometimes fails to really being the system that is going to be a component that I'm going to build bigger systems around. And how exactly BAML plays into that is. Let me open up cursor is this is what it does for you. So in baml, we turn every LLM, instead of viewing an LLM as this giant chat loop that has to go be, that has to, like, maintain, where I have to use like 50 different frameworks and manage my chat context, we treat an LLM like A calculator. A calculator takes two numbers in and produces a number out. An LLM takes any data you want in and turns it into any other data format you want out.
Vaibhav Gupta [00:11:15]: So in this case I have a function called an extract resume. It takes in a string as an input or an image and it produces a resume data model. This is inside of a bamboo file. My resume data model is described to have a field name which is consists of another class name first and last it has an email which is described to be a non Gmail email. And here non gmail email is a string, but it's a string that's not allowed to have gmail.com inside of it and experiences is a string array, a list of strings, and then we define what model we want to go use. So in this case I have an OpenAI fallback and I'll show what that is in a second and then we have the actual prompt. But where BAML really starts to be different is not only can you write if conditions for loops or anything else you want in here, but we actually give you a live prompt preview for your test cases. So you write a test case and as you edit it, you can now see your prompt.
Vaibhav Gupta [00:12:13]: And why does seeing your prompt matter? Well, once you're looking at the tokens, how do you know that this number of dashes is actually good? Is this better? Well, clearly not, because this is two tokens for GPT4.0, many using their tokenizer than one. So I should just use the one token indicator and I can find which ones make more sense for different schemas and different alignments. But it's really just a way for me to go learn how I want to go build this out. Now the other thing that you're able to do is you're actually able to see the raw web request that's being built. Because any AI framework you use at some point is calling an API. They don't show you the API request that they're making. That's like building a React app and not being able to see the underlying HTML. Yes, you can do it, but when things go wrong, you are hosed.
Vaibhav Gupta [00:13:01]: There's no debuggability. And now you're debugging someone else's code in design choices you had no choice in making. That's why debugability was one of the first things that we thought of is if you're going to go and abstract this away. In this case we're calling OpenAI fallback because OpenAI fallback actually calls GPT4O mini and tries that twice and if it fails, it goes to Sonnet. And you can see here the Sonnet way of composing the web request is totally different because Sonnet Anthropic has a totally different way of composing the API endpoint. You can just click around and see exactly what the differences are as you're changing here. But I think the biggest difference and the biggest unlock that I see people make is really this button over here called Run test. So we all have test cases and normally to run tests you have to find the file you wrote the test in, you have to find the CLI command that you write the test in and actually go run it.
Vaibhav Gupta [00:13:50]: And that just leads to people running tests. Not as often in React. When I change a website, I just change my TSX file and it just hot reloads and I can just see exactly what change it was. And that means in seconds I can try tens of designs very, very quickly across two screens. We try and bring that same iteration loop and seconds of trying to prompt engineering. Because now I can just press run test and the model calls itself here the prompt is actually doing something interesting. The prompt is actually doing chain of thought and reasoning because I told it to before answering list three incredible achievements of the person. Then if you notice, the model spit out something that looks like JSON but isn't actually JSON because none of these values actually have quotation marks around it.
Vaibhav Gupta [00:14:37]: And we're actually able to go pull that out. Actually I have a unit test at the bottom that I said where the name is Vaibhav Gupta, but name is actually a class. Let's change that name first. It's Vibh Gupta and let's run it again. And you'll notice that it parsed while it was streaming. And we did a couple of things. We took this whole output from the model and we turned it into your resume data model, which consists of name, email and experiences. And that's why email is null here, even though it's the LLM spit out fiberoundryml.com and you can actually go see that right over here.
Vaibhav Gupta [00:15:18]: And when you're actually able to go through this, what this does is you no longer have to think about how to prompt the model perfectly. The model will just go do this. And what we do as VAML is we do error correction. So no matter what the model spits out, we give you the perfect answer all the time. And then eventually in Python, what you're able to go, sorry, I super apologize for this, give me like one second.
Ben Epstein [00:16:00]: But while he's moving around, I'll say something really quick about this, which is that I was actually just showing this similar demo at an event a couple days ago. One of the things that people brought up was it's very cool to have the LLM generate the schema. But what happens if that doesn't make any sense? What if your system doesn't require the LLM to generate the schema? Something that's I think cool to point out here when Baba was showing the demo of generating that schema is that can very much be like a one time thing. You can have an LLM or multiple LLMs generate four or five different variants of a BAML schema, which a human can then go in and validate, decide that that is the schema you're ready for, and then you can ship that out to thousands of inferences, thousands of different documents, even use a cheaper model because it's way easier to fill in a schema than to generate the schema. So I think that's just worth calling out.
Vaibhav Gupta [00:16:51]: For example, in this case, like in a resume, if I'm actually like, if I'm hiring for software engineers, the way I structure my resume is very different if I'm hiring for like PhD students. And the things I look for and care about. If I'm looking for a senior engineer is just different versus if I'm hiring a grad student right out of college. Because in a senior engineer I don't really like their experience and their education kind of matters, but what I really care about is the impact that they drove to the business that they worked in. And I want to tailor the experience of really focus on the impact. But in the case of like a student, I really want to tailor the experience. Like the research they did and like what is their H index score, which wouldn't make sense for most people and like even adding the H index or like the impact of their research to like the score to represent that to an LLM when a person doesn't do research is just going to make the LLM hallucinate. So the ability to be able to control the schema super dynamically, we seem to be really, really critical to a lot of people.
Vaibhav Gupta [00:17:49]: Succeeding and being able to build really, really reliable pipelines. And eventually at some point you probably want to get out of bamboo and write your actual application because you have a web app or some other application that runs it, like Python, Typescript, Java, whatever you want. And the way that that ends up working is you just write baml client import for you. And you notice everything is auto completing because Cursor understands BAML pretty fast. And Resume experiences becomes a list of strings. If I change experience to be like an experience type, which is described down here, then Resume Experience becomes a list of experiences right away and you don't have to think about it. I don't know why Python is being weird. The Python language server is slow.
Vaibhav Gupta [00:18:37]: There you go. You can see it'd be actually a list of experiences. The whole point is you get this developer friendliness and you get a DX that allows you to iterate on prompts really, really fast, Allows you to not have to think about. Allows you to not have to think about the exact words you put into the prompt. It allows you to define everything in code as opposed to having to stitch together a lot of strings. So we don't end up in a world where we're really writing prompts like this. We want to write prompts that look more like the equivalent of React. And the premise of that is you have do a lot less prompting, your team does a lot more engineering.
Vaibhav Gupta [00:19:12]: And that leads to two main benefits, which is your pipelines naturally become more reliable, but really you can iterate a lot faster across your whole system, which means they become more maintainable, which is not only can your engineers update your pipelines faster, but also AI agents can make code reviews and everything else much faster because systems are just plugged together as interfaces as opposed to having to go do everything through do like 17 different layers of abstraction. I'll pause there.
Charles Frye [00:19:46]: Yeah, super cool demo. One question I would. Yeah, definitely like the vision. I fuck with the vision for sure. Taking unstructured data, producing structured outputs from it, and then being able to use those that like schema, use that typing information when developing to speed developer loops. Yeah, 100%. You're. I'm.
Charles Frye [00:20:08]: I'm in the choir, singing. As you're preaching the question I have with like, just concretely looking at some of the demos. Like you, you had that form where you were extracting data out and there was like a subtotal and a total, there was like a quantity and a rate, and those together become an amount. And you said like, oh, yeah, well, so the quantity times the rate should, you know, equal the total total, you know, and that's like, that's like a thing you want to enforce on the extraction. But I don't know if You've ever exchanged PDFs with the type of people who do their business via PDF. They tend to have an error rate that is not zero. And so they might actually put the wrong numbers there. And I'm curious, like, how do you distinguish between.
Charles Frye [00:20:52]: The language model is failing to extract the information that is present from in the document. Like it's failing to extract a non Gmail email and oh no, the document itself is improperly formatted. The language model is doing things correctly and like, therefore we need to go back to the provider of the document and tell them to, you know, get their shit together.
Vaibhav Gupta [00:21:14]: I think it's really a matter of building systems. So take, take the fact that we have a model doing this completely out of it. The model just transform the data into something. At some point your system builds conditional validations across your system. So you're like, I have to know the total should equal these amounts. Whether the total is wrong because the model is wrong or because the person wrote the document incorrectly doesn't matter. The point is, you know, the guarantee that you're expecting is not correct. So the first thing you have to do is can you detect it? You build a system that can detect misalignments of some kind.
Vaibhav Gupta [00:21:51]: Once you build a system that can detect misalignments, you can do a lot of things. Once I detect that one total is incorrect, I can write software that does two separate things. I can write software that sends an email to someone human that says, hey, I noticed this error. Can you please tell me what looks wrong about this? I can take that error that I took and send another prompt back to an LLM that says, hey, this is an error. Can you tell me what type of error this is? Is this an extraction error where the rows are wrong? Or does the document have an error and the model returns a category of what type of error that is and what it recommends as a suggested fix. And then you take that, send it to a human to review, or you take that and just apply the fix and you just go do that. If it looks like an extraction error, you apply the fix. If it looks like a documentary, you send a workflow off to the human, say, hey, can you please double check this? This looks like a discrepancy.
Vaibhav Gupta [00:22:44]: And it's more about building these systems together with if conditions for loops is piece of software that we all are familiar with and not trying to do everything in that one prompt.
Charles Frye [00:22:56]: Got it?
Vaibhav Gupta [00:22:56]: Does that answer your question, Charles?
Charles Frye [00:22:58]: Yeah, yeah, I think so. I think the, yeah, so that, I mean the question always like definitely being like going to an LLM judge or you know, is always an option that you have and like having validation to Know when to try and kick it off to a judge is definitely useful.
Vaibhav Gupta [00:23:14]: Exactly.
Charles Frye [00:23:15]: But the question is then like have you made the problem any easier by passing it off to an OM judge? Like for example, the hallucination. The like rate of hallucination might be very high. Like frequently if you tell a language model this looks like an error, even if it's correct the model because they're obsequious and trained to like, you know, follow prompts a bit too much, they might be like, oh yeah, absolutely, that's a mistake. Like, I'm so sorry.
Vaibhav Gupta [00:23:38]: Well that's just, that's just a matter of, that's just a matter of prompting. So like for example, if you have a categorization problem, one really big advantage that I tell people to go do is give the LLM an out. Don't. Like, it's like lawyers I think are one of the best people that are prompters that I've seen. Because you know, the question of like no leading questions and then we all joke about that but like it's, it's really true. Like if you, if you leave that alum, it's going to give you the thing you want designed to make you happy. But what you should do is really think. And I think that's where like the playground really helps, which is like you just write five test cases and you see if it's working in those scenarios and giving the output that you expect.
Vaibhav Gupta [00:24:17]: Because if I test it on seven different test cases and it's working, I let it run in production, it breaks, I build more test cases and I make my pipelines better over time. And that iteration loop is what you do. And notice one of the key things that I was saying was not that the LM will be right, it's that you have to build a way to detect a system that's wrong and kick it off to a totally separate pipeline. And that pipeline doesn't have an element of the charger is one option. But if your business is super, super, super accuracy sensitive, send it to a human. It's, it's basically a, it's, it's a function whether it's a human or an LM that takes in the error and the configuration and tells you what the suggested fix is. It's a function.
Ben Epstein [00:25:01]: I would even push like one step further. Charles, I definitely, I agree with that. I think it's a very cool way to rethink how LLMs work. But I even have maybe a simpler response which is how do you know if the LL made an Extraction mistake. Well, if you've reframed LLMs into just functions that return JSON, really what you've done is you've turned LLMs into millions of small APIs with clearly defined specs. And so you can just write pytests and give it a bunch of PDFs where the sums and the like accurate the values are wrong. And you can actually just test that with a whole bunch of different PBFs that you care about and make sure those values are right. And you can actually just see like how robust your model is to those problems that might arise.
Charles Frye [00:25:44]: Right? Yeah, I guess you do have the problem like, you know, it was right at a certain rate on your test PDFs, you know, but you're. If you think dev environment versus production environment skew is bad and software is where like it's, you know, maybe some environment variable differences. The difference between the test data and the prod data can be really extreme. And I think Baibov's like totally right to bring up the sort of like data flywheel approach of like yeah, collect examples from production and then, and then add them as sort of regression tests. But it does, I don't know it like, I guess you do end up with errors. You have to. Manually, you know, sort of manually.
Vaibhav Gupta [00:26:21]: Yeah, you're. You're building a probabilistic pipeline and when you think about. So like for example, we built face ID at Google, it's the same thing. Like you just have to build a probabilistic pipeline and you're never going to be right perfectly. But you just put engineering hours into it. It's like security. Like when you first start off building an application, security like your application needs to be this secure. How do you make an application more secure? It's literally just work that you put on top of it with more software, more processes that make your system more secure.
Vaibhav Gupta [00:26:50]: AI pipelines are the same. You start off with a pipeline that's pretty accurate, like most of us are with a pipeline that's not accurate. Use bamboo. You start off with a pipeline that's more accurate because you're testing, because you actually have tests that you're running and you're seeing the prompts. It's not like, I mean, Bamble does some stuff, but it's really iteration loop allows you to build a more accurate pipeline. And then you do build that data flywheel. And one thing I didn't show is how we help you build that data flywheel. But imagine building a data flywheel where you can take production data and turn it into test.
Vaibhav Gupta [00:27:21]: And you just do that. And you do that over time. As you get more users, you understand your edge cases and then you slowly stack it up on top of each other and you build accuracy over time instead of expecting to have accuracy on day zero. If you need accuracy on day zero, then you pay the upfront talk tax and collect all the data up front. It's a processes problem in AI. I think that's why engineering teams are really struggling to get past the demo hurdle. Because engineering teams aren't set up to have these processes in their workflow. It just doesn't exist.
Vaibhav Gupta [00:27:51]: And machine learning teams aren't set up to have the processes of having shippable code in their workflow. And like it's just. And that's the bridge that I think is really novel for a lot of people to go approach and go after.
Ben Epstein [00:28:08]: It's very interesting when you reframe, even just thinking about prompts from strings to functions and like from strings to structured outputs. I mean, I have found that the accuracy of the MY systems at least gets a lot better because you're not constraining in like the technical term of constraining. You're not like actually constraining the tokens that the model can generate, but you are so aggressively constraining what you're asking from the model. I never thought about it from like the no leading questions, but sort of forces you to start writing prompts. Like when I do very little prompt engineering relative to what I used to do. Now that I'm using BAML for everything in my, in my stack and I found that when I first of all, whenever I do prompt engineering, it's only for small models. I never have to prompt engineer with Gemini, with Claude, with OpenAI. But when I'm using a small model and I'm prompt engineering for these extraction type tasks, the only prompt engineering end up doing is actually like, okay, I get it.
Ben Epstein [00:29:04]: The field name that I provided is actually a little bit maybe nebulous. Like let me just add a description to like make that field name a little bit more precise. That's always been what it has, has been. Unless you get like to those small models that are just nonsense anyway. Like with the medium and then the big models, it's very. When you force it into a schema, like you're definitionally being so specific as to what you want because you're asking for this schema. It's very cool. Okay, Vybob is gonna, is gonna leave us for now and Charles is going to take over and show us a little bit about modal, which I'm also really excited about.
Ben Epstein [00:29:42]: And maybe people will see from this how they work really well together when you can get.
Vaibhav Gupta [00:29:47]: We've got quite a few people using both and it's really fun.
Ben Epstein [00:29:50]: Yeah, it's an awesome combo word.
Charles Frye [00:29:52]: Cool.
Ben Epstein [00:29:53]: All right.
Charles Frye [00:29:53]: Yeah, thanks.
Ben Epstein [00:29:55]: See you later. All right, Charles.
Charles Frye [00:30:02]: There we go. All right, so gonna present modal, which my favorite way to talk about it for folks from the sort of ML Ops world is that modal is the ML infra that you didn't ever realize that you really wanted. So the basic pitch here is actually I kind of like old computer science ideas. I have a PhD, spent too much time in school, so I like to frame it in terms of really old ideas. So fire emojis in the chat if you ever used enterprise Java Beans. But the basic idea with modal is the same. This old idea goes back to the mainframe days or Java Beans. Uh, like, it's this idea of remote procedure calling.
Charles Frye [00:30:53]: Like, you've got some code on your machine and you've been like, playing around with it for a while and you're like, okay, now I've been fiddling with like one test example and now I want to run a thousand test examples. Or I've been playing around with a model that I'm training on, what, a batch size or data set size one. Like, I'm overfitting to one single data point. I've been like, you know, figuring out all the hugging face trainer flags in that simple case. Now, oh, well, you know, that runs fine on my, on my MacBook, but now I need to run it on like a big GPU so that I can, you know, so that I can finish my training run in over lunchtime instead of over like 10 days. So the idea is you want to take code that like, you know, you got running, that's working on your machine and you want to run it somewhere else where there's like, resources that you don't have on your machine. This is like another way of helping to close that, like, development, production gap that Vaibhav was talking about closing. He mentioned that like, you know, ML teams have this.
Charles Frye [00:31:57]: Don't have this processes or infrastructure in place to be able to write shippable code often because they have to like, shell into or open a jupyter notebook in some like, super expensive GPU instance in order to be able to like, test their code properly. So they don't, you know, they just, they don't, they don't really connect that, like production infrastructure and development infrastructure, the way people do in a lot of other domains. So that's the problem we solve with modal at a, like, at a very high level. We've got some, like, you, you import our little Python library, you start doing stuff on your machine, and you're like, you know what, actually this function needs to run on eight H1 hundreds at once. And actually probably I want 1000, you know, I got 1000 inputs, so I'll probably need a bunch of copies of that to run. You, like, put some little decorators in your Python code to say that, and then you issue a command, you Python run your script or whatever, use our SDK, and that causes your code to get shipped onto and run on our cloud. So the, like, this idea is old. The thing that makes it good is one, like, you know, the changes in what computing looks like in the last five and 10 years.
Charles Frye [00:33:12]: So in less 10, maybe even 15 years to switch over to cloud computing. Like, this approach. RPC was like popular back in mainframe days. We're back to mainframes, baby. They're just called the cloud. And so this, like that, that switch has made this approach good again. And then the other part of it is that in order to make calling a remote procedure feel like calling a local function, we had to do a whole bunch of stuff. So we rewrote the stack from the ground up to make this actually good for running user workloads.
Charles Frye [00:33:50]: We don't use kubernetes. We use kubernetes for what it's good for, like managing a lot of stateless services, a couple things like that. But it's not great for this sort of like scheduling workflows stuff. And it's not great for data infrastructure. It doesn't handle heterogeneous compute well. Docker similar. There's lots of great things about Docker and especially the container interface that they've defined, that open protocol. We wrote something that's like, you know, compatible with that protocol, but it has a bunch of features to make developing stuff directly against the cloud.
Charles Frye [00:34:22]: Like, yeah, not suck, you know, and we've talked a lot about all these, all the pieces that go into that. Like, you know, lazy, being lazy loading with file systems in some ways, being eager with file system loading in other ways. So that when you go to run something, it doesn't take 30 seconds to spin up, it takes like three. And I think just to jump back a slide because I forgot about this point, like, yeah, this basic idea of, oh, yeah, let me run some code. Like, let me run a function on the cloud. Like, yeah, there's things like AWS Lambda, these serverless platforms on the cloud providers that are pretty good at this. And also there's this old idea of rpc. Another thing that makes modal different from other serverless platforms and from other ways of approaching remote procedure calling is all the other features that we can add on top to.
Charles Frye [00:35:15]: It's not just about, oh, I want more hardware, I want to run more copies of this function on 1000 CPUs at once. It's like web serving. Oh yeah, let's get you some TLS certificates. You've got a domain that you can talk over. Oh yeah, let's set up one of those async event loops around your code so that you have to write async functions maybe, but you don't have to think about UVicorn or asyncio or any of these other things that will give you an event loop. Like, we got that for you, don't worry about it, Bestie. And sort of all the other kinds of superpowers that you can get like, by like, just from the like pure plumbing and infrastructure of your code. The goal being, like, we want you to focus on like the actual business logic, on the actual like, problems that your code is solving, not the like, hardware infrastructure and nuts and bolts that it needs in order to, in order to do that.
Charles Frye [00:36:14]: So I think I'll actually, I'm just, I'll skip this slide. We'll maybe come back to this if we got time. And I'm in instead going to jump into like a demo so you can, you know, kind of see what this looks like. This is our little like minimal modal demo here and let me even like, cut some stuff out. So when I say like, our goal is to make it as like, easy as possible to do this, like, this is what, this is what, you know, the easy, the easiest modal example looks like it's, you know, it's just a few steps. You have a Python function like this guy here. You want to run this on the cloud. I don't know who's squaring numbers on the cloud, but like, you get what I'm going for here, some Python function.
Charles Frye [00:36:57]: So what do you have to do to go from like just having this to having something that like runs on the cloud with auto scaling and, and, and other sorts of delightful magic. So one, you got to have a little decorator that says, hey, hey, this should be a modal function. You'll notice that decorator has this little prefix here. It's got an app. Okay, what's that app? That app is like, it's an object in our library. It says like, oh, this is a bunch of functions or add storage and all these things that kind of all live together, they get deployed together, they get developed on together. That's like a modal app. So yeah, building a simple one is one line.
Charles Frye [00:37:37]: You don't notice what's not here. There's no YAML, there's no Docker file where you have to say, I want this particular version of, of Debian, whatever, or Alpine and this version of Python. Like you can, you can set that, but we are making you think about it. We give you sensible defaults. Of course, all this is coming from our like software library. This, this modal library imported at the top. Let me just show you like, you know what that looks like? Modal run. Get started.
Charles Frye [00:38:04]: So modal, that's our software. That's like our client software. So this says what just happened While I was trying to explain what our software library was, is created, took that code, threw it up on Modal's cloud, allocated some resources for it, ran it and then printed the results both locally and remotely. Actually, I took out the remote print. So let me add that back in because that's kind of important. This code is running remotely on Modal. Boom. Now let's run that again.
Charles Frye [00:38:40]: New code. In some platforms, this would have been a whole Docker image rebuild for some reason. But none of that. You don't have to worry about that. With Modal, it took as quickly as you could change it, the thing changed. Let's see, what else maybe would I want to show here? Let's stick with maybe some development stuff and say like, so the big thing that people really like about Modal is we got those good Nvidia GPUs. And maybe to show you that, let me go ahead and just like do what every machine learning engineer likes to do, which is run Nvidia smi. Nvidia smi, we'll call that from Python.
Charles Frye [00:39:25]: So Modal's like all about running Python code. But it turns out like Python is actually just like kind of the front end to computers these days. You know, it's sort of like, it's like bash, but, but good. And so a lot of people end up using, using Modal and Python in this way where it's like, oh yeah, I want to like run this process. Like, it's not that I necessarily want to like build my whole application in Python or even like use Python bindings to fast libraries like Torch does. But like I just want to like launch something like NVS am I. And you can turn this into like a long running server and all that kind of stuff if you want. So like that's how I do like quality assurance on our documentation.
Charles Frye [00:40:04]: I just serve the docs off of modal and then I can take a look and that's like, that's like a full stack JavaScript app. And you can just run that by just, you know, switching that Nvidia SMI to node and making sure it's installed. But yeah, so there you go. You can see we printed out the like remote there is printing out the, that we got that Nvidia SMI output. And for my last trick at least in this file. So the nice thing about building stuff for scaling on the cloud is that once you've solved the problems of running it once, you could run it 100 times. So let's now run it 42 times. Oh yeah, sorry, I forgot a little bit of syntax there.
Charles Frye [00:40:48]: I want to grab all of the results. Yeah, generators. So, yeah, all right, so now like now we're running on 3, 4, 5h 100 containers. So all those spun up code landed on them and we squared 40, we squared all the numbers from 0 to 41 and you know, we could. Let's, you know that's, that's pretty cool. We have, it's Even cooler than 42, 420, am I right? So now let's run like even more. And you can see, sorry, like focus on that little time thing on the bottom that's like ripping through all these inputs, giving an estimate of when it'll be done. And right above it is like the number of GPUs that are running.
Charles Frye [00:41:35]: So like you can see that that number goes up over time and it's going up over time because our system is detecting, oh wait, there's some pending inputs from like this person asked for 420 like inputs to run and now those are pending. They're sitting in a queue. Let's, you know, let's spin up some more. So by the time it was done like we had, we had like 10 or 20 GPUs going. And I can maybe pull that up here and show you like what that would look like. You'd see this is our like interface for looking at everything. Like, oh yeah, we got, yeah, we have 420 succeeded function calls. We have these containers that started up.
Charles Frye [00:42:12]: If they had logs there Wasn't anything really logged by them? If they had logs is where you'd be able to see them. Yeah. So that's all very cool. So let's now deploy that and for the like. So the way I was doing it, just there, this modal run that says like, hey, that's like taking a script and turning it into something that runs on the cloud, which is pretty cool. Like, you know, a lot of ML engineers spend time on data pipelines, on training jobs, on other kinds of like one off things. Also it's a great way to run tests. You just like run the script and it's got some like tests in it.
Charles Frye [00:42:53]: But like modal is also about deployment. So like deploying this thing as a service. So now we've got GPU backed squaring as a remote procedure call in Python in modal. So let me show you how to do that. So that Python, that function up there, like, is defined in this in that Python file. But if I'm using modal, I can just like get that bad boy whenever I want. The. There we go.
Vaibhav Gupta [00:43:20]: Yeah.
Charles Frye [00:43:20]: F. Yeah. What is this? Oh, that's a modal function. I should be able to just call it on something. And you know this, you don't get that nice printed output. So it takes a second for it to show up. But there we go. F remote 2.
Charles Frye [00:43:36]: And I think at the first call was a little slow, like fast for cloud execution. Right. That was a couple seconds to get some infrastructure going. I don't know if you've ever waited 40 minutes for EKs to start up, but I have and so that's pretty fast. But now the remote calls are human perception. They're certainly as fast as something you run locally. So they're in that couple hundred milliseconds or even less probably range. So let me do one last thing as a cool little demo of just how far you can take this.
Charles Frye [00:44:20]: I'm going to start calling this function and right now it's returning four. Right. So what I'm going to do is I'm going to change the code of that function and just like, you know, I'm going to change its behavior. Like this is like, oh yeah, I wanted to like change my prompt, deploy new model, whatever it is you want to like change it and then all of your downstream code needs to like consume it. So and then we'll do, let's see, import. This is like a little lazy, But I mean, time.sleep one. There we go. All right, so now it's calling this function and it's getting four, so that means it's still the same old function that I was just running.
Charles Frye [00:44:56]: So let's change the code and let's make it actually cube the number instead of square it. That'll change the output and then let's deploy it. So now I'm deploying and creating, like a new version of this function. And so what we'll see is in a couple of seconds that this thing will stop returning. Not yet. Because now that while condition is no longer true, the new version of the function has been deployed. And when I call F remote 2, I will get 8, that's complete redeployment while the thing is running, you'll see there's no errors. It's like, oh, function is not defined.
Charles Frye [00:45:36]: Please come back later. We do a smart little blue green deployment thing and this gives you this totally flexible infrastructure for, you know, in Python with a couple of fun little decorators like creating, running, developing and deploying Python functions and like, hell web endpoints. I didn't even show you how you can turn, like, turn these into, like, fast API endpoints that you can hit with curl and all kinds of other exciting stuff.
Vaibhav Gupta [00:46:05]: I use that.
Charles Frye [00:46:07]: Yeah. Hell yeah. Yeah, that's. That was like, the thing that, like, blew my mind. I came for the GPUs when I was still, like, teaching people how to deploy models. All I cared about was like, running PyTorch on GPUs. And then the. Yeah, then I was like, oh, wait, this fast API library, like, I didn't use it that much because I thought it was like, kind of hard to get going with it.
Charles Frye [00:46:29]: And then Modal showed, like, showed me how to, like, get started with it and deploy applications in like a few seconds. And then I, like, started down, down a long journey that led to me, ended up ending up joining the company and yeah, in between which I deployed a bunch of fun little demos on Modal. So, yeah, so now, just to close out that demo, by the way, there's. Now we're cubing two instead of squaring it, and that deployment's already, Already finished. Yeah, so that's the. That's what, you know, developing and building infrastructure on Modal looks like. I think, you know, focused a lot on the sort of, like developer experience experience and making this, like, you know, how you could make your, you know, be more effective as an ML engineer, make your teams more effective, you know, at least make them make them happier because they can write joyful Python instead of painful YAML. The other point I want to make since, you know, lots of ML ops people in here, lots of ops people probably think about infrastructure and cost a lot.
Charles Frye [00:47:33]: One reason to care about like a serverless platform like modal when you're thinking about GPUs is that GPUs, like, you know, they can be hard to get a hold of. They're getting like the, you know, as new generations come out, the previous generations get easier and easier, but still people tend to do like fixed provisions of GPUs. So they say like, okay, let me sign a contract with cloud provider X for 140 GPUs. And then, you know, throughout the day, maybe some in the morning, they don't need very many. And then in the middle, you know, the afternoon in their time zone, like that's when they get a bunch of people. And then the like, demand goes back down. That whole time you have to pay, you're paying for Those, you know, 140 GPUs that you had to allocate. And then like you make, you know, trend on hacker news or, or whatever.
Charles Frye [00:48:23]: And then your, your demand goes way up. And then during that time people have to queue, wait, they get sad, they complain about you on hacker news because you're slow. So it wasn't even enough to provision that many. And then things quiet down and you're back to overpaying. So this is what you get if you just manually, you just create a cluster and then you just worry about scheduling onto that cluster. And a lot of people do this. If you look at certain industry surveys, like people say, like, oh yeah, I'm using like at peak, I use 60% of the GPUs I'm paying for. That's crazy.
Charles Frye [00:49:00]: It's like extremely expensive. And so some people do a little better and they like try to allocate GPUs manually. Maybe they use like kubernetes for this. They've got some like smart, you know, terraform and Pulumi scripts to manage their allocations. The problem you'll run into is like a lot of this stuff is just slow. It takes minutes, tens of minutes to spin up new replicas. Not the seconds that you just saw me spinning up new replicas and copies with modal. And so because of that, what tends to happen is you get this kind of delayed effect, right? I don't know if anybody still remembers convolutions, but your provision GPUs end up being this time delayed convolution of your actual demand.
Charles Frye [00:49:45]: And that can lead to you like still getting bad quality of service when there's spikes and overpaying, which is a huge bummer. And so the nice thing about having some kind of like, fast automatic allocation, whether that's modal or something you, like, build in house, is that you can get both utilization to be high, so you're paying only for the GPUs you need, and the quality of service is high, so users don't have to wait in line because you don't have all the GPUs you could make use of. So I wrote a long blog post about this that's available at that QR code on The Modal blog, modal.com blog that sort of like walks through this and also talks about all the other ways people worry about, like, you know, GPUs are expensive. You want to make them, like, you want to make really good use of them, you know, so it's like the T shirt in the bottom corner there and like. Yeah, so just talking about all those things. So some technical nuggets about how to interpret utilization from Nvidia SMI and how to think about maximizing CUDA kernel performance. Because all these together are what allow you to deliver an application that has high quality and controlled cost. And like, isn't that what we're supposed to be doing as engineers? That's what we're supposed to be thinking about.
Charles Frye [00:51:02]: Um, yeah. So, yeah, that's, you know, all I have, by the way. Yeah, check out Modal. It's free. Like, running those demos that I just did was probably like, I don't know, like 10 cents, 30 cents, something like that. And modal gives you 30 bucks a month of free compute. So you can sign up today, try it out, check out some of our examples which include like, running deep seek and fine tuning flux on pictures of your dog and like, I don't know, like analyzing a bunch of parquet files. All kinds of, like, different things you might want to run with this platform.
Charles Frye [00:51:39]: And you can try, like, you can try all of them without having to spend a dollar. And hopefully you will, you know, see those and get excited about building your actual applications. Building your. Whether that's like an internal dashboard or something. Like, you know, SUNO runs their generative, you know, music generation on our platform and, you know, so could you. So, yeah, come and check us out.
Ben Epstein [00:52:09]: That was awesome. Yeah, that, that. So I have a couple, I have a couple of example, like real world examples of how this changes, how you can think about, like building systems, one of which is you were talking about Load. But something you didn't mention that I leverage is when you're running an expensive server, like fast. I have a fast API server that requires GPUs and it is amazing that modal can spin up and spin down. But like for me for example, I know that users are coming at around 9, 9:30 in the morning. I have my. I programmatically have my GPUs like have that server spin up to one replica at 850 just in case, just so it's ready.
Ben Epstein [00:52:49]: I'm spending 10 minutes worth of GPU time which is build at the second, totally worth it. And then it dynamically handles it and then it, you know that that just starts at 8:50 in the morning, which is amazing. And then it spins down all night when I don't have any users and I'm not spending any money, which is crazy. Like if you look at people complain a lot about LLM capacity, LM quotas, which is real, like very real that OpenAI and Google, et cetera. With Google at least you can pay for provision throughput. It's so much more expensive than their per token payment. And at some level I get that. But another level it's like, well that's ridiculous.
Ben Epstein [00:53:27]: I don't need provision throughput from 9am to 9pm or 9am to 9am 24 7. And so not that all open source models are as good but like the idea that I can deploy one as a backup if I get quoted by Google and it can spin up and scale as I need is a pretty amazing thing to have.
Charles Frye [00:53:45]: Yeah, yeah, definitely. I think, yeah. As the models for tasks like what Vybob was showing in the BAML demo where it's like extraction. Yeah creating a schema like you frequently have to go to whatever smart model but then like filling in schemas and noticing errors and then kicking them back up to the big model. Very straightforward for all these like 8 billion parameter like 34 billion parameter that range. And that happens to be a scale that works pretty well with our platform. It's, you know, you totally can run. You know like I've run deep seek mostly for fun.
Charles Frye [00:54:20]: I've run llama 405B to yeah I used that model to power a fake social media network of celebrities from 1995 and like it wasn't super economical to run them but like it did run them. So if you had some other operational reason. I think the biggest reason to do it is to sort of live in the future which is to say like clearly open source Models are getting better over time. For the proprietary models, one models are too. But there's things that like just work right now and those will get better. Like the models that are able to handle those tasks will get smaller and faster to run and the GPUs that run them will get cheaper. And so like it's clearly going to be the case even if even for things that are not economical right now that you know that those that people will be able to self serve these things economically just like they do other services that support their applications. So I think that's whether it's like having some extra capacity to burst onto or you know, living in the future.
Charles Frye [00:55:27]: I think there's already great reasons to. To run models yourself.
Ben Epstein [00:55:33]: Yeah, I try to in the whole methodology of being economical with the systems that you build. I really try to play with any model that I can run with vllm on an L40s like that GPU like a one hundreds are great but they're actually like kind of more expensive than using Gemini and they're cheaper than using Claude and OpenAI. So if you're coming from that world it's great. But if you're coming from Gemini, which I tend to use pretty heavily, it only really makes sense for me to use an L40s and that will let me run essentially any model that's 8 billion parameters or less or like any 14 billion parameter model with quantization which they now sort of support. So like that's great, right? Like I was running Llama, the new llama 3.3 I think the 14, 11 or 14 billion vision model and I was doing extractions like I was saying with five I have that repo and maybe I'll share it. But it was, it could not generate the schemas, but it could almost always extract the schemas. And like that's great. Like it's sort of the, it's not as technical but it's sort of the offloading that a lot of people have been using where you use a small model when you can and you offload to a big model in a non technical way.
Ben Epstein [00:56:44]: Like just from a system procedure perspective, just run it on the smaller models is very cool. But I also leverage modal for non GPU tasks too which like it's a. The GPU stuff is magic, but also the CPU stuff alone is pretty sweet. Like I have a web server that even on first request is still typically under a second and then it goes down to like you know, under 200 milliseconds. But the fact that I can have it off and on first request takes a second most of the time is pretty crazy. Like that's a crazy concept.
Charles Frye [00:57:14]: Yeah, yeah, that definitely. I think that's one of our major goals. And I think the, you know, the vision of the founders was not initially like, oh yeah, let's build a GPU startup for people's generative models. Like it's just like, let's build this like fast auto scaling serverless infrastructure because people want to do data and compute intensive things and there's like all kinds of stuff you can do with it. And if like one second delay is too much for you and you need, you need something faster. I think that currently the cost to run a like have a CPU container like live all the time. Like you mentioned being able to like turn on like oh, just be live even if there are no requests. If you just set that, you could totally set up the schedule thing that you described and that like that's you know, five, five extra lines of modal code.
Charles Frye [00:58:04]: But if you don't want to do that and it's a CPU container, the cost for that right now is like, it recently went down from like 8 bucks to 4 bucks I want to say. And so that was crazy. Like a month. A month. And that's like, that's just backed by us being able to like more efficiently make use of CPU and memory resources because we're just like constantly getting better at both like scheduling cheap resources from the clouds and then like packing people together. That's a whole fun engineering problem that if you're like a systems engineer and you're looking for like fun hard problems, hit us up. But yeah, that allows us to like that allow like some advances there allowed us to drop prices and so now you can run that and now you'll have more like you know, your, the latencies you really hope for you know like 250 milliseconds to respond. Even when even a request, when it comes out of the, out of the blue.
Ben Epstein [00:58:58]: Yeah. People have asked me in the past when I'm like helping other startups and talk about infrastructure. I'll always recommend modal when it's an option to them and they'll say like, oh, but what if it's just a server and like you could just have it running on an EC2 instance. I'm like, you, you can like, you could totally do that. But the price difference relative to even just the engineering differences is especially now like not even worth it. Like I certainly do things on modal that I could be doing on an EC2 instance, but I just don't have any good reason to. Even the CI CD where my CD is just modal deploy and it's just my app gets like, it's, it's too, it's become like I almost, I have this split brain where at some level I love telling people about modal because I love the product and at some level I'm like, oh, if I tell too many people they'll run out of GPUs and they'll run out of CPUs and I want to access. But then I'm like these, I mean these guys are the best engineers.
Ben Epstein [00:59:48]: Like they'll figure out how to get more. Like they'll make it work.
Charles Frye [00:59:51]: Yeah, yeah, the. So yeah, there's there's some stuff in that GPU utilization article about this. But yeah, basically our, you know, as people come in and ask for GPUs and CPUs we like scale up how, how many we're running all the time and we maintain like a little buffer so that there's always some that you'll get scheduled on and then as you stop using it, somebody else starts using it and we're sort of round robining stuff. So yeah, we, you know, there have been times in the past where like when a new GPU type comes out and it's hard to get on demand or spot capacity or you know, right now if you want to pin to be in a specific cloud to avoid egress fees and then you also want to be, or sorry, be in a specific cloud and a specific region in that cloud provider to avoid egress fees, then you start to have to like queue for a couple minutes to get a gpu. But those are also things that we expect to just resolve with scale. Like the reason why is because you have to queue is because we have fewer. Like the buffer there is smaller, like it's numerically smaller and then the like noise is not as, you know, there isn't as much like concentration of measurements that allows us to tightly engineer those buffers to maintain the latencies that we want for people to be able to run their code. And so we've seen basically as the scale of the platform increases and as usage increases, instead of it being like a, oh no, now we're running out of space.
Charles Frye [01:01:19]: It's actually there's like a virtuous cycle of improved developer experience as we get more people on the platform, as we get larger and larger customers on the platform that like that creates this aggregate demand that we can more effectively engineer around.
Ben Epstein [01:01:34]: Yeah. There is a product that I am waiting for someone else to build. I don't want to do it, but I'm waiting for someone to build a company which is built on top of modal which is like the true serverless lakehouse architecture. Like it is you submit requests like like bow plan did this with Lambda and like that's fun but like that's not real because it's too slow. So you can submit requests. Those requests get analyzed. They spin up a modal instance or instances that run DuckDB or Pollers which read and write to and from either your S3 or when Modal I know gets better with their volumes and like it'll just be native in the modal volumes. Whatever the storage is.
Ben Epstein [01:02:14]: You just have DuckDB or Poller spin up. Pollers already can manage GPUs. Like it can already leverage GPUs and read and write to and from Iceberg and Delta. So you. That's what I want. Like that's my dream architecture is. Is this pure serverless like compute over here, storage over here system. I'm very excited for.
Ben Epstein [01:02:33]: It'll happen. I just don't know who's going to do it.
Charles Frye [01:02:35]: Yeah. Yeah. I mean part of like we're hoping some people want to like self serve that on top of modal and build their own. Their own things. We have a couple of demos using like Polar's or. Or DuckDB or whatever and like I think we have a blog post or two about like some sassy thing about like build your own, you know, lake house with in one file of Python. So you can definitely, you can definitely at least like you know, play with it and use it for small jobs hopefully. Yeah.
Charles Frye [01:03:02]: As we grow we would love to also build that sort of thing on top of our platform, you know and like go up like right now we're just making sure we got that infrastructure in place before we build experiment management or workflows or. Or lake houses. But it's totally. It's on the. It's on the world domination plan for sure.
Ben Epstein [01:03:23]: I love that. Unfortunately, I mean we're at. This is awesome. I think we're even at time and we didn't get to do a Prefect demo and maybe we'll get one on the next one. But I just want to maybe call out that Prefect has built a pretty sick integration with Modal. If you're running like more comp. Like modal already has crons and you can do scheduling and get all Your logs and it is amazing. But if you're looking for a really first class orchestration, Prefect has a system.
Ben Epstein [01:03:49]: Actually I wrote the guide for connecting the two, which was really a fun and not very hard project. You can have Prefect run that orchestrates all your jobs. And every time a job or a flow or a task kicks off, it goes to Modal and creates a modal sandbox environment which will install your dependencies or pull in your Docker image, whatever you want, spin it up. Because you're using modality, the sandbox starts and it's. I mean for me right now I'm using modal with CPUs. Sandbox starts in milliseconds and I'm using UV with my lock file, so that installs in milliseconds. And so the latency between when a job starts on Prefect and the code is actually running in Modal can be under two or three seconds. And for batch jobs, I mean that's like very real and very reasonable and so you get both.
Ben Epstein [01:04:32]: And that's how I actually run all my production jobs. Prefect is scheduling and then pushing onto Modal for execution.
Charles Frye [01:04:39]: Yeah, definitely. And there's, yeah, there's a couple of other like orchestratory workflowy people have started building stuff on Modal. Right. Like, yeah, zenml has some stuff about like managing infrastructure with Modal. And then of course there's like, that's like a, you know, ML Ops specific thing. But then there's, you know, Dagster airflow, all these things. So yeah, yeah, I think Prefect might have the most mature integration out of those. It's been a while since I checked it.
Charles Frye [01:05:08]: I'm spending all my time trying to make VLM go really fast these days. So I haven't been doing, I've been running Polaris and, and doing workflows.
Ben Epstein [01:05:17]: But yeah, my slowest thing with my VLM instance is not the modal side, it's the VLM side. Like when I get my server up, Modal tends to start within like 400 milliseconds. And then VLM takes like three or four seconds to start, which I assumed was not something modal could, could, could do. It could work on.
Charles Frye [01:05:37]: Yeah, yeah, there's some things you can do there, I think. Like. Yeah, there's two. Like I was, when I say trying to make VLM go fast, I mean like taking advantage of all the flags that they have for like quantize the model this way. Oh yeah, handle this in chunk pre fill, activate this secret flag that.
Ben Epstein [01:05:54]: Oh, I'd like to when you figure that out, please let me know. I would love to optimize my VLM servers.
Charles Frye [01:06:00]: Yeah, so there's that part that's like the execution latency. Then there is the startup latency where I think a lot of people run VLM on very long running servers, so they're willing to do a lot of work upfront to improve later latencies. So right now Modal is limited in what we can do to avoid to amortize that work more effectively on a serverless platform. The big one is GPU snapshotting. We just like take the state of the memory on the gpu like at any point legally, you know, a process running on a typical operating system is just the like state in memory. Right. So like in principle it's a very easy thing to whiteboard. Like right before you're about to take a request, what's the state of memory on the cpu, what's the state of memory on the gpu? Flash freeze that to disk, bring it back up later and put the exact same thing in place.
Charles Frye [01:06:54]: Yeah, so like we have some, we have some prototypey stuff for that and like you can probably even turn on the GPU snapshotting if you look carefully at our SDK. But it's still, it's like Nvidia has only just put out this like checkpointing tool and there's lots of rough edges and so it's still, the GPU part is hard and then there's only limited stuff for a framework like Vllm or Pytorch that you can like, that you can capture with just the CPU memory state. But those are some tricks that could shave off sometime.
Ben Epstein [01:07:27]: As a builder who is good but nowhere near the quality of the engineers at Modal, part of what I'm happy to pay Modal for is to not use those rough edges. I'm going to wait, I'm going to wait until it's in the official documentation. Modal and I will take advantage and I'm happy to take the two second loss right now to not have, I'm paying to not have to do all of those ops, which is such a sick like thing that I can be up and running in less than a day.
Charles Frye [01:07:54]: Word.
Ben Epstein [01:07:56]: That's awesome. Charles. Thanks so much for coming on. I know we're a little over. I appreciate it. It's a really, a really fun.
Charles Frye [01:08:01]: Yeah, thanks for having me, Ben.
Ben Epstein [01:08:02]: Yeah, awesome. All right, thanks everyone for joining. Talk to you later.