MLOps Community
+00:00 GMT
Sign in or Join the community to continue

The AI Developer Experience Sucks so Let's Fix it // Erik Bernhardsson // AI in Production

Posted Mar 13, 2025 | Views 275
# AI Developer Experience
# Cloud
# GPU
# Modal Labs
Share

speaker

avatar
Erik Bernhardsson
Founder @ Modal Labs

Erik is currently working on some crazy data stuff since early 2021 but previously spent 6 years as the CTO of Better.com, growing the tech team from 1 to 300. Before Better, Erik spent 6 years at Spotify, building the music recommendation system and managing a team focused on machine learning.

+ Read More

SUMMARY

Erik talks about how he went into a deep infrastructure rabbit hole to fix the developer experience working with the cloud and GPUs. This involves building a custom file system, their own scheduler, and much more.

+ Read More

TRANSCRIPT

Demetrios [00:00:03]: Our final keynote of the day. Congratulations for hanging with us and sitting through all these cringe videos that I set up for you. This has been a blast. And without further ado, it is my great pleasure to introduce our final keynote. I feel like I should grab the guitar for this one and bring him onto the stage. Eric, the CEO and founder of Modal. How you doing, dude? Oh, why are you so small? Hold on. Hey, that was.

Demetrios [00:00:40]: That was not meant to be a power move right there. How's it going, man? It's been a while.

Erik Bernhardsson [00:00:48]: Good, how are you? Can't you hear me?

Demetrios [00:00:51]: Yeah, I hear you nice and clear. I like that you got a nice mic for this. I appreciate you taking this seriously.

Erik Bernhardsson [00:00:57]: I did take it very seriously.

Demetrios [00:01:00]: So man, I'm excited to hear from you. As you know, I hear from folks in the community that are using Modal and absolutely love it. I'm gonna let you rock and roll and then maybe I'll bring on a few friends at the end for the Q and A and we will finish out the day strong.

Erik Bernhardsson [00:01:20]: Nice. Cool. I share my screen. How do I. Yeah, there it is. There we go. Okay, I'm gonna switch to this. How does it look?

Demetrios [00:01:31]: Looks great.

Erik Bernhardsson [00:01:32]: Nice. Okay. Sweet. Okay, so I am Eric. I'm the CEO of a company called Modal based in New York. I've been doing a lot of random stuff in my career. In particular I was a Spotify for almost seven years where I did a lot of AI stuff, built a music recommendation system. I open sourced the vector database figure Luigi which no one uses today.

Erik Bernhardsson [00:01:59]: I was the CTO for many years. Started hacking on Modal during the pandemic as a sort of fun side project. I had quit my job and wanted to do something else. And in particular I was very curious about the sort of data AI machine learning. Stack realized like the best place to start was at the bottom layer and started hacking on all kinds of very complex things around container management, file systems, schedulers. But the core thing I wanted which is like I built data AI machine learning applications for most of my career, in particular Spotify as I mentioned and I just felt like the developer experience wasn't there. I wanted something that lets me iterate, something that's kind of general purpose so I can build a lot of stuff and then actually it turns out like Genai was like a really good use case for us. That's where we found a lot of product market fit and a lot of traction.

Erik Bernhardsson [00:03:01]: What is Moodle? I'll give you a demo in a second. I'LL actually do some live coding, but real quick, Modal lets you write code and we only support Python today and we manage all the infrastructure around like, you know, running things in the cloud, especially running things on different GPUs, building containers, doing all the load balancing and routing. You can think of us as like, kind of like Kubernetes or AWS Lambda in the sense that like we are a cloud provider or like Lambda in the sense that like we run the code for you. You can think of as Kubernetes in the sense that like we're the infrastructure layer but we're a hosted provider. It's a little bit different than Kubernetes now sense we manage all this stuff for you. And in particular, the thing I designed Modal for was I wanted to make cloud development feel like local development because I think if there's anything I've noticed, I'm a massive lover of the cloud. I've used it since 2008 is when I first started using AWS. And it's obviously brought us a lot of power, but I kind of feel like it's actually in a way a step backwards in terms of developer experience if you're working on these data AI machine learning use cases.

Erik Bernhardsson [00:04:14]: So that took us into a deep rabbit hole where we realized we had to build our own scheduler, container, runtime and all kinds of complex stuff. But before we jump into that, basically what is Modal today we run a huge pool of compute resources in the cloud, of GPUs, in particular, H1 hundreds, A1 hundreds, whatever, like thousands of those. And then we expose a Python SDK that makes it very easy to interact with us and to run arbitrary code. So all you need to do is basically write like a few lines of Python. You can think of it as like you take a function and put a decorator on it, and now it's a cloud function and that then makes it possible to run it in the cloud. And before we jump into the demo, just talking a little bit about the use cases. Genai inference has been the main use case. So particularly like stable diffusion or we have a customer called Suno that does AI generated music.

Erik Bernhardsson [00:05:10]: So you put in a prompt like I want to hear hip hop music about data engineering or something, and it makes songs, it runs a modal. A lot of bass jobs we've seen also more recently a lot of success with computational biotech, which is kind of exciting. A lot of LM stuff, of course, yeah, already mentioned sudo substack useless. There's Many other customers, I will switch screen and look to my terminal. We're going to keep this very old school. Here we go. Okay, so I'm going to try to do some live coding here and show you what modal looks like when you're actually working with it. So here's like an incredibly simple example.

Erik Bernhardsson [00:05:55]: So sorry, I lost mine here. Very simple example of little program that uses modal. And this is not too complex. Basically we have a little function that returns a square of a number, prints and stuff to standard out. We have some little glue code to just call that. And you can run models in different ways, but one of the ways we let you do it is we have a cli. And the idea is basically when I run this thing, it takes the code that I have on my computer, it sticks it in a container in the cloud and it executes that code. And then assets, printing stuff or whatever, it streams that back.

Erik Bernhardsson [00:06:36]: The goal here is to make it feel like I'm almost running things locally. That includes things like if I can't spell today, but it's fine if I edit the code and rerun it. It actually just let me run the latest code. I don't have to build a Docker container, push that container to the cloud, download logs, et cetera. I have this very fast feedback loop, almost as if I'm running things locally. This obviously wasn't super exciting code. We're just printing some stuff. So let me start to show a little bit more stuff.

Erik Bernhardsson [00:07:11]: So let's say we want to run this on H100 Nvidia flagship GPU, or at least used to be. Now they have B2 hundreds, all kinds of stuff. Yeah, with just one code change. Now we're running it in a container that has access to the GPU to a gpu. We're not using the gpu, but I can show you how to do that in a second. So modal also lets you define basically arbitrary container images as environments in code. You can also give us a Docker file, actually, but you can also define it in code. I actually find it a lot easier to define it in code.

Erik Bernhardsson [00:07:46]: So we can install something like Torch in a container, and then inside the container we can do import Torch print can do something like Torch cuda isavailable. And if I run this thing now, it's gonna most likely already has this container image cached because I already used it in the past. Whoops, what did I do? Oh, I need to set on the image. Also I should know. I need a set on the Function I need to set that the image equals image. Now we're going to run this with an arbitrary container image. This container image already pre built. And then we're importing Torch.

Erik Bernhardsson [00:08:30]: We're able to use Torch, but I can also install other stuff. Let's say I want to install whatever. Let's say actually you can run arbitrary commands, so echo, whatever and then Pip, install Seaborn or something like that. We then build containers in the cloud for you on demand. And we can do that extremely fast because we built our own container builder and our own Docker file parser and a whole lot of other stuff. So we can typically build custom containers in a few seconds and then just in a few seconds later, we can already run code with those container images. You can see we built this container in about 17 seconds and now we're basically spinning up a container on an H100 in the Cloud and execute this. If we run it again, we already have this container cache is going to be much faster.

Erik Bernhardsson [00:09:26]: So far we're not doing too much stuff, but I'll show you in a second how you can also do. You can scale things out. So just showing what happens when you run it again. Let's delete some of this stuff. We can just make it simpler. Print X. You can map over these. So you can any of these functions in modal.

Erik Bernhardsson [00:09:52]: We make it very easy to do sort of what I think was embarrassingly parallel, like fan out parallelism. So you could take a function like this and just map, map over 10,000 elements. We'll unpack that by just doing list. And if I run this thing, we're basically going to try to spin up as many containers as we can. And remember, this is actually running on H1 hundreds, right. So now we're going to try to basically paralyze this as much as we can by spinning up lots of H1 hundreds. I'm not sure why it's taking a little longer than usual, but you can see we're already running three H1 hundreds. If you run a big map over lots of elements, you can easily get like hundreds or even thousands of GPUs over time.

Erik Bernhardsson [00:10:39]: Sort of scales up if we go to this link. In fact, you can actually. Let me see if I can share this. You can see this. Sorry, you can see this function. You can look at container metrics. So we have a pretty rich UI that lets you look at, you know, all kinds of metrics. Zooming in.

Erik Bernhardsson [00:11:08]: Oops. And there's a whole lot of other stuff I'm going To jump back to the. Some like noise in the background. I think someone else that might have been me. Yeah, it's okay. So there's a lot of other stuff to do in modal. You can define file systems, you can deploy functions and call them an independent context. You can deploy web servers, you can set up cron jobs, all kinds of other random stuff.

Erik Bernhardsson [00:11:37]: So let's switch back to the presentation behind the scenes and I'll talk a bit. You know how we built modal because there's a lot of sort of, you know, complexity around like managing containers and stuff like that. So we built our own infrastructure in order to do this. Like I mentioned, we built a lot of, you know, we built our file system and many other things. That's the secret for like how we can spin up containers so fast. What are containers? So containers are actually not that complex. And by the way, the reason why we went for containers was that in like a lot of data, AI machine learning use cases, the only like really, you know, the way people really package code is just like fat Docker containers. Like there's no sort of equivalent of like just small self contained binaries.

Erik Bernhardsson [00:12:31]: You just have to live with the world that like, you know, things have a lot of dynamic libraries and a lot of Python modules, et cetera. So the only way to really distribute code and the package code is through like big fat Docker containers. Containers, but they're not that complex actually. They're just really just like a file system and a bunch of stuff around process isolation. It turns out a lot of Docker images have a lot of junk. So just looking at a Docker file, there's a lot of stuff, a lot of stuff is never actually read by the container. And then another inefficiency is like a lot of the stuff in these containers are actually not particularly unique. It's the same set of files that you see in a lot of.

Erik Bernhardsson [00:13:07]: And this is just like three random containers I looked at. So we use the thing called content address storage. Basically when we snapshot a container to file system, we look at all the files, we compute a checksum, and then we see if that checksum already, if we already have that blob. If not, we just add that blob. And that means container images themselves are actually just metadata, just pointers to blobs. That means we can also cache container images extremely fast and we can lazy load blobs when we read data from containers. Then on top of that we also spend a bunch of time on CPU memory snapshot. I Didn't show you actually this, but that's in a way you can improve cold start even more if you have model initialization for instance happening, which is very common in gen AI models.

Erik Bernhardsson [00:14:03]: You have these very large models and with modal you can snapshot the memory so that you can restore these containers very fast. Working on GPU memory snapshotting as well. And the end result is we can snapshot. Sorry, we can start containers very fast. And that is I think the key thing for having this like local, like experience even though you're actually running stuff in the cloud. Why does this all matter? I talked about the developer experience, but it also matters because a lot of the use cases with modal is online inference. And online inference, unlike training or something like that, is very unpredictable in terms of usage. Right.

Erik Bernhardsson [00:14:46]: So this is synthetic data just in this chart. But I always think of resource pooling as actually kind of a free lunch in the sense that if you take a bunch of uncorrelated workloads and you run them in a multi tenant way, you can actually achieve much higher resource utilization. We do all these other tricks too. It's bin packing and how we allocate resources in the cloud, et cetera. But the key thing to doing that is you have to have the ability to scale up and scale down very fast on demand. So in order to do that you have to start containers fast. Taking a step back, I talked a little bit about the infrastructure in the last few minutes. But taking a step back, why do people like modal? People like modal because it's high code, like you can run whatever code you want.

Erik Bernhardsson [00:15:33]: We're not an AI API, so to speak. People come to us because they have their own custom models, they have their own custom workflow and they need a provider that can handle that. That's generic in the same way as like EC2 or Lambda or Kubernetes is generic. It can iterate super fast. You saw already as I was running stuff in the cloud, like how fast iteration cycles are. We're also fully usage based. So in modal the only way we charge you is for code that's running. And so when we scale up, we charge for that.

Erik Bernhardsson [00:16:05]: When we scale down, we don't charge for it. When we scale down to zero, there's no cost and there's zero capacity management because we manage all the capacity. You can get access to thousands of GPUs. How do you get started? Pip, install Modal, all users get 30 bucks a month for free. And if you're a startup, you can actually get 10 to 50k in modal credits as well. So it takes about 5 minutes to get started. All you need to do is pip, install modal and then there's a thing to set up a token through GitHub or Google and then you can start to write code and run it in a cloud and deploy things and use GPUs and all kinds of other stuff. So yeah, thank you so much RA.

+ Read More
Sign in or Join the community
MLOps Community
Create an account
Change email
e.g. https://www.linkedin.com/in/xxx or https://xx.linkedin.com/in/xxx
I agree to MLOps Community’s Code of Conduct and Privacy Policy.
Like
Comments (0)
Popular
avatar


Watch More

Data Engineering for Streamlining the Data Science Developer Experience // Aishwarya Joshi // DE4AI
Posted Sep 18, 2024 | Views 1.2K
Generative AI Agents in Production: Best Practices and Lessons Learned // Patrick Marlow // Agents in Production
Posted Nov 15, 2024 | Views 5.8K
# Generative AI Agents
# Vertex Applied AI
# Agents in Production
The State of Production Machine Learning in 2024 // Alejandro Saucedo // AI in Production
Posted Feb 25, 2024 | Views 1.1K
# LLM Use Cases
# LLM in Production
# MLOPs tooling