MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Preemption Chaos and Optimizing Server Startup

Posted Jul 21, 2023 | Views 534
# LLM in Production
# Optimizing Server Startup
# Repl.it
Share
speakers
avatar
Bradley Heilbrun
Engineer @ Replit

Replit engineer focused on reliable and scalable LLM infrastructure. Formerly, YouTube's first SRE, longtime Googler and early PayPal linux guy.

+ Read More
avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
SUMMARY

GPU-enabled hosts are a significant driver of cloud costs for teams serving LLMs in production. Preemptible instances can provide significant savings but generally aren’t fit for highly available services. This lightning talk tells the story of how Replit switched to preemptible GKE nodes, tamed the ensuing chaos, and saved buckets of cash while improving uptime.

+ Read More
TRANSCRIPT

 So we're gonna keep it moving and next up, I've got the best fun fact ever about our next guest. Bradley, where are you at? Bradley a currently working for Rep Lip. But I've read in your bio that you are also known as YouTube's engineer number seven. Uh, yeah, I think Employee seven SRE one. Oh my God, man, that's so wild.

That is something that's like, uh, history. That is so cool. So yeah, you can see, you see all these gray hairs, those are outages that have, uh, that's when you, you remember every one of 'em, huh? Exactly. There's a lot of, lot of lost cat videos. That is awesome. Well, dude, I've been chatting so much. I'm gonna let you chat now and I'll be back in 10 minutes.

No, thank you. Thank you for, uh, keeping us honest. All right, uh, I'm gonna get started here. So welcome to my, uh, lightning Talk. Try and keep it to 10 minutes. Preemption, chaos and Server Startup, uh, is kind of a nuts and bolts, um, uh, system. Talk for those, uh, serving their their own LLMs. So what is, uh, uh, repli?

I should go back. Bradley Holburn. I work at Repli. What is Repli? Um, it's a online, uh, web-based, uh, development environment among other things. Um, this is a shot of our I d E experience, so, Uh, we use LMS for a number of different things, um, for code completion, for, uh, transform code, explain code. Uh, there's a, a debugging experience where we try and debug with the L l M and we've, uh, We've self-trained as well as, um, hosted our own models, uh, for a long time now, um, what, what feels like a long time, there's a slide of a couple of the models that we've built.

It's a tweet from Mom Jat, our founder. Um, and one of these is open source. You can find it on, on hugging face, but we're, um, I'll pitch a rep. We're always looking for people. So, uh, if you're interested in joining, uh, let us know. So the abstract of this talk, um, using preempt GPUs in the cloud. To, to cut your cost by two thirds, um, while maintaining, uh, your uptime, uh, for your users.

So, um, large language models are large, um, and serving them at low latency requires the best available. Uh, GPUs just kind of as a, as a general statement, um, you know, for the best user experience, uh, we wanna drive down those latencies and, um, as we push models, you know, as they get bigger, we need, uh, better GPUs.

But those are expensive. Here's a slide. Uh, here's a shot from Google's pricing page. Um, you can see that, uh, for an a 100 best available GP right now, um, although we're testing H one H 100 s is about $3,000 per month, um, no discounts. By contrast, the spot price is a thousand dollars. Um, so you can cut your cost by two thirds.

Um, if you were able to use, um, the, the spot prices, Um, what is spot? Well, spot is a charitable term for preempt and best effort, unfortunately. Practical matters. Um, so preemptions are, are real, unfortunately. Um, if you, if you have any workloads on preempt, uh, node pools, you, you, you can go to your logs and just see a steady stream of, of node preemptions and stockouts are real.

What's a stockout? Well, it's when an entire zone has no resources available. Um, and, and these are, these, these are real events that happen every day. And in fact, if you go to the Google's, if you go to Google's documentation page for. Spot nodes. They say Don't run highly available services on this. Um, but we have, and we, and we will.

So another thing that's, that's sort of like a little harder to figure out is that their guarantees are overall just worse. Um, there's only 15 seconds of notification when a node is gonna, um, when a node is gonna go down. And, for example, pod disruption budgets don't work, which is a, um, which is a Kubernetes, uh, feature.

So in, in general, we kind of tackle this from three ways. Um, we spread across as many availability zones as possible, spread the risk. There are techniques for falling back to expensive nodes. Or also having sort of a, um, using commitments, uh, to drive down costs, maybe for your baseline capacity. This talk is about number three.

We have only 10 minutes, so, uh, focus on number three here, and that's speeding up the server. Startup. You need to be really dynamic and reactive, um, if the nodes are, are sort of like popping in and out of existence. Um, so just as a quick story about how we, um, sped up our server startup real quick.

Windows notes can disappear in 15 seconds. Uh, you don't want boot times to take 15 minutes, which is what happened to us. So here's an example I pulled, um, from, from pre-op optimization. Um, it's about two minutes to get the note online. This is like installing the drivers. Then there was about 11 minutes of our application containers starting 11 minutes, and then about five minutes for the model to to load the weights and become healthy for serving in total about 18 minutes if I did my math right.

So the first thing we did was just let's try and make the container smaller, and we were able to shave about 10 gigabytes from the, from the, um, compressed size overall. And here's some examples. It turns out that, uh, pip, by default as a cache, we don't need that in production. We're not gonna be reinstalling packages.

Uh, we also had some dependencies that were dev and test only like PyTorch. Turns out installing PyTorch, also installed Cuda. The Cuda libraries are huge. Uh, two gigabytes there switching to a slim base image. Pretty, pretty standard stuff. We use Triton Inference server from Nvidia, um, to, to serve our models, the self-hosted models.

And by default it includes, uh, support for multiple frameworks, TensorFlow, PyTorch, Onyx, and if you're serving one model, you probably only need one framework support. So we're able to, uh, shave off a lot of dependencies there. And then finally, it turns out our build process was leaving a bunch of artifacts in the container.

And, and so just kind of like general housekeeping. This shed a little bit of time off the 18 minutes, but, but sort of like, not, not enough. It was like a minute or something like that. Minute hours two, enter g k E image streaming. Um, Google has this feature called image streaming. Uh, something equivalent exists in Amazon.

Um, where to use the blog to use a quote from their blog, reduces image pull time from minutes to seconds, and, This is, this is what happened for us. Um, honestly, our, our con the container portion of the startup went from minutes to seconds. And it does this by, in the background streaming the actual file contents as you read them.

So this works great if you don't need every file in a container, um, which was the case for us. Um, that might not be the case for, for everyone's container, but it was the case for us. So it was a great help. Not mentioned as, as sort of like, um, obviously is enabling this feature is actually on the entire node.

It's node wide, so the system containers, the Kubernetes containers, um, those actually started to boot faster as well. This, this shaved many minutes off of the, the overall start time. Next we came to loading the actual models. As I said, large models are large. Um, and so, you know, like a three gigabyte, uh, a 3 billion parameter model might be like 12 gigabytes on debt on disk.

At the time, we were fetching our models from GCs, Google Cloud storage, and, uh, we realized that we were fetching onto a remotely attached. Spinning disc, um, it's like the slowest thing you could imagine. And so we're like, aha, let's use, let's use the fastest thing available, which is a locally attached, uh, NVMe.

S s D we're like, oh, this is gonna be much better. It didn't help. It turns out there was no improvement, uh, which was, which was frustrating and surprising. So we started to dig in. We use the tool Gs U till from Google, um, which is basically ours sync for GCs. And, um, we are only getting 50 megabytes a second, um, of transfer.

And, and, and, which was upsetting, um, because there's at least a gig, Nick. Uh, we should be able to buffer into ram. The, the disc should be faster than that. So like, you know what's going on. It took a long time to figure out, but switching to a different container image. Quintupled the performance. So the, the transfer speed went up five x over five x, um, just by switching the container image.

Uh, it's still a Google Cloud sdk, but we moved from Alpine to Slim and it got wildly better. It turns out multi-processing is hard, so I dug into the code for GS U till and there was this comment that said basically, uh, GS U till would hang on Alpine when multi-process was enabled. We couldn't figure it out.

So we disabled multi-process. Um, and of course nowhere is this documented. Um, and it's, it's only in the repository. Um, so that, that, that was overall quite upsetting. But, um, this allowed the model loading portion to change from. You know, let's say like four minutes for that model down to less than 30 seconds, um, which is a huge win and allows the note to come online much faster.

So in o o, uh, in total we got our, our pod startup, our server startup time from 18 minutes down to two minutes. And I've seen it be even well under that. Um, and the way we did that was streaming the containers. Lots of, um, croft enabling G K E image streaming. Moving to e femoral local SSDs, which was useful once we fixed the tooling.

And then finally that, that tooling fix where we, uh, were able to, um, find that rather nasty, uh, bug, uh, or feature, I'm not sure, classify that. Um, and so o overall we got the pod to start up much quicker. And this was part of an overall strategy of moving on to preempt nodes, which cut our cost by two thirds.

Um, and we were able to maintain our uptime. So in total, thank you for joining this talk. I think we are able to keep it under 10 minutes, so, uh, here we go. You're making my job easy, man. You're making it too easy. Wow, that was so awesome. Zoom. Thank you. A pleasure with mine. I, I, uh, look forward to, uh, more guitar, uh, uh, later.

Uh, oh. Wait, are you in San Francisco? I got an inkling that you may be. Yeah. We're based in San Francisco. I'm uh, uh, east of Berkeley right now. All right, so in two weeks I'm gonna be there and I'm gonna be hosting the L l M Avalanche Meetup. We're expecting around a thousand people, and it would be, my honor, if you came, maybe you got other things that you have to do in real life, and it's not quite as easy as just hopping on a Zoom call, but.

I'm putting it out there. I would be, I would be honored. You'll, I gotta commit. I gotta commit. So you'll be able to see some music playing in person. And you, uh, I, I feel like you play music too. I've been known to, to riff a little bit. Yeah. I, I, I think, yeah, just a little. That sounds like you're being very, very, uh, what is it called?

What is it? When the, you. I, my brain stops working at this time night cause I'm in Europe. Ooh. Uh, and false, false modesty is, I think what it was modest is the word I was looking for without the false, the modest part. So I'm gonna try and just get a bunch of, uh, instruments that we bring to this meetup.

And we can have like a little jam room. I can't promise anything, but hopefully we can jam in person and make some songs about L LM infrastructure and spot instances and all that good stuff. Meet up, dude. I. I laughed so hard when like, yeah, so we saw this comment and Multithreading is hard, so we just got rid of multi-threading.

Who knows how that goes? These are the, the bugs that plague us. Yeah, man, it's so good. It is so good. So thank you Bradley. I really appreciate this and I'll probably hit you up to bring you on the podcast if you're open for it. Uh, Thanks for having me. Pleasures mine.

+ Read More
Sign in or Join the community

Create an account

Change email
e.g. https://www.linkedin.com/in/xxx or https://xx.linkedin.com/in/xxx
I agree to MLOps Community’s Code of Conduct and Privacy Policy.

Watch More

Build Reliable Systems with Chaos Engineering
Posted May 31, 2024 | Views 1.9K
# Chaos Engineering
# MLOps
# Steadybit
Global Feature Store: Optimizing Locally and Scaling Globally at Delivery Hero
Posted Sep 24, 2024 | Views 365
# Global Feature Store
# MLOps Practices
# Delivery Hero
Traversing the Data Maturity Spectrum: A Startup Perspective
Posted Apr 21, 2022 | Views 721
# Product Development
# Infrastructure
# Startups