Sign in or Join the community to continue

Less Is Not More: How to Serve More Models Efficiently

Posted Aug 15, 2024 | Views 104

# AI tools

# Generative AI

# Storia

Share

speaker

Julia Turc

Co-CEO @ Storia AI

Julia is the co-CEO and co-founder of Storia AI, building an AI copilot for image editing. Previously, she spent 8 years at Google, where she worked on machine-learning-related projects including federated learning and BERT, one of the big large language models that started the LLM revolution. She holds a Computer Science degree from the University of Cambridge and a Masters degree in Machine Translation from the University of Oxford.

+ Read More

SUMMARY

While building content generation platforms for filmmakers and marketers, we learnt that professional creatives need personalized on-brand AI tools. However, running generative AI models at scale is incredibly expensive and most models suffer from throughput and latency constraints that have negative downstream effects on product experiences. Right now we are building infrastructure to help organizations developing generative AI assets train and serve models more cheaply and efficiently than was possible before, starting with visual systems.

+ Read More

TRANSCRIPT

Julia Turc [00:00:10]: All right, I'm Julia. I'm one of the founders of Storia, where we are building infrastructure for multimodal generative AI. Both my co-founder and I have a background in machine learning. I was at Google Research for seven years where I worked on natural language processing, and he was part of Amazon Alexa, where he set up the very first LLMs. So today, I'm not going to describe our product, but I'm going to tell you about how we made one of our customers very happy. For in this particular story, our customer is a marketing agency. And this marketing agency works with a lot of very high-end brands, hundreds of them. Now this is an AI-forward marketing agency who acknowledges that there's a possibility that even creative tasks like creating marketing copy and marketing visual assets like social media and ad creatives will probably be automated.

Julia Turc [00:01:10]: So they want to get ahead of the curve. You might be aware that there's a lot of very cool technology that where you give it a text prompt and it gives you back a beautiful image in return. There's services like Moderni, stability, Runway and so on. And marketing agencies do find a lot of value from these products, either for ideation or for as a replacement for stock photography. But they can't really use them for the end assets that they give to brands because they can't really reflect the personality of the brand or the voice of the brand. It looks a little bit cookie cutter. So you can go back to Tesla or Google or whatever company you're working with with a mid journey generated image. So what they want to do is fine tune a stable diffusion model for each of the brands.

Julia Turc [00:02:03]: If you're not familiar with the text to image space, I know that a lot of people are into LLMs, but text to image is a little bit less known. Stable diffusion is the go to model, open source model. Everybody can download the weights and fine tune it if they want to. Now this is a very hard task to do, customizing an image model for every single customer. And there's many reasons why it's difficult to do, but there's one main one, a very simple one. Do you recognize what this plot is? Exactly. So basically, GPU's are expensive and you can't afford to waste compute. That's why they can't really afford to have one stable diffusion model per customer.

Julia Turc [00:02:50]: Because a stable diffusion model is around 7gb and it takes 20gb of ram to ram. So it's absolutely prohibitively expensive to have one for each of the 100 customers. So the way fine tuning is done today, both in language, but also in image is through these adapter modules. These are additional weights that you put on top of your base model. And because their size is much smaller than your base model, then this leads to faster training. It requires less data. You don't need 5 million data points to train an adapter model. And also during inference, the increase in latency is pretty minimal.

Julia Turc [00:03:37]: But most importantly, during serving. These adapter modules are very convenient because if you have 100 customers, you can actually serve them with a single gpu. So you host stable diffusion on one gpu, and then depending on the incoming query, you just swap the adapter module in and out. So instead of needing 100 gpu's, you just need one gpu. A second problem is that even though there's a lot of progress in LLMs, inference in images is still very slow. Batching is much harder, and it can take a few minutes if you want to generate, say, eight to 16 images, which marketers usually want to do because they want a lot of variety. So one problem that, the second problem that we had to deal with was how do we reduce the latency time so that it's like humanly acceptable to wait for the output? And if you're not familiar with diffusion models, the reason why they take so long, it's because they do this iterative step. They start with complete noise, and then they keep removing noise step by step until you reach this beautiful image.

Julia Turc [00:04:50]: Normally with stable diffusion, you need somewhere between 30 to 50 steps in order to get a good image. And this is where the solution comes in. The solution is called distillation. It's a very old technique that of course started maybe 1015 years ago, but in context of text image models, what it does is it distills the path from noise to image into a much shorter path. So then instead of having to do 50 denoising steps, you can maybe do eight denoising steps. There's many, many papers that are trying to do this, but the one that worked best for us is called hypersD, and it comes from Bytedance. And a very interesting result was that when we went back to the marketing agency to do a bit of a b testing between stable diffusion versus this distilled model, we asked their illustrators, their in house illustrators, to compare things side by side. Surprisingly, 80% of the time, they chose the distilled model.

Julia Turc [00:05:51]: So this is kind of a free lunch that happens so, so rarely in machine learning, where you get a smaller model that actually performs better. And we managed to reduce the latency from 15 seconds to around 4 seconds. Now, once we train hundreds of models, the next challenge is how do we serve them? How do we make them ready, readily available, so that whenever someone wants to generate a image in the style of Tesla, they can do so in a few seconds. By the way, Tesla is not a customer. I'm not giving any out, any of the secret sauce. So how do we do that? Well, we started with the most obvious solution, which is a serverless solution. You go to replicate, you put your model there and then they give you back an API, and whenever you hit that API, you get an image back. This worked great for prototyping.

Julia Turc [00:06:44]: It's great to get something running, but you very quickly run into problems when you're dealing with such a big scale of models. The biggest one is the cold start problem. So as I described before, in order to bring in an adapter module, there's some time required to take the adapter module from disk and bring it into gpu. Sometimes with replicate this could take even up to two minutes. So it was just prohibitively expensive. So we had to move on to a more complex solution. The second solution that we tried are the go to services from the big three cloud providers. So for Aws, this is sagemaker, for Google, this is vertex AI.

Julia Turc [00:07:29]: The contract that you have with these services is that you give them a Docker image. The only contract is the docker image needs to expose a server with two APIs. One of them is to load the model, the other one is to generate an image. And the beauty of it is that you give it the Docker image and then what they will do is they make sure it's up and running and also auto scales. So if you bombard your server with many, many queries, they promise that they will keep bringing up accelerators to match your demand. I'm going to show you in a second why this was not ideal for us. There are situations in which these layers of abstraction are great to get you going, but then you end up debugging the layer of abstraction, especially if they're not open source, which is the case for Sagemaker and vertex. What you end up doing is you rely on a documentation that might or might not be clear, and might or might not truly describe what's happening behind the scenes.

Julia Turc [00:08:32]: And when the promise and the documentation doesn't meet what you're seeing, you are left frustrated because there's no debugging to do from there. Maybe you have some logs, maybe you don't, but you can't ssh into a machine and see, why is this not scaling? So we ended up actually implementing our own Kubernetes solution. We were in denial for a long time. We really, really, really tried to not get there. But we hired our Kubernetes expert and once everything was set up, it was basically bliss. Because now we have control over everything, we have visibility over everything. And when something fails or the system doesn't auto scale, I can go into the logs and I can understand why and go and fix it. And I'm going to show you a side by side test.

Julia Turc [00:09:23]: So what you're seeing here is a 20 minutes load test where we have ten users bombarding the servers with demands, with requests for various brands. The red line shows the number of failures per second and the green line shows the number of successful requests per second. Remember that one image takes more than 4 seconds to generate. So that's why sometimes you'll see 0.3 requests per second. Okay, so the top side is Kubernetes, the bottom side is sagemaker. The first thing you can see is that the red line for Kubernetes is pretty flat. That means we don't see any failures. And here what we mean by failure is actually a timeout.

Julia Turc [00:10:09]: So if a query doesn't get resolved in three minutes, we count it as a timeout and we discard it. However, if you look at Sagemaker, which is again AWS's solution for this, you definitely see a zig zag of failures over here. And what's most annoying is that the green line actually disappears on Sagemaker. And what happens here is that Sagemaker is seeing a lot of requests and is trying to auto scale. And while it's auto scaling and trying to bring up more gpu's and it becomes completely irresponsive for five minutes, which is completely against the value proposition of Sagemaker and vertex in general. They say, don't you worry about it, your system will be up and running and we'll auto scale it. Now, we run this test three times and it had the same behavior. But I don't want to claim that this is what Sagemaker always does.

Julia Turc [00:11:04]: Maybe it was just the that software update that they sent on a Friday afternoon and I was very unlucky to run my load test then. But it still makes the point that I don't want to deal with this. I don't want to have to think about, okay, what are people at AWS doing? I want to be able to go into my logs and see that machines are dead. And I know how to press the reset button. So yeah, that's the story of how we help the marketing agency produce 100 text to image models and serve them to their customer brands. And if you are interested in taking control of your own AI stack, come and talk to us. We're very happy to have a conversation. Thank you.

+ Read More

Sign in or Join the community

Watch More

LIMA: Less is More for Alignment

Posted Jul 17, 2023 | Views 785

# LLM in Production

# LIMA

# FAIR Labs

Taking LangChain Apps to Production with LangChain-serve

Posted Apr 27, 2023 | Views 2.5K

# LLM

# LLM in Production

# LangChain

# LangChain-serve

# Rungalileo.io

# Snorkel.ai

# Wandb.ai

# Tecton.ai

# Petuum.com

# mckinsey.com/quantumblack

# Wallaroo.ai

# Union.ai

# Redis.com

# Alphasignal.ai

# Bigbraindaily.com

# Turningpost.com

Do More with Less: Large Model Training and Inference with DeepSpeed

Posted Jun 20, 2023 | Views 1.5K

# LLMs

# LLM in Production

# DeepSpeed

# Redis.io

# Gantry.io

# Predibase.com

# Humanloop.com

# Anyscale.com

# Zilliz.com

# Arize.com

# Nvidia.com

# TrueFoundry.com

# Premai.io

# Continual.ai

# Argilla.io

# Genesiscloud.com

# Rungalileo.io