MLOps Community
timezone
+00:00 GMT
Sign in or Join the community to continue

Scalable Evaluation and Serving of Open Source LLMs

Posted Jun 20, 2023 | Views 510
# LLM in Production
# Scalable Evaluation
# Anyscale.com
# Redis.io
# Gantry.io
# Predibase.com
# Humanloop.com
# Zilliz.com
# Arize.com
# Nvidia.com
# TrueFoundry.com
# Premai.io
# Continual.ai
# Argilla.io
# Genesiscloud.com
# Rungalileo.io
Share
SPEAKERS
Waleed Kadous
Waleed Kadous
Waleed Kadous
Head of Engineering @ Anyscale

Dr. Waleed Kadous leads engineering at Anyscale, the company behind the open source project Ray, the popular scalable AI platform. Prior to Anyscale, Waleed worked at Uber, where he led overall system architecture, evangelized machine learning, and led the Location and Maps teams. He previously worked at Google, where he founded the Android Location and Sensing team, responsible for the "blue dot" as well as ML algorithms underlying products like Google Fit.

+ Read More

Dr. Waleed Kadous leads engineering at Anyscale, the company behind the open source project Ray, the popular scalable AI platform. Prior to Anyscale, Waleed worked at Uber, where he led overall system architecture, evangelized machine learning, and led the Location and Maps teams. He previously worked at Google, where he founded the Android Location and Sensing team, responsible for the "blue dot" as well as ML algorithms underlying products like Google Fit.

+ Read More
Demetrios Brinkmann
Demetrios Brinkmann
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
SUMMARY

While we've seen great progress on Open Source LLMs, we haven't seen the same level of progress on systems to serve those LLMs in production contexts. In this presentation, I work through some of the challenges of taking open-source models and serving them in production.

+ Read More
TRANSCRIPT

Introduction

 I'm not in a SMR mode anymore. You there? What's up, Wale? Hey, how's it going? You are? You doing all right? I'm doing great, man. I'm blissing out now after that one. Oh yeah. No, no, sir. There we go. There we go. Stretch it out if you need to. And wow. Awesome. So, uh, everyone is very clearheaded now, I believe, for your talk and they're ready.

But we need to just school them on the fact that there is an any scale Ray workshop happening in person once we finish this conference today online. And it is happening in San Francisco. So we actually filled up the full conference. Sorry. The full workshop, the full conference, the conference is pretty full too.

I mean there's like 2000 people online right now, so that's pretty full, I would say. But we filled up the workshop. There's like 50 people on the wait list. I'm not gonna be the guy that says it, but I bet if you just show up and you slip the doorman a hundred, you'll get in. All right. That's all I gotta say, Wale.

Take it over, man. It's all yours. Hey, thank you very much. De um, it's really great to be presenting at, um, at ML Lops community and MLMs in production. Um, I have an update. You know, obviously things have moved very fast in the industry, uh, but the last time I talked, uh, an ML lops event was a few weeks ago, two weeks ago, and we were just giving people a preview of aviary.

And now what I hope to share today is our first two weeks of the painful lessons we've learned, building it, deploying it, and getting it running. So what I'll be talking about today, uh, is, is really about the lessons we learned, and this is one of my favorite quotes, which is that experience is a good school, but the fees are very high.

So my goal today is to avoid you paying some of the fees that we did. So if you learn some of the things about what it takes to really deploy a low lens in production, or at least our experiences from it, then I will have done my job. So we're gonna start off with something very simple, which is, You know, there's lots of different ways to focus, you know, how you deploy l l m in productions, but the ones where we are talking today about are self-hosted models.

In other words, you as an organization, run and maintain the model, uh, usually based on a foundation, an open source foundational model, um, uh, that you fine tuned or maybe you're just using it as is. But first let's dig into the question of why you would want to use. You're, why would you wanna self host anyway?

Uh, second is we'll talk a little bit about exploration and what we built, something called Ry Explorer, which has been out now for two weeks. Um, and, uh, then we'll share some lessons that we learned the hard way, right? First one is model selection is tricky. Serving LLMs turns out to be pretty hard in many ways.

Be prepared for like, deploying more than just a single llm. Understanding the importance of G P U Ram. And finally, how do you keep costs down? And what we'll be doing is we'll giving, be giving you a little bit of a behind the scenes look of how it works. We'll be showing you our dashboards, our tooling of our live production service, just to give you an idea of like what we learned, uh, during this process.

So the first question I wanted to discuss was really the oh and one other thing. We'll be doing lots of demos, so hopefully the demos don't die. So when you try to deploy an L LM in production, you really have three different choices or four choices, but I think the fourth choice is kind of very rare. So you can either go to a commercial company like Open AI, anthropic, um, you can go to a hosted open source solution like hugging phase, mosaic Octo ai.

And again, these slides are super fresh cuz the Octo AI just launched yesterday. Um, or on, we'll talk a bit about every coming forward and, and you know how you can do that. Um, but, um, any scale is the company behind, uh, an open source project called Ray and as well as a managed hosted service called, uh, any Scale.

And what we'll talk about today can work both ways. So everything I'll be showing you except for one small thing is completely open source and you can have a play with it. Um, the third option is, is really self-hosted. So usually what you'll do is you'll take an open source language model. Like the, the language model dujour is of course Falcon with its 40 billion parameters and the smaller 7 billion parameter model.

And then you'll, you'll either deploy it as is, or you'll do your own fine tuning on top of it. There's a small fraction of companies that are building, uh, their own LLMs from scratch, but I don't think that represents the majority. Um, for those companies. They often end up using Ray with both Open AI and Cohere who are building those commercial applications using Ray.

The open source toolkit that we've made available for distributed computing to train their models, but that's not gonna be the focus today. So why would you want to use self-hosted LLMs? And the way we think about it is we are big fo um, fans of self-hosted LLMs at any scale. But you know, we're not blind fans to it.

We use GT four for things, some things. And. You know, every, you, you just want to deploy the right tool for the job. But in particular, I think the, the advantages of oss LLMs is you have control over your data. You know, you might have regulatory restrictions and you might just decide that, you know what, LLMs are a competitive advantage, and I'm not really, I don't really wanna share it with a potential competitor who knows what open AI is gonna do next.

Second reason is cost. And you know, GPT four at the high end can be very expensive at 10 cents per thousand tokens. So we work with a company called CAPA and they help us with document search for Ray documentation. It works really great. The only problem is that they charge us 25 cents per query, which is just ridiculously expensive.

But they don't have a choice because the backend is G P T four. So we want to be careful about costs. You want, you never want to be in a situation where your vendor has locked you in because that removes your negotiating position when it comes to cost a and finally, customizability. So, um, j you wanna do things like fine tune your models, but if you go to the commercial companies like ai, they will charge you six times as much for serving a fine tune model as a non fine tune model.

So the one area that I think open self-hosted LLMs are not quite as good is, um, quality. Um, and, and this is something that's changing, you know, there was a really interesting Google paper that said we have no note and they actually plotted the graph of how quickly open source quality is improving. Um, testing is starting to show that the gap is narrowing, so that's from places like LMS and so on.

And finally, um, we've been running some experiments. We're not ready to share the results yet cuz they're a little bit preliminary. So, But what we're seeing is open source is perfectly good enough for things like summarization and retrieval of assisted generation. So it's not that they're plug-in replacements for the full power of GPT four, but if you think carefully about your use case, you may be able to find a good fit.

So just trying to summarize these options here. You know, this is a chart that I hope you find you find useful as you're trying to choose when do I use commercial APIs, when do I do self-hosted, and what the pros and cons are. Um, Clearly, for example, um, the hosted approaches are easier to use. Self-hosted has some risks.

Obviously, what I'm trying to do and what we're trying to do with Avery is turn this sad face into at least a neutral face, if not a smiley face, to make it easier for people to build self-hosted models. So now let's dive into the lessons. The first lesson is, uh, wait a second. It's great that you have a self-hosted model.

But how, how do I, how do I choose the right OSS model? What size do I need? How much does each option cost? Um, wh what, where are the quality characteristics? Some models come with up to fours. Which of the fours do you use? So the first thing that we did is that we built some tools and I'll, I'll walk you through those tools right now.

Uh, so the first is Avery Explorer. Um, and that's, uh, you can try this out [email protected]. Um, and so I'm just gonna give you a quick overview of Aviary, um, the tool that we built, and, uh, then we can, um, I'll also show you some of the command line tools that we built, right? So this is Aviary. Um, and as you can see here, it just a, it, it, it allows you to choose different l l M models.

You can choose different characteristics. We have, uh, certain options, like you can choose the fast option, which chooses. The fastest LLMs we, we have. And then you can just quickly, uh, type on like whatever you want your prompt to be. Um, and you can click, uh, and then very quickly you can compare these three LLMs and see which ones, uh, you find personally more acceptable.

Which ones work better? The latencies going up. I can see, uh, probably all of you are hitting the website soon. We'll see if we see the spikes, but you can, you can see that the outputs of each of them. And importantly, we can measure things like the latency, the cost, and uh, the tokens per, you know, the, you know, how much each of these models costs.

Um, and so that gives you a quick overview if you just wanna kind of see on the surface. Of course, it's also possible for you to load your own models. So what you could do is you could also have fine tuned versions of your own models that you put into this list, which we'll talk about later. Um, The other thing we're able to do is very easily to kind of, we, we've allowed people here to vote.

So, you know, I might come here and say, you know what, uh, well this one doesn't have five books in it, so maybe we'll just choose this answer. And this one's two of a vote. So I'm gonna choose this answer. Now, imagine what happens if, if, if we have pants, uh, people around the world are filling that in. And what we can build is we can build a leaderboard of like, what are the best LLMs for this particular use case.

Um, not only can we build a leader leaderboard, um, and there are other websites that have leaderboards, but I think one of the unique, um, characteristics for us is that we've tried to characterize things like the cost, the latency, uh, and so on, so that you can build a model. Now, there are some simplifying assumptions here, particularly.

That it doesn't fully include batching. And batching can have a 10 x impact. So think of these as an upper bound. But what I can do, for example, is I can sort by the fastest, which also means the cheapest in terms of generating outputs. Um, and you can see that Amazon light GPTs pretty fast, but unfortunately it doesn't rank very highly on performance.

Whereas I think the superstar and some of the models that we're seeing people want to use. Are things like, uh, mb uh, like the chat model from Mosaic ml or Open Assistant. So let's have a look at those, you know, what it looks like, um, you know, in terms of the cost. And we see not surprisingly like that those are some of the most expensive ones.

Um, mosaic Chat is not too bad. Um, but you can also see that part of the reason that something like story writer or open assistant, um, are so expensive is not because it's, uh, the cost per token is high, but it's things like. Open Assistant likes to generate like long answers. And the same thing for Story Writer, which is, that was what it was fine tuned for.

So we hope that you will find a tool like this useful in your selection of, uh, the appropriate model. Uh, let's go to a live demo. Uh, this is that. So that's, you know, just the, the UI version. But what we can also do is we can, um, we can, uh, Also, we also have some command line tools. Uh, so, uh, for example, if you, you, what you can do is we've created this, um, Avery command and then Avery backend can list all of the different models that it has available to it.

This is the same model that you folks are hitting really hard right now. Um, and, uh, you can get a list of models, but you, what you can also do is you can, um, let's say that I have a bru bunch of prompts. So I'm, I'm gonna look at something like examples slash QA prompts, right? I'm just trying to, what I might be doing is I might be working on a trivia bot and I wanna work out, you know, is, are the answers good or not for these different, um, questions.

So what I can do is I can actually now. Uh, uh, run a query on this. Um, and what I can do is just, um, see, compare two outputs. So we said that Mosaic was a good one. We've also added a capability to compare with open ai and you, it takes that input and it's now just gonna sit there and process it. Um, sending the, the mosaic amounts to our backend and at the same time, uh, just sending those same queries to OpenAI.

So that will take a few seconds. Um, but very soon we'll be able to get the results there. And so we've asked it a few j trivia questions, um, and now it's going to hit OpenAI to just show us what the results possible there are. Um, but you can see this tool is, is kind of making it much easier for us to evaluate what the open source options are.

And, uh, yeah, so that gives you an idea of like how we approach the problem of how can we help people understand what the right LM model for them to use is. Cool. So I'm gonna continue now. Hopefully if we have time, we might see what the results there are there. But then the funny thing happened, uh, as we were trying to build this application, and that is that we found that deploying LLMs is harder than it looks.

You know, it's one thing to say, oh, that's a really great app, and then, you know, that's very fun. But really think about what it took. It took a, you know, what we found the hard way was managing this aviary, so to speak, of 12 models, was giving us headaches. Um, and the initial assumption might be that what you just gotta do is I can just download the model from hugging face and run that.

And what we've found is that it's much harder than that. Um, you have to think about different definitions of roles, different markers, all that kind of stuff. Different stop tokens, different accelerators that work for different models. Some need more GPUs, some need more, uh, GPUs, um, as well as support for things like, um, batching and streaming.

So, um, that's the challenge we faced and as we thought about it, it, you know, we thought it, you know, maybe there's a way we can define a configuration. Um, that allows us to specify what we want and to prepare each model, uh, for production. Um, and you know, we were really excited about this because once we added this kind of model config abstraction, Demetrius is everything.

Okay? You're you new? I am the bear of bad news. Yeah. Uh, you were just sharing, you've been sharing Avery this whole time and you didn't share the c l I. So I'm realizing now that. There was a few things that people missed out on. Okay. So there we go. There it is. Let me rewind. Sorry for that, folks. Yeah, that's all good, man.

Um, so, um, so, uh, yeah, as, as I said, you know, we, uh, where we were was we ran this model. We tested, uh, open AI versus mpt and it generated the output for us. So, you know, let's put that model in a file called, uh, qa.

And it has, you know, all of the statistics of each of the models. What was the difference between one? And, you know, we've also added a command called Avery Evaluate. And what we can do is, again, this is the time to use GPT four, is send those input to GPT four and ask at which of the inputs, which of the answers were better.

And, uh, very soon that should be able to produce a result for us, um, and print a table that shows us what it thought the best results are. So hopefully that has given you an idea of some of the capabilities, and I apologize for the, for the issue that we saw earlier. And I will remember to hit the, the, the share the tab button in future.

Cool. So just in a moment, it will print a table and you can see that it's basically scored each of these inputs, um, and compared them. Um, and in this particular case, this evaluation thing is still something that we're fine tuning. But it, it gives you like, it, you can see looking at these results that actually the open source results are not, like, are not bad at all.

Like they're actually very good for this particular type of application. All righty. Let's go back to the slides and hit the share tab instead. So, so, as I was saying, um, really as a result of thinking about things in terms of model configuration, We were able to add support for a new L l m that came out in about five minutes.

So let's have a look at those, uh, those models and what one of those model configs, uh, looks like. So what we can see here is that we have a little config file for each of the different models. Um, and you'll see that, you know, in some ways it's very similar, but you'll notice things like this. They'll use different strings to denote, you know, what's the start of a string in the end of a string.

If I compare it to the one from Mosaic ml, there's all this fine tuning you have to do. There's all of these settings that you have to make about, you know, that use it, uh, slightly differently. Uh, yep. So, Uh, so you can see that there's differences in what we are trying to do and, and the, the, the tweaks that we have to make to get particular models to produce reasonable values.

Um, but what we've done is effectively there's a little model config for each of the different models that we wanna run. And ideally, and what happens a lot of the time is you just modify the model, configure bit, and it's able to run your new models. Or say you have a fine tuning model, you can, can do something like that.

Uh, and it's just a change of the direction. Cool. So the third lesson that we learned was that you are probably gonna deploy, deploy many models, more models than you think. Um, so we thought at the beginning of this that we might have like three models. Uh, we might have like a small, medium, large, but very quickly we found that we wanted to add more.

Maybe there's different sizes, maybe there's different fine tunes. Maybe you want to do AB testing. And um, the other thing that we're seeing on the horizon is that there's a router design pattern where you don't have a single l m but you might have like three LLMs fine tuned for different things. Um, and then you kind of have like a router LLM that decides which of the three LLMs to call now just that simple configuration.

Already we're talking about four LLMs. Cool. So very, very quickly we started to realize that managing, um, multiple models was very, very difficult. Um, each has different requirements and everything, uh, we needed. Um, and could we make it. So that deploying a new, um, l l m was no harder than any other microservice.

Um, and that was really the genesis of the Aviary backend. And as you folks know, this is an l l m. You know, there's kind of a very important DevOps principle, which is that you treat the things you're working with like cattle, not like pets. You know, you don't. You don't have cat, you know cattle. You know that you still care for them, but you care for them as a group.

You don't ask a cow to come sit on your couch the way that you would ask a pet to sit on your couch, right? So this principle of treating LLMs like cattle and not pet pets really, um, simplifies the management and the declarative approach that we took of the model. Configs also simplifies that. So a model dies.

It's okay. You can just load it from the, the model config. Again, it dies Ray serve and Ray, which is what the system is built on. Has really good, um, um, fault tolerance. So if front of the service dies, it'll just bring up another one. Um, and because of the features built into Ray and Ray serve, it's super easy.

So what I'd like to do now is just show you a little bit about the real backend that's actually serving a scale.com. So, uh, I'm gonna click on share this tab instead. This is showing you, um, the, um, any scale and, um, right now the backend that you're seeing is actually this guy. Um, and you can see that there's actually one service or one deployment for each of the different, uh, models.

And then there's uh, one kind of glue model called the router deployment that kind of holds it all together and redirects traffic. So this particular model, as we said, that that's running in production, we're hosting 12 LLMs together, and we can click on any of them and get their statistics. But the thing that I wanted to show was just like the serve dashboards.

So very quickly you can come over here. And you can look at, you know, what are the statistics for like this service, um, what's happening to our traffic? Um, and it looks like you guys have been busy in the last five minutes or so, you folks have started to hit. And you know, we've gone from like 15 replicas to like 20 replicas very, very quickly.

So that gives you an idea of like, um, you know, how our backend works. We also monitor it in terms of like understanding how many users are using it. So I'm gonna share this tab now. And this is, um, let me just do a quick refresh on this. This is, um, this is our live traffic showing us, you know, how many queries people have put in, um, what the token distribution is like.

You know, we realized that people were using us a lot for summarization. Um, you know, uh, but also we know you noticed that we're actually getting, you can see here just how spiky the traffic is and that spiky traffic does not play well with LLMs, unfortunately. So this kind of gives you an idea of like, uh, the, you know, how it works, um, how we monitor it, um, how easy it is to, um, switch and, you know, you can very easily, I'm switching tabs now.

Um, you can very easily, let me see if I can clear this. You can very easily kind of deploy new models, uh, with this format. So I might, you know, uh, let me try this and, and we'll run locally. This will probably take a while, but I can depo it. Uh, um, specify the model config file, um, models and let's say Amazon like gpp.

And very quickly this model will load it up. So again, it's this kind of character of move. Um, Kind of making it very easy to, to kind of deploy these models and handle all of that particular issues. Cool. So let's go back to the slides. Lesson four. It's all about gpm, not about compute. Um, so we have this, um, page you can check out called LLM numbers dot ray io numbers.

Every L LM developer should know. Um, and what you realize is that GPUs, um, how we use them when we're building LLMs is we put, we use about half of it for model parameters and then half of it for working memory for batching. And batching is very important, uh, for performance reasons, you know, um, processing.

If you didn't do batching for example, you, you might see,

uh, nobody else sees me.

Don't let me distract you ley. That was for the people that are waiting for the next talk. Okay, cool. You keep cruising, man. You're doing great. All right, great. Um, so it's all about the G P U Ram, um, and really managing that. So you need this memory, uh, for both purposes, and that means that you have to think, uh, very carefully about like how you lay things out in memory.

Um, So we often, um, how we dealt with that is we used, um, bigger instances to allow for more efficient patching. Um, so basically now we rarely use 16 gig gpu. Um, the, um, the at and gs tend to be kind of our go-to gpu. And the A 100 s for the really big models like Falcon and um, um, um, open assistant that are 30 billion or 40 billion parameters, often we'll need two of them because the parameters are so many they don't fit on a single gpu.

And we'll run two GPUs in parallel. But the thing to up we're optimizing for is making sure there's enough GPU memory that we can do batching. And batching is where you send, you might take five requests and send them at once to the gpu. Which gets you much, much greater, uh, throughput, um, at a small cost in latency.

Um, and so even it doesn't really matter which GPU is faster, what the key thing to focus on is really which G P U has more memory. Uh, and generally, you know, the A 100 s can be a bit pricey, but even the ATGs going from a normal 16 gig GPU U, which might be slightly faster to a 24 gig GPU, will probably give you a better performance improvement.

And Fundal lesson, uh, cost management. So, um, I wanna tell you that, you know, every, you know, we have, uh, we, we like to keep one replica around. We can actually do scale to zero, but that introduces, you know, a minute or two every time someone uses a new l l m. So we keep at least one replica of each l l m around just running so that they can get good results.

Um, and if you have 12 models, that's essentially 12 GPUs, we have to get running. Um, I know it might sound high, but it's actually after optimization. Um, you just have to be very, very careful when you provision. So you really need to be able to auto scale up and up and down. But that's not really the, the key problem because if you're auto scaling time, you know it is 15 minutes and you have a 20 minute peak, uh, that's not gonna be fun.

So what we've done is we've really done a lot to kind of optimize fast loading and, uh, startup time. Uh, we, we locally cash the, um, hugging face models and we might use, um, Ella, um, NVMe drives. These are very fast local drives to help accelerate that. Um, you don't realize it, but hugging face, you know, like any service has variations in how long it's taking.

So if you can run a local S3 and manage a local cash, Um, then you can, uh, absolutely accelerate how long it takes for a node to come up if you're auto scaling. And then the second thing is, uh, this is the one part of this that's not open source. Um, on our proprietary, any scale service, we've optimized the crap out of starting up node and we can start up nodes in like under a minute.

So, um, that's kind of within your system. You can optimize your fast startup time or you can use any scale to do that.

Cool.

Um, the other thing is that we, um, very carefully, uh, fi uh, fine tune the auto scaling parameters. Um, and we took advantage of raise serve ca capabilities, which is a very flexible system. So you'll see if I go and we have a look at one of those files again, and, uh, as we saw earlier, You know, I just started, um, a service with nothing more than an aviary run.

And now if I do, um, that's the Avery models,

um, it will tell me the list of models, but now I'm gonna tell it that I want to use the local ovary post.

And sure enough, we should see our local model and our local model running on this machine only has light G P T running on it. Um, so it makes it very easy to load things up. But just looking at the dashboard over here and looking at the peakiness of the traffic, uh, you can see that there's lots of variations.

So if I look at the last seven days instead,

You can see that there's a lot of spikiness there and we, we have to tailor for it. Um, and so the way that we've done that is we've made it so that, uh, you can specify, you know, what are the minimum number of replicas that you want. How do you, how often do you check to see if the, the, the look back is, so how frequently you are looking at those statistics.

Um, and also bias towards down scaling slightly slower, which costs a bit more, but it really, um, prevents those types of peaks that you can't handle.

So let's go to conclusions now. Um, Uh, what we've seen is that self, um, hosted solutions of LLMs have some advantages. If those advantages, uh, align with your particular use case, then it's a really good idea to use self-hosted options. Um, but there are some things you, you need to pay attention to. You need to choose the right OSS model.

You need to, um, really think about what it takes to serve a model. Um, you gotta expect that you're gonna be serving more than just a few models and really think about vpu. And uh, keeping cost downs means you have to have things like auto scaling. You need to fine tune them. Um, now as I mentioned, everything that's part of Avery except for that one little thing, which is fast node startup time, um, is open source.

The UI is open source. The CLI is open source. The backend is open source. And so you can help us to. Builds for the ML Lops community, a tool that allows them to more easily do self-hosted models. And so what's next? We're adding features like streaming, continuous batching, um, and more ENSs. So the key ops here, you can try it out yourself.

Uh, there's a open source for the whole thing. And if you want managed Avery, just drop us an email. And if you need some numbers to help guide your process on putting things in production, have a look at LLM numbers dot. Thanks, dude, those number, that numbers is so cool. I love that you all thought of that and I would expect nothing less from you, Mr.

Ali, that that is just quality right there, man. And the, uh, like one question I had, that dashboard that you showed that was basically you had to set that up, everyone is going to have to set up their own of those, or that comes right out of the box with Avery. Um, So we have a public one that anybody can use at any time.

If you want to run your own and choose your own models, you can do that too. We expect people to, to run their own aary. Um, um, you know, we've, we've made it very easy. We can just give you a quick account and you can set it up on Amazon AWS very quickly. Um, it just requires a right cluster to run, do. So there are a few questions that are coming through here and.

We are horribly late on time as usual. So the panel that's coming up, they're being very patient with us and because there's just too many, too many good questions in here for me not to ask them. Uh, that was a great presentation, man, an awesome demo. Can you, uh, okay. Does it make sense to predict peak times and scale the cluster in advance?

Is that possible? Yeah, if you can do it, um, Uh, you know, what we've found is the peaks are very, very random. You don't know what time zone someone's gonna be, be, you know, is interested in your system. We had an article come out yesterday and all of a sudden the traffic quadrupled to the website. Um, so the dynamic approach is probably the, the safer approach.

Um, but yes, if you want to, you can definitely just kind of say, I'm gonna, you know, between four o'clock and six o'clock, I want you to have two replicas instead of one. Yeah. Yeah, exactly. If you can predict the future, get on it. There you go. That's basically the understanding that I got from your answer.

Yeah, and if you, if you can advise me on stock trades, that would be even more amazing. Yeah. Awesome. So last one for you, WITA, and then I'm gonna ask you to jump in the chat. And also W Lead is on Slack, so if anyone wants to ask in the community account conferences channel, just tag w. And, uh, he's in there.

But Gilad was asking what would be the metric that auto scaling is configured on traffic? How would you differentiate high memory consumption due to lots of small requests from a few large ones? Great question. Yeah. Um, I think that's something that we're still working on fine tuning. Um, one of the tricks with LLMs that's very different is, you know, a two word, an, you know, maybe a very short prompt has a very long answer.

And that's where things like continuous batching become very useful because if you have continuous batching and you have streaming, it kind of evens that out, right? It's just, you know, what rate of tokens am I getting in and what rate of tokens am I getting out? And, uh, if you can model that, you can do that to more accurately predict things.

So. Good. Well everyone, I am going to also, before I kick you off while mm-hmm. I want to say that any scale is an incredible sponsor of our L L M in production report that we just put out. So go get your hands on that and thank you. Any scale and w lead you, uh, I gave you a whole lot of shit when you came on the podcast for not sponsoring and you pulled through, you made it happen.

You threw some budget around. So in case the, uh, in case Avery goes out, it's because they probably spent too much money on this damn report sponsoring the community. So we need to help them. Uh, we need to help make it. As reliable as possible. Wali. Dude, thank you so much, man. It's been a pleasure. This has been awesome.

And you know, I'm really eager to answer your questions. Uh, just drop me an email or, or just, uh, I'll be on online on Slack to answer your questions too. There we go. There we go. All right.

+ Read More
Sign in or Join the community

Create an account

Change email
e.g. https://www.linkedin.com/in/xxx
I agree to MLOps Community’s Code of Conduct and Privacy Policy.

Watch More

29:04
Posted Oct 09, 2023 | Views 6.5K
# Finetuning
# Open-Source
# LLMs in Production
# Lightning AI