MLOps Community
+00:00 GMT
Sign in or Join the community to continue

LLMs vs LMs in Prod

Posted Jul 21, 2023 | Views 1.2K
# LLM in Production
# Production Models
# Voiceflow
# voiceflow.com
Share
speaker
avatar
Denys Linkov
ML Lead @ Voiceflow

Denys is the ML lead at Voiceflow focused on building the ML platform and data science offerings. His focus is on real-time NLP systems that help Voiceflow's 60+ enterprise customers build better conversational assistants and for 100,000 hobbyists to explore their creativity. Prior to Voiceflow, he was a Senior Cloud Architect at the Royal Bank of Canada.

He's an active member of the ML community, participating in Discord and Slack servers, mentorship events, and meetups.

+ Read More
SUMMARY

What are some of the key differences in using 100M vs 100B parameter models in production? In this talk, Denys from Voiceflow will cover how their MLOps processes have differed between smaller transformer models and LLMs. He'll walk through how the main 4 production models Voiceflow uses differ, and the processes plus product planning behind each one. The talk will cover prompt testing, automated training, real-time inference, and more!

+ Read More
TRANSCRIPT

 Final speaker of the day, um, which I'm, I'm sad to say, but I'm excited to have him on. Um, we'll be Dennis and he's gonna talk to us about LLMs verse lms LLM in PR in product. I'll let him clarify that title. Um, hello. How's it going? It's good, Lily. Thanks for having me. Good, good. Um, all right, well, I will let you take it away.

I'm very curious to hear all about this talk. Great. Let's get started. So, hi everybody. Um, I'm Dennis. I'm the machine learning lead at Voice Low, and I'll be talking a little bit about what we've learned working with large language models in production versus just regular old language models in production.

So, uh, agenda's pretty straightforward. We're going to talk about how we use large language models right now in productions. We'll then talk a little bit about the difference between large language models and large and large or language models and get myself tongue tied, uh, talking about our existing use cases and some production deployment differences.

So a little bit about voice load because I think it's really important to establish. What we're building, um, cuz you have to know what you're building and what problem you're solving before you just go deploy models left, right, and center. So, uh, we're a company that's focused on helping you build, uh, chat and voice assistance.

Uh, we're a platform, which means we're self-serve, meaning as a company, we need to make sure that our customers have all the tools they need to, to build these assist assistance themselves. So for our use cases, uh, we've had this platform around for almost five years now. And uh, around six months ago we added some generative AI features.

So large language models. We added use cases at creation time when you're actually building your chat bots and assistance, but also at runtime when you want to prompt chain or utilize large language models to create some kind of, uh, additional response for the user. So on the left hand here, we're creating some data to train our bot.

On the right hand here, we are adding a generate step that we have onto the canvas. So very different use cases because creation time, it's not the end user facing one. It's for, for our, uh, platform users. And runtime is at the end. So two different use cases, but, but both pretty interesting for large language models.

So, uh, we established this by launching our AI playground where you can basically experiment with different types of language models, uh, for building your conversational assistance. So before we had this cool platform that we've built, now we're layering on the ability to prompt chain, uh, add all these different AI steps, choose different models, but also mix it up sort of with more traditional, uh, chatbot and voice assistant building.

So, images, responses, we can mix and match. Uh, we also launched a knowledge base feature, which basically lets you upload data and return a response that summarized by a large language model. So not only do you have to, you can manually create your, your assistance. You can also just do a quick FAQ creation of the data.

So those are some of the, the use cases that, that we've built out. Um, and I'm describing them not to necessarily promote the product, but for you to understand what challenges we ran into, uh, and how we solve. So now taking a step back, um, I wanna talk about what a large language model is, and I have a fairly controversial definition.

Um, the first question I typically ask people is that, is, is Bert a large language model? Right? Bert came out in 2018. Uh, was it a large language model? And people are usually like, eh, Hesitant. Maybe in the chat you can say if you think bird is a large language model, uh, then we get into the, another model, FLA T five, right?

Is FLA T five a large language model. And then people are starting to say, okay, well it's an encoder decoder model. It does some really cool stuff. Then you get to Planti five X xl, which is an 11 billion parameter model. And it's like, okay, is that a large language model? So the way that, uh, I've defined a large language model is a way that it's a general purpose language model that you can throw a bunch of tasks on and.

What I think it is, is something that's better than the original GPT, the 2020, um, release of GPT three, uh, where you can do a bunch of cool tasks, right? Summarization, generation, et cetera. So that, that's how, uh, I've defined it. Um, because within our platform you need to be able to support a bunch of different use cases, um, and uh, to be able to do this for a bunch of different domains.

Cause we support, um, automotive, retail, um, banking, all these different, uh, verticals. So take, taking a look at all the benchmarks, right? Uh, and we get to one benchmark, uh mm l u and we look at the data and there's a couple of, uh, open source large language models that are better than GT three, right? So flan TI five, large llama.

Flan Family Falcon came out and there's a bunch of llama derivatives, but basically that's what I qualified to be a large language model just for, for the purpose of generation and understanding and typically how people are using them. So with that definition, um, this is why we use large language models is for this generational ability across different domains and a number of different tasks.

So I'll be using that definition and talking a little bit about the model sizes. So what's interesting is for large language models, typically you need multi GPUs to deploy this if you're using full precision, right? Um, Falcon, uh, seven B actually, uh, can be deployed on one GPU fairly easily. But most of the other models, if you're trying to deploy Falcon 40, uh, billion parameters at full precision, you need multiple GPUs.

Uh, even the, the larger LAMA models, right? Um, We've had, uh, a lot of interesting progress being made with quantizing these models. Uh, a couple of different talks probably talked about that, where if you're doing three bit or four bit precision, you could do it on a single gpu, an A 100, for example. Now, Integrating large language models with your existing infrastructure is challenging.

Uh, we haven't done this integration ourselves. We've been using the OpenAI, OpenAI A P I for this reason, because we don't want to be running a fleet of a 100 s. And space is changing so quickly that you have four bit papers being released before it was eight bit. Right? Uh, progress is going so quickly that we haven't invested into that LLM infrastructure because it's not core to our business.

Our core to our business is giving these supportive features to. Our, uh, customers rather than doing the research and, and self hosting side of things. So we've chosen to use an api. And that's how we integrated it. So, um, we have the service that we use both for large language models and for language models.

Uh, we built it out, it's called ML Gateway. It basically connects with our services and for each of our models we have an endpoint and the service. And for large language models, we do a little bit more. So, uh, same way that you would connect with a voice flow model, uh, we connect for our generative functions.

We just pass it through OpenAI rather than to our own infrastructure. On the backend. And then we do some prompt validation, rate limiting and usage tracking. So, uh, we use the same service for integrating it because we already built it out. Um, it was convenient, just plop it in and replace the api. Uh, you might have seen already.

We also have cloud in the platform. So it's the same thing connecting to anthropic, uh, sort of in the same method. Now with large language models, you typically don't get the same errors that you get with regular ML models because our ML models are encoder based, so there's no generation involved. It just gives you a class like, this is what you're going to do or gives us an embedding.

So very different problem space. So for us it was a little bit new figuring out, okay, what prompt areas are we gonna run into? So we use large language models to generate J S O N and that takes some time, um, to, to figure out exactly what the format is because these models don't produce the cleanest stuff, right?

So we had to implement, uh, some prompt engineering to fix this. Even then it wasn't great. So we need some re just some handwritten rules, uh, to. Formatted, uh, I know this has changed. Uh, when was it yesterday where the, the functions came out, but for us, when we were doing this integration work, uh, late last year.

Right. It's, it's something that we ran into and we had to test to make sure it was okay. So a couple different examples here. Just js o n uh, issues. So we actually can pass it back to, to our platform. And sometimes it's, it's, it's a little frustrating, right? Because you'd assume that these large language models are trained on correct data.

You have all your prompts established, but still have some issues there. And the way we, we went through this is that we'd record all our errors. Uh, that are coming in through, through parsing. We track metrics for all the different tasks that we had, and then we would occasionally run those erroring, uh, responses and prompts, um, through other prompts that we created and created this test suite that we would back test.

If it worked well, we'd push to our prompt store. So this is an example of running, uh, one of these tests, right? You can see here where things would break and then we would go ahead and modify those prompts ba based on those errors. Now we also tried out fine tuning for some of these tasks for the formatting in our specific use case.

Uh, but we noticed that it didn't actually improve too, too much. Um, so we found out that because you can only fine tune the smaller models on, on open ai, we didn't get that same performance that we got with GPT 3.5 or GPT or GT four. So while formatting wise it was much better, uh, we lost some of that.

Answering ability. So this example here is a q and a prompt that we had, um, and we saw decreased performance. So we didn't actually use the fine tuning, uh, in production. So another example here, um, we saw that fine tuning gave us an answer, but it was just completely hallucinated. The, but the full size da Vinci model worked.

All right. Yeah, for fine tuning, uh, we actually use generated data to fine tune the model in our experiment. Um, we would pass in some documents, uh, use some generation of the questions and answers, and then, uh, myself and other people on my team, we, we took that and made sure that the answers were, were correct, um, and then we fed that training data into the fine tuning process.

Now when we originally did this chat, GPT, the a p I was not available that came out in in March, if I remember correctly. So we went through and redid the tests on chat GPT, uh, it was cheaper, but the engineering work to redo some of these. Prompts and integrations wasn't necessarily worth it. We migrated some models onto chat, GPT, especially all the new stuff, but not the old ones because from an engineering perspective, uh, on as, as a company, uh, we decided that it was working well enough and re refactoring that wasn't necessarily worth it.

And GPT four came out and we're like, okay, this model's really cool. Should we try it out? Uh, we did, it was better. It was just way too slow for our use cases, so we abandoned it. Um, we still let you use GPT four on our platform if you saw that in that, the dropdown. But we try to provide the users some different guidelines.

Uh, we published articles ha have those different materials to help users choose which model they want to use. And what we figured out internally is that it's actually fairly challenging to test large language models, uh, in production, uh, from a, uh, conversational perspective, right? We talk about chat GPT, giving responses and whatnot.

So we started building out this large language model framework, um, that we're continuing to iterate on and develop in-house. Uh, just to make sure that we ourselves can write our own test cases. Uh, we can have both technical users, uh, and some of the, the, the less technical folks who are more hands-on with customers, uh, and understanding their use cases.

Write these test cases and be able to run them. And we eventually want to, to productize this, this as well. So something that we're looking at too. Now with large language models, it's interesting because there's that discussion of fine tuning versus using Few Shot, um, and there's a number of proponents of using few Shot.

Instead of fine tuning, and if you shout is great, the challenge is that when you give a lot of examples, um, your costs go up quite, quite a bit. So what we learned is that when running in production, you have to be careful in those prompt tokens, um, depending if you're using Chat GPT or GPT four, uh, to run a two K prompt, um, which gets you get there pretty quickly with using few shot learning.

Uh, depending on your task, it can be quite expensive For, for GPT four, you're looking at 6 cents per interaction. Uh, chat GPT is still pretty reasonable, but you have to be very careful with that cost trade off, especially in production with, with higher usage. The other thing that we ran into, this is a screenshot from Helicon, uh, but we had similar issues is that there's a lot of fluctuation between latency.

So with our internal models, we saw that there was consistent latency, both Chachi, bts and api. We saw that there's quite a bit of fluctuation and consistency. P 90 nines were, were pretty crazy at some point. Um, so we decided to benchmark it and. Think through. Okay. Are we offering Azure chat GPT or just regular chat GPT?

I found that Azure was, uh, almost three times faster, um, for, for doing these kinds of tasks. And the standard deviation is also lower, so we found a much more consistent experience. But the trade-off is obviously in cost. There's an upfront cost for, for running that. So, um, We didn't necessarily expect that these things would be, uh, uh, at play.

Um, well, we hope they weren't, but because OpenAI and other LLM companies are growing quickly, their, um, models are evolving. There is some instability there. So it's kind of difficult to tell your customer that you have an issue with latency or something else as a platform when it's, it's happening downstream.

Um, if anybody, uh, experienced the AWS outage earlier this week, it's, it's a little bit awkward, right? Because. You can't fix your, your, your platform. It's, it's just down unless you've invested in some, uh, much more mature, uh, multi-region, uh, multi-model, um, success criteria. So, False experimented with, uh, different models.

So we have chat GPT, Azure Chat, GPT, and we tried to, Claude and Claude Instant found some differences there. Uh, and it actually depends based on how many tokens you use. Um, don't have too much time to go into that, but. Something important from our end is just making sure that we can test out these models that, uh, as they're coming out and making sure that, that they fit the use case.

Um, we can't benchmark all the models because that's its own research task, but uh, when we find that a use case matches, we'll try to benchmark and add into the platform. Now we get into the question about should we deploy our own models? And we as a company have been deploying our own models into production for around a year and a half now, and it's a trade off, right?

So when we were using an api, whether for fine tuning or inference on open ai, Uh, we are not, we don't control our own infrastructure, right? So here you can see it took a little while to be in a queue and to actually fine tune the model versus when we run it ourselves, we get a lot more logs. It's a lot faster cause we handle the platform.

Um, but it takes a lot more effort to to build that out and it certainly has trade offs. So now shifting into from our large language model use cases to our language model use cases, um, we have sort of four main ones. Uh, utterance recommendation, conflict resolution, clarity scoring, and our N L U model.

So basically, uh, these are designed around the process of building a conversational assistant where you want to see is your data good? How do I get more data, and how do I have an actual good model that powers this, this chatbot of voices. And what was interesting is that the first model we pushed to production was our utterance recommendation that we built in house.

But when we integrated with a large language model, we actually deprecated that, um, especially with multilingual customers, especially with more and more domains, it just made sense to use a, a third party api, uh, something that we weren't necessarily considering. Uh, before, but it just made sense, right?

And as a, as a practitioner, it's like made this model, it's your baby, it's running in production, but just gotta deprecate it cause uh, it doesn't make business sense. You're not getting that value. Now, this certainly has a tradeoff, right? Um, I'll talk a little bit about how we deploy our language models.

Um, you have this option of just using it as an api, right? I, I called this, um, talked a little bit about this first in January, I called this large language models as a service, and then Llama came out, and then now they, they together. When you look at the acronym a little bit, but when you're using large language models as a service, right, um, it's very easy.

Uh, or a lot easier. You, you send your data or you just call the API and it works, but it's certainly a matrix, right? Um, are you hosting your own model? Are you training your own model? Are you using your own data? There's all these different considerations to, uh, go into, and not, not as much time to talk about it, but you definitely make need to make that decision of where you, you take this trade off.

Uh, For us, you can see on the top right, we built our own ML platform that does, uh, our own fine tuning. We hosted ourself, uh, we run inference. We actually let our customers train models in real time. Uh, so that adds an extra level of complexity. Uh, but we also use OpenAI, which is like in the bottom left, so you don't actually have to choose.

The same hosting solution for all your models. You just need to make sure it makes business sense, right? So we don't do things necessarily in the middle here. We've kind of taken a, an opinionated approach on either end, but as technology changes, as value propositions change, right? You have to always know what makes sense, right?

So for example, if we're going to deploy, uh, A large language model, it might make sense to use a more managed solution rather than trying to wrangle like a one hundreds for for ourselves. So H one hundreds or whatever GP requirements that. Now, um, it's interesting because you get into this conversation of can LLMs do everything right?

Um, and one of our primary models is an N L U model, uh, which in industry it's a little bit of a misnomer. Uh, typically, N L U is something for natural language understanding, but in the chatbot space, it's focused on the task of. Intent and entity detection. So tent detection, a user says something you try to match to the class.

So I want to order, uh, pizza. You match to the order pizza intent. Then entity extraction is that you might say a medium pizza or cheese pizza. It's kind of like a keyword, an information that you want to get out of that sentence. So, uh, these models have been around for a while. There's several commercialized, uh, versions out there and we decided to build one ourselves cause it's part of our, our core business.

Problem is that sometimes when you build your own models, you have these outages you can see here like a spike, uh, and latency that, that we ran into, and that that can be kind of challenging, right? Because prod goes down and everybody's like, why is prod down? Why don't we just use a managed service? And that goes into the consideration as well.

So, uh, we made the decision to host the models ourselves cuz they're, they're quite, um, sort of custom to what we built. Um, but what ended up happening is the original architecture that we built to, if we go down to that list of four models, we started off with utterance recommendation and we had certain requirements.

So we built it for, uh, pub sub architecture. Uh, where some of these requests might be a little longer running. We had an SLA of 150 milliseconds in each direction for P 50 s. Uh, so that was okay and using the technology that we selected Google Pub Sub for that, it worked out great. Uh, we got to do lots of cool things like schema validation, uh, handled traffic, multi-region.

That was quite good. But what ended up happening is when we. Deployed the voice, l l u, this model that needed to do inference. Uh, we realized that with this pubs sub architecture for, for our language model, right? We were, we were getting too high P 90 nines, right? It took too long for certain requests to come back, and part of it was based on the pubs sub architecture.

And it feels bad because you, you designed this platform, you built it. A year before we wanted to deploy this model, and it worked great until it didn't. So then we had to do, uh, a refactor and a re-architecture. And what's interesting is that the actual language model had very consistent latencies, right?

It was between 16 and 18 milliseconds. Uh, the LA language model had, um, great, great response times, but Pubsub itself, you can see here, the, the P 90 nines got quite high in each direction of actually. Sending the message back and forth, right? So the P 50 s were great. You can see here like 20 milliseconds, like 30 milliseconds in each direction.

But you had those massive spikes because, uh, the tech wasn't built there. And, uh, we didn't actually know we were gonna build this model when we were building the platform. So we have to re-architect. We had to use Redis as our queue, um, and put it deployed closer to our, uh, application layer. So that had some, some challenges there too.

Um, but we ended up doing, it hit our P 50 and P 99 targets compared to our earlier deployment. And we outperform a lot of the, the industry standards. Um, Based on what we've been doing, our, our testing on various data sets, uh, couple of competitors we're doing quite well. So we invested in that language modeling capability because it was core to our business, right?

Um, it was something that we wanted to bring it in house. And the question is, well, large language models are great, right? They're amazing, they're really cool. They can do all these tasks. So how do they compare to this custom model that you spent a lot of time building? Well, the first thing is that latency is quite high.

Um, that, that's one of the challenges. The other thing is that the, the accuracy and the cost just don't make sense. So, um, Our l u model, uh, still outperforms GPT four in this, uh, one test that we did, both on cost and on accuracy. Uh, GPT four does, does a great job as a model. It's very easy to use, but it costs a thousand times more than than our model to actually do inference on.

And this is only 3000 inferences that you're running. So, uh, if you're going into production as a large company, it's quite expensive, right? So, That's that. Um, large language models versus language models. Um, talked through a whole bunch of topics, but happy to answer questions. And if you have any more, uh, questions, you can always reach out to me by email or on LinkedIn.

Awesome. Thank you so much. And yeah, everybody who's watching, um, check out Dennis's email if you have, um, questions and we'll give it a moment in the chat. Um, but I think. Maybe folks can reach out a little later cause we're wrapping up, but that was awesome. How did you, how'd you feel about the talk? Yeah, I mean, it, it's always, always interesting with, with these talks.

Um, but feels good. I mean, there's, there's so much to talk about on the subject. I could probably do like an eight hour talk, but yeah. Gotta wrap it up at some point. No, sometimes like the 10 minute talks I can tell people are just like crunching all the information down and it's so hard. Yeah. Well, cool.

All right, well thank you so much for being our final talk of the day. Um, definitely send Dennis your questions and maybe Dennis, if you wanna jump in the chat, um, and see if there are questions there. And yeah, we're wrapping up two days, I think 84 talks. It's been, it's been crazy. Yeah. Thanks for having me.

I'll jump in the chat and answer any questions there. Okay. Awesome. Thank you so much. Bye, Dennis.

+ Read More
Sign in or Join the community

Create an account

Change email
e.g. https://www.linkedin.com/in/xxx or https://xx.linkedin.com/in/xxx
I agree to MLOps Community’s Code of Conduct and Privacy Policy.

Watch More

29:04
Finetuning Open-Source LLMs // LLMs in Production Conference 3 Keynote 1
Posted Oct 09, 2023 | Views 7.5K
# Finetuning
# Open-Source
# LLMs in Production
# Lightning AI
Emerging Patterns for LLMs in Production
Posted Apr 27, 2023 | Views 2.2K
# LLM
# LLM in Production
# In-Stealth
# Rungalileo.io
# Snorkel.ai
# Wandb.ai
# Tecton.ai
# Petuum.com
# mckinsey.com/quantumblack
# Wallaroo.ai
# Union.ai
# Redis.com
# Alphasignal.ai
# Bigbraindaily.com
# Turningpost.com