Sign in or Join the community to continue

The Confidence Checklist for LLMs in Production

Posted Jul 14, 2023 | Views 876

# LLMs in Production

# LLM Deployment

# Portkey.ai

Share

speaker

Rohit Agarwal

CEO @ Portkey.ai

Rohit is the Co-founder and CEO of portkey.ai which is an FMOps stack for monitoring, model management, compliance, and more. Previously, he headed Product & AI at Pepper Content which has served ~900M generations on LLMs in production.

Having seen large LLM deployments in production, he's always happy to help companies build their infra stacks on FM APIs or Open-source models.

+ Read More

SUMMARY

For 2 whole years of working with a large LLM deployment, I always felt uncomfortable. How is my system performing? Are my users liking the outputs? Who needs help? Probabilistic systems can make this really hard to understand. In this talk, we'll discuss practical & implementable items to secure your LLM system and gain confidence while deploying to production.

+ Read More

TRANSCRIPT

Um, and I'm so pumped to introduce our next speaker, Rohe. Um, he's gonna be talking about confidence checklist for L L M in production, um, which I think is incredibly important and very timely. Um, so let's bring him on stage without further ado. He's gonna do a full presentation actually our first one of the day.

So Rohe, you'll have 30 minutes. If people have questions, please drop them in the chat. Um, I think we'll have a couple minutes at the end to answer them, so let's bring Rohe on. Hello. How you doing? Very good. How are you doing, li? I'm good, I'm good. Were you playing the trivia game? Yeah, I was. That was fun.

I mean, I didn't get a lot of it right, but. I think Demetrius makes it fun. Yeah, definitely. Um, it was the speaker one, right? Yeah, it's about the speakers. Ok, cool. Yes. Yeah, I feel like some of the like fun facts about speakers have just been like blowing our minds. So it's a good, it's a good trivia game.

Absolutely. I mean, there are people who've DJ'ed in the clubs in sf, which is like super awesome. Yeah. That, that's crazy. Not, not my past experience, but other people. Um, alright, well here are your slides. Can you see them? Yeah. All good. Okay. Yeah, go for it. Thanks. Hey, hope all of you're having a great time.

Um, glad to be talking about the confidence checklist. Um, you know, please keep asking questions. We'll also take questions in the end, but if there's anything coming up, I'll keep looking at the chat as well. Um, Awesome. So let's get started. A little bit about me. Um, I jumbled on Twitter if you just wanna reach out or see what I'm thinking.

But I'm right now building port.ai. Previously I was, uh, building pep type. I was at FreshWorks Frame Bench. Uh, interesting facts. Um, I've written production code and PRDs using ai. Uh, so I'm a product and AI person. And if anybody tells you that, hey, it's not really, you can't write production code with it, um, talk to me.

I think there are definitely ways to do it and it's really good at it. Um, a product I've built, we've generated over 900 million tokens in production, so we've seen it over a period of time and I think it's just massive, uh, the technology impact and what. What is possible using this? Uh, I have still not trained my own private a so I think, uh, still naive in the G P U space, but I think have seen a lot of stuff in production.

Cool. So I think what I'm trying to do in the next 20, 25 minutes is talk about some of the things which have been production learnings for me. Uh, what are things that have really impacted. Production. So if you're thinking about going to production, think of this as a checklist. Six things that you should really do we'll.

Try to keep it as actionable as possible so that you can go and implement these learnings in real time. Perfect. So first, uh, think about output validations. When I was launching my products, and I've seen many other people do this as well, we just rush code out without worrying about output validations, and.

What it, what will often happen is that while this code works in the test environment or development environment, because your prompts are very limited to the things you're testing about, you might be getting great outputs, but when it goes to production, you'll have 20% of your customers who are not getting great outputs, and that 20% is actually a really large churn number to see.

So you'll quickly see a lot of people starting to churn out because. It works eight out of 10 times, or your product works eight out of 10 times, and that'll cause a lot of challenges. A very easy way to solve this is using output validations. Um, and let's talk about how can you implement output validations.

So this will be the format of the entire deck. We'll, we'll go through why is. What you should do, and then we've got some solutions, which is in the basic advanced expert format. Stuff that's worked out for me. But if there's more stuff that's worked out for you, uh, let me know. Very basic stuff is just check for empty strings in character lens.

So if you know that your output needs to be at least a hundred characters or 20 words, just add that check so that if it's not passing that check, you can just retry it again. Uh, you can check the format of output. So if you're generating code or you have to generate J s O N, just check what you generated.

Uh, guardrails is a very good library to do this. Um, I use it personally and it's a very easy way to just make sure that your output, uh, matches the user's expectations. So that's the basic stuff, right? Definitely get this done. Probably takes 10 minutes to get it set up, so no reason not to do it. When you go a little more advanced, Uh, focus on, can you check for relevance?

Uh, if you, if you've done even basic n l p, checking for relevance against the user input query is very easy. So just check for relevance. Uh, try ranking results. If you're showing multiple results or if you've generated multiple results, then you can rank them. Cohere offers an API for this, or there are very simple re-rank, uh, algorithms that are available.

This also really works very well. If you're doing a rag type of use case. If you're fetching multiple um vectors from your vector store, you can actually rank those based on the input query, which makes it even better to pick which one will make the most sense to answer the user's resource question.

Third piece I think experts have told this time and again, is. Try to answer questions over the closed domain. So if you're, if you're worried about getting the facts right and you, you're worried about being truthy, the only way to do it, and the only way to have LLMs, not hallucinate is to do this over a closed domain.

So give it all of the context. Tell the prompt, um, or tell the l l m that only use this context to answer my question, uh, and then ask your question. So, That just improves fact, uh, truthiness by a, by another level. So definitely do that. Uh, export level would be model-based checks. Uh, this is a very interesting concept that open AI evals launch and now almost everybody's doing.

But we are not talking about batch evals. I'm really saying can you evaluate in real time the output that is being sent out? Now, obviously this introduces some latency. But there are ways to go around it, but there are model based checks. So basically you get an answer from, um, an L l M model, and you can reply to it if it's a chat model, or you just send another prompt saying, are you sure?

And just, uh, the fact that you ask the question. Let's the model think about the answer and come back with, was it right or it can correct its answer. Extremely helpful. Are you sure? Is such a powerful prompt. Uh, you should definitely check it out, if not in production, at least in testing, but yeah, that's how you do output validations.

Moving on. Uh, secondly, and I'm sure nobody's really thinking about this, but prepare for a DDoS, prepare for a lot of users starting to use you. And I speak from experience, right? I mean, we were building an app, we were having fun, and nobody really expected that we'll get DDOSed. Now, the problem with DDoSing, in most cases you might think is, Hey, we'll go down.

Yeah. So, you know, the fail wheel comes up. People know they're popular and that's awesome. But what's actually gonna happen is if you stay up, you are gonna make hundreds of thousands of requests to another server, uh, to maybe open ai. And that is gonna blow your bill like crazy. You'll not realize it, but in a matter of days, uh, the token costs add up very quickly.

So you really need to figure out. How do you not get a lot of bad traffic to your website? Right. Basic stuff, just add a capcha. If you are getting DDoS, if you immediately start to see a lot of bad traffic coming to you, just add a capcha that solves a lot of the problems so that at least simple hacker cyber attacks don't get in.

But right after that, I think you should really invest in rate limiting users and organizations. So how do you. Make sure that all of your users across your application have a great experience because you've rate limited, uh, the folks who might be abusing your system. So it's always good to have these checks in place.

Right now, there are tools to do this very, very easily. So, uh, no reason not to do rate limiting across users or even across organizations. It's also good to build this into pricing plans. So if you're building a product and you're gonna price it, Then it's also good to rate limit users, or if you're doing it internally, again, individual users can have quotas, so you're not exceeding it and everybody has the best performance expert level would be.

Uh, we actually ended up implementing IP based monitoring and fingerprinting, which is a lot more deeper concept, but there are companies available that help you do this much more simply as well. So definitely I think if I were to talk about IP based monitoring a little bit more, how it helps is instead of users identifying a user by their email address or by the user id, whatever you have stored in your database, you can actually use that IP address to figure out if it's the same user creating malicious accounts on your system.

So a capcha would obviously help, but uh, if a user is just, they love your product, they're just using it extensively. You don't wanna be caught up in a situation where you end up spending so much money and it doesn't, uh, make up for what you're charging the user. So definitely I think, think about doing it.

Another interesting variant to this, uh, to just getting DDOSed is there are some languages where tokens are very expensive, so a short word in. Uh, accordion could actually cost you a lot of tokens, so keep a lookout for multilingual usage of your product. Uh, it's a good way to keep tracking which users, which language, what cost is it happening so that you can manage those costs and better.

It, it's better to be safe, uh, than so in these cases, because open air costs just blew up. I remember at one point in time we would pay, uh, Our open AI bill was six x the AWS bill, and when I say that, I would, uh, I'm sure a lot of people will agree that we've seen these in many cases as well, so just look out for this.

I think adjacent to this is building user limits. I think it's just important that, uh, you are not winging it and user limits are again, really, really easy to build. How do you make sure that all of your users have. A really good experience and they're all able to use your product with the highest level of reliability and accuracy.

So, I mean, I keep stressing on this point, all your users because for sure some of your users, your PA users will have an amazing experience, but you actually want to capture all of your users. Um, the basic stuff, you just start with client site rate limiting, just bound it. So if somebody, uh, And this is viewed even if I say it, but there were people who just keep clicking the generate button multiple times.

Now what's happening is the request is eventually gonna take much longer. Um, you could have just prevented it by using a simple dba. If you see chat, chat, G P D already does this today. So if you're using chat g p D till the time, one response is not complete. You cannot open up another chat. Um, you cannot open up another chat.

You can't send another message. Uh, so that's d bouncing on a very different level. So you can have d bouncing on button clicks. You can have a d bounce on is a request already in line. You can even have rate limiting on how many requests can user make, um, in a given time period. Again, chat, gp d implements it with GPD four.

They say you can only do 25 requests across three hours. Um, but yeah, again, Simple stuff implemented. Don't, don't just worry. Build it out as you go. A little more advanced. When you have multiple organizations coming in, you might want to go rate limits. You want, you might wanna build rate limits, especially segment based.

So you could say for organizations in my premium plan or for this set of users, These are the rate limits versus for some other users, I've got higher limits. Lower limits. Um, it's also helpful to create sort of a, a low list where people are now trusted and then you can grant them higher rate limits. So everybody coming in can probably start with a lower rate limit.

And then as they build trust with you, you can start increasing their rate limits. Uh, how you implement that is you just put people into different segments. So any new user within the first seven days stays in that untrusted bucket or like a quarantine bucket. When you're ready to take them out based on some amount of usage or some amount of concurrent usage, then you can move them to the trusted bucket, which has then higher rate limits or maybe paid customers get higher rate limits.

So that's an easy way of implementing rate limiting. On the user side, obviously an export level is dynamic rate limiting, which I was just talking about. How can you really. Identify abusers and fingerprinting is a way to do that, but can you identify people who are abusing the system and decrease limits for them?

Right? Um, I hope that's not too complex, but the same concept as bucketing. But now you can actually dynamically separate limits for people. So companies already do this. Uh, there is a concept of credits or karma that you can build into your system where you can say that everybody gets a certain level of rate limits and then that is increased or decreased based on what is the kind of usage we are seeing from your system coming into ours.

Uh, a general practice as you start with a lower limit and keep increasing as trust in your system, trust in the user increases as well. We've talked about fingerprinting. Um, extremely useful when you are launching to production and launching to a lot of users.

Perfect. Number four, um, caring about latency. Uh, and I sort of love this meme, but I think latency is the biggest factor, which is different from. Older user experiences to the L L M user experiences. You know, do you remember times where you would get frustrated if an app even took two seconds to respond to you?

You would get finicking and you wouldn't want that. You actually wanna spend time on apps that are really fast. Uh, Google said that, Hey, we are searching through millions of documents within milliseconds, but then today, that entire thing changed and. Open eye calls or inference endpoints can take really long to come up with an answer.

Uh, so it's natural for us to stop caring about latency because you can say, Hey, my request is gonna take 15 seconds. Users just have to cope with it. But what's happening is that users are having a bad time and they're not used to systems like this. It's interesting that chat GPD has probably normalized it, uh, but they use a variation of things.

To improve the perceived latency, um, of their product. So, uh, I just keep that in mind. Maybe you can't handle latency because inference takes the amount of time it takes, but you can improve the perceived latency of your product. And I think that's what is really important. Uh, so how do you do it? Basic stuff.

Uh, just implement streaming. It's the easiest way to get started. One parameter. In most API calls, you can just say stream true, and that improves the perceived latency. So that 15 seconds that the user has to just wait for that circle to go around and then come with an output, you could actually show them that data coming in and it reduces the anxiety in the user.

So perceived latency goes down very, very much because the time to the first token. Um, and probably this should become a metric, so saying tft. So time to First Token is actually maybe a second, which is the kind of experience people are really used to, and that's also what Chad GT and other products have gotten them used to.

So just implement streaming, get done, advanced, um, implement streaming properly, implementing streaming. Just saying Stream True is just a simpler way of doing things. But the technology of streaming is just so complex that you'll have to start handling a lot of edge cases for some users. Streams can break streams, can have multiple data objects coming together.

So there are multiple implementations of this, but it's always good to just test out streaming really well. So while it improves latency, it can also break your app in multiple ways. Uh, I think just yesterday IL released their AI library. I think their AI is DK and I think that has streaming done really well.

But if you're implementing streaming, I would say just look at some of these edge cases where you can have incomplete chunks or multiple chunks coming into one. Um, it just makes for a bad user experience. Face stream breaks in between. Um, the second piece is handling rate limits with back off. So what I mean is, and again, goes back to user rate limits.

I think that is just an important concept wherein if you have some users who are completely, uh, taking up all of your usage, then uh, you'll, I mean, the other users are gonna suffer. So what's happening is you are hitting rate limits of your endpoint provider. And that is gonna start hurting your users as well.

So it's always good to handle these with exponential back offs. So just implement like a tenacity in Python or again, there are solutions out there to do this, but just handle your rate limits gracefully. Do that with big, uh, with exponential back off. Do that with random jitter concepts that have existed in engineering, but we just somehow misapplying these in LLMs.

The third is caching. Uh, again, super simple concept, but if you can semantically cache your queries, and we've actually, you know, we're just launching this is a plug, is we are just launching a product around semantic caching, but semantic caching and actually improve your responses, uh, dramatically. So imagine for a percentage of your responses, you don't have to go to.

Um, G PT or your inference, endpoint, get the inference out and show the answer. You can actually serve from a cashed response because these are either the same queries or similar queries going in. Especially useful if you're building a rat system or you're chatting with documents, chatting with data, you, you'll realize that probably 20 to 30% of the times users across an organization are asking the same queries.

So you could just cash it. And whenever you see a similar query coming in within timeframe, obviously you could return from cash. That's a very big performance boost. Uh, people don't wait that 15 seconds. They're getting some of the frequently asked questions. They get answered really, really fast. So you should check it out.

On the export level, you can think about really building fallbacks. So if one providers fails or one providers taking time, Can you fall back to another provider as well? And the most common use case I've seen production is where you start with an open eye and you put anthropic as a fallback, or you have GPD four as your main model.

But 3.5 is a fallback. It's a graceful degradation, but users still have a good experience with latency. Uh, second, you can implement queues. So instead of having all of the requests coming in and you fanning out the requests to your API provider, you can actually queue these systems, which helps in obviously rate limiting at your end, and users can have a more consistent experience.

So you can always say that for every user, one of the queries is gonna get executed simultaneously. Um, yeah, uh, yeah, I'm just reading. Traditional role of a software engineer. I really feel that a, a lot of times what we are doing with LMS is more engineering than machine learning because for the large part, machine learning's already done.

So somebody's built, pre-trained, these models and now you just using this. So it's almost like, it feels like I'm reminding everybody of the concepts you've already implemented in other APIs, but there are just minor nuances on how can you implement them. Um, in your l l m strategy as well. Perfect. So, moving on to logs and monitoring.

So this is a, again, a hard lesson that we've learned. So when you start with saying, let's log all our requests, let's monitor for latency, cost tokens, um, the first thing that comes is I already have a monitoring longing system in place. Let me just push everything to it. And, uh, what happens is a lot of these systems were not built for unstructured data.

So LLMs, you have these large prompts, you have these large outputs, and all of the logging monitoring systems actually exist for these structured data. So they expect smaller strings, they expect to index everything, and this quickly becomes either very expensive or very slow. And they're not built for probabilistic models, right?

Uh, to give you an example, APIs until sometime back, were returning a success or an error. Now, APIs are returning a success, but you don't know if the output is accurate or not. So there's like a whole variety of accurate, partially accurate, completely inaccurate, but the API return true. So Datadog or any other monitoring software cannot really monitor it.

So you'll need like a specialized system to start monitoring if the L l M APIs are working for you well or not. Um, the basic stuff is you have to either live with some amount of poor visibility as you're getting started, which is fine, uh, because as you're getting early users and they're trying it out, you may not want that level of complexity.

Or you just pay for a Datadog or CloudWatch or the likes and spend some time setting it up. Uh, here I don't have an involved option. I don't think there's an easy way or an advanced way to do this. The expert way. Obviously I have seen companies who've built out their own monitoring layers with like an elastic, a Kibana, uh, for probabilistic text models.

So you've done some amount of evals on the models, and then you're pushing this data into a column or data store, and then you use a Grafana dashboard on top of it. So if you're building a really large app, and this is something that's core to you, I would really say that you should think about. What's the DevOps layer look like for monitoring, logging analytics?

I think that becomes, um, important as you're building out the apps. Uh, if somebody's using Datadog, I think it's great to get started. As you hit walls, you can start to upgrade, migrate, try out different solutions. Um, but just something to watch out for because it becomes important as you're starting to scale, uh, your app.

Lastly, I think the sixth thing is, uh, implementing data privacy. Uh, as you get started, most companies aren't really worried about it, but to some extent think that a lot of private data is going across the wire to multiple applications and then to your, you know, API provider of choice. So how do you implement data privacy early on?

And there are very easy ways to do it. It's important that you do it because when it hits you that you're not compliant with gdpr, you're not compliant with C P A. The fines can be drastic. It's also something that people value a lot. So building it early on in your system makes users really, really happy.

So these are one of the quick things that ensure a great user experience early on itself. Basic stuff. I think there's no basic stuff. You can just close your eyes and say it's fine. Um, privacy is the way it is. I'm not gonna worry about it to begin with. And a lot of people do that. Um, I don't think it's okay, but I think anything to get your product off the ground, right?

So basic is closing your eyes. Advanced mode is let's amend our GDPR cookie and privacy policy first. So let's tell people that, hey, your data is going here. We either have a data processing agreement with them or we don't, but at least talk to your users that this is what we're doing. And then secondly, you can actually implement PII masking.

So if you're using line chain or LAMA Index, or even if you are using the base APIs, you can pick a library to implement P I I masking within your code. So what's happening is whenever users are giving you any information, you can make sure you're anonymizing all of this. Before they get sent to other systems.

So, um, fairly easy libraries that exist for this just do it when you're going for a more advanced mode, or especially your target user segment is really focused about privacy. So in the medical sector, financial sector, uh, I would recommend using something like, uh, Microsoft's procedural library or Azure also as a service.

That can identify PI to a much higher degree of accuracy. So when you wanna make sure that no PI should ever leave my system, or even within my system, I need to store PI really well, then just implement this. Perfect. So if I were to just give a quick rundown of everything that we discussed, um, One, make it reliable.

So what do people really care about? Right? Make it reliable. And this is true for almost every software that that exists. And we are reimplementing almost all of that as we talk about ML ops for LLMs or LLM ops, or whatever you wanna call it. But one, stable APIs, no downtimes, faster responses. Um, second is making it accurate, uh, but how do you reduce hallucinations on LLMs?

So provide relevant, moderated, consistent generations. Um, you can solve this. You can start to do, start to use the biggest, baddest model that's out there. Go down, improve prompt engineering. Uh, then maybe if that doesn't work, keep going and maybe fine tune your own models. Uh, evals are a great way to continuously monitor if your responses are accurate, uh, consistent.

And moderated and compliant. So just do that. And if that doesn't work, then usually it's, it's a good idea to maybe consider building your own models and the accuracy is important. Third piece is, uh, making it cheap. I think I have seen this in a lot of applications with friends, um, and my own application that you can get started with a poc.

You are seeing a lot of traffic. You're excited, things are going well, but then I. Infants costs can be really expensive, uh, if you're trying to build a SAS product on top of it. SAS products are generally enjoyed very high margins, but infants costs can very quickly eat up those margins. Uh, so how do you make sure that, uh, the economies of scale are working in your favor?

Or after a point in time, you're transitioning to smaller, cheaper models, uh, without impacting your quality as much. So I think that's a balance to maintain. But then reliability, accuracy cost are probably these three things that your users worry about the most. Um, and you just have to implement these concepts in your LLM app itself.

So what should we do? Implement telemetry. You cannot fix what you don't know. Telemetry for LMS is a little bit different. It is a little more nuanced, so check that out. Build systems to improve resilience, fallbacks, retries, streaming, uh, UX pieces that you can do to just improve perceived latency, perceived reliability, and then collect human feedback early so that you are fine tuning your data and making sure things are really going well for you.

So while you can launch your POC in probably a hackathon and everybody's excited, I think getting production ready is, is a fairly long journey. So that's a marathon, that's not a sprint. So make make sure that you're spending time on, uh, getting through this easily. Yeah, I think lastly, while you know, I hate saying, um, the buzzword, but don't just be a rapper.

But I think what people are trying to say is that a foundational model system is only the compute layer. So think of this as your server or an e c two, um, you know, that's your bare metal server that exists. You need to build a product on top. So while every software was probably an AWS wrap up before this, uh, here, you just have to find what's your value endpoint and then be production ready with it so you can hack together a solution for a POC or an mvp, and then going to production is a completely different beast.

So yeah, I wish all of you all the very best in getting to production. Any questions? I'm available on Twitter, uh, but happy to take questions now as well. Awesome. Thank you so much, RO. Oops. Yeah, definitely head over to the chat if people have questions. Find Rohe on Twitter. That was an awesome talk, but we have to kick you out, uh, to prepare for our next panel.

Absolutely. No, fantastic doing this. Thanks so much, Lily. Thank you. Thank you so long.

+ Read More

Sign in or Join the community

Watch More

Emerging Patterns for LLMs in Production

Posted Apr 27, 2023 | Views 2.3K

# LLM

# LLM in Production

# In-Stealth

# Rungalileo.io

# Snorkel.ai

# Wandb.ai

# Tecton.ai

# Petuum.com

# mckinsey.com/quantumblack

# Wallaroo.ai

# Union.ai

# Redis.com

# Alphasignal.ai

# Bigbraindaily.com

# Turningpost.com

Finetuning Open-Source LLMs // LLMs in Production Conference 3 Keynote 1

Posted Oct 09, 2023 | Views 7.7K

# Finetuning

# Open-Source

# LLMs in Production

# Lightning AI

Current State of LLMs in Production

Posted Oct 18, 2023 | Views 1.8K

# Natural Language Processing

# LLMs

# Truckstop

# Truckstop.com