Sign in or Join the community to continue

It Worked When I Prompted It

Posted Jun 28, 2023 | Views 407

# LLM PoCs

# LLM in Production

# Sleek.com

Share

speaker

Soham Chatterjee

Machine Learning Lead @ Sleek

Soham leads the machine learning team at Sleek, where he builds tools for automated accounting and back-office management. As an electrical engineer, Soham has a passion for the intersection of machine learning and electronics, specifically TinyML/Edge Computing. He has several courses on MLOps and TinyMLOps available on Udacity and LinkedIn, with more courses in the works.

+ Read More

SUMMARY

The journey from LLM PoCs to production deployment is fraught with unique challenges, from maintaining model reliability to effectively managing costs. In this talk, we delve deep into these complexities, outlining design patterns for successful LLM production, the role of vector databases, strategies to enhance reliability, and cost-effective methodologies.

+ Read More

TRANSCRIPT

Link to slides

Hey, Soham. How's it going? Hey. Hi. So nice to meet you. It's going good. How are you doing? So nice to meet you too. Let me take this giant QR code off the screen. Um, let's see. Okay. Just changed it. Um, all right. Let's see. Cool. Thank you so much for joining us. Um, I think I have your slides up, so we'll get started in just a moment.

Um, let's try it out. Yep. Oops. Um, oh, these are, oh, these are your slides. Oh, yeah. Yeah. Awesome. Okay. Um, why don't you just let me know when you want me to go to the next one. Um, so I, I think I'm able to share, um, I think that's me sharing right in, in, uh, so. Uh, what's up on the screen is what I'm sharing.

Okay. Yeah. Okay. Nevermind. Okay, sounds good then. Um, yeah, but I can try. Let me try yours real quick. Mm-hmm. Thank you everybody for bearing with us through the technical difficulties. It would not be a true virtual conference without them. Uh, and thank you Sohan. All right, so let me try yours real quick.

Let's see. All right. I think these are from your screen now. Yeah. Yeah. Awesome. Okay. All right. Take it away. Okay. Thank you so much. I'm really happy to be here. And, um, yeah, it's not easy, uh, giving a talk after mate and Chip speak. Uh, but I, you know, I hope I can, um, uh, uh, you know, say something that's, uh, that you still find useful.

Um, so, you know, uh, my talk, it's gonna be about, um, the challenges that, uh, I faced when I was building this, um, L l M product. And, and, you know, throughout these last, um, about six months of building this, um, I've learned, um, a lot of things about building the product and how we can actually, um, solve some of the challenges that, that happen when deploying l l M products.

So I'm hoping to, you know, like, um, tell you a bit about that and, and hopefully, um, give you, um, a preface. Uh, to some of, um, things that other people will be talking about in the conference later on. So, a bit about me. Um, I am, I'm currently, uh, working as the lead, um, ML engineer at a FinTech, uh, startup in Singapore called Sleek.

Um, apart from that, I'm also an instructor at LinkedIn as well as Udacity. So I've, I've created a few courses. Um, and the, and, and over the last few months I've been building this, um, tool called Speaker Scribe that you can use to create, um, talks, um, uh, talk proposals, as well as like workshop proposals and stuff like that.

And in the process, like I'm hoping to learn a lot about deploying LLMs and, um, And, and, um, and what are the standards, like good standards and best practices for deploying LLMs? Um, and also I'm, uh, I'm learn writing about what I'm learning throughout the process, um, uh, at um, at tiny ml dot.com. Um, so, so, um, I wanna quickly cover like, like three things basically.

One is like the challenges, um, that we are facing, um, with deploying LLM applications. So, Uh, the solutions that have worked for us so far. Um, and also, um, end by talking a bit about the kind of products that I think are useful, uh, like the kind of useful products that you can build with LLMs. Uh, so yeah.

So I wanna start out by talking about, um, our biggest challenge, which is, um, using L L M APIs. Uh, I think one of the biggest challenge is, um, is actually the lack of like SLAs or commitments on, on endpoint, uh, uptimes and latencies, uh, from, from API providers. Um, so for instance, when we were building our application, uh, there were inconsistencies when we would get a result back from the api.

Um, this would cause, uh, uh, and, and the problem with having like delays is, um, a lot of the tools that people are building with LLMs are, are for creative applications, um, applications where, you know, if you don't give a response quickly, uh, you can't re uh, you kind of disrupt like a flow state that, that the user may have.

And, um, and having to wait, you know, even a, even a few extra seconds can, uh, can, can be painful, right. Um, Uh, and, and on top of that, um, the deprecation of APIs actually kind of further add fuel to the fire. Um, so when, uh, when OpenAI ca um, uh, so, so when we started building our application with, uh, with the, the Vinci 0 0 2 endpoint, um, it was working really well.

And, um, and we were able to, um, you know, create a lot of prompts that were, that worked and, um, and, and we, we had like this really good application, uh, but then they suddenly deprecated it. And, and when we tried moving to, uh, to the new updated endpoint, it took like a lot of time, uh, to adjust that application to make sure it was working at the same level.

Um, and so yeah, so that's, that's another issue that happens. Um, and finally, there's the issue of API costs. Um, even, even with the, um, reduction costs that OpenAI, um, uh, announced yesterday. Um, uh, what happens is as your application kind of increases in, uh, complexity and in scope, um, you have to, your prompts increase and, and that causes an increase.

Uh, in your API costs, which will kind of tend to, um, sky dock it really quickly. Um, and, and you can't really rely on those api, um, uh, uh, you can really rely on these external APIs because, um, because of the costs. Uh, so what you have to do in instead is move on to other tasks like fine tuning, um, or, or maybe even actually training, uh, training your own custom model.

And then the second challenge that we had had to do with, um, how we structure prompts and how we parse and serve the results of the output to our clients. Um, prompt engineering is not really an exact science and there's a lot of issues, uh, uh, with, uh, uh, with, uh, creating prompts because, um, Uh, so, so for instance, issues we've had are like adding, um, uh, uh, apostrophe, commas or, you know, like punctuations have, uh, usually like, breaks a lot of prompts for us.

Um, and, and so, so it's, it's really hard to, uh, make a prompt that, um, that works and then try to maintain it. Um, Hallucinations are also an issue. Um, and the problem with hallucinations, especially now, is that they're, um, they're really hard to spot as well. Um, and, and, and so what we need to do to, uh, prevent that is have evaluation metrics for the outputs, uh, which, which there aren't really, um, a lot of.

And um, and it's also hard to evaluate a lot of your outputs as well. Um, so, uh, so, and, and because it's hard to evaluate your outputs, you can't really trust the output that the model is giving, which, uh, and, and, and if you can't, um, trust it, then it's very hard to serve it to a client because you don't know if it contains misinformation.

Um, another issue is, uh, especially with all the updates that, um, that OpenAI and other LM providers are making with their models is, Um, at least something that we've seen is that the outputs are not, um, creative enough sometimes. Um, and, and what that hap and when that happens, um, you, you don't want to serve like similar outputs to, to people asking, asking you to, um, uh, so asking you to create like, um, similar proposals, right?

Um, so, so it's hard. So, so you have to engineer your prompts in such a way. Um, that the outputs are, are creative, you know, play around with some of the parameters that, um, Opena provides. Um, and finally, um, uh, you have to deal with bias as well as incorrect data. So, um, so what worked for us and some of the things that we are doing is, um, with prompts, what we've seen works really well is providing context to, uh, to the L L M when creating prompts.

Um, so, you know, using few short prompting that, that, that helps a lot. Um, so this is, um, a demo of, of speaker screen. And so when you go and click the regenerate proposal, um, ideally you should also provide the context of, of the previous output. Um, so that helps the l lm also be more creative and, and, and make better outputs.

Um, there are some more complex, um, uh, uh, prompt engineering techniques. Uh, but what we've seen is anything like, um, anything after a chain of thought. Um, kind of makes the prompts really huge as well as the outputs and, and that increases your API costs as well. Um, And regarding prompts, prompts are now like your ip, right?

Um, if you have a group good prompt, you should save it. Um, and you should, um, protect it because it's, it's your ip and you should also version your prompts to see what, what, you know, what changes, how, how the changes are affecting the output. And, and, you know, maybe the, that might lead to something like prompt tops or something later on as well.

Um, Um, one way to provide context to LLMs and to also reduce like costs is to use vector databases. Uh, by saving previous outputs, like caching your outputs, you can actually, um, save on like, uh, lat, improve latency as well as, you know, by not having to, um, call the api. You can save API costs as well. Um, and then vector databases also help by, um, finding context of, uh, uh, finding context that you can, uh, provide to the L L M to help improve, uh, its output as well as to reduce solution issues.

Um, so for instance, in our case for the speakers Skype product, what we did was we, um, we. We created, uh, a database of past talks, uh, and past workshops that people submitted and, and we use. Uh, and whenever someone asks, like, um, uh, asks us to create a new pro, uh, a new proposal, uh, what we do is we provide, um, we use vector databases to get the, uh, uh, context of, of the talks that we have in our database to, to, um, to improve performance.

Um, and finally, um, you know, like, uh, what, what, so, so as we are starting to, uh, build more and more complex, um, products with LLMs, um, I think if you look at like the, uh, comic here, um, so I think right now we are kind of in like stage two or stage three of, um, of, of, of the, uh, complexity of, of, um, building products with LLMs.

Um, And, and the problem with using chains is that chains tend to fail, um, uh, especially if the chains are like really long. So, so what I think what you need to do is, uh, what worked for us is keeping chains short. Um, also, uh, by having change long, uh, it increases like costs as well as latency and complexity, which, which is not great.

And finally, um, it helps to use, uh, not use agents. Agents right now, they don't really work that well. They're not, um, Uh, they're not reliable and reproducible. Um, so in our, in, in our, um, experience, like agents don't really work that well. Um, so bringing all that back together, um, how do you build like, um, a good pro and useful l l m product, right?

Um, and, and what you need to do, uh, and, and what you need and your LLM product, what it needs to do is, uh, make, uh, make sure that. Um, it, it need, um, you need to make sure that people don't use T GBT and use your product instead. Right? So how you do that is, first of all, you make it easier to access and that's something that, for instance, Grammarly has done.

You know, you can, you use Grammarly Go nowadays. It's like, it's, it's similar to how we built this, um, this strategy equity Chrome extension. Um, so, so just reduce the, um, effort it takes to go to, and then, um, uh, and, and just provide it easier. Um, Also you can help make your application better than chat GT by including domain knowledge.

So that's what we did with speaker screen. And what we are doing at Sleek is we have like our own customer data. So, so we can, uh, we can use that as context as well to improve the output of our models. Right? Um, and finally, another thing that, that you should do is, uh, provide like some additional services on top of.

Of the LLM output that chat provides. So one interesting thing that I've seen is people using chat, uh, people using, um, LLMs to create decks, right? Um, for decks, you need like images, which chat doesn't do right now. Um, yeah, so. I think, uh, that's pretty much it. Thank you so much. Um, you can check us, uh, check us [email protected] and also, um, uh, you know, follow our,

thank you so much. Awesome. Thank you so much.

+ Read More

Sign in or Join the community

Watch More

I Don't Really Know What MLOps is, but I Think I'm Starting to Like it

Posted Aug 18, 2021 | Views 454

# Cultural Side

# Case Study

# Presentation

# Coding Workshop

Fine-Tuning LLMs: Best Practices and When to Go Small

Posted Jun 01, 2023 | Views 2.4K

# Large Language Models

# LLM

# AI-powered Product

# Preemo

# Gradient.ai

Who Does What (And When) in MLOps?

Posted Oct 24, 2022 | Views 1.1K

# Maturity Level

# Team Agreement

# MLOps Journey