Sign in or Join the community to continue

Emerging Patterns for LLMs in Production

Posted Apr 27, 2023 | Views 2.3K

# LLM

# LLM in Production

# In-Stealth

# Rungalileo.io

# Snorkel.ai

# Wandb.ai

# Tecton.ai

# Petuum.com

# mckinsey.com/quantumblack

# Wallaroo.ai

# Union.ai

# Redis.com

# Alphasignal.ai

# Bigbraindaily.com

# Turningpost.com

Share

speakers

Willem Pienaar

Co-Founder & CTO @ Cleric

Willem Pienaar, CTO of Cleric, is a builder with a focus on LLM agents, MLOps, and open-source tooling. He is the creator of Feast, an open-source feature store, and contributed to the creation of both the feature store and MLOps categories.

Before starting Cleric, Willem led the open-source engineering team at Tecton and established the ML platform team at Gojek, where he built high-scale ML systems for the Southeast Asian Decacorn.

+ Read More

Demetrios Brinkmann

Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

As the landscape of large language models (LLMs) advances at an unprecedented rate, novel techniques are constantly emerging to make LLMs faster, safer, and more reliable in production. This talk explores some of the latest patterns that builders have adopted when integrating LLMs into their products.

+ Read More

TRANSCRIPT

Right. I'm gonna be talking a bit about some of the emerging patterns we've seen with the large lanuage models in production. Right. I'm gonna be talking a bit about some of the emerging patterns we've seen with the large lanuage models in production. Um But first about me, um so my name is uh my background is in production ML built a bunch of production ML systems. Um A company called Gojek really also doubled down and focused on ML tools and frameworks and platforms. Um So, one of which was a feast open source project we built and open sourced and adopted by a bunch of companies like Sharp and Twitter and Robin Hood and um some others. Um But I've already been working with teams building ML tools and platforms and um Um But first about me, um so my name is uh my background is in production ML built a bunch of production ML systems. Um A company called Gojek really also doubled down and focused on ML tools and frameworks and platforms. Um So, one of which was a feast open source project we built and open sourced and adopted by a bunch of companies like Sharp and Twitter and Robin Hood and um some others. Um But I've already been working with teams building ML tools and platforms and um you know, helping them do it in a reliable way. And so that's why the generative A I space is interesting to me because of the challenges we're seeing today. you know, helping them do it in a reliable way. And so that's why the generative A I space is interesting to me because of the challenges we're seeing today. So, you know, what are we, what are some of the unique challenges that we see today with the generative A I? And um you know, So, you know, what are we, what are some of the unique challenges that we see today with the generative A I? And um you know, I, I guess like in, in general um I, I guess like in, in general um uh the key, the key ones that we are seeing are around reliability cost, latency and safety and the numbers on the screen, there are from um an alum production survey that uh the key, the key ones that we are seeing are around reliability cost, latency and safety and the numbers on the screen, there are from um an alum production survey that and their basic teams that have responded and said this is an egregious or like a critical problem for us. And so reliability is obviously one of them because you're dealing with like unstructured or textual output, coursing that into something that your application can use is hard. Uh Cost is a big one. Um If you're an A builder today, you're either asking users to provide an API token or you're absorbing a lot of the cost yourself. And so this is a challenge that a lot of, you know, product builders are faced with today. Uh We've also and their basic teams that have responded and said this is an egregious or like a critical problem for us. And so reliability is obviously one of them because you're dealing with like unstructured or textual output, coursing that into something that your application can use is hard. Uh Cost is a big one. Um If you're an A builder today, you're either asking users to provide an API token or you're absorbing a lot of the cost yourself. And so this is a challenge that a lot of, you know, product builders are faced with today. Uh We've also deployed a bunch of probes globally or I personally did that and I've been monitoring these api end points myself, the the providers and they're really, really slow. So they're like 405 106 100 milliseconds. Um Sometimes they, they even spike to like 18 20 milliseconds for completion. So how do you build the product around that? And especially if you have to do many round trips, it's very challenging. Um And family safety is hard. Um You know, folks are deployed a bunch of probes globally or I personally did that and I've been monitoring these api end points myself, the the providers and they're really, really slow. So they're like 405 106 100 milliseconds. Um Sometimes they, they even spike to like 18 20 milliseconds for completion. So how do you build the product around that? And especially if you have to do many round trips, it's very challenging. Um And family safety is hard. Um You know, folks are input private data or they um you know, you're vulnerable to private um sorry, prompt injection attacks and all kinds of vulnerabilities that are kind of new and unique. Um So it's, it's challenging building on our a today and, you know, that's kind of like an opportunity for us to be uh things down a little bit. input private data or they um you know, you're vulnerable to private um sorry, prompt injection attacks and all kinds of vulnerabilities that are kind of new and unique. Um So it's, it's challenging building on our a today and, you know, that's kind of like an opportunity for us to be uh things down a little bit. So how do you use elements effectively? The same rules apply to structured ML? Really? Um in the space. Start simple. Um Start with the basic prompting, start with including some examples to do some few shot prompting. Start introducing external data or exogenous data sources using line chain alarm index and composing workflows. Um So you incrementally increase your accuracy over time. Yes, you will increase your cost and lac is a little bit. So how do you use elements effectively? The same rules apply to structured ML? Really? Um in the space. Start simple. Um Start with the basic prompting, start with including some examples to do some few shot prompting. Start introducing external data or exogenous data sources using line chain alarm index and composing workflows. Um So you incrementally increase your accuracy over time. Yes, you will increase your cost and lac is a little bit. Um But often that's OK if your um you know, a accuracy and reliability improves. Um But often that's OK if your um you know, a accuracy and reliability improves. Um But if, but soon you getting to a point where you want to get to it Um But if, but soon you getting to a point where you want to get to it of refinement where you do things like prompting uh where you're doing tool selection, calling RT API S like Wolfram alpha or others to give you more reliable um responses. Um And I think most folks end at this stage, but, you know, if you want to take things further, you can also um start using um you know, fine tuning hosted models or using open source models and training them from scratch. But that's uh something that we wouldn't discuss today. of refinement where you do things like prompting uh where you're doing tool selection, calling RT API S like Wolfram alpha or others to give you more reliable um responses. Um And I think most folks end at this stage, but, you know, if you want to take things further, you can also um start using um you know, fine tuning hosted models or using open source models and training them from scratch. But that's uh something that we wouldn't discuss today. But um you know, in terms of the techniques that we're seeing out there in the wild, I think one of the key ones that, you know, and some of the others have been introducing and um kind of spearheading is adding structure to your responses. But um you know, in terms of the techniques that we're seeing out there in the wild, I think one of the key ones that, you know, and some of the others have been introducing and um kind of spearheading is adding structure to your responses. So you can ask a model to provide a type or respond a typescript scheme as format and it will do so and you can encourage it to be more reliable by giving examples, asking it to take on a persona um and just sing it or reminding it, hey, always return Jason, you can even unfortunately threaten the model sometimes and it will um often be more accurate for that uh response. So you can ask a model to provide a type or respond a typescript scheme as format and it will do so and you can encourage it to be more reliable by giving examples, asking it to take on a persona um and just sing it or reminding it, hey, always return Jason, you can even unfortunately threaten the model sometimes and it will um often be more accurate for that uh response. And if it fails, you just re ask and you can keep re asking until, you know, I guess it affects your U X. Um You can increase the temperature or you can start with a more cost effective model and then ramp up to froMLike a GP 23.5 to a four. And this allows you to at least validate the output in a structured way instead of dealing with uh clean text um as the output. And if it fails, you just re ask and you can keep re asking until, you know, I guess it affects your U X. Um You can increase the temperature or you can start with a more cost effective model and then ramp up to froMLike a GP 23.5 to a four. And this allows you to at least validate the output in a structured way instead of dealing with uh clean text um as the output. Um Another technique that we're seeing um applied more even in the production setting is self refinement. And so the idea is normally you, you make some kind of uh predict. Uh you give it a prompt and you get a completion, but you can also in the prompt, say, hey, review what you've just given me and score yourself and refine that prompt and you can even ask the model to do this multiple times. So it's literally scoring itself and improving the prompt. Um Another technique that we're seeing um applied more even in the production setting is self refinement. And so the idea is normally you, you make some kind of uh predict. Uh you give it a prompt and you get a completion, but you can also in the prompt, say, hey, review what you've just given me and score yourself and refine that prompt and you can even ask the model to do this multiple times. So it's literally scoring itself and improving the prompt. Um And this has been shown to be surprisingly effective, especially for models like GP T four. Um So you can say oh, you've written a tweet for me, make the tweet more engaging rate, your rate, the tweet and, and, and you know, improve it. Um And so that's been a surprisingly um effective technique and it's outperformed baselines and a lot of use cases. Um And this has been shown to be surprisingly effective, especially for models like GP T four. Um So you can say oh, you've written a tweet for me, make the tweet more engaging rate, your rate, the tweet and, and, and you know, improve it. Um And so that's been a surprisingly um effective technique and it's outperformed baselines and a lot of use cases. Um And so another technique that we are seeing out there in the wild is contextual compression. And so what you see in what you do in a lot of cases is you're calling external data sources. So you're using LA chain or Llama index and you're enriching the data in your context window with um fresh data that's relevant to answering a question or doing some kind of task. But often it's very unstructured the way people do with that today. One of the ways you can improve the density of information is by doing compression. Um And so another technique that we are seeing out there in the wild is contextual compression. And so what you see in what you do in a lot of cases is you're calling external data sources. So you're using LA chain or Llama index and you're enriching the data in your context window with um fresh data that's relevant to answering a question or doing some kind of task. But often it's very unstructured the way people do with that today. One of the ways you can improve the density of information is by doing compression. So you can do, you can say so let's say your question is you score the first goal in the FIFA World Cup, you can call a bunch of external sources and then compress based on the question that you're asking that information even drop some of those examples and then only shove in the, you know, relevant ones into the context window. And so this gives you often two X or three X the amount of information that you can put into the context window and that really improves the quality of the output uh completion. So you can do, you can say so let's say your question is you score the first goal in the FIFA World Cup, you can call a bunch of external sources and then compress based on the question that you're asking that information even drop some of those examples and then only shove in the, you know, relevant ones into the context window. And so this gives you often two X or three X the amount of information that you can put into the context window and that really improves the quality of the output uh completion. So now you've got more and more accuracy and a little bit more structure. But what about latency? So one of the techniques that we're seeing now also being adopted is semantic cashing. So the idea is with normal cache is if a prompt, if you're just cashing prompts and the completions, you often don't have a very high cash hit rate because one character difference and suddenly the cash is going to get missed, but you can use a vector DB to also do cache hits. So you cache it based on a distance, uMLike similarity distance. So now you've got more and more accuracy and a little bit more structure. But what about latency? So one of the techniques that we're seeing now also being adopted is semantic cashing. So the idea is with normal cache is if a prompt, if you're just cashing prompts and the completions, you often don't have a very high cash hit rate because one character difference and suddenly the cash is going to get missed, but you can use a vector DB to also do cache hits. So you cache it based on a distance, uMLike similarity distance. Um The problem is that often or in a lot of cases, you don't really have a hit, right? If it's, if the prompts are sometimes slightly off, maybe the head isn't a true hit. So you need some kind of evaluation function. So what folks are doing is they're using an L L M again using an L M for everything, but use an L M to evaluate the output response and have a judge whether it was really a cache hit. And that, that works really well in the case where the completion that you're returning is an expensive completion. Um The problem is that often or in a lot of cases, you don't really have a hit, right? If it's, if the prompts are sometimes slightly off, maybe the head isn't a true hit. So you need some kind of evaluation function. So what folks are doing is they're using an L L M again using an L M for everything, but use an L M to evaluate the output response and have a judge whether it was really a cache hit. And that, that works really well in the case where the completion that you're returning is an expensive completion. So in many cases, you're calling an L M, you know, 3456 times, especially if you're using chain of thought reasoning or any kind of decomposition, it's expensive to compute those completions. And so um even if you have to call an L M to validate a cachet, it's still cheaper than having to re compute everything. This technique also works really well for tool selection, So in many cases, you're calling an L M, you know, 3456 times, especially if you're using chain of thought reasoning or any kind of decomposition, it's expensive to compute those completions. And so um even if you have to call an L M to validate a cachet, it's still cheaper than having to re compute everything. This technique also works really well for tool selection, a set of options. So if you're using like a tool like an API that you're gonna use or like Wolfram alpha or something. Um Then if you have a high confidence in a single tool and no confidence in any other tools, you don't even need to call an L M, you can just use a distance metric to evaluate the cash rate. So this is a very good technique to reduce latencies um over time above and beyond just the normal cache. a set of options. So if you're using like a tool like an API that you're gonna use or like Wolfram alpha or something. Um Then if you have a high confidence in a single tool and no confidence in any other tools, you don't even need to call an L M, you can just use a distance metric to evaluate the cash rate. So this is a very good technique to reduce latencies um over time above and beyond just the normal cache. And then finally, I think one of the interesting things that I've realized from the paper in um you know, sparks at G I is that uh large lanuage models like GP T four are extremely good at P I and prompt injection detection 3.5 is also is also good. But these L large lanuage models are effective at identifying uMLeakage of information are performing even the baseline, purposeful tools at this task um and also prompt injections. And then finally, I think one of the interesting things that I've realized from the paper in um you know, sparks at G I is that uh large lanuage models like GP T four are extremely good at P I and prompt injection detection 3.5 is also is also good. But these L large lanuage models are effective at identifying uMLeakage of information are performing even the baseline, purposeful tools at this task um and also prompt injections. And in fact, a little bit later today, we're going to be playing one of those games where we can, you can actually try this out. And in fact, a little bit later today, we're going to be playing one of those games where we can, you can actually try this out. Um But so yeah, if you're building on today and you're trying to uh make your system more reliable, faster um and improve the U um reach out. Um But so yeah, if you're building on today and you're trying to uh make your system more reliable, faster um and improve the U um reach out.

+ Read More

Watch More

Finetuning Open-Source LLMs // LLMs in Production Conference 3 Keynote 1

Posted Oct 09, 2023 | Views 7.7K

# Finetuning

# Open-Source

# LLMs in Production

# Lightning AI

The Confidence Checklist for LLMs in Production

Posted Jul 14, 2023 | Views 876

# LLMs in Production

# LLM Deployment

# Portkey.ai

Current State of LLMs in Production

Posted Oct 18, 2023 | Views 1.8K

# Natural Language Processing

# LLMs

# Truckstop

# Truckstop.com