MLOps Community
timezone
+00:00 GMT
Sign in or Join the community to continue

No Rose Without a Thorn - Obstacles to Successful LLM Deployments

Posted Apr 23, 2023 | Views 838
# LLM
# LLM in Production
# LLM Deployments
# Rungalileo.io
# Snorkel.ai
# Wandb.ai
# Tecton.ai
# Petuum.com
# mckinsey.com/quantumblack
# Wallaroo.ai
# Union.ai
# Redis.com
# Alphasignal.ai
# Bigbraindaily.com
# Turningpost.com
Share
SPEAKERS
Tanmay Chopra
Tanmay Chopra
Tanmay Chopra
Machine Learning Engineer @ Neeva

Tanmay is a machine learning engineer at Neeva, where he's currently engaged in reimagining the search experience through AI - wrangling with LLMs and building cold-start recommendation systems. Previously, Tanmay worked on TikTok's Global Trust&Safety Algorithms team - spearheading the development of AI technologies to counter violent extremism and graphic violence on the platform across 160+ countries. Tanmay has a bachelor's and master's in Computer Science from Columbia University, with a specialization in machine learning.

Tanmay is deeply passionate about communicating science and technology to those outside its realm. He's previously written about LLMs for TechCrunch, held workshops across India on the art of science communication for high school and college students, and is the author of Black Holes, Big Bang and a Load of Salt - a labor of love that elucidated the oft-overlooked contributions of Indian scientists to modern science and helped everyday people understand some of the most complex scientific developments of the past century without breaking into a sweat!

+ Read More

Tanmay is a machine learning engineer at Neeva, where he's currently engaged in reimagining the search experience through AI - wrangling with LLMs and building cold-start recommendation systems. Previously, Tanmay worked on TikTok's Global Trust&Safety Algorithms team - spearheading the development of AI technologies to counter violent extremism and graphic violence on the platform across 160+ countries. Tanmay has a bachelor's and master's in Computer Science from Columbia University, with a specialization in machine learning.

Tanmay is deeply passionate about communicating science and technology to those outside its realm. He's previously written about LLMs for TechCrunch, held workshops across India on the art of science communication for high school and college students, and is the author of Black Holes, Big Bang and a Load of Salt - a labor of love that elucidated the oft-overlooked contributions of Indian scientists to modern science and helped everyday people understand some of the most complex scientific developments of the past century without breaking into a sweat!

+ Read More
Demetrios Brinkmann
Demetrios Brinkmann
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
SUMMARY

LLMs have garnered immense attention in a short span of time - with their capabilities usually being conveyed to the world in low-precision demanding scenarios like demos and MVPs, but as we all know, deploying to prod is a whole other ballgame. In this talk, we'll discuss some pitfalls expected in deploying LLMs to production use cases both at the terminal layer (direct-to-user) as well as intermediate layers. We'll approach this topic from both infrastructural and output-focused lenses and explore potential solutions to challenges ranging from foundational model downtime and latency concerns to output variability and prompt injections.

+ Read More
TRANSCRIPT

Link to slides

Hi, everyone. I'm um I work in machine learning at NBA, which is a search A I startup and previously did the same work at, at tiktok. Uh So I'm here to be a bit of a buzzkill about L L MS and, and talk to you about the obstacles to deploying L L MS successfully to production. Um Man. Awesome.

Thank you so much. Uh Hi, everyone. I'm um I work in machine learning at NBA, which is a search A I startup and previously did the same work at, at tiktok. Uh So I'm here to be a bit of a buzzkill about L L MS and, and talk to you about the obstacles to deploying L L MS successfully to production. Um So if you've spent any time in on Twitter at all, you've probably seen um hundreds if not thousands of demos of L L MS. Um If you work in industry, you've probably seen very, very few of these ever get deployed to production. Um If you work in statistics, you probably hit me right now for not giving you a skill for this diagram. Um But why do So if you've spent any time in on Twitter at all, you've probably seen um hundreds if not thousands of demos of L L MS.

Um If you work in industry, you've probably seen very, very few of these ever get deployed to production. Um If you work in statistics, you probably hit me right now for not giving you a skill for this diagram. Um But why do such great M V P s and demos never make it into production. Um Well, there's two big chunks of challenges that we see uh with deploying L MS of fraud. Um and they come in the form of sort of infrastructural funds and, and output linked fonts um where infrastructural thons kind of refer refers to technical or, such great M V P s and demos never make it into production. Um Well, there's two big chunks of challenges that we see uh with deploying L MS of fraud. Um and they come in the form of sort of infrastructural funds and, and output linked fonts um where infrastructural thons kind of refer refers to technical or, or integration linked challenges um where output linked FS are essentially the output of the model, the text that it's generating uh that might cause problems for and block us from going to Tora.

Um So when you come to the infrastructural side, there's sort of four big buckets that, that we look at problems. In. or integration linked challenges um where output linked FS are essentially the output of the model, the text that it's generating uh that might cause problems for and block us from going to Tora. Um So when you come to the infrastructural side, there's sort of four big buckets that, that we look at problems. In.

The first is L MS are slower than some of the status quo experiences. We're used to um a really good example of this is search right where you have very, very quick results uh that users are used to. But now when you start generating L M M output, um The first is L MS are slower than some of the status quo experiences. We're used to um a really good example of this is search right where you have very, very quick results uh that users are used to. But now when you start generating L M M output, um it takes significantly longer for that response to complete as ex as compared to their status quo experience.

There's also a lot of decision in that goes into taking L MS to abroad. Um One of the biggest decisions is do you buy or do you build uh whereby kind of refers to purchasing api access to a foundational model and build usually refers to fine tuning some sort of open source um L L M. it takes significantly longer for that response to complete as ex as compared to their status quo experience. There's also a lot of decision in that goes into taking L MS to abroad. Um One of the biggest decisions is do you buy or do you build uh whereby kind of refers to purchasing api access to a foundational model and build usually refers to fine tuning some sort of open source um L L M. And so when you see the buying case tends to pose much smaller upfront costs. But as you scale uh starts creating a lot of challenges in terms of costs. Um On the building side, there's much higher upfront cost and a lot more uncertainty on whether your L L MS will be um And so when you see the buying case tends to pose much smaller upfront costs. But as you scale uh starts creating a lot of challenges in terms of costs. Um On the building side, there's much higher upfront cost and a lot more uncertainty on whether your L L MS will be um at the quality level that you needed to be able to demo or, or, or actually go to production there's also some emerging challenges around api reliability in case you do choose to buy. Um And this usually tends to emerge from at the quality level that you needed to be able to demo or, or, or actually go to production there's also some emerging challenges around api reliability in case you do choose to buy. Um And this usually tends to emerge from cases where um infrastructure serving infrastructure is still being built up by foundational model providers. Um And users can very quickly lose trust um in cases deployed abroad.

Um When these, when they experience um even infrequent downtime, uh one of the bigger challenges is also evaluation, we're still leaning somewhat relatively heavily on the manual side. Um And we're looking for more and more clear quality metrics for this output cases where um infrastructure serving infrastructure is still being built up by foundational model providers. Um And users can very quickly lose trust um in cases deployed abroad. Um When these, when they experience um even infrequent downtime, uh one of the bigger challenges is also evaluation, we're still leaning somewhat relatively heavily on the manual side. Um And we're looking for more and more clear quality metrics for this output on the output link side.

Um The, the major challenge that we're seeing, especially when you start integrating these models into pipelines is output format variability. Uh Given that these are sort of generative models, there is a certain degree of unpredictability to the response uh which can make it quite challenging to sort of plug into some sort of pipeline that expects certain formats. Uh This is the easiest one to solve, but there is a lack of reproducibility um on the output link side.

Um The, the major challenge that we're seeing, especially when you start integrating these models into pipelines is output format variability. Uh Given that these are sort of generative models, there is a certain degree of unpredictability to the response uh which can make it quite challenging to sort of plug into some sort of pipeline that expects certain formats. Uh This is the easiest one to solve, but there is a lack of reproducibility um where the same input might give you different outputs even for the same model.

Um And then finally, we sort of come to this world of uh adversarial attacks or adversarial users where you see challenges related to prompt hijacking and, and as an extension, trust and safety uh where the model might generate or be forced to generate um intentionally malicious output that might be considered undesirable, where the same input might give you different outputs even for the same model. Um And then finally, we sort of come to this world of uh adversarial attacks or adversarial users where you see challenges related to prompt hijacking and, and as an extension, trust and safety uh where the model might generate or be forced to generate um intentionally malicious output that might be considered undesirable, but does this mean we're doomed? Are we always going to see sort of this case, a very small case of production. I don't think so. Um What I've just covered is sort of pretty much the whole landscape of challenges. Um Some of these are mutually exclusive. If you buy, you will face some. If you build, you'll face some, but it's highly unlikely that you would face all of these challenges.

Um So let's talk solutions but does this mean we're doomed? Are we always going to see sort of this case, a very small case of production. I don't think so. Um What I've just covered is sort of pretty much the whole landscape of challenges. Um Some of these are mutually exclusive. If you buy, you will face some. If you build, you'll face some, but it's highly unlikely that you would face all of these challenges. Um So let's talk solutions when it comes to the infrastructural side. Um in terms of speed, you can make models faster or you can make models seem faster. Um When you make models faster, that's sort of the conventional machine learning techniques of distillation, pruning. Um trying to use smaller models uh where bigger ones are not necessary, this obviously sort of leans towards being able to build. Um But if you are buying, when it comes to the infrastructural side. Um in terms of speed, you can make models faster or you can make models seem faster.

Um When you make models faster, that's sort of the conventional machine learning techniques of distillation, pruning. Um trying to use smaller models uh where bigger ones are not necessary, this obviously sort of leans towards being able to build. Um But if you are buying, uh there are ways to make models seem faster um by leaning into human computer interaction techniques.

So you can load animations, you can start streaming output. Um You can start parallel using um uh there are ways to make models seem faster um by leaning into human computer interaction techniques. So you can load animations, you can start streaming output. Um You can start parallel using um outputs to the core task that are more complementary versus blocking. Um And in terms of cost and decision, this is a tough space, but there is a somewhat optimal approach which sort of entails buying while you build.

So using that low upfront investment, getting to market fast, validating your M V P s and then over time sort of collecting data and fine tuning models uh in house to make sure that your costs aren't getting infeasible in the long term with adoption outputs to the core task that are more complementary versus blocking.

Um And in terms of cost and decision, this is a tough space, but there is a somewhat optimal approach which sort of entails buying while you build. So using that low upfront investment, getting to market fast, validating your M V P s and then over time sort of collecting data and fine tuning models uh in house to make sure that your costs aren't getting infeasible in the long term with adoption when it comes to sort of this foundational model of reliability space. Um The best parallel or the best analogy is sort of this multi cloud approach. Um where you think about these fallbacks across different providers, we have yet to see a time where two of the major foundation providers have failed together. So this does seem to be a satisfactory approach to dealing with this.

Um when it comes to sort of this foundational model of reliability space. Um The best parallel or the best analogy is sort of this multi cloud approach. Um where you think about these fallbacks across different providers, we have yet to see a time where two of the major foundation providers have failed together. So this does seem to be a satisfactory approach to dealing with this.

Um There is also this component of failing gracefully. Um Users do understand that this is a technology in development. And so if we are reliant on sort of api reliability, um it does make a lot of sense to think about what happens in the last resort case where you do fail um to, to deliver output. There is also this component of failing gracefully. Um Users do understand that this is a technology in development. And so if we are reliant on sort of api reliability, um it does make a lot of sense to think about what happens in the last resort case where you do fail um to, to deliver output. And furthermore, when it comes to this evaluation infrastructure, um the way I like to think about it is this is a great time to fail fast with fail safes. So make sure that you're not causing sort of trust and safety related failures.

But when it comes to the core product itself, um And furthermore, when it comes to this evaluation infrastructure, um the way I like to think about it is this is a great time to fail fast with fail safes. So make sure that you're not causing sort of trust and safety related failures. But when it comes to the core product itself, um it's totally fine to start going to production faster. Um and using very strong user feedback and feedback loops to make sure that uh you're iterating as you go. Um Another really helpful approach is to link your L L M integrations to some sort of top line metric. So that could be anything from say stay duration and the session length. Um And, and evaluating how this impacts that change. it's totally fine to start going to production faster. Um and using very strong user feedback and feedback loops to make sure that uh you're iterating as you go. Um Another really helpful approach is to link your L L M integrations to some sort of top line metric. So that could be anything from say stay duration and the session length. Um And, and evaluating how this impacts that change.

Um On the output link side, um output format variability is probably the largest chunk of challenge. Um I would say in terms of um reducing this to a huge amount, few shot prompting really helps. So you actually go ahead and give output examples in the prompt itself. Um Um On the output link side, um output format variability is probably the largest chunk of challenge. Um I would say in terms of um reducing this to a huge amount, few shot prompting really helps. So you actually go ahead and give output examples in the prompt itself.

Um There's also a couple of really cool libraries. I think we, we had one of those speakers talk about these earlier, which is sort of guard rails and realm that can kind of help you validate output formats and actually iterate if you need to um There's also a couple of really cool libraries. I think we, we had one of those speakers talk about these earlier, which is sort of guard rails and realm that can kind of help you validate output formats and actually iterate if you need to um call the L L M again. Um again, fail gracefully is always helpful, just one easy fallback. Um in terms of lack of reproducibility, this is absolutely the easiest one to solve. Uh You can just set the temperature to zero.

Um I sort of think about call the L L M again. Um again, fail gracefully is always helpful, just one easy fallback. Um in terms of lack of reproducibility, this is absolutely the easiest one to solve. Uh You can just set the temperature to zero. Um I sort of think about prompt hijacking and trust and safety in one pocket largely because uh the main negative outcome of being prompt hijacked uh does sort of lean towards generating outputs that a trust and safety layer could solve. Um prompt hijacking and trust and safety in one pocket largely because uh the main negative outcome of being prompt hijacked uh does sort of lean towards generating outputs that a trust and safety layer could solve.

Um So I just want to end with some thoughts or, or tips around how you can start strong. How do you get your first L L M deploy? And how do you make sure that use case succeeds? Um The first and probably most vital aspect to think about is project positioning. It helps massively to be able to focus on deploying to noncritical workflows, um where you So I just want to end with some thoughts or, or tips around how you can start strong. How do you get your first L L M deploy? And how do you make sure that use case succeeds? Um The first and probably most vital aspect to think about is project positioning. It helps massively to be able to focus on deploying to noncritical workflows, um where you be able to add value but not become a dependency.

As we're sort of building up more reliable serving infrastructure, we can start serving more critical work flows but things like output variability api downtime sort of push us in this direction where we should be trying ideally to add value but not become a dependency. Uh It also helps to sort of have relatively higher latency use cases. Um be able to add value but not become a dependency. As we're sort of building up more reliable serving infrastructure, we can start serving more critical work flows but things like output variability api downtime sort of push us in this direction where we should be trying ideally to add value but not become a dependency. Uh It also helps to sort of have relatively higher latency use cases. Um These are scenarios where things where users have lower expectations of how quickly outputs will be generated. And so it gives you more space to sort of create value um These are scenarios where things where users have lower expectations of how quickly outputs will be generated. And so it gives you more space to sort of create value um in a low barrier to create value.

The third one is probably the most key one to make sure your L L MS don't just go to fraud but stay in fraud. Um And this is to plan to build while you buy. So make sure that as you're working towards deploying your L M with an API solution, um You're also figuring out how in the long term you're able to scale those costs uh in a manner that's in a low barrier to create value. The third one is probably the most key one to make sure your L L MS don't just go to fraud but stay in fraud. Um And this is to plan to build while you buy. So make sure that as you're working towards deploying your L M with an API solution, um You're also figuring out how in the long term you're able to scale those costs uh in a manner that's and lastly do not underestimate the H C I component. Um In this case, L M success is largely determined also by how it interacts with the user. And it really really helps to sort of respond seemingly faster or fail gracefully or enable large scale user feedback.

Um And that's pretty much it from my end and lastly do not underestimate the H C I component. Um In this case, L M success is largely determined also by how it interacts with the user. And it really really helps to sort of respond seemingly faster or fail gracefully or enable large scale user feedback. Um And that's pretty much it from my end over to you in your over to you in your eyes, eyes, dude.

Awesome. Thank you so much for this, for those who want to keep up with the conversation and you want to ask me questions because that was awesome. Really incredible. There's some really cool questions coming through in the chat. dude.

Awesome. Thank you so much for this, for those who want to keep up with the conversation and you want to ask me questions because that was awesome. Really incredible. There's some really cool questions coming through in the chat. Go ahead, jump in Slack, he's there, go to the conference community conference channels and tag him and uh well, you can continue the conversation there in the. Go ahead, jump in Slack, he's there, go to the conference community conference channels and tag him and uh well, you can continue the conversation there in the chat.

+ Read More

Watch More

19:14
Posted Jan 29, 2024 | Views 254
# Multimodal LLM App
# Artifact Storage
# Siennai Analytics
35:23
Posted Jun 20, 2023 | Views 9.8K
# LLM in Production
# LLMs
# Claypot AI
# Redis.io
# Gantry.io
# Predibase.com
# Humanloop.com
# Anyscale.com
# Zilliz.com
# Arize.com
# Nvidia.com
# TrueFoundry.com
# Premai.io
# Continual.ai
# Argilla.io
# Genesiscloud.com
# Rungalileo.io