MLOps Community
timezone
+00:00 GMT
Sign in or Join the community to continue

Beyond the Hype: Monitoring LLMs in Production

Posted Jun 20, 2023 | Views 618
# LLM in Production
# Monitoring
# Arize.com
# Redis.io
# Gantry.io
# Predibase.com
# Humanloop.com
# Anyscale.com
# Zilliz.com
# Nvidia.com
# TrueFoundry.com
# Premai.io
# Continual.ai
# Argilla.io
# Genesiscloud.com
# Rungalileo.io
Share
SPEAKER
Claire Longo
Claire Longo
Claire Longo
Head of ML Solutions Engineering @ Arize AI

Claire Longo is a Machine Learning Engineer and Data Scientist who has focused her career on building and scaling central ML engineering teams. She is currently Head of ML Solutions Engineering at Arize, leading post-sales machine learning solutions engineering efforts and assisting Arize customers in integrating ML Observability into their MLOps infrastructure and workflows. Prior to Arize, she led the MLOps team at Opendoor and established a new machine learning team to deliver data-driven products at Twilio.

+ Read More

Claire Longo is a Machine Learning Engineer and Data Scientist who has focused her career on building and scaling central ML engineering teams. She is currently Head of ML Solutions Engineering at Arize, leading post-sales machine learning solutions engineering efforts and assisting Arize customers in integrating ML Observability into their MLOps infrastructure and workflows. Prior to Arize, she led the MLOps team at Opendoor and established a new machine learning team to deliver data-driven products at Twilio.

+ Read More
SUMMARY

Here’s the truth: troubleshooting models based on unstructured data is notoriously difficult. The measures typically used for drift in tabular data do not extend to unstructured data. The general challenge with measuring unstructured data drift is that you need to understand the change in relationships inside the unstructured data itself. In short, you need to understand the data in a deeper way before you can understand drift and performance degradation.

In this presentation, Claire Longo presents findings from research on ways to measure vector/embedding drift for image and language models. With lessons learned from testing different approaches (including Euclidean and Cosine distance) across billions of streams and use cases, she will dive into how to detect whether two unstructured language datasets are different — and, if so, how to understand that difference using techniques such as UMAP.

+ Read More
TRANSCRIPT

Introduction

Our next speaker, I'm so pumped about. Claire, welcome to the stage. Thank you for your, um, for your patience and for joining us. I know you're a rockstar at Arise and also in the Denver community. You've been really active there, so very excited to hear your, uh, talk today. Yeah, great to meet everyone.

How's my audio? Everything coming through okay? Yeah. Sounds good. All right, and here are your slides. Perfect. All right. Hello everyone. I'm Claire Longo. Um, I'm currently head of ML Solutions Engineering for Customer Success at Arise. Um, this is an absolute amazing job. It's one of my favorites that I've had.

Um, and it's because I have this opportunity to collaborate with all these different customers. We're really working with like some of the best machine learning teams in the world. And through this role, I have a window into how our customers are thinking about, um, monitoring machine learning models. And I also start to see how people are starting to think about monitoring LLMs.

Um, I consult with customers to really help them integrate, um, arise as an ML observability solution. And my background is a little bit like data science. Um, I used to be a data scientist. I trained models hands-on. I got frustrated that I couldn't get them into production, so I went and learned to be an engineer and I became passionate about, um, ML ops.

So that's really my focus. I've been an ML ops person for a while. And now I'm hearing this new term, um, ML or l l m Ops. Here, let me get to my slides. Yeah, so this is my favorite meme right now. It is. It really resonated with me because I've heard a lot of different terms here. Like being an ML ops person, really passionate about applying DevOps principles to scaling machine learning models in production.

I've heard a lot of different terms. I've heard AI ops, I've heard DL ops, which is like deep learning ops, and now I'm hearing LLM ops, and I'm honest, honestly, it's just like having a lot of trouble saying this word. It is not easy to pronounce. I've heard some other people articulate it very well, but to me this is a tongue twister.

Um, so I'd love it if we could just stick with L uh, with L with ML ops, but. As I really dive into like LLMs and think about how to put these models in production, I do think there are some nuances with, um, productionalizing the generative ai, um, that really differentiated enough from this old school kind of tabular ml.

It's funny to really think of machine learning as a little bit old school here, but, um, There's nuances in productionalizing, the LMS that I think are worth digging into. I don't know if it's really worth a whole new term here that's this hard to articulate, but, um, it's definitely worth diving into the differences in the concepts here and really looking at monitoring specifically through the lens of LLMs and what the might look like.

So what I'm gonna talk through today is, um, taking my learnings. From monitoring machine learning models in production and really think through how we might apply that to monitoring these LLMs in production systems.

So, um, I think the audience here has probably heard this already a few times, but I talked through how to choose the right tech stack, um, for deploying an lm and the reason I'm gonna talk through that is that there are like kind of different monitoring opportunities and so I think it's worth kind of highlighting that.

So when deciding to implement an LM into production, there are two things to consider. Um, these are very simple things, but I do think it's worth calling out here because it really will prepare you with the right data and information to choose your best tech stack. So number one is proprietary data.

Does your system require proprietary data? This is gonna help you decide if you should really be building things in house or if you're comfortable sending that data through a open a api. And then number two is just what is the value of applying generative AI here? I kind of laugh at this one because I feel like, um, I think it was about 10 years ago, there was like the big data craze, and I was asking myself the same question in terms of what is the value of applying data science here?

More traditional machine learning models. And I would always try to compare it to a baseline, like machine learning is exciting, it's the next big thing. We could totally use it to solve this problem, but what a simple rules-based heuristic do, what is the value of layering in the more complexities? And I think it's so important to answer the same questions with a generative ai.

Um, and then also thinking about the value. I like to think of it in terms of the cost of building the system as well as the ROI to the business. So what business problem are we solving? How much do we think that we're gonna impact the business in terms of, um, roi? And that'll help us choose the right tech stack.

So I think this is very similar to the previous talk, um, but I'll talk through it quickly just as a recap. So how do we choose the right tech stack for deploying our l l M model? So starting with the first row, the first option is kind of the mvp. Option. This is the out of the box, uh, just a single endpoint prototype.

This may something like using the Open AI api. Um, there's nothing else here. There's no prompt engineering or anything else. So, although this is really easy to get off the ground, just kind of test down an idea. It's great for an M V P or proof of concept. It's not going to be personalized to solving your business use case in any way.

So that's where we go to the next row. Here. We're looking at prompt engineering. Um, I think there's been a lot of analysis to show that there's a ton of value here, figuring out the right prompt. So when we start using the prompt template, we will need to provide a light framework here to really integrate that.

Uh, template into the system. So there's a little bit of more coding and you're building out a real, um, kind of software system here, but you are also, um, kind of automating and, um, starting to tune things a little bit more to your business use case. The third uh, line here is gonna be custom vector database.

So the way I like to think about this is actually through a bit of an example. Let's say you are building a chat bot that answers questions about a specific software tool. And if you send those questions just to the chat, g p t or um, some L L l m a p i, it might not give you anything specific, but if you provide with that query, um, specific context about the software such as the user documentation, You can start to get some really good responses that are very, um, tuned to the use case that you're trying to, um, create the system for.

So that's an example of context data. Context data can be a user's document. It can be any really kind of set of text. It gets, uh, decomposed into vectors and stored in a database. And then these get, um, integrated with your prompt. So it's a way to really kind of. Greatly enrich your prompt with the proper context data to really, now we're tuning this even more to our business problem that we're solving and, uh, we have a lot of control over what goes into that context data.

So there is lots to optimize here, lots of opportunity. The fourth line here is fine tuning. So this one, this is, this one is expensive. Um, I would go through these in order and we're kind of like increasing complexity as we go, but we're also probably com increasing value. So I think that's worth considering.

Um, fine tuning. You take your data and you're actually fine tuning these, um, public, um, LLMs. To your use case. Um, but the reason I say it's expensive is like these are large models. There's a lot of weights here, so this is, um, obviously going to have some cost to it. Um, the last one is gonna be your foundation model.

Um, serving on your own infrastructure. So this is like the most secure option. Your data doesn't have to leave your own infrastructure. You have full control over creating the l l M as well as the data comes into it. Um, obviously I think the benefits here are clear. You, you just have control over your system.

You don't have to send any data out of your infrastructure, but it is going to be the most complex to build because you're billing from the ground up. So with that in mind, I want to talk through kind of what a general LM system here might look like. So in this example, we have a user query and. Once we receive the query, we're pulling data out of a vector store.

That's your context data, maybe user documentation or something like that. And then we're using a prompt template to construct a excellent prompt with that user query that contacts data the best prompt that we can, and we send that into the L L M. LMS gonna give us a response and then if we're really fancy, we can start collecting user feedback.

And I think there's a lot of value in this user feedback because of the kind of, um, feedback loop it can create as you improve things. So I definitely recommend layering that in if you can. But let's look at the system in the context of monitoring. So what I'm showing here is, um, on the left of the screen, you see these two little areas where the embeddings are generated.

The reason I highlight these is because this is the data that you're gonna want to start to monitor to see how all the system is doing. Um, really quickly, what is an embedding? I think this group is pretty technical, so I'm not gonna go deep into it. I think we're all. Kind of working with embeddings at this point, which is really fantastic.

I used to have to explain what an embedding is all the time, but essentially what this is is. User query, it's a, uh, document of text. It's just text data. And we need to translate that data into numbers to be able to work with it in a meaningful way in mathematical models. So the embedding is really just a mathematical representation of the text data.

It preserves all these interesting patterns in the text data and uh, then we can use it to work with, um, in mathematical models, which is what an LLM is. So it really just, Takes text, turns it into a meaningful vector of numbers, and it's really just a vector of numbers here. So now we're generating all kinds of vectors of numbers.

We have embeddings from user query, our context store, um, our vector store. I'm sorry, with the context data is embedding is, uh, embeddings already. And then also there can be embeddings generated out of the response as well. So when we look at this from the lens of monitoring, there's all that data flowing through the system that's gonna be very valuable to collect and analyze.

And then there's also kind of like these points where these systems can kind of start to break. There's situations where. Even with a really good prompt, if you don't have good context data involved in that prompt, or the LM hasn't been fine tuned to really answer the the question, um, that's coming in through the query.

Um, we all know, we've probably seen it. These models just hallucinate. They just make stuff up and they will very confidently, um, respond. So it's a little bit hard to spot these hallucinations. Um, but that is a pain point to look for and it's usually because there is a gap in the context data or the model hasn't been properly fine tuned.

The other area here is just evaluating the accuracy of this kind of system. Um, I think that it's hard to really say how accurate are these models, because we are not, this is not tabular machine learning anymore, where it's like right or wrong. It's either kind of like relevant, not relevant, but even that's not quite the right way to think about it.

Um, there's just a challenge here in evaluating them. There are a ton of metrics here that we can look at. And, um, there's also looking at how the system is impacting your business. So let's say I deploy a chat bot that does something meaningful for my business. If I click that user feedback, I can start to really get kind of that proxy for how accurate these systems are.

And so I think that's kind of a good way to go here. So really high level on this slide. What I'm trying to really show here is like, What an LM sys L L M system looks like and where the opportunities are and the data that should be collected for monitoring. So I've monitored quite a few tabular machine learning models, um, lot of recommended systems models, and these are really easy to think about in terms of monitoring.

Um, and I wanna talk through what that looks like and then we'll kind of translate that to LMS because there are new complexities here. So for an ML inference, like you're just making like a prediction out of a machine learning model, let's say it's a recommender system that uses tabular data, the things you wanna log so that you can do proper monitoring of that system and make this entire system auditable, are gonna be your features and your predictions.

And then you're also gonna have a feedback loop collecting those um, truth labels. So, um, maybe an example here with a recommender system is the prediction would be like a probability that customer might purchase an item and the features might be their past purchase history. And then the truth label is very straightforward.

They either purchased it or they didn't. This is all pretty easy when we look in LLM system and think about like, what might an inference really be in this situation? The data looks more complex, so the data that we wanna log are no longer features and predictions. We're looking at queries and responses.

We also have that context data, whatever context was retrieved for a specific query. And then we also wanna collect human feedback data, but it may or may not be as easily generated as, um, some of these tabular machine learning use cases. So the data just came, became a little bit more complex, but I think the core concepts and kind of best practices around monitoring really kind of remain the same here.

So, So what we wanna do is we want to monitor those embeddings and what does that even mean and how do you even get an embedding out of the system? So as I mentioned, the embeddings are just very long vectors of numbers. They're super meaningful, and I'll show you how we can get meaning out of them. But the first step is to generate them.

So if we have a system that looks like this, we have to generate and save these embeddings in the. Out of the production system. So the way to do this is, there's kind of two options to consider here. One is to extract them directly from your neural network. Um, neural networks will naturally create embeddings as they are trained.

So if you went with the option where you're creating your own model here, you can just pull embeddings out of that. And the second option would be to generate them through an api such as like open ai, hugging face, have really good options for this. So there's a lot of ways to get the embeddings outta here, but if you have that raw text data, like let's say you have that query, you can translate that into embedding using these tools.

Um, or if that's flowing through a model already, um, like your own L LLM that you can, you can pull it out of that.

Just a quick one minute warning, clay. Okay, thank you. I can wrap this up very quickly. Um, that's actually perfect timing. So once you have the embeddings, you start monitoring them and it's a lot. It is a different way than kind of monitoring the, the tabular data where it's very easy to see like an immediate change in tabular data.

You're just really looking at, um, one dimension there. But embeddings, we're talking about end dimensions, latent space. So what we wanna do is visualize those embeddings and latent space and calculate distances between them. So in the situation where I talked about there might be a gap in, um, your context data and your queries.

Maybe your user starts to query your system for data that is not well represented in your context Data. Vector store and you'd love to kind of add, um, some data to provide that context. Your system starts performing better. You can visualize this. You could look at your embeddings for your queries. You can look at your embeddings for your, um, context store, and you can measure, um, different distances between that.

And if there's a gap, it'll actually pop out really quickly. Um, so visualizing it is one thing, but monitoring it as another. How do we actually proactively monitor that? You can use distance metrics, um, to calculate kind of those distances between those clusters and then monitor on that. So in this example, we're looking at ICL distance between two different clusters.

So if you had an automated monitor on this, this would tell you that your, um, queries and your context have a significant dif uh, difference. And that is all I had. Here's a very beautiful graph of what that might look like. This is an example from Arise Phoenix. So if you're interested, check it out. Um, and if you wanna deep dive more into like monitoring machine learning systems, lms, anything like that, Um, feel free to connect with me.

I'm happy today. Meet more people in the community. Thanks everyone. Awesome. Thanks Claire. Beautiful last slide. I love it.

+ Read More
Sign in or Join the community

Create an account

Change email
e.g. https://www.linkedin.com/in/xxx
I agree to MLOps Community’s Code of Conduct and Privacy Policy.

Watch More

29:04
Posted Oct 09, 2023 | Views 6.5K
# Finetuning
# Open-Source
# LLMs in Production
# Lightning AI
9:41
Posted Apr 27, 2023 | Views 1.9K
# LLM
# LLM in Production
# In-Stealth
# Rungalileo.io
# Snorkel.ai
# Wandb.ai
# Tecton.ai
# Petuum.com
# mckinsey.com/quantumblack
# Wallaroo.ai
# Union.ai
# Redis.com
# Alphasignal.ai
# Bigbraindaily.com
# Turningpost.com