MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Debugging AI Applications

Posted Aug 08, 2024 | Views 134
# RAG
# AI
# Voltage Data
Share
speaker
avatar
Chip Huyen
CEO @ Claypot AI

Chip Huyen is a co-founder of Claypot AI, a platform for real-time machine learning. Previously, she was with Snorkel AI and NVIDIA. She teaches CS 329S: Machine Learning Systems Design at Stanford. She’s the author of the book Designing Machine Learning Systems, an Amazon bestseller in AI.

+ Read More
SUMMARY

Some questions I often hear are "What model should I use?", "Should I finetune or RAG?", "How many examples should I use?". What to do next depends on the issues your application has. Chip Huyen will walk through common failure modes and how to address each in this talk.

+ Read More
TRANSCRIPT

Chip Huyen [00:00:10]: Hello, everyone, I'm Chip. So actually gonna give a new talk today. I came here today and was like, let's try. I just felt like when I talked to people, it's like this topic might be a better fit for you all, so I hope that I'm right. So we're talking about debugging AI applications, but first obligatory company intro. So I work with one churn data, and we build GPU native query engines. Basically, we focus on workloads that are too big and too expensive for spark to handle. So the idea is that data processing is naturally parallelizable.

Chip Huyen [00:00:44]: So if you want to process a billion rows, you can just farm them out to thousands of GPU cores. So we're so involved with open source, so we. As a core maintainer of two very popular, two open source projects like Ibis and Arrow, I'm also the author of Designing Machine systems and I'm working on a new book on engineering. And as a part of doing research for my book, I go through almost like every single AI tool that comes my way. And it's very painful because there are a lot of those. So to make my life easier, I only look at those who at most 500 stars. So you can look at all of them here on my website and you can browse them like, I manually categorize them by what they do. And I also find helpful to track how many stars, like how fast is this growing? Because I see a pattern which I call as a hype pattern, like a repost comes out, it gets a lot of stores, and nobody ever uses this again.

Chip Huyen [00:01:41]: So I think it's useful to see how things grow. Okay, so for isoshet analysis that you can check out, I often hear a lot of questions like which model should I use, should I fine tune, or should I do rack? Like, how many examples that I should include? And I think I feel like the answer always depends on what kind of issues you are trying to correct for the applications. So what you want to do next depends on what are the failure modes for the applications. So I categorize the failure modes into like six kind of failure modes. So one is like information based, behavior based latency, and because we only have about 20 minutes today. So I'm going to go over the first three types of failure modes and how to address them. So the first one is information based failures. I think the slides have changed and this does not.

Chip Huyen [00:02:38]: Okay, so this is an older slide, but it's okay, it works. So I would say that information based failures is the vast majority of pain in a system today are caused by these issues. So it costs like what does it look like? So it looks like when the outputs are outdated. So for example, you ask the model something and the model say something like as of my cutoff date, this has not happened yet. So you don't see that information the other ways it happens. Way, way more common is hallucinations. So hallucinations. There are many reasons why a model hallucinates, but it's just like hallucination is just part of the model.

Chip Huyen [00:03:20]: Wait, can you hear me better? Because I heard there's some echoing issue with the room. Good. Okay, so hallucination is just part of the model of probabilistic natures. However, models are also more likely to hallucinate when it doesn't have access to the right information. So AI does have this bias like bias toward actions. So even when it doesn't know, it still tries to respond and do something. So you can see that when you ask the model of questions that involve some rare or niche information that is unlikely to appear in the training data, the model is more likely to hallucinate. So this is a very, very common issue, like information based failures.

Chip Huyen [00:04:08]: So I would say this, like usually the first cost of actions is to try to address this kind of failure. So you can do this by enhanced context. So there are two ways you can go about it. First is you try to retrieve information using a reachable system, like in a Rack system. You can also give the model tools so that it can gather information itself, for example using web search or other information. So Rac. So first with retrieval. So usually think of retrieval as doing unstructured retrieval.

Chip Huyen [00:04:45]: You retrieve data like images or documentations. And retrieval is a very old science. It has been around for many, many years and retrieval is the backbone of search. Recommended system log analysis. So many of those systems for search and recommendation systems can be used for RAC. So in general, you could think of term based retrieval like keyword search or BM 25, or fancier techniques like vector search and also hybrid search. But I would say that if you want to start, start with something simple like VM 25 is a lot simpler than a vector search and it's very, very strong baseline to beat. You can also do retrieval from tabular data like a SQL table.

Chip Huyen [00:05:33]: However, doing retrieval with tabular data is a lot more like, it's a lot different, like retrieving documents because you first need to convert the natural query into SQL queries. So you need text to SQL queries and then you need to get access to the SQL executor. So it's a bit trickier because now you need models to follow the SQL format and then you need to put guardrail because you don't want the system to accidentally execute some dangerous SQL queries. Another thing to retrieve augment context is to use htus tools. So one very very common, I think one of the first tool that people are very excited about with chat CPT is like web search, browse the Internet. So now you want to give it things like web search, news API, weather API. But there are other tools you can give model access to as well. For example calculator or like a bash terminal.

Chip Huyen [00:06:32]: And I think this is like so simple code is like agent tech workflows and people think of agent, it's very fancy, but I think of agent, it's just like okay, you'll give access to tool API calls or functions you can call. So I could put both action API and reachable under the category of context constructions and it would look something like this. So the goal is just to give the model access to the relevant information so that it can produce better answers. So the second kind of oh sorry. So you can see it's like, actually I really like this research of paper. So it just shows it compares like Rac versus phytune impact of those on different, like this is on the MMLU benchmark and you can see that like rack gives a really strong performance boost. Like it usually perform much better than fine tuning, but at the same time sometimes it can even perform better than rack plus fine tuning and retributable is especially useful for use cases that involve new information. So for example, here's a benchmark focused especially on current news.

Chip Huyen [00:07:46]: And you can see rack is just like outperform, fine tune and even fine tune plus rack. So that is interesting because why would base model rack outperform fine tune model plus rack? So I do think just like when you fine tune a model, there's a chance that the model does better as a fine tune task, but as other capacities get impacted and not going to perform just as well, I don't know. Also, so when you do retrieval, one very common thing you should do is a query rewriting, or some people call it query normalizations or sort of query expansions. Because sometimes in conversations you won't have the queries. It's like just don't quite make sense even once you use them verbatim for retrieval. So first of all, here's an example. You can ask user first ask when was the last time John Doe bought something from us. And then the follow up question is like how about Emily Doe? So if you try to retrieve documents using how about Emily Doe? You get something very irrelevant.

Chip Huyen [00:08:45]: So it would try to rewrite the query into something like this. When was the last time Emily Doe bought something from us? So there are different ways to do query writing, but usually AI is pretty good at that. You can ask another model, say, hey, given all this context, rewrite this question so that it make it easier to actually say what users are asking. However, query writing can be quite tricky. So let's say that user asks a question like how about his wife? So now we need information like who would be his wife? Right? So you need to be able to do identity resolutions. And if the system does not have access to the information, it has to respond like, sorry, we don't know who John Doe's wife is. Instead of making up some name like hallucinate something. Okay, so the next type of failures is the behavior failures.

Chip Huyen [00:09:33]: So one way it looks like is that like when the outputs are factually correct, but are not quite what you want. So here's sample, right? Like user can ask like hey, why is retrieval important for AI applications? And it respond like retrieval is an important technique for search and recommended system. So this is correct, but it's not what users want. So another way this can fail is that you process don't follow the requested format. So a very common format that often requested by models is JSON. So you want JSON with valid JSON with certain keys. So you can just have one. Here's an example.

Chip Huyen [00:10:12]: We can see that it output first has a preamble. The answer is which make it not possible for adjacent parser or as the adjacent is just invalid or missing a closing bracket or it doesn't contain the keys that you want. So these two are very common behavior based failures. So the next question is, how do we address those? I feel like I used to take a course. So it's like everyone waits for students to respond and people don't respond. So one way is she used prompting, which is just adding more examples. And they try to get like C hoping that model was like, we want to follow the examples. You can use like some validators, like JSON validators, you can use AI to validate, you can still use post processing.

Chip Huyen [00:11:00]: So LinkedIn recently has a pretty interesting case study. So they found out that the model by themselves can only output JSON correctly 90% of the time. But they also realize that when they fail, a lot of the arrows following similar patterns. So they wrote a script to manually correct those arrows and they were able to boost the correctness to about 99%, which is enough for the applications. Another thing that is not added here because it's an older slide which is use a better model. So it has been shown that as model becomes more powerful, they also get better at following instructions and can output more and can output format more correctly. And another thing you probably want to try, glucagon tune constraint sampling, which I think is pretty interesting, but also like a bit harder to do and a lot of time because constraint sampling, you have to define a grammar to force a model to follow that. So why it's very interesting.

Chip Huyen [00:11:56]: I know that math PhD loves that issues, loves that problem. But sometimes the effort to spend on defining grammar might as well be spent on doing fine tuning, like training a better model. So the last solution here. So fine tuning. So here's an example of what constraint, a very simple constraint sampling would look like, but because of the interest of time, we won't go into. So basically you put it before like the sampling layers and try just like how you get this sample from those valid tokens instead of on the tokens. And so one thing I want to fight, I was just fine tuning and buddhism last year. So fine tuning is never the first answer to anything.

Chip Huyen [00:12:36]: So fine tuning is the last line of defense. And the reason is that fine tuning, not only it requires an upfront cost because you need data and people know how to do it, it also require continual maintenance. Once you have the fine tune model, what do you do? How often do you want to fine tune it? And what if there's another base model that comes out? How fast can you iterate compared to how OpenAI Gemini anthropic can iterate? So it is a very big commitment to get into fine tuning. So the last failure mode I want to get into in this talk is latency. There are many different metrics for latency. So how should solve latency? Depends on what metrics you care about. So here's some of them. Like the first attempt to first token after users send a query, how long until you see the first token from the users? People care about time between tokens, which is the newer latency metrics.

Chip Huyen [00:13:36]: People have found out that sometimes users get more frustrated when it's just like someone types so slowly or like tokens per second. And also the total latency, like how long does the latency take in total? And this especially important for when you use some kind of chain of thought or get more residual planning when it, when the response might require multiple steps until it completes. So it can take like minutes and it can be like quite scary. So you talk to a company and they said like let's ask a model to do chain of thought and the quality is good, but then it takes like three or like sometimes like even like longer to complete a response and it's just like unacceptable. So latency, like how do we deal with latency? So good thing is that latency is not a new concept. Ever since people have been having Internet, we've been thinking about how to get things faster and faster. So a lot of those latency solutions we can use for this, but they also like newer solutions. First of streaming mode, when you use OpenAI anthropic, you do instead of waiting for the whole response to complete, you can just show users one token at a time.

Chip Huyen [00:14:46]: However, it comes with a trade off, which should I think of as a security versus latency. So the reason is that why you can get users to see the response faster? It also makes it harder to validate the response because now you just show it to user before you can validate whether the response is safe or is good. Another is that using cache. So they are manual exact cache. This is a cache system, it's not new. I feel like everyone who deploys software know what a cache is. So you can use AI to predict how long whether the query could be repeated again. So for example, if the query is very unique, how is the weather right now? There's no point in caching it because on a chain it's the next minute.

Chip Huyen [00:15:29]: But however, if the query is more likely it should be appear again, then you might want to cache it. Another cache I talk is very exciting is prom cache. Who here is familiar with prom cache? So prom cache is just like, it's a pretty new technique. I think it's already incorporated into Gemini just like last week. So the idea is that when you have queries that have shared text segments, you can cache the text segments so you don't have to process those again. So it's very common when you have system prompt examples. So a lot of the queries also have the same system prompts and examples so you don't have to process this again between queries. So far for Gemini, I think once they announce the pricing, it's like for cash tokens, it's like a quarter of the price, but you have to pay extra like for cash storage, which is make it a lot cheaper.

Chip Huyen [00:16:21]: I also mentioned something here like semantic cache because it comes up a lot. But honestly, if you think about semantic cash, I would say think again because it's very easy to get it wrong. Like other cash system, it usually just makes you think faster and doesn't change the quality. But semantic cash can cause some model output to be wrong if you like get the wrong item. So just something to think about. If you decided to go about semantic cache, another way is like to try to decompose a task into a smaller task. So the idea is like you can try to do as much as subtasks in parallel as possible. So first of all, you want a task to say, hey, rewrite this essay into three difficulty level.

Chip Huyen [00:17:01]: One for first grader, one for 8th grader, one for college student. Then you can do, do all three of these tasks at the same time. You can try to get simpler model for smaller tasks. If this task is simpler, maybe you can use a simple classifier or a seven b model instead of GBD four. That wouldn't have speed up your task. You can also use only five tuner smaller model to imitate the behavior of the large model, which give you model performance when increased latency. But again, fine tuning is tricky. And the last one, it's a bit interesting.

Chip Huyen [00:17:39]: It's just like I've seen some companies when they have outputs that have high variance in latency, especially with chain of thought. Like one query might be very fast, but the next query might be very slow because it takes more steps. So they try to generate multiple outputs at the same time and whichever completes first, they send it back to users. So here you have the latency versus cost trader because it can have reduced latency, but you have to do more query like API calls, which costs you more money. Okay, so I think that is pretty much all of my time. This one should summarize. It's actually inspired by a slide from OpenAI. I should have put the link there.

Chip Huyen [00:18:18]: So basically when you start, I think you should just try to make the most milish out of prompting, add examples, and then you try to do a simple rack like BM 25 or keyword search. And only after that five tool is much, much later because it's more complex. Yeah, so that's me. I have on Twitter LinkedIn discord. And here's my blog when I write a lot about AI applications. Thank you very much.

AIQCON Male Host [00:18:54]: All right, thank you so much, Chip. Before chip leaves the stage, do we have any questions for chip? We have just a couple minutes. All right, you got some questions. After all, I raise your hands one more time so I can see. All right.

Q1 [00:19:10]: What are you most excited about next?

Chip Huyen [00:19:14]: I'm actually really excited about AI fashion storytelling. So I'm talking to company recently, and so I think storytelling. So I come from a writing background, and I think just like storytelling, if you know how to write, tell good story, you can get full attention. So you can help understand technical concept better. You just, like, make things easier. So it's very common in gaming. I'm talking to a company right now which is doing story generation for kids. So what they do is that they try to detect what words the kids have trouble with, and they try to generate stories with those words to encourage kids to learn the words more.

Chip Huyen [00:19:53]: And it's pretty interesting, because when you jerry stories for kids, now we have a constraint generation problem because you can only use the words that the kids are familiar with, and that really, really limits the kind of stories you can create.

Q2 [00:20:07]: Hello. Thank you for your presentation. I have a question. So, does prom cache help with the context window as well?

Chip Huyen [00:20:17]: It does not, unfortunately. I don't think so.

Q2 [00:20:20]: Okay.

AIQCON Male Host [00:20:25]: All right.

Q3 [00:20:28]: Hey, I actually had a question about the performance part that you were talking about. You were talking about time to first token. I was just curious if you had any thoughts on how OpenAI is doing it with their 4.0 model. Because some people have thought that it's possible that they're actually using a sort of human tweaking effect instead of speeding up the model by having the model always say a leading word, like, okay, and then start giving you the answer, or something like that, it usually has something in there that pauses long enough that even if their time to first token was drastically slowed down, the person using it would never notice. Is that a method that's being employed, or is this pure hypothetical?

Chip Huyen [00:21:09]: I feel like if I knew what OpenAI is doing, I feel like there's a way you can arbitrage information. So I do think it's like, I hear online process OpenAI is much better hiding latency, and I think there's certain techniques you can use to hide it. Some idea came up recently. I didn't come up with it, but somebody told me is that they use preamble to high latency. For example, sometimes models start with, oh, the answer to this is this. Not because they need to answer that, because it's just like it buys them a little bit more time until the model can generate the next token. So I certainly think that there are a lot of techniques we can use, but it's more of a product question. Then I can AI questions.

AIQCON Male Host [00:21:52]: Good question. One more question.

Q1 [00:21:56]: What do you think about using LLMs for evaluating other LLM based applications? And do you think it's just a banded solution until we get really grounded evaluation metrics that we can use to evaluate ground roots?

Chip Huyen [00:22:07]: I'm actually very excited about that. I do think it's like, I think it's like, but one is the only solution right now for a lot of application in productions when there's no gradual and we don't have the humans. However, using it require a lot of care. So I was looking at a bunch of companies like when they have metric with the same name, but like this metric, very highly dependent on the ALM as a judge itself and also the prompt. And I was looking at that and everyone has different prompt for like different metrics. So I was talking to a PM at a pretty big company using like PM for very popular product, and I was asking, hey, what measures you use? And he said like, oh, coherence accuracy relevant. And I'm just like, do you like, have you like, what does the problem look like? And he was like, oh, I actually don't know what the prompts look like. So we have the situation when like one team manage the metrics and they come with the prompts and another team using this metric, but they have no idea what's those, what's the how these metrics are being defined or computed.

Chip Huyen [00:23:11]: And it has situation where because sometimes it's very common for people to have a typos, so sometimes they might change, update the metric prompts and the downstream application developer keep using those metrics as if they were stationary. So I do think it's very exciting, but we need a lot more rigor into using those metrics.

+ Read More
Sign in or Join the community

Create an account

Change email
e.g. https://www.linkedin.com/in/xxx or https://xx.linkedin.com/in/xxx
I agree to MLOps Community’s Code of Conduct and Privacy Policy.

Watch More

Building LLM Applications for Production
Posted Jun 20, 2023 | Views 10.3K
# LLM in Production
# LLMs
# Claypot AI
# Redis.io
# Gantry.io
# Predibase.com
# Humanloop.com
# Anyscale.com
# Zilliz.com
# Arize.com
# Nvidia.com
# TrueFoundry.com
# Premai.io
# Continual.ai
# Argilla.io
# Genesiscloud.com
# Rungalileo.io