Sign in or Join the community to continue

Do Re MI for Training Metrics: Start at the Beginning

Posted Aug 15, 2024 | Views 135

# Training Metrics

# AI

# OpenAI

Share

speaker

Todd Underwood

Research Platform Reliability Lead @ Open AI

Todd Underwood leads reliability for the Research Platform at Open AI, working to improve the reliability and usability of the software and systems that train some of the best models in the world.

Prior to that, he was a Senior Engineering Director at Google leading ML capacity engineering in the office of the CFO at Alphabet. Before that, he founded and led ML Site Reliability Engineering, a set of teams that build and scale internal and external AI/ML services and are critical to almost every Product Area at Google. He was previously the Site Lead for Google’s Pittsburgh office. He recently published Reliable Machine Learning: Applying SRE Principles to ML in Production (O’Reilly Press, 2022).

Todd has primary expertise in distributed systems, especially for Machine Learning / AI pipelines, although he also has a background in systems engineering and networking. In addition to Reliable Machine Learning, Todd has presented work on ML at various conferences and forums including OPML20 and TWIMLCon 21 and 22. He has presented work on the future of systems and software reliability engineering at LISA13, LISA16 and SREConEU15, SREConEU 22. He is a co-author of a chapter in the O'Reilly SRE Book. He has published three articles in Usenix’s; login: magazine. He has presented work related to Internet routing dynamics and relationships at NANOG, RIPE, and various Internet interconnection meetings. He served on the Program Committee for OPML, was Chair of the NANOG Program Committee, and helped found the RIPE Programme Committee.

He is interested in how to make computers and people work much, much better together.

+ Read More

SUMMARY

Model quality/performance is the only true end-to-end metric of model training performance and correctness. But it is usually far too slow to be useful from a production point of view. It tells us what happened with training hours or days ago, when we have already suffered some kind of a problem. To improve the reliability of training we will want to look for short-term proxies for the kinds of problems we experience in large model training. This talk will identify some of the common failures that happen during model training and some faster/cheaper metrics that can serve as reasonable proxies of those failures.

+ Read More

TRANSCRIPT

Todd Underwood [00:00:10]: So. Oh my God, look at that. That's a wacky picture. Okay, I'm gonna give a talk. It's not gonna be about quality. You're gonna be very disappointed. But as previously discussed, you will be asleep. So it's gonna be okay.

Todd Underwood [00:00:23]: There's a little bit of a sound of music, Julie Andrew's metaphor that's woven throughout here. A couple things I wanna say. You don't have to have seen the sound of music, but if you have, it still won't help. It still won't help. So don't worry about that part. The other thing is that robot Julie Andrews is not real. Julie Andrews. That should be obvious.

Todd Underwood [00:00:41]: In fact, all of the pictures are fake. Just letting you know. All of, just like everything that comes from LLM. No, not everything comes from LLMs is fake. Okay. Furthermore, disappointing you, this is not a talk about quality like you're here at a quality conference. It's not about quality. But I'm gonna do that magic thing where you take the thing you're supposed to be talking about, you claim it's actually related to a different thing and then you ignore it and talk about the different thing.

Todd Underwood [00:01:06]: I'm going to do that. So follow me on this and we're going to get there. It's going to be great. So the claim I'm going to be making is that quality is a reliability problem. And then I'm not going to talk about quality. I'm only going to talk about reliability. So once I convince you of that, you'll have forgotten that the whole conference was supposed to be about quality or quality measurements or evaluations. It's going to be good.

Todd Underwood [00:01:26]: So model quality. It is true that model quality is the only good end to end metric of system availability or system reliability. And that sometimes I say that and people are like, what do you mean? I'm like. I feel like it should be obvious, but if it's not obvious that we're going to go into it, that's going to, it's going to be clear why that's the case. It is also the case that training problems, systems infrastructure problems, do result in model quality problems very frequently. In fact, it's almost more frequently than the bad model architectures and bad data that we have. So it's both. But anyway, let's start at the very beginning.

Todd Underwood [00:02:02]: So first, context and disclaimers. Who am I? I'm Todd. I have been working on large scale ML systems for a depressingly long amount of time. I do give you permission to call me old it doesn't bother me. Don't even a little bit. There's no youth fetishism over here. I started working on ML systems at Google in 2009. I've been at OpenAI for the last little bit, working on the research side, on training big models.

Todd Underwood [00:02:26]: I don't do model architecture. I don't do ML algorithms. I barely. I can spell ML and I routinely spell it correctly, but that's very close to the most sophistication I have, which might make you wonder why I'm here and none of us knows. But that's okay. I was here for the company, which is good. You all look asleep, but it's okay. But I was also here for the coffee, which has been taken away.

Todd Underwood [00:02:50]: I might give up on this talk in the middle. Before you do, we're going to see. We're going to see how it goes. Okay. Other context and disclaimers, who you are. So, first of all, I told GPT four o to make that image. That's a really creepy, weird. Like, that's.

Todd Underwood [00:03:06]: Anyway, some of these images are profoundly disturbing, and when you ask it to make it friendlier, it just makes the robot shorter and fatter, which I don't think is necessarily wrong. Like, I understand why it does. But anyway, sometimes there's. I can't tell if it's telling it something interesting about itself or about us or both. That's the trouble with ML systems. I am imagining that all of you are at least vaguely interested in how models get trained. Maybe some of you train models, maybe you don't. I spent all my life training big models and working on infrastructure for training models.

Todd Underwood [00:03:43]: So that's kind of what I'm thinking. If you don't do that, like, there's other talks, and they're probably better than this one. And then finally, about this talk, like this is a talk about training, not inference. It's a talk about systems engineering, not machine learning, not model quality, other disclaimers. There's no secrets. There's no secrets. In fact, anybody who tells you secrets from their employer in a talk isn't employed for very long. So none of these talks have any secrets.

Todd Underwood [00:04:11]: If that's disappointing to you, you should have known that it might not be relevant for most people. Also, the entire talk was generated by GPT 4.0. It was not. No, but the pictures were. So, okay, so I'm making this claim that systems reliability and quality are related to each other and that we're going to make that claim. So, first of all, here's the claim that quality is a reliability metric. I mentioned earlier that I was going to say this. What do I mean by that? If your system is working like you have some measurement for your training system, it's like reading data.

Todd Underwood [00:04:50]: It's updating a map of keys somewhere, parameters somewhere. It's chewing through the data, it's making progress, it's up, it's available. That could be true. If all of that is true and the model doesn't work, you haven't done anything. Like nothing has happened. Like you used a lot of computer cycles, you burned some electricity, and you haven't achieved anything. Cause the only reason we're here is to make models that do something, that have learned something, that are capable of something. So systems working with bad ML results, that's a failure.

Todd Underwood [00:05:26]: If there were a situation where you thought your system wasn't working very well, it was unavailable, it was crashing all the time, restores from checkpoints weren't working right, systems were throwing errors. But at the end of the day, or several days or in my world, several months, you had a model that did something cool, that did something new. Well, that would be a mystery, but it certainly wouldn't be a failure, right? And so again, like, the reason we're here, like I say, like, GPU's and computers are not just for fun. Like, we don't get thousands or tens of thousands or hundreds of thousands of these and put them in a place just for fun. To be clear, it is fun. Like, I'm not denying that it's not fun to play with thousands of computers, but that's not why we're here. We're here to produce models that do something. And so we have to try to keep that in mind.

Todd Underwood [00:06:18]: Secondly, and this one is less obvious to some people, systems problems can create quality problems, can create model, model, model availability problems and model performance problems. So concretely, there's data and metadata problems. You train on bad data, you train on unrepresentative data. You train on great data that you misinterpret because you thought it was different data. So you do the wrong thing with it. Almost everyone here who's ever trained a model has done all of that, right? And that's one of the things I find funny, is a lot of times in these sort of settings, people talk about the really tricky and subtle errors, but those aren't the, like, the errors we all make are like, I read the wrong file, like it sounds dumb, but like, maybe I'm the only person who does that, and yet I talk to people I'm like, oh yeah, so yesterday didn't go well because I spent 8 hours reading the wrong directory full of the wrong files that didn't have the stuff I thought in it, but it looked close enough to the stuff that my model just like chewed it up and did stuff with it. And that was a waste of 8 hours of seven. Like, and so people, like I say that people are like, I would never do that.

Todd Underwood [00:07:27]: I'm like, you've done that. Come on, you've done that. Let's see. Like, you've done that. And so like, I think it's reasonable to acknowledge the simple failures that we have. And that's one of them. Silent data corruption. Like the number of cases where we ask for bleeding hardware and now we're covered in some blood, right? Like we've got a lot of hardware that works really fast a lot of the time.

Todd Underwood [00:07:52]: Not like all the time for sure, sometimes most of the time, but it depends on how many of you have. But we asked for really fast hardware, clocked at really high rates, produced right at the edge of what manufacturing capabilities we have in fabs. And so we take a lot of errors, configuration errors. You ran the wrong thing. You ran like, you ran yesterday's config on today's model architecture. That happens all the time. And there's just plain software bugs like the, you know, you did and all reduced from all the nodes, but you didn't line things up right and you ended up with garbage at the end. All of these are systems errors.

Todd Underwood [00:08:31]: They're not really ML errors, but you produce a model that's no good. So hopefully I've convinced you that there's this bidirectional truth of model quality is a systems reliability metric, but systems also impact the quality of the model you produce. Okay, so if that's true, let's just assume one of my favorite things in life is to assume I'm right. So if you would just go with me on this, just assume I'm right. No, I might be right. If this is true, then let's talk about, at a systems level, three things that I think you should measure. They're really straightforward, it's not that complicated. And I'm going to explain each one.

Todd Underwood [00:09:10]: Talk about which parts of it might be interesting or hard to measure. They're not. It's mostly not that hard. And then why those are the right metrics. Okay, so the first one is uptime. So what is uptime like? It's, I mean, honestly, like, as someone, like, who has had sre in the title of my job for a long time. Reliability's in there. Like, it seems weird to still be asking, so what is uptime? It feels like a very, sort of like we're going back to our pure epistemological questions.

Todd Underwood [00:09:40]: Like, all of you studied epistemology in college, I'm positive of this. Right? Because historically, engineers get very, very good education in history and philosophy. That's what. Wait, that's. Sorry, wrong talk. Wrong talk. Different, different place. So uptime is time making progress divided by all the time.

Todd Underwood [00:09:59]: That's it. It's not that complicated. Except, wait, what is making progress? That's the thing we talked about, we cared about. What does it mean to be making progress? Well, simplistically, we're reading data, we're reading our input data, and we're learning where learning means what? Well, we're updating some map, we're updating some data structure that stores something. But really what we mean when we say learning is that we're able to do something we couldn't do before or something better than we could do it before. I know some of this sounds very, very basic, but I think it's useful to think about concretely. What are we talking about when we think about uptime? So a model is up. If it is spending time reading training data and producing a new map that is capable of doing something it couldn't do before, that's it.

Todd Underwood [00:10:57]: Now, sometimes you can only figure that out retrospectively. So sometimes you can say like, you're training long everything. So I don't know why I'm, why we march, but we march when we're training along. I don't know. We're training along, we're doing it, we're moving forward in time. And yet I, when we check at the end of the day or the next day, we're like, this model's a bit crap. It's just really not very good. In fact, it's less good than it was two days ago.

Todd Underwood [00:11:24]: Well, in my mind, those are not two days of uptime. Those are not, in particular. One common thing I restored from the wrong checkpoint. So I had a crash, something. I wanted to stop it and reconfigure something. I wanted to add some new training data. I stopped training for a minute, and now I'm going to restart it. I restarted it from 6 hours ago instead of like 30 minutes ago.

Todd Underwood [00:11:48]: Well, that's okay, but you just lost five and a half hours of uptime. You had had it, but you're now walking through your five and a half hours of uptime again. And that's too bad for you because time, this is one of the most disappointing aspects of my life. Time does march forward at 1 second/second this is one of the most disappointing features of life. Get used to that. Okay, so uptime, why do we measure uptime? This is probably painfully obvious, but zero is a very, very, very low number. Like, zero is just a terribly low number. It's a really disappointingly low number in terms of future progress.

Todd Underwood [00:12:24]: And so if your model isn't up, you're really just not doing anything, which is what we're doing here. Right here. No, actually my model's up right now. I don't know, like my model's training while we talk. I just checked, it's training. So that's the first thing is like this is, this is not enough. This is one of those like necessary but not sufficient. If your model isn't up, you're not going to solve any of the other problems.

Todd Underwood [00:12:46]: If it is up, but you're doing dumb stuff poorly, you could still not solve those problems, but at least you have a fighting chance. So the first thing I want you to do is determine whether your training system is up at all. Large distributed systems are very challenging. They're challenging to measure, they're challenging to configure. People pretend like, I think ever since some of the early days of the Internet when people like Google were like, I will send an RPC to 10,000 machines and use that to answer your search query. And we're like, oh, that's easy. It's not easy. It's never been easy.

Todd Underwood [00:13:19]: It's easier than it's ever been, and it's still pretty hard. So next one, throughput in efficiency. Okay, so what is throughput? Inefficiency. There's a bunch of different ways people measure this, but the key idea is cycles of your accelerator, processor, GPU, time spent training the model divided by total cycles. So it can be model cycles divided by total cycles that were available. One of the most common measurements here is model flops utilization, MFU. This is sort of observed tokens per second divided by the theoretical maximum tokens per second of the system. I will warn you that mfus come in very low numbers.

Todd Underwood [00:14:05]: Like many of us, when we think about efficiency, we're like, what about 60%, 70%? There are no 70% mfus. Nobody is getting 70% of the cycles out of anything, which is really sad. But as long as you recalibrate, you can just think of this as tremendous opportunity for improvement. And in particular, like, if you stand up a system and you're getting 1% MFU, God forbid, then all you can think is like, I don't need to buy GPU's. I just need to buy some smarts. I just need to buy a clue. Like, I need to be smarter about how I use these expensive GPU's. So every time you see a low training efficiency, whether it's like chip time spent training versus chip time spent doing other things, that's another measure.

Todd Underwood [00:14:49]: Like, remember that these are complex systems. So when you think about the training, you're doing a bunch of things. You're fetching data from, presumably a networked blob store data thingy off on the side. Okay, that's great. But when you're fetching the data, you're probably not also training the data. Unless you've been very careful about interleaving these kinds of activities. You are computing new values for things, but then you're distributing your version of those values to other friends of yours on the network who have computed their own values, and you're trying to aggregate these into a shared view of the world. Unless your system is very carefully designed to interleave these activities, you'll stall, you'll be slow.

Todd Underwood [00:15:30]: It'll be disappointing. So, throughput inefficiency. Why throughput? I think I covered it because low throughput is very common, and there's a lot of ways to build these things that work but don't work well. You can have bad networking. You can have bottlenecks in your system. You can have poorly architected communications with computation, interleaving, you can have inefficient or slow checkpoints and restores. You can checkpoint too often. You can checkpoint not often enough when you think about like, so why do we checkpoint? We checkpoint because things crash.

Todd Underwood [00:16:02]: How often do we checkpoint? Exactly as often as you need to, to make as much forward progress as you can. Given how often things crash, that was the best non answer ever. There's no answer. Like, this is just a computation you do. If you had GPU's that crashed every third step, you would checkpoint every step. If you have GPU's that crash, on average every 500th step, you probably wouldn't checkpoint every step because checkpointing slows you down, but it doesn't slow you down if it prevents you from having to go back too far. You know, once things do crash. So a lot of this is about that sort of trade off.

Todd Underwood [00:16:41]: Okay, the final one I want to talk about is less obvious. Which is machine or network health. So everything else was sort of like, at least in the domain of model training, right? And this one, I love that picture. I'm like, that is what Dolly? Two, I think it is in this case, thinks as a healthy machine in a data center. I don't know. Healthy. It's healthy. My other favorite thing, and I think this is a well documented fact, is most of these image generation models can't do text.

Todd Underwood [00:17:15]: They just make up the weirdest words. It's fascinating to me because you would think text is pretty easy. You could, like, you could do some reinforcement learning and be like, no, that's still not a word. That's not a word. But anyway, they have some very odd words in here. Okay, so what is machine health? It is the number of reliable or trustworthy machines at any given point in time divided by total machines. What's a reliable machine? Well, I mean, that is indeed a philosophical question, but a reliable machine is a machine that knows about itself, that boots, that is aware of its hardware, that is capable of accessing its hardware, hardware to include a cpu and any GPU's or accelerators it might have, that is aware of its network port and can make network connections, accept network connections, send and receive data, depending on your background. That other sounds stupidly easier, like an incredibly tall order, and it's actually somewhere in the middle of those two things.

Todd Underwood [00:18:14]: So, okay, so that's what's reliable. What's trustworthy, again, like, one of the most disappointingly common features of very large scale computation of any kind, is that when you get beyond a point, the number of failures just become large. So double bit ECC errors, uncorrectable ECC errors, transparent corruption of data while in transit, like, and all this stuff that sometimes cannot be caught. Other than now, my model is not correct. And so that's really, this is really the reason to focus on machine and network health, is your ability to figure out whether you have corruptors and other items in your training system. Total should be obvious, but total in the divisor here, it's whatever you're paying for. And not all of your providers will deliver all of the machines and deliver them booted, but they will all charge you for all of them. They're very good at charging you.

Todd Underwood [00:19:15]: It's one of the best things about them. Why machine health? I think the math is against us once we get beyond a certain scale. So the math of a distributed single model scale problem is against us. The math of advanced cutting edge accelerators. I talked about this like I don't think this is not Nvidia's fault. Like, Nvidia didn't say, we want to make stuff that just works, just barely works. Nvidia said, oh, you all want the fastest thing ever? And we all said, yes. And they said, ok, it works ok.

Todd Underwood [00:19:50]: And we were like, ok, I guess I'll take it, because I got to have that. I got to have that. And so this is a deal all of us made with each other about, like, we all wanted fast. All of us wanted fast. And we were willing to tolerate some of the challenges of having really finicky networking and really touchy hardware because we just wanted fast. And so that's the world that we're living in right now, and I don't see that changing for a little while. So that applies to the machines and it applies to the network interconnects. Like, the only network interconnects that actually work are too slow to use.

Todd Underwood [00:20:28]: And so the networks that you want to use are the ones that, if you like, if you walk across the room and kind of look at them sideways and nod, like one of the ports has a flap, and you're like, what did you do? I'm like, well, it was the vibrations in the floor panel. Somebody unplugged a cable on the other side of the data center. This stuff really happens. And it's not exactly like that. All of that's an exaggerate. This entire talk is an exaggeration. Don't believe anything I say. Okay, so if we've sung the do re mi of, like, the basic things, what's the advanced stuff? The advanced stuff is really straightforward.

Todd Underwood [00:21:00]: It's model quality. Okay, we're back. We're back to the model quality thing. Well, why don't we just measure this? Because I told you it was important. Why don't we just measure this? This is hard to measure. Imprecise, too slow, and doesn't give us causal attribution. So hard to measure. Nobody agrees on what good quality measures are.

Todd Underwood [00:21:17]: It depends on what you're doing. Obviously. There's been, like, lots of good talks about what that is. Too slow. Most of these evaluations don't happen in milliseconds or seconds. They happen later. They happen sometimes in minutes or hours. No causal attribution.

Todd Underwood [00:21:33]: Like if I have machines running slow or if I have a training system running slowly or a data corruptor, I can find which one. If later in the day, the model is not that good. All I know is, well, something went wrong in here. There was some stuff over here and some of it went wrong. That's not great. It's not really great. So, okay. Hopefully I convinced you of almost nothing.

Todd Underwood [00:21:56]: No, hopefully I convinced you. ML training systems availability is a quality problem. Quality can be often is a systems problem, and that measurement can be pretty simple. And I hope you have no tomatoes. Are there any questions?

AIQCON Male Host [00:22:12]: Thank you very much, Todd. That was fantastic. If there is one thing I remember from this conference is quality means reliability and vice versa. So we have time for one or two questions. Make it really, really good.

Todd Underwood [00:22:23]: No, don't. Don't put pressure on them. They don't have good questions. No. Come on. Anybody?

AIQCON Male Host [00:22:29]: Anybody?

Todd Underwood [00:22:30]: I have stunned you in a state of submission. This was my goal.

AIQCON Male Host [00:22:36]: Somebody ask a really good question.

Todd Underwood [00:22:39]: You can't put pressure on people like that. No.

AIQCON Male Host [00:22:45]: All right.

Todd Underwood [00:22:45]: Yes.

Q1 [00:22:48]: So we have aiops, SRE team that's focused on kind of building out, just starting on how to even design site reliability for an LLM product. Do you have any recommendations on where to get started?

Todd Underwood [00:23:02]: Do you mean using AI to manage production? When you say AI ops, that's what. Yeah, yeah.

Q1 [00:23:07]: I think just like building reliability around anhe LLM based product.

Todd Underwood [00:23:12]: Oh, like, so incorporating LLMs into some other product and wanting to measure the reliability. Yeah, this is super tough because the models are not idempotent. Like, you give the same query and you get different answers back different times. And if you're going to be trusting those, I think that some of the stuff that people have talked about, like, start with measuring the basics. Like were you able to connect to it and did you get a response? It sounds dumb, but, like, that's the first thing. I would sample some of the request response pairs if a privacy policy allows for it, and I would let people flag bad responses somewhere and accumulate those. And I think, unfortunately, you just have to go to a sampling approach plus human eval.

+ Read More

Sign in or Join the community

Watch More

How to Systematically Test and Evaluate Your LLMs Apps

Posted Oct 18, 2024 | Views 15.1K

# LLMs

# Engineering best practices

# Comet ML

Small Data, Big Impact: The Story Behind DuckDB

Posted Jan 09, 2024 | Views 13.3K

# Data Management

# MotherDuck

# DuckDB

Building LLM Applications for Production

Posted Jun 20, 2023 | Views 11K

# LLM in Production

# LLMs

# Claypot AI

# Redis.io

# Gantry.io

# Predibase.com

# Humanloop.com

# Anyscale.com

# Zilliz.com

# Arize.com

# Nvidia.com

# TrueFoundry.com

# Premai.io

# Continual.ai

# Argilla.io

# Genesiscloud.com

# Rungalileo.io