MLOps Community
+00:00 GMT
Sign in or Join the community to continue

The Art and Science of Training LLMs

Posted Mar 22, 2024 | Views 600
# LLMs
# MosaicML
# Databricks
Bandish Shah
Bandish Shah
Bandish Shah
Engineering Manager @ Databricks

Bandish Shah is an Engineering Manager at MosaicML/Databricks, where he focuses on making generative AI training and inference efficient, fast, and accessible by bridging the gap between deep learning, large-scale distributed systems, and performance computing. Bandish has over a decade of experience building systems for machine learning and enterprise applications. Prior to MosaicML, Bandish held engineering and development roles at SambaNova Systems where he helped develop and ship the first RDU systems from the ground up, and Oracle where he worked as an ASIC engineer for SPARC-based enterprise servers.

+ Read More

Bandish Shah is an Engineering Manager at MosaicML/Databricks, where he focuses on making generative AI training and inference efficient, fast, and accessible by bridging the gap between deep learning, large-scale distributed systems, and performance computing. Bandish has over a decade of experience building systems for machine learning and enterprise applications. Prior to MosaicML, Bandish held engineering and development roles at SambaNova Systems where he helped develop and ship the first RDU systems from the ground up, and Oracle where he worked as an ASIC engineer for SPARC-based enterprise servers.

+ Read More
Davis Blalock
Davis Blalock
Davis Blalock
Research Scientist @ MosaicML

Davis Blalock is a research scientist and the first employee at MosaicML. He previously worked at PocketSonics (acquired 2013) and completed his PhD at MIT, where he was advised by John Guttag. He received his M.S. from MIT and his B.S. from the University of Virginia. He is a Qualcomm Innovation Fellow, NSF Graduate Research Fellow, and Barry M. Goldwater Scholar. He is also the author of Davis Summarizes Papers, one of the most widely-read machine learning newsletters.

+ Read More

Davis Blalock is a research scientist and the first employee at MosaicML. He previously worked at PocketSonics (acquired 2013) and completed his PhD at MIT, where he was advised by John Guttag. He received his M.S. from MIT and his B.S. from the University of Virginia. He is a Qualcomm Innovation Fellow, NSF Graduate Research Fellow, and Barry M. Goldwater Scholar. He is also the author of Davis Summarizes Papers, one of the most widely-read machine learning newsletters.

+ Read More
Demetrios Brinkmann
Demetrios Brinkmann
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

What's hard about language models at scale? Turns out...everything. MosaicML's Davis and Bandish share war stories and lessons learned from pushing the limits of LLM training and helping dozens of customers get LLMs into production. They cover what can go wrong at every level of the stack, how to make sure you're building the right solution, and some contrarian takes on the future of efficient models.

+ Read More

Davis Summarizes Papers ⁠⁠Newsletter signup link:

Bandish Shah 00:00:00: Hey everyone. My name is Bandish. I work on deep learning at databricks and I take my coffee as a shot of espresso in the morning.

Davis Blalock 00:00:09: Hey, I'm Davis. I'm a research scientist at Databricks and my caffeine of choice is green tea.

Demetrios 00:00:17: Welcome, welcome. We are back with another MLOps community podcast. I am your host and today a we get two for the price of one. Davis and Bandish, both doing some really cool stuff when it comes to training large language models at mosaic. And if you are into the research side of things or you are at least a little bit curious like myself, this is the episode for that they break it down all the different places that you can fail. We don't really like to talk about what is glittery and gold and all that. We're not going to tell you in these episodes that AGI is coming. Get ready.

Demetrios 00:01:04: We better get universal basic income. We're talking about how these training runs can fail and what a conversation this was. There are so many different ways that you can look at training LLMs and the pains that these deep learning researchers go through on a day to day basis because as Davis put it, you can fail in a million different ways. And he broke down just a handful of some of these ways. And a lot of times you don't even realize that your models are failing. You don't even know that the model training is not going as planned. You just see all of a sudden things start acting funny and then you have to go debug. And maybe you have to debug on the hardware side, maybe you have to debug because there's some breaking changes in Pytorch that you didn't even realize, or there's just another type of framework that you're using and there's been a breaking change.

Demetrios 00:02:10: He was mentioning how much of a headache all these breaking changes are and how you have to keep up with that and the act of trying to stay stable and use one version of Pytorch or whatever it is that you're leaning heavily on. It's a bit of a myth because you want the newest and baddest features that come out in the latest PyTorch release. But guess what? Breaking changes. So that's another area. It can fail. It can fail just because you don't really understand what business problem you're going after when you're training this model. And so these guys get down and dirty with us on all the different things that they've seen in their years of experience while training models, I also got into data quality with them and how we can ensure data quality. Especially when you are training these gigantic models and you have a lot of data.

Demetrios 00:03:13: It's basically the opposite of that. Big data is dead trend. This is the big data is alive and well trend. And one quote that I particularly like from the conversation is every data set is ugly in its own way. And bandish mentioned how you got to get in there and you got to just do a little bit of manual evaluation, look at your data set. Sometimes there's ideas that, oh, yeah, a markdown file is referencing diagram one or table one. But because of the way that you did your data preprocessing, table one does not exist in that markdown file anymore. So you'll figure it out soon enough.

Demetrios 00:04:00: Just taking a few random samples of these gigantic data sets, and there's always something wrong. That is what I'm coming away with. It's just a matter of if you can catch the majority of things. And these guys at Mosaic, they have been doing it long enough to where they've got some experience and they've seen a few things. They've been around the block, we could say. So let's get into this conversation. But before we do, I want to mention that I did some analysis on the YouTube page that we have and specifically the analytics side of the YouTube page. And I have some key findings for you.

Demetrios 00:04:38: So if you are watching this on YouTube, there are about 30,000 views that we had in the last 28 days. 21,000 of those views were from non subscribers. What is this? What is going on? It takes nothing. Hit that subscribe button. We would love to see you more often because the funny part of my little analysis is that 20,000 of those 30,000 views were from returning viewers. This means lots of people are coming back for seconds, but you're not hitting the subscribe button. So come on, hit that subscribe button. And I also want to give a huge shout out to everyone that reached out to me after a few podcasts ago when I was lost on a paper.

Demetrios 00:05:31: I could not figure out what the paper was that I was trying to find, and I knew it had something to do with data leaks. And also using OpenAI and all the ways that when you call the OpenAI API, you get data leaks. And some Prague researchers had written this paper. I couldn't figure it out. So many of you reached out to me and gave me that paper. I just have to say you restored my faith in humanity. It is so nice to see that? And it's a little bit dangerous because now I might start leaning on you a little more than expected. And I'm going to do my best to first Google things before I ask the masses.

Demetrios 00:06:15: But I may ask you all a few more questions while we're at it. Last but not least, it is official. June 25. We're having the first in person mlops community conference. It's going to be all about AI Quality It's going to be called the AI quality conference June 25 in San Francisco. I really hope to see you there. We will leave a link to where you can get the early bird tickets right now.

Demetrios 00:06:46: Come join us. You know, we're not going to just do it like a normal conference. You can expect lots of shenanigans, that is for sure. Hopefully you did not use your learning budget up for this year yet and convince your boss that you got to go. We're going to make the tickets nice and cheap, so it's going to be an easy win. We just want to get you there. We want to have a bunch of fun friends hanging out in a room learning some stuff and maybe doing some crazy musical improvisations or some hardcore techno dance classes. You name it.

Demetrios 00:07:23: I don't quite know what the craziness is going to be yet because it's still very early, but I suggest you get in on it. June 25. That is We've got it all laid out. And before we go, I want to give a huge shout out to databricks for sponsoring this episode. They've been doing some incredible stuff, as you will hear. And let's listen to our friends at databricks and Mosaic in this conversation right now. Davis, I gotta hit you with the question, man.

Demetrios 00:08:01: I'm a fan of your newsletter. I really appreciate when you break down all these papers that I don't necessarily understand 100% of the time, but I feel like they're important and I feel smarter because I get your newsletter in my inbox. Only thing is, my inbox has been a little bit lonely since the beginning of the year, and I'm guessing that's because you either got access to a ton of new GPUs and so you've been training a lot of models, or you're taking a hiatus to come back bigger and better in the second half of 2024.

Davis Blalock 00:08:39: Yeah, so there are two reasons for that. There's an exciting one and a depressing one. The exciting one is that we've been cooking at Mosaic. I can't say exactly what yet. But basically, when the cluster cost more cash per day than you do per year, it's a little hard to justify working on other stuff. So that's kind of been bumped to the priority, top of the priority list to the exclusion of everything else. Little more depressing reason is, I think having to crank something out every week kind of sucked the joy out of reading papers for me for a long time. Think about a task you don't want to do.

Davis Blalock 00:09:24: Like the last time you had something you really dreaded, you put it off whenever you could. Now imagine that even if you do it, it just comes back again next week. It gets a little depressing. And now imagine that every day you put it off, it gets worse, because every day more and more papers are coming in. So it used to be the case that reading papers was the one thing I could always do. No matter how sleep deprived I am, no matter what else is going on, it's always easy to sit down, focus, or read a paper. But having this big extrinsic motivation and feeling like there are all these people almost breathing down my neck, waiting for content, it just sucked all the joy out of it. And so I kind of reached the point where I needed to take a little bit of a break and kind of rediscover the joy of learning, reading all this stuff, getting back to why I started doing in the first place.

Davis Blalock 00:10:22: So that's what's happening. I think I would be doing it now if we weren't cooking so much, but I needed at least a few weeks to kind of reset there.

Demetrios 00:10:31: For me, it's very much like when I write the newsletter for the Mlops community, and especially the ones where it's more like I need to add my thought, I need to bring in new types of ideas. I want originality in it. It is literally like, I know that feeling of procrastination where each day it just builds, and then you would think, like, more subscribers is better, and you would be excited for that. But then you see the numbers of subscribers are growing, and then the procrastination is growing, and so that all just adds to the weight of it. And I would end up going nuts and then spending so much time on the newsletters and specifically, like, these mega ops roundups. And I was doing those monthly, we have our weekly newsletter that goes out, but then I would do, like a monthly one to kind of go over things and give my commentary on the whole scene and the ecosystem and all that fun stuff. And it's hard. So I feel you.

Demetrios 00:11:35: I can totally understand. A little break. Hopefully you'll get back at it because my IQ thanks you. And for anyone that does not know and wants to sign up for when Davis is back in the newsletter writing phase and these clusters aren't costing more than half of a team worth per day, I guess I would highly recommend subscribing to his newsletter. It's going to be in the description. It's like Davis reads papers, I think is the very unique name.

Davis Blalock 00:12:10: Davis summarizes papers there.

Demetrios 00:12:13: It is very clear what you're going to get on that newsletter.

Bandish Shah 00:12:17: Davis reads papers so you don't have to.

Davis Blalock 00:12:20: It's great.

Bandish Shah 00:12:21: It's fantastic.

Demetrios 00:12:22: And that is the value prop there. It's like you read a whole ton of papers and then you call out what the interesting pieces of some really good ones are. And so we're going to be talking a lot about lots of the stuff that you've been learning in those papers. But first, maybe, Davis, can you break down a little bit of how you got into tech in general so we can see your art before mosaic?

Bandish Shah 00:12:55: Yeah.

Davis Blalock 00:12:56: So I feel like me getting into technology was kind of inevitable in some sense. So my granddad was an electrical engineer and grew up on a farm in the middle of nowhere in Tennessee, but just always had that inclination. Ended up being the first person in his family to go to college at the University of Tennessee. Later went on to become a professor there. He had three sons. All of them got electrical engineering degrees from the University of Tennessee. One's a professor there. Now my dad's a professor at UVA.

Davis Blalock 00:13:34: Other one is a practicing electrical engineer. Just every male Blaylock on my side of the family is an engineer of some sort. I'm actually a little bit of an outlier. Getting a degree in computer science instead of electrical engineering, solely electrical engineering for an undergrad.

Demetrios 00:13:53: And then. So you got the degree and you decided to, hey, you know what? I think this ML stuff is pretty fun. And when did you get the bug for starting to train and basically big GPUs, making GPUs go burr?

Davis Blalock 00:14:14: Yeah, it was not what I intended to do. So back in 2013, I was an undergrad at University of Virginia. I worked in a research group doing wearable devices. This was back in the days when Fitbit was going to change the world. We were going to revolutionize healthcare, all that stuff. And we had these little wearable devices.

Bandish Shah 00:14:42: Our lab had made.

Davis Blalock 00:14:42: We would record data from all sorts of different people. We do studies, identify dementia, Parkinson's, health caregivers, whatever. And it turned out that once you had all this time series data, all this longitudinal data, it was really hard to analyze it, very hard to make sense of accelerometer, gyroscope, magnetometer signals. So I ended up just kind of backing into machine learning and data mining as a result of that, I was trying to work on healthcare stuff, but it just converged to the inevitable challenges of making sense of the data. So that was what I spent a fair amount of my time on in undergrad and then coming to grad school in 2014. It just seemed like machine learning was the field that was popping off that had a lot of potential, and that was very deeply technical, because I also considered doing a lot of human computer interaction stuff. There were groups that wanted me to come to HCI, but there was a lot of value, it seemed, in doing something deeply technical that was emerging, that had a lot of potential. So that was how I ended up signing a new machine learning when it came to MIT.

Demetrios 00:15:59: So you wrote the paper multiplying matrices without multiplying, which I think gained you a bit of notoriety. Can you explain that?

Davis Blalock 00:16:08: Yeah, that was a fun one. So the problem we're tackling there is approximate matrix multiplication, which is exactly what it sounds like. You have two matrices, you want to multiply them, because matrix multiplication is of course, one of the most fundamental operations in machine learning in many other fields. But you want to do it really quickly and you're willing to tolerate some error in the output. There are a bunch of approaches to this under slightly different assumptions. We found a method that ends up letting you do this approximation without any multiplication. Sounds a little bit counterintuitive. Effectively, you have to know one of the matrices ahead of time.

Davis Blalock 00:16:52: If you know one matrix ahead of time, it turns out you can process the other ones in a way that doesn't involve any multiplication. I think it'll be a little hard to get into the details, but the keyword here to look for is vector quantization. That's kind of the core of the approach. And what's cool about it is that it is extremely power efficient, because you one do vastly fewer operations. You end up doing small table lookups instead of multiplies to replace, say, four or maybe eight multiplies with one table lookup. That is both less work and is something you can implement in way fewer transistors. Just think about how many transistors it takes to build a multiplexer versus multiplier. So my contrarian take is that binary neural networks make basically no sense.

Davis Blalock 00:17:50: They're strictly worse than vector quantization, they're strictly less expressive. They don't buy you any more hardware efficiency. They are the more obvious approach because they're just extrapolating, taking the parameters to fewer and fewer bits, but there is effectively no advantage to them. It's all going to be vector quantization in the future.

Demetrios 00:18:10: Wait, and this has been getting kind of popular recently because of the whole, I think there was a paper that recently came out that was talking about how, oh, we can do large language models in one bit. Right, and so you're saying that's going to be thrown out the window, it's just a matter of time?

Davis Blalock 00:18:29: Pretty much, yeah. There's been work on binarization of neural networks since at least 2015 ish. There's a company called Xnor AI that got acquired in 2016. So this is not a new thing. I think it's been riding the hype cycle up and down. Occasionally there's a paper that gets people excited. But yeah, I think it's an interesting proof of concept. But if you just look at the fundamentals, if you just look at the amount of information per stored bit you can get out of binary versus vector quantization, you just have strictly more expressivity with vector quantization.

Davis Blalock 00:19:12: Binarization is just a special case that kind of hamstrings you because you can't exploit any mutual information across different parameters and every parameter has to be exactly one bit. You're just really, really constrained compared to a more flexible representation that looks at the distribution of multiple values together. That's really the core idea of vector quantization. People say quantization in neural networks usually means scalar quantization, which is every input floating point weight becomes one lower precision weight. Vector quantization is you look at groups of weights and you effectively do k means and find structure in the distribution of your weights across different dimensions. So you can exploit that structure to get vastly less quantization error and you end up with usually higher entropy in your encodings because you're saying, which centroid is it? And yeah, it is just strictly more expressive while retaining the extreme low power hardware advantages. And it works today on CPUs. Like, you can already get a huge speed up on CPUs.

Demetrios 00:20:28: So bandish, don't want you to feel like we're forgetting about you over here. Of course, I know that you were doing stuff with matrices also. I mean, you had some time at Cisco just to catch everyone up on your journey until you got to mosaic. You also were working at Sambanova, so you're no stranger to this space either. And I would love to get just the quick rundown on how you found yourself working on matrices and then jumped on to mosaic and now you're leading teams there.

Bandish Shah 00:21:04: Yeah, for sure. I mean, let's see. So it's actually similar somewhat to Davis said. I started out never actually thinking I'd get into computer engineering, computer science. I wanted to work on stuff that ends up in space and ended up just. Yeah, I mean, I was just super, like, as a kid, I was the kid that took everything apart. I drove my parents nuts and constantly wouldn't shut up. My mom likes to joke that no one would babysit me because I would just keep asking questions.

Bandish Shah 00:21:33: And it was like one of those things where I think that eventually just turned into getting into engineering school. There was a million things in between. There was astronaut, there was business, there was whatever flavor of the week kind of thing. But at the end of the day, I think engineer at heart. And, yes, I ended up getting to college, declared an electrical engineering major because that's what a bunch of my uncles had done and ended up loving it and actually ended up jokingly never actually took a CS course in all of college, except for, like, maybe the closest thing was a computer architecture and digital design course, which I took my senior year. And I got involved actually working on electrical systems for a nanosatellite project. So the air Force research labs at the time had a nanosatellite program. And so I ended up kind of building all the different interfaces, and it was like this modular satellite that you could build very quickly.

Bandish Shah 00:22:30: These are like nanosatellites. They're pretty common now. Like, think about SpaceX or think about Starlink and stuff. A little smaller than that, but same kind of basic principle. And, yeah, it was like, okay, I'm going to do aerospace. I'm going to go to grad school for that, and then I'm going to work for NASA. It's been great. And that all changed, I think, my senior year when I decided to show up to a career fair super late in the last 15 minutes to got, you see what was out there.

Bandish Shah 00:23:01: And I worked it in college as my kind of school gig and ended up almost running over a former kind of coworker that I worked with. It was like a year older, and he worked at Sun Microsystems at the time, and he was carrying this giant server motherboard that they had brought as, like, a prop. And I was like, that's cool. How do I make that? He was like, give me your resume. And then about a few weeks later, interviewed at Sun Microsystems and then joined them, and then about a year and a half later got bought by Oracle. So I was building silicon. I worked on like, PCI Express, worked on memory asics, so worked on all kind of like the north Bridge, south bridge type stuff, if you speak the intel parlance, and then eventually started thinking about the main process there. And so I did that for almost the better part of a decade.

Bandish Shah 00:23:48: And actually right around the time I think Naveen Rao was starting nirvana, and it was like, oh, man. Before, if you wanted to build chips and you wanted to work in a startup that was unheard of. Everything had consolidated. You got to go to one of the big companies if you want to make a career in chip. You know, I think a lot of part, like Naveen kind of made it cool again. There was new money going into chip designs. AI was picking up, right? There were a lot of breakthroughs in deep learning happening, and it was like, okay, let's try to figure out how to build an FPGA at the time to just prototype it, right? Let's do like a quick iteration kind of hardware design on an FPGA. And so I like to joke that hammers kind of only see the world as nails.

Bandish Shah 00:24:35: And so as a hardware guy, I was like, oh, let's clock it faster. Let's take this thing that's happening in software. It's a bunch of vector instructions on an x 86 processor or something, and let's actually figure out how to build specialized hardware. And it was kind of along the same theme. So we started working on kind of prototyping these different things, and then actually ended up a team from oracle at the time, went off and founded Saminova. So I got involved in Saminova very early, and that took a bunch of research, like really just super interesting research around building coarse grain reconfigurable type architectures and creating basically a deep learning accelerator from that. So then from there, kind of went into performance and really looking at how to optimize these workloads on these specialized chips. Probably picked up a little bit of more and more software, started actually moving up the stack a bit more.

Bandish Shah 00:25:28: And now I joke that I kind of manage a python team. So I've come almost not full circle, but I've definitely come from like, have gone to the dark side, if you will, if you're a hardware guy.

Demetrios 00:25:41: Well, here's what I imagine the both of you do when you go to work. And I think we're all kind of the same age. So you can remember those memes that were popular on Facebook probably like ten years ago, where it's like what my brother thinks I do and what my mom thinks I do, what I actually do. I think you guys sit there and you check in on your training runs that have been going for the last 30 days, and you're like, everything looking good here? Let's see. Well, how can I debug this? Or we need to stop training. So that's my very naive vision of what you are doing. Give me the real breakdown on what's a day to day life in Davis and Bendisha's world.

Davis Blalock 00:26:24: Yeah, I could do that. So if you want to remember what it's like to train llms, you just have to remember there's a song written about it. It's 10% luck, 20% skills, 15% concentrated power will, 5% pleasure, and 50% pain. That's pretty much the breakdown.

Demetrios 00:26:48: Where's the pain coming from?

Davis Blalock 00:26:50: Yeah, that's a great question. So there is pain from all over the place. The lowest level source of pain is that you get a ton of hardware failures. Supposedly this doesn't happen as much with TPUs, but at least with GPUs you're going to be seeing hardware failures all the time. And when you get a hardware failure, it doesn't raise hardware failure exception that you can catch and handle in some nice way. You just get arbitrary bad behavior. Maybe your job just completely crashes. Maybe some of your nodes just slow down slightly.

Davis Blalock 00:27:30: You could even just start getting subtly incorrect outputs. All sorts of terrible things happen. So you have say, thousands of GPUs, maybe just hundreds of GPUs crunching numbers, and then suddenly something just kind of goes wrong. Usually what you'll see is a nickel timeout. Nickel is Nvidia's distributed communication library. All that means is that some node didn't respond when they were supposed to be syncing gradients or something. It's very vague. So it just means that something has gone wrong.

Davis Blalock 00:28:02: And now it's up to you to figure out what that is. You might manually run diagnostics on every single node, try to figure out what's not responding or what's working too slowly. Maybe it's just transient. Occasionally something just randomly happens, or maybe you don't encounter any of these issues. And actually there's some other problem that has occurred. For example, if you have enough nodes trying to write a checkpoint to an object store at once, you can basically ddos your object store and you will just see that your checkpoint saving or loading silently fails you can get similar issues if you have, say, a network file system. You're just putting so much load on every single piece that you're touching that any piece can fail at any time. That's the lowest level, that's the hardware, and then you have the software on top of it.

Davis Blalock 00:29:05: So because we're still in the phase of the field where we're figuring out how to do all this, and there is increasing demand for training at such large scale, the software stacks are not mature. There are breaking changes in all the Nvidia libraries. You'll get breaking changes in Pysorch, you'll get breaking changes in hugging face libraries, whatever abstraction layers you're using on top of those to try to handle large scale training stuff that will experience breaking changes. Just loading your data at scale is hard. You're going to have to have some sort of library to do that. Everything is unstable and you spend a lot of time maintaining some kind of layer to abstract it away. You're checking for different torch version numbers in your python code and monkey patching different functions. And like Pytorch, FSDP.

Davis Blalock 00:29:59: And every single one of those is something you learned the hard way. You didn't look at the code, write some clean unit tests and say, oh yes, well, of course, for torch two, two, one, we need the monkey patch to function in the following way. It's more like something horrible happened. He threw an engineer a month at it to try to get to the root cause of the issue. And eventually you found, oh my gosh. It's this incredibly subtle thing with how they're initializing the buffers or whatever, and then you'll keep at it, you figure it out and hopefully it gets better over time. So every piece of it is difficult.

Bandish Shah 00:30:31: And that's just whether the job works.

Davis Blalock 00:30:33: We haven't gotten to whether the job yields anywhere near the accuracy you want. So my view is that the worst debugging in all of computer science is debugging deep learning accuracy. It's worse than distributed systems, worse than embedded systems, because it is so hard to get signal about what's happening and so expensive. If the numbers are too low, there's no exception. There's no concrete behavior you diagnose that says, oh, well, that's null. Well, of course it's not working. No, you just get numbers that are too low with almost no indication as to what's happening. And it is extremely expensive to check what is happening.

Davis Blalock 00:31:19: So you have to be incredibly disciplined, incredibly systematic, and have in some cases just a little bit of black magic to guess what is going wrong and identify it. So that's an overview.

Bandish Shah 00:31:36: That's a huge paradigm shift. Right. And I think people struggle with this all the time in computing systems. You want a right answer, right? There's always an expected answer. And here there isn't. We've found issues where we're not even computing the right thing and the model still has good accuracy. It's like, where would that fly? Right? You can't approximately transfer bank funds. That doesn't fly.

Bandish Shah 00:32:06: That's not going to work. So it's like, here. But it's like, hey. Oh, yeah, we've been doing that wrong the whole time. You catch a bug, like a year later in some code that you wrote, and it's like, oh, yeah. Something underlying changed that broke how that behaves, and we never caught it because we didn't have a test or something. And it's like, well, when we changed it, the model performed worse. And you're like, wait, what's happening? Right? Things are constantly moving.

Bandish Shah 00:32:34: There's all these moving parts. Things are changing, and there's no right answer. Right. It's very much like when you were asking about what we do. What's our day to day? I kind of tell my family, my friends who aren't in this, it honestly sometimes feels like just educating someone. Right? It's like, what do we do, right. It's like we. We have no formula for how everyone should learn.

Bandish Shah 00:32:57: That's very clear. Everyone learns differently. Everyone excels at different topics. And it's like, what do we do? We write a bunch of tests, and we say, take your LSATs or take your MCATs or take your whatever. And if you do well on them, then you get a score. And I mean that, you know, that's. That's what this is, right? There's no perfect science to it in any way, shape, or form. There's no algorithm that it's always going to get you the right answer.

Bandish Shah 00:33:23: It's like you're kind of just trying different things and seeing how the thing responds, essentially.

Demetrios 00:33:30: Wow. So I feel for you guys. Just listening to that, I can feel the amount of experience and pain that you've had to go through. And as you mentioned, these are not mistakes that come without a price tag. There was something that I still quote matte, one of your colleagues at databricks, about when he came on here. He was saying how you can spend 25, 30, 40 days training a model, and it's not until the 40th day that you realize this model actually isn't that good and we can't use it. And so you've put all of these resources into it, whether that is the people's time or the GPU costs or just trying to learn and make it better, and you've got effectively nothing to show for it because it's not like, oh, well, we can use this model. In some cases you can't, like half ass, it's kind of all or nothing.

Demetrios 00:34:45: Is this model hitting the needs that we have or is it not? And if it doesn't, then 40 days later you're like, I guess we got to try this one again, and let's see if we can not mess it up next time. Yeah, right. So I imagine you've felt that a few times.

Bandish Shah 00:35:04: This is one of the things that we always talk to customers about, right? It's like even the problem of knowing that the model is meeting your need or it's addressing your problem is pretty hard. Having good evals for your specific use case is, honestly, that's step zero. I think the first question we ask is like, okay, you want to train a model with us, or you want to develop a new model with us? How are you going to know that that model is doing what you want? What are you doing? Right? Are you going to use kind of those leaderboard metrics? Is that reflective of your use cases? If you're doing code generation, are you going to use human eval? Is that good enough for you? Or if you have a super specific business, right. If you want to automate some call center interactions or something and you have very specific product needs, how is that going to be captured in general kind of domain stuff that you're getting from scraping the Internet. But in those cases, how are you going to evaluate this model? So I think that's huge. We kind of say, don't even start until you've answered that question. And the problem with that is sometimes you can't answer it until you have a model to try it out with. And so it's kind of a bit of a chicken and egg ish type scenario.

Bandish Shah 00:36:20: Davis I know there's quite a bit of work that we also kind of do to kind of try to understand where we'll land. And I think it's impossible to know what capabilities you're going to get until you have the final thing. But in terms of looking at where we're going to land, one of the things that we actually look at quite a bit is scaling laws. And so we do a lot of work to kind of try to figure out, okay, this is how we're trending. We're running evals constantly when we figure out, okay, these are the things that are important for the models that we're training. So again, it's not 100% guarantee, but it's important to kind of track your progress as you go. Where are you going? And that's why, again, having those evals is super important. You want a mix, you want all the kind of general leaderboard evals, the MMLUs, the GSM, eight Ks, like all these other things, and then on top of that, you want your own evals.

Bandish Shah 00:37:15: We do vibe checks as well. Right. We frequently ask if model can't answer who it's trained by, then it's something that we want to make sure that the model understands. So there's a lot of things that you kind of do along the way. And I think we've gotten pretty good at trying to figure out where we're going to land. Usually it's a surprise. Sometimes you're like, wow, I can't believe it's actually this good. And so I think that's been a really positive trend.

Bandish Shah 00:37:42: I think we have to rinse and repeat that and see if we can consistently land there. But so far, we've been pretty good at kind of trying to figure out, okay, we can kind of land here and the model is going to be fairly capable.

Demetrios 00:37:55: And does the lack of consistency come from just that? Every time you're training models for customers, it's a different scenario? Or is it because it's just like you find, oh, there was a breaking change in Pytorch and we didn't recognize it and so now we got to start that again.

Bandish Shah 00:38:15: Or a little bit of both. It's probably both. I mean, I think it depends also because we're breaking it up. Like if we're pre training, if we're doing some more domain adaptation, fine tuning, other things on top of that. But I think there is part of it is like being, and it's a really hard problem. I can't emphasize that enough. When a customer has, like, I have this use case, it's very easy for people to be like, oh, let me just throw AI in there. It'll automatically figure out everything.

Bandish Shah 00:38:49: And it's the holy grail.

Demetrios 00:38:52: Yeah, I've been reading some twitter threads that say that AI could fix this. Right?

Bandish Shah 00:38:56: Yeah, exactly. And a lot of folks also, I will admittedly say I think a lot of people throw in like, deep learning or generative AI where you probably can just get away with some statistical ML model or something much simpler. Yeah, I think first of all, even just defining your problem is super important and then understanding how, you know, how do you measure the effect enough to solving that problem? I think if you don't nail that down, then yeah, I think how do you know what data you're going to really need to gather and collect? How are you going to assess the quality of your data if you don't know that this is representative of the problem and you want a reasonable distribute. Technically, I think as we say, we want a reasonable representation. We want a good distribution of the problems that we're trying to solve in our data set. Right? How do you even know that if you don't know how you're going to email? So I think that's why again, keep harping on that. But that's so important. And then, yeah, things are constantly changing.

Bandish Shah 00:39:57: I think Davis was basically describing every single day for the last, I don't know, several months where for a long time we had these aspirations of with Pytorch, we're like, oh, we're only going to pin Pytorch versions. We're only going to do things on stable releases. We're enterprise, right? We want good release cycles. We want to do all this. That was such a pipe dream. We had to get all cutting edge features and then we found this out from other people too. Like, oh, hey, we always run Pytorch nightly internally. And it's like, well, does it break all the time? Constantly.

Bandish Shah 00:40:32: We constantly have to fix it. And it's like there's monkey passing and so there's just all this enormous complexity that gets added on because it's like building an airplane while it's in the air and on fire and somehow it can't crash. And if it does, you got to put the pieces back together and take flight again, it's constantly. And then root causing. That is huge, right? Like getting in there and actually figuring out, okay, it was this one line change or it was this monkey patch that we had, or it was this version conflict. Like we're using older cuda kernel because we have a mismatch on the base image versus what's running in the container. And it's just like all these things we overarchingly call the undifferentiated stuff. A lot of people, I think when they want to train at scale are like, oh, let me just try to get the cheapest number of, the highest number, cheapest number of GPUs that I can from AWS and I'll figure out the rest.

Bandish Shah 00:41:29: And it's like you've only solved maybe five to percent, ten of the problem, right? Getting the GPU compute has been hard because of supply constraints, but longer term, I don't think that's the thing that's going to really create the value creation on top of this.

Demetrios 00:41:47: Yeah, even now, man, it's so funny you mentioned that, because I think in the last week and a half, I've seen three or four people have either come to me or I've seen posts about how, hey, we have excess or spare GPU capacity, in case anybody wants some, we're looking to fill up our GPUs. And, yeah, very much like it reminds me of that classic paper by D. Scully on, oh, the model is only one part of the problem and then you have all the infrastructure around it. GPUs are just another part of that problem. It's not like you get the GPUs and then everything is good, you're golden. You've got to really be thinking about each specific piece and how to efficiently take on each of these boxes.

Bandish Shah 00:42:34: Right? Again, I'm a hardware guy. I come for building chips. You can't just buy an X 86 processor and be like, cool, I have a computer, right? There's everything else that it has to plug into. Don't get me wrong, building a chip is super hard and it takes a lot of people and time and money and resources to do it. But again, you're only part of the way there, right? You got to put everything together, you got to get the software just right. And then not only are we doing that for one box, we're doing that for hundreds or thousands at this point. This is anecdotally, I mean, I wouldn't quote me on this, but we roughly would assume about one failure every thousand, couple of thousand GPU hours. So when you're training on 3000, 4000 plus GPUs, you're going to see a failure, like every hour.

Bandish Shah 00:43:29: That's a scale. And then think about, you're moving terabytes of data around, if not more. Resuming state takes forever, right? And so if you're failing every hour or a couple of hours, if it takes you 30 minutes to resume, you're wasting someone's salary easily a day if you don't make that stuff super fast, right? And so we call it the undifferentiated stuff. But to be honest, that's what I tell everyone is, do you want to spend your time there? Because it is very hard to get all that stuff, right? You just want to go and buy. You want to just go to Dell and buy a PC. Why would you? Very few instances where you want to build all of that on your own.

Demetrios 00:44:17: Hence why Davis stopped writing his newsletter focused on this.

Bandish Shah 00:44:23: Yeah, well, part of it is mean. That's it, right? That's the value that we create. And that's what's really different. I think about what we're doing is we're not just doing this once for a model and then wrapping it behind an API. We're actually doing this once for ourselves. That's kind of part of my job, is I always say my team is two sides of the same coin. One side works very closely with Davis and the research team. The other side is working with the product team and our customers.

Bandish Shah 00:44:54: And I think our job is to basically figure out how do we take this one thing that we've done, we've pushed the envelope, we've developed these new capabilities, but then I got to rinse and repeat that another hundred or a thousand times on the other side. Right. I got to make customers just as successful, if not more than we have been training models. And so that's what we're doing. So all that pain and suffering that Davis was just talking about, we have to package that and productize it so other people don't face it. Right. Other people don't have to deal with it. And it's a really nice experience for.

Demetrios 00:45:23: Them and getting that consistency because like you were talking about, it's like if you're out there and you're adding the value of, hey, when we do create a model, we're going to ensure that it is of quality and that if there are these problems or these failures that are happening, we can at least make sure that we're the ones that are looking into them and you don't have to do that. I think that's a strong statement that you're saying, and I do see the value in that 100%.

Davis Blalock 00:45:58: Yeah. Another surprising value proposition is that it's extremely valuable to have a complete configuration that just works. So you need your question of what happens when you spend all this time training a model and it's just not good enough at the end. That's a really big risk for people investing, often millions of dollars training a model. So one of the surprising value propositions we've had is the ability to just hand people a configuration, from images to hyperparameters to everything, and be able to say, okay, just train this. Just point it at your data, maybe configure it a little bit. If you have, say, really strong priors about some change that needs to be made, and it'll pretty reliably work, as opposed to having to discover all that.

Demetrios 00:46:54: It's like buying insurance for those training runs.

Davis Blalock 00:46:58: Exactly. The reliability is so huge, especially for startups. Like, you're a startup, you raise $10 million in cash and you spend 6 million training your model, that kind of has to work, you don't really get a second chance necessarily. And that's true both in terms of the ML side, like hyperparameters, and also to a lesser extent in terms of software pain. Like, I can hand you five different code bases that all work where all the tests pass and so on, but if you have to plug them together, the probability of your whole system actually working and not hitting a bunch of subtle failures and version issues is very low. So having a happy path in terms of both hyperparameters and everything else that's necessary to make it work has proven to be surprisingly valuable for a lot of our customers.

Demetrios 00:47:56: So there's another direction that I want to go into, since I've got you here and you both spend your days training models and thinking about sizes of models and architectures of models and then inference. Also, once the model is out there and I see you as understanding the system as opposed to just like the one piece. One thing that is very popular these days is a lot of people talking about how, okay, we've got the GPT four s out there, and those are great for when I want to get an MVP. But once I have something and I see that there's value, then I want to bring it in house and I want to make it faster and I want to make it more specific for that use case. I don't need the gigantic model that can do like 50 different things. I just need a much smaller model that is going to be able to do my one or two use cases. And so I think about how people wrestle with different questions, right? And it's not just, I know there's a lot of interesting stuff, whether it's like, oh, should we do rag? Or should we fine tune? And I'm not really even thinking about that. One question that I think about these days is, when do you say, okay, I want to train a model? Let's say that someone says I need to train my own model, but I just need like 1 billion parameter model.

Demetrios 00:49:33: Is that even worth it? Are you going to see those types of, are you going to see a 1 billion parameter model? Excelling in certain use cases. If it is that very small model, or maybe it's that you say, you guys advise. Look, you know what, don't train a 1 billion parameter model from scratch. Let's take this model and let's distill it so we can make it 1 billion and it's going to be much faster. There's all different ways that you can get to this end goal of how do we make it the most accurate and the fastest and the smallest as possible, but still hitting our needs. And so maybe you can talk about all the different ways that you think through those problems these days.

Bandish Shah 00:50:25: Yeah, maybe I'll start high level. And then Davis, you should fill in kind of how to tune all of the stuff. But I think when we talk to customers, I think what we see mostly is like, I think there's probably four dimensions that people really care about and it's really around privacy. Privacy is number one for a lot of enterprise customers. Like the most important thing. And privacy, it's like, yeah, it's like I want to own the model. I'd lump, not quite lumped in there, but I'd maybe say privacy and IP, I have unique data. I am very good at solving this business problem.

Bandish Shah 00:50:58: Now I fully subscribe that, hey, if I'm not in this game, I'm going to lose out like missing out on the Internet in the late 90s or two thousand s. And so I think people kind of get that and they want to figure out how to deploy it. But privacy is like I'm healthcare, I don't want to send by HIPAA whatever to the rest APIs, if you will, right, the private models because I don't know what they're going to do. And sure they can offer guarantees and all that, but there's nothing like keeping it in your own house, right? That's kind of like the main thing is really important. The other is obviously quality, which I think kind of you were kind of going at. And I think quality, obviously, especially open source, like everyone cares about. How do I get the highest quality at the lowest kind of model count? I want to run it on my laptop. I don't want an expensive GPU or something to run on.

Bandish Shah 00:51:45: And then it's cost and performance, and unfortunately quality, cost and performance. And maybe all four of these are not like completely independent variables, right. They're all kind of related and you can kind of tune this entire space. But I think one of the other things that I think we really look at is if you look at kind of like scaling laws, Chinchilla, that's going to tell you that's really focused on training. So for a given quality or some metric of quality, what's the minimum amount of cost or compute that I have to spend to get there? But what those things tend to leave out and actually I had someone, I was part of the mosaic team actually try to bridge these two things together is the cost of inference, right? And so being able to, is this better to train a small, like, if you look at the difference between a 20 billion parameter model and a 30 billion parameter model, it's huge. It's actually over like two x the cost and compute, right? Because it's a quadratic relationship. And so just think about a 1 billion versus 10 billion parameter model. So at the end of the day, you could over train a model and make it much cheaper to actually deploy.

Bandish Shah 00:53:02: And in order to do that, you have to understand kind of all of it, right? You have to understand like, okay, what is the cost of deployment? If I train a 500 billion parameter model, it's going to be super expensive to serve. If I need a lot of throughput, if I need a lot of users, do I need the generalizability that a 500 billion parameter model, if trained well, can offer versus, can I get away with something that's a 30 billion parameter model? And then I think the other part is the architecture of the model kind of helps here, right. This is why people start getting into things like mixture of experts, where you draw the kind of this compromise between model like parameter count, but also the inference cost goes down quite a bit. There's a lot of different knobs that you can tune during that and then also looking at the amount of data that you have to collect. But, yeah, I mean, Davis, feel free to dive in. I just wanted to kind of say these are the dimensions that I think we generally see.

Davis Blalock 00:54:02: Yeah, definitely. I think when making these choices, the basic framework is the same as for rolling out any sort of software or any sort of product feature, which is you should start with the simplest thing possible, and only if that's not good enough do you proceed to something more sophisticated. So if you can just query a third party API and that's good enough for your use case, then you should probably just do that. Then if you have, say, privacy constraints or cost constraints or maybe latency constraints, then maybe you deploy your own pre trade model. Just take a generic mixture or llama or whatever it is, and just put that behind an API. And then if that's not quite good enough, okay, maybe curate some in context examples to give it better prompts, try to squeeze more quality out of it. And if that's not good enough, maybe hook it up to a retrieval system, depending on the latency you have and how much complexity it adds to get that set up. And if that's not good enough, maybe fine tune it on a little bit of data.

Davis Blalock 00:55:16: And if that's not good enough, swap out a bigger model, maybe pretrain your own model, do a much larger scale fine tuning, or even continued pretraining. The size here is a little bit of an open question, but there definitely seems to be alpha in doing some sort of domain adaptation with a radial next token prediction pretraining objective be followed by fine tuning or something like RLHF. So you just keep adding layers as needed, rather than necessarily saying, here a priori is the exact set of features or the exact pipeline we're going to need to hit this target, unless maybe somehow you can forecast that.

Demetrios 00:56:03: So this is such a good point, and it reflects something that Sam, who we had on the podcast probably like two weeks ago, was talking about. And I mentioned it to you guys before we hit record, but it was along the lines of, I haven't met a lot of people who said, damn, I wish I would have architected a more complicated system. And this comes back to that exact point where it's just like, if you can get away with just calling an API, do it. Don't even worry about setting up any lang chain or llama index or any of that. If it's just a straight call to the API and you can get value out of that, do that and try and go as simple as possible. And then I really like this idea of breaking down the levels of complexity that you've seen. And so you're like, all right, well, yeah, there's APIs. All right, then you can potentially bring it in house.

Demetrios 00:57:01: You can set up a rag if you want. You can set up some fine tuning. You can try and go a bigger model if that's where you need to see everything, just to try and hit a little bit better accuracy on that. And we can't forget this idea that you're talking about. Bandish is like, all of that's great if you understand what the problem you're trying to solve for is, because otherwise you're just shooting for something that you don't know if it's actually that good or not, right? You're just saying, well, we could probably do better, but better against what? What is it that you're trying to solve for. Because you could always do better.

Davis Blalock 00:57:48: Right?

Demetrios 00:57:49: Yeah.

Bandish Shah 00:57:49: And some of this, by the way, translates to, I think, almost organizational complexity and sophistication. I think a lot of times there's just so much information out there right now about this, and I think that it takes time to build and foster this type of expertise. We spent a lot of time thinking about the audience. So internally, we're like, hey, when I talk to my team, it's like, Davis and the research team is customer zero. But what our internal research team does is vastly like what the labs are doing, the state of the art, all of that. When we expose these things as products to customers, they have to be significantly simplified and streamlined for people. We were talking about defaults earlier. I cannot undervalue the value of a good default is massive.

Bandish Shah 00:58:50: I would guess if I looked at all of the configs that we look at, and I just diffed them all and looked at how many changes there were, there's a few. Like, you could have hundreds of lines of a config and just probably see maybe a dozen lines change because people don't change that much stuff. Right. I frequently hear when I chat with customers, like, hey, we did all these sweeps. It was, like, super expensive, but then we just use your defaults. And they were like, the best convergence that we were able to get. And it's like, yeah, because we did that for you. We did all these sweeps, and we've tried to find good defaults.

Bandish Shah 00:59:27: Right. But that's another way of breaking down that complexity. But I would say exactly what you're kind of saying, right, is like, if I'm just prototyping, if I'm trying to understand my problem, don't train a model. Just start with something out there, right, and only do it if you think that solving the problem necessitates that customization. And I think what we see with most customers is they start small, the ones that are the most successful. And we've written about the replets. We've done several different customer stories out there. They start small.

Bandish Shah 01:00:02: They start at like 1 billion, 3 billion forever models. They work their way up internally. We do that, too. That's how we manage complexity. We do a bunch of small gating runs to get things in a place where, okay, we can now predictably understand that, hey, if we were to double the parameter count, where would we go? Right. And I think as you continue to build that expertise as a team, you can take on more of these challenges. Right. And then I hope also from the complexity standpoint, the tooling gets better.

Bandish Shah 01:00:31: We're building different techniques that people can start using and deploying at scale to actually improve the quality in several of those dimensions that we were kind of talking about. But I think, yeah, this all comes down to managing complexity. You don't have to go train a hundred billion parameter model on day one. Don't be that person who goes to your CTO or CIO or whatever and is like, yeah, we're going to do this because if you don't understand it, you're just going to go waste like millions of dollars, basically.

Demetrios 01:01:02: Yeah, you're not going to be in that job for too much longer, that's for sure. The one thing that I also wanted to dig into is the complexity on the data side. And just like how hard it can be with the data pipelines, because we all know and we've heard it like a million times, oh, data is your moat, right? And so you still have a lot to do on the data side, just in general, I think, and a lot of times that can be overlooked. So maybe, like Davis, have you seen good ways of data pipelines being efficiently thought out, or where are there parts, bottlenecks, areas that fail? All of that fun stuff too.

Davis Blalock 01:01:52: Yeah, there are so many pitfalls and potential failures here. So a lot of it is just true of all of machine learning and data science. Data engineering is hard. Getting good data is hard. Cleaning data is hard. And a lot of this holds independent of whether you're training LLMs or doing any sort of large scale machine learning in the context of LLMs, you do hit a few extra problems that you may not have seen if you've been doing more traditional data sites. One that sticks out is tokenization. So we don't train models on the raw UTF eight or whatever encoding you have in your documents.

Davis Blalock 01:02:42: We split the documents up into what we call tokens. So like the suffix tin might be a token or the is probably a token. It's this algorithm for splitting text into unique identifiers. And that process have many subtle pitfalls. Many tokenizers, for example, allow you to generate just completely invalid bytes, so you might have outputs that are just not UTF eight at all. You will also get weird coalescing of stuff like white space with other characters. So maybe like Colon isn't a token, colon space is a token. So we've had issues, for example, when trying to get the Falcon models to do math where we just could not get them to produce anything useful.

Davis Blalock 01:03:38: It was just garbage. And then we realized, oh, wait, we need a space at the end of the problem. And then everything started working. And that's not even that uncommon. That's not some weird quirk of the Falcon models. We've had evals that work this way. We've had bug fixes in our evaluation programs that say ten or 20% accuracy once we fix them. So there's all sorts of pitfalls there.

Davis Blalock 01:04:10: There are also a bunch of challenges associated with even loading your data at scale. So if you've only operated at a relatively small scale, you might be used to just loading your CSV file or whatever. Everything just runs on your laptop. You just point your spark program and have parquet file in one s, three bucket, whatever. When you have a sufficient number of nodes loading a sufficient amount of data, you hit a whole bunch of weird requirements around deduplication. Like you can't have different nodes request the same data, or you're going to be requesting the same data 2000 times, and then your egress fees might actually matter if you're in a remote object store or it's just incredibly slow. You also need good shuffling. Or it turns out you get subtle convergence issues.

Davis Blalock 01:05:03: Like you ever see your loss curves going up and down? Probably you have bad shuffling. We call those runs wavy boys. There's also issues with resumption. If you want to load data at scale, like your job crashes and then you want to restart it, well, okay, are you going to redownload your data set from scratch? What happens if you want to resote on a different number of nodes? Does that completely screw up your shuffling? There are a whole bunch of issues along those lines that make it such that you kind of need to use some library for this. We have one. I know other people have attempted to write them, but you just hit a whole bunch of subtle pitfalls here that you may not be used to thinking about.

Demetrios 01:05:47: And how much of these translate over to. All right, cool. I want to fine tune a model.

Davis Blalock 01:05:55: So a lot of it is a matter of scale rather than whether you're pretraining versus fine tuning, because fundamentally pretraining versus fine tuning is kind of just a different loss function. Unless maybe you're doing RlHf, then you're using PPO. Things get a little more strange because there's a reward model. But I think very few people are actually doing RLHF at this point, at least on their own. So it's more a matter of scale if you're training something on two nodes and you have redundant downloads and your data set isn't that big, and you can just redownload it every time the job crashes. Then you can dodge a lot of this complexity. You don't have to worry about a 30 or 60 minutes ramp up time when your cluster doesn't cost thousands of dollars per hour. Yeah, but, yeah, you'll hit a bunch of barriers as you scale up.

Bandish Shah 01:06:51: I'll put this another way. It was interesting, I think, for me, and I was, I think, blissfully naive about data in general, until, first of all, we scaled, and I just don't mean scaled in terms of computing infrastructure, but we had to scale our token count to train pretrained, useful models. But also, then this little company called databricks acquired Mosaic and new type of skill. Yeah, new type of scale. But I mean, managing data at scale and doing AI at scale, there's similarities, there's differences, but the kind of thing that drove it home for me was we had these data pipelines that would do things like deduplication and filtering, and some of these things would take, like hours, maybe days, depending on what we were kind of doing when we brought them in. And kind of basically, we just went to the horde of field engineering and kind of spark experts at databricks and were like, help us. How do we make these faster? And we got it down to something that took days, down to a couple of hours. You're like, wow, we didn't start these data engineers, and it was like one of these things where, like, okay, if you use the right tool for the job, if you use the system that's built to process data at scale, that's incredibly important.

Bandish Shah 01:08:15: And I think when driving it back, enterprises actually a lot of customers have a lot of data. And so that whole pipeline, that whole ETL processing pipeline that you have to do, there's this whole problem of getting that data and then whittling it down to the super high quality data in the right format that you actually need to do different types of training. Right? Like pretraining kind of just takes, just feed in all the text you can get, right? But then you have things like structure and fine tuning where now you need to craft, like, prop response pairs. You want to add instructions on top of things. You might want to augment your data. You might want to create synthetic data to add additional data and get your token counts up. So there's like this whole sophisticated pipeline that I think is emerging that also has to happen at the scales that we're talking about to actually be able to train, like, a really high quality model right. So that's why this was such a big, like a match made in heaven, right? There's this gravity to data, and this gravity because of the scale of data that is very interesting to look at, like, okay, is data going to kind of migrate to training compute, or is training compute going to migrate to data and kind of turns out we just have to build a bridge between the two.

Bandish Shah 01:09:37: The gravity of both is very massive, and that bridge, that's what Davis was alluding to. We have this kind of data loader that now we're working on really super optimizing to kind of move between these data processing compute clusters and as well as our training clusters.

Davis Blalock 01:09:53: Right.

Bandish Shah 01:09:54: So I think getting all of that right again is like one of that. We say, maybe it's the undifferentiated stuff, but it's the super hard stuff at the end of the day. And when it works, it's like magic.

Demetrios 01:10:04: And you guys are lucky. I think you have probably some of the best spark engineers in the business that you can lean on. I imagine a lot of people wish that they had that type of person that they can just go and ping for questions on the data engineering side. Yeah. I'm also thinking about, though, when it comes to data quality, we hear that term being thrown around so much. And when you're at this scale that you're talking about, how do you even deal with data quality? How can you ensure that the data quality is, like, the highest, purest form?

Davis Blalock 01:10:48: It's super hard. I think the core of the answer is, look at your data. Sadly, there's really no substitute for that. You'll find all sorts of horrific things happening. Like, you might find that all of the tables and all of your markdown files got stripped away by whatever preprocessing you have. So the text will reference table one, and there just is no table one. You'll see all sorts of horrific things like that.

Demetrios 01:11:18: And it's also hard manually looking at.

Davis Blalock 01:11:21: Your data to some extent. If you're randomly sampling, there, at least initially, is probably enough broken stuff that you will find errors. If you sit and look at your data, there is also some amount of stuff you can do at scale. But automating it is awfully difficult because every data set is ugly in its own way. Every data set needs to be cleaned in its own way. And what good data even means is going to depend on your exact problem. So there's some stuff you can do, like you put in some filtering heuristic, and that gives you a subset of your data and then you train a simple model on it, and maybe if the number is higher than it was with a different subset of your data, you conclude that your filtering heuristic was helpful. But eBay, that can be very difficult because one of the things we've seen and other groups have seen is that you don't get that much signal on data quality until you're operating a pretty large scale.

Davis Blalock 01:12:22: You need pretty good models in many cases to have data quality and or quantity become the bottleneck. So it's not something where you run it on your laptop in ten minutes. Like, oh, the SPM got 93% accuracy instead of 91. And so we trust that as a reliable signal more like, well, okay, that was a GPU week of time on a 7 billion parameter model, and it kind of seems like it did better. And if you throw enough compute and enough experiments at it, you kind of get a feel for what's working. But the overt stuff you can be confident in is usually when there's just an outright problem that you can programmatically fix, like, all the tables getting dropped.

Bandish Shah 01:13:06: Yeah. Davis, talk about the horrific stuff that happens, like, when you're doing some processing. There's also just horrific things in data sets. I can't tell you. Please pay your data engineers more because they have to go through this stuff. And sometimes you just have to imagine getting into the depths of the Internet. Right. There's just some stuff in there that.

Demetrios 01:13:30: Just the depths of Reddit.

Bandish Shah 01:13:31: Yeah. These poor folks have to go through and figure out what to do with it.

Demetrios 01:13:36: So funny. We need to create the data engineer Appreciation day, and we probably need two of those a year. It's not enough to just have one. They really got to have two. Fellas, this has been so awesome. I really appreciate the both of you coming on here and schooling me on this and just, like, taking the time to break this down, it was fascinating from my side. I love what you all are doing at Mosaic. I know that you've got all kinds of fun stuff lined up that Denny, our mutual friend who also is working at databricks Mosaic, and he keeps dropping breadcrumbs about how you're coming out with some really fun stuff really soon.

Demetrios 01:14:23: But he doesn't share anything else, which is killing me. So hopefully we will be able to show off what you all create in the near future.

Bandish Shah 01:14:32: Yeah, we can't wait to share it with you, and thanks for having us. This was so much fun.

Davis Blalock 01:14:36: Yeah, thanks. A ton of fun.

Demetrios 01:14:40: Yep. And, Davis, I'll be patiently waiting for when the GPU training jobs go down and to get back onto that newsletter.

Davis Blalock 01:14:49: Yeah, definitely.

Bandish Shah 01:14:50: That's how you know something big is coming.

Demetrios 01:14:52: Yeah.

Bandish Shah 01:14:53: When Davis puts it out, you're like, oh, man, mosaic. Databrick's going to drop something here.

Demetrios 01:14:59: This. He got a rest.

Bandish Shah 01:15:03: Classic.

Demetrios 01:15:04: Well, thanks, fellas.

+ Read More

Watch More

Posted Dec 01, 2022 | Views 509
# ML Features
# 3 Off-the-shell Services
# Critical User Path
Posted Apr 18, 2023 | Views 1.3K
# LLM in Production
# Data Science Solutions
# QuantumBlack