Introducing DBRX: The Future of Language Models // [Exclusive] Databricks Roundtable
Davis Blalock is a research scientist and the first employee at MosaicML. He previously worked at PocketSonics (acquired 2013) and completed his PhD at MIT, where he was advised by John Guttag. He received his M.S. from MIT and his B.S. from the University of Virginia. He is a Qualcomm Innovation Fellow, NSF Graduate Research Fellow, and Barry M. Goldwater Scholar. He is also the author of Davis Summarizes Papers, one of the most widely-read machine learning newsletters.
Bandish Shah is an Engineering Manager at MosaicML/Databricks, where he focuses on making generative AI training and inference efficient, fast, and accessible by bridging the gap between deep learning, large-scale distributed systems, and performance computing. Bandish has over a decade of experience building systems for machine learning and enterprise applications. Prior to MosaicML, Bandish held engineering and development roles at SambaNova Systems where he helped develop and ship the first RDU systems from the ground up, and Oracle where he worked as an ASIC engineer for SPARC-based enterprise servers.
Abhi is an NLP architect working on helping organizations build their own LLMs using Databricks. Joined as part of the MosaicML team and used to work as a researcher at Cerebras Systems.
Ajay is an engineering manager at Databricks leading the GenAI training platform team. He was one of the early engineers at MosaicML (acquired by Databricks) where he first helped build and launch Composer (an open source deep learning training framework) and afterwards led the development of the MosaicML training platform which enabled customers to train models (such as LLMs) from scratch on their own datasets at scale. Prior to MosaicML, Ajay was co-founder and CEO of Overfit, an online personal training startup (YC S20). Before that, Ajay worked on ML solutions for ransomware detection and data governance at Rubrik. Ajay has both a B.S. and MEng in computer science with a concentration in AI from MIT.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
DBRX is designed to be especially capable of a wide range of tasks and outperforms other open LLMs on standard benchmarks. It also promises to excel at code and math problems, areas where others have struggled. Our panel of experts will get into the technical nuances, potential applications, and implications of DBRx for businesses, developers, and the broader tech community. This session is a great opportunity to hear from insiders about how DBRX's capabilities can benefit you.
Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com/
Demetrios [00:00:06]: What is up, everyone? Welcome, welcome. What a day to be alive. We are here with the Databricks team. They've got some huge news that just came out, as you probably saw last week. And after we saw it, we said, you know what? Let's have a roundtable session about this because there is too much news to let just slide by. I would like to bring out the Databricks team, but I'm going to do it in true ML ops fashion. And I am going to sing a little song about them to make us start off on the right foot. Hopefully you all are ready for this.
Demetrios [00:00:51]: You may not know what you're getting yourselves into, but let's get it rocking. If anyone has any key that they want me to sing this intro song for the Databricks team in speak now or forever hold your peace. And if you want to prompt me, throw it in the chat. I am listening. Of course, c sharp is going to be in the go to one. Here we go. C sharp minor. Can you all hear that? Ooh.
Demetrios [00:01:23]: Databricks is releasing models for the open source community. We got em here to talk with me. I wanna know, what do you call this model? And to answer that question, we've got my good friend Denny Lee. Oh, where are you at, Denny? Hey.
Denny Lee [00:02:06]: Hello, hello, hello. Great C#. Thanks very much, Demetrios. Always glad to be hanging out with you, like usual. So if you don't know who I am, I'm just saying. Hey, you said the intro, so I'll do real quick names, Daniel. But apparently a developer advocate at databricks, Demetrius and I have been guilty of doing many fun things together, and we're back at it again. That's probably all you really need to know.
Demetrios [00:02:32]: What do we call in this model, dude, that you've got a big llm out in the open source land, and I think that I pronounced it earlier and you said, no, that's not what it's called.
Denny Lee [00:02:44]: That's right. It's okay. Now, of course, the name is DBRX, but I'm trying to make it a thing where we all say DaBaRex.
Demetrios [00:02:52]: Okay?
Denny Lee [00:02:52]: So that's the idea. It's gonna sticks. DaBaRex wanted to sticks.
Demetrios [00:03:00]: Luckily for us, we've got some other people that have worked on DaBaRex and are able to, we're gonna have over the next hour or so, we're gonna talk with some of the actual talent behind it. Look at this. Daba Rex just popped on the screen. Oh, my God. There we go. First out, I'm going to bring on to the stage. AJ. Where you at, AJ? There he is.
Demetrios [00:03:25]: What's going on, dude?
Denny Lee [00:03:27]: Hello.
Demetrios [00:03:28]: Hello. What have you. What have you been up to?
Ajay Saini [00:03:33]: Um, so I guess first quick introduction. So, yeah, I'm AJ. I'm the engineering manager for the generative AI training platform team at databricks. And, yeah, I mean, honestly, for the past few months, it's been a lot of, uh, cranking away, making this model train, um, largely running all of the backend infrastructure for training it, making sure it works well, making our. Making sure the model is actually going. Um.
Demetrios [00:03:56]: You'Ve had your work cut out for you, so I'm going to keep bringing the team onto the stage. Bandish, I know you're in there somewhere. There is. What's up, dude?
Bandish Shah [00:04:05]: How's it going? Dimitrios, I was glad to see you again.
Demetrios [00:04:07]: You too, man. So give us the rundown. What have you been up to?
Bandish Shah [00:04:11]: Yeah, I mean, I think what we talked about two, three weeks ago, caught.
Demetrios [00:04:15]: Up on a sleep a little bit.
Bandish Shah [00:04:16]: A little more gray hair after this, but yeah, we've. Yes, my name is Bandish. I'm another engineering manager at databricks. And my team basically builds all the training tooling, like composer LLM, foundry, streaming datasets for those folks following our open source tools that we use to train dB Rex or DAPa Rex.
Ajay Saini [00:04:36]: I guess that's what we're calling it.
Bandish Shah [00:04:38]: We're trying to figure out what to call it. We're trying a few different things off size.
Demetrios [00:04:41]: Yes. Well, I'm excited to talk about it. We got Abhi. Where you at, dude? Where you at? There he is.
Abhi Venigalla [00:04:48]: Hey, great to be here. My name is Abhi. I'm an architect on the NLP team. I've been helping lead a lot of the modeling choices and development around WX and spent a crazy couple months to everything from pre training, data prep, um, all the post training as well. Um, it's been a huge team effort. Yeah, super excited to be here.
Davis Blalock [00:05:08]: Excellent.
Demetrios [00:05:09]: Well, we've also got Davis coming on in just a second. We got to kick it off, man. Like, denny, can you lead us through the sequence of events that has happened over the last month when it comes to dabber x?
Denny Lee [00:05:22]: Well, actually, no, in all seriousness, I I. That's a great question, but I would rather have bodesh AJ or avi actually take her the biggest. They are integral for its creation. So I'm just the guy who talks a lot. That's, that's all I am in this case.
Demetrios [00:05:38]: Okay.
Denny Lee [00:05:38]: So I want, I want these guys to go ahead and actually, like, I'll ask them some good questions and provide some color commentary that I can have help with. But when it comes to the actual model self doubt, let's, let's have these guys talk about. So, uh, choose among yourselves which one wants to fight on, uh, who to talk about it first.
Bandish Shah [00:05:54]: Oh, boy. I I vote Abhi. I think I'll be. Abby's going to have a great rundown.
Denny Lee [00:05:58]: Oh. All right. Let's go, Abhi. Ben.
Abhi Venigalla [00:06:00]: Yeah, sure. Sure thing. So. So what would you like to know?
Denny Lee [00:06:03]: Oh, the context is more like what. How did we go about into actually even creating the model in the first place? Like, the story goes back into. We're going backwards to, like, you know, the latter part of last year early. You know, what we published in the blogs is that, you know, it's two months or three months, depending on which blog it was right for its creation. So, yeah, like, a little bit about that. Maybe we can even start with the megabloks paper or anything. Anything or any particular research you want to get into that. I think that'll naturally lead into band just talking about foundry composer, things of that nature.
Denny Lee [00:06:38]: So what was that?
Abhi Venigalla [00:06:40]: Yeah, so I think when we look back at sort of last year, originally, a lot of us came from the mosaic team, and our goal was to really help people build and train their own models and build the infrastructure around that. I think we got to this point where we showed that you can actually train up to GPT-3 quality models for a relatively low price. And as we looked ahead, especially after we got acquired by databricks, we really want to see, can we push the quality to the next level? Can we get to 3.5 and beyond? And what were the different things that need to go into that? So a lot of it came down to obviously need to scale up to bigger models, but to do so in an efficient way, we really dug into this new mixture of experts architecture, which I'm sure we can talk a lot about, which really helps us train higher quality models without increasing the cost too much. And that involved a lot of both the research and engineering, figuring out how to make these models converge, figuring out how to actually train them at high performance on big clusters. We spent a lot of time digging into the research, and then I say right about around Christmas time, actually, that's when we decided, hey, we actually want to go for it. We want to try and build a new model that's way more powerful than our previous MPT series. And that's really kicked off this effort. So the past three months or so have been just round the clock research and engineering.
Abhi Venigalla [00:07:54]: I'm trying to get Dabberx out. So.
Demetrios [00:07:57]: Good. So we have a new person that joined us. Davis, can you give us a quick intro before we dive deeper into this? I know it's like we're taking one step back and two steps forward, but we're going to get there.
Davis Blalock [00:08:11]: One step back in, two steps forward. Sounds exactly like training a huge model. So it's pretty fitting.
Demetrios [00:08:17]: So, yeah, like we're on point.
Davis Blalock [00:08:19]: Yeah.
Abhi Venigalla [00:08:19]: So.
Davis Blalock [00:08:19]: Hi, I'm Davis. Good to be back here. I am a researcher at Databricks Mosaic AI team, formerly Mosaicml.
Demetrios [00:08:32]: So I'm always interested in hearing. And Denny, feel free to jump in if you have any thoughts on this, too. But when it comes to training one of these models and thinking about it, you did a lot of testing beforehand, right? And abhi, you were just talking about how we you, at some point you were like, okay, we've tested enough, now we think we can go for it. You have to make that gigantic investment into training the model. You have to be pretty confident that what's going to come out on the other side is going to be useful and going to make an impact. What are some things that gave you all the security to know that you can do this? Like if you're going to invest this time and effort and resources into it, it's going to be good.
Abhi Venigalla [00:09:23]: Yeah. So I think there's like both the research and the engineering risks. I can talk about the research and maybe bonus or AJ, you guys can talk about the engineering stuff too. So, you know, the basic thing we try to do is really nail down the recipe for modeling. So, you know, the architecture, you know, the choices of, you know, how much data to use versus the number of parameters. We do a lot of these tests at smaller scale and we try to basically benchmark ourselves against other existing models out there and our internal benchmarks, our previous MPT models, and we try to see that we're hitting the same quality as those other benchmarks but with much less computer money than before. We do a lot of those experiments at smaller scales and eventually we start scaling it up and making sure that we're actually hitting the quality we expect. And that's what kind of gives us the confidence that, like, hey, you know, now we're going to really crank it up and go ten x farther than we ever have before.
Abhi Venigalla [00:10:18]: It's just a lot of those little scaling experience. That's the research. But on the infra, yeah, maybe someone else was talking about that.
Bandish Shah [00:10:26]: Yeah, I mean, I'll jump in from the kind of the training side, right. I think, you know, research was really kind of pushing the frontier as far, as far as kind of when we work with our research team, right? And I think that's like very, it's a really great kind of collaboration that we do. We innovate because customer zero is always research for us. And so I think sometime last year Abby comes up, says, hey, I got this MoE architecture. And we had just put out MPT 30 B and riding that wave and we were kind of busy with the acquisition. And it was like that I frame as our gen one stack, right? That was our dense model training stack. We were really good at doing this at hundreds of GPU's. We've been working with customers and we've kind of proven it out over and over again.
Bandish Shah [00:11:10]: And that's a major thing that we do, right. We don't do this once. We have to do it once with our research team that's creating these new capabilities and really pushing the boundaries of.
Ajay Saini [00:11:20]: What we can do.
Bandish Shah [00:11:21]: And then we have to repeat that for databricks customers. It's that and it's scaling in a lot of different ways. When we build these new stacks, that's really the crux of what's happening. We have this new moe architecture. You know, everything breaks, right? It's like, oh, we're ready to start training this. And it's like not. What? What do you mean? We're nowhere near ready.
Denny Lee [00:11:42]: Right.
Bandish Shah [00:11:43]: We need like, you know, and so a lot of this, you know, when we're doing it is almost like, you know, we're, we're already, you know, flying. We're building an airplane literally while it's.
Demetrios [00:11:52]: Flying in the air and on fire, right?
Bandish Shah [00:11:54]: And, but we do that so that you don't have to, right. And you don't do it later. And so a lot of this kind of involved, okay, now we have this Moe thing. Like how do we split this workload across hundreds of thousands of GPU's?
Denny Lee [00:12:06]: Right.
Bandish Shah [00:12:06]: We train dB racks on over 3000 different GPU's. What are the things that are going to fail, right. When you go from 512 block sizes to over 1000? Things that didn't fail at those smaller scales are definitely failing at larger scales. And you're onion peeling. You're doing a lot of these things. And so a lot of the engineering effort starts going into, hey, how are we going to all these problems? And you start by patching and then you just build up from there, start understanding. Hey, checkpoint loading is breaking for some different reason than it was before. How do we make this go fast and how do we eventually do it so that we're uptime is very long and I think we did pretty well.
Bandish Shah [00:12:50]: Being able to train this model in three months or so is pretty amazing. And I think we're going to, as we commercialize the stack and start training with customers, it's going to, it's going to continue to improve and get even better.
Denny Lee [00:13:01]: All right, this is perfect. So actually, I've got two sets of questions just based on what you just said. Bondish. Okay, so, but I'm going to direct it to Davis. I wanted you to talk a little bit. Moe, I'm going to do that first, but just as a lead in for the next one. Then, AJ, I want to talk to you about the hardware failures and the training, because that part super distinct that bondage was referring to. But first things first.
Denny Lee [00:13:22]: You know, Davis, why don't you talk a little bit about the Moe architecture? Because, you know, we, we, we, we spun, we, we went around it talking about like this idea of mixture of experts. But this is not necessarily the most obvious thing that we're talking, that typically when you're looking at lava two and even the older MPT models, where we're talking about dense models and now we're talking about sparse models, can you talk a little bit about that maybe within the context of the megablocks paper for sake argument?
Davis Blalock [00:13:46]: Yeah, absolutely. So here's one way to think about a mixture of experts. Simple approach that you could do is you could just train different models for different subsets of your data. Like you have a model that's good at processing math, model that's good at processing Python code, whatever. You could just have different specialized models and you can run that. That's a thing you can do, but you'll hit a fair amount of redundancy. Like they're all going to have to have big embedding tables with different input tokens. They're all going to have to figure out how to do reference resolution for pronouns.
Davis Blalock [00:14:28]: There's just a lot of the model that translates across domains and only some of it that is specific to a particular domain. So in a mixture of experts model, what you do is instead of having completely separate models, you're going to have one model, but all of the feedforward networks, pairs of linear layers in a transformer are going to be duplicated, say eight, or in our case 16 times. It have 16 different feed forward networks in every transformer block and every token is going to go to a subset of these feedforward networks. In our case, we send each token to four of them. What this means is that we have many different, supposedly specialized feedforward networks in every transformer block, and tokens go to the relevant ones. There are a bunch of things that are nice about this. One is that you have 16 times more parameters. That's a lot more model capacity.
Davis Blalock [00:15:35]: One thing that's not very nice is that you end up with 16 times more parameters in each of your layers. So that causes a number of technical challenges as far as which experts a given token goes to. You learn so called routing functions that hopefully does something reasonable. There are a lot of open questions about how to do this really well in the research literature, but that's the high level idea. You're adding way more parameters, but then at runtime, you use them sparsely so that each token only sees them.
Denny Lee [00:16:12]: So you had just mentioned the fact that we're talking about you're using four experts out of the 16 that are available. Right? So, can you provide, what is the benefit? Right, because you didn't note the fact that with 16 experts, you have that many more parameters to work with. So what's the benefit of doing four experts out of 16? And then do note the fact that when you're talking about, for example, like with Mistral and with Grok, it's two experts out of eight. So I'm just curious from that context.
Davis Blalock [00:16:38]: So why four experts out of 16? Is really two questions. One is why only four experts? And the second is, why four and not just one? So why choose only four? Why choose a subset? Well, the amount of work you have to do, computationally, the number of flops for each token you feed into your model is going to be proportional to the number of experts that are used. So if I take a token and I feed it through 16 different linear layers, that's of course, 16 times as much work as only feeding it through one linear layer. So in order to reduce the amount of work we do per token, we only routed each token to a subset of the experts. Now, if that's the case, why not just only do one? Why not maximize the number of ratio of parameters to the amount of work you do? And the answer there is that there's a trade off, and you kind of have to tune it. The trade off is that your model still does benefit from doing more work. One way to think about this, intuitively, is that the amount of expressiveness you get here is proportional to them, like n choose k or depends on the n. Choose K.
Davis Blalock [00:18:01]: Of how many experts you use, there's only one expert. There are only 16 possible ways a token can get processed. If you have two out of 16 experts, you have 16 times 15 ways it's hoping it can get processed, and so on. So, if you use four, you end up with this really expressive combinatorial space where each token, you're still only using a small fraction of the parameters, but the number of functions you can express here is really, really huge. So you end up with a big model capacity win. That still gets you a lot of the benefit of having, just using all of the experts or an equivalent dense model, the same parameter count.
Denny Lee [00:18:43]: Okay. And I think, like, one of the context that I would add before I switch over is that there's also the benefit of speed. Right. Basically, by reducing the number of experts going overall, you've actually basically get the better of both worlds, where you basically have the higher volta quality due to the 16 experts, but you actually get the better inference speed because you're using forex. Did I get that correct? Just verify here.
Davis Blalock [00:19:12]: Yeah, that's. That's a really good summary. Kind of getting the best of both worlds.
Denny Lee [00:19:17]: Cool. I probably asked Davirex that question honestly, so that is probably actually why. I don't know.
Abhi Venigalla [00:19:23]: All right.
Denny Lee [00:19:24]: How are the truth. Oh, no, no. That's always been the truth. I'm not pretending I know anything. I literally asked Domorex friend. That tells me. And then I just get the sites give me some citations that I can act like I know I'm talking about. Um, but the one thing I definitely want to talk about.
Denny Lee [00:19:39]: Uh, Jay, like, okay, dude. Like, paralyzing something like this has got to be complicated. Like, you got to tell me a little bit about the hardware challenges. How do you run something like this at scales? Because that's all. That's 3072 GPU's that you're running. This can't be like a. Like a walk in the park, right?
Abhi Venigalla [00:19:57]: Yeah.
Ajay Saini [00:19:58]: So this is a fun one. So, one piece of large model training. Well, fun and fun and not fun, depending on how you see it. But basically, one piece of large model training that people don't really talk about as much is how much of a systems and operations problem it really is. Throughout our time running the mosaic training platform, we had run across thousands of GPU's before, but we had never done a single training on that many GPU's. It had been a whole bunch of smaller, separate ones. And when you hit that scale, as bandish had mentioned earlier, all sorts of things break statistically. Actually, people say you'll observe roughly one GPU failure every 1000 GPU days.
Ajay Saini [00:20:36]: And on a 3000 GPU cluster, that means you're seeing three per day. And I can tell you empirically, we actually did see three to four per day throughout the entire training run. In order to deal with this one, you have to have really good monitoring. We were monitoring the GPU's, we were monitoring things at the Kubernetes level, which is the foundation of how our platform was built. But you also have to monitor the entire network because a really critical part of training, parallelizing this model across so many GPU's, is that you need to have really high throughput communication between all of the GPU's. The network layer is super important. What you end up having to do is basically check every single pair of machines, or what we call nodes in the cluster to make sure that you're actually getting that really high throughput. Many times random network links will go down and they'll have to go figure out what they are.
Ajay Saini [00:21:21]: There was this whole operational thing of building all of these monitoring layers, all this health checking. And then on top of that, no matter how hard you try to actually catch all these things, you will never catch everything. So the next piece of it is okay. Now, in the cases where you don't catch the failures, do you have the right operations in place to be able to deal with it in a very timely manner? So we actually had a dedicated on call rotation specifically for this training run where if anything ever went wrong, we had a whole bunch of monitors and alerts that would be a 24/7 on call page that would just wake people up in the middle of the night and go debug some hardware, go back to sleep, wake up again, go do it. And it kind of just went on for weeks.
Denny Lee [00:21:58]: That's amazing. In fact, you're literally reminding me now this is, I'm basically aging myself very horribly. I just want to call that out. Did you guys ever deal with, for example, HP superdubs, like the old school ones? Because that's basically exactly the same thing. The idea was that infinite bad connections were still really, really new and they were really, really fragile at the time. So exactly the same concept in order, just get. We were, I was part of the SQL server team where we're actually running sequel server on Superdeals. And that's exactly it.
Denny Lee [00:22:32]: Yeah, like basically calls at all hours just because the network leak wasn't working, that all hell broke loose. And so, like, I'm just curious, from your perspective, did this reduce over time as you were training, or like, you know, training Devorex or was it pretty consistent? Anyways, just because the nature of the infinibi connections, of this nature of the GPU themselves.
Ajay Saini [00:22:54]: So we had pretty close relationships with the cloud providers, and we were able to kind of report all of our failures after we detected and mitigated them. So we did actually see improvements in health over time. But also over time, we got a lot better at monitoring and checking for these things automatically, because every time we had something go wrong, we were able to develop some kind of script or something to like diagnose it for future times. And now we have this gigantic logbook with all of the things that that went wrong, and we're actually actively working on automating all of these things so that when we have customers training models like this, they won't run into the same problems.
Demetrios [00:23:24]: That's nice of you. Thinking about the customers.
Denny Lee [00:23:28]: Yep.
Davis Blalock [00:23:29]: Customer first, right?
Demetrios [00:23:30]: Yeah, I'm sure a lot of people really enjoyed that. So there's a question coming through in the chat that I wanted to ask, which is about have you ever experimented with training strategies other than FDSP, like using frameworks such as Megatron or Deepspeed?
Bandish Shah [00:23:49]: Yeah, I know. I'll be here, whichever one, but you want to take it and then I'll maybe add in afterwards?
Denny Lee [00:23:54]: Sure.
Bandish Shah [00:23:54]: Yeah.
Abhi Venigalla [00:23:55]: So we primarily use a pytorch based training stack, and we have this library on top called composer, which is open source, so anyone can use it. And what we've tried to do with that framework is make it as flexible as possible for different models. And we use composer to train not only our llms, like TBRX, we also use it to train diffusion models. We have other future models coming up. One of the reasons that's possible, in our opinion, is because we use Pytorch FSDP, which really doesn't enforce any requirements on the model architecture. It makes it easy to distribute it across big clusters without using 3d parallelism or pipeline parallels or anything like that. I think that that flexibility is what we're going for. Most of all, we found over time that we've been able to optimize FSDP to actually achieve very, very close performance to, I think, even these 3d parallel strategies.
Abhi Venigalla [00:24:48]: And so that's kind of like the trade off we've made. I think, like deep speed in comparison, is basically implementing the same algorithm. So whether you use deep speed zero three versus Pytorch app SDP. I think you'll see a very similar situation. We just tend to use the Pytorch one because it's a little more natively integrated with things like Pytorch Autocast and stuff like that. But yeah, I'll hand over to banish as well, who's also built a lot of these different frameworks.
Bandish Shah [00:25:12]: Yeah, no, I mean, I think Abhi covered definitely the frameworks, and we love the simplicity. Right. Like, FSDP is like very straightforward to think about. We're able to keep optimizing it and keep pushing the bounds. And, you know, we work very closely with Pytorch on this, and that's been, that's actually been a really great partnership. But I think the other side is, you know, I think what Ashley AJ was talking about, like, having super high speed interconnects really helps. And, and that's, you know, again, not everyone's going to have that. So we're not saying like, things like, oh, like pipeline parallelism or different parallels and strategies aren't necessarily needed, but when we actually build these clusters, when we train on these things, having those high performance networks really goes a long way.
Bandish Shah [00:25:55]: I think as we keep pushing the bounds, we'll evaluate how do we keep pushing the utilization counts. We're really maximizing the amount of performance we can get per dollar. But so far, yeah, we've basically gotten away with FSDP and then variations on that. Uh, and just really, really, you know, being very intelligent about how we optimize things.
Demetrios [00:26:15]: All right, I'm going to lob another one out at you all because the chat has been awesome. Keep the questions coming in the chat. You all are incredible. So what tools do you use to benchmark your LLM inference regarding latency, GPU utilization, et cetera, et cetera.
Davis Blalock [00:26:35]: So I know the inference people have some very specific tooling that we didn't use to pre train it necessarily, but it's mostly off the shelf stuff. We get a lot of mileage out of the Pytorch profiler as well as the Pytorch memory profiler. I think those two were about 80% of it. Occasionally we would also use just like a little bit of our own tooling to, say, log all the distributed calls, but it was mostly off the shelf stuff. And I think for inference, a lot of the challenges, there are maybe less around which tool are you using? Because you can often indirect to say, very high performance kernels in TRTlM or other libraries where you can get away with not looking at the hardware perf counters and the l two cache bandwidth and so on a pretty good fraction of the time. But it's often hard to get a good representation of your workloads, especially when your workloads are very bursty, because, okay, what year inference latency is going to depend on your batch size, like how much work are you doing right now? How many other queries are you having to answer, and how do you get a realistic setting so you can measure that in a situation that's going to be relevant for what your customers are seeing? It's not quite an answer, but I know we had to build a fair amount of scaffolding around that to really get accurate answers to these sorts of questions.
Ajay Saini [00:28:07]: Just to quickly add on to what Davis is saying, there's the inference latency piece for inference specifically. There's also throughput, which matters a lot. One thing we did for serving DBRX is actually we built our own optimized inference web server that has a whole bunch of customizations in it to make the model super performant when it's being served. The way we test this is actually the very simple deploy the model using the server and then absolutely bombard it with requests and see what throughput we're able to get. There's nothing like basically trying to simulate live actual workloads and then from there measuring, measuring the performance and then optimizing from there.
Bandish Shah [00:28:40]: Yeah, I think the only thing I would add, I mean, I think the inference is definitely a full end to end in that sense. Systems problem. AJ is talking about the web server that we have to optimize all the way from the rest API call to the actual model call itself. I think we written a great blog post about this and it highlights our benchmarking, our inference benchmarking. We look at things like time per output token or time to first token, time for the generate calls to actually happen. And what is compute bound, the other is memory bound. Also, we look at memory bandwidth utilization. We kind of highlight that framework.
Bandish Shah [00:29:20]: Definitely would point up folks to that blog. And then in terms of actual tooling for the model, a lot of it is like, okay, how do we get those metrics? And so a lot of it looks at, okay, hey, what do we expect if we know how much data we're moving around, we know the feeds and the speeds of the system, what should we actually expect to see in terms of model performance? And then we also use a lot of the vendor tools. I know we've actually looked at things like insights from Nvidia to actually model how long certain kernels are taking. And a lot of that also happens in partnership with those vendors. We work very closely with the TRTLM team to really optimize that. And, yeah, and that's kind of what led to those prs, to the TRTLM repo and the VLM repos for both of those solutions.
Demetrios [00:30:08]: Beyond the Chinchilla model or beyond the Chinchilla scaling guidelines, how did you choose the number of tokens to pre train on? And maybe I'll call out Abi on this one.
Denny Lee [00:30:22]: Oh, and just as a quick call out before you do it, most people probably here don't even know what the Jill guidelines are, right? They probably haven't read the research paper around this. Again, I only know because I've asked tomorex to tell me. So. So what, you provide the context first? Just want to call that out?
Abhi Venigalla [00:30:38]: Yeah, absolutely. So, you know, when it comes to sharing these models, a lot of times you're trying to produce the highest quality for a given compute. And compute is basically a product between how many parameters, how many tokens you use. And so if you look back at the history of like, lM scaling, in the early days, people were training very, very large models on a small number of tokens. GPT-3 was kind of like this, I think 175 billion parameters, only 300 billion tokens. And they tried to develop scaling laws to predict what was the most optimal way to assign those two components, the parameters of the tokens. Now, a few years later, there was a paper that basically trained a model called chinchilla, and they remeasured the scaling laws more precisely. And they found that actually you want to have a lot more tokens than parameters.
Abhi Venigalla [00:31:28]: So they came to a recommendation that is approximately about a one to 20 ratio. So if you want to train 100 billion parameter model, you should use actually 2 trillion tokens. That actually dramatically changed how people built models. People started building smaller models, trained on much more data. Beyond that, people found that, hey, actually we can extend it even farther, although it's not exactly compute optimal. You can train out even more tokens. You can increase that ratio from 20 x to 100 x, even 200 x. And then you get models like the Lama models.
Abhi Venigalla [00:32:03]: For example, llama seven b is only a 7 billion parameter model, but trained for 2 trillion tokens, humongous ratio, way beyond chinchilla, and so on. So we look at all these different strategies and we know for our customers, basically being able to spread efficiently. So the smallest parameter size possible is really important. So we basically want to ask ourselves what is the highest token to parameter ratio we can achieve without giving too much quality away. So not quite chinchilla, not quite extremely large amounts of tokens. And we had a lot of experience a smaller scale, basically at the same compute budgets where we would try to shrink the model and increase the token count as much as possible to basically see where that trade off was. So I think when you look at Daba Rex today, it has basically 132 billion total parameters and 12 trillion tokens. So it's pretty close to 100 x.
Abhi Venigalla [00:32:56]: I think that's what we came to. If you're interested in these scale laws, there are new papers out there for mixed Rex for scaling laws. I think there's one even just from a few weeks ago, but that's how.
Davis Blalock [00:33:07]: We came to it.
Denny Lee [00:33:09]: Let's talk a little bit about the data quality in terms of one of the recurring themes from the blog post was just that the higher quality data, so we can break it down in terms of, for example, the GPT four tokenizer from TikTok. We can talk about from the standpoint that we have higher quality data which resulted in us doing the dense MoEA versus MPT seven B. I'd like to chime in a little bit about that, just to provide the context that we end up utilizing the very data tools that databricks has. I'm not trying to upsell that, by the way, I'm just calling out that we're from databricks. So cool. We end up using our own tools. We end up dog fooding ourselves basically to go ahead and actually produce, get high quality data so that we can go ahead and actually produce models like that. And that's more open for anybody here right now.
Abhi Venigalla [00:33:58]: At this point, I think you should upsell it. I think when we moved from Mosaic, I think our data processing literally got 100 x faster. Not even exaggerating. It became much, much much easier to deal with these huge text datasets.
Davis Blalock [00:34:13]: I think it was pretty remarkable for our data team because obviously when you're dealing with many terabytes of data, that's not something you want to do on your laptop. So getting everyone on board with sparkling and having the ability to write something kind of pandas like something fairly familiar, but that kind of just works at scale, was really helpful. We had a very interesting time exploring the data, trying to run all sorts of different transformations, keeping track of everything because you try lots of different experiments, you try lots of different pre processing, you have these whole data science pipelines to go from raw text to this subset of really clean data that you actually want to use. So yeah, it ended up working really well. Getting acquired by databricks, honestly, I think that we first found out was a little bit like, oh, databricks, okay. Not like the first company you think of as deep learning, but in retrospect, seeing how much of the work is the data pipeline and data management and how much alpha there is and doing that really well, it's like, oh, this makes total sense. Like, I really see the integration here. So that's been a very positive experience for everyone who's had a crunch.
Davis Blalock [00:35:39]: Large datasets here.
Demetrios [00:35:41]: Ben Dish, I know you got something for us.
Bandish Shah [00:35:43]: Oh, yeah, sure. So I think one of the things that was actually interesting, I mean, we're kind of talking about why this was like a matchmated habit, right? We, I think we talked about this a few weeks ago too. Our data processing sped up so much after the Datarix acquisition because it's just we have these experts and this is this amazing system and this platform that we built to do this. And then once you have that data, you have to pump it into 3000 plus GPU's and actually train these jobs. There's actually a problem that we had started working on at Mosaic, but especially when you have these cloud built platforms, the underlying storage is optic storage. When you want to distribute this data at scale. We're running one training job, but we're carving up the work and giving it to these different gpu's. And in order to do that, you also have to take your data and split it up and send it to these different gpu.
Bandish Shah [00:36:43]: So doing that 1000 GPU scale is, turns out it's really difficult. And also then streaming it over the Internet or streaming it over a high bandwidth network also has, comes with its own set of problems. For anyone who's been on the other side of a DDoS attack, you hit these issues, especially when you're pinging these service providers, splitting up that data and really streaming it or chunking it up, downloading it, making sure that you're downloading stuff that you're going to need next in the background, being able to shuffle it at scale and then basically reorganize it to all these different gpu's. That's what our streaming datasets library lets us do. And it's really just even gotten better and more capable to train the scale of this model with DB Rex.
Abhi Venigalla [00:37:35]: I think the streaming team, I was just going to say out of all the different pieces of infrastructure, we had all the different tools. I think streaming is the one piece that didn't break it all as we scaled. So it was just pretty remarkable.
Bandish Shah [00:37:49]: It was pretty amazing, by the way. That's like almost a one person team.
Abhi Venigalla [00:37:53]: In a lot of sense.
Bandish Shah [00:37:54]: That's like, you know, it's like. It is impressive with those guys. Like that. That guy does, but like. Yeah, no, I mean, it's.
Denny Lee [00:38:01]: It's a.
Bandish Shah [00:38:01]: It's an incredible piece of engineering that we kind of built there.
Denny Lee [00:38:05]: Yeah, I think you want to chime in, too, so go for it.
Bandish Shah [00:38:09]: Another.
Ajay Saini [00:38:10]: Another fun story about kind of scaling up our data. So our streaming data loader held up and was very scalable. Object stores, on the other hand, not as much. So, fun story. We actually had to switch object stores at one point and then also implement some hacks to be able to download the data and not get throttled or have other issues with object store in the cloud providers just because we were not necessarily with data but also with checkpoint download. When we're launching a whole bunch of training runs, when you start to really scale up, you start to make the object or happy.
Abhi Venigalla [00:38:39]: Yeah.
Denny Lee [00:38:39]: Especially when you're trying to list them. Right. So, yeah. Like the listing alone, basically you're doing a DDOS attack directly on the object store.
Bandish Shah [00:38:46]: Yeah, that's always pretty much, yeah.
Denny Lee [00:38:48]: Yeah.
Bandish Shah [00:38:48]: You have thousands of these workers just blasting away.
Abhi Venigalla [00:38:51]: So.
Denny Lee [00:38:52]: Sorry.
Bandish Shah [00:38:52]: Go ahead, Abby.
Abhi Venigalla [00:38:53]: Yeah, no, I think what was cool about that is, like, some of them were things in our libraries where we had done at first, like what seemed like the natural thing to do. Hey, you know every gpu should go download the metadata file, right? That can't be so bad, right? And that's totally fine. When you only have 100 gpu, that's only on 3000 to the objects where it looks like you're spamming them and you get shut down. So then we had to put in all these fixes, which totally makes sense. Just one guy downloaded and then share the data or something like that. But now that we've actually fixed these, our customers, we won't see these problems at all, which is cool.
Denny Lee [00:39:24]: Cool, cool. Before Dimitrio sticks over, I just wanted to do a quick call up because Davis did bring that up. He said, you don't want to run this on your laptop. Now, I did want to clarify. That's running on your laptop for training, which we completely agree with it. If you go get the quantized version, I think I'll be correct me is the guy from Apple who actually quantized double x down. If you've got, I think you're telling me, a 96 gigabyte MacBook Pro, you should be able to go ahead and run, and run inference directly on the laptop itself. Yes.
Denny Lee [00:39:54]: Yeah.
Abhi Venigalla [00:39:55]: So there are at least a couple high end laptops which came on it. I think the llama CPP folks are also working on it, so hopefully that'll open up today.
Denny Lee [00:40:02]: Yeah, yeah, yeah, yeah, yeah. I've been playing with that myself, and my portable Mac studio is not surprising.
Demetrios [00:40:10]: You should be able to run it.
Bandish Shah [00:40:12]: I think, once llama CPP, or, you know, you should be able to run on your gaming rig at home right after it's quantized.
Denny Lee [00:40:19]: Yeah, I know, but see, I only have 32 gigs of this one, so it's just an excuse for me to buy more memory now.
Bandish Shah [00:40:26]: Yeah, yeah.
Denny Lee [00:40:26]: That's all. So you gotta help me out here. Come on. Come on, dude.
Demetrios [00:40:28]: Exactly.
Bandish Shah [00:40:29]: Quality. Yeah, that's. That's personal. I don't know. That's just 32 gigs. I mean, come on.
Denny Lee [00:40:35]: Yeah, yeah, I know, I know. I feel really horrible right now. No, back on track. The mates just go for a buddy.
Demetrios [00:40:43]: We digress. But there is this data quality issue, and there was a question coming through in the chat that I think is great, which was about thinking about future models, is moving towards synthetic data modalities on the roadmap.
Abhi Venigalla [00:40:57]: Yeah, I think that's definitely a research area we're very interested in, particularly because as you try to tramp to even larger models, we're starting to run out of just data on the Internet. I think one of the core beliefs we had with training custom models is that there's actually more useful data held privately by enterprises. And potentially that's a way for people to tap into it with synthetic. I think we found, even with DB Rex, and we're just trying to experiment with this, that basically being able to take a small data set that you have and expand it and then fine tune that is helping a lot. We're trying to figure out ways to do that with our customers as well. So generally very interested in synthetic data and hopefully in the coming months and stuff, a lot more save up.
Denny Lee [00:41:46]: Oh, sorry.
Ajay Saini [00:41:46]: Just add a quick add on to that. So, on the topic of a lot of our research, eventually funneling into the product and things we offer to customers, one of the biggest. So we actually have a fine tuning API and coming soon ability to fine tune DVRX as well. And one thing that we really learned through a lot of our customer conversations is that customers struggle to have very large labeled data sets for supervised fine tuning and synthetic data is one really great way to address that. And that's something that we're actually looking at first obviously doing all the research on and deeply understanding ourselves, but then eventually actually building into the offering as well. So very much on the road map.
Denny Lee [00:42:22]: Perfect. No? So I just realized basically both from what you're talking about in terms of latest research, I don't. One thing we may or may not have discussed was curriculum learning. And I was wondering if one of you can chime in a little bit about that, because that actually had to do a lot with the improvement for training as well. I just realized, because we talked about the tokenizer, we talked about data, but I forgot about the curriculum learning part. Anybody want to chime in on that one?
Abhi Venigalla [00:42:45]: Yeah, I can chime a little bit. So what we found is that, and we did found this through small scale experiments, that it helps a lot to kind of change the distribution of your data towards the end of pre training. So we did these experiments when the last 10% or 20% of training, we would focus the pre training data on less, just web scale text and more on like the high quality data or code data that we actually wanted them all to be really good at. And that turned out to help performance a lot on the sort of benchmarks that we care about. So for DVRX broadly, we wanted a general purpose LLM that was also particularly good at coding and resting. So when we basically tailored the pre trained data towards that. And again, this is really easy with our tools because we use these kind of streams and you can change the ratio of the streams at runtime. That helped a lot in final performance.
Abhi Venigalla [00:43:35]: And I think actually when we talked about with, uh, with some of our reporters as well, um, we, we tried out different data mixes. Um, and so this is actually like a pretty, like exciting moment during the MDBRX. And we found that the curriculum learning one worked the best, which is, by.
Bandish Shah [00:43:51]: The way, I don't know. I mean, that is super exciting to me. Right? Like, like, you know, because you, you hear all these stories about people who, like, are, spend a lot of money trying to like, take these base models, you know, fine tune them or, or change things. And I think, like, if we can get really good at continuing to teach these models more and more effectively and get away from, hey, I added a bunch of my data, but the model actually got worse and a bunch of other stuff. If we can crack that, that's amazing. And that's what's super exciting, I think, about being able to change the data mix during training. Look at where it's at. If you have good evals in place and tweak it as you're going along, you're almost figuring out what you can continue to teach the model.
Bandish Shah [00:44:35]: So really looking forward to kind of like what we kind of keep developing on that front.
Demetrios [00:44:40]: So there's an awesome question that came through the chat again from Max and I think this one is directed towards you, AJ.
Denny Lee [00:44:47]: For a large scale training, generally switch to slurm instead of k eight. Does k eight affect itself, affect your training, the pipeline? And what factors could you make you switch from k to slurm?
Ajay Saini [00:45:01]: That's an interesting one. The reason why we chose Kubernetes is actually because we were building this training platform to be multi cloud and also to be something customer facing. Kubernetes is significantly easier to build multicloud. You can run on AWS, Azure, Oracle, GCP, wherever, much more easily than you can with Slurm. That was actually the big reason at the end of the day when we built our product, our research team is treated just like a customer to an extent. And our research team plus all of our customers running multi cloud Kubernetes is just significantly easier. Now in terms of scalability. Yes, there's a point there.
Ajay Saini [00:45:35]: I think Kubernetes, once you really scale up the number of machines and the amount of hardware it's managing, you have some challenges in scaling. But those are things that we actually build on workarounds for within our own platform to deal with. Because at the end of the day, Kubernetes is simply, you can think of as the foundational layer that manages the hardware. But we've built a whole stack of systems on top of it to really enable this large scale training.
Demetrios [00:45:57]: The whole chat wants to know this question, which is smaller models on the roadmap?
Bandish Shah [00:46:04]: I mean, why we just got to go bigger. No, I mean, yeah. Again, I think what we're doing is we're trying to make this available to a lot of different market segments, right? And so small bottles, big bottles, everything in between, I think are things that are fair game. Um, you could train a smaller model, right, with databricks, um, if that's really what you need. Um, but you know, I think longer term we want to support as many use cases as we can for our customers and, and so that's kind of how we'll tackle it.
Demetrios [00:46:35]: Excellent. All right, this is the last one, I think, that we can hit before the end. It's going to be a rapid fire round so that we can all make our next meeting on time. And speaking on the topic of fine tuning, if we go with Q Laura or some sort of Lora for no, what's it. What are you calling it again? Dbx dibeks Dab RX?
Denny Lee [00:46:57]: Man, you gotta work with me, buddy.
Demetrios [00:47:00]: It's gonna stay one of these times. It'll stick in my head anyway. Lightning round Q Laura or some sort of Laura for daba Rex, which framework would you suggest? How many GPU's would make sense?
Abhi Venigalla [00:47:15]: I think for Q Laura I we are working on modifying the transformers integration to make it easier to use. I don't know exactly which frameworks are best for right now we are working on lore fine tuning support just in LM foundry, which is our public training repo. So check that out and post an issue there depending on what you're looking for. But yeah, you can also take a look at just the Transformers repo. We're working on our basically pr right now.
Demetrios [00:47:43]: Yeah, excellent. So I think that is all for today. The last thing Jonathan got back to us in the chat for you AJ, about the flavor of kubernetes and was asking was it Openshift rancher? What did you use to manage the Kubernetes cluster, if anything?
Ajay Saini [00:48:03]: So very, very early on we used to use Rancher for a lot of our stuff, but we actually replaced large components of rancher with our own with our own stack over time. For some of the more established cloud providers like AWS GCP, we'll just use their managed one and build on top of that. On OCi, we'll kind of roll one of our own like open source distribution Kubernetes.
Demetrios [00:48:24]: Guys, I feel like we could have continued this for another hour. The chat is going wild. We didn't get to half of the questions and I appreciate everyone that came and was able to talk with me. I also appreciate everyone that is in the chat being super active and we're going to be doing more of these. So stick around and we will probably be emailing you when we have the next one. I really appreciate this if anyone wants to dive in deeper. Davis and Bandish and I had a podcast about three weeks ago that came out and it talked a lot about the different pains that you all felt when you were going through this. So we dove deep into it.
Demetrios [00:49:07]: I want to thank you guys so much. Even though I can only see your frozen faces, I'm visualizing you smiling and being very happy right now.
Denny Lee [00:49:19]: I'll vote for you, buddy. We're all happy right now, so we're good. And thank you very much to everybody who's watching us. We really appreciate you guys diving in and like always, Demetrios, thank you very much, man. This is always glad to join. This was awesome.
Bandish Shah [00:49:35]: Thank you so much. So much fun. Thank you for hosting.
Demetrios [00:49:38]: This was great. Thank you all. We will see you all later. Have a good one.