# DeepSpeed: Enabling Efficient Trillion Parameter Scale Training for Deep Learning Models

Olatunji (Tunji) Ruwase is a co-founder and lead of the DeepSpeed project at Microsoft. His broad industry and research background spans compilers, operating systems, and hardware accelerators. He is currently interested in building systems convergence optimizations, and frameworks for distributed training and inference of deep learning models. His research results on DL training, inference, and hyperparameter search are used in multiple Microsoft systems and products, such as Bing, Ads, HyperDrive, and Catapault. Tunji earned a PhD in Computer Science from Carnegie Mellon University.

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

Deep Learning (DL) is driving unprecedented progress in a wide range of Artificial Intelligence domains, including natural language processing, vision, speech, and multimodal. However, sustaining this AI revolution requires practical solutions to the extreme demands of model scaling on the compute, memory, communication and storage components of modern computing hardware. To address this challenge, we created a deep learning optimization library called DeepSpeed to make distributed model training and inference efficient, effective, and easy on commodity hardware. This talk will focus on DeepSpeed training optimizations for improving the memory, compute, and data efficiency of extreme model scaling.

### DeepSpeed: Enabling Efficient Trillion Parameter Scale Training for Deep Learning Models

### AI in Production

Demetrios [00:00:05]: Now I'm going to bring up our next guest. Tunji, where you at, man? There he is.

Tunji Ruwase [00:00:11]: Hi, Demetrius. Thanks so much for the invitation. It's really exciting to be here. Hope you can see my slides coming true over there.

Demetrios [00:00:19]: And I'm going to throw them up right now, so everybody should see them. I'm going to get rid of this shirt and the QR code.

Tunji Ruwase [00:00:27]: Excellent.

Demetrios [00:00:28]: People don't have to look at that because you got better stuff to talk about, dude.

Tunji Ruwase [00:00:32]: I don't know. I love, super excited for this talk.

Demetrios [00:00:34]: I'll let you get rocking.

Tunji Ruwase [00:00:35]: All right, thanks. Thanks so much. Yeah. Hi, everyone. Really excited to be here. Really glad to be giving this talk on behalf of the deep speed team. Talk. Today is going to be about efficient training and parameter scaling for deep learning models.

Tunji Ruwase [00:00:51]: We got into the large language model wave a while back because then we noticed maybe four or five years ago that models were getting bigger. And we're getting bigger because the quality was improving, the accuracy was getting better. And so people kept pushing, building bigger and bigger models. And you can see here, between 2018 and 2022, the models grew from a few less than 100 million parameters to half a trillion parameters with the megatron Turing NLG. And the results then showed that we hadn't hit the accuracy limit yet. Model scaling was sort of the future. We are systems people. And so we quickly identified that there were a number of systems challenges that kind of come up with model scaling.

Tunji Ruwase [00:01:46]: And of course, with challenges, you also have opportunities or optimizations. And so today I'm going to focus on three challenges, memory, compute, and data. So you scale your model, you need more memory. To train those models, you need more compute, and you need more data. And in deepspeed, we've created solutions for all three. Our memory solution is called zero. Our compute solution is deepspeed moe. And then we have a data efficiency for the data challenge.

Tunji Ruwase [00:02:13]: And so I'm going to start first with the zero. And this section, you can think about it of how we broke the GPU memory wall for deep learning. The training landscape sort of looks like this. And on this plot, what I'm showing is the growth of models over time. Transformer models in particular, which are what are used for language models, and also showing you the growth in the GPU memory size. And what you'll see here is that while the models were growing at about 100 times per year, GPU memory was basically only growing about two x every two years. And so five years ago, we had 24gb HPM devices. Today, it's about 80gb.

Tunji Ruwase [00:03:00]: I think there are a few hundred gigabytes range coming along. But the point is that the models are growing faster than the hardware can keep up. And so what do we do there? But first, let's look at why do models consume so much memory, right? And to think about that language. Models are typically trained with something called the atom optimizer. And so for each parameter you have in your model, it consumes about 20 bytes. And those 20 bytes have to go with basically the parameters, the gradients and the optimizer state. And so that means if you have a billion parameter model, you need about 20gb per GPU to fit that model. And this memory estimation does not include your input or your activations that are generated while you're training.

Tunji Ruwase [00:03:50]: So we created zero to overcome this memory limitation and this zero itself. You can think of it as a family of composable optimizations that helps to reduce the GPU memory costs of training. And we have two dimensions of optimizations. The first is partitioning. That is, we have a way of dividing the memory state across the GPU devices rather than replicating them. So we did a parallel training. What you would have is that the model state is replicated across all the devices, but in zero, we partition them, and that helps to reduce the amount of memory consumed per GPU. And the other dimension is offloading, which is basically we utilize not just the GPU memory, but also CPU memory and NVMe memory, because your servers come in and they have this memory tier hierarchy available there.

Tunji Ruwase [00:04:48]: And so we leverage all of them. So let's just illustrate that a little bit. So let's think about the baseline system where we are doing data parallel training. The color coding here, the blue is for the parameters, orange is for gradients, and green is for optimizer state. And so we're looking at training over n gpus. And so, like I said in the baseline data, parallel training all of the gpus, the model state on each GPU is essentially the same, so it's all replicated. And so we implement zero in our partitioning in stages. So the first stage of optimization is essentially where we take the optimizer state, which you can see is the biggest chunk of memory, and we partition that across all the gpus.

Tunji Ruwase [00:05:32]: And so that means each GPU essentially just holds a slice, or much smaller portion of the optimizer state. And that helps to reduce the memory consumption per GPU. The next stage is we partition the gradients as well. So that helps to further reduce the slice, the memory consumption per GPU and the final one, as you can see where I'm going here, is that we partition the parameters as well. And so now, rather than have each GPU hold the entire model, it holds just a slice, and that helps to fit it in. And if we think in terms of the savings, another way to look at that is, like I said earlier, that each parameter consumes about 20 bytes. In the baseline case of GPU memory, with zero stage one, we reduce that memory footprint down to less than five. And then with stage two, less than three, and with stage one, less than one.

Tunji Ruwase [00:06:28]: And so all of this helps you to fit models that are much larger than your GPU memory, helps you to fit them for training. The other dimension that I mentioned was offloading. And so the idea there, like I said, is not just to keep the model state in GPU memory. And so the first work we did, there is something we call zero offload. And here you can see we've moved the optimizer state out of GPU memory, and we're hosting it now in CPU memory. And then we can go to the extreme of actually hosting the entire, moving the entire thing, entire model state, parameter gradient, and optimizer state into NVMe, and we call the system zero, infinity. And so with that. So what's the impact of all of these optimizations? Well, here's a plot that sort of shows that.

Tunji Ruwase [00:07:18]: It's the plot I showed earlier that was showing that transformers were growing at about 200 times every two years, while GPU memory was growing about two times every two years. Well, with deep speed technologies, all of this technology that I just talked about, it actually allowed. It's a system capability that allows you to support models that are growing at about 400 times every two years. So, with the systems optimization from deep speed, we're running ahead of hardware. We are enabling folks to train models well ahead of when the hardware is available to support that. And that's been pretty exciting and neat to see. And these techniques have actually been used by real models, which I'll come back to a little bit towards the end of the talk. But, yeah, in a space of two years, we're able to create capability to scale your models up to 400 times larger than was previously possible.

Tunji Ruwase [00:08:19]: All right, so that talks about the memory wall. I'll now move on to the compute challenge. So, as your models get bigger, they need more compute. And this line of work, we call it a deep speed mixture of experts, or moe. And so here, the idea is that the traditional models, we call them dense, which means, as you make them bigger, you increase the parameters, and that means every time you have to process an input, all of the parameters are involved, and that increases your computation cost. But sparse models, or sparse moe, are a new design where when you increase your model parameter size, you don't actually use all the parameters for every single input. Instead, you use a subset of your parameters for each input, and we call this subset experts. And so, from the perspective of compared to dense models, it means that as you increase your model size, your compute cost doesn't actually grow, because essentially you're only using a subset of your input, of your parameters to process each input.

Tunji Ruwase [00:09:24]: And so that allows you to train much larger models without increasing the computation requirements. What this means is now we can train larger models cheaper in terms of compute. And so I'll illustrate that here with this example. So here I'm showing validation loss and for different models, right? So the red curve is a 1.3 billion parameter dense model, and the sort of yellow curve is a 6.7 billion parameter dense model, which is five times larger. And so, just focusing on those two curves, what we see here is that you made the model five times larger, and you can see the quality impact, right? It achieves much lower validation loss. It's a better model, essentially, but it's coming at five times the cost. The blue curve, on the other hand, is basically using a 1.3 billion parameter model size, but using 128 experts. And so from a computation costs perspective, this blue curve has the same computation cost as the red curve.

Tunji Ruwase [00:10:28]: But if you can see in terms of the model quality, it's actually matching the 6.7, the five times larger model. And this is not just for the validation loss. We can also look at sort of downstream evaluation. And there we see that with the mixture of expert with five times less compute, we are essentially getting the same model quality. Right. So this helps us to tackle the compute requirements of model scaling with MOE. For the same model quality, we are computing with much less compute costs. But that's not all we did there.

Tunji Ruwase [00:11:09]: There's another dimension to look here, which is that I talked about 128 experts. But then research has shown that typically you can scale more than that. There's diminishing returns as you scale the number of experts. And so eventually you still have to scale the base model. And so to do that efficiently, we designed a system called Deepspeed TED, which essentially uses a three dimensional form of parallelism. We have the tensor parallelism. We have something called expert parallelism, and then the data parallelism. And by combining all those three, we're able to scale the base model so scale it up significantly, in addition to having more experts.

Tunji Ruwase [00:11:53]: And so with the 3d parallelism here, I'm comparing the deep speed TeD with just the baseline deep speed moe. And on the x axis there, you're seeing the number of gpus. The y axis is showing the parameter size and billions, and we're limiting ourselves to just 128 experts and also tensor parallelism within a node. But the main takeaway here is that with the deep speed TED, with tensor parallelism, we're able to scale the base model size up to close to five times compared to just using regular deep speed moe. But with this, we did observe, though, some interesting challenges came up where a big chunk of the time was spent in communication. So here we're showing the performance, the iteration time, breaking it down in terms of the computes, which is sort of we labeled order, and then various communication, like all gather, all tool and all reduce. And we see that about half of the time is spent in communication. And so with that, we applied different optimizations.

Tunji Ruwase [00:13:01]: The first one is something called duplicate token dropping. There we observed that there were duplication in how tokens were dropped, and so, and that was taking up about 64. By optimizing that, we're able to reduce the Alt all time by about 64%. Another optimization we did was communication aware activation checkpointing, which achieved, brought about a 33% reduction in all reduced time. And so, in combination, these two optimizations helped to improve deep speed TED's performance by about 21% overall. And then the final result I have here is just, we did some strong scaling. This was on the summit supercomputer. And the main two takeaway here is that as we increase the number of gpus, we see a strong scaling result where this optimizations were bringing about a 20% to 29% speed ups compared to the deep speed moe baseline.

Tunji Ruwase [00:13:59]: And that's that for Moe. So then I'll move on to the data efficiency portion of it. And so, why do we care about this? Well, with large model training, as the models scale, it's been observed that we also need more data to train them to good quality. And so here I'm kind of showing on the X axis different model sizes over time, and then the scale of data that was used to train them. And some of those models are things you recognize. The blue curve is showing the model scale, but the orange curve, as you can see, is also showing the amount of data that was used to train it. And you can see that that itself is also scaling even faster than the model size itself. If we think about the overall cost for training the model, it's really a function of both the model size and the data size.

Tunji Ruwase [00:14:53]: And so we do need to control the data cost as well. And the way we've done that is by designing a framework called the data efficiency framework. And this framework is based on the observation that by using efficient data sampling and data routing algorithms, we can actually achieve a lot of efficiencies with training. And so, for example, we could either achieve the same model quality using less data, so that would be one dimension. And so, like the x axis, we're showing the percentage of data that's used for training. And on the y axis, we're showing the model quality. And there we're showing that, yeah, we can reduce the amount of data to get to the same quality as a baseline, or we can achieve even higher quality if we fix our data costs. And so that's what this is showing here.

Tunji Ruwase [00:15:49]: And the framework itself, we've designed it to be something that is modular and allows for flexible composition of various sampling and routing algorithms. Deepspeed is open source, and so we always welcome contributions. And so a lot of our solutions that we build, we build them to be modular and extendable. And so that's something we did here with the data efficiency framework. And so the things I would just want to call out here in terms of the framework, is that here we have your data analyzer that's looking at your training data as it's coming in, and we are providing something called a data sampler. Traditionally, the baseline sampler is typically random data samples, but we created something called a curriculum learning. So it's a different kind of sampling algorithm that helps to improve data efficiency. And also then once the data has been fed into the model, we have data router module, which can sometimes bypass certain layers.

Tunji Ruwase [00:16:50]: And with this just to be concrete, we've created two algorithms. One is curriculum learning. And the key idea behind there is that rather than processing your data in a random order, you should first sample them from the easiest samples to the hardest samples. And if you look at the paper, you get more of those details. The other algorithm, which has to do with routing, is something called random layer wise token dropping. And that's just based on the observation that the middle layers of your model can actually skip some of your tokens with minimal impact on the model quality. And so I'll just focus on, in the time lapse, focus on the curriculum learning and just show some of the results we had for GPD two evaluation. So the table that I'm showing here is showing the training time for 1.5 billion parameter model.

Tunji Ruwase [00:17:45]: We see there that the first row of results there is showing the baseline with a batch size of 512. The training time was about over 300 hours, training over 157,000,000,000 tokens. And this is some downstream results. And then the next two rows are showing curriculum learning in two modes in which you could use it. So the first row is showing if you use it in a way to reduce your data cost, and so you end up training in less time, less than half of the time, and you still end up achieving equivalent model quality as the baseline, as you can see there in terms of the Wikitext and lambda results. Or the last row shows the other mode you could use, in which case you actually still train over all of the data, but you end up getting much better downstream results. Perplexity we can see there it's improved and as well as accuracy is improved. And so with curriculum learning, which is just one of the sampling algorithms that we provide in the data efficiency framework, you can either reduce your training costs or get better quality for the same cost.

Tunji Ruwase [00:19:04]: With all of those, I just wanted to sort of, of kind, as I wrap up, sort of just step back and see this is sort of what Deepspeed has been able to, the impact that Deepspeed has had in the model training ecosystem. Like I said, it's an open source library, and so we do collaborate with a lot of external folks. And so Deepspeed has been used to train pretty much most of the open source large models that you've heard of. And here's just a sampling of some of them. And so this is ranging from 5 billion parameter model to have a trillion parameter model, which was a collaboration with Nvidia Deepspeed. You'll find Deepspeed integrated into all of your favorite frameworks. It's a Pytouch based library, and so you might have seen it come across it even in your everyday use. And most recently, one direction we've been expanding on is our accelerator support.

Tunji Ruwase [00:19:54]: We started out as basically being just a CUDA library, but now we've expanded to other accelerators like AMD, the Habana gaudis, and intel. And there's even Apple support in there, if that's your preferred hardware. But all in all, we are trying to make sure Deepspeed is available everywhere and to all users. And with that, just going to wrap up by saying, yes, we are open source. We are proudly open source, and we invite collaborators. So please open your first pull request today, and you can follow us on x. Thank you so much.

Demetrios [00:20:33]: That's it, dude.

Tunji Ruwase [00:20:34]: Awesome.

Demetrios [00:20:34]: Pull requests. Welcome. That is what we're saying here. I love it. So there's some questions that came through in the chat that I would love to ask you before you jump off. First one we've got Ahmed asking, are mixture of experts self organizing mixture?

Tunji Ruwase [00:20:56]: No. What do you mean by. I mean, let me just talk about how the algorithm works, and then maybe that clarify. So we have something called a router, which essentially, when it gets a token, decides which experts to route that token to. Now, that router itself is trained as part of the process, and so it's not. I mean, I guess maybe, if that's what you mean by self organizing. Yeah, it's self trained as part of the training process. It's not some pre arrange your static routing thing.

Tunji Ruwase [00:21:27]: It's dynamic.

Demetrios [00:21:30]: Okay, yeah, that makes sense. And you did mention that you cut off the amount of experts at 128.

Tunji Ruwase [00:21:39]: No, we didn't. So, research studies, I mean, we're not the only folks working in this area. Mixture of experts, actually very active. So the most recent large models that you've heard of, like the GPT four or the Gemini, they're all mixture of experts, just for the same reason. Like compute just is not scaling. Studies have been done that show that beyond 128 experts, you have diminishing returns. But that was just based on the available data, available knowledge. It doesn't mean like a year from now, no one's going to show up and say 1000 experts is actually the right way to go.

Tunji Ruwase [00:22:14]: But it's a very fast moving sales. That's.

Demetrios [00:22:18]: It turns out from 128 to 1000, it doesn't really matter. But then all of a sudden, 1000, throw 1000 at them and you might see some.

Tunji Ruwase [00:22:28]: Well, yeah, I mean, like, five years ago, when Bert came out, it was just 100 million parameters, and we were all super excited. Now we're like over a trillion parameters now. So just five years.

Demetrios [00:22:42]: That's true, man. You never know what's coming next. That is awesome. So there is another question from Kim asking about how is this technique different from flash attention?

Tunji Ruwase [00:22:54]: That's a great question. So, flash attention is a specific optimization that focuses on the compute part of the transformer. So, like, in the transformer, which is the key computing block here, there's the portion of it called the attention, flash attention. All it does is just makes it run more efficiently on gpus by doing tiling, essentially small general technical tiling so it is composable. Like, it's composable. What we do, it's autogonal, solving a different problem.

Demetrios [00:23:28]: Okay, and how about, can you explain how Xlora is different than mixture of experts, Laura?

Tunji Ruwase [00:23:35]: Okay, so Laura is basically a parameter efficiency technique where the idea is that we can represent some of this. So at the core of everything, all we're doing is matrix multiplies. We just have a bunch of matrices. And so with Laura, the idea is rather than represent the matrices as a whole, you can actually project them into a lower dimension and you still without losing quality. And so that helps to reduce the memory size. So Laura is another memory efficiency technique, and it's composable with us as well. We actually do support it and all.

Demetrios [00:24:11]: Not. So it's like you can take advantage.

Tunji Ruwase [00:24:15]: Of you actually, if you go online and you search for something called primary efficient fine tuning, or Laura, you'll see deep speed there. I think the initial kind of studies from huggin phase of using Laura came use deep speed. And Laura is also from Microsoft. So we do know the folks who designed.

Demetrios [00:24:35]: Yeah, they're your amigos. You can walk down the hall and.

Tunji Ruwase [00:24:41]: Exactly.

Demetrios [00:24:42]: Excellent. This is a great question for the gpu poor, which you don't fall into that category.

Tunji Ruwase [00:24:49]: No. So that's a great question, and I kind of didn't sort of COVID that. But if you go back and look at our zero offload and zero infinity, that's essentially what that is for. It's like if you don't have a lot of gpus, but you can afford to buy a lot of ssds, then you're good to go because you can just host your model in the ssds just because of the 20 minutes I did have it. Yeah, I take down something called zero inference. It's more of our inference solution as well, where basically you can inference like a GPT-3 on a single gpu by basically hosting it in NVMe. So just check that out. That's wild.

Tunji Ruwase [00:25:37]: Yeah.

Demetrios [00:25:37]: Wait a minute, did I hear that correctly? You can host the GPT-3 on a single gpu?

Tunji Ruwase [00:25:42]: Yes, we put it on SSD. Right. As long as you have big enough SSD, you're good to go.

Demetrios [00:25:51]: All right, I like the way you're talking now. So when it comes to the deep speed configuration, and continuing with this GPupore theory or theme, does applying deep speed configuration to model training improve speed in single gpus, or is it only on inference with the zero?

Tunji Ruwase [00:26:14]: All right, great question. So single gpu will not, I mean, our zero line of work does not help single gpu because single gpu is an optimization that assumes you are doing data parallel training, and we're optimizing on the data parallel training side. What zero would provide you is the capability to be able to train a much larger model than you could actually fit on a single gpu. So it would be slower, but it will train like you have that capability. That makes sense. Yeah.

Demetrios [00:26:45]: Cool. So this is, again, for those people that are asking in the chat. It's deep speed zero. Check it out. SSD is all you need, apparently. This has been awesome, dude.

Tunji Ruwase [00:26:55]: Yeah. Thank you so much.

Demetrios [00:26:56]: I really appreciate you coming on here.

Tunji Ruwase [00:26:57]: Thank you so much.

Demetrios [00:26:58]: And it's so cool. I feel honored that you come and give talks like this to us. It is one of these highlights, man.

Tunji Ruwase [00:27:08]: Of course, we love the community. Like, deepspeed is open source, like I said, and if you check out our contributor list, it's a lot of external folks, and we love it. We absolutely love.

Demetrios [00:27:21]: Pr is welcome.

Tunji Ruwase [00:27:22]: PR is welcome. All right. Thank you so much, Demetrius. Really appreciate the opportunity. All right, take care.

Demetrios [00:27:28]: Yeah, take care, dude. It has a.