Quantized LLM Training at Scale with ZeRO++ // Guanhua Wang // AI in Production 2025
speaker

Guanhua Wang is a Senior Researcher in DeepSpeed team at Microsoft. His research focus on large-scale LLM training and serving. Previously, he leaded ZeRO++ project at Microsoft which helps reduce over half of model training time both inside Microsoft and Linkedin. He also leaded and was major contributor to Microsoft Phi-3 model training. He holds a CS PhD from UC Berkeley advised by Prof Ion Stoica.
SUMMARY
Communication is the major bottleneck in large-scale LLM training. In ZeRO++, we quantize both weights and gradients during training in order to reduce the communication volume by 4x, which leads to end-to-end training time reduction by over 50%.
TRANSCRIPT
Click to view the Presentation Slides
Adam Becker [00:00:00]: Let's bring on our final speaker for this stage. Alex. Alex, are you around? Let's see.
Guanhua Wang [00:00:11]: Yeah, I can hear you right now.
Adam Becker [00:00:13]: Alex, very good to see you.
Guanhua Wang [00:00:14]: How are you?
Adam Becker [00:00:16]: I'm excellent. I'm stoked to hear your talk. Do you have slides?
Guanhua Wang [00:00:20]: Let me do it now.
Adam Becker [00:00:22]: Of your own heart. I will be back in 20 minutes and take it away.
Guanhua Wang [00:00:29]: Okay. Should I get started now or. Okay.
Adam Becker [00:00:34]: Yep, you're.
Guanhua Wang [00:00:36]: Hi everyone. I'm Guanhua Wang. I'm a senior researcher in DeepSpeed team at Microsoft. And today I'm going to talk about how we train large language model at scale using quantization method. The project I want to talk about is zero, which is extremely efficient collective communication for large language model training. This is a joint work with my teammates at deepspeed team. So the motivation here is as we see that the model set grows exponentially like from originally from the BERT is like 340 million parameters and goes up to recent years. Like Megatron tooling energy is 530 billion parameters.
Guanhua Wang [00:01:19]: So the model size grows exponentially which means we need to use more GPUs and more parallelism to do the training. However, for specific models, maximum global batch size is fixed during the training. So if we include more GPUs it means we have a smaller micro batch size per GPU during the training. So this is like our first problem we want to solve here which is the communication overhead become huge in this small batch training setting. As you can see in this middle figure here from right to left, as we decrease the micro batchset per GPU from 24 to 8, the corresponding communication latency or the communication time increase from 26% up to 44% in 8 microbit sets cases. So the communication latency can be huge in small batch size training. The second problem we want to tackle is the communication overhead can be really big or can be huge in the case that we have limited internal bandwidth. As you can see on the right hand side figure from left to right.
Guanhua Wang [00:02:38]: If we decrease the interconnect between the computation node from 8ib. Here IB means InfiniBand we decrease from using 8ib links down to using 1ib links. You can see the T flops per GPU almost go down to half in 1ib setting cases. So these are the two problems we want to tackle. First is communication is huge worker in small byte training and the communication can be really significant when we have limited internode bandwidth. So we want to reduce communication overhead in these two following cases. Small batch training and Limited internal bandwidth. So before we dive into the zero work, I want to first give you a recap on Deep Speed Zero Optimizer.
Guanhua Wang [00:03:25]: So zero Optimizer is a easy to use memory efficient data parallel training paradigm from Microsoft DSP team. And for training billion or trillion parameters of large language models, we usually use Visual Stage 3. As you can see in the bottom figure here, this is a very simple example of Visual stage three training workflow. So from the left before we do any computation, we need to do a forward or gather on the width in order for every GPU to recollect the full width of a specific layer of the model. After every GPU get the full width of a layer, we do a forward compute. And then during the backward stage before we do any backward computation, we do this all gather on the width for each GPU to recollect all the width of a layer again. And after each GPU get the full width of a specific layer, we do a backward compute. And then after we generate the gradient, we do a backward reduce scatter on the gradient values and then we do a parameter update.
Guanhua Wang [00:04:30]: This is like a training workflow of one training integration using zero stage three. So we first do a communication volume analytics on this. Basically we have three collective communication involved in every training iteration. Those are denoted as those orange boxes. We have a forward or gather, backward or gather and backward reduce scatter. Let's assume the model size is M For forward all guys unwidth communication data volume will be set of M. Similarly for backward all gur unwidth communicating volume is the same as forward, so it's still M. For the third part backward reduce scatter on gradients the set is still M because every weight has a counterpart of gradient value.
Guanhua Wang [00:05:16]: So the weight number of values in weight is the same as number of values in gradients. So in total for every training iteration we need to communicate the 3m data volume given the model set of M. So how can we reduce the communication volume in these cases we basically propose three different components to reduce those three collective communication volume one by one. The first one is called qwz. It's to reduce the forward organized communication volume on weights. Basically here we adopt like quantization method and we communicate 8 bits instead of traditional 16 bits value such that we can reduce the communication volume by half. But naively quantize the value can cause model divergence. As you can see on the right hand side figure.
Guanhua Wang [00:06:07]: Basically we do a very simple quantize and de quantize the final value could be the data precision of final value could have a lot of error compare with the value at the very top. Those error values are denoted as red color numbers. So instead of doing naive global quantization, we do something called a block quantization. Basically we first chunk the original tensor into smaller chunks and within each chunk we do some quantization on that. By doing this block quantization we can achieve 3.3x data Precision improvement over the baseline and our highly optimized CUDA kernel can achieve 2.5x faster performance compared with Pytorch native kernels. By doing this quantization on the weights during the forward all gather, we can reduce the forward all gather communication volume from m down to 0.5 em. The second component is called HPZ which is short for heterogeneous partitioning in zero. This is trying to reduce a backward out guide communicating volume.
Guanhua Wang [00:07:19]: So in vanilla zero stage three, when we have a full model we split it across all the GPUs in use. So when we do all guys, this all guys will trigger all the GPU to do some communication. But instead in this HPZ module what we did is we hold a secondary model replica within each machine such that when we want to do this backward all gather, the all gather only happen within a machine, not cross machine. So this is like a trade off between using more GPU memory but treated for you having less communication volume. And by holding this secondary model part replica within each machine we can reduce the backward algorithm volume From M to 0 because we never do cross node communication. The third component called qtz which is try to reduce the backward reduce scalar communication on gradients. The first attempt we tried is can we just directly apply quantization on gradient communication that is not doable because it introduce significant data precision loss due to the reduction operation. Reduction means we do a sum of over array or tensor.
Guanhua Wang [00:08:37]: So instead we just propose a novel hierarchical auto method which is a replacement of traditional reduced scatter collective communication call so we can communicate either four or eight bits, but we do the reduction or the sum operation in foop data precision. And by adopting this novel hierarchical outworld module called qgz we can reduce the backward reduced scatter data volume from m to just 1/4 of m. So my strategy summary here is we have three different collective during one training iteration and first during the fold algeather width. By doing block wise quantization we can reduce the size From M to 0.5 M and do the backward altogether on width. By holding a secondary replica of model within each node we can remove this communication cost of M2 down to zero and for a third part, backward reduced scatter of communication set of M on gradients. By adopting our novo quantized collective communication call, we can reduce it from m to 0.25 m. So in total we reduce communication from 3 m to 0.75 m because of the time constraints. So I will first I will only dive into the system design for the third part, the gradient communication part, which is the QGZ module.
Guanhua Wang [00:10:11]: So the initial challenge here for quantization gradient are listed as the following three bullet points. First is there is no existing collective for quantized gradient communication. So such as Nico Ricoh, they never support the quantized gradient communication collectives. Second is 1bit atom optimizer cannot be used at 0 stage 3 because 1bit Adam optimizer assume every GPU has the global optimizer states, but in 0 stage 3 we split it across all the GPUs. Third, if we directly apply quantization on reduced scatter can have longer end to end latency and lower data precision. As you can see in the bottom figure here, those blue boxes, we have four GPUs, G0, G1 up to G3 and they form a ring topology and they want to do a reduced scatter. It may work like this. So basically after G0 receives data from G3, it will first do a decantization locally and then try to do a sum of its local gradients with the G3 received gradients to a sum over G3 and G0 gradients.
Guanhua Wang [00:11:27]: And after that we do a quantization on the sum of these gradients and then pass this partial sum to the GPU one GPU one do the same thing like first do a decantization and then a sum and then a quantization pass to GPU 2, so and so forth. And in order to finish one round of reduced scatter, the number of sequential quantization decantization kernel involved equals to the number of GPUs we have here. We have four GPUs. The number of sequential Q +D is four. If we have 1,000 GPU, the number of sequential quantization could be 1,000, which is really huge. So the first problem is we may have too many quantizations sequentially. So how can we solve this problem? Our solution is to directly replacing this ring based radio scatter with our one hop auto communication protocol. As you can see on the right hand side figure after each GPU generate their local gradient like those gray boxes we first do a one shot quantization and then generate those blue boxes on the right hand side figure and after that we do a one shot across all the GPUs we use and then do a decanterization and reduction to get the final correct answer.
Guanhua Wang [00:12:45]: By replacing the ring based reduced scatter with our one hop outward we can reduce the number of sequential quantization decantization from number of GPUs to just one. However we have the second challenge here is we with our One Hop Auto Protocol it may have an issue of communication volume blow up. As you can see on the left hand side figure, the baseline ring based reduce scatter on the cross node communication volume always be M because all the data chunk follows the ring topology to send and receive. So the total amount of data volume communicated is M. But in the middle figure here assuming we have NGPU per node and we have model size of M in the middle figure here after each GPU do a local quantization they can reduce their local gradient set from m to m over 4. But after that we need to do that one shot out, which means every GPU need to Send out m over 4 data cross node. So in total cross order communication volume can be n times m over 4. Because we have NGPU per node and each GPU need to send M over 4 data.
Guanhua Wang [00:14:03]: So this is much bigger compared with the original SAS m. So how can we solve this issue? So we change our plan. Instead of doing one hop we do a two hop out hull. So the step one as you can see on the right hand side figure, we do a Etra node output and reduction in order to reduce the communication volume per each GPU. Here we reduce the computation volume per etpu from M4 down to M4n. And after that we do the internal output. Here each GPU only need to send out M4N data, so in total each node only need to send out m over 4n times its m over 4 data. By doing this two step auto instead of one hop we can remove this communication volume blow up issue.
Guanhua Wang [00:14:59]: The third challenge with our hierarchical output protocol is data misplacement issue. As you can see in this setting we have two machines each with two GPUs. After each of the GPU generated their local gradients like the gray boxes at the bottom. The correct final data placement after our communication should be those green boxes at the top, which means GPU 0 has gradient chunk 1, GPU 1 has gradient chunk 2, so on so forth. However, if directly apply our hierarchical outward after the first step internal output, you can see the G1. After the outwork happens, the GVN cannot even see the gradient chunk tool which should be its correct final data placement. Similarly on the G2 side it cannot see the gradient chunk 3 which should be it correct final data placement. So because of this issue, after the second round internal data, we have data misplacement issue between GPU 1, GPU 2 here.
Guanhua Wang [00:16:09]: So how did we solve this? Basically we tried to do something called tensor size reordering before we do any quantization and any communication. As you can see in left hand side figure, after each GPU generated their local gradient, what we did is we first swap the gradient chunk within each local GPU. Here we swap the order of gradient chunk 2 and 3 in the reverse order show as those orange arrow lines. And after that we do the ultra node auto and reduction. And after that we do the internal auto iron reduction. Now you can see everything just work perfectly and correctly. So the QTZ module overall workflow look like this. Before we do any quantization we first do a tensor slice reordering.
Guanhua Wang [00:17:03]: And after that we do intra node quantization and then do intra node auto communication as a step one and then do intra node decantization and reduction to get the intra node reduced gradient values. And after that we do internal quantization, internode auto communication and internal decantization and reduction in order to get the correct final answer for every gpu. We further optimize this workflow by doing two things. The first thing is called a kernel fusion. Basically we fill the subsequent kernel together in order to reduce the I o times of global memory access on the GPUs. Thus we can improve the system efficiency of the kernels. We do two kernel fusion. One is on the tensor slice reorder and the intra node quantization.
Guanhua Wang [00:17:59]: The second one is on the internal decoanization, intra nodal reduction and internode quantization. The second optimization we did is to do some kind of communication collective overlapping. Basically here we overlap internal communication and the internal communication on different data chunks such that they don't have any data dependencies. So we can do the overlapping. By doing this overlap we can reduce internal latency. Here comes to the evaluation part. As you can see on left hand side table, we focus on small batch size. Here the tokens per GPU is just 1k or 2k.
Guanhua Wang [00:18:40]: And we mainly focus on the second column, the low cross node bandwidth column, which is one IP connection between nodes, which means it's just a 100 gigabits per second across node and you can see in the middle column here compared with our baseline 0 can achieve up to 2.16x speedup compared with 0 for large language model training and even for the ATAB connections. As a red column in the table we can achieve roughly 16 to 30% speed up compared with zero baseline and on the right hand side figure we also show the scalability of of 0 from 64 GPUs up to over 300 GPUs. As you can see both in low crosstalk bandwidth which are the left two figures and the high bandwidth communication channels which are shown as the right hand side figure. In both of these cases zero platform can achieve higher tplaps for GPU cloud compare with zero especially for the IB equals to 1 cases, the left two figures we can always consistently have almost 2x speedup compared with zero. We also do some validation loss test to see our model converged correctly. We tested on two models, one is a smaller model on the left the GPT is 350 million parameters, the right is JPE13B. In both of these cases zero plot path loss curve matches the baseline zero stage three and later we also introduce some kind of six bit quantization which is between four and eight bits for some use cases. We also conduct some evaluations on the IRHF training llama two 70B and OPE 30B.
Guanhua Wang [00:20:46]: As you can see on the left hand side figure those orange bars are zero throughput and those blue bars are the zero stage three. For the llama 70B we can achieve 3.3x speedup and for OPG 30B we achieve 3x speedup and one public announcement from LinkedIn that zero help them reduce over 50% of the training time in their AI production workload training at LinkedIn and this is like a LinkedIn post here and we have like an offline gathering and a Talk there at LinkedIn headquarter. The summary here is zero purpose incorporates three techniques QWZ, HPZ and QGZ to optimize end to end communication during zero training and reduce 4x communication volume compared with Xero. Xero achieve up to 2.16x speedup over 300 GPUs training and 3.3x speed up over RLHF training and zero are already integrated with our open source deepspeed repo for the next generation zero training engine and we are looking for all kind of collaborators users. If you are interested feel free to ping me with the email at the bottom.
Adam Becker [00:22:15]: Thank you Nice Alex. This is, I mean this is incredible work that you guys are doing. What are next steps? So first of all, you guys have, you've invented some of these methods right now you're already using them internally. You see the results with LinkedIn. Are they broadly available throughout? Is everybody starting to adopt them at Microsoft or is it still in a little bit more of an experimental phase?
Guanhua Wang [00:22:44]: In Microsoft? We have some production workload running using Liro. One thing I can say is for example the popular small language model at Microsoft called the Phi 3 and 543 Phi 3, those small language model series, they use zero to improve the training throughput. And we also try to apply our method to some internal non public workload, but that I'm not allowed to see.
Adam Becker [00:23:15]: We won't press you on that question from the audience. Is there a trade off of reducing training time on model performance?
Guanhua Wang [00:23:23]: Yeah, so that's a great question. Usually people believe that when we do quantization, our model, like a model convergence can have a big issue. But what we saw is if we quantize to 8 bits it is okay or the fully trained model is usable. Maybe after the model fully trained on par with training using full precision, we may have the like. Some benchmarks go a little bit lower than training using a full processor, but it's usable. And in most of cases for smaller models like several billion to maybe like less than 200 billion models, in those cases we never see any correctness issue or performance loss regarding to accuracy. Only very big model have some accuracy loss, maybe over 500 billion.
Adam Becker [00:24:22]: Alex, thank you very much. If folks have questions, maybe they can find you afterwards or join the chat.
