Domino: Communication-Free LLM Training Engine
Guanhua Wang is a Senior Researcher in DeepSpeed team at Microsoft. His research focus on large-scale LLM training and serving. Previously, he leaded ZeRO++ project at Microsoft which helps reduce over half of model training time both inside Microsoft and Linkedin. He also leaded and was major contributor to Microsoft Phi-3 model training. He holds a CS PhD from UC Berkeley advised by Prof Ion Stoica.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
Given the popularity of generative AI, Large Language Models (LLMs) often consume hundreds or thousands of GPUs to parallelize and accelerate the training process. Communication overhead becomes more pronounced when training LLMs at scale. To eliminate communication overhead in distributed LLM training, we propose Domino, which provides a generic scheme to hide communication behind computation. By breaking the data dependency of a single batch training into smaller independent pieces, Domino pipelines these independent pieces of training and provides a generic strategy of fine-grained communication and computation overlapping. Extensive results show that compared with Megatron-LM, Domino achieves up to 1.3x speedup for LLM training on Nvidia DGX-H100 GPUs.
Guanhua Wang [00:00:00]: Yes. Also, I'm Guan hua from Microsoft D3 team. I'm a senior researcher here and I drink a proverbs mainly I using the blue bottle coffee beans. I really like it and I drink it maybe like 8 or 9 in the morning and another cup at noon.
Demetrios [00:00:17]: Optimal throughput is what we're talking today with Alex from the Deep Speed team. And it was good. I was holding onto my seat the whole time, as always. I'm Demetrios, your host for the ML Ops community podcast. Listen to this guy's credentials and tell me he is not killing it in the LLM AI world. The dude created Phi 3. The dude worked on Zero Plus plus, which makes model training time very, very short or shorter per se. But really where he hit the home run is where we spent the second half of this hour talking.
Demetrios [00:01:03]: And that is on Domino and what Domino does to bring down the training time. It is incredible to hear how he did this and what the Deep Speed team is focused on right now. I will give a quick plug that if you are looking to get a job. The Deep Speed team is hiring aggressively. They're looking for senior developers and researchers that want to come and do stuff with Domino. So let's get into this podcast and as always, if you liked it, share it with just one friend. Start with this question. Man, this is always on my mind.
Demetrios [00:01:52]: How do you pronounce it? Is it Phi? Is it Phi? What do you call it?
Guanhua Wang [00:01:57]: Oh, internally we call it Phi 3, even though it's like I, but we call it Phi, not Phi. I don't know why it's some kind of term.
Demetrios [00:02:06]: Confuse me so bad. It's like what is this is. What's the pronunciation on this? But that kind of gives. Adds a little mystic to it. And can you break down what it is? What is Phi?
Guanhua Wang [00:02:20]: Oh, Phi is like. Because previously they have something called Phi 2 that is a previous version of the small language model Phi 3 and what they mean by feed. Like, they want to mimic the physical environment such that the arm or the small angry model can. Can like reflect some physical environment and take some action. That is their final goal previously. But we are at the very beginning and later on the team, after we finish the phase five, three, the team just do some reorg and we may not continue to five four or five five.
Demetrios [00:03:05]: Oh, no, no. Oh, so there's not going to be any of this?
Guanhua Wang [00:03:09]: No, because the project leader is just a jump to open is not in Microsoft anymore.
Demetrios [00:03:17]: Oh, interesting. And the whole thing, I know that there was A lot of hype around it when it first came out. Especially like Phi 2. Right. Because it was so small and it had a lot of potential.
Guanhua Wang [00:03:31]: Yeah. And that's the reason right now you can say the llama 3.2 or something like that. So do this kind of small like remodel, not the big one, not like several tons of billion or hundreds of billion parameters.
Demetrios [00:03:44]: And what were you doing to get such solid results or performance with such a small model? What like when you were training it, what were things that you made sure to do differently?
Guanhua Wang [00:03:58]: Yeah, so that's a really interesting question. We did a lot of like a fancy trick to make the small language model work. The most important thing I think is because I was leading a training team of the factory. The most important thing is we need to have very high quality data which means our data has a minimum noise and we need to do some kind of data pre processing, data cleaning all before we do the training. And that will take a lot of time and a lot of money. We also need to purchase some high quality data from like the top notch magazine, like New York Times, Forbes, those kind of. We need to purchase those text from them because they are really high quality and usually they don't have any wrong information inside their data. That's the most important thing.
Demetrios [00:04:56]: And was it all human created data and all? You couldn't use any synthetic data, could you?
Guanhua Wang [00:05:04]: We have synthetic data, but that portion is not that big. Synthetic data is relatively small because those synthetic data just try to mimic some part of the original data. And usually this is just. To me personally, I think it's just a waste of training time because for the model they just see roughly the same pattern compared with the original human rights data.
Demetrios [00:05:33]: And so now you're saying that like 5, 4 and 55 are not going to happen because what the performance wasn't good enough to make it think that or make you think that it was worth continuing with or it was more like. Let's just hand that off to a different team.
Guanhua Wang [00:05:54]: I think because like AI inside Microsoft we have, we are very highly competitive. We have several teams doing similar thing. It's just like a company wise strategy or policy, they don't want us to continue work on that. Right now we only have two lines of model we want to further training. One is from OpenAI line, the other is from the Microsoft AI Mustafa who lead Net.
Demetrios [00:06:24]: Yeah, yeah. From the inflection team. Right, yeah. And so this is fascinating to me because there is such a large amount of interest that goes into the small language models. And if small language models are going to be this solution for architectures to do things faster and cheaper. And that's what we've been promised in a way. Now, whether or not that's a reality today is debatable, I think. And you really have to do.
Demetrios [00:06:58]: A lot of you laugh at that, but you have to do a lot of work to make the small language model be as good as it can be. Right. It's not like you can just grab it off the shelf and then plug it into your system and it's going to be working.
Guanhua Wang [00:07:14]: Yeah. That's the reason right now, at least inside Microsoft for small language model post training is even more important than pre training because we need to like learn customized data.
Demetrios [00:07:27]: And now that kind of rolls into the next topic that I wanted to talk about, which is deep speed and all your work that you've been doing on that. Because it feels like deep speed helps out with pre and post training, right?
Guanhua Wang [00:07:41]: Yeah, yeah, yeah, yeah. Right now we also do quantization. Yeah.
Demetrios [00:07:46]: So there you go. It's like the trifecta. And can you explain to people what DeepSpeed is?
Guanhua Wang [00:07:55]: I think we can call DeepSpeed like a third party library based on Pytorch. What we did is that we suppose some new features that Pytorch doesn't have previously. The most important one is called the zero optimizer. Basically it's like a data parallel training a paradigm but it is very memory efficient. Different from like traditional data parallel training that each GPU need to maintain the whole model parameter in zero strategy. Each GPU only maintain a portion of the whole model parameter and one specific layer need to do some computation. Every GPU do orgasm or do like a recollect of the width from other GPUs to finish this layer's compute and then just erase or dump this layer's weight. Once we need to call the computation on this layer again we do an alt gather again on the base across the GPU.
Guanhua Wang [00:08:55]: So it's memory efficient. That's a huge performance gain compared with vanilla Pytorch because in Pytorch previously it doesn't support such memory efficient data parallel training paradigm. And right now they just doing something similar called FSDP which is similar as zero, but it's a Pytorch native library. This is one big piece. And the other thing we did is we do some kind of data offloading to further release memory pressure on the GPU side. So if a chunk of data doesn't need to compute. At the moment, we move it from GPU memory back to CPU memory such that we have more GPU memory to use for other computation. And after we need to compute some tensors on the CPU set, we just move it back to GPU memory and do the compute.
Guanhua Wang [00:09:49]: Once we get the results, we move it back to CPU memory to save the memory.
Demetrios [00:09:56]: So I like that how it's constantly.
Guanhua Wang [00:09:58]: Going back and forth, back and forth.
Demetrios [00:09:59]: Yeah, yeah, yeah. Just to make that more memory efficient. So that's one aspect of it. And then the other aspect was like you said, you almost can it. Is it wrong to think of it as, you know, in databases you have sharding. Is it similar to that where you're just sharding GPUs for their training? Yeah.
Guanhua Wang [00:10:21]: Zero is like the database sharding.
Demetrios [00:10:24]: Yes, yeah, yeah. So you're, you're taking a GPU and you're saying you don't need to have everything, all this information of the model on this gpu, you just need to have the parts that you're training right now. And then when it gets a little bit, when it, when you realize there's certain things that you don't need, even from that, then you offload those onto cpu.
Guanhua Wang [00:10:48]: Right.
Demetrios [00:10:48]: And.
Guanhua Wang [00:10:49]: Yeah, yeah, yeah. Very similar. Yes. If we are like a system research group, we often do this kind of system tricks.
Demetrios [00:10:58]: Yeah. What are some crazy ideas that you haven't implemented but you thought about or you tried to implement and it just didn't make it past.
Guanhua Wang [00:11:05]: Yeah, one crazy idea we actually implemented, but not many people use it because of the hardware issue. So we implement something called zero infinity. Basically instead of offloading data to CPU memory, we offloading data to disk because disk has like a much bigger volume compared with any memory, CPU or GPU memory. And by enable that feature we can train a large language model on a single gpu. That's a craziest thing we have done. But that depend on some Nvidia disk hardware called VME device. Yeah, a lot of people, they don't have this device so they cannot use a feature.
Demetrios [00:11:50]: So basically it's, it's offloading onto the NVME and you're able to. Because the NVME is much larger or it's, it's much faster.
Guanhua Wang [00:12:02]: What's much faster? Much faster than traditional PCIe based disk.
Demetrios [00:12:06]: Sneaky. Yeah, yeah, I think so. We had, we had a talk on deep speed probably almost a year ago now at. Yeah, yeah, at the conference. And one of the things that was said was. Yeah, we trained a large language model on a single GPU and I was just like what? No, that's, that's impossible.
Guanhua Wang [00:12:28]: Yeah. Another like a shortcoming of that is it takes infinity long because we need to move data back and forth and disk usually is slow.
Demetrios [00:12:37]: That's why you called it.
Guanhua Wang [00:12:42]: Like 24 hour.
Demetrios [00:12:44]: So it, it is possible to train on a G single gpu. It's just that it might not be in your lifetime.
Guanhua Wang [00:12:50]: Right, right. Maybe take tens of thousand years for big model training. Yeah.
Demetrios [00:12:56]: All right. So then there's the memory constraint. And so it's like I'm seeing the, the work you're doing is, is really cool because you've got this small language model. You're saying the pre training is great, but then we need the post training to be as important, if not more important. And so you're using deep speed on that post training?
Guanhua Wang [00:13:19]: Yes, we use deep speed for both pre training and post training. And actually we have another stage called mid training. Middle Mid training. That's a weird term, but what it means is we try to support different languages. We call it mid training. It's just in between of pre training and the post training. Post training is for customized data. Mid training is just for more language support.
Guanhua Wang [00:13:46]: And the pre training is just to pre train the model using like the original data we purchased.
Demetrios [00:13:53]: And so you can have that pre training then you stuff in all of the languages to kind of amplify or make it a polygote and make it speak more languages. And then that happens in the mid training.
Guanhua Wang [00:14:06]: Yeah.
Demetrios [00:14:07]: And then we can when we grab it off the shelf like if I download five from Hugging face. Is it with mid training already?
Guanhua Wang [00:14:16]: Oh, usually not. Usually they only support like English or at the best they support Germany. I remember. Yeah. Not many languages.
Demetrios [00:14:28]: Yeah, yeah, yeah. German. That's a random one. I mean I live in Germany so that's great for me. But you would think like Spanish or Portuguese, those are more.
Guanhua Wang [00:14:39]: Yeah, we have some like Mediterranean to support that. Also support some like Indian language or Chinese.
Demetrios [00:14:45]: Yeah. And then what does the post training look like? And can you explain how you guys do it with deep speed and with the like what techniques you're using?
Guanhua Wang [00:14:56]: Yeah. Post training mainly lacking involve two different aspect. The first is we have customized tokens or customized text from user. Both like internal user and some external user. They just using it to further fine tune the model. And in that case usually we don't use very long sequence to ensure that our model can train on relatively small cluster because the larger the Sequence we need to use more gpu basically, even.
Demetrios [00:15:30]: With the deep speed.
Guanhua Wang [00:15:32]: Yeah. Because the sequence itself will generate some big activation in the forward and generate some gradients in the backward. Those activation and gradients size are proportional to the sequence length.
Demetrios [00:15:48]: Okay. All right. And does that have anything to do with later the context window that you're able to.
Guanhua Wang [00:15:55]: Yeah. So usually for example, we create a 4K sequence. We can do a context window of 1K or 2K. That should be fine. Another line of post training is we also add quantization on top of the base model and try to train on the customized input data at the same time. We do some connotation like 4 bit or random even 3 bit connellation.
Demetrios [00:16:22]: Yeah. Not one bit yet though, huh?
Guanhua Wang [00:16:26]: I don't think so. Yeah, I think three bit is almost touch the ceiling.
Demetrios [00:16:31]: Yeah. Why is that?
Guanhua Wang [00:16:33]: Because the number of bits means the less number of bits you use, the less data precision you will have. And three or four bit of the data precision is already not that good. Yeah, even we do a lot of fine tuning on top of that. Usually we only expect the model accuracy or the model quality slightly below or roughly the same as a base model. That is our expectation. But when we try to use something like two bits or even one bit, the model accuracy decreased significantly. Like only half or even less of the original base model accuracy.
Demetrios [00:17:14]: So again this goes back to. Yeah, you can. But do you want to. You can train a large language model on a single gpu, but do you want to. You can quantize it with one bit, but do you want to. Like what's the point?
Guanhua Wang [00:17:33]: I think no one use it.
Demetrios [00:17:35]: Yeah, yeah, exactly. No one's going to use it. And so then the quant. But the quantization, I wasn't super clear on that. It happens after the mid. And after that the post training or it's part of the post training.
Guanhua Wang [00:17:48]: It's part of the post training. Maybe. Yeah, sorry, I may not be very clear. So for every waist tensor we have a quantized waste tensor which is the same as the original one but with less bits. And during the training when we pass a gradient to the waist tensor, we pass it through the quantized with not the original base width.
Demetrios [00:18:12]: So that's the quantization that goes through. Is it just like out of the box vanilla quantization or are you also.
Guanhua Wang [00:18:21]: Yeah, we have some. One is like the open IS version GPTQ the other way. Also right now we also do some research for determining how to round up or round down the quantization value. It is still ongoing. I think we may release sometime next year. That one can achieve higher accuracy compared with open. Yeah.
Demetrios [00:18:46]: And then afterwards. So you pop out with this very specialized model and it is hyper focused or domain specific. Right. Do you distill it later after that to try and make it smaller? You say it's small enough. This is the whole point of doing it.
Guanhua Wang [00:19:03]: Okay. Yes. Another issue we found is previously for the large language model, the stipulation always decrease accuracy. That's the main issue. So right now we just abandon the distillation. Just our user correlation with orange model shape.
Demetrios [00:19:21]: Wait, even on the bigger. The higher parameter models.
Guanhua Wang [00:19:25]: Yeah, like several tons of billion model also has decreased accuracy.
Demetrios [00:19:33]: Interesting.
Guanhua Wang [00:19:34]: Because yeah our like production environment, we really want to have high accuracy rather than saving some. Some weeds or. Yeah.
Demetrios [00:19:44]: And distillation is not the way to go then because we were thinking about. It's. It's good that you tell me that now because one thing that we want to do in the MLOps community is create like a hack with me session and we're going to take models and try and make them smaller and faster and ideally keep the accuracy. And so we had thought about distilling but now that you tell me that I'm not even going to go down that route and I'm going to say it's all quantization.
Guanhua Wang [00:20:19]: Yeah. I think distillation may work if the downstream tasks is really simple, such as it's just a chatbot for a specific person, those kind of very simple tasks. I think distillation should work. But for more general tasks like a chatbot for a million user, distillation may not work.
Demetrios [00:20:40]: That's not the way to go. Okay. And that's it. And you've tried, I imagine a lot of different distillation techniques. It doesn't matter.
Guanhua Wang [00:20:48]: There was a very big team in Azure who is doing distillation for almost two years. It goes to a bad end. Yeah.
Demetrios [00:20:57]: Really?
Guanhua Wang [00:20:57]: Yeah.
Demetrios [00:20:58]: Okay, well that's a huge signal. So basically anybody who's thinking that distillation is going to save you, you gotta be okay with letting the accuracy slip.
Guanhua Wang [00:21:11]: Yeah, we saw that. For example, if you're using something like a co pilot for your personal laptop, distillation may work. But if it can be more general and provide services for more people, it doesn't work really well. And basically after you do the distillation, the memory inside the model is too small to memorize a lot of people's attributes.
Demetrios [00:21:37]: Okay. So that's why it works with a copilot if it's only one person's attributes.
Guanhua Wang [00:21:41]: For copilot, we can do distillation, quantization, everything and make the model extremely small.
Demetrios [00:21:47]: Yeah. So that it can fit on that CPU also, right?
Guanhua Wang [00:21:51]: Yes. We also had to use CPU to do the serving. Not using gpu.
Demetrios [00:21:55]: Yeah. And what, what's that process?
Guanhua Wang [00:21:59]: Oh, that process is basically it we can say doesn't have post training, it's just do the post training on the fly. When you use a computer, you browse a website, you, you, you watch videos, those kind of thing. We just try to like collect those signals and try to customize a personal assistant for you. And in that case we also didn't upload any data to our server or anything to keep the data privacy.
Demetrios [00:22:29]: Yeah. So you're doing federated learning then?
Guanhua Wang [00:22:32]: Roughly. We don't even need to communicate with the server because like a personal pattern or personal characteristic is really easy to learn in your local laptop. We don't need to upload anything to our server.
Demetrios [00:22:48]: So the post training is done on the cpu.
Guanhua Wang [00:22:51]: Yeah, on your laptop.
Demetrios [00:22:53]: Wow.
Guanhua Wang [00:22:53]: Yeah.
Demetrios [00:22:54]: Okay. And I'm. It's like that happens at night when you go to sleep and then the laptop kind of gets the updates.
Guanhua Wang [00:23:02]: Happens on the fly when you use it. We do some fine tuning.
Demetrios [00:23:05]: It's in real time.
Guanhua Wang [00:23:06]: Yeah, real time. Maybe one or two seconds.
Demetrios [00:23:09]: Yeah.
Guanhua Wang [00:23:09]: Delay. Yeah.
Demetrios [00:23:11]: How is that possible? That's nuts.
Guanhua Wang [00:23:14]: Because we just try to collect a very simple signal. Like you click on what kind of website, at which time frame you watch movies, at which time frame you may play video games. Those are very simple signals or very simple data which also put data and feed into the model to do some fun. To me, for human activity, activity every day, maybe it's just around 100 or at most a thousand activity per day. It feels like 100 or 1000 simple data point. We need to fine tune.
Demetrios [00:23:50]: Okay. And that's why it can kind of happen on the fly. But is there a specific like package that is constantly retraining it or training it or is it just that you're throwing that into the context window as a database?
Guanhua Wang [00:24:05]: Yeah, we throw it into the context window. Once we batch to like a 2k context, we do a training around. Yeah.
Demetrios [00:24:11]: Ah, I see. Wow, that's fascinating to think about. And the other piece that I wanted to know about from you is are you using Lauras and if so, how, what do those look like and what is that whole process?
Guanhua Wang [00:24:29]: Oh, for Laura, I think it is mainly used for post training because. Yeah, Laura, it Already have like a smaller width module rather than a big chunk of width. It is used for post training for like Microsoft internal models as well as some OpenAI models. We both use Lora and I think it's public. And Lora itself has a big issue is if the customer data is very different from the pre trained data, then the Lora kind of strategy doesn't perform really well in terms of model accuracy. If the like the downstream the post training data have similar attributes or characteristic compared with pre training, then Lora can perform really well. That's what we found.
Demetrios [00:25:24]: Oh, so basically you have to have a Lora that fits into the worldview of that pre trained model. If the Lora is giving new ideas to the model, then the model almost rejects it.
Guanhua Wang [00:25:41]: Yeah, maybe. Yeah. That's the reason we now like switch from Laura to some pure condolences.
Demetrios [00:25:49]: Okay, the way that I understand it is there's like a Laura that's an orchestrator that will tell the model which Laura to choose or to like load up real quick and add. But you decided that it's actually better just to quantize instead of trying to do that. Or is it Laura?
Guanhua Wang [00:26:10]: Because previously Lora is designed for large language model. In that case we have a lot of like redundancy I can see in the model width. And by adopting Lora smaller width rather than the big width, we have a huge performance scan without looking losing accuracy. But right now our model is already smaller, much smaller than previous version. So Lorac, if you use a smaller tensor to represent the original like a small language models tensor, it will have some trouble.
Demetrios [00:26:44]: Oh, fascinating.
Guanhua Wang [00:26:46]: Okay, if you like for Lora, if you do it on llama like 4 or 5B or llama 70B, it works perfectly. No matter whether the data of post training and pre training are different or not, it works perfect. But for small language model it's not the same case.
Demetrios [00:27:03]: So there's almost like a cut off.
Guanhua Wang [00:27:06]: Right.
Demetrios [00:27:07]: 3B.
Guanhua Wang [00:27:08]: Eh, you're right. Over several billion model, it doesn't work that well.
Demetrios [00:27:14]: Yeah, yeah, it's once you get over that 5B hump, then you can start saying hey yeah, we are seeing better effects. But then again it's the law of diminishing returns. Once you go to 30B, you don't really need Laura's. So it's really in that sweet spot of like seven to whatever. 7:40. Yeah, that's fascinating. I didn't realize that either.
Guanhua Wang [00:27:44]: You can see recently there are a lot of work on pure quantization, not for Lora. Maybe one year or two years before it's very popular. But right now for those small models, maybe not.
Demetrios [00:27:57]: Yeah, because so. And this is when we're talking about small models specifically, quantization is the clear path.
Guanhua Wang [00:28:06]: Right. And women have different quantization strategy. Right now people are working on it. Like I also heard from Nvidia, amd, those hardware companies, they also do quantization. For example for the Nvidia's GP200 they already support IP4 data type 4 for dream. That's awesome.
Demetrios [00:28:28]: Yeah. And what are some other quantization techniques that you have been messing with or have enjoyed?
Guanhua Wang [00:28:37]: One thing we are trying to integrate into Deepspormer engine from Nvidia that is to do FP8GM computation. And previously when we do gym we always without Cornelius, we always try to do IP16 or BF16 gym to preserve the precision or accuracy. But right now with a transformer engine, before the gym happened, they downcast the 16 bits value into 8 bits and doing a gym in really fast amount of time. And after that they upcast the LP8 data type to back to like 15 bit.
Demetrios [00:29:18]: What is. Can you explain the transformer engine for me?
Guanhua Wang [00:29:22]: Yeah, so the transformer engine for me personally it is just to do a data downcast from 16 bits to 8 bits and then do the computation like a matrix multiplier or something. And after that we get the FP8 output, it will upcast it back to like 16 bits. So in the middle they only compute like 8 bits bits, which makes it faster than previous 16 bits computation.
Demetrios [00:29:56]: And there's no trade off there, at.
Guanhua Wang [00:30:00]: Least for right now. For the model wave train with didn't see much difference. We may lose some accuracy, but that is tolerable. Another is what I have implemented called zero, that is to do quantization on like gradients and activations and weights during training. Because at that moment we still train large language model and the communication is a big overhead. So the computation is big overhead at that moment. So what we can do is we do quantization before we do any data transfer. Basically like in zero optimizer during the first forward we do some weight algather on the width.
Guanhua Wang [00:30:47]: So in 0/plus instead of doing width altogether, we first do a quantization on the width to make the 16 bits down to eight or four bits and then we communicate those four bits. That can save a lot of time because we can't communicate less volume of data. And in the backward we do the same thing for the gradient before we transfer the Gradient. We do a quantization and we transfer. On the receiver side, we do decanization to up cast the data and do the training.
Demetrios [00:31:20]: Yeah, you remind me of the meme where it's like, exhibit, you know, and he's like, yo, dog, I heard you liked quantization. So I put some quantization on your quantization.
Guanhua Wang [00:31:33]: Yeah, yeah. So we put a quantization before and after communication.
Demetrios [00:31:39]: Exactly. It's like, hey, could we quantize this part of the workflow? Yeah, let's try it. Could we quantize this data? Yeah, let's do it. And see every single spot where you can come and you can quantize it. It seems like you've. You've tried to. And so far you've had some success with it. Then you're not seeing that it's causing.
Guanhua Wang [00:32:01]: Significant issues, some trouble. For example, for width, usually we only quantize to 8 bits, not 4 bits. For 4 bits, with quantization, we saw the performance degradation and for gradients, because it has. It doesn't need a very high data precision. So we can contact it to four bits.
Demetrios [00:32:22]: Wow.
Guanhua Wang [00:32:23]: Yeah.
Demetrios [00:32:23]: And now this is on pre and post training, or is this main post?
Guanhua Wang [00:32:28]: Okay. It's mainly on pre training at that moment because we have some internal workload which have like a very low bandwidth interconnect.
Demetrios [00:32:38]: Well, speaking of bandwidth, let's talk about Domino and what you've been doing there, because that's like your newest brainchild, right? You've your creation. And can you explain to us what it is first?
Guanhua Wang [00:32:53]: Yes. In one sentence domain, we try to eliminate communication during training, either pre training or post training.
Demetrios [00:33:02]: All communication between GPUs.
Guanhua Wang [00:33:05]: Yeah, or most communication. Right now we can hide around 70% up to 100% communication.
Demetrios [00:33:12]: And it works with deep speed. The idea.
Guanhua Wang [00:33:14]: Yeah, Renault works with deep speed.
Demetrios [00:33:17]: So let me see, let me see if I understand the full picture. It's like I'm training a model, but now I don't have to worry about loading up the model on every single GPU because I've got deep speed and it is sharding, quote, unquote, the model weights. And so it's training these different layers in their own way. And then with Domino, you're saying, and you know what, you don't have to be talking back and forth to these large clusters, like these gigantic beefy machines that are training. They don't need to constantly be in contact with the other 24,000 machines because you've got domino. And it just means that you have less communication and networking overhead.
Guanhua Wang [00:34:10]: No, no, that's not the case. So what we did is we still have communication, but user cannot see it. We had the communication behind the computation such that like from under tool and training perspective, you always see the GPU always working on computation. The communication is under the hood. We still have communication and we have different model sharding strategies than deep speed zero. We borrow from the Megatron's tensor parallel strategy, because in Megatron tensor parallel we see that the throughput is higher than zero or any other strategy when you have high bandwidth interconnect between GPUs.
Demetrios [00:34:54]: And why does hiding the communication from me, the user help with throughput?
Guanhua Wang [00:35:00]: Yeah, so previously what we did for transformer training is we train a layer like compute on one layer and then we need to sync or communicate for this layer's output. That's what we did before. And the communication overhead just stand out as under 200 training time, for example, you compute for 5 seconds and after that you need to wait for 3 second for the communication to finish. That's what we did before. And right now what I did for the domino or what the domino achieve it, we just hide this 3 second communication inside the 5 second computation such that every iteration right now is just a 5 second previous is like 8 second per iteration. Yeah, that's the difference.
Demetrios [00:35:53]: But so it's basically borrowing the same time it's communicating. It's just within those five seconds. It's communicating.
Guanhua Wang [00:36:00]: Yeah. The key attributes of domino is previous. Previously when we compute a layer, we need to wait for the whole layer's output to be generated before we do the communication. But right now, after we compute a few output, we start triggering the communication. Yeah, that's like doing some pipelining inside of the computation.
Demetrios [00:36:26]: So the whole message or the whole communication message doesn't need to be there, it's you're starting it as it's finishing almost like just in time.
Guanhua Wang [00:36:35]: Yeah, because we need to like do some tensor sharding to break down the data dependency for data without data dependency we can like send it over without waiting for others.
Demetrios [00:36:47]: All right, bro, hear me out. But what if we quantize those messages?
Guanhua Wang [00:36:54]: Yeah. So domino works with quantization. Of course with quantization we can make the training iteration time even smaller for both compute and the communication. So that's the reason in domino, our first version we didn't say we enable quantization. But right now we try to incorporate with transformer engine and also my previous communication coordination strategy.
Demetrios [00:37:20]: So there's Inevitably been some wild stuff you've got done because of all of the lift. It's like you're taking, you're not just taking a little bit from here and there, you're making some significant increases.
Guanhua Wang [00:37:36]: Yeah, that's the reason, like Domino is a next generation training platform for deep speed. It's a big project and right now we also try to collaborate with a lot of industrial activities like amd, intel and another called P startup for the short video generation from Stanford. Yeah, those are our initial customers right now.
Demetrios [00:38:01]: Oh, that's so cool. So the training that you've done, like is there, are there any statistics that you can share with us on how much faster it is?
Guanhua Wang [00:38:11]: Oh, compare with like maybe the best throughput. Like, because right now if you have like a very high bandwidth interconnect between nodes, the best solution is to use Megatron from Nvidia. We can achieve like 1.3x speed up compared with Megatron, which is almost like optimal throughput you can get.
Demetrios [00:38:34]: And this has nothing to do with the different, what is it? The different wires that you're using connecting it. Like this is the, is it Ethernet, which, what is it called? The, oh, Ethernet band Infiniband. That's it. Yeah, you don't even have to worry about that. You're not dealing at all with that type of throughput because of the way that you're structuring the software.
Guanhua Wang [00:38:59]: Yes, yes, we structure the software, we don't touch the hardware. Yeah, that's another big attribute for Domino. We also didn't touch the kernel, we didn't modify the kernel. So any new kernel we can directly integrate with Domino.
Demetrios [00:39:13]: Oh, and is that how you're able to see this on any type of chip?
Guanhua Wang [00:39:20]: Yeah, because right now we are helping AMD to do speed up. We also try to run Domino on the Ms. 300 directly. We have some speed up without modifying any line of code.
Demetrios [00:39:33]: So it plays in an abstraction above the. Basically all that hardware and all the chip.
Guanhua Wang [00:39:42]: Basically we're just like a more smart scheduler to launch a computer kernel and a communication kernel. Do it in a more wise way, more smart way.
Demetrios [00:39:54]: Oh, that is fascinating. So now what are some things that you've, you've been able to see? Like obviously you take any, you take eight seconds down to five seconds and that's huge. Are there any other numbers that you've seen that you're just blown away by?
Guanhua Wang [00:40:12]: Yeah, because previously Megatron from Nvidia, it doesn't support Martin node training. Everything happened within a single node With a million connection. And right now we also extended to Martin node training with pure like tensor parallelism. That's another big jump from the original Tezza parallel strategy.
Demetrios [00:40:35]: Oh wait, I'm not sure if I understood that. So it's the. You created multi node parallelism with this. And before it was only the Megatron was only supporting single node.
Guanhua Wang [00:40:48]: Yes, because for Megatron it doesn't hide the communication. When it scale up to multi node, the communication overhead will be larger. For example, in a single node case, we have 5 second compute, 3 seconds communication. But for multi node, the communication could be 20 seconds or even more. Yeah, yeah. So Megatron cannot hide it perfectly.
Demetrios [00:41:14]: Okay, so the more nodes that you get, the more that you're. Yeah. The more you see the benefit.
Guanhua Wang [00:41:22]: Yes, roughly. And we are continuing optimizing for it. For example, if the communication is too big, we may do some quantization before we do communication to fix down the time.
Demetrios [00:41:33]: There's them. Oh man, I knew it was coming in somewhere. Wait, where are you quantizing? What?
Guanhua Wang [00:41:42]: Yeah. Yes.
Demetrios [00:41:44]: So you're quantizing the.
Guanhua Wang [00:41:46]: With.
Demetrios [00:41:47]: Oh, the weights. Okay.
Guanhua Wang [00:41:48]: Yeah, with activations, gradients, whatever. Takes a long time. We quantize it and see how it works.
Demetrios [00:41:56]: And then those. The communication comes off the back of it. So that the communication is able to go just as fast as the weights when they're quantized.
Guanhua Wang [00:42:07]: Yeah. So our perfect ideal case is computation is the same time as communication. So we can do a perfect overlap. Which one is longer? It will be like the longest for the training.
Demetrios [00:42:21]: Yeah. Okay, so it's almost like. Yeah, you, you just start the communication a millisecond later and it is the same length or it's the same amount of time. Because it couldn't work where it's the exact same amount of time. Right. Because slightly longer. Yeah, yeah, Slightly longer. Right, yeah, that makes sense.
Demetrios [00:42:44]: I really like this idea of sharing time instead of doing it. So you're doing things synchronously instead of async.
Guanhua Wang [00:42:54]: Yeah. The reason like Megatron doesn't do this kind of sharing time strategy is because they didn't. At least from my perspective, they may not know how to break down the data dependency. That's a key issue. We just find a way to do the tensor sharding and after we shard the small pieces, those small pieces do not have any data dependency. So we can do the communication on the fly.
Demetrios [00:43:19]: Oh, so you're breaking up the data.
Guanhua Wang [00:43:24]: Into even smaller pieces.
Demetrios [00:43:26]: Yeah, I see. And how does that even work? So you're Full on. You're sharding the tensors so that there's no data dependencies.
Guanhua Wang [00:43:35]: Yes, yes.
Demetrios [00:43:36]: And because of that you are able to have the communication be right away.
Guanhua Wang [00:43:42]: In the middle of computation. Yeah. And our sharding is very general. It can work for any transformer models. It's not just for a specific model. We have a specific sharding strategy.
Demetrios [00:43:58]: And you think there's potential for gain there. If you have different sharding strategies, you can optimize your sharding. Or is that not the place that you're looking to get a little more lift?
Guanhua Wang [00:44:11]: I think right now we. So before we launch this domino project, we already tried tons of sharding strategy and what we propose is what we think is near optimal. Yeah.
Demetrios [00:44:25]: And so looking ahead, where are you seeing this? I mean, you mentioned the dream is to get the time of the two times to be basically the same and the communication is just a little bit longer so that you can have it go. But what needs to happen to get there? Where are you thinking? You need to shave off those milliseconds.
Guanhua Wang [00:44:50]: Yeah. So the reason we propose the domino project is we see the trend that the internode connection, the bandwidth is catching up with intra node earlyink bandwidth. For example, for the H100 DGX H100 boxes, the intra node emitting bandwidth is around 900 gigabytes per second. And the internal, the cross node bandwidth of InfiniBand is 400. It's almost at the same level even it's just half the bandwidth of intra node. And in the next generation of the B100 B200 GPUs, we saw that their network card for Infiniband is almost reaching 800 gigabytes per second. It's almost same as emitting 900, it's just 100 difference. And because we see this trend, we saw that for tp, usually it only works for single node, but for Martin node it doesn't work well.
Guanhua Wang [00:45:48]: But in our case we just try to hide the computation and communication together to make it work. And this trend will be continued once we see that there is no bandwidth gap between intra node and internal. That's the reason why domino will be more popular later on for the later TPUs.
Demetrios [00:46:11]: Okay, so once basically the internode catches up, then you're going to see domino shine.
Guanhua Wang [00:46:20]: Yeah. Lack of competition, communication are roughly the same time. Yes.
Demetrios [00:46:27]: So it's almost not on you to create the optimization here. It's more on the hardware provider.
Guanhua Wang [00:46:34]: Yeah. Because in this year we saw the GB200. We already see this trend is become real. Even if maybe later on they shut down this kind of bandwidth matching internal interview, they shut down this kind of project. We still can do quantization on the communication to shake it on the communication. Both work.
Demetrios [00:46:55]: Yeah, that's where I instantly thought like, oh, the quantization of the communication is where you could add a little list.
Guanhua Wang [00:47:01]: Yeah. And I have the experience. We just had to integrate the zero quantity kernels. That's not very hard.
Demetrios [00:47:11]: Okay, so with zero, you were already kind of testing that out.
Guanhua Wang [00:47:16]: Yeah, yeah, yeah. And we already use 0 + both internal at Microsoft and at LinkedIn. We help them reduce over half of the training time, which is big.
Demetrios [00:47:25]: Wow.
Guanhua Wang [00:47:25]: Yeah.
Demetrios [00:47:26]: Wow, man, you're saving so much money.
Guanhua Wang [00:47:29]: Yeah, yeah, yeah. That's really my. My work is maybe more important than some other people.
Demetrios [00:47:36]: Yeah. It's incredible to think about how much just that shaving 8 seconds down to 5 seconds if you're training a large language model. It's huge.
Guanhua Wang [00:47:47]: Yes, yes, it's huge. And right now for like our big customer, like OpenAI and Microsoft AI, we always use high performance computing like the DJ X boxes with high bandwidth interconnect. So that's what Domino focus on. But later, if we want to extend it to low bandwidth interconnect, we also can do quantization. So both way works.
Demetrios [00:48:15]: Oh, and it's not. So it's not even worth it for you to try and think about how to specialize for different hardware.
Guanhua Wang [00:48:23]: We don't need to.
Demetrios [00:48:25]: Yeah.
Guanhua Wang [00:48:25]: That's like a beauty of this project. We always do it at the tensor level or at the algorithm level. We don't touch it down to the hardware because if we touch it down to hardware, our scope will be limited.
Demetrios [00:48:41]: Yeah. And you're at the whims of other people.
Guanhua Wang [00:48:43]: Yeah, because like previously, like, or two months ago when we tried to adopt Domino on AMD GPUs, I was worried, I would think maybe it doesn't work or it performed really bad. But we just directly run our code, we have performance gains. That's the beauty of this project.
Demetrios [00:49:02]: And it's fully open sourced, right? Yeah, it's fully open sourced along with deep speed. So PRs are open.
Guanhua Wang [00:49:09]: It's already merged to the master branch.
Demetrios [00:49:12]: Wow.
Guanhua Wang [00:49:13]: Yeah.
Demetrios [00:49:13]: All right, so if anybody wants to go and poke around in the code, it's there, it's on GitHub and you will maintain it.
Guanhua Wang [00:49:23]: If people have some issues, they can raise the issues. We can solve it.
Demetrios [00:49:27]: Perfect. Dude. It's been so fun talking to you because I've just learned a ton about all this.