Do More with Less: Large Model Training and Inference with DeepSpeed
Samyam Rajbhandari is a co-founder and the system architect of DeepSpeed at Microsoft. He works on developing high-performance infrastructures for accelerating large-scale deep learning training and inference on parallel and distributed systems. He designed systems such as ZeRO and 3D parallelism that have been adopted by many DL frameworks, have become the staple engine for training large language models, and have made it possible to train models like Turing-NLG 17.2B, Megtron-Turing 530B, and Bloom 176B, etc. On the inference front, he designs fast systems and leads optimization efforts for various transformer and MoE-based LLMs architectures as well as more esoteric multi-modal architecture like DALL.e. His work on inference optimizations has been released as part of DeepSpeed-Inference, DeepSpeed-MII, and DeepSpeed-Chat, while also being used in multiple Microsoft systems and products such as Bing, Ads, and AzureML to reduce latency, cost, and improve capacity. Samyam received his Ph.D. in Computer Science from The Ohio State University.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
In the last few years, DeepSpeed has released numerous technologies for training and inference of large models, transforming the large model training landscape from a system perspective. Technologies like ZeRO, and 3D-Parallelism have become the building blocks for training large models at scale, powering LLMs like Bloom-176B, Megatron-Turing 530B, and many others. Heterogenous memory training systems like ZeRO-Offload and ZeRO-Infinity have democratized LLMs by making them accessible with limited resources. DeepSpeed-Inference and DeepSpeed-MII have made it easy to apply powerful inference optimizations to accelerate LLMs for deployment. As a result, DeepSpeed has been integrated directly into platforms like HuggingFace, PyTorch Lightning, and Mosiac ML. Similarly, the ZeRO family of technologies and 3D-Parallelism are offered as part of PyTorch, Colossal-AI, Megatron-LM, etc.
In this talk, Samyam shares the journey of DeepSpeed as they navigated through the large model training landscape and built systems to extend it beyond what was possible. Samyam shares their motivations, insights, aha moments, and stories behind the technologies that are now part of DeepSpeed and have become the fundamental building blocks for training and inferencing large language models at scale.
Link to slides
Introduction 
Where is he at? Where are you at? There he's, what's up? Sorry. Hello. How's it going? Hey, it is going good. Thank you guys for, for having me. I am a little bit sick. You might be able to hear it in my voice, but I am super pumped for, for that is medication. That is medication. He could not let us. Hang. He, you couldn't leave us.
Huh? You could have called him Jake. You could have said Ah, no, he's dedicated. No, no. He's dedicated. Not today. That's right. That's right. Not today. That's awesome. Well, dude, we are, David and I are gonna jump off. We'll be back with questions in 20, 25 minutes. So talk to you soon. Sounds good. will you be putting my slides okay.
There, there it's no, should be up. Yeah, there. It's cool. All right. Hello. Hello everyone. my name is Samari. I am, a co-founder and the architect at the Microsoft Deep Speed Team. many of you might already know what deep speed is. Many of you might have trained your models. using deep speed.
Overview of Deep Speed
Today I'll give, a brief overview of, of deep speed and walk you through some of, our insights along the way as we developed, deep speed.
Just some, some, some fun stories along the, the way. and also tell you about the, the key features of deep speed. So at a high level, deep speed is a. hi. You're back. That wasn't, so we're improvising like a jazz band, basically. yeah. I just, do you mind resizing the window that you're presenting with for the slides, because it's a little bit vertical right now.
And there we go. Boom. All right. I'll be, ill be all right. All right, cool. So deep speed is a library for training compression and in, inference of deep speed. we started by primarily focusing on model, scale and speed. So let me tell you a little bit about what that.
Enabling Large Models
We all know large language models have been blowing up in the last two years.
Um, the size of the transformer models that were trained increased from a few hundred million parameters, to hundreds of billions of parameters. About a 240 times increase in model size during the same time. If you look at the AI hardware that these models are trained on, the memory capacity of these hardware only increased by two and a half times.
So how did we enable this massive increase in, in model size of these large language models? And that boils down to system capabilities. We built systems that allowed us to scale the model. From a single device to hundreds and thousands of devices. And if you look at the trajectory of deep speed in enabling large models, we enabled a 4,000 increase, in model size in the same two year span where the transformer or the large language model size increased by 240 x.
Because of that, deep Speeded has been used to train a lot of large language models, that you see, including the Microsoft Touring 17 billion, the Bloom, big science model, Megatron Touring 530 B model, and, and so many more using some of the, large model training technologies that I will, share with you in, in just a moment.
Speed and Efficiency
Now in addition to model scale, deep speed is super fast at training these large models. If we go back to the days where Bert was considered a large model, deep speed at the world, fastest Bert training time where we were able to train the Bert in 44 minutes on a cluster of thousand 24 Nvidia V 100 gps.
Fast forward now we're able to train trillion parameter models on, on large GPU clusters. We can achieve near perfectly near scalability. What that means is, is we increase the number of GPUs your throughput. Or your training speed increases proportionally. There is very little efficiency loss. and we are able to do that on both older generation of hardware, like the Nvidia B 100 GPUs or the newer generation of hardware like the, the, the Nvidia a 100, GPUs all the way to thousands of of GPUs.
Democratization of Large Models
One of the challenges of large language model training is it's generally inaccessible because of just the size of the model and amount of resources required, or at least that used to be the case in the, in the early days. If you look at the largest model, you can train on a single device, something that a lot of data scientists have access to.
You are limited to a couple of billion parameters. You cannot fine tune something like llama 65 billion. so. In order to make it more accessible, we develop technologies like Zero Infinity, which allows you to leverage, heterogeneous memory, things like CPU memory, mbme to increase the model size that you can train.
And by using this, you're able to, fine tune models like LAMA 65 B or even a trillion parameter model with a single gpu if you have enough NBME device. And that, Makes large language models a lot more accessible than if you had, if you needed a massive cluster to be able to do the same thing. We also develop technologies that reduces communication between devices when training large models.
This is important because if the communication is slow, most of the time you're just spending doing the communication rather than any useful work. and that means you can only train these large language models on very powerful, very expensive supercomputing clusters. By reducing the communication volume, you make it more accessible where you can train it with much less bandwidth.
So the model scale, speed and democratization are kind of the three main emphasis points for, for, for deep speed.
Development of Deep speed Features
So let me take you guys a little bit deeper into, how some of these features were, were developed. some of the, the moments of insights where we were like, yeah, oh, this kind of works, and we build around that.
so lemme just take you back a little bit and share some of that. So early 2019. a couple of months before Umbert had come out and we were trying to train Bird at Microsoft. at that time, we already knew about distributed data power training and Nvidia Apex provided the infrastructure to do this.
We didn't have deep speed back then. we had access to. 64 V 100 GPUs. Pretty decent GPUs at the time, but the problem was that this was a cloud cluster. The network on this machine was really, really slow. Four gigabits per second ethernet. To give you a reference on today's supercomputing clusters, where a lot of these large language models are trained, that bandwidth is about 1600 gigabits per second, so roughly 400 times slower.
And the memory wasn't too, too big either. So you could do a, a batch size of maybe four per per gpu. And when we were training this model, it was incredibly slow. Training it on 64 GPS was slower than training it on a single gpu. And we were just scratching our head what is going on? Turns out we were spending all of our time doing the communication that I referred to earlier.
Here's a, here's a, here's a, here's a simple loop of what, that captures essentially what Apex was doing. We were using what is called gradient accumulation to increase the batch size where you run with a small batch size to a forward backward, and you keep accumulating over the gradients. And then right before you accumulate, you do this grad in averaging across all the GPUs that you have, all 64 GPUs, every step after every forward and and backward.
And that was just killing time. We were just spending all our time just averaging gradients, averaging gradients. We were not making any profits, but once we saw this, we realized we could actually just move out the averaging part. Outside of the gradient accumulation, you let the forward and backward run for a couple of steps.
Just keep averaging on the local gradients, on the same gpu. And once you're ready to update your model, that's when you can do the averaging across all the GPUs. It reduced the communication time by 16 x. We were starting to see some scalability. Now the 64 GPU performance was no longer slower than a single GPU performance.
Um, and with that, we were able to train bird in, in eight days, no longer a super speedy bird, but we managed to train it in a, in a reasonable timeframe, and that's where deep scale was, was born. We used this to train a few early models, inside Microsoft. and we were, we were quite, quite excited to, get into this distributed, training landscape.
Um, Then I. A couple of months later, new models started coming out. For example, open AI released a 1.5 billion parameter GT two model Megatron released 8.3 billion parameter megatron model. And we were wondering, what can we do next? Can we train a, a larger, more powerful model? And what would it take to to do something like that?
The issue was that at that time there were two forms of parallelism technologies that were used to train the model data parallelism, where you replicate all of your data across all of your GPUs and you're limited by the single GPU memory. You can, again, scale to maybe a couple of billion parameters and then model parallelism, which is how the megatron models were trained, where.
You partition the model and you communicate all your activations throughout the, the training, and its super high communication volume. The only way you can get around it is by having very fast communication, which was present inside a single node, what you call a dtx two node, from Nvidia. But as soon as you go outside, The bandwidth wasn't there and, and you can't really scale it further, so you were limited by the capacity of a single node, which was roughly about a 20 billion parameters.
If you didn't care too much about running it, it efficiently. But we soon realized that the way Data parallel was trained, we were replicating everything, but we could actually not replicate everything. We could just have each G P U hold just a part of the model and we can. Communicate them as they are needed.
And if you did that, you significantly reduced the amount of memory required, but you didn't really incur a much larger communication overhead. Not nothing compared to the model parallelism, the tensor slicing, that was done within the, the node. So that's where the, the whole zero line of idea was born.
Zero redundancy optimizer. We don't redundantly store any optimizer model states across, GPUs. So with that, It opened up the possibility to train models with hundreds of billions or even trillions of parameters without losing compute efficiency. And it was really easy to use. And so using zero and, the, the model parallelism, we at that time trained megatron tiering 17 billion.
It was the largest language model, at the time, and we were really excited about this zero technology. We wanted to bring this out to the community and see what the community did with it. so we wanted to open source deep scale. But, interesting story, deep scale was already taken as a name. So after brainstorming for, for quite a bit, we decided to change the name from deep scale to deep speed.
and we open sourced zero and deep speed, in 2020. and since then, zero has been used as part of deep speed library to train several large language models. It has been adopted to several frameworks, including PyTorch itself, and we were at the time, um, a little bit upset that we had to change the name, but, a couple of, weeks down the line, one of my colleague's daughter, she noticed that deep speed is the same when you spell it forward or backward.
And that was kind of cool. And, we all started to, enjoy the name a little bit more over time and, and, and right now, I, I feel like it captures the essence of what we do, a lot better than, than deep scale.
Training a Trillion Parameter Model
So we actually really like the name and it's a, all right, so then we were starting to think what is next around June, 2020.
OpenAI released 175 billion perter model, and, and, we started to wonder, what would it take to train a trillion parameter model? We knew there were two possible paths. One is through zero, and zero has, several different stages. We'll go into the details there, but we didn't have the full implementation of zero at that point.
But the other line of work that we had started to look into is 3D capitalism. Excuse me. 3D parallelism combines three different forms of parallelism, pipeline model, and zero. And the key here is the pipeline parallelism. Pipeline parallelism, incurs very, very little communication overhead. So by using pipeline parallelism to scale across nodes, you can scale to really large models.
Um, and it is really, really efficient as well. The only kicker is that it is very complicated to both develop and use, but since we were planning to try to train a trillion parameter model at a very large scale, it made it worth it to try to develop a system where we are trying to get the best efficiency out of the system.
So, after a couple of months of work, we had our 3D parallels ready. and we wanted to try it out. And luckily we had, I remember this one weekend where, one of, one of the large GPU clusters at this time, we already started getting supercomputing clusters inside Microsoft. And one of these clusters was going through an upgrade.
Um, and we had an agreement. That after the upgrade, they would lend us the entire GP clusters with over a thousand GPUs for about two days to test scalability. So we had to, if we were gonna test a fill in parameter, this was our window and, and we had to make it work. long story short, we managed to, during this weekend, Schedule a job with over, a trillion parameters.
We submitted it, we waited for 30 minutes, and we got the first iteration time, meaning it was running. And during that weekend, we were able to produce this graph, which shows how the 3D parallelism technology can scale to a trillion parameter model with perfectly near scalability on 800 v 100 GPUs. So with this technology, we collaborated with Nvidia and we used it to train Megatron during 530 building parameter models.
This was done in about two months, using 2000, a 100 GPUs. And then we used the same technology to also collaborate with the community to train. Boom, 176 billion parameter model. Okay, so. It wasn't quite a trillion parameters, but we know we could train trillion and we were training, models that were half a trillion parameters.
So that was quite exciting. But then the question is, what next? Can we continue scaling the, the size of the model even further? Well, the problem is that the megatron touring five 30 B, it took two months to train on 2000 GPUs, and we only trained it on 300, under 300 billion tokens from today's standards.
That's. Heavily undertrained. Today, the LAMA 65 billion. A model that's almost 10 times smaller is trained with ship with, with over 1.2 trillion tokens. And even the M PT 7 billion is trained with, about a trillion tokens if you use the same kind of scaling laws. And try to train a 500 billion or a trillion parameter model.
Using a trillion token, it would take six months to a year on 2000 GPUs. And if we did the scaling properly and trained it with 10 trillion tokens, which is probably about the number of tokens that would actually be needed to train this model to its full potential, it will take 10 years. It is no longer feasible to train these massive, dense models.
So what do we do then? We can actually now scale with Sparsity using mixture of expert, models. What does that mean? an analogy that that is, that helps me think through this is when you're going to a hospital for a diagnosis or to a clinic, you don't go see every single doctor and combine what they had to say to get your diagnosis.
You go to an an expert based on your symptoms, make sure if expert is kind of similar to that. You might have a really large model, but you will only use a subset of the parameters in the model, based on the, the input token that is going through the model. And so here's a plot that shows, a 1.3 billion parameter dense model, loss curve during the training compared to a 1.3 billion with 128 experts.
And then we compare that with the 6.7 billion parameter against model.
So here we're seeing that the 1.3 billion with 128 experts has a similar accuracy as a 6.7 billion parameter dense model. Now, if you were to compare the throughput, the 6.7 billion dense model is roughly five times slower than the expert model. And if you look at the total number of parameters on the expert model, it's actually 52 billion parameters because you have all those experts.
So now you can scale to a really large model. But also run it fairly cheaply. so in order to enable this, we released deep speed m o e for training, which allows you to train multi-trillion parameter o e models with excellent training efficiency at the cost of training a much smaller, dense model.
So, so far we've been talking about just pushing the. pushing the size of the, the models, right? But as we start increasing the size of the models, you would need larger and larger GPU clusters to fit the, the model. So for example, if we had to fine tune a one 75 billion parameter GT three model, that's gonna take 256 V 100 gps.
It is a large GP cluster that very few people have access to. But if we actually look at, the modern. hardware. The limitation is because we're trying to fit the entire model in GPU memory, but if you look at the CPU and N D M E storage, there's actually a lot more. The ju, the GPU memory is, less than 50 times the overall storage in, in a system, and so, If we could leverage the G P U CPU and N VME memory, you could easily fit a trillion parameter model for fine tuning on a single DTX two node.
And that would make large models a lot more, accessible. The cache here is that those forms of memories are much slower. So N D M E, for example, is 60 times slower than the GPU memory, and even worse, in order to use these forms of memory, you would need to bring the data from these slower memories to GPU memory through the P C I E link, and that becomes your bot link.
And that link is over a hundred times slower than than GPU memory. So what do we do? So I actually remember when we had the, the aha moment on, on, on how we can fix this. this was back in, in, in, in 2020. I was, we had, we had rented, a cabin with a bunch of friends, to celebrate my wife's, birthday.
And I remember I couldn't sleep because I couldn't stop this thought of data moving back and forth between NVMe and and, and GPU memory. And, there was this moment where I realized, well, The N D M E and P C I E links are really slow by itself, but if you could partition each parameters across all your GPUs and try to bring the data in from the slow memory, you actually have lots of these running in parallel so you can get the aggregate bandwidth across all of these.
And so we, we can now increase the bandwidth from something like 16 gigabytes or 30 gigabytes per second to. Hundreds of gigabytes, per second. And, this is, this was the, the fundamental idea behind zero infinity. we build upon it, and we release zero infinity, as part of deep speed. And with it we can fine tune models with hundreds of billions of parameters on a single gpu.
Eliminating the barrier to entry for a lot of data scientists for, fine tuning these large models. All right, so we mostly talked about training, but there are a few other things that, we can do with, deep speed. For example, we can do compressed training. It allows you to train with really long sequence length using sparse attentions, or we have mechanisms like the progressive layer dropout, where instead of training through every single transformer, layer of the model you can select if, choose just kind of like the expert of tokens, expert of, mixture of experts using your token, and a gating function to figure out which layers you want to go through and which you don't.
And, that can help you speed up your training. And of course inference. Once you've trained these storage models, you want to be able to inference them effectively at a low cost, and productionize them. The challenge here is that the large model landscape is quite diverse. You have models that range from a few hundred million parameters to hundreds of billions of parameters.
You have dense models. You have sparse models. And, different scenarios have, different requirements For inference. You might be doing a throughput based scenario where you just care about cost, you don't care about latency, it's all offline, and you wanna, maximize your throughput. Or you might be doing, a scenario where you are facing customers, where your latency is super critical and you wanna minimize latency.
So at Deep Speeded we developed a systematic composition of different optimizations such that you can get, the best latency and throughput depending on the scenario across this entire model stack. However, the challenge is that there are lots of optimizations that goes into speeding up, large language models.
And from a data scientist perspective, it's not always clear, what you wanna do. Even though some of the tools to do it might already be available, for example, deep speed inference. So in order to address that, we developed a new library called Deep Speed mi, two Deep Speed Model Implementations for inference.
Where we've taken the popular open source model and applied deep speeded inference optimizations to them so that you can simply just with the click of a button, run this model with the best latency and throughput. So here is a small code snippet on how you can. Deploy, a model using deep speed MI two.
You can pick your hugging face model that you want to deploy. You just say MI two dot deploy, specify a bunch of things and that's it. At the backend, deep speed will apply all its optimizations and you can get a handle on which you can submit your queries and you can get really fast, latencies and throughput.
Deep speed MI two supports over 24,000 different models, that are open sourced. At the time when we released it, it reduced the cost of inferencing bloom, one 76 billion model, by up to 40 x. And it can also create models like stable diffusion. Hello? Am I running late? No, you're good. Sorry I was, my internet went out and I wanted to make sure I I was still working.
Go on man. My bad for Alright. Okay. No, no worries. Um. Something else that is tied to deep speed inference that we are really excited about, that we released pretty recently is Deep Speed Chat.
I'm sorry guys. my, my throat is acting up a little bit. So deep speed chat is kind of this, this interesting project because training a chat model requires R L S F training. the, and the, the R L SF training Pipeline has both training components and infants components in it. and it is equally important to.
Um, accelerate both training and and inference. And this was the first time where we used all the techniques that we've developed on the training side. Combined it with all the techniques we've developed on the inference side to accelerate R l SF training. and it allows you to train really large models for, for chat, really fast if that's what you want, or really cheaply if you don't care about the latency and all you care about is, is, is cost.
Okay? So, We've kind of covered, several features inside deep speed, several deep speed offerings like the deep speeded MI two, and the deep speeded chat. However, all of these features do not mean much if they are very hard to to use. And so at Deep Speeded we greatly care about usability. the zero line of technology makes it easy to scale your model with.
Virtually no code change. You can go from 1.4 billion parameters to trillions of parameters, by using deep speed without any change to your code. and because of the, of how easy it is to use, deep speed has also been integrated into hugging face and, and PyTorch. And you can just enable the deep speed backend, by just changing your configs or using a different launcher without having to change anything in, in your code.
Deep speed is also infrastructure agnostic. It supports agile ml, Azure VMs, or if you wanna bring your own hardware and run it there, it works there as well. let me kind of share, how you would apply deep speed, to your, to your models. So normally in Pipe torch you would create a model. All you have to do is wrap your model inside the deep speed engine and you're ready to go.
You specify a deep speed configuration file. Where you specify all different kind of optimizations that you want for, for deep speed to enable, and that is it. and we are very grateful and we are, we're very excited that the community, has been growing and we've seen a massive adoption of deep speed.
If you're using a large language model, there's a good chance that it was either trained with deep speed or it was trained using one of the technologies that was developed, by deep speed. we had over 4 million plus installs. Since we released it, we have tons of unique contributors and, a lot of third parties depend on, on, on deep speed for their, for training their workload and, and inference.
So it's, it's been, a great journey and, I am super grateful, to, to be part of the, of deep speed and the, the deep speed open source community. And that is all I have. For you guys, we, we welcome contributions, and if you enjoyed this talk, please do like, our repo, for deep speed and thank you.
Wow, that was a great presentation. Thank you so much, Samya. You really had a, a, a great pace. So I, I, I appreciate you coming here, even though you were. you did a great job. we, there are some questions, but before we get into the questions that were there, I had one question that I, as I was thinking about a lot of this stuff, deep speed has had a lot of impact in the industry, so much to the point that you see zero or zero like, optimizations being implemented in other optimization libraries.
Can you just tell us a little bit about how zero differs from, fully charted data parallelism in PyTorch or maybe the implementation in Colossal? just anything that you can say about that. yeah. First it's to, to me it's, it's really exciting to see something that was developed at, at deep speed now being, adopted at, at, at, at different places.
At the core of the fundamental technology is, is the, is the same. You, you partition your parameters, optimize the states gradients, and then you do the, the collectives there. there are, there are different, there are differences related to usability. I think last time I checked on, on PyTorch side, the fully distributed data parallelism, it is not fully automatic.
You have to apply that. some somewhat recursively, understood. Okay. there, I don't know if it has changed over, over time. We've also seen, zero being adopted at, at, at colossal ai. initially since, since deep speed is, is an open source, most of the implementation came. From deep speed, but over time, I feel like they've added their own, enhancements to it.
One of the things that I recall, not for Zero, but for Zero Offload and Infinity, they have this thing called mm-hmm Gemini, which is a, a, a spark small memory management, to allow for, for partial, offloading, which helps in some corner cases, as well. so. But I think fundamentally all of all of these are, are, are the same, zero, zero technology.
Conclusion
I might be biased, but so far I haven't seen an implementation which, outperforms the one that we have in, in, in deep speed. but it's still exciting to, to see the, the adoption and a little competition is great. Yeah. Yeah. I think it's healthy to push the boundaries of the space. and one thing I also just wanna say before we get into questions is I really loved how you, you thought about some of these things, and came up with insights and it like progressively built on top of each other.
I think that's really awesome and it's, it gets encouraging for other people. some of these innovations don't come out of the blue, right? Like it comes through progressive iterations on some of these. Things. So I really appreciate that. There are some questions I'd like to get to. People are very, happy with this talk.
Okay. See, I'm so sorry. I'm very bad at this. I see a few different ones. I see someone asking a question about, dealing with, failures. So, is there any capabilities like or any fault tolerance baked into it, the deep speed training mechanism or something you have to do on your own? We've, we've, we've talked about, elasticity and, and fall tolerance.
Um, and it's been, been features on our, our wishlist, but it's not something that is, fully baked into deep screen. Got it. At the moment. However, it is definitely a pain point, especially when you are, scaling to. tens, hundreds or or thousands of, of GPUs. It is something that we, we, we incur regularly, especially with like node failures and Yeah, yeah, yeah.
Errors. It's hard to tell which piece is, is broken and so on. Yeah. So one suggestion I have though, and that we, use a lot is just a simple. All reduced test before you start the training, you would just do an all reduced test to ensure that we're getting the, the, the bandwidth that we, we expect. Expect.
Yeah. And if not, something is wrong with your cluster and then you start doing a binary search on which part of the cluster? it's something that is definitely automated. that can be automated. Okay. that's good to know. Yeah, but I, I, I remember when we were talking before, you were telling me about what it was like being on call as you were developing some of these really large models and how painful it was.
Right. To have to wake up in the middle of the night and be like, oh my God, there's this failure. And the, I guess the pressure behind it because of how expensive these resources are. Like, one thing I was really su I was thinking about is how expensive all of this research is because, you know, it's, it's very hard to, you know, do some of these experiments.
Um, also making it very hard to, you know, It. You wanna make sure what you're doing works so that you expect it to work, because if not, you're just wasting time and money. So, yeah, that was really great. sorry. Want to keep, keep it moving. Make sure I, get to other people's questions. someone asked something, if you could shed some light on speeding up the inference of an l l m, like let's say the 7 billion or 40 billion Fal falcon without considering quantization.
So maybe that's a deep speed M two question. sure. so it, how you accelerate it really depends on whether you are trying to push for latency or throughput. If you're trying to push for latency, then, then you are usually running at a very small batch size. and at that point you are limited in your performance by how fast you can read the parameters from the GPU memory.
So all the optimization centers around getting the best possible memory, bandwidth sta reading all of your SM so that you're pulling from the HPM memory as quickly as, as as possible. So that's at, at a high level. What we would do, if you're trying to maximize for throughput, then you would try to, use a really large batch size and, get the compute course to use as efficiently as possible.
Keep it compute Ron. Love that. thank you so much Demetrius Sar. Go ahead. No man, I'm just coming on here because sadly we gotta keep it moving. I told you all my name is Ringo today. I'm keeping time like a clock, just keeping us cruising throughout the day. So this has been awesome. I really appreciate you coming on here.
I know it's no small feat. To get a talk pushed through the PR department at these gigantic companies. So I think you probably worked harder on that than you did on the actual talk itself, which we all appreciate. There's a ton of people that said how excited they were for your talk, and that's awesome.
You did not disappoint, man. You did not at all. You came sick and everything, and we all appreciate that so much. Yeah. Thank you so much. Thanks for having me.