Handling Multi-Terabyte LLM Checkpoints
Full-stack Machine Learning Engineer, currently working on infrastructure for LLM training, with previous experience in ML for Ads, Speech, and Tax.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
The talk provides a gentle introduction to the topic of LLM checkpointing: why is it hard, and how big are the checkpoints? It covers various tips and tricks for saving and loading multi-terabyte checkpoints, as well as the selection of cloud storage options for checkpointing.
Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com/
Simon Karasik [00:00:00]: My name is Simon, Simon Karasik. I am a machine learning engineer at Nebius AI I am working on large language models at the moment. So Nebius AI is a cloud company. We make NAI-specific Cloud. But there is an arid team that is working on LLMs and I am coming from this team. And yes, that's what I do. Now about coffee.
Simon Karasik [00:00:26]: You know, I tried to quit coffee some time ago. I didn't drink coffee for a month. At first it was challenging, but then became fun. I realized that tea is great. You can have tea with milk, tea without milk, green tea, black tea.
Demetrios [00:00:44]: What is happening ML ops community? We are back for another podcast edition. As per usual, I am your host Demetrios, and today we're talking with Simon. But before we get into all of that, I've got a little bone to pick with the Australians out there. For the life of me, I cannot understand why you call lettuce salad. Salad is so much more elaborate than just lettuce. So stop doing it on injustice and make sure if you're gonna call something salad, you at least have some dressing or the very least oil and vinegar on that, please. Alright, that's it with my bone to pick. Now, we got into a deep, deep conversation with Simon all about checkpoints, and checkpoints when it comes to training large language models.
Demetrios [00:01:32]: As you will hear, he is currently training a 300 billion parameter model, which sounds absurdly large. I'm very excited to see what comes out of that. But he talks about how checkpoints become so big when you are training at that level. The checkpoint management in itself is something that you have to think about and scaling laws for what you're going to do with these four terabyte checkpoints becomes an issue. And how you look at the management of oh, do we need one per week? Do we need 1 /hour do we need checkpoints that we had a month ago? How do you manage all of that? He breaks down eloquently. I really appreciated it. And he's got like the coolest job in the world because he's working at nebulous AI and they are specifically a cloud provider that is dedicated to AI workloads. So what does that mean? Well, that means that they have tools for training and inference.
Demetrios [00:02:39]: They also have a whole lot of GPU availability, which everybody needs if you're going to be doing any kind of AI these days, it seems like. And there's also like a UX that's built for the engineer and developers, which is top notch. So you don't have this mess of services. There's not that paradox of choice of which service should I use. They really think through that. And what Simon told us is he's been dogfooding the whole LLM generation process. And as you will see, he got into some of the nitty gritty when it comes to networking, checkpointing. I feel like an expert after talking to him about checkpointing.
Demetrios [00:03:20]: And as for like a little icing on the cake, we could call it. He talked about the differences that he has seen over his years between scikit learn models and training those and training a 1 billion parameter large language model, quote unquote large. I guess we could call that small language model these days and then training something like this 300 billion parameter large language model. So hope you all enjoy. And a huge shout out to nebulous for sponsoring this episode. If you like it and you think it is valuable, we would love to see you leave some feedback in Spotify, drop a star, leave a review comments on YouTube are always fun. Subscribe. You know what to do.
Demetrios [00:04:07]: I'll see you all on the other side, or maybe in San Francisco on June 25 for the in person AI quality conference we're organizing. Whenever I think about checkpoints, I think about checkpoint Charlie, because I love Berlin and I go to Berlin quite a bit. So I have that association. I say the word checkpoint and then next thing that comes up is, Charlie, we are going to be talking all about checkpoints today. But before we get into that, I want to talk to you a little bit about how you came to be, where you're at. I know you've done some work with traditional ML and you are now training large language models. So I want to hear about the juxtaposition there and like how things differ in that regard. But maybe you can talk to us about what you were doing with ads before you started training large language models.
Simon Karasik [00:05:04]: Yeah, sure. Now, first of all, I also used in Berlin recently and also had association with Checkpoint Charlie. And it was one of my ideas to have this, like a logo for some presentation. I don't know, couldn't find a good image. Yes, and I'm working with large language models, but before I worked quite a lot in advertisement, in machine learning. For advertisement, I worked at Yandex. It's kind of Russian Google, some people call it like this. And I was focused on ML infrastructure there, and I was making some different ranking systems, like click prediction.
Simon Karasik [00:05:41]: Like if you see some bottle without advertised, will the user click on this bottle or will not click? And I was also working on infrastructure, because it turned out that infrastructure is really big, is really complex. It's not just like, let's deploy this model. It takes like maybe 20 different steps, different teams, different systems for the model to get trained, collect the data, to push the production, and it was like, wow, ouch, ouch.
Demetrios [00:06:12]: We'll break that down a little bit more. You can't just say that and then keep me hanging. Okay, why was it so big? What did that look like? Paint the picture a little bit more.
Simon Karasik [00:06:20]: Because if you want, create a great advertisement system that is accurate. It's like personalized. You have to collect lots of data and process it. Because it was like I was making some deep learning model, like some recommender model, and it doesn't just go to production. This model is a feature to another model that is another model. It's like a big combination and like, okay, well, I need to train this deep learning model in Pytorch. But it's not regular Pytorch. It's like Pytorch, especially for recommender systems.
Simon Karasik [00:06:51]: And it has some special add ons developed by my colleagues. Then it comes there. We need to validate. Then we have some data that's coming every hour, but it's not actually really every hour because it takes time for the data to actually come from your click online to the data store and then getting trained and pushed to production. And it's where lots and lots of pieces like that were developed for years. And even like, we had this like a service that is serving like ten or 20 deep learning models for different teams. And also was a big, big, like, infrastructure and inference. And I remember, like I do Samsung, you know, like, I reduce some load some save some percentage of CPU's.
Simon Karasik [00:07:45]: Like, let's roof this model and save like 1% of CPU's. But it ends up like saving as 1000 of CPU's. It was like, wow. And I have never seen the cpu's. I just see this dashboard, like, that many cpu's we consume, but it's not like hold at a center like 20 computers. I don't know.
Demetrios [00:08:10]: But did you get a medal or anything? Did you get a raise after you did that?
Simon Karasik [00:08:15]: Yeah.
Demetrios [00:08:16]: Did anyone notice?
Simon Karasik [00:08:17]: I guess, sure. Everybody noticed. Because if you release some cpu's, that means now we can use it for something more useful. But you can't just remove some model from production. It takes time to check that nobody needs this model. It doesn't make any profit. Then you can remove this model. And he reused this space for something else.
Demetrios [00:08:41]: Damn that's crazy what you just said, how there was basically like dead models out there in the wild still, but there wasn't proper cleaning that had happened. So they were still just like hanging out. And you recognize that and you were able to save cpu costs and just like open up cpu's for newer things and potentially better things to come out, or at least new experiments that are gonna be not dead in the wild. And what did the process look like for that garbage collection?
Simon Karasik [00:09:19]: Actually, it was a kind of another project. It was another project that tried to connect all the pieces together because we had this runtime like inference, then we had some date part. We had training pipelines and they were not really interconnected. It happens like maybe you had some model in runtime, you removed this model, but the process that was training this model was still active, it was still consuming some cpu's every day. It was another project to actually connect all the pieces together so that if you delete a model from runtime, you get warning, hey, man, you still train this model? Maybe you should stop it.
Demetrios [00:09:57]: Wow, that is so wild to think about that. You had so many models out there and being trained and just being in production that it's almost like people lost count of them or there wasn't like clear owners of it. Is that safe to say?
Simon Karasik [00:10:14]: It's some kind of historical reasons, you know, like you have a system that is active for like ten years. Somebody created a model like ten years ago and deployed it. It makes its job, it works well. Eventually the system got refactored, changes, changes. And then you have some line of config that you have no idea how it comes here. Who made this model? The person who made this model is already, I don't know, in another company for five years and nobody knows. You don't want to crash the production, so you're really making it step by step.
Demetrios [00:10:51]: Yeah, I can understand that. You're being very cautious about how you go about collecting those models or getting them out of production, because it is a non zero chance that it could actually be very valuable and there could be a lot of money being made. And then you went and you said, all right, enough of the deep learning recommender systems. I'm going to go to the source. I like large language models now. Or what happened. Give me the breakdown on how you started working with large language models and training large language models.
Simon Karasik [00:11:27]: Yeah, it was just an opportunity. I saw this, like, job. I decided it's really interesting, I see that, how large language models are changing the world. I even remember when I was working back in advertisement, my team lead was so much impressed. He said like no matter what we do, llms will replace everything.
Demetrios [00:11:50]: Oh no, he was one of those guys.
Simon Karasik [00:11:52]: Yeah, he was kind of fatalist, but still I, it was impressive, you know, like when you saw that like I'm doing this super tough job, nothing can replace me. Then you go to JPT and say, hey, could you write this code for me? And it does the job. You write, you're writing an article, you like spending 1 hour, then you ask childgpt and it makes it better, the better than it was. Making point outward and now it's mind blowing.
Demetrios [00:12:24]: Yeah, yeah. So that made you realize there's a lot of potential here. And what happened?
Simon Karasik [00:12:33]: Yes, I joined this team and we started making these language models. I was one of the first to join, maybe like peeps to join. And what's impressive, what was impressive for me because you know, back in the days when I worked in ads, it was like really large scene, like there was people for everything. Like you can talk to this guy, to this guy and I didn't have to.
Demetrios [00:13:02]: How there was a million models in production.
Simon Karasik [00:13:04]: Yeah. And I haven't really had an idea of how it all works at such a big scale.
Demetrios [00:13:10]: You joined a new team within Yandex that was working on large language models?
Simon Karasik [00:13:14]: No, I joined nebulous. I joined nebulous. So and now Nabus is working on this LLM project, okay.
Demetrios [00:13:21]: And it's a startup so you don't have someone to get you coffee in the morning and someone to write your reports for you. You gotta do everything now on your own.
Simon Karasik [00:13:30]: Yeah, sounds like that. So it's like nebulous is a cloud company, it's mostly cloud development, cloud engineering, but we still have this team of large language models. It's like we are like a small island of machine learning actually, machine learning guys around, big infrastructure guys. And like I have to come up with ideas, my team has to come up with idea. How can we make this machine learning, big machine learning training, like out of bear, like cloud services for sure. You have kubernetes, you have storage, you have some ci, but these are building blocks, but you need to do a real big pipeline. Oh I see. It's an interesting journey.
Demetrios [00:14:18]: So it's almost like you're dogfooding the LLM journey that your customers are going on and you're doing it yourself. So that when customers say, oh, we want to change a large language model, we want to train a large language model, then you understand what they're about to go through.
Simon Karasik [00:14:38]: Yeah, exactly. So Docfuzen is one of the terms we're now using. And it's actually now, you know, like we are one step ahead of our customers. So whatever it happens first we might find some problem, we fix it. And we know that now our cloud is 100% production rate because we train such a big model and we know what can go wrong and what can go well.
Demetrios [00:15:08]: And talk to me about the models that you've trained so far.
Simon Karasik [00:15:12]: Yeah. So we started with kind of testing. So our first goal was to verify that we can train a good model. So we tried to reproduce Lama 7 billion parameters in a way which reproduces data collection pipelines, training pipelines. We had this kind of budget we limited as Lama and made sure that we can do a good job. And then we decided to go bigger. So now we are trading a 300 billion parameters model. It's like huge.
Simon Karasik [00:15:47]: It's still in progress. So I maybe can't share lots of details, but I hope you will hear about it soon. But it's not like 300 billion parameters. It's like more than 1000 GPU's to train. Just like insane. And I have never seen these gpu's. I have never been to data center, but I can imagine a big, big room full of computers just doing some matrix multiplications.
Demetrios [00:16:15]: Yep, yep. That's where you can go in the winter if it gets too cold. Yeah, that big room with all those GPU's. Wow. So that is impressive. Now I want to talk about checkpoints and what that experience has been like, what the learnings you have seen while training have been, I think for the uninitiated. Can I give you my understanding of checkpoints? And then you can. Yeah, sure, correct me and tell me where it's wrong.
Demetrios [00:16:47]: I look at checkpoints like when you're playing video games and you get to a certain part in the game that the next time you die, you go back to that part. So you don't go all the way back to the beginning, you just go back to whatever the level four, level five, wherever you get to the beginning of level four and then you face the final boss of level four. But is that a good way of putting it or tell me where I'm off.
Simon Karasik [00:17:16]: Yeah, it's, you know, it's really a good comparison. It's really how it happens. But you know, like if what can happen? Like you wake up one morning and see like. And your loss plots that your loss has exploded, you know, like, oh wow, we need to roll back. I hope we have a checkpoint because if you don't have a checkpoint, you're in trouble. You need to redo like days of work. But if you have a checkpoint, you still have maybe some hours to retrain.
Demetrios [00:17:46]: How often are you checkpointing?
Simon Karasik [00:17:49]: Maybe like once every hour, maybe 2 hours.
Demetrios [00:17:53]: Nice.
Simon Karasik [00:17:53]: So just to make sure like if something happens, like we don't have to kind of redo many steps because on one hand checkpoint is also not free. Because if I want to save checkpoint, I still have to freeze the trading like for ten, for five minutes, maybe for some minutes to save the checkpoint. So if I do checkpoints too frequently, I make the training slower. But if I don't do checkpoints, I'm in the risk.
Demetrios [00:18:21]: Yeah, yeah, you're taking a risk. So you found the sweet spot is once an hour, once every 2 hours, yeah.
Simon Karasik [00:18:28]: It just sounds like some reasonable, reasonable amount of times. And it's okay if something happens, maybe we are okay to redo 1 hour of training.
Demetrios [00:18:38]: Yeah. And you're checkpointing everything, like, or how do you calculate the checkpoint size? I guess.
Simon Karasik [00:18:45]: Yeah, sure. So like, you know it's quite an impressive, because if you have a large language model on hugging face it's like often it's like inference model in a way. It is already quantized and just you're ready to push it into a gpu and it's like you have a parameter and often it's like, like two bytes per parameter. And it's like you have Lama 7 billion parameters and like 14, 13gb. But if you train a model like this, you need much, much more space because you have optimizer, you have parameters, it's like three numbers, like for every billion you have, you need to have like three billions of numbers to save. And like each number is like floating point number, it's like four bytes and it's like, okay, you have Lamb 7 billion on hugging face it's just like shortening gigabytes. But to train lamase you need to have 70gb of space. Like 70 or something like that.
Simon Karasik [00:19:54]: And it's not all, okay, we save 70gb, but inside, in memory you have much more, you have like gradients, you have all this kind of stuff that are allocated when you train, you don't save it, but it is allocated on GPU's. So if I wanted to train Samsung, it's like, okay, no problem. It's like just certain gigabytes. No it's not certain, it's like hundreds of gigabytes that I need to be.
Demetrios [00:20:22]: Allocated in GPU memory, and that's presumably why things get very expensive very quickly.
Simon Karasik [00:20:29]: Yeah, sure. Because like you need to have lots of just memory to all the kids a model.
Demetrios [00:20:35]: Yeah. And you need to have the confidence that you're doing the right thing because I can imagine a lot of fear that people have when they're training models is they're training something and it's going to come out in 20 days or 14 days and it's going to be absolute shit and not very useful. So how are you making sure to cover your back on that?
Simon Karasik [00:21:00]: Yeah, sure. So first of all, we start with small experiments. So we start with like 1 billion model, 3 billion model, something that can be trained like in one day, in three days, not only on hundred GPU's, but maybe on eight or 16 GPU's, just to make sure that we are doing the right job. Because we know there are papers about scaling clause, about like if you make your model ten times bigger, what will be the result? So knowing that we can say, okay, if it's a small model, it's okay. Big model also should be okay. So this is kind of the first way we make sure that everything is alright. Also when we do training, we have lots of plots, we do, we have weights and biases. It's like we have 100 dashboards there because we have laws, we have gradients, we have gradients on this layer, on that layer.
Simon Karasik [00:21:56]: And it really helps because if something goes wrong, we can always go to this particular plot and see what happens there.
Demetrios [00:22:06]: Okay. But it feels like there's different types of checkpoints that you can do, right? Like there's basic checkpoints and maybe there's more in depth checkpoints. Am I off on that one?
Simon Karasik [00:22:21]: Actually, I guess there is only one type of checkpoint we can do. So we just save as a model with optimizer state. And basically it is everything we need to have to restore our training. So if something crashed, we still can restore the training from this checkpoint and it will still have all the same loss, it will continue right where it stopped.
Demetrios [00:22:45]: So we, when it is a smaller model, like you were talking about, are checkpoints any different?
Simon Karasik [00:22:55]: Yeah. So there is a difference. And the difference is in size. So if you train a model like 300 billion parameters, its checkpoint is like three and a half terabytes. Just insane. Like you see three and a half terabytes. You know, my MacBook has like only 400gb.
Demetrios [00:23:15]: Oh my God.
Simon Karasik [00:23:16]: It's like more than like the hard drive of my laptop and it needs to be saved. Like every hour, like with many replicas. And it doesn't fit into one computer. Like if you, even if you have this super powerful eight gpu virtual machine it has 1 memory and checkpoint is three terabytes. So you need to have like several virtual machines to save this checkpoint and to load it. And it's really a lot. So if you have a small model, like relatively small, like 1 billion parameters, like some years ago it was the highest you can get. I remember like when Bert was released, like Bert large was 300 million and was like a lot.
Simon Karasik [00:24:08]: Nobody could run it. It was really heavy. And nowadays we don't even count to this one.
Demetrios [00:24:15]: We don't even bat an eye. That's so true.
Simon Karasik [00:24:17]: Yeah. So yeah, if you have a small checkpoint, it's not a problem to save it. Maybe it's 50 gb, maybe 70. It's still not so much like our machines on which we run training are much bigger. And you can save 70gb to disk. It will take some time, but it's still, it's okay. It's not a big issue. But now when you have terabytes, you need to save, you need to save, you need to do it in parallel.
Simon Karasik [00:24:45]: You need to distribute to checkpoint among hosts, among virtual machines. Because if you try to save street terabytes from one virtual machine, it will take really long.
Demetrios [00:24:59]: Yep.
Simon Karasik [00:25:01]: So we do some hacks to make it faster.
Demetrios [00:25:05]: Oh yeah, tell me about those. Yeah, sure.
Simon Karasik [00:25:07]: So first of all, we do it in parallel on many machines. We split our checkpoints so our checkpoint into parts. So if we have training and retraining on eight machines, each machine will save a part of this checkpoint. So that. Yeah, okay, in total it's like three terabytes. But for every machine it just like 300gb. It's still quite a lot, but it's not so much.
Demetrios [00:25:35]: Yeah, it reminds me of torrents like when you tor into a music file or something and you get it back from many different pieces of that music file.
Simon Karasik [00:25:48]: Yeah, kind of something like that. Yeah. And we also realized, okay, like we have these machines and we have network to the storage, like you know, regular Internet network. And our machines are connected. We are super fast. Infiniband network, it's like a gpu to gpu network because if you train a model on many machines, you need to communicate a lot. That's why there is Infiniband super fast network. And okay, we have a network between hosts that is much faster than the network to the storage.
Simon Karasik [00:26:24]: So when we save our model, our checkpoint, it is saved like a small bits from every host. Then we load small bits. So when we load the checkpoint, every host has only a small part of the model. It's not enough, but in total every host has everything. And then we use our super fast network to exchange parts.
Demetrios [00:26:51]: How often are you needing to use the checkpoints? Is it like once a week, once a day?
Simon Karasik [00:26:59]: Yeah, it depends, you know, if everything goes well, we don't want it to ever roll back.
Demetrios [00:27:07]: Yeah, ideal world.
Simon Karasik [00:27:08]: We have a training we just started, we wait one month. So it depends, you know, actually we use checkpoints always because on top of what we train we have another background process that takes this checkpoint and runs some evaluations, some validation because we don't want to stop our main training but we still want to collect all these benchmarks like MMLU and other. So like we have a background process that reads every new checkpoint and tries to validate it. But also tricky because if our checkpoint is that big, like three terabytes this another process also has lots of job.
Demetrios [00:27:51]: To do just thinking about these gigantic checkpoints that you have and how you can even work with them. So you, basically the ideal scenario is you set it and forget it. But the realistic scenario is like you're going and you're trying to figure things out every once in a while on how it looks.
Simon Karasik [00:28:14]: Yeah, got it. So first of all, as we realize it's important to save checkpoints properly in a way we don't need. So we save it every hour but we don't need to store checkpoints from every hour. But we need to store checkpoint from yesterday and from last week. You know, kind of this exponential schedule because, you know, like if you store like it every hour, every hour is like three terabytes. We can quickly reach one petabyte of a disk. And actually we use kubernetes to orchestrate our training. It also was quite a funny and tough way to get it working.
Simon Karasik [00:29:00]: Turkish trade this 100 machines working altogether.
Demetrios [00:29:04]: Well, because I've heard a lot of people like to use slurm. Right? Was there a reason that you chose kubernetes instead?
Simon Karasik [00:29:11]: So actually, no. I heard about Slurm only recently when I was talking to my colleagues who work with clients and they said we need slurm because clients use slurm. And I'm like, what is slurring? You know, like it happens sometimes you have some technology that is there around like for years but you have never touched it. And then you realize that you just needed to use, right. Google query search. Right. Things. Yeah.
Simon Karasik [00:29:42]: So we use kubernetes because I don't know, it feels like natural because I don't know, like I had experience with kubernetes. My colleagues have experience with kubernetes. Yeah, it's initial way to go.
Demetrios [00:29:56]: Yeah, yeah, I know the databricks team who trained DBRX, they came on last week and they were talking about how they made that choice too, and they went with Kubernetes. But I have heard a lot of the researchers enjoy slurm much more than Kubernetes.
Simon Karasik [00:30:16]: Yeah, I agree. So I read a bit about slurm. So it seems to arise from MPI, from this kind of more scientific computations. And maybe the guys who are doing research, they come from this background and they're used to MPI, but those guys who got coming from more kind of engineering infrastructure, they're more familiar with Kubernetes.
Demetrios [00:30:40]: Yeah, yeah. Are there any lessons that you learned when it comes to storage choices?
Simon Karasik [00:30:47]: It was very impressive for me when I was trying to understand how does storage work in cloud? Because, you know, like if I have my laptop, it's all clear. I have my laptop, I have SSD or HSD, just save it. But in cloud you have like data center, you have virtual machines, you have storage. Where does the storage leap? If you have like, you know, like s three and you say just s three, let's save it, what actually happens and turns out to be a really big job because first of all, what was mind blowing for me, you know, you have like a virtual machine and you have a disk on this virtual machine, but it's not there. It's not, if you have a realtor machine, the disk is not really connected to this machine. There are another machines that have these disks and that's why they are called network disks. Yeah, cloud uses network disks, sorry. Because like, you know, imagine have a realtor machine and you want to go to another virtual machine with more cpu's, with more memory, but you still have the disk.
Simon Karasik [00:32:00]: You can't just take the disk, put it into another machine. You can't move like all your data from one machine to another. That's why cloud uses this network disks. And okay, it was the first aside for me how hard it is and regarding the choice of storage. So we had several alternatives that we considered and we tried. So we had this file storage, like NFS, we had s three, like storage. And we also considered. So to do data processing, we are using a product called Traktor AI.
Simon Karasik [00:32:39]: It's a product built on top of what is ours. It's a kind of big data system that was previously built in Yandex, and now it's like collaboration between Nebus and Yandex. It's open source products. So we are contributing to it. And we have the structure that is built on top of this. White is ours. And so we also had this option.
Demetrios [00:33:02]: Because when you're talking about storage, you're not just talking about these gigantic terabyte storage of checkpoints. You're also talking about all the data break down, everything that you're thinking about. When it comes to storage, I care.
Simon Karasik [00:33:15]: About security and scalability, because it could be that you have some storage. It works just well when you attach it to one virtual machine, but then, like, you have like ten virtual machines and they all want to do something, but storage is not scalable. And it says, oh, man. Like, you had this 1gb of bandwidth, but now, like, you all have this 1gb, like, not each 1gb, but 1gb for all. So also, what's important when I'm coming to choose a storage is what features does it support? Because, you know, like, we all worked with s three. It looks like just great. It's just easy. Let's push something there.
Simon Karasik [00:34:01]: But then when I worked with it, I faced this old issue that s three is not a file system, because, like we had, we tried to store checkpoints in s three, and we used this kind of tools, that kind of s three file system that make you feel that you have a bucket and s three, but it looks like a disk, it looks like a mount, a realtor machine. It worked just well. But then our checkpoint sync library was doing some move. Like, let's move the checkpoint from here to there. And with s three, you can't do it because s three is like, let's copy the data and delete the old data. That's how we do move. And it's. We just observed it.
Simon Karasik [00:34:52]: Our training became very slow.
Demetrios [00:34:55]: So it feels like you were hitting, it's not necessarily the scale scaling laws that we were talking about from, like, the papers and the training, but you were hitting different types of scaling laws where you didn't necessarily need to think about these things, like, when you have smaller amount of data or you have smaller checkpoints, but all of a sudden s three wasn't possible for you. So you went with, did you end up choosing NFS?
Simon Karasik [00:35:23]: So at the moment, we use this traktor AI, okay? Because it was an experiment and it went really well. We liked what we have. And another motivation to use tract AI was the fact that we already use it to do data processing. We already store the data, we process the data, we know it is a reliable system and most of the time we use the system. So it's kind of natural to use it in one more place and have everything there. It's easier for us to manage like this.
Demetrios [00:36:01]: And besides storage and hitting scaling laws with storage, what are some other pieces that have become like paramount in your setup?
Simon Karasik [00:36:15]: Yeah, first of all, it's Kubernetes. We use it a lot. Like we have some problem. Okay, let's deploy a daemon set. We have some problem, let's do this, let's deploy that. It became a tool we use a lot. It's important tool to master and we even use some tools on top of kubernetes to orchestrate our training. So we use Argo.
Simon Karasik [00:36:41]: Argo is a kind of orchestration framework and we found it really useful because what happens with model training, it's not like you just have one port, another port, they're all connected. If you train a model on 100 machines, they all have to be coordinated. And if one machine fails, everything should fail. And so we used Argo to implement this kind of coordination.
Demetrios [00:37:12]: Okay, it wasn't argo workflows then?
Simon Karasik [00:37:15]: Yeah, it was argo workflows.
Demetrios [00:37:16]: Oh, it was the interesting one about argo workflows. So I was just at Kubectrone a few weeks ago and it feels like everybody was wearing an argo shirt and it's gotten a lot of traction these days. It is what cube flow is built on top of. So I think a lot of people have gone and they said, okay, we want to get a little bit further down and have more knobs and have, I think Argo is matured, nicer than Kubeflow. And so people are going towards Argo more than Kubeflow these days. But that's interesting that you chose it too and you saw the value in it. What other add ons were you using with kubernetes? Because I know there's especially, you talked about how networking was so important, like were you doing special stuff with networking? Were you using a specific flavor of kubernetes? Was it your own stood up? Was it the cloud version? What does that look like?
Simon Karasik [00:38:16]: Yeah, so we use standard kubernetes as it's available in our cloud, but we for sure we have some additional stuff as we deploy to kubernetes. First of all, it is GPU operator, because GPU's are really complex, you need to monitor them well, you need to check them. So we deploy GPU operators. It's making sure that everything works smooth. And for example, if GPU operator checks your GPU and it sees that GPU is broken for some reason, it's too, but too hard. For example, it will stop it. It will just tell kubernetes, hey kubernetes, this node is not good. Let us don't use it.
Demetrios [00:38:58]: Relax for a little bit. Take a little break.
Simon Karasik [00:39:01]: Yeah. Other products that we found useful is we deployed our custom piece of software that runs this network testing that. It checks that all the nodes that we're using to train are well connected. They have the right speed just in case, you know, it can happen. If you train a model, it trains altogether. And if something is, if one node is slow for some reason, it looks like all training is slow. So you really need to have some parallel monitoring to make sure that, okay, this node is slow or this node is okay. So we do this testing to make sure that everything is well connected, broke smoothly.
Demetrios [00:39:48]: So what are you doing in those cases? Because it's basically like the army, you're only as fast as your weakest point or you're only as fast as your slowest point and you're monitoring the Kubernetes nodes. And so let's say that one of the nodes for some reason is going very, very slow, what do you do? Do you stop training and then try and debug? Or do you go into it and try and just update on the fly?
Simon Karasik [00:40:15]: So, because, you know, one great feature of cloud is you don't rent actual machines, you don't run specific gpu's, you run kind of resource. But it's, it's a concept, it's idea. So if we see that something doesn't work and for example, this GPU is not working, we can always disable it and we will be given another. So, okay, we automatically stop the training. We have the process that does mentoring, it stops the training, it removes this broken GPU and we are given another node automatically and we continue there and then we can talk to our cloud team. They will debug this, they will fix it and later they will know what goes wrong, they will fix it and we will not have this kind of issue like this anymore.
Demetrios [00:41:06]: So most of the time it's not even on your side. It's just that the GPU itself for some reason isn't working that well.
Simon Karasik [00:41:14]: Yeah, it could be the reason because it's super hard piece of hardware. It's not like you have a machine, it has eight gpu's, they connected, they connected inside, they connected to each other, they connected to other machines. It's super hard system and there is always something that can go wrong. But because it's cloud, it's easier to just, okay, replace it and do something later.
Demetrios [00:41:42]: Yeah. And so you're monitoring the system and the Kubernetes clusters I imagine. Are you using something like Prometheus on that? And then you're also monitoring the training and are there other things that you're monitoring?
Simon Karasik [00:42:01]: Yeah, sure. So first of all, we use one weights and biases to collect our trading clocks. I mean like loss and etcetera. And weights and biases already does pretty good job. It also collects some statistics about the real time machine. It shows some GPU temperature, I don't know, like the amount of memory consumed. So we have quite a lot of information coming from weights and biases. Also there are lots of cloud based mentorings.
Simon Karasik [00:42:35]: So because when you create a virtual machine in cloud, it already has some mentorings in it. We just go to another tab in UI and we see the temperature as you know, we see CPU load, etcetera. Also we collect logs of training and became also quite a challenging task because it's no like, it's interesting how machine learning is getting closer to kind of microservices because I used to do microservices before and I used some logging systems with microservices and is all distributed but they are connected the same about training. You have a training, it's running on 100 machines but it's all interconnected. And you need to collect logs from every machine to eventually have a picture of what went wrong. Because if training fails, just every node, every machine has a bunch of logs like hey, something went wrong and failed and failed and failed and failed. And you need to have a way to find this one specific machine that had a real error because others failed. Because that one failed.
Demetrios [00:43:57]: Yeah, because it's all linked. So if one goes down, everyone goes down.
Simon Karasik [00:44:02]: Yeah.
Demetrios [00:44:03]: So it's funny you mention that because I do remember there was this blog post that came out, I want to say it was like 2020 or early 2021 from OpenAI. And they talked about how they scaled their Kubernetes cluster to over 6000 nodes. And I remember when that came out and this was pre OpenAI being what it is today. And I was so fascinated by that, by how big of training jobs they needed and the type of problems that they were looking at. And so this was as they were probably training like GPT-2 and they were scaling it up to 6000 nodes and getting to that point. So I understand things can get very complex and very quickly. And so you're looking at the training job, so you're monitoring the training job, you're also monitoring the GPU's, making sure they're working and then making sure that you have a way to debug if things go offline. And then you're also monitoring just the system in general and the Kubernetes cluster making sure that that is working and there's every node is firing how it should.
Simon Karasik [00:45:19]: Yeah exactly. So we tried to monitor everything.
Demetrios [00:45:22]: So there's two last questions that I want to get to for you. First one is around pre training versus fine tuning and how much of like this conversation that we're having with checkpoints changes as we start looking at fine tuning.
Simon Karasik [00:45:41]: I guess the difference is first of all if you do fine tuning you don't need so many gpu's. If you do pre training you need to have like thousand gpu's. But if you then want to fine tune this model it has just, I don't know like 1000 of rows of data, 5000 I don't know, not many, not billions throws and you don't need so many gpu's you ok to go with just two machines. But the problem is the model is still as big as it was. It was like three terabytes, it is still three terabytes but now you need to feed these three terabytes, not into 100 machines, into two machines or four machines. You need to load the three terabytes from two machines and it creates a different kind of workload for the cloud, different kind of requirements because you know like if you want to load a checkpoint from 100 machines you want your storage to be scalable in terms of consumers you want your storage to be able to work with 100 clients at the same time. But now you don't need so many, you don't have so many consumers, you just want to load the three terabytes from two machines. But now you need this network to be super fast, this particular network from storage to this one machine to be super fast.
Simon Karasik [00:47:06]: Otherwise you will wait, I don't know like 1 hour waiting for your data to get into your machine.
Demetrios [00:47:13]: But weren't you saying that the networking that you need for pre training is also it needs to be super fast or there you can have it be a little less.
Simon Karasik [00:47:25]: So when we do pre training because we have many machines and we kind of split our checkpoints. So in a way that as I mentioned every machine is loading and saving only a small part of checkpoint. It's saving like 30gb out of three terabytes. So for every particular machine, you don't need to save and load so much data. You just need to make sure that it works. It works. And if you have some good, okay, speed of the network, you will finish it. But now you have all the same amount of data that you need to read from one host, and that's like 100 more data that need to be transferred over the same network, but over one network channel.
Simon Karasik [00:48:16]: And it's kind of different design from the storage perspective, from the networking perspective. And it's funny also network. It's interesting how complex network is, because I remember when I was doing some experiments, and then network guys from network are coming to me and say, hey, what are you doing? Like, we see our main turings just exploded. And I say, oh, well, I'm testing this. And they like, okay, let us tune our network so that it will be able to scale better to do the thing. And they ask me different questions, like, how does a network work? Do you go, like, from here to there or from here to there? Because these are different layers of network, and they require different parts of the network to work. I'm like, I don't know. Let's discuss.
Demetrios [00:49:11]: That is fascinating. Networking is and art, 100%. You got to a point where you were like, all right, well, we can get some optimization here. If I meet with the networking team and I see how our training, or, and this was in the training, or this was in the fine tuning phase.
Simon Karasik [00:49:32]: It was in training when we were trying to checkpoint from 100 machines. It's a big load right at the same time, the same moment. So it was that moment.
Demetrios [00:49:44]: Good stress test. Yeah, yeah, yeah. I'm sure the first person to catch that was just like, what is going on here? Who is doing what? What are people doing here? We gotta go talk to Simon.
Simon Karasik [00:49:57]: Yeah, it was exactly the thing. And I was lucky because one of the colleagues who are doing network, she sits in my office, like, right near me. So she came to me and said, like, hey, what are you doing?
Demetrios [00:50:11]: Yeah, we got to have a conversation. Come to my, come to my office. That's great. So that's the differences. Big differences that we need to keep in mind when it comes to fine tuning versus pre training. What about when it comes to, like, the traditional ML training with a scikit learn model versus now when you're training llms? We've gone over how wildly different it is when it comes to checkpoints and sizes. But are there things that you want to highlight that you keep in mind now or that you learned from making that jump from like traditional ML and working with scikit learn and then going into like more deep learning and llms?
Simon Karasik [00:50:56]: Maybe it's the first lesson I have is, unless you're good with psychotherapy, you don't need Pytorch. Unless you're good with Pytorch, you don't need like super fancy deployment models on eight gpu's because it's not like, okay, let's just use GPU. I feel it's like a much bigger shift of complexity that you get. So unless you're good with something simpler, I believe you should use something simpler, only then shift to something bigger. But it's interesting how it happened with large language models because I talked to my friend and he told like, you know, previously we had some problem, but we didn't have data and we couldn't do anything. But with large language models, we can have zero data, but we can still solve the problem in a way that now many problems that he used to solve with Cyclotron could be solved just with promising of GPT, maybe. Yeah, but it requires some extra complexity from a GPC side.
Demetrios [00:52:05]: So what you're saying is like there were use cases that we wanted to tackle but we didn't have the data for, and now we don't need the data because it's already in the model and we can just go and prompt it.
Simon Karasik [00:52:21]: Yeah, it's like super impressive how it changed, because technology that wasn't there like a year ago, two years ago, it came and it changed everything.
Demetrios [00:52:32]: But I do like this notion of keep it simple. It is so funny that we need to keep repeating it because it's so easy to overcomplicate and want to jump to calling ourselves, like, it's just like we get to wear these medals of. All right, yeah, I've trained a model that is, I used a thousand GPU's to train this model or whatever it is, you know, bigger is better type thing. But if you don't need to, then why, why do you do it? Right?
Simon Karasik [00:53:04]: Yeah, but you know, like, I guess just some people like to do cool things, they like to do hard things, and this is what motivates them to work. This is their driver. Okay. They can do it simple, but they don't like it, but they can do it complex and they enjoy it.
Demetrios [00:53:26]: Well, dude, this has been super fun to talk with you. I appreciate this. You gave me a sobering look at what it takes to actually train these models. And you are encountering some of the challenges, like day in and day out. And I appreciate you sharing these challenges with us because now, hopefully, anybody out there that wants to do this knows what it takes.
Simon Karasik [00:53:54]: Yeah, it's really. Thank you. Hold up. Wait a minute.
Demetrios [00:53:59]: We gotta talk real fast because I am so excited about the MlOps community conference that is happening on June 25 in San Francisco. It is our first in person conference ever. Honestly, I'm shaking in my boots because it's something that I've wanted to do for ages. We've been doing the online version of this, and hopefully I've gained enough of your trust for you to be able to say that I know when this guy has a conference, it's going to be quality. Funny enough, we are doing it. The whole theme is about AI quality. I teamed up with my buddy Moe at Kalenna, who knows a thing or two about AI quality. And we are going to have some of the most impressive speakers that you could think of.
Demetrios [00:54:45]: I'm not going to list them all here because it would probably take the next two to five minutes, but just know we've got the CTO of Cruz coming to give a little keynote. We've got the CEO of U.com coming. We've got Chip, we've got Linus. We've got the whole crew that you would expect. And I am going to be doing all kinds of extracurricular activities that will be fun and maybe a little bit cringe. You may hear or see me playing the guitar. Just come. It's going to be an awesome time.
Demetrios [00:55:19]: Would love to have you there. And that is again, June 25 5th in San Francisco. See you all then.