Sign in or Join the community to continue

Fixing GPU Starvation in Large-Scale Distributed Training

Posted Apr 03, 2026 | Views 220

# GPU Starvation

# Uber ML

# ML Infrastructure

Share

Speakers

Kashish Mittal

Staff Software Engineer @ Uber

Kashish Mittal is a Staff Software Engineer at Uber, where he architects the hyperscale machine learning infrastructure that powers Uber’s core mobility and delivery marketplaces. Prior to Uber, Kashish spent nearly a decade at Google building highly scalable, low-latency distributed ML systems for flagship products, including YouTube Ads and Core Search Ranking. His engineering expertise lies at the intersection of distributed systems and AI—specifically focusing on large-scale data processing, eliminating critical I/O bottlenecks, and maximizing GPU efficiency for petabyte-scale training pipelines. When he isn't hunting down distributed race conditions, he is a passionate advocate for open-source architecture and building reproducible, high-throughput ML systems.

+ Read More

Demetrios Brinkmann

Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

Kashish zooms out to discuss a universal industry pattern: how infrastructure—specifically data loading—is almost always the hidden constraint for ML scaling.

The conversation dives deep into a recent architectural war story. Kashish walks through the full-stack profiling and detective work required to solve a massive GPU starvation bottleneck. By redesigning the Petastorm caching layer to bypass CPU transformation walls and uncovering hidden distributed race conditions, his team boosted GPU utilization to 60%+ and cut training time by 80%. Kashish also shares his philosophy on the fundamental trade-offs between latency and efficiency in GPU serving.

+ Read More

TRANSCRIPT

Kashish: [00:00:00] These chip sets now have disk. They have SST. What if we just cach it? We don't have to make remote calls. So the first time in the first epoch, when you read the data, just store it locally on your G-P-U-C-P host. So that next time when you need that data, just you have a cache, like just bombarded locally.

Kashish: You don't have to make any remote calls.

Demetrios: There's no shortage, like you're never gonna have a problem as a software engineer. The way that software engineering is going to look in a few years is gonna be completely different than it looked right now, or it looked a year ago. Yeah. But yeah. I'm not buying into the whole like, oh, software engineers are cooked.

Demetrios: Narrative.

Kashish: Yeah, just my director just flagged me. I'm the expensive engineer using cloud code, [00:01:00] so by billing numbers are just going on the top. I, it's fine. It's just a few dollars.

Demetrios: Yeah.

Kashish: Yes.

Demetrios: Yeah. Okay. Oh man. When folks start spending more than they spend on like engineering resource, like humans.

Kashish: Oh, uh, it's too far.

Kashish: I don't think so. We are near that.

Demetrios: That's gonna be, then there's gonna be some real questions that are raised, like, ah, is it actually, should we be doing this with an agent or should we just hire a human?

Kashish: Yeah. Think at this point we are, we have, we are in a discount phase. It's a honeymoon period. Let's see.

Kashish: Yeah. If actual typeface kicking, I don't know how expensive would.

Demetrios: Yeah, totally. Well, dude, you're working at Uber right now on the ML infrastructure. You get to go pretty deep on the infrastructure. You get to play around with GPUs. Before that, you were at Google, you played around with YouTube ads, which, uh, are awesome.

Demetrios: Right? And then also the core search ranking, which I [00:02:00] think was, that's like the golden goose of Google and. Tell me a little bit about what you were doing back in those days and your journey so far.

Kashish: No, no. That's a good question. Thank you so much for having me here. So, yeah, like, uh, most of the time when I, okay, so of all those who are listening to me, ham crushes, I work at the Stop engine, Uber leading the marketplace matching team, and most of my work is in ML infrastructure domain using, uh, related to scalability and efficiency.

Kashish: Before that I was in YouTube ads simul. Um, so I feel like the scale and the problem constraints are very different, like working in Uber versus working in Google. They have, of course TPGP, Nvidia, they always also make a difference here. But like the problem space, I have been dealing with this. Whenever we want to scale the model, it's never the model, which is a culprit.

Kashish: Like I many m come here. Maybe the architecture is not efficient. Maybe we should [00:03:00] change the layers, the hidden layers or something. Or maybe we can have a different model. Architecture. It's never the problem. It's mostly always in the sector,

Demetrios: still it or quantizing it, that doesn't matter.

Kashish: It doesn't make a huge difference.

Kashish: When you look at the bigger picture, it's always a constraint on the infra, and most of the time what I've seen is the data io like. I have so many stories from my time in Google where the data I is not just a software engineering problem. It is a problem at different stages ~like ~where it happens. Like to give you a very quick examples of two of those stories, one is like it's a very standard problem.

Kashish: ~Like ~anyone in the industry would like, Hey, let's, in a model which is a smaller model, not a global model, segmented on a small sub size of data. Now of course everyone use Parque. Everything is column. We know that. But, ~~uh,~~ it's not efficient to read a slice of data from that pocket tables when you only care about a few row.

Kashish: So it's [00:04:00] more like a statement where you read everything first and filter it out, like it's too expensive. You're wasting a lot of time in CPN. You're really, really wasting those CPUs like I feel. These GPUs have become so precious, so like a scarce resource that if you are not able to feed them and you're starving the GPUs, you're not doing a good job.

Kashish: And it's not just about resources. I feel like the velocity is another aspect for at least these industries. You, you cannot, you don't have infinite amount of GPUs, so you cannot just throw more GPUs at it. And whatever ation speed is, you're gonna slow it down. You gonna take a hit on your productivity, you gonna take a hit on your landings.

Kashish: So it's not just resource, it is productivity as well.

Demetrios: And what I'm hearing you say there is really like folks being sloppy about. Using the GPUs and really like overcompensating with, yeah, hey, we can just use this compute and whatever. We're gonna read [00:05:00] the whole file, but eh, it's not that big of a deal 'cause we, we don't need to worry that much about it.

Demetrios: But when you're at an organization where everyone is fighting for every slice of the GPU

Kashish: Yes,

Demetrios: then you have to be very on your game about how you are using those or. How you're letting others use them are, are you like the gatekeeper to the GP use? Uh,

Kashish: you can say that like, we want to ensure at least our utilization is above 80%, otherwise we are wasting the resources because like below it sounds super bad and like if everyone is reading everything and just throwing away the data, then it's more of a infrastructure constraint or the problem.

Kashish: It's like you cannot expects to figure that out. They are, they have their hands in, ~~uh,~~ model architecture, the business objectives, how to optimize it, whether the labels are correct, like that's more of a machine learning problem, how the [00:06:00] attention is working. But do they have to worry about how the data is coming to the GPU?

Kashish: They should only be able to specify what they need, so it's actually individual problem than more of a machine learning problem in some sense. So like in this case, you can actually restructure the data the way it's written on the desk. So that you can just batch read. Like let's say if you know what the slices are based on the models you, all you do is rewrite the entire date on the disk such that you can batch read the slices.

Kashish: Otherwise you can't do point because pointers are super costly. So that's what I was saying, like it's the data a problem, but I feel like whenever Emily is like, let's say they want to have a bigger model. But, and they want to read everything in the world, they won't be able to scale. They'll blame their own model, but that's not the case.

Kashish: It's they, we are not able to feed data fast enough to you.

Demetrios: Okay. So the model's there, and it's, that's why you're saying you're starving the GPU, it's there, [00:07:00] but you can't feed the data that that model needs.

Kashish: So yeah, like, and another case is like the example I gave is more of a soft engineering problem, okay?

Kashish: The way the data is written, the data structure, the schema. But let's take an example, which is very common recommendation models. All you care about the query item pairs, right? You have a query and you have an item. So when you look at the chip set, where the software, like the host, CPO host and a GP host are connected, the CPO host has to move the data to GP host.

Kashish: As quickly as possible such that you can run the Forward Pro and get the predictions. So many a times you have thousands of items per query. The way you send it, like again, I'm saying this is a function of constraint you are working in, in this case you have duplicate query features for each item and you're throwing it in.

Kashish: This is like a far blob of data you're passing and it's expensive. Like sure they are connected with ~like ~700 terabytes of network or whatever, but this is still [00:08:00] slow. The easy way out is don't duplicate, pack and unpack. Unpack on GPU. You can do that. Just have one query and all the items, pass it just before passing it to the GPU for the forward prop.

Kashish: You can unflatten it and have like the entire tensor to pass it this way. You are saving a lot of time passing the data between the CP and GP host. So this is more of a optimization the team did on the hardware. It goes like for everything, for all ads because everything is a recommendation.

Demetrios: Well, and are you messing around much with tensor rt?

Kashish: So in that case we had TPUs, so we were messing around with X Excels.

Demetrios: Okay.

Kashish: Yeah.

Demetrios: It's a di Is that like a different tensor rt basically?

Kashish: Uh, yes. Yes. Yeah. It's like,

Demetrios: that's the TPU version of Tensor Rd.

Kashish: Yes.

Demetrios: But does that even matter, I guess is is the question because you're saying the model isn't the [00:09:00] problem.

Kashish: Yeah. I feel like these kind of bottlenecks exist everywhere in one form or the another, and they are solvable. Like, it's like, it's not a function of model, right? If you're not passing the data, but your utilization is low, it's not because your forward prop is expensive. GPUs are very good with maths.

Kashish: Like Metrication, they're doing what they're supposed to do. If we are not using the passing the data, they can't do their job. So that's what I was saying, like these kind of problems are independent of the chips you are using, but these data bottlenecks exist everywhere.

Demetrios: Yeah. And you're just not optimizing the data to be able to get done by the GPUs or the tus.

Demetrios: Like you are not giving it in that way. The whole part before. It's doing the multiplication, it's doing its GPU thing. Is that hard part, like you have to make sure the data is,

Kashish: yeah, you just in

Demetrios: the perfect form.

Kashish: Yeah. [00:10:00] You just need to get the 10 to the GPU one way or the other. Rest of the things are super fast.

Kashish: I have like rarely seen like, sure you can do quantization to optimize a lot of things, but the major jump here is getting the data first. You should see good utilization if you are able to remove all the data. Your bottlenecks here.

Demetrios: Okay, so that was. When you were at Google, Google and YouTube and all that fun stuff, you have since gone over and there's like new stories and new fun

Kashish: Yes.

Kashish: Bottlenecks that you've been telling. Yeah. We have a coming blog as well on that. I think we're gonna publish it in a week or two, so Uber is gonna publish it. So yeah, let me give you the crux of the problem. So our models were struggling on a 100 chips at 15, 20% utilization. That really sounds quite like.

Kashish: 20% of 15. The realization you are almost wasting one fifth of the GPU and

Demetrios: yeah, it

Kashish: sounds like a lot of

Demetrios: money.

Kashish: Oh yes. A lot of money and a lot of time wasted just to train the model. [00:11:00] Yeah. So we were like, okay, um, maybe we are again, I know I said the model is not the problem, but as a software engineer, you always try to remove the variables.

Kashish: So we said, okay, let's remove the variable, whether it's the model or the infra. So the easiest way was. Remove all the data by bottle ion X by just loading everything in ram. Let's get a slice of data, put it in the ram so there's no read or write anything from here. We are just reading from the RAM directly less than the, uh, training on the model, the verdict, 85% utilization, so we understood model is correct.

Kashish: Model is not doing anything funky, which is causing a hit on the GP utilization. We are kind of not feeding the data. The similar problem we have seen before. So then you all, like in that stage you're like, maybe there's a easy fix. Let's try to tweak some levers. Let's increase the number of threats le of the cop CP host.

Kashish: Maybe let's increase the parallelism, maybe add more [00:12:00] Q storage or whatever. We did try that. Nothing changed, of course. Like maybe it's not like it was more like maybe we get lucky in life and everything works and we just move forward with the problem space, but didn't happen. So the next one was, okay, how should you even profile this?

Kashish: Now we need to understand how do we even understand the data bottling where it's, so we added tracing at different stages in pet storm, like,

Demetrios: and that's the workload you're trying to profile the workload to see where's thing, where is it getting clogged up?

Kashish: Yes. So you can consider like a very high level view of PET Thomas.

Kashish: There's a producer part and the consumer part. Producer responsible for reading the data from some remote file system somewhere where data is stored and store it in a queue, like in a app. The consumer is the one responsible for doing the slicing of the batches, converting into tensor and passing to gpu.

Kashish: So they are kind of working with each [00:13:00] other such that we always have the queue full. Such that we can continue feeding the batch if like GPU U is fast, CP is the fastest machine here, so we need to catch up and make sure we always have the data to pass the GPU

Demetrios: Uhhuh. So

Kashish: that's how I would say like a big level.

Kashish: We were pet storm, so, and.

Demetrios: Pet Storm is something you created or it was already there that you guys were just maximizing or utilizing.

Kashish: So Pet Storm is like Uber's open source library for model training. Distributed model training uhhuh. So it, it existed before, so,

Demetrios: okay.

Kashish: I think as the GPUs are getting more and more powerful, uh, we started seeing this issue more recently.

Kashish: Because basically in this space, if Bright Machine, which is the GPU is getting more and more fast, it's getting faster. Then you see, okay, we are not catching up to it.

Demetrios: Yeah.

Kashish: Like if you think about like the previous GPUs, maybe RTX, even before that, it might not be apparent. ~Like ~there's no model. ~Like ~because ~we ~both, the things are operating at a similar speed, so you are not wasting anything.

Kashish: If [00:14:00] you go to B 200, you are gonna waste it a lot. Like they are very fast, they're like eight chip upset. Like they, they will just, they're hungry. I feel like they're so hungry that they'll eat up whatever data you're passing. So that what the problem was remote aping now, so I was just

Demetrios: smoking on, it's not a question of it, the problem isn't necessarily that you don't have the large data sets.

Kashish: No,

Demetrios: it's that, it's not that. The data sets you have, you are not able to just get them into the GPU to do their GPU thing.

Kashish: Yes, that's correct. Like you have the data, you have everything, just you are not able to feed it. That's the bigger problem.

Demetrios: Yeah.

Kashish: So,

okay.

Demetrios: So

Kashish: yeah.

Demetrios: What'd you, what happened next?

Kashish: Yeah, no, it's, it's interesting.

Kashish: So we, we added tracing at the producer side. We were like, sure, we are not able to read the data fast enough. Maybe that's the problem. The network cost would be expensive, which we are making to some remote file [00:15:00] system where the data is hosted, like in, maybe we're not reading them fast enough and app like the queue, which should, it's supposed to be full.

Kashish: It's not, it's empty. And that's what we found after tracing and logging everywhere. Okay. We are not reading the data fast enough from the pake. Loading it in queue and that's why consumer can't do shit like it. It doesn't have data in the queue. How can it pass it to gpu? So the utilization was basically very wobbly in some sense.

Kashish: As soon as we fast the data, it just passed through it directly to GPU. GP consumes it and the reader threads are still reading the data slowly and stead. So very standard soft enzyme problem like. Why can't we read the afu?

Demetrios: Yeah.

Kashish: What is the issue there? So, and again, like it's not like the problem kind of escalates.

Kashish: Okay. Data can be different data centers, you can have a copy and all that thing. But the idea we thought about was these chip sets now have disk. They have SST, [00:16:00] what if we just cache it? We don't have to make remote calls. So the first time in the first epoch, when you read the data, just store it locally.

Kashish: Your GPCP host, so that next time when you need that data, just you have a cache it, like just bombard it locally. You don't have to make any remote calls

Demetrios: and you don't have a problem with the amount of storage.

Kashish: No, that's a good question. So for our case, like at least in. Like the data is not gigantic.

Kashish: Like, it's not like number of petabytes. So this problem was solvable in these constraints. But even if you have more data, whatever amount, let's say 50 terabyte or something, you can at least C that. So still that will help you the utilization because you don't have to make calls for each and every row group.

Kashish: Like it solves the problem for us because our data was slow, but for anyone who has more data than that, it still solves a lot of problem with ~like ~getting [00:17:00] at least some data back to the GP. Yeah, so we implemented it. It was simple, it was nice. We have clo so it's not like a big deal, but this was like a despair moment for ~like ~a software engineer.

Kashish: We were like very excited to launch it and nothing changed. I was really, really disappointed. Wait, what? Nothing changed. Nothing changed. Like the queue is full, but utilization didn't change. Like that was moment I was like, like you are looking at a software engineer playbook and just like, what the hell?

Kashish: What did we do wrong? But

Demetrios: like this is, yeah, the software engineering interview question was total bullshit.

Kashish: Yeah. Like, um. What did you do wrong? Like that's the basic, right? That's what like it's the magic chip you do. Or maybe you're just a magic words you say, and it used to work. Why is it in working?

Kashish: Like is something wrong somewhere? But remember a few minutes back, I said like the first thing we did was removing the variables. So we had [00:18:00] the headroom of 85%. So that was the only thing, which was a sparky thing, which you could see ~like ~there is a headroom of 85%. So we revisited our steps ~~like.~~ Why did loading the data in the memory work like it?

Kashish: If it works, then our solution should also work because we also have a data in the queue, which is around, it doesn't make sense, like why one thing is working and not the other thing.

Demetrios: Mm-hmm.

Kashish: And what we found was it's, it, it's a very funny story. So whenever we read the data from Parque, it's all, it's in the pie arrow format because it's very, very optimized for number, uh, for parque.

Kashish: Just read it very quickly.

Demetrios: What's Pi Arrow? It's Arrow,

Kashish: the Apache thing? Yeah.

Demetrios: Yeah,

Kashish: so it's, it's basically reads the columns very quickly like, and it's like a database. You can do slicing and all those things. So it's very fast on Parquet files, and that's what someone must have implemented, which is the right thing to do.

Kashish: [00:19:00] But this is a language that GPU doesn't understand. GPU doesn't understand shit. What? Py Arrow said it. Carrie over mpa. It goes tenses. So this translation, so that

Demetrios: Y.

Kashish: Yes.

Demetrios: Oh,

Kashish: this PI Arrow to MPA translation is done on the fly. When we read the data from the queue and when we were doing everything in memory, everything was Ed.

Kashish: So this small thing, which is the, oh my God, translation from pyro to MPA was eating up entire hetero. So it was more of a, how should I say, a double bottleneck first show, the data read was the bottleneck, but this is not data read. This is a transformation from the different data types. Because I mean, if g understand, sure that has been like best thing, but it's not, it doesn't, it only needs dense sensors to do anything.

Kashish: So. The idea was, again, simple, we un like as soon as you understand the problem, you're like, gotcha. I know what to do. Yeah, we can just cash the transform output. We really [00:20:00] don't need to do transformation every time. Right? Just cash. The cash can always have num by, so your queue is like now a full fledge queue, which can directly feed into GPUs.

Kashish: So that was the idea. And it worked like we were able to come to 85%. Yes. And like our training time reduced from a day to like an hour or two. Like that's it. You the same set of resources. And we were like, yeah, dude, you

Demetrios: like Sherlock Holmes over here trying to

Kashish: figure out what

Demetrios: the problem is.

Kashish: Oh, I, I wish like, uh, I, we did it like last year in Q2.

Kashish: I wish we had more agents, like things have changed a lot over the year. Like that time we didn't use or. I would say the agents were not that common like a year back. But I, I was thinking do you

Demetrios: think they would've found the problem fast?

Kashish: That's what I was thinking. Yeah. Like, I mean, probably I would say like if it's able to run some simulations, uh, we don't have a simulator, but they can create a simulator Of course.[00:21:00]

Demetrios: Yeah.

Kashish: And run it. So like, I believe things would have been faster and different.

Demetrios: Yeah, I mean you got all that idle GPU time, you could just reallocate it to the agent simulators.

Kashish: Oh, yes. Yeah, no, those, those, like we were able to roll it out like publicly now, so anyone can use it in Pet Strong, like who are losing it.

Kashish: So yes. So that was good.

Demetrios: Dude, that is so incredible. Uh, I gotta say, I'm. I was very invested in the story. I thought you were gonna gimme another turn and twist. Like, no, even that didn't work. And then we found out there was another

Kashish: problem I would've given up. I'm pretty sure at that part I'm like, no, this shit is not gonna work.

Kashish: Everything is broken. But no, it worked out. I was happy. Yes.

Demetrios: Oh wow. That. So then what's the, um. What are some other war stories? Dude, you, you got me. I'm fully hooked on what you've [00:22:00] been going through here just because I feel like anyone who has played around with GPUs and had to do anything with them, whether it's training a model, or just trying to keep them running.

Demetrios: Has war stories and they have scar tissue from it. And so there's like, uh, I know folks who have PTSD just when they see the, uh, um, errors.

Kashish: Oh, yes, yes.

Demetrios: No. Oh, again, how I swear

Kashish: No, no, that, that's, that's very fair. Like, I mean, I, I have some experience or stories which are ongoing way through servings, like training and serving stories are very different.

Kashish: ~Like ~training is more offline. You don't have to worry about a lot of things, and you have more breathing room in some sense. In serving. You are in a war between the two main variables for anyone. One is a latency. How quickly do you want the results of a model GP inference? And the second phase here is [00:23:00] should ~~like,~~ but how much is the efficiency or utilization, ~like ~consider case, ~like ~if you're very, very latency sensitive.

Kashish: You want to run the forward prof on the GPU as quickly as possible, but what if you don't have enough data to compute a correct, like the full batch? What would you do? Either you run a padded batch like with fake data and just run it to get the results, but that's not helping. Sure you get the results.

Kashish: You are good at latency, but you're wasting the GPUs a lot. Sure, maybe you can wait long enough so that you have enough data to create a batch on the serving side such that you can a forward prop, but then when you say, wait enough, you're adding latency. So I think it would be a function of the use cases.

Kashish: Like for example, in ads we are very relatively sensitive. Like I think, I mean a millisecond, some of it cry like it's like we, we really need to be as fast as possible. In that case, we are wasting resources. Sure. Our goals are, [00:24:00] I mean, there are ways to improve it, but that's always like a trade off game.

Kashish: Like, how much can you improve and what can we do, blah, blah, blah, this. But are there

Demetrios: are on ads, on those use cases, are there not use cases where you don't even need GPUs or is it you always are gonna need those GPUs just because of the scale that you're at?

Kashish: Yes. Uh, so basically. On CPU, your cost would actually be more than GPU if you actually run those model on CPU.

Kashish: So consider a case on CPUs. You, let's say you have eight threads and on each thread you are running a forward prop on each training will sequentially you're not, you know, you can't run bigger batches on CPU. So the cost, the way it adds up is the number of instances you have for that Q case. With GPU just because it has a capacity.

Kashish: You can have bigger batches and less number of GPUs, so you can draw, like there's no limitation. Like the models were not [00:25:00] that com complex that you can't use CPU for it, but the cost adds up. Like they are more costly than GPU.

Demetrios: Yeah.

Kashish: That's why we were fine with the wasted batches still being on GPU then go on CPU.

Kashish: That's a part.

Demetrios: Wow. Well that's a, that's an interesting trade off right there. But sorry, I, I distracted you get back to No,

Kashish: no, no. No, no, that's good actually. And there's also another aspect, which is killing us now, or has killed us before as well, which is the IO cost. Again, I told you like data IO is everywhere.

Kashish: Now you need to get a features. If you have a bigger batch, let's say you have a batch of 10,000, you need 10,000 costs, whatever, 200 features per training load. What? How many features you have, like for the model? How will you get that much data? ~Like ~if a client is responsible for that, you're gonna pay a huge amount of serial recession costs.

Kashish: If client is not responsible or somewhere you need to fetch from some Cassandra, some red, or [00:26:00] somewhere, you're gonna bombard that just to create a batch. ~Like ~this is a problem, like you'll have to pay additional costs just for data to get the data for that bigger batch. With offline, you have caching and everything, all insured.

Kashish: You can have cash, but it's not like. You can scale for everything. You cannot. Oh. Like the online casting will be more costly.

Demetrios: Wait, tell. So tell me more about this.

Kashish: So, okay, let's consider a case where you have a model which each training grows of a dimension. ~Like ~200 features. You have 200 features. Okay?

Kashish: Um, now you are hosting the model on some GPU and you want to have a backs size of thousand. So you would have a matrix of thousand into 200. Who will pass that data. Either a client can do it, then it's ~like ~a very big request over the wire, which you're gonna send to GPU. You have to pay that cost. Or what you can do is you send a minimal request to the GPU endpoint and let [00:27:00] endpoint get the other data from some different services and gather it there.

Kashish: But again, that is gonna be costly. You have someone has to pay that cost. Either a client pays it ~like ~or a GP endpoint pays it. This will again, block you from using the GPU because unless you have the full matrix, you cannot run a forward product. It's a similar problem, just in a different domain. Like

Demetrios: you

Kashish: can see Yeah.

Demetrios: Different flavor of this.

Kashish: Yeah, it's a different flavor.

Demetrios: Yeah.

Kashish: Like, and it's diff it's, it's a slightly more difficult problem because these are, these features can be real time features. These can be, you can, you can, maybe, you can't even cache them. Like these things can happen, like based on the problem you're solving offline shares all offline.

Kashish: On a fixed set of data, you can do more stuff. But online you always want the freshest and the recent data. Maybe it's user embedding, so you want the latest embedding, which is generated. So it the cost add,

Demetrios: it's when that [00:28:00] data is in flight. Huh? You somebody's gotta pay.

Kashish: Yes.

Demetrios: So

Kashish: you came, people are doing whatever they're good at, but we need to give them something.

Demetrios: Yeah, yeah. Oh man. What is this, uh, idea of the reproducibility trade off you were talking about and Oh,

Kashish: yeah, no, that's a good point. So basically, one other thing which we saw when we were working on this peton thing, was we improved the parallelism, we improved the utilization by a lot. But we started seeing degradation in the models and the degradation in the model was not because of, I say caching, but it was because we were improving the parallelism, increase in parallelism so much.

Kashish: There was no determinism in the order of the data, which is fe ~like ~basically considered a case when you have a batch. The, the skewness in the labels affecting the model quality. Because [00:29:00] you are not guaranteeing any order of data in the batches. You are hurting the model quality. Like one model is run on the same data can give you some a UC in same run, and again, on the same data can have a better a UC or worse a UC.

Kashish: And you are like, okay, which one is true? You don't know. And sometimes the differences are so huge. You're like, did you actually improve the model magically without doing anything? It's not possible. So ~~like, uh,~~ we were focusing on more on the optimization efficiency. So the Michelangelo team is the one which is like a uber wide ML infra team.

Kashish: Mm-hmm. So they started focusing on reproducibility. So what they did was they rewired the queues in the pet storm. Before you had ~like ~a master queue to get all the work done, the workers were going crazy. Hey, let's get read the data. Let's just read the, just put in the queue. There's no order. The way they twisted without affecting the efficiency was having per worker queues and divide the work in a manner that it's always deterministic.

Kashish: You're not starving any [00:30:00] worker, but you are ensuring this worker gets the task the same way every time on the same seat. And oh, workers, when the queue is getting ready, then they fill the master queue.

Demetrios: And are you creating some kind of like an audit trail so that the workers know.

Kashish: We basically pre allocate everything like, or pre decide everything before even we start training such that we know, okay, this is gonna happen for the son and it's deterministic.

Kashish: So this way at least we, even if you're going crazy with the parallelism, we are not hurting the model quality.

Demetrios: And when, when you're talking parallelism, how much. Parallelism, are you talking and is there a place where that starts to degrade? Like did you find, hey, after a certain point you start to see losses?

Kashish: Uh, I would say after a certain point it was not losses but no gains. I think what internally kernel does is it, [00:31:00] even if it specify, let's say 200 thread, it cannot allocate it. It'll just ignore it. Or maybe it'll just do a round robin in between the limited size it has. So whatever you specify, if you go outside the constraint or whatever the GCPU is telling you, the co CCP codes are telling you it's not gonna increase it further.

Kashish: It's like you're just telling the CPU core and it's ignoring you. It'll do whatever it can at the max. You can specify whatever. You can reduce it. Sure, it'll entertain you, but if you go. You're like, do whatever you want. I'm gonna do my best. So that's what we have seen. Like there's a good upper bound, but after that it doesn't matter.

Demetrios: Wow. Okay. That's fascinating. So creating just a little bit more determinism in the way that, and orchestrating it, almost like preemptively and saying, look, here's how it's gonna happen, and this is the recipe so that every single time we do this, it. Used as this recipe gives you that reproducibility.

Kashish: [00:32:00] Yeah. And this way, at least the models are getting the correct, I would say quality every time of vision. And you can have better, if something should change, it'll change, otherwise it shouldn't.

Demetrios: Yeah.

Kashish: Yeah. Now,

Demetrios: you know what's, it's so funny because a friend of mine was just putting in a WhatsApp group about a, he bought, I think A DGX spark and he wanted to run.

Demetrios: A model locally. Uh, can I read you this WhatsApp message? Yeah. Because he's running into something that I think you are uniquely suited to figure out the, uh, answer to. And I was saying to him like, oh, you might wanna try llama CPP, or you might want to see about VLM, but the thing is. After listening to you, I realize it's not the model that his problem is, so

Kashish: I'm [00:33:00] pretty sure Yeah, go

Demetrios: for it.

Demetrios: Yeah. Yeah. Let's see, let's see. I'm gonna read this in real time. Yeah. And then I'll, I'll give you a minute to figure it out. He says, mm-hmm. I've just purchased the DGX Spark for my office. We ran Alama on it, and it pulled Quinn. 3.59 B and 27 B, deep seat coder and also GPT OSS. I don't recall how many pars, but I think it's the 120 B version.

Kashish: Mm-hmm. Mm-hmm.

Demetrios: Now it's exposed to the local network, so anyone in the office can use Alama or open AI API endpoints with those models. Okay. I created a Docker with Claude Code and I tried to run it. I use. M variables when running the docker to point to the model that I want. I tried most of the models.

Demetrios: It's very, very slow. I mean, using in, it took like 10 [00:34:00] minutes and I had one shell script in my repo. I run the llama run command and tried the model. The response was fast, so my only conclusion is either the network is too slow, it's one gigabyte, or Claude code is really slow. Anyone have advice?

Kashish: I think that's, that's a really interesting one.

Kashish: Like, um, in my opinion it would be network, I would say. Because like, even if you are like it, okay, I think all these models also need to fetch more tokens or more data right? From somewhere. I don't know. Like for example, I, uh, if they're using some MCP for searching some docs or some of the mc servers, how are they gonna reach it to, they have to go over the network.

Kashish: If your network is not fast, they're gonna just get stuck at it. It would be super slow. Like for example, if I'm running a cloud code locally on the machine, but still it communicates with other MCP servers. So at the end you need to [00:35:00] get the data to the model so that it can make a something. It makes a decision.

Kashish: And if you're slow with that, it could be this all I'm surprised about the innate part, I think I mentioned like even the in it is slow, which is a little surprising. Yeah, it's not, it shouldn't call anyone in practice in theory. Like it's just a start prompt.

Demetrios: Yeah.

Kashish: Right.

Demetrios: Yeah. It says in, it took like 10 minutes and I had one shell script in my repo.

Kashish: So there is a concept of warm starting these models when you run them as influence training versus as influence. I'm not sure if that is something which we missed in this case. Like the warm start concept is, I, I'm not sure if, I mean, I'm also not very like expert in that, but my understanding is whenever you want to run an inference on these CPUs, you need to send some QPS beforehand such that the memory is warm started.

Kashish: And if you don't do that, they would be very, very super, super slow. So [00:36:00] it could be more like, let, let's say a very good test here that's just bombarded with queries for a few minutes and see if the more it gets form start. The G, which is hosting the models because it'll load a lot of things in the memory, and the next time when you do it, it'll fast.

Kashish: So even for these serving GPUs, which we have, we maintain A-Q-P-S-A date or test QPS on those GPUs. Yeah. Such that they always want, like with a fake data. And whenever you have the actual production request, you're not like starting from a CoStar. So that could actually be more because in IT is also failing.

Kashish: So the maybe GP is like, it's not warm started

Demetrios: So you fake it. Yes. The warm start is basically you faking it with a bunch of synthetic data. Yes. Just throwing it at it so that it never fully shuts down. Yeah,

Kashish: yeah, that's correct. And you, they are running the model, but you're not using the predictions. It just says you are loading everything.

Kashish: It's more like, yeah, you're ca [00:37:00] bombing the cache or memory in some sense,

Demetrios: uhhuh. And once again, the key here is the answer is not use a smaller model.

Kashish: Oh nah, no, I don't think like it actually, the smaller model will waste the GPU, like the capacity. I feel like you can use a bigger model. Like if the model, if the GPUs have capacity, then you should use a bigger model.

Demetrios: Exactly, because I, I was sitting there and I was like, wow, you might wanna try and use a smaller model, maybe, I don't know. And they were,

Kashish: I mean, that could be a very good test in this case, but I don't think it'll work. Yeah, let's, I mean, that can, we can ask them like, let's try it out, and that would be a good definite answer.

Kashish: Again, it's like a detective hat kind of thing. You put on a detective hat and you're like, okay, let's try out a different variation phase. Maybe everything boils under the same thing.

Demetrios: Yeah. You're just testing things. Yes. Testing hypotheses.

Kashish: Yes.

Demetrios: You're that scientist in the lab and you've got your [00:38:00] experiment.

Demetrios: Yes. And it's like, is this gonna work? Maybe,

Kashish: yeah,

Demetrios: let's try it.

Kashish: Yes. And at the end when you find it, it's like, voila, it's done. Nice. It's a very good feeling, I would say. Yes.

Demetrios: Yeah. That's that dopamine release right there. Yes. When you find it, and actually when you find it. Yeah, you fix it and then it actually works because you thought you found it many times, but it was that false summit trying to get to the top of the mountain.

Kashish: Yeah, yeah. No, no, I agree. Like the final catch, the God is the thing like, and it is actually fixed till then you are like in an illusion. No, maybe it doesn't work.

Demetrios: Yes. Well, dude, what else are you thinking about these days?

Kashish: Huh? For me, um, like, okay, we are in the world where. Vr, everyone is a building agent, so I'm trying to play around with the skills these days.

Kashish: A lot like to see like what all skills, like actually I was following the [00:39:00] recent post you had about the playground skill. I was thinking to download and try it out, ~like ~all the skills, like I feel ~like ~at some point in future you would have a skill for a. Junior software engineer or maybe senior software engineer, which you can download and it has everything.

Kashish: Then yeah, maybe we might restart word, but I like these days, I'm trying out these how, although I haven't tried open cloud, which I was thinking about, but the thing is I have a court laptop. I can't try it on that.

Demetrios: Yeah.

Kashish: Have tried

Demetrios: it. I just am not confident enough in my security chops to try it.

Kashish: Yeah.

Kashish: It's a good idea. I, although I don't even know what I want to ask that open clog agent to do, I don't have things like that. Like, Hey, book me a flight. But no, I want to book it myself because I want to have options. I don't know.

Demetrios: Yeah, yeah,

Kashish: exactly. That is good. ~All the security something. Yeah, ~

Demetrios: ~~I, I saw a meme about open claw and it was like, oh my gosh.~~

Demetrios: Uh. This is the future. I had my open, my claw bought back in those days when it was called that, you know, go and book me a a dinner restaurant or restaurant. I can't put these words together, book me a reservation at a restaurant for dinner. Mm-hmm. And it. Came back and then they looked at their API bill and it was like, it only costs me $80.

Demetrios: They're like, so something that you could have just picked up the phone and called the restaurant to reserve the table. Oh, wow. And taken you five minutes max. Now you told the. Open claw to do that and it cost you 80 bucks. Wow. Good job. Like, I don't know if you're winning here, but I guess that's cool in the name of progress.

Kashish: Yeah, I think slowly and steadily it might get cheaper. I hope. Hopefully. Hopefully. But I feel like it would be other way around. But let's see, like given like everyone is consuming those, like they are like. Almost like not even consuming, I'm missing a word here. Like it's getting disappeared. Like all the GPUs are getting erased from the mal.

Kashish: ~Everyone wants it. ~

Demetrios: ~Yeah. ~

Kashish: ~Yeah. ~

Demetrios: ~~What about you? I mean, going,~~ going back to skills, [00:40:00] it's a,

Kashish: yeah.

Demetrios: Really interesting one. Leo just posted in our coding agent conference or coding agent channel in Slack. A super cool skill. Actually, no, sorry, I misspoke. Uh, David. Posted about this skill that is, it was created by a few psychologists.

Kashish: Oh,

Demetrios: and it's a way for you to have the option after you have a coding session, the agent will come and it will create a learning. Session with you to update you on everything that it's done. Mm-hmm. And then you have like this learning session where it's like, Hey, here's what I changed, here's how I changed it.

Demetrios: And it does it in a way that helps you learn about the updates from. Uh, level of like the code base, but also if it introduced new things that [00:41:00] you may not know about. Like, I don't know about React. Okay, cool. Well here's, I implemented React and here's how I did it and here's why this matters. So it's kind of bringing you along on the journey as you are coding.

Kashish: Oh, well that's so nice. Like, so basically it's, if I would have used that skill when we were deploying this. Then I can ask that agent to even debug further or maybe have more efficiency or something like that. Or even like Yeah, exactly. It can be debugging skill as well, right? Like that could also be translated to that because it already had the session.

Demetrios: Yeah, and I think the, the key here is that it's helping you learn everything that it did. It's like giving you a minute to step back. And instead of doing work in the code base and having it, like whatever [00:42:00] change code or submit a pr, it says, I'm going to create a learning journey for you. I see. So that I can teach you everything that I just did.

Kashish: So translating vibe coding into learning? Yeah.

Demetrios: In a way Or, or at least like, and, and the whole. Thing that I thought was cool about this Yeah. Is that it allows you to stay up to date on everything that's changing so that Yes, you know how very quickly code bases can get away from you, and then later on you're trying to debug something and you're like, oh, I don't even know what the hell is going on here.

Demetrios: Actually, I gotta really drop in and understand

Kashish: Yes.

Demetrios: What this code is saying. And yes. It's to combat that like, so that you're not so far out of the loop. Now, whether or not you want to do this on every single coding session or every single project, [00:43:00] that's up to you.

Kashish: Yeah, no, I think it's, I loved it. Yeah, it's, it's good actually because, and especially when you're exploring a new code base, I think that would be more beneficial.

Kashish: We teach the journey. Okay, what did we do there? Like, I don't even know this code. But I know what I want to do. I think that's the part. So how do we reach there? No, that, that totally makes sense. Yeah,

Demetrios: exactly. So I've seen, I've seen a few skills like that where it's almost like reflect back and help teach me.

Demetrios: Those are cool as as like a bucket of skills. The other skills that I, um. Have been seeing, yeah. Is what I was mentioning to you about, and it's not necessarily a skill, it's more just like rules, I think is a strong word for it, but ways to help set up workflows. So there's very common things. And we had Rob on here, uh, as I was [00:44:00] mentioning, and Rob was talking about how with certain things that he does a lot.

Kashish: Mm-hmm.

Demetrios: He just set up one click button so that he has a little command center. Oh, and he doesn't have to explain. He doesn't even have to tell the agent to invoke a skill. He just clicks a button and then boom, or a hotkey, and boom, the agent knows. It loads up that skill and it knows exactly what to do.

Kashish: Oh man, this world is changing so quickly. I feel like every other week it's something interesting like

Demetrios: Yeah,

Kashish: yeah, this is actually smart. Because many a times you repeatedly do the same thing every time. Like for example, building testing, it's, these are simple, but let's say some kind of workflow setup or maybe look at X, Y, Z things.

Kashish: Yeah. Which you can actually make it a part of workflow as a skill.

Demetrios: Exactly.

Kashish: Which is, yeah. Nice.

Demetrios: Yeah. And the other one that I've been using a lot and, and I guess this is. Very much [00:45:00] on the user experience of working with these coding agents is I, 99% of the time will use a dictation tool. Instead of actually typing, I hardly type anymore.

Kashish: Oh. Really?

Demetrios: Yeah. Yeah, dude. Nice. Especially when I know what I wanna say. It's a little harder when you don't have your thoughts clear and you're trying to like dictate on the fly what you wanna say, like for an email or something. Yeah. But when it's like, I know what I wanna say and I just boom, we'll use it.

Demetrios: I have a hot key and it automatically starts up my microphone and I'm talking. Right. I

Kashish: never used it. I never tried it. Nice

Demetrios: man. That is a game changer because normally you have in your head what you want done before you start to [00:46:00] ask Claude code to do it for you. So it's a lot easier to go back and forth with it by dictating and someone at the coding agents conference during our like hot take session said that.

Demetrios: One of their biggest hacks is they always use voice mode to just brain dump at the beginning of a session, and they'll brain dump everything in voice mode because it is a way to just get all of that context out of you.

Kashish: I see. No, actually that, that makes sense because when you write, you rewrite. Yeah.

Kashish: You kind of try to clarify answers as you are writing the prompt and

Demetrios: it's slow.

Kashish: Yes. Tiring.

Demetrios: Yeah.

Kashish: If it's a big rob, if it's a big rob, like you need to write the entire design you have in mind into a few sentences, but still it's tiring and you miss,

Demetrios: but now,

Kashish: yeah,

Demetrios: you don't need to, if you're just like basically word [00:47:00] vomiting at the beginning of the session and making sure that it's clear, Hey, this is the, this is everything.

Demetrios: For context, and you can say it all, you can say it five different ways. It doesn't really matter because no amount of talking that you're gonna do is going to really make a dent in the context window unless you're,

Kashish: yeah.

Demetrios: Given a sermon or something, but I don't know how many people are doing that at the beginning of their coding sessions.

Kashish: I don't think so. Yeah, that would make some difference, but I don't know how much, maybe it ignores it, but let's see.

Demetrios: But

Kashish: yeah, that's, that's fair. Yeah. I'll try that. No, that's, thanks. Thanks for the tip. No, that just makes sense.

Demetrios: Yeah. These are, these are fun ones and

Kashish: yeah,

Demetrios: it doesn't, uh, sadly doesn't make me any better at.

Demetrios: Coding, like I still am totally encountering the major flaws, but [00:48:00] one thing we started doing, I, I may have told you this is on Fridays.

Kashish: Yeah.

Demetrios: We're having these lunch and learn sessions with like the coding agent lunch and learn sessions. And those have been super helpful because just getting in a virtual room with folks who are also.

Demetrios: Bumping up against walls or finding little tricks that work like this, just like, oh, hey, at the beginning of the session you should always brain dump with a voice mode. It's like, oh, that's a really easy way to do it, and I don't need to change much about my workflow to make that

Kashish: happen.

Demetrios: Yeah,

Kashish: nice. But I, I mean, I'm curious, you said that it's not helping you like much in the coding.

Kashish: You're still hitting walls like. What do you mean by that?

Demetrios: Because I'll still get to a point where I'm like, uh, why is this not working? And then I go try to like click through the code base and I'm like, I don't actually know what the hell is going on here. And now I [00:49:00] really gotta debug. Oh yeah.

Kashish: Sometimes when it go crazy it's, it's horrible.

Kashish: I sometimes feel like, why did you even do that?

Demetrios: Yeah.

Kashish: Like it is, it's just such a simple thing, like I have to undo everything to start from scratch.

Demetrios: Well, I think the difference between you and me is that you have an opinion about how it should be done.

Kashish: Yes.

Demetrios: And you know that like, why did you even do that?

Demetrios: And you're like, because you should have done. But when I'm asking, why did you even do that? I'm, I don't have any idea what the other way and better way should have been. So it's kind of like, why did you do that? Uh, what's, what were you trying to do there? What,

Kashish: okay. I see your point. Yeah. I mean, it happens to me.

Kashish: Yeah. So like, if I'm touching, like I don't, I'm not familiar with Go Language. So you're changing go. I really don't have any clue. Just make the change. Make it work. Yeah. That's that. That's the end goal I have. I'll figure it out later. Like first let's make it work and have to sometimes get a scolding [00:50:00] to this cloud guy.

Kashish: Like, why are, what are you doing? This is not working. Right. I don't know how much

Demetrios: scolding it is.

Kashish: Yeah.

Demetrios: I've tried so many times to tell it, to read the docs. It doesn't listen. It's

Kashish: like, I think they are. Sometimes I've seen like they kind of cut of, cut the context for the docs. Like they don't read the entire thing.

Kashish: They only read, read a few things, then just go crazy.

Demetrios: Yeah. They think they know, they know best and, and that's another one. Another trick that Rob was telling me yesterday was on all of his files in a code based. He makes sure that one of the things that the agents do is they always comment like A-T-L-D-R, what's happening in this file?

Kashish: Oh. '

Demetrios: cause by doing that, the agents are much more likely to, yes. Be able to search and understand what is where and what's working and what each file does.

Kashish: [00:51:00] That's actually a good idea actually. Everywhere, like this is summary something which can actually, like, at the end of the, that they're gonna summarize the doc like they're not gonna do.

Kashish: But if you're already providing it, it's more like a proper context of the doc. And that's, that's a good idea. Actually, I should ask people to do it here. Like if we have been using it so often now that it should help. Yeah.

Demetrios: Yeah. And he also said every folder has a read me.

Kashish: Oh.

Demetrios: So that it's the, for the same reason.

Kashish: Okay. Makes sense.

Demetrios: Yeah. You wanna give it that context, man?

Kashish: Yeah, like sometimes I have to tell, like, think five times before you reply. So I've seen in the prong it's thinking 1, 2, 3. Like it's think rethinking everything. Like don't gimme a reply before you think five times already. Like sometimes when I'm irritated, like you have to think five times.

Kashish: Like I'm scolding a child or something.

Demetrios: Yeah, that's exactly what I was thinking. It's like

Kashish: you have, go to your room, you have counted five before you answer. So it does that. And maybe [00:52:00] I've seen him giving better answers. I don't know if it's a hike or something, but it works.

Demetrios: Yeah. Well, it's like that ultra think or the, yeah, the planning mode.

Demetrios: Um, I've heard folks say that they stopped using planning mode.

Kashish: Oh.

Demetrios: And this, this was, I wanna start

Kashish: that from that. Yeah.

Demetrios: Yeah. And, and you wanna, but the. Basically what you're telling it is reason harder. Make sure you're like fully reasoning Yes. When before you come back to me with anything.

Kashish: Yeah. Don't waste my time.

Kashish: The resources are cheaper.

Demetrios: Yeah.

+ Read More

Watch More

Large Language Model at Scale

Posted May 13, 2023 | Views 923

# Large Language Models

# LLM in Production

# Cohere.ai

Do More with Less: Large Model Training and Inference with DeepSpeed

Posted Jun 20, 2023 | Views 1.6K

# LLMs

# LLM in Production

# DeepSpeed

# Redis.io

# Gantry.io

# Predibase.com

# Humanloop.com

# Anyscale.com

# Zilliz.com

# Arize.com

# Nvidia.com

# TrueFoundry.com

# Premai.io

# Continual.ai

# Argilla.io

# Genesiscloud.com

# Rungalileo.io

The Daft distributed Python data engine: multimodal data curation at any scale // Jay Chia // DE4AI

Posted Sep 17, 2024 | Views 1.5K