Sign in or Join the community to continue

Building Out GPU Clouds

Posted May 23, 2025 | Views 203

# GPUs

# AI infrastructure

# Rafay

Share

speakers

Mohan Atreya

Chief Product Officer @ Rafay

Mohan is a seasoned and innovative product leader currently serving as the Chief Product Officer at Rafay Systems. He has led multi-site teams and driven product strategy at companies like Okta, Neustar, and McAfee.

+ Read More

Demetrios Brinkmann

Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

Demetrios and Mohan Atreya break down the GPU madness behind AI — from supply headaches and sky-high prices to the rise of nimble GPU clouds trying to outsmart the giants. They cover power-hungry hardware, failed experiments, and how new cloud models are shaking things up with smarter provisioning, tokenized access, and a whole lotta hustle. It's a wild ride through the guts of AI infrastructure — fun, fast, and full of sparks!

+ Read More

TRANSCRIPT

Demetrios [00:00:01]: We're firing on all cylinders. GPUs are hard. You've been dealing with them a lot. Why are they so hard?

Mohan Atreya [00:00:11]: The first is they are hard to get hold of the ones number one, the ones you want, they may not be available. And usually if you look back, you historically only had two options. Your company probably would have purchased a bunch of GPUs in a data center. Exactly. Hardware and set things up. And if the company is large, the enterprise is large, they probably have a good number of GPUs that it has set up and made available to you. Or in the last 15 years or so when you had the folks like AWS, Azure, Google and the ELF, they provide GPUs for rent. Essentially these were the only two games in town.

Mohan Atreya [00:00:59]: And the problem that I think people run into is for many tasks, tasks you need the exact kind of GPU to do the thing optimally. And you may not get hold of that or it might be too expensive. Then you're kind of stuck between a rock and a hard place.

Demetrios [00:01:16]: Especially if your company is okayed a lot of spend on aws. And AWS either doesn't have the capacity you're looking for or so I see.

Mohan Atreya [00:01:25]: This a lot with some, when I talk to some customers, one of the complaints they, they actually two things they'll end up saying is one is it doesn't understand what I exactly need because it operates in standards based world. So if it's outside the known bubble for them, it, you know, it's a lot more work for you to convince them and say hey, I need that. Even on a cloud like aws. The second problem they run into is they really don't know if that thing is gonna work for them or not. A lot of lot of stuff in AI ML is experimentation because you learn about it and then you have to try it. And if it's too hard to get access to something before you can try it. Cause you may conclude this is not the thing I need. And then some of these clouds also make it so hard for you to get access like do a one year commit.

Demetrios [00:02:16]: So true. If you want to do something with just reserve pricing, the pricing is astronomical.

Mohan Atreya [00:02:21]: Exactly, exactly. And that gets in the way of doing business. Right now none of that exists with general purpose compute, but we struggle with that with AI infrastructure today.

Demetrios [00:02:32]: Yeah, well you did mention there's kind of these two paths, which is you either buy the hardware and set up your own cloud or your own GPU cloud we could say. Or you go to the cloud Cloud providers in the recent, I would say two years, three years, we've had the explosion of the modals and the Lambda Labs and the base tens. All of those folks.

Mohan Atreya [00:02:52]: Correct.

Demetrios [00:02:52]: Have come out of nowhere or seemingly nowhere. And there's a lot of demand for that type of thing too.

Mohan Atreya [00:02:59]: Yeah, that we believe will be a game changer for the market. And like you said correctly, in the last two years, things have cropped up and then there's large number of providers who, who are looking to set this up. In fact, that's, that's, you know, the kind of entities we guys work with now. It's hard for everyone. And for them, the problem is, you know, they do the hard work of finding the real estate data center power because these GPUs are power hungry. Then you have to buy the GPUs, rack them, and they do that. Then they start looking at, okay, I got a bunch of GPUs. How do I convert this into a GPU cloud? And a lot of them have hardware chops, data center chops, software chops.

Mohan Atreya [00:03:48]: And running a service like an AWS kind of an experience is not easy. So that gap is what we guys fill because they need to go to market fast because they are probably paying interest on the GPUs. And if the GPUs are idling and they have no customers, they're burning money at the time.

Demetrios [00:04:10]: Yeah, if you're a Core Weave, they've done some very interesting financial engineering to make those GPUs worth the while. And they want to be saturating those GPUs 100% all the time.

Mohan Atreya [00:04:21]: Correct? Correct. Now, they are also a little lucky, in my opinion, because if you look at their 10K, I mean, the S1 that they published before they went public, about 60, 70% of the revenue comes from, I think, one customer, Microsoft. So effectively they are, you know, they got lucky because they had one big customer who could, you know, fund the whole thing. But if you are a new GPU cloud, you may not be that lucky because there's only so many companies that do foundation models that need that kind of scale. Microsoft, through OpenAI, I guess, was funding Core Weave's investment in that. But what about the rest? So the kind of organizations the GPU clouds we work with, they seem to sell to enterprises like our universities. Let me just give an example. This was a university that said, hey, we want to launch an AI ML lab.

Mohan Atreya [00:05:17]: We want to train the next generation of data scientists. The thing is, to do that, well, you need GPUs the university doesn't have anything and it's too hard to set up. But they want to launch the course now. Right. So they are working with these new GPU clouds, the NEO clouds, and these guys are giving them experiences like, hey, I'll give you a notebook as a service, I'll give you Ray as a service, I'll give you kubeflow as a service. Of course, underneath the covers they all have GPUs. And I think this is the way by which the next generation of data scientists, etc. Because when they graduate, they need to have practical experience with the real things.

Demetrios [00:05:59]: Yeah, And a collab that you get a free 300 bucks a month or 300 bucks until you burn that. 300 bucks? Yes. That's not enough, Correct?

Mohan Atreya [00:06:07]: Exactly. And then this all also has to be structured, right? Like in a university setting. How do I give you a degree? How do I get you a label that you can practice stuff? So these are the kind of examples we see. And then the NEO clouds, the GPU clouds are enabling, in my opinion, the next generation, the next wave of people will hit the market fully trained on how to do AI ML at scale.

Demetrios [00:06:32]: So you were talking about how basically you've got, from the very low end on the infrastructure stack to the frameworks and like the pytorches of the the world and all in between, you've been playing in that dimension. What are some things I imagine over the last two years that you've seen as far as gotchas?

Mohan Atreya [00:06:57]: Yeah, let's talk about a few things. Of course it's a complex stack, a lot of stuff can go wrong. But let's talk about things that are interesting. So when you buy a computer, you kind of expect the CPU to last a long time. Now, we're not talking about a single GPU here, we're Talking about multiple GPUs connected together, working well together. And when you have that many moving parts in a complex system, failures are possible. Right. In fact, Meta had published an interesting stat in their Llama white paper.

Mohan Atreya [00:07:37]: I think when they launched three, I believe it's, I think a year old at this time. I think they saw about 30% failure rate. You don't expect that on a computer. You never even see anything like that. Right. This kind of talks to the complexity of the environment. Now the question is, if you are in middle of a training run and you have a GPU failure, what do you do now? Do you abandon the training run that has been running for two weeks? That will be Bad. So how do you replace capacity? Like literally a swap underneath the covers where you can still deliver an sla.

Mohan Atreya [00:08:16]: That's amazing. So I think as a GPU cloud, as any provider, whether you're an enterprise or a wannabe GPU cloud, you gotta take care of these things because your customers expect that. Because they just want to do a training run.

Demetrios [00:08:30]: Yeah. And trying to tell somebody expect 30% problems 30% of the time, shit's gonna hit the fan.

Mohan Atreya [00:08:37]: Exactly.

Demetrios [00:08:37]: That's a really hard sell.

Mohan Atreya [00:08:38]: Yeah. And interestingly, you know, so one of our advisors, he worked on Llama, right. So he kind of gave us some inside stories on how difficult it was as an end user within say, remember, Meta is big, they have their enormous internal infrastructure, but all the teams don't get access to it at the same time. You get two weeks, and the two weeks you do your stuff, you better.

Demetrios [00:09:05]: Do everything you can in the next two weeks.

Mohan Atreya [00:09:08]: And a lot of times these teams just have a hypothesis, right? They have a hypothesis that I want to run this and if it doesn't work, then they had to wait again in line. Right. This is pretty frustrating as a user because you don't know, you have a general sense of, yeah, I think I should try this. This may work. And I think that's a problem in general with aiml because you start with a hypothesis, it's not deterministic, it's. And as depending on what you find from that, you tune your hypothesis a little more. This is difficult because the data is informing you what is possible. And it's not like a formula where you can say, if I do this, this magical stuff will happen.

Mohan Atreya [00:09:49]: So this is the day in the life of an end user. And if you bolt on a 30% failure on that and say that too bad, your two weeks are up, I mean, that's terrible, right?

Demetrios [00:10:00]: And these failures are coming from just GPUs being finicky.

Mohan Atreya [00:10:04]: That and the connectivity issues. Because I think the thing that maybe a lot of us forget is it's actually a connected GPU system. Right. You may have like 100 GPUs that have to work together interconnected. And this connectivity is where kind of we guys play, right? How do I. If it's in the same node, If I have eight GPUs in the same node, how can they talk to each other fast? If I have two servers, two nodes, how does internode connectivity work? What if a switch dies in between? What happens to that connection? How do you have resiliency built in there? How do you detect readout without causing an issue upstream to the training job that's going on. So these are all. And then the thermals are a crazy problem.

Mohan Atreya [00:10:58]: Right. Like heating.

Demetrios [00:11:00]: I remember reading one of the blogs from Meta again on how they would optimize their GPU clusters. And one thing they said was, what about updating? We have these 24,000 GPUs. Basically at any given moment we have to be updating the GPUs constantly. And so making sure that you're updating in the most efficient way possible.

Mohan Atreya [00:11:25]: Correct.

Demetrios [00:11:26]: And the people that are using those GPUs if they had a two week long job that they're running and now it's like, oh, we need to update this. You have to figure out how you're going to kick them off of those GPUs or just swap it out like you said, under the covers. They don't know that anything went differently.

Mohan Atreya [00:11:43]: Correct.

Demetrios [00:11:44]: It's just that you as that engineer that's working on the clusters, you understand, okay, we swapped this out and now we're updating these GPUs. Correct.

Mohan Atreya [00:11:53]: Or it could be something as simple as maybe I'm swapping out an SSD, swapping out memory. So not just the GPUs, but the stuff around GPUs. Stuff can. Stuff happens. Yeah, yeah.

Demetrios [00:12:04]: Getting back to the gotchas, what are things that in your eyes when you've been playing with it, when you've been seeing it, it's like this is a very common one that will get people into trouble.

Mohan Atreya [00:12:17]: Yeah. So there's a steady state issue. Is at steady state. But let's talk about what happens before that and I'll talk about what I heard from some Neo clouds when they service certain big customers and when some of them described how they onboard them, I literally fell off the chair. It is good to know because this is a state of the art in many new clouds and you can actually see this. We go to some new GPU cloud that exists and you say, hey, why can't I just simply sign up, log in, put in my card and use it? Why are you asking me to submit a form and you'll call me in a day or two? And is it because you don't have enough capacity? Remember the data center provider here, they have a bunch of GPUs sitting in one or two data centers. What we found is they don't even have a good sense of inventory. They don't even know what's out there really.

Mohan Atreya [00:13:21]: So these are the curveballs we guys had to encounter and say, hey man, we can help you solve these problems. When I say inventory, I'm not Talking about just GPUs, like which server is it on? What kind of memory does it have?

Demetrios [00:13:33]: How's it connected? How is it connected to Infinite Man? Or is it some random one?

Mohan Atreya [00:13:37]: What's it Mac address? Yeah, and we were surprised and things like firewall policies because there's multiple teams and all of that, right? These are all. Many of them manage these in Google sheets or Excel spreadsheets. So that's the state of art at many of these companies, right? So, so no wonder when you submit a form and someone negotiates finer GPUs that you're buying this. Take two weeks to do all of.

Demetrios [00:14:03]: These things, actually provision you those GPUs.

Mohan Atreya [00:14:05]: Correct. And 10 teams have to do their stuff and update the Excel spreadsheet. I mean, this is not good, right? So what we have done in our platform is we said, hey, why not solve this problem? What if there's a system source of truth where you maintain this inventory? But we don't want to be simply a replacement for an Excel spreadsheet. The inventory is important because it can dynamically then carve out capacity from the infrastructure and allocate it to a tenant. So that is now done. Now the next thing is, it's not just the GPU allocation. You have to also now create what is called as a tenant access network, what the industry calls a tan, a tenant access network. So what happens is, imagine, I'll take an analogy.

Mohan Atreya [00:15:02]: When you go to a hotel, there's a room, imagine the room can auto expand and shrink based on what you need. So how do you automatically put up the walls? How do you put up, put the right things in the room that you need? All that is done manually today. And what we do is automate that. If I say I need 100 GPUs, I do the east west network. East west is the Infiniband, the North south, because I need a storage network, a high speed storage network where I store my models, my data, et cetera. And then my east west network cannot be interrupted because if it is, then a latency kills, right? How do I then let people access that network? And how do I manage multiple tenants on my one single pool of infrastructure, fully knowing tomorrow this university is going to come and say, hey, I need 100 more GPUs. So there has to be elastic and shrink and contract over time. This is stable stakes for a GPU cloud, right?

Demetrios [00:16:02]: But really Hard issues. When you say that, I think, oh, my gosh, that is brutal to be able to be that elastic, but also have all of these things that are in your mind with that inventory. If it was being done on a Google Sheet. And by the way, always a great business idea, if you see somebody doing something on a Google Sheet, you can probably think of like 10 companies that have come because they saw someone doing something on a Google Sheet or just like Excel and they thought, you know what, we could probably make a product around that.

Mohan Atreya [00:16:32]: Totally. I mean, like, even products like Airtable.

Demetrios [00:16:35]: Yeah, I was thinking of Airtable too. Airtable is a classic one where it is just such a better experience, but at the end of the day, it's a Excel sheet.

Mohan Atreya [00:16:43]: Glorified Excel sheet. Yeah, exactly. So you look at the bottom of the stack. This is kind of what we solved. And then what happens? Your inventory also is changing. Servers are dying, servers are getting replaced. You have new switches. Your data center is not static.

Mohan Atreya [00:16:57]: You're adding capacity, removing capacity, retiring GPUs, adding new GPUs. How do you keep up with that? Yeah, so stuff around you is changing. It's like quicksand constantly. Stuff on top is changing. So we have a middle layer that is helping you make sense of what's happening on top, what's happening below.

Demetrios [00:17:16]: I like this. You're mentioning that on the bottom, on that layer, it's constantly changing because of whatever reason. And then on the top, it's changing because of the. The customers wanting more capacity or wanting less capacity.

Mohan Atreya [00:17:30]: Correct, correct.

Demetrios [00:17:31]: And so being able to be that. Yeah, in a way, the broker of the capacity, knowing what capacity there is, what's actually online. Because the last thing you want to do is give a customer something that's not on. You say it's online, but they're trying to load up and it's like, why is this not loading? Something's broken here.

Mohan Atreya [00:17:51]: And that would be such a bad experience. Right. Like, think about. We all expect, like when you go to AWS, press a button and say, I want this EC2 instance, it comes up like clockwork. You don't have to make a support call. That's the experience people expect.

Demetrios [00:18:09]: Exactly.

Mohan Atreya [00:18:09]: Can you be that tall to have a shot in the market? And that's a pretty hard mountain to climb.

Demetrios [00:18:17]: The things that you get into as you're trying to provision GPUs, or all these hard pieces, the ways that when working with GPUs, you've seen them fail. I. I know that we. When you gave the presentation at the conference the other day, someone asked about the noisy neighbor situation.

Mohan Atreya [00:18:39]: Yes.

Demetrios [00:18:39]: And I, I wanted to get into that with you because I think that's one thing that people who have tried this at their. If they're at a company that has the luxury of buying hardware or just being the one who's provisioning this, they get into those situations. I remember three, four years ago when we had the RUN AI team on here, they talked about how difficult it was too. And how do you go about solving for these noisy neighbor situations? Maybe break down for myself even give me a reminder of what that even means.

Mohan Atreya [00:19:11]: Sure, absolutely. So this all kind of was what forced us to get into the inventory management space, right? So if you go talk to Nvidia or amd, et cetera, right? The technologies for this exist, right? You have ways by which you can isolate, like using EVPN technologies, vxlan, all these stuff exists. I mean, this is how data centers work today, right? Orchestrating those dynamically, making sure you can do this programmatically based on some higher level. Ask like, hey, I want 100 GPUs. Yesterday I was using 50 GPUs. Where am I going to allocate them? How am I going to connect them to the same vpc? Essentially the same technologies, like the things you're familiar with, like a vpc, like a vxlan, an evpn, how do you do route leaking to my storage network. All these are known devils. This is orchestrating them automatically and programmatically and then making sure that I didn't have any Mrs.

Mohan Atreya [00:20:20]: In the process.

Demetrios [00:20:20]: But is this just like Terraform that we're talking about? It's not that.

Mohan Atreya [00:20:25]: Terraform is a way to achieve that. Terraform has no understanding of the inventory, Right? It has to talk to something. Yeah, right. So the interfaces we support. Terraform is one of the interfaces we support because people want to automate their way as they want. But terraform itself has no understanding of inventory, right? It doesn't know the current state. Like how many switches do I have? And then some of these technologies that exist, some of them will support certain kinds of automation, others will support other kinds of automation. So you have to effectively present a unified interface and say that I'm going to shield you from the vagaries of 20 different languages you got to speak.

Mohan Atreya [00:21:10]: It will become the universal translator of sorts. A unified orchestrator and a translator to talk to all these various technologies that exist. And why do they exist? If I'm an enterprise, I can go in and say, I'M going to go with a gold standard because my universe is small. I'm going to pick vendor A, B, C. Or if I want to reduce that even further, I'll say, you know what, I'll pay the extra money. I'll buy an Nvidia dgx, which is a closed system. I don't want to deal with that. Right.

Mohan Atreya [00:21:39]: I want to avoid all these problems. I get a box, plug it in, but it's limited to eight GPUs. And if I have more money, then I'll go buy a superpod, but I'm paying through my nose at that point of time because nothing is modular. A GPU cloud cannot do that. They need modularity because their demands are different. Of course, the margins are the nature where they make money. So they'll have an Arista, Juniper, Cisco for the network Ethernet switches, they'll have Mellanox for the infiniBand, they'll have 20 different storage vendors. They're trying to save dollars here and there because they can't pass this on to the user.

Mohan Atreya [00:22:20]: They have to remain competitive in the market and they have maybe 12 months to actually make money, after which the next generation GPU comes out and then it's game over.

Demetrios [00:22:31]: Yeah, then they have to make another investment to bring on the new GPU that folks are going to be asking for. Exactly.

Mohan Atreya [00:22:37]: So that variety is what causes them grief. The variety gives them the margin. But because there is no consistency across all of these things, you need something that glue it all together, right? Or like we all use an iPhone or a Mac or something like that, man. It's impossible to change anything in those. Right. It's a black box walled garden. Or you can buy a wall garden system which not affordable for the GPU clouds. So anyway, I guess we digress a little more than we thought.

Mohan Atreya [00:23:10]: But these are some of the challenges they face, which may not exist for an enterprise per se, because they're scale is. Let's talk about scale. Right? The typical GPU cloud we work with has 5,000 GPUs, 10,000 GPUs. Multi generations of GPUs. Right. Sometimes spanning four or five data centers, sometimes not just data centers. It's Colo here, colo there, stitched together with some weird network. And they have data center talent, they have no software talent.

Mohan Atreya [00:23:42]: Right. Enterprise very different. Right. They have some homogeneity. They have 64 GPUs, 100 GPUs.

Demetrios [00:23:52]: Yeah, unless you're meta, you're 100 GPUs is great. Scale.

Mohan Atreya [00:23:56]: Yeah, exactly.

Demetrios [00:23:57]: And a lot more software chops than hardware chops, I would imagine too.

Mohan Atreya [00:24:01]: Exactly. Now, even within that 64, let's talk enterprise for a second. Right. The GPU clouds different scale. The same problem also exists for the Enterprise, which is if I have, let's say I have only 100 GPUs, how do I slice that and give it to a data scientist for X hours and reclaim it and give it to someone else when they need it that.

Demetrios [00:24:24]: Two weeks the meta guy was talking about.

Mohan Atreya [00:24:26]: Yeah, how do I do that efficiently? In the old days you would get on a queue like, you know, if you're familiar with Slurm, you know, a lot of universities still do it. This is a 20 year old HPC tech where you said, hey, I'm going to submit a job, I'm going to describe what I want to my Slurm job. It'll wait in line and Slurm will say, when that capacity is available, your job will run. Now how do you achieve that in the modern containerized world? That's the challenge. What's the replacement for Slurm?

Demetrios [00:24:59]: Yeah, a lot of people still just use Slurm because of that exact point.

Mohan Atreya [00:25:02]: Exactly.

Demetrios [00:25:03]: And then you have others who say, all right, well we're going to go about it with Kubernetes.

Mohan Atreya [00:25:06]: Exactly.

Demetrios [00:25:07]: And try and figure it out that way.

Mohan Atreya [00:25:10]: So. And Slurm has its own problems. Like, you know, all the nodes have to look exactly the same, all the GPUs have to look exactly the same. If I have to update an os, I mean, I have a customer who's doing that. They took it offline for two weeks because they had to do a OS update across their Slurm cluster.

Demetrios [00:25:26]: That's what I was talking about with the, with the Meta paper when they were talking about these rolling updates that they have to do that's, that can cripple you and you don't even realize it until you're in that situation where you go, oh, we'll just update real fast. And then your whole cluster's offline for two weeks.

Mohan Atreya [00:25:40]: Exactly. And then you talk to the data scientists there, they say, what am I going to do? Force vacation.

Demetrios [00:25:47]: Yeah, they get to hang out. Well, it's not bad, it's not the worst thing in the world. But yeah, I imagine they want to be making an impact, they want to contribute and then. Yeah, Force vacation.

Mohan Atreya [00:25:56]: Yeah. And it's frustrating for some of them. Right. Because again, I'm friends with some of these people, these customers, and they come and Say, look, I have a month to finish my analysis, submit a paper, only then I get to go present at this event, some big bioinformatics event or something like that. Right. They have their end lines. What do they do? Now this thing got offline because some security guy came and said he got a patch.

Demetrios [00:26:23]: Yeah, Some update. Can you. Yeah, that is a great one.

Mohan Atreya [00:26:27]: Yeah.

Demetrios [00:26:27]: So, so then we were talking about scale and we were saying enterprise is one type of scale. But then these NEO clouds are a whole different level of scale. So what are some of these problems that you'll run into when you're at this type of scale?

Mohan Atreya [00:26:42]: Yeah, we talked about two things. We talked about how do I onboard users, how do I handle them? With some elasticity, because they demand and require elasticity in an environment where stuff under me is in flux. Right. And how do you keep that going whether you're an enterprise or NEO cloud? Similar problems, just the scale is different. Then the next problem people run into is you can't give the end user a data scientist bare metal server and say, okay, bye bye. What's that person going to do? So the problem then becomes how do I. What is the end user experience that an MLOPS or a data scientist expects? What do they need? Right now if I go to a cloud, what do I get? SageMaker. Yeah.

Mohan Atreya [00:27:36]: And you click a button, you get a jupyter notebook, you click a button, somehow magically GPU show up, you train.

Demetrios [00:27:43]: Stuff, run some pipelines, you're good. Yeah, it is very managed in that regard if you want it to be.

Mohan Atreya [00:27:49]: Exactly. Now how can I bring that experience in my data center if I'm an enterprise? Because I may be regulated, I may not be able to move all the data to McLeod. There's so many constraints that people have and at some point we should talk about data gravity because that is one massive intractable challenge that I see organizations struggle with and it's kind of outside our scope, but it's fun to talk about.

Demetrios [00:28:16]: Well, it is funny that you mentioned that because we've been going around and I've been asking folks, hey, we're putting together this GPU Providers guide. What are some things that you would want to see in it? And during that process I've been interviewing folks in the community and saying, hey, you bought a bunch of GPUs you rented. Really? I think bought is the wrong word for it. What? When you rented these GPUs, do you wish you would have known? What do you wish you would have asked yourself about? And one of Them that came up was, you know, I didn't realize how much I would have to pay in egress fees to get my data to these GPUs.

Mohan Atreya [00:28:51]: Exactly. So that, that is a, definitely a challenge. And I think, let me kind of double click on that a little bit because that's a fun topic and I'm a technologist, I love these crazy problems. Now the problem there is you have this concept of cold data, hot data. And what I mean by that is let's say you have 1 million photographs that you've taken over 20 years. You don't touch them all at the same time. In fact, your viewer, where you're viewing stuff, you're probably looking at a few of them. And let's say it's indexed and all of that and stored in some.

Mohan Atreya [00:29:30]: So what would the provider do? If I'm Apple or Google, I'm not going to put everything in, in hot storage, you don't need that. So I'm going to do some tricks there. I'm going to move some to cold storage, cheaper storage, and only keep certain things in memory. That's how most of these systems work. Now when you overlay AIML on top and let me take an example of, let's say I'm a clinical diagnosis company, I have a new, I'm a data scientist looking at, hey, this is new kind of treatment. But I want to compare against patient Data that is 20 year old. Now for me, cold data has no meaning because I had to compare against all the things that exist and get my answer quickly. If I'm going to wait for the data to come back from tape onto hot storage, everything is slow.

Mohan Atreya [00:30:23]: What do I do now? I'm talking petabytes of data that I need instant access to. And if you overlay that with the example I took, like that Slurm system going offline, what if I have to move that system to the cloud like a bush to the cloud or shift to the cloud? How am I going to move my petabytes of data there?

Demetrios [00:30:42]: Yeah, what do you do?

Mohan Atreya [00:30:43]: That seems to be open questions. Yeah, yeah, it's a question that is hard to solve. And you have constraints like the egress charges, et cetera that people have to deal with.

Demetrios [00:30:54]: Yeah, I guess at the end of the day you can always figure out a way to make it work. But does it work within your budget?

Mohan Atreya [00:31:01]: Yeah, exactly. See when does this happen? When you have things like I gotta do patching or something else. And now this is where Kubernetes comes in. Handy because it's kind of more flexible than things like Slurm. It doesn't have to deal with 20 year old tech. Right. So people end up spinning up a second parallel system close enough to the data. The investment you'll have to make here is more hardware and then you can deal with your upgrades and updates here on the system that needs to be touched.

Mohan Atreya [00:31:37]: And then you roll over slowly. Right. You don't have to do big bang updates. So I've seen companies do that where they instead of operating at 100% capacity, they shrink it down to 80%. At least people are able to do their jobs and slowly, gradually, they back away at the.

Demetrios [00:31:57]: So at any given time you can expect 20% to be offline. So really you buy 100 million worth of GPUs and you only use 80 million.

Mohan Atreya [00:32:08]: It's a practical approach. There's no other way out. Because you're updating something all the time and then you want to minimize the time it's offline. Right. Bring it down to hours if possible. Right. And some of this you can do if you do some intelligent things like store all the updates, et cetera, locally. You're not pulling down terabytes from the Internet to update.

Mohan Atreya [00:32:38]: So local cache systems. Right. Like if it's a software update, why do you want to download Ubuntu updates from the Internet for every system?

Demetrios [00:32:45]: Yeah.

Mohan Atreya [00:32:46]: Right. So people do these intelligent things and we help with that to make sure that first of all, can I reduce that from 20% to 10% and I keep that really, really, really quick so that I can sweep through my infrastructure as quickly as I can, like patching, et cetera. Yeah.

Demetrios [00:33:06]: So we also were mentioning before I distracted you with the data gravity thing, how you don't want to just have bare metal. I guess for some neoclouds that is one of their value props. They say, look, we're as close to the metal as you can get. And some folks want that. But for the most part you want to give people something, at least the minimal resources so that they can do their jobs on there if they want.

Mohan Atreya [00:33:38]: Yeah, is actually a great point. So let me again use an analogy here because it'll, I think, help explain the user's needs. Right. Let's say I wanna, I'm hosting a party tonight and I'm gonna give them pizzas. I have two options, right? I can buy a ready made pizza from the freezer at the local store. All I have to do is reheat it and give it, or I can buy all the ingredients and make it myself. And different users may fall on the two ends of the spectrum, right? Nothing right and wrong, it's just preference.

Demetrios [00:34:13]: Yeah, preference.

Mohan Atreya [00:34:14]: Similarly here, what is the catalog? What's a menu of items that the enterprise or the cloud provider has to offer? Some will say, hey, just give me bare metal, I'll do everything myself on top. This is akin to someone saying, just give me the tomatoes, flour, I'll make my own Pizza Base, etc. And they may have a genuine reason to do that. Some may say that, hey, I don't need all that, just give me a notebook. If you only sell the raw material, I mean, you're not an interesting storefront, right? If you sell only the ready made stuff, again, you're not an interesting storefront from a user perspective. So as an enterprise or a GPU cloud, you kind of want to have both.

Demetrios [00:34:56]: Yeah. And actually this brings up another point. I guess the metaphor may fall down here, but a lot of these folks that are these Neo clouds are now offering tokens. They're saying, don't even mess with the sage makers. We've got the bedrocks of this and just tell us what model you want and we'll give it to you in tokens. That's how we're trying to.

Mohan Atreya [00:35:18]: Exactly. It's actually fantastic you brought that up because the future with Genai probably might be a token cloud, right? Where in a modern gen, workloads that run, they consume tokens and they consume a lot of tokens and people don't want to run a dedicated inference endpoint like, you know, it costs a lot of money to run Llama yourself, right? So then they'll end up saying, look, I just want to write my app, want to write my Python code, give me an API endpoint and a token and let me consume tokens and I'll pay you by the tokens. I consume maybe a million tokens a day. Now why is it interesting for a GPU cloud? A lot of them have, like it or not, they have spare capacity lying around, right? How do they monetize it?

Demetrios [00:36:11]: It's higher profit margins, Right?

Mohan Atreya [00:36:12]: Exactly. Instead of it idling, if I can be running a serverless inference and be somehow monetizing that gpu, but I'm selling something different to the user. The user here is expecting tokens as I'm selling them tokens, not GPUs, but I'm not idling my GPUs.

Demetrios [00:36:31]: And can you also use the second or third generation, these older GPUs you now can get a bit more juice from.

Mohan Atreya [00:36:39]: Exactly. So that's a beautiful point you made. So like we talked about, there's a 12 to 18 month lifespan and then people say, hey, I don't want to buy today. Everyone wants an H100. In a few months, probably it'll be something else. Exactly right. And the interest level on those today's generation systems are going to go down. What do you do with that now? You don't want to idle it, you don't want to throw it away.

Mohan Atreya [00:37:04]: It's a wasted investment. So now you can extend the life of these systems while exactly what you said. If I can run a token cloud on that and make sure that I can deliver them tokens now I'm protecting my revenue stream. I'm running a higher margin service. And whether you're an enterprise or a GPU cloud, the same problem. Except in one case, it's a CFO of the GPU cloud worrying about margins in the enterprise. It's the CFO of that enterprise talking to it and saying, hey, what do you mean you just spent $5 million on this H1 hundreds? What do you mean it's not useful anymore? Same problems. But yeah, token clouds is where the market I believe will go.

Mohan Atreya [00:37:49]: A lot of the genai use cases because a lot of us who've been in AI ML for a while, when we think about MLOps, we think you're building your own model from scratch, you're.

Demetrios [00:38:02]: Dealing with the data. All those data pipelines made me think that it's a very different Persona. Or you're almost attacking different folks who work on ML in different ways.

Mohan Atreya [00:38:15]: Correct.

Demetrios [00:38:15]: Because if you're just hitting the tokens, and I've heard a lot of people say outsource everything you can when it comes to this, if you can just get tokens, then start there because it makes your life way easier.

Mohan Atreya [00:38:27]: Correct.

Demetrios [00:38:28]: If you don't have to deal with any of these GPUs and then set up the services on top of those, then great. If you do, that's one type of Persona and maybe you're setting up the platform on that, you're the platform engineer. That is one Persona that's worrying about that reliability. But then if you're just building the app, you're hitting that API and you're creating the product that uses that service. That's a whole different Persona.

Mohan Atreya [00:38:57]: Exactly.

Demetrios [00:38:57]: And there's various Personas that could be that consumer.

Mohan Atreya [00:39:01]: Yes. And it's not just a developer materials. I think what we see is an interesting pattern Where I think this token cloud will be an enabler for what I would call as a citizen scientist. We talk about a data scientist, we're seeing a pattern where maybe someone from the CFO team who doesn't know how to write code, what they're expecting is saying, hey, how do I unlock value from my data? I have all these spreadsheets, I don't know what to do with it. Yeah, it can either sleep there in storage or maybe I can unlock some value out of it. If only I can train my LLM and LLM, whichever is suitable, that can somehow understand this data and become a domain expert. As an example, literally earlier today I was talking to one company that said, we have 20 years of investor relations, content. There's one guy in the company who's been there with them for like six, seven years, who knows everything if that person is on vacation or gets hit by bus.

Mohan Atreya [00:40:05]: Exactly.

Demetrios [00:40:06]: Big problems.

Mohan Atreya [00:40:07]: No one else knows because these documents are massive. Right. But they're so proprietary to the company. So they're basically saying, man, how can I build a machine with this knowledge so we can ask questions, questions to it?

Demetrios [00:40:20]: And I think the default that most people go to is they say, oh, well, AWS has like the private cloud. And so we just do it on AWS in a bedrock type of way, from what I've heard. I'm not going to sit here and just talk a bunch of shit about aws. But some people run into allegedly problems with not getting good service and paying a lot for it, correct?

Mohan Atreya [00:40:44]: Yeah, they offer a good platform, but it's not for everybody. Right. There are many organizations and geographies where this is not going to work. Like many of the partners and customers we work with, they work with like a defense entity and country A, country B, they're not going to go to their public cloud, at least not in this geopolitical climate. Right. They're very sensitive now. Like, if I'm a defense agency or if I'm doing something sensitive in that country, I prefer to use a sovereign cloud that I have control over where I have guarantee my data will never leak. So some of them feel that way.

Mohan Atreya [00:41:26]: Right. Again, right or wrong, AWS is not for everybody or any other public cloud. And yeah, I mean, like, for example, at gtc, I heard something interesting and since you're from Germany, I'll talk about this. So this person came along and said, hey, you know, we have some factories that have been shut down and what the government wants to do is repurpose them into AI data centers. Because they have power, they have real estate, and apparently it's set up in a way where you can set up racks and convert them into.

Demetrios [00:42:05]: They've got cash to burn.

Mohan Atreya [00:42:07]: Exactly. I think people are thinking of very innovative ways to get started. So I'm pretty excited about the situation because if, if access to AI infrastructure is available to everybody, think about all the stuff is going to unlock. I think we are at that inflection point. Is it one year out, two years out? I don't know. But we are at a very interesting point in the market.

Demetrios [00:42:31]: Yeah. And it is fascinating when we go back to that CFO who just in the basic iteration of now I have my chatbot that I can talk to and I don't have to worry about the data leaking to whatever. And we've heard that for a while. Right. That's the whole thing. The whole reason why open source folks are banging the drum. But then if you take it a few steps further and you can say, well, now teams can build services that will be useful for our CFO products that are going to help make that Persona's life easier. And I think a lot of different teams and product folks are doing that work right now inside of the organization.

Demetrios [00:43:14]: They're going around and saying, just show me your job, show me what you're doing. Let's see if we can plug in some kind of AI service here.

Mohan Atreya [00:43:23]: Correct. Okay. To make that person become more productive. Like the example I was talking about, the CFO's team gathering all the data and they don't have a developer or any for that matter. Anyone who understands what a temperature is in a token or how do I quantize a model? I mean, that all is like Greek and Latin for them. Right. They are finance people. So how do I get this person to become more productive without having a massive development team that has to do some requirements gathering?

Demetrios [00:44:00]: I've seen. It's funny you mentioned that, because I saw recently a team that has an embedded like data scientist ML Persona on the finance side.

Mohan Atreya [00:44:11]: Yes.

Demetrios [00:44:11]: So you've got that finance team of 40 people and they have one embedded engineer in there who is very deep in the AI world and is trying to help them streamline their processes and make it more productive.

Mohan Atreya [00:44:23]: Exactly. That's the approach where I see the market shifting. And this is something that we guys are very interested in because we're getting dragged into it from this token cloud thing, which is what I would call as zero code fine tuning. Right. Because in the end it's just workflows. So Give me my data. Select a model. You can choose A, B, C, whatever you want.

Mohan Atreya [00:44:50]: Say okay, somehow click a button called fine tune. They don't need to know what it.

Demetrios [00:44:54]: Does behind the scenes, just optimize.

Mohan Atreya [00:44:56]: Right? And then it fine tunes it and then it says, okay, here's your endpoint right now, here's a chatbot which is pointing at that endpoint. Now test it.

Demetrios [00:45:06]: Yeah.

Mohan Atreya [00:45:07]: See if it's actually doing its job. And they'll just iterate very quickly. And these models are getting so intelligent that you can fine tune them quickly. All the complexity that we guys deal with can be abstracted away.

Demetrios [00:45:25]: Yeah, I like that. Zero code fine tuning. I've heard it. My friends at PREM are doing like autonomous fine tuning is what they're calling it, but it's that same general idea. And recently we had Tanmay on here and he was talking about the benefits of fine tuning because. And I personally have seen a lot of people dunking on fine tuning. I probably have been guilty of this myself where it's like, fine tuning is a lot of work and it can potentially be for nothing if you're not doing it right. You may get a worse result after you've fine tuned the base model.

Demetrios [00:46:00]: And so you have to go into it with a certain understanding of what you're getting yourself into.

Mohan Atreya [00:46:05]: I think we are at a point now where we can make this a zero core experience to a certain extent. Right. Like we can even go and say, hey, for the kind of data set you have, maybe these models are the right kind of models to pick for the kind of budget you have. Maybe you want to pick this because you only have access to XGPUs, right. Or these kind of tokens. If you think about like lawyers, if they have to do an analysis of 50 years of cases, oh man, that's a lot of work. What if something can just autonomously help you there and just give you an assist?

Demetrios [00:46:43]: Yeah, it's fascinating to think about those trade offs.

Mohan Atreya [00:46:45]: Fine tuning is a great tool for certain scenarios where the domain set is kind of reasonably static. It doesn't change all the time. I don't know what kind of shelf life it has. There are people who say fine tuning will take over the world. There are some people who say, no, it won't. And I think the jury's out on that. We'll see. When you're doing fine tuning, you're consuming AI infrastructure.

Mohan Atreya [00:47:09]: To go back to that store analogy that I mentioned, right? You may be buying a ready made pizza, you may be buying the ingredients or you may just buy tokens, or you may say, you know what, I don't care what tokens are. I just want a workflow that is going to upload my data and fine tune in the process. I may be using tokens and GPUs. I don't care as a user. So that's a spectrum of things I think the market needs. And if I am a cloud provider or an enterprise, these are the kind of services I need to offer my users the spectrum. Otherwise I'm an uninteresting storefront for the user.

+ Read More

Watch More

Efficient GPU infrastructure at LinkedIn

Posted Mar 28, 2025 | Views 1.1K

# GPU

# LLM

# LinkedIn

Posted May 19, 2022 | Views 378

# Run:AI

# GPU

# CPU

# Run.ai

No GPU Before PMF

Posted Mar 11, 2024 | Views 294

# GPU

# PMF

# Dust