Sign in or Join the community to continue

Bridging the Gap between Model Development and AI Infrastructure // Mohan Atreya // AI in Production 2025

Posted Mar 13, 2025 | Views 322

# LLM

# GPU

# Rafay

Share

speaker

Mohan Atreya

Chief Product Officer @ Rafay

Mohan is a seasoned and innovative product leader currently serving as the Chief Product Officer at Rafay Systems. He has led multi-site teams and driven product strategy at companies like Okta, Neustar, and McAfee.

+ Read More

SUMMARY

Large Language Models (LLMs) are transforming industries, but their success hinges not just on cutting-edge models, but on the ability to efficiently train, deploy, and manage them at scale. Training state-of-the-art models like GPT/LLAMA requires thousands of GPUs running for weeks, posing significant challenges for MLOps teams. This session explores how LLMs are built, the scale of training required, and the growing infrastructure demands placed on MLOps teams. It will cover the key operational challenges of AI workloads, including distributed training, cost optimization, and multi-tenancy. Finally, the talk will highlight how Rafay’s GPU PaaS enables seamless AI infrastructure management, helping organizations scale AI efficiently without bottlenecks. Attendees will gain a deeper understanding of LLM training, GPU workload management, and practical strategies to optimize AI infrastructure.

+ Read More

TRANSCRIPT

Click here for the Presentation Slides

Demetrios [00:00:00]: Where are you at? I'm bringing you on the stage right now. Yeah, there he is. How you doing, dude?

Mohan Atreya [00:00:14]: I'm well. Can you hear me okay?

Demetrios [00:00:16]: Yeah, hear you loud and clear. I'm going to let you share your screen and we are going to get to the talk of yours right now. I'm very excited. We're going to be bridging the gap between model development and AI infrastructure. Right on. I see your screen here, so let's rock and roll, man.

Mohan Atreya [00:00:39]: Perfect. Perfect. Hey, thank you all. Fantastic opportunity and good to see you again. Demetrios. It's fantastic to be back again. We thought we would want to talk about a topic that is not just close to our heart, but to a large number of providers in the market and enterprises. And this will be an interesting conversation and I've kind of kept it somewhat at a higher level.

Mohan Atreya [00:01:12]: Hopefully this will be given the audience here. I think this may be very intriguing for some of them. So let's talk about that. So a little bit about me in case, you know, I love chatting with people on LinkedIn, Twitter, email, wherever you want, wherever you are. We also blog extensively. So a little bit about me. If you want to stay in touch and chat with me. I'm based on Pacific time in the San Francisco Bay area.

Mohan Atreya [00:01:39]: If you're at local meetups and other things, or if you're going to be at Nvidia's GTC next week, look me up. I'll be hanging around the Rafe booth. Okay, so let's start with a little bit of a view of the landscape. And then hopefully this will help you reset your thinking a little bit. If you look at the landscape and you can kind of broadly put the users in two buckets, there are trainers, as in, these are the people who create the foundational models and all that. And then there are tuners and influencers. So these are people who use the foundation models. Right.

Mohan Atreya [00:02:25]: As an example. So if you look at the number of companies, if you look at number of companies as the dimension, there are very few trainers, but there are thousands, tens of thousands, maybe even millions of organizations who are going to be tuners and influencers. Completely different kind of logarithmic scale for tuners and inferencers. But if you flip the logic and you think about scale of AI infrastructure. And what do I mean by that? What I mean is, well, we just look at GPU count as an example, number of GPUs. The entire story is flipped on its head, right? The trainers today have crazy number of GPUs and the tuners and inferencers are what the industry terms today as GPU poor. Well, I'm GPU poor too, for example, right? I'm not a trainer. I struggle to find GPUs at the right time and the tooling at the right time to do my tuning and inferencing.

Mohan Atreya [00:03:37]: This is a state of the market today, right? Now you may say, hey Mohan, this all sounds good, but let's see some numbers, let's see some numbers here, right? So when Meta trained and launched llama 3.1, which is, you know, for all practical purposes history now, right, they claim they use about 16,000 H100 GPUs. Well, that's a big number. And if you've been on Twitter and even watching Elon Musk's post about X AI Grok 3, apparently they're using 100,000 H1 hundreds. And it's rumored that OpenAI has 25,000 A1 hundreds when they launched GPT4. GPT4 also is old. It's history. And look at how the numbers are projected to change. By the end of 2025, Meta expects to have 600,000 H1 hundreds.

Mohan Atreya [00:04:31]: And here where we are like most of us are looking for like, can I get 1h100 to do my job? So, you know, the trainers. I think the point I'm trying to make is these guys are operating at a different scale, but you can probably count them in two hands. There's only that many of them, right? These guys are going to have the lion's share of AI infrastructure. So you may be now thinking, hey, you know, how is this going to help me? I'm not at any of these companies, I'm not a trainer in, in the sense that I'm not creating a foundation model. Perhaps. What am I doing at this presentation? Well, the conversation is not about these guys, the conversation is about the rest of us, right? But I just wanted to make sure that we talk about how these guys operated in different scale so we have some relative sense of numbers that we're talking about. Now let's pivot to folks like you and me and hundreds of other people who are trying to either develop a fine tuned model or develop my own statistical model or do some inferencing. What are my challenges? Everyone here in, in this meeting probably is, you know, has encountered these problems, right? I need a GPU for my Gen AI Chatbot app in my enterprise and by the time I put in a ticket to enterprise IT and I get access to it, oh my God, you know, my, you know, I'm frustrated, so I just go to the nearest GPU cloud and I put in my credit card and I get access to what I need.

Mohan Atreya [00:06:14]: Enterprise not happy. And you're not happy that the IT team is able to give you what you need. This is a story of a lot of our lives today, right? And if you go talk to enterprise IT to talk about, oh my God, if I give everyone a gpu, then it's going to cost us a lot of money. And what about idling GPUs? Because we do make a mistake. Like think about how many times you spin up something and you forget to turn it off. And if you have a gpu, man, your numbers are going to add up like crazy, right? So these are the challenges that people are facing, the folks on the top, folks like you and me, who need access to GPUs. And not just GPUs, but sometimes also things like compute, things like storage, things like high end memory, things like access to data. We just don't get it when we need it, right? And there's always this tension between folks like us and many cases it's, so how do we solve this? In summary, these are the five core issues that organizations struggle with and users struggle with.

Mohan Atreya [00:07:24]: I don't get access to what I need quickly. I want to experiment again. How many of you were desperate to try out deepseed when it launched a few weeks back? You probably tried it on your own dime, outside the company's infrastructure because you were just curious. You had to experiment. You had, you probably didn't get access to it quickly enough. So these are the five big challenges that we see again and again and again. Everyone is struggling with it and there's gotta be a solution to this problem, right? It cannot. AI will.

Mohan Atreya [00:08:00]: The growth with AI and adoption with AI will kind of be cramped unless these problems are solved somehow. Right? So how do we bridge these gaps? Right? You have the users and the AI infrastructure and this IT in between. Who today is not able to provide that bridge? They want to build that bridge, desperately build that bridge. So what we believe is needed is you need a platform as a service, as a critical middleware that acts as that bridge between the users and it owned AI infrastructure. The IP owned AI infrastructure could be GPUs, could be VMs, could be Kubernetes, clusters, could be data. Because as users on the left, I just want to fine tune, I just want to build my model, I want to use my jupyter notebook. Why do I Have to go learn kubernetes. Why do I have to go learn about containers? What do I have to learn about cloud? I mean, is it really needed? Let me do my job, right? It's a fast moving world.

Mohan Atreya [00:09:01]: So what do we do now? So just to kind of summarize, the gap here is you as users, want to use one of these. I think most of you here are probably extremely familiar with these tools. On the top, some of you might be using Ray, some of you might be using Kubeflow as an MLOps platform, some of you might be using MLFlow. And then there's like hundreds of these applications on top. And some of you might be building your apps on top of this, right? Not using these platforms, but building your own apps. And right at the bottom you have accelerated computing infrastructure. And the company says, well, we got some GPUs, maybe 50 GPUs. And what you need in middle is this pass.

Mohan Atreya [00:09:47]: What does that pass do, really? It will help with two problems, orchestration and governance and consumption and monetization. And you might be thinking, what the hell does that mean? Really? What this really means is this as a user, you want to be able to literally click a button and get the right kind of infrastructure instantly. Most of you probably don't have that today. It'd be nice if you did your infrastructure. Some of you may say, hey, I need a VM with a GPU in it. Another person on the call here might say, hey, can I get a cluster with some GPUs on it? Another person is going to say, hey, I kind of need bare metal servers for my job. Another one's going to say, you know what, hey, we've been doing this thing for 15, 20 years. We come from the HPC land.

Mohan Atreya [00:10:48]: We kind of need slurm with GPUs on it. How is it going to manage all these things? Think about those poor souls when they're going to struggle managing all these various asks that everyone has. They have a limited pool of resources below, so they need to make sure that they operate this efficiently. So what would that look like? The answer for this is you got to centralize. There's no choice. And this is something that we kind of been working with a lot of large enterprises and this is the answer that they have settled onto. What does this mean? If I have, let's take an example, a large hedge fund that we've been working with, they said, hey, look, we have this hundred GPUs and we have 50, 60 data scientists and it's A hedge fund. So they live and die off their ML models, right? When the ML models are right, what are making the decision on how to trade, right, and what to trade on.

Mohan Atreya [00:11:53]: And in a volatile market like what it is right now, these models and their accuracy can mean the difference between profit or loss, right? So this is, in this landscape, it is their backbone, the quality of their ML models. Now if the data scientists don't have access to the GPUs when they need it, they're not going to be producing models. Business is toast, right? So that's a situation they found themselves in. And the answer for them was, hey, let me centralize the GPUs and if there are idling GPUs I can reclaim and put it back in the central pool and give it to someone else who needs it. When someone who needs it, I'll give it to them instantly. I'll give it to them in a form factor that they care about, right? So that's the ideal experience that I think organizations are settling on. Now let's see what this could look like, right? Some of you might be thinking, hey, all kind of sounds nice. What would this look like as a user? So we're going to look at two things.

Mohan Atreya [00:13:00]: We're going to look at if I am a data scientist or, or an ML researcher, what is the ideal experience? What could it be like? What should it be like? Right, so we'll see that. The second part we'll see is, well, somehow that experience was created. How was it done? Like, you know, and Demetrios had the nice Harry Potter video just before this. Is it going to wave a wand and deliver this experience or how are they going to design it? How are they going to deliver it? Well, that's an important aspect as well. So it's a balance between these two that delivers the whole solution for the vast majority of us. Now if you are one of those lucky companies that has 100,000 GPUs, like one of the foundation model, the trainers, as they called it, you probably have a dedicated team whose only job is to deliver this massive colossus cluster for your job. The rest of us don't have anything like that. Let's see what this will look like.

Mohan Atreya [00:14:14]: Now what I'm going to do is first start with the stuff on the left, the end user experience. I'm going to pivot to a quick demo here. I'm sharing my screen so you should be able to see it. And Dimitrios, if you could just let me know if this is not Visible. So as a user, I log in, I get my own portal. So let's say Dimitrios and I are data scientists of this company. I log in and I'm given a shopping cart. I'm given a shopping cart.

Mohan Atreya [00:14:53]: And in a shopping cart I can ask for all kind of compute, I can ask for notebooks, I can ask for a lot of these things that somehow magically has been created. Now if you walk a little bigger.

Demetrios [00:15:05]: Sorry, Mohan, sorry.

Mohan Atreya [00:15:07]: Yes.

Demetrios [00:15:07]: Can you make it a little bigger?

Mohan Atreya [00:15:09]: Yeah. I'm on extreme magnification, so hopefully it's not going to mangle the experience, but let's try it. I'm a data scientist. Is it looking okay now?

Demetrios [00:15:20]: Yeah, if you can go bigger then awesome. But if that's the best we got, I'm happy with it.

Mohan Atreya [00:15:26]: Okay, let's try it. It's mangled the look and feel, but let's try it. Apologies, I'm on an external monitor here. As a user, I'm a data scientist, I just want stuff and I want it quickly and instantly. Now just like you guys go into your local store, grocery store, and you go there and you know, you know where to look for apples, oranges, sugar, coffee. You just know. You just go to the menu, you go to the aisle, you pick what you want, you say, hey, I want to pay for it and you get to use it. Why can't that kind of an experience be possible for data scientists and ML researchers in a company? Why does it have to be so complicated? So here's what you're seeing.

Mohan Atreya [00:16:13]: So what you're seeing here is me as a user, I'm given a menu of options, right? These are standardized options that the company has provided me. Someone in it has said, hey, this is what is in our shopping cart. You go do whatever you want, you can use them. So if I go and say, hey, I want this medium instance with 8 CPU, 16 gig memory and 2 GPUs, all I have to do is say, yeah, I want that. Give it a name and say go. I just give it a name and say go. In a few seconds you'll have an instance like this that has been launched. And the same thing with the notebook.

Mohan Atreya [00:16:56]: I want a jupyter notebook. No problems, Something has been curated for you. You give it a name, you deploy it on a particular instance that you have and then you choose among the various profiles that exist. You know, all this has been curated. So this actually has been a lifesaver for a bunch of users because we're seeing data scientists struggle with Hey, I don't have the right. It works for my colleague, is not working for me. Why? Well, I had the wrong version of Python. The wrong version of Pytorch is working for him and not for me.

Mohan Atreya [00:17:29]: And we've seen people waste a week trying to figure that out. Similarly here, what about other things? Compute and notebook are easy peasy, right? What's the big deal, right? What is interesting is what I'm going to show you, right? What end users should be able to get is any AI ML platform or application available at one click away. So, for example, in the last three weeks, I'm going to talk a little bit about everyone's kind of being hot and bothered about Deep seq. A lot of people mostly want to be able to try it and see, is it real, is it not? Can I quickly try it? Can I see if it'll change my existing platform dramatically? Should we be using it? How do I form an opinion? All of us experiment. Now, what many of our customers have been doing is they've been asking for cards like these where they say, hey, if I give this to my users, anyone should be able to click a button and get access to it. If I want to try deepsea, I just have to click something, give it a name and say, go. That's it. The whole thing will come up and the IT is able to give me a curated experience for everything I need.

Mohan Atreya [00:18:51]: And they were able to get this going in about 15 to 20 minutes for the users. Same thing with other things. I want to try Nvidia Nims. Well, if I do it the traditional way, I gotta be a Kubernetes expert, I gotta be a helm expert. I gotta understand how to debug it, troubleshoot it. No data scientist or researcher cares about all that. They don't have time for all of that. So what if it's one click away from me and I can try it? Same thing with Sagemaker, same thing with Slurm.

Mohan Atreya [00:19:20]: All these things can package any app. This is the experience that I think is ideal for users, end users who can focus on their job. They don't want to go learn infrastructure stuff, they get to use it. Now let's pivot into the second part, which is actually the more interesting part, which is how did this experience get designed? There was something done where this was done. Let's go look at that. I'm going to make this bigger too. So what we want to offer is a studio for it. What it is going to do is they're going to publish these profiles they're going to say, hey, medium in my case means this.

Mohan Atreya [00:20:07]: And they will design this queue. Just like there's a store, somebody in a store is designing, hey, I want to have these kind of apples, this kind of sugar, this kind of stuff in the store. Someone is designing these and saying, what can you do and cannot do? And all the complexities abstracted away from the end users to provide that curated experience. Same thing with services. Like if you want to have deep seat. The example you saw was this, right? We did another one that we said, hey, this company wants to do some classification of ships as an AIML job. It should be a button click away. So this is kind of what I think most organizations should be looking at, which is deliver a stellar experience for a user where they are not bogged down by complexities and nuances of various kinds of AI infrastructure.

Mohan Atreya [00:21:09]: It should be like your laptop. You open it, press a button, login, use it, right? That's what has to be given to folks like us so that we can do our jobs faster. We are now at an interesting point in terms of the AIML flywheel, at least in my point of view. It's accelerating. It's accelerating to a point where the difference between success and failure is going to be how fast can you iterate? Right? And that is where you don't need unwanted baggage, right? So we need to find a way to shed it, and that's kind of what we're focused on. So what you saw here was essentially something like this from an architecture perspective. This is what it looked like. There was a shared Kubernetes cluster.

Mohan Atreya [00:21:58]: The company had set up one massive cluster with, I don't know, 50, 60 GPUs. Right? And you had four or five people hanging off the same cluster with their own tenants. Right? So this data scientist is doing something with Jupyter on this compute instance and they asked for one gpu. This ML researcher said, hey, I want to do Jupyter, but I want to do some training on Ray and I'm operating with two GPUs that was allocated out of this pool. The gen developer said, hey, I want to use Deep SEQ and I want to do some odd llama and I want to run some massive inferencing thing. Well, you need a lot of GPUs for that. So something was given out of this pool to this guy. Four GPUs in this case.

Mohan Atreya [00:22:43]: And when these guys finish the job, this will go back to the pool. The next guy can get it. This is kind of what you saw in the demo, extremely efficient for the IT and operations. So nobody here is going to be GPU poor, right? Everybody gets a fair share, right? Everybody gets access to stuff, they do their job efficiently. They don't know anything about kubernetes, they don't need to know anything about containers, they focus on their models, they live in their jupyter notebooks, they live in Python code. That is where I think the market has to be so that folks like data scientists, researchers and genai developers can actually get the flywheel moving faster. That in a nutshell is kind of what we think the market needs in a nutshell to bridge the gap between the low level AI infrastructure to what the people need on top. What they really really need is self service consumption and usage of GPUs.

Mohan Atreya [00:23:50]: No more waiting for allocations that take two weeks, a week, et cetera. Whatever AI app you want to try has to be templatized and has to be one click away for me to try things so I can get it tried quickly and move on. Everything has to be on shared infrastructure. No company can afford dedicated stuff, so multi tenancy is a big deal. Like the example we saw there where we had multiple people living on the same unified shared AI infrastructure. So these are the three simple things that I would encourage you to factor in and make sure the end user experience is stellar. Right? The end users really really matter here. Now if you're interested in learning a little bit more about us, we blog extensively, at least 2, 3 blogs a week on some topic related to GPUs, AIML, data scientists, et cetera.

Mohan Atreya [00:24:54]: Mostly around AI infrastructure orchestration. We blog here. Please go have a look. We do have extensive documentation and getting started guides and then of course welcome to sign up for a free org. If you want to experience what you just saw, give it a spin. So that's all I had Demetrios. I hope I stayed on time. If there's anything else, I'm happy to take any questions and love to interact with people on LinkedIn, Twitter, etc.

Demetrios [00:25:28]: Thank you 100%. Actually, could you leave that last slide up with the. Oh sorry, the one with all the links. Because the blog is highly recommended and I want to make sure that in case people enjoy infrastructure they get to see that and check it out. Let me see if I can bring that back on stage. All right, cool. So and might need to be a little bigger but we'll drop it in the chat too. There's a few questions coming through from me.

Demetrios [00:26:00]: I'm wondering how do you get new tools implemented into rafa? Is it on your side or is it on my side? Do I need to. Let's say that I want to use a specific tool and plug it in with you guys. How would that work?

Mohan Atreya [00:26:17]: Yeah, great question. So this is a process that happens in every company. There's usually one or two data scientists who go to a conference and learn about a new tool. Like let's take Ray as an example. Three years ago, probably not many people knew about Ray. Right? It's a newer framework, it's a cool framework, it's hot, it's amazing. What people would do is they'd say, hey, we really tried it, we liked it. Can you encapsulate it? And the work here is done typically using the PaaS studio that I showed and is done by it.

Mohan Atreya [00:26:53]: It takes about 15 to 20 minutes to templatize some standard configuration, test it and then open it up to everybody. Right. Every app that I was showing there is about a 15, 20 minute job essentially. Think of the analogy that I use. The work here is essentially say, hey, how am I going to make it available in that store? Which aisle am I going to put it in? What price point am I going to put on it? How am I going to bill it? Some kind of a QR code or barcode or something. That's pretty much all that is needed. And the rest of it is this extremely straightforward 15 to 30 minutes work stops to onboard a new tool. Now the good thing is you can go from one guy who learned about it to everybody in the company being able to use it in a matter of like days.

Mohan Atreya [00:27:49]: I think that's the value.

Demetrios [00:27:51]: Nice. Okay, so the next question that I had for you was coming through in the chat in shared infra we found noisy neighbor problems resulting in QoS issues. Have you taken care of that problem in your gpu? Pass. Yes, I guess is how you pronounce it.

Mohan Atreya [00:28:11]: That's right. That's right. That's exactly right. Noisy neighbor problem can be a huge challenge and we have a bunch of controls in place to, I would not say eliminate it, but minimize it. There's always going to be some dimension that can cause challenges, but we minimize it to a point where it's a non issue for 99% of the use cases.

Demetrios [00:28:36]: Okay, excellent. And the part on using different tools inside of it that does it not matter if it is open source or closed source? It's. It's the same kind of process for each.

Mohan Atreya [00:28:59]: Yeah, great question. Let's take an example, if you look at these two guys, DataRobot and, and data Iq, and there's a reason why I took this particular set of examples. You look at the examples on top, many of these are open source, but Data Robot and Data Iq are not their commercial products, right? So for some of these guys, we have worked with them, or we continue to work with them behind the scenes with commercial products and bring them in, right? We make this turnkey with those commercial providers because there's enough number of people who care about it. If you guys see anything here that is missing, we have a catalog of these integrations with third parties where the commercial vendors, we got to make it whole for them, right? With open source is much easier, of course, we can curate it. Our customers can curate it themselves.

Demetrios [00:29:53]: So I've heard this a few times. Which is the best part about a managed service is that it's a managed service. And the worst part about the managed service is a managed service, right? So you want to be able to give knobs and whistles to folks if they want to go deeper. How do you look at that? Especially like you being the product guy. And I'm sure it's not the first time that you've heard that.

Mohan Atreya [00:30:23]: Yeah, yeah, it's. It's actually an amazing question. And who would ask this question, I guess has lived this before. So. So let's spend this 30 seconds on it. The question is more around abstraction sometimes can be royal pain and can be a negative if done badly. Which is why, if you notice here, there's two layers here. If the abstraction with the AIML tool on top is too tight and too closed, the user may come back and say, well, this is not of any use to me.

Mohan Atreya [00:30:57]: You have to do everything all over again. It is possible for the IT teams to perhaps swing the pendulum completely to the other side. It is completely possible because abstraction can be too tight, too closed and all of that. This is why we have like gradient controls. So the other option is the end users sometimes decide, hey, look, whatever IT has given me is too tight, too abstracted away, I'll go consume the lower level stuff. Like they may land up saying, hey, I'll get, you know, I'll use a Kubernetes cluster or a VM that the GPU pass gave me and I'll install the tool myself. In fact, that's how people experiment and provide feedback to it, because how does IT know whether they drew the line at the right place? Sometimes they make mistakes. So the end users have an opportunity to give feedback and say, hey, look, can you please loosen up the abstraction a little more, give us a few more knobs and it's an iterative process to get there.

Demetrios [00:31:58]: Yeah, yeah, yeah, yeah. That's always the fun part. So, Mohan, this has been awesome, man. I really appreciate you coming on here and sharing more and we will be in touch. I guess you're on LinkedIn, so if anyone wants to jump in there and continue the conversation, they will.

+ Read More

Like

Comments (0)

Popular

Watch More

Generative AI Agents in Production: Best Practices and Lessons Learned // Patrick Marlow // Agents in Production

Posted Nov 15, 2024 | Views 6.3K

# Generative AI Agents

# Vertex Applied AI

# Agents in Production

Bridging the Gap Between AI and Business Data

Posted Jun 20, 2025 | Views 112

# AI and Business Data

# LLM

# Snow Leopard AI

PyTorch: Bridging AI Research and Production

Posted Nov 15, 2021 | Views 314

# PyTorch

# PyTorch Ecosystem

# AI Models

# fireworks.ai