MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Accelerating Growth Through Optimizing GPU Usage // Sahil Khanna // AI in Production 2025

Posted Mar 21, 2025 | Views 92
# GPU
# adobe
Share

speaker

avatar
Sahil Khanna
Senior machine learning engineer @ Adobe

As a software engineer specializing in machine learning, I have led the development of advanced training and inference platforms that have enhanced AI capabilities and streamlined processes. My expertise in MLOps has enabled me to build scalable solutions that drive innovation and deliver measurable results.

+ Read More

SUMMARY

The presentation explores the critical importance of optimizing GPU usage for generative AI models. ​ It delves into the journey of Adobe's Compute Platform, highlighting the challenges faced and the innovative solutions implemented to enhance GPU utilization, resource management, and reliability. ​ The presentation also provides an overview of the AI Compute Platform Architecture and acknowledges the contributions of the dedicated team members who made these advancements possible.

+ Read More

TRANSCRIPT

Click to view the Presentation Slides

Ben Epstein [00:00:02]: We have our last talk on this track coming up with Sahil. So, how's it going?

Sahil Khanna [00:00:08]: Good, how are you?

Ben Epstein [00:00:10]: I'm doing well. I'm really excited. This is our last but certainly not least talk. I feel like every talk today has been really, really awesome. I love having to pay attention and listen to all of them. It's been very informative conversations. We're going to dig in today for our last talk around optimizing GPU usage, which is a very cool topic from this speaker. Sal is a senior machine learning engineer at Adobe and my initial.

Ben Epstein [00:00:35]: When I think about Adobe, I wouldn't immediately think about GPU optimization. I might think about like Modal or like Nvidia. But Adobe puts out such fantastic software and such cool engineering and obviously there must be doing a lot of things with generative AI today. So it makes a lot of sense they would focus on this. Why don't you give a little background on where you're coming from, what you work on and then you can jump into your talk.

Sahil Khanna [00:00:57]: Yeah, for sure. First of all, I'm thrilled to be here. Especially big thanks to all the organizers and speakers. So far all the talks have been fantastic. I've learned so much in few hours, so thank you for that. For me, I'm currently working as a machine learning engineer at Adobe, specifically within a Firefly organization. So I work with a team which focuses on training, infrastructure and frameworks. Before that I worked like few as an industry with Etsy and Instacart building machine learning system for inference and training.

Sahil Khanna [00:01:36]: So that's about me. When we are ready, we can start talking about the talk.

Ben Epstein [00:01:42]: Yeah, fantastic. If you go ahead and share your screen, I'll bring it in and we can take it off.

Sahil Khanna [00:01:47]: Okay, let me just figure that out. Sorry about that.

Ben Epstein [00:02:03]: No, no worries. Okay.

Sahil Khanna [00:02:11]: Yeah, I think I shared the card. Correct screen.

Ben Epstein [00:02:15]: Yeah, I see your slides. Okay, I'm going to jump off. Take it away when you're ready.

Sahil Khanna [00:02:19]: Okay, let's do that. Yeah. So I think today what I'm going to talk about is some of the challenges we faced optimize GPU when we tried to scale generative AI model training. And in the following slides we're going to discuss some of the challenges as well as solution and our journey to solve those solutions. So let's get into it before we discuss the technical aspects. I'm not sure who. For those like who are not familiar with Firefly. So Firefly you can.

Sahil Khanna [00:02:55]: I definitely encourage everyone to go to this link. You can play with it. Firefly actually Currently is a driving force behind a lot of AI capabilities within Adobe applications. Most of the features which it supports, some of them I also like added some snippets of the video. Like you can currently generate videos using text, you can generate images using text, you can give an image and ask Firefly to generate a video. And there are many more functionality which you can do. In the interest of time, I'm gonna skip to the technical part on how in the back end we are actually powering Firefly website and how what kind of platform we have developed to empower the generative AI models to train and then scale at this scale at Adobe level. So let's just get into that.

Sahil Khanna [00:04:05]: Okay, so first of all, to support the capability of FIFI application, my team has developed a compute platform with the goal of improving the developer productivity by simplifying the access to GPU instances. However, it's not an easy task. We will touch on technical and management challenges to make GPU available for all these jobs and why it makes it hard to optimize the utilization. Even just before going there, I just wanted to provide a little bit reference on the scale we are currently working on. So we currently run hundreds of distributed jobs in the platform and we manage thousands of GPU nodes. So let's get into some of the challenges we faced. So, so I've categorized them into four Category 4 parts here, but these are not by any means comprehensive list. They are, I've combined them, I've only mentioned four for this discussion.

Sahil Khanna [00:05:15]: But I'm sure there are other challenges which we can also discuss if you have more time for this. But like, first of all, I wanted to specify why it is a little bit important for us to worry about this problem. So the first challenge we have is a high demand. We train a lot of generative AI models which help us generate videos, audios, images. And it is currently very hard to make any progress or do experimentation or development if you don't have access to this business. Big GPU machines like 800H, 200H hundreds, things like that. So there's very high demand from all the researcher engineers in the organization. Unfortunately, with this high demand, we don't.

Sahil Khanna [00:06:08]: We have a very limited supply, so we have to reserve capacity in advance for many years so that we can fulfill all the demands we have in the, in the company. But with the reservations you can imagine that it comes. There's a more management challenge around optimizing the cost, making sure we can fairly allocate all the available capacity we have among different projects. And then efficiently use all the allocated capacity for these projects. And on top of that the world is not very idle. Like we also have very high failure rate in these machines nodes when we run it at scale. So we frequently have to recycle the hardware, bad hardware with the new hardware. So it also introduces a lot of performance challenges which we have to tackle within the platform.

Sahil Khanna [00:07:09]: So let's first talk about the first two categories, the high demand and resource management and see how it actually leads to some of the technical requirements from the system. So the first of all, in order to support this scale, we have to support scheduling jobs on multiple clusters because we don't want to be borderline on a single cluster. In addition to that we have to ensure that we can provide dedicated capacity to all the critical projects and also reuse all the available capacity for the non critical projects when it's available. Now in order to support these requirements, we end up developing our own in house scheduler. We couldn't find any open source solutions which can support these majorly. Three features. One is to schedule jobs across multiple cluster and have state of wall where you can share nodes between the jobs among clusters. Second is to support quota management so that you can enable some guaranteed capacity for critical workloads.

Sahil Khanna [00:08:26]: And in addition to that we wanted to support preemption so that whenever the resources are available and not used, we can use it for non critical jobs. But always we have this ability to take it back and use it for critical job when they need it. And all of these features work great for all the production training jobs. But it created more challenges when we started supporting development jobs. So we're going to talk a little bit about what kind of challenges we encountered when we started supporting the development development workflow. So in order to support collaboration on the platform, we started supporting interactive sessions. So the whole process is like people can just create an interactive session on the jupyter notebook, get access to these GPU nodes and they can do their development and experimentation on those nodes. The main challenges with this kind of workflow is that these sessions are mostly idle when people are not working on it or if they are not running anything.

Sahil Khanna [00:09:38]: And also they consist of a lot of custom state because if someone is interacting to a machine, they set up a lot of environment in that machine. And just doing a preemption on this machine is not ideal because it's very disruptive to use a workflow. We can lose the whole the configuration they have spent like hours configuring. So we had to solve this problem. So in order to solve this problem, we then worked on two main features. One feature which we had to embed was snapshotting. So the whole idea is that we will take a snapshot of the expected environment state of your job before we actually stop the job. And so this, the advantage of this is like the next time someone wants to start a new job, they can resume their job within one few seconds from the last environment states and they will don't have to spend another hours to set up the environment.

Sahil Khanna [00:10:40]: And this actually allowed us to support prehibition effectively without giving a bad user experience throughout the cluster and actually optimize the usage of those nodes for all the jobs we have. In addition to that, we also started implementing a lot of policies on our system. So one of the policy which we implemented was reclaim policy where we reclaim resources when the jobs are idle and use those resources for other applications in need. This helped us optimize the ideal GPU usage type and actually get to the better performance overall of all the resources we have. And this, all of these features so far which in this journey I've showed has helped us support both the production and development workflow. And everything was great if the world is idle. Unfortunately that is not the case. Everything.

Sahil Khanna [00:11:47]: There's a lot of challenges around reliability when it comes to using a shared infrastructure on cloud. And that brings up into the last, last set of challenges which we encountered which also affected the experience as well as our performance of the training jobs. Most of these machines which we use on cloud, they often encounter a lot of hardware failures. They have configuration issues where you can't schedule jobs on it, they overheat and there are other connectivity issue and many other issues. All of these issues leads to longer training time, sometimes disruption in training and low GPU utilization. Because now we are like spending most of the time recovering and being idle. And all of these challenges actually made us build our in house auto recovery system. And the whole idea behind this system is there's a centralized brain which monitors the progress of a job.

Sahil Khanna [00:13:04]: It also gather the data from the nodes about its health. It also gathers data from other part of the systems around processes the failures which are happening in order to make a decision how to recover from different failures inside the job. And auto resume training. This has really helped us ensure that we successfully complete our training. Because generative models trainings are very very long. It takes days or weeks to complete. So we want so this, this system helped us ensure that we complete successfully and also complete it faster. Optimize all the resources we have recover faster from the issues happening because of bad hardware, our connectivity.

Sahil Khanna [00:14:01]: So all of these challenges had led to this architecture where at the. On the top, on the extreme left, we have all the user experience interfaces. We support a UI, a Python SDK, a CLI which lets user interact with our APIs. Then we have this layer where we store all the API, all the data gathered from the user requirements and then generate events. All these events take actions by interacting with different components in the system. The first component for example, if you take an example of a job, how it gets scheduled. The first component which takes an action is global Scheduler. It reads the requirement of a job, it has a state of the world, it knows how many jobs are running, what are the nodes available, how many quota we can provide to this project and things like that.

Sahil Khanna [00:15:06]: And based on that it takes a decision where to schedule this job and rather to schedule this job. And then that action goes to a cluster manager. Cluster manager. So we have a manager per cluster of kubernetes and each cluster manager's responsibility is to talk to the kubernetes cluster master and schedule the actual ports which schedules the job. And then this cluster managers takes the action from scheduler, it knows the spec. Then it should use a job with all the required ports and infrastructure around it. And we also run a few more agents on each node and for each port so that we can keep collecting data about the node as well as the port. And also these different agents help us run these different actions which I mentioned previously.

Sahil Khanna [00:16:06]: For instance, if you want to take a snapshot, we talk to the node manager and node manager take a snapshot of the local data on that node and put it to S3 say similarly like we talk to board agent if you want to like capture some metrics, if you want to run tracing and all those things. And this is like a very. This is like very. This is how like all these components interact with each other on a very high level. So I didn't have so much time to include more details into it. So I only put high level. Hopefully I can answer some questions if you have about this architecture. If you have more time before ending this presentation, I wanted to.

Sahil Khanna [00:17:00]: Obviously there are like so many people who have contributed to this platform. I only mentioned few of them who actively develop. However, there are many more who had invaluable contributions and because of the constraint space constraint I could list all of them. So thank you for everyone listening to me and hopefully you have some questions I can answer.

Ben Epstein [00:17:28]: Wowza. Thank you so Much. That was such an awesome way to end the. At least our track of the conference. I'm curious before asking any questions or pulling from the audience, do you have any like you shared from other people? Do you have any like personal links or LinkedIn or GitHub or any place where people can follow you and learn more?

Sahil Khanna [00:17:47]: Yeah, I did add it here, but let's see. I just want to make sure. Yeah, so you can see my LinkedIn profile. So you can definitely reach out to me on LinkedIn.

Ben Epstein [00:18:03]: Sweet. Very cool. Yeah. Okay. We brought, we brought that back on. Yeah, that's awesome. I'm curious. I asked the same question in another talk, but like what are some of the most interesting applications that you personally have been leveraging as a user, like dogfooding the platform?

Sahil Khanna [00:18:23]: That's a very good question. Because our problems are very complex at scale. Most of the Adobe systems are in house previously. There are few systems which we have used and experimented with in Adobe as well as in my previous job. One of the common system is Ray for running distributed jobs. We heavily use Torch for these system for the training. So Torch adoption has been like very recently increased significantly and Adobe is also like heavily invested in it. Even like we contribute towards fixing bugs as well.

Ben Epstein [00:19:13]: Oh, that's super cool. Anything on the other side, like the most interesting models like diffusion models or vision models that have come out through, through this platform that you've gotten to watch happen.

Sahil Khanna [00:19:25]: Yeah. So application wise, like I think this thank you note also I have used Firefly to create it from the image even. I've used it to create a lot of symbols for my projects like mascots and things like that. I am very, I don't have the innovative edge in me. So like I use Firefly very heavily to like get something interesting out there for my creative pursuits.

Ben Epstein [00:19:55]: But that's very cool. So is Firefly the platform that I remember I saw Adobe release a fee. I think I saw Adobe release a feature where you could take like SVGs and, and orient them using AI. Like you could reposition them somehow in three dimensions. Was that through Firefly?

Sahil Khanna [00:20:18]: Yeah. So Firefly currently. So they are like this basic things where you can generate new videos, audios, but then it also integrates with Adobe other solutions like Photoshop and things like that where you can actually with prompting can make edits and do a lot of these transformations automatically. And all of these models are powered and hosted by Firefly, but those are integrated with all the Adobe Enterprise solutions.

Ben Epstein [00:20:47]: That makes sense. That's very cool. We got A question from Raul. It's actually a little bit similar to another question I had, but he said when you mentioned cross cluster training, how hard is back propagation? How difficult does that become and how did you solve it?

Sahil Khanna [00:21:03]: Right now we try to bin pack all the trainings within a cluster, but because we have so many nodes and cluster running a single cluster doesn't allow us to run these many nodes. So we had to like should you have to create multiple clusters in order to support this scale? But we try to right now like per job we don't have that much scale. So we try to fit a single job within a cluster so that we can especially within a zone so that we can use the fastest connectivity possible and have the maximum throughput.

Ben Epstein [00:21:44]: Yeah, that makes sense. And it seems like he was saying network, network bandwidth wise. But it sounds like bin packing is like the way you're getting around that.

Sahil Khanna [00:21:50]: Yeah, bin packing is the issue.

Ben Epstein [00:21:56]: I had a question in terms of you had a whole slide around like obviously being fault tolerant and I think a lot about GPUs like from, from modal because they have a lot of like very cool serverless functionality. I'm sure you guys have built a lot of that in house. How do you like how. How frequently are your GPUs failing such that you had to have that resiliency. Like in a typical training job, how many are you using? What percent are failing? Like what are some of those numbers if you can share them?

Sahil Khanna [00:22:25]: I don't have stats but normally like every training run they are at least every day. Few incidents where either the job GPU will overheat and won't be available or the node doesn't have GPU anymore or there's a connectivity issues between nodes so we had to recycle them automatically so that training can keep moving. We are currently on aws but yeah, right now I don't have like stats around it. We just deployed this auto recovery system where we're collecting more stacks.

Ben Epstein [00:23:03]: That's very cool. Are you. And you guys aren't using any like open source platforms for this training? You're. You rolled it yourself.

Sahil Khanna [00:23:12]: We use Torch and Torch Elastic but the, the platform which provisions the infrastructure is in house. We did recently encounter some issue with Torch Elastic and how it interacts with platform. So we are also trying to see how we can either closely integrate that into within the platform or create something similar for our use case.

Ben Epstein [00:23:36]: That's awesome. Very cool. Well, thank you so much for coming on. It's a really great last call. Yeah, we appreciate it. Thank you very much.

+ Read More
Sign in or Join the community
MLOps Community
Create an account
Change email
e.g. https://www.linkedin.com/in/xxx or https://xx.linkedin.com/in/xxx
I agree to MLOps Community’s Code of Conduct and Privacy Policy.
Like
Comments (0)
Popular
avatar


Watch More

Generative AI Agents in Production: Best Practices and Lessons Learned // Patrick Marlow // Agents in Production
Posted Nov 15, 2024 | Views 5.8K
# Generative AI Agents
# Vertex Applied AI
# Agents in Production
Navigating through Retrieval Evaluation to demystify LLM Wonderland // Atita Arora // AI in Production
Posted Feb 18, 2024 | Views 853
# LLM
# Evaluation
# AI
# ML
Scaling AI in Production
Posted May 19, 2021 | Views 621
# Machine Learning
# ML Systems
# AI
# AIEngineering
# bit.ly/AIEngineering