Efficient GPU infrastructure at LinkedIn
speakers

Executive Director, AI and ML Platform at LinkedIn | Ex IBM Senior Director and Distinguished Engineer, Watson AI and Data | Founder at Kubeflow | Ex LFAI Trusted AI NA Chair
Animesh is the Executive Director leading the next-generation AI and ML Platform at LinkedIn, enabling the creation of the AI Foundation Models Platform, serving the needs of 930+ Million members of LinkedIn. Building Distributed Training Platforms, Machine Learning Pipelines, Feature Pipelines, Metadata engines, etc. Leading the creation of the LinkedIn GAI platform for fine-tuning, experimentation and inference needs. Animesh has more than 20 patents and 50+ publications.
Past IBM Watson AI and Data Open Tech CTO, Senior Director, and Distinguished Engineer, with 20+ years experience in the Software industry, and 15+ years in AI, Data, and Cloud Platform. Led globally dispersed teams, managed globally distributed projects, and served as a trusted adviser to Fortune 500 firms. Played a leadership role in creating, designing, and implementing Data and AI engines for AI and ML platforms, led Trusted AI efforts, and drove the strategy and execution for Kubeflow, OpenDataHub, and execution in products like Watson OpenScale and Watson Machine Learning.

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
SUMMARY
Animesh discusses LLMs at scale, GPU infrastructure, and optimization strategies. He highlights LinkedIn's use of LLMs for features like profile summarization and hiring assistants, the rising cost of GPUs, and the trade-offs in model deployment. Animesh also touches on real-time training, inference efficiency, and balancing infrastructure costs with AI advancements. The conversation explores the evolving AI landscape, compliance challenges, and simplifying architecture to enhance scalability and talent acquisition.
TRANSCRIPT
Animesh Singh [00:00:00]: Hi, this is Animesh Singh. I'm a director at LinkedIn, leading a GPU fleet and infrastructure for training and inferencing. In addition, I lead the distributed training software stack across all of LinkedIn, additionally contributing to and leading a lot of initiatives around ML efficiency and performance optimizations. Prior to that, I was a distinguished engineer and senior director at IBM, leading a lot of core Watson initiatives. Woo.
Demetrios [00:00:30]: I'm bubbling from this conversation so many gems when it comes to working with LLMs at scale GPU infrastructure, what you want to be optimizing for how you can think about optimizations. And I had a incredible question at the end, like how does the platform and the GPU infrastructure that you're dealing with differ when it comes to working on LLMs versus traditional ML did not disappoint. Let's get into the conversation. Well, let's start with this man. And I'm so happy that we got to do this because we've been jiggling around our calendars to make this work for probably about six months. I think we were going to do this when I was last in San Francisco in June, but one thing led to another and here we are in 2025 having the conversation. Finally, persistence paid off.
Animesh Singh [00:01:43]: I think this is great. And I've been following your work throughout as well. You're doing excellent work in terms of, you know, bringing communities together and disseminating that knowledge. Right. Like what's all happening in the AI space, right. And what use cases are springing up, what are the industries they are targeting? There is some excellent work you are driving in that and I'm so glad it's happening in 2025. I feel, you know, we have now some better experience of, you know, what is working, what is not working, what may be a little bit of the hype, right? What is realistic, what is going to be the trends in 2025. So I think the timing is working out.
Animesh Singh [00:02:27]: Yeah.
Demetrios [00:02:28]: What is working?
Animesh Singh [00:02:31]: Definitely, I think one thing which has definitely proven that it's here to stay is LLMs. Right. I feel throughout, you know, 2022, 23, there was a lot of, you know, discussions how effective LLMs are going to be, right. In the industry, in the space. There are bunch of, you know, modeling architectures like, you know, recommendation ranking, LPMs, graph neural networks, GNs. Right. But the efficacy of LLMs and the use cases being powered by LLMs was quite a bit of question mark, right? Like, yes, there was a promise. What we saw that magic moment with ChatGPT coming in that literally woke Everyone up, right? Hey, it does seem seamless.
Animesh Singh [00:03:30]: It does seem that it's not yet another chatbot you are talking to. And that sprung up the industry right? When I joined LinkedIn at that time, you know, the ChatGPT moment hasn't happened. And as soon as I joined A month later, ChatGPT came in and what I came in here to do, a lot of that changed within a period of a month. And I think through the course of that period, multiple companies, multiple industries have identified, right, different use cases which are working well with this, right? And people are being productive, be it either generating code, be it able to do certain automation, leveraging this. The interface does seem very human, like. And a lot of the generative AI use cases which we launched even on LinkedIn, for example, profile summarization, right? So based on what you have, create a headline for me, create a summary for me, use cases like, you know, assistant for LinkedIn learning courses, use cases like, you know, resume targeting recruiter emails for candidates. Because a lot of the things which we had seen that, you know, if you're getting cold call emails from recruiters, they are not hyper personalized, right? Like at times they sound like a template generative AI and LLMs helped us immensely in that, right? Where they take into account the candidate's profile, the company in which candidate is working, the company from which the recruiter is, and create these very personalized emails which we are seeing success candidates actually responding much more, opening those emails. So I think a bunch of use cases we have seen working really well with LLMs and we are obviously doubling down.
Animesh Singh [00:05:27]: If you see the talk of 2025 and even before that last year, there was quite a bit of discussion on agents, what they can, what they cannot do. And we did our own experiments, right? Like invested a lot internally in terms of building agent infrastructure, right? First of all, what does it take to create agents? How is it different than the traditional generative AI applications or use cases we were building? What are the nuances, right? What makes it different? And then finally we launched, you know, Closer in the last quarter of last year, LinkedIn hiring assistant, right? So which is essentially an agent for the recruiters which based on certain criterias they define, will actually go work behind the scenes, find candidates for them, relevant candidates, summarize their experience and profile to the recruiters, and then based on that they can tell, okay, you know, reach out to these particular candidates and you know, let's start having a discussion. And there is much more. We are doubling down on that whole LinkedIn hiring assistant and we have seen some great, great enthusiasm from our partners and customers. Right. So it's seeing some very good results. So there is much more we are doing now on other areas of LinkedIn which will be powered by agents. So they are here to stay.
Animesh Singh [00:06:54]: I think it's how in which use cases really benefit from it that will be a nuanced discussion. You cannot throw it at every single thing. But you know, there are so many things which can be just powered by them, which then frees you up to do things right, which are probably the more creative aspects of the work you are doing. Right? And we are seeing quite a bit of that happening.
Demetrios [00:07:21]: So what's not working?
Animesh Singh [00:07:23]: I wouldn't say what's not working, but a thing which probably needs a lot of improvement moving forward is the cost and the ROI of launching these LLM based, either agentic use cases or traditional, you know, prescriptive rag based generative AI apps, or even leveraging LLMs for use cases beyond generative AI. The cost is a big hindrance, I think one if you see training itself used to be the biggest barrier for entry and that got solved to some extent because first of all, even within LinkedIn we invested quite heavily in building out or scale out training infrastructure which can power LLMs. I think when I came in, you know, we were working off V100 fleet and then since then we have scaled our LinkedIn fleet by 7x. We have A1 hundreds, we have H1 hundreds, we have H2 hundreds and the fleet is as modern as it could be. And we have scaled our training tremendously, right? Like it's 150x increase in the size of the models we are training, right? Our data processing has increased many folders. We actually completed 1 million training runs on the platform and we're training big foundation models now. Investment in the infrastructure went in and I think it was well understood for companies at scale of LinkedIn, et cetera, where a lot of content based data is being consumed, being produced, you will have tons of data, right? And when you have these tons of data, you need to make sure that you have the infrastructure to train models on these data. Right? And then the other thing which actually helped a lot of the training landscape is the emergence of open source models.
Animesh Singh [00:09:26]: I think meta led the way, followed by many others in the industry where for smaller companies or for companies which there is not a need to train a model on world data, you have your own specific data which is probably not as big as the world data. Getting these open source models and starting on top of them. Because these models already know what exists in the world. They can answer your questions around. They've crawled through Wikipedia, they've called through public libraries, all the articles and they're being trained on that. Then you can bring in and do more fine tuning instead of training on huge amounts of data. So then the infrastructure cost can go down further. So fine tuning became a big mechanism.
Animesh Singh [00:10:16]: Plus the emergence in the industry around a lot of techniques, around supervised fine tuning, zero shot, few shot learning which emerged where, you know, prompt optimization techniques which emerged, which essentially brought the cost down heavily on the training side. What is now, you know, more and much more in the picture is the cost of inferencing, right? And I think it's humongous, it's big and there are a lot of efforts which we are doing, which the industry is also doing overall, how to bring down the cost of inferencing these models. Now, if you take a look at generative AI use cases specifically as well, there is some thinking which is built in. When you are interacting with model where you are asking certain queries, asking it to analyze, you are prepared mentally, it would take some time, it will think through it, right? Even with the emergence of the latest OpenAI models, where there is a lot of reasoning going on, so there is a lot of it is actually analyzing its own output, then refining its own output, then the second output is further analyzed, there is a lot of back and forth reasoning. So there are multiple inferencing calls happening. And I think as a consumer we are prepared if we are going into a scenario in this particular context, the model may have some thinking time. Specifically if you are interacting with the latest OpenAI models, etc. Then you know that you're asking complex queries which need that analysis.
Animesh Singh [00:11:48]: So the latency you are willing to tolerate now the adamant and even to get to that latency, right? Like there is tons of infrastructure investment which has been made, right? The general realization is that even with all these investments, we are not able to get our GPUs to perform at maximum util, right? Like inferencing is becoming very costly because you are optimizing a lot for latency throughput. And you know, you have a lot of failover mechanisms which you need to build. Like almost any company needs to account for. Hey, I have two or three data centers, right? Like if one data center goes down, so there is a lot of redundancy you need to build for applications which are user facing. That means the cost of GPUs is ballooning, right? And that essentially is an infrastructure problem which needs to be solved specifically. Now when we do take LLMs in use cases where the consumer appetite for latency might not be there at all, right? Like there are efforts happening across the industry, right? Like hey, the traditional Rexis, right? Recommendation ranking, like so, for example, you go to social media sites, you get recommendations, you get feed, you get people you want to connect with. All these things like, you know, as soon as you go to the site, this should be there, right? You should be, as you're scrolling through the feed, the feed should be just being, you know, updated and customized for you in real time. There is no appetite for latency in both scenarios.
Animesh Singh [00:13:32]: Now if you need to see if LLMs can be effective in those world, that means you nearly really need to optimize for latency. And if you're really doing that, you are potentially throwing more money at the problem. And so that problem, I think, how do we take this large language model architecture, the transformer architecture, and make it really, really optimized for inferencing, right? Is becoming a big thing, right? Which needs to be solved for scale, right?
Demetrios [00:14:05]: Do you feel like it's a bit of trying to fit a round peg in a square hole? Because when you throw LLMs at like a Rexis problem, just because we've been doing Rexis for a while and we figured out how to make it real time, why do we need to add an LLM on top of. Almost in my eyes is overcomplicating things just to try and use a shiny new tool. But maybe you've seen there's better performance, there's better personalization there or something that I haven't.
Animesh Singh [00:14:40]: I think I would more than speaking on my behalf, right? Like in general there are like, you know, research papers, emerging companies are trying that now why would you try something like that, right? I mean, and that's a fair question, right? Like Rexis is already well established, right? And as a architectural pattern, right? Like the traditional recommendation ranking models, retrieval models, including you know, something like graph neural networks and GNNs, et cetera, right? They do a very, very solid job at this. And you have seen companies like.
Demetrios [00:15:17]: And it's fast, it's fast.
Animesh Singh [00:15:18]: And like you take a look at TikTok oftentimes, you know, the recommendation algorithm is talked about. So you're fairly right. Like hey, is why is that why? I think there are a couple of things in heuristic, right? So the way we have solved recommendation ranking problems in the industry is obviously you have created models. A lot of the companies have smaller models, right? These are not traditional foundation models which have been trained on world data, et cetera. Right. They have potentially not seen a lot of user interactions and patterns. So then you add things like real time training. There is a lot of data being ingested in real time.
Animesh Singh [00:16:05]: Online training is happening. There is a lot of feature ingestion which is happening in real time. What is user interacting with? So there is this new paradigm, okay, which is these models, the LLM models. These are foundation models. They have potentially seen maybe 95% of the patterns. So maybe what you need to do in real time to update these models is probably a lesser investment. Right. Models have seen majority of the patterns and if you feed what the user is done, right.
Animesh Singh [00:16:39]: Like they would be able to predict much more comprehensively. You don't need to do a lot of online training, et cetera, right. In real time. That's one of the thinking. The other thing is the simplification of the architecture for something like gns. What we need to do is when you are doing the training gns, all the data is in the graph format, graph structure format. You need to traverse the nodes and the edges in real time because you don't know prehand how much data you will be processing, which will be the right node, which edge you need to traverse. So there is not data pre processing happening.
Animesh Singh [00:17:17]: And there is a different architecture. And it's inherently hard to scale GNNs beyond a certain limit because there is live data processing happening while the training is going on or while inferencing is going on. Right. So an architecture and then JNN is one example. There are different, you know, recommendation ranking architectures. So the premise that, hey, and, and then companies would have a proliferation of these recommendation ranking models, right? Like every team would create for each of their use cases, they would start from scratch and create a model which is potentially small model. It does very targeted job, it does a really good job at it, right. And then build all these things.
Animesh Singh [00:17:59]: So you have bespoke models with different architectures. There are huge number of them. If you take the LLM route, right. Like the trend which is emerging in the industry, right? Hey, if I do create a giant foundation model, right? And if obviously very present in what's happening, distillation as a technique is becoming very prominent in the industry. Like I will create smaller models for inferencing, but I will distill it from this giant foundation model. So what you have done is, you know, you have sort of centralized the Creation of models in potentially a central place. So you can think of a scenario in future, right? Like there is one central team instead of having every different use case and every different vertical within your org, creating their own models for their own use cases which are very targeted. One central team which is, you know, the holder of your organization's data, which is curating that whole data and creating, you know, maybe one or two or three.
Animesh Singh [00:19:01]: Okay, very simplistic scenario. You're creating two giant foundation models, one for generative AI use cases, one for non generative AI use cases. And then you know, the models for all particular use cases are being distilled from this. So you have simplified the overall architecture. You are essentially worrying about the compliance majorly. Like there is compliance is a big thing as well as you are seeing, right? There was the DMA act, there is AI act, there will be other acts which will be coming if you curtail the surface area, how many models you have, right? What data it was trained on, right? Then you're not worrying about the many hundred other models which your organization has created and make sure everyone is compliant. What data it was trained on, was that data compliant? You have sort of centralized and simplified that problem. Hiring becomes easy, right? Right now you do tend to hire like when I'm looking at certain use cases like GNNs, people who can run GNN at scale, right? You need a very targeted skill set, people who understand how to target graph data.
Animesh Singh [00:20:12]: And then you need to go into the depths of the GPUs architecture where is because what a lot of data transfer is happening during that training. So, okay, is NVLink good enough? How much HBM memory I have, how much on disk memory I have on the GPU nodes, Then you know, what is the network bandwidth. There is a lot of in depth GPU knowledge you will require, right? Just to go there and start solving. Plus you know, the graph traversal, what algorithms do I need to introduce to? So your sourcing of skills and talent also becomes simplified. So there is that overall value proposition which could be achieved provided if LLMs do, you know, prove themselves. I think many are looking at this problem space and figuring that out. So it's yet not solved. And as I said, unlike generative AI as well, this is a problem of scale.
Animesh Singh [00:21:13]: Generative AI use cases. You have to explicitly as a user go and invoke that. You will go like I'm giving you LinkedIn example, you will go and say, hey, summarize my profile. Or a recruiter will say launch this particular agent or you know, LinkedIn learning user will go and say, summarize this course for me or explain to me this nuance. So these are, you know, discrete transactions, users going to feed, logging into feed, browsing through the content. This is happening all the time. It is happening at scale, right? So then that's a bigger, much bigger scale problem which needs to be solved with LLMs overall, right? In a cost effective manner. I think it's.
Animesh Singh [00:21:57]: And very, very, you know, latency sensitive manner. So that's the thing, right, which needs to be happening and we'll see whether they do prove efficacy. Like, you know, based on all the results you've seen, they're able to pass a lot of Bar exams, PhD exams, Math exams. Right. So you'll see, hey, they are intelligent, right? Like can they be very intelligent for this specific set of targeted problems? Yeah.
Demetrios [00:22:22]: I hadn't thought about the simplicity in the architecture and also the simplicity in being able to attract talent that understands this architecture because it is more simplified and you don't have these very specialized, deep, deep tech type of roles. So you can almost get a much broader choice of talent from now going into GPUs themselves. And you were talking about a, the cost of having GPUs at scale and when you have this many GPUs and you're trying to utilize them, you don't want to have any percentage going idle. I imagine you think about that a lot. You're thinking, wow, we're burning money just letting this GPU sit around and we're not utilizing it to its maximum capacity does. Is that what you're trying to do or is that what Liger kernels are trying to help out with? Can you explain that?
Animesh Singh [00:23:32]: Yes, and I think any, any infra and platform team which is in ML infra at this point you would talk. This is a burning thing, right?
Demetrios [00:23:44]: Keeps you up at night.
Animesh Singh [00:23:47]: On one hand, you cannot run these LLMs, for example, or these modern recommendation ranking models without GPUs. Right? Training, definitely. Right. And even with inferencing now with LLMs you have to go on GPUs. Right? Like the gender trend in the industry, like if you remind two years ago was hey, we will train on GPUs but serve on CPUs, right? That's how most of the companies had architected and CPUs were not that expensive. The modern architecture, it doesn't lend you to that. Right? So you need to get. So the whole investment in GPU efficiency becomes very, very paramount.
Animesh Singh [00:24:30]: And it starts from at every Single layer, Like so, in our case, like, okay, there is the general thing, right? Which also happens in companies, right? Here, you need to allocate a maintenance budget. Now you can be much more generous with CPUs in terms of maintenance budgets. How much spare capacity do you allocate so that if your 20% of your fleet is being maintained on a regular basis, so that's the delta you need. If you need high availability, you need to spread it across three data centers. Okay, let's go ahead and do it, right? Those decisions were much easier. They're becoming much more harder when you have GPUs in house. Because, yes, there are certain companies which have potentially invested a lot, right? And they have deep coffers, but a lot of the companies want to be cost conscious about it, right? So every decision, hey, even to the point, like looking at certain use cases, what would be the right maintenance budget, right? Like if, if you have, let's say, you know, in, in one maintenance zone, thousand GPUs, can I allocate 100 just for maintenance, right? Like, then you do the cost and it's like, hey, there's a lot of wastage which is going on. You start sweating, you start, start thinking.
Demetrios [00:25:43]: About how much money that is.
Animesh Singh [00:25:45]: And yeah, and so, so how do you first of all start then looking at the workload so that they become more resilient to maintenance? Like, so, okay, if I cannot take the previous approach, used to be like, Once I know 20% of my fleet is going to be maintained, I will not schedule anything, right? We don't need to worry too much about, you know, for the next 24 hours, even if they're empty. Now you're worrying about that I cannot leave 200 GPUs lying empty for 24 hours, right? Just because, you know the maintenance is going to happen. So then you look at your workload, okay, how can I make the workloads more resilient? Now, if you look at distributed training workloads, right, Majority in the industry are gang scheduled, right? So you do gang scheduling, you know, beforehand. This distributed training workload will take x amount of GPUs, right? And there is a lot of, you know, synchronization happening as the distributed training is going on. And if, you know, few of the nodes fail, you will have to, you know, put it back in the queue. Like, you obviously need to have checkpoint restore, put it back in the queue, do again, gas scheduling so that it can resume. So then you look at this workload, okay, Maybe in this world we have to rethink the whole notion of gang scheduling, do we need to build more elasticity? Right? Like if a workload has been launched on 100 GPUs and 30 GPUs are going to be taken away, can it shrink? Can it expand on its own? Right. And continue going.
Animesh Singh [00:27:16]: Same for inferencing. I think the need for scale down, scale up is becoming big. Serverless architecture has to become a prominent scheme in the thing of things. Like you cannot optimize for max traffic. Like, and a lot of us are doing that. A lot of the industry is doing that where you put what is the peak traffic you get and that's your capacity. But if out of 24 hours, 2 hours is your peak traffic for those rest of the 22 hours, these GPUs are sitting idle because they have been provisioned to handle that peak traffic. Again, a lot of vestige.
Animesh Singh [00:27:53]: So so many of these decisions are now becoming very prominent. So I feel the emergence of elastic architecture, serverless architecture to handle workloads which should be able to scale down, scale up as the capacity is shrinking. Expanding is a big investment area in something, right? Which we are doing. Like, and I'm assuming like, you know, a lot of people are solving this problem at different means and mechanisms, right? So yeah, that's essentially the key characteristic there.
Demetrios [00:28:32]: Well, and you haven't even mentioned the reliability of the GPUs because they're notoriously unreliable.
Animesh Singh [00:28:38]: They are like there are, I mean all these Nvidia GPUs you get, for example, right? And one of the decisions we took early on, and I think for the simplicity part of it, right, like hey, let's start at least on the training side, right? We'll go with Nvidia, right? Like, and, and so far we have standardized on Nvidia throughout. But then you know, you have a lot of these checks, things which can go wrong because it's just taking a distributed training example, when you are running at scale, you are touching your. In our case we have kubernetes as the compute infra layer, right? So you have kubernetes, you have the network, you have the hardware, you have the storage. Training is, you know, reading and pumping a lot of data throughout. Any of these points going down is a weak link in addition to the GPUs itself, right? And activate the planned maintenance. And sometimes for foundation models, your training can run for one week, two week, three week at stretch, right? When you are actually running training for these large foundation models, you're guaranteed to have some part of the infrastructure having some problem. Sometimes it will be the messaging queue, sometimes it will be the storage system, sometimes it's something in the kubernetes layer, or sometimes it could be just the GPU node itself. Failing health checks, ECC errors, nickel is a problem.
Animesh Singh [00:30:06]: InfiniBand of the GPUs, right, which is connecting all this. So this poses a huge challenge, right, in terms of ensuring, because these are so long running jobs, how do you make sure that you are being effective? And that combined with the problem of gang scheduling, because they need to go all together at once or otherwise they will need to wait. If a training job needs 500 GPUs, it will be in the queue until those 500 GPUs are free. Right? You cannot launch it before that because you know you have not. I mean, that's how the architectures around gang scheduling are built. So this is a big thing, right, which we have invested a lot in terms of building, you know, very fast checkpointing, restore because you assume failure is happening, maintenance is happening. So a lot of investment has gone in from our side in terms of ensuring that, you know, there is automated checkpointing restore, which could be done, which is hierarchical. So we first checkpoint in memory, then distribute those checkpoints to some async block storage.
Animesh Singh [00:31:11]: Then you have to read, right? And then there is also discussions around whether these checkpoints can reside on the GPU SSDs itself. Because there is a cost of reading. Once you have like the training can continue checkpointing in memory, but once you have sent it to some remote storage, a block storage or a file based storage, then when you're reading, if something has gone wrong, that takes time, you have to bring that through the network. And bigger models will have many gigabytes of checkpoints bringing them through the network. So then, okay, can we leverage the GPU SSDs under the covers itself? Right. GPU SSDs obviously have some limited capacity. How much can you store? So bunch of those architectural elements need to be crafted. So we are investing quite a lot and quite heavily into that liger you talked about, right.
Animesh Singh [00:32:01]: Was the other part of the efficiency problem, right. Which is essentially as we went on this journey, we saw like beyond the reliability and the scalability, a lot of the use cases, the model started getting more complex, more bigger and we had an X amount of GPU fleet. We started seeing a lot of pulls from our customer. Hey, I need GPUs. I'm not getting GPUs or my training is running for so long and I really want to get it done faster. Multiple of these use cases and literally people were. Because GPUs are a scarce entity, at least a year ago we had quite a bit of that as we were scaling the fleet and we were ordering. Nvidia has its own supply chain and timeline.
Animesh Singh [00:32:53]: The use cases were just springing up. So we looked at the problem. Okay, and you follow the traditional methodology of data parallelism. Model parallelism. Let's introduce tensor parallelism. We invested in technologies like 00 because our A100 fleet had very constrained network bandwidth. So how you can scale up training even with that constrained bandwidth, zero helped in that. So after we have exhausted every single option, then, okay, what can we do? So as we started going deeper, one of the things which we used to do infrequently for our customers was rewriting CUDA kernels, right? Like a lot of the use cases which are very sensitive to that, the training should complete within X number of hours, et cetera.
Animesh Singh [00:33:43]: When we looked at the model training code which our models used to produce, there was a lot of improvements or at times we will go and rewrite the CUDA kernels, few certain operations. Right? Now you bring this to the LLM. The thought which we had was, hey, we can do this on a model by model basis, but this is not scalable, right? Like there are multiple users having different models. So the thought, okay, how can we make what exists to actually solve this problem, which can be a little bit more scalable? So that's where, you know, fortunately at that point, OpenAI has launched Triton, which was essentially a Pythonic programming interface, right? Where you can do CUDA kernel programming at a much abstracted layer. Most of our users would be, you know, familiar with Python. So using that. Now, the second thing was like, you know, GPU memory is hierarchical, right? Like you have the DRAM, then you have the HBM memory, then you have SRAM, the Streaming Multi Processor, SMS and GPUs. They interact mostly with SRAM, right? Which is very little.
Animesh Singh [00:34:54]: And so all these multiple kernels you have and multiple operators you have in your training code, there's a lot of data transfer happening from CPU to HBM to sram. Like that is your biggest bottleneck, right? Even though the GPUs can massively parallelize the amount of IO data I O, which is happening between different hierarchies of memory, that's becoming a big bottleneck for the training time. So that's when we thought, okay, let's combine what Triton has brought to the table. Let's take this problem and create certain first of all custom kernels for our distributed training workloads. And we started seeing huge gains, right? Like in one of the cases, like just 300% increase in the overall efficien majority of the cases, like more than 50% decrease in the memory which was being used. What we did was we started fusing kernels, right? Like certain operations, they can be just combined, fused together. Operators can be fused together. You don't need to have five kernels to do these different operators, right? And there is some human judgment involved in that to make that decision.
Animesh Singh [00:36:02]: And we have also seen flash attention in the industry sort of saw that quite a bit, right? And it became very popular and it was built on the principles of kernel fusion. So we took that and combined that and it solved and helped immensely a lot of our internal developers. Huge decrease in training time because memory efficiency got much bigger. And once we realized, hey, this looks good and let's just open sources, there was not a lot of planning, there was not a lot of thought. It was also like, hey, lot of open source models are coming in and maybe if community likes it, they will potentially create kernels for those open source models. And if we end up consuming it, it eventually also benefits us, right? Yeah, it just took off, right? I think multiple companies were in the same position, right? It's when you go and put a problem out. They were all getting this big load of generative AI use cases, lot of LLMs. GPU crunch was omnipresent and almost everybody needed to solve this problem for distributed training at scale.
Animesh Singh [00:37:15]: And we got a solid reception in the community from Andre Hugging Face. One thing which we did, spent some time on making the use very easy, right? So we integrated with the Hugging Face ecosystem right off the bat, right. Et cetera. And we just last week it completed more than 1 million downloads, right?
Demetrios [00:37:40]: Incredible.
Animesh Singh [00:37:41]: This is amazing. We never thought this will go all the way there, right? And we have got insert lots of good feedback from the users who are benefiting from it.
Demetrios [00:37:52]: Yeah, it's funny that you mentioned the memory aspect because we had a guest on here a few months ago talking about how his whole thesis was that we are memory constrained, we're not GPU constrained. GPU is almost like not the way to look at it. It should be really looking at memory. And that is the bottleneck right now.
Animesh Singh [00:38:15]: And I think very rightly so, right. Like see, ultimately ML is a data processing problem right now. So that does mean that, you know, there's a lot of Data moving in between these GPUs. Right? Now the fastest you can process data is if the data is in the memory. Right now you go through the hierarchical thing, okay, gpu, hbm, sram. The second thing is, okay, the CPU on system memory of the CPU on the GPU nodes, then you go, okay, I will go outside the GPU nodes, maybe the GPU SSDs itself. But the more you can keep in memory, the faster you can process more. So for LLM oriented workloads, generative AI workloads, because if you look at people want to have bigger context length.
Animesh Singh [00:39:10]: Even when you are training the model, if you increase the context length, you potentially can train bigger models. Inferencing time, right? Like the amount of information users are sending, right? For the model to process the amount of information the models are generating. Like see Rexis output used to be very straightforward, right? Like Rexis recommends you whether this is a good content to show or not. In case of generative AI, the output is also huge, right? So a lot of that data processing is going back and forth and the more you can, you know, leverage the memory effectively, the biggest gain you can get. So it's rightly so. There's a lot of. Even during inferencing time, like Kvcache is becoming a big technique, right? Like where you can reuse Kvcache as the inferencing calls are coming, right? For sequence of tokens, et cetera, right. Which we have seen before.
Demetrios [00:40:06]: You calculated that can save a lot of cost too.
Animesh Singh [00:40:09]: Yeah. You've already calculated attention scores for all these different tokens. You are swinging the sequences you are seeing, right? So why do this again? So let's keep it in memory. But how much can you keep in memory? The part is like, you know, if you start leveraging GPU memory for KV caching, then model itself needs GPU memory for its own processing. So what is that write boundary. And most of the H1 hundreds, right? Which is now dominantly like if you look at what the industry has probably most in bulk is right now, H1 hundreds, right? Or to some extent H2 hundreds and 80 gig memory per GPU is turning out to be low based on the amount of data you are processing. And I think part of it is also as the use cases emerge, that starts becoming clearer because the way maybe whatever went into the decision making was hey, this memory is for GPU processing, whatever GPU is computing. When people start leveraging this memory as a cache, like Kvcache, like In case of GNNs, also we are keeping some part of the graph structure in memory.
Animesh Singh [00:41:18]: When you start using it as a data storage mechanism in addition to being used for computing needs of your model itself, then it's less then you need more. So I'm assuming that realization is already like if you see the Grace Hopper architecture and the Grace Blackwell architecture where they're combining ARM CPUs with the GPU nodes and then they're creating a very high bandwidth data transfer link between that the CPU and the GPU so that you can then leverage the more memory on the other side of the CPU nodes because there is a very high bandwidth data transfer or realization to this particular use case which is emerging from generative AI. Right, it's.
Demetrios [00:42:05]: Yeah, I want to give a shout out to the person who said that to me because I was trying to remember their name. It was Bernie Wu was the one who said that we are memory constrained. And you just went very deep in that and makes a lot of sense everything that you're saying. You did mention one thing that you worked on quite heavily too, which I've also heard is an absolute pain around checkpoints and speeding up the checkpoints because a, it's not clear when things are going to fail. So you don't know if you need to checkpoint every second or every five seconds or every five minutes or every day. And so you, if you over optimize for checkpointing all the time, then you're potentially transferring around a ton of data. And you don't need to be because as you mentioned, these checkpoints are huge. And so especially when you're training LLMs that are very, very big, the checkpoints can be in the terabytes.
Demetrios [00:43:00]: And that's just like if you're doing that every five seconds, that's a whole lot of data that's going around. So how did you specifically speed up the checkpoint and make that process better?
Animesh Singh [00:43:15]: I think we decided like so initially the very naive implementation of checkpointing was, you know, so our majority of our data is on hdfs, right? And we will. The, the mechanism was you checkpoint and the checkpoint goes to hdfs and when you are reading right, you, you read from there. So the realization Specifically yes, with LLMs, like hey, first of all, these checkpoints are big and the training pauses while you are checkpointing, right? So there is a pause in the training workload and this gets done. There's a transaction which is happening, right? So the first thing which we did and you know, changed that architecture as LLMs became more prominent is hey, let's make it, you know, two phase transaction, right? In essence we will checkpoint in memory, right? And from there onwards any copy of that checkpoint is a sync. So two points, right? Like the checkpointing in memory is very fast. And the second thing is you're not waiting for that checkpoint to be transferred to a remote storage, right? So it's hierarchical checkpointing strategy which we developed and now we are streaming that checkpoint. We also changed our backend for checkpointing storage from HDFS to a block based storage. Now we are investing in figuring out, going and optimizing it even further.
Animesh Singh [00:44:47]: So GPUs, if you request, like when you are ordering GPUs, Nvidia GPUs, you can request SSD storage within the GPUs, right? And there are technologies which I'm assuming some of the companies may have in house, but even in open source, right. Where you can then start looking at creating a cache, a distributed cache onto the GPU ssd. So maybe one GPU SSD storage is not enough, right. It's for your needs. Then you can combine the GPU SSD storage across all your GPUs and then create a distributed cache to store this. Right. So that's the second phase we are on in terms of ensuring so that even. And the advantage with that will be in the restore.
Animesh Singh [00:45:34]: I think checkpointing will still be fast because it's in memory. It then will go to the GPU SSDs. But when you are reading back, if you don't have to take it outside of the GPU network and bring it back into the GPU network, you save tons while restoring. And when reading. The other thing which we did was like, to your point, one of the questions was how often do you checkpoint, right? I mean that's a big question. And we have mostly sort of left it to the modelers like to take that common sense decision. How soon should you checkpoint, right? How much of a. Like for example, for jobs which take like more than a week to train, I think they can have, hey, how much I'm willing to lose.
Animesh Singh [00:46:16]: I'm willing to lose five hours worth of data. Or am I willing to lose if something goes wrong and I have to restart, can I restart from previous day's checkpoint? So it's a decision making heuristic in certain use cases, like very deterministic use cases, when you're doing incremental training, like you need to train every few hours, then it is very clear, like, hey, it's driven by checkpoints, right? So you have to do a checkpoint every Two hours, et cetera. So depending on the use cases, the other thing which we have done is like disruption readiness. So if in cases of planned maintenance, where we know that the nodes are going to go down, we trigger a checkpoint. Right? So it's a triggered checkpoint. And if the modelers have implemented checkpointing restore, right, that trigger will go invoke a checkpointing restore. So this is not a plan, this is not. It's happening before we take that workload of those nodes, right.
Animesh Singh [00:47:15]: So essentially this is signal from the underlying infra that, hey, these nodes are going away. I'm triggering a notification, this notification. And this is all automated right now we'll go to the model running model that will be checkpointed, moved, and then, you know, we will put it back. And then we are invested in things like priority queuing. So if things which are being disrupted, they should be moved to the front of the queue, like a Disney front of the line pass, and rescheduled like as fast as possible. Right. So you need to invest in those mechanisms to make sure that, you know, that goes on smoothly.
Demetrios [00:47:52]: Yeah, I like the idea of first of all just getting the checkpoint offloaded as quickly as possible so you can get back to training and then do what you want with the checkpoint, but don't have everything stopped because you're trying to offload the checkpoint. So, like make that as quick as possible. I think I've heard that before as, as a design pattern. And then also make sure that there's no surprises for the model that is training. And then it's like, hey, what happened? I thought I had all this resources and it all went offline. So you make sure that these things are, they feel kind of standard, but it's like unless you think through it or you have a bad experience, I imagine you probably had an experience where you realized, wow, we should probably do something about that, because there was a whole training job that just kind of went to nothing. And we could probably provision or at least put a bit more anticipation in place for that. One thing that I was thinking about is how we've been centering this conversation a lot on the new paradigm of LLMs and agents.
Demetrios [00:49:05]: But LinkedIn still has a ton of traditional ML use cases. Right. So how do you think about bridging the gap or creating a platform that can service both the LLM, the agent use cases, and then also the traditional ML use cases is.
Animesh Singh [00:49:28]: I think it's happening. So certain things are common. Right. Like obviously all the investments we are doing in the GPU infrastructure, GPU Monitoring observability, resiliency and you know, building things like distributed checkpointing, restore automated disruption readiness. Then you know, it goes, then you start going into like. So one of the things which I did once I came here was, you know, we took a major decision to rehaul our machine learning training pipelines. The pipelines or the orchestration engine was built very much in a solid prescriptive way, right? Like it was multiple components which we had in our training pipeline. It was built on TensorFlow and sort of, you know, it's a similar paradigm which TFX follows tensorflow extended you look at from Google and then you know, every single step is sending metadata to the subsystem.
Animesh Singh [00:50:35]: So very rigid now. I think that realization was very early on. Hey this. And part of this was also, you know, there is a lot of traditional feature engineering which happens with the traditional access models, right? As opposed to LLMs which don't have the notion of feature engineering, you are feeding bubbles of text. So the feature engineering as a discipline is sort of disappearing in this new world. And there is a lot of new rack based architectures which are bringing up, right, you are doing more in real time. So the machine learning pipelines engine we rewrote and we redesigned on an open source orchestration engine called flyte. Because the notion was that experiments will be very heavy right in this new world.
Animesh Singh [00:51:30]: And so there was a lot of quick changing of code. I want to do very short fine tuning jobs or sometimes very long running jobs. But the user experience has to be very nifty. We launched something called interactive dev where you can have VS code connect to a remote running job and put debug pointers and trace through what is going wrong. Because otherwise the previous experience used to be like you don't know what has gone wrong. So once it has gone wrong you need to rebuild, recompile, re upload. So this whole new modern pipeline training architecture which is very much focused on very quick changes, very much experiments, you don't need to re upload, go directly debug there, see what values are being parsed in real time through your debug pointers go one by one. And then you know, once we standardized LLMs on it, we are migrating a lot of our traditional Rexis models onto this new platform as well.
Animesh Singh [00:52:30]: And the feedback from users is really great. Like they're all liking this a little bit less rigid, very experimentation oriented architecture where the back and forth things which they need to change to change their model and train their model. And there is a Very inbuilt, robust versioning mechanism. One thing which was missing in our previous versions of machine learning pipelines was you have a config, you have a model, you have a data set, all these are different version things and you do an experiment, then you have to go and look at every different entity what was correlated with this particular experiment or training run. So in this new machine learning pipeline there is a very, very robust versioning, right? Like in terms of all the entities, your configuration parameters, your pipeline, your model as well as what data version, these are all bundled together, right? So every experiment is associated with the right lineage. Like okay, this was all the things which were used to create this particular model, right? So versioning is very robust because it's sort of built with that assumption. You will experiment a lot, right? So that's one change I think one thing which potentially and again it's early, early days of what will change in rexis visa vis the non LLM architecture. I think the LLM architecture itself, like one hypothesis is hey, the LLM architecture itself will take place in rexes.
Animesh Singh [00:54:00]: So then, okay, whatever you have done at upper layers or you're doing for LLMs like Vector DBS embeddings, they become the prominent mechanism overall the rack becomes a prominent mechanism. So your traditional feature engineering pipelines, the way you were constructing, they will change the real time feature pipelines, not the offline, they will change to accommodate to this new paradigm. Tools like LangChain, Langgraph which are very much dominant in the LLM space, they may become prominent in that space as well. So far we haven't gone that far. I don't think we have looked at hey, can how the LangChain or lang graph is being used, right? For LLMs, whether prescriptive orchestration graphs when the models are inferencing in real time or non prescriptive agentic graphs which are formed like where the agent decides what path to traverse. Can they be used in the traditional space where there is a lot of feature ingestion happening? There are feature engineering pipelines which you have built. But I think the weight is on to see itself. I think the thing which will simplify this can LLMs solve that part of the problem? Then you can take the LLM infrastructure as is even at the upper layers and start replicating it for access.
Animesh Singh [00:55:17]: Until then I think it's in fact majority of our workloads are still Rexis workloads are not powered by LLMs, right? And rightly so. They are very targeted, very effective in what they do and so the first hypothesis has to be at the model layer itself, but at the lower layers we are definitely ensuring that there is standardization across. So it's now the same machine learning pipelines engine which is being used all across LinkedIn, whether it's for Rexis or for LLM workloads. Right. So we have standardized all that layer, the topmost layer, when you move towards inferencing and all. Some of those changes we'll figure out as we go.
Demetrios [00:55:57]: Yeah, where exactly does it divert? Because it is fascinating to think about how the lower levels you've recognized, you can share resources or you can have the same type of stuff going on, but at some point it's diverting right now. So where, where is that point? If you know. Or it might be case by case basis, I think.
Animesh Singh [00:56:21]: See, first thing is so. So you can look at every single layer, right? Hey, do we need. So for example, when you are going to inference, right? Do you. So people have either written their own inferencing engines, right, in different languages or you have standardized on something like TF serving from open source, TensorRT from Nvidia, which were very heavily optimized for models. Right. In the, in the Rexess space, when you look at the LLM space, there is vllm, there is tgi, there is sglang, right? The inference engine itself. There are custom inference engines being written for LLMs. You just even look at the hardware architecture we didn't talk about, right? There are a lot of custom chips which are coming like you see many startups, Grok is another example which are just developing custom hardware.
Demetrios [00:57:12]: Cerebrum. Yeah, yeah, for sure.
Animesh Singh [00:57:15]: So it's if for people who are investing and taking a bet that hey, it's much simpler for them because they can just standardize on one. That one of the reasons, right, where you may feel, hey, Nvidia GPUs could be better for LLMs, might be if you're right versus, you know, people who are investing in custom Asics custom chips, which are potentially right. Better for LLMs in terms of inferencing, et cetera is the bet on the architecture. If the transformer architecture is going to replace every single thing, then you can have custom chips. So right. There is some diversion. Some companies are just going with very custom chips, right? For LLMs. So the engine is different, the chips are different, right.
Animesh Singh [00:57:58]: Like if you build something general purpose, right now we are all, you know, standardized. The hardware layer is. There is no specific thing apart from the kernels work, right. Which we have been Doing which has been pretty specific. Then you also look at the feature processing infra. Right. So a lot of real time feature processing, feature ingestion in case the kind of tooling which has emerged in the LLM world, you are talking LAN chain, langgraph, vector, dbs are prominent, right? It's not the traditional feature processing and embeddings have become the dominant landscape. And I think we have been on a journey where we are using embeddings all across as well.
Animesh Singh [00:58:45]: Started heavily both for access as well as LLMs. But some of the tooling chain there is different it right. What you will use for something like beam, for example, for feature processing in the past. Now a lot of that is happening in the LLM world through the lang chains and the lang graphs like when you are actually orchestrating the data. So those are the divergences which are happening and are there for the right reasons. I think over a period of time it will reconcile. But yes, that layer is different when you are handling traditional rexes versus this. The inferencing engines are different when you are handling that.
Animesh Singh [00:59:22]: In some cases people are even taking different chips, hardware chips for these layers. So I think it's bottoms up, we have to push, we have to make sure continue standardizing on bottom layers and go up the stack and see what else. As we become more mature, what else can be standardized and can be the same. If LLMs start solving Rexes, then potentially maybe, you know, that itself solves a lot of this problem.