MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Efficient GPU infrastructure at LinkedIn

Posted Mar 28, 2025 | Views 23
# GPU
# LLM
# LinkedIn
Share

speakers

avatar
Animesh Singh
Executive Director, AI Platform and Infrastructure @ LinkedIn

Executive Director, AI and ML Platform at LinkedIn | Ex IBM Senior Director and Distinguished Engineer, Watson AI and Data | Founder at Kubeflow | Ex LFAI Trusted AI NA Chair

Animesh is the Executive Director leading the next-generation AI and ML Platform at LinkedIn, enabling the creation of the AI Foundation Models Platform, serving the needs of 930+ Million members of LinkedIn. Building Distributed Training Platforms, Machine Learning Pipelines, Feature Pipelines, Metadata engines, etc. Leading the creation of the LinkedIn GAI platform for fine-tuning, experimentation and inference needs. Animesh has more than 20 patents and 50+ publications.

Past IBM Watson AI and Data Open Tech CTO, Senior Director, and Distinguished Engineer, with 20+ years experience in the Software industry, and 15+ years in AI, Data, and Cloud Platform. Led globally dispersed teams, managed globally distributed projects, and served as a trusted adviser to Fortune 500 firms. Played a leadership role in creating, designing, and implementing Data and AI engines for AI and ML platforms, led Trusted AI efforts, and drove the strategy and execution for Kubeflow, OpenDataHub, and execution in products like Watson OpenScale and Watson Machine Learning.

+ Read More
avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

Animesh discusses LLMs at scale, GPU infrastructure, and optimization strategies. He highlights LinkedIn's use of LLMs for features like profile summarization and hiring assistants, the rising cost of GPUs, and the trade-offs in model deployment. Animesh also touches on real-time training, inference efficiency, and balancing infrastructure costs with AI advancements. The conversation explores the evolving AI landscape, compliance challenges, and simplifying architecture to enhance scalability and talent acquisition.

+ Read More

TRANSCRIPT

Animesh: [00:00:00] Animesh Singh, right? Uh, currently at LinkedIn, director of our GPU infrastructure, uh, training platform. Optimizing some inferencing engines, coffee, uh, tall mocha, extra hot.

Demetrios: Woo! I'm bubbling from this conversation. So many gems when it comes to working with LLMs at scale, GPU infrastructure, what you want to be optimizing for, how you can think about optimizations, and I had a incredible question at the end.

Demetrios: Like how does the platform and the GPU infrastructure that you're dealing with differ when it comes to working on LLMs versus traditional ML did not disappoint, but let's get into the conversation.[00:01:00]

Demetrios: Well, Let's start with this man, and I'm so happy that we got to do this because we've been jiggling around our calendars to make this work for probably about six months. I think we were gonna do this when I was last in San Francisco in June, but one thing led to another and here we are in 2025 having the conversation finally.

Demetrios: Persistence paid off.

Animesh: I think this is great and I've been following your work throughout as well. You're doing excellent work in terms of, you know, bringing communities together and disseminating that knowledge, right? Like, uh, what's all happening in the AI space, right? And what use cases are springing up?

Animesh: Uh, what are the industries they are targeting? There is some excellent work you are driving in that. And I'm so glad it's happening in 2025. I feel, uh, Uh, you know, we have now some, uh, better, uh, experience of, [00:02:00] you know, what is working, what is not working, what may be a little bit of the hype, right? What is realistic?

Animesh: What is going to be the trends in 2025? So, uh, I think the timing is, is working out. Yeah.

Demetrios: What is working?

Animesh: Definitely. I think, uh, one thing which. Has, uh, definitely proven that it's here to stay, uh, is NLMs, right? I feel throughout, um, you know, 20, uh, 22, uh, 23, there was, um, lot of, you know, uh, discussions how effective.

Animesh: Uh, LLMs are going to be right in the industry in the space, there are a bunch of, you know, modeling architectures like, uh, you know, recommendation ranking LPNs, uh, graph neural networks, GNNs, right? But the efficacy of LLMs and the use cases being powered by LLMs was [00:03:00] quite a bit of question mark, right?

Animesh: Like, yes, there was a promise. What we saw, that magic moment with Chad g Bt coming in, um, that literally, you know, woke every up, uh, everyone up, right? Hey, uh, it does seem seamless. It, it does seem that it's, it's not yet another chatbot you are talking to and that sprung up the industry, right? Um, when I joined, um, uh, LinkedIn.

Animesh: At that time, you know, the ChatGPT moment hasn't happened. And as soon as I joined a month later, ChatGPT came in. And what I came in here to do, a lot of that changed within a period of a month. Uh, and I think through the course of that period, uh, multiple companies, multiple industries have identified right different use cases which are working well with this, right?

Animesh: And people are being productive. Be it, you know, either, uh, uh, you know, generating code, be able to do certain, uh, automation, uh, leveraging this, the interface [00:04:00] does seem very human like and, um, a lot of the generative AI use cases, which we launched even on LinkedIn, uh, for example, you know, profile summarization, right?

Animesh: So based on, uh, what you have, uh, you know, create a headline for me, create a summary for me. Uh, use cases like, you know, uh, a sustained for lengthy learning courses, uh, use cases like, you know, resume, uh, uh, targeting, uh, recruiter emails for candidates. Because a lot of the things which we had seen that, you know, if you're getting cold call, Emails from recruiters, they are not hyper personalized, right?

Animesh: Like, at times they sound like a template. Uh, LLNs helped us immensely in that, right, where they take into account the candidate's profile, the company in which candidate is working, the company from which the recruiter is and create these very personalized emails, which, you know, we are seeing success candidates actually responding much more opening those [00:05:00] emails.

Animesh: So I think, you know, A bunch of use cases we have seen, uh, working really well, uh, with LLMs and we are obviously doubling down, right? Like, uh, if you see the talk of 2025 and even before that, uh, last year, there was quite a bit of, uh, discussion on agents, uh, what they can do, what they cannot do. And Uh, we did our own, um, experiments, right?

Animesh: Like, uh, invested a lot internally in terms of building, uh, agent, uh, infrastructure, right? Uh, first of all, right? Uh, what does it take to create agents? How is it different than the traditional, um, generative AI applications or use cases we were building? What are the nuances, right? What makes it different?

Animesh: And then finally, we launched, you know, Uh, closer on the last quarter of, uh, last year, uh, LinkedIn Hiding Assistant, right? So which is essentially an agent for the recruiters, which based on certain criterias they define [00:06:00] will actually go work behind the scenes, find candidates for them, relevant candidates, summarize their, um, experience and profile to the, uh, recruiters.

Animesh: And then based on that, they can tell, okay, go. You know, reach out to this particular candidates and you know, uh, let's start having a discussion and there is much more we are doubling down on that whole linkedin hiring assistant and we have seen some great great enthusiasm from our partners and customers, right?

Animesh: So it's seeing some very good results. So there is much more we are doing now on other areas of linkedin which will be powered by agents. So they are here to stay. I think it's, um, How, in which use cases really benefit from it, that will be a nuanced discussion. You cannot throw it at every single thing.

Animesh: Um, you know, uh, there are so many things which can be just powered by them, which then frees you up to do things right, which are probably the more creative aspects of the work you are doing. Right. And, and we are seeing quite a bit of that [00:07:00] happening.

Demetrios: So what's not working?

Animesh: I wouldn't say what's not working, but a thing which probably needs a lot of improvement moving forward is the cost and the ROI, uh, of launching these LLM based, uh, either agentic use cases or traditional, you know, prescriptive, uh, uh, rag based, uh, generative AI apps, or even, you know, leveraging LLMs for use cases beyond generative AI.

Animesh: The cost is a big hindrance. I think, uh, one, if you see, like, training itself used to be the biggest barrier for entry. Um, and that God saw to some extent, uh, because first of all, you know, even within linkedin, we invested quite heavily in building out our scale out training infrastructure, which can power elements.

Animesh: Uh, I think when I Came in, you know, uh, we [00:08:00] were working off V 100 fleet. And then since then we have scaled our linkedin fleet by seven X. We have a 100. We have H 100. We have H 200. And the fleet is as modern as it could be. And we have scaled our training tremendously, right? Like it's 1 50 X increase in the size of the models.

Animesh: We are training right? Our data processing has increased many fold. Uh, we actually, you know, completed, uh, 1 million training runs on the platform, and, uh, we're, uh, training big foundation models. Now, investment in the infrastructure went in, and I think it was well understood for companies at scale of, uh, LinkedIn, et cetera, right?

Animesh: Where a lot of content based data is being consumed, being produced, you will have tons of data, right? And when you have these tons of data, you need to make sure that you have the infrastructure to train models on these data, right? Uh. So, so, and then the other thing, which actually helped a lot of the, [00:09:00] the, uh, training landscape is the emergence of open source models.

Animesh: I think, uh, you know, Meta led the way followed by many others in the industry, right? Where for smaller companies or for companies, which there is not a need to train a model on world data, right? Like you have your own specific data, which is probably, you know, not as big as, as the world data. Getting these open source models and starting on top of them because these models already know what exists in the world.

Animesh: They can answer your questions around, you know, they've crawled through Wikipedia. They have called through, you know, public libraries, all the articles, and they're being trained on them. Then you can bring in and do more fine tuning instead of, you know, training on, um Huge amounts of data. So then the infrastructure cost can go down further, right?

Animesh: Uh, so fine tuning became a big mechanism. Plus, you know, the emergence in the industry around a lot of techniques around supervised fine tuning, zero short few [00:10:00] short learning, uh, which emerged, uh, where, you know, prompt optimization techniques, which emerged, which essentially, uh, brought the cost down heavily on the training side.

Animesh: What is now, uh, you know, more and much more in the picture is the cost of inferencing, right? And, and I think it's, it's humongous, it's big, and there are a lot of efforts, right? Which we are doing, which the industry is also doing, right? Overall, how to bring down the cost of inferencing these models. Now, if you take a look at, um, generative AI use cases specifically as well, there is some thinking which is built in, like when you are interacting with, uh, uh, model, right?

Animesh: Where you are asking certain queries, asking it to analyze, you're prepared mentally, it will take some time, it will think through it, right? Even with the emergence of, you know, the latest open AI models, right? Where there is a lot of reasoning going on, right? So there is a lot of it is actually analyzing its own output.[00:11:00]

Animesh: Then, you know, refining its own output. Then the second output is further analyzed. There is a lot of back and forth reasoning. So there are multiple inferencing calls happening. And I think as a consumer, we are prepared if we are going into a scenario tool. In this particular context, the model may have some thinking time, specifically if you are interacting with the latest OpenAI models, etc.

Animesh: Then you know that, you know, you're asking complex queries, which need that analysis. So the latency you are willing to tolerate. Now, the advent, and even to get to that latency, right? Like there is tons of infrastructure investment which has been made, right? The general, uh, realization is that, you know, with even with all these investments, we are not able to get our GPUs to perform at maximum utilization, right?

Animesh: Like inferencing is becoming very costly because you are optimizing a lot for latency throughput. Uh, and you know, you have a [00:12:00] lot of, uh, uh, failover mechanisms, which you need to build, like almost any company needs to account for, Hey, I, I have, two or three data centers, right? Like if one data center goes down, so there is a lot of redundancy you need to build for, uh, uh, applications which are user facing.

Animesh: That means, you know, the cost of GPUs is ballooning, right? And, and, uh, uh, that's essentially is, is an infrastructure problem which needs to be solved. Specifically now, when we Do take LLMs in use cases where the consumer appetite for latency might not be there at all, right? Like, um, there are efforts happening across the industry, right?

Animesh: Like, hey, uh, the traditional Rexis, right? Recommendation ranking, like, so for example, you go to social media sites, you get recommendations, you get feed, uh, you get people you want to connect with all these things like, you know, as soon as you go to the site, This should be there, right? Uh, you should [00:13:00] be, as you're scrolling through the feed, the feed should be just being, you know, updated and, and customized for you in real time.

Animesh: There is no appetite for latency in those scenarios. Now, if, if you, if you need to, uh, see fellows can be effective in those world, that means, you know, you nearly, uh, really need to optimize volatil latency. And if you're really doing that, you are. Potentially throwing more money at the problem, uh, and, and so that problem, I think, you know, uh, how do we take this, uh, uh, large language model architecture, the transformer architecture, and make it really, really optimized for, um, inferencing, right, is, is becoming a big thing, right, uh, which needs to be solved for scale.

Demetrios: Do you feel like it's a bit of trying to fit a round peg in a square hole? Because when you throw LLMs at like a Rexis problem, just because we've been doing Rexis for a while, and we figured out how to make it real time, why do we need to [00:14:00] add an LLM on top of it? It almost In my eyes is overcomplicating things just to try and use a shiny new tool, but maybe you've seen there's better performance.

Demetrios: There's better personalization there or something that I haven't.

Animesh: I think I would, um, more than, uh, speaking on my behalf, right? Like in general, uh, there are like, you know, research papers emerging companies are trying that. Now, why would you try something like that, right? I mean, and that's a fair question, right?

Animesh: Like Rexis is already well established, right? And as, as a architectural pattern, right? Like, uh, the traditional recommendation ranking models, retrieval models, including, you know, uh, something like graph neural networks and GNNs, et cetera, right? They do a very, very solid job at, at this. And you have seen companies like, uh, Yeah.

Demetrios: And it's fast.

Animesh: It's fast. And, and, um, uh, like you, you take a look at, uh, TikTok, [00:15:00] oftentimes, you know, the recommendation algorithm is, is, is talked about. So you're fairly right. Like, Hey, is why, uh, is, is that why I think there are a couple of things, uh, in heuristic, right? So the way we have solved, uh, recommendation ranking problems in the industry is obviously you have created models.

Animesh: A lot of the companies Have a smaller models, right? These are not traditional foundation models, which have been trained on world data, etc, right? They have potentially not seen a lot of user interactions and patterns. So then, you know, you add things like Real time training. There's a lot of data being ingested in real time online training is happening.

Animesh: You know, there is there is a lot of Feature ingestion which is happening in real time. What is user interacting with so There is this new paradigm, okay, which is these models, the LLM models. These are foundation models. They have potentially seen maybe 95 percent of the patterns. [00:16:00] So maybe the what you need to do in real time to update these models is probably a lesser investment, right?

Animesh: Models have seen majority of the patterns. And if you feed what the user is, Done, right? Like they would be able to predict much more comprehensively. You don't need to do a lot of online training, etc. right in real time. That's one of the, you know, thinking. The other thing is the simplification of the architecture.

Animesh: Um, for something like GNNs, right? What we need to do is, you know, when you are doing the training, GNS, all the data is in the graph format, graph structure format, you need to traverse the nodes and the edges in real time, right? Because you don't know pre hand, right? Like, uh, how much data you will be processing, which will be the right node, which edge you need to traverse, right?

Animesh: So there is not data pre processing happening. And There is a different architecture and it's inherently hard to scale GNNs beyond a certain limit because there is live data processing happening while the [00:17:00] training is going on or while, uh, inferencing is going on, right? So, uh, an architecture, uh, and then GNN is one example.

Animesh: There are different, you know, recommendation ranking architectures. So the premise that, hey, uh, and, and Then companies would have a proliferation of these recommendation ranking models, right? Like every team, uh, would create, uh, for each of their use cases, they would start from scratch and, and create a model, which is potentially small model.

Animesh: It does a very targeted job. It does a really good job at it, right? And then build all these things. So you have bespoke models with different architectures. There are huge number of them. If you take the LLM route, right, like the the trend, which is emerging in the industry, right? Hey, if I do create a giant foundation model, right?

Animesh: And and if you you're Obviously very present in what's happening. Distillation as a technique is becoming very prominent in industry, right? Like I will create smaller models for inferencing, but I will distill it from this giant foundation model. So what you [00:18:00] have done is, you know, you have sort of centralized the, the, the creation of models.

Animesh: And in potentially a central place. So you can think of a scenario in the future, right? Like there is one central team instead of having every different use case in every different vertical within your or creating your own models for their own use cases, which are very targeted. One central team, which is, you know, the holder off your organization's data.

Animesh: which is curating that whole data and creating, you know, maybe one or two or three. Okay, very simplistic scenario. You're creating two giant foundation models, one for generative AI use cases, one for non generative AI use cases. And then, you know, the models for all particular use cases are being distilled from this.

Animesh: So you have simplified the, the overall architecture. You are essentially worrying about the compliance majorly. Like there is compliance is a big thing as well as you are, um, seeing, right, there was the, the DMA act, there [00:19:00] is AI act, there will be other acts which will be coming. If you, if you, uh, curtail the surface area, how many models you have, right?

Animesh: What data it was trained on, right? Then you're not worrying about, you know, the, uh, many hundred other models which your organization has created and make sure everyone is compliant. What data it was trained on? Was that data compliant? You have sort of centralized and simplified that problem. Hiring becomes easy, right?

Animesh: Right now, you do tend to hire, like, when, when, uh, I'm looking at certain use cases, like, you know, uh, GNNs, people who can run GNN at scale, right? You need a very targeted, uh, skill set. People who understand, you know, how to target graph data. And then you need to go into the depths of the, uh, GPUs architecture where, Because what a lot of data transfer is happening during that training.

Animesh: So, okay, is NVLink good enough? How much HBM memory I have? How much on disk memory I have on the GPU nodes? Then, you know, what is the network [00:20:00] bandwidth? There is a lot of in depth, uh, GPU, uh, knowledge you will require, right? Just to go there and start solving. Plus, you know, the graph traversal. What algorithms do I need to introduce to?

Animesh: So, your, uh, Uh, sourcing of skills and talent also becomes simplified. So there is that overall value proposition which could be achieved, provided if LLMs do, ~uh, you know, uh, ~prove themselves. I think many are, uh, are looking at this problem space and figuring that out. So it's yet not, uh, solved. And as I said, like, you know, unlike, uh, generative AI as well, uh, this is a problem of scale.

Animesh: Like. Generative AI use cases, you have to explicitly, as a user, go and invoke that, right? You will go like, I am giving you a LinkedIn example, right? You will go and say, hey, summarize my profile, or a recruiter will say, launch this particular agent, or, uh, you, uh, you know, LinkedIn learning user will go and say, [00:21:00] summarize this course for me, or explain to me this nuance.

Animesh: So these are, you know, uh, discrete transactions. Users going to feed, logging into feed, browsing through the content. This is happening all the time. It is happening at scale, right? So, so then that's a bigger, much bigger scale problem, which needs to be solved with, um, LLMs, uh, overall, right? In a cost effective manner.

Animesh: It's, it's, and very, very, you know, uh, um, uh, latency sensitive matter. So that's, that's the thing, right? Which needs to be happening and we'll see, uh, whether they do prove efficacy. Like, you know, based on all the results you've seen, they're able to pass a lot of bar exams, PhD exams, math exams, right? So you'll see, Hey, they are intelligent, right?

Animesh: Like, can they be very intelligent for this specific set of targeted problems?

intro: Yeah,

Demetrios: I hadn't thought

Demetrios: about the simplicity in the architecture and also the simplicity in Being able to attract talent that understands [00:22:00] this architecture because it is more simplified and you don't have these very specialized, deep, deep tech type of roles, so you can almost, um, get a much broader choice of talent from now going into GPUs themselves, and you were talking about, A, the cost of having GPUs at scale, and when you have this many GPUs and you're trying to utilize them, you don't want to have any percentage going idle.

Demetrios: I imagine you think about that a lot. You're thinking, wow, we're burning money just letting this GPU sit around and we're not utilizing it to its maximum capacity. Does, is that what you're trying to do, or is that what, uh, Liger kernels are trying to help out with? Can you explain that?

Animesh: ~Yes, and I think, um, ~any, any infra and platform team, which is an ML infra at this point, you would talk, this is a burning, uh, thing, right?

Demetrios: Keeps you up at night.

Animesh: You have, on one hand, you know, you cannot. run these LLMs, for example, or, or these modern recommendation ranking models without GPUs, [00:23:00] right? Um, training definitely, right? Um, and even with, uh, inferencing you have to go on GPUs, right? Like the general trend in the industry, like if you rewind two years ago was, hey, we will train CPUs, right?

Animesh: That's how most of the companies had architected. ~And, uh, and CPUs were not that expensive. The modern, uh, architecture, it doesn't lend you to that, right? So we need to get. So the whole investment in GPU efficiency becomes very, very paramount. And it starts from, uh, at every single layer. Like, so in, in, ~in our case, like, okay, there is the general thing, right?

Animesh: Which also happens in companies, right? Here, you need to allocate a maintenance budget. No, you can be much more generous with CPUs, right, in terms of maintenance budgets, how much spare capacity do you allocate so that if you're 20 you need. Uh, if you need high availability, you need to spread it across [00:24:00] three data centers.

Animesh: Uh, okay, let's go ahead and do it, right? Those decisions were much easier. They're becoming much more harder when you have GPUs in house because, yes, there are certain companies which are potentially Invested a lot, right? Uh, and they have deep coffers, but, uh, a lot of the companies want to be cost conscious about it, right?

Animesh: Like you are so every decision. Hey, even to the point like looking at certain use cases, what would be the right maintenance budget, right? Like If, if you have, let's say, you know, in, in one maintenance on 1000 GPUs, can I allocate 100 just for maintenance, right? Like, then you do the cost and it's like, hey, there's a lot of wastage, which is going on,

Demetrios: you start sweating,

Animesh: you start started

Demetrios: thinking about how much money that is.

Demetrios: And yeah.

Animesh: And so, um, So how do you, first of all, start then looking at the workload so that they become more resilient to maintenance? Like, so okay, if I cannot take the previous, the previous approach used to be like, [00:25:00] once I know 20 percent of my fleet is going to be maintained, I will not schedule anything, right?

Animesh: We don't need to worry too much about, uh, you know, for the next 24 hours, even if they're empty. Now you're worrying about that. I cannot leave 200 GPUs lying empty. Uh, for 24 hours, right? Just because, you know, the maintenance is going to happen. So then you look at your workload. Okay, how can I make the workloads more resilient?

Animesh: Now, if you look at, um, distributed training workloads, right? Majority in the industry are gang scheduled, right? So you do gang scheduling, you know, beforehand, this distributed training workload will attack. X amount of GPUs, right? And there is a lot of, you know, uh, synchronization happening as the distributor training is going on.

Animesh: And if, if, you know, a few of the nodes fail, you will have to, you know, put it back in the queue. Like you obviously need to have checkpoint to restore, put it back in the queue, do again, gas scheduling so that, you know, uh, it can, uh, resume. So then you look at this workload. Okay, maybe in [00:26:00] this world, we have to rethink the whole notion of gang scheduling, right?

Animesh: Uh, do we need to build more elasticity, right? Like if, if a workload has been launched on 100 are, are going to be taken away, can it shrink? Can it expand on its own, right? And continue going same for inferencing, I think the need for Scale down, scale up is becoming big. Um, serverless architecture has to become a prominent scheme in the thing of things like you cannot optimize for max traffic, like.

Animesh: And a lot of us are doing that. A lot of the industry is doing that, where you put what is the peak traffic you get and that's your capacity. But if out of 24 hours, 2 hours is your peak traffic, for those rest of the rest of the 22 hours, these GPUs are sitting idle because they have been provisioned to handle that peak traffic.

Animesh: Again, a lot of wastage. So, so many of these decisions are now becoming very prominent. So, I feel the emergence [00:27:00] of elastic architecture, serverless architecture. To handle workloads, which, which, you know, Should be able to scale down, scale up as the capacity is, you know, shrinking, expanding is, is, you know, uh, uh, a big investment area and something like which, which, uh, we are doing like, and, and I'm, I'm assuming like, you know, a lot of people are, are solving this problem at, uh, different means and mechanisms.

Animesh: Right. Uh, so yeah, that's, that's essentially, uh, the, the key

Animesh: characteristic there.

Demetrios: Well, and you haven't even mentioned. the reliability of the GPUs because they're notoriously unreliable.

Animesh: They are like, there are, um, I mean, all these Nvidia GPUs you get, for example, right? And, and one of the decisions we took, uh, early on, and I think, um, for the simplicity part of it, right?

Animesh: Like, hey, let's start at least on the training side, right? We'll go with Nvidia, ~right? Like, uh, And, ~and so far we have [00:28:00] standardized on NVIDIA, uh, throughout, but then, you know, you have a lot of these checks, uh, Things which can go wrong because it's just taking a distributed training example when you are running at scale.

Animesh: You are touching your in our case You know, we have communities as the compute in for a layer, right? So you have Kubernetes you have the network you have the hardware you have the storage training is you know Reading and pumping a lot of data throughout any of these points Uh, going down is a weak link in addition to the GPUs itself.

Animesh: Right? Uh, and act with the planned maintenance. So, so, and sometimes, you know, for foundation models, your training can run for one week, two week, three week at stretch, right? When you are actually running training for these large foundation models, you're guaranteed to have some part of the infrastructure having some problem.

Animesh: Sometimes it will be the messaging queue. Sometimes it will be the storage system. Sometimes it's the, the something in the Kubernetes layer, or sometimes it could be just the GPU [00:29:00] node itself, uh, failing health checks, uh, ECC errors. Nickel is a problem. InfiniBand of the GPUs, right? Which is connecting all this.

Animesh: So this, this poses a huge challenge, right? And in terms of ensuring, because these are so long running jobs, how do you make sure that you are being effective, right? And that combined with the problem of gang scheduling, because him. They need to go all together at once or otherwise they will need to wait.

Animesh: If a, if a training job needs 500 GPUs, it will be in the queue until those 500 GPUs are free, right? You cannot launch it before that. Uh, because, uh, you know, you have not, uh, I mean, that's how the, the, the architectures around gang scheduling are built. So, this, uh, is, is a big, uh, thing, right, which we have invested a lot in terms of building, you know, uh, very fast checkpointing restore because you assume failure is happening, maintenance is happening.

Animesh: So, a lot of investment has gone in from our side in terms of ensuring that, you know, there is, uh, [00:30:00] Automated checkpointing restore, which could be done, uh, which is, you know, hierarchical. So we first checkpoint in memory, then distribute those checkpoints to some async block storage, then you have to read, right?

Animesh: And, and then, you know, there is also discussions around whether these checkpoints, uh, can reside on the GPU SSDs itself, because there is a cost of, you know, reading once you have like, The training can continue check pointing in memory, but once you have sent it to some remote storage, a block storage or a file based storage, then when you're reading, if something has gone wrong, that takes time, you have to bring the network and and bigger models will have, you know, many gigabytes of checkpoints, right, bringing them through the network.

Animesh: So then, okay, can we leverage the GPU SSDs under the covers itself, right, GPU SSDs obviously have some limited capacity. How much can you store? Uh, so bunch of those, uh, you know, architectural elements, uh, need to be crafted. So we are investing quite a lot and quite heavily into that. Um, Liger you talked about, right?

Animesh: Um, [00:31:00] was the other part of the efficiency problem, right? Which is essentially. As we went on this journey, we saw like beyond the reliability and the scalability, a lot of the use cases, the model started getting more complex, more bigger. And we had an X amount of GPU fleet. We started seeing a lot of, uh, pulls from our customer, Hey, I'm, I need GPUs, I'm not getting GPUs or my training is running for like, you know, so long, right.

Animesh: And, and I really want to get it done faster. Multiple of these use cases and literally people were, uh, because, you know, GPUs was our scarce entity. At least, you know, a year ago, we had quite a bit of that, right? We were as we were scaling the fleet, right? And we were ordering Nvidia has its own supply chain and timeline, right?

Animesh: The use cases were just springing up. So the we looked at the problem. Okay. And you follow the traditional methodology of data parallelism, [00:32:00] model parallelism. Let's introduce 00 plus plus, um, because our a 100 fleet had very constrained network bandwidth. So how you can scale up training even with that constrained, uh, bandwidth, um, Zero plus plus helped in that.

Animesh: So after we have exhausted every single option, uh, then, okay, what can we do? So as we started going deeper, one of the things which we used to do infrequently for our customers was, you know, rewriting CUDA kernels. Right. Like, uh, a lot of the use cases which are very, uh, sensitive to that the training should complete within X number of hours, et cetera.

Animesh: Uh, when we looked at the model training code, uh, which our models used to produce, there was a lot of improvements. So at times we will go and rewrite the CUDA kernels, few certain operations, right? Now you bring this to the LLM. The thought which we had was, hey, We can do this on a model by model basis, but [00:33:00] this is not scalable, right?

Animesh: Like, uh, there are multiple users having different models. So the thought, okay, how can we make what, what exists to actually solve this problem, which can be a little bit more scalable. So that's where, you know, fortunately at that point, OpenAI has launched a Triton, uh, which was essentially, uh, a Pythonic programming interface, right?

Animesh: Uh, where you can do CUDA kernel programming. At a much abstracted layer. Most of our, uh, users would be, you know, familiar with Python. So using that. Now, the second thing was like, you know, GPU memory is hierarchical, right? Like you have, uh, the dram, then you have the, the HBM memory. Then you have sram, the streaming, multiprocessor, ss, SMS and, and GPUs.

Animesh: They interact mostly with sram, right? Which is very little. So every time, and, and so all these. multiple kernels you have, uh, and multiple operators you have in your training code. [00:34:00] There's a lot of data transfer happening, right, from CPU to, SRAM, like that is your biggest bottleneck, right? Even though the GPUs can massively parallelize, the amount of IO, data IO, which is happening between different hierarchies of memory, that's becoming a big bottleneck, uh, for the training time.

Animesh: So that's when we thought, okay, let's combine. What Tridon has brought to the table. Let's take this problem and create certain first of all custom kernels for our distributed training workloads. And we started seeing huge gains, right? Like in one of the cases, like just 300 percent increase in the overall efficiency, majority of the cases, like more than 50 percent decrease in the memory which was being used.

Animesh: What we did was we started fusing kernels, right? Like a certain operations. Um, they can be just combined, fused together. Operators can be fused together. You don't need to have five kernels to do this different five, uh, different operators, right? And there is some human judgment involved in that, uh, uh, to, to make that decision, which [00:35:00] And we have also seen flash attention in the industry sort of saw that, uh, uh, quite a bit, right?

Animesh: And it became very popular and it was built on the principles of kernel fusion. So we took that and combined that and it's, it's solved and helped immensely a lot of our internal developers. A huge decrease in training time because memory efficiency caught much bigger. And once we realized, hey, uh, this looks good and, and let's just open sources.

Animesh: There was not a lot of. Uh, planning. There was not a lot of, uh, thought. It was also like a, um, lot of open source models are coming in and maybe, you know, if community likes it, they will potentially create kernels for those open source models. And if we end up consuming it, it eventually also benefits us, right?

Animesh: Yeah, it just took off, right? I think, um, multiple companies were in the same position, right? It's it's when you like, go and put a problem out. They were all, you know, getting this this [00:36:00] big load off generated via use cases. Lot of LLM's GPU crunch was only present on almost everybody, right? Uh, needed to solve this problem for distributed training right at scale.

Animesh: And we got A solid reception right in the community right from Andre hugging face. We one thing which we did Spend some time on making the use very easy, right? So we integrated with the hugging face ecosystem right off the bat Right, uh, etc. So and we just Um, uh, last week it completed, uh, more than 1 million downloads, right?

Demetrios: Incredible.

Animesh: This is amazing. Like we never thought, you know, this will go all the way there, right? So, uh, uh, and we have gotten lots of good feedback from, from the users who are benefiting from it.

Demetrios: Yeah. It's funny that you mentioned the memory aspect because we had, um, a guest on here. A few months ago talking about how [00:37:00] his whole thesis was that we are memory constrained.

Demetrios: We're not GPU constrained. GPU is almost like not the way to look at it. It should be really looking at memory. And that is the bottleneck right now.

Animesh: And, and I think very rightly so, right? Like see, ultimately ML is a data processing problem, right? No. So that does mean that, you know, there's a lot of data moving in between these GPUs, right?

Animesh: Now, the fastest you can process data is if the data is in the memory, right? Now you go through the hierarchical thing. Okay, GPU, HBM, SRAM. The second thing is okay, the, the CPU on system, uh, memory of the CPU, uh, uh, on, on the GPU nodes, then you go, okay, I, I will go outside the GPU nodes, right? Maybe the GPU SSDs, right?

Animesh: Itself, but the more you can keep in memory, the faster you can process, more so for [00:38:00] LLM oriented workloads and ADBI workloads, because if you look at, you know, people want to have bigger context length, right? Even when you are training the model, right? If you increase the context length, you potentially can train bigger models.

Animesh: Inferencing time, right? Like the amount of information users are sending. Right for the model to process the amount of information the models are generating like see rexis output used to be very straightforward, right? Like it's rexis recommends you whether this is a good content to show or not in case of generative AI, the output is also huge, right?

Animesh: So a lot of that data processing is going back and forth. And the more you can, you know, uh, leverage the memory effectively, the biggest gain you can get. So it's, it's, uh, rightly so there is a lot of. Um, even during inferencing time, like KVCache is becoming a big technique, right? Like where you can reuse KVCache as the [00:39:00] inferencing calls are coming, right?

Animesh: For sequence of tokens, etc. Which we have seen before, you can calculate it. That can save

Demetrios: a lot of costs too. Yeah,

Animesh: you've collect. You've already calculated attention scores for all these different, you know, tokens, you're seeing the sequences you are seeing, right? So why do this again? So let's keep it in memory.

Animesh: But how much can you keep in memory? The other part is like, you know, if you start leveraging GPU memory for KV caching, then model itself needs GPU memory for its own processing. So what is that route right boundary? Uh, And, and most of the H100s, right, which is now dominantly, uh, uh, like if you, if you look at what the, uh, industry has, probably most in bulk is right now H100s, right, or to some extent H200s, and, and 80 gig memory, uh, per, uh, GPU is, is turning out to be low.

Animesh: Right. Based on the amount of, uh, uh, data you're processing. And I think part of it is also as the use cases emerge, that starts becoming clearer. Uh, because the way, [00:40:00] uh, maybe the, the, whatever went into the decision making was, hey, this memory is for GPU processing. Whatever GPU is computing. When people start leveraging this memory as a cache, like KV cache, like in case of GNNs also, we are keeping some part of the graph structure in memory.

Animesh: When you start using it as a data storage mechanism, in addition to being used for computing needs of, uh, of your model itself. Then it's less, then you need more. So, um, I'm assuming, you know, the, the, that, that, uh, realization is already like, if you see the Grace Hopper architecture and the, um, uh, Grace Blackwell, uh, architecture where they're combining ARM CPUs.

Animesh: Uh, with the GPU nodes, and then they're creating a very high bandwidth, uh, uh, data transfer link between that, uh, GPU so that you can then leverage the, the more memory on, on the other side of the CPU nodes because there is a very high bandwidth data transfer or realization, uh, to this particular use case, [00:41:00] uh, which is emerging from, uh, generative AI, right?

Animesh: It's, uh, yeah.

Demetrios: Uh, I want to give a shout out to the person who, who said that to me, cause I was trying to remember their name. It was Bernie Wu was the one who said that we are memory constrained and you just went very deep in that and makes a lot of sense everything that you're saying. You did mention one thing that you worked on quite heavily too, which I've also heard is an absolute pain around checkpoints, and speeding up the checkpoints.

Demetrios: Because, A, it's not clear when things are going to fail, so you don't know if you need to checkpoint every second, or every five seconds, or every five minutes, or every day, and so, you, if you over optimize for checkpointing all the time, then you're potentially transferring around a ton of data. And you don't need to be, because as you mentioned, these checkpoints are huge.

Demetrios: And so, especially when you're training LLMs that are very, very big, the checkpoints can be in the terabytes. And that's [00:42:00] just like, if you're doing that every five seconds, that's a whole lot of data that's going around. So how did you specifically speed up the checkpoint and make that process better?

Animesh: I think we decided, like, so initially the very naive implementation of Checkpointing was, you know, so our, our majority of our data is on HS DFS.

Animesh: Right. Um, and, uh, we will, uh, the, the mechanism was you Checkpoint and the checkpoint goes to SDFS. And when you are reading Right, you read from there. So the realization specifically Yes. With LLMs, like a, first of all, these checkpoints are big and the. Training pauses while you are check pointing, right?

Animesh: So, so there's a pause in the training workload and this gets done. There's a transaction which is happening, right? So the first thing which we did in and, you know, change that [00:43:00] architecture as LLMs became more prominent is hey. Uh, let's make it, you know, uh, two phase transaction, right? In essence, we will checkpoint in memory, right?

Animesh: So, and, and from there onwards, any copy of that checkpoint is a sink. So, so two points, right? Like the checkpointing in memory is very fast. And the second thing is you're not Waiting for that checkpoint to be transferred to a remote storage, right? So it's hierarchical checkpointing strategy, which we developed.

Animesh: And now we are, you know, streaming that checkpoint. We also changed our, our backend for checkpointing storage from HDFS to a block based storage. Now we are, you know, investing in, in figuring out, uh, going and optimizing it even further. So, uh, GPUs. Uh, if you request, like when you are ordering GPUs, uh, Nvidia GPUs, you can request SSD storage within the GPUs, right?

Animesh: Uh, and there are technologies, [00:44:00] uh, which I'm assuming, you know, some of the companies may have in house, but even in open source, right, where you can then start looking at creating a cache. A distributed cache onto the GPU SSD. So maybe one GPU SSD storage is not enough, right? It's, um, uh, for your needs.

Animesh: Then you can combine the GPU SSD storage across all your GPUs and then create a distributed cache to store this, right? So that's the second phase we are on, uh, in terms of ensuring so that even, and the advantage with that will be in the restore. I think checkpointing will still be fast because it's in memory.

Animesh: It's then will go to the GPU SSDs. But when you are reading back. If you don't have to take it outside of the GPU network and bring it back into the GPU network, you save tons while restoring and when reading. The other thing which we did was, like to your point, one of the questions was how often do you checkpoint, right?

Animesh: I mean, that's a big question. And, and we have mostly, uh, Sort of, you know, left it [00:45:00] to the modelers, like, to take that common sense, uh, decision. How soon should you checkpoint, right? How much of a, like, for example, for jobs which take, like, more than a week to train, I think they can have a, hey, how much I'm willing to lose.

Animesh: I'm willing to lose five hours worth of data or am I willing to lose if, if something goes wrong and I have to restart, can I restart from previous day's checkpoint? So it's, it's a decision making, right? Uh, heuristic in certain use cases, like, um, very deterministic use cases when you're doing incremental training, like you need to train every few hours, then it is very clear, like, Hey, it's driven by checkpoints, right?

Animesh: So you have to do a checkpoint, you know, every two hours, uh, et cetera. So. Uh, depending on the use cases, the other thing which we have done is like disruption readiness, so. If, uh, in cases of planned maintenance, right? Uh, where we know that the nodes are going to go down, right? We trigger a checkpoint, right?

Animesh: So it's a triggered checkpoint. And if [00:46:00] the modelers have implemented, uh, checkpointing restore, right? That trigger will go invoke a checkpointing restore. So this is not a plan. This is not it's happening before we take that workload of those nodes. Right? So essentially this is, uh, signal from the underlying infra that hey, These nodes are going away, I'm triggering a notification, this notification, and this is all automated right now, we'll go to the model, running model, that will be checkpointed, moved, and then you know, we will put it back, and then we are invested in things like priority queuing, so if things which are being disrupted, they should be moved to the front of the queue, like a Disney front of the line pass.

Animesh: like as fast as possible, right? So you need to invest in those mechanisms to make sure that, you know, uh, uh, that goes on smoothly. Yeah.

Demetrios: I like the idea of first of all, just getting the checkpoint offloaded as quickly as possible so you can get back to training and [00:47:00] then do what you want with the checkpoint, but don't have everything stopped.

Demetrios: Because you're trying to offload the checkpoint. So like make that as quick as possible. I think I've heard that before as, as a design pattern. And then also make sure that there's no surprises for the model that is training. And then it's like, Hey, what happened? I thought I had all this resources and it all went offline.

Demetrios: So you. Make sure that these things are, they feel kind of standard, but it's like, unless you think through it, or you have a bad experience, I imagine you probably had an experience where you realized, wow, we should probably do something about that because there was a whole training job that just kind of went to nothing, and we could probably provision, or at least put a bit more anticipation in place for that, uh, One thing that I was thinking about is how we've been centering this conversation a lot on [00:48:00] the new paradigm of LLMs and agents, but LinkedIn still has a ton of traditional ML use cases, right?

Demetrios: So how do you think about bridging the gap or creating A platform that can service both the LLM, the agent use cases, and then also the traditional ML use cases.

Animesh: I think it's, it's happening. Um, so certain things are common, right? Like, obviously, all the investments we are doing in the GPU infrastructure, GPU monitoring, observability, resiliency, um, And, you know, building things like, uh, distributed checkpointing, restore, automated disruption readiness, then, you know, it, it goes.

Animesh: Then you start going into, like, so one of the things which I did once I came here was, you know, [00:49:00] uh, we took a major decision to rehaul our machine learning training pipelines. The pipelines or the orchestration engine was built, uh, very much in a solid prescriptive way, right? Like it, it was, multiple components which we had in our training pipeline.

Animesh: It was built on tensorflow and and sort of, you know, it's a similar paradigm which tfx follows tensorflow extended, uh, you look at from google and then, you know, every single step is sending metadata to the subsystem. So very rigid now. I think that realization was very early on. Hey, this, uh, and part of this was also, you know, there is a lot of traditional feature engineering which happens with the traditional access models, right?

Animesh: As opposed to LLMs, which don't have the notion of feature engineering, you are feeding, you know, cobbles of text. So, so the feature engineering as a discipline is, [00:50:00] is sort of disappearing, um, in, in, in this new world. Uh, so, and, and there is a lot of new, uh, you know, rack based architectures which are bringing up, right, you are doing more in real time.

Animesh: Um, so, the machine learning pipelines engine, we, we, rewrote and we redesigned on an open source orchestration engine called flight because the notion was that experiments will be very heavy right in, in, in this new world. And, and so, you know, there was a lot of quick changing of code. I want to do very short, fine tuning jobs or sometimes very long running jobs, but the user experience has to be.

Animesh: Uh, you know, very nifty. We launched something called interactive dev, where you can, you know, have, uh, VS code connect to a remote running job, right? And put debug pointers and trace through, uh, what is going wrong because otherwise the previous experience used to be like, you know. You don't know what has gone wrong.

Animesh: So once it has gone wrong, you need [00:51:00] to rebuild, recompile, reupload. So this, this whole new modern pipeline training architecture, which is very much focused on very quick changes, very much experiments, you don't need to reupload. Go directly debug there, see what values are being passed in real time through your debug pointers, go one by one.

Animesh: And then, you know, once we standardized LLMs on it, we are Migrating a lot of our traditional Rexis, uh, models onto this, uh, new platform as well. And the feedback from users is, is really great. Like, they're all liking this, uh, a little bit less rigid. Uh, very, you know, experimentation oriented, uh, architecture where, you know, the, the back and forth, uh, things which they need to change, right, to change their model and train their model.

Animesh: is, is, uh, and there is a very inbuilt, robust versioning mechanism. One thing which was missing in our previous versions of machine learning pipelines was You have a config, you have a model, you have a data set. All these [00:52:00] are different version things, right? And you do an experiment, then you have to go and look at every different entity.

Animesh: What was correlated with this particular experiment or training run? So in this new machine learning pipeline, right, there is a versioning, right? Like in terms of all the entities, your configuration parameters, your pipeline, your model. As well as, you know, what data was said version. These are all bundled together, right?

Animesh: So every experiment is associated with the right lineage, like, okay, this was all the things which were used to create this particular model, right? So, uh, versioning is, is very robust because it's sort of built with that assumption. You will experiment a lot, right? So that's one change. I think one thing which, um, Potentially.

Animesh: And again, it's early, early days of, uh, the, what will change in Rexis, uh, vis a vis, you know, the non LLM architecture. I think the LLM architecture itself, like one hypothesis is, hey, the LLM architecture itself will take place in Rexis, right? So then, okay, [00:53:00] whatever you have done at upper layers, right, for Or you're doing for LLMs like vector DBs, embeddings, right?

Animesh: They become the prominent mechanism. Overall, the rack becomes a prominent mechanism, right? So your traditional feature engineering pipelines, the way you are constructing, they will change, you know, the rear end feature pipelines, not the offline, right? They will change to accommodate to this new paradigm.

Animesh: Tools like LangChain, LangGraph, which are very much dominant in the LLM space, they may become prominent in that space as well. So far, we haven't gone that far. I don't think we have. looked at, hey, can the, how the, the line chain or line graph is being used, right? For LLMs, uh, uh, whether, you know, prescriptive orchestration graphs, when the models are inferencing in real time or non prescriptive agentic graphs, which are formed, like where the agent decides what part to traverse.

Animesh: Can they be used in, in, um, the, the traditional space where, you know, there is a lot of feature ingestion happening. There are feature engineering pipelines, which you have built, [00:54:00] uh, I think the wait is on to see. itself. I think the thing which will simplify this can LLMs solve that part of the problem.

Animesh: Then you can take the LLM infrastructure as is even at the upper layers and start replicating it for Rexis. Until then, I think it's, uh, in fact, majority of our workloads are still Rexis workloads are not powered by LLMs, right? And, and rightly so. They are very targeted, very effective in what they do. Uh, and, um, so it's, The first hypothesis has to be at the model layer itself, but at the lower layers, we are definitely ensuring that there is standardization across like, so it's now the same machine learning pipelines engine which is being used all across LinkedIn, whether it's for Rexis or for LLM workloads, right?

Animesh: So we have standardized all that layer, uh, the topmost layer when you, you know, move towards inferencing and all some of those changes, uh, you know, we'll figure out as we go. Yeah.

Demetrios: Yeah. Where, where exactly does it divert? Because it is fascinating to think [00:55:00] about how the lower levels you've recognized there you can share resources or you can have the same type of stuff going on, but at some point it's diverting right now.

Demetrios: So where, where is that point? If you know, or it might be case by case basis,

Animesh: I think, see, first thing is. So, so you can look at every single layer, right? Hey, do we need, so for example, when you are going to inference, right? Uh, do you, so people have either written their own inferencing engines, right, in different languages, or you have standardized on something like tf.

Animesh: serving from open source, tensorRT from NVIDIA, which were very heavily optimized for models, right, in the, in the Rexis space. When you look at the LLM space, there is VLLM, there is TGI, there is SG LANG, right, the inference engine itself, there are custom inference engines being written for LLMs. You just even look at the hardware [00:56:00] architecture, uh, we didn't talk about, right, there are a lot of custom chips which are coming, like you, you see many startups, Grok is another example, which are just developing.

Animesh: Custom hardwares, Cerebriums, yeah, yeah, for sure. So, so it's, it's, if for people who are investing and taking a bet that, Hey, it's much simpler for them because they can just standardize on one that the, one of the reasons, right, where, where you may feel, Hey, NVIDIA GPUs could be better for LLMs might be if you're right versus, you know, people who are investing in custom ASICs, custom chips, which are potentially right, uh, better for LLMs in terms of inferencing, etc.

Animesh: Is the bet on the architecture of the transformer architecture is going to replace every single thing, then you can have custom chips. So, right, there is some diversion. Some companies are just going with very custom chips. Right for LLMs. So the engine is different. The chips are different, right? Like if you build something general purpose right now, we are we [00:57:00] are all, you know, standardized.

Animesh: The hardware layer is there is no specific thing apart from the kernels work, right? Which we have been doing, which has been pretty specific. Then you also look at the feature processing infra, right? So a lot of real time feature processing feature ingestion, uh, In case the kind of tooling which has emerged in the LLM world, it's you are talking, you know, uh, land chain, uh, a land graph, vector DBs are prominent, right?

Animesh: It's not the traditional feature processing, right? And Embeddings have become the dominant landscape. And I think we have been on a journey where, you know, we are using embeddings all across as well. Started, uh, heavily both for Rexis as well as LLMs, but some of the tooling chain there is different, right?

Animesh: What, uh, uh, you will use for something like Beam. For example, you know, for feature processing in the past now, you know, a lot of that is happening, [00:58:00] uh, in the LLM world through the lang chains and the lang graphs, like, you know, when you are actually orchestrating the data. So those are the divergences which are happening and, and are there for the right reasons.

Animesh: I think, um, You know, over a period of time, it will reconcile, but yes, that, that layer is different. Uh, when you are handling traditional Rexis versus this, the inferencing engines are different when you are handling there. In some cases, people are even taking different chips, hardware chips, uh, uh, for these layers.

Animesh: Um, so I think it's bottoms up. We have to push, we have to make sure, you know, continue standardizing on bottom layers, right. And go up the stack and see, you know, what else, you know, as, as we become more mature, uh, what else can be standardized and can be the same. Uh, if LLN starts solving Rexis, then potentially maybe, you know, that itself solves a lot of this [00:59:00] problem.

+ Read More

Watch More

Efficient Deployment of Models at the Edge
Posted Jan 17, 2025 | Views 503
# AI
# Models at the Edge
# Qualcomm
The Evolution of ML Infrastructure
Posted Jan 10, 2023 | Views 1.6K
# ML Infrastructure
# ML Adoption
# Landscape of ML
# ML Investing
# Bessemer Venture Partners
# bvp.com