Boosting LLM/RAG Workflows & Scheduling w/ Composable Memory and Checkpointing
Bernie is VP of Strategic Partnerships & Business Development for MemVerge. He has 25+ years of experience as a senior executive for data center hardware and software infrastructure companies, including companies such as Conner/Seagate, Cheyenne Software, Trend Micro, FalconStor, Levyx, and MetalSoft. He is also on the Board of Directors for Cirrus Data Solutions. Bernie has a BS/MS in Engineering from UC Berkeley and an MBA from UCLA.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
Limited memory capacity hinders the performance and potential of research and production environments utilizing Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) techniques. This discussion explores how leveraging industry-standard CXL memory can be configured as a secondary, composable memory tier to alleviate this constraint.
We will highlight some recent work we’ve done in integrating this novel class of memory into LLM/RAG/vector database frameworks and workflows.
Disaggregated shared memory is envisioned to offer high-performance, low-latency caches for model/pipeline checkpoints of LLM models, KV caches during distributed inferencing, LORA adaptors, and in-process data for heterogeneous CPU/GPU workflows. We expect to showcase these types of use cases in the coming months.
Bernie Wu [00:00:00]: My name is Bernie Wu. I'm the VP of strategic partnerships and business development for memverge, and I'm a fairly simple guy. I take my coffee black, but I take a lot of it.
Demetrios [00:00:11]: Welcome back to the Mlops community podcast. I am your host, Demetrios. And, you know, normally I like to take my vitamins, and I generally take quite a few vitamins. I do, you know, an array of vitamins. Vitamin D, vitamin B, vitamin C, vitamin A, all that good stuff. I also have been known to take some mushrooms. Not the hallucinogenic kind, the kind that help your brain function. And oh, my, I wish I would have remembered to take them today, because I did not.
Demetrios [00:00:48]: And this conversation I was hanging on by a thread. Bernie brought some serious learnings to me. He helped me see things in a different way, specifically around memory, memory allocation, elasticity of memory. And it just so happens, funny enough, ironically, that I should have taken these pills to help with my own memory. Not the hardware memory, my own. If anything Bernie says piques your interest. Feel free to reach out to him. He is very cool and very open to chatting with new people.
Demetrios [00:01:29]: Here we go. One thing that most of the time we'll snark at when people say this is the word or this expression of, oh, you just have to think about it by first principles. You have to apply first principles thinking to it. And I know it is great, but it's almost been overused since it came into fashion and Elon Musk started using it. And now when you hear people saying it, I don't really trust that they actually a, even know what that means when b actually do it. But in the pre conversation that we were just having, the only thing that I could think about is your vision right now that I want to get into with not looking at gpu shortage in the market being the problem, looking at memory shortage in the market being the problem, that is what I would consider taking things to first principles.
Bernie Wu [00:02:36]: Yeah, I know, I agree. It's funny you mentioned first principles, because a long time ago, I went to UC Berkeley, and the emphasis, even at the undergraduate levels, getting everything down to first principles. If you don't understand the first principles, you're not going to graduate from this place. But I do think that in this industry, I'll start out with this little factoid, the bulk of the industry is going to be delivery on production. That's where you get an ROI on all this AI investment that's going on in the industry. And the transformer model is taking over the world. The transformer model and foundation models and things like that. I believe the number is around one flop pertain per byte.
Bernie Wu [00:03:28]: So. Or in other words, there's ten bytes of memory movement or loading into registers for every floating point operation on average. So it's a very data intensive area. And on the flip side, in the news there is all the stories about GPU shortages and things like that. But if you actually look at how people size their purchases of GPU's is based on, okay, my model is this big and therefore I need this much memory. And okay, each GPU has this much Hbm for GPU's. Therefore I buy x number of GPU's. That's how it was done.
Bernie Wu [00:04:07]: And you find that out that a lot of times the GPU's are then purchased that way. And often they're underutilized because they're memory bound. And so some of these models that were trained, I know at least at scale, their utilization was down in the 30% level. So if you could actually get more utilization, obviously you wouldn't need so many gpu's, but they are memory bound. So we are working to solve that. I mean, it's a complex problem. It's going to take some new memory technologies, new software, which is what my company's working on, and it's going to percolate all the way up through the stack. I started seeing signs of that already there.
Bernie Wu [00:04:54]: New kinds of schedulers are going to have to be developed and things like that to handle a more composable memory architecture. Now.
Demetrios [00:05:02]: Well, yeah, talk to me about how you see memory right now. It almost feels like we are memory scarce. How would a memory abundant world look?
Bernie Wu [00:05:18]: Yeah, a memory abundant world. Well, I would say that you would have very common in things like kubernetes and AI environments. You have these things where you run out of memory and then you start spilling to disk. And when you start spilling to disks, also your performance grinds to, you know, it runs four or 500% slower, that kind of thing. It starts swapping and things like that. Or in some cases, because you don't even have the option to swap the disks, these things just get killed. The other they're killed or whatever. But you know, so as an example, in that kind of use case, we're working on a situation where we have this elastic memory, we have this surge capacity of memory in a separate pool.
Bernie Wu [00:06:04]: And then when we detect that something is running out of memory, a particular compute instance, we just give it a shot of memory until the problem goes away. Then we reclaim it and assign it somewhere else. So I think part of the overall solution, kind of like what people are doing with GPU's, is to, is to pull things together, and you can get higher efficiency, and then also you can address fluctuations in demand as workloads, go through different compute instances, and add either more GPU's or add more memory. There's a more. In other words, the infrastructure needs to get more elastic and composable.
Demetrios [00:06:45]: Well, it's interesting that you talk about how a lot of times you're seeing 30% utilization rate, right? But then there are those spikes that will kill the job, and so you're trying to help for that spillover when you get that spike so that you don't lose everything, right?
Bernie Wu [00:07:06]: Right. Yeah. So that's another interesting area. There's a lot of, actually, because the AI ML areas is a relatively new industry, and because you need to preserve these models, they take a long time to run. There's been a lot of work done at the framework level, the software, the pytorch level, etcetera, to, to take checkpoints of things and save those states. So that between training epics, you know, something crashes, you can at least restore, or if you overshoot the model training, you can roll back, etcetera. But, and those are, those are interesting technologies, but they, I don't think they can address all the different use cases for, for example, when a system just overheats or memory failures occur. That, especially at scale.
Bernie Wu [00:07:55]: I think, I was at a conference and they said that some of these larger models that get up to, they're planning to build models with a million gpu's. And at that kind of scale, you're going to have 30, 40, 50 gpu failures per hour. So a lot more resilience has to be built into the architectures. And if you know anything about memory, as soon as the power or something goes off, you lose it. So we're also working on what we call checkpointing technology at the memory level. So the ability to save machine state and memory state at that level, as well as the framework level. And I think between the combination of those two will get much higher resilience. They'd be able to run all these long running training jobs or long running inferencing jobs.
Bernie Wu [00:08:40]: Now they're actually going to become inferencing services. So they need to be much more resilient than they are right now.
Demetrios [00:08:47]: What do you mean by inferencing services?
Bernie Wu [00:08:49]: Well, I think, yeah, things go into production. We go from what, fundamentally from a scheduling operation standpoint is more of a batch operation, batch to train this batch to train that to a service where we're constantly dealing with maybe hundreds of thousands of consumers. And this thing has to run like any other web service. Its resilience, its ability to be migrated to new infrastructure as old infrastructure needs to be upgraded. All those kinds of tools that im sure the mlops community is very, very familiar with, theyre going to be needed. And so were interested in collaborating with your community to help bring those tools, the right tools to handle this new onslaught of demand for memory.
Demetrios [00:09:38]: Well, it's funny, you mentioned how it's fairly common for GPU's to go offline. I think the best way I've heard it was when we had our AI quality conference. Todd Underwood, who was head of the research platform at OpenAI, he said, you can just look at a GPU the wrong way and it'll go offline. And so he was just obviously taking it to the extreme and making jokes about how sensitive GPU's are and how finicky they are and how difficult it is for people to actually get that. And then in that same talk, he talked about how the other part that is complicated to really get right is knowing when to checkpoint. Cause he's like, yeah, maybe you say, okay, I'm gonna checkpoint every second, but what happens if it doesn't go offline? Or on average, GPU's go offline every hundred seconds? Then you should be okay to do it every 50 seconds or every 25 seconds. Right, but what happens if it goes offline every 3 seconds? Then you really have to be figuring those things out and then doing the calculations and recognizing it. And another person that we had come on here from nebulous cloud was talking about how these checkpoints are just so big when he's training these models.
Demetrios [00:11:05]: So you can't be that sparing with how often you're checkpointing, because it adds up quickly. I think he's, if I remember correctly, it was like three terabyte checkpoints every time.
Bernie Wu [00:11:18]: Yep, yep. Some of the training models are getting up to that size and writing that all to a file system, even a high performance file system is, is becoming a bottleneck and then also restoring it. So one area where some of this memory technology that I'm working on is relevant is to create a memory pool that can act as a cache to basically dump all that stuff very quickly into memory, which is much faster. Obviously, it's about memory speeds are about the latency and all that kind of stuff. But it could be up to ten x difference by dumping it into memory and then asynchronously bleeding it off into the file system. That way you minimize the interruption period of the checkpoint because that is a problem. The checkpoint overhead is not surprising in a lot of training models to be 25 30% of the total training lifecycle is spent on just doing the checkpoints, dealing.
Demetrios [00:12:16]: Yeah, checkpoint management. So basically you're saying, hey, what if we just had a redis type of thing where you dump it into memory cached and then slowly it gets offloaded from that memory so you don't have to be sitting there taking the time to offload the checkpoint and then come back online.
Bernie Wu [00:12:36]: Yeah, some of that's been applied in other industries like the high performance computing industry. They've had these giant clusters for decades and they're running a model that runs for a whole month simulation of a nuclear explosion or whatever they're doing. And yeah, they have to, they use this kind of checkpointing approach. So I think some of those techniques will have to come into play here. And then, you know, there's other, there's other needs for memory caches. So for example, the, on the inferencing side, what's happening is that the, in order to scale it, the models are getting so big, like the latest lama 405 405 billion parameter model. Now it doesn't fit in most GPU nodes and now needs to be distributed. And so now people are not only figuring out how to, they've already figured out kind of how to distribute training with 3d parallelism and all this kind of stuff, but now they're trying to figure out how to distribute inferencing.
Bernie Wu [00:13:33]: And one of the first things that people realize that, hey, this is a two stage process. There's a pre, there's kind of a pretty prompt phase, which is cpu intensive, and then there's this decode phase which is really purely, you know, memory, memory bound. And so they're starting to split the architecture, but the prompts are also getting so long that the KV weights and these training models is that goes up quadratically with the length of the prompt. And so those things don't fit in memory. So now they have to start distributing the memory. So we have a distributed GPU and a distributed memory problem. And anytime you deal with memory, the first thing that happens is it gets fragmented. So how do you manage all that stuff? So I don't think there'll be any shortage of work for companies like ours and the rest of the memory industry and trying to solve some of these problems, and they're really cool problems because like I said, in this revolution, if I compared it to the big data revolution, there's a lot of things that the infrastructure area can do to innovate and improve this overall stack.
Demetrios [00:14:46]: Yeah, it's really attacking that hardware side of things which. Yeah, the big data revolution. I think you mentioned before how it was a revolution, but not as hard on the whole stack.
Bernie Wu [00:15:01]: Yeah, same processors actually, just using lower grade servers. You didn't need to have your high end server. The whole trick washing scale out, not up here. I think I would contrast that because these workloads are going to get more. Right now, most of this memory problem is what I would characterize as capacity bound or secondarily bandwidth bound. But I think latency bound, which is the other concept in what you talk about. What do you mean by memory bound? It's either capacity latency or bandwidth or all of them. I think the latency bound issues are going to become more and more critical because the pipelines are getting more and more complicated.
Bernie Wu [00:15:43]: And so the end to end time is becoming a factor. If you're sitting at your computer, you type something in and you get an answer back in 200 milliseconds, life is good. Now these computers are going to go off and start reasoning and do lots of, or do some sort of multi agentic kind of rag LLM pipeline. And so there's a big difference between waiting 10 seconds for the answer and maybe waiting for ten minutes. Right. And so getting faster devices, which is memory, basically, that's where memory really tops. Using conventional flash storage, using more memory more aggressively to help compress those pipelines, increase the utilization and throughput, it's going to be key.
Demetrios [00:16:28]: Yeah, well, talk to me about this distributed memory problem. And when you go into that type of a problem, can you break down how you think about solving it?
Bernie Wu [00:16:40]: I think, well, I think it's going to be sort of a, you know, at the lowest level, things like models, like the VL model was first to pioneer, but they, they built a virtual memory operating system inside the, basically inside the HBM memory space so they could page and have uniform size blocks and manage memory that way. Because every prompt that comes in creates a different size kv set of weights, right? So you're constantly fluctuating and so you had to have some order to that allocation of that memory. Now, now these things don't even fit. So, so now you're, you have this same problem of allocating memory and, but now it spans now multiple nodes, multiple GPU's and even multiple nodes. So it's going to take some kind of scheme, kind of a scheduling scheme, which is something else that we're starting to work on at our company is how do we schedule these resources? And it's not just scheduling GPU resources, but I think also we're going to need to figure out how to schedule memory resources and leverage a composable memory, which is. Or a virtualized memory pool that makes it easier to, you know, move memory around between compute nodes as well as within. Within a GPU itself. And then we have to figure out how to migrate things.
Bernie Wu [00:18:02]: Right. So a lot of times, you. Classic memory, at some point you have to start, you know, garbage collecting, defragmenting things or compacting, bin packing things. Same, same exercise. So a lot of it's interesting, the industry kind of goes in a spiral. All these concepts that have existed for decades, but now we got to figure out how to apply them to this new set of circumstances. But some of those basic principles are still going to need to be applied in this area.
Demetrios [00:18:27]: I'm not sure I fully understood when you said the composable memory there and the virtual memory, sharing it around. Yeah, I think I get it, but not fully.
Bernie Wu [00:18:42]: Yeah, yeah. Actually, I kind of jumped ahead. I didn't explain that concept. So up until now, DRAM has been, I call it chained to the processor. It's either chained to the GPU as HBN memory, or it's chained to a cpu, and it's the system memory on the DDR bus. And so what's going on in the, in this memory revolution? It's not quite as visible as the accelerator revolution, but there's definitely a memory revolution underway. There's a big effort to put memory on its own fabric. So one of the first steps toward that was the CXL consortium, which was formed in 2019, and that was the realization that we can't use these classic parallel bus architectures and scale memory much more.
Bernie Wu [00:19:28]: We're out of gas on matter of fact, if you look at a high level, the ratio of memory to compute cores has been deteriorating for years and years, and now we've got to totally reverse that. So the only way they could do that was to actually start using more of a serial bus architecture. And the first one that come along is PCIe. So the CXL consortium now allows you to take memory modules and plug them into a PCIe bus and extend the memory over the PCIe bus. And this is standard DDR DRAm memory. So it's high, it's reasonably high speed memory and allow you to expand within a compute node and then across compute nodes. You may know already that a lot of people have built PCIe switches, so now there's PCI. Now there's what they call CXL enabled PCI switches coming out.
Bernie Wu [00:20:19]: CXL, by the way, stands for Express Express computer. Express link is a way to, it's just the acronym for this whole idea of creating additional pools of memory that are external to the system bosses, but allow you to pool and share memory. So sharing memory is really kind of the ultimate because I think there's use cases in AI where people want to save like the entire chat history and then be able to put that back into a problem quickly or some pieces of it, and have more accurate answers or relevant answers and things like that. Well, you can imagine at some point in the not too distant future we're going to have the ability to put an entire group, let's say they're all interested, this group is all interested in travel to Hawaii, wherever we can put all their chat history in one pool and share the responses and questions among an entire pool across multiple nodes. Yeah, exactly. Collaboration is now going to be enabled. There's already, people are trying to, we can collaborate to a certain degree, but I think we'll be able to take it to a new level and to a new depth of history. So that's also exciting.
Bernie Wu [00:21:37]: This ability to use a memory pool. Memory pool becomes basically a first class citizen over time in a data center and becomes a repository of all this chat history.
Demetrios [00:21:49]: If I'm understanding that correctly. It's like bringing Google Docs to your chat history. And now I can comment on some of your chat history, or I can just jump in and say, hey now, explain more. This, whatever you said from your answer, but now I get to take the wheel and drive.
Bernie Wu [00:22:09]: Yeah, it's kind of like, it's like car applications, right? It's a crowdsourcing kind of thing. You look at the traffic, you have all these data points reporting on the traffic everywhere real time. So here you can crowdsource the responses, in one sense get, I think, higher quality responses or more closer to real time responses based on changing conditions or whatever to what's going on. And so we started prototyping this. So I put a link to a blog that we have on how we took Lama index and this very simple rag LLM pipeline has started creating that shared repository. But we're still, we're still working on it. That's more of a toy demo, to be honest with you at this point. But it does show that we can make some of those pipelines what I would call better, cheaper and faster, you know, better in the sense that we're going to get a higher quality response from aggregating all this chat history, you know, cheaper because we actually have, because of the additional bandwidth of adding this cxl memory to these compute nodes, we get more bandwidth, higher queries per second, which, you know, per node.
Bernie Wu [00:23:19]: So that drives down the cost per query and then faster because we're recalling everything from memory. So when you look at some of these workloads, which require a rapid lookup of all these different memory objects, it's hard to beat, you know, a random direct seek off of a memory device, as opposed to going into a storage device and loading a whole block at a time and trying to parse through that.
Demetrios [00:23:46]: So really it comes back to what you were saying earlier on, this elasticity and being able to have the elasticity of memory along with the elasticity of everything else. Memory in both ways on the training side of things, but also on this inference side of things, the whole pipeline.
Bernie Wu [00:24:07]: Even the pre training, the ingest area, you have the emergence now, just got back from the race, some of the others emergence of these multimodal, heterogeneous workflows. Right, multimodal meaning, okay, it's coming in as voice. This guy is speaking Korean. And now we got to turn it into vectors and embed it, and then we got to run it through a translator and have it within, you know, within mill, you know, milliseconds, come back and speak in English, you know, like a universal translator. So, yeah, so that kind of modality switch makes models a lot harder, especially when you get to video. The video images are much bigger than just text prompts. And so I think for those, we'll need to increase the token generation rate by at least five x, equivalent of five x just to handle those kinds of things.
Demetrios [00:25:00]: Which means more different, though, when you're looking at the training side of this elasticity of memory versus the inference side and how you set it up and how you think about it.
Bernie Wu [00:25:12]: Yeah, no, there's definitely differences. But you know, I think the other interesting thing about our industry is originally they were very distinct, very distinct. A matter of fact, you could just train on gpu's and you could even use cpu's to do the inferencing, and you only needed one or two of those to do it. But now I think there's convergence. I mean, there's there's, there's inferencing going on in the training phase. That's how some of these reasoning models got developed. And then there's, there's, there's training going out of the inferencing phase the other way around. You know, this, this fine tuning and this need to have a Lora adapter per person.
Bernie Wu [00:25:49]: Right. So you'll have your own lore adapter. I'll have my own lore adapter. And as soon as this general purpose model finds out, oh, it's you, it'll drop in your lore adapter. None of those customized responses for your, for your, you know, your, your preferences or whatever.
Demetrios [00:26:02]: Right.
Bernie Wu [00:26:02]: So I think, yeah, it's going to get, I think it's going to get moshed together more than, more distinct. Yeah. And so we need a really elastic, you know, flexible and scalable memory architecture. And I think this memory architecture needs to go up and out. So it needs to be a scale up and also a scale out. Just scale up just because the going back to first principles in physics, I mean a nanometer, I mean a nanosecond is a nanosecond. Right. Yeah.
Bernie Wu [00:26:33]: So if you want to really compress latency of these more complex workflows, there's going to have to be some level of scaling up. On the other hand, if you want to handle the entire planet's AI needs, you're going to need a scale out. So it's going to be interesting. And there, there are not only new memory technologies that are disaggregated like CXL coming out, but they're also new fabrics that are coming out that also will support this scale up and scale out. So your audience should read up on, there's an emerging standard called the Ultra Ethernet Consortium which will address some of the scale out issues. And then there's another consortium called the Unified Accelerator link which will address memory scale up. So, yeah, so those are the emerging standards along with the CXL standard I just mentioned that I think are all going to over the next couple of years, make significant differences in this memory bandwidth capacity latency problem.
Demetrios [00:27:33]: Well, I remember in the, I did a breakdown on a few of those Facebook engineering blogs or meta engineering blogs and they were talking about their gpu's and how they train llama and all that fun stuff. One thing that they mentioned was, I think it was 25,000 h 100s or a, can't remember exactly which 24, they split it up. So they had, I'm pretty sure they had 48,000 in total. But they said, we don't know what's gonna be better. We want to test which one is better. And one of the tests that they did was with that ultra Ethernet to see if it was faster versus the. Now I'll have to check again what the other option was that they had.
Bernie Wu [00:28:22]: Yeah, well, the people are using infiniband as one choice.
Demetrios [00:28:26]: That was it. The classic infiniband.
Bernie Wu [00:28:28]: Yeah, there's standards. Well, part of it is there are no standards. So I believe what meta has done so far is sort of a proprietary way of doing what they call packet sprayings across Ethernet, which is one of the things that the alternate ETH and that consortium wants to standardize on. So a lot of these networks, I think were, and I'm kind of drifting off of memory into networks, but I'll do go a little further. A lot of those networks, I think, were designed primarily for more like north south traffic than east west traffic that you see in these clusters that need to exchange at high speeds all the weights and values and all that kind of stuff. So that's really the ultimate problem to try to solve, and it has to be reliable. Can't drop packets and retransmit. It takes too long.
Demetrios [00:29:12]: So one thing that I did want to touch on was how this all affects operations and the scheduling workflows and the pipelines around that, the compute instances, the composability of the GPU's and schedulings that now need to take into consideration the GPU status. You had mentioned how it's not just scheduling for the GPU status, but then the memory scheduling and all of that. And so you're really dissecting the scheduling workflows in ways that it feels like. We're not necessarily thinking about it today, but it probably would help us to think about it.
Bernie Wu [00:29:50]: There's multiple dimensions to the scheduling problem. One, quite honestly, is more of an organizational thing, how a company is organized, or an enterprise or a cloud services organized. And, you know, so it's a, it's basically an allocation of a scary scarce resource. At the end of the day, it's an economic problem. Right? So there's that dimension. Okay, who, who has priority? How do, how do we assign priorities to jobs? Some jobs are batch jobs. I mean, kind of at a trivial level. You can say, well, this is a batch job.
Bernie Wu [00:30:21]: It can be bumped off. This one's an interactive web session. We can't, it has to be a certain kind of SLA. So you're going to have a different infrastructure. And they probably most likely, I would imagine, get partitioned. This area may be just for more bashy kind of things. In this area is more services, kinds of things are ongoing streaming things. So that's one level of cut at this problem.
Bernie Wu [00:30:42]: And the other level of cut is that, okay, how do I, you know, how do I balance among multiple different criteria, how resources would be allocated? So for example, do I care more about utilization of memory and utilization of GPU? Is that the primary dimension of scheduling or is the primary dimension of scheduling the SLA? You know, what is the end to end throughput of a single session? What is the overall total throughput and what's the long tail latency of any individual? Maybe the most important criteria is actually making sure that the P 99 of a session is no more than Xdev. Otherwise I'm going to have a customer churn and drop out. So all those things have to be figured out and then map to some sort of what I think will need to be a highly elastic infrastructure. A lot of people are building their infrastructure on Kubernetes, which is great, but it doesn't go nearly far enough in trying to address these new problems. I mean it's designed to do a horizontal scaling and vertical auto scaling. It has a lot of infrastructure. But now that the accelerator was designed for these cpu architecture, now the accelerator is becoming more important and the memory is becoming more important. So it's funny that the components of the node, not necessarily the actual node itself, is becoming the main, these individual components is becoming more important to schedule uh, more directly is what's, what's happening now.
Demetrios [00:32:17]: And do you see an open source project that is built on top of kubernetes coming through and, and helping that? Or is it something that just kubernetes itself the way it was thought through and the primitives there are not going to be able to do the trick, I think.
Bernie Wu [00:32:36]: No, I actually uh, there, there's gosh, there must be at least a half a dozen scheduling projects going on that are kubernetes related. And some of them are also trying to address various aspects. And right now it's still very much a fluid thing. It's not clear because the reason things are not clear is because things are changing so rapidly, even at the top level of the stack, just basically the way we put together and deliver models. You know, you know, I would say last year was the, you know, the year of, you know, model proving right. Which model is the right model? Is it going to be an open source model? Is it going to be, you know, are we going to use APIs in the cloud? And those are the only good models. And I think that pretty much the people now have kind of decided, okay, most of these models, open source models, are good enough, at least good enough for large language models. They may not be good enough for other kinds of multimodal models yet, but definitely that area is out of the way.
Bernie Wu [00:33:39]: And so this year, I think was more about rag, right? How do we start delivering these, putting these things into products and services, these AI things? And toward the tail end of this year, it's been more about multi agentic rag. Now the training area is going through another iteration. Now people want to start having these things. Reason more. So there's so much flux is what I'm trying to say. There's so much flux at that level that whatever schedule you build is going to have to be very composable itself and also probably able to teach itself over time. Yeah. Have some sort of AI ops element to it because it's complicated.
Demetrios [00:34:25]: So that's fascinating to think about. The flexibility needs to almost be the number one design principle.
Bernie Wu [00:34:32]: Yeah, I think. I think architecturally it needs to be flexible, you know, almost to an extreme. And, and then like I said, there's these new trends right now, a lot of schedulers don't even consider memory. Right. It's just. Yeah, you know, it's just like they didn't consider what a GPU's when they originally built the original schedulers for coordinate. Now, now we got to start spending more time at more. More of the action is there, even though they're, even though they're still tethered to a CPU.
Bernie Wu [00:34:59]: You talk to a CPU as a Kubernetes node, but you're really driving the GPU's that node contains. And those GPU's have their own network too, by the way, so they have their own NVmesh and things like that. So yeah, it's just this is really a time of a lot of experimentation, I would say. But it is still evolving very quickly and I. There's probably going to be multiple right answers is how I see it. I mean, that's how I see the. Like I said, I think there's at least a half a dozen scheduling projects. I mean there's q, there's unicorn.
Bernie Wu [00:35:37]: And I just heard another one from Alibaba I think is working on one. So there's several. And these are open source scheduler scheduling projects. Yeah.
Demetrios [00:35:48]: I guess my biggest question is how much of that is a square peg trying to fit into a round hole.
Bernie Wu [00:35:57]: Yeah, I think he, you know, I think some of these projects have been driven by, you know, particular, you know, hyperscalers and their large scale deployments and their kind of unique set of circumstances. So that's why there's so many of these things. It's going to be driven by, like I said, at least by organizational considerations and economic considerations that also driven by their technology stack that they chose for that generation of deployment. It's challenging, but I think my expectation is that some of this dust will start settling in the next couple of years and we'll see more distinct patterns emerge. And my hope is that by that time, you'll see memory get more respect in the industry or awareness and use as part of solving some of these AI problems, which, again, people have this high level perception of their GPU bomb. I think really, when I listen to a lot of these talks, at the end of the day, they're really memory bond problems. Yeah.
Demetrios [00:37:14]: And that's why I feel like it is you going back to first principles because you're saying, yeah, the GPU is there, you are bound by the GPU. But what do you want from that GPU? Where's the real bottleneck? If it's at the memory layer, then we might want to start there.
Bernie Wu [00:37:28]: No? Exactly. I think the, I like to say that the, in a, western cultures, people say when you look at a problem, here's point a and here's point b, and you draw a line between a and b, that's the direction you go solid. You don't have to. Or the direction maybe an asian cultural, it's more of a spiral. Everything is a yin and yang is a circle. I think reality, this industry is a combination of both. It's both a vector and also a spiral. So you have this spiraling around.
Bernie Wu [00:38:00]: Yeah. So you have this rotation. So right now we're in a rotation where, you know, it's, you know, the perception at least has been we have a huge gpu shortage or whatever, but I think at some points it's been over. The industry's been network bound or storage bound, but I think memory bound is the next thing coming, or it's already here, is people just don't acknowledge it. And they're trying to, like I said, do as much ledger domain as they can with, I call it ledger domain with algorithms to spread out the memory problem of distributed as much as possible or quantize it. Right. You know, okay, we're not going to use such accurate FP 16 down to Inf four, whatever we're trying to do to shrink the size of things. But nonetheless, a lot of the fundamental equations say, hey, you want these bigger problems and models or more sophisticated things, your kvs are going to, your weights are going to go up quadratically, right? Yeah.
Bernie Wu [00:38:58]: So if it's quadratic with four biz, it's still quadratic versus quadratic with 60 biz. It was just, just mushrooming. So I think there's a, like I said, I think I'll be. Memories turn to be in the, in the docket and then in the limelight in the next, you know, 1218 months to solve some of these problems.
Demetrios [00:39:15]: I see it also coming to the consumer hardware that we've got and this amplification of memory on my laptop or on my cell phone.
Bernie Wu [00:39:26]: Well, you know, right now, right now it is kind of a cram down situation. So, yeah, you have extra considerations like a handheld mobile device, you know, how the power budget and things like that. So actually, I remember listening to a talk the other day from Qualcomm. It was amazing how much they're able to start putting some of these visualization AI kinds of tools. They're starting to put those even on a phone, but they're going to need to carve out eight gigs, ten, you know, 16 gigs, 32 gigs. And so there's some limit to that, how much you can quantize things. But so I think to that point, I think, yeah, they're squeezing out every last drop of blood we can out of algorithms and quantization. At some point you're going to just, we're out of algorithmic tricks.
Bernie Wu [00:40:13]: We've got to do something. And I think, I think on the edge, you can get away with that. You got a smaller screen and all that kind of stuff. You don't have to worry about the resolution and certain other factors like that. But yeah, I think for a lot of other use cases, we just gotta refactor the memory stack as well as the GPU stack and everything else.
Demetrios [00:40:38]: You're telling me there's not enough fancy math in the world to squeeze a little bit more out of that algorithm? You gotta go somewhere else.
Bernie Wu [00:40:47]: I mean, that's what happened in general. AI. AI has been around for years, decades. And the problem was they didn't have enough compute power. Finally, the economics and the ability to connect these things in some reasonable network like Infiniband became made it viable to run these kinds of basically grinding problems. So that was a limiter. And I think, like I said, to take AI to the next level, which I think is going to be very exciting, this multimodal, real time stuff, the ability to implement robotics or whatever, that's going to require more than just the algorithms to solve it. We're going to need algorithms.
Bernie Wu [00:41:31]: I don't want to belittle what those people are doing. They're doing some incredible work, but everything falls back. I think the rubber hits the road in the operations. It's nice to do all those R and D, but we've got to put it in production, we got to make it economical, and we got to make it secure and meet SLA's, all that kind of stuff. So I'm kind of interested in that, applying all this innovation and making this stuff really tangible for everybody to use.
Demetrios [00:42:03]: Preston and speaking of making it more economical, does the excess of memory, do you feel like there will be new strategies that will help make it more economical? Like we were talking about with the checkpoints and how you can kind of slowly roll the checkpoints from one stage to another via the memory path, or. Because when I think of gigantic pools of memory, I don't think cheap, right?
Bernie Wu [00:42:34]: No, there's things that necessarily. Bingo.
Demetrios [00:42:37]: Go in my head.
Bernie Wu [00:42:37]: Yeah. No, actually, years ago, when Memverch got started, we. The inspiration back then was that was going to be the availability of cheap memory. So several years ago, intel introduced something called optane, or persistent memory. And its cost per bit, its read access time, was the same as DRAM. Matter of fact, you plugged in the same DDR slots as regular DRAM, but its cost per bit was like one fourth, one third, one four of DRAM. So you could actually build a much larger memory pool and drive down the cost per bit. But so that's one aspect, unfortunately, they discontinued that, was a proprietary architecture that discontinued that.
Bernie Wu [00:43:24]: But within the memory hardware industry, there are new kinds of memory technologies emerging also. There also kind of converged memory architectures emerging. SSD is much cheaper than on a cost per bit than DRAm. So people are trying to build these kind of hybridized memory slash SSD. Matter of fact, they're referred to as memory semantic SSD's. Hold on, because the problem with the SSD is that the cost per bit is slower, but the access time, you know, lookup time is a lot longer. You know, so maybe instead of 100 nanoseconds, now you're at ten microseconds, you know, it's like a couple orders of magnitude higher. So.
Bernie Wu [00:44:13]: So part of the solution is to try to use a combination of those in a very clever caching way. And some of these al, some algorithms, not all of them, are more predictable from, from an application standpoint. So you have a better chance of using that kind of caching algorithm there. Others are going to be totally random just because if you're serving inferencing out there, who knows who's going to come up to the chat and ask some off the wall question next? You never know. You have no idea what kind of access you might need. Then there are fundamentally new memory technologies. I know some of the largest VC firms are looking at funding some of these new generation memory technologies that will kind of bridge that substantial gap. There's like a couple orders of magnitude difference in performance, but also in cost per bit between DRAM and SSD.
Bernie Wu [00:45:08]: So yeah, I think that'll also have to happen as well over time.
Demetrios [00:45:14]: Now, the other piece that I wanted to dive into was around what you had said before with these basically pools of memory that you can use in an elastic way and so that you can avoid what I am sure everyone that is listening has had happen to them at one point or another in their life with those ohm errors.
Bernie Wu [00:45:40]: Yeah, those are, those are really annoying. That was the most annoying thing I learned about Kubernetes is like, good thing is it's a very elastic infrastructure for microservices. And all that bad news is if you have, if you're semi stateful, remember you're dragging around and you get killed. It's not a pleasant situation. Yeah, exactly. So I think it's interesting how Kubernetes evolved because it was supposed to be a totally stateless microservices. In other words, all these database and storage problems, not my backyard, put them outside the Kubernetes and connect to them. But I think that's not so easy to do, in my opinion, with some of these AI ML workflows.
Bernie Wu [00:46:29]: So over the years, obviously the Kubernetes community has done a great job of accommodating stateful applications and replica sets and all this other kind of stuff that you see in a more general purpose enterprise environments. I think this elastic pool. Yeah, we're hoping to show this kind of demonstration probably in a not too distant future, where as soon as we detect memory pressure, things are starting to either because I think you can adjust. There is a project going on with Kubernetes where you can spill the disk or do disk swapping, which is heretical to original Kubernetes philosophy, but they're allowing it now. But so we can either detect those kinds of swaps or detect when just before a node gets shot, preemptively shot for out of memory, or, sorry, where a pod gets killed because it's about to run out of memory, the node, we can intervene in those and inject memory, or we can cycle the memory. So there are certain things, applications where the, the memory is actually sinusoidal or whatever in its consumption and we can adopt, we can what we call wave ride the amount of memory based on where the application is running. So those are things we're hoping to add to Kubernetes very shortly.
Demetrios [00:47:57]: How do you even go about that? That's what I'm fascinated by is a, you saw the problem, you said, all right, nobody likes these om errors. And we think that we can fix it these different ways by injecting memory or trying to swap it out. What are you thinking about when you're attacking that problem?
Bernie Wu [00:48:20]: Well, the first area, I mean it's from where I sit, we're kind of a piece of middleware. So first thing is we've got to get our hardware partners ready. So we are working with several players that are building memory pools. And so we've got to get that piece of infrastructure ready. And then the second piece is then we need to integrate with the Kubernetes layer. I think there's some relatively easy ways to integrate with the Kubernetes letter or create a hook or an operator or whatever, but logger run will we? And I say we, the industry and a community are going to need to create some probably new standards. So for example, in Kubernetes there is something called a, I think it's a CSI, it's a standard for the storage layer, right? And there has to be an equivalent standard for. How do we bring memory as a separate first class citizen into a Kubernetes environment? Right now we can kind of hack our way in and show something, but actually with the existing schedulers and things like that.
Bernie Wu [00:49:27]: But I think over time we'll need to work with areas. For example, there's another initiative called DRA, dynamic resource allocation, because right now everything is statically done. Once that node is set up by the Kubelet or whatever, it's static. But what if the thing changes? What if the memory increases and decreases? Or what if the number of GPU available increases or decreases? That kind of dynamicism, I think is just a general thing that the Kubernetes industry is trying to face up to. How do we make this even more elastic than it currently is? Right now? It's elastic at the node level, but even below that, there's some of these other elements that have their own composability to them that need to be more dynamic.
Demetrios [00:50:15]: And do you feel. So I also like that you're looking at the node and you're saying inside of the node there needs to be a way to be more flexible and more elastic and then the whole cluster. And like you mentioned, up and down the stack, but also left and right in the stack. We need to rethink and look at all these different areas. If we're coming into a world where AI plays such a big part of that, I wonder, is this only going to be the evolution for the use cases that we are trying to create AI products with, or do you just see it that it's going to be better for the industry in general if we amplify on every vector that we can?
Bernie Wu [00:51:11]: Yeah, well, with respect to things like memory, yeah, I mean, there's, there's, there's just tons of applications today that are getting killed and evicted or whatever, and so all those are happening from this. So, yeah, there's definitely that kind of benefit. Yeah, and I think same on the HPC side. Those, those guys, that, that whole area, there's a lot of convergence between what goes on in HPC world and AI. I think that that area can also benefit from all the memory advances. General enterprise applications, databases. Yeah, I think it'll be a broad base thing, but right now, the way the AI movement is quite honestly sucking all the oxygen out of every other initiative right now, it's probably the correct thing to do anyhow. From a society standpoint, this is the hottest area, but once it settles down, I think some of these other innovations will, will obviously benefit the rest of the infrastructure and the rest of the applications that are still out there.
Bernie Wu [00:52:12]: They're not going to go away, databases or anything like that as an example. Matter of fact, they'll be part of this overall infrastructure. They're going to be our source of truth, our ability to avoid hallucinations, whatever, and they'll benefit from all these memory advances as well.
Demetrios [00:52:32]: Bernie, incredible talking to you, and I want to just mentioned that I looked up the meta blog post and what I was looking at. They'd had the 24,000 or, no, it was 16,000 GPU cluster and it had remote direct memory access over convergent Ethernet ROce. That probably sounds familiar to you. I remember when I read that, I said, this could be in greek, this could be in chinese. I have no idea what that means.
Bernie Wu [00:53:03]: Yeah, RDB is called RDB is rocky remote over Ethernet. Yeah, roce remote over converged Ethernet. So that's been actually around for quite a while, and that's the kind of the bread and butter of the current Nvidia architecture. But like I said, I think people are ready to figure out how to take this to the next level and standardize things with ultra Ethernet, so.
Demetrios [00:53:28]: Exactly. I thought they had used the ultra Ethernet, but it looks like they didn't, and they were using the infiniband for the other 16,000 cluster. And. Yeah, from. Let me see. Hold on a sec. The roce was used for training the largest cluster. Despite underlying neckwear differences between these clusters, they both provided equivalent performances.
Demetrios [00:53:56]: So what I remember from this is they said that the rocky or RoCe, as you mentioned, cluster, they were able to optimize it for quick build time. And the Infiniband cluster, they had full bisection bandwidth.
Bernie Wu [00:54:13]: Yeah, those two technologies, really, they're all based on the same physics. Here's this network, and whether you're running it, used to be there was quite a technology gap between Infiniband and Ethernet, but now that gap is pretty much down. The physical gap from a physical file layer is basically equivalent and is really the protocol layer. That's where the differences are. Infiniband designed to be a lossless configuration architecture, and Ethernet was always designed to be lossy, and there was trade offs for that, so. And now, you know, now they're all getting kind of mushed, or the Ethernet's going to try to absorb that type of protocol and compete with Infiniband, quite honestly.