MLOps Community
Sign in or Join the community to continue

Performance Optimization and Software/Hardware Co-design across PyTorch, CUDA, and NVIDIA GPUs

Posted Feb 24, 2026 | Views 14
# NVIDIA GPUs
# CUDA framework
# GitHub repo
Share

Speakers

user's Avatar
Chris Fregly
Founder, AI Performance Engineer, and Investor @ Stealth AI

Chris Fregly is an AI performance engineer and startup founder with experience at AWS, Databricks, and Netflix. He's the author of three (3) O'Reilly books including Data Science on AWS (2021), Generative AI on AWS (2023), and AI Systems Performance Engineering (2025). He also runs the global AI Performance Engineering meetup and speaks at many AI-related conferences including Nvidia GTC, ODSC, Big Data London, and more.

+ Read More
user's Avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

In today’s era of massive generative models, it's important to understand the full scope of AI systems' performance engineering. This talk discusses the new O'Reilly book, AI Systems Performance Engineering, and the accompanying GitHub repo (https://github.com/cfregly/ai-performance-engineering).

This talk provides engineers, researchers, and developers with a set of actionable optimization strategies. You'll learn techniques to co-design and co-optimize hardware, software, and algorithms to build resilient, scalable, and cost-effective AI systems for both training and inference.

+ Read More

TRANSCRIPT

Chris Fregly [00:00:00]: And there's entire products like SageMaker HyperPod was built specifically, um, to have a, a standby, like a, a warm standby of GPUs that have been pre, pre-warmed up that can be swapped in. Um, that's the whole premise of the product.

Demetrios Brinkmann [00:00:28]: Um, first of all, congrats on the book. I think that's a huge accomplishment and I wanna dive all into everything about creating it and what you've learned along the way, what you've been up to since. There is a GitHub repo that comes along with the book, which is super cool. So if folks wanna get hands-on, we're also gonna go fairly low level. You know, this is something that I'm excited about because of the depth at which you bring, and then being able to kind of traverse the whole stack up and down from the kernel to the applications. And you had asked me right before we hit record, like, what, what do you think about the whole narrative of not wanting software engineers? So maybe we start with that because I have some interesting takes, which is More along the lines of the applications right now, a lot of them are becoming very throwaway and you can like think of it as akin to fast food apps where you just, you build it, you take some time, maybe it's 20 minutes, maybe it's 2 hours. It's for a very specific you use case and. You never have to productionize that, so you don't have to think about the million things that come with production because it's just built for you.

Chris Fregly [00:01:53]: Yeah, I guess, you know, coming from a large-scale application, you know, background, uh, places like Netflix and, you know, Databricks and of course most recently AWS, um, and, you know, various startups in between. Um, I think about statements like that., and the first thing that I, I think of is this person has no idea what it takes to maintain and run an application. Now, if you're, if you're limiting it to just yourself or maybe 5 or 6 people, right? But actually I'm curious, you know, continuing your, your thoughts there, where would you deploy it? Just out of curiosity, like a Vercel or like a, Yeah.

Demetrios Brinkmann [00:02:39]: Vercel, Railway, but even a lot of the stuff that I'm running, I'm just running locally. Yeah. Yeah. So there is no deploy in that sense.

Chris Fregly [00:02:47]: Got it. Right. And then you're still using a database though, right? I assume like a— sometimes— or something.

Demetrios Brinkmann [00:02:53]: Okay.

Chris Fregly [00:02:53]: Yeah.

Demetrios Brinkmann [00:02:53]: Like I'm as hacky as just using Notion for my database, which is— it's sounds horrible. You know, it definitely defies all laws of everything that you would think. But it's like, oh, well, if I just want, like, for example, I'll give you the exact toy that I was playing with and building today. I kept getting sick of creating deep research reports and then wanting more information on just a piece of the report. And so I created a little bot that I can bring into the— uh, Google Docs, and then I can tag @AI. And say, ask it questions or tell it to like note this down and put it in my Notion database, whatever. And it does that. And that's just for me.

Demetrios Brinkmann [00:03:43]: Like, I don't want anybody else to have access to that.

Chris Fregly [00:03:48]: And as your workflows change and as your frustrations mount, you will chip away at that. Exactly. Do you— and so. Being, being a little bit old school in, you know, traditional software scaling distributed systems, I guess, I guess that, that doesn't really matter. Right? Like, yeah, some of those things don't matter. I mean, there's like backups and, um, I was watching, uh, uh, you know, a podcaster, Matt Berman, who I'm sure you're, you know, like familiar with. He was describing his whole workflow and, and, you know,, and one of the things I noticed is he was backing everything up to GitHub and then he, he had these cron jobs on top of cron jobs that were doing backups on top of backups. And it's like, well, what if the one backup cron job fails and it doesn't run till the next day? Now you're— so there's a lot of hackiness going on, of course.

Chris Fregly [00:04:45]: Exactly. If it works for you and you can move quickly too. And, and I guess you're not waiting on someone else to release the feature. Or Google Docs to implement it, or, you know, that's the beauty.

Demetrios Brinkmann [00:04:56]: That is really the beauty of it. And so going back to the original idea of like, oh, are we going to need software engineers? I still am 100% convinced that we're going to need software engineers because me doing that is nothing like me taking something and bringing it to production. And I understand that gap. What I think is just different is now how custom we are able to quickly spin things up. And also what you were talking about too, with like, ah, maintaining software is kind of a pain. 100% agree. And I don't want to now have 50 or 100 or 200 different personalized apps that I have to maintain. I get the feeling that it's probably faster for me to just start from scratch again next time than have to maintain it.

Chris Fregly [00:05:50]: Yeah, yeah, 100%. I was working on a project, um, over the holidays. I, you know, picked up for like a buddy of mine and he's a CEO and runs a, you know, pretty tight ship and he, he's got like really good revenues and he keeps his expenses low. And, um, and he was mortified when during one of our standups I said, yeah, I just threw everything away that I'd been working on for the last 2 weeks and I'd And I started over and it actually, um, we had to talk about it offline. Like I didn't realize he, you know, he, he would, he wanted me to also kind of share information with his team on my workflows. And I had already gotten to the point that, that you just mentioned, which is I could start over and it's actually faster and cleaner and it doesn't have to maintain all the old tests. It's just gonna write new tests, you know? And so I think there we're, we're in seeing this shift. Um, yeah, curious, do you read all of the code that's being—

Demetrios Brinkmann [00:06:54]: that's being generated? No, I don't read any of it, to be honest. No. Well, I mean, occasionally I'll jump in and try to read and figure out what's going on. But honestly, if I have a problem or if I'm debugging stuff, there's, first of all, there's some awesome skills that are just open repos on GitHub. And one of 'em is like, a debugger skill, Agents, um, by Oda. I just, I was checking it out a few days ago and I got it. And since then I haven't had to use my other trick, which is using the playground skill to create a visual diagram of what's going on with the code and then help me debug it. Interesting.

Chris Fregly [00:07:42]: Okay.

Demetrios Brinkmann [00:07:42]: Yeah, that was, that was awesome. And that was my go-to because then I could isolate like, what's the problem? What is happening? And kind of explain it in a diagram. And then the prompt would just be there in the playground. So my thing was, I would say, create a playground skill to help me debug the data flow of this app or something like that. Yeah. Or all the possible scenarios of why this isn't working. And then it would have the, the debug prompt that I could just copy paste back into when I would isolate the issue. It's like, oh, this part of the app is not rendering this part.

Demetrios Brinkmann [00:08:23]: Uh, this button is not clicking through or whatever. Then boom, it just helps me transfer that.

Chris Fregly [00:08:29]: I actually didn't even know about that. I, I was having it generate mermaid diagrams and. Uh, generate markdown, but all right, I have to go look into these. And these are skills for like Codex or for Claude or— yep.

Demetrios Brinkmann [00:08:43]: Yep. It, and it's so nice. I feel like the playground skill is a skill that keeps on giving. Anything I throw at it, I can try and, uh, get it to do stuff. Like the other day I was creating a horsey ranch game with my daughter to try and teach her a little bit of geometry. And we wanted to change the appearance of some of the game characters. And I would say like the playground skill is really something for when you want a little more fine-grained ability than words can offer you. So that means you don't wanna have to explain things with more intensity in words.

Demetrios Brinkmann [00:09:27]: You just wanna have a slider. That you can jack up the intensity or, or bring it down with, you know? And so with the appearance of these different game characters, we were able to change the colors and change the size of their clothes or the size of their eyes. Yeah.

Chris Fregly [00:09:44]: That type of stuff. And it works with any— it, it just figures it out.

Demetrios Brinkmann [00:09:50]: It just works, huh?

Chris Fregly [00:09:51]: Yeah.

Demetrios Brinkmann [00:09:51]: It, it's amazing how well it figures it out too. Like throwing some random stuff at it and It is.

Chris Fregly [00:09:59]: Wow. Yeah. Impressive. Yeah. Any tips for educational things for like, I have two nephews and all they do is play like Roblox all day and they, they haven't made the, you know, tra— the, uh, transition. My sister doesn't let me force them to learn, you know, computer science and, and things. So I'm trying to do subtle ways, but like, I think I've heard one of the tips is like you can You know, have ChatGPT or Claude like ask questions back. So when, when they ask a question about something to research coming back, have, have it ask questions.

Chris Fregly [00:10:37]: Any other thing like that?

Demetrios Brinkmann [00:10:39]: What I've started doing with my daughter on the way to school sometimes is we will put it on voice mode and then Uh, I'll say, you know, hey, we're trying to learn about geometry. Right. And I'm with my 7-year-old daughter and she really likes horses. Can we, uh, can you tell us a story that like incorporates the diameter and the radius and all that? Yeah. So that's fun. Uh, another one is trying to, yeah, create this, uh, horsey ranch game where which she loves, and we plugged in a voice agent too, so she can talk back and forth to it. And that is really cool cuz it's just so easy. It's ridiculously easy to build these things, but then, uh, sometimes I'm so outta my depth and it's like, oh, we hit a bug and this isn't rendering.

Demetrios Brinkmann [00:11:36]: And I don't know what's the problem. And so then I have to sit there for 30 minutes trying to debug and my daughter gets bored and she's long gone, you know?

Chris Fregly [00:11:45]: Ah, she's gone.

Demetrios Brinkmann [00:11:48]: Yeah.

Chris Fregly [00:11:49]: Hilarious.

Demetrios Brinkmann [00:11:49]: But dude, talk to me about this book because there's that Venn diagram or that triangle of skills that you now need and maybe give an overview of the book that you wrote. Um, and what kind of inspired you to write it maybe?

Chris Fregly [00:12:06]: Yeah. I mean, this is the third book I've written with the O'Reilly folks. Um, and each of them, the common theme was I was trying to, you know, learn as much as I could about a particular domain. So I joined, yeah, I joined Amazon, what was it, 2019. And I, I was, you know, came through an acquisition. I had a startup that was doing inference, you know, very much like a Base10 today or like a Together AI or a Fireworks. Before your time. Little before you came, huh? It was, you know, before like LLMs.

Chris Fregly [00:12:43]: So like you can imagine it was TensorFlow models. PyTorch wasn't even big, you know, back in 2016. Like we had one PyTorch customer and they were the annoying one that we didn't know anything about. Like we would help them, but they had to do everything on their own. We just gave them our, you know, product and they can— Wow. —fit it in. But, um, yeah, the startup was interesting. So I had originally started at, uh, Netflix when I moved to the Bay Area.

Chris Fregly [00:13:07]: I'm from Chicago, but moved to the Bay Area to get a lot of scale. My buddy recruited me there, very quickly realized that my little financial apps and toy apps that I was building for companies in Chicago are very different than, you know, 300 million customer, uh, you know, hundreds of thousands of streaming servers. So I joined the, the first streaming team. I was on the streaming team from the beginning. Um, That's when, when they split off DVD. And so I never had anything to do with the DVD side. I was always streaming. And then did that.

Chris Fregly [00:13:43]: And then I joined Databricks and I got to see data at scale, you know, that's where I met Holden Karau, who I think, you know, and yeah. Yeah. So she's, she, she kind of turned me onto the, the book bug. Writing. Yeah. Yeah. And I had always wanted to write a book. I, I submitted about 7 or 8 proposals to the O'Reilly folks.

Chris Fregly [00:14:13]: I always got rejected. Yeah. Um, even with, you know, Holden's help and with other people's help. So they're, they're very selective. What's interesting about the book proposal process, it's very much like when you're pitching your startup to VCs. And so it, it was not coincidence that I had to pitch to VCs with my startup and, and then, you know, eventually sold to like Amazon. And then when I went to write my— the most successful book proposal, I basically followed the 12, 13 slide Sequoia, you know, pitch deck, which I had used for all my pitches. And because what they wanna know is, you know, why now? Right? Like, why is this book relevant now? Why, um, you know, yeah, why me? Like, why am I the right, uh, like author for it? And, uh, they wanna know competitors, they wanna know, um, the TAM.

Chris Fregly [00:15:13]: Uh, yeah. Like what's the addressable market for this book? Right. And so Yeah, and if you ever do want to write a book, you know, for you or for your like, audience with O'Reilly, they, you know, give you the proposal and you have to kind of fill it out and it's super annoying, just like filling out a pit— you know, the pitch deck is, and then you have to pitch it to them, right? So by the third book, they were like, "Okay, yeah, you know, just, just like give us everything," 'cause I had done it so many times, but oh, back to this. So there's always this thing of there's so much material and I'm getting it from 100 different sources. Like the most recent book, the AI Systems Performance Engineering book, I was pulling tips off of Twitter, you know? Wow. Um, yeah, X. And I was getting tips out of, uh, um, you know, Discord channels, which like Discord scares me. I don't like Discord.

Chris Fregly [00:16:13]: Like it doesn't make any sense to me. Right. So, yeah.. And I can't copy and paste or something out of it. There's, there's something weird. So I have to do screenshots and, you know, it's very strange, strange land over there. But, uh, and so once you, you realize this and NVIDIA is not really known for their documentation, right? No. And it's, it is really, really bad.

Chris Fregly [00:16:37]: And when you finally land on something that explains the hardware,, it's in the context of some super obscure, you know, fluid dynamic algorithm that, that kind of has some similarities to it, what an LLM could do, but really has, has nothing to do. So, so this was my attempt to really bring together PyTorch, um, you know, framework, uh, CUDA, which is always kind of an elite type of technology and, and framework to learn. And then the hardware, and, you know, the algorithms. So really it's, it's hardware, it's the software, and then it's algorithms, and that's called co-design. So those three together, yeah, that's what this book focuses on.

Demetrios Brinkmann [00:17:23]: Yep. It feels like you're doing God's work because there's like 100 people in the world that understand those three things that you just mapped out.

Chris Fregly [00:17:33]: Yeah. And my goal was to make that, you know, 100,000, one, one million people. Right? And, and do it in a, in, in a, in a domain that is relevant today. Because I, like, I start reading these CUDA books and by the end of it I can do some crazy, you know, numerical optimization, you know, thing that, that I can't, I can't make that connection into the PyTorch world and stuff.

Demetrios Brinkmann [00:18:04]: And so, uh, I see where you're going. Yeah. Because it's just so far in left field that you're like, how does this relate to the whole world of LLMs and machine learning? Yeah.

Chris Fregly [00:18:20]: And I, I, while, while writing the book, I recognized quickly that it's a cultural thing within NVIDIA and hot take, no one from, uh, NVIDIA was allowed to be a formal reviewer on my book. Like, right, like let alone co-author. But, uh, that, that was already ruled out quickly when I, I had approached a couple friends of mine. Um, they were not willing to give quotes. They were not willing to be associated. Now that said, they were super curious and they helped me quite a bit and they were able to, you know, unlock a few concepts that to me weren't, weren't jiving that I couldn't get from the forums. And I, Unfortunately, I can't thank those people publicly, so I, you know, I've had to thank them privately and, and that's, yeah, that's a bummer.

Demetrios Brinkmann [00:19:13]: Yeah, that's a real— well, it's, yeah, it's funny you mentioned that because my buddy and past co-host of this here podcast, David Aponte, he's been working on a lot of real small— he's trying to figure out all of the different chips, not just the NVIDIA ones, but. One thing he said to me in passing that I— it has stuck with me since is he was like, yeah, so, you know, uh, NVIDIA chips, they don't come with like a user manual on how they are built. Right, right.

Chris Fregly [00:19:45]: Nobody knows what's going on behind the veil. Yeah. And that was the deep seek moment, right? That was one of the— and my book starts off that way, like chapter 1, I, I kind of explain you know, like why now basically? Like why did, did— and there's a lot of DeepSeq references throughout, right? Like knowing that there's gonna be the next version of DeepSeq and all that. So like, I'm not trying to be on, on the cutting edge, right? This book took a year and there was like 9 models that, that surpassed DeepSeq in that time. But what was interesting about the DeepSeq moment is it made people realize that even with hardware constraint, you know, Chinese export, uh, like restrictions, Deepseek was somehow able to basically reverse engineer the— some lesser, uh, documented, um, yeah, right. Like the NVIDIA folks are like very, you know, careful to say, no, this was a documented feature. It's just, we didn't make it easy for you to find the docs basically. Yeah.

Chris Fregly [00:20:52]: Right. And so they like, yeah, one of the many like innovations out of Deepseek was that they, they found instructions that were not well documented that could, you know, bypass the L2 cache under certain circumstances. And of course the NVIDIA folks that are optimizing for other use cases, they would never think to do this. And so that's why it's not like a first-class like CUDA API, but if you know what you're doing, you could drop down basically to the assembly and, you know, it, it's called PTX or, uh, uh, it's, it's, it's PTX for the acronym, but they call it P-Tex. Yeah. The NVIDIA folks call it P-Tex. So you have to be able to like translate that. So yeah, we have GTC coming up.

Chris Fregly [00:21:32]: Are you coming to that, that conference? No, but— Oh man.

Demetrios Brinkmann [00:21:36]: That's my favorite one of the year. Yeah. That is a wild one. I've heard very good things about it. Yeah. We have, um, some, half-off, I think, tickets for the community. And one thing that is pretty much impossible to get is free tickets to that.

Chris Fregly [00:21:57]: Oh yeah. No, not unless you're speaking. Right.

Demetrios Brinkmann [00:22:00]: Yeah.

Chris Fregly [00:22:01]: Yeah. Yeah.

Demetrios Brinkmann [00:22:01]: I spoke a bunch of years ago. Mm-hmm.

Chris Fregly [00:22:04]: Oh, nice.

Demetrios Brinkmann [00:22:04]: I was gonna ask you more about just like diving into the book and other sections of it. What were some pieces that for you, you feel like, okay, this was, as you said, challenging to put two and two together to figure out how they coexist. But now that you have it, it's like shining a light on something that was very in the dark.

Chris Fregly [00:22:33]: Yeah. And the book takes very much like a bottoms-up approach, right? So there's a sort of intro chapter to explain kind of the context of the book and what we're gonna, you know, like some of the key terms, like one, one term that is related to the hardware-software algorithms trifecta, if you will, Venn diagram, if you will, is, uh, the term, uh, mechanical sympathy. And this is a term that actually I think came about from the Java world even where it, and it, it was a term that references the fact that if you're gonna be a really good developer, you have to understand like the environments and, and the Java context, that's really understanding the JVM, right? In, in a, you know, C++ CUDA context, that's understanding the hardware, the, the, you know, GPU hardware. 'Cause there's every generation is, is more than just an incremental or even like a doubling of power. There's, You know, one thing I realized when I was diving deep is that the NVIDIA folks are really good about working with the researchers, uh, to figure out what they can abstract away into the hardware. And of course they're thinking about it at a much, much deeper level, you know, than, than when you abstract things at a software level, right? You know, software, we're, we're trying to abstract like the database, we're trying to abstract like, the like distributed systems, you know, um, companies like Temporal, for example, they, they're trying to abstract away retries and, you know, all the distributed systems pain. Um, but at the hardware level, implementing operations that are common for the transformer, right? Like softmax. Um, you know, uh, and So when they work with the researchers, hopefully they're able to get down to the level of, okay, you're trying to do this within PyTorch.

Chris Fregly [00:24:41]: What can we do to the hardware to make it better for you for the next generation? That was a big unlock for me because I always, in my head, I still picture NVIDIA as like the gaming company that just got lucky, right? Yeah. But you know, Jensen's, is, is a smart guy. And Jensen is— works very closely with these, you know, researchers. If you remember when like OpenAI first launched and Jensen showed up with that, you know, big, big massive DGX box. Um, and yeah, there was Elon was there and everyone, right? Um, yeah, back when things were, were good and stable for, for OpenAI. So, um, that was a real eye-opener. Um, other things, you, you start to learn some of the, what's called Jensen math, right? So yeah, Jensen's the CEO of like NVIDIA and they, they do these tricks. It, it's sort of related to marketing math, you know, it's like, okay, well we're gonna talk about the number of flops for this chip in, in, and, and talk about the maximum flops.

Chris Fregly [00:25:52]: Even though the transformer doesn't actually use that, that type of, of like algorithm, right? Like sparsity. Yeah. Let's say 2^4 sparsity or something, right? It's not as common in like LLM land as it is maybe in other domains, but we're gonna talk about it as though the chip can do, you know, yeah, whatever, 180, uh, like teraflops or something.

Demetrios Brinkmann [00:26:13]: So it's like, we just blew that number out of the water. Can you believe how good this is now? And it's—, then the 2% or 0.02% of people that actually know what's going on, they're like, so what?

Chris Fregly [00:26:31]: And what's, what's bad about that is you go out and buy this chip and you give them the 70% margin and, you know, whatever. And you get this thing installed and you're like, where's my teraflops? You know, what's going on here? And then they're like, oh yeah, that's only under certain condition. The other little sneaky part that like NVIDIA has done a couple times in their history, including Blackwell, is they actually end up putting two chips onto one. Yeah. And you know, so I think it's called MCM, multi-chip module something. And so really Blackwell is two smaller Blackwells and Grace Blackwell is two smaller Blackwells with a Grace CPU. So it's got this like two eyes and a little nose., and they have, you know, super fast interconnects and they can keep the memory— the, uh, memory has to stay coherent. Um, Grace Blackwell, the reason why people, um, uh, either love it or hate it is, uh, because the CPU has so much memory now you basically have a gigantic pool like relative to the smaller GPU memory right now.

Chris Fregly [00:27:41]: You can access the CPU memory. Yeah, it's slower, but I could load more, you know, um, like parameters. I, I can load larger things. I can do things. And so when you're— in fact, I, I just got done working with a company, um, doing performance benchmarks on a Grace Blackwell system, uh, uh, GB, uh, 200. Um, and the calculus, the, the way that the kernels are written has to change. And so it's, it's not enough to just take your same PyTorch code and run it on a GB200. Yeah.

Chris Fregly [00:28:17]: You have to think about other things and you have to, and there's, there's, there's one other concept and it's the arithmetic intensity. And that's a property of an algorithm. So that's a property of like the attention, uh, like algorithm within, you know, transformers, which is really the most critical, part where tokens come in, you know, words come in and they're, they're, they're then sort of related to each other and like how much attention are they paying to each other? It's a very expensive computation. It's, you know, O(n²) roughly. And, and so if, if you're not paying attention and you, so that term, the arithmetic intensity, is how much like data movement has to happen per compute. And so, and you know, data movement is actually the slowest part, right? Like, uh, moving it from memory into the chip. Like once, once the data moves into the chip and into the registers, then it can get the full TFLOPS, right? The, the full compute and like compute keeps going up like crazy. It's just memory bandwidth.

Demetrios Brinkmann [00:29:35]: Is really the limiting factor. So there's a lot more. Yeah. That leads nicely into something that you were mentioning before we hit record, which is all of these things that we now have to think about when building platforms for AI, which we didn't necessarily have to when we were just thinking about the CPUs.

Chris Fregly [00:29:58]: These, like these GPUs are just little sidecars, you know, to the CPU. Right. That, I mean, that's why they're called accelerators. And that gets to another point where like the more of these accelerators you have, now you've got like bigger problems. And, and you know, there's the, the NVL72 Grace Blackwell, which is what a lot of the neo clouds are starting to deploy now, the Lambda Labs, the, the, you know, Virta out of Finland and um, CoreWeave and stuff. They— and that's got, you know, 9 trays of, um, you know, 8, um, of these like GPUs. So that's the 72. Um, and these are all interconnected in a single rack.

Chris Fregly [00:30:45]: So that's the beauty of it is it's a single rack. It uses a crazy amount of power, weighs like, you know, 3,000 pounds, which I don't know how many kilograms that is. 1,500. Yeah. Yeah. So yeah, pretty heavy, um, requires massive cooling, you know, so, and I guess they're relatively efficient in, you know, compared to if you have the same kind of, of stack or, um, set of, you know, servers up and running. Um, but they're, they're beasts that, that have to be very carefully interconnected. So now we have something that's called the NVLink, right? The NVIDIA link.

Chris Fregly [00:31:24]: Yeah. And one little secret, by the way, that NVIDIA doesn't talk about much, probably because of regulatory, um, concerns, is they, they acquired a, a company in 2019. I think it closed in, in 2020, um, called Mellanox. Yeah. And that's what converted, um, right, like, uh, that company into not just a chip company, right? So now we've got like NVIDIA started off just building chips. Now they're an AI systems company.. So they've got the switches, they, the, you know, networking switches, they've got like the really fast interconnects. Um, and they just brought in a lot of expertise.

Chris Fregly [00:32:06]: That company also that they acquired were the creators or the most, uh, most commonly, or I, yeah, I think the, the, the, um, most popular, um, like creators of InfiniBand., right, which is like sort of next generation Ethernet. And so while there's still a lot of data centers out there that are using Ethernet and, you know, you, you pretty much want to stay away from those, by the way, but you, you want to stick with InfiniBand, you know, and then if you want, we, we can kind of get in at some point to some of the cloud providers have their own networking. And so they don't use InfiniBand. Yes. AWS being a very, big one. And it, it always caused problems with us, with customers, because there was an extra layer where we had to adapt, you know, like the NVIDIA libraries had to then work with this custom layer that was called EFA, which required Amazon to then have, uh, backward compat— or, you know, forks of PyTorch. And it was just mess after mess after mess. And, and that's, that's still with the NVIDIA GPUs.

Demetrios Brinkmann [00:33:18]: That's not even getting into Trainium.

Chris Fregly [00:33:21]: And that's just changing one piece of the puzzle. Yeah. Yeah. And when you don't buy into the NVIDIA, like sort of reference architecture, they call it, which includes all of that, you know, networking. There's a lot of things that you have to do on your own that like NVIDIA is just like, Sorry, like we can't help you. So yeah. Did they void the warranty? Like, right? That's— and that's something else too, actually. Good, good segue.

Chris Fregly [00:33:53]: These chips, I don't know how familiar you are with the failure rates of these GPUs.

Demetrios Brinkmann [00:33:58]: I mean, yeah, yeah, these things run. I wanted to ask you about the reliability in that regard because I've heard from all different Areas on how unreliable they are, just the, the chips themselves, but then also everything that comes around the data center and not having power to one fourth of the whole data center and it's going out. And so you have to have all of these extra failsafe modes because you basically have to expect that things are gonna get switched off. Mm-hmm. And so the, the. Chips themselves are very unreliable. And then you're talking about like just chips blown up or chips basically being rendered not usable.

Chris Fregly [00:34:45]: Yeah. Every single cluster that I get on, like a new cluster, the first thing I do is, is put it through sort of a, I, I just, you know, put the pedal to the metal. Yeah. As they say. And, uh, figure out like which ones are on the verge of, of burning out., and I'll burn out 2% just within a couple, you know, minutes because the, you know, they're purposely designed to run hot, which is why they need a lot of cooling. Even, yeah, I don't have one here, but those little small DGX Sparks that they came out with, if you've ever run like something on there, like, I mean, that thing can, can burn a child, right? It's not safe to have in your home. Like, I don't think it's gonna burn down a house or anything, but it's gonna burn your skin. It's— and, and that's one, you know, Blackwell.

Chris Fregly [00:35:37]: Um, it's a, it's a Grace Blackwell. Um, but, and that's also why that particular machine is, is a little bit, uh, constrained. They don't have certain instructions for some of the tensor cores and things because they, they physically can't keep it cool enough in a consumer device. Right. So it's not really the best device to learn like real CUDA programming on. Um, but yeah, they'll, they'll, they'll burn out there. They throttle super quickly. So when you're doing any kind of benchmarking, what you're supposed to do is lock the, the clock speed, you know, the like GPU speed.

Chris Fregly [00:36:12]: Um, so it doesn't get up above a certain temperature because yeah, otherwise at any given point in time, they're, they're being throttled so that they don't burn out. And you're talking even the new ones, right? Yeah. Oh yeah. Especially the new ones.

Demetrios Brinkmann [00:36:26]: Yeah. Oh yeah. Because I think I heard somewhere where there's like a warmup period that you have to go through in order to know, like, is this gonna go or not?

Chris Fregly [00:36:37]: For sure. Yeah. Yeah. That's what I was referencing to, to see. Um, because they all, you know, that's another thing that you, you realize when you start thinking about hardware and and reading about the failure rates and the, the fabs and how they're designed, like at the end of the day, of, of the day, they're, they're still manufacturing like defects, right? That are happening. And of course they're generating millions of these, these chips. They're, you know, building millions of these per year and they have to be at a specific temperature, you know, which is why the Arizona fabs have always been a little bit controversial because Um, so hot. Yeah, it's hot.

Chris Fregly [00:37:19]: There's no water. I mean, you're in the desert, right? And even if you do like get through all that, even, even the ambient air being different than what like Taiwan Semiconductor has in, you know, like Taiwan, it's just a different climate, just a different thing. And if any of those are different than what they expected, they have to then go through and, and, you know, add extra filters or do something with, with the air.

Demetrios Brinkmann [00:37:48]: So yeah, there's a lot going on. Why these things— did you uncover why they're so finicky? Like, why can't they make them so it's less precious and just like battle-hardened?

Chris Fregly [00:38:04]: Yeah, well, part of it is, right, like, do you remember when you, uh, maybe built your first computer or your friend built built his first computer and you, you found out you can overclock the, you know, GPU or the CPU or something. Yeah. And you're like, why the hell would I not overclock this? This is obviously I'm going to do this. Right. And then you do it and then you realize, okay, it burned out. Now I gotta go back to Fry's or Micro Center or whatever, Best Buy and get a new, you know, chip. Um, so part of it is they're optimizing for performance, right? And they, but even separate from that, there's just at scale, there's just going to be like, you know, like molecular issues, right? And so the best thing you can do, 'cause I'm given like new clusters all the time to say, hey, you know, help us tune this thing. And so that's the warmup period is like really figuring out, and in a cloud type of environment, I hammer those things because I'm like, you guys should have already warmed this up.

Chris Fregly [00:39:12]: Oh, now it's my turn from the sort of application level to make sure that, that my application, you know, that, that my workload and then it— I would say probably 60% of the time, 60 to 70% of the time they have to spin up a new cluster because something went wrong. Yeah. And there's entire products like SageMaker HyperPod was built specifically, um, to have a, a standby, like a, a warm standby of GPUs that have been pre, pre-warmed up. Oh. That can be swapped in.

Demetrios Brinkmann [00:39:49]: Um, that's the whole premise of the product. Yeah. Well, you were also talking about a, an abdi, type of product, but for GPUs that from the—

Chris Fregly [00:40:00]: yeah. Tell me more about that. Yeah. So for, for listeners that might not be familiar, AppDynamics is a company. Um, they're, you know, relatively old right now. Um, they were new when I joined Netflix and that was, I think, 2011. Good for them. Um, and AppDynamics was a, a, you know, think of it like an agent, not, not the current definition, you know, yeah, of an agent, but like, think of it like, uh, Right, like a daemon thread, or a, a process that runs on all of your servers.

Chris Fregly [00:40:30]: So back then, you know, Netflix, it was all Java-based streaming servers. And, you know, they, they could— each of those processes that were running were, were, were very, very, like, lightweight, but they could, you know, track all of the, the, like, network traffic. So we knew which of our, you know, 'cause like we had hundreds of, these microservices. It was, you know, very much inspired by like Amazon and Bezos and the, you know, microservice. And so just trying to even get an understanding of how our, our services were all connecting to each other and the health, that was a big win. But it also tracked the memory, it tracked, um, any sort of oddities going on. You know, this was all CPU-based. And so there's this company that I met actually because of the book that's called Zymtrace, Z-Y-M trace.

Chris Fregly [00:41:23]: They're actually based outta London. Um, nice. And they, they are ex, uh, like AppDynamics people, right? Which is why I was very interested in them. Um, and they remember when, you know, Netflix was their first customer. And so add in all of the similar things that AppDynamics did. Um, but then also now add on all this complexity about the, like, GPU power, the GPU temperature. Um, any, like they can detect when a GPU is being underutilized and they can make suggestions at, at the, the full system level, you know, and say, hey, maybe, and they're very much like, right, like LLM aware. Yeah.

Chris Fregly [00:42:06]: This product. And so it can actually look at your training workload or your inference workload and say, hey, it looks like you need to add a few more, like experts in your, you know, mixture of experts model over here.

Demetrios Brinkmann [00:42:18]: And, and on this server is underutilized or the networking's not, not optimized for, yeah, for this particular setup. Yep. Going back to when you're warming up these GPU clusters, this is primarily with you running inference loads, right?

Chris Fregly [00:42:35]: It's not for training. Yeah, I do both. I mean, it, it depends on, and you know, keep in mind RL, right? This is kind of the, the dirty little secret is that it's both, right? It's doing inference and it's, it's doing, um, you know, backprop and updating weights and things. So But interesting that you say that because HyperPod was actually built for training. Uh-huh. Um, and the ability to, if, if a GPU or, or like an instance dies, we have this warm standby that can come in and the job can then continue without this like major, you know, uh, job restart where you might lose the last couple hours 'cause your last checkpoint was a couple hours ago and things. Yeah. So, um, and then very quickly customers were like, hey, we actually want to do like, right, like inference on this thing.

Chris Fregly [00:43:27]: Either we're, we're generating synthetic data for the next training run, or we're doing RL, um, where we have to do a bunch of inference and then, you know, compare it to the— or like verify it, right, in an RL, um, in an RLVR, right, where you're doing reinforcement learning with, verifiable rewards where you're, you know, either like executing code or you're doing math equations or you're running some kind of experiment.

Demetrios Brinkmann [00:43:57]: Yep. And then verifying it. So you undoubtedly have looked at or have a good pulse of what's on the market for whether that is the Lambda Labs of the world where you can get down into the bare metal and kind of control everything from what I understand, Or it's the base tens where it is straight like, hey, just throw whatever you want and we'll give you this API. Uh, are there any that, or do you have any commentary on what's out there right now?

Chris Fregly [00:44:34]: Yeah, sure. Yeah. And obviously any, as we've seen over the years, all of these companies end up looking like that, you know, yeah, over the years they all start to build each other's features. So I would argue— Yeah, well, you, you start off with one set of skills, get a little success, and then customers start to ask for more and more and more. Perfect, um, perfect, uh, example here would be Lambda Labs, actually. Lambda was primarily training Um, they, they started to dabble a bit into inference. Um, they even had their own chat, like LambdaChat or something at some point. Um, which, uh, which I didn't really use that much, but it's, it's very obvious when you see their roadmap about which kind of customers that they're now starting to get, like the, the the first initial tranche, you know, might've been like the big labs, right? I don't know for sure, but, and then to contrast folks like Base10 and Together and Fireworks, you know, like those folks, they were just trying to get the open source models out, like behind an endpoint, maybe supporting fine tuning so that you can customize the open source models.

Chris Fregly [00:45:56]: Very different model, right? The labs, or sorry, you know, places like Lambda Labs and, you know, CoreWeave, they're there to customize and pre-train and do the full, full training lifecycle. But they're realizing that because inference is like a major, you know, the, these apps are running 24/7, right?

Demetrios Brinkmann [00:46:16]: People are, are like asking questions 24/7 that, that they have to get their, their like inference act together. Yep. Yeah, it's, it's fascinating to me to think about how that idea of everyone converging. And I do think that we'll see that it's cool to see some of the, uh, the different ones stay in their lane or just say like, hey, we're, we're just going to do this. And that's all that matters. And then others are trying to go and be the whole platform and I, I do appreciate that you have so much variety right now in the way that, like, the value prop of Modal is you can get these GPUs going really fast. You've got that cold start, uh, or it gets going and you're good. One thing that is wild to me though, is that they're all running on top of other people's hardware.

Demetrios Brinkmann [00:47:17]: Yeah. Yeah.

Chris Fregly [00:47:18]: They're just renting out GPUs from the other people that are potentially competitors. Right. And they're, they're able to aggregate that demand. Um, you know, cuz yeah, I don't know when the last time you've tried to go in and, and like rent a B200 from, from AWS, but you, you can't do it. Impossible. Yeah. It's impossible. And CoreWeave laughed at me.

Chris Fregly [00:47:41]: I like, I literally, during the development of the book, I said, hey, Hey guys, I'm, you know, writing a, a book. I'm a third time author. I'm, you know, um, would love to, to mention you guys in the book, which I still do. I'm not like a dick or whatever, but, um, they're like, we literally have no capacity and the minimum spend is blah, blah, blah. And, you know, um, so like they didn't laugh at me, they were laughing with me, you know, cuz I was like, yeah, I, you know, came from Amazon, I get it. But I just, you know, yeah, thought I would try. And you know, they, they're, they're booked out. Like, yeah, when I spoke to them over the summer, they were booked out for like 2 years and they just can't even get enough.

Chris Fregly [00:48:23]: And, and then there's the issue of even if they could get enough, where are they gonna get the power and how do they, you know, all these turbines and the, I guess the turbine blades are the most precious.

Demetrios Brinkmann [00:48:35]: Like you can build a turbine, but you can't— I heard that recently on a, The Stripe Podcast. Oh, interesting. It, it was exactly that, talking about how the blades are not being made fast enough. Yeah. Which is interesting. They're coming from— I'm in Germany right now and I think I remember hearing that it was coming from Germany and it's like, hey, somebody needs to put some, you know, light a fire under the Germans' ass. Fuel on the fire. Yeah.

Demetrios Brinkmann [00:49:07]: Yeah. What's going on? Yeah. It is wild. And, uh, how much did you research or decide not to research on the alternatives? Because like AWS has Trainium, right? And Inferentia, and then you've got TPUs and you've got AMD.

Chris Fregly [00:49:25]: There's lots out there, but like a very small subset of the users actually use that. Yep. Um, and that's the answer to the question right there, right? Is, you know, this book was supposed to be 300 pages. And when I got as deep as I did, I said, it will be a disservice to the folks that read this if I don't cover like all of CUDA, right? That, that makes sense for this domain. There's still so much more to CUDA and there's still you know, so much more and, and all the, the texture renderers and things that are actually, you know, still used as, as different types of caches and, and stuff that, that have wonky names that don't really translate to the LLM world. So, but there's so many patterns that have been built up that are, and if you take something like flash attention, you know, um, that was built by the chief research scientist over at Together AI, right? He's a famous guy. Tree— yeah, Tree Dao. Tree Dao, yeah.

Chris Fregly [00:50:37]: Yeah, yeah, he's pretty badass. Um, uh, he, he really brought light to this like mechanical sympathy type of thing. And the, the big revelation was, is let's use the fastest memory for the things that, that need to move, you know,, the, the like memory that's closest to the chip. Let's put things that are constantly moving in and out. Let's keep it there so that it doesn't have to travel across all the other slower, um, you know, bandwidth, uh, like interconnects and, and just, just that one small change made huge performance gains. And that's why with every generation, uh, flash attention has to get like, has to get rebuilt. And reoptimized because it's not fully taking advantage. And you know, there's a Flash Attention 4, or, uh, maybe it's up to 5.

Chris Fregly [00:51:26]: I think 4 is the, uh, Blackwell version basically. So the, the short answer to your question was, is there's so many patterns that fit the goal of my book, which was this co-design just within like the NVIDIA ecosystem. And it, and it really, really covered everything that I wanted to cover. So not only is it the most widely used chip, but it's also the most widely, uh, the, uh, like supported. There's a lot of community support. There's people that have figured out the doc— like, like you can infer the crappy documentation from all of the examples that are out there. And that's not the case with AMD. That, that's not the case with Trainium.

Chris Fregly [00:52:13]: It's really hard to understand these chips. Without having someone from AMD or someone from Amazon right next to your, your terminal. Yeah. Helping you, you code it. So, and it just would've exploded the book. I mean, then the book went from 300 pages to 1,000 and then, but during reviews, people kept asking, you know, why aren't you talking about Trainium? And yeah, my funny answer was, uh, yeah, yeah. And TPUs actually, I probably would've talked about TPUs, um, even over Trainium. Just because it's a different type of, you know, architecture and systolic arrays and, and, but, you know, we, we didn't have a lot of customers using Trainium back then.

Chris Fregly [00:52:56]: Yeah, obviously Anthropic was basically forced to use, right? Because of the partnership. And against their will. Yeah. Yeah. And well, no, I mean, they've been pretty successful. It just took a lot of work. And price performance, you know, was good and they're, they're, you know, trying to keep costs low and stuff. And just having other alternatives.

Chris Fregly [00:53:20]: I mean, I saw this back at, at like Netflix too with, with the, the cloud vendor, you know, Netflix famously chose Amazon, right? Like AWS as the cloud vendor. But we always teased them to, to say that we were gonna go to Azure or to GCP when at least back then there was no way in hell we were switching off. It was, it was gonna take too much effort, you know, and like we had features to build, right? Like, yeah, these days Netflix is in kind of a maintenance mode is what I've heard. I, I haven't worked there in 15 years, but, um, you know, so, so now they could take on projects like that. But yeah, back then, you know, we were building streaming while we were, you know, scaling onto, yeah, onto AWS. So just, just having that sort of threat of being able to switch, which is what Sam Altman's doing right now, right?

Demetrios Brinkmann [00:54:13]: He's got Cerebris and he's got, yeah.

Chris Fregly [00:54:18]: How much have you tried to play with coding agents, like at this level? Yeah, 100%. I'm actually tuning a, a, a kernel right here, right behind your, your, um, Your screen.

Demetrios Brinkmann [00:54:33]: Um, it's very different than generating a Next.js app. There's that.

Chris Fregly [00:54:36]: Or my local app that I was telling you about in the beginning. Right. Yeah. Single use, uh, bespoke throwaway app. Yeah. It's, you know, and halfway through writing my book, the other reason why I insisted on it being as big as it is. Is I selfishly wanted to use it with my coding assistant. So about July last year, I was asked to, to cut a lot of this book cuz it just got massive.

Chris Fregly [00:55:08]: We were either gonna split it into 3 books. It's really 3 books in one. It's like, um, kind of a systems, like, like, you know, OS Kubernetes, um, like optimization for AI. And storage and, and then it's like CUDA and all the, you know, crazy things there to support the LLMs. And then it's inference and training and it's kind of systems-wide. And they, so we were gonna do a 3-part thing and then, you know, I just couldn't take it anymore.

Demetrios Brinkmann [00:55:41]: I just had to get these, these words out. And that's what, you know, I had other shit to do. I'm just saying in my, old producer when I was making music would tell me, he was like, you don't finish an album, you abandon it. No, that's right. Yeah. Yeah. Yeah. So that was, that was what you hit, but sorry, I cut you off.

Chris Fregly [00:56:01]: You were saying that there was that moment. There was a moment when I was like, like this book is, is going to be good for the readers, but this book is gonna be even more valuable to the, like, to the agents, right? To the, you know, coding assistants. And it has saved me so many times because these models, the, you know, lack in certain domains, right? There's just not enough training data, uh, for example, on how to tune kernels and, you know, how to optimize CUDA. Um, it's in a lot of engineers' heads that, that work in, you know, Santa Clara at the NVIDIA office, but it's not, you know, because the code's not open source, right? There's a lot of code within their, their ecosystem that's hidden. Um, and it's all C++ and it's all, you know, shipped as binaries and assembly and, you know, yeah, good luck trying to reverse engineer some of that stuff. Right. So my goal was to if I was going to continue down this route and, and let, you know, others explore this route, I really wanted to make the book more of a flywheel. And that's why the GitHub repo is, is so critical.

Chris Fregly [00:57:21]: Oh yeah. And I've built a bunch of MCP tools that are of course very friendly. I now have OpenClaw. Um, you know, yeah, I had to jump on, on OpenClaw. Nice. So I've got, uh, like a GPU,, claw, um, or yeah, like a, yeah, I keep like renaming it, right? Just like OpenClaw. Yeah. I've like renamed it about 3 or 4 times, but really it's a sort of AI systems optimizer skill and it can go out there and, and run all the cluster checks and, you know, try out the different things and make sure all the tensor cores are working as they should be and stuff.

Chris Fregly [00:57:59]: Wow. Right. Because that's the key. In, in CPU land, we've got instruction pipelines, right? And some of them are instructions to load memory, instructions to do— but they're like fairly general and basic. On a GPU, there's pipelines to do crazy, crazy shit. Um, you know, there's, there's even special memory that's called tensor memory that works with the tensor cores. So tensor cores are like segmented off separate from the CUDA cores, the sort of generic, you know, CUDA cores. And they're typically designed for specific precisions like floating point 16 or FP4, you know? And so when you're, when you're writing a kernel, you, you have to know that there are separate pipelines for these things so that you can write them in, you know, parallel.

Chris Fregly [00:58:55]: And say, okay, well, while you're, you know, computing the results, um, in FP8, I want you to go do the multiplication, um, or, you know, for, for the first instruct, cuz you're like constantly overlapping, right? Yep. That way you can hide the relatively long time that it takes to move the data in from, from memory in. And so now it's, it's a trick of trying to take advantage of those parallel pipelines. And, um, yeah, Blackwell did something sneaky, by the way, when they moved from Hopper to Blackwell that like not a lot of people know about where they got rid of, of one pipeline, uh, and put it towards another. Um, you know, yeah, basically said, okay, it seems like these days no one's really using, you know, int, int8 or, or something or FP64. So yeah, we're going to actually., take transistors away, you know, resources from that, that particular use case and put them towards— and if you don't know that and you've got code that's, that's relying on that, you're gonna run on, on Blackwell and see a huge performance hit and be like, you know, yeah, what the eff?

Demetrios Brinkmann [01:00:05]: So, yeah, I thought this was supposed to be better. Uh, yeah. Yeah. Oh man. And so have you just created, like you said, skills?

Chris Fregly [01:00:15]: For them that you've thrown on the GitHub repo, or is it that you're using, uh, uh, some kind of a RAG on the book? Right. I, of course, back when, when RAG was big, that was in my head. Um, these days I just rely on cursors like grep and rgrep. And so yeah, I've converted the book into Markdown and it's by chapter and I've got all of the examples also, by chapter. And so I can usually hint, you know, to Cursor and like put, put those, the, the chapter that I know it is, you know, relevant to this, uh, particular task that I'm working on just to kind of minimize it. So it's a little bit faster, but, um, and this gets in this, you know, might be a good segue to the controversial Codex versus Claude.

Demetrios Brinkmann [01:01:08]: So, yeah, curious. What, what your, uh, go-to is these days. Is it Opus 4.6? So I use Opus 4.6. I have heard from friends— I just haven't given Codex enough time. And I will also use Gemini and then Antigravity, which sometimes completely bricks out my computer. Uh, so like, it's so heavy. It's just like, what is going on here? But That's a whole nother story. And I, I haven't given Codex enough time.

Demetrios Brinkmann [01:01:39]: I have friends that have said they prefer it. It's the best because it doesn't take as many shortcuts, all of that stuff. But I'm like, dude, I, I'm already paying Anthropic so much.

Chris Fregly [01:01:52]: I know. Every month that I'm cool just sticking with it. It's good enough. It's good enough. Yeah. Yeah. Yeah. I was once like you.

Chris Fregly [01:02:04]: Yeah, for sure. You've, you've changed. Uh, yes. I, I used Cloud Code like initially because of the hype and because I just, and you know, trying like, like Codex is relatively new, you know, compared to Cloud Code. Maybe it came out a couple months later. So obviously the, the— but what, what I noticed, especially with larger code bases, you know, there's, there's almost a thousand, samples in my, um, you know, GitHub repo for the book. And I've noticed, yeah, I think shortcuts is probably a good way to put it where it's not reading all the files. Yeah.

Chris Fregly [01:02:41]: Claude, it, um, you know, like Codex is way slower. Yeah, obviously way slower, which is why I was excited about the Cerebris version, uh, the, you know, Codex Spark. Um, I have thoughts on, on that. I, I still stick with the slower one. Oh yeah. Yeah. Um, but I, I can step up and, and walk away and go to the gym or go do something to the store, come back, and I know Codex will have completed the task. Yeah.

Chris Fregly [01:03:15]: With Claude, I feel like it's designed for a different type of user who is sitting there waiting and wants to see and wants to be part of the, of the solution.

Demetrios Brinkmann [01:03:24]: Yeah. Do you feel that? It's a little more back and forth. Yeah. I will notice that. Yeah. It hardly ever gets it the first time. And so you have to go and you have to manually validate and then be like, oh, I'm getting this error, uh, something, this isn't working. Check that.

Demetrios Brinkmann [01:03:43]: And, uh, I could see that. All right, well, shoot.

Chris Fregly [01:03:46]: Like I needed another fricking subscription in my life. I guess I'm gonna— I have to go now and do it. I mean, I think that's where the Ralph Wiggum stuff comes from, right? Like putting it in a loop. Like, I just thought that that was so ridiculous. You know, I'm like, yes, I love Ralph. He's, he's obviously, uh, my favorite character on that show. Um, but I don't need him in my development cycle. I just want the model to do it.

Chris Fregly [01:04:13]: Now, the other thing too is that it appears that the way that Codex has been post-trained, it is doing more frequent, um, like compactions and it's doing it in, in the model where it seems like Anthropic post-training has it kind of only when it gets full. Whereas, you know, Codex kind of does it when it determines that it's a good time that it can compact things and not, not lose too much, you know, versus a huge, huge context.

Demetrios Brinkmann [01:04:48]: Okay, now I'm at 70% or 65%. Now I'm gonna compact. So, Um, yeah, actually a lot of times the best performance gains you can get are just by clearing that context window. Mm-hmm.

Chris Fregly [01:05:02]: And then it's like, wow, the problem that I was having, it just miraculously disappeared. Yeah. It's counterintuitive because you want it to like remember all of the mistakes that it made or whatnot.

Demetrios Brinkmann [01:05:12]: But yeah, if you give it a— it figures it out. For the skills piece, you know what other one that I've been loving? Uh, it's called, I think it's called like Claudeception. Have you heard of that one where it will, it creates skills when it goes through and it does something, it will then reflect back and be like, could I have made a skill out of this?

Chris Fregly [01:05:33]: And then it will create— that's exactly what it—

Demetrios Brinkmann [01:05:36]: skills I've been looking for. Oh, that's crazy. Okay. Yeah. Cause I do that manually. Yeah. I figured when you were saying, oh, well, yeah, I, I will go and create some skills if, if it's there and I see that it's needed, but yeah, try Claudeception out, see if it works. And there's, there's other ones that have been very useful too.

Demetrios Brinkmann [01:05:58]: My friend Raul was telling me he just creates one skill and he spends a lot of time teaching that one skill, how to then create all the other skills.

Chris Fregly [01:06:08]: So if he ever says, okay, go make a skill out of this. It knows what to do. Cool. Okay. Yeah. Yeah. That's a cool pattern.

Demetrios Brinkmann [01:06:19]: Yeah. So I love these discussions cuz everyone has their own workflow and, and, um, that's what we started doing because there's just basically there's 100,000, 200,000, 2 million. I don't know how many people now are their own little R&D department. And we've all been empowered with Hey, just go try stuff. And so we, we get together on Fridays now and have little lunch and learns and be like, what are you doing with it? Well, oh really? Oh, that's interesting. And we've had some folks come with ideas of, well, in human psychology, we can't really understand 7 layers deep. Oh, I love it. We can't go past 7 layers.

Demetrios Brinkmann [01:07:01]: So anytime we get, uh, I create a file structure, I never want to go past 7 layers for the LLM. And it's like, dude, I have no idea if that works, but just the fact that you're trying it is really cool to me. Yeah. When do you run these? This is a formal thing on Fridays. Yeah. Yeah.

Chris Fregly [01:07:22]: I'll, I'll invite you to 'em on Fridays. Oh, please. At lunchtime in California. So around 11:00 AM. Yeah. Nice. Yeah, man. Yeah, it's fun.

Chris Fregly [01:07:32]: Yeah. In fact, do you know Swyx? Yeah, of course. Last, I, um, I was chatting with him, um, at the, uh, the, uh, like, Clog conference. Yeah. Do you remember the, the ClogCon that happened? The ClogCon. Yeah. Dude, that was a shit show. It was an absolute shit show.

Chris Fregly [01:07:51]: Yeah. Oh man. In like a good way, you know, like a San Francisco's back, like kind of way. Right. People had like lobster, they were serving. Like, um, crab legs. Yeah. And just chaos.

Chris Fregly [01:08:08]: Yeah. Um, but I was talking with Swyx and we were chatting a little bit about making that like a first-class track in his like upcoming AI engineer summits because, and just, just having people, yeah. Even if it's informal, maybe like, like a lunch where everyone comes, gets together and shares.

Demetrios Brinkmann [01:08:26]: But I think that's super powerful. Yeah. Yeah. When did you start that? Just recently, because we did a reading group in January, our first reading group of the year, it was on vibe coding. And then we read the paper, which was cool for the first 15 minutes. And then it just devolved into everybody sharing. I do those. How they've been getting a little bit more gain.

Demetrios Brinkmann [01:08:47]: And that has, that was like, wow, well, we should do this again. Let's make it a thing. And so we started doing it on Fridays and it, it's been so useful to me. And especially because maybe I'll know about this, but then somebody else knows about that and I'll say, oh yeah, but did you try Claudeception? And they're like, well, no, actually I like, you know, this one better. And, and it's the community Slack is great for that too.

Chris Fregly [01:09:17]: But when you are actually prodding and probing a little bit more, then you get different ideas out of folks. Totally. Yeah.

Demetrios Brinkmann [01:09:24]: Great idea. I'm, that's gonna be, I'm gonna clear my, my calendar for that one. There you go. Well, yeah. And, and actually now that you say that, we're doing the Coding Agent Conference on March 3rd, right? And one idea that I had for that was exactly what Swyx is thinking on. We need some kind of unconferency way to let people just give their best tips and I landed upon, we're gonna do a session where we'll just have a Slido up and people go to the Slido and then they put in Slido what their best trick is. And then whoever up— like what gets upvoted the most, that person gets to come up on stage and then demo that and showcase it. So that's one of the things that we're doing.

Demetrios Brinkmann [01:10:13]: Yeah. And then I'm also gonna do that kind of thing with hot takes. Like, what's your hottest take?

Chris Fregly [01:10:19]: And then whichever ones get upvoted the most, they come on stage and have to defend their hot take. You know, I think the, the biggest hot take from the Peter Steinberger, uh, the Open Claw guy, right? The whole journey. The journey. Yeah. He, he was— it, it was very clear from his Lex Fridman interview, from Um, you know, he was at the, the, uh, KlawCon, of course. Uh, yeah, I mean, he was talking trash about everything. Yeah, he said something and, you know, this is all, all public, you can go watch the video, but he was joking that he, he was reached out by so many folks. Uh, he got an email from NVIDIA, like not, you know, Jensen directly, but some whatever, um, like lieutenant of, of Jensen's, and they sent him a Microsoft Teams invite and he just rejected it because of that and said, I'm not going to work with you guys because you use Teams.

Chris Fregly [01:11:19]: No way. Yeah. So, I mean, he's a, you know, he's a really interesting character in that, like, he just doesn't give an eff, you know? Yeah. He's got enough money to not care. Right. So, yeah.

Demetrios Brinkmann [01:11:32]: Which I didn't really, like, I knew nothing about this guy until— Me neither. OpenClaw. Yeah. Yeah. But yeah, I think a lot of folks. I had no idea who he was or what he was doing. I just love the fact that he shipped like 40 products before OpenClaw took off. Totally.

Demetrios Brinkmann [01:11:46]: Oh, right, right. It's like, that's, that's awesome to see.

Chris Fregly [01:11:54]: I also just found out that he is absolutely jacked and like might be on steroids. Yeah. I mean, yeah, yeah.

Demetrios Brinkmann [01:12:01]: He's a big dude, I guess, in person. Uh, I mean, he's, he's like Austrian, so. Yeah, he's got those.

Chris Fregly [01:12:09]: So is it steroids or is it Austrian? I don't know. Yeah.

Demetrios Brinkmann [01:12:15]: Yeah, he's got the Schwarzenegger. Austrian. Oh, that's hilarious.

Chris Fregly [01:12:18]: Um, yeah. Did you listen to the Friedman podcast? No. Is it worth, is it worth it? It's like 9 hours or 4 hours or something, but Friedman's good now about doing clips. And so you could see, um, but. Yeah, Friedman kept pushing him on which direction are you going? Oh yeah. And then this is what I was gonna say. I was surprised. He's, he's pretty anti-Anthropic.

Chris Fregly [01:12:43]: Everything from, I mean, besides the little mini cease and desist on the name. Yeah. He was not impressed. It, it didn't make the top 2. The top 2 that he told Lex about were either Meta, because Zuckerberg actually pinged him on, on WhatsApp directly. Wow. And was asking him questions about the API and how it was designed and said he was actually building something and he wanted to talk to him. And so, um, and he, he liked Sam Altman.

Chris Fregly [01:13:11]: He, he made a comment and said, you know, a lot of people vilify those, you know, those guys, but you know, honestly, they're, they're hackers like us, right? So just, um, just in, you know, different leadership types of positions. But yeah, Anthropic, And basically he's a Codex user, so, uh, and he, um, he was describing very similar things to what, what you and I are mentioning where it's like, there's, there's too much back and forth. There's, there's too many decisions that I need to make that I would expect the model to make.

Demetrios Brinkmann [01:13:43]: And I just want to fire off 15 of these things and know that it's going to get it done. Yeah. Oh man. Now I got to go try it.

Chris Fregly [01:13:53]: No, I gotta go have some fun with code. 200 bucks. Yeah, I'm gonna go and do that right now. Shoot. Yeah.

Demetrios Brinkmann [01:14:01]: What's your, uh, biggest like monthly bill for cloud or for any of these? Uh, probably, so I am relatively, again, I'm just building for myself, right?

Chris Fregly [01:14:15]: So relatively calm, but I probably spent like 300, 340, which can be a lot or it can be nothing depending on who I'm talking to. That's nothing. Yeah. Yeah. That's just me getting my, you know, warming up those clusters. Totally. No, they go by quickly. Um, I think from what I hear at, at meetups, I think 1,500 to 2,000 per month seems to be pretty standard.

Chris Fregly [01:14:42]: Yeah. Some people hit, hit 25, 30, but Yeah, I can burn through like 2,000 without even thinking about it.

Demetrios Brinkmann [01:14:47]: And then I have to, you know, set up alerts and be like, oh crap, you have to make sure I can pay my, my rent. So, yeah. Oh, that's classic. And so talk to me how you are using these for the kernel exactly.

Chris Fregly [01:15:03]: I mean, you're doing the skills like you mentioned, but what other stuff are you doing spending all this money? One of the, the issues, if I get a cluster with a new chip, so let's take the AMD or the newest versions of, you know, Trainium, which I didn't have any exposure to before I left Amazon. So they're, you know, pretty new to me. You have to like really dig deep. You, you, you have to look at like, take a really fast library, like in, you know, the NVIDIA world, that would be cuBLAS. Right. Which is like highly tuned by the CUDA ninjas. And it's like this, this badge of honor that you can get it to. And so, um, doing something like that for these, these other chips and looking at the assembly and these models are not good with assembly, but they're good at reading text.

Chris Fregly [01:16:04]: I mean, if it's like genomic text or strings and, you know, and can build patterns. And so I've, I've developed some ways to like reverse engineer and like, this is all brand new stuff that I've been working on since I finished the book that, and, you know, finally had time to explore. Um, but really that's outside of having one of these CUDA engineers next to you or these, you know, hardware folks next to you. That's really the only other way to do it is to take these, uh, non-public, um, you know, frameworks, uh, have them compile the code and then just, just try to decompile and, and yeah, just basically like reverse engineer and then try to apply it to other use cases. Wow. Yeah. Um, and that, that then leads to different types of like innovations with flash attention and, you know, the DeepSeq moment where they found something that was in the assembly and then they had to reverse. And, and fortunately, you know, the instructions kind of give you a hint as to what they are.

Chris Fregly [01:17:16]: Like if it's something.l2, like it, it's doing something or bypassing or something with the L2 cache. And so you can play with that and you can pull it out and you can put it in, you could double it. You could, you know, do it twice. You can do it and see what's actually going on. So it's, Right now it's a game of just, let's just try to figure these things out because even, you know, the NVIDIA folks aren't always optimizing for your, for your use case. Use case. Yeah. I mean, fortunately they have a lot of resources now on, on, um, right, like LLMs and, and things.

Chris Fregly [01:17:49]: But even, even like, like 1 or 2 years ago, yeah, it wasn't this case, but there's not enough training data and think of, I think when these labs go to do their next round of, of RL or their, their next round of, of pre-training, let's start with pre-training, you know, trying to get in these long reasoning chains, which I'm generating right now, like it, it's trying different things. It's like exploring. So, um, actually this like leads to kind of a more general pain point that I see when I'm, I'm using these models is yes, I could read snippets of the reasoning chains, the thinking. Yeah. But you can't see all of it. Yeah. And I feel like, heck, we're, we're paying for it. I mean, that's 90% of my, you know, like $2,000 bill is the stuff that I can't see.

Chris Fregly [01:18:43]: Right. And so I actually wanna get a way to, to, you know, pull that in.

Demetrios Brinkmann [01:18:49]: So, um, That's not true. I think with, with the API, I think you can get more. Yeah. More detail. You know, I, I also had this guy Rohit on here probably 7, 8 months ago now, and he said something beautiful that I still can't do, but you can kind of hack it together, which is, he's like, why can't we have different situations where if I want to go through and figure something out during that reasoning process, I can nudge it one way or the other. And so now I think you can kind of do that with just interrupting it and then saying, go here, go there.

Chris Fregly [01:19:39]: But the problem is if you— don't see the reasoning happening as much as you would like, then you can't steer it. Yeah. Yeah. Or it's, it's mid-steer already. And you like, you're— yeah. Cause that actually happened to me last night. I was trying to steer it and, and, or by the time I see it, it's already too late, too late. Right.

Chris Fregly [01:20:04]: And so, yeah. Yeah. A little confusion around that. Kind of stuff. But, you know, I mean, I'm thinking of this in the fact that I'm, I'm burning tokens to get to an answer that I want to make sure I get the answer a lot faster next time. Right. And it's like, how do I do that? I can do the, the thing, you know, where it's going to create a skill, uh, like you were saying, kind of dynamically, or I can have it update the agent's MD or the Claude MD to remember something about that. But I still think there's a case when— oh, and so the point that I was gonna make was I also met this other company, um, actually, uh, several companies, but one company, um, in, uh, particular out of New York that is building their own sort of reasoning chains for CUDA kernel optimizations.

Chris Fregly [01:21:01]: Oh, nice. And that got the attention, of course, of the big labs, right? They very much have an interest in making their models better at doing this, because it's going to make their jobs much easier. Also too, because OpenAI, for example, hasn't quite made the shift to Blackwell, or last I heard they were still— and you can tell because they created this language called Triton, So Triton is like a little— and yeah, so I cover Triton in the book, I think chapters 12 and 13 or something. Triton is like, abstracts you from having to write CUDA and it's in Python and it's its own DSL, you know, Python-based DSL. And it gives you access to CUDA-like constructs, but that's, you know, kind of a higher level of abstraction. And those libraries are not optimized for Blackwell. And it's, and it's because like OpenAI hasn't gotten there yet. So they're not spending any resources.

Chris Fregly [01:21:59]: So you can kind of back into where they're at in their hardware stack by how much they're supporting the software that they have open sourced. I mean, they might have an internal fork that they're not sharing publicly, but yeah. Um, yeah. So, but, and, and oh, so these, these companies that are tuning and tuning all day and have huge, huge clusters and trying out different experiments. To, to get the best speed under the, under sort of LLM, you know, like token generation, uh, use cases. They're, they're, they're sharing this data with the labs, which when, when I talk to them, I say, don't do that. Like, why are you doing that? Right. Because that's, well, that's like— Sell the data, don't share.

Chris Fregly [01:22:50]: Yeah. And, you know, quite honestly, maybe they are selling it and they're not telling me or something, but it, it sound— it sounded to me like they're trying to get in with the labs, so they're just going to give it to them for free.

Demetrios Brinkmann [01:23:06]: And I'm like, dude, like, that's your—

Chris Fregly [01:23:07]: that's, that's your, your moat, you know? Yeah, exactly. That's basically you're shooting yourself in the foot there. Yeah. Yeah, potentially. So not unless there's some other arrangement. Yeah.

Demetrios Brinkmann [01:23:17]: They're also doing it with like AMD. They're also doing it with, um, these sort of newer chip companies. Yeah. Have you thought about open sourcing your different conversations or reasoning paths for stuff that you've been doing?

Chris Fregly [01:23:35]: Because it could potentially be training data there too, right? Yeah. I, but like, that's what I'm saying is I have to get this data and I don't, know how to get it from OpenAI. All right. Yeah. If, if I was using the API maybe, but I use my subscription. Yeah. Through, you know, there's like two different ways to authenticate. You could use the API key or you could use, but the, but yeah, that's exactly what I'm saying is like, I think, I don't know if like anyone else has like thought of this, but we, we should have access to these tokens.

Demetrios Brinkmann [01:24:15]: Especially if you're creating such high valuable data. Yeah, right. Well, OpenAI very likely has access to these. So, so now it's— I think that they— I wonder what the chances are of them actually knowing that this is being created and this is valuable. It reminds me of back in the day, I heard a story that if you Googled something about some really fancy algorithm, then you would get a call and it would be like, hey, come work for us. Because there was only like— Yeah, from Google. Yeah, from Google. It was something around, I think it was a, a search algorithm or something where— Yeah, yeah.

Demetrios Brinkmann [01:24:55]: You were like, oh, if you know about this, you've gotten this far, then you are legit and you should come and work for us. Hilarious. So hilarious. It's, it's like that.

Chris Fregly [01:25:07]: Are they, do they know that the data, that the data chain that you have going on is that valuable?

Demetrios Brinkmann [01:25:17]: I wonder. I mean, they certainly can, they can find it. Yeah. Yeah, of course. Yeah. We're telling 'em right now. We're making it very clear. If anybody at the labs is listening, go and look at If anybody at OpenAI specifically—

Chris Fregly [01:25:36]: yeah, it's like right now, 9:38 in the morning. Yeah, yeah, PT. You're tuning kernels. That's some sweet stuff.

+ Read More

Watch More

Cost Optimization and Performance
Posted Apr 27, 2023 | Views 1.3K
# LLM in Production
# LLM
# Cost Optimization
# Cost Performance
# Rungalileo.io
# Snorkel.ai
# Wandb.ai
# Tecton.ai
# Petuum.com
# mckinsey.com/quantumblack
# Wallaroo.ai
# Union.ai
# Redis.com
# Alphasignal.ai
# Bigbraindaily.com
# Turningpost.com
How to Systematically Test and Evaluate Your LLMs Apps
Posted Oct 18, 2024 | Views 15.2K
# LLMs
# Engineering best practices
# Comet ML
Code of Conduct