MLOps Community
Sign in or Join the community to continue

Speed and Scale: How Today's AI Datacenters Are Operating Through Hypergrowth

Posted Feb 03, 2026 | Views 6
# AI Agents
# AI Engineer
# AI agents in production
# AI Agents use case
# System Design
Share

Speakers

user's Avatar
Kris Beevers
CEO @ NetBox Labs

Kris Beevers is the Co-founder and CEO of NetBox Labs. NetBox is used by nearly every Neocloud and AI datacenter to manage their networks and infrastructure. Kris is an engineer at heart and by background, and loves the leverage infrastructure innovation creates to accelerate technology and empower engineers to do their best work. A serial entrepreneur, Kris has founded and helped lead multiple other successful businesses in the internet and network infrastructure. Most recently, he co-founded and led NS1, which was acquired by IBM in 2023. He holds a Ph.D. in Computer Science from Rensselaer Polytechnic Institute and is based in New Jersey.

+ Read More
user's Avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

Hundreds of neocloud operators and "AI Factory" builders have emerged to serve the insatiable demand for AI infrastructure. These teams are compressing the design, build, deploy, operate, scale cycle of their infrastructures down to months, while managing massive footprints with lean teams. How? By applying modern intent driven infrastructure automation principles to greenfield deployments. We'll explore how these teams carry design intent through to production, and how operating and automating around consistent infrastructure data is compressing "time to first train".

+ Read More

TRANSCRIPT

Kris Beevers [00:00:00]: To couple our observability, the physical and the logical, with our context about how this stuff is supposed to be working. Marry all that stuff up. What you figure out is the delta between our, like our observations and what's really happening and what our intent says or what the design says. And then that delta is the problem that we have to go fix.

Demetrios Brinkmann [00:00:26]: Probably a really good place to kick it off. Like, how is this new world shaken down and what does it look like now?

Kris Beevers [00:00:35]: Yeah, well, I mean, and first, now that, now that we're up and running, you know, just to contextualize, I think for the audience a little bit. I'm Chris, I'm co founder, CEO. I'm an engineer and I work on infrastructure. That's me. So like, you know, what we're talking about today, I think mostly is all of this infrastructure investment that is happening in support of AI and machine learning and these as we, as we have talked about, these things that are taking over the world right now. And you know, the infrastructure investment here, I mean it is, it is enormous, right? It is some, some meaningful percentage of GDP growth in America this year. Right? Like is AI infrastructure. And so what is going on there, I think is, you know, our big theme today.

Kris Beevers [00:01:30]: And then just to contextualize, like you know, myself and Netbox and how we fit into this and what we see. Netbox is the system record for infrastructure. That's what it is, right? So if you distill that way down at its core and Upbox is a model that the teams that are building this kind of infrastructure are using to keep track of everything in their infrastructure, everything from space and power and cooling. These sort of like foundational fundamentals that we hear about all the time, all the way up to like, I mean certainly racks and GPU servers and switching and cabling and then into logical stuff like IP addresses and switch configs and server configs and all of the automation that happens around this. So like really from from as early as I've decided I want to design a new big AI data center to this thing is running and scaling and serving, you know, inference or training workloads or whatever it is at Megascale, we have this incredible lens into that journey. And then the other simple point of context I'll share is I don't think there's any AI data center in the world that is built without Netbox. And so we get this incredible cross view into all the variations and all the things that people are trying and what is happening. And the word we were Using a few minutes ago is chaos.

Kris Beevers [00:02:58]: Right? Like the chaos in this space right now. And it is insanely chaotic just because of the pace and the amount of demand that there is and the level of investment that is happening and how many teams and companies are being formed all at the same time to tackle this and the geopolitics of it even. Like there's so many factors. Right. So that's a little bit of the context, I think, of what we're going to talk about. And the headline here is like, everybody's investing here right now. There's no secret answer to how to build AI infrastructure. It's a lot of hard work and a lot of logistics and a lot of like, supply chain type problems.

Kris Beevers [00:03:42]: And where am I going to get the power and how am I going to get the power in 45 days when I've made a commitment to get 300 megawatts online? And where am I going to get the servers and the GPUs and the cabling and the racks and the humans to do all this stuff or the robots increasingly? Right. Like all of that is what is happening now. So fascinating time in this space.

Demetrios Brinkmann [00:04:07]: And so it feels like every single one of these steps is a potential bottleneck. Are there some that you see being more of a bottleneck than others? I know that back in the day it was like GPU constraint. Has that been alleviated? And now it's more on the actual wiring or the power.

Kris Beevers [00:04:31]: Yeah, it's such a great question. And building these kinds of infrastructures at the speed and scale that the world demands right now is a constraint satisfaction problem. At any given time there is a constraint and it is the primary bottleneck. And there have been several of them over the last couple of years, and there will continue to be several of these as long as demand is what it is, which is, you know, almost like infinitely more demand for AI infrastructure than there is supply. Right. And so that's, that's the root cause I think of like, you know, the furiosity with which investment is happening in the space and the pace at which teams are trying to. Trying to build in this space. So let's just break down some of the different kinds of constraints that we, we all hear about, right? Yeah, we all hear about power as a constraint.

Kris Beevers [00:05:27]: For an example, where am I going to get 3 gigawatts of power? I did a call last, not last week, the week before last, with a public company that, you know, is building AI data center infrastructure that I hadn't really heard much about. In the past, they grew up in kind of bitcoin land and got on a call with the CTO and he said, you know, we have 3 gigawatts of power capacity. And my jaw dropped, right, like, because that, that's problem number one that you usually hear like, where, where are you going? All of this and the, the kinds of anecdotes we're hearing just give everyone a sense of the, the chaos and the scrappiness with which this power problem is being approached. We have a, we have a, a saying we use in our team at Netbox Labs for how we need to operate to serve this kind of customer. And that saying is turbines in the parking lot. And that's a real anecdote. You know, a few months ago we're talking with one of these, you know, real hyperscale AI data center builders who's got tens of billions of dollars in commitment from one of the research labs to bring infrastructure online incredibly fast. And what they told us was, hey, Chris.

Kris Beevers [00:06:51]: Hey, Netbox Labs team. This infrastructure is coming online real fast. We're buying turbines, we're putting them in the parking lot. That's how we're getting our power fast enough because we can't, the grids can't operate this quickly, Right. And can't scale this quickly or we can't build nuclear reactors this quickly. Right. So it, power is a big constraint that everybody's investing in, obviously.

Demetrios Brinkmann [00:07:12]: Wait, and so how did this other company have the three gigs of spare capacity?

Kris Beevers [00:07:18]: Yeah, luck in history. And you know, one of the, one of the sort of interesting trends that we've seen is companies that invested maybe seven, eight, nine years ago in tapping, you know, cheap power or building infrastructure and environments suitable for, you know, large scale data center infrastructure with good cooling characteristics or access to hydro, both for power and for cooling, like those kinds of things for bitcoin. These are companies that are really well positioned actually in sort of this, this AI infrastructure world. And we've all seen that with coreweave, who's sort of the flagship, you know, builder of infrastructure in this space. I think a lot of their historical contemporaries in the bitcoin mining space have recognized like, hey, it's way more lucrative to shift over to this AI infrastructure stuff and use the, the fact that they've already invested in solving these big constraints, the power and the cooling and the space and even just like the logistics. And I think that's the next big constraint that we see that maybe a lot of, a lot of the time is not as talked about. As it should be, which is think, think about everything that has to line up to bring a big, I don't know, 300 megawatt AI data center full of Nvidia superpods online. There's a lot of moving parts and they all have to converge.

Kris Beevers [00:08:54]: And they start as early as, like, I've purchased a parcel of land and I have some power commitment. Great. What am I going to do with that? I need to design a physical space. I need to interlock that with my ability to procure racks and GPU servers and switching and fiber optic cabling and turbines and power infrastructure and liquid cooling infrastructure. And all of this stuff has to converge in a design, has to drive a purchasing workflow. All of those components have to arrive before you can bring anything online. And so, like, that problem of even just what am I going to order to make sure that it's all going to work together at massive scale, right? And how am I going to make sure that it all arrives in a way that, you know, I can, I can interlock that with my design and make sure I've got everything I need to bring this, this AI infrastructure up and running. And then how am I going to even deal with the fact that I've got, you know, thousands of pallets of stuff arriving at a loading dock and it needs to get racked and stacked and plugged in and cabled and burned in and tested and configured and deployed and observed and, and then ultimately delivered to an end customer for training or sliced and diced cloud style for retail, like inference type use cases.

Kris Beevers [00:10:26]: Incredible logistics problems. Let's just make it a little worse. Every one of those components is sold by a different vendor. And one of the really interesting challenges that we see is that every one of these vendors operates differently. They skew differently, they expose their catalogs differently, they expose lifecycle data about their catalogs differently. Like if I want to buy a GPU server from Dell over here, and I need to get some fiber optic cabling from Corning over there, and I need to switch from Arista over here, I've got three different procurement problems to solve, three different kinds of data I need to figure out how to work with. And then let's compound it one more time and then I'll stop. The other big problem is that there's so much demand in this space and the challenges are so hard that the componentry vendors are iterating on their own offerings incredibly fast to meet the needs of this space.

Kris Beevers [00:11:24]: And so the ground is shifting under the feet of the folks who are seeking to build these data centers, the kinds of switches they can buy or the kinds of servers. Every few months Nvidia has a new architecture, right? And uh, and then that nullifies like all the other stuff that attached to it and you have to rethink all of this. So it's a really hard problem space at the moment.

Demetrios Brinkmann [00:11:46]: And I imagine when, when I want to learn something, I usually will go to YouTube and search how to do X, or I'll read some articles and some blog posts about how to do that and kind of immerse myself in it that way. I get the feeling you don't have.

Kris Beevers [00:12:04]: These.

Demetrios Brinkmann [00:12:07]: Engineers going around making YouTube videos on how to put together the GPUs and the data. Data centers, yeah.

Kris Beevers [00:12:15]: This is not a space like, I don't know, call it, call it software development, for example, where there's a vast array of humans and some people are inclined to tinker and share and that kind of thing. In fact, you know, I would say there are a few hundred people in the world roughly who know how to build this kind of infrastructure, this sort of speed and scale. And they all know each other, of course, like, mostly they live in like a five block radius in Soma in California or something like that. They probably all have drinks with each other every Thursday or whatever, but they're too busy to, you know, be sitting down and recording YouTube videos and so on. These are the kinds of folks who, I mean, honestly, on Christmas Day they're operating, you know, infrastructure and they're procuring stuff. That's the pace at which this, this space is moving right now. And again like the furiosity with which these folks are operating. And so they're, they're figuring this out on the fly.

Kris Beevers [00:13:12]: They're sharing with each other within their community. I just don't think there's been enough, almost like enough duration for the learnings to propagate. And you know, also because of the rate of change in this industry, the learnings are going stale really fast too. So this is not an easy space to dive into.

Demetrios Brinkmann [00:13:36]: So how then, when something is updated and you have better throughput with one provider, does that, not necessarily the knowledge get propagated, but just like how do you update and keep tabs on it when you're building out your data center and you recognize, whoa, there's actually a better way of doing it? So all of the inventory that I was going to order, now let's upgrade to this. But that means that these five things downstream are going to be affected too, or how do you know, like the second and third order repercussions, I guess.

Kris Beevers [00:14:16]: Of it's such a question. And let's, let's break this problem up a little bit because I think there are a few different answers. First, that the teams that have started to figure out the answers to this question are the teams that have been at this a while. And by a while I mean more than a year. Right? Like the teams building, you know, real generative AI infrastructure at, you know, 100 plus megawatt scale, you can count them mostly on a couple hands. And these are teams that are practiced a lifecycle management of their infrastructure. And I think what this roughly looks like is because of the rate of change and the rate of turnover in these environments. It's not like let's rip and replace all that stuff that we built 18 months ago.

Kris Beevers [00:15:09]: It's well, we're building 40 more data centers this year. And so the new data centers are going to get the new architecture. And that new architecture is, is changing, is turning over every, you know, four to six months roughly, with a lot of alignment and similarity. Right. Like there's a lot of convergence on some of the basics. So I think that's roughly what it looks like. And then, you know, the other, the other segment here is, and this is the honest reality, probably about a hundred neoclouds have popped up in the last 12, 18, 24 months. Right.

Kris Beevers [00:15:45]: And these folks have not tackled this problem yet. They're still building greenfield, you know, their first or second generation of data center infrastructures. And you know, we in a totally different world in enterprise where infrastructure has lived for decades, you know, we think and talk all the time about lifecycle management, end of life, end of support. When is it time for us to take out that old switch and replace it with the next generation? How do we think about network refresh? Teams in AI infrastructure haven't, and this is an absolute, but as a rule of thumb haven't gotten to lifecycle management yet. And so like this is a problem that is coming and not a problem that these teams generally have a great set of answers for today.

Demetrios Brinkmann [00:16:38]: Well, if you have a new data center that has all the newest and best components, but you also have the second third generation data centers, is the demand that you've been seeing for those second and third generation data centers and what they can provide for GPU usage, is that just skipping to the newest one or is it still like there's just as much for as far down and as many generations back as it can go.

Kris Beevers [00:17:17]: So one thing I won't pretend to be is an actual operator of AI data centers. And so, you know, only they know their real like business models and you know, kind of what the, the demand looks like across their different kinds of footprints. But my perception is that the demand here is deep. Yeah. And you know, what you get with a current generation, I don't know, you know, Nvidia Superpod, you know, with call it, you know, GB300 architecture or something like that is efficiency relative to prior generations in a bunch of dimensions. Power efficiency, cooling efficiency and so on. Right. So really that's what we're trading off.

Kris Beevers [00:18:03]: Like there's a ton of demand that is still being served by older architectures because they're available and they're online and they work. But as we transition toward newer generation architectures, you know, we're serving more demand per watt, for example, or more demand per. Well, I think watt is really the right, the right measure.

Demetrios Brinkmann [00:18:23]: I do know that some friends of mine did some calculations on their usage and it was cheaper for them to use the newer versions, even though it's more expensive per hour. They could get more compute and get things done faster and so they didn't have to use as many hours.

Kris Beevers [00:18:43]: Yeah, I think that's, that's roughly a simple and good way to think about it. Now put yourself though, in the shoes of, you know, a large neocloud operator where you've made, you know, billions of dollars of investment in, you know, last year's generation of infrastructure. And I think this is a general, a general question in the market today that we don't have great answers to. How long is the useful life of that infrastructure? We haven't reached it yet. This whole, this whole push really roughly has only been going on for, you know, 24 months or thereabouts. Right. And so I think there's a lot of learnings we're going to have downstream over the next two, three, four years about what happens to, to the two, three, four year or four generation old infrastructure and is do the economics work to rip and replace it? Do we just keep it online, serving less, you know, less critical workloads or workloads that are more amenable to the lack of efficiency there. What do we do? We don't know yet, I don't think.

Demetrios Brinkmann [00:19:57]: Yeah. And so now can you take me through this stack again from the tangible stuff you can touch all the way to the more software type intangibles and how you can keep track of all of it and then know what needs to be done, where or how much of this we have? Is that enough of what we've got and just be able to keep the trains running on time? Yeah.

Kris Beevers [00:20:28]: So so much about this spanning, you put it really well, like the tangible all the way up to the sort of the actual outcomes, which are token generation, you know, roughly. So much of that is about the data through thread of what are all of the things that make up our ability to produce this outcome and how do they interact with each other and how do they interloc. And so if we, if we rewind just a little bit, we talked about the earliest phases of these AI infrastructure builds being. We've got, you know, we've, we've procured some acreage roughly, and we've got a power commitment from somebody from that moment forth. The teams that are doing this best are managing from intent, they're designing and they're carrying the data of their design all the way through to token generation. And what I mean by that is, okay, we've got, you know, 200 acres of space and we've got 300 megawatts of capacity. What's the shape of the building we're going to build? How many racks can fit in it? What do the thermal calculations say about that? What kind of power and cooling density can we achieve? What sort of the inference or training infrastructure can align with those parameters? What Nvidia gear are we going to buy or what switching footprint are we going to buy? All of that stuff ultimately needs to end up in a data model. And this is what Netbox is, that captures the inventory.

Kris Beevers [00:22:14]: Like, what are all the components, the interrelationships between that inventory, like on this interface and this switch, we're plugging in this cable with this length and it's going to, you know, route in this cable run to this server, in this rack, in this, you know, geometry in the data center, and all of that has to flow all the way back to like power and cooling and so on. All that has to flow from design intent. This design is going to meet our, our parameters for thermal, thermals and cooling characteristics and power consumption and token generation. What happens if you don't do that is at some point you have so much logistical complexity that you can't keep track of all this stuff. And we see this when teams start with spreadsheets, right? Like the canonical way to do this would be like, okay, let's do some power calculations and all right, we need 18,372 cables and they all have these different lengths at some point, you know, that spreadsheet gets messed up and you order the wrong thing. And you know, a fun anecdote from a couple months ago for me now spending time with one of the largest producers of fiber optic cabling in the world. Their number one business problem returns. Everybody orders the wrong cable lengths, right? And why does that happen? Bad management of design data.

Kris Beevers [00:23:43]: Incorrect cable calculations that are maybe not taking into account things like cable bend radius or you know, obstructions in the physical facility. So data really has to be the through thread from you know, as early as like I've but got some space and some power all the way through to this is running. And then you know, one other dimension, this that I think is important to start to pull apart is, you know, we're talking about AI infrastructure as sort of this monolithic concept. But there are different goals for different kinds of infrastructure. And you know, we see a range from I'm building a gigantic 500 megawatt facility to do at scale training for a research lab, single tenant infrastructure, all the way to things like what I call retail neocloud infrastructure. We're going to let people come and swipe a credit card and rent some GPU capacity for a few hours and then return it to the pool. So a lot of ephemerality and then over to what I usually think of as AI factory infrastructure, maybe like pre canned builds, even things in, you know, shipping container sized prefab pods that might get shipped off to some enterprise, a bank or whatever who wants to buy some AI and doesn't have the capacity themselves. And so we also have to consider like what's the, what's the end use case? Who's going to consume this? How are they going to consume it? That has to feed into the design as well.

Demetrios Brinkmann [00:25:23]: Oh man. Just trying to grok this idea of thousands of different cables and laying them out across a gigantic data center. And I know that when I chatted with Andy from Vast, he was saying that a lot of times folks will break up the data centers into four different grids or eight different grids so that if you for some reason you lose some power, hopefully you don't lose power in the whole thing and you're able to kind of salvage whatever you.

Kris Beevers [00:26:00]: Lose what you've just touched on. There is yet another kind of like outcome constraint, a resiliency or redundancy constraint on power. And we see the same thing on connectivity and we see the same thing on cooling and so on. Right. So you know, really what we've just described is a gigantic math problem where there are all these constraints on one end and all these inputs on the other end. And right now a mix of expert humans and software like Netbox or other CAD tools and so on are going together and out the other end is popping. You know, kind of a design that you know, to best guess meets those constraints.

Demetrios Brinkmann [00:26:48]: And you said something about like the bend constraint that some cables have. I imagine that's just because you can't put a 90 degree angle in it like a hose. The water doesn't come out.

Kris Beevers [00:27:00]: Yeah, exactly, exactly like that. Like light is traveling through these fiber cables and so you bend it 90 degrees. Like the light doesn't, doesn't make that bend. And so there's a bend radius you have to take into account. So like all kinds of fun math in, in these data centers and does.

Demetrios Brinkmann [00:27:17]: I think I heard something about Nvidia giving a lot of suggestions on how to set up the data centers or they have standards.

Kris Beevers [00:27:28]: Well, this is such a great topic for us to broach a little bit because one of, one of the, I think one of the drivers of speed in these, you know, kind of hyper complex, many dimensional, like constraint satisfaction logistics problems is going to be data standardization. And here what you're referring to are, you know, blueprints or reference architectures that, that Nvidia provides for. Here is our blueprint for a, you know, an H100 superpod or something like that. And you know, you, you need power in this configuration and you need cabling that looks like this and you need this kind of switching footprints, you need to plug things in this way and these sort of cooling characteristics. And you know, think of it as like an instruction set for how to build a working POD in their blueprint with you know, kind of all of the physical and logical elements pre considered for you so that you, if you are not a, you know, an expert at designing these things from scratch can say, okay, I want to implement that blueprint. Here's everything I need to buy. It's almost like the LEGO brick instructions, right? For how do I get, how do I get a working, you know, pod in Nvidia's definition? And now extend this. We see this from fiber optic cabling providers.

Kris Beevers [00:28:54]: They have, they have blueprints or reference architectures. We see it from data center fabric network fabric vendors like you know, high density Infiniband fabric blueprints and so on. These blueprints, the way they manifest today mostly is PDFs right? So if you, you know, if you're, if you're going to build a big, huge, you know, 300 megawatt data center. What are you doing? First you're reading a lot of PDFs that are very technical and very. And then you're needing to interlock them with each other. You're needing to take this fabric design and this compute design and this cabling design, et cetera. Right. And figure out how do I marry these things up? And we're not here yet in our ecosystem or generally today, but what we are working on here is how do we start to create some programmatic representations of not just these reference designs, but also the componentry that goes into them.

Kris Beevers [00:29:54]: This is, this is one of the things that Netbox has really brought to the space is the notion of a de facto industry standard way of modeling all of this information. The vendors and data center operators and tooling companies building software used by these teams and really everybody can sort of count on and align to. Here is how we're going to share data about cable lengths and cable bends and cable geometry and cable types. Here's how we're going to share information about switches and servers and routers and firewalls and storage devices and so on. Physical dimensions, logical dimensions, configuration details, interfaces. And if we can come up with a programmatic language for that information to be shared, then we can start to automate much faster. Then these teams can start to automate the design process by consuming vendor data in canonical forms of. This is one of the areas that we're working on.

Kris Beevers [00:30:57]: And this is all about all the way back to what we talked about earlier. The, the bottle, the logistics bottleneck, like how do we solve for that logistics. The procurement bottleneck of I've got stuff that I got to get from 100 different vendors and it's all going to arrive and it all has to work together. And right now that's a very human problem. And humans are the bottleneck in that scenario. And so if we can programmatize and automate that logist kind of standardized data as an industry, we're going to move a lot faster.

Demetrios Brinkmann [00:31:32]: Well, it does feel like that's a perfect use case for trying to automate it. And knowing that there's, you have all these constraints and you have your inventory, how can we make sure that what we have matches our constraints? I also was thinking as you were talking about that, like, I imagine there's people, as you were saying, those 200 people in the world that understand how to do this and have been doing it for a while that are trying to figure out the best way to do it. And they've got some tried and true tested tricks, or they're pushing the boundaries. I remember reading a blog post back in the day from the Meta engineering team about how they had two different clusters of 24,000 GPUs and one was using the traditional Infiniband, but then another one, they were doing something different to try and get a little bit more speed out of it. And ultimately it, it was kind of the same, I think, but there's going to be that R and D that's happening too. And so in a way, what I was reminded of was like, it kind of feels like you could do some flight simulator program or something akin to that, where you get to go in and try and say, what if we did this and we tweaked these things and maybe you would get a simulation. And then before going and buying all that inventory or before spending all those hours, you can at least like sanity check it.

Kris Beevers [00:33:14]: So there's a term that you'll often hear in infrastructure management or in all other kinds of industries, and the term is digital twin. And that's roughly what you're describing, right? Like we want to create a digital twin of the eventual mega scale infrastructure that we're going to build so we can pressure test it. And we already talked about a few of the, you know, the really simple ways that people are doing that kind of thing today. You don't really want to go build a 300 megawatt data center if you haven't, through your digital twinning and design process, proven to yourself that your design, you know, meets power redundancy constraints, for example, right? Like, like that would be you, you wouldn't be able to meet your SLAs to your end customers without considering those characteristics in your design. Where we, where we start to run into constraints is the, the depth, the efficacy, the fidelity, I guess, with which we can digitally twin these environments, partly because there are not standard ways of sharing data about the characteristics of the components and how they're going to work in practice and so on. Right? So this is a key part of the design process that we want ultimately to make higher fidelity by having more current, more aligned and more granular data about everything that goes into these environments. So that we can pressure test, you know, in many different ways and be confident that when we do go buy, you know, a billion dollars worth of stuff to deploy this data center, we're going to meet those design constraints that we've laid out for ourselves.

Demetrios Brinkmann [00:35:05]: And that fidelity isn't there because it's still so new that fidelity can be.

Kris Beevers [00:35:12]: There with a lot of hard manual labor today. And again it goes Back to reading PDFs like design specs for individual servers or switches or cables or whatever. That's where that information resides today. Where it doesn't reside, generally without a lot of hard work by these teams is in a common data model like Netbox where you can say, you know, like I'm going to design this data center, it's going to have this layout, it's going to have these physical characteristics, it's going to have this cabling architecture, it's going to have this network configuration, it's going to have this switching model, it's going to have these server models, it's going to have, et cetera, the data about all of those components that feeds the. Call it like the evaluation of that design in a programmatic sense it's really hard to get at today it's sitting in PDFs right? And now look, we know where that's going. Like data in PDFs is not, is not as mysterious as it used to be. Like circular reasons to our conversation, right? But like so ironic, huh? Yeah, isn't it? But, but what we need to arrive at is a, is a, is a way for vendors to expose that data programmatically so that the people designing these massive infrastructures as new componentry is being designed every few weeks for, for their new needs and new constraints can slurp up that, that data about this infrastructure and pressure test new designs, drive effective procurement workflows and compress this time. Nvidia has a term that they use which is time to first train and that's really the North Star metric that we're talking about often.

Kris Beevers [00:37:03]: You know, we have design constraints we want to meet, we want to compress the time from I've got space and power to I'm training my, my model in my now online infrastructure and, and one of the biggest constraints to that is all of these data logistics and all of the pressure testing of the, the design through digital twinning and effective aligned data in a data model like Netbox. So that's what we're working on.

Demetrios Brinkmann [00:37:32]: Well can just imagine the amount of time it takes to just plug everything in that and that's just one piece of it that's like probably you're in the home stretch when you start plugging things in.

Kris Beevers [00:37:46]: Well now let's talk about that because I think this is the act of racking, stacking, cabling, plugging things in Right. That act is an emerging constraint. And that constraint is going to get solved in the way we all think it is too, pretty soon, which is robots, roughly. And AI is going to help us with that as well. And so what do you need for a robot to walk into a gigantic data center and know what to rack, where and how to plug things in, and how to run cables and like all of that kind of stuff. You need accurate data for that robot to act on. And so I think that's another thing that we find ourselves thinking about or having conversations with all these operators about constantly, like, how am I going to give that robot accurate instruct, accurate what I call field operations instructions? Right now, today, even a human, like a data center technician, someone we're asking to walk into the data center and go down this cold aisle and walk around to the front of this rack and unplug this thing and this interface and plug it in over there, that person isn't an architect. They don't often have all the information and all the knowledge to deduce exactly what to do without really clear instructions.

Kris Beevers [00:39:09]: Often we'll call these instructions like a cut sheet or a standard operating procedure or something like that. And now think about how precise that cut sheet or those operating procedures need to be if the thing we're instructing is no longer a human technician, but it's an optimus robot or something that's walking through the data hall and taking actions. That's happening today, for what it's worth, in certain large hyperscale facilities. So that's another constraint that we're going to start to compress is like the, the racking, stackling, the stacking, like the day zero field operations and then also the ongoing field operations because things break all the time or need to be re racked or reconfigured.

Demetrios Brinkmann [00:40:00]: I recorded a few musical albums back in the day, and it was at a studio that was owned by a guy who absolutely adored complexity in the way that there were so many wires everywhere. And he really was into analog gear. And so things were going through here and there and I didn't understand half of it, but I spent enough time because I recorded like three albums at this studio and I, I was there so much that he eventually just gave me the keys and said, you know, like, record when you want, but nine times out of 10, when I would go to record when I wanted, I was troubleshooting for the whole session because I couldn't figure out what the hell was going on with the complexity that he had set up and this was in a studio the size of the room that I'm in right now. I can't imagine a like five different football field length data center.

Kris Beevers [00:41:06]: And, and that's such a great microcosm example because what you needed was something written down somewhere that you could refer to this red cable here. Like plug it in over here if you want this outcome, right? And that's if we distill it down. That's actually all Netbox really is. Right? Like it is that documentation in a programmatic form that gets used to affect an outcome. Whether it's a physical outcome like plug this cable in here or. Deploy this rack with this much space around it for thermal reasons or whatever. Or it's a logical outcome like assign this IP address to this server so that it can be exposed to this end customer in this way. That's, that's roughly all it is.

Kris Beevers [00:41:58]: And I think the, the simplest and biggest takeaway here. A huge amount of complexity, huge amount of chaos, huge amount of demand, a huge amount of speed. And what, what that requires is good data. No surprise, right? Like when you can operate with good data, whether it's at studio scale or whether it's at mega scale, you eliminate those inefficient. You would have been recording in three minutes instead of, you know, using all your time to figure out analog cabling.

Demetrios Brinkmann [00:42:27]: So many long nights trying to figure out why the hell I can't hear anything. Everything's working but no sound is coming out. What is going on?

Kris Beevers [00:42:37]: Oh man.

Demetrios Brinkmann [00:42:38]: So yeah, those. It's almost like the, the sops of the data centers need to be so clear and crisp and precise because you have so many more variables that are involved. And now if you're trying to eliminate constraints in every single move because you speed is king right here. Royalty really. And you're saying, well, could we throw robots at this or could we throw, could we make this programmatic? Can we try and use a different Infiniband the next generation? There's always something that you're trying to iterate on. I can see how something like Netbox is going to be immensely valuable, but also how people are just probably pulling their hair out left and right when it's amazing that any data centers are actually online right now. Now that I think about it, it's.

Kris Beevers [00:43:40]: Such a, it's such a fascinating space and I like, you know, many minutes into our conversation, one thing I'll admit, I'm not a data center engineer. I have no idea how any of this stuff really works. What we See is these teams that are the best in the world at building these infrastructures, like how they are doing it in this really interesting sort of cross cutting view because they all are using netbox to solve this data problem. And I think you're right. There's a reason why there's only a few hundred people in the world who are really pros at this. It's a really hard problem to solve. It's really dynamic. And I think there are some, some really interesting analogies by the way to like the early days of the Internet.

Kris Beevers [00:44:18]: You know, 20 years ago you could get in a room with roughly the 200 people who ran the Internet. They all knew each other, they knew how it worked, they talked to each other. It was evolving at a really fast clip. That's what's happening in AI infrastructure right now. This knowledge will propagate, it'll start to normalize over time. How you do this stuff. What we don't know yet is whether like the diversity that we're seeing in the ecosystem will persist. Will there continue to be hundreds and hundreds of, you know, operators of AI infrastructure? Will we see more consolidation? There are real economies of scale in this space, you know, to the, you know, ultra hyperscalers.

Kris Beevers [00:44:59]: I think that's also going to shake out over the next couple of years.

Demetrios Brinkmann [00:45:03]: It is absolutely intriguing how until a few years ago there were three clouds and that was really it. And now there's been this Cambrian explosion and so much so we actually just released a GPU guide on the whole ecosystem because I had, I just kept seeing all of these funding announcements and the majority of them that were raising a whole whole boatload of money were very low level infrastructure plays. And it was like, well, what are the value props that each one of these folks have? I don't understand what the difference is between a modal and a base 10 or a fireworks or together. And then are they even going and building their own data centers? Not really. They're kind of just renting them off of other people's data centers that are already out there. So then you have a little bit lower that you could go to like the Lambdas and they are building their own data centers and they're trying to do that. And so I was intrigued by all that which led me down the rabbit hole of well, okay, well what are the value props here? Like what do I, when would I want to use one versus the other?

Kris Beevers [00:46:18]: And I think you use a great term here, which is Cambrian Explosion. The market answers. We don't know the answers to like the actual value props that are going to stick in the long term. And that's why this Cambrian explosion is happening. There's an exploration happening in the market right now, right. And so it could be that all these different business models make sense and all end up persisting. And you know, we retain this diversity of kinds of infrastructure providers to meet these ultimate value propositions. Whether it's, you know, really lightweight, ephemeral, ephemeral ability to pop up new infrastructure really quickly on top of existing colocation footprints, or whether it's in a mega single tenant, purpose built facilities with specialized cooling and power and whatever characteristics, data centers in space, who knows, right? So we're seeing an exploration happen right now.

Kris Beevers [00:47:19]: Some of the value props that I do think are going to continue to like weigh into a diversity of infrastructure types and providers, at least for, for some time. One is sovereignty. And this is an interesting one. But if you think about how important to the world AI has become in such a compressed time, what we're seeing is countries, nations, right? Say this is strategically important. We need sovereign AI infrastructure. We can't be dependent on AI infrastructure sitting in the United States for example. And so you're not going to solve sovereignty with big monolithic hyperscale infrastructures. We're seeing sovereign AI infrastructures pop up all over the place.

Kris Beevers [00:48:09]: Now there are companies that specialize in building sovereign AI infrastructure. So that may be a specialization that persists. Another kind of specialization that we already touched on a little bit is the spectrum from single tenant mega scale training footprints for research labs, all the way to slice and dice retail type infrastructures. These are very different, different end use cases and value propositions that may not all be best served by the same kind of company or architecture or operationalization. And then one more that we see often is, and this is akin to the sovereignty one, certain kinds of enterprises really want to own their own AI infrastructure. We see this a lot in financial services, we see it a lot in healthcare, and in particular in pharma where high performance compute is really valuable to have for drug discovery and things like that. We see it a lot in government, obviously. And one of the most interesting places that we see this a lot is in energy.

Kris Beevers [00:49:26]: If you're generating gigawatts of power, it's a pretty natural extension of your business to put some GPUs next to that power. But what all of those kinds of organizations are not are hyperscale data center builders and operators, right? So when I think of the term AI factory, which we hear a Lot. It's mostly about that sort of value proposition. How do I provide like the ability for, you know, financial services or pharma or government or utility to get an AI data center without themselves having to develop all of this expertise. So there's a bunch of different business models that we're seeing be explored right now. They aren't all obviously all served by the same exact kind of like converged, you know, hyperscale company. So for a while I think we're going to continue to have this diversity.

Demetrios Brinkmann [00:50:24]: Well, what you said earlier about a shipping container, basically a data center in a box.

Kris Beevers [00:50:31]: Yep.

Demetrios Brinkmann [00:50:32]: Is new to me. I had never heard about that.

Kris Beevers [00:50:35]: Actually something that has existed for some time, even back in, you know, back when we all talked about cloud as the next big thing, which ship has sailed a little bit. We've all consumed cloud and now some repatriation or whatever. But deploying private clouds in shipping containers, or think of them as just prefab cloud infrastructures or edge infrastructures, pretty well established concept. And so maybe no surprise that the same methodology is playing out in AI or GPU infrastructure as well for organizations that need to be able to deploy, you know, not hyperscale, but owned single tenant footprints. Yeah.

Demetrios Brinkmann [00:51:18]: And I like how you say the classic quote from Amazon, right? Where stick to what makes your beer taste good. And if you're a financial services company, building data centers is probably not what your bread and butter is. So how can you reap the benefits of all of this without having to become a data center expert? Is it that you just hire one of these 200 people or is it that there's ways now that companies are popping up to service that demand?

Kris Beevers [00:51:57]: And that's exactly it. There are, there are companies that we're seeing that are, I call them AI factory companies, roughly that whose value prop is exactly that I'm going to deliver like an AI data center to meet the needs of this bank or whatever it is or there are spaces that have existed forever in what we call systems integration. Right. So the big systems integrators who forever have solved problems of building IT infrastructure or OT infrastructure or whatever for enterprises, all now have AI factory or high performance computing practices who consolidate some of that expertise and repeatedly for their customers are able to procure all the equipment, solve all these problems that we've just talked about and pump out AI data centers to spec to meet the needs of their customers. So that's a business model that I think will persist.

Demetrios Brinkmann [00:52:59]: Now, when you are looking up the stack, how high up do you go, in terms of that maintenance piece or if we need to do updates to the GPUs. I know that can be a headache in itself because every second that the GPU is offline, you're burning cash. Are you also looking at that or is there a level where you, you say, all right, cool, we've, we made it to here and we kind of stop.

Kris Beevers [00:53:31]: We have spent a lot of time today talking about like the build process time to first train, to use Nvidia's term. Right. I would say at least as important is the kind of operational processes around this infrastructure is up and running and living. And then that blade in that rack died. What do we do? Right? Or yeah, we need to do some software upgrades on, you know, these 18 racks or something like that. Right. So here again, the data is critical. All of these systems in real time are interacting according to some configuration.

Kris Beevers [00:54:13]: This server has this IP address on this interface. It's connected to this switch, it's running this software, it's allocated to this tenant or this end customer or something like that. Right. So all of that data context, equally important, once you're up and running for things like, I want to drain this infrastructure so that I can do a bunch of field operations on it or reconfigure it or something like that. And you know, I would also say, like, the pace in an operational state is another order of magnitude faster. Right. Like here we're talking about operations or, sorry, automation. Things can happen that are not driven from design when you're operating infrastructure.

Kris Beevers [00:55:01]: Like a backhoe can cut a piece of fiber and that just happened. Now what has to happen to flow around that all of that must be automated. There's no way to deal with that situation in a satisfying way without automation. And so automation demands hands good data.

Demetrios Brinkmann [00:55:19]: And I, I just keep coming back to like, there's so many ways that this can fail. From human error to just, yeah, power outages or you blow a fuse and you now have to troubleshoot.

Kris Beevers [00:55:39]: You now have to troubleshoot. I mean, and you're right, there's so many ways this can fail. This is part of why as early as the design phase, we need to define our reliability or resiliency constraints and design for failure. Everything always fails in infrastructure. That's just a truism, right? At some point a backhoe will cut that piece of fiber or that generator will blow up or something that will happen. And so we design for those failures with appropriate redundancy and so on, but then we have to resolve those failures. And this is what you're getting at. We need the data and the visibility to understand what is that thing, what's it connected to, what is impacted by this and what are our options for doing something about it.

Kris Beevers [00:56:30]: And you know, a model like Netbox encodes all of the logical and physical information. To understand that stuff, you need to couple it with tools that are observing behavior of these systems in real time. We call these observability tools for understanding the flow of network traffic or understanding thermal characteristics in the data center, like actual heat sensors on all the racks or power load characteristics or anything you can imagine we want to have visibility into as well, to couple that data with the context of what we've built and how it is supposed to be working so that we can ask how is the design intent being met based on our observations, based on what we can sense in this operational environment, and if not, what are we going to do about that? Right, that's when we get humans to work. These switches are behaving differently than we expect per our design. Humans, can you look at all the data and decide what we should do about this? And then increasingly, of course, we hope that AI operations sort of closes that loop as well. AI agent like this seems to be behaving differently than we expect. What do you diagnose as the issue and what would you do about it?

Demetrios Brinkmann [00:57:45]: Well, this is fascinating because you're, when you're talking about observability, you're referencing like the physical, the actual things you can touch. And whenever I think about observability, it's like, oh yeah, is there like a datadog type of tool? And solution is what my mind instantly goes to when I think of observability.

Kris Beevers [00:58:07]: And that matters as well, right? Because again, think about the value proposition in these infrastructures. Ultimately it's tokens are being generated, roughly. And so you need to observe what I think of as the logical, like the flow of network traffic or is the server up or down or whatever. That's a, that's a part of the process of generating those tokens. But the logical parts of that process are underwritten by the physical parts of that process. There are servers and there is power and there are fiber optic cables and so on that make up the ability to generate those, run the software that's generating those tokens. And so it's not enough just to observe things at the logical level. Especially if you want to diagnose like why did that IP address suddenly go offline? Well, it could be because somebody cut the cable and we need to be able to diagnose down to that level to remediate that issue and get that IP address back online.

Kris Beevers [00:59:07]: Right. So to couple our observability, the physical and the logical, with our context about how this stuff is supposed to be working, marry all that stuff up. What you figure out is the delta between our, like our observations and what's really happening and what our intent says or what the design says. And then that delta is the problem that we have to go fix.

Demetrios Brinkmann [00:59:30]: Man, this is so cool to learn about. There is so much involved in it and it is a very high value problem because there's so much on the line. And as we know, like GPUs are measured in seconds. And so you need to keep them online at all costs.

Kris Beevers [00:59:58]: And this is the operational problem that we're talking about. I 100% agree. You need to keep them online at all costs when there's. This is why these spaces, like observability for example, are such high value spaces because the, the outcome when something fails is deeply negative. Right. The other value equation, going back to conversation we had earlier, is the speed to deliver value from these operating infrastructures to get them operational. And I think the other thing that we're seeing in this space is the demand is so high and the expectations are so high to meet that, that demand. That delays in speed, delays in logistics, like when those constraints are bottlenecks, create drag to bring infrastructure operational, that has a really negative impact on value for the world too and certainly for the companies that are building this infrastructure.

Kris Beevers [01:01:01]: So constraints, both pre operations and post operations on speed, resiliency, scale, effectiveness, that.

Demetrios Brinkmann [01:01:08]: Kind of thing, what are you seeing as far as the average of time it takes to stand up a data center?

Kris Beevers [01:01:16]: It's compressing really, really fast now. I mean, I think if you asked me that same question five years ago, I would have said a typical data center is a two year build at least. And today I would say, look, I had, I had a week in November and I had calls with 7 or 8 of our customers building AI infrastructure where every single one of them said, listen, we're going to be 10xing our infrastructure over the next six months. We need your help in these ways or we need to make sure we're ready in way X, Y or Z. So I mean, just think about that math, right? These are organizations that have, you know, large scale data centers online today and they're saying it's 10xing in the next six months. They're building a data center, a Week, roughly. And now let's be clear, right? They're not starting from scratch and getting it online in a week. Right.

Kris Beevers [01:02:20]: These things are operate, they're happening in parallel. It is a complex product project management, you know, and logistics management problem to line all these things up, to pump out that many data centers that quickly. But that's roughly the pace that we're seeing now.

Demetrios Brinkmann [01:02:38]: Where do you feel there are pieces missing right now with the. Whether it's in the build or it's in the operation parts of the data center.

Kris Beevers [01:02:51]: Yeah, yeah. Like where can we improve? Yeah, I think so. I think the integration of all these things that we've talked about is a really tough challenge. It's way tougher than it should be today. And we, we delved into some of this in talking about the notion of standardization of say component data, for example, from, from networking and server and GPU and cabling and whatever vendors. Right. Like the more we can start to standardize on how vendors just share that information, the more we can compress the timed operations, the time to first train. So that's, that's one example of the integration problem.

Kris Beevers [01:03:30]: Another example we talked about. Once these infrastructures are online, there are all these systems that these teams are using to understand and to observe like how they're behaving and integrate that with. How are they supposed to be behaving and integrate that with. We want to drive change and we want to scale this and we want to support new kinds of workloads and so on. And so right now these teams are having to cobble too much of that together, glue too much of that together on their own. And so I think we will see continued like tightness in the way tools like Netbox or we have observability capabilities or drift detection capabilities or things like that all glued together tightly to form kind of a cohesive approach for these teams. From as early as design intent all the way through this thing is running and I'm changing it without having to, you know, themselves integrate many different components or data sets.

Demetrios Brinkmann [01:04:43]: Is it going to be the different providers start to integrate tighter or is it going to be something where you see Netbox being the one single pane of glass that you can look through.

Kris Beevers [01:04:56]: My general point of view is there, it's going to be both roughly. Like one of the things that we think about is that there will be black platform consolidation in this space. It does make sense to have a tightly integrated set of tooling and capabilities to serve the end to end end needs given how tightly these Operators need to, need to operate from as early as design all the way through to kind of managing change in operational infrastructure. And we're building for that, that. However, the reality is every infrastructure is different. There's no such thing as like a magic blueprint. As, as hard as Nvidia and others are trying, that everyone is going to follow the Cambrian explosion that we talked about sort of dictates that, right? There are different value propositions, there are different ideas that people are trying, right? So there probably will not be one magic, you know, tooling stack to rule them all. We'll build, build for that.

Kris Beevers [01:06:00]: But the other guiding principle that I think is really important in a space like this, and we have this applies outside of AI data centers too, into enterprise infrastructure or OT infrastructure, all kinds of infrastructure, is one of openness and composability. So, you know, one of the things that we haven't talked about very much is the fact that Netbox is open source. Almost everything that we build has open source elements. And that's really purposeful and important. We need it to be easy for these teams to decide. I'm going to use all the Netbox stuff to run this stuff, except for this one piece, because I really want to custom build something over here for, I don't know, observing or driving automation in my infrastructure or some design element that's unique to my business, but that needs to be able to integrate with the rest of the stack. And this is where openness and APIs and composability of the toolchain that these, that these operators use to manage their infrastructures is really, really important. And that's, I think, fundamental to how we think about the space and how I expect it to continue to evolve.

+ Read More

Watch More

Speed and Sensibility: Balancing Latency and UX in Generative AI
Posted Oct 26, 2023 | Views 483
# Conversational AI
# Humans and AI
# Deepgram
AI Agents Are Revolutionizing E-Commerce at OLX // Nishi and Beatriz
Posted Nov 22, 2024 | Views 1.2K
# olx
# Prosus
# AI Agents
# Agentic
# GenAi
How Feature Stores Work: Enabling Data Scientists to Write Petabyte-Scale Data Pipelines for AI/ML
Posted Sep 17, 2024 | Views 599
# Feature Store
# Petabyte-Scale
# Featureform
Code of Conduct