MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Building Data Centers for GPU Clouds

Posted Oct 03, 2025 | Views 4
# GPU Cloud
# Data Centers
# Buzz HCP
# Prosus Group
Share

speakers

user's Avatar
Craig Tavares
President and COO @ BUZZ HPC

Craig Tavares, President of Buzz AI, brings 20+ years of expertise in digital infrastructure and energy, leading the company’s GPU Cloud and HPC data center business. An innovator in telecom, data centers, cloud, and power generation, he has a proven track record in high-performance technologies.

His career began with Toronto Hydro Telecom, developing a metro fiber network. At Cogeco, he managed 20 data centers, global networks, and HPC hosting. At Aptum, he led cloud strategy, partnering with Microsoft Azure and AWS, and contributed to Apple’s hybrid cloud approach.

As co-founder of Kingston Co-gen, he spearheaded M&A and corporate development, acquiring 300+ MW of natural gas generation and launching one of North America’s first HPC data centers linked to a power plant.

Additionally, he has advised and contributed to several scale-up companies in energy, HPC, big data, and AI. At Buzz AI, his leadership is instrumental in expanding the company’s HPC capabilities and delivering high-performance cloud solutions for demanding workloads.

+ Read More
user's Avatar
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

Craig Tavares, COO of Buzz High Performance Compute, shares lessons from building GPU cloud infrastructure worldwide. He stresses the role of sovereign mandates, renewable power, and modular cooling in scaling data centers, while highlighting NVIDIA partnerships and orchestration as key to sustaining AI workloads.

+ Read More

TRANSCRIPT

Craig Tavares [00:00:00]: I got into building clouds inside those data centers. So I probably built more clouds than I care to admit to you. Some were great, some were awful.

Demetrios Brinkmann [00:00:12]: Yeah, deep conversations are coming. Beware. So dude, you've been building data centers. Huge investments, big-time production. I want to know everything about it, okay?

Craig Tavares [00:00:27]: Absolutely. So just, just for your audience, Craig DeVares. I'm President COO of Buzz High Performance Computing. We're a Canadian cloud service provider. We, we own data centers facilities in both Sweden and Canada. We operate GPU clouds at scale. We, we certainly cater to again, a lot of the sovereign mandates that you see in a lot of those nations, especially Canada being Canada owned and operated. You know, we have a huge interest and obvious within the sovereign economy of supporting sovereign mandates for AI, what that means, again you have to look at the full integrated stack.

Craig Tavares [00:01:06]: So yeah, my background, I've been building telecom infrastructure right out of school, putting fiber in the ground for communications networks, building telecom companies from the ground up, both nationally and then internationally after that. And when you're in that infrastructure game and you're building all these massive high bandwidth networks, you quickly got into data centers, right? So again, everybody knew that the network and the data center went hand in hand. Like you can't have a data center without a network and networks led to a data center somewhere because again, that's where the automation sits. So that, that got exacerbated. It went, you know, super, super hyperscale when again all the aws's and Azures and Google's came into the market. So I do have to roll back the clock for you a little bit. Didn't tell you that Continuum or just go back on when again, you know, we went from being a data center provider and then I got into building clouds inside those data centers. So I probably built more clouds than I care to admit to you.

Craig Tavares [00:02:08]: Some were great, some were awful and then some were right at that cusp when again like the hyperscalers came in and took over the world. So the relevant public clouds became what we called or became what we know as regional clouds. And the regional clouds, the non AWS's, the non Google still had relevance because one, there was a big customer base that just wanted again a simple cloud to access. But more importantly, there was a big customer base that wanted to know where the data was.

Demetrios Brinkmann [00:02:38]: Right?

Craig Tavares [00:02:39]: So again, if I'm going to put my data in a cloud again, I kind of want it, you know, within again a domestic region. And it only is to satisfy against certain data residency policies that enterprise or government might Have. So again, fast forward. I did some time at some large Fortune 500 companies, went back to venture. I actually did venture venture where we, we purchased power plants. So we, we bought power plants, we operationalized them. These were what we call combined cycle natural gas power plants. And we did that in Canada.

Craig Tavares [00:03:12]: And then I built an HPC facility actually directly connected to one of these power plants. So what we know as completely off grid, a data center powered by its own power source and then, you know, off the fence. So the, the power that I was metering was going directly to my data center, but again it was my power generation that was supplying electricity to that data center. And that, that's going back a few years. But now that's, that's what you hear a lot of the big data center companies talking about. They're like, hey, I don't have access to energy where I want it, when I want it at the price I want it at. So you know, how do I scale my data center? By creating my own power generation and, and allowing for infinite scale based on, you know, me controlling all the elements myself.

Demetrios Brinkmann [00:03:59]: Right.

Craig Tavares [00:03:59]: Not dependent, the, the, the local grid, not dependent on, you know, the, the region I'm in, whether or not again, it has like say renewable energy sources. So again, those are the things that, that keep coming up all the time. Like you were saying earlier when we were talking, you know, one, what, what is that, what is that basic building block that you have to have first? Well, yeah, it is power and land, right? That's where it starts. Then, then I could look at the data center appetites in that market and then start looking at the data, the data center strategy.

Demetrios Brinkmann [00:04:29]: So how do you look at that?

Craig Tavares [00:04:31]: Well, now that with AI like the.

Demetrios Brinkmann [00:04:34]: There'S infinite appetite, you don't have to look at much.

Craig Tavares [00:04:37]: It is like. Yeah, because training, when you're, when you're looking at model training, again, it really changed the design philosophy, number one, and it changed localization or geographic relationship. What I mean by that is again, look at how we built data centers around the world. We built a data center so that the data center is closest to the user.

Demetrios Brinkmann [00:04:59]: Right.

Craig Tavares [00:04:59]: So again, I'm building in like key markets and they're usually large population centers. So as, as I chose a location for my new data center building and this is again, regional clouds or hyperscale clouds. Again, I want it to be in key markets. And then if you look at the US we knew these as NFL cities. So if in, in the US if there was a state with an NFL team, you could get guaranteed. There was a cloud provider in that very, very first.

Demetrios Brinkmann [00:05:23]: Right.

Demetrios Brinkmann [00:05:24]: And for Canada it's if it has a hockey team.

Craig Tavares [00:05:27]: Yes, actually, no, it is true. Yeah, but we don't actually have a lot of hockey teams, believe it or not. But, but yeah, no, there are a few key markets in Canada and you know, you, you could, you could call, you could count them in one hand. But Toronto being the big epicenter, Montreal being another massive market, Vancouver another market, Alberta, Manitoba is coming up as a, as a very relevant market as well too. So yeah, we, we do see again the growth in Canada. Um, Quebec grew really fast for another reason is because we had very cheap hydroelectric energy. So 1.

Demetrios Brinkmann [00:06:04]: So it enabled the build out.

Craig Tavares [00:06:07]: It did, yeah. Because one there, there was a lot of it. So when Quebec started investing its power infrastructure years ago, it was to enable industrialization. So they wanted the forestry guys to come in, they wanted the mineral mining guys to come in. So they built lots of hydroelectric energy because again, those are industries that take a lot of electricity, use a lot of electric city. It didn't really happen. So it didn't actually happen. And again, you were talking about 20 years ago when these investments and the infrastructure development started happening.

Craig Tavares [00:06:37]: But one thing that did happen is data centers saw the opportunity and data centers came in really quickly. So we saw a lot of the consumption and all the power demand really get eaten up fast. And then the other thing that happened too was a lot of battery storage companies that came in from battery manufacturing perspective came into market too. So again this, this is more on the manufacturing side and to build batteries. It, it, there is a huge energy demand as well too. So electricity is a key element in developing batteries. And if you're developing batteries, you're doing it under a green flag, a green mandate. Therefore you don't want to be creating or, or developing batteries with fossil fuels.

Craig Tavares [00:07:22]: So again, because Quebec was hydroelectric, all the battery manufacturers really enjoyed going to Quebec to use again an abundance of cheap energy to do what they could with renewable energy sources.

Demetrios Brinkmann [00:07:35]: So I like that you are basically going full stack on the building out of the data center saying we're going to produce our own energy, we're going to make sure that we have the capabilities that we need. How are you producing that energy?

Craig Tavares [00:07:49]: Besides the hydroelectric today at Buzz, we don't actually build and produce any energy ourselves. What we have done is this comes from our parent company, Hyde Digital. We found locations around the world that has that situation. I'm explaining to you where there's an abundance of hydroelectric Energy one we wanted to really stay and remain a sustainable company where again we're using renewable energy sources and in this case again hydroelectric or geothermal. So Sweden was again a big anchor site for us. It was a primary location for us and where we built some of our first facilities so that, you know, again, we're using 100% green energy there. Again, Quebec again, another obviously a home base for us. And then you know, a lot of our GPU clusters today that we offer to the market are based in Quebec.

Craig Tavares [00:08:42]: And you know, my parent company, Hive Digital just made a large acquisition in Paraguay where we purchased a very large piece of land facility connected to one of the largest dams in the western hemisphere. So 100% hydroelectric yen.

Demetrios Brinkmann [00:08:57]: It feels like if you build it, they will come, right?

Craig Tavares [00:09:01]: Yeah, I think you're planning a roadmap like you're creating that energy roadmap so that again, as you build out that data center infrastructure, again you have the energy to back it up and energy to scale with you. So again, for what we need to do and cater to the market that we cater to, we're building fully integrated data centers. Which means that right now the locations and the access to energy and the hydro energy that I have, it actually caters to demand and for the next couple years it'll cater to whatever I need to do from a GPU cloud perspective. So data centers, you know, come in different shapes and sizes.

Demetrios Brinkmann [00:09:39]: Right.

Craig Tavares [00:09:41]: Again, the traditional data center that I was talking to about earlier, you know, we were building really sub 10 kilowatt racks. So I could, you know, I could put in any single rack, a bunch of servers and I could see them up to 10 kilowatts of energy in that rack. And then I think, you know, if you sit any infrastructure talks here at the conference, you know, you'll hear that, that shift again where again quickly people had to go from that 10 kilowatt rack or sub 10 kilowatt rack to a 40 kilowatt rack when the, the H series GPU is released.

Demetrios Brinkmann [00:10:10]: Right.

Craig Tavares [00:10:10]: So you just went kind of anywhere from like 5 to 6 to even 8x depending what your, your power consumption that rack was overnight. And then you quickly went to an 80 kilowatt rack really in like a three to six month period. And now what we're planning for is 130, 135 kilowatt racks with the GB2.

Demetrios Brinkmann [00:10:31]: Hundreds with no view of slowing down at all.

Craig Tavares [00:10:35]: Yeah, I think again, data center design and data center construction will naturally limit the, the absorption in the market. You know, if you talk to Jensen, he's already preempting. Hey, guys, start thinking about 400 kilowatt racks. Yeah. And you get into other issues with that one. Yes. Again, power density is a huge problem. You probably have to get out of this rack mindset.

Craig Tavares [00:11:00]: So a data center will not be the same data center that you remember it like, you know, a few years ago. And then the heat dissipation is a real issue. That's a real thing. And that's why we had to shift from air cooled to liquid or water cooled, because you can't get the same heat dissipation or the same thermal transfer through air as you would with liquid. So again, you're trying to create that thermal efficiency and removing heat off those chips as fast as humanly possible and at scale. So yeah, again, it's not an easy engineering feat to again, design data centers, but also design data centers. So they're modular and future proof.

Demetrios Brinkmann [00:11:42]: Right.

Craig Tavares [00:11:42]: Because again, I'm trying to plan for what that next rack density might be, and they're just getting more and more dense and more power hungry.

Demetrios Brinkmann [00:11:49]: Do you see a world where there's 400 kilowatt racks?

Craig Tavares [00:11:52]: Yeah.

Demetrios Brinkmann [00:11:53]: And how does that look? How do you make that transition? Especially when you're trying to plan forward proof and you're like, dude, I just got to 130. Give me a little bit of a rest.

Craig Tavares [00:12:02]: So I. We have ongoing joke with our engineers is don't screw anything down to the ground.

Demetrios Brinkmann [00:12:08]: So put it all on wheels so you can just wheel them in and out.

Craig Tavares [00:12:12]: That's right. So yeah, we're. And if you think about this for a second, like the size pipes we're putting to data centers for the plumbing just to move again, liquid fast enough at volume through these servers, I mean, it's already massive. So yeah, if you have to now, again, prepare for the future and say, okay, I need the right plumbing infrastructure in place, I need the right electrical infrastructure in place. These things are just going get bigger and bigger. So yeah, it's not like I don't think anyone is actually solved for all those answers. Except like I said, like, don't, don't get used to one standard. That's why I'm saying don't screw anything down to the ground.

Craig Tavares [00:12:46]: Like you could change it next year.

Demetrios Brinkmann [00:12:47]: Again, I remember that I made a joke. Oh yeah, you're in Canada and Sweden because it's cold there. Right. And you're like, well, that's kind of true, actually.

Craig Tavares [00:12:57]: It helps a lot. Yeah. So the metric that we use to determine your data center efficiency is something called pue. So it's like your power utilization efficiency metric. And in most data centers, especially cool climates, we try to leverage what's known as free air cooling.

Demetrios Brinkmann [00:13:14]: Right.

Craig Tavares [00:13:15]: So I want to use as much as the ambient air. When it's cool, bring it into the data center, use it instead of firing up my chillers and cool that water loop if it's liquid or if it's just air that I'm putting to the data center.

Demetrios Brinkmann [00:13:26]: Right.

Craig Tavares [00:13:26]: Take the cold air from outside and use it. And yeah, in climates like we have in northern Canada and if we have northern Sweden again. Yeah. You want to leverage that cold air again for 80% of the year, again, you have cold air, why not use it? So then that way I drive down my PUE and you know, on average throughout the entire year, I have what we know is like a 1.3 Poe. So for every 1 kilowatt of usable energy or capacity to have my server, you know, I'm using about 30% more to cool the environments and manage environmentals within the data center.

Demetrios Brinkmann [00:14:07]: I hear a lot of people talking about building data centers in the desert, in the Middle East.

Demetrios Brinkmann [00:14:11]: Yeah.

Demetrios Brinkmann [00:14:12]: Does that just not make sense?

Craig Tavares [00:14:13]: Well, you can do it. You just wouldn't use as much for your cooling. So there are other cooling techniques that you would incorporate, such as evap cooling. Yeah, definitely. You might have to fire up the chillers more often, but they're innovating. I mean, even in climates like that, they're always looking at innovation techniques to, again, keep efficiencies really low. And one thing that we see work really well around, you know, what we know as distant energy plants, is one, if I do generate heat, I want to reuse it somewhere else. So the other thing that we do and we look at when we design data centers, especially if they're industrial parks.

Craig Tavares [00:14:49]: Yeah, Can I take that heat waste and then, hey, give it to my neighbor if they need heat for a sauna?

Demetrios Brinkmann [00:14:54]: Yeah.

Craig Tavares [00:14:55]: Or anything like that. Yeah, absolutely. So, you know, whatever you can use that heat waste for. Yeah, go, go create again a district energy system with, with neighboring businesses to, to reuse that heat or another idea. And, and you know, we, we've did this, we've done this historically. Like if you're, if you're, say, near a large body of water, it could be a lake, it could be the ocean, and you have the ability to, to actually leverage deep lake cooling or, or deep water cooling. What you'll do is take again, cold water in from that, that water source and then run it through your data center. So again, this is another district energy type technique.

Craig Tavares [00:15:36]: We do it in Toronto quite a bit like the whole all of downtown Toronto or most of downtown Toronto is on this district energy system where we take all the lake water, pump it through pipes in all the commercial buildings and that's how we create the cooling for the buildings.

Demetrios Brinkmann [00:15:50]: Oh, cool.

Craig Tavares [00:15:51]: Same concept for a data center. Just again, more dense, more compact way to do it.

Demetrios Brinkmann [00:15:55]: So there's a lot of different chip companies here. At the raise summit, I talked to you about this and you're like, yeah, they're kind of a dime a dozen. They have their different perks. But you are standardizing on Nvidia, right? You're not doing any other kind of chips you're not really interested in.

Craig Tavares [00:16:14]: It's true. Yeah, we have found a strong partner in Nvidia and we also get a lot of support from Nvidia too. So they're really big on supporting their Nvidia cloud partners. We're one of them. So, you know, we're very, very fond of that partnership. You know, certainly again, when you look at the Nvidia team, you know, they're pioneering and carving through again new technology fees, right. They have this innovation engine that's always going. And the thing I love about Nvidia is they're not afraid also to experiment and get people to come and collaborate and experiment with them in the market.

Demetrios Brinkmann [00:16:50]: Are they the ones coming and helping you with the cooling and the design?

Craig Tavares [00:16:54]: They do. They have a very, they have a, they have a great team that again, coll, collaborate to say, hey, listen, we want to build a standard. And the reason why they want to build a standard is they want to create a consistent experience for all the customers. So again, if you're going to spend a quarter million dollars a server, you as a consumer or me as a service provider and trying to monetize that, definitely you want to deliver the best experience to your customer possible, which means performance, reliability. And that reliability is really around uptime because again, I'm spending lots of dollars to build my AI project. Well, listen, it better, you know, the hardware better work. So yeah, they do have a strong influence. They bring in as, you know, reference architectures that the NCPS use.

Craig Tavares [00:17:38]: And listen, again, standardization is not a bad thing because that standardization can also be updated when you find new innovation techniques to make something better. So it's not to say that you standardize to lock into a rigid framework. It's to say that, hey, Once we find a good recipe, let's all, let's all follow the process. But when we see process improvements for the next iteration, we'll incorporate those.

Demetrios Brinkmann [00:18:02]: Do they also let you know what's coming down the pipe?

Craig Tavares [00:18:04]: Absolutely.

Demetrios Brinkmann [00:18:05]: And that's where you can recognize that in the next three, six, ten months, we should also be starting to prepare for X, Y, Z. Yeah, that, that.

Craig Tavares [00:18:15]: And from two angles. One is the hardware perspective, right? Yeah. How do I have to plan my data center design around that? New technology coming to future is telling you again, you're going. And then the second part is software.

Demetrios Brinkmann [00:18:30]: Right.

Craig Tavares [00:18:31]: So again, you know, I want to say that Nvidia equally invests as much time in hardware in their software stack. So, yeah, we definitely get early preview of all these things. And then, you know, again, the goal is like, how do I create like the, you know, the, the best AI platform and democratize that for easy access to everybody that's innovating because we want to accelerate the speed of innovation in the market right now. That's the one thing that we don't want to do, is slow down anybody. That again, is building the AI project. And they themselves have to get to market really quick because they got to see an ROI and they got to monetize it themselves.

Demetrios Brinkmann [00:19:05]: Yeah, this is probably a good point to talk about the hardware software bottlenecks that you were just talking about on the panel. What are some key points that you've been seeing out there?

Craig Tavares [00:19:15]: On the hardware side, this was more about keeping up with long lead times and supply chains. So in some areas and some businesses, you know, you got guys that are GPU rich. That means they have a, like open access GPUs, huge budgets, a massive amount of capital. So you see a lot of wastage, you see a lot of unutilized GPUs. But then on the other spectrum, you do have a big part of the market. And again, you see this with researchers, you see this with startups that don't have big budgets. So again, they're, they're a little bit GPU poor. I want more and more GPUs, and I love to innovate faster, but the GPU is expensive, it's not cheap.

Craig Tavares [00:19:55]: So what we do on the hardware side again is we have a planning sort of continuum where one, from a data center infrastructure perspective, that's probably the longest lead item because it takes a long time to build a data center. We're talking 12 to 18 months in some cases.

Demetrios Brinkmann [00:20:09]: Right.

Craig Tavares [00:20:09]: So I have to have that data center pipeline Capacity ready to even deploy or install a gpu. And then obviously the GPU itself, again, you have an alignment with, again, your OEMs, all the server vendors that again, are packaging those Nvidia GPUs and then you, when you install them. And you know, we've got really good at this. It used to take a long time to turn up these clusters. They're highly complex. It's like wiring a brain and making it work with a thousand network connections. And you know, again, all the software.

Demetrios Brinkmann [00:20:37]: Just one plug is not in the right place.

Craig Tavares [00:20:40]: Yeah, no, you got all these errors and problems. Yeah, it sucks.

Demetrios Brinkmann [00:20:43]: Sounds like a nightmare.

Craig Tavares [00:20:44]: No, it sucks. Right, so. So again, yeah, but we have, you know, we've got a model down where again, we can deploy really quick. But when I say deploy quick, we're still talking about weeks sometimes.

Demetrios Brinkmann [00:20:53]: Right?

Craig Tavares [00:20:53]: So again, install the racks, connect it from a network perspective, then commission the servers and then pressure test them.

Demetrios Brinkmann [00:21:01]: Right.

Craig Tavares [00:21:01]: So again, that, that's making sure I get all the right firmwares in there, all the right drivers are in there. And then I got to get the OS and patch properly and then I go back to test it. So again, yeah, that takes time. It definitely takes time. And then I, you know, when I have a clean system, then I could you, I put in the market and then start to monetize it.

Demetrios Brinkmann [00:21:18]: And then you throw the software on top of that.

Craig Tavares [00:21:21]: Yeah, and that, that's the other part of the bottleneck you're talking about. So on the software side now, as a, as a service provider, what I want to do is offer some level of orchestration to my customer too, to say, hey, number one, let me make sure again, I, you know, I put you on the, the right platform for the right reason. And whether you give a customer bare metal or whether, again, I'm, you know, I'm giving him kubernetes to consume, or maybe it's slurm. It is around again, me optimizing that platform for customers use and developers, again, you look at, historically, developers love bare metal. The problem with that is, again, you have a lot.

Demetrios Brinkmann [00:22:00]: Yeah, it is.

Craig Tavares [00:22:01]: Yeah. Like. And you see start, you see wastage start to come out too.

Demetrios Brinkmann [00:22:04]: Right?

Craig Tavares [00:22:05]: Because, you know, people start to treat servers like pets instead of cattle. And, and then, you know, the whole.

Demetrios Brinkmann [00:22:13]: DevOps movement, it is.

Craig Tavares [00:22:14]: Yeah, no, it came back. It came back again.

Demetrios Brinkmann [00:22:17]: Right?

Craig Tavares [00:22:18]: People are hoarding GPUs.

Demetrios Brinkmann [00:22:20]: Right.

Craig Tavares [00:22:20]: People are actually holding on GPUs if they can. Because again, you see this cycle and even the hyperscalers where it's actually hard to get a GPU sometimes.

Demetrios Brinkmann [00:22:30]: Right.

Craig Tavares [00:22:31]: And then, you know, unless you're willing to lock into a long term contract. And people lock into long term contracts. Why? Because when they need the GPU they want to make sure they have it. So anyways, it's a behavior that we see in market all the time when again there's a resource that's, it's highly desirable and when you need it, you need it. Right, so the, the software stack on this side is one. Yeah, we do see optimizations at least from a model perspective. So you look at, you know, you look at compression, right? Really really simple things that, that everybody from a model building perspective always tries to improve and, and then you, you go down the stack to orchestration. That's where again, as a service provider now I want to schedule workloads and put them in the most efficient places and to make sure I'm creating maximum usage out of my cluster.

Craig Tavares [00:23:20]: Because why, if I have wastage in my cluster, well, I'm not monetizing and making the money I need to, to pay back the investment.

Demetrios Brinkmann [00:23:25]: I was talking to Paul yesterday and he was mentioning how for him, he's okay paying a premium for a more managed service on this because it's so hard to get talent that really understands how the hell these things work.

Craig Tavares [00:23:40]: No, you hit, you hit the nail on the head and you know, when you go to enterprise, you can't go and sit down with a company and say, oh, how many GPUs do you need? You can't have that conversation. Nobody is going to know how the hell that. Right?

Demetrios Brinkmann [00:23:55]: Yeah.

Craig Tavares [00:23:55]: And no one's, no, no one's going to know what to do with it either afterwards. So again, yeah, that's where again, I think the token economy is starting to become more and more real thing. And especially as again as we start to shift away from guys that are just doing foundation models and you know, really again still going back to tune models because again you get into a lot of organizations where again they're just taking proprietary data and they want closed environments so they want on, on prem environments. So yeah, you will still get a lot of tuning happening but then again you quickly shift to that inference side of the equation and then yeah, now you're abstracting the hardware which is good that that is actually the right thing to do because like I said back to the, the bare metal conversation that we're having and trying to sell GPU hours or trying to sell bare metal, you do see a lot of wastage There, Right. You can't maximize efficiency very easily because you're leaving it to an individual or someone else and their behaviors to maximize that hardware themselves. And in multi tenant environments, yes, as a service provider, it may sound counterintuitive that I want people to save money because if people are spending money, well that means I make more money. It's not true. I want people to have efficient usage because again, it extends the life of the project.

Craig Tavares [00:25:06]: I want to add value as a service provider so my service becomes more sticky as well. So anyways, it's actually great again for innovation, it's great for the industry. And then me, it's great for my brand because if I could send you a recommendation and say, hey, listen, look at my tooling, here's where you have wastage, here's where you could optimize, or here's where you could be on a cheaper GPU for the workload that you're doing. I want to provide information to my customer.

Demetrios Brinkmann [00:25:33]: Are you also renting out GPUs to some of these neo clouds that are just doing inference like the modals and the base 10s, or do they get enough in other places?

Craig Tavares [00:25:45]: No, we do have a pretty broad customer base and yeah, a lot of them are inferenced as a service providers. So yeah, we do that. And again, those guys are great too, again because their downstream customers are more interested in, okay, how many tokens am I using rather than how many GPU hours do I need? Right. So that, you know, again, again, you do see, again, a huge shift in the market. So that. But you still need a metric. That's why the token is important. However, again, like I said, the GPU hour is just really, really inefficient.

Craig Tavares [00:26:16]: And it was a little bit more hard to place in terms of me trying to quantify Again, how many GPUs I need for again, a specific model or a specific project that I needed to deliver.

Demetrios Brinkmann [00:26:28]: Well, I know some friends that bought some GPUs and one of their main decision criterias was the level of support that they were going to get because they knew it was going to be so hard to just go in there and figure out how to make this work with their project. And there was going to be a lot of learning for both parties, not just the team that was buying the GPUs, but also the folks that were renting the GPUs. There was a lot of times where they were encountering things for the first time and it's like, all right, well let's Troubleshoot this together, I guess.

Craig Tavares [00:27:02]: So we still see that even as a service provider and operator. Like, again, a new firmware from an OEM comes out and it changes everything. Then we got to roll back firmwares because I'm not getting the bandwidth that I was supposed to get. Or again, like I said, you got to install patches or you can install packages in the right order just to get again the throughput you need. Because this is around moving data too. If you think about it, within a cluster, the network has to be operating at, at full bandwidth efficiency. Your memory has to be at full bandwidth efficiency.

Demetrios Brinkmann [00:27:33]: Right?

Craig Tavares [00:27:34]: And you know, this is where NVLink came in because again, yeah, you want the cluster to act really as a single server almost where again, that GPU parallel processing, again, they could all talk to each other at again, the maximum efficiency. So you go back to what we know as primitives. Primitives again are your cpu, your memory, your disk and your network.

Demetrios Brinkmann [00:27:54]: Right?

Craig Tavares [00:27:54]: And then you need that high throughput across all those elements. So again, there's one vendor that we use on the storage side, Vast data. So again, vast has been extremely, extremely powerful for us because again, you don't see bottlenecks to storage and it's network storage. And then orchestration, like we partnered with folks like Rafe who again, create massive orchestration automation for us, you know, even at the kubernetes level or creating dev pods, for example. And you know it again, when you talk about customer experience, again, this is where again, we want to make our customers, you know, easy to consume or provide the infrastructure, our customers easy to consume. Or they hit that easy button so that again, they're not think like you said, they're not thinking about it too much.

Demetrios Brinkmann [00:28:42]: Yeah, I remember we had Mohan on from Rafa and he was talking about how they're this middle layer and it's constantly like the ground is shifting underneath them because there's not a clear thing on how many GPUs are actually being utilized, how much of the GPUs are being utilized. If there's rolling updates happening, which ones are we taking offline, putting online? So underneath them nothing is stable. And then above them, the demand isn't stable either. It's like, I want 50 GPUs right now, I want 50,000 GPUs right now. And so they're constantly trying to play this game of like cat and mouse on. All right, well, we can connect these here. And it reminds me of my daughter's like, little playbook where you draw the line of this is a mouse. And so then mouse word goes there with the picture and then you draw the line here and it's like this is a cat and cat goes there.

Demetrios Brinkmann [00:29:33]: But you're just doing it at a ginormous scale with a lot of money on the line.

Craig Tavares [00:29:37]: I think the best story I can give you is around, you know, historical technology evolutions and it. I had, you know, had the opportunity to listen to Jeffrey Hinton in Toronto not too long ago and he, he was asked the same question. Like he, you know, they kind of asked, hey, what was the efficiency gain? Or like what was the thing that allowed us to get to where we are right now? And I was surprised and not so surprised by his answer, but he actually said it was, you know, the progression evolution in GPUs. So it was the advancement performance of the hardware. You know, because when you look at efficiencies and you look at like, you know, evolution, you see it across the entire spectrum, hardware and software. So you know, based on what you're saying, even with Rafa, like yeah, again you do need the software optimizations at different levels. Whether again it's in the model itself, whether it's the orchestration layers or just be behavior, like how do you create more education for people around their behaviors, around usage? But the example that I wanted to draw to is like you think about autonomous driving as a technology. It, it's the solution, it's an outcome.

Craig Tavares [00:30:46]: But the technologies behind it were twofold. One is you have that AI, right? AI making decision. This is almost like real time inference by the way too. But here's the big but you probably wouldn't have got it and Jeffrey engine's right without the gpu. Another example of that is ocr. So it's optical development in the camera, right? So if you think about a self driving car, well guess what, if that camera didn't evolve to where it is today, you wouldn't have self driving cars either. So it is like that convergence of all technologies evolving almost at the same pace and then reducing or lowering the barrier of entry for that innovation to start. And economics has a lot to do.

Demetrios Brinkmann [00:31:28]: With it too, right?

Craig Tavares [00:31:29]: So is a GPU at a price now that I could offer to the market, that the market could finance or manage or fund to go do the things that we do today.

Demetrios Brinkmann [00:31:42]: The GPU played a big role in that because of their training. The camera played a big role in that. You have this intersection and this convergence of all these different pieces. And so then when you look at it and you Extrapolate it out to the companies and the businesses. Now, when you're talking about the economics, it is clear that companies need to be able to say this makes sense economically for us to spend a lot of money on whatever it is that we're doing. And so that conversation is now being had more and more. Are we spending money on GPUs just to do AI or do we start to see that ROI from it? And I think more and more people are quantifying what is the lift that AI is giving us. We've moved out of that POC phase and now we're not willing to burn a lot of money on the GPUs just because we want to be AI literate.

Demetrios Brinkmann [00:32:43]: Yeah.

Demetrios Brinkmann [00:32:44]: And so it's very cool to see.

Craig Tavares [00:32:47]: No, no, no, again, yeah, you struck a chord with me on that one too. And, and more so because obviously Buzz, being a, a GPU and AI centric cloud provider means that again, yeah, we have purpose built tooling, we have purpose built software in our cloud for the development of AI. But what we've done too is we've, we've offered a product in the market that's a fraction of the cost of like a hyperscale type offering. So again, if you compare us to like the AWS's and Google's and Microsoft of the world, yeah, we are much, much cheaper to go get that AI project done. We're quicker and more simpler to use and it allows you to fail fast.

Demetrios Brinkmann [00:33:31]: Right.

Craig Tavares [00:33:31]: So when you're in a platform like ours, you can, you could throw pizza dough on the ceiling and see if it works. Again, you're not, you're not running too high of a budget. And because of the concept around cloud, well, guess what, I could hit the reset button and start over again. Yeah, if you were to go straight to, I'm going to go buy hardware for my own data center, try to figure out how to use a gpu, figure out if I got the right optimization techniques in there, figure out if I brought the right scale, that that's a dangerous venture. Yeah, that, that's a really dangerous investment. So it's really good to at least start in cloud, get that POV done, see if you could build an ROI really quick and like I said, you could fail fast. If the pizza dough doesn't stick, well, hey, the pizza doesn't stick.

Demetrios Brinkmann [00:34:09]: Right.

Craig Tavares [00:34:09]: So, and then, yeah, once you have that, you may stay in cloud or you may, and we've seen this a lot, we may create hybrid environments where, yeah, now I do edge into my own on Prem data center. I want to have my own dedicated stack. The economics may work better in that case because I could drive high utilization on that hardware that I own. And then I use cloud for bursting and I use cloud for different projects. Whether again, it's maybe just my inference platform that I want to run in cloud and I'll keep training on Prem or vice versa. So, yeah, I. I think hybrid is a great future for everybody.

Demetrios Brinkmann [00:34:39]: Right.

Craig Tavares [00:34:39]: Being able to figure out again where the dollar should go based on how much hardware I own versus how much hardware I rent.

Demetrios Brinkmann [00:34:46]: The hybrid narrative for the cloud was always something that people like to have, but I don't know if it was ever a really strong need, except for a few companies that definitely needed to do it for one reason or another. But at the end of the day, folks like to talk about it more than they like to actually do it. But here I think there's a. It flips it on its head because of what you're saying, where you need it, whether you're trying to get more access to GPUs and you can't get it or you're just seeing that unit economics work better when I do certain things on my own or with this cloud. And this cloud provides this type of service, so it's better for these use cases. And you have this almost like new paradigm that's emerging where hybrid is very common design.

Craig Tavares [00:35:41]: Yeah. But I want to say one thing. Hybrid's not easy to pull off.

Demetrios Brinkmann [00:35:45]: Yeah.

Craig Tavares [00:35:45]: So what, what we've done at Box.

Demetrios Brinkmann [00:35:47]: Which is why I think traditionally people weren't doing it. No, they were talking about it and it's like, yeah, we're going to go hybrid or we like hybrid. And I remember somebody telling me that the CEO of databricks loved it when people would talk about hybrid because they were like, yeah, everybody loves that idea. And then they end up using us and execution fails. Yeah, yeah, it's like, it's a great in theory, but not in practice.

Craig Tavares [00:36:12]: So. So one thing I could tell you is that again, when you look at a lot of the AI projects today, we see a lot of data movement and that data movement happens because, you know, I have my. All my applications in hyperscale cloud right now. All the data is sitting there. You know, it can be an S3, but I need a specialized platform to do AI. You know, I don't want to pay the hyperscaler price to do it in their cloud. So now you see again, a lot of data movement happening from hyperscaler. To a platform like Buzz.

Craig Tavares [00:36:47]: And then, hey, I may train something, I may tune something again, I may, I may run an inference endpoint, but what I don't want to do is have to move the data back in. So I don't want to create a movement of data from one place to another, then another place to another. So again, I think that's one thing that Buzz has tried to do, is we've tried to create a holistic platform. It's an enterprise grade platform, which means that I could take something from again, its infancy, from its genesis. And that's again, maybe again even building a training model to how that turns into an app and hosts the app as well too. So again, we provide VMs, we provide the storage services, we provide the security services, we provide the resiliency. And really, again, it at least allows a user consumer to not have to move the data back and forth so you could host everything and run that app in one place at least. Again, it goes back to what we were talking earlier.

Craig Tavares [00:37:42]: Like, I want to create the proper customer experiences. I want to create the proper environment for my base to be able to use my infrastructure without making it too complicated.

Demetrios Brinkmann [00:37:52]: Right.

Craig Tavares [00:37:52]: And yeah, moving data is not easy and it's not cheap.

Demetrios Brinkmann [00:37:55]: That's what I've heard that when folks are going and shopping around for GPUs, a lot of times you forget about that and then you are starting your project and you're like, oh yeah, egress fees are kind of expensive. We didn't realize that it was going to be this expensive. And if you talk about the hybrid model, there are ways that you're going to have to get really creative with that data architecture so that it doesn't end up not making financial sense. Again, you're like, you did all this gymnastics to make it hybrid so that you could save a buck here and there and then it doesn't work.

Craig Tavares [00:38:34]: Yeah, no, it's, it's again, a good point. I, I think when we look at hybrid and what we saw traditionally, especially in cloud, was there were a lot of enterprises and companies that prefer to keep certain applications on prem and they, if they understood cloud and they understood that the best thing to do with cloud is go cloud native, not lift and shifts, then they were able to harness the value of cloud a little bit better and then it became economically more feasible too. But yeah, everyone that just did a lift and shift to say, hey, I'm going to throw an application cloud for no reason at all, that that wasn't a great idea and that became very, very expensive. So again, hybrid was more a force methodology previously where again, I think some customers that needed to keep apps on prem did that. And that could be compliance reasons too. Again it goes back to what I was saying earlier on sovereignty and data sovereignty specifically, where if, if you couldn't show that again your data was in one place and it was secured and ring fenced the way you need it to be, well, you weren't meeting certain compliance and controls that you had within an organization. So again that, that, that really kept again the colocation business alive. It kept again a lot of data centers that, where customers again just had their own racks, they had their own space.

Craig Tavares [00:39:56]: You know, traditional enterprise IT data centers that existed, again, those are still there for that reason.

Demetrios Brinkmann [00:40:02]: Right?

Craig Tavares [00:40:02]: It's around a lot of it is around the data sovereignty and the data residency requirements that they need to meet. And then yeah, you go to hyperscale. And again they were great front end apps. They're great web apps that love to be in hyperscale cloud and they were suited for April again, especially if you're using cloud native techniques. So again, come back to AI now. And again AI loves on prem massive amount of data that you got to train. But again, when you go to inference, again it's not again the same data set. However, one thing I'll share with you is why would you create multiple data sets for different types of phases within that AI journey? If I have my data in one place, I want to keep my data one place and I want to iterate on it.

Craig Tavares [00:40:48]: Therefore again, if I have to go back and tune it too, I'd rather not create a second copy of that data. So yeah, I think you're absolutely right. Like the concept of hybrid sounds attractive. It's not easy to execute, it's not easy to manage. And typically if you get everything one place, you want to do that, what.

Demetrios Brinkmann [00:41:09]: Are you most excited for? That's coming down the pipe.

Craig Tavares [00:41:15]: I see. So again, if you look at one sovereign around the world and what a sovereign mandate is, again, countries and nations, they're saying, hey listen, I get that AI is super important. I hate to use the arms race term because it kind of refers to trying to acquire as many GPUs as possible. But what the truth is, I think what a lot of countries have recognized is that they do have to have an AI strategy. And for certain countries that AI strategy means, well, one is keeping a talent pool or growing a talent pool within the country to help businesses to grow your gdp. And it's super important because one, you don't want brain drain too where again, you have your talent coming out. You're universities and schools just jumping to another country because of a higher paycheck as well. Too well.

Demetrios Brinkmann [00:42:13]: And they can't get access.

Craig Tavares [00:42:15]: Yeah.

Demetrios Brinkmann [00:42:15]: To any cool stuff to work on.

Craig Tavares [00:42:18]: Yeah, no, it's, it's, it's super. It's super. It's a big problem. And then for governments, government's the other big, you know, thing behind sovereign.

Demetrios Brinkmann [00:42:27]: Right.

Craig Tavares [00:42:28]: Government itself is going to be a massive consumer of AI so again, yeah, they, they do see it as a strategic position. So you have to adopt AI as government. There's so many government departments and factions too inside of that. But, but again what sovereign is really, it's, you know, it's control over the infrastructure.

Demetrios Brinkmann [00:42:43]: Right.

Craig Tavares [00:42:43]: Do I have domestic control of that infrastructure? Do I have domestic control of the operations? And that's all digital operations too. And the last one is data like that, that sovereign piece is really a data story. Like is my data itself sovereign as well? And now you can talk about sovereign AI. And again the way the government supporting this and Canada's been great for this by the way. Like, you know, the first thing we did was we elected like a minister of AI that, that brilliant. Yeah. We need a guy that, that's hyper focused on this, like you know, fueling innovation, fueling the economy around AI itself. And then we create a grant program.

Craig Tavares [00:43:17]: So you know that. And you hear, you hear about the headlines all the time too, right? Oh, I'm, you know, you just allocated $100 billion to again data center infrastructure and AI Saudi, they stood up humane and you know, they got another, you know, X billion of dollars there. And then you got Stargate where again they, they, they earmarked $500 billion to go build. Then if you look, you know, under the covers, it's no, we're going to deploy, you know, the first $30 billion. And you know, this is this progression.

Demetrios Brinkmann [00:43:47]: Yeah, yeah.

Craig Tavares [00:43:49]: So, but you know, again it, it is, it is a lot of, a lot of those headlines where you're going to see again investment and, but it's a stake in the ground too. And, and countries saying yeah, no, I get this is hyper important. It's one of the biggest revolutions that we've seen in our time. So yeah, I want to make sure one, that the funding is there so that you know, we protect our sovereignty. What. Number two, that we help businesses innovate and move forward. So again, yeah, data center GPU access isn't the bottleneck or the threshold to go innovate and build.

Demetrios Brinkmann [00:44:20]: Is there any other topics that you want to touch on that we didn't touch on as buzz?

Craig Tavares [00:44:24]: We've been building enterprise platforms, but we've also started to incorporate a sovereign philosophy and a sovereign framework in that. So when I say enterprise platform, that that's going above again your basic CL tooling that you would see to develop an AI project. It's again incorporating very rigid compliance, very rigid security techniques and technologies and systems. It's incorporating high availability, reliability, tight SLAs for our customers.

Demetrios Brinkmann [00:44:55]: Right.

Craig Tavares [00:44:55]: And then bring in the sovereign. It, you know, it's around again adhering to certain national or local policies and keeping governance and control over in the environment. So. So sovereign actually has become the new standard. And if you say, hey, I'm a startup and I don't even care about standard sovereign, you might be right. You may not need sovereign, but because sovereign is setting a high bar, you enjoy the reliability of that cloud. You enjoy the reliability and performance of those of those you don't have to.

Demetrios Brinkmann [00:45:23]: Sacrifice just to be sovereign.

Craig Tavares [00:45:25]: Correct. Exactly. Perfect.

Demetrios Brinkmann [00:45:28]: Dude, we got it.

+ Read More

Watch More

Building Out GPU Clouds
Posted May 23, 2025 | Views 208
# GPUs
# AI infrastructure
# Rafay
Building LLM Applications for Production
Posted Jun 20, 2023 | Views 11K
# LLM in Production
# LLMs
# Claypot AI
# Redis.io
# Gantry.io
# Predibase.com
# Humanloop.com
# Anyscale.com
# Zilliz.com
# Arize.com
# Nvidia.com
# TrueFoundry.com
# Premai.io
# Continual.ai
# Argilla.io
# Genesiscloud.com
# Rungalileo.io
Privacy Policy