MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Open Source and Fast Decision Making: Rob Hirschfeld on the Future of Software Development

Posted Jul 04, 2023 | Views 591
# DevOps Movement
# API Provision
# RackN.com
Share
SPEAKERS
Rob Hirschfeld
Rob Hirschfeld
Rob Hirschfeld
CEO and Co-founder @ RackN

Rob Hirschfeld, the CEO and co-founder of Rack N, discusses his extensive experience in the DevOps movement. He shares his notable achievement of coining the term "the cloud" and obtaining patents for infrastructure management and API provision. Rob highlights the stagnant progress in operations and the persistent challenges in security and access controls within the industry. The absence of standardization in areas such as Kubernetes and single sign-on complicates the development of robust solutions. To address these issues, Rob underscores the significance of open-source practices, automation, and version control in achieving operational independence and resilience in infrastructure management.

+ Read More

Rob Hirschfeld, the CEO and co-founder of Rack N, discusses his extensive experience in the DevOps movement. He shares his notable achievement of coining the term "the cloud" and obtaining patents for infrastructure management and API provision. Rob highlights the stagnant progress in operations and the persistent challenges in security and access controls within the industry. The absence of standardization in areas such as Kubernetes and single sign-on complicates the development of robust solutions. To address these issues, Rob underscores the significance of open-source practices, automation, and version control in achieving operational independence and resilience in infrastructure management.

+ Read More
Demetrios Brinkmann
Demetrios Brinkmann
Demetrios Brinkmann
Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More
SUMMARY

Rob Hirschfeld, the CEO and co-founder of RackN, discusses his extensive experience in the DevOps movement. He shares his notable achievement of coining the term "the cloud" and obtaining patents for infrastructure management and API provision. Rob highlights the stagnant progress in operations and the persistent challenges in security and access controls within the industry. The absence of standardization in areas such as Kubernetes and single sign-on complicates the development of robust solutions. To address these issues, Rob underscores the significance of open-source practices, automation, and version control in achieving operational independence and resilience in infrastructure management.

+ Read More
TRANSCRIPT

I am Rob Hirschfeld, c e o, and co-founder of RAC n and I like my coffee black and weak. Much to my wife's chagrin. She thinks I am weak natured because of it. I like, I like weak black coffee.

Hello. Hello everyone. Welcome back to the ML Ops Community podcast. Today I am flying solo talking to my man Rob Hirschfeld.

He is the c e o and co-founder of Racking. And boy oh boy, was this conversation awesome. Why is that you say, well, Rob has the DevOps mentality baked into his blood. He's been doing this for so many years. I don't even think he realizes how much of a fundamental piece he is and has been for the DevOps movement.

I mean, the guy invented. The cloud. Can we just get that out of the way real fast? You hear him talk about it? I asked him all about that rainy day when he created the cloud and what he thinks of it now and how he feels like it's been done justice By Amazon, I mean, he freely admitted that he did not know how to monetize it.

He did not realize what he had on his hands when it was created and Amazon stepped in and they did know what they had and they were able to make a pretty penny off of it. But that is not to say that Rob did not and still has not been killing it when it comes to products after inventing the cloud, right now, he is the CEO of RAC and as I mentioned, and they are making DevOps so much easier.

The user experience for. The DevOps engineer is incredible and he breaks that down in this conversation. He also said something I think for me as far as my key takeaway, if I leave you with anything here, he harped on the idea of exercising your code and that stuck with me. So I'm not gonna spoil anything or explain what it means because I think he does a much better job and I probably won't be able to do it any justice.

But listen out for when he mentions Exercise your code. And without further ado, let's just jump into this conversation with Rob. All right, hope you all enjoy. And if you like the ML Lops Community Podcast, why not subscribe or leave us a review, give us some feedback. But you know what can be the highest leveraged action that you can take?

That would be if you send this to one friend that you think would also enjoy it. So bring in the DevOps fundamentals into the ML ops mentality. Rob Hirschfeld,

did you know where I want to begin? I want to talk first about this stat that I saw, which is you have copy written the term, the cloud, or you came up with that term. What is this about? I'm the, I'm, I'm Dave Macquarie and I are patent patented the cloud. That is nuts. So you were the first one to call it the cloud?

We were, yeah. Tell me that story. What and how, like when was that and how did it happen? Oh boy, that is a great story. In, in 1999 Dave, Dave Macquarie and I, and if you don't know Dave Macquarie. Dave Macquarie is famous for work. He did actually when he and I were reunited at Dell around data gravity.

So data, he's, he's the data gravity guru. Ah, wow. And so he and I have been inventing together since 1999. We formed a company to do application service providers or ASPs early days of the, of, of the web, I had been doing consulting where I was writing applications and delivering them into people's, offices and things like that.

And it was hugely painful. And so I was asking how do I make it so that I can webify an application and make it more available? Something that people still weren't doing, hardly doing at that point. And Dave had come in doing Citrix work, which was remote desktop. Application sharing and things like that.

And we got together and started and built this, this startup where we would help companies take software that they had running on desktops and turn it into a cloud offering, although it wasn't called cloud at the time, it was just an internet offering. Right. As a and what happened is to, so for, and we got funding on that model, 90 99 was crazy up until, there was a crash.

And so mm-hmm. We, we had, we had a interesting ride through this story, through this, but in order to do the work, we had to create a complete application stack for each server, for each customer. So we would have a customer come in, they didn't know if they could sell anything or not, but we would put five servers together so that they could do their dev.

Process to see if they could actually build a Citrix enabled application. We needed a SQL server, we needed all this stuff, right? Yeah. Yeah. And so we, we were, this was the PI pizza box servers. That was one used servers had just come out and we were getting stacks of those servers and we were buying like $30,000 worth of gear for each customer.

It was insane. Wow. And so we could not grow the business because we had this need for ver, this need for so much hardware. Yeah. Dave took VMware the desktop version mm-hmm. And ran a prototype where he built that whole five server stack on running on a desktop and thought, oh, this is really cool.

We're gonna do that. And we've tried to install on servers that didn't work. We went, we call it VMware. And VMware said, yeah, we have this beta of this thing called esx. Why don't you try and get it working? We were the first company to get ESX working anywhere outside of VMs, VMware's Labs, anywhere.

And so we, we, we virtualized, all of those functions onto a single server. We were super excited and things, things were, things were pretty good, right? That was, that was be beginning of like, okay, this is SA saving us costs. But we had enough servers that we were like, managing ESX on five or six servers, right?

There was no shared storage. There was no Right. Sans were right. Ridiculously expensive. There wasn't even, this was, this was the very early days of NetApp. There wasn't even good shared storage, shared block storage and just to set the scene, cuz it's important right? When you 1999 inventing the cloud.

Yeah, we went to, yeah, we went to VMware's headquarters and we said this is really cool. We love this software, but it's sort of hard for us to manage. Is it okay for us to write some management software? Mm-hmm. And we talked to, literally to Diane Green, who was CEO for a long time, but she said, yeah, we're just worried about making this, this server better.

We we're not so worried about managing it yet. And we said, oh, no problem. We, we sense a business opportunity. And we came back from that meeting and started writing patents about how to manage and provide an api, which is really what the, that patent's about for infrastructure managed by another company.

Hmm. So it was all about multim machine virtual management in that system. And then we wrote, I think seven or eight patents down around those concepts around managing VMs around globally, distributing 'em, all sorts of cool things. That one got, that one got granted. It's almost been 20, it's the patent's about to expire.

Dave and I are gonna have a party. When that patent expires it's owned by Dell. Tried to get them to donate it to the OpenStack foundation back in the day. No, nobody's, nobody enforces it. Right? The patent, the way these patents work, a ton of people have referred to it, but yeah, my wife likes to tease speech.

She's like, yeah, he invented the cloud. And I, I don't know if it's a story of, of valor or shame because at the same time, Amazon showed up. Yeah. Seven, eight years later with the right business model for it, we turned it into a dev test system. And never, never, never figured out how to monetize it or figured out too late how to monetize it.

So, one thing that is fascinating to me is that you've seen this 25 years of cloud. What has changed and what's improved? What is still around for better or for worse? Less has changed that people than people think want to believe. They, they feel like a lot's changed, but they're, they're the, there's things that are very persistent in this.

One is the confusion of the business model and the, the, the technology, right? So like the thing that Dave and I were doing at the time was pretty novel, which was you would rent infrastructure, rent infrastructure as a service. That has gotten so co-mingled with cloud, which I usually think of as API driven infrastructure or more elastic dynamic infrastructure.

And I think one of the things that that has was less confusing at the beginning is, is actually even more confusing now, is commingling this idea of the business model and the technology. And so that, that definitely has been there. The behind the scenes stuff actually hasn't improved that much. I, I, we actually like to talk about, about how little we've really improved the operations and the work of operations in this.

It's still, like we were joking in 99, IBM was doing all this work on data center autonomics, right? Self-healing, all the, and, and it's, we haven't really made a lot of progress with that, that stuff. It's, we keep adding new services and microservices and all of those little services are just, the complexity of of that environment has really exploded in ways that people complain about, but they don't know how to, how to chase.

And the same is true. We, we've made these very, very narrow businesses. Back in the day, you, you really started sort of a bigger. Footprint of it, you assumed you were gonna have a bigger footprint of a business to make this stuff work. And now with some of the SaaS offerings, you can start in a very narrow targeted way for a business, which is amazing.

But we end up with some, these, these very, very small use cases that become standalone businesses. Um mm-hmm. And that actually is not, that, that is part of this cloud generation that's very different than what we saw. And is that causing more headaches? Because you mentioned the microservices adding more complexity.

Like what are the, can you give us some examples of that and how you feel that potentially is, has its pros and cons I imagine? And what those pros and cons are? I, it's, it's sad to me cuz, RAC N is a software provider. We, we sell software to people. They run that software. The level of control in that and the disciplines necessary to distribute updates and things like that are, are essential to our business, which is about automation.

But if you look at the services that you're consuming as a business, right, it feels like every week I add a new service into the mix. Yes. Yeah. And that service is generally not integrated to anything else. And so I, I now have my day spread through using a ton of apps. I have sensitive information spread across a ton of apps.

Right. I've been playing with G P T chat g p T a lot, right? That's like cut and I'm cut and pasting that everywhere. Which is okay. I don't want it scanning my systems, but Right. The next generation of this, they're gonna be clawing through my slack. They're gonna be running through my get repos.

Right? Yeah. The number of times you turn around and say, oh, I want a securities scanner. For go, I'm gonna go to the company that writes security scanners for go and then give them access without restrictions to all of my get repos and maybe my, my my, my G drive for, right? Mm-hmm. And so now I have a company that has access for a very narrow reason into all of my information.

And that trend line should be terrifying people, and we're very used to it frog boiling and water used to it. Where, we're potentially vulnerable to, to, some startups, smallest security violation or going out of bit, it, it's really frustrating. Like if I wanted to pull Slack back in and run it myself, I'd have to switch to Mattermost, but even Mattermost to really now focused on hosting we, we've just given up on people running their own stuff.

And man, that is fascinating. I a hundred percent agree with you on that. That a lot of these times you get access to way more than, it's like you bite off more than you can chew in a way because you give access for one thing. Like you were talking about this go example. Yeah. But you give access to the whole, you let someone in the mansion when really they only needed to be in the courtyard.

And Yes. So there is a, I feel like you probably have a very strong opinion about the better way of doing this and what, what we should be doing instead, and how to provision and how to make sure that things are secure and how to make sure that the data and the information is only given when it needs to be given and to it, who it needs to be given to.

Oh boy, I, I wish I actually had answers to that problem. Good. It's, oh boy. We, we keep, we keep sort of bypassing the question of security and access and controls because it's so hard. The UX is so hard for people to get Right. And it's definitely something that the industry hasn't moved towards standards on.

Mm-hmm. We we're starting to see a little bit like in the Kubernetes space around like some ingress controllers and some, some spaces like that. But even that's really not helping us create standards around this. Um mm-hmm. Even like single sign-on is not particularly re robustly supported by all apps, or if it does, it's using, using single sign-on providers that.

Mining my information for habits and behaviors. It's, we, we've, it's a little bit of a dystopia compared to, to what, we were hoping, like when we did the, the, the startup, it was called Pro tier. I never, I don't think I ever named it. Back in the day, part of the idea was that you could put a couple of applications together, but still have shared storage and things like that.

And we've, every, every app, and this is actually a big problem in, in operations tooling. We keep designing stuff that's designed to have its own data store, it's own source of truth, and it, we're not designing things that are good at sharing information. And that, that to me, even more than security piece, which I think is important, is really, really the, the starting point of the problem is you have to have things that expect not to be the sole source of truth in an environment.

We keep, we keep doing that to ourselves. Isn't it Kind of like the, we could draw a parallel here with the ads and what you see on Facebook or any social media platform, and it's because they've had the cookies that have been following you around the internet. And so you are, now, they have this picture of who you are and they have all of this data about you, and so then they can better serve you ads.

I look at it like, oh, well if we want the apps that we're using to better perform in our ecosystem on all of our data, then they have to have that access to all of the data. A lot of times the people who are benefiting from that data are, that's their monetization strategy is that data. And we have to, we've, we've, we've gotten used to not doing this, so there's a barrier here, but you know, the idea that you're gonna pay for the service that you're using or.

Pay to run the software to increase your control and privacy on that, the inconvenience of that is high and people, the consequence of not doing it is pretty low, right? Yes, ma'am. So, you were making a comment about my, my camera stream for my background, right. That, I had to hunt around for a device that had the option to locally let me retrieve the stream inside of my own land.

Mm-hmm. Most of the cameras that you get, and they're much cheaper, which is, which is stunning, actually will only send the video out to their service and you can only retrieve the video from the, their, their internet site. So you can't connect to the camera directly at all. That has become a premium feature and so, we, we have to figure out as an industry How to make it easier.

Cause I, I don't think this is just the consumers. I think this is where the integration comes in. How to make it easier to keep our con our users in control, our operators in control, keep our data local. But that might mean slowing down, it might mean agreeing to protocols. It, it might mean saying, oh, I'm not the source, I'm not the source of truth.

But if you were gonna be like, oh, I'm, that sounds great, Rob. I, I wanna integrate into a controlled protocol. There's, which one of the six are you gonna pick? Yeah. And I can, I can give you an example out of the ops space if it's, if it's helpful. Yeah, yeah, yeah. Please do. So, pH a couple years ago, HP did you know the yeoman's work?

I have to give them some credit for establishing this protocol called Redfish. Mm-hmm. I don't know if you're familiar with it. If you're familiar with bare metal it is, it is a protocol that's used for out of band management on servers. Before that there was something called I P M I, which was very old, incredibly fragmented.

And then there were, each vendor had their own. And so HP said, I'm gonna consolidate, we're gonna, we're gonna open source this, this protocol, we're gonna build a consortium of vendors and we're gonna standardize on redfish. Mm-hmm. Trumpets sounded angels, angels came down, blessed the servers.

Yeah. The classic story, but the reality is now you have another standard. The you have you, well, the problem is not only do you have one standard, every vendor to make it work has their own variation on that standard. And so you went from n standards to two N standards. And so it, it helped a little because there's sort of some, there, there's definitely a commonality, but it's not enough to make it where you can plug in a De Dell server instead of an HP server and have your management tooling just work.

And so that's, that's a real deficit in how all those pieces fit together. And then the other thing that people forget that's very real in this is time is actually a problem because the protocols improve over time. And what ha ends up happening is you don't, you don't just get to say, am I using redfish?

This is true of every protocol. It's not just, am I using redfish? You have to be like, oh, am I using redfish before they fix the bug in how the security protocols interact with the server or after and of my fleet of servers, which ones are patched or not. So when you're dealing with this, and this is why it's more work and we, we sort of just throw up our hands and, and say, I, I'm gonna let the SaaS provider do that.

Mm-hmm. You have to deal with which protocol am I using? Which version of that protocol you have to deal with systems that are patched or not patched. There's, there's a lot of work. It's valuable work, but it's work. Well, that is something that I wanted to get into you with, get into with you about the impact of using open source.

Yeah. And having that capabil capability to go and know everything about what you're doing and roll your own in a way, versus the using a SaaS provider and being able to go fast as you were mentioning, and being able to then not necessarily have to think about all of these bells and whistles and knobs that you have to turn, or patches that you have to be updated on.

And so you inevitably have thought a lot about these trade-offs over the years. Yeah. How do you look at that these days?, there's there vendored, open, open source, which I actually differentiate. And because open source itself becomes a layered piece and then sas and even SAS has the open, open backed and, and, and non-op, open backed.

Although at the end of the day, I think if you're using a SaaS, it's very, you're, you're, you're not really using open source technology that might mm-hmm. That people might get upset when I say that. But, it's, it's that the person, the operational expertise to run software is its own distinct thing.

Mm-hmm. And, no SaaS provider I've seen opens, opens and I'm sure there's an example, but by and large, they don't tell you what they're doing. Certainly not like the hyperscalers that is incredibly secret sauce on how that works. So the. What, what I see in that is that the effort to maintain your own stuff, track the versions, track the dependencies, patch the operating systems, right?

All of that, that, that you have to do. We have not done a good job as an industry helping people do that in a repeatable way. Hmm. So that the tra this is actually the travesty that started RAC n if you wanna indulge you with the story Yeah, please do. About this. So we started before, as we started Rac n the team was at Dell in the early OpenStack and Hadoop days, and we built the first OpenStack installer, something called Crowbar.

And we were, it was beautiful. You could, you could basically plug in a server and then build an entire cluster automatically. It was, literally just did all the work for you. And we would take that to a customer site. We would, ship a rack of servers, they would plug it in and they would build OpenStack or Hadoop on that, on that stack of servers.

Took about an hour. And, it was, it was lovely. And, but what would happen is every time we went, we would learn things from that experience. We'd improve the software, we'd fix it, we'd come back, to the office. We'd fix things, we'd add capabilities, the versions would change, we'd update all that stuff, and we'd go to the next site, improve the experience.

And by the time we got through like 10 of those, we were really good in fast about building that whole experience and making it go. I will tell you, , it never worked outta the box. Because each site. Has is different. We show up and people are like, I, I thought I had this, but I really have that.

Can you reset? But that is normal. But, but that wasn't a travesty. We got really good at showing up, getting it done. And then we actually had a benchmark where our customers, , the benchmark of success was when you were flying home and the customer had done the reset tune and reset themselves while you're on the way home.

Nice. So that was like, it was, it was amazing. But when we got to that 11th install, we said, all right, you know what? We really have learned a lot. Let's improve installs one to 10. We had no way to fix anything that we'd done before. Every site, even though we'd use a completely automated process, ended up as unique.

Mm-hmm. Which is not software. So when we look at this problem, if you're a software provider, if you're writing a product, You want, you must have all of your customers able to migrate together so you can support them. Otherwise, you're just making a faster consulting practice, which is not, which is not solving this problem right now.

You each person is still an island. What, what we set out to fix in this was how do you make it so automation can be reused so that if I give you a new version of my OpenStack install, you can go back to the site you already installed and put that in and then rerun that patch, right? So from an engineering perspective, I can actually build something and have people keep reusing it.

I can have a community of people actually collaborating and helping each other. And this to your question is what's missing, right? If we go back to my, my camera, right, it would be really cool if I could download, Hey, I wanna store my videos locally. Here's a standard way to do it. Run that software and make it work and be confident that, I could buy this gear and buy this camera and install this software and have it go and then be able to hit a button and upgrade and, and patch all that stuff.

There's not a lot of incentives in the market at the moment that make it easy to do that, or standards that make it easy. Right there, there, there's a lot of barriers to it. The benefits of being able to do that yourself and have that type of autonomy is really high. Cause then, then you're not paying somebody, x yet another subscription fee to service whatever, camera thing.

Or not paying them a service fee and assuming that they've gotten the rights to watch all your videos. Right. I, neither, neither scenario is particularly attractive to me. Well, you do talk, I mean, in this story that you lay out, you talk about Yeah, the automation and one through 10 not being, Automatable that word that I'm looking for.

It's not, it's not automatable. It's, it's the, and ah, there's, there's a, a thing I I'd like to talk about, about the way to combat complexity is to exercise the systems that you have. And what, what happens is that the challenge here is that you had 10 people, 11 people all using OpenStax successfully, and they're all pretty happy, but none of them actually can participate as a community in improvements in the future.

And, and the, the, the complexity piece, what I mean when I talk about the, the antidote for complexity is exercise, is I've, I've got these 10 people, they're all slightly different, but they're using, 99% the same automation. If I find a bug on one site, the chances are they all have the bug. Yeah.

So, so I have reduced the complexity of the whole system. By exercise, by having 10 people exercising the code instead of, one by one by one. But what I'm, what I'm missing, and this is actually what, what we set out to fix at RAC n is I wanna be able to say, well, if I, if one is, gets the benefit of exercising, then I wanna be able to repeat that across all of the other customers.

They need to be able to take that fix. And then what happens is the community gets bigger and bigger. Everybody's experience, everybody finding those edge cases and testing things creates a mutual benefit. Yeah, yeah, yeah. But that doesn't happen by accident. It's, it's, it's actually a, something you have to invest in.

But why is that not happening? I didn't understand that. It doesn't happen because we, especially with all these SaaS, right? Right now, the cloud dis incense that from happening, they, they don't want you to do it. They, they want you to let them take the expertise and do that work. Mm-hmm. It, it doesn't happen because it requires more effort.

For the individuals to stay on a system or participate in a system that they can share. Right. Right. Now, when people build systems, they have a tendency to build the automation they need and the, it's, they're like, this happens all the time. So if you download oppor, a chef's Ansible script or something like that, you're like, this is really cool, but it has all this extra logic that I don't need for my environment.

And you start ripping it out. Yeah. Because it's, it's opaque complexity. And so you, to, to make it easier for you to maintain that playbook, you take out all of the stuff that people have added for other environments that aren't yours, or defense mechanisms that you're not gonna hit. You don't keep it in because there's really no incentive for you to, to be like, well, I'm gonna take out, I'm only in Amazon.

I'm gonna take out the Google and Azure stuff because yeah, I don't, I can't support it and I am not planning to take the upgrades. Now, if you were planning to take the upgrades, those other cloud profiles are actually valuable because they mean that you're now exercising that playbook on on three clouds and not just one.

But we don't look at the systems as an operator. We don't typically look at those added pieces as beneficial to us if we're not actively using them. Right. Does that, does that help? And when we build an infrastructure pipeline, the pipeline does a whole bunch of work that most people don't care about.

They're trying to solve the very narrow part of that pipe pipeline. They get the benefit if the the of the whole pipeline working. But most of the time, this is the behavior that we have to overcome. People, if you hit, if you hit code that doesn't do what it is, if you're like, well, I don't understand why that code is doing it.

I'm just gonna rip that out. Yeah. I used to have this fight internally in the team too, by the way. Right. I I can I have a story about this might illustrate it. Boy, we had built these, these beautiful pipelines for bare metal hardware. We do a ton of bare metal hardware, uhhuh. And so my co-founder, our CTO, had this pipeline that literally you plug in servers and it discovers inventories, qualifies, updates, patches, installs the os, installs, the applica.

I mean, like it just beau it just is beautiful. Completely automated pipelines. And I took it, cause I'm guilty of this too, and I said, you know that, that's really cool Greg, but I just need to do the cloud. And in the cloud systems I get the server. All that stuff is already done. I just wanna skip to the end of the pipeline.

And so I took his pipeline, I copied it, cut out all the bits that were for bare metal and was running it, and I'm like, look, this is great. Look how fast it is. And I mean, he just, he literally, I could hear his, his forehead hit the desk and he is like, Rob, you missed, you missed it. Leave all that stuff in.

I'm like, yeah, but some of it doesn't work. The bio stuff doesn't work when you're on the cloud. And he's like, then fix that. It doesn't work. Because what, what happens is, and this is the way it is today, when you run that, it's exercise when you run the whole pipeline in every environment, that means that every time I run the cloud pieces, I'm actually exercising and able to move that automation into any environment.

So now if I run a whole bunch of stuff in the cloud and somebody calls up and says, yeah, this is really cool, can you also run it in bare metal? I'm like, well, of course you can. That pipeline is exercised in the cloud in virtual machines. Oh, nice. And on bare metal. And so now I've gotten the benefit of the operational right practice and the exercise across the whole spectrum because I fixed the place where it was failing to apply bios to a vm silly.

But instead of removing it, I fixed that behavior so I could keep the exercise in the system as opposed to looking at it. So instead of looking at it as operational bloat and thinking, this is something that we're not gonna need ever, so I'm gonna get rid of it because it's not, doesn't have anything to do with me.

It's not my monkey's, not my circus. I'm gonna just get that outta my life so I don't have to deal with the headache of it not working and getting errors every time I want to do something. You're saying just change the frame of mind, because who knows what the future holds and you want to be able to.

Change on the drop of a dime. Correct. When as you exercise that code, as you participate in the community, even though you're not actively thinking you're participating in the community, the fact that you're using something that is versioned and updateable, right. Those things all support this reuse and accelerated effect that is, is much more powerful.

And this is the thing, I think a lot of us, and I I've watched this cuz the, some people think the antidote to complexity is, is simplicity is removing mm-hmm. Items, which is, which is what we see all the time. And I, I hear this, it's like, well we standardize our data centers on Dell because we're just gonna be a single vendor and it's simpler for us to build our tooling.

And I'm like, that works until Dell has a new version that breaks the old stuff. Cause that happens. Or you acquire somebody and they switch or, yeah, the pandemic hits and I can't acquire servers anymore from that vendor and I have to switch. Right. It's just it and it's just bad business.

Or you're, you're com for no technical reason except that your procurement people say, I got a good deal. Right? Or your, your CEO played golf with the Cisco vendor. And so all of those things are real. What we've, we've gotten used to is just tearing out stuff that we, we don't real, we don't think is there.

And it's, it's just a, it's a, it's across the industry we have the same problem. Right. There's a joke about, legacy code and, people are, people are like, what's, what's another name for Legacy Code? It's like proven, tested, resilient. It's, a lot of stuff in, in code. Is there?

Because it, somebody learned a lesson and had to put in protections. And we just need to reframe it from that perspective. Yeah. It doesn't mean I'm not in favor of refactoring or or streamlining when you can, but there's a lot of cases where we don't look at the system effect when we do that. Well, there is something that I think you mentioned beforehand about automating infrastructure inefficiently, and it feels like this.

What we're talking about kind of plays into that, and I also would love to look at it through the lens of the machine learning engineer and who's building with machine learning use cases and potentially building on top of the data foundations that the company has and how you see that when it comes to these automations, and especially like automating code, automating infrastructure, and doing it inefficiently as opposed to things that you've seen work.

Oh my goodness. Especially for, for ML workloads, I, I think that we, because because they tend to be big, they tend to be, ones that have a fair bit of complexity in the infrastructure, and there's a high desire to have very rigorous repeatability across, across that, right? Yeah. That fleet.

But what, what that does is it actually undermines some of the, the counterintuitive pieces, and I'll, I'll explain that. For this. So one of the big barriers I hit when we started talking about immutability and one of our, our, my, my chief solutions architect, and I used to have this argument all the time.

He's like, I want to re-image a whole server. I wanna burn a server down, and I wanna build it back up from immutable artifacts. And I'm like, there's no way that's slow. It's hard. It's, I don't wanna risk. Breaking the whole server because we we're gonna re-image it. And, and I'm like, well, why don't we just take the server and patch it?

Just, just apply the little patches. It's so fast we can just, surgically tweak it. And he was entirely right. That, what, what we've seen as we implemented those practices is that when you take the, this, this sort of an artifact approach, this immutability of saying, this is what I want my, my server to look like, and then you reset it from zero back into that pristine state, that immutable state, what, what it does.

It gives you a huge amount of certainty in what you have deployed and how, and the, the, the thing I was reacting to, and this I think is one of the bad habits that, that we get into. Is you, you have battle scars that say, yeah, the, that, that system didn't reset correctly. Or I have a high failure rate on the resets, and I see this a lot reli.

A lack of reliability is the crisis in automation that you, people have to think about that if, if you're nervous about running an automation script or pushing a button to reset a server and burn it down to zero and reset it, then address your reliability problem first, right? That process needs to be reliable.

And what we've seen is one, the artifact patching is actually the, a full system reset is actually a faster reset than patching in a lot of cases because that surgical patch that you think is very surgical ends up actually taking a lot of time and bringing a machine in from an a, a already patched, everything's right artifact with the applications installed.

If you get, if you start with that, actually laying that image in and booting it. Is very fast. If you go through like an app, get update and app, get this and app get, and you're pulling things in from all sorts of repos, that is the most fragile process in the industry. Anytime I do it, like app get update or Yum install whatever, it goes to the internet, it goes to repos, it's gonna tell you you're outta date, it's gonna pull in all sorts of random stuff that you weren't thinking you had to do.

And those systems are in incre. That process is incredibly fragile. It's very slow in comparison. And so if you build that image, do all that work once, and then clone it to your entire fleet and then solve the reliability issue so that you can confidently roll machines over like that. The difference in performance of your system is remarkable.

And then, right, we have customers who rebuild an E S X I cluster. They just push a button, they reset the whole cluster bios, everything. And they, they push a button, they go get coffee, they come back and they resume their work. Right, but it's mind changing cuz now they're like, well it's outta date. Let me just reset and get, I know what the correct state looks like.

Let me go do that. They stopped wondering, like, is my system all the way patched or partially patched? Is it, did patching this break? Am I in a, there's, there's all sorts of mess that mm-hmm. People create with incomplete updates or partial updates or not knowing the state of their system. And then there's no standalone system.

So everything's interconnected to other things. And if you're not sure the version it's at or the patch date it's at, or if somebody hacked it. Right, right. This is, it's, it's, it's beautiful. But it's a completely different way of thinking about it.

The other thing I would, I would say in this is turn rates are really valuable. I, I keep talking about exercise. A, a part of another thing of exercise is just being very comfortable with turning things over and having the days when you would have a system and be like, it's been up for a year.

I'm, you sell, pop champagne corks and, and be excited. Nowadays, if you have systems that haven't been reset in 30 days, right? You should actually be like, oh, wait, I've all my, my servers are too old. This is cloud, or, or, or on pre I don't care where they are. If you're not able to roll all of your infrastructure on a 30 day basis or faster, then you're not, you're not keeping up with patches.

You're not refreshing. You don't have the, the resilience in your operations to actually deal with an emergency patch or change or, or something like that, right? If you're, every 30 days you're cycling through stuff and somebody comes out with the next harp lead virus and you're like, oh, okay, that's, that's gonna impact all my systems.

And now you're, you've lost, like when Java, we had the Java things and, and people lost months of work. Mm-hmm. Finding, discovering, fixing, patching, you wanna be able to say, oh, I know how to fix this. I got the fix queued up. Start a rollout process. And you want to be able to say, I'm just, instead of taking 30 days, I'm gonna do my rollout over the next week.

If you don't have those processes in place, you're vulnerable. And as you were mentioning earlier about the legacy code and that being resilient and tested and basically battle hardened. Yeah. How do you look at these two things and how do they fit together? The, you mean patching versus the immutable, the, the thinking about immutability.

Yeah. I'm, I'm thinking more along the lines of having this legacy code and then also being agile enough to burn everything down every 30 days. I, I don't see them as, as incompatible. The, the, if you have legacy code, The thing that makes it scary to people is they don't know how to recreate the build.

They don't know how to patch it. They don't know how to upgrade it. They don't know what the environment is. And so the, the challenge with the legacy piece is not the language that it was written in or the age of the code. The, the thing that people gets people in trouble when it, it gets to legacy is they don't have, they don't know how to maintain it.

They don't know how to patch it, they don't know how to upgrade it. They've, they've basically created a static artifact. And so it, it is worth noting, right? When I talk about having these immutable artifacts and version controlled everything, it's only useful if you have an automated process to build that artifact, right?

Mm-hmm. And people forget, I immutability doesn't just mean that I have a whole bunch of stuff sitting in a locked vault. What it actually means is that I have a way to reproduce those artifacts in a predictable way and then apply those into my environment in a predictable way. Yeah. The legacy thing that gets people nervous is when they have a, a FORTRAN program that nobody knows how to build, debug test.

Right. That that's installed on a mainframe. That, that, four generations ago, the person who set that up and made the decisions went away. That should make, that should make people nervous. But you can have modern programs, right. Oh, here's the magic docker container that you run in Kubernetes.

And how did I build that? I don't know how I built it. That person's gone. Don't mess with the container, just run it. That that should send off the alarm bells. Just like that's what we're talking about here. The legacy is the lack of repeatability in the system. Mm-hmm. And how do you look at that when it comes to.

Throwing or sprinkling on data because it feels like a lot of this stuff that you're talking about is very much like, in the software world and very DevOps focused. But then when you start to incorporate data in there and data isn't as I would say, agile as you are making the these things out to be, how do you see the difference there?

Oh my goodness. I, I, it's, it can't be as agile without a doubt. I, I do think that the, metadata collection of metadata, I think access to that data, how it's partitioned and shared. The way, the way I see it is you should have automated, repeatable processes for doing those things. So, mm-hmm.

This, this actually goes back to our ability, what we were talking about with improving access to things. Some of it is the, the tools actually have pretty good access controls, but we have a tendency as humans to interact, right? Turn the knob, set it up and say we're done, and spend less time actually figuring out how to automate what those processes and controls are.

Mm-hmm. And so I would, what I would suggest, and I'm, I, I realize my world is DevOps, so I'm, I'm putting a very DevOpsy spin on this, but you know, you should be able to get hands off in providing the access to that data, replicating the pieces of the data. Right. We're back to, is there a, an immutable process or aversion to process?

Is there a process that other people can see and inspect? Understanding how that the, the parts of your data are replicated and put together. We, we do need to understand how all that stuff fits together. Um mm-hmm. That stuff, I'm, I'm glomming a whole bunch of data under a, under that stuff, but, it's, it, you do, you do have to have a way to, manage all of those pieces, protect all those pieces standardize how that this stuff works.

So, Rob, I wanna talk about the idea of racking and Sure. What you're doing over there and get into what exactly it is and how it helps people and how potentially it can help us make sure that we have these automated processes and we're doing things effectively and not ineffectively, and basically leveraging and standing on the shoulders of giants who have been in the cloud business for the last 25 years.

I, I am happy to try and try and shed some light on this. What, what we do with, with RAC n and a lot of it addresses the problems we've been laying out. Right? Obviously that's, that's, that's where we see things that are going on. RAC n writes software product called Digital Rebar. Digital Rebar is it's a workflow engine for automation and infrastructure.

So it's not like there's some orchestrators out there that people are familiar with or job schedulers in the ML ops space. Digital rebar is specifically an infrastructure tool. So the, the, the semantics, the nomenclature, the objects in that system are infrastructure focus, and what it does is it builds an structure pipeline which is very similar to like a C I C D pipeline or a data pipeline in that you can sequence a whole bunch of tools and operations together in a repeatable way, so you can start a process and then flow data in, in, in our case, data is state of the infrastructure.

Through a series of transforms for us, that means terraform and Ansible or shell scripts or talking to other APIs, but you're talking to all of the components of the infrastructure in a version managed sequence to get to a desired state. And then one of the things we did that's really been fun is that we also have a way to do jobs.

I'll put this in ML terms. But you know, we, we have transforms, which is the infrastructure pipeline going from one state to another or resetting a rack of servers, things like that. But we also found that that automation can be applied in work orders, little jobs. And so people wanna be able to say, you know what?

I just need you to run a security scan. I don't want you to change anything. And so that type of, of small operation or work order is also part of the system so transforms and then you can schedule a whole bunch of work. And then the other thing we've done about with this is we've created a shared state system in that it's very easy to send state from digital rebar to other systems cuz you have to participate in the infrastructure or provide access to, from other systems into digital rebar.

So we find people wanting to do the, like putting dev portal in, in front of digital rebar and then let the infrastructure do its thing. CI i c d systems, Terraform in front of it. You know what, what we find from an ops perspective is that people want to consolidate and consistent operational experience.

They wanna abstract out, is it cloud, is it vm, is it bare metal? Have that same pipeline run everywhere like we were talking about. Mm-hmm. But then not force people into having an operations experience when they just want a server. When they want a dev experience. So for us, a dev or I think a, a scien, a data scientist, right, would say, I just need you to build me a cluster that can do ML analysis.

I don't care about the process that you use. I don't care about the operating system, I don't care about the cloud or the infrastructure. Just, just set everything up. I just wanna click a button and get a cluster back. There's a ton of operational work that has to happen to make that go. That's what Digital Rebar does.

Simple api, whole bunch of operational work that's consistent, portable and reusable, but back into a very simple API for a, a consumer to use. And then we do a ton of work as a company because it's software to allow that to be a platform that somebody can use without giving up control of their systems.

Right. Especially if you're talking, we're talking about data here. Right. None of our customers want us to tunnel in to their data center or their data set. Like this whole thing we were talking about. That's, no, we, we, and we don't want it. I, I don't want access to anybody's data. That's the last thing I want.

The other thing that we, that nobody wants is for us to make changes to their automation platforms that are in our benefit, because we want everybody on the same version or upgraded or patched. Mm-hmm. So the fact that we're software, which is really important for automation, especially, we do distributed multi-site control planes with customers.

So they, they have global, global fleets of infrastructure. But this works even for teams and, multiple teams collaborating around the automation. You don't wanna have to say, well, I patched my automation to include this new feature and force it on everybody else Uhhuh. So because of the way we do versioned automation, You can actually control when people pick up that automation, they can pick it up, that's op, but they don't have to pick it up.

That, that control is really important for people. And all that just comes out as immutable artifacts that you download into digital rebar and, and then make it run. But then each customer has their own version and they're, they maintain it themselves and we work really hard to make that simple.

And so is there the capability then You mentioned if you want to use or you can just ask for certain things and you don't care about other things and you just say, gimme this what I want. I don't care about the operating system, I don't care about the cloud, any of that. I imagine there is the capability to say, I do want it on this cloud.

I do want it with this operating system. Mm-hmm. And you can get that granular Yes. Yeah. But the, but the system is, uses declarative APIs, so. The goal is if you start a pipeline, the system can fill in or more, more importantly, the operators who set it up can actually provide ways to fill all those details.

So you can say, I need to use this cloud specifically. Or you can say, I just, I'm just gonna call, I need you to fulfill this request and not care That, not care how that abstraction is really important, right? Because what we've built are these pipelines that run anywhere. And so, and we do this internally all the time.

I will test new builds and new development in cloud infrastructure cuz it's dynamic and inexpensive. But because of the way we build that pipeline, that pipeline then will work against a bare metal system because the abstractions are fine for it. Interesting because righting we're, what we've got is this long pipeline of work that is consistently applied and if I add something into the pipeline, then I can make it go now.

I might only build, this is where the community effect gets really interesting. I might only build for Abuntu if I'm only testing for Abuntu or Amazon Linux or something like that. But once I have that code, then we can come back and say, yeah, you know what, I'll also make it work for CentOS or Windows or Right.

We, we have ways to keep extending individual tasks in the pipelines to broaden their scope or add different APIs or detect where they are and make adjustments to what's going on. Mm-hmm. That, that's, it's a really powerful feature. When people look at it the first time, they're like, oh, wait a second.

There's, we're doing a ton of stuff. And that can be a little overwhelming. It's important for people to be like, oh, wait a second. They've baked in these protections and standards and things like that. And even if I don't understand why it's doing this task, instead of turning it off, we have people who do that and then they come back and they're like a, a couple months later, like, I needed turn it on it to, and we're like, all right, did you turn off the, they're like, oh, I did, I copied all that stuff and took out the things I didn't need.

And we're like, were they breaking, like we were talking about earlier? Yeah, yeah. No, and they pull it out. Even if they weren't, they were working, they just thought they were unnecessary complexity. They just hadn't reached that point in their journey yet. Yeah. And so that's, that's part of how all these, these things work.

Most operators now understand. They're like, okay, wait a second. If you can do all, I, maybe I only buy Lenovo gear, but, the fact that you can do the other types of gear is good protection for me, it's, but it's a different mentality in looking at how these systems go. Just like switching to immutability.

People are like, I don't wanna do that. I've gotta build images. Now that's painful. And you're like, well, yes. But once you've built that image, everything you do is gonna be more resilient and faster. And operators have to have the time to do that. It feels like that is something that you learn with experience after.

As you mentioned, you could rattle off a few different scenarios just off the top of your head when shit hits the fan. Yeah. And I imagine those first couple years before you run into the first whatever, maybe the pandemic hits and you can't get servers anymore or whatever. The five other scenarios that you were able to just talk about, obviously from battle scars that you have gone through Yep.

Over the years, that seems like it's, it's very clear as to why you would want to keep that as opposed to rip things out and say, well, we're never gonna need this. It's it. This is, to me, the reframing of perceived complexity as mm-hmm. Right. As as exercise or. Something useful. Yeah. And, and I do a lot of thinking about how, because everybody runs around pulling on their hair, being like, oh, systems are so complex.

I'm scared of operations now. Which, which I, I understand the feeling. And I, I spent a couple years ago, I was like, all right, how do we, how do we deal with this? And I went down a whole bunch of rabbit holes until I realized that, this is, well, it's actually well understood. There's a whole bunch of academic research on complexity and, and dealing with complex systems and what failures in complex systems look like.

And, and I can't summarize all that research that quickly, but one of the things that that made me reevaluate was complex systems are much more resilient to failure than simple systems. Mm-hmm. And, and so when you look at a system and see complexity, your, your question shouldn't be as, Why is that system so complex?

I can't understand it. You might not understand everything about it, but you need to be like, like how do I exercise this system when as it operates how observable and transparent it is? Those are things like for Rack n we work really hard on that. We have boy, I've done whole podcasts on nothing but this fact that when things break instead of doing retries, we stop and tell people to fix the, the root cause of the problem rather than just banging through.

Because when you're fighting complexity, you don't want unexpected behaviors, right? So there's, there's all sorts of design elements that go into, into making these systems really resilient. But we're very used to needing comprehensible systems. Maybe, maybe chat GPT is gonna teach us something indirectly that not understanding what's going on behind the covers is not.

As bad as we thought it would be as we thought it was. We've sort of gotten, we're sort of very willing to accept that We don't know how these models got trained. But that's part of these systems is you're, we can lean into making a complex system usable and secure and reliable and, and resilient.

But, it does, there's a little bit more cognitive lift the first time you look at it. Mm-hmm. So you have to have had played around with some chat G P T and its abilities to use racking or potentially like, play with the API or do these operational loads under the covers. Have you discovered anything and is it, has it been anything good?

I'm, I'm actually doing some talks about what I'm calling generative DevOps that I would suggest people to look at. We're, we're, we're way down the rabbit hole at this point. But yeah, if you search for generative DevOps, you'll see some of the early thinking that I have with this. And the, the, the, the teaser on it is that it's making expertise much more accessible.

And so I'm using this phrase, the 10 x operator, like we've been talking about 10 x developers for a long time, but mm-hmm. We might be entering an era where there is the possibility of a 10 x operator. And that completely changes assumptions on all sorts of things. How you run gear, can you run gear?

How many people you need, how effective, like, like Edge, like, like the 10 X operator concept with generative DevOps completely could remake the landscape we had. I love it. I'm gonna Google that right now. Well, Rob, thank you so much for coming on here and talking to me about all this, the past, present, and future really, and bringing this DevOps feel to it.

I, there's so much that the ML ops world borrows from DevOps and so it's great to hear your spin on things and, and also see what you are working on and what you're dedicating your time and energy into these days. And it's fascinating stories you've got. Thank you. I really appreciate this. And think that's it, man.

We'll end it here. I appreciate it. Thank you for the conversation. It has been a pleasure.

+ Read More

Watch More

57:42
Posted Apr 23, 2023 | Views 5.8K
# Spark
# Open Source
# Databricks
1:04:35
Posted Dec 12, 2023 | Views 287
# AI
# Software Development
# Exafunction
# Codeium
# QuantumBlack
# McKinsey and Company
34:57
Posted Jun 20, 2023 | Views 523
# LLM in Production
# Scalable Evaluation
# Anyscale.com
# Redis.io
# Gantry.io
# Predibase.com
# Humanloop.com
# Zilliz.com
# Arize.com
# Nvidia.com
# TrueFoundry.com
# Premai.io
# Continual.ai
# Argilla.io
# Genesiscloud.com
# Rungalileo.io