MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Build Reliable Systems with Chaos Engineering

Posted May 31, 2024 | Views 1.8K
# Chaos Engineering
# MLOps
# Steadybit
Share
speaker
avatar
Benjamin Wilms
CEO & Co-Founder @ Steadybit

Benjamin has over 20 years of experience as a developer and software architect. He fell in love with chaos engineering 7 years ago and shares his knowledge as a speaker and author. In October 2019, he founded the startup Steadybit with two friends, focusing on developers and teams embracing chaos engineering. He relaxes by mountain biking when he's not knee-deep in complex and distributed code.

+ Read More
SUMMARY

How to build reliable systems under unpredictable conditions with Chaos Engineering.

+ Read More
TRANSCRIPT

Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com/(url)

Benjamin Wilms [00:00:00]: My name is Benjamin Wilms. I'm the co-founder and CEO of Steadybit, a chaos engineering platform. And you can call me the paralytic guy because I don't trust any system. I need to see how the system reacts under any conditions. Coffee. Let's talk about coffee. So, there are two types of coffee I like the most in the early morning. There is one with milk and sugar.

Benjamin Wilms [00:00:23]: Don't judge me. And of course, there's espresso. Just Pio. Espresso.

Demetrios [00:00:28]: What is happening? Mlops community? This is another one of them good old podcasts. And I am your host, Dmitri Ose, today talking to Benjamin. And I wanted to have him on here because I feel like what he is doing is so closely related and needed in the mlops world. It's just not something that we talk about that much. Yet. Yet, maybe. And I think that he has some really unique ideas around running experiments to try and purposefully inflict chaos on your systems to make sure you know and understand where things can fail. And when you do that, you then get to go and you upgrade your systems and you continue to run the experiment until you feel like you've got it down.

Demetrios [00:01:19]: And then you just throw that experiment and all of those learnings into a test that is part of your CI CD pipeline. Benjamin explained it so well. It was a great conversation. Hope you enjoy it. Let's get into the episode. And one thing that he did say that I thought was so cool is I usually ask people before we start recording, I say, what would be the ideal outcome of this episode? And he said, for him, one of the goals that he wanted to inspire in people is to just have somebody that has never heard of chaos engineering go down the rabbit hole and look deeper at it and maybe start incorporating it into their repertoire of tools that they can use to keep their systems more reliable. Let's talk with Benjamin. And if you enjoyed this episode, you know it means the world to me.

Demetrios [00:02:13]: If you leave a star or if you give us a review or my favorite, share it with one close friend. I'll see y'all later. Alright. Benjamin, you are not too far away from me. You're almost like my neighbor. I'm in Frankfurt. I think you're just a little bit further north.

Benjamin Wilms [00:02:40]: Yeah, it's just a 45 minutes ride with the famous german Deutsche Bahn. Yes.

Demetrios [00:02:47]: That is never late or never has any problems. We all know that.

Benjamin Wilms [00:02:50]: Yeah. We all know that. Yes.

Demetrios [00:02:53]: So I want to get you on here because I wanted to talk about chaos engineering and what better of a person to talk about chaos engineering than the person who built chaos monkey for spring boot. And for those uninitiated people who don't know what spring boot is, it's a framework for Java. What is chaos engineering? Can you give us the TL doctor of it?

Benjamin Wilms [00:03:13]: Yes, for sure. So the real need behind chaos engineering is that people are building very complex and distributed systems. And systems is not only like something from the technology point of view, it's really something where a lot of interaction is getting started or is inside of that system. So it's very hard to understand how the system works and how you should react under specific conditions. So it's really hard to understand what's going on inside of that system. Now, with Keirs engineering, you can train yourself, you can train your organization, you can train your system, your technology to handle specific conditions. And most of the time, we are talking about some bad conditions, bad conditions from production, like a latency spike, some delay in the traffic, maybe an area of your network is no longer responding, or maybe something went wrong during your deployment. So really, like those moments where you are under high stress and you would like not to get in the same moment again and again.

Benjamin Wilms [00:04:20]: And now with chaos engineering, you can inject those bad moments, but in a safe environment, you can do it in a pre production stage. You can be more proactive and you can learn from it. You can train your system and yourself.

Demetrios [00:04:34]: And so I've been kind of on this kick about change management and how many processes there are. In software development, when you make changes to code, you want to make sure that it is in the most reliable way possible.

Benjamin Wilms [00:04:49]: Correct.

Demetrios [00:04:49]: And so this kind of falls under that category. You get to not have to test in prod. Right. You can do things beforehand.

Benjamin Wilms [00:04:58]: Hopefully you should, because otherwise it's too late.

Demetrios [00:05:02]: Exactly. And it's funny because these high stress situations that you were talking about, I think they tend to happen at like 03:00 a.m. when you have one person on call and they get woken up in the middle of the night. So is that the idea, like less 03:00 a.m. calls?

Benjamin Wilms [00:05:22]: Correct. Correct. That's like the outcome. And, but the more valuable outcome is for yourself and of course, for your customers. They, your customers are still able to use your product, they are able to use your service you're offering. And to be honest, your customers don't care about any technical issues. They don't care about AWS or Azure or Google issues. They would like to buy something on your marketplace.

Benjamin Wilms [00:05:47]: They would like to interact with your app on your mobile phone. And today's market conditions are quite tough because if your customer is not able to do it, they will continue, they will go to your competitors, they will find a way to get their product. And so you need to be able to handle those outages, you need to be able to make your customer happy. This will lead leads to revenue and happy customers.

Demetrios [00:06:12]: And you mentioned complex systems and how valuable this is for complex systems. Is it also useful for like the smaller startups that are just getting rocking and don't have that complex of systems in place?

Benjamin Wilms [00:06:25]: Yes, you are correct. Why? Let's imagine you're a young startup, you as a young startup, you would like to get your idea out of the market and you're highly focused on, on specific use cases, something you would like to implement. And maybe as a young startup you don't should, you're not able, because of missing resources, to invest a lot of energy in reliability engineering to make it more perfect. So if something is there that you can just ask, okay, how safe I am, how big is the risk I'm taking to go into production with this new feature? Is it still so that this system with the new feature can handle in sole outage of my cloud environment, can handle some hiccups in Kubernetes? That's something you as a young startup, you need to be able to see it and to handle the risk.

Demetrios [00:07:21]: Now I was talking to a CTO of a big technology company and I was talking to them about chaos engineering and what they thought about it. And one thing that they said was they don't want to know how brittle their system is. What did she say to that? Because it's like, yeah, it's almost like they know that things are hanging on by a shoestring. But let's not mess things up because then we have to go and we have to stop and backtrack and make it more robust.

Benjamin Wilms [00:07:52]: Okay, I need to be careful now because this story is something I also know, maybe from a different company, but so what's the story behind? There was a company reaching out to us. It was an inbot request, it was an inboard lead. And they told us, oh man, our system is so under pressure. And we as engineers and sres, we realized we created a monster and this monster is our system and we are not able to handle it anymore. So we need to do something with chaos engineering. We would like to get rid of all those incidents. We are spending hours each day on fixing something. So we would like to turn this so they did a POC with us, they did the POC in their internal system, and they identified a lot of risks in the system which there were not able to handle.

Benjamin Wilms [00:08:42]: So next step, POC was done. Let's get a report out of the POC. Let's present this to our management. And the management was very impressed about the findings, about this huge list of risk in the system, but there was no commitment coming from the management. They told the team, no, let's focus on fire tools that focus on, you need to be more reactive. We should spend more money on people that are taking care about all those fires in your system, but don't fix it. No, no, no. Let's continue the old way.

Benjamin Wilms [00:09:20]: So it was really. It is still really hard for me to understand. If you are aware of such a big risk in your system, which can lead to many bad scenarios for you as a company, what's why you are not choosing the right way? From my perspective.

Demetrios [00:09:38]: Yeah. It does feel like one thing that you were mentioning is that it's just as much organizational as it is technical.

Benjamin Wilms [00:09:47]: It is part of the culture, it is part of the motivation. So you need to ask yourself, what is there, let's say a compensation plan for some levels in your company? What's in such a plan? Is there any motivation to improve something from specific levels? And where is the pain located? And how much momentum can this group of people, where the pain is located, create to improve something? And this reflects into the culture. And most of the companies we spoke to and we are still engaged with are building such a failure culture where you are improving because you failed, where you can learn from failures. And when no one is doing something with, like, finger pointing or that's bad, it's your job. It's not my fault. So it's really deep in the culture of a company.

Demetrios [00:10:43]: Yeah, those post mortem, the blameless post mortem. Exactly. One thing that I also think about is how you play a lot in the DevOps sphere. And this, what you're talking about is a lot. And it has to do a lot with the infrastructure of code or just like the infrastructure of a company or the technological piece of it.

Benjamin Wilms [00:11:09]: Yes.

Demetrios [00:11:10]: Do you also see a lot of people wanting to do this with data and basically the way that I envision. Can I just tell you how I envision?

Benjamin Wilms [00:11:20]: Yeah, please.

Demetrios [00:11:20]: Yacht engineering. And I think chaos monkey is my favorite way of envisioning it, because it's like there's a monkey that got loose in your system and it's just turning off and on a bunch of different pieces of your system and you get to see how resilient that is. And so I wonder, do you also see people wanting to come and say, can you just turn off and on my databases or can you make data flows go the wrong way or stuff like that to simulate that trouble? I mean, to be honest, a lot of people probably say I don't even need a chaos monkey. There's a few people at my company that do that on their own. I don't need some special software for that. But I want to hear what your story on the data side of things too.

Benjamin Wilms [00:12:03]: Let me start with the data side, because also I would like to talk about the chaos monkey side. And from the data perspective, chaos engineering as we are doing it in our vision of steady bit is not limited to only infrastructure or hardware components. It is really like a multilevel approach for what I'm talking about. You are getting started on bare metal on a virtual machine. A lot of stuff can go wrong on that machine on many different levels. Like yeah, network resource consumption, storage. But then also you can go up the levels where you would like to do something on a platform level. And platform could be like kubernetes or could be a distributed database system or even go higher until you are on the application level, you are going through the cloud provider level.

Benjamin Wilms [00:12:55]: So on many different levels something can go wrong. And also on many different levels you need to inject those behavior, which also means that people are doing it on the database side. They are injecting latency on only all reading requests to the database. Or they are trying to do how to say database shutdown but also like bring up a new instance. How fast is the data but distributed to the new instance? Is this a hot standby? Whatever. So they are running those systems and tests and even on the network layer. So our customers are doing on Kafka, something on redis, something they are handling with the messages, they are holding back some messages that newer messages can bypass those messages over the networks that maybe, let's say an update event of an customer is executed before the create event as a simple example. So you can do it on every level is the short answer.

Benjamin Wilms [00:13:59]: Getting to the chaos monkey side, I was, and I'm still a big fan of chaos monkey. Why the core concept, the core idea behind the chaos monkey was all the engineers, all the product engineering teams, they are enforced to do something to be more reliable. Why? Because they know the chaos monkey is running and I've got no idea what this freaking tool is doing and at what point in time I don't want to been called on a Sunday morning at 03:00 a.m. or whatever. So, ha, it's really like this. You need to be prepared for those moments. And that was a great job done by the chaos market. Now, after a couple of years, the chaos engineering approach is more shifting into science or running experiments where you are always asking yourself, is the system still healthy or not? And if not, the system should not be part of an experiment.

Benjamin Wilms [00:15:02]: So you need to find a chaos engineering solution that is able to understand and interact with your system. To answer the question, if my system is still healthy. If not, the experiment needs to stop immediately and should do a rollback, which the chaos monkey doesn't care. He's always, like, hitting your system. And that's like the big difference between the old approach of random injecting bad behavior and the nowadays approach of chaos engineering platforms.

Demetrios [00:15:29]: Yeah, so I can see the value of that, especially when you're talking about on that first part. If you're doing stuff with Kafka and you're recognizing data streams aren't being piped in in the right time, or you're holding back data streams. And then for a machine learning model, that could be very useful, especially if it's real time and it needs those data streams and it needs that. So how do you make sure that the machine learning model that's real time downstream is still effective if all this upstream chaos is happening?

Benjamin Wilms [00:16:02]: Correct.

Demetrios [00:16:03]: And then on the flip side, the chaos monkey versus the science and experimentation, it does seem like a very clear distinction, and it seems like a more mature way of doing it. Like the Chaos monkey was the infant way of doing it. And now we've kind of gotten into, we've graduated and we're in high school now.

Benjamin Wilms [00:16:22]: Exactly.

Demetrios [00:16:24]: And so we can be a bit more focused on how we create those different experiments. And we can also know if we have a hunch that I think this side of the system is a little bit unreliable or I've seen or I've heard about something happening with my friends system in this way, you can run an experiment and make it happen and it's more directed.

Benjamin Wilms [00:16:48]: Exactly. And that's a very good point. What people are doing is there is an experiment, and this has been run the first time in one area of the system, of their system, and it was successful. Now they are sharing it with other teams inside because nowadays they are all running on one central platform. Maybe it's a Kubernetes based platform. I don't know, but, okay. This team was quite successful to now handle all incidents with, let's say, some storage issues. Now, let's use this collection of experiments they have executed with success and hand over those collection to another team that they can do it on their applications they are running and doing.

Benjamin Wilms [00:17:33]: And that's like, really this sharing is caring moment where people are able to help each other and they are helping each other because there was a bad moment in outage, and they can learn so much from those outages, which is driving a strong value.

Demetrios [00:17:49]: Yeah, learn from my mishaps. And so, hopefully, you don't have to go through them yourself.

Benjamin Wilms [00:17:55]: Exactly.

Demetrios [00:17:57]: All right, real fast, I want to tell you about the sponsor of today's episode, AWS, tranium and inferencia. Are you stuck in the performance cost trade off when training and deploying your generative AI applications? Well, good news is you're not alone. Look no further than AWS's tranium and inferencia, the optimal infrastructure for generative AI. AWS, tranium and inferencia provide high performance compute infrastructure for large scale training and inference with LLMs and diffusion models. And here's the kicker. You can save up to 50% on training costs and up to 40% on inference costs. That is 50 with a five and a zero. Whoa, that's kind of a big number.

Demetrios [00:18:46]: Get started today by using your existing code and frameworks, such as Pytorch and Tensorflow. That is AWS, tranium, and inferencia. Check it out. And now let's get back into the show.

Demetrios [00:18:59]: And one thing I also think about is, how have you seen people doing this ad hoc, if at all? Like, if I didn't have a tool to do this, what would I do?

Benjamin Wilms [00:19:11]: Turning off a machine, shell, command, shutdown, whatever, that's the easiest way to do. Or if you are able to unblock the power, yeah, do it, but cut some wires. Exactly. But then you are more getting back to the chaos multi approach. And you, first of all, before you do anything, chaos engineering. To start with, the core issue is not like turning off something or like the challenge, not issue. The challenge is not to turn off something. The challenge is define your expectations.

Benjamin Wilms [00:19:43]: Start from your point of view. What is your expectation, what your system needs to do and then under which conditions. So, for example, my expectation is that my customer needs to purchase my products in my ecommerce online store, even if my cloud provider is losing 60% of all zones. Okay, how to make that. How to make that happen? How to survive. That's something you need to build in your system, and then you need to test if your expectation is fulfilled. And that's something you can do. Of course, without any chaos engineering platform, it's getting more easier to do it.

Benjamin Wilms [00:20:19]: Yes, but that's like the normal starting point with your expectation.

Demetrios [00:20:25]: So it feels like you've probably seen a million different ways that systems fail. Do you notice that customers maybe are not as creative in these experimentations and you almost have to help give them templates on? Well, what about if you have this scenario or that scenario? Or is there anything like that where you can suggest different ways that they.

Benjamin Wilms [00:20:48]: Can test the hardest answer. Sorry, no, the hardest question is inside of chaos engineering. Where should I start and how should I start? So where should I start in my system? What is a good area under inside of my system where a lot of stuff can go wrong? And the second one is, okay, now I found that spot. But how? What's next? Is it something on the network layer? Is it something with a cloud level? Is Kubernetes, Kafka, redis, whatever, all the database by starting point. That is the biggest challenge. So there is a way where you can use some tools that can highlight some, let's say, risky areas in your system. And there are also. Those tools should be also able to tell you why.

Benjamin Wilms [00:21:40]: Why is it risky? Maybe those tools are able to do a pre check of best practices. So there are open source tools out there for Kubernetes and other technologies where you can run a check of your system. So maybe let's talk about Kubernetes. The CPU resource consumption is not configured, or maybe you are running not enough replicas, not enough instances of your service. Maybe your instances are not distributed across all the zones of your cloud environment. That are questions that are, you need to take care, that need to be answered and hopefully with an easy approach.

Demetrios [00:22:20]: Yeah, and I could imagine that machine learning engineers can resonate with this. And then you just swap out cpu's for GPU's and it all makes sense. Like, oh, GPU access, my data scientists need more of that, or whatever it.

Benjamin Wilms [00:22:34]: May be, or network on the network. Maybe there are some routes on the network about how the traffic is being routed, or maybe there are some firewalls involved, or maybe some packages are corrupted or lost, whatever. So a lot of stuff can go wrong.

Demetrios [00:22:53]: And also I think there's the other hand in this, which could be data access. And people in Germany can't get the data that is in Canada, because that data can't go outside of Canada. If there's special PII, or it has a certain use case and so trying to almost like do white hat hacking inside of the system to see if there is possible for that data to be leaked in some way or another could be quite useful.

Benjamin Wilms [00:23:24]: Is the data distributed across all your areas around the world fast enough? Or is there one area where, let's say in Canada again, everything is working perfectly, but the rest of world is not getting actual data. Or maybe between the synchronization there is something going wrong. Maybe there's a funny attack you can do. It's called time machines, where you can change the time of the system. So this leads to very bad scenarios, like all your certificates are outdated, so your system is no longer working because there is no secure communication possible. Or let's say the data in another region is now created on a newer time as around the rest of the world, which is quite hard about when the system has been distributing all the data around the world. So you can do it on many different levels.

Demetrios [00:24:27]: So it does feel like chaos engineering goes hand in hand with observability. And by observability, I think that we should probably like define what that means too. And in my understanding, you can tell me where I'm wrong. My understanding is that observability is just there to clock and read what is happening and maybe give you some alerts when things start to go wrong. And this is much more proactive, but it still can be plugged into an observability tool. Is that a fair assessment?

Benjamin Wilms [00:25:04]: Yes. You need to have an observability tool in place and why for multiple reasons. First of all, to understand what's going on and how is my system reacting under specific conditions which were injected. And the important part is, if you are doing chaos engineering, let's imagine you are doing a cpu or memory attack, where you will use all the memory of one dedicated machine or one region of a cloud environment. You need to be able to understand which was injected by your chaos tool and which is something like a cascading failure. So something like a reaction of your system. So why is it important? This is the job of your observability tool. So that's why both tools needs to work very closely.

Benjamin Wilms [00:25:57]: The chaos engineering tool needs to tell the observability tool, okay, that was injected by my chaos tool. This was the group of virtual machines. I injected a lot of cpu load. If something is going crazy outside of that group, we found something. That's a bad scenario. Cascading errors are in your system. So it's really like you need to be directional communication. You need to consume data in your chaos engineering tool to see if everything safe, or should I stop immediately? And you need to send out data so that also all the people on call are able to understand.

Benjamin Wilms [00:26:35]: Okay, that's done by a chaos tool, and that's something bad in my system.

Demetrios [00:26:41]: And it does feel like, as you mentioned before, one of the hardest questions is, where do I start? Now, when you do this and you run an experiment, you see what the chaos tool has done versus what the cascading effects were. Do you also get recommendations on how to fix that? Or is that just the engineer should know or should spend some time doing that root cause analysis?

Benjamin Wilms [00:27:07]: That's the hardest part. Sometimes you can check for something and you are able to do a kind of advice. But the chaos tool or the people that have created those chaos tools are not the experts for your system. You are the only expert for your system. It's very hard to understand how you should fix something. And that's like where you need to be able to have the experience, the knowledge, how can I fix it? There are some starting points here, but no tool on that market, or maybe in the future will be able to tell you exactly. Okay. In a complex system, we found something and that's how you can fix it.

Benjamin Wilms [00:27:56]: It's just an advice. It's just something where you get a little bit of guidance, but don't trust a tool which is telling you, hey, yeah, just pull request and you are safe to go. That's hard because the complexities is so big. That's hard.

Demetrios [00:28:13]: Does that mean that you run various experiments in the same area to see. Okay, we think we updated it. Let's try again.

Benjamin Wilms [00:28:22]: Exactly. Experiments, they need to be repeatable. You need to be able to just press a button and tell the system, please do it again, because I just changed this little attribute. Now we are talking again about a cpu limit, or let's say about the distribution on the network layer. Next one. Are we now safe enough? No. Ah. We need to tweak it again.

Benjamin Wilms [00:28:44]: And that's like this iteration you're in. It's a process.

Demetrios [00:28:49]: Yeah. I was hoping that you would say you have some magical AI solution that can give you the root cause analysis.

Benjamin Wilms [00:28:56]: Not today. But there is a way where you can use AI for doing those iterations in a more efficient way as you are able to do it as a person.

Demetrios [00:29:13]: That's incredible. And I do see the value in this. And you mentioned how you have the systems teams and you, I imagine you're speaking a lot to sres or what. So it's sres that are trying to make sure that their systems are the most reliable as possible. Do you find that you ever talk to like data platform teams or machine learning platform teams or is it mainly at that SRE level?

Benjamin Wilms [00:29:42]: So right now it is at three levels or tiers. So first of all, of course, the biggest group right now is the SRE group. Why? That's where the pain is. They are more hunted by their own systems. They are always under pressure and if you are under pressure you are not able to fix something. So that's the first group. The second group is coming from QA performance teams. So they are, they would like to see if the system is still able to deliver as needed and with high quality even under those conditions.

Benjamin Wilms [00:30:15]: So they are really like running their load performance tests in combination of chaos engineering. And the third group is right now the platform engineering team. So people that are building a big platform that has been used internally for other product engineering teams, developer teams to build something. So maybe on the company level there was a decision done. We need to run on a multi cloud provider setup. And your engineers, they don't care about the underlying Kubernetes cluster, is it managed by Azure or Google or AWS? I don't care. So you need to have a platform, a central platform. So those teams, they need to make sure that the new created platform is bulletproofed.

Benjamin Wilms [00:31:01]: That's why they are doing chaos engineering and they are also integrating chaos engineering into their CI CD system in the pipeline process. In the past, some of the conversations we had were getting started with people that are from the machine learning area and it is always very closely connected to the data, how the data is coming into those systems and how the interaction is now with the rest of the system. That is like their biggest pain point. It's not about running the system, it's really like, is it even working if some streams are not responding anymore or not producing the data? Correct. That's the core issue, but it's not a big group right now.

Demetrios [00:31:48]: Yeah, and I would imagine that people, like I said earlier, people get that in their day to day and they don't need any special tool to have these data streams be shut off.

Benjamin Wilms [00:31:59]: So if this is only happening in production, it's maybe not a good place. If you are able to get this up and not running in a pre production system and you can train your tool, that's a good approach.

Demetrios [00:32:12]: I was going to ask a little bit more to double click on this idea of incorporating the chaos engineering into your CI CD pipeline. What does that look like?

Benjamin Wilms [00:32:25]: Some steps you have to do first before you can go the CI CD way. First of all, you need to answer the question, where should I start and how you need to create experiments. If you were successful in the execution of those experiments, they are a very strong candidate to automate everything. So you are really like creating a list of experiments that are been executed why? Via your pipeline system after the deployment in a pre production system. So what our customers are doing is there's a set of experiments, and this experiment set needs to be executed after really every deployment in, let's say, the QA stage, the performance or pre production system. Why? Because inside of those experiments there are old incidents from the past. They have turned an incident into a kind of regression experiment, regression test to really make sure that they are not getting back in the same incident situation again. Another type of experiments they are automating is there are some non functional requirements.

Benjamin Wilms [00:33:38]: That's something we need to see in the system. Let's make safe that the system is able to do that and able to work as needed, and then they are doing it in the CI CD systems.

Demetrios [00:33:53]: So I like that. It's like first you're experimenting, once you know that you've gotten it down pat, then you can add it into the CI CD system just to make it one more test. So that in case you're changing something later on down the line, you never forget the pain that you could have caused.

Benjamin Wilms [00:34:12]: Yeah, that's, that's funny, but also a little bit offset. So I'm a developer by heart and, but we need to be honest. Developers don't care about security and they don't care about reliability. Why? Because they are measured on features and how fast they can get features into production. So everything else, like security or liability, I don't care as long as no one is telling me to do it. Yeah, but now it is still very important. And if you are not taking care, this will lead to very bad scenarios and a lot of pressure on your shoulders again, because then someone is telling you this. Very hard words, hey, you have to fix it.

Benjamin Wilms [00:34:59]: You messed up. That's the situation we are not able to get in anymore.

Demetrios [00:35:04]: And is it a fair assessment to say that this is only for the reliability aspect of what your tools are capable of doing when they're put under pressure? But I'm trying to find the words to figure out how to explain this. It doesn't have anything to do with security threats. Since you brought up this idea of how developers don't really care about reliability or security when it comes to chaos engineering, it's not like you're doing any security vulnerability scanning or security threats. It's only on that reliability side, like if things turn off or things don't act the way that they should.

Benjamin Wilms [00:35:43]: Correct. It's really a focus on reliability, availability of your system, and it's not about, yeah, finding or really like. The difference is, if you're starting from the security perspective, security is most of the time starting from the outside of the system. Someone would, is trying to get in your system from the outside. Reliability and chaos engineering is something where you are taking care from the inside that your system can handle a lot of load, a lot of outages in areas of your system, and it's really like improving your system from the inside and not make it, make it more secure.

Demetrios [00:36:26]: So, yeah, I like that. That's a really good way of visualizing it. Someone from the outside. I also think that, I guess with security you have a nefarious actor that is trying to cause you harm. With chaos engineering, it's like you are trying to see if you are causing yourself harm.

Benjamin Wilms [00:36:47]: Yes, exactly.

Demetrios [00:36:49]: Excellent. So what are some ways that you've seen people do this wrong?

Benjamin Wilms [00:36:53]: You can get started with chaos engineering and you will find a very big list of tools to get started. So let's, let's put all those tools in one toolbox. And inside of the toolbox, there's a screwdriver, there's a hammer, there is whatever else inside. And now what people are doing? Ah, I will take the biggest hammer I can find in this toolbox. Okay, now what's next? I need to use this hammer. So they are very focused on the chaos engineering tool and not on the value they are able to get if they have been using this tool. Correct. So the outage, the attack is not like the value.

Benjamin Wilms [00:37:36]: That is more like to create value. It's just a tiny tool, but people are focusing so damn freaking. Such big amount of resources on. I need to do a regional outage on AWS. Oh, nah, we are not able to handle a region outage here. Big surprise. I could show you up front if you're running in one region, hot. So that's the point is don't focus too much on the tooling.

Benjamin Wilms [00:38:05]: It's, it's. You need to really. Yeah. Start with your expectation again, what your system needs to do for you is like, that's where the value is and then you can use some tools to verify if the system is still providing the value as needed.

Demetrios [00:38:26]: Well, it kind of goes back to what you were saying earlier, is how you decide on the experiment, how you decide on where and what, choosing the right tool. You are also deciding where you're going to be swinging this hammer and on what and if the value that you can gain, basically, the insights that you can extract from swinging that hammer, those need to be something that is not blatantly obvious.

Benjamin Wilms [00:38:58]: Correct. And that's, again, why it is so important that you have a strong integration into existing observability tools, that you can get the data, and also that you are able to rerun the same conditions or rerun the same experiment under the same conditions every time. So if you're running an experiment under changing conditions, that's something not repeatable. That's something which is not helping you to improve. And it's, again, you can imagine, like in software testing, like in unit integration testing, you are starting a new test every time under the same conditions to really make sure that your test case is the same. But if the starting conditions are every time different, that you're getting crazy. That's, again, is a big need for a specific tool that can help you to understand, okay, this experiment, are we safe to execute this experiment because the same conditions are in the system? Yes or no? If no. Okay.

Benjamin Wilms [00:40:01]: There's a new finding. That's something. Yeah. Where you have to take care.

Demetrios [00:40:06]: I do like this viewpoint of saying our end goal or our maturity level number two is being able to incorporate this into the CI CD pipeline and running it as one more test.

Benjamin Wilms [00:40:20]: Yes, correct.

Demetrios [00:40:22]: Yeah. That feels like a very lofty goal, but it is quite useful.

Benjamin Wilms [00:40:28]: But it's doable. And make sure it's before you go into production.

Demetrios [00:40:32]: Yeah. That's always useful information.

Benjamin Wilms [00:40:37]: Yeah.

Demetrios [00:40:37]: Not a lot of us follow.

Benjamin Wilms [00:40:40]: Yeah.

Demetrios [00:40:41]: Yeah. Then one thing that I did want to call out that I think people are understanding the beauty and the value of this, is that I was at Kubecon, whatever, a month ago, and you all were there. I saw the steady bit booth, and I was on one side of the venue, and there's a million different booths at Kubecon for anybody that has never been. It is an absolute zoo.

Benjamin Wilms [00:41:06]: It is.

Demetrios [00:41:07]: I think they said there were, like, 13,000 people there.

Benjamin Wilms [00:41:09]: Yeah.

Demetrios [00:41:10]: And from the other side of the venue, I had two people walking in front of me, and they said they saw your booth, and it had a big thing that said chaos engineering on it. And these two developers were like, oh, chaos engineering. Let's go there. Let's go check that. So I do think that this idea is something that people really understand and respect and want to incorporate more into their products.

Benjamin Wilms [00:41:39]: Yeah. And that's also something we can see in our daily doing. The need is growing. People are more aware of the topic. It is an old topic, it's not something really new. But like every time there's like this hype cycle, a lot of people are getting up, oh, new topic. Chaos engineering. Sounds awesome.

Benjamin Wilms [00:42:00]: And then again it's getting down on a specific level. And now we can really see where it's been mentioned by many different companies around the world. It's been done by lot of industries. So it's in our customer base. It's not just one type of industry. It's a wide mix about retail, e commerce, insurance companies, financial companies, even a heating manufacturer is on it. So it's a wide mix about the industries. Why? There's one in common, a lot of pain because something is not working as expected and that's something they would like to get rid of it again, why? They would like to implement new features, they would like to make their customers happy and who wants amazing, maybe.

Benjamin Wilms [00:42:50]: And that's something that they need to make sure.

Demetrios [00:42:54]: So the last question that I got for you, it may be the kicker on this. What do you say to the person who says, you know what if our downtime is a few hours every year or every quarter and I have to look at the ROI of implementing a tool like chaos engineering or just going offline for a few hours, I think I'm just going to stick to the. What I don't know is not going to kill me and I would rather let it go offline for a few hours, let the people get pinged at 03:00 a.m. and not implement that tool.

Benjamin Wilms [00:43:43]: So the first answer before I will ask some questions is. The first answer is hope is not a strategy. That's not something you should do. It's really like maybe it will go wrong, maybe not, I don't care. But no, you should care. Why? It's your company you are in. It's your there for, for your customers. And the first question I would like to ask those people is do you know how much revenue you are making in just 1 hour? Are you connected enough that you are able to understand what the impact is? If there is a ten minute outage time or 1 hour.

Benjamin Wilms [00:44:24]: So it's really you can do it on your own. Just take a look at the numbers if you're able to get them do the math and you can see how big the impact could be. And the next one is, there is something very important. It is very closely with your customers, and it's a connection between your customer and your company. And it's. It's about the trust. So if a system fails one time, two times, three times, you are losing trust. You are losing trust of the most important group for you, your customers.

Benjamin Wilms [00:44:55]: And that's something you are not getting back overnight. That something you, you're working on hard and. Yeah. Please make yourself aware of those scenarios and conditions.

Demetrios [00:45:07]: Yeah. You do have to have a lot of brand loyalty for someone to suffer through a few outages.

Benjamin Wilms [00:45:13]: Yeah. And if you take a look at your smartphone, if there's one app not working, go to the app store, go to the next one. There is a lot of. Yeah. Pressure on today's markets, and we are not the only one with the solution.

Demetrios [00:45:30]: Well, Benjamin, this has been great. I really appreciate you schooling me on everything chaos engineering, especially when it comes to how it is not only a technological solution, but it is very much this idea of people and organizational mentality that you need to bring into the reliability sector of your company. I also like how you speak about this, and you've been dealing a lot with sres and platform teams, but it has so many repercussions in the machine learning field, especially for those that are having to deal with machine learning models that need to be reliable. And you, like you said, do you know how much money you lose if a machine learning model does what we've heard various times on this show, and it just recommends the same item over and over and over again to every single person that comes to your e commerce shop. Do you know how much that loses for the company? Well, it is probably very useful to figure that out and then become proactive on how you can make your systems more reliable and robust?

Benjamin Wilms [00:46:42]: Well said. And nothing more to add.

+ Read More

Watch More

Reliable ML
Posted Oct 05, 2022 | Views 775
# Reliable ML
# Revenue
# Decision Making
# Google
# Google.com
# Stanza
# stanza.systems
LLMOps: The Emerging Toolkit for Reliable, High-quality LLM Applications
Posted Jun 20, 2023 | Views 3.7K
# LLM in Production
# LLMs
# LLM Applications
# Databricks
# Redis.io
# Gantry.io
# Predibase.com
# Humanloop.com
# Anyscale.com
# Zilliz.com
# Arize.com
# Nvidia.com
# TrueFoundry.com
# Premai.io
# Continual.ai
# Argilla.io
# Genesiscloud.com
# Rungalileo.io