Data Infrastructure Cost: Tips to Keep Our CFOs Happy // Jose Navarro // DE4AI
MLOps Engineer at Cleo.
“Hey Platform team, any idea why the cloud bill is up by X% this term compared to our previous one?” As platform engineers working at organizations developing or using AI products, that X % amount can be quite high very quickly. In this talk, I will show some strategies that you can use to reduce the cost of your Data Infrastructure, share the responsibility across product teams and control it overtime.
Demetrios [00:00:05]: Senor Jose, my amigo, my friend, I am excited to have him on because he is talking about how to save money. And finops, we gotta love it, especially these days if you're getting pressure on, you know, being more efficient with less spend. Let's listen to what Jose has to say about how they did it. Where you at, senor? Hey, there he is.
Jose Navarro [00:00:33]: Hi there. How are you? How is everyone?
Demetrios [00:00:35]: I'm great, man. I think just one thing, we might want to change your mic from your Airpods to your computer because we've been having trouble with the Airpod mics recently. And so before you get into the talk, I want to make sure that everyone hears this very clearly.
Jose Navarro [00:00:53]: Audio default MacBook, how is that okay?
Demetrios [00:01:01]: They're all so much better. So much better.
Jose Navarro [00:01:07]: I can't hear you now. I can hear you. All good? Awesome.
Demetrios [00:01:12]: Yeah. All right, I'm going to bring on your presentation, and I look forward to saving some money.
Jose Navarro [00:01:19]: Awesome. Thank you very much. Well, thank you, everyone, for coming, and I hope you enjoyed the conference so far. My name is Jose Navarro, and I work as an MlOps engineer at a company called Clio. At Clio, we help our users to manage better their finance. So kind of like keeping track of our data infrastructure. Cost is kind of like embedded in the culture, in the culture of the company. So what I want to talk to you in the next ten minutes is how you can track your cost in an efficient manner and how you can share and empower your product teams to share this responsibility, and some tricks to save some money as well, that have helped me in the past.
Jose Navarro [00:02:04]: So, I don't know you, but I've received this message quite a few times in my career asking about why cost is increasing. And even in companies where I've been told that we shouldn't be looking after cloud compute cost, it's fine. We just focus on growing, growing, growing. But at some point, these messages always come around. And I believe it doesn't take much to just keep an organized cost visualization so that you can actually answer to these questions straight away and hopefully maybe delay these type of messages for quite a while as well. So to do that, we're going to go through a few steps. So first is visibility, then how can we improve that visibility? With something called attribution, a little bit of light process that we can add to continuous monitoring cost, and then finally, the tricks that everyone is waiting for. So, in terms of visibility, as engineers, we are quite useful.
Jose Navarro [00:03:04]: This phrase of, you can improve what you don't measure right so if you try to improve the latency of an endpoint, you quickly understand that you need some monitor to understand how your improvements work in and iterate over time. So with cost is exactly the same. Luckily for us, cloud vendors usually have some dashboards where you easily can track the cost, but I personally find them that they not enough in most cases. When you try to understand what's going on in this example, we can see that the cost has increased by 30% from August to September. And this specific cloud vendor allows me to filter by cloud services. So we can see that the cloud service that has improved its service b. However, it could happen that that service b is used across the whole organization. So it's very difficult still to understand where that cost coming from.
Jose Navarro [00:04:00]: So visibility is not just enough, we need something called attribution. And attribution is just leaving some breadcrumbs path along the way so that you can always follow it back and understand where the cost this coming from. And it's as easy as this. So when you create resources in your cloud, you just add some tags, and with those tags what they allow you is to filter down exactly where is that cost coming from. So in my case I added some environment project and squats. But maybe in your business case or in your organization, it makes sense to have more tax or different type of tax. But for this example, we're going to go with this. So if we come back to the previous example, we saw that the service b was the responsible for the increase.
Jose Navarro [00:04:48]: And now because we have environment and project, we can drill down into more information. So we can see that per environment, it looks like the cost has increased uniformly across all the environments. So it doesn't really give us any information. But if we look into per project, we quickly realize that is project circumental, the one that has increased the cost, and then we can look into what happened in September in that project, we can talk to the team that is working on that project and understand very quickly what's going on, so that we can mitigate it very quickly in terms of continuous monitoring the cost. I personally find that alerts and anomaly detection is hard to tune and it leads to alert fatigue. So basically people doesn't track the cost over time because they get in too many alerts. So what we do at Clio is just have a very light process in which one of the tasks that we do before jumping into spring planning is just checking the cost and how it changed compared to last week. And if it's kind of stable, then we move on.
Jose Navarro [00:05:55]: And if there is a substantial change, then we can have a look into these like tags and understand which part of the infrastructure is coming from and then potentially add some ticket to the sprint so we can mitigate it or investigate it further, etcetera. Also, thanks to attribution, what we can do is empower product teams as well to keep track of the cost as well, so that not just the platform team is responsible for this, but we can share this culture of like always keeping an eye on cost and it doesn't take much time. As you see, it's just probably two or three minutes every week to keep an eye on it and then mitigate it. If it happens so quickly, we're going to go through some tricks. I'm going to go very quickly through them. If anyone wants to talk more details about them, feel free to reach me out in the slack channel in mlops community. I'm happy to talk further about any of this. So first of all is make use of storage data tiers.
Jose Navarro [00:07:00]: So it's very common for companies working on AI to ingest lots of data and keep it forever. So think about it. Do you need this data forever or do you want to keep it in the default tier forever? Or maybe you can use lifecycle policies to move all data or data that is not accessed regularly into tiers that are a lot cheaper and save money that way. Another one that I found useful from time to time is having a look into volumes that are hanging around. I found several volumes of several terabytes hanging around that haven't been used in a year. So potentially snapshot those and delete them. Or asking people, are you still using? Or do you want to keep the data here? Like can we move it somewhere else where it's cheaper, that type of thing. One of my favorites ones is like the different pricing models between some of the cloud services, between on demand and provision.
Jose Navarro [00:07:57]: Some services have these two pricing models where on demand is the default one, where you pay per request of the service and then the cloud provider is able to auto scale infinitely. So some sort of like a service lake serverless type of service, whereas provision, you have to understand how you're going to use the service beforehand because you set the capacity that you want to provide to that service and just pay for that capacity. So usually what happens is when projects start up, you use on demand because you don't know how you're going to use it. But then over time everything keeps on demand, which is quite a lot more expensive. And I saved quite, quite a lot of money by looking into the usage and then moving it into provision, setting up some default capacity and potentially having some schedule auto scaling if you need like some bigger user at some point in your day or in your week. Difference between spot and on demand instances or spot instances are as cheaper as 90% of the regular on demand prices. Spot instances are just the leftover compute that the cloud provider have around that is not using and it sells to you at a discount price. But the trick is that they can request it back at any point with some amount of seconds of heads up.
Jose Navarro [00:09:29]: So you just have to be ready to lose any of your instances at any point. But on the other hand you can save a lot of money using a lot of responses. Spot instances, compute and stored reserved instances. You can get 40% of the total cost of your compute and store reserved instances if you understand exactly what you need. So have a look into what you use annually of compute and your database, your type of instance that you use for your database. And if you are kind of pretty sure that you gonna have kind of like a steady workload, you can reserve the instances for a year and then pay 40% less. Then monitor utilization of your services, make sure that the resources that you pay for are well utilized and if not then move to a cheaper instance. And finally, I've got an extreme trick that it doesn't work for all the use cases, but I end up using it quite recently and I found quite interesting which is budget with automated deny access policy.
Jose Navarro [00:10:39]: So we had this use case of a team that wanted to use a service that is very very expensive. And then I've got it made sense for the team to use this service. But on the other hand I wanted to make sure that suddenly someone doesn't forget to switch off this service and we've got a 50 grand cloud build the next month because of it. So I created this budget with an agree budget for it. And you can add automated deny access policy if you get to certain point on your budget. So we agree on a budget and we say if you reach 100% of this budget then this policy is going to kick in and it's going to stop you using this service. Obviously it doesn't work for anything, production workloads or stuff like that. But for things like experimentation and things like that, it helps making sure that your cost doesn't explode.
Jose Navarro [00:11:32]: And that was the final trick. Thank you very much. Hopefully I'm on time.
Demetrios [00:11:37]: Oh, you are making my job easy. I like it. Okay, so the TLDR is just basically, alerts don't work that well. Make sure to hard code it in so you can't get these costs running out from under you.
Jose Navarro [00:11:55]: Yes, a little bit of process. It works pretty well and it doesn't take that much, to be honest. But if you've got a way of alerting very efficiently, then do it that way. But what I found is usually we end up having lots of alerts and then people is like, oh, the usual, and don't look at the alerts.
Demetrios [00:12:13]: Classic. Well, excellent, Jose. I appreciate it. This was a lightning talk. This was in and out fast, fast finops is always important to be thinking about, and I am very happy that you came on here to talk to us about it.