MLOps Community
+00:00 GMT
Sign in or Join the community to continue

How to Stop AI Agents from Bleeding Your Cloud Budget // Advait Patel // Agents in Production 2025

Posted Jul 25, 2025 | Views 47
# Agents in Production
# Cloud Budget
# Broadcom
Share

speaker

avatar
Advait Patel
Senior Site Reliability Engineer @ Broadcom

Advait Patel is a Senior Site Reliability Engineer at Broadcom, where he leads the secure design and deployment of large-scale, cloud-native platforms. With over 8 years of experience in cloud infrastructure, DevSecOps, and AI-driven automation, he focuses on building systems where AI agents operate safely, efficiently, and cost-effectively across production environments.

He is the creator of DockSec, an open-source AI-powered Docker security analyzer used by security and DevOps teams to detect vulnerabilities and enforce best practices with the help of autonomous LLM agents. Advait is also a founding contributor to AIVSS (AI Verified Secure Systems), a global initiative focused on safe, auditable agentic AI, and serves as Vice Chair for IEEE Region 4 and Conference Chair for IEEE Chicago Section.

His work has been featured at leading conferences, including ISACA, Blue Team Con, DataWeek, and the Open Cloud Security Conference. In 2025, he will speak at IEEE Cloud Summit, CornCon, and BSides Orlando. He is also the author of two forthcoming Springer Nature books on Google Cloud IAM and Security. Advait is passionate about operationalizing AI agents with real-world guardrails, especially when it comes to performance, security, and cost.

+ Read More

SUMMARY

As AI agents become active participants in production environments, handling infrastructure tasks, chaining tools, generating outputs, and executing plans across cloud services, the financial implications are often underestimated. These agents may appear intelligent, but they have zero awareness of cost boundaries. A single agent loop with poorly bounded retries, excessive API calls, or unrestricted tool usage can quietly rack up hundreds or even thousands of compute, token, or storage costs. In this session, I’ll walk through how seemingly harmless design decisions, like overly verbose prompts, excessive tool chaining, or unrestricted LLM usage, can result in runaway spending. I’ll share lessons from deploying agentic systems in cloud-native pipelines and infrastructure security tools, including my work on DockSec, an open-source AI-powered container security analyzer. We’ll explore how agents misbehave in cloud billing terms and what the attendees can do to stop it. Attendees will learn practical strategies to monitor, contain, and optimize agent costs: from integrating cost observability into your agent stack, to programmatically setting retry, token, and API call budgets, to leveraging agent memory, caching, and behavior throttling to reduce waste. Whether they’re scaling agents in production or just starting to build them, this talk will give them the tools to design agent systems that are not only intelligent but also financially sustainable.

+ Read More

TRANSCRIPT

Advait Patel [00:00:00]: Hey everyone, Good morning, good afternoon, good evening from whichever time zone you are in. Thank you so much for joining me today in this session. My name is Advait Patel and I work as a Senior Cycle Reliability Engineer at Broadcom. Today I'm excited to be a part of this amazing conference agents in production 2,025. And today I'm going to walk you through the topic that is becoming more and more important and relevant in the in the era of LLM. Topic is how to stop AI agents from quietly draining your cloud budget. So let's, let's dive in. Okay, a little bit about me, quick intro.

Advait Patel [00:00:53]: I specialize in AIOps cost optimization and cloud infrastructure. I'm also involved with IEEE. I check conferences, I review papers, I review journal articles and I try to stay active in the community with these activities. I have also written a couple of books on IAM and GCP security and I speak at conferences like isaca, Sense, Cloud Summit, Blue Team Con and more. You can find all those details on my LinkedIn profile. Also this session, the today's session pulls from real life lessons that I have learned working with cloud, native AI systems and LLM agents. So to get started, we have all seen this shift, right? AI agents are becoming more popular. They're like literally everywhere these days.

Advait Patel [00:01:47]: They are showing up in your data pipelines, your security monitoring or your automation bots and wherever it is you just name it. They are meant to automate your workloads, your day to day operations. They are meant to make decisions on behalf of the human. They are also and save us time and money as well. But here is the catch. If they are not deployed right correctly, they can do the exact opposite. And that's where the trouble actually starts. So what is the hidden cost of this autonomy? What is the what is the catch? The reality is just because something is autonomous that doesn't mean it is optimized.

Advait Patel [00:02:34]: And that's what we have been seeing in this recent trends in in the world of LLMs I have seen AI agents fire of thousands of API calls. As Dimitra just mentioned, API calls, they spin up GPU instances and those instances we don't even need or they probably loop infinitely just because someone missed a condition. Just someone because Mr. Condition Check, right? One recent survey that I would like to brought up is that survey said that around 30% of gen AI workloads low past budget and it's usually not because the model was too powerful or the model was too optimized, or the model was doing everything but it's because someone forgot to, someone forgot to put the brakes on. That was the reason. I know it sounds silly, but this is, this is what it is when we talk about the cost and when we talk about this LLM generated era. So what is the. If we talk about the root cause and probably if we did, if we dive deeper, we can see a few things that are common, that are common mistakes that are probably the reason why companies start splitting their their cloud cost.

Advait Patel [00:03:55]: And here are some of the things. Let's break down and why. Let's. Let's break down and see why this happens. First of all, agents don't always know when to stop. We need infinite retry loops. We need something and these agents are infinite retry loops which is in which is common these days. Also the LLMs often reload full context every single time which is also which is killing the bandwidth, which is killing the cost, which is calling which is killing the observability and etc when they try to reload every single time the full context.

Advait Patel [00:04:30]: Also one of the other aspects, one of the other important aspect is the logging. Some agents store every single thing for forever. They store debug logs as well. Debug logs are not that useful and if they store these debug logs forever, it is completely bleeding your cloud cost. So logging is one of of the important root cause as well. Other, other than that auto scaling gets misused because we don't know because of the policy. Agents can auto scale up or auto scale down, right? But if those policies are not in place then there can be abuse of auto scaling. Also many agents have too many permissions which, which do which they don't even need, which they don't even required to do their operations right so they can spin up expensive services using those over permissive policies.

Advait Patel [00:05:27]: And the worst part, you know what the worst part is? Teams often have zero visibility into what agents are actually doing. So that that is where the observability comes into the picture. So this is a recipe for cloud cost disaster. And this is what are the root cause that we have seen or we have noticed in case studies or in our recent day to day life or in our operations. Okay, so let me take a, let me take a step back and let me walk you folks through a case study when an agent actually went rogue. So here is true story. So we have this bot called Support GPT. So that's Support GPT was meant to help generate help desk tickets and was trying to help the our customer, our support engineers.

Advait Patel [00:06:22]: And this sounds harmless Right. But no, it had a prompt loop issue and no rate limit. So it went off and created approximately 10,000 tickets in one day. And this operation spiked 4x into cost and eventually at the end of the day, it was all avoidable. And how we fixed it, what was the reason, what was the solution to fix it? It was the guardrails throttling, right? That was it. That was the ultimate solution that could have fixed this issue. A simple control would have saved thousands of dollars for the company, for the team, and that could have been easily avoid. What is the cloud native strategy to reign in cost? Right.

Advait Patel [00:07:08]: So what we can do about it, and here is the playbook, here is something that we can, we can try to implement, we can try to do and this is probably some very lightweight and simple architecture that we can try to implement. So let's start with first and very first and very simple thing which is to put limits in place whether it is cpu, whether it is memory, whether it is concurrency, whatever it is, put limits in place for each and every resources that the agent is using or the your automation is using. Also the second thing is use cost alerts. Whatever you are, whatever you are implementing, always have observability or always have visibility into those metrics such as for example AWS and gcp. They both have, they both have default metrics into them. So definitely it is advisable to use and to have your eyes in whatever the automation or whatever the agent is doing, doing or automating the workload. Also before you like before you start using any agent, put your agents into sandbox environment, give them, give them, kill switches, use circuit breakers, use, use policy, use policies so that your agents don't go rogue. Also before you put your agent into production, run cost simulations before launch, run like multiple, multiple varieties of cost simulators, use cost simulators, use existing tools.

Advait Patel [00:08:40]: If you are in cloud environment, use their cost explorer before you run your agents in production or in any of the environments because those simulator won't give you the exact amount of the resources or exact amount of cost that it will use, but it will give you some range, it will give you some idea about how your cost of cost picture looks like. Also tag and label your agents so you know who owns what, so when, when times comes and when you need to troubleshoot. This is the fundamental thing that will help you out. If you have properly tagged and properly labeled your agents to see which team is using what, which workload is using what, which inference is using what, this is how you will deep dive into Your agent workloads and automate cost reviews. This is the foremost and important thing. Even once a month can make a huge difference. At my team we are doing cost reviews bi weekly. That is where our Sprint timelines are.

Advait Patel [00:09:44]: And for the longer and for the lengthy workload we are also doing cost reviews nightly using the tools and using the by default tools that is coming from with the GCP and aws. So definitely you don't have to just make. You don't have to just deploy your agents and let them do whatever they want to do. But also if you start reviewing those cost budgets and if you start reviewing those cost statements, you will get to know more about how they are bleeding and what they are bleeding and what is the exact time they are using the cost more and less. So that's why you can build your or change your schedules accordingly. And this isn't about fancy tools but it, if you think properly it is about discipline. You don't have to use like third party tools or you don't have to use like spend money on using tools. But if you, if you implement your guardrails, if you implement your schedule by looking at the multiple reviews and simulation and alerts you will get to know that this is more about being in disciplined environment.

Advait Patel [00:10:54]: So let's talk about AI Aware cost optimization framework, right? So I will simply like, I will like. I. I like to use a simple framework I call as track, throttle, train and terminate.4 Ts that it's easier to remember that way. So okay, let's talk about each in detail track, track what your agents are doing. You don't have to just trust them blindly. But using using correct observability, using correct logs, using correct alerting, you can, you need to track what they are doing in real time. Also throttle their execution, give them unlimited power, don't give them, don't give whatever they don't need. But give them only.

Advait Patel [00:11:39]: Let's start with zero trust architecture only give them what they are absolutely need and then you can adjust the permission. It's not like you cannot change anything. But start with least amount of resources and then give them power or give them access on what they need to execute their tasks to execute their workflow. Also train them to be efficient. So that is where your reviews, that is where your schedule will come in place. You know this, okay. This execution takes this, this amount of resources, this amount of time. So based on those learnings you can ask, you can create your agents to be more efficient so that they can use those resources in a particular time frame or you can, they can use those resources a particular way that can also help your cost overall.

Advait Patel [00:12:31]: Also another thing is terminate you don't need all the resources up and running every time. Terminate anything that is idle, that is orphaned or that is misbehaving, right? Because we don't. We usually overlook those resources which are like, which are, which are died, right? We are like, okay, they are died. They are not consuming any resources, they are not contributing to cost cost. Right? But that is a misconception. So terminate anything that are not in use, that are misbehaving or that are misconfigured. So that loop just those four steps can prevent about 85 to 90% of cost leaks if you perform them in a particular way. So that's how I remember those four T's every single time when we talk about cost optimizing our agents or workflows.

Advait Patel [00:13:23]: So to give you an idea and picture about how you can implement this using what and tools and technologies. So here are a few that I recommend. We use Terraform as our infrastructure as a service provider and Terraform has some really nice cost estimation plugins that you can use that you can inject in your pipeline or in your code that can give you some nice overview about the estimation that this workflow or your agent can take. Also there are tools like FinOps, AI Cast AI or Kubecost which helps you track, use, usage and optimize. Kubecost in particular will give you the overall picture about your kubernetes environment and your kubernetes usage and your container, resource utilization, etc. So you can use these tools and get your understand your usage and optimize them as well for observability, which is also one of the important, one of the important features or one of the important aspects of tracking everything we use or we use or I recommend Prometheus or Grafana and also Wavefront. They all are great. Depending on your workload, depending on your requirement use case, you can use any of these tools to get an eye on your workflow or agent.

Advait Patel [00:14:57]: These tools are great if you have the access and if not you can explore other options as well. And also don't sleep on the build built in tools. If you are in cloud, GCP's recommender system, AWS, AWS's compute optimizer are actually useful if you set them up right. So it's not like you have to buy in or purchase third party vendors licenses, but if you are on cloud, any cloud, they have like really nice Recommender tools which are back in the. In the back end. They are using machine learning and MLAI algorithms to give you the exact projection and the exist estimate about the. About the workload workload. So this is the bare minimum thing that you can get started with.

Advait Patel [00:15:42]: And if you need more, if you. If you are more complex side you can start exploring other third party or other third other vendor tools. But these are bare minimum that you can at least get started with that that comes with your cloud providers. So let's talk about agent guardrails with policy as code. Right now this one is a game changer policy as code. We use Open Policy Agent or OPA or GCP's policy tools to enforce these rules. If I give an example, we block agents for from speeding up GPUs unless they are explicitly allowed. So by default we are saying okay, no more GPUs unless someone someone comes up with some specific permission or someone comes up with specific need.

Advait Patel [00:16:36]: Then only those agents will create these GPUs. Otherwise there is no need for the agent to spin up any GPUs in any case or for any workload because GPUs are way way, way expensive than with thing. Also we can also implement a policy to limit what roles they can assume or what actions they can take. So basically don't trust the agent to do the right thing but enforce it with code. That way you know what that you. That. That way you are coding the agent and giving them instructions about what they can do and what they cannot do and they cannot, they cannot overcome or they cannot misbehave. If policies are in place, they cannot go A.B.

Advait Patel [00:17:20]: those policies that are restricting their permissions or the access. Here is the example. Here is the sample data from code which is implementing the policy and use like similarly. Similarly this you can policy as you can implement as many policies as you want. Basically just just make sure that those agents are behaving correctly and not doing anything that they should not. So how to build those budget smart agents, right? If you are building agents from the scratch or maybe if you are fine tuning them, here is what helps that I have seen. So first of all build cost awareness right into the prompts or agent planner. Also weight token cost versus accuracy especially with the large language models.

Advait Patel [00:18:15]: Also distribute work to cheaper regions Cloud like any cloud providers they have like agents in double digits. Sorry regions in double digits, right? You don't have to. These are not customer facing. So you don't have to only deploy deploy your agents where your customers are. Let's say your customers are in US east, let's say Virginia or those regions. You don't have to deploy your agents into Northern U.S. northeast U.S. northern Virginia region.

Advait Patel [00:18:47]: You can deploy them anywhere, wherever those regions are cheaper or you are getting more contracts cheaper contracts with the cloud. So distribute your work to cheaper regions. Also use APIs to check regional pricing before assigning tasks. So that is these are the few things that your agent should have or your or to build the budget Smart agents. It's just like managing a team. You want the right agent doing the right job in the most cost effective location. So this, these are. These are the important functions or important things when you are planning to optimize cost Building the agents here in this example these are the flow.

Advait Patel [00:19:34]: This is the flow diagram where it will basic flow diagram where it will work based on the requirement for the agent. That is how you can build your agent and make it smart. So what is the checklist? What is the checklist before deploying the agents right because you need this. You need. You need to make sure that you are ready to actually push it into the production or in any environment. Even if you push it into sandbox still you will bleed cost. It's not like if you put your agent into sandbox or dev environment it's cost free. That's not.

Advait Patel [00:20:10]: That's not how it is. So before you push anything to produce go through this checklist. Let's start with CPU and memory limits. Are they in place? Okay, done. Then let's go into the next checklist which is rate limits and retries configured logging retention set or is it dumping everything? You need to make sure that retention periods are set the logging the log rotation policies are set so that the logs are not stored forever. And they are. They are rotated by time to time. Also budget alerts turned on.

Advait Patel [00:20:44]: Also is and is there a kill switch? If you can't answer yes to all five, you are taking a big risk. So these are the five checklist items that I tend to make sure that everyone is following this and then based on your requirement based on your use case you can add more items checklist sized items that can help you and your team saving more cost. But these are the fundamental or primary checklists that that I usually take take a look at. So what is the future outlook? How like where we are headed to and what what is next? I think we will see more phenops agents that optimize themselves. Also I'll like. I hope to see more artificial intelligence, machine learning and cloud ops team start working much closer together. So it's not like one, one team's responsibility or another team's responsibility but it's a shared responsibility model where teams work together and share the tasks and rely on each other. Agents will collaborate to make budget based decisions for sure.

Advait Patel [00:21:59]: And we will start training agents using reinforcement learning not just for performance but also for cost control. So basically I think the time in, in the future times AI will be managing AI itself but we need to get the foundations right first and that is why those checklists, those root causes are very important to get started with. So what are the final thoughts before we, before we log off from today? So what are the final thoughts to wrap up? AI agents are powerful that are, there is no doubt in it but without Boundari they are very dangerous. You need to implement, you need to implement guardrails around them so that they don't go rogue. They don't start making things crazy. Most budget issues aren't like they are not tech related issues but they are visibility and governance issues. With right visibility and with right permissions you can, you can fix them. And final but not the last thing, don't wait until an agent runs wild.

Advait Patel [00:23:09]: Always build guardrails now and not later because smarter agents need smarter boundaries as well. So that's it from my side. Thank you so much for joining me in this session. I hope this probably gave you a few practical takeaways and if you would like the slides or policy templates or anything or want to connect. I'm on LinkedIn. My LinkedIn handle is Advait Patel93. I appreciate your time and happy to take any questions if there are any.

Demetrios [00:23:43]: Oh there are questions. There's already questions in the chat which is a good sign. I've got a ton of questions because this is one of the topics I am very passionate about and I imagine people are going to be asking questions now that we've finished the talk. First things first, the questions from the chat and later I will be a bit selfish and ask my question. Simple one for you. What is the source of that first graph that you shared? The hidden cost of autonomy.

Advait Patel [00:24:19]: It was the LLM. I just give the data to LLM and it was generated by them.

Demetrios [00:24:26]: Oh okay. And but where did you get the data?

Advait Patel [00:24:29]: I will share the source now.

Demetrios [00:24:32]: Share that in the chat. Next up, we've got a question from Tanmay. How do you think about using the two layer strategy? Number one, make the LLM not overthink via a prompt and number two an API layer which checks and notifies the LLM that if it's reaching the limit, that urgency and calmness also makes it think well plus it hard locks so the API is not triggered.

Advait Patel [00:25:11]: Probably. I, I for. I already forgot the first question because the second question was longer.

Demetrios [00:25:18]: Basically it's saying, hey, here's, here's a potential like design pattern, right? If I'm understanding the question correctly and it's saying that you've got these two layers. The first layer is just making sure that the LLM is not overthinking by stating it explicitly in the prompt. And then the second layer is you're guarding with an API and then you're notifying the LLM if it's reaching its limit of like thinking or doing too much, I think, and that's how I understand it. But for sure, if 10 may you want to correct me on anything, feel free to jump in the chat and say it.

Advait Patel [00:26:01]: Yeah, so that is why, that is where the context is important, right? You don't have, you don't, you don't want to remove the context from the LLM, but also you don't want to take the LLM use all those contexts even if they need. Right? So that is where you can use those threat throttlings and that is where you can use those rate limits to give the proper context to the LLM so that it doesn't think over. So that it doesn't think too much or doesn't overthink it. At the same time, it's not creating context every single time. So that's where those throttlings policies come in place. And coming back to the API question, the APIs are always tricky because basically as a human we can make decisions based on the situations, based on the, based on the different cases that, that are being thrown to us. Right? But LLMs are only doing stuff that they are asked to do. So basically if you let LLMs take all the decisions by themselves, they will, they will go rogue for sure.

Advait Patel [00:27:06]: And if you don't give them, if you don't make them smart, they won't give you proper, they won't give you proper answers or they won't do proper job for you as well. Right? So in this scenario, giving the, giving the right set of policies is very important for the, to handling the APIs for the LLMs.

Demetrios [00:27:29]: Excellent. Yeah, the Tanmay chimed in in the chat. I, I like that answer on like, yeah, API is, humans are smart, APIs maybe are, but it's a lot of work. And Tanmay was saying it's Actually an API that the LLM gets the limit of tokens. Tokens from. So you're kind of like blockading the tokens via API. So it gets focused and also urgency and calmness and makes it the most focused. One strategy.

Demetrios [00:28:06]: I think that's fascinating. The thing that I wanted to ask you or go into was if you have seen like. So one place that I've seen folks bleeding money is that when they go and have an agent scrape a website, they come back and 90 of the stuff that they, that agent has gotten off of the website is absolute, that you do not need to put in the context window. You've probably seen it. If anyone's dealt with agents and they have a web scraper or some kind of scraper, the majority of the stuff is from the DOM or from like the actual. The. The code is not useful. And so you have all this HTTP that you're looking at and you're like, well, how can I filter that out? Right? Because that those input tokens, those don't mean anything.

Demetrios [00:29:09]: And so what have you seen work to almost be like, it will catch the garbage before even going to the LLM maybe. Or have you seen anything like that? I wonder about that.

Advait Patel [00:29:25]: Yeah, definitely. And it happens all the time. It's not like just the web scrappers, but again, if you look at any single thing, because LLMs, they tend to grab whatever we have, whatever we have designed them to design them in the back end, right? So they don't know this is garbage and this is not the garbage. Basically they are like, okay, I need to scrap everything and I will need to inject everything, right? But here in this case, you can set up the guardrail, you can set up the policy where you can ask, where you can define things like this. If you see X, Y and Z, you can mark it as garbage and you can avoid it and not to move forward with the next step. So that's, that's what you can try to control by setting up the right policy. Otherwise, if you don't set up the policy, it will take. It will take up everything, everything and it will go to the next step and it will inject it into the LLM.

Demetrios [00:30:18]: Dude, this was great. As Keith in the chat said, you're. Or sorry, Keith said something. But also we've got somebody in here in the chat, who was it? Clarence was saying this was his favorite talk of the day. And you just saved a bunch of people a bunch of money. So good on you.

Advait Patel [00:30:37]: Thank you so much. It's like learning, learning things, gaining experience, but at some cost.

Demetrios [00:30:45]: Yeah, hopefully not too much. That's why it's great to learn from you. Because if we don't have to also assume that cost, then we'll just let you take it. So come back when you have more of your learnings and we can continue this.

+ Read More
Sign in or Join the community
MLOps Community
Create an account
Change email
e.g. https://www.linkedin.com/in/xxx or https://xx.linkedin.com/in/xxx
Comments (0)
Popular
avatar


Watch More

Generative AI Agents in Production: Best Practices and Lessons Learned // Patrick Marlow // Agents in Production
Posted Nov 15, 2024 | Views 6.3K
# Generative AI Agents
# Vertex Applied AI
# Agents in Production
Lessons From Building Replit Agent // James Austin // Agents in Production
Posted Nov 26, 2024 | Views 1.4K
# Replit Agent
# Repls
# Replit