Cost Optimization for Multi-agent systems // Mohamed Rashad // Agents in Production 2025
speaker

Since the begining of my career in 2014 I’ve been passionate of building products, and over the past 11 years I’ve built and contributed to AI-Driven & Data-Centric products in all scales from Fortune 500 Companies to one-man startups, solving a range of problems in Telecom, Finance, Agriculture, Manfaucturing, Aerospace, HR and Robotics among many other, working side-by-side with many founders, decision-makers and product teams, and bringing ideas to life. With the focus of developing high-tech and delivering real value to business.
This led me to create Hyperion AI, an AI+Data Studio, focusing on helping companies at all scales to experiment with AI through providing them with Rapid Prototyping Services, Consultations, Fractional CTO Expertise, and Professional Training.
You can view me as a T-shaped minded person, interested in interdisciplinary and multidisciplinary work; considering this, I work to find the links between different disciplines and integrate them for a solution. Trying to make this world better, by applying engineering and spreading knowledge.. from tech strategy to low-level coding, passing through system design, team management, etc.
SUMMARY
In the fast advancing world of AI systems, cost optimization isn't just a luxury, but it can break your company. As these systems scale and complexity grows, keeping expenses in check across infrastructure, model serving, latency management, and resource allocation becomes increasingly challenging. In this talk, we'll go through the critical challenges surrounding cost management in multi-agent environments. We'll dive into real-world solutions, highlighting hybrid deployment strategies (cloud and on-prem), precise cost-tracking tools, and resource optimization tactics that have delivered significant savings without compromising performance. We will also cover real use cases from our experience in the last decade and actionable insights drawn from practical experiences.
TRANSCRIPT
Mohamed Rashad [00:00:08]: So also to use our time was the best way. Let's just move forward. We are building like in, in the current age when, when you're building a multi agent system in general it's a little bit different than the normal agent normal like single agent systems, especially in the way you can keep track of everything happening. Which is by the way very similar to how when people started going from monolith software to microservice software and everything became different like everything with course became different, with infrastructure became different, etc. Etc. So the main, the main thing that we, we are trying to figure out after figuring out how to architecture your agent or agents in this case how to build the whole pipeline around it is actually how can you calculate the cost correctly? Especially when you have different agents communicating with each other, communicating with the user, having variable input and variable output, it can get a little bit harder to track because actually like the costing of normal LLMs is already hard. So we, we have two, we have actually five main cost drivers that we, we will need to figure out. The first one is our LLM API call.
Mohamed Rashad [00:01:28]: This is usually the most straightforward, the, the most deterministic part. You have let's say 20 calls, 10 calls in in one run. And from this you can figure out how much the model is quoting. And this also varies based on the model tier you're using. So using O3 is different from using 4.0 in OpenAI or using Cloud models or using llama model hosted them somewhere else. The tokens you're putting also makes a difference and the call frequency, how many calls you're making to this LLM specifically during your run all shape your cost. Next to it also is the inter agent communication. If your agents are communicating with programmatic messages so there is no LLM involved, it would be easy to calculate the code.
Mohamed Rashad [00:02:18]: But if your agents are communicating with natural language, then you have another hidden number of LLM calls in order to move forward with like with your calculation on the other side. Using tools in general, some tools can be free if you're building your own function or your own database call or tom tree. But if you're using external tools, let's say you're using HERB API to do a web search or something similar. This can be a very like. It adds another layer of cost and you need to figure out how many times your agents are calling these tools for every, for every run you're working on the other side. Not for everyone, but if you're hosting your own own LLM there is also another Costing involved. If you're doing it on your own server, there is another costing involved. But if you're using a platform for your agent, like a direct online solution, probably it's a fixed cost, but if not, then you have your own cost to figure out for hosting this.
Mohamed Rashad [00:03:20]: And finally not, not usually for small systems or you know, like for internal tools, but if you're having a system that deals with millions of messages per day, it's time to think also about your data ingress and egress and how to calculate the data storage, the data networking. All of this adds to the cost, especially when you're having multiple agents or multiple doing this in parallel. The other thing, or the most complex part about this is actually when things compound. So things compound because of the number of cascading agents doing all of this, and it makes costs go exponential compared to normal other systems. For this we would very briefly cover few ways of doing the cost monitoring and some strategies to fix your cost. So at the beginning we start from logging to have a granular API logging, for example using length, miss, length use or any other similar tool where for every call that happens, you know where it comes from, when it happened, which agent, which task it it happened in, the model used, input tokens, the call duration, how many seconds or latency happened. Because latency also have its cost as we move forward. The second is having your cost dashboard, which can be again you can build it purely custom using Grafana or Datadog or so, or you can have it with a specific LLM ops tool like Langsmith or Langfuse, which are basically the most commonly used.
Mohamed Rashad [00:04:56]: And the main thing you need to figure out is the total cost over time, the cost per agent, the cost per model. Because you can have like not expensive agents but expensive models, the cost per task or per the user interaction like one run and also the token consumption trend to know if there's some like trends happening here and there. And so on the other side, for approaches to do the cost optimization, I will cover very fast two approaches. The first is the model tiering. So not all tasks need your O3 many model or not all tasks need this. You can based on the task being done, can be a reasoning task, can be a summarization task, can be a symbolic intent or so you can choose the model and also make it dynamic. So you can have an orchestrator layer which chooses the model based on the nature of the task. Instead of just having a single string with your model all over the place and also having a cascade of models where you where you can start the first part of the task with a sheer model, let's say the data cleaning part and then both the high value information into your expensive model.
Mohamed Rashad [00:06:07]: So a number of tokens in the expensive model gets less. For example, you can use for step one Claude Sonnet and then for step two you use Claude Opus and you decrease the number of tokens coming to Claude Opus. On the other side you. You can do the your prompt engineering for the token efficiency which will be which which actually not. It's not always effective. But if you can decrease the redundant redundant drum or the redundant short examples and so on, because many models don't need them now trying to limit the format that you get out on. And so with like JSON or something similar to decrease the redundant tokens would be very important to decrease the cost as you move forward. Especially if you have if you have like 20 prompts over five agents and being called few times over the lifecycle of a task.
Mohamed Rashad [00:07:08]: And finally, which is usually the hardest part is actually caching to implement a cache for the responses or for the inputs to not resend the whole context for every call, to not resend the whole, let's say result for the next layer and so on. Caching is not easy. In general. You can start from using a very basic like cache managed by you after using specialized ones. But generally this is. This is one of the most effective ways if you can do it. And if not you can also depend on the prompt engineering of the model selection. So yeah, two more extra things which may be interesting for like advanced cases is optimizing the inter agent communication.
Mohamed Rashad [00:07:56]: So trying as much as you can to do the communication with JSON, try to do it as much as you can with without natural language like using regex or something similar. Also if you're passing a huge context between models, try to summarize it. Put it in a vector database, put it in some kind of memory so you don't pass all the tokens to the next request from the neighboring agent. And finally, which actually I left it till the end because it gets different for different people. But usually your task doesn't need this number of calls. So the idea of task pruning or like the task level optimization and decreasing the number of calls or the number of steps that you need to do on your agent can reflect greatly on the cost. If you can really like do your prompting and the workflow structure correctly, you can decrease the cost by actually decreasing the number of requests you make overall. Because the same task you can implement in less steps.
Mohamed Rashad [00:08:59]: So this is more of a like pure optimization of your tasks. So yeah, this is basically what I wanted to share very fast about the topics. I would be happy to share this presentation and also like who have questioned discussions about it and yeah, this is it from my site.
Adam Becker [00:09:19]: Nice. Thank you very much, Rashad. I believe we might not have much time for all that many questions now. I'll give it another couple of minutes to see if anybody's dropping anything in the chat. Until then I can. Can you put up the slides again where you talk about the model tiering? Yeah, yeah. So you spoke about perhaps having like an orchestrator or a coordinator that can tell you, okay, well now you can break down a particularly maybe like a very complex prompt into multiple prompts or instead of pinging an existing model that might be a little bit overkill, go to perhaps smaller models that are less expensive but can nevertheless get the job done.
Mohamed Rashad [00:10:12]: Right?
Adam Becker [00:10:12]: Am I, I'm understanding you're right, right?
Mohamed Rashad [00:10:14]: Yes, yes you can actually. It's usually the second. So let's say you, your team system does summarization, it does some code generation, it does something else, you know, like intent prediction or so some tasks like intent prediction can be done with a very small model accurately. But co generation needs a bigger model if you have a layer.
Adam Becker [00:10:36]: But I want to zoom into that accurately piece because my impression is that a lot of people and if anybody wants to comment in the chat, my impression is that let's say I do multiple operations. To what extent do I feel comfortable that dropping from a larger model to a smaller model that might be a little bit more specific? Sure, it might save me a little bit of money, but am I going to pay for that cost for those savings with performance? And do you believe that in order to get people to feel comfortable doing it, we have to invest in just like proper evaluation layers?
Mohamed Rashad [00:11:16]: Not only for this, I guess for any broader production level agentic or multi agent system you need to invest in proper eval. This is the only way you can move between models. This is the only way you can even update your prompts, update your tools, calling, etc. And this is a problem. This is a place where many teams are lacking because it's also new. You know, like evaluations for LLMs are very, very different from evaluations for like older ML systems if you did something before 2022. So it's not, not that people are like not doing it, but it's hard actually doing correct eval and that's a robust eval. They're actually very hard.
Adam Becker [00:11:58]: Yeah.
Mohamed Rashad [00:11:59]: So I was about to tell you, like, this is very task to pick, like choosing models and moving between them. And as you said, you need to have a robust eval in order to do this in a good way. Yeah.
Adam Becker [00:12:10]: Fortunately for us, we're gonna have a lot of folks today that are going to be talking to us exactly. About how they do evaluation and some ideas for it. And it's just so interesting how it isn't just to improve your product now, it's also to save you money. Rashad, thank you very much for joining us.
