MLOps Community
+00:00 GMT
Sign in or Join the community to continue

From Guesswork to Greatness: Systematic AI Agent Optimization in Production // Nimrod Busany // Agents in Production 2025

Posted Jul 25, 2025 | Views 62
# Agents in Production
# AI Optimization
# Traigent
Share

speaker

avatar
Nimrod Busany
Founder & Chief Scientist @ Traigent

Nimrod is the Founder and Chief Scientist at Traigent, where he’s building tools that help ML teams debug and improve AI agents using data-driven methods instead of trial-and-error. Before this, he led AI engineering efforts at IBM and Accenture, deploying production ML systems for Fortune 500 companies. His background spans both sides of the AI/software divide—using AI to automate development and applying software engineering principles to make AI systems more reliable.

He holds a PhD in Computer Science and has published and spoken widely on building scalable, dependable AI, including invited talks at leading software engineering conferences and prestigious institutions such as MIT and CMU.

+ Read More

SUMMARY

Every engineer building AI agents has experienced it: you tweak a prompt, swap out a model, or adjust a RAG setting—only to find it either worsens the agent or improves one aspect while breaking another. Why does this happen? Because teams typically test just one configuration out of countless possible combinations, hoping for the best. Current evaluation tools are built for single-point assessments, not the extensive multi-dimensional comparisons that real-world scenarios demand. Sure, you might be able to A/B test two prompts or select from a few models, but exploring hundreds of configurations across dimensions like cost, latency, and accuracy simultaneously is nearly impossible. In this talk, we'll demonstrate how adopting a structured approach to testing alternatives can significantly change outcomes. Leveraging concepts from multi-objective optimization, we’ll illustrate how Traigent's SDK and UI empower engineers to allocate their testing budgets effectively. Traigent intelligently identifies and explores promising configurations, highlighting optimal tradeoffs. You'll learn how this methodology can yield quality improvements of 4–7x and reduce costs by up to 90%, all without resorting to guesswork or manual trial-and-error.

+ Read More

TRANSCRIPT

Nimrod Busany [00:00:00]: Thank you so much for having me here. And in the talk today I'm going to talk about how can we AI practitioners and engineers, maybe many of us now coming from the software engineering realm, how can we turn much of the guesswork that we do when we're building our agents into something that is more systematic, more like how we're used to build things and optimize them for production. A bit about myself. So I've been between academia and industry for the last 15 years. I did my PhD in computer science and continued to be an active member of the software engineering and AI community, publishing paper patents and so on. So I do have a lot of passion for that. But at the same time I also enjoy to build things. So for the last 12 years I've been working at IBM Research and then Accenture Labs.

Nimrod Busany [00:01:18]: And in Accenture Labs I laid a few AI teams. And as part of this I realized that we have a problem. And that problem is what led me to talk about Trajan, which is my attempt to solve that problem. So have you ever found yourself trying to build, let's call it an agent writing some script, a few commands, long term commands, or using directly the OpenAI SDK and you have your prompt, and that prompt includes instructions on how the LLM should answer or help the user or extract information from a context or a file that you gave it, and so on. And you have this templated code basically. And then you ask yourself which model should I use or what temperature should I use? Depending on the vendor. There are quite a few other options for you to control. When we think about the prompt, the prompt is not that trivial.

Nimrod Busany [00:02:36]: It has multiple ports. There is the retrieval part, there's the prompting strategy, a few other selections, styles and maybe also want to include a role. So when we think about feels like there's quite a lot of guesswork here and wouldn't be amazing if instead of guessing we can just have an easy way to get in a click of a button, a report that actually compares the different alternatives for us on different objectives because we hardly ever have a single measure accuracy, its runtime, its cost accuracy is oftentimes not a single measure, but maybe multiple ones. So we're after some something like this taking an existing code and helping engineers get or make educated guesses decisions instead of making guesses. Now how is the current process of developing an AI engine looks like and this is maybe a typical scenario that I saw, but I'm not saying it's the only scenario in this typical scenario we have is we have an engineer which gets a request to try and improve an agent. Now, improving the agent could be in order to reduce the cost, to improve accuracy, improve response time and so on. The way that usually things are happening is by the engineer getting that instruction, then trying to make tweaks and hoping that those tweaks would work. The tweaks are trying to improve multiple measures most of the time, so maybe improve performance while maintaining the cost.

Nimrod Busany [00:04:38]: And oftentimes, especially when we're talking about fixing and improving the agent there, we're approaching it with a bias. So we're seeing a few examples didn't work in production. We're saying, okay, let's improve the prompt. And then we're trying to evaluate. What I see many people do is try to fix, check the fix by checking that example and maybe a few others that are very similar. The implications are that we oftentimes fix a single area and break other areas, or we just hurt the overall task that we had. So different measures that are conflicting start to fluctuate. And in general, what we'll often be doing is to test a single combination at a time.

Nimrod Busany [00:05:30]: And to do more than that, we actually need to have a methodology, an educated way in which we're exploring the different options or the different design choices that we have. Now, when we think about might sound maybe initially like a trivial problem. Let's just create for loops and try out different combinations. That's not going to work. Why is that so? Exponential complexity. If you just have to model three prompts and two temperatures, that's 12 combinations. And if you run evaluation set on each, then maybe that's manageable. If you increase the amount a little bit, you get 60.

Nimrod Busany [00:06:19]: If you add top K. If you're using few shots in pre values, you're already at 1200 options and it can explode dramatically. We're seeing is actually that the amount of design choices are just increasing by the day. So this approach is not going to be very practical if we just approach natively. And there's a bonus question, do we know how to tackle this problem? And the answer is yes, to some extent. In software engineering, we have had techniques like combinatorial test design, where we're trying to design a test suite that would help us expose bugs without having to go over every possible test. But it wasn't really designed for the type of phone we're seeing here. So we could maybe borrow ideas, but it's not exactly what we need.

Nimrod Busany [00:07:17]: I want to make it even more practical. At Accenture, I was leading a Team we're working for a few months on a text to SQL agent. Basically taking a question and turning into SVO was a year and a half ago models weren't as effective as they are today. And there was a competition called the Spider competition. We done quite a lot of literature review, came up with an agent that we thought was really good. And here in this diagram you can see a few of the different dimensions that we had. So style could be code like, could be instructive, could be concise. That's how we're describing the task and the schemas, the selection strategies.

Nimrod Busany [00:08:11]: When we're basically trying to ask the LLM to come up with an answer to a question, we can include examples so we can choose examples from ready made data set that we have with quite a few different techniques. Number of examples that we put in. Again, we need to make some choice error correction. We thought adding a second round of fixing an issue if the query was incorrect would be an effective way to get queries to work. And all the other parameters that you typically see in an LLM, which model, which temperature, the okay. And so on. Now when we're sorry, it's supposed to be, I think there's a application here. Okay, so what we had was what happened to us is that after a while we reached a plateau.

Nimrod Busany [00:09:09]: Okay, so we got to a pretty nice place in that competition, but we weren't able to improve it. And I just put in here the number of combinations so that you'd see why it was quite complex to try and improve. We're trying to tweak and we tried a few other combinations but none of them was really helping us to improve the system significantly. Now when we're talking about the space, it's not only that it's big, but it's also with much dependencies. And the dependencies are dependencies that are obvious when it comes to your objectives. So I want to improve performance. I would take a smarter, more expensive model, but that would cost me more. So there's like dependencies when it comes to my objective, but there are also a lot of dependencies when it comes to the configurations themselves.

Nimrod Busany [00:10:02]: For example, if you choose 0 shot K is not something you're going to choose. So there's. There are also quite a few combinations that just don't make sense. And these are the things that we're going to have to account for if we want to try to explore the space that we have in order to find something that is more effective. Again, when we think about this exploration problem, but this time from a machine learning perspective, trying to optimize hyper parameters. That's something we know, right? So maybe we can just do hyperparameter tuning on the problem. Well, there are a few problems with that approach. First one is that there are a few problems with the approach and we'll get to them in a few slides.

Nimrod Busany [00:10:54]: But let's just conclude everything we talked about so far. We want to optimize our system, we want to find the right configuration. We have quite a big configuration space that is not that trivial to navigate. Now the question is. Last remaining question. Does it really matter? Who cares? Maybe it's only about the model. In a paper that we published, we actually show that it's far from being true. In fact, if you think that paying more is going to give you more accuracy, that's rarely the case.

Nimrod Busany [00:11:33]: And what we showed in the paper about this is that if we think just about accuracy versus cost and we try to place random configurations of those choices that we saw in the previous slides, we would see that many of the configurations are inefficient. That means that there is a better configuration for the money, which will give you more accuracy. How much better? So turns out because of the pricing model of LLM vendors, you can quite easily get a completely inferior pricing configuration which you basically pay a lot for and you get accuracy which is not very good. In our case, for example, months of working, we just realized that we could have replaced GPT4 with GPT 3.5 for the same accuracy or actually go up the performance in about 5%. That means going from number 16 on an international competition to number five quite significant. So I hope I managed to convince you that this is a hard problem and there is some motivation. But the question is, do we need other or new tools? Or we can do with what we have. A B testing.

Nimrod Busany [00:13:03]: AB testing is an approach to compare between two different versions of a software. It's not going to be effective here. Why? Because two options at a time or maybe a little bit more, but it's not going to give us the tools to really choose what we're going to try out. Let's move on to hyperparameter tuning. First of all, it's mainly built for machine learning experts. It requires us to be able to define an objective function. It requires us to do a lot of work. Basically in order to set up our optimization, we could define the constraints and so on, but it's a lot of work and it's not very suited if your parameters are things like prompts or different parameters that are affecting the agent, like temperature.

Nimrod Busany [00:14:05]: I mean, yeah, it's continuous, but do I really want to go over a continuous, continuous variable? Or maybe I just want to try out a few options. Why? Because a naive approach is going to be very pricey. If we're going to have to run an evaluation set over every configuration and we're not too careful, we can easily reach bankruptcy. Thousands of our configurations, each one is going to cost you 20, 30, $40 to evaluate. It's not gonna, it's not gonna scale. And in addition to that, you're not really accounting for rate limits, quotes and so on. So these are the things that should maybe help us understand why hyperparameter tuning is not the approach here. Right.

Nimrod Busany [00:14:58]: It's gonna. We're gonna have to be very, very pay a lot of attention to the configurations that we're choosing and how much budget we're going to give them. So with Tragent, we're actually trying to do that. We're trying to build a tool that is just right for this task, not any high performance that you need tuning LLM agents. And what we want to do is we want to give software engineers who don't want to spend much time, or AI engineers who don't want to spend much time, the ability to define as little as possible, basically objective, which could be predefined coming from Ragas or some other evaluation platform. And an evaluation set, if we can evaluate it, we can optimize it. That's the biggest thing that I hope you can take from that. You take from this.

Nimrod Busany [00:15:47]: That's what got us to come up with Trajan. Let's stop evaluating and start improving things. And let's try to do it with as little effort as possible. So we want to give experience to engineers who just want to get the most of what they can with as little effort. But we also want to be able to give power users the ability to define things more refined manner. So the decorator that we have, it can be customized. You can define your own evaluators, you can define different strategies for sampling and so on. What is it that this single decorator is helping you get? You define that decorator over that function.

Nimrod Busany [00:16:29]: That includes LLM code you call the optimize function Optimization runs an evaluation, but it does it effectively in a smart way. And then you can either get the best configuration, the best score, or you can get detailed statistics. What have we seen so far benefits of using preagent? We saw improvements in agents, so 20% accuracy, cost reduction 40 times might sound a lot, but that's just meaning moving from GPT4 to GPT 3.5 or a lower model and faster response time. But I think the most important thing for practitioners is that it just saves you a lot of time. At Accenture, I saw the amount of time that we had to spend in order to make up our own scripts to try and do something like this. It's obviously not going to be something that every one of us need to solve and hence the solution. Tragent's approach is the following thing. We are not exploring the entire space of combination.

Nimrod Busany [00:17:36]: We're actually using different machine learning techniques in order to identify the most promising strategies as fast as we can, trying to waste as little examples, samples and LLM invocations in the process. We're trying to prioritize the valid areas and do very efficient sampling when it comes to giving, when it comes to different configurations and basically the entire budget that we have for the experimentation, what you as a user get is the ability to take the same amount of budget that you wanted or was planning to allocate in order to compare between different models or whatnot, and get much more effective exploration with less time. So basically what you could get with a naive approach in $1,000 we can try and achieve with our algorithms in much less because of doing efficient something and so on. And then this is obviously on a case to case level, but we've seen so far is that if you have multiple objectives, you can oftentimes reach significant improvements on those measures that you have. We're just about to release our SDK and the SDK is going to be something that will give value. So the idea is that we basically want to serve the community, the tool that they can use to optimize their LLMs. And then from there we will extend the service and incorporate kind of a SaaS service in which we're going to try to do the experimentation and exploration in our own environments and save a lot of effort to engineering teams. But still we hope that with this SDK people will start experimenting with their agents and trying to optimize them.

Nimrod Busany [00:20:01]: And if you want to join, we are just about to release a closed version just to get the initial feedback and realize what people really want to see. And so you can follow this, you can follow the link if you take this QR code and fill in a short questionnaire that will help us really understand what people care about when it comes to optimizing and how they want to see a tool like this in Insider Eco system of development. And in addition to that, if you want to join that, the first early bird teams who get to work for free with the trajenta, just reach out to us and we would be more than happy to give you early access and hear what you think. Yep. So with that, I think.

Demetrios [00:21:03]: You did it.

Nimrod Busany [00:21:04]: Yeah.

Demetrios [00:21:09]: Yeah, it's perfect. And I like this a lot. There's some really good questions come through in the chat, so many coming through that I'm gonna just start. We've got like five minutes for questions. I don't think we're gonna get through them all, but a few people were trying to ask questions. What type of optimizations are you using in Trigent, like toe gradient based evolutionary optimizations? Is it an improvement on, on the test of the prompt? Do you optimize chains of agents?

Nimrod Busany [00:21:49]: Amazing question. So actually some of them are on the roadmap, some of them are in process and some of them are done. So in the SDK we're going to release, we're going to support basically all of the popular optimization techniques like grid search and Bayesian optimization. And basically you can, you can customize it and if you have a ready made algorithm, you don't have to, but if you want to, you can extend the library. We're actually developing algorithms that would be specifically for this problem and they would have guarantees. So basically what we want to do is want to prove that for the money that you spent, you get the most promising experiment that can be done. So we're actually working on this algorithm, so it's all custom and regarding question of optimizing the prompts. So absolutely.

Nimrod Busany [00:22:57]: And the question is, how do we do that? The thing that we're currently exploring is or considering is before we start the optimization, we take the prompt and then we try to optimize it with different techniques. Now we have options that we can try and choose from. It could be that we can have a more dynamic approach when we're trying to tune the prompt as we go. But this is something that is a little more complex to consider. So basically we can use off the shelf optimization techniques and produce prompt options. Going beyond that is going to be something we're going to have to think about a little bit more.

Demetrios [00:23:38]: Dude, I'm not going to lie, this is some very advanced and I think that's why I was so excited for the talk. There's two people that are chiming in. The QR code's not working for them. So if you have a link, throw that in in our chat and then I'll drop it into the live stream chat. And in the meantime, while you get that, I've got an incredible question for you, but I'm gonna need all your brain power because Will brought the heat when he asked his question. All right, so I've got the, I've got the QR code link and I'm gonna drop it in the chat, but I'm gonna ask this question first. The exponential complexity, hidden dependencies and trade off balancing you mapped out are exactly what we've been working with and wrestling with. We've been developing a symbolic control layer using prime harmonic modulation PHM to deterministically constrain the agent configuration space, especially for things like temperature and top K and tool usage.

Demetrios [00:24:51]: Instead of tuning millions of combos, we structure entropy from the start. Do you see a future where agent optimization shifts from empirical tuning to mathematically grounded initializations?

Nimrod Busany [00:25:11]: I think that's an amazing, amazing question and I think the initialization first is, is an important question. How do we initialize? Because currently we just guess. Then how do we optimize and could we achieve some guarantees? I think we can because there is a lot of research about like minimizing regret. But I'll tell you more than that. We have published this paper about the problem, not the solution, the problem with the University of Ottawa. I'm also a researcher at Telha University. We're actually working on a workshop in that we want to include introduce into the ICSI conference which is the icsc, the most prestigious software engineering conference. So if you want to talk to us, we'd be happy to.

Nimrod Busany [00:26:02]: We think that this problem is going to require many different approaches and if you have something interesting, let's talk. We would be happy to start and grow that community that tries to solve this really hard problem. So.

Demetrios [00:26:17]: Let'S talk. Yes, get in touch. And if you're wondering how to get in touch, we've got the email right there on that last slide. This has been awesome. There's more questions in the chat. I feel bad that I cannot continue with them, but I've got one job today and that is to keep us on time. So I'm going to ask everyone who's got all these questions that are coming through. They're such great questions.

Demetrios [00:26:46]: Just reach out, send them an email, hit them up on LinkedIn and thank you so much for doing this, man. This was.

Nimrod Busany [00:26:53]: Thank you for this opportunity. I honestly truly appreciate it.

Demetrios [00:26:58]: Right on.

+ Read More
Sign in or Join the community
MLOps Community
Create an account
Change email
e.g. https://www.linkedin.com/in/xxx or https://xx.linkedin.com/in/xxx
Comments (0)
Popular
avatar


Watch More

Lessons From Building Replit Agent // James Austin // Agents in Production
Posted Nov 26, 2024 | Views 1.4K
# Replit Agent
# Repls
# Replit
Generative AI Agents in Production: Best Practices and Lessons Learned // Patrick Marlow // Agents in Production
Posted Nov 15, 2024 | Views 6.3K
# Generative AI Agents
# Vertex Applied AI
# Agents in Production
Create Multi-Agent AI Systems in JavaScript // Dariel Vila // Agents in Production
Posted Nov 26, 2024 | Views 1.2K
# javascript
# multi-agent
# AI Systems