From Tool Calling to CodeAct: Practical Lessons for Deploying Executable Agent Workflows // Gal Peretz // Agents in Production 2025
speaker

Gal Peretz is the Head of Artificial Intelligence at Carbyne, where he leads the development of advanced AI solutions for public safety. Previously, he served as Head of AI & Data at Torq and held engineering and research leadership roles at Microsoft, IBM, and several startups. Gal holds both a Bachelor’s and Master’s degree in Computer Science from the Technion, specializing in Natural Language Processing, and has authored multiple papers accepted at top conferences. He is the founder of Apriori.ai, a GenAI and NLP consulting firm, and co-hosts LangTalks, a leading podcast on AI engineering. Gal is passionate about building agentic applications for real-world impact and leads a global community of over 6,000 AI developers.
SUMMARY
As autonomous agents transition from research labs to real-world production, the need for flexible, scalable, and robust tool use is more critical than ever. This talk introduces CodeAct, a new paradigm for agent architectures that enables Large Language Models to generate and execute Python code for tool interaction, moving beyond traditional JSON or function-calling approaches. Drawing from hands-on experience with LangGraph and open-source libraries, I will share practical lessons, deployment strategies, and common challenges when integrating CodeAct into production agent systems. Attendees will see live examples, learn how CodeAct can streamline complex workflows, and leave with actionable insights for building more transparent, auditable, and powerful AI agents in production environments.
TRANSCRIPT
Gal Peretz [00:00:00]: So hello everyone, I'm Gal Peretz, I'm head of AI at Carbyne where we develop basically AI voice, AI agent for emergency response. So think 911 medical emergencies and stuff like that. And also I am co host and co found Lang Talks which is a podcast and a big community of 6,000 AI engineers. Today we're going to talk about some interesting stuff because I guess most of you are familiar with the traditional tool calling JSON based paradigm and today I'm going to show you a different one and I think that it's super interesting because this one going to shake a little bit the JSON paradigm that was initiated by OpenAI and so on. So let's dive in. Okay, so this is the agenda. We're going to show the traditional tool calling how this flow goes like all the agents and the loop thing. We going to wonder why even to invent the wheel and try a new approach.
Gal Peretz [00:01:24]: Then we'll introduce the new approach which is called CodeAct and then I will walk you through the technical aspect of that. Finally as we all know after this data science and workflow that we want to understand what in it for us and we'll show the result and and also we'll talk about the key takeaway to take from this presentation. So how the traditional reasoning and acting works actually. Right. So usually we have an LLM, it can be OpenAI, it can be open source, can be also anthropic and so on. Most of them already implement sort of like this JSON, you know, tool calling. So you just give it the user task. Right? Right.
Gal Peretz [00:02:17]: With the, all the, you know, OpenAI schema. Also if it's like MCP, it doesn't really matter. You still need to add this OpenAI schema to the LLM call and then this LLM just output a tool call intent, right? It's not the execution itself, it's just the intent of tool calling in a format of a JSON and then you hand it to the execution or we are actually going to execute it. We are responsible to call the specific function that AI want us to call with those specific properties. Then the output goes to the LLM and the loop goes on and on. This is an example. Let's say that we want to search a specific property price for a specific phone, but we want to search it in in several countries and to know what are the cost efficient country that we want to buy from. So the first is think, right? The LLM thinks what is going to be the intent of the user then can Understand? Okay, I need to call a tool.
Gal Peretz [00:03:24]: So it outputs like this JSON with the name of the tool. It. It may look a little bit different, but it's similar to name of the tool, right? And the parameters. Then we are going to change the state of the environment, right? We are going to execute the function, change the state, hand it back to the agent, and so on and so forth. And basically it can call one tool, but it can also output few intents, right? A few intents for tool calling. And we are going to execute them in parallel. Okay, so what's wrong? Right? It looks like it's going to work. This is actually power a lot of production ready solution right now as we speak.
Gal Peretz [00:04:10]: So what's wrong with that? So let's talk a little bit about what works. So the first one is when we want to express bureau branches, right? If branches loop branches. So let's take for example this task get all user and for each active user deactivate his account, right? So what's going to happen? Let's say that we have a tool called get all users. So the LLM going to ask us to execute this tool. We are going to get the list of user, we hand it back to the out to the LLM context window. Then the LLM going to say okay, now execute. Let's say that we have a deactiv function, execute the active function a few times with different parameters. If we have five users, that's going to work just fine.
Gal Peretz [00:05:04]: But what if we have 10,000 user or 100,000 users? I can assure that it will fail because a lot of problem. The lost and middle effect. The LLM attention mechanism cannot attend to this amount of users or information and it will call the active function with few users. But for sure it will miss a lot of them. The next step is the fact that actually those OpenAI fine tune models and any other vendors were fine tuned to output and also to get a specific JSON schema. What works with that? If we have a fine tune object, a fine tuned model, we need a fine tune pipeline. We need to get the data to construct a loss to fine tune it. And then we have versions from time to time.
Gal Peretz [00:06:08]: We need to refine to unit with other action that we will make sure that it will output the correct output according to the format. So basically it's a pipeline, right? But the fact that it's specific pipeline for a specific task compare for example to model the generate codes. So you can be certain that for example the pace of improving those code generation models is Way faster than improving those fine tuned LLM that should output those jsons. Another aspect is that it's very hard to represent nesting objects. Types. Nesting types, any sort of that. It's very hard. So let's take an example.
Gal Peretz [00:07:07]: These types we have this, those pedantic types, we have departments, right? We have company, we have departments which are. Which is a list of department and then we have also employees and the employees has also address how we going to represent that in a JSON. Like with the OpenAI schema, it looks something like that. Okay, so it's a pretty nasty representation and it's not just necessary for us. It's going to confuse also the LLM and think of that. This is just the parameter of the functions. We have also the context we have also the intent. We have maybe also have the history of the conversation.
Gal Peretz [00:07:58]: So it can create a lot of noise. Another aspect is how we can represent composability. What does it mean when we have a function and we have the output of this function, we want to take this output as an input to another function. This is very hard in the current mechanism that we have because when we execute, for example, they get all users for once we don't have the output. Those OpenAI schema only represent the input. They represent the parameters that you need to input. But we don't have any idea what's going to be the output. Or at least the LLM doesn't have any idea.
Gal Peretz [00:08:47]: Basically what we need to do when we have the output from get all user we need to hand it back to the LLM to decide what you want to extract from it, to put it as an input in the compute happiness function and so on and so forth, right? It's very hard to represent composability like that. And the last one is basically the action space, right? What does it mean, the action space? So in the traditional aspect we have function that we define, maybe we have mcps and so on and so forth. But anything that we haven't think of from the right perspective. For example, let's take this task, tell me how many active users we have in our system. Let's define that maybe you model the get all active users because you thought that this agent going to serve user request and it's going to need to understand who are the active users. But then this specific task also require you to aggregate those users, right? So what's going to happen is basically that we again we get a list of active users and then we will hand it back to the LLM and count or trust on him to actually use his reasoning capacity, right, to aggregate those and out back the output of how many people that are there, right? So and I think that I would prefer that OpenAI will show me or any other vendor that they can count very well. They are in Strawberry first and then I will count on them to count stuff in production. So I hope this gives you an idea why we need a new paradigm.
Gal Peretz [00:10:47]: And this new paradigm called correct the idea, the notion is very easy. Instead of output JSON, let's write code. LLM great in writing code. So let's do that. The idea in the first place is very similar. We plan, we thought, right, we need to understand the intent and then instead of outputting JSON, we will write Python, right? We just emit the Python and then we will need a sandbox execution environment. After that the LLM will observe. But actually here in this step you can observe and understand if there were any errors and stuff like that.
Gal Peretz [00:11:30]: And then you can revise in the next step you can revise the code if error record or maybe you can see from the prints from the output that we are not on the right track. So we can revise the code and we will execute the code. Again, very similar notion, but different approach. Let's see in the technical aspect of things, let's see how we going to represent the LLM functions in the aspect like in the terms of before when we use the JSON and after when we use Kodak, right? So here we have for example this function and to represent it to LLM we need to append it in the metadata in the request this kind of nested object, Right? Right. For the code we can just define as we are usually define functions in code, right? Why do we need to define different type of, you know, notions, types, different things of stuff to represent LLM what he want, what we wanted it to do, right? Is already fine tuned on how to write code, how to understand code. Let's zoom in and understand the flow. So basically this one we already walked through. So how it works, instead of just act on a specific action, we are going to write code.
Gal Peretz [00:13:06]: We are going to use the loop and the if and all that this magic that already available in code. Also we can use for example mean which we tool that we are not really defined in the first place. We can use it because we are writing code. We can also use for example Pandas and aggregate on the fly, right? We don't really need to define that. So this is basically the system prompt. We are Just give it a restriction. What kind of, you know, you have access to the Internet, you don't have access. You do have access to free file to write to file, to read to file.
Gal Peretz [00:13:44]: Just limiting the search space because just give the LLM the ability to write code can take it to the other way. And instead of like you know we will give too much dynamic in the system, we will have a free variables there and it won't converge. So this is how the system and this is how the the user prompt looks like. Basically you want to omit the the body because maybe in the implementation of those function we have function that we don't want the LLM to use. So let's jump because we don't have we a little bit out of time. Let's jump to the results. So basically you can see here, the interesting thing to see here is that you can use a very small model with Kodak and it's competitive to actually using a bigger model with JSON schema. Right.
Gal Peretz [00:14:39]: So this is specifically for data set that are workflow based and tool calling and so on. And I see that we are out of time a little bit. So just key takeaway. When to use Kodak when you have workflow, like when you need branching, we need the dynamic. You want to put the time and effort to customize your environment. When do you want to use JSON? Like when you want the plug and play solution, when you don't really have any complex tasks and when you want a low latency because you just call an API and that's it. Thank you very much. Please connect LinkedIn if you have any question.
Gal Peretz [00:15:21]: Awesome.
Skylar Payne [00:15:21]: This is incredible. We did have a question in the chat a little bit earlier on LoD from Solara. AI asked if you recommend fine tuning or would you say that libraries like trust call are enough.
Gal Peretz [00:15:35]: So fine tuning meaning the fine tuned in the Kodak paradigm or fine tuning the model that use Kodak.
Skylar Payne [00:15:41]: That's a great question. I'm not quite sure what he meant. I think he was asking this question when you before you had introduced Kodakt and you were talking more about like JSON tool calling and some of the barriers to it. You know, Trust call is, you know, something that helps make that a little bit easier. Yeah, maybe the question, you know, taken further is really, you know, how does trust call compare to something like Kodak?
Gal Peretz [00:16:08]: Yeah, basically it depends, right? It depends what you use. If you use a very capable model like OpenAI GPT Open or so on. So I guess trust can work. But if you for example, use open source ones, you want to fine tune them to really restrict the output to bound to those OpenAI schema JSON.
Skylar Payne [00:16:38]: Okay, cool. Yeah, it totally makes sense. Curious whether maybe just one more question before we switch off. Just curious whether you experience any cases where the openness or the unconstrained nature of just generating code becomes an issue. You can kind of generate program and what do you do to control for that?
Gal Peretz [00:17:04]: Yeah, for sure. So I had a slide for that unfortunately, but yeah, for sure. So you can use for example tool like Abstract syntax tree and Linters to understand before you even execute the code if there were a typing error or stuff like that. And Abstract Syntax Tree you can for example analyze the code and spot any forbidden functions that you don't want the execution to call and then reiterate on them. Usually we are actually want to you know, restrict the LLM based on the prompt and to give this like to limit the search space because it's very similar to any data science problem. When you have too much of freedom it's not going to converge. But if you are expressing and seeing like the JSON like we are using those JSON based paradigm and instead of give it freedom it's it doesn't have the power to express like this workflow. I think that the play here is to understand how to push the boundary when you need to, but give the restrictions and have the tools to give the restriction when you need to.
Skylar Payne [00:18:23]: Totally awesome. Thank you so much for joining. We're gonna mosey on over to our next speaker. Feel free to get that QR code. Connect with Gal on LinkedIn or find him Gal-Peretz on LinkedIn. But yeah, thank you for coming, really enjoyed it.
Gal Peretz [00:18:42]: Thank you for having me.

