Sign in or Join the community to continue

MLOps Reading Group - December : A Taxonomy of AgentOps for Enabling Observability of Foundation Model-based Agents

Posted Dec 27, 2024 | Views 385

# AI Agents

# Observability

# AI Systems

Share

speakers

Nehil Jain

MLE Consultant @ TBA

Hey! I’m Nehil Jain, an Applied AI Consultant in the SF area. I specialize in enhancing business performance with AI/ML applications. With a solid background in AI engineering and experience at QuantumBlack, McKinsey, and Super.com, I transform complex business challenges into practical, scalable AI solutions. I focus on GenAI, MLOps, and modern data platforms. I lead projects that not only scale operations but also reduce costs and improve decision-making. I stay updated with the latest in machine learning and data engineering to develop effective, business-aligned tech solutions. Whether it’s improving customer experiences, streamlining operations, or driving AI innovation, my goal is to deliver tangible, impactful value. Interested in leveraging your data as a key asset? Let’s chat.

+ Read More

Adam Becker

IRL @ MLOps Community

I'm a tech entrepreneur and I spent the last decade founding companies that drive societal change.

I am now building Deep Matter, a startup still in stealth mode...

I was most recently building Telepath, the world's most developer-friendly machine learning platform. Throughout my previous projects, I had learned that building machine learning powered applications is hard - especially hard when you don't have a background in data science. I believe that this is choking innovation, especially in industries that can't support large data teams.

For example, I previously co-founded Call Time AI, where we used Artificial Intelligence to assemble and study the largest database of political contributions. The company powered progressive campaigns from school board to the Presidency. As of October, 2020, we helped Democrats raise tens of millions of dollars. In April of 2021, we sold Call Time to Political Data Inc.. Our success, in large part, is due to our ability to productionize machine learning.

I believe that knowledge is unbounded, and that everything that is not forbidden by laws of nature is achievable, given the right knowledge. This holds immense promise for the future of intelligence and therefore for the future of well-being. I believe that the process of mining knowledge should be done honestly and responsibly, and that wielding it should be done with care. I co-founded Telepath to give more tools to more people to access more knowledge.

I'm fascinated by the relationship between technology, science and history. I graduated from UC Berkeley with degrees in Astrophysics and Classics and have published several papers on those topics. I was previously a researcher at the Getty Villa where I wrote about Ancient Greek math and at the Weizmann Institute, where I researched supernovae.

I currently live in New York City. I enjoy advising startups, thinking about how they can make for an excellent vehicle for addressing the Israeli-Palestinian conflict, and hearing from random folks who stumble on my LinkedIn profile. Reach out, friend!

+ Read More

Valdimar Eggertsson

AI Developer @ Snjallgögn (Smart Data inc.)

Binoy Pirera

Community Operations @ MLOps Community

SUMMARY

In the December Reading Group session, we explored A Taxonomy of Agents for Enabling Observability of Foundation Model-Based Agents. Key participants discussed the challenges of building agentic AI systems, focusing on four key capabilities: perception, planning, action, and adaptation. The paper highlighted issues like lack of controllability, complex inputs/outputs, and the difficulty of monitoring AI systems. Early-stage insights drew on DevOps and MLOps practices, and the need for improved tools and evaluation strategies for agent observability. The session fostered a collaborative exchange of ideas and practical solutions.

+ Read More

TRANSCRIPT

Binoy Pirera [00:00:00]: Welcome to the final reading group session for the year. If you've been joining us regularly every. Thank you so much. The main reason we wanted to, you know, stop this initiative is people just struggle so much to find time to kind of invest in education as they get lost in the day to day and we wanted to find some time to kind of autumn. You can just join the reading session and just know what's up. And so if you've been contributing in terms of selecting a paper and being active in the Slack channel, thank you so much. And I know there's a lot of people today joining us with the first time, if this is your first time. We keep things very open ended so you know, let us know what's on your mind and just unmute at any point and let us move modes or if you have any specific questions, please feel free to, you know, use the chat and I'll read it out for you.

Binoy Pirera [00:00:48]: So today's paper is a taxonomy of agents ops for enabling observability of foundation model based agents. And to help us dissect the paper we have three amazing speakers and Nehil Jain is back and Nahil is a co founder of a stealth AI startup. And we also have Valdimar Eggertsson as usual, he's the AI team lead of Smart Tutoring, he's based in Germany. And we also have Adam Becker, the COO of MLOps Community. He's based in New York and as usual my name is Binoy, you can find me in Slack. All right, so let's get started.

Nehil Jain [00:01:28]: So Nehil, yeah, I think agents has been all the rage and Adam chose this paper for us to understand really like how to build a mental model of what's going on when you're doing ops for agents. And I mean mlops groups, we have been doing ops for ML models and such for a long time. So very fitting paper I think just to start the journey on learning about agents and very exciting times. Like there's so much happening, everyone is like, oh, build agent AI systems whether you know what it means or not. But that's what all the age is. And so yeah, what are agents? Let's start there. So I think Andrew Ng and some other people have kind of come together and consolidated a little bit around what agentic, what we are calling agentic AI in today's world. I think it's still kind of fuzzy and confusing but if a system does four things which is like it can perceive input from different kind of modalities or different input systems, that is like piece One which is perception, then is planning and reasoning, which is like more about given a set of inputs and a goal, can it decide what to do next or plan and break it into multiple subtasks, which is kind of intelligent in itself.

Nehil Jain [00:02:45]: You can't hard code that stuff. And then there is actions of, okay, I've decided these are the subtasks or these are the different things I need to do. How do I actually make that happen? And so primarily today we do it through tool calling, which is like you're telling LLMs, hey, this is how you can interact or take action with an external tool. And then the last one is kind of looking at what we call few shot prompting or even like fine tuning. Where can you actually take examples and improve based on the feedback that you're getting from your input and what you've been trained on. And so that's the adaptation and like being able to change based on different inputs. So this is where the paper starts. The paper talks about like, what are the main challenges when we talk about ops around agents or in general like building agent AI system? And the first one is, you know, you have lack of control on how it makes decisions.

Nehil Jain [00:03:41]: One of the common things, speaking from experience and practice here is that if you have too many decisions to make, in most cases LLMs won't be even nearly accurate as compared to what a human would do in the given situation. Which is where you're seeing that the cost of building production ready agentic system is so high because either you have to do a lot of iterations to get it right, or it just doesn't make the right decision. Whatever you try to do, it's very hard to control what it does in given scenarios. And so that's the first piece, lack of controllability on making decisions. And we know LLMs are mostly a black box still. So that's the first piece. The second one is kind of related. So in most cases when you're doing planning and taking action and then looking at the feedback, you're doing multiple iterations to get it right.

Nehil Jain [00:04:33]: And you're also breaking it down into multiple steps. So you do something, then you do something else, then you do something else, which is kind of called multi hoping. You're doing many different subtasks in sequence to achieve a goal. And for that the input and output is very complex. In previous cases, when you were building traditional ML models, you would assume there's a specific set of input and a very specific structure of the output, but that's not true anymore. And so that makes it very hard to observe, monitor, track, all these different pieces. I think in the later sections we will talk about how do you break it down into different components of what you need to track and how to think about that. And that's the whole point of this paper, the taxonomy.

Nehil Jain [00:05:17]: And that is a challenge. And I think we're still learning of how best to model this stuff. We're evolving based on our previous DevOps and MLOps days of how to do traces and observability and bringing that to agent top, which is what the paper talks about. But I still think it's very early days. And the last one is observability. They were talking about EU app which says you have to track and store everything or at least have the ability to do that. Which means again, you need to know what you're tracking to be able to say that yes, we are tracking and storing everything. So those are the main challenges that the paper talks about.

Nehil Jain [00:05:56]: I think there are other challenges as well, but from observability perspective, this is where we are at for data sources. So they chose very high level data sources, like even something that this group, if we were not doing pure research would choose. Like they went to GitHub and looked at all the different agentic or agent op systems where you can build observability and monitoring for agents. And so that's what they're talking about here. And then the second thing that they looked at was market maps of, you know, you know, people writing about these are the different pieces of the stack and these are different companies trying to automate it. My takeaway here really is that like it is very early still that people are not able to consolidate or have like a go to gold standard or source of truth to read from. That's why we are assimilating information both from the practitioners and the research and kind of combine it together. And so that's why the paper chose kind of these tools as a way to understand, okay, how to think about observability for agents.

Nehil Jain [00:07:08]: As I'm saying here, it's very early days. So the way to look at the whole lay of the land is by looking at people developing and building things in real life instead of just theoretical research. And we will solidify as time goes, I think, but this is where we are at. I've used a lot of these agent observability tools myself already. So like, exciting to see that these are the popular ones. Anyone has tried some of these tools or is there something that's missing here? I do Think that Langgraph is missing on the agent dev tool side. But I think observability is very well covered.

Adam Becker [00:07:51]: Have you tried Datadog for like agent type of.

Nehil Jain [00:07:55]: I have seen people do traces. Yeah. So because you can do open telemetry with it, you can basically use Datadog as a consumer. Like you can produce events and it'll be trapped there and you can. Yeah, same with some APM companies are also doing it now. So.

Adam Becker [00:08:13]: Yeah, and have they kind of like like built up the tooling to. To accommodate like, like a foundation model and like an LLM driven kind of like use case because it feels like it's a little bit different from what they would have used to do, you know, back in the day. So like does it. Can you have like a human that provides feedback on different things and like costs of you know, how many, how many tokens they spent and things like that or, or not a human feedback bot?

Nehil Jain [00:08:44]: I haven't seen myself personally, but they have done the cost and the traces and their lifespan offer trace and all of those things. And they're again using OpenTelemetry, which is like a protocol to do that. And so Laminar, another company, all these are actually kind of coming together on saying hey, if we support Otel, which is OpenTelemetry, then the consumer will become standardized and you can have many people just consuming from the same producer. Um, so that, that's what Datadog started with. And I think that like yeah, I'm, I haven't looked at how they do human in the loop. I use Landsmith a lot. I have tried landfills. I like both of those for most of these things.

Nehil Jain [00:09:25]: But. Yeah, yeah.

Adam Becker [00:09:28]: And sorry, one more thing. I'm. I'm very curious. Just because you've used it so the. Would you say that you like more like one versus the other because of like the number of features that they give you and like kind of like the scope of what they do or you actually found like a different type of like user experience using one versus the other.

Nehil Jain [00:09:49]: For Langsmiths and Langfields, I found both of them very similar and it'll go back to kind of what you guys are. You and Valdemir are going to talk about like eventually you're just looking at traces and spans and you have to look at the nitty gritty data and kind of understand what's going on. And the tooling is also very similar. Looking at least right now there are some UI changes and how the human interacts with it, but that's kind of it. I Think and hence like I think like still early days, I think we will standardize around one way of looking at these things and then eventually, yeah, that'll be where everyone will do it. So the paper describes some key features looking at these different tools that we just looked at and they break it down and then I think that sets it up perfectly for all the taxonomy discussion. So one thing is of course the creation of the agent, which means how do you give it ability to perceive, like do inputs, multimodal inputs and yeah, so on, so forth. The second thing is how do you give it ability to do actions, which is like toolkits.

Nehil Jain [00:10:54]: And of course you're extending the ability of what the foundational model itself can do with external systems by giving it toolkits. The other one is reference or grounding material through VectorDB. We have talked a lot from the papers that we have discussed on context and rag, et cetera. So that's that. And then you have the ability to fine tune it, which is like retrain the model to become specialized in something. And now you have specialized agents doing specific tasks if they are better suited for it. So that's the creation piece and then you have prompt management, which is I think fairly well understood at this point, where you version different pieces of the prompt and then you kind of test with different inputs what works well and then you eventually choose that, which is very similar to how we used to do even like model selection in the older days where you train a bunch of different models, you look at the metrics and then you're like, okay, this is the champion model and you promote that. And then yeah, there's playground and some detection stuff as well.

Nehil Jain [00:11:55]: Prompt detection I think is not solved. Very hard problem as well. But yeah, okay, so the other set of features are evaluation and testing. Pretty hard but very important topic. And I think this group should be fairly well excited about like how do you do evals for agents or even just like LLMs in general? One thing I like which I highlighted for myself here was how do you look at, especially in agentic context, how do you look at evals in different granularity? So one is you give it an input. Hey, book a meeting in my calendar after looking at my emails and then output is hey, do I have the right calendar invites? But you can also check it at each step of like, okay, did it understand the input correctly? Did it extract the right information? Then did it call the right tool which is like get my calendar. Did it then figure out all the conflicts correctly? Did it prioritize it correctly, so on and so forth. And you can look at each step and see the input, output, input output, input output, and then eventually evaluate.

Nehil Jain [00:13:03]: Okay, this step is what I need to improve or like a bunch of steps I need to improve, et cetera. And then trajectory is basically doing evals on planning. Was it able to choose the right step at a given point, which is mostly a classification problem, where it's like given an input, did it choose out of the set of outputs the right option, so to speak? Human feedback, kind of what Adam was talking about. We have of course the prompt which has some context on the human, but like during the agents also you can get feedback from the human and then they're breaking it down into implicit and explicit, where implicit is like, can it call some tool and understand metrics around the user's behavior to make it more rich? And then monitoring and pacing, which I think we'll go much deeper into in the lower sections as well, where. How do you actually track all this stuff? Yeah, I think that's it. That sets it up for all the different features and the agents systems that we look at. And then let's look at the taxonomy to understand how we'll actually break it down in our heads. I'll pause for question or discussion if we have any.

Bruno Lannoo [00:14:17]: There are no other questions, like for the evaluation part. I already noticed with regular LLMs that the impact of yeah, the expectation the asserts you would add at the end of your evaluation are like way wider than they would have been with like traditional automation and even like with traditional ML where like you have a single thing, it might be a bit random, so you have statistical challenges, but it's still one thing you're trying to assert. Well, with LLMs it's hard to be like, well, I don't want the LM to give an answer that's only fitting one criterium. But with agentic system, I feel like you broaden the scope even wider and like you expect it to do like a ton of stuff. Like is there any like steps in the direction of figuring out how we're going to handle that?

Nehil Jain [00:15:04]: Yeah, I think my. Again, talking from the paper doesn't talk about it, but from my practitioner's standpoint, what I've seen is like you trace everything and then kind of similar to what you do with LLMs and as you said, instead of saying match this specific class, you say if the output contains this substring or something, like it's already wider than a specific string assertion. You kind of have to play in the same domain, you look at all the data and then you figure out how will you assert the output based on the input. If you're doing the final response thing, the single step thing is similar to what you were saying already, where it's like one LLM call, you take an input, you get an output. How do you just check like it did the right thing. So I think it depends on breaking it down into these different levels of evals and then handle them slightly differently. But you have to look at the data. That's the only answer we have right now.

Nehil Jain [00:15:54]: Eventually it'll get better.

Adam Becker [00:15:56]: Yeah, can I maybe like offer an angle on this and let's see if that's. So I spoke to someone the other day and we kind of like looked at the question of whether the prompt itself will need to be unbundled into much more specific like composition of prompt. Because like, right. Just like imagine in like whatever when you're starting out programming, you write like a giant function and then you realize that I can't test a function like this. This is, that's not the proper, you know, I'm doing way too many things I can't test. So then you start to break it down and then you just like compose your kind of flow through different functions, each of which is much more easily testable. And I feel like what we have right now is we just kind of like dump everything into like a single prompt or a couple of prompts or something. But each of these prompts is just becoming like a very large function that becomes very difficult to test because like you said, how are you going to prioritize the testing? And there's so many different components to consider.

Adam Becker [00:16:55]: So could it be that we end up being not just wiser about evaluation, but about prompt construction?

Valdimar Eggertsson [00:17:04]: Yeah, I think for sure.

Nehil Jain [00:17:05]: And also the other angle is intelligence is going to get better. So maybe you will get to a point where you don't have to do this, you don't have to decompose your prompt, but the model itself will just do better as time progresses on doing more complex things. But yeah, the current solution is reduce the amount of don't do this, don't do this, don't do this statement which is basically saying like let it do simple smaller things in a given prompt and chain more.

Bruno Lannoo [00:17:30]: I do think it's a very interesting.

Nehil Jain [00:17:31]: It becomes multi hop and as we go into your traces conversation that will become more relevant as well.

Bruno Lannoo [00:17:37]: Yeah, I think it's a very interesting thought. If we develop some insight into the structure of prompts and we can be like, okay, we have a base layout and then we kind of like test not the whole prompt and a bunch of them in comparison, which test like this reference prompt and then like one of the pieces of the structure, we vary it and then another piece and we find the optimal for each piece. I think that that's going to be a lot easier to kind of get a feeling of like, okay, this is, this is a good one. So yeah, I do think it's an interesting perspective. Thanks.

Adam Becker [00:18:08]: Yeah, yeah, I suspect it might come down to like just like where you throw in the costs. Right? So like, are you going to. Because if you're going to just keep pinging the LLM while, I mean even just because you have lots of different prompts, each of which is doing a smaller thing. Well, in that case you're going to spend a lot of money right now and just like, you know, pinging it a bunch of times. But at the same time, if you're going to need to unpack the response and then do a lot of evaluation on the response, you're also going to spend a bunch of money there. So it might be like a trade off at some point.

Valdimar Eggertsson [00:18:41]: Should I just continue with the presentation to the next part? Basically what they did was kind of missing a slide on the overall picture, but they made this kind of mapping, like a mind map of what an Agent Ops platform consists of. And they had this comparison listing earlier of Agent Ops, Datadog, Langsmith, Langfuse, et cetera. Then I just dove into it and mapped out what do you need to create agentic system. So I'm just going to start kind of listing up, going through it quickly, the fundamentals of it, because it's like all the, what they call traceable artifacts, data artifacts. There's a few different components. First one is how are you going to create the agent? How do you define it? And this is all pretty basic, but it's a good way to put us on the same page or kind of vocabulary and stuff on building agents. So every agent must have an ID and ideally you want to version control them, give it a name. And I thought the goal was interesting, but that's just basically the introduction prompt, like giving it the goal which the agent ends up striving for and calling functions and stuff to do.

Valdimar Eggertsson [00:20:18]: We have different input data and a set of prompts that the agent reads and has access to. We configure the LLM, we give it access to helper functions or tools that it can cause and then One thing that they point out that I thought was fairly interesting is the role type. They thought about worker versus coordinator. So it's like the custom role of the agent. Yeah, I might, if I were doing it, I would maybe think of it as the like maybe is the agent gonna talk to a customer or staff, for example? So it's like give it a fundamentally different roles. It can be this kind of variable. An important part of it is are the guardrails, which we will see in a bit.

Adam Becker [00:21:13]: Can I ask a question?

Valdimar Eggertsson [00:21:14]: Yeah, very.

Adam Becker [00:21:16]: About the role, because I find something might be interesting. So is the idea that. Because like the. The agent role type is the idea that perhaps at some point you could start to limit different like access to different tools and like set like different guardrails depending on the role too, where that becomes some kind of like. Yeah, like you encapsulate a bunch of different things even just by specifying a role type.

Valdimar Eggertsson [00:21:49]: I've seen that in terms of having like the role of the user depending on who's using it. Does the agent have access to all the data or just a part of the data, for example? So that's role in the kind of.

Adam Becker [00:22:06]: I'm thinking like in the permission sense.

Valdimar Eggertsson [00:22:08]: Yeah, authentication sense. I mean that would make sense if you had like a company of robots. They would have different privileges and maybe that'll be a part of future agents. But also if you are trying to solve some problem and you want to have a team of agents that work in some kind of little society, then it makes sense to have workers and coordinators. I think that's like maybe future stuff, but maybe not that far away. Have you guys seen any of these autonomous, like societies of agents that are like have a team working together? I haven't really seen it, but I guess people are working on this kind of stuff.

Adam Becker [00:22:58]: I've seen that in research.

Nehil Jain [00:23:00]: Right.

Adam Becker [00:23:00]: And a lot of like research people are doing, but then like, like the larger scale kind of like simulated societies.

Valdimar Eggertsson [00:23:08]: Yeah, I think it makes perfect sense to communicate. It's kind of how intelligence works also just in our brains and different smaller modules communicating. So yeah, when you have all of this, you have an agent and this is kind of like agnostic of if you're using Blanksmith or whatever. Personally, I've just been using like we made their own kind of software for keeping track of this stuff and it has effectively all of this. Maybe not the role type I thought was interesting. I'll just continue then. What we want to do is enhance the context, which is usually done by giving that bot access to retrieval like a database you can retrieve from. Or we have in context learning.

Valdimar Eggertsson [00:24:05]: In context learning is when you provide the model with an example of how to solve a task. And yeah, I don't know, I was doing it yesterday, I guess I was asking having GPT form something in markdown in a certain way and you know it just part of the prompt instead of just describing it, describing it in an abstract way, you're given a context example like this is how you're supposed to format it and then it can solve the task much better. Then we have retrieval augmented generation just been very hot topic. But they yeah, decompose it into a few different parts. Yeah, you have the input like a question with the person asked and you have the sources the AI uses. I don't remember what the keyword was supposed to be, but yeah, it's if you want to have keyword search. So if you want to hybrid search you both use like this is not maybe a complete explanation of what RAG is all about. But you can have vectors or you can have keyword search.

Valdimar Eggertsson [00:25:10]: You just want to get fetch documents somehow from a knowledge base and incorporated into the context of the LLM. Then we have the prompt registry. We were talking about decomposing prompts a bit earlier and I think they. Yeah, they analyze how prompts what they're composed of into fairly detailed structure. Here you want to ID it and version control it, give it a name. We have templates for example, like for the retrieval augmented generation. You want to have a template for the like. Here's the question, here's the document.

Valdimar Eggertsson [00:25:54]: Please format your response in a polite manner or whatever. Maybe the different templates here are related to what we would want to test. I don't know, just trying to connect it to something from the conversation earlier. Yeah, the user wants this. This is what should do. Here's the question, here are some examples and all of this enter bust the whole prompt, the whole instructions that go into the model. And we can vary each component I guess and test them maybe to get a properly functioning system. Finally there's something they call the prompt optimization techniques.

Valdimar Eggertsson [00:26:40]: I was a bit confused about this. They say it's for enhancing the performance of prompts fed to agents. But it's kind of also like chain of thought, tree of thought and whatnot. It's just different ways to generate the response. Chain of thought is like GPT01 where it thinks step by step. Tree of thought is interesting. I remember actually covering it in this kind of journal club like A year ago where it's an alternative to the chain of thought which is just a chain sequence where the LLM constructs different chains of thoughts in kind of a tree like manner and it's traversed in a way to find the optimal solution. I'm not sure that's being used.

Valdimar Eggertsson [00:27:29]: It's kind of expensive and slow probably. So yeah, you can have this on top of the prompt to generate answers in a cool way iterate the prompt and I think we talked about that in the rag demonstration based rag last time or you can do it in a loop as I'm programming around the prompt stuff. And finally for my part here there are guardrails within which an agent must operate, ensuring that it adheres to defined guidelines. It split it into the target of the Cartwright. Is it about the are we making a car drill around the LLM or something else? And what we do about it is actions. I'm not sure how you build guardrails around these things. How like I usually have guardrails in the prompts to tell LLM to not go off topic or whatever. Well, yeah, filters.

Valdimar Eggertsson [00:28:35]: That's a part of let's say using GPT on Microsoft Azure. It has a pretty strong content filter so that it make sure you don't talk about anything inappropriate with AI. This chapter is incredibly small. I don't explain anything and I'm not really sure what the uniform rules or priority enabled rules even are. Any comments on guardrails?

Bruno Lannoo [00:29:03]: Yeah, I'm a bit surprised that you mentioned that you put guardrails in your prompts because like in my mind guardrails were like specifically a separate mechanism where like well maybe before the prompt is sent to the model you kind of like have a parser check for some keywords and if some keywords are in there, you kind of have an action rejecting to send even the prompt or after the prompt comes back from from the the LLM you might be do the same thing and be like in the response. You might look at some kind of more deterministic approach like. But I'm not sure if maybe guardrails is a broader term than what I thought or if it's just like yeah.

Valdimar Eggertsson [00:29:47]: Yeah, I mean in this paper just say guardrails within an eight within which an agent must operate, ensuring that it adheres to defined guidelines. Yeah, maybe I'm just thinking of general instructions. Maybe a guardrail I can think of is to anonymize the output or is it yeah, the input. Maybe both.

Bruno Lannoo [00:30:12]: You can, you can do both. I think, I think that, that, that would fit more with guard. Because like for me a defining characteristic of guard is worse that they are more deterministic. But I'm not sure that that that has to be because I feel one of the points is to try to protect yourself against the non determinism of the LLM by having guardrails that are more deterministic so that like you can kind of at least ensure that some more extreme problems are definitely not going to happen.

Valdimar Eggertsson [00:30:48]: Yeah, so we can put guardrails on the data set. I don't know how I feel like they maybe just put everything here anyway. It's not my, my expert, I'm not an expert in this.

Adam Becker [00:31:01]: I think that that's the right discussion in train of thought though. Like where is the appropriate place to put guardrails? If you put build too many guardrails into the model, I think you affect its judgment and I think regardless you need to have external sort of guardrails.

Valdimar Eggertsson [00:31:18]: And to ensure that appropriate model behavior.

Adam Becker [00:31:24]: Yeah, I think that it's. Yeah. It comes down to. So this is how I see the targets. Right. So it's like you're not necessarily saying that every step needs to have guardrails and then it's up to you to pick and choose like where you want that in. Right. Where you want to throw that in.

Adam Becker [00:31:42]: And then some of those. I think you might choose to put a particular guardrail on a step even as a function of something that you see at runtime, something that you see in the prompt. If you detect some, somebody mentioned X in the prompt. In that case, put a guardrail on that thing. So I think that's where like the uniform might just be. Okay, this exists all the time. Negotiability might be like, okay, well only if you see this type of thing, you might consider adding a guardrail and not a guard. I think that that's kind of how I see all that.

Adam Becker [00:32:18]: And then. Yeah, and then like how do you handle like violations? Right. Then these are like the different actions. By the way, Bruno, to your point about the. I think you were saying about the user goal. Yeah, yeah, I've never seen that language either. So I'm not, I'm not entirely sure. But based on the paper, I think the way, the way they, they put it is like the prompt.

Adam Becker [00:32:47]: You know, I'll just read prompt serve as foundational element for the agent's decision making and behavior, incorporating multiple layers of information such as goals, instructions and contextual data. I think they separate between the. And the instruction so they. It's almost like, yeah, I do feel.

Bruno Lannoo [00:33:06]: Like I was also running into this. This, this is this realization that it's starting to be important to differentiate between like whatever the user put in and whatever we send to the model. Those are different things. And, and in the first approach, I had the tendency to call them both prompt, but that gets very confusing very quickly. So I think it's good for the field to have distinct names, but it would be also convenient if we all use the same distinct names. So that's why it was a bit like inquiring if people have heard different terms used and which one is the most popular current, so that maybe we can already start aligning a little bit on. On which one would fit best.

Adam Becker [00:33:40]: But do you think the user like the concept of the user goal, do you think that's likely to cascade down to like all the different agents? Because I feel like maybe that would only be like one agent that takes in the user goal and then they start to kind of break down the different steps. And then most of the steps might not incorporate the user information at all because they're now operating on like some goals.

Bruno Lannoo [00:34:03]: Yeah, I'm not talking about the conceptual effect of the user goal. Like, I guess there is a. There's a conceptual part to it, but there's also like a literal part to it. Like whatever the user put in that piece of text that needs a name. And you can call it User Goal, or you can call it User Request, or you can call it Prompt, but I think that's going to be ambiguous. And then like you do all kinds of things to that. And then in an agency system, you will multiple times go to the model. And whatever you send to a model every time also needs to have a name.

Bruno Lannoo [00:34:32]: Even though it might be the first time you go to the model, it might be. There's so many titration and I'm more tempted to call that prompt, but I don't. I'm not talking about conceptually. Conceptually, of course, inside that user goal, user request, there will be an intention. That intention will be sent to the first LLM as a prompt, together with extra things, and then it will be processed and maybe not show up as literally anymore, but like it will be hopefully dribbling down into all the other requests because hopefully you stay focused and keep on working towards that.

Adam Becker [00:35:06]: Actually, I think this dovetails nicely into the next slide. Valdemar, I'm good to start. Okay, the next concept here is the agent execution and they break this down into a few different stages. The first is planning and then reasoning, memory and workflow. So with respect to planning, I think that, that that is connected to what we were just talking about. Because even if the user had initially submitted something, well, perhaps the first agent will say let me break this down into much like into smaller steps and let's see how we can drive through those steps. I think the best way to understand planning here is to first consider the output. So in the planning phase the output is basically a set of tasks, right? Or a queue of tasks.

Adam Becker [00:35:56]: And I don't think they described it very well here, but I copied and pasted a couple of things from other places about planning. So here they say, despite the abstract concept of planning, a general formulation of the planning tasks can be described as follows. Whatever. Given a time step T environment E action space a task go g the action step A sub T in A is this the planning procedure can be just to generate a sequence of actions. So the idea is how do we just come up with a bunch of different actions and there are different ways to even go about this and there's like different flavors of the planning. So I just thought this would be useful because as soon as I saw this it kind of started to make sense how you get to choose the different methodologies of creating these. So one could be task decomposition, right? Like divide and conquer. This is what it looks like is just keep breaking this down or a multi plan selection.

Adam Becker [00:36:54]: Let's come up with a bunch of different plans and then evaluate each of those plans and then find the optimal one. Right? Or the next one could be external planner aided. You can reflect and refine. You can look at like the memory or which types of breakdowns and planning has worked better and other. And then you just come up with each of these plans. So so basically yeah, so you have a planning phase. The output a bunch of different tasks. Each task is, has a task description, the which agent is going to be responsible for executing it.

Adam Becker [00:37:28]: Are there any tools? And what do you expect the output to look like? And I think that, let's see, do I have. Okay, yeah. What do you expect the output to look like? Then there's the actual taking of the action. I think it's like the executing of the, of the task. So what's the action trigger in this case it might just be like okay, the user submitted a query, now this agent needs to run. So the task I think is like the smallest, it's like the smallest element of direction for the agent. The actual task here that is running that's like the browsing of the website. And these are the parameters that have to go into that action.

Adam Becker [00:38:09]: So all of this is encapsulated in the task. Then you basically, once you have that output, you feed that in. That becomes like, no, yeah, like the input to each of the agents is then all of these. That's the planning phase. Next is reasoning. And you already saw like even some mentioning of reasoning in. Even in the planning phase. You could have even done that if you wanted to rely on it.

Adam Becker [00:38:34]: The idea here is how do you use existing knowledge, draw conclusions, make predictions, construct explanations. Right. So they didn't go into detail on this, which I thought is unfortunate because I think there's a lag here and that's pretty interesting. But the idea is now you can come up with different benchmarks and evaluate your agent's ability to reason. You can submit kind of like you can create reasoning tasks and see how well it does. They didn't go into this in detail though. Next is the memory. The idea here is you have like short term memory and you have long term memory.

Adam Becker [00:39:11]: And so the short term memory is like you're saying, Bruno, how do we even make sure that like the agent remains focused? Right? Like is it actually, does it remember what it just did and does it have even the context of what the bigger plan is or did it totally get lost along the way? So it needs to remember the intermediate outcomes, the recent interactions. Recent interactions even in the context of just like this session. So this user started out, they said that now they're going through a few more iterations. Do we still remember what, like who they are and what they're after? The chat history and then the context that is relevant for executing longer term memory could be, I think more like. Yeah, like the retrieval documents knowledge database. I think this is where like rag and all of that might come in. Past executions. This again, if we wanted to do, to use that for the, for the reflection or for the, for the planning.

Adam Becker [00:40:08]: What worked well, what didn't work well? All of that. Okay, so that's memory. Next we have a workflow. They didn't describe workflow, honestly, I think in any detail. And maybe I'll read to you what they said and you can see whether or not this picture you think maps onto it. So they say workflows reduce system complexity by breaking down complex tasks into smaller steps, reducing reliance on prompt engineering and model inference capabilities and connecting nodes with different functionalities. Developers can execute a series of operations within the workflow. So I guess, I think I got that from Lang's method.

Adam Becker [00:40:45]: The idea here is, okay, well maybe each agent is like responsible for a particular thing and then the workflows is creating some relationships between, between the different agents so that they produce a particular outcome systematically. And then you could just kind of like abstract away all of the different interactions perhaps between, like between the different agents and just consider that a single workflow. That's my best takeaway. Other people. If anybody has used workflows with these tools before, tell me if I'm totally off or if that's how you have in mind.

Bruno Lannoo [00:41:26]: I haven't used it, but like it does sound like very similar to planning. I struggle a bit to see the difference.

Adam Becker [00:41:34]: I think that, so the workflows, it could be that workflow could come out of a planning phase. But I suspect that workflows are, are like human engineered. So we specify what the process needs to. We've created like, okay, now we no longer have to think about the different steps along the way. And it could be this.

Bruno Lannoo [00:41:57]: Yeah, I could see a slice being drawn in this kind of into two categories, like what have we pre planned and what is kind of being on the spot invented by the model.

Nehil Jain [00:42:08]: Yeah, yeah.

Adam Becker [00:42:09]: But now how does the plan if, if there is space for a more dynamic type of plan. I'm not sure how that fits into an existing workflow. So maybe it's like, I don't know.

Bruno Lannoo [00:42:21]: Yeah, I was also kind of like thinking, but I'm not sure that fits much better of like whether the planning is like the coming up with the tasks and the workflow is maybe assigning the task to the specific components that are capable of doing them. That could also be a place you could split this process into two components. But I'm not sure if workflow would be then a perfect name for that second part either.

Adam Becker [00:42:44]: So I think that will be more of like the supervision or coordination or something like that. The coordination, yeah. Okay, next we have the evaluation and feedback. So the idea here that developers can create an evaluation template to systematically assess the quality of the output. So how do you know if the agent is doing what you want it to do? So the idea here is that you come up with an evaluation template and you include the kinds of metrics that are relevant to you, the criteria, some script for running it, or some script for basically giving an example of how you might have evaluated a particular output in the past. Then you actually run it. So you, you registered with Eval in the evaluation registry or evaluation card. It gets an id, a version, a name the input to.

Adam Becker [00:43:40]: Now the specific evaluation execution or an evaluation run will be like the prompt, the agent task, what it was supposed to do and then what it actually did. And then the evaluator output is telling you, okay, well this is how well it actually did. Right. So this has been sort of like trained on an evaluation data set that ideally you're like compiling that has some test set, a ground truth. You can run this over time to just continue to see how well your agents are performing. And then you can also begin to kind of intercept these and grow many of them to a human so that the human can give the their own score and that then goes back into your evaluation data sets. Right. So we're creating some like feedback loop.

Adam Becker [00:44:29]: That's all they had to say about this. Yeah. So human feedback create a list of, so you can create a list of categories that are relevant to you. And then the human can go in and begin to assess the basis of the category. So things like toxicity and human correctness, answer relevance. And then to assign the score and then you stitch together the feedback loop to make sure you dynamically update the training set. I think the key of the, of the analysis tends to come down to tracing into spans. So, okay, the highest level trace represents the complete journey of a single request or task through the system.

Adam Becker [00:45:07]: In the context of an LLM agent, it would include all the operations, decisions and calls that occur to fulfill a single user query or task. Did I have an okay? Did I have a okay? That's okay. So that's the trace. For example, if an agent receives a query and it's expected to summarize the document, so the trace might start when the user sends the query. That's the beginning of the trace and it encompass the document retrieval pre process every step along the way until the user gets back the output. So that's the trace and it's composed of all the different spans would contribute to completing the task. Now they weren't, I don't think they were particularly explicit about differences between tracing and span and session. At least I didn't get a very good impression from this.

Adam Becker [00:45:58]: So I pulled in a couple of other diagrams from places. So for people who are not familiar with it, we start out like, okay, so a trace is composed of a bunch of different spans. And this the we start out with the root span. And the idea here is that there is some type of hierarchy here, right, because some spans, let's say like this span is for the entire call, but then you can have like, you break it down into like a subspan that can then again delegates to other, other agents to do different tasks. And then all of those are composed together to create the trace. So the trace is all of these together. And then you can begin to interrogate each span to see, well, how much did I spend on this and how long did that take and what was the latency and all of that. So the span is a single unit of work or operation within the trace.

Adam Becker [00:46:49]: It represents a specific step or activity like fetching data, calling a sub agent, processing a response. And the span includes metadata like start and end times, all the different resources involved, error information. And they're hierarchical so they can be nested. You can have the summarization task. But then there's like smaller kinds of spans within. Now the session is related traces together, usually corresponding to a single interaction session with the user. Yeah, so this is, this might be like a good way to think about it. We have the tracing level, could be at the session level you have a session id, user id, timestamp to click Pass and then a bunch of different traces.

Adam Becker [00:47:31]: And then each of those traces has a bunch of different spans. Yeah, and there's different types of spans. Right. So you can have one for just making a call to the LLM, one for making a call to an existing workflow using tools, using specific tools, embedding, all that. Okay, now once you have all of, once you have the trace and you have the session information and you have the each of the spans, you might be interested in monitoring. So the idea here is like based on what aspects of the system you want to monitor, that gives you your choice of metrics and then what level you're interested in doing the monitoring that gives you the dimension. So some metrics, there's like the common metrics like token usage, cost, latency, but then there's like wants to figure out the quality of the output. So these are toxicity, answer, relevance.

Adam Becker [00:48:24]: Right. So you can evaluate it at each of those levels and then specific errors. So like have there been any privacy issues detected or latency? And then the dimensions are. Well, now that you have this like session and span and trace level, you can figure out, okay, are you interested in token usage for a bunch of sessions for specific sessions, sessions tagged by something for a trace, that entire trace or for a very specific span span or for all of the spans that of a particular type. You can evaluate these metrics based on users, based on models that they've, that your agents are using or perhaps based on prompt versions. Yeah, I think that's all. So I had questions thoughts. At the end they give a couple of caveats of saying, well, there might be tools we didn't examine in there might be things we didn't read.

Binoy Pirera [00:49:20]: Awesome. Well, thank you Adam, thank you Voldemort, thank you Bruno, and thank you Nehil, who's not even here. Neil had to leave early, so you can find the links to speakers, LinkedIn profiles in the comments in the chat. So thank you as usual for everyone for joining. We really appreciate it. It was a very introductive session and hope to see you guys with a more interesting paper next year.

+ Read More

Watch More

A-MEM: Agentic Memory for LLM Agents // April Reading Group

Posted May 01, 2025 | Views 80

# Agentic Memory

# LLMs

# AI Agents

How to Systematically Test and Evaluate Your LLMs Apps

Posted Oct 18, 2024 | Views 15.1K

# LLMs

# Engineering best practices

# Comet ML

Small Data, Big Impact: The Story Behind DuckDB

Posted Jan 09, 2024 | Views 13.3K

# Data Management

# MotherDuck

# DuckDB