Assessing and Verifying Task Utility in LLM-Powered Applications // Julia Kiseleva // Agents in Production
Data is a superpower, and Skylar has been passionate about applying it to solve important problems across society. For several years, Skylar worked on large-scale, personalized search and recommendation at LinkedIn -- leading teams to make step-function improvements in our machine learning systems to help people find the best-fit role. Since then, he shifted my focus to applying machine learning to mental health care to ensure the best access and quality for all. To decompress from his workaholism, Skylar loves lifting weights, writing music, and hanging out at the beach!
The rapid development of Large Language Models (LLMs) has led to a surge in applications that facilitate collaboration among multiple agents, assisting humans in their daily tasks. However, a significant gap remains in assessing to what extent LLM-powered applications genuinely enhance user experience and task execution efficiency. This highlights the need to verify utility of LLM-powered applications, particularly by ensuring alignment between the application's functionality and end-user needs. We introduce AgentEval, a novel framework designed to simplify the utility verification process by automatically proposing a set of criteria tailored to the unique purpose of any given application. This allows for a comprehensive assessment, quantifying the utility of an application against the suggested criteria.
Skylar Payne [00:00:03]: We're bringing on Julia here from MultiOn. Julia has a background working on researching user based scale satisfaction metrics and methods for the last 15 years dating to web search scenarios, and spent six years conducting research on agent related concepts at Microsoft and now works building agents and at Multion, who has a tagline of building agents that complete tasks start to end. So I'm very curious to see what Julie has to share with us with that. I'll let you take it away and I'll come back in for Q and A at the end.
Julia Kiseleva [00:00:46]: Thank you so much for the introduction. Happy to be here and share some of my work on edge and evaluations. So and just to highlight what you will get out of this talk today. So if you're interested to build LL powered applications, it's basically like in a genetic space and you build it for various domains so you want to understand what potential metrics you need to optimize your application for. And sometimes it's actually a lot of work to figure out what needs to be done. For example, like if you look at an example, a great example of such an application such as web search has been a lot of work and research done by a huge community to even figure out what's kind of the right metric to optimize for from the user perspectives. Because at the end of the day we're all building those type of things for the users and I was lucky to be part of Autogen. So it's probably if you looked at Gentex Space, you probably heard about this open source programming framework which has been released in September of the last year and I think it was one of the best and attract a lot of attention.
Julia Kiseleva [00:02:11]: So and I was just blown away by the idea how many people are trying to use the agents, LM powered engines to solve all kind of tasks, all kind of tasks. You could see like especially like on Discord community you could see like people were using it to build like teachers to build analysts and things like that. But the difficult part in machine learning is actually figuring out what's your objectives, what basically you kind of build in what you want to optimize for and what you want to design as a, as a metric. So it's relatively easy if you think about such tasks like for example as object recognition because you can just get the ground truth and label that and just focus on making sure that your algorithm is working well. But if you're building something for the end user, you need to care about how your end users are satisfied or not satisfied with an experience and what your application is even done for. And so kind of the motivation to get into this idea of kind of scalable approach for evaluating agent performance or like not even evaluating but even understanding what kind of utility it may or may not bring to the users. So we decided to start to explore this because we have seen how many different types of developers all over the world start building the applications for the users and they have to have tools to figure out if the applications if the agents are building and actually achieving the achieving the needs user actually kind of highlighted. And so again most of the time this type of things are done in like in machine learning settings when you have sort of ground truth you collected you labeled but it's not always possible especially when with such a rapid way of technology and even like again for that you need to at least understand what's your objectives.
Julia Kiseleva [00:04:24]: So and there's also a lot of research has been done already to show that LLM itself can be efficient alternative to get an idea about what your users actually want and help you to evaluate the systems you're building. And you need to do it fast. And it's a relatively cost efficient way of doing this. I will not go deep into this because it's not part of the stock. But if you're interested you can write a lot of papers where you could maybe highlight this LLM as a judge. This process needs to be automated especially it's important again for the tasks when you sometimes don't have any ground truths and you want to figure out this by going through your data. Basically another important aspect why we really wanted to bring Agent Eval out there is is like also empowering the developers and make them aware what potential utility of the applications they're designing can bring to the end users and make them aware about that and hopefully care so that they while they will be improving and changing the agentic system behind it. So they actually know they're not acting in the dark, but they have a tool to figure out if it's getting better or worse.
Julia Kiseleva [00:05:50]: So if you let me be a bit more structured on this aspect it's like so we have seen a lot of tasks which is users are trying to solve using the LLM or LLM powered applications and there are tasks where success is not clearly defined because success is usually for us as machine learning researchers it's something like great if it's on binary scale. So if we have completed the task or not completed the task. But for example, if you see that a system is writing email for you and Then you can copy this email and change it here and there. So what exactly was the success here? So that's still on. Interesting direction for the research. But there are tasks where success are clearly defined. For example, if the goal of your ALM powered application is to solve the math problem, so like one of the one of those benchmarks we all care about. So like the success is clearly defined.
Julia Kiseleva [00:07:03]: So because we have a ground truth, we know the right answer and we can judge if the solution was correct or not. So another aspect of those tasks is that when you do have preferred optimal solution for this and there are some cases where there's multiple solutions which lead to the same sometimes incorrect answer and you would need to figure out which one is actually better for you or not. Why I'm bringing up that here because what we're going to discuss next is a new way of defining what's utility the LLM powered application can bring to the end users. And we try to be as domain agnostic as possible, so meaning we cannot really provide it with a concrete benchmark. And so the whole idea about benchmarks at the moment is probably also needs to be discussed in a separate talk. But while we're building this approach we want to kind of show that it makes sense. So we need to kind of start somewhere. So and we decided so obviously that's going to be much easier to start with the tasks where success is clearly defined and see what we can do for this type of tasks in this space.
Julia Kiseleva [00:08:25]: And so for this particular presentation I will be using the task of solving the mathematical problems as one of those so which is clearly defined. And so like we tried agent eval on various types of tasks, even not where the success was defined afterwards. So like you can look up this in the paper or online. So first we need to reconsider the idea that you need to define the utility because you're not building the application just for the application, you want to help users to solve some particular tasks. So and usually so you have those criterias in mind, like that's you building the solution which is doing something faster than another solution or it's kind of less or more efforts. But there might be something which you cannot really even look at at the moment. So and usually when you do that for the new tasks, like for example how to relate the teacher performance, you have to define a lot of criteria and usually takes like used to take a lot of effort and a lot of ongoing user research and so on and so forth. So what we going here is that like most of the developers do not have this typ of support time or even kind of proper education to do those user research.
Julia Kiseleva [00:09:57]: So can we define the critic agent which will help the developers with defining the criteria which the application needs to be kind of evaluated. And that's basically those criteria express some kind of utility to the end user and the quantifier agents can assess how well your particular solution is doing according to defined criteria. So we basically trying to define kind of multidimensional task utility and just to give you like thing out of scope. So I picked the math example for the. For the reasons that I think we all probably a bit domain expert in math here because we all studied it at school or maybe at some other levels. So basically here you see the criterias for the math problem which was just defined by the critic agent. So that's basically the agent which you provide was prompt where you ask agent to kind of explain what the task is and you provide one successful example and one unsuccessful example of your agent solving a particular problem. And so here I just picked like four criterias and it's description, description and accepted values.
Julia Kiseleva [00:11:25]: So and it's all the output of the kind of the agent. So nothing of that I came up myself. So I can just at this point it's like. Especially if like for example I designed a new new application for a new domain and I have seen for example we tried the experiment for the robotics space. So where I was I'm not an expert so. And actually like the critic engine helped us to figure out like quite a complete set of criteria which was verified by the domain expert later on. So like that's basically saves us a lot of time on figuring out what are those important criteria. So which are very important for this domain.
Julia Kiseleva [00:12:07]: Again you can do it very easily with just a critic agent which adjusts you this table. It's very important to follow the accepted values as suggested by the critic agents here. All this 0 ones and 2, that's what we did to clarify and use it for the layer histograms. But in general you better stick with the language of it. If you have. I can go on longer explanations why but just like believe me, that was the best way to go to make sure that you provide for the quantifier agent the textual description of this. So basically that all comes for free. And by the way, so you would say how is it important? That was really important.
Julia Kiseleva [00:12:57]: For example, one of the first criteria which was shown for the math problems was the code quality. And I was like what kind of code quality. And then I realized that outagen so like one of the baselines which we use produce a Python code to solve this problems. That's why we kind of achieve a very high quality. So very high. So like basically it's doing the tasks right. The question is if you want to actually solve it, you then provide like writing the code. So for example, example, if you trying to do something like a teacher agent, is it something you want to do or not? Again, it's up to you.
Julia Kiseleva [00:13:43]: It's your application. So we're just here to help you to evaluate and define what the task utility can be. Another thing you might ask me is that the quantifier agent in this case sees unseen examples of your agent and tries to quantify them. So basically like if you define the new solution or provide a new baseline and you seek with kind of task utility you want to be following this type of criteria. So the quantifier agents can help you to assess the unseen examples and see if like you improving on some of those criteria. So and we desperately needed here at the verify agent because the verifier agent can tell you which criterias are good or maybe like it's better to say which criteria out of what you have is like can be easily assessed by the quantifier agent or not. So and we specify the verifier agent which will help you to filter the criterias which are actually stable and robust. And also they are very.
Julia Kiseleva [00:14:59]: So they're supposed to be good with adversarial examples. And I will show you a bit of example later. So you can download like here I think you can download the paper and by the way the algen part of the autogen. Now you just go run it. So that's the main part of the talk you may want to remember. So imagine you are designing your LL powered application. You use agents which are communicating to each other to solve the task. They produce a lot of logs for you.
Julia Kiseleva [00:15:31]: So and you can basically based on this logs figure it out what kind of criteria are important for your task and hopefully for your end users as well. You can use a quantifier agents to kind of assess the quality of each of unseen or unseen solutions. And then using the verifier agents you can just filter the criterias which we think are not really stable. That's possible because there's a lot of things needs to be done still to figure out how well this quantifier agent works. And you can do it for any type of domain and you don't have to get ground truth for this. So if you have that's great. But if you don't then just want to explore. There's something that you can in couple of minutes set up and get the first idea what your users can be using it for.
Julia Kiseleva [00:16:29]: Here just give you also a bit of an example and let's come back to this idea of like we developing something new. This is a new approach of how you can do the sort of assessment. We call it assessment because it's a bit over stretched called evaluation in my opinion. But that's why we start with a task where we can split the successful and unsuccessful tasks. And so here what you see, you see the same criteria. So it's only 4 usually it's up to 25 criteria you can get for majority of tasks. And you could see like three baselines, it's like React, Vanilla Service, GPT4 and Autogen. And you could see that for example for clarity, efficiency and completeness you have also in line that successful tasks actually higher on average for each criteria.
Julia Kiseleva [00:17:25]: That's kind of the things we were trying to make sure that we following. So just because again we're doing something new, we have to have some kind of ways of verifying if we're doing it right or not. So but you can see that there's a variation in terms of the how the various types of criteria kind of assessed by different baselines. And that's kind of can be like then up to you to decide if you want to go with one of them. But also important part is like to keep those assessments and if you change your agents, underlying agents, you can reiterate the process and see first of all there's a possibility that your new baseline can bring new criteria into the place because kind of your solutions sometimes brings some of these criteria and you could see if that's something you want to go with. However, this error analysis doesn't look really good for us. So like that's where we think that quantifier couldn't do a job to assess the proper kind of the proper way of the error analysis. And that's what the verifier is for.
Julia Kiseleva [00:18:45]: Because a verifier will tell you that maybe error analysis is a good criteria you want to have but we just cannot assess it in the current state. So and that's definitely more research needs to be done towards figuring out if we can do something about it. So and this is another important part to figure it out is that there's a task Based criteria and there's a solution based criteria. So sometimes if you just describe to OLM what you want to evaluate based on a task, I can suggest you different criteria. But providing the successful and unsuccessful example can open up the types of again like if you just described to LLM like what's the criteria for solving the math problem? The code efficiency would not be part of it at least like at the time when we tried it. But since now sometimes we use the code to solve the math problems that's kind of. That's part of it now. So but it's also kind of important what you could see here that you at some point kind of that would be suspicious if our critic will continue like give us the more kind of iterations we run the more criteria is going to suggest.
Julia Kiseleva [00:20:06]: So like it looks like at some point we can a number of criteria which is a good sign. So like basically there's a limited number of deaths and from my own. My own intake here is like I would not worry too much about the accuracy of this criteria and those people say like are they going to be correct? It's only like in the worst case, 25 criteria, 30 criteria per domain. As a domain expert you can easily verify some of those and you can remove the one which you're not interested. I would be more concerned figuring out how complete is the set. So and that's research needs to be done on this in this area as well. But another intake here please try to as soon as you propose a new solution around your critic you would be surprised because the critic might discover new criteria which is introduced by your solution. So and this is another kind of ideas about how can you see that's again we have this advantage here that we had like successful and unsuccessful cases.
Julia Kiseleva [00:21:10]: So and then just to see the distribution of a quantifier output for successful we could see there's a dark blue and failed cases. And for example again for error analysis you could see that basically the scores are kind of almost the same on average. That's not what you want to see. So but again we learned it based on the examples how to based on examples where we have success criterias and like using the verifier you can filter the criterias which are unstable even if you don't know the success of it. So and another thing is that like was interesting to do with figuring out if the sort of worse examples a worse solution is actually shown to be worsened for different criteria. It's not such an easy task because again we just have here like either successful or not successful solutions. That's why so like you didn't have a gradation. So like as you could see here, this average values which is corresponded to the categories I showed you in the beginning.
Julia Kiseleva [00:22:35]: So what we did we just take and introduce the noise in an existing solution and then the hypothesis was that the samples which would be having contained in noise. So a disturbed sample as you could see here should get lower score based on quantifier for in comparison to the same examples but without noise. And so we could see that this is actually the case. So that means that you can sort of trust the idea that quantifier will rank higher the examples where like which are like sort of of a better quality for various types of criteria. Again this is important because we trying to set up something absolutely new here and as a conclusions so we introduced a novel framework. We called it Agent Eval. So it's designed to quickly evaluate LLM part agentic application. Here you have a QR code you can go to the blog post was on as part of the Autogen library and just try Agent Eval for URL powered application.
Julia Kiseleva [00:24:04]: Also actually that's based on academic paper which will be presented tomorrow. I think it's EMNLP as well. So and we believe that it can be used for any type of agentic applications where you have logs which you can analyze and see how your agents is behaving. And that should also should be very scalable and it's cost efficient as well. So you don't really need to run even do a lot of LLM calls to get your first sense of various types of criteria. And it can give you kind of interesting outline is like what that potential utility can be. So for example, for this mathematical math application some of the criteria was basically that solution was verbose and maybe you want verbose solutions especially if you design something like a teacher. So and the beauty of this is that like then having this in place, you can actually keep optimizing your agents for providing the more verbose solutions or you can even give it to outsource it to the users and ask them what kind of solutions they want to have, what kind of utility they have.
Julia Kiseleva [00:25:21]: But having those criterias and having a good way of figuring out if it's actually your solution is following those criteria using the quantifier can help you to kind of optimize your application towards the user needs. So and another thing is that like as I said like Agent Evolve helps you to uncover your capabilities and seeing how they evolved over time because if you propose a new solution there's a possibility that that will bring a new criteria for you as well. So keep an eye on that and we hope that's going to help the developers, all developer community to figure out how to assess their solution and align them with the needs or with the criteria they want to kind of associate these solutions with. So and this definitely potentials, a lot of potential for this one. One of the first directions which we can go and I'm just super happy was here today to share this work because I think that's potentially it's can uncover the ways for developers to actually figure out the strong strengths and weaknesses of the agentic applications based on real data, based on interactions and just also bring awareness that we need to optimize our applications not only towards the obvious metrics like latency, even success, but also this underlying multidimensional way of how users receive our application. And we have a way to influence that through this. Thank you so much for your attention. That's pretty much completing my talk today.
Julia Kiseleva [00:27:27]: Awesome.
Skylar Payne [00:27:27]: Thank you so much for coming. Fortunately we don't have time for questions so you know, at this time everyone can mosey on over Back to track 1 to view the closing. But really appreciate everyone coming and viewing and really appreciate your time, Julia.
Julia Kiseleva [00:27:48]: So yeah, thank you so much. If you have any questions, please reach out to me. And LinkedIn DM on Twitter is also open so I'm happy to follow up on any questions regarding this work. And again it's available as a part of the Autogen library and more extensive explanation as a part of the paper.
Skylar Payne [00:28:12]: Yeah, awesome. Thank you so much. Take care.
Julia Kiseleva [00:28:16]: Thank you.