Evaluating the Effectiveness of Large Language Models
Aniket is a Vision Systems Engineer at Ultium Cells, skilled in Machine Learning and Deep Learning. I'm also engaged in AI research, focusing on Large Language Models (LLMs).
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
Dive into the world of Large Language Models (LLMs) like GPT-4. Why is it crucial to evaluate these models, how we measure their performance, and the common hurdles we face? Drawing from Aniket's research, he shares insights on the importance of prompt engineering and model selection. Aniket also discusses real-world applications in healthcare, economics, and education, and highlights future directions for improving LLMs.
Aniket Singh [00:00:00]: I'm Aniket Singh. I work as an AI engineer or fission systems engineer at Ultium Cells. I drink my coffee, usually with milk.
Demetrios [00:00:14]: What is up, MLOps community? I am back with another podcast. As always, your host, Demetrios. Today we're talking with Aniket, all about evaluating the LLMs themselves, but not from the benchmarking perspectives. We want to evaluate just how much they know, how well they can do. I loved when he talked about the confidence scores, especially how he tuned these different confidence scores and which models are more confident and which models are less confident. Let's get into it with Aniket. As always, if you liked the episode, feel free to give us a little feedback. And if you can share it with just one friend, would mean the world to me.
Demetrios [00:01:04]: We should probably start with the backstory on how we met, which was through.
Aniket Singh [00:01:10]: Which was through lattice flow, actually, since we did the roundtable discussion and we were kind of like planning to do it, and I was like, we should also do another one, something related to that's not my job, something that I.
Demetrios [00:01:27]: Do on the side, because you are. So you have. Can you explain to us and myself again what your job is and what your hobby is?
Aniket Singh [00:01:39]: Sure. So currently I work as a, you can say machine learning engineer or a vision systems engineer or an AI engineer. Basically you can name anything you want, but at a EV battery manufacturing plant called altium cells. And that's why I was invited to the roundtable discussion, because we were discussing about the manufacturing and AI in manufacturing challenges related to that. But that's my day job, but at the same time because I was very interested in research and especially LLMs. And that's kind of my thesis. Also, when I did my masters, my thesis was related to like transformer models, which wasn't LLMs were not a big thing back then when I did my masters, but I was always interested in this transformer model because it performed much better than any other model that we had in that time.
Demetrios [00:02:43]: You bet on the right horse.
Aniket Singh [00:02:44]: Yeah, yeah. And once chat GPT came out with their API, I was like, we need to do something. We need to start doing some research in this. And I do enjoy working on this on the side. And we ended up publishing few papers on this, and I'm still working on few more.
Demetrios [00:03:04]: Yeah. Awesome. So you've got your day job and then your night job and your night job has brought you into a bunch of cool realms. Specifically, we wanted to talk today about the ever hot topic of evaluating LLMs. I know that there's been a bunch of people on here that have talked about evaluation for LLMs, we do it very much on the systems level, like how actually we've had. So taking a step back, we've done surveys on evaluation. We've done two surveys in the community on evaluation and how people in the community are looking at evaluating their LLMs and their systems. But you have a bit of a different take.
Demetrios [00:03:59]: So let's talk about that.
Aniket Singh [00:04:02]: So when these models started to come out and all the open source models started to come out, I would see a lot of benchmarking, especially. And since I wanted to do something within this LLM, within the realm of LLM, I saw everyone is doing benchmarking and nobody actually figured out how to make a whole startup or application using LLMs just yet because it was still the very beginning phase of it and they were making a lot of changes to the API and that would break a lot of things on the application side. So we did not want to get involved in any of that side, like the LLM system evaluation or LLM benchmarking, because there are a lot of people doing it. So I thought of this different approach that people are using LLM to also see if they could replace some of the work that humans would do to make it more efficient or make it more cheaper. Whatever you can say, depending on the motive. Right. So we thought why not just use LLM to see if we can replace some of the scenarios where human would come into play. Right.
Aniket Singh [00:05:23]: Let's say auction, right. This is one where people go and bid on the things that they want to buy, and you have to be very careful of how much you bid because you don't want to overbid. But humans in general usually are pretty good at it because they know their budget, they know what they can afford, how much the thing would cost, unless they are very unfamiliar with the item. Right. In most cases, humans are pretty good. But can LLMs do that? That was one of them. Another.
Demetrios [00:05:55]: Wait, can we, can we double click on that one? Because that feels like something that also you could set up with just rules based system. I don't want to. Yeah, bid outside of this number. Here's my budget.
Aniket Singh [00:06:09]: Yeah. Technically, rule based system can be done too. Right. But we wanted to see like how LLM would make those decisions. Right. Rule based system or an AI system could do that too. But just to simulate, let's say there's an extension that we wanted to work on, but we haven't gotten the chance. Think of a scenario where you know, the person's personality.
Aniket Singh [00:06:35]: Right. You know, how these persons would behave and, you know, kind of like, what kind of people would come to the auction, and you kind of simulate that auction beforehand to see how much profit you can make it. Right. Scenarios like that where you can use the personality to dictate the outcome. So we did give personalities to all these LLMs. Like, we use clang chain models to give them a personality, give them some backstory, like, hey, you are an accountant. You have been working for ten years, 15 years, whatever. There was another model called Olivia who was an extrovert person who didn't care as much.
Aniket Singh [00:07:19]: She was more spontaneous. So all these models had a different characteristics that dictated a lot. And that was one thing that we got to see, that LLMs actually behave similar to what the character six that.
Demetrios [00:07:35]: We provide them with because of any characteristics that you gave them. Did any go over budget?
Aniket Singh [00:07:41]: Oh, yes. The extrovert one, spontaneous one, was going over budget every single time. And we had to do a little bit of work to make her understand, because we named her Olivia, and we tried to give her some clues, like, hey, you need to make sure that we were not telling her explicitly that this is your budget or this is how much you can spend a. Right. But at the same time, we were trying to make sure that they don't completely go crazy, especially with GPT 3.5. That's what. That's what we noticed, that they didn't care so much about math and the reasoning. But GPT four was a very different story.
Aniket Singh [00:08:24]: But we were able to achieve much better results just using chain of thought with GPT 3.5 as well.
Demetrios [00:08:32]: Okay, so let's get into the differences between the models. But I, before we do, to give an overview of what you really work on, you're thinking a lot about evaluating the capabilities of the model and not so much about the system around the model and how it's set up and how the evaluation of the vector store retrieval is, or the evaluation of the final product, or the final answer, or the question answer pairs. That type of thing isn't really what you're looking at when you're talking about evaluating these models. You're saying, can it do what we do as humans? Are there tasks that we do that if you give it to an LLM and you give it a backstory, it can also kind of do it.
Aniket Singh [00:09:25]: Or not do it.
Demetrios [00:09:27]: Yeah, or not do it completely.
Aniket Singh [00:09:28]: So that's what we wanted to test, basically, because everyone was coming up with their version of testing the model. There are a lot of libraries now that can do that. But there weren't many papers that were talking about this kind of evaluation. So that's, that was one of the motivation to do something a bit different than what the trend is.
Demetrios [00:09:53]: And some of these different things that you tried to see if the LLMs could do were bidding at an auction. What were some other tasks that you thought might be good to replicate?
Aniket Singh [00:10:06]: So there are other tasks that we are currently working on, but this is a very different type of assessment called stealth assessment. This is something we are working on and it is almost ready, should be available pretty soon. But what we basically did in this is we made them do leetcode problems, and not just like, okay, solve this leetcode problem, but actually using confidence scores that they provide with their answer, how confident they were, and providing them with feedbacks if needed. If they asked for hints, we provided them with hints. But what we were doing is every hint or feedback would cost them something. They knew that there is a cost, there's a budget, but they didn't know explicitly that they were being tested on their budget management. That's why it's called stealth assessment, because we explicitly, we are not making them aware of this situation. So this is a different paper that we are working on right now.
Aniket Singh [00:11:12]: I have to take a step back a little bit, because this paper was inspired by another paper where we tested the models for Dunning Kruger effect, which is a very well known phenomenon, but there are a lot of controversies around it. Some accept it, some don't accept it, but we still wanted to see how, because they learn from text, do they show similar behavior or not? Right. So in that paper, we introduced the model confidence score with the answers to see how well their confidence and the competence basically align. This can inspire the other paper, and.
Demetrios [00:11:54]: The confidence score is just once they've done a task, they say, we think that this task is complete and we give it a number from one to ten.
Aniket Singh [00:12:03]: So we make LLM do this. We ask LLMs like, hey, so you solve this problem, once you solve this, how confident you are out of one to ten or zero to one, that you did very well. So we had two different types of confidence metrics. So we were asking for absolute confidence and relative confidence. So the absolute confidence is how confident the model is in itself, that it did a good job. Right. These relative confidences, because we tell the models that, hey, this is a test that we are doing between multiple models. How well do you think you did compared to the other models when they don't know what other models are doing, they have no idea.
Aniket Singh [00:12:48]: They just have information how good the model itself is, especially in case of GPT four. We could see that the model was very aware that it is one of the best model out there. So that was kind of interesting to see.
Demetrios [00:13:04]: I think all of these different LLMs have no problem with their confidence, and that is one thing that they do not lack.
Aniket Singh [00:13:13]: That's what we thought that was going to happen. But some models were kind of, in a way, aware or even less confident than they should have been. Like models, like cloudy Washington, not confident even when the dancers were right. So it would try to lay back a little bit and say that it doesn't feel very confident. But so I think it's trained in a way to not act way too confident.
Demetrios [00:13:42]: Maybe her interest.
Aniket Singh [00:13:43]: Yeah. But GPT four, on the other hand, it's totally different story. It's so confident. It knows that I am the best.
Demetrios [00:13:49]: Model out there and it will say it.
Aniket Singh [00:13:53]: Yeah, yeah, yeah.
Demetrios [00:13:54]: Oh, that's classic. Now, I've heard there are problems when you get into the scenarios of trying to rate things like one through ten. This has been on the evaluation of, like, LLMs as a judge. So an LLM rating another LLM's output and saying, is this correct? And give it a score between, like, one through ten on its correctness.
Aniket Singh [00:14:20]: Yeah.
Demetrios [00:14:21]: And the problem there is that one through nine for an LLM are basically all the same thing. And so really it should be binary. It should just be that zero to one. Did you see any of those correlations?
Aniket Singh [00:14:35]: Yes. So we did two papers like this. So, first paper, actually, we had one to ten, but then we thought, based on our professor's output, he recommended that we switch to zero to one and see if that helps. And we did see a difference because we felt like it was much more accurate because it was thinking more probabilistically, which wasn't the case with one to ten, it would probably most likely give nine or 9.5 with GPT four, especially, and even GPD 3.5, actually. But when we switch to the zero to one error, actually tried to say 0.980.92, or even sometimes go down all the way to 0.77. So we introduced feedback. Basically, this feedback confirms that, hey, your answer was wrong. So we wanted to see how well it aligns for the next question, like, is it going to reduce the confidence? Which we started to see, especially in case of that zero, two, one.
Demetrios [00:15:42]: So you were. Let me just see if I understand what the experiment was. You were getting output and the LLM was scoring the confidence of it being right. And then you would give it explicit feedback on that was right or that was wrong.
Aniket Singh [00:15:58]: Well, then.
Demetrios [00:15:59]: And then you would ask it again, what's your confidence of it being right or wrong? Or you would have it regenerate.
Aniket Singh [00:16:05]: So think of two different cases. So there's feedback case and no feedback case, which both are independent to each other. So one case, we don't give them any feedback. We don't tell them if they were wrong or right at all. We just ask them the question, ask them to rate their confidence for that answer that they just provided. In the other case, what we do is we tell them that, hey, this time you were wrong, but it already gave us, let's say, 98% confidence. But now we ask them, like, how confident do you feel that you are going to do better in the next question? Then do this show that they went down on the confidence or not? And in most cases, especially with models like GPT four cloud a, we would see that it would start to align the confidence and drop it to, like, 77%. Right.
Aniket Singh [00:16:55]: So when they get to the next question and see the question and they answer it in some cases, when the questions are difficult, because we had varying difficulty level of questions. Right. So this way we could see if they are actually trying to align the confidence score with how accurate the answer is or not.
Demetrios [00:17:17]: And so when you're asking it for its confidence on the next question, it hadn't seen the next question.
Aniket Singh [00:17:25]: It doesn't see the next question. No.
Demetrios [00:17:27]: Okay. And then later, one thing that is fascinating to me there is that it almost, like, knew that the questions were harder and it had less confidence in it. Is that what I'm understanding?
Aniket Singh [00:17:43]: Yeah. Well, when it starts to mess up, when it starts to give wrong answers, I think it starts to realize that the questions are getting difficult now and the level is changing. But we don't tell them that, hey, you are in level two or level three, it's all based on the question. So it does start to understand, and in some cases, just knows because it was probably trained on the exact question. Right. But there is a lot of misinformation, too. So we kind of used data that would have a lot of misinformation and make things difficult for everyone. So the data picking was done in a way that it would be robust.
Demetrios [00:18:26]: Yeah. Yeah. All right, so tell me some other takeaways from this. And this was just one of the papers, right. How many papers have you written, by the way? You've written a ton so the auction.
Aniket Singh [00:18:37]: Paper is one of the paper which is already published. And the model confidence score, the one to ten, is also published in information journal. Actually, in that paper we used multiple models, not just one. We used a lot of open source models too. And then because of the feedback that we got and like people who do LLMs and like, based on the understanding, we were like, okay, we need to do another one. And we ended up working on another paper that used only GPT four model because some of these models were not great enough to be tested for specifically to the USMLe related questions. That's what we worked on and we used zero to one in that. And we have a couple more papers that's coming.
Aniket Singh [00:19:33]: The stealth assessment is one that's coming up very soon. There's another paper that, the extension of the auction paper that we are kind of working on that right now and trying to see if we can use more advanced features that we have since they have become a multimodal now. So we are trying to work on those things too.
Demetrios [00:20:00]: Oh, cool. So would you be feeding in photos of items and that type of thing?
Aniket Singh [00:20:06]: That's the plan. And to see how well it adjusts based on the picture, if we change the, make it not as much attractive, does it actually drop the value or not? So those kind of things. We are still in discussion with my professor how we can work on that paper because it hasn't completely, completely started yet. We do know that we want to work on it and we are making sure that we have the prereq done before we actually get into it. Yeah.
Demetrios [00:20:39]: And do you have thoughts in mind where you're going to go next and how you're. Because if I'm understanding your motives, you really want to see how aware are these things. How can they understand, how much subtleties can they understand? And can we test for these different subtleties that they understand? So have you thought about other experiments on how to probe that now I'm.
Aniket Singh [00:21:09]: Thinking to move more into the applications and actually making them do an extensive task. Because a lot of the issues right now is using NLM in production. Right. We see a lot of issues, and I have my own startup too, where we see a lot of issues because of LLMs. So we, we are trying to now get into research of that so that we can improve our production LLMs. Right. So that's where I'm going next. And especially using multimodal is multi model and multi agent is what I'm looking into.
Aniket Singh [00:21:49]: Because a lot of the mistakes that an LLM makes with their output can be possibly improved using a different LLM, a different agentic workflow, right?
Demetrios [00:22:02]: Let's say you explain that real fast.
Aniket Singh [00:22:04]: Yeah, yeah, sure. So let's say I'm trying to extract some information from a document, right? I, if I use just one LLM to do that, it's probably going to just skim through it and it's not going to do a great job. But when we use multiple agents and we make them take, hey, you will take care of this part, you will take care of this part, you will take care of this part and create a whole workflow. This could potentially improve how well the model will provide the results. And not just having, let's say, ten different models do it, but having other models do cross check, making sure that the output given by the model is actually what is required by the user. So one model doesn't need to know exact motive. So let's say I need to extract all those information in JSON format, right? Because that's one of the things that we do a lot, right? Extracting information from a picture or a document, and we want everything in JSON format. If LLM is completely focused to just extract and also give the JSON file, there's a good chance it might mess up because of the speed and hallucination and all that.
Aniket Singh [00:23:28]: So many things are there that could happen and make the output not as expected. But let's say if I ask multiple models to extract that information, have another model horse that information, another model that actually ends up giving the actual result that I'm looking for. And that model is fine tuned or trained on exactly how I want the data to be. For me, this should. And actually, Andrew Wenjie has even talked about it, that this actually does improve the performance of the model by a lot, and it does increase the cost. Yes, but models are becoming cheaper and cheaper. So I think this is something that many startups or anyone working on nlms for application would be okay with if they're actually getting a reliable result. So my goal is now to focus on reliability of the model and how well they can understand the task to create that workflow.
Demetrios [00:24:35]: Well, it seems like there are many ways people are trying to create agent architectures right now because it is such a greenfield technology, right? You just get to explore completely. Have you seen different architectures that work best? Like you were just saying, like if you give an agent a task and it doesn't need to know about any of the other tasks, it just has to do that that works or if you're having multiple agents cross checking and validating and then one that is the final fine tuned agent to really make sure that it comes out proper. That seems like one agent architecture that you've seen work. Have you seen other ones?
Aniket Singh [00:25:23]: So there are some libraries, I should say, that are already out there. I think crew AI is one of them. Another one would be the auto GPT, I believe that came out a while ago and there's few more. I think Microsoft has some own version of it. So I have been testing some of them and my team is also working on testing some of these to see if they actually will get the job done. Or is, is it actually not enough? So we are working on finding a use case that we can also publish. And it's complicated enough that would actually make sense for multi agent to work on and not something like just summary. That's very simple.
Aniket Singh [00:26:11]: So I want to work on something bit more complicated, but not at the same time. I think when GPT four came out they had tax GPT, something that people would like to monetize that's already there. People are working on it. Not something like that, but something that would help just in general, like something related to document parsing is what I'm looking into, but not exactly how I plan to do it. That's something we are still experimenting and trying to understand first, and then we get to the actual application and building that application and basically publishing how we got to that level. How did we actually achieve it? Did we have to do a lot of fine tuning? Did we have to just create a workflow with prompts? So we need to work on that.
Demetrios [00:27:05]: Yeah. The other piece that you mentioned, just getting a bunch of LLM calls and making sure to cross validate or get the consensus from a few different LLMs before it moves on to the next stage. And you have pros and cons for each of these, which I'm sure you will address where it's like, yeah, you can make ten different LLM calls before it goes to the next stage, but that you have to think about price. Yes, at this moment in time, as you mentioned, maybe later we don't need to think about the price, but for now you do have to kind of keep that in mind. The fine tuning piece, I think is a perfect example of when to actually fine tune, when you want it to come out in JSON or when you want like JSON validator. That's like one of the few times that fine tuning is brilliant. I've also heard about simulation. I think where you just ask it to do what? Or you ask it to, like, simulate what it thinks it should do.
Demetrios [00:28:10]: And so then it can give you, like, all right, here's me simulating the task, and you get that going on. But I'm not well versed enough in the agent architecture to really, like, fully understand all the different ways that you can leverage these agents.
Aniket Singh [00:28:36]: I'll talk a little bit about this one. I'm still trying to learn this myself, actually. We have a PhD student in our team who is amazing at these kind of workflows especially. So I wish she was here with me to answer this. But I will try to get to this a little bit based on my understanding, let's say, auto GPT for, say, as much as I know what it does is it prompts itself. How would you do this? And then based on those prompts, it creates multiple agents. Yes, this could help, but when we think of a production level LLM, we can't be doing that every time, right? Because it's going to have a lot of trial and error to get things done. It may get things done right, but it's going to cost us a lot of.
Aniket Singh [00:29:22]: But what I want to focus on with my paper is how would you do that? Fine tuning of not just the LLM, but like, fine tuning of creation of those models, creation of those workflow for your use case. So focusing on that instead of just having prompts, creating prompts, and just assigning the prompts to each LLM, if that makes sense.
Demetrios [00:29:52]: Yeah. And I've heard that you have much more success when you make it very, very like pointed or almost like focused and closed system. You don't want to give it the ability to come up with whatever plan it wants. You want to really make it closed and have clear guardrails on it.
Aniket Singh [00:30:18]: Yeah, for now, that's what we are actually doing with our production LLM that we have. But I don't think that's just enough to make it even more reliable. Well, I would say 80% to 90% of the cases it does come up with reliable output. But the problem right now is one thing. An OpenAI, OpenAI model will change, and then it just messes up everything that we have planned for. So that's what I wanna be able to fix with the research that I'm gonna work on. It doesn't have to be OpenAI, it could be any model. And how do we make sure that we can get this process done more quickly is also one of my goal.
Aniket Singh [00:31:04]: Using any open source LLM for site. Because if we are splitting tasks into multiple agents, we don't probably need GPT four level or GPT 400 level agents. We could probably work just with llama seven. Right? I mean, lava 7 billion.
Demetrios [00:31:24]: Sorry. Yeah, yeah. So I'm trying to wrap my head around this and think about how that would work so that you can be model agnostic and like, have you found success in that? Because it feels like the abstraction layer sometimes you always have to be weary of the model, the underlying model and the capabilities that it has.
Aniket Singh [00:31:55]: We do need to test 100% before thinking, okay, I'm going to go with this model, because not all models are same. Like, every model has their own capabilities. Right. So it does seem like, okay, GPT four is the answer to pretty much everything, except it messes up when we start to actually, like, put it in production. But first, it's bit expensive. It may not be that expensive for just few tokens or queries that you do, but when you are thinking about a production level LLM, it starts to get very expensive. So that's why for smaller level tasks, like when we are, let's just talk about document parsing for, say, if I want to just parse a document, I don't have to have an LLM do everything. Preprocessing we could do with OpenCV and other libraries.
Aniket Singh [00:32:55]: Right. Then we can just maybe extract using tensorflow, like some kind of model. That's our pre trained model. Right. And once we extract, then we actually start to create a workflow with Linus to get to our results. So we don't have to have a very capable model, but a very capable workflow. This agent workflow that we have needs to be fine tuned to our actual necessity. Like what I actually want out of that document, because I may not need everything out of that document, but at the same time, if I ask, hey, GPT four, can you just get this from this document? It's probably going to not always give that exact result when you just prompted to do that.
Demetrios [00:33:47]: So you're looking at it more in terms of pipelines and how you can create a robust pipeline as opposed to a robust prompt.
Aniket Singh [00:33:56]: Yeah, yeah. Now my focus has shifted more towards getting to that level because I think that's where most, like, important results would come out of LLM. Because if we can't just use this in production, then there's almost no use. Right?
Demetrios [00:34:19]: Yeah. Even if it does do all this stuff that we do as humans, or if it is as aware as we are and it has confidence scores and it knows it's going to get something wrong or right. That all doesn't really matter if at the end of the day, you can't use it in production.
Aniket Singh [00:34:37]: Exactly. Exactly.
Demetrios [00:34:39]: Yeah. Well, this has been awesome, dude. I appreciate you coming on here and chatting with me, and as always, I love talking about evaluation. And so I like the different spin that you bring to it. Looking at the capabilities of the models themselves, as opposed to the system or the benchmarks, and hearing about your next steps for testing agents and evaluating the agents and looking at the agentic workflows as pipelines is super cool. Hopefully, once you've done that, you can also let us know how that goes and what some of the findings are. Once you get that paper out too, I'll make sure to read it and give a little review and probably post about it. Work.
Aniket Singh [00:35:27]: Thank you.