MLOps Community
+00:00 GMT
Sign in or Join the community to continue

How Autonomous Agents Can Help You Get LLMs Ready for Production

Posted Mar 18, 2024 | Views 725
# Autonomous Agents
# LLMs
# NatWest Groups
# natwestgroup.com
Share
speaker
avatar
Chris Booth
Product Owner - Machine Learning @ NatWest Group

Chris Booth is a product owner for machine learning and innovation for NatWest Groups’ artificially intelligent agent: Cora. His core focus is observing emerging technologies, creating a roadmap of experiments to bring into production, and delivering value to the group.

+ Read More
SUMMARY

Chris Booth shared insights into leveraging autonomous agents to enhance language model (LLM) readiness for production. Drawing from his experience as a Product Owner for machine learning, Chris highlighted the role of autonomous agents in streamlining processes and demonstrated their capabilities through a live financial data extraction demo. He emphasized addressing challenges such as reasoning, latency, and explainability, advocating for techniques like chain prompting and advanced models. Chris also encouraged open-source collaboration to fine-tune LLMs and integrate knowledge graphs for improved performance and reliability.

+ Read More
TRANSCRIPT

Chris Booth [00:00:00]: Okay, I'm just going to raise hands first just to see what language level need to work at and what the audience is here. So if I say llms, anyone doesn't know what that is? All right, good. Okay, quantitization, we've covered that. All right, good, I can talk freely, rag. No one don't know, rag. All right, there's one. Fine, two. So we're going to talk about autonomous agents.

Chris Booth [00:00:23]: Who doesn't know autonomous agents? Okay, a few. Great. So we're going to cover really briefly what Jarvis is and Thomas agents, how they work, and from my perspective, and one of my responsibilities as Polydona is trying to get llms into production and how I'm seeing agents hit two birds on stone by providing more functionality while also solving a lot of production issues. About me, this needs to be a picture updated. I'm much older and haggard now. Product own machine learning, what does that mean? I sit in between product management and ML. I manage two product managers who have two squads on themselves, work on ML projects, and my specific specialty is natural image processing, been doing since 2015. I've worked with most footsee 250 companies and I work mostly on natural processing and used to be chat bots, which used to be the annoying things that didn't work, but now everyone wants one, so I'm probably going to.

Chris Booth [00:01:32]: For Quora. Quora is the banks chat bot, but now virtual assistant agent. Insert your buzword here. And it's probably one of the more impressive and performant chat bots in Europe. I think it's rated first or second in terms of performance and functionality. So it's 15 million customers a day. So we know production supports six different brands with various conversational dynamics and branding, and it's still relatively basic functionality like smart card queries. You can manage your roles and users there, you can open and close accounts.

Chris Booth [00:02:12]: But to the point of Hans agents, we're seeing now some really exciting opportunities to help customers retail internally too. You name it, we're looking at it. Hands up if you don't know who this is or what that is. First time ever I've got an audience that hasn't had a raise a hand. Great. Okay. The real nerds in the room, who knew it was Java, stood for just a very talented system. That's why autonomous agents are so popular and it's clearly resonating.

Chris Booth [00:02:41]: So Tensorflow is the other line there. Since ML started becoming popular in 16 and an auto GTP in red has clearly resonated with everyone. We all want our own Jarvis and the best way I like to think about these agents is I call a robotic process automation 2.0. And we've got a live demo for you. Kind of, yeah. Here's one I made earlier. This is Evo, which is currently the top performing agent on auto GDP. They've done an amazing job of creating the kind of leaderboard and getting these agents to compete, similar to the hugging face example Luke gave earlier.

Chris Booth [00:03:24]: And they compete against how many tasks and what range of tasks can they carry out. So I think we can all agree this prompt I gave here is pretty complex. Request extract these documents, which are. These documents were two th filings are just filings. Investors have to report about what trades they make. These in particular Nancy Pelosi's, who's a big us senate governor. She somehow magically beats the s and P 500 by 60% a year. How does she do that? I'm out to copy her trades because she knows something I don't.

Chris Booth [00:04:02]: So these are her documents, just two of them. To keep it simple, I'm asking Evo here, which is an agent, to extract the trades, the dates, all the information I want, and then write a text file and a CSV with all the information. This, if I was doing it myself, would probably take me a couple of hours. Evo ninja here did it in a minute and a half, 2 minutes. So there you go. It's taken out all the assets, the transaction types. So you've got your CSV file there, and it's even given me a nice little summary as an investor, if it would match my investment preferences as a high risk long term investor. And I think that's absolutely mind boggling.

Chris Booth [00:04:41]: Boggling how fantastic that is. And this is just one use case. Luke gave a nice range of potential other applications too. If we have time, I'll let you guys fire off some requests and see how that does. How do these help us get ready for production? We are very serious now, west, about getting these agents in production, because just the amount of benefits we can bring customers and I'll touch on that in a minute. But this is my good friend and colleague Ewan, a machine learning lead, and we managed to get Mistral into our security perimeter to start testing locally, sorry, within our environment in two days, which if anyone works in enterprise knows that is warp speed, as he says. And what we're going to cover here is just a few kind of frameworks and examples we've given to our governance and risk teams to help them understand the technology, derisk it, and yeah, we're going to use there. So one of our goals, and I'm really passionate about this personally in Quora, is what if we could provide financial coaching to every customer? Because I don't know about you, but I'm terrible at my finances and I've not made good investment.

Chris Booth [00:05:48]: So I think we can do a lot more to help our customers understand finance. Kind of get rid of the terminology and the complexity behind it and all makes us a little bit wealthy. Hopefully, or the worst case, get people out of debt and in worse financial situations. The problem with using language models, and this would be a nightmare, would be core to start telling them to buy bitcoin. Regardless of your opinion of it, I like your disclaimer at the bottom. Just to be very clear, we're not saying buy bitcoin. So what I'm going to run you guys through is something I learned in my aircraft engineering days, which is a couple of risk factor management techniques. You can imagine using management models for use cases from the far left being safe, all the way to the right being super risky.

Chris Booth [00:06:42]: So far left would be things like using for internal Q A and some simple use cases. Generate images in the internal environment. Shouldn't be any problem. If I could, I would have three slides more with an arrow go along here. Using LM's fundraiser guidance. When I first proposed it, I might as well been kicked out of the room. But now we're treating it seriously and I'll show you why. There are a lot of challenges ahead.

Chris Booth [00:07:01]: These are some of the things I'm trying to wrangle at the moment. We don't have time for this, sadly. So we're just going to cover these four points. Hallucinations, reasoning latency and explainability. Repeatability. Who's not familiar with the swiss cheese risk management model? Yeah, there we go. Finally, very simple, really. You can put your risk factors you're worried about on this tangent here, and there's your risk outcome.

Chris Booth [00:07:29]: The plane crashes, the banks just as bitcoin really bad. And what you do is you just put slices of swiss cheese, which are the cheese slices with holes in them, and you put them as barriers. And then theory is no single measure of defense that you put in will be 100% guaranteed to work. So special in aircraft engineering, we had a quadruplex system, which is a fancy way of saying you need to put four redundancies in place for it to be deemed reliably safe to fly. And even with language models, let's say you somehow fine tuned a model to get to 99.99% accuracy. That 0.1% over a million users is still not good enough. So we're going to need multiple layers of that to make sure we do a good job. Ergo, model performance as impressive as gp four is, 80% accuracy is not good enough for bitcoin.

Chris Booth [00:08:21]: That's one in five saying that you should buy bitcoin, not good. So the first step to agents is chain prompting. Who's not familiar with chain of thoughts? Chain prompting prompting techniques. Cool. So I go through the basics. This is a chat GBT API. You put your input, gives you an output, as Luke very well explained earlier. Then you've got few shot prompting, which is a prompt.

Chris Booth [00:08:48]: Sorry, this is just a prompt answer question, like a pirate r fuse shot is when you give a few examples to kind of do some fine tune on the go. And now you'll start being more specific about your output. Arm hearties. Again, we're not training core to do this, by the way. Then comes chain of thought. So there's this paper released, I'll show it in a minute. And essentially part of the prompt, you just get the model to say, let's think about the step by step so much in the same way. You'd take a maths test and you're asked to explain your reasoning and your workings out, you get the model to do the same.

Chris Booth [00:09:22]: And lo and behold, this is quite old now, relatively, but it causes performance to absolutely leap up by. What's that, a factor of two, three? Just by getting the chain of thought. So this was the first breakthrough, and this is where people started clocking on. Maybe we can. By having a chain of thought, you can get the model to carry out tasks. Then came a paper called reflection. I'm just going to call it reflect because I can't talk. So you got chain of thought prompt in the top left there.

Chris Booth [00:09:51]: And then what these researchers did was get it to provide multiple outputs of varying types, and then you can wait across them. There's a couple of ways you can do this, but basically you get it to compare itself against its output and analyze itself. And lo and behold, performance jumps dramatically. Again, just with the same model. No fine tuning, no rag, just some simple prompts and some loops in the system, basically. Okay, cool. We can add some clever prompting, but as we saw, performance still is questionably in the 80, 90%, depending on the topic and depending if you rely on the benchmarks, not quite good enough. So we've got that layer, but the holes will go through.

Chris Booth [00:10:36]: So what else can we do? Luke? Alluded to it earlier. We can have something called react, reason and acting. So you've got our two clever prompts now. Now comes a third. We've got rag retrieval augmented generation, and I'll touch on that briefly. And then your tools and APIs. So how, these are the steps, right? This kind of goes in a loop and uses different tools. So Evo makes a prediction and it just feeds the prompt in and tries to predict what step it should take.

Chris Booth [00:11:09]: It selects an agent or a tool. The tool is synthesizer. And it's gone and synthesized a couple of files together, and then it's gone through that loop again, done it again, and so on and so forth. See if it does a different step somewhere. And like Luke said, what's great about retrieval organ generation is it reduces the hallucination risks by feeding it ground truths and reliable sources of information. And then APIs. You can do web scraping. I'm sure we've all used Chachi bati now where it can use Dali three and other tools.

Chris Booth [00:11:44]: But what's clever about that? This also increases performance. It's kind of an emergence effect, basically. So, okay, let's put react in place. This kind of solves explainability. And so when we ran this past the risk and governance teams, they were actually quite happy with it, which surprised us because providing this chain of thought, yes, you can't explain what the weights are doing in the model, not without some real clever data science that I don't know of, but by the fact it provides its reasoning. It's kind of enough auditing for them to be kind of happy with audits reviews, getting it to improve its performance. But again, not good enough. We still need more model performance.

Chris Booth [00:12:25]: We've got laid scene reasoning issues too. I love this diagram. This is hugging GPT or Jarvis that Microsoft released kind of explains in a step by step how these agents generally work. Step one, task planning. So you saw with Evo, these agents generally try and predict what tasks you do. It will store these tasks somewhere and go right. Step one is this. Step two is this.

Chris Booth [00:12:50]: It'll store them. Huggin GPT here selects the model in the Huggin GPT in the Huggin face repository. But replace model selection with tools, with actions, with writing, whatever executes in a loop and then outputs your response. I'll pause there. Does this make sense to everyone? Any questions up until now? Or is it all clear?

Guest1 [00:13:12]: Yeah. For chain of thought, is anyone that you try to actually evaluate the step.

Chris Booth [00:13:20]: Of it, the estate.

Guest1 [00:13:22]: So chain of top you generate a step by step.

Chris Booth [00:13:26]: I think.

Guest1 [00:13:26]: Right. Is there anyone that you try to actually checking each step of it, whether it makes sense or not?

Chris Booth [00:13:33]: So spot on question. And yeah, that's what we're doing. And most interestingly, if who hasn't heard of the Qstar leaks by OpenAI, I'm going to go on a leak diversion here. So QStar is this leak by leak by OpenAI that the problem with language models and being the very nature of a regression model is you can't use reinforcement learning on them. And reinforcement learning is great for scalability, for not worrying about data. Alphago and all these models, like super outperformed anything else because of reinforcement learning. But by the very nature of language, there's no reward function. There's no clear go after this.

Chris Booth [00:14:14]: For language models. Chain of thought provides that structure and the reward. You can reward how? Well it answers something. You can reward those steps. And so the rumor is OpenAI found a way to add a reward function and a reinforcement learning loop onto the chain of thought and performance went through the roof. So we are not looking into that yet, but we've been manually, by hand, reviewing the outputs. Chain of thought, and I might reveal in a minute, which Luke might help us with, is, yeah, you can create a data set to fine tune the model on the chain of. Here's I've kind of given away, but can anyone guess what the problem with going back to language model over and over might be? So the problem we've got is latency.

Chris Booth [00:15:05]: Even the small ones aren't fast in the grand scheme of data engineering architecture. Right. A really good article written by Chip Hewen. Highly recommend it. She deep dives into mlops and the problems with agency. And the issue is not the tokens being put in the encoder, it's decoding. That takes time. Is the long and short of it by a factor of three ish? Two, even.

Chris Booth [00:15:33]: So. Yeah. Someone asked a really good question about the trifecta of ways you could make a language model more reliable. And one of the problems with large token sizes, even with all the recent developments, is just it's inference time. That's not going to suit 15 million customers. It's not going to be viable. I'll skip one solution, which is you shrink the models to make them faster. Luke's already covered this, so I'll pass this.

Chris Booth [00:15:59]: So you're making smaller. Great. Okay. Making them smaller. That's wrong. Should be model performance doesn't really solve latency, doesn't solve anything. So we can also fine tune them. This is textbook.

Chris Booth [00:16:14]: Would need, which is Phi two, released by Microsoft. Again, Luke touched on it briefly, but considering the size of the model, 2.7 billion parameters and the performance, it's absolutely phenomenal. But the hypothesis is as the trend keeps going on, we should start solving the latency issue. At least we believe so anyway. But even if we put all these systems in place, as we saw just a second ago, there's one big problem, which is reasoning. These language models, by their very nature, are never going to be that good at reasoning. So we've got a bit of an issue there. This was released recently by the Gemini by Google Gemini paper web agents is probably one of the closest well recognized benchmarks at the moment, and GPT four only scores 15 and reasoning 83.

Chris Booth [00:17:02]: But again, we can debate what reasoning actually means. So to the point we've still got a long way to go to get these agents reliably function over large task areas. So my prediction, which was just confirmed today, was I think this year is going to be like large agent models. There's been just releases about a new product called Rabbit, which is basically a handheld language model in your hand, but they fine tuned it on large actions. So getting the model to be very good at knowing what actions it takes to complete a task based on your Linux or based on your apps. And this is why I think it's the last step. I think if we can crack reasoning, then these models will be ready for production and it will finally give you options. Not advice, whatever good advice, just options.

Chris Booth [00:17:55]: However, some issues will still persist. Even if we manage all this, how can we clarify our knowledge of the user when intake confidence is low? That's a complex way of saying if I ask a language model for a movie recommendation, it will just splurt out a random response. Not really personalized to me. When if I ask one of my friends like Luke, and he knows Marvel, he would probably recommend and go, oh, you like Marvel? You should watch. You shouldn't because they're all crap now. But he'd recommend based on my preferences, right? How do we want to achieve that? If we want to provide a real personalized experience, how can we easily structure our data? How can we throw on a whole day lake at speed? Because we want to be able to do rag at scale? And how can we achieve not just explainability, but traceability? Anyone want to take a guess? Mlops guys, keep quiet. Any guesses for what technology stack might be able to solve all this knowledge? Graphs? And this is the nerdiest way to flex on this is my obsidian data store on my knowledge of language models, knowledge graphs. However, that will have to wait to another day.

Chris Booth [00:19:05]: I'm afraid you just have to invite me again to go talk about that, because we're running out of time. So to wrap up, I'm calling for open source contributors. I've got two ambitions at the moment. I'd love to develop my, or fine tune my own mixture of agents model, fine tune it using Qlora and fighter approach techniques. It's not been done, as far as I'm aware, really, aside from that closed sourced one. So I want to get Bristol into the limelight. I think we can do it. We've got a little musketeers on the go, so do please get in touch if you're interested in learning about agents, language models, or graphs, and just to help with the code base.

Chris Booth [00:19:48]: What are we building, this monstrosity? This is very similar. I won't go into detail right now because it's on the repo, but this is probably one of the more complex and leading architectures that I know of as an expert in the field, built by myself, and it uses multiple agents to carry out different tasks, much like we saw in the demo. However, the main key difference, which I've not seen yet, is it uses a knowledge graph to collect that kind of memory and those personal details and infer on that. And I'm happy to go into detail another day. Here's some repos. I highly recommend inferred GPTop. That's the open source one. I'm running auto GTP.

Chris Booth [00:20:30]: Definitely check that out. And GPT engine is also really, really good. I managed to build an app in about the full app in about 30 minutes, when it would normally take me forever, literally, because I'm bad at JavaScript. If you want to talk more, get in touch about the repo. Whatever you want. If you want me slides, just connect me on LinkedIn. I'll leave it up there for a minute for questions us. Thank you.

+ Read More

Watch More

Deploying Autonomous Coding Agents // Graham Neubig // Agents in Production
Posted Nov 22, 2024 | Views 1.1K
# Coding
# Agents
# Autonomous
Emerging Patterns for LLMs in Production
Posted Apr 27, 2023 | Views 2.2K
# LLM
# LLM in Production
# In-Stealth
# Rungalileo.io
# Snorkel.ai
# Wandb.ai
# Tecton.ai
# Petuum.com
# mckinsey.com/quantumblack
# Wallaroo.ai
# Union.ai
# Redis.com
# Alphasignal.ai
# Bigbraindaily.com
# Turningpost.com