How Reinforcement Learning can Improve your Agent // Patrick Barker // Agents in Production 2025
speaker

When Patrick is not occupied with building his AI company, he enjoys spending time with his wonderful kids or exploring the hills of Boulder.
SUMMARY
In this talk we will bring to light the open secret in the AI community: most agents don't work reliably. We'll explore the most common ways agents fail, highlighting how fundamental issues with the model often can't be overcome with prompting. If this is true, then why aren't we correcting the paths in the model? Reinforcement learning offers the most promising path to reliable agents. Designing reward signals is the future of agentic development. In the next few years we will transition from agents that are programed deterministically, to agents that are taught interactively. We don't need to be stuck in the ice age of frozen models, we can take our agents to the next level of success.
TRANSCRIPT
Patrick Barker [00:00:00]: We're going to talk about reinforcement learning for reliable agents today. So it's an open secret I think in the industry that, you know, most AI agents just don't work right. A lot of times we see these fancy demos, they don't really hold up in production obviously. Like this depends on, you know, how much of agentic behavior we're talking about. You know, if you have a simple agent versus like a multi step multi tool all the way out to autonomous, they sort of get worse and worse as you go out that spectrum. It's also clear that prompting isn't enough, at least for now, you know, maybe in the future will be. If probably was enough, we would have reliable agents today. But it's, it's clearly not quite enough at this stage of ad.
Patrick Barker [00:00:53]: So the question we've been asking is, you know, can reinforcement learning get us to reliable agents? And so a little bit of background on reinforcement learning with LLMs. So you know, originally we had RLHF, which is reinforcement learning, human feedback. This is the magic that actually give us ChatGPT, right? So a lot of people when they think of reinforcement learning with LLMs, they think about one or R1, you know, these amazing reasoning models we have now. But reinforcement learning has actually been around for a long time and we used it to bring about ChatGPT with something called instruct GBT where we're basically just taking base LLM and teaching it multi turn conversations. The most common algorithms here, PPO and DPO, how you do this is the LLM would generate multiple responses. Then you have a reviewer just rank these responses and then you have a reward model that's trained to mimic those user preferences. And then the reward model is used to train the LLM. The reward signal can come from a number of different things.
Patrick Barker [00:01:55]: You can have a binary good bad signal, you can have a ranked signal. So you take say maybe four responses and you rank which ones, rank them in order of which ones you prefer and then you can score them from 0 to N. So it's really common to score maybe 0 to 4 and you just give them different rankings around that. And then you can also just have textual feedback. Was really interesting. We've been playing around with this a lot recently, but sort of just modeling multi turn conversations where the LLM both models the assistant and models the user. And I actually think this is a really promising path forward. And so what's the other type? Well, we have RLVR, right? And this is the magic that gave us 01 and R1 and the idea is we just have verifiable rewards instead of a reward model.
Patrick Barker [00:02:41]: Right. The most common algorithm here is grpo. And so what happens here is the LLM again generates multiple responses to any given prompts, but then those responses are deterministically verified. Right. So if it's a math problem, there's only one right answer. There's not a bunch of right answers. It's just it needs to output one right answer. Right.
Patrick Barker [00:03:02]: So we can deterministically validate, you know, if that answer is correct. And then a reward model still trained to sort of mimic those user preferences. And. And then you can use that reward model to train. So why does RLVR work so well? We had a lot of reinforcement Learning back with ChatGPT, which are RLHF, and this obviously continued. I don't think a lot of people were really aware of it, but it continued to evolve. But then we had this massive lift off with RLVR and we saw 01 and R1 doing these amazing things. And so what's the magic behind rlbr? Well, it's just that it doesn't really have to deal with reward hacking.
Patrick Barker [00:03:39]: So here's an example right here of where OpenAI used the game scores reward, but the agent just learned it could drive in a circle and it could just pick these power ups every time it goes across. And it would win the game, it would get the highest score. And so models do this all the time. If anyone's spent time training models, you'll know anytime they can cheat, they will cheat. They will take every opportunity they can get to cheat. And, and this just gets into gradient descent and how that works. But this is a real challenge, right? Because most problems aren't rlvr. We have this thing, rlvr, that works amazingly, but it's really just like a very small part of the problems we'd want to use with LLMs.
Patrick Barker [00:04:25]: And so the bigger prize is actually figuring out likely how we can defeat reward hacking in one form or another. One of the things we've been asking is can reasoning defeat reward hacking? This is gan. Like you would have an agent trying to solve the puzzle, and then you'd have a validator that's also reasoning about it like, hey, was this actually a successful thing? You train them in tandem. The other thing we can maybe do here is just use less data. Right now, a big reason why we need these reward models is that you have to train these agents on so much data, but we as humans just don't need that much Data we just need a bit of feedback, we need a bit of direction, we can piece the whole thing together. There's been a lot of work recently to see learning on really minimal examples. Obviously we have this a bit with just prompt engineering, few shot learning, but actually getting into the training as well. And you could do this potentially with a lot of data augmentation and a lot of LLM generating augmented data to your given prompt.
Patrick Barker [00:05:34]: The question is, will this work for agents? If we're getting into multi turn tool calling agents, is this something that can RL help with that? I think we've seen already, companies like Manus openly discuss that making agents work is more of an alignment problem than anything, which is pretty good signal that they're seeing that internally. The intuition here is it's way easier for an LLM to verify an action was correct than it is to actually generate the correct action. There's a way to leapfrog up this where you just train an element to figure out if something was done right, which actually does pretty well. And then using that we can get into the data flywheel in the data flywheel. We talked about this a little bit in the last talk here. But once you've approximated the human signal, you can replace that with an LLM and then this gets known as RLAIF. So it's reinforcement learning, AI feedback and then the LLM could just sit in that loop and you could just generate tasks maybe within a skill set. You just generate a bunch of tasks and then you have the LLM actually become the verifier instead of the human.
Patrick Barker [00:06:47]: This gives you basically infinite data and you can just train away again. Challenge here is reward hacking. And yeah, I guess, you know, question is how well does this work? Right? So we've been working on this for a little while now and it definitely does work, right? So you can, you can pretty easily out outperform most frontier models, particularly on complex tasks and, and tasks that are out of distribution for them. It's also cheap and fast and reliable, which is a, which is a nice add I people who have dealt a lot of frontier models probably know they have absolutely horrendous uptime. I actually have no idea how you like build something on a single frontier model. I guess that's why Lightlm and Portkey are so popular is that like, you know, the uptime is just absolutely terrible. If you want any sort of production service that's pretty challenging. So yeah, we have the ability to run these things just locally train up a model that's Much smaller and faster that we can run on our own infrastructure and can perform better.
Patrick Barker [00:07:49]: I think this is an exciting direction overall. Then, yeah, of course, we've explored as well is can we do interactive learning? Can we teach an agent really in real time how to do a specific task? We've created this little open source library called Dark rl they can see playing right now and it does the ChatGPT thing of generating outdoor responses. You choose one of those responses and then actually learns in real time. It actually does RL training in real time. As you click that button, it's immediately trained. We developed some kernels to improve the speed, both training and inference within a single model, which I think is really exciting. We're asking the questions of what if every LLM was an online LLM? What if every algorithm where you had a generate method, you also had a learn method? You could just send data to immediately, it's immediately updated. That's it.
Patrick Barker [00:08:52]: Thanks. If you want to catch up, we have this library now we're playing around with these ideas. It's called Dark rl. Yeah, Come check it out. And always have to chat with anybody that's around. Thank you.
Skylar Payne [00:09:04]: Awesome. Thank you so much. Again, these lightning talks are always so dense.
Patrick Barker [00:09:10]: There's so much information.
Skylar Payne [00:09:12]: So love it. If so, folks, you have like questions, feel free to drop them in the chat. We have another talk in like five minutes, so we have a little, little bit of time. But yeah, while we're kind of like here, is there a way people should connect with you? Should they just go to this GitHub if they have other questions and want to follow up?
Patrick Barker [00:09:32]: Yeah, the GitHub's probably the, the best place to keep up. Cool. I'd say that's probably the easiest right now. Yeah, feel free to follow the. Org or whatever. We're doing a lot of stuff in this space and yeah, I mean, reach out from there. I think, you know, GitHub's point you to like a discord and a slack and stuff, so feel free to jump in there for the chat. So.
Skylar Payne [00:09:52]: Awesome. Cool. And maybe one. I don't see any questions in the chat, so maybe just one from me. You know, getting to this point to where you have a reward function from like mimicking possibly like human taste, more or less. How much. How do you think about the balance between like, quality and quantity? There's like, do you spend a lot of time high quality? How much data do you typically need? Like, what does it look like?
Patrick Barker [00:10:19]: Yes. We found a couple of really interesting things here. One is we tried to train binary reward models, right, that just said yes, no to a task, because it was particularly easy to just like take, you know, just have a user say, thumbs up, thumbs down on a particular response. Right. We actually had a lot of trouble training binary reward models. You really have to balance the data. They're super biased, you know, so, like, if your data set, say, has 60% yeses and 40% no's, the model will just naturally go to 60%, you know, without actually knowing what it's doing. It just gets biased, you know, in that direction.
Patrick Barker [00:10:53]: And that becomes really hard with, like, online learning. Right. So if you're offline, you can, like, always rebalance the data set. But if you're trying to do this stuff online, it becomes really, really challenging to do that. But yeah, we're, we're always thinking of ways to sort of do this with less data. And yeah, it's challenging. So the lot of, like, data augmentation seems, seems possible here. We have an intuition.
Patrick Barker [00:11:17]: We sort of found that for every task an agent does, for every specific action within that task, there are numerous other tasks in which that action would be valid. Right. So you can basically take a task within an action and send it to an LLM and say, what other tasks would this action be valid for? It turns out you can generate a lot of other tasks and you could sort of like bootstrap up training easier that way with. With less data. So.
Skylar Payne [00:11:48]: Awesome, awesome. Just check the chat real quick, see if there was another question that came in. I don't see anything. So I think with that we'll say goodbye to Patrick Barker, who I assume has no relation to Travis Barker.
Patrick Barker [00:12:04]: That's correct, unfortunately. Yeah.
Skylar Payne [00:12:08]: I'm just trying to get into that Kardashian party, you know?
Patrick Barker [00:12:10]: Yeah, yeah, I'm a pretty bad drummer too, so.
Skylar Payne [00:12:15]: Awesome. Take care. Thank you so much for coming. Thank you.
Patrick Barker [00:12:18]: Cheers.
