MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Applying the ideas of R1 to Computer Use // Patrick Barker // Agent Hour

Posted Feb 06, 2025 | Views 90
Share
speaker
avatar
Patrick Barker
CTO @ Kentauros AI

When Patrick is not occupied with building his AI company, he enjoys spending time with his wonderful kids or exploring the hills of Boulder.

+ Read More
SUMMARY

R1 > computer? In this talk, we will explore applying the ideas of R1 style RL fine-tuning for computer use

+ Read More
TRANSCRIPT

Demetrios [00:00:00]: I'm excited for your talk. I know that you've been doing some of the most advanced stuff out of anybody that I know when it comes to just thinking through agents and how really like the computer use agents and the web browsing agents work. I'll hand it over to you, man, and then we'll be back and we can have this roundtable in a bit afterwards.

Patrick Barker [00:00:23]: Cool. It sounds good, man.

Demetrios [00:00:25]: All right.

Patrick Barker [00:00:27]: All right. So we were talking about R1 computer use today. A little bit about me. I'm the CTO of Kentauros AI. We focus heavily on GUI automation, specifically a little bit beyond browser automation, where we're just doing multimodal models that operate directly off of a keyboard and a mouse. And we've been working on this for about a year, so. So it's been a really fun project to work on and I think we're coming along the way to some things that are pretty exciting. So, you know, R1 should need no introduction, obviously.

Patrick Barker [00:01:03]: You know, I think everyone's heard of it in the last couple of weeks. You know, it's a tremendous success using sort of hard deterministic verifiers and GRPO policy loss and we can induce reasoning relatively cheap, which is super exciting. Obviously computer use as well, which I'm sure people have been hearing about. The most recent one was OpenAI's operator, but yeah, there's a long history that led up to this. Obviously we had Clip, which is the backbone of a lot of vision language models today. Then we had Flamingo, which was really the first open question answering vision language model. Then we had Molmo this last year, which a lot of people actually don't know about. But AI2 is doing really incredible work.

Patrick Barker [00:01:47]: The cool thing about MOLMO is the full training recipes available. They open source their data sets as well, which I wish more people would do. But they were really the first people to release a grounding coordinates data set, which is really exciting of about a billion examples. This is just super critical because vision language models, pretty much all of them are not trained on this data at all. If you're telling it to like, hey, I need to click on something on a screen. That data was just horribly out of distribution for a really long time and there was no large data set. We at Kentauros actually collected a pretty large data set of about 80,000 and it worked fairly well. But then MOMO is released a couple months later and they have a million data points, which is amazing.

Patrick Barker [00:02:35]: So MOMO is really a big release to the community. That I think is kind of under recognized in a lot of ways. And then yeah, about a month later Cloud, you know, came out with computer use. There's sort of the first frontier model to do that. And then you know, last month we had, we had Operator as well. These are all really exciting developments. What I'd say is, you know, none of them are reliable yet. So this is definitely a problem we haven't cracked yet.

Patrick Barker [00:03:00]: It's getting better. But you know, we only see roughly 60 to maybe 80% reliability in general, you know, for any sort of medium grade task. So yeah, we can highlight sort of computer struggles overall. Clicking Like I just said, historically, you know, vision language models horribly out of distribution for this recently. Like the Momo Pixmo data set is probably the best one on the market right now. So this problem is getting better. We're seeing a lot more QLIK models that are quite reliable reasoning. This is still out of distribution for almost all VLM models for like specifically for computer use.

Patrick Barker [00:03:44]: I don't know of actually any data sets. I think Kentauros has a very small one. We've started and we're growing but I think that's the only one in the market that I'm aware of currently. And then recovery, what I mean by this is if the agent fails, which they're going to fail frequently, you really need to train it to recover from failed states. This is critical. I think one of the first papers was the PI Knot paper that really talked about this, which is a robotics paper, but they really showed that training on recovery is as important as training on a perfect run. But the irony is in the ecosystem there's almost no data sets that do this. Every data set is golden runs and it's just a series of golden runs.

Patrick Barker [00:04:29]: And really what you actually want is about half and half. So you want half the data to just be like a perfect run through the task on the computer and the other half to actually have a bunch of errors that the model has to learn to correct from. And then reliability overall they're just not, you know, like I just kind of talked about a little bit. We're seeing you know, really 80% max on any like just medium grade task. On harder tasks like multi app tasks, stuff like that, it's much, much worse. We're seeing about 10% and OS World is currently the top benchmark kind of in this space. It's really, really poorly put together. So we're actually very close this month to releasing a new benchmark we're calling OS Universe.

Patrick Barker [00:05:11]: So that should be out soon. So yeah, I got wondering obviously as R1 was released and we've been sort of playing with the reasoning stuff for a while, but could we combine the R1 style reasoning with computer use? And could we solve some of these problems leveraging the work that they did? The big challenge of this is that R1 relies very heavily on hard verification for the reward models. And what that means is like if I do a math problem, it has one answer and I can verify that immediately and I can deterministically know if it was right or wrong. And why this is so important in the current state of the world is that if you use a non deterministic method and use a model for the reward, the actor model will generally find a way to hack the reward model. And it's just kind of this funny thing. We've been dealing with reinforcement learning for a very long time. The reason why hard verification works so well is it really forces the model to create a true representation in its latent space of the reasoning. This is the primary challenge of trying to apply that idea to GUI data.

Patrick Barker [00:06:26]: This is because we really just don't have hard verification at scale for GUI data. Like you can go through and do some hard verification verification tasks for certain types of things. It's fairly limited and it's really cumbersome to collect that data. So there's really just not a lot of this data that's out there. And so we got wondering, you know, could we do reasoning reward models? Right? And what I mean by this is, you know, the top one, you have an actor and the actor is doing sort of a react style loop. It's going to observe, it's going to think and that's going to act. And then you have the reward model and the reward model do something very similar. So it would observe, it would think, and then it would reward.

Patrick Barker [00:07:16]: And so the question becomes, how do we train a reasoner that is potentially capable of doing this? So we've developed sort of a platform which you can see down here, that allows you to review agentic actions as they take place. And our hope here is that we can scale enough human label data to train a reliable reward model, or at least the start of that. And the exciting part is if that is possible, then the agent would be able to learn autonomously, which is very exciting. So we are rapidly heading towards this goal. One challenge we've immediately faced is it's a lot slower. So when you add the reasoning data to say the actor, they're already quite a bit they're already quite slow so you add more reasoning data, they get even slower. So the thing we've been playing with now that seems to be working is you train once with reasoning and then you train just another example, same model and everything. You train it without reasoning.

Patrick Barker [00:08:24]: And the idea here is that it'll create a shared latent representation. It's almost as though the reasoning when you train it on that one is like is guidance for the attention in a way. And then once, once the attention's sort of been learned in that format then you can usually get away without the reasoning. Now this, you know, the reasoning continues to work a bit better but you'll see that without the reasoning gets significantly better. So this is a pretty promising approach. I haven't seen a lot of people doing this yet, so I think it's fairly novel. And then where are we in the process doing this? We are currently collecting this base training set of reasoning data. We've started training some base models for this with limited set.

Patrick Barker [00:09:11]: We have, we have annotators kind of working around the clock on this. We're gathering quite a large data set and then yeah, soon we're going to be starting off with the RL style training. And yeah, if you're interested in this work, we just open sourced a repo called R1 computer use and we are building this all on the open. The data sets are giving me the open and something that interests you. We'd love to chat. So yeah, that's all I got for today. So.

Demetrios [00:09:38]: Awesome dude. So yeah, to echo that I think there is. There's something super cool that you're kind of crowdsourcing the whole getting quality data because that's where you're seeing the biggest bottleneck or the most difficult part, right?

Patrick Barker [00:09:58]: Yeah, yeah that's definitely the hardest part with the computer stuff is we don't. There's really not many data sets at all that exists in the world. So the way we're thinking about this is it's easier to verify that an action was correct than it is to take the action. Right. It's significantly easier and models are already quite a bit better at this. If you think of like LLM as a judge sort of is really rapidly becoming LLM as a reward model. Right. So the idea is because it's, it's easier for the model to do that it'll hopefully take less training data to train a reasoning model that can be a reward model and then that model trains the actor.

Patrick Barker [00:10:39]: So it's a way to sort of hopscotch up and then you can sort of have it just run off on its own once the verifier, you know, approximates human signal enough. Which is similar to what we do with RLHF, actually. It's just sort of now being applied to reasoning.

Demetrios [00:10:54]: But the MoMA set wasn't enough to get like, you can do this R1 on it, and it didn't work enough.

Patrick Barker [00:11:04]: Yeah. So the MoMA set is the main thing I think they added is pixmo, which is a clicking. Well, it's not really clicking, it's a coordinate grounding data set. So given an image, sort of like the old convolutional detection things where you get an image and draw a bunch of bounding boxes. LLMs have historically been very bad at this. So I'd say their biggest contribution was Pixma, which is about a million or 2 million data points. That is just like really high quality grounding annotations. And so that way you can actually move the mouse and then click on the appropriate thing.

Patrick Barker [00:11:44]: And so I'd say that has taken us a long ways on our benchmarks for clicking. We've gotten, you know, trained on Pixmo. We get about 87% accurate on most things with the clicks, which is pretty good. We've got some other sort of training techniques. We've worked into that and we've gotten it close to 90% now. So it's getting better. We still need, I would say Pixmo is still a little light in GUI data. A lot of it is just sort of like robotic style data in the real world and we could use more sort of GUI interaction.

Patrick Barker [00:12:19]: Grounding data would probably take that further. So.

+ Read More
Sign in or Join the community

Create an account

Change email
e.g. https://www.linkedin.com/in/xxx or https://xx.linkedin.com/in/xxx
I agree to MLOps Community’s Code of Conduct and Privacy Policy.

Watch More

Assess the Value and Feasibility of LLM Use Cases with a Checklist
Posted Oct 24, 2023 | Views 570
# LLM Value and Feasibility
# Checklist
# Xebia
Navigating the AI Frontier: The Power of Synthetic Data and Agent Evaluations in LLM Development
Posted Jun 18, 2024 | Views 536
# AI Frontier
# Synthetic Data
# Evaluations
# LLMs
# Okareo.com
How LlamaIndex Can Bring the Power of LLM's to Your Data
Posted Apr 27, 2023 | Views 2.8K
# LLM
# LLM in Production
# LlamaIndex
# Rungalileo.io
# Snorkel.ai
# Wandb.ai
# Tecton.ai
# Petuum.com
# mckinsey.com/quantumblack
# Wallaroo.ai
# Union.ai
# Redis.com
# Alphasignal.ai
# Bigbraindaily.com
# Turningpost.com