Deploying Autonomous Coding Agents // Graham Neubig // Agents in Production
Graham Neubig is a Professor of Computer Science at CMU and Chief Scientist at All Hands AI, where he is researching language-based autonomous agents, particularly with applications to coding. He is also one of the maintainers of the OpenHands coding agent, built by a community of 180 contributors and garnering 31k stars on github.
At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
In this talk, I will discuss practical issues and technical fundamentals of coding agents. This will include both the promise of what coding agents may be able to do if they are implemented properly, what tasks they can do well now, some challenges in using them appropriately, and some of the challenges that we need to solve to make them more effective -- file localization, file editing, planning, error recovery, and "good taste". I will also give a few demos of our open-source OpenHands toolkit (https://github.com/All-Hands-AI/OpenHands) to illustrate each of the points.
Demetrios Brinkmann [00:00:20]: Yes. All right, Graham, I bet a good sum of money you were not expecting to see a llama today.
Graham Neubig [00:00:32]: It wasn't, but I'm happy I did.
Demetrios Brinkmann [00:00:34]: Yes. Now, you've been doing a lot of work with autonomous coding agents. I want to say thank you for making all this work open source and I'm going to hand it over to you and let you cook.
Graham Neubig [00:00:48]: Great. I'm happy to give a talk here today about deploying agents for software development. And so the reason why I'm interested in software development may be kind of obvious, but it's basically, you know, more and more of the stuff we do is running on software. And so if we think kind of, if we gave everybody the ability to quickly write software to achieve their goals, imagine the, you know, various things that all kinds of people would be able to do if we look at what's involved in developing software every day. A lot of people might think about coding, but coding. Coding is not the only thing that software developers do. If you look at this paper from Microsoft Research, they actually look through the day of a software developer and 15% of the time was spent on coding. And then separately from that, there's also bug fixing, testing, documents, reviews, communication and other things like this.
Graham Neubig [00:01:45]: So we have a lot of things, you know, other includes eating and going to the bathroom and stuff like that. You know, things we don't need agents to do, but we need to do. So I think a lot of people have used development copilots and these kind of work synchronously with the developer to write code. These can include things like GitHub, copilot, cursor code completion and other things like this. But what I'm aiming to do is something more autonomous that essentially can go and solve full issues for you without to much intervention. And so this is an example of adding tests to something in an existing repo and you just enter a relatively large amount of, you know, instructions and it can go through and, you know, write down the tests for you. And I won't go all the way to the end of this because time is limited today. But if you go all the way to the end of the video, it pushes to get up and I can review its pull request.
Graham Neubig [00:02:51]: So another way that we do this is we have autonomous issue resolution. So this is done through a GitHub action where we basically run an agent and you can tag any issue with something like fix me and the agent will then go in and start fixing the issue. All of this is running on GitHub servers or the LLM provider servers, and then in a few minutes, basically it says a potential fix has been generated and it can send a PR to your repository. However, this is kind of the ideal and in reality, this is a tricky task because there's lots of things that go into good coding. And I'd like to talk a little bit about some of the challenges we've encountered in building coding agents along the way. And the first thing is, you know, defining the environment and how you interact with the environment when your environment is as broad as writing and testing and debugging code. Once you have your environment, how can you observe the environment and act in the environment in an effective way? How can you actually generate code? And for this, we're mostly just leaving it up to a base language model. We're not doing a whole lot of innovation.
Graham Neubig [00:04:06]: But I could talk a little bit about that if people are interested in the qa. A really big problem is file localization, and that's specifically identifying which places in the code you should be modifying based on user, you know, intents planning and error recovery. How can you come up with good plans for solving coding tasks and safety? So how do we work with and interact with code? So coding agents, they need to do things like understand repository structure, be able to identify, you know, the structure of a large repository, read in an existing code, modify or produce code, and run code and debug. Another thing that I actually didn't put on this slide, but we're also doing in our kind of software toolkit, Open Hands for software development is web browsing. So I kind of feel if you can modify code, write code and do web browsing, then you can mostly do, you know, a lot of the tasks that I do in my everyday work. So it's a pretty, you know, broad action space. And exactly how we re realize this within Open Hands, which is, you know, our software toolkit that gets pretty good accuracy on a lot of coding tasks is something called codact. And what we do is whenever we interact with the environment, we do it essentially by writing code and executing code.
Graham Neubig [00:05:39]: And so if we look at the left side of this figure, we have kind of the typical way that a lot of AI agents interact with their environments, which is they use tool use and they call a single tool. And based on that tool, they get the results from that tool, they call another tool, they get the results from that tool and step through a big task. But the problem is, what if the task is relatively complex? You might need 20 or 30 steps to solve even a simple information Seeking task like the one in this figure. And what Codadact does is essentially instead it calls tools by just writing Python code. And so this includes not just writing new code, but also executing code to do actions that manipulate things in your file system and other things like this. So this allows you to do, in our framework anyway, this allows you to execute bash commands and jupyter commands and results in faster resolution, higher and higher success than direct tool use. But then one question becomes what tools do you use to interact with the environment? And a kind of early work in this area was something called SwiAgent. And what it does is it defines specialized tools that make it possible to efficiently explore repositories and edit code.
Graham Neubig [00:07:00]: And so what you do is you come up with a set of tools just like kind of any, you know, computer tool use method. But these tools are specifically designed to allow you to effectively navigate repositories and to edit files in them. So to give an example, you might have a open file command that says, okay, I want to open the file solver DIO 15Py on line 405. And if you do that, it will open line 405 and also give you a window above and below line 405 so you can kind of see what's going on there. And so that allows you to observe files without spending tons and tons of tokens on understanding what's going on in very long files. And then you also have kind of file editing actions which are you have a thought about what the agent is doing and you can kind of display that to the user. So they have an idea what the agent is currently working on. And you have things like an edit action over here to edit the code.
Graham Neubig [00:08:06]: And the way we handle these two things in open hands is essentially we have the ability of the model to run Python, run bash commands and browse the Internet. And within the Python that we allow it to run, we define particular tools that the agent is able to interact with that include the SWE agent style skills for manipulating things. And this allows you to, for example, write a write up Flask app, run the Flask app, and then browse to the Flask app to see what's going on in the Flask app. So this is kind of interesting because we're not, you know, the interface itself is very simple. It's just run Python, run bat, run bash, or browse. But you know, it's very powerful because each of these tools is powerful. So moving on to how we localize files, there's a number of different Ways you can do this and what you want to do is you want to find the correct files to edit given a user intent. And if we look at what a user intent looks like, this is an actual issue from our repo.
Graham Neubig [00:09:18]: And it says when in confirmation mode, it's not possible to give instructions between steps. You have to reject an action and it seems like it doesn't know what that the action was rejected. So this is the problem the user wants to solve. But this doesn't tell you anywhere which files you should be modifying or other things like this. So, you know, it becomes difficult to know which JavaScript file you should modify in the front end, for instance. And this is analogous to kind of environment understanding or exploration problems in other agents or reinforcement learning or other stuff like that. So, you know, we want to go in and solve this problem. So solution number one is just offload this problem to the user.
Graham Neubig [00:10:01]: So you have the user give a relatively specific instruction and say, okay, you probably want to go in and edit these files. And this actually works if you're working with someone who's familiar with working with agents. So this particular intent was written by me and it worked perfectly when I asked the agent to resolve this issue. But you know, it's not perfect if you don't know which files you should be editing or if it's a unfamiliar user or other stuff like this. So the second option that people have explored a bit is to prompt the agent with search tools. And so SUI agent provides a tool for searching repositories. There's also, you know, other possibilities for building kind of a rag style retrieval system over, over the repository. But all of these are, you know, they require a bit of extra overhead, of course.
Graham Neubig [00:10:55]: And the final thing that people have explored a bit is a priori mapping the repo. And we can create a map of the repo and prompt the agent with it. So there's a tool called ADR that creates a tree structured map of the repo you're working with. There's another one called Agent List that does a hierarchical search for every issue. And the way this works is it essentially goes down starting with a map of the entire repo and then asking a language model to localize the top files. Then given the top files, asking the language model to localize the classes and functions, and then given these functions, asking the agent to localize to specific locations that it would like to edit using the edit tool. So this is kind of how agents find files. Right now in open hands, we're actually doing something Relatively simple.
Graham Neubig [00:11:50]: And we're just relying on the agent to use GREP and find and the other tools at its disposal to find the files that it needs to work on. But I definitely think that more sophisticated solutions are useful and we're working on that right now. So the next question is how to allow the agent to plan. And there's a few ways you can do this. You can have a hard coded task completion process. And so just to give one example, there's a paper called Agent List that did relatively well, including the file localization function, localization, patch generation, and patch application. And the way this works is one of two ways. The first way is you can actually have the agent explicitly step through each of these steps.
Graham Neubig [00:12:35]: So you run the file localization step first and then the function localization step second, then patch generation, then patch application. Another option is just to prompt the model. And that's actually the approach that we take. Again, relatively simple but flexible. So we tell the agent first you should localize files and localize functions and localize patches and then apply the patches, for instance, if we wanted to follow this workflow. And that's easy to implement. But more importantly, it also allows the agent to kind of get off the, the beaten path if it needs to do something else. So we found that approach pretty effective.
Graham Neubig [00:13:13]: There are other methods though. So for instance, there are methods that have the agents explicitly generate a plan and then based on the plan, they delegate to other agents that do things like reproducing bugs, localizing the files, editing the files, and verifying whether that worked. So you can also have a more explicit planning approach. And finally, another approach that I'm pretty excited about recently is kind of coming up with a plan, delegating the plan to an agent that is tasked with kind of executing that plan, but if the plan fails, kicking it back up to the global planning agent so that the global planning agent can reformulate the plan based on the fact that the initial plan was not, you know, some steps were not successful. One thing I'm excited about with respect to this planning style thing is actually if you have a really complicated software development task, you could theoretically break it up into small bits and deploy them to parallel agents. And those parallel agents would work on each bit one at a time and then report back so you could finish the overall task faster. But we haven't quite gotten to that point yet. So another thing that's really, really critical, of course, for anything you do in production is good evaluation.
Graham Neubig [00:14:35]: And if you look at the types of environments that we'd like to have these kind of coding agents work in the actual environments, include repos like those you get from GitHub and GitLab, but also task management software like Jira and Linear Office software like Google Docs and Microsoft Office, and communication tools like Gmail and Slack. In reality, the testing environments are mostly focused on coding and of course developers do more like browse the web. So I'm actually pretty excited. We have ongoing work that's looking into a little bit of a broader task, but the ones that exist right now are mostly focused on coding. So I'll introduce a few of those. So the first thing is whether your language model can code at all. Because all of our modern agents are built on language models. And for this, data sets like Human EVL and MBPP have examples of usage of the Python standard library.
Graham Neubig [00:15:31]: So for example, given a non empty list of integers, return the sum of all the odd elements that are in even positions. And I kind of view these, if you've ever seen them in language model evaluations. They're kind of like the leet code of, you know, coding agent evaluation. Because there are these very algorithmic questions that are kind of difficult to grok and solve, but actually we don't experience a whole lot of them in our everyday life as programmers. So a new dataset that maybe some people have heard of that is kind of very nice for evaluating agents is Swebench and Swebench, basically what it consists of is its issues from GitHub plus their code bases. And your goal is to generate a pull request so you get an issue like data leak and GDT due to warm start, yada yada yada, and you get the code base that this issue was filed on, this is sent to a language model. The language model generates a pr and based on the pr you then run unit tests. And some of these unit tests were failing previous to the priority and then they need to pass after the pr.
Graham Neubig [00:16:45]: So basically what you're doing is you're testing the ability of language models to solve real GitHub issues. And so this is a great data set. It requires long context understanding, very precise implementation, and it also consists of a full data set with all issues, a light data set with a smaller number of simpler issues, and a verified data set where all of the issue descriptions are verified to be, you know, concrete enough that a programmer could go in and actually solve the problem. There's also some other interesting data sets. I'll just introduce one design to code, which is code generation from websites so this helps test the multimodal abilities of models and they also propose some evaluations and models to work on this. So based on these evaluations, you know, I talked about a variety of agents. How good are agents at actually solving GitHub issues? And so if you look at the Sweetbench verified leaderboard, this says current, but it might be from last week or so. Our system Open Hands was on the top of the leaderboard with a 53% resolve rate.
Graham Neubig [00:17:59]: And what this means is 53% of the kind of non vague issues on GitHub from real open source repositories could be solved by H agents autonomously with no interaction with humans. And honestly this is pretty amazing. And if you go and download an agent like Open Hands and actually play around with it and try to use it to solve issues, I think you'll be seriously impressed by how good they've gotten, especially in the past two months or so. And we use it for anywhere between 30% to 50% of the pull requests on the various software repositories that we're developing ourselves. So they're getting really good at being able to solve pick off especially the kind of issues that require less deep thought about implementation decisions and things like this. Another question that a lot of people might have is like, which language models could I be using to do these coding agents? And these results are a little bit older than the 53% result I just showed to you. And it's on a different data set, Sweebench Lite, but I'd like to present it anyway. And so we basically compared Claude 3.5, Sonnet GPT 4.001 Mini and a bunch of other models, including open source models like Llama and deepseek.
Graham Neubig [00:19:17]: And what we found is Claude was quite nice, quite good at solving this. We had GPT 4.0 trailing a bit behind zero. One mini did a little bit worse than GPT 4.0, which might be a little bit surprising, but we found it was kind of like overthinking things and doing things that were not necessary to actually solve the issue. And then in terms of the best open Source models, Llama 3.14.5B and Deep Seq 2.5 were the best when we evaluated it. But we're working on evaluating QWEN 2.5 coder as well since it just came out. So a final thing I'd like to talk about that's particularly pertinent to deployment of models in particular is safety and coding models can cause harm. And I'm actually a bit more worried about them causing harm by accident by doing things that we didn't really want. So, for example, one thing is the coding models often like to push to your main branch.
Graham Neubig [00:20:15]: And, you know, this is something you usually don't want to happen unless you've confirmed it. Another thing that's a little bit interesting, a little bit scary even, is we tell the coding model that it needs to make the tests pass. So it just deletes the tests and makes sure that all the remaining tests pass. So this is a problem, obviously. But also some people have been studying intentional harms from language models and have demonstrated that coding agents can be used for hacking. And so we have a few mitigation strategies. The first one is sandboxing. So run everything in a Docker sandbox and sometimes on the cloud or other things like this so we can limit the execution environment.
Graham Neubig [00:21:03]: Another thing is giving the agents appropriate credentials. So the principle of least privilege, make sure that they only have access to the actions that they should have access to. And a final thing that's pretty interesting is post hoc auditing. So can we audit the actions of the model beforehand so that it doesn't kind of step over the bounds of the things that it's allowed to do? And we can use language models for this. We can use security analyzers and other things like this. And we have a security analyzer implemented within Open Hands to prevent that. So, yeah, that's all I have for today. There's a lot of interesting future directions like agentic training methods, human in the loop, you know, systems and evaluation.
Graham Neubig [00:21:47]: How can we get the agents to confirm better and also broader tasks than coding? And all the stuff I talked about is open source. You can go download it, play with it, join the community and stuff. So, yeah, thanks a lot, man.
Demetrios Brinkmann [00:22:02]: Thank you. I appreciate you doing this. I know there's two really cool questions that came through in the chat. One is about overfitting when it comes to the benchmarks and the dangers there. Have you seen anything along those lines?
Graham Neubig [00:22:15]: Yeah, that's a great question. So I think there is some overfitting going on in the benchmarks, and there have been some examinations of Sweebench, and they said that up to 20 to 30% of the solutions that were generated demonstrated some signs of overfitting. I think, nonetheless, my impression is that better models on swedbench also tend to be better when I practically use them myself. And this model that we have here is much, much better than the model we had two months ago. That Had a lower score on sweep. The fortunate thing is that all of these are created naturalistically from GitHub. So if you can crawl new data from GitHub, you can create things that probably aren't linked into the language models. So I do think that there's a way around that.
Graham Neubig [00:23:09]: And, you know, people are solving this as we speak.
Demetrios Brinkmann [00:23:12]: So. Last one for you. And then we got to jump. What would be an approach while working with a code migration use case, I.e. legacy code conversions, how to go about parsing the legacy and then iteratively updating the target code?
Graham Neubig [00:23:27]: Yeah, that's a great question. My first answer would honestly be just ask it to do it and it will kind of go in and do a reasonably good job of it. Like, I literally right before this, tried to incorporate a relatively large repository that had different linting rules into our main open Hands repository. And anybody who's tried to do that before knows that's a pain. You need to go in and reformat all of your code and stuff like this. And it did it pretty well. So I'd start out by saying that, and then if there's particular guidelines that you want to give it, or maybe you want to give it a tool that it can use to rename things en masse, that might be one way, but overall, I think it might just do it for you. Pretty well.
Demetrios Brinkmann [00:24:17]: Said. Last one was the last one, but it's not because another great question came through. Actually, more great questions are rolling through and we have a little bit of buffer time. So I'm going to use that buffer time and ask you this one. What has been your strategy to ensure minimal latency and performance?
Graham Neubig [00:24:36]: Yeah, it's a great question. Latency kind of comes with the territory if you're running code, because running code takes time. We do see some latency from language models also, but one of the most important things actually is starting up quickly, because when you want to code something, you don't want to wait. For example, when we're doing sandboxing, we have sandboxes. In our online offering, we have sandboxes that are always live on the cloud. So you can grab a live sandbox and start playing around with it right away. And so I think preventing startup time is also a big thing for user experience. So that's been one thing we've been working on too.
Demetrios Brinkmann [00:25:23]: Yeah, yeah. No one wants to have the stroke of inspiration and then sit for 10 minutes while everything is starting up. I could see why that could be a very painful user experience. Jan's Asking, have you experimented with more sophisticated retrieval methods like tree summarizers and what about huge context windows with Gemini? How much of the performance is related to retrieval?
Graham Neubig [00:25:49]: Yeah, that's a great question. So the funny thing is, like, when we started out, we started out with a relatively complex setup, and we've kind of aggressively simplified it down to something very simple and nonetheless have pretty good performance. And we've tried a couple times, not super seriously, to incorporate things like mapping the repo and coming up with RAG systems and stuff like this. And we've seen, like, not huge bumps in benchmark scores. But I think that also might just be a problem of not trying hard enough. So we're going to continue trying, but we have tried rag, we've tried mapping the repo and stuff like that, and it was hard to beat the, like, finding grep baseline, basically.
Demetrios Brinkmann [00:26:33]: Yeah. And this one's personal question, because I know you. You gave three antidotes there on how to get better performance. Do you personally like one of those three and use one of those three more often, or is it you just choose the right tool for the right.
Graham Neubig [00:26:49]: Time or for like mapping the repo and finding the files?
Demetrios Brinkmann [00:26:53]: Yeah. And also because I know you gave three different options of how I think it was getting better performance. One was mapping, one was just asking it to do what you're looking for. Right?
Graham Neubig [00:27:07]: Yeah. So if you can get the user to provide you a little bit more information, that can be huge. I mean, it basically solves the problem for you. So practically that's, you know, maybe not a bad option. But I'm hopeful for rag. But we actually have a paper called Code ragbench where we tried to evaluate the ability of lots of models to do retrieval over code. And they're actually, there aren't really great embedding models for retrieval over code. So I think that's a little bit of a bottleneck also.
Graham Neubig [00:27:41]: And so, yeah, I think it just needs a bit more work. You know, three months more work and then we'll probably have a good solution. But I don't have a really great one. Right.
Demetrios Brinkmann [00:27:49]: Three months is ambitious. I like it. Well, Graham, thank you so much for agreeing to do this. I really appreciate you coming on here and chatting with me and being so generous with your time.