From Few Shot Code Generation to Autonomous Software Engineering Agents // John Yang
John Yang is a 1st year PhD student at Stanford University advised by Professor Diyi Yang. Previously, he was a Master's student at Princeton University, where he was advised by Karthik Narasimhan. John's research directions include Language Agents, Language Model Evaluation, and Software Engineering.
Software Engineering can serve as a diverse, rich testbed for evaluating the next generation of language models and driving their development. This talk introduces a line of three works that has established the potential of this research direction and guided industry advancements towards autonomous software engineers. First, SWE-bench is a benchmark that evaluates an Al system's capability to resolve real world GitHub issues, featuring 2294 task instances collected from 12 distinct Python repositories. Second, SWE-agent is an autonomous system that uses a language model to interact with a computer to solve software engineering tasks, setting a state-of-the-art 12.5% resolved rate on the SWE-bench test set. Lastly, SWE-bench Multimodal, a new dataset of 617 task instances from 17 JavaScript repositories, shows how many existing coding agents are overfit to Python, raising the property of “generalizability as an overlooked but desirable trait of AI systems.
Andrew Tanabe [00:00:03]: Hi everybody. Welcome back to stream 4 here. This is our final talk of the evening and we've got John here who is a PhD student at Stanford and he's going to share with us some thoughts and observations on few shot code generation to autonomous software engineering with agents. So John, just want to provide a quick set of house rules here. We'll have about 20 minutes for your presentation and then 5 minutes for Q and A. I'll watch the Q and A on my side and then when it's time for that portion I'll jump back on and help out with the questions from the audience here. Otherwise, I'll give you a time check about two minutes until the end and I'll leave you to it. So John, looking forward to it.
John Yang [00:00:54]: All right, gotcha. Thank you so much, Andrew. I really appreciate it. Cool. Good afternoon, Good evening to wherever you are in the world. It's really nice to be invited. I'm super excited to share a little bit of my ongoing work to this great community. So, yeah, as Andrew mentioned, my name is John.
John Yang [00:01:13]: I'm currently a first year PhD at Stanford University. I'm working with Professor De Young and before this I was also at Princeton University working with Professor Karthik Narasimhan, both of which have kind of been my advisors throughout the course of these projects. So for today's presentation, what I'm going to do is sort of introduce this suite of works that have really set the landscape for how we evaluate and think about the development of AI agents specifically for software engineering, not only in the past couple of months, but also leading forward in the future. Cool. So as you may be aware, this year, starting in March with the announcement of Devin from Cognition Labs and also the Open Source Alternative Suite Agent, which I will talk about, AI software engineering has been all the hype. There's been a lot of excitement around taking language models, adding a good amount of scaffolding and programmatic reasoning to it that enables it to solve real world software engineering issues like GitHub problems and creating pull requests. And since the beginning of this year, we've seen a flurry of submissions to Sweep. So what I'm going to do in this talk in particular is just kind of COVID the history of how these projects were developed and in the process kind of offer some insights on where exactly the field has been going and what potential directions there are in the future and what we might sort of see as sort of the future of how AI collaborates with human software engineers.
John Yang [00:03:00]: Cool. Great. So the first project I'm going to cover is Sweep, which is very much this benchmark that did a lot to characterize what it means for a system to perform software engineering and how do we evaluate the performance of language model based systems on such a complicated task. So the original inspiration for this project back in, I want to say July of last year, was that the state of the art when it came to evaluating language models like ChatGPT or Claude and Anthropic was to do it sort of with two styles. One style, which is still very popular today, is this kind of potpourri style. So within the research literature there are things like MMLU or Superglue, if you're not familiar with these terms, where you might have seen them before is whenever OpenAI or Anthropic or Google or Meta do a release of a new model, they'll kind of tell you, oh, this is how well our models perform on these benchmarks. And these benchmarks tend to be a huge collection of a lot of these kind of mutually distinct tasks. But then if you evaluate on a lot of them, the conclusion is that you would get a good picture of how well it does on translating from English to French or recognizing different proper nouns, or doing sentiment analysis.
John Yang [00:04:28]: So a lot of these sort of different collections of tasks. So this is great for sort of guiding the research field for quite a bit of time. But in recent years, if you look at the performance on these benchmarks, those numbers have, well, been saturated. So you're getting like 70, 80, 90 plus performance. And so there's sort of a real call for action in terms of defining new tasks that are compelling and can evaluate the development of the next generation of language models. So to that end, code generation has kind of emerged in recent years and you had sort of these really simple problems, kind of leak code, software engineering interview style, where it would ask the language model, hey, can you implement merge sort or can you implement a heap or something like that? And so the task was taking a natural language kind of doc string and asking a language model to spit out the self contained Python program that is going to sort of solve the problem. And you had a sort of a lot of follow up work that implements sort of a lot of different kinds of these tasks. So one of the most popular benchmarks in this realm is human eval, which came out from OpenAI about like two or three years ago.
John Yang [00:05:43]: And so as you might expect, especially if you've kind of worked with a little bit of basic Python before is you get that doc string in green and what's highlighted in Yellow is what the language model is supposed to generate. So it's, it was a very good problem for problem style for quite a bit of time. But what's happened in recent years is that this has also been saturated. So language models are very good at these sort of self contained completions. So you know, with everything, with language models getting a lot of gains, performance and things kind of saturating, it begs the question of kind of revisiting what makes a good benchmark. And there's kind of three guiding principles that led to sweep. So one is that it's challenging for state of the art models, you know, for things like GPT4 and Claude Sonnet, ideally you're seeing performance that is not close to 100%, but you know, something that it reasonably struggles with and the numbers reflect it. And if you look at sort of the reasoning chains, that reflects it too.
John Yang [00:06:46]: The second is that now we can really think about realistic use cases. So as opposed to before where we're kind of dumbing down the problem, you know, we can really lean on real world complex problems as a good test bed. And last but not least, evaluating these solutions easily is also incredibly important. So all of these points are something that software engineering really encompasses. Right? So this idea of it's a very challenging problem. You know, there's entire markets, entire careers dedicated to software engineering. It's extremely realistic and you know, within the realm of software engineering you have things like unit tests and integration tests and to end things that allow one to evaluate solutions very rigorously. And it's not sort of a matter of opinion.
John Yang [00:07:34]: Cool. So finally kind of diving into the actual project itself, the main inspiration for SWEEP is this general idea of, well, can you look at real world workflows within the realm of software engineering? And given that those workflows, if they have enough data, can you take that data and convert it into a good evaluation? So in our case, it's kind of looking at the open source software development pipeline. So what does that actually mean? Right, if you go on GitHub, which is kind of a popular software online software repository hosting site, you know, you can look at things, you can look at a code base and you'll see, for example, if you've used like numpy or Pandas or a lot of these popular Python packages, that a lot of these code bases will have issues that users report about them saying like, hey, this doesn't work or I want this particular feature basically giving developers feedback on how to improve the code base. Right. So the idea here is that Usually a person is going to be the one fixing the issue for that code base. But now what if we sort of put a language model in place of that? And so in this setting, the task is given an issue and a code base, the language model must generate a pull request, or just generally speaking, a fix that corresponds to resolving that issue. And then you have a bunch of unit tests that correspond to the issue, and you can check whether that fix is successful by simply running against those tests. So the general formulation is again, given the code base issue, generate a fix, and then verify if that fix works with tests.
John Yang [00:09:21]: Again, just revisiting the real world basis for this. You have the repository, someone reports a real issue, Then what usually happens on GitHub is someone will take a look at the issue, especially one of the code based maintainers, and they'll say, okay, it looks like this is a reasonable fix. Here is a pull request that I've created and the pull request will be merged. So through this simple pattern of reporting an issue and then developing a pull request that resolves it, you get a lot of great data that can become the test bed for evaluating sort of software engineering as a task. So in terms of how we actually constructed Sweep, it involves a sort of fairly complicated scraping process on GitHub. And what that involves is first you kind of target those issue PR pairs that we talked about. So we looked at specifically 12 very popular Python repositories, pypy packages, you know, things like Django, Flask, matplotlib, Seaborn, Pytest, development tools that most people might be familiar with if they've worked with Python. And then what you do is you kind of look for pull requests that have an issue associated with it.
John Yang [00:10:39]: So you're going to want to filter through the issues and pull requests and look for a couple things. The pull request should have an issue because that's going to serve as the problem description. The pull request should contribute a certain number of tests. So what that means is if you look at the code changes that were submitted, part of those code changes are the actual fix, but part of those code changes are also the unit tests that verify that that fix is correctly working and that behavior is sustained for the future. So you want those two criteria to be satisfied. And last but not least, for step three, you need to set up a lot of execution environments. So practically speaking, technically speaking, we're referring to kind of Docker containers or a lot of execution environments that support allowing one to apply different code fixes and then actually run those tests to Verify whether those fixes work. So once you have that kind of pipeline in place, with a little bit of manual intervention and some execution environment construction, you can really put together a big problem set.
John Yang [00:11:48]: And what each instance kind of looks like is what we see here. So you have some metadata at the top that refers to kind of this is the information, this is the repository, the pull request that these problems were from. And then you have your problem statement, which is very much verbatim the original issue. So in this case what we're looking at is the sympy library, which is this kind of popular package for doing symbolic math operations in Python. And you see there's some sort of issue with, you know, some matrix expression. Then from the actual code changes in the pull request, you get this reference solution, which we call the gold patch. And then you get also this test patch which corresponds to the new tests or the updated tests that capture the new behavior that the gold patch implements. There you go, you have a dataset, you have the problem, and then you have the evaluation for it.
John Yang [00:12:46]: Just by scaling this collection up, you get Sweebench as a benchmark. So it's a collection of 2294 problems across those 12 Python repositories. As you can see, Django is kind of a big contributor, but you also get a lot of cool things like linting or web requests or machine learning libraries or even sort of symbolic mathematical computation with sympy or Astropy is pretty cool, which is kind of astrophysics Python library, which is quite cool. But it's very challenging compared to sort of the prior benchmarks that we talked about, given sort of the size of a code base. So you're not just reading an issue, but you're looking around a code base and you have to actually do what a software engineer does, which is understand the code base, try out different fixes, implement the tests. And so it becomes sort of a very challenging benchmark. So I'll go through this part very quickly. For the initial work we tried a few very simple baselines, particularly retrieval, augmented generation, which is, does not involve any sort of execution or multi steps or kind of agentic approaches, and we found that the performance wasn't very great.
John Yang [00:14:01]: So yeah, if you look at kind of the initial results for Sweep, which was exactly one year ago, around October or November last year at this time, you know, you're out of 2294 problems. Like the best model is only getting 2 or 3%. And so a lot of sort of very obvious challenges become immediate. One is this kind of file localization problem where even just given the issue description, it becomes fairly difficult to just even correctly identify what files you're supposed to edit. So that's evidenced by if you simply just give the correct files, as opposed to having a retrieval method try to decide which files to edit, you get a boost in performance. But yeah, long context reasoning is quite difficult. Reasoning across a massive code base is much more challenging than just sort of reading a single document and also just coding style. Like if you can imagine, like for leetcode.
John Yang [00:15:00]: If you're writing solutions to these self contained Python programs, usually you're dealing with pretty primitive Python. You're not dealing with higher order functions or objects or sort of other classes and entities that were created in the rest of the class code base. So the problem is simpler at sort of the leetcode level. But with software engineering all of a sudden it's not just a matter of like knowing the exact solution because it's not usually readily apparent. Understanding the rest of the code base and what context to include and what to exclude and what to not consider is an incredibly important and also challenging part of this benchmark. So now with sweetbench, what we've done is we've kind of established this home base, this groundwork for better understanding, engaging the development of language models that can do software engineering. From last year October all the way till this year March, it actually wasn't quite obvious how you could develop on this benchmark. A lot of the initial feedback that we got, particularly from people in academia, was that this benchmark is way too difficult, especially for this kind of like executionless, no agents, no interaction kind of setting.
John Yang [00:16:21]: And so that leads us to sort of the next work, which is Suite Agent, which is the very first work that introduces language agents as a solution for software engineering. So if you're not familiar, the idea of a language agent is as opposed to the traditional kind of sequence to sequence style of thinking about using a language model which is given an input directly generate the output, language agents kind of combine or take a lot of pre existing literature from RL or a lot of these sort of task formulations and adapt it where as opposed to learning a policy or kind of learning a traditional reinforcement learning agent, that you simply put the language model in the role of being this kind of agentic entity. And so, you know, at this point it's taken off and we all are quite familiar with what a language agent is. But in terms of the academic framing, this is a way to think about it. Before Sweet Engine, I had this project and we were trying the simple thing of just taking the language model and having it operate directly in a BASH terminal, just have it write CD ls, navigate around bash. It worked pretty well for fairly simple problems like coming up with BASH queries and doing basic SQL database manipulations, but for kind of more complex things like software engineering where you're going to want to edit a code base and you're going to want a CD into different things, or maybe you're going to want to grep, you're going to want to search different symbols within the code base, or you're going to want to execute certain files or certain tests. Like it becomes a lot more involved and kind of the actions required to get to the solution, that action space becomes a lot larger. So what we notice is by simply plugging in language models directly to kind of the tools that we're familiar with, you don't get great performance if you're just plugging it directly into VS code and you want it to operate.
John Yang [00:18:21]: As a software engineer using VS code, that might be a little bit tough. The hypothesis that we had is that in the same way that as a user we co evolve where humans learn to use tools, but then those tools become improved and optimized to make human operations on the task a little bit easier, that that language models can improve more and get better performance by simply giving it a better interface, by giving it a better set of tools to operate. So I see I'm running out of time, so I'll kind of go through these next slides a little bit quickly. But basically we come up with this new set of principles for what it means to design good interfaces for agents. And you can read about this more in the paper if you're interested, but some of the initial findings are quite straightforward, but it's more sort of in the design process process that they bubble up. So again, when we talk about Sue Agent, what we are referring to is a language model plus an agent computer interface. And what this means, especially with regards to LM friendly commands, is that on the left you have this kind of really basic BASH thing. But if you optimize things like editing or like searching through a code base to be more amenable to sort of the inclinations of a language model, then you'll, you'll get better performance.
John Yang [00:19:46]: So anyway, I'll go through these diagrams very quickly, but the exciting thing is that the performance jumped up. So you went from like 2, 3, 4% in a rag setting to all of a sudden 13%. And if you Saw Devin and Cognition Lab's announcement from earlier this year. You know, that was a really big breakthrough and we have a bunch of sort of supporting ablations. I'll go through these quickly so you can read more or feel free to connect with me and ask. But one I did want to highlight is that with the right set of tools you start seeing certain problem solving patterns emerge. For example, what we're seeing here, especially with these colors, hopefully it's somewhat apparent, is that you'll see from turn 0 to 3 a language model. Will either one try to reproduce the file.
John Yang [00:20:34]: If you look at that big green bar that refers to Create Command, it's creating a Python file and trying to recreate the bug. Or it's going to just start localizing and looking for where the problem is. Then there'll be of a middle phase where it edits and then runs tests, edits run tests. And then slowly it'll kind of like start submitting solutions after a couple turns of editing. Cool. So I'll kind of skip all this information, but basically there's been really great progress since then. After Sweet Agent, we saw kind of a lot of people get excited and we've had a lot of submissions to the Sweet Bench leaderboard, so please feel free to check it out. OpenAI and Anthropic have taken it up as sort of one of the newer good evaluations and we've seen tremendous jumps.
John Yang [00:21:19]: Like the latest top performance on one split, Sweetbench Verified is like more than 50%, which is incredibly exciting. So I'll kind of leave it at that. I want to thank all my collaborators. This is definitely not a solo effort. There are a lot of great people that supported me for this project and for also sort of the works that we're doing right now. And if you're interested, I really encourage you to check out the repositories for sweetbench Suite.
Andrew Tanabe [00:21:48]: Thank you so much. John. What a. What a fantastic presentation here and really interesting the way that you're thinking about you and your team are thinking about sort of real world application. Beyond those sort of purely academic, you know, earlier stage benchmarking. It's something that we really think a lot about here At Process, we have our own Pro LLM benchmark that is not quite as technical as that, but really goes into use cases that we see across our portfolio companies as well as basing it off of question and answer pairs from Stack Overflow and other public places. One of the questions that we are seeing here has to do with exactly that step that you all took from saying, okay, here's a really complicated, a very challenging benchmark, but what if we allow it to take multiple turns? What if we allow that agent sort of process to move forward? And I think that the sort of sense that I'm getting from the comments here is where does it go beyond there is, is it that the next step is multiple agents being strung together? Is it giving more and more time, allowing more context, more planning capacity? Where do you see the next inroads for actually conquering this benchmark?
John Yang [00:23:29]: Yeah, definitely. I think that's a great question also really cool to hear kind of about. Yeah, I think evaluations are so important. I guess instinctively it would be two answers. So since like since this benchmark came out, I think we saw a lot of people develop new solutions and certainly the agentic one is a popular one. You've also seen people kind of explicitly define what that problem solving pipeline is. So it's not sort of as the language model, isn't as free in some sense, but it's kind of subscribed to like well first you have to and you should look for the files to edit and then test and whatever. So to your point? Yeah, I think, I think planning, searching like a lot of the test time, compute 01 things that people are excited about these days in academia.
John Yang [00:24:16]: I certainly think that has a part especially for software engineering at least even with these base models. I think what we're seeing is sometimes there's cascading errors. If it tries to make an edit and it doesn't work, it'll keep hammering home versus a person might take a step back. That's part of it I think certainly Sweetbench, while it's quite compelling, only captures like a important but still like just one of the facets of kind of software engineering. So there's certainly a lot of workflows that we haven't tapped on so far. But with people kind of examining enterprise workflows or examining like within software engineering but also in different markets, like I think a lot of the other great presenters today have kind of shown that side a bit. I think I would be excited to see whether those would be good evaluation.
Andrew Tanabe [00:25:07]: Yeah, no, I mean with so many different, different models, different problems in the world, different ways of approaching it and the complexity, well, you know, the non deterministic element of course of these, of these AI systems, the need for these benchmarks is just becoming more and more obvious. I think we were hearing a lot about earlier different ways to test agent systems and just whether you're putting it onto reality world data, but in a secure environment or you're using different benchmarks. It's really critical work that helps a lot of, a lot of engineering teams out there, you know, put things into production. So thank, thank you for your work.
John Yang [00:25:51]: Yeah, no problem. Thank you so much.
Andrew Tanabe [00:25:53]: Yeah, thank you, John. We're going to head off now.
John Yang [00:25:56]: Bye.