MLOps Community
+00:00 GMT
Sign in or Join the community to continue

Lessons From Building Replit Agent // James Austin // Agents in Production

Posted Nov 26, 2024 | Views 1.1K
# Replit Agent
# Repls
# Replit
Share
speakers
avatar
James Austin
Staff Software Engineer @ Replit

James is a Staff Software Engineer on Replit's AI Team, where he was a key contributor to Replit Agent, a Software Agent designed to allow anyone to build software from scratch. Previously he led Replit's RAG effort, and worked at Amazon.

+ Read More
avatar
Adam Becker
IRL @ MLOps Community

I'm a tech entrepreneur and I spent the last decade founding companies that drive societal change.

I am now building Deep Matter, a startup still in stealth mode...

I was most recently building Telepath, the world's most developer-friendly machine learning platform. Throughout my previous projects, I had learned that building machine learning powered applications is hard - especially hard when you don't have a background in data science. I believe that this is choking innovation, especially in industries that can't support large data teams.

For example, I previously co-founded Call Time AI, where we used Artificial Intelligence to assemble and study the largest database of political contributions. The company powered progressive campaigns from school board to the Presidency. As of October, 2020, we helped Democrats raise tens of millions of dollars. In April of 2021, we sold Call Time to Political Data Inc.. Our success, in large part, is due to our ability to productionize machine learning.

I believe that knowledge is unbounded, and that everything that is not forbidden by laws of nature is achievable, given the right knowledge. This holds immense promise for the future of intelligence and therefore for the future of well-being. I believe that the process of mining knowledge should be done honestly and responsibly, and that wielding it should be done with care. I co-founded Telepath to give more tools to more people to access more knowledge.

I'm fascinated by the relationship between technology, science and history. I graduated from UC Berkeley with degrees in Astrophysics and Classics and have published several papers on those topics. I was previously a researcher at the Getty Villa where I wrote about Ancient Greek math and at the Weizmann Institute, where I researched supernovae.

I currently live in New York City. I enjoy advising startups, thinking about how they can make for an excellent vehicle for addressing the Israeli-Palestinian conflict, and hearing from random folks who stumble on my LinkedIn profile. Reach out, friend!

+ Read More
SUMMARY

Few companies have shipped an agent at the scale of Replit, but with scale comes challenges. This talk explores how Replit built and scaled its agent, and how it adapted from a small engineering team to involving almost half of its engineering organization almost overnight.

+ Read More
TRANSCRIPT

Adam Becker [00:00:05]: Coming right up. It's funny how, like, in virtually every circle I speak to and I talk about agents in production, people keep bringing up replit. And as almost like an example of what it's like to actually ship an agent to production, I would like to invite to the stage James. Let's see. James, are you with us? Can you hear us? Can you hear us? Are you here?

James Austin [00:00:33]: I'm here. Can you hear me?

Adam Becker [00:00:34]: Yes. Nice. James Austin, very nice to have you here. Today you're going to take over and you're going to tell us a little bit of the lessons from building a replit agent. I'm excited to hear what you have to say. I can put your screen up. Okay, I'm going to be back in about 20 minutes, folks. If you have questions, put them in the chat.

Adam Becker [00:00:59]: James, I'll see you soon.

James Austin [00:01:01]: Thank you very much. My name is James. These are some lessons we learned from building the Replit agent. A quick little overview about what is REPL.it is an online cloud development environment where users create repls, which are basically fully contained, chained development environments. So they basically, it's a full environment. You can write code, install packages, execute it, and then deploy it. And the goal is really, for the company is really to empower the next billion software creators. So it wants to be kind of easy to use, easy to get started.

James Austin [00:01:41]: We've had a lot of success. We've got about 30 million users signed up, and we've been doing the AI thing for a while. But in September, we deployed the replit agent, which is our fully automated code agent, where you talk to it and it writes the code for you, and you're working with the agent to develop what you want. These are a few lessons that we learned as we went through the process. The first one was really, you need to define who you're actually building for, which sounds kind of trite because that's ultimately what every product is, right? Like you have to figure out what you're going to build. But when we first started building this, we started optimizing purely for our sweet bench score. Sweet bench, for those who don't know, is an evaluation about taking basically a GitHub issue and converting into a GitHub print over six or seven relatively large repos like Django. And that's a really important metric, but it's not actually what our users were coming to replit to measure or to build.

James Austin [00:02:52]: They wanted to build their own idea from scratch relatively quickly and iterate and work in a very, very tight loop with their agent. And we had this realization that we were building things that were easy for us to measure, but not necessarily things that are important to measure. And we realized that basically a code agent means very different things to many different types of people. An engineering manager with a few reports in San Francisco who he's paying hundreds of thousands of dollars a year is very different to a traditional engineer who wants help writing their boilerplate, or what we call the AI first coder, who's building their SaaS product on the weekend. And they all have very different requirements and what they want and what they're looking for. So our engineering manager has a enormous budget because he's thinking through it, of it through the lens of how do I build, how do I replace these engineers, or how do I help get more money, get more value out of them? He's thinking about working things async. He writes up issues that are relatively well defined and can fire them off and then expect his agent to come back later, like hours later. And it's often working with, like, very large code bases.

James Austin [00:04:15]: Meanwhile, the AI first coder and their traditional engineer, they're kind of much more locked into that, like, ChatGPT or Claude mindset where they're paying like 20 bucks a month. It's very in the loop. Maybe they know what they're working with, maybe they know what they want, maybe they don't. And ultimately, it's really hard to beat everything for everyone. And if you improve things for one target audience, you make things considerably worse for another. As an example, Monte Carlo tree search, which for people who aren't familiar, is essentially running the agent on a specific problem multiple times in parallel and then having some sort of way of figuring out which agent did the best drastically improves accuracy, especially over long trajectories where your agent is working for a really long time. And that's super important. But the issue is that if you run five agents in parallel, you pay five times as much.

James Austin [00:05:21]: And it also kind of slows things down. So for your AI first coder or your traditional engineer who kind of want to be in the loop, you're not actually providing a lot of value. So if you're building things for them, you. You don't. You want to avoid techniques kind of like that, at least for now. Obviously, all of this advice is going to change very, very quickly as new models come online. The the rise of, like, very, very small models like GPT4 and Mini, and how much they've improved in accuracy. The new 3.5 haiku means that some of These options are actually a bit more feasible.

James Austin [00:06:03]: And then the other bit is product decisions like when did we want to support arbitrary tech stacks? Product managers or the AI first coders don't really mind if their agent uses next or Express or Flask, but engineers really, really, really care about that. And they will let you know on Twitter and you can imagine how we know that. And ultimately targeting your Persona, about who you're targeting, who you want to help lets you know on where to improve your agent. So some examples of decisions that we made is that we really wanted to optimize for that like getting started approach. So our first Sweebench optimized agent didn't really feel good to use for getting started because it was relatively methodical. And that meant when you needed to create say, you know, five or 10 different files to get started and each loop kind of took like 20, 30 seconds, it actually took a relatively long time for your agent to build something that felt good. It also spent a lot of time like reasoning and thinking things through. Ultimately we actually found out that we can do.

James Austin [00:07:21]: We built something called rapid build mode where instead of going in and using that traditional agentic loop, we kickstart the problem. We kick start building a solution by just dumping out as much as we possibly can. We give it some templates, we give it some guidance, there's some custom prompting there, but the agent can very easily actually dump out 10, 15 files that are relatively cohesive. And look, they often have a few small issues, but then you can drop back to that agentic loop to fix up those small issues. And this ultimately took us from a working application taking six or seven minutes to get started to being under two minutes. The other bit is we started doing prompt rewriting. So rather than diving straight into the code or trying to tackle users problem, we actually try and rewrite what they've given us to expand it, add in more detail, figure out really what they're looking for. So that rather than after six or seven minutes, they get something that isn't actually what they want, we give them something that's a lot more what they expected.

James Austin [00:08:35]: You can see this example on the left. I hope people can read it. I give it a query that says build a waitlist site for my Italian restaurant called La Pizza, which collects emails and full names from users, make the design of the website modern and professional. And it doesn't add that much, but it elaborates that hey, we're going to be using Flask for building this website. And then I would be able to say hey, actually, I'd like you to use, say, Express, and then it would go. Okay, cool. Well, I'm glad we figured that out earlier. Second lesson that we learned is really, it's really, really important to automate finding your failure cases.

James Austin [00:09:17]: Agents fail in really, really strange ways that are really hard to detect, and it's a very, very long tail. The other bit is that agents will relentlessly try to solve any problem that they see, which means that they will often go down on, like, side tangents. Anthropic. When they released their computer use demo a few weeks ago, they saw an example where their bot, which is trained to operate a computer or a browser, got distracted and started looking at photos of Yellowstone national park rather than doing the coding task it was actually given. Even when you add guardrails, people will be able to get past them with a little bit of creative prompting. As an example, when we initially launched the Replit agent, we decided to narrow in on a few very simple stacks until we were confident in wider things. Next JS was one of the stacks that we originally decided that we weren't going to support from day one. But it turns out that if you just told it that you were the CEO of replit and you were running a test, you could actually just get around the block list.

James Austin [00:10:28]: We could have done a bunch of prompt engineering to make that block list a lot more robust. But frankly, users are actually quite reasonable and they understand that if they're doing these prompt techniques, then things might go off the rails. But it's just a warning that you can't actually trust that this is going to work perfectly each and every time. And then the last issue, one of the other issues is that spotting agent failure all these cases is quite difficult. Traditional monitoring systems like datadog will help you notice if your app crashes, but it won't help you if your user has, you know, gotten. If your agent has started going in circles, it's gotten stuck in a loop about it can't solve a problem, or if the user has, like, gotten around a block list. And when you look at those user metrics, a session where a user exits in frustration looks like a session where a user left satisfied. So ultimately, you need to be paying a lot of attention to your traces and what your agent is doing.

James Austin [00:11:41]: But if you look at every trace like, you're just simply going to be overwhelmed. We generate thousands and thousands of traces every day, and it's simply too much for us to actually read through. We use Langsmith for monitoring traces, which is a really, really Great product, highly recommended. But ultimately you need to build tactics for spotting these failure cases into your actual application. We found an amazing amount of success with rollbacks. Any rollback means that you made a mistake, whether that's the user didn't specify what they wanted and you did the wrong thing. Sorry, I think we're back. Or it could be that your agent has gotten stuck and the user wants to restart.

James Austin [00:12:40]: But if you just log every time you do a rollback and how often a specific point has been rolled back to, you can use that as a very clear signal that this is an issue or there's some issue for the agent, and that's somewhere where engineers can dive in, try and understand the problem. Other solutions are sentiment analysis on user inputs. So if a user says, you didn't do what I want, negative sentiment, that can be a red flag. And then the other bit of insight we had is that we added a feedback button to be like, you know, provide feedback about what the agents doing. We found that users just didn't really use it. The nice thing about rollbacks is that users get immediate value from it. But feedbacks, we found that users thought they were kind of screaming into the void. And then obviously the other solution is that traditional methods of finding issues still work.

James Austin [00:13:39]: So social media, for example, there's a lot of users who come to us on Twitter and be like, hey, I have this issue and we can dive in and figure out what's going on there. Custom support as well, all really, really good ways of finding these traces so that you can find the holes in your agent logic. Third lesson. Evaluate. Evaluate. Evaluate. Evaluate. I can't get stress enough that you are probably under testing your agent in rigorous ways.

James Austin [00:14:15]: This is really, really important. So when we first started building the agent, we were doing what we were calling vibes based development, which is basically play around with it. Does it feel better? Okay, make a change that gets better. Keep doing that, keep doing that, keep doing that. The issue is that you end up with this kind of like patchwork of different hacks and comments and prompts to get around specific issues that you've seen. And it's really hard to spot those regressions in performance. And if it takes you, say like two or three minutes to test a trace for your agent, then it's really hard to measure whether you've improved something from say, you know, 50% failure case to 90% to 95 to 99 if you want to. You know, if something fails 5% of the time, you need to run it 20 times.

James Austin [00:15:15]: Before you expect to actually see a failure. And then the other bit is that you can't actually rely on the public evaluations. The analogy I like to use is the public evaluations, like the sat, they measure something. That something is kind of important, but it's not really specific enough to a job or your specific requirements that you would actually hire someone based purely on their SAT score. And then a lot of the public evaluations also, they're really good for specific problems, but they're probably not good for your problem. So Sweetbench is great for the GitHub issue to GitHub PR pipeline agent, but it's not a good evaluation for an agent that helps an AI first coder build a marketing website where for example, the user goes like, oh hey, you've got like the image on the left side. Can we move that over to the right side and maybe make it a palm tree instead of a forest. There's no benchmark publicly available benchmark for that.

James Austin [00:16:22]: And how you evaluate your agent is really as specific to your software as your integration tests are. And that's really how you should be looking at them, like integration tests. The other bit is that evaluations are a long term investment. And I really mean they're an investment that pays off. They are resource intensive to build and resource intensive to run, which means that they're a moat is really, really hard to replicate if you met. When you make large changes, they're a safety net. They let you evaluate new models before that have been released and that have different nuances. Anecdotally, we've found working with Frontier Labs that having these evaluations tackling a specific problem is something that they're like, they're interested in getting feedback, which means that you can kind of have warning about new models before they're released.

James Austin [00:17:23]: So for example, we had access to computer use a bit before it was publicly available, which let us give anthropic some feedback about where we saw the value and where the weaknesses were. Then the last bit is that unlike a publicly available eval where people are building them and they were released once and then rarely iterated on, or if they are iterated on, it's in the form of an entirely new set of benchmarks. You can grow these over time. Every time you spot a new issue, that's an opportunity for you to add a new evaluation to your test set and go from there. And then the fourth lesson is that LLMs have a really strong learning, have a really big learning curve. So Replit's AI team about 12 months ago was around eight engineers. But they were divided across a pretty wide range of of things. We had what we were calling AI engineers, which is kind of a lot of the, like working with large language models like, you know, that are provided by various labs and hosted externally to the company.

James Austin [00:18:46]: We were training smaller large language models for specific tasks. So like the code complete, we have a model that called code repair that takes a chunk of broken code and suggest a fix to it. And then we also were managing inference on these like self hosted models. But the team that was kind of working on the infrastructure of what would become the agent and the kind of initial prototype for the agent was really like three engineers. And then we built this prototype for the agent and we immediately got buy in from the leadership of the company. And those three engineers working on the agent expanded massively to like 20. Not all of whom were working on the agent directly, but we're also working on like integrating with various parts across the entire platform, so building new APIs for the agent to interact with. But a lot of them were kind of working with LLMs for the first time in years.

James Austin [00:19:53]: Replit's built a lot of software over the years and none of it really involves LLMs, which meant that we had a lot of really, really talented engineers who just didn't really have experience working with LLMs. So there was this like really big learning curve that we needed to upscale engineers to upskill engineers as quickly as possible. Some of the problems we were facing with the agent were problems where we could basically drop an engineer in and ask them to tackle. And they were the same engineering problems that they would see everywhere else. Like we had a memory leak. You can drop any engineer in, they'll be able to track down a memory leak. And there's teething issues of getting up to speed on a new code base. But it's no different to helping out another team in any other engineering role.

James Austin [00:20:47]: But some problems are new. So designing tools for an agent is not like designing an API. APIs can have as many fields as possible, and the field names are important for like communicative reasons, but they don't really have a material effect on the efficacy of the API. None of that is true for a tool. The more you add, the more likely things are to go off the rails. You need to think about whether it's kind of in the distribution of how it works. There's a lot of weird nuances that's really, really hard to write down and also really hard to. There's not a lot of public documentation about it.

James Austin [00:21:28]: And the only way you can really build this intuition, figure out how to build these tools effectively, is to develop really good intuition. So we. It's really hard and developing that intuition is really, really difficult. So the things that we found were really helpful is a high quality evaluation harness. We don't, we obviously aren't running our evaluations like every five minutes, but something as simple as being able to run say five different prompts across the agent in parallel and then seeing how each one of them kind of changes with after you've made a specific prompt. We found that that really helped upskill our engineers in understanding getting a good intuition about how the entire system fit together and how making a change in one spot can let data, sorry, improve how the agent works in other parts, how making a tweak to the plan that is generated at the beginning of an agent session, how that can help solve problems later on. And then the other bit is obviously there's no replacement for just time spent working with the LLM. And the last bit of advice that we found really useful is just asking, for example, traces on every print.

James Austin [00:22:58]: Obviously, like code review is like a really, really good method for helping people understand a code base better and upskilling people. But making sure that those traces were included on every single PR that touched the agent helped other people kind of get a better intuition about how agents would use a specific tool and help develop that intuition. Just some final closing thoughts. I was originally one of those new engineers. I actually joined Replit as a platform engineer working on storage. And I've only been doing this kind of AI engineering stuff for about 18 months. I found that I could contribute really, really quickly because new ideas can come from everywhere and you only need to look at social media to see that even the Frontier Labs, they don't know everything and there's always a lot to contribute. So never stop learning and never stop experimenting.

James Austin [00:23:57]: Thank you very much. And we're hiring.

Adam Becker [00:24:01]: So nice, James, thank you very much. We do have a couple of questions on the chat and folks, if you have more questions, please drop them there. So one of them is Jeremy asking what tools besides Langsmith can you recommend to start building agents? And maybe I'll sharpen that a little bit and just like speak about even evaluation in particular, simply because you've made such a strong emphasis on this.

James Austin [00:24:26]: Yeah, so I'm. We use Brain Trust for some of our evaluations for chat, for our traditional like chat features. We've actually decided to roll a lot of our own for evaluations just because we think it's so important. And just having a better understanding of that problem helped us develop better agents. So we roll a lot of our evaluations ourselves, and it's just Python that we're writing. Langsmith is really good for the traceability, and we dump all of our information from our evaluations in there. But the actual harness itself is homegrown.

Adam Becker [00:25:12]: We have a few more questions in a similar vein. How are you solving evaluation, then? Do you run a web agent to automate interacting on the final app created by the agent?

James Austin [00:25:22]: Yes, we do. Web agents have existed for a while, and there's been a bunch of techniques that have worked really well for testing functionality and stuff like that. But now we're using Anthropic's computer use tool almost exclusively. We've got a harness that spins up a bunch of Chrome instances and Docker containers, and then basically the agent can interact with each of those, test them in parallel, logs out everything, and engineers can dive in and see our web agent will talk back to our actual agent and pretend to be that kind of a user.

Adam Becker [00:26:07]: Yeah. Okay, so we have another question here from Eric. Again, I think thematically consistent here. What are good evaluation tools, evaluation methods, and what is a good evaluation harness?

James Austin [00:26:21]: Yeah, kind of homegrown. We try not to be too prescriptive about what we're trying to evaluate in terms of like, coming up with methods, coming up with kind of. These are things we want to test. We were quite reactive. So as an example, when we first launched, our agent really struggled with authentication and integrating with. Actually, sorry, no. A better example is we noticed an issue a couple of weeks ago. It turns out every frontier model doesn't know what GPT 4.0 is.

James Austin [00:27:01]: 4.0 came out after the knowledge cutoff. So there's actually this issue where a user would have some code that called GPT4.0 and then the agent would be like, oh, actually, that's a mistake. That's meant to be four. Let me fix that up. And goes, well, how do you fix this? What does that look like? So we built some evaluations that specifically tested. Does it change this? Even if you're changing something else, like nearby? We don't want it to be proactive in this instance. What are the right techniques for doing this? We did some. Yeah, sorry, yeah.

Adam Becker [00:27:37]: I'm very curious about who. So you say. So we ended up writing these evaluations. Who is the we in this case? So you said you had grown, let's say in the beginning, from. Let's say three engineers working on LLMs to like 20, I imagine might even be growing more. Is everybody writing evaluations? Do you have like a subset of people writing evaluations? Is it always engineers writing? Like, who's actually we, the AI team.

James Austin [00:28:04]: Kind of owns our evaluation harness, but the harness is designed to be able to take. We kind of have a set of like instructions that we can provide to that like web agent about like, you know, this is what you're trying to build. This is kind of like if the user asks about this, respond with this. And that's all kind of basically a plain text prompt. Engineers across the organization who are working on the agent say, if they're working on like a specific integration or they want to improve a specific feature, they will write their own evaluation and then kind of contribute it back to a pool that we have to a pool.

Adam Becker [00:28:44]: Of like evaluation, like examples or tests.

James Austin [00:28:47]: Or something like that. Exactly.

Adam Becker [00:28:49]: You mentioned before that it's. You should sort of see evaluations as an investment that pays off over time.

James Austin [00:28:58]: Yeah.

Adam Becker [00:28:59]: Is there ever any pushback from the organization? Because in some ways it might increase the quality of the product and then the reliability and the trustworthiness. The trustworthiness of it. But it could also. Does it sometimes feel like it's slowing down the organization, in which case other people would want to prioritize other things, in which case, how do you ever champion the need for evaluation?

James Austin [00:29:28]: You can make the same argument about integration tests. Integration tests can be flaky. They're hard to set up a lot of the times. But once you have that harness developed, they're relatively cheap to kind of add more on. And the other bit is that an agent is kind of. Evaluations are different from integration tests in a specific way in that it's not a big deal if they necessarily fail as long as the number is going in the right direction. So yeah, we had some pushback initially when it was like, hey, we want to be the first on the market, we want to ship this. The vibes based development is working.

James Austin [00:30:04]: We don't need to spend time on that harness. And then it's kind of the role of, you know, senior engineers to push back on that and be like, hey, like, you know, we want to build this and do this right and it's going to pay off dividends. And I'm lucky to work with leaders who listen to me and just say like, hey, actually no, that, that makes sense. We're going to give you some time and some slack to, you know, go slow now so that we can Go fast.

Adam Becker [00:30:29]: Later, we have a. We have a bunch of more questions. I don't think we'll have time to go through all of them. So if you can also just like hop in the chat. Let's see if I can just pick one. How do you orchestrate debugging? Do your agents use debuggers at all or is it brute force on the errors?

James Austin [00:30:53]: So we have a number of what we call views that the agent has, and these include everything from. We have files that has open are fed through an lsp. So we have like, you know, this symbol couldn't be detected. We have anything that basically anything that the user can see, the agent can see. So it can see console logs in real time. It can, you know, see basically everything. That's how we do it. We don't have any debugging in the sense of adding breakpoints and stepping through code just yet.

James Austin [00:31:32]: We have noticed that it will add print lines, especially if the user. It's not working. There's multiple back and forths. It will start doing things like print lines. That's his preferred debugging method right now. That's one of those areas where it's really hard to. We haven't nailed how to do that yet, especially with like. Theoretically, the goal eventually is to support all languages and all environments.

James Austin [00:32:04]: And doing that, building an interface that works consistently across everything is like an insanely difficult problem. We don't have a debugger yet, but if you'd like to work on it, give me a call.

Adam Becker [00:32:18]: Okay, give him a call. Find him in the chat. James Austin from Revolut, thank you very much.

+ Read More
Sign in or Join the community

Create an account

Change email
e.g. https://www.linkedin.com/in/xxx or https://xx.linkedin.com/in/xxx
I agree to MLOps Community’s Code of Conduct and Privacy Policy.

Watch More

25:43
Generative AI Agents in Production: Best Practices and Lessons Learned // Patrick Marlow // Agents in Production
Posted Nov 15, 2024 | Views 2K
# Generative AI Agents
# Vertex Applied AI
# Agents in Production
Building Reliable Agents // Eno Reyes // Agents in Production
Posted Nov 20, 2024 | Views 1.3K
# Agentic
# AI Systems
# Factory
Create Multi-Agent AI Systems in JavaScript // Dariel Vila // Agents in Production
Posted Nov 26, 2024 | Views 979
# javascript
# multi-agent
# AI Systems