Sign in or Join the community to continue

Claude Plays Pokémon - A Conversation with the Creator

Posted Mar 20, 2025 | Views 195

# Claude

# Pokemon

# Anthropic

Share

speakers

David Hershey

Member of Technical Staff @ Anthropic

David Hershey devoted most of his career to machine learning infrastructure and trying to abstract away the hairy systems complexity that gets in the way of people building amazing ML applications.

+ Read More

Demetrios Brinkmann

Chief Happiness Engineer @ MLOps Community

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.

+ Read More

SUMMARY

Demetrios chats with David Hershey from Anthropic’s Applied AI team about his agent-powered Pokémon project using Claude. They explore agent frameworks, prompt optimization vs. fine-tuning, and AI’s growing role in software, legal, and accounting fields. David highlights how managed AI platforms simplify deployment, making advanced AI more accessible.

+ Read More

TRANSCRIPT

David Hershey [00:00:00]: I'm David Hershey. I am on the Applied AI team at Anthropic, and I like a latte in the morning.

Demetrios [00:00:07]: Welcome back to the ML Ops community podcast. I'm your host, Demetrios, and today I'm talking to an old friend, David. He's been working at Anthropic, and it is a blast every time I catch up with him because I learn new stuff, I chat with him, and he opens my mind a little bit. Recently he's gotten quite famous because of the Anthropic model that has come out and is playing Pokemon. That is cool. He made that and we talk all about it. Let's get into this conversation with Senor David. I had to put on my nice outfit because I knew that I was going to be talking to someone who is famous.

David Hershey [00:01:02]: I don't know about that, but I appreciate it.

Demetrios [00:01:04]: Deep, dude, you created the Pokemon. Claude plays Pokemon. And I didn't even know it. You were showing me that, and I was like, yeah, this is great. I thought it was just Anthropic that did that. I didn't realize that was your thing.

David Hershey [00:01:18]: Yeah, that is. That is my baby. My side project for. For a while. It has an interesting, like, story of how it came to be a little bit. Cause it wasn't always like this, you know, it got put out into the world this weekend. For a long time, it was just sort of like my. My little fun side hustle.

David Hershey [00:01:35]: But. But yeah, that is my baby. Somehow people are watching it. Like, it's been stuck in one spot for two days and there's 1500 people still watching it. So I'm. I'm sort of, like, amazed that people. People care are still doing it.

Demetrios [00:01:50]: That is incredible. Tell me about, like, how did you even get that idea? How did you go about executing on it?

David Hershey [00:01:56]: Yeah, no doubt. So, like, honestly, so it started like, the first time I tried this was in, like, June last year. There's basically, like, two things that were true. I was working. I work with customers at Anthropic. I was working with a bunch of customers. We're all working on agents. And, like, I wanted some place for me to be able to, like, build out agents to some extent.

David Hershey [00:02:18]: Right. Like, some of our more successful customers were building really cool agents. And it's just like, I needed some playground for myself to build it on. And so it's like, okay, I'm excited to try to build with agents. How do I want to do it? It's like, well, I should probably do the thing that's Going to be like the literal most fun. Like if I'm going to really get into it, like I'm going to have some fun along the way. And someone else at dropping actually before me had had tried hooking up quad to Pokemon like an initial like little test and so I was just like do it. So I, I.

David Hershey [00:02:45]: Back in June last year I, I kind of like dove in and, and built out a little bit of like different, a handful of different like agent frameworks to, to try to play Pokemon. And then since then it's just like kind of like every time we've released new models I've been slowly iterating and improving and, and sort of like working on it, seeing progress, that kind of thing. So. But yeah, it just started out of like purely like I want to do this thing and I'm going to like make sure I have a great time along the way. Which on a side note, I then became like deeply obsessed. If you ask my wife, she's probably like kind of upset at me at how much how obsessed I am with this thing.

Demetrios [00:03:22]: And you realize that it was good enough to release to the world now because why?

David Hershey [00:03:28]: Yeah, like okay, so we tried this on Sonic 3.5 when it came out in June.

Demetrios [00:03:35]: Yeah.

David Hershey [00:03:35]: And it like you could see it do some stuff like it got out of the house. Like it, it meandered around but like it struggled. Right. We tried it with size 3.6 unofficially inside 3.5 new in October and it like it got a little bit further like it got a starter Pokemon. It got out like it did some stuff right. And I actually thought about releasing it then and like the thing that kind of was happening in the middle was like I would post these updates to our slack about this. Like I have a slack channel at anthropic about this and all about Pokemon. All about CLAUDE PLAYS Pokemon is the internal one too.

David Hershey [00:04:14]: People were following along, right. Like, so I, I joked that was like Claude's social media manager while I played Pokemon kind of for a while. Like I would just like pull out gifs and clips of it like doing stuff and people would get kind of hyped. And it was like fun enough even back when with the last model that like it was like should we put this out? Like it's kind of fun but it's like it was pretty bad. Like it didn't get very far. It would have been like a 12 hour experiment. There really would have been much to it. And so then with this model like it just like kind of reached the breakthrough where like you could see it kind of meaningfully do stuff and make progress.

David Hershey [00:04:50]: And it's still like, to be clear, in its current state, like, there's stuff it's good at, there's stuff it's bad at, but it's like enough that like it does move through the game slowly. And there's also like this like give and take of it, like does something really stupid for a while and you're like, no, Claude, why? And then it like solves it and it's very like. It's just like kind of. It's got that tension that's good in content a little bit. And so it became like. And I think like people internally following this along were just like having a really good time. By the time with Claude was like beating gym leaders and stuff, you know, people are like freaking out, having a good time.

Demetrios [00:05:25]: How to release this?

David Hershey [00:05:26]: Oh yeah, a little bit of that. The release actually came late. But then the other side, like, honestly is like, we put it out in our research blog, like the chart of the different models making different amounts of progress. And there's like literally just some amount of like. It's an interesting way to see how models handle these like, long horizon decisioning things. All of these evals that people are used to, like MMLU and gpqa. There's all these evals that exist out there and most of them are like, here's a prompt and I get a response and like, does it get it right or not? You know, and there's like fleetingly few that are like actually testing the model's ability to like take in new information, try another thing, make progress. And that's because it's like pretty hard to actually like measure those things.

David Hershey [00:06:10]: Like doing 10 hours of work. It's like pretty hard to measure like, how good was that 10 hours of work? You know, when a model does it. But Pokemon, it's like, I don't know, you beat a gym leader. That's a thing that happens after 10 hours, right? And so it actually kind of to some extent even internally for us, became a measuring stick of like, how well can this model, like stay coherent over hours and hours and hours and prompts and prompts and prompts of like taking new information, trying to learn, update, do stuff. It became interesting enough for us to like understand what the models were good and bad at still. But it was like also just like a good thing to put out there to show people what can this do. Like, why does this matter? I know Pokemon's kind of like a goofy way to do that to some extent. I think it resonates with some people that, like, oh, these models aren't just a chat thing.

David Hershey [00:07:01]: I type in a prompt, but they can kind of go do stuff sometimes, even if it's not that good comparatively, it's better than it has been.

Demetrios [00:07:07]: Do you feel like you're going to now start simulating a whole bunch of these to get some data and then maybe try and make it better for the next one?

David Hershey [00:07:19]: I think, like, part of what's fun about this is that, like, it's not trained on Pokemon. You know, like, part of the fun thing is like, this. It's exploring this thing for the first time and getting a feel for that. And so I think we're going to stay in the version of this where it's like just sort of like a good way to see how it experiences these new environments that it's never really been trained to do.

Demetrios [00:07:42]: Well, take me through the internals. What does it actually look like? You mentioned you created an agent framework for it. You didn't want to grab anything off the shelf or you, like, what's.

David Hershey [00:07:52]: I'm a learn by do kind of guy. You know, sometimes you, like, don't get to the depth of, like, how does this thing work and why does it work that you wanted to get to? So that that was where it came from. I learned this, like, my favorite way to learn this. This is a complete tangent, but I took Carpathy's, like, computer vision class in grad school, and he has this, like, right, gradient descent from scratch exercise, like, homework exercise he did, where it's like, you know, like, TensorFlow had just come out at the time. You could do that, but it's like, no, you're gonna, like, you're gonna write it. You're gonna figure out how to, like, implement a machine learning framework yourself. And that's like, the way that I understand machine learning now is like 30% because of that one assignment.

Demetrios [00:08:30]: The. The value. The time to value that you got. You still remember that one.

David Hershey [00:08:34]: Yeah, yeah, yeah, yeah. Part of what's so fun about, like, doing this and building in this way with these models is like, you just like, learn a lot about them when you stare at them, right? A million tokens. You know, like, I've seen Claude write, I don't even want to know how many words about Pokemon. But, like, you learn a lot about, like, how it thinks by just, like, reading it a lot. Like, getting into the weeds, seeing how it reacts to different prompts and stuff like that. That's like, the core of why I decided to kind of go my own way. I tried a few, like, the published papers. Like, I tried Voyager back when that was a thing, and a few other things.

David Hershey [00:09:08]: But, like, at the end of the day, I just kind of, like, wanted to hook something up myself.

Demetrios [00:09:11]: And what does it actually do? Like, what is the way that it takes in world information and then, like, acts on that?

David Hershey [00:09:20]: Yeah, it's actually, like, pretty simple. Over time, I've, like, actually stripped out a lot of complexity from it. So I'll go over it quickly, but it's like, it's not the craziest thing in the world. Has, like, a quick prompt to tell it, like it's playing Pokemon. And I give it access to three tools, essentially. It has, like, the ability to use, like, press buttons. It can press, like, A, B, start, select up, down, left, right, you know, like, press buttons. It has this, like, concept of a knowledge base that's actually, like, stuck in its prompt, but it's just, like, bits of information that it stores to keep track of things over long periods of time.

David Hershey [00:09:57]: So it's like, the electric is super effective against water. The thing it could, like, decide to write there potentially, but it, like, fully controls that, so it can, like, add sections to it, edit sections, that kind of thing. And then I have so quads, like, still not that good at, like, actually seeing the screen. So the last tool I have is what I call Navigator, and that lets it, like, point to a place it wants to go on the screen. And, like, the model, like, will just, like, automatically move it there if it's, like, within reach of, like, the current screen. It doesn't have, like, a great understanding still of, like, the difference in its position between, like, where it is and where it wants to go. So, like, if you just, like, let it say where it wants to go, it does a little better. It's the only, like, simplification we really make for it.

David Hershey [00:10:42]: And then, like, how it actually sees the world is when it presses a button, it gets a screenshot back, like, to see where. Where it is after the button press. It also gets, like, a little dump of some stuff that, like, reads directly from the game, like, its current location and a few other things. But it's, like, pretty minor. But basically, like, it presses buttons, it sees what happens afterwards, and then it presses the next button and it goes and goes and goes.

Demetrios [00:11:06]: Kind of like me with a new program.

David Hershey [00:11:08]: Yeah, yeah, yeah, more or less. It's not that different. It's not that different from how you play It.

Demetrios [00:11:14]: But wait, so there was one thing there that is interesting is the prompt that Claude itself has the ability to update things in its own prompt.

David Hershey [00:11:26]: Yeah. So the key insight there is like, you know, you're playing Pokemon. Let's see. I'm looking at the stream over here in this. It's in the, like, three days since it's been up. It's taken 16,884 actions as of this exact moment in time. Wow. Um, and so, like, if you think about that, that, like, roughly correlates to like 16,000 screenshots, like a whole bunch of stuff that it has seen over that time.

David Hershey [00:11:52]: And it, like, that much information doesn't fit into the context window of a language model. Right. So you need some way to, like, be able to condense, like, to get rid of old information. And so the knowledge base, I'm not, like, I'm literally not sure this is the most optimal way to manage this, but the knowledge base is one way that it can keep track of information over longer periods of time. So what ends up happening is it, like, takes 30 actions and then it summarizes the conversation, like the things it did for the last 30 actions and, like, chunks that down. And then it does it. It sort of like accordions like that. Right.

David Hershey [00:12:30]: So it takes 30 actions, does a summary, takes 30 actions, does a summary, that kind of thing. But trying to keep everything in the summary, it hopefully writes kind of bloats. So the knowledge base is a way for it to, like, kind of track longer horizon things.

Demetrios [00:12:43]: Yeah, that makes sense. Well, dude, talk to me about some of the stuff that you're doing right now at Anthropic. I know you're leading a team that is engaging on fine tuning, and we mentioned there's. I had the big question of, hey, is fine tuning really all it's cracked up to be? And I say that because we did a roundtable with Nvidia probably like a month ago. And one thing that was abundantly clear, because we did the whole roundtable on fine tuning, and folks were like, I've seen way more lift for way less effort by just tuning my prompt, not fine tuning the model. Yeah, what. But what is. What are your opinions on it?

David Hershey [00:13:28]: I think that's about right for the vast majority of things. Honestly, like, like, to some extent, that's the best thing about language models. Right. Like, I come from the world of machine learning, you and me both. Like, when you have to make a change to a model by, like, training it and getting new data and stuff like that. It's like, that is slow. There's a reason ML Ops is such a hard thing is because like getting that all right is really freaking hard. And comparatively, like prompting is a miracle where you can like iterate on this thing with a, you know, even if you have a big eval suite, it's like a 10 minute iteration cycle, not a, not a free week or whatever.

David Hershey [00:14:09]: Iteration cycle. And so I think like, from, from my experience, like I frequently encourage anybody who's thinking about fine tuning. Like the. One of the first things I go in and do is like, how far have you really pushed prompting? Like how, how far have you really gotten with it? And you know, there are people that have like really stretched it to the limit. But there are a lot of people who like maybe because like, probably a little weird still, like people haven't quite figured it out, like how to get it right. It can be finicky. It's not easy to like get the best prompt. And I think so.

David Hershey [00:14:45]: Some people like get stuck part way. I think they're at the top. But until you like really are confident that you've gotten the most out of a prompt, I think it's like almost never a good idea to, to even consider fine tuning for most use cases. And there's so much you can do to add knowledge with rag with like whatever other way you want to put things in the prompt like that are way easier than trying to like do fine tuning that. Yeah. For, for someone whose job is to help people with fine tuning broadly. Like I spent a lot more time telling people that they should like steer away from it to start because I think you gotta be like pretty precise. It's really expensive.

David Hershey [00:15:23]: Like it's just like, it's a challenging. I don't mean expensive from like the cost of like even doing it. It's time. The cost of people's time working on it is just really, really hard to, to justify more often than not. Yeah, obviously like I have a job, so I don't think that's like, that means that's the. It's always a bad idea.

Demetrios [00:15:40]: You have a job and a team that you lead. So when is it right to do.

David Hershey [00:15:45]: Yeah, great question. I think there are a handful of use cases at times where fine tuning really makes sense. Um, I work on and see a subset of them, but I'll try to like give you an overview of like my, my viewpoint on like where, where it can be a good idea. Um, one I guess I'll call out first is like a thing that works but also feels a little bit to me like a trap, which is like the pattern of taking, like trying to train a smaller model to do something that a bigger model can do well to, like save money particularly, is the thing that I've seen occasionally. The reason that I think that is like a bit of a trap is when you look at the last few years of, of model development, like every handful of months a cheaper, smarter model comes out. And just by like kind of doing nothing with your whole dev team, you can often get that same outcome. Like, you can get a cheaper model that it's like a cheaper, faster model that does the thing that the last model did fine just with a little bit of patience. With just patience, right? And I think like, unless it's like such a layup because you already have data and you know exactly how it's going to work, like, the process of getting fine tuning right is still pretty challenging.

David Hershey [00:17:00]: Like building the right data set, making it work well, making it not like degrade the other capabilities of a model, like, these are all pretty challenging things to get, right. And you're going to end up sinking a lot of developer time into something that you could just like do nothing and do. So that's one that I think, like, it does work. Like, you can, you can have success. Like, you can, you can do the thing. But I'm like, I think people need to be a little bit more like skeptical of should they do it, even if it's possible for that.

Demetrios [00:17:25]: Because that was that use case, right? There was one of the folks that was in that roundtable, they said, look, I cut my cost by like 90% because of that very thing that you're talking about. But that's once they get a model that is working right. So the cost of the model spitting out information is probably 90% cheaper. But what was the cost of getting that model?

David Hershey [00:17:50]: I mean, there's like an easy corollary to that. This within anthropic, which is we put out haiku 3.5 four months ago, right? I think it was November, my time. I lose track of time here, but it was somewhere around there. And it like benchmarks wise is about as good as Opus that we put out in March of last year. So it's like eight months or something like that between the two, and it's like literally a tenth of the cost. Like, so that's. There's your 90% cost savings there, right? So it's like this is like the actual thing that happened. So, like, maybe you care about that six months like, maybe you get it done in two or three months and you.

David Hershey [00:18:27]: And you saved money for six months and that makes a big difference for you. But like, I don't know, the opportunity cost of people who could work on machine learning is pretty high, I think. It's. It's not clear to me that the trade off is great for that is all I would say. Which is, again, it works. Like, if that math makes sense to you. I think it makes sense.

Demetrios [00:18:45]: If you're at that scale, then of course, go for it.

David Hershey [00:18:48]: Right? Yeah, right. Like you can. If you save $5 million for a million dollars of work or something like that, then like, you made $4 million. I suppose the flip side does still exist though, that, like, you could have done maybe something else that was more valuable than that with your time. Anyway, I'm like, we're getting into the, the nuance of this, but I just, like, I, I personally would approach that with a little bit of skepticism is all.

Demetrios [00:19:12]: And the other side is for form, right?

David Hershey [00:19:15]: Uh, yeah. And so then there's like this other form of like, kind of relatively simple fine tuning that I think works, which is just like, there's a handful of cases where it's like, I needed to follow this output format a little better. I needed to, like, understand this input format of data I have. So, like, maybe I've got some specific kind of document that I want the model to, like, understand a little bit better and be able to work with a little bit better. And there are just like a bunch of, like, little examples of like, my data looks like this. This data is like, a little different than what the model's ever seen before. Yeah, it's not that hard to, like, reason about, but, like, getting the model to like, actually get it might be really powerful. And I've seen, like, other things, like, you know, classifiers with language models.

David Hershey [00:19:58]: Like, language models have this, like, pretty good understanding of text in general. And like, you can get them to do, like, simple classification tasks that work really well. There's a handful of stuff like that that's like, kind of in the. This is an easy task, but I just kind of need to, like, get this model that doesn't know much about it to understand this data. Those are, like, pretty reasonable. I've seen a decent number of people have success with those. And then there's this, like, third category, which, frankly is where I spend most of my time, which is like, actually taking a really good model and trying to make it better at something. And that's really hard, like, to Be clear, you know, like there's a reason why the research labs are, are make good models and make better models than most other people's can.

David Hershey [00:20:44]: Cause like they, they hire pretty smart people to work on this task of like, how do you take a model and do the last mile fine tuning to make it really good? It's like sufficiently hard that most people really should like not engage with it. I don't think like that extra juice is worth the squeeze for a lot of people. I think like the vast majority of these cases, like you could typically take a model and get good enough performance off the shelf, but if you just like happen to be in the place where like you really care about the 5% past where a model can get right now, maybe that's like why you're competitively differentiated. Maybe it's like the difference between your product kind of working and not is that last 5%. Then I think it can make sense to do research and, and that gets like pretty sophisticated. I think there's like all sorts of different fine tuning methods that can work there. And you have to be like, it's a research project very much at that point. It is not like an engineering project.

David Hershey [00:21:35]: It's like we need to do research to figure out what data methods and tooling are going to actually make a model better at a task. But it's certainly possible. Like I mean labs all the time make models better at specific tasks. You've seen it in a whole bunch of different ways as new models have come out over the last like in particular six or eight months. So it's like obvious if you squint that labs have the capability to like take a model and make it better at a thing. And if that's true, then that should be true for anybody effectively. But the cost is really high. So like it just has to be like, so worth it to go down this, this sort of research endeavor.

David Hershey [00:22:14]: But it does work like you can, you can take an arbitrary problem and think about how do I make them all better at that problem.

Demetrios [00:22:20]: Well, it's funny that you mentioned before also something that I wholeheartedly agree with on, since we're not super clear with what is happening when you send in the prompt and you don't know if the output is not good enough because my prompt is not good enough or because I need to now make the actual model better, then a lot of folks potentially default to fine tuning as the next step because it's just like, oh well, I gotta, I gotta make this output better. So I guess if I can't get there with my, my prompts that I bought from an AI influencer off of Twitter. Then I think I should now start looking into fine tuning and that is hopefully what we are advising folks not to do.

David Hershey [00:23:16]: Yeah, I think that's like exactly right. There's this like kind of mysterious promise of fine tuning that exists sometimes I think where it's like if you put all of your data into the machine learning box, like surely it will get better. And as someone who has like put a lot of data into the machine learning box and watched watched the models get worse more often than they get better for like the average set of data you get, I can tell you that's like a very expensive task to take by default. Like when I think it, it's not straightforward, it's not obvious. It's not obvious that any single set of data is going to make the model better at what you care about. Like, that's why I describe some of this as like research to some extent because like actually figuring out what's going to impact the behavior you care about just takes time and it's not easy. And so I think like, yeah, there's something about like the, the pull of like the power of machine learning getting us there. It's gotten us a long ways but, but I have certainly seen a lot more teams have a lot more success working on prompting longer than, than fine tuning, especially in like raw count.

Demetrios [00:24:27]: I really want to talk to you about agents because that is the buzzword of the year and it feels like everyone is trying to do things around it. I have recorded a ton of episodes with folks that are putting agents into production. I feel like you've probably seen some really cool use cases with agents. What are some just off the top of your head, areas you want to jam on around agents?

David Hershey [00:24:55]: Obviously, like we have seen the same thing, which is like, I think a lot of the people that we've seen have the most success with Quad have been building agents. Yeah. And we've put a lot of work into making models that are like better at doing agentic shaped things. We are like very much of the belief that that is what's to come. You know, like it's funny. I think you think about agents and there's part of the funny thing that I think has happened and part of the funny thing about working in this industry is it's kind of hard to know when, when and what model is going to come out that makes a specific agent work well. Oh, I don't think we internally at the labs, like have a perfect grasp on some model we're forecasting for the future. Like what it's going to be good enough to solve some agent task is going to be.

David Hershey [00:25:51]: Right. So part of the funny thing is like you kind of just sit around knowing that there's something like you squint and something looks like it's going to work and then maybe one of these models is like kind of good enough to hit the liftoff point where it's actually valuable. So to give you an example, like that clearly happened with coding last year, right? I think like at the tail half of last year and especially when we released the updated Sonnet3.5 in October, like you saw a lot of the coding agents really, really explode. And I think that's because like it was the first time where a model really got like to that next tier of oh, this is like went from like a cute thing that I'm like happy to play with to like whoa, this is really, really good.

Demetrios [00:26:36]: Yeah, it was useful. It saves me time. I can describe things. I. That's when you saw the YouTube videos of like my 8 year old daughter just built a website.

David Hershey [00:26:47]: Exactly, exactly. And that's like, that's the thing that I think is on is kind of interesting because like what that implies a little bit is like you just kind of have to like get a feel for what are we close to, like what, where, what could pop next a little bit and kind of like maybe it's hard to know if it will, when it will, which, which release from which lab. Like who knows what's going to make something really, really great. But I think you can like start to see different things take off. So coding is the one that like has obviously taken off. We released Quad code this week as part of the release and that's like a really awesome I think example of like I think part of what's cool about Quad code is it. It's still like a fairly lightweight thing around the model and it really highlights like how the model itself is getting better at being this thing. So coding has quite clearly exploded and I have like hunches about other things I think could.

David Hershey [00:27:51]: You know, it's hard for me to say for sure what will for the same perspective that you have. But I think there's like some more sophisticated like for example legal workflows I could imagine getting a lot better over long horizons. There's like like stuff and we have people that work at Anthropic who have work done things like accounting workflows. Not Like a product we're working on, to be clear. But like I, I'm like know people at Anthropic who have in the past work on that kind of thing. I think there's like a lot of stuff that is probably further out, but like you could imagine a model that like gets good enough at using a spreadsheet that there's just like a ton of like manipulation of spreadsheets, which is just so much of like work that gets done is like menial moving things around spreadsheets and running formulas and stuff that you can imagine like models getting better at that kind of thing and, and that like having a really big impact on what kinds of agents are like actually useful work. If you can go to cloud AI and say like here's the spreadsheet and here's this analysis I need to run. Can you like go do it and.

Demetrios [00:28:55]: Create a data visualization from it?

David Hershey [00:28:58]: Yeah, like, like build the analysis, build a model, like predict, forecast this thing, tell me what the answer is and like give me the sheet to show it.

Demetrios [00:29:05]: I need a dashboard for my boss.

David Hershey [00:29:07]: Sure. Right.

Demetrios [00:29:08]: In 20 minutes when my meeting is.

David Hershey [00:29:11]: Yeah, yeah, but like, yeah, I think part of the frustrating thing on the flip side of this is like I, I even like I have about as much information as I could have and it's like hard for me to predict what the next thing that's going to work is. But if you had to guess me like with a pattern or ask me what the pattern is for, like what's going to happen with agents, it's that like it could happen because of this model we just put out this week. It's like hard to know to some extent but like a models are going to come out, some startup or someone is going to be like building this agent in this category and it's like, oh, it works now. Like, like it didn't work and like it's like oh, oh it's like good now. And then they're just going to explode, you know, like it's going to be over. It's going to be a month and they're going to explode and it's going to be like all you hear about could took off and a la cursor.

Demetrios [00:30:02]: 100 million in whatever X amount of months, not years.

David Hershey [00:30:07]: Right? Yeah. And that's like, I think we're going to see, you're going to see that pattern happen with just like a handful of different products, workflows and tools people have where like around some model that gets like the key set of Small capabilities. You need to make it so that it went from like, oh, it looks like it's about to do this, but then it, like, tripped and stubbed its toe. And now I have to go figure out what went wrong. And it, like, ruins the immersion of trying to use this agent to like, oh, I asked it to write this code and it just did it. And I, like, didn't think about it. And that's like, a world of difference.

Demetrios [00:30:40]: So it's funny you mentioned that too, because the, the hackiness that you have to do when your agent doesn't work, you spend all this time trying to just get it there so you don't have that experience or, or your end user doesn't have that experience of, oh, I thought it was going to work, and then it didn't work. And so the reliability of the agents is something that's like, so much of why.

David Hershey [00:31:03]: It's like the tipping point, you know, it's like, as soon as you have to go in and figure out what didn't go wrong, well, like, you waste all of it. Like, you have to get the full context on why it went wrong and what happened in the middle. And it's like you are, like, nearly at the point where you may as well have done it yourself. And even if it's like one little tiny mistake at the end, that happens. Right? But it happens, like, enough. Then it's like, oh, I'm trying to have it do this thing, but I have to actually go, like, do the whole thing to figure out what went wrong. But as soon as it, like, crosses that threshold of, like, more often than not, it. You don't need to check it.

David Hershey [00:31:37]: Like, it, it looks that it. Then it's like, oh, this is the coolest thing I've ever seen. It's incredible. Yeah. Which I think, like, again, I'm just gonna fall back on code. Cause it's the thing. We've seen so much, Bill. It's like I've had AI write code for me for a long time.

David Hershey [00:31:51]: Like, I. I've been an anthropic for a year. Like, I've used AI to write code a lot over the last year. But, like, there's this big difference from, like, I get, like, I copy and paste it in and I, like, poke through it and I, like, kind of get it there to, like, it just kind of happens that is, like, way different. You know, I, I wish I had, like, some hot, spicy prediction of, like, exactly what the next thing is that it's going to take off. I really don't. But that's like the thing that I expect to be true for agents is that like a model comes out, a month goes by, the thing clicks for someone, it explodes and then like, bam. There's another industry that's like, really got a new way of writing by.

Demetrios [00:32:34]: Yeah, it takes the menial work and all of these tasks or whatever the people that are in that sector normally do now, they're doing it a lot faster and so they're able to produce a lot more from it. And I, I think it's just, it's exactly what you're saying. Like, it's just going to be more reliable and you're going to see that reliability is going to then echo out into the world.

David Hershey [00:33:03]: Yeah, yeah, no, I completely agree. I completely agree.

Demetrios [00:33:08]: And it's, it's funny you mentioned how the model updates will cause this type of second or third order effect and also how folks almost need to reset when new models come out. Because I was talking to some friends who are building agents and they said that every time they upgrade to a new model, it's almost like they have to start from scratch with the prompts again. Because a lot of what you're saying is the models are better at doing things. So now the prompts, you don't need to specify all these little edge cases. You don't need to tell it to do that thing anymore because it already does it. It's already in the training or the fine tuning.

David Hershey [00:33:50]: Yeah. I mean, with Pokemon to bring it back to that. Like every time I've tested on one of the new models, like the, the most common change it makes is I'm like deleting stuff. You know, it's like, oh, I had all of this like prompt stuff in there. Like, tell it to not make all of these stupid mistakes and to like try to put band aids all over it. And then if I just delete all the band aids, it's just way better. I think, like, part of why people have a hard time seeing new models come out sometimes. I think it's true for like, almost every model that comes out is like, you look at these benchmarks, but they're not typically like, the thing that's holding the model back is not like some, some.

David Hershey [00:34:26]: Not some big thing. It's not like it learned physics overnight. And that's like how it, like why it's been to be better at filling out a spreadsheet. It's like, oh, it got just like good enough to click on a cell in this spreadsheet. You know, like, it. It clicks on the right cell now and. And then it all works, right? And it's, like, kind of imperceptible if you're just chatting with quad AI, that it's like this model that could, for the most part, look like. It's like just kind of sounds like maybe incrementally a little smarter or whatever.

David Hershey [00:34:55]: Like, not much, but like, if it got a little better at clicking on a spreadsheet and that means that it can fill out spreadsheets now, it's like, oh, they'll be game over.

Demetrios [00:35:03]: Yeah.

David Hershey [00:35:04]: And so I think that's part of, like, we're in this funny era where models come out. I think, to some, like, at first glance, for some people, it's like, what's the big deal? What's so much different? Kind of looks the same, like on cloud AI and feels the same, right? And people notice a little. But it's not like this, like, gargantuan thing for everything until you find the thing that it's actually way better at. Like this. Oh, this. When I asked it for this thing, it's like, yeah, way better. And then it's like, you know that that's where you end up seeing a lot of the change from my perspective. And in Pokemon, that's what happened, right? Like, it's like it wasn't obvious.

David Hershey [00:35:42]: Some of these benchmarks look kind of similar. And then suddenly, like, we hook it up to Pokemon and it's like, oh, it's like night and day. This model, like, can do it, you know? It's like, whoa, that's. That's crazy. That's so weird.

Demetrios [00:35:53]: We got to show more people, release this to the world. That's what we can expect as far as the next big breakthrough is agents on your video games. And so actually, one thing that it reminded me of there is. It's like one. What was the quote from Neil Armstrong when he came off of the moon landing when he said, one small step.

David Hershey [00:36:24]: For man, one giant leap for mankind?

Demetrios [00:36:26]: It's like one small thing for the model, but one giant leap for mankind in a way.

David Hershey [00:36:32]: Yeah.

Demetrios [00:36:33]: Makes me think about it. So what. What other stuff have you been thinking about lately? I remember, and to let you know, whenever I chat with you, and I always love chatting with you, I come out of these conversations like, oh, damn, I wasn't viewing that that way, but now you totally changed my mind and point. In fact, one of the first times that we talked after. After the whole AI revolution started and got underway, you were saying, dude, there's like millions of developers that can now use AI as opposed to the 1 million or hundreds of thousands of machine learning engineers that use AI. Like, do you realize what is going to change now? And I was like, are you sure? I don't think so. It doesn't seem like it's that good. And after that conversation with you, I was like, oh, well, I guess maybe I'm going to be open minded about it.

Demetrios [00:37:34]: So are there other areas in. And I'm not saying that there's been such a big disruption recently, but things that you're thinking about now that you're excited about that don't involve Pokemon.

David Hershey [00:37:48]: What do you mean? What else is there?

Demetrios [00:37:52]: That's, that's probably been your life for the past, whatever.

David Hershey [00:37:55]: At least the last four days. It's been a little bit, a little bit of my life. But I'm gonna give you like a series of boring answers because, like, I've just been so enveloped in trying to, to make better models and help people use them better. And so like, I honestly like feel a little blinders on, which is like all I can think about these days is like, what's gonna happen with new language models and how, how are they gonna change how people use them? And so I'm kind of like still stuck on the same thought that I gave you when we, we talked a while ago, which is, has it changed?

Demetrios [00:38:32]: Has that like assumption of, hey, now everyone is going to start using AI? You still feel strongly about that? Are you doubling down? Are you backing off?

David Hershey [00:38:42]: Like, yeah, I mean, I mean if you go talk to the world, like, if you talk to the people that we talk to, like, it is not, it is not just ML people that we talk to to build AI features. You know, it's not even close. Like, the vast majority of the people that build with Quad are, are that build with Claude are engineers. And then you talk about the people who are like actually getting value out of it. It's like people going to Claude AI and doing all sorts of other stuff too, right? It's, it's like. And you can build workflows on top of Quad. Like you, I saw you are interested in mcp. Like you can build like pretty significant workflows without really like needing a huge engineering effort.

David Hershey [00:39:28]: Like you expose a few things via MCP and like you can start building out like workflows that really do a lot of your job for you just by like gluing some of the pieces together that have been exposed in your organization. And so like, yeah, much More radicalized than ever that the part of what's happening here is that, like, we are dramatically raising the bar on who can use it. I think there's a long ways to go. Like, I, I think the experience of go to a blank chatbot and like, try to figure out how to use it is not good enough for some people. Right. And so I think, like, figuring out how to elevate people to feel like they understand, like, what can I do with this thing? Is an unsolved problem in general.

Demetrios [00:40:11]: Yeah. I heard it put that this way that I really like is how much cognitive load are you putting on the end user.

David Hershey [00:40:19]: Yeah.

Demetrios [00:40:20]: And when it is a blank chatbot and they have to create the question or create the whole prompt, that's a lot of cognitive load as opposed to scrolling TikTok, you know, and so, yeah, it is.

David Hershey [00:40:34]: And it, it just like, I don't know, it's AI, you just like, see a lot in the world. And I think it can be like, somewhat overwhelming if it's not grounded in, like, what does it do for me?

Demetrios [00:40:46]: Yeah.

David Hershey [00:40:47]: And I think people are slowly figuring that out. But, you know, maybe this is like, ties back to you to like, software engineers clearly figured out what AI can do for them. You know, I. It's been a while since I've run into someone who is not, at least to some extent figured out a way that using language models, like, has a pretty big impact on how they do their job.

Demetrios [00:41:09]: Yeah. Even just with the coding. That's true. I may be biased, obviously, but in my eyes, there's almost like the rabbit hole that every software engineer has to go down when they start trying to build AI features. And that is they start learning about AI more and more and more, and all. Next thing you know, it's like, okay, well, you're almost like this hybrid of someone who is a data engineer. Like, they have to learn about pipelines, they have to learn about all this stuff that we've been doing. It's almost like this gateway drug into the ML world, or AI world, I guess, is what they would call it.

David Hershey [00:41:51]: It's funny, like, a lot of the core skill set that I think was relevant for machine learning people, once you can, like, give up control of the gradient, descent is actually still a pretty relevant, relevant skill set. Like, a lot of what you do to make these systems better is you like, build the data sets you need to evaluate them. You do this like, sort of like stochastic iteration process over prompts or whatever it is, or like the various systems that build out an agent or whatever it is. But there's like a lot of just like pure experimentation you need to do to like get it. And it needs to be good. Like you need to track your experiments, you need to be like thoughtful about what and how you did it. If you're imprecise with that, you end up in the same holes that we've learned a lot about doing machine learning in the day. And so there's just all of these sort of crossover skill sets that I think are true where there's a lot that people that engineers building with language models can learn from the history of machine learning.

David Hershey [00:42:53]: And there's a lot of skills that people with machine learning backgrounds I think can use. Um, and, and I think like, it's just a question of like figuring out where and how to like let go and live in that hybrid world. Like ideally you don't need to worry about inference anymore. You know, like, you get, it's like one of the coolest, like one of maybe the least talked about things I think about what labs do. But like they figure out how to serve machine learning models at like incredible efficiency to, to people. And in the past it was like, oh, I'm gonna have to figure out like a GPU cluster and serving and routing and all. Like there's just like all of this annoying stuff you need to figure out to use ML that's just like fancy API where you get to pay as you go on demand. Like it's crazy how convenient that is to use machine learning and like how quickly people got hooked on the drug that is token based pricing where you just pay for API calls instead of having to pay to like host GPUs and deal with it.

David Hershey [00:43:49]: Like so I guess like that I use that as an example of like, hopefully some of this has just like gotten much easier. Like you don't need to worry and think as much about GPUs if you're just like going down the managed route that I think most people should be on. There's like all of this infrastructure hassle that you can kind of not have to worry as much about. Like, the data problems tend to be like a little less painful. It's not like cluster scale data. A lot of the times that you're thinking about with language models it tends to be like, you know, small amounts of information about people maybe, but like, it's not like a lot that you're really working with compared to like, you know, our background, we both have seen like really gnarly data problems. Yeah and in my time in Anthropic I've seen nowhere near that kind of data problem which is just like, it's just getting a little simpler.

Demetrios [00:44:35]: So yeah, my one buddy told me who and he works at a bank so you can imagine the strictest of regulations. He was just like for the new gen because I was asking him what are you doing? He's serving both gen AI use cases and traditional ML use cases. So you can. All that fraud detection fun stuff he's doing and the gen AI stuff.

David Hershey [00:44:58]: Totally.

Demetrios [00:44:58]: He said man, when you can get away with it, outsource everything to the like get rid of all the headache for the platform and just outsource it to these labs.

David Hershey [00:45:11]: I know, yeah. I think that one like has just taken a little bit of convincing for people because I think it happened so fast where we like a lot of people invested so much in figuring out all of this really complicated infrastructure to be able to participate in machine learning. And I think like the idea of oh we can do it without all of that, you know, like we can build on top of this like place that's going to like in some cases host training and inference and everything and I just like submit data sets or submit inference calls and like it all happens and it auto scales and it's as big a scale as I want. Like everything's perfect. That's like a thing that I think it takes some adapting if you've like spent this muscle building out the infrastructure. But man, it's like way, it's way easier. People who let go and embrace the free infrastructure they're getting to some extent the free management they're getting. I think I like been able to make a ton of progress on this stuff.

Demetrios [00:46:09]: Yeah, it's like the grandpa's yelling at their kids. Back in my day we had to actually hook up all this and figure out our own gradient descent.

David Hershey [00:46:21]: This is what I said to you though. This is like why I was so excited to join because I spent so much time helping people figure out how to hook up all of this stuff and it's so hard and challenging and.

Demetrios [00:46:28]: Now I just like all custom right.

David Hershey [00:46:32]: Words over here and you get some words out the other side and. And we all just have a good time now. It's great.

Demetrios [00:46:37]: Yeah. And then. Yeah, exactly. We can play Pokemon or at least watch some AI play Pokemon.

David Hershey [00:46:43]: Yeah, yeah, yeah. Too badly.

+ Read More

Watch More

A Conversation with Seattle Data Guy

Posted Dec 08, 2020 | Views 284

# Theseattledataguy.com

# Machine Learning

The Creator of FastAPI’s Next Chapter

Posted Jun 17, 2025 | Views 267

# FastAPI

# FastAPI Cloud

# FastAPI Labs

Scalable Python for Everyone, Everywhere, Conversation with the Creators of Dask

Posted Oct 14, 2020 | Views 448

# Presentation

# Coding Workshop