Greg Kamradt: Benchmarking Intelligence | ARC Prize
speakers

Greg has mentored thousands of developers and founders, empowering them to build AI-centric applications. By crafting tutorial-based content, Greg aims to guide everyone from seasoned builders to ambitious indie hackers. Some of his popular works: 'Introduction to LangChain Part 1, Part 2' (+145K views), and 'How To Question A Book' featuring Pinecone (+115K Views). Greg partners with companies during their product launches, feature enhancements, and funding rounds. His objective is to cultivate not just awareness, but also a practical understanding of how to optimally utilize a company's tools. He previously led Growth @ Salesforce for Sales & Service Clouds in addition to being early on at Digits, a FinTech Series-C company.

At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios is constantly learning and engaging in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether that be analyzing the best paths forward, overcoming obstacles, or building lego houses with his daughter.
SUMMARY
What makes a good AI benchmark? Greg Kamradt joins Demetrios to break it down—from human-easy, AI-hard puzzles to wild new games that test how fast models can truly learn. They talk hidden datasets, compute tradeoffs, and why benchmarks might be our best bet for tracking progress toward AGI. It’s nerdy, strategic, and surprisingly philosophical.
TRANSCRIPT
Greg Kamradt [00:00:00]: This is too good.
Demetrios [00:00:01]: You were awesome, man. This is great. That's how we know it's legit now.
Greg Kamradt [00:00:13]: Well, I mean, I was looking forward to this.
Demetrios [00:00:16]: Dude, I can't tell you how stoked I am. Tell me exactly how it went down that you were on a live stream with Sam Altman.
Greg Kamradt [00:00:23]: Yeah. So I run ARC Prize right now, right. And we run an AI benchmark called ARC AGI, which is we want AGI progress pulled forward. Like, we. We want tech progress pulled forward because we believe it's going to be one of the best technologies that humanity's ever had. Right. There's a big question on, well, how the heck do you make progress go faster? And so our route that we've chosen is from a benchmark, and it's by. Created by Francois Chollet in 2019.
Greg Kamradt [00:00:51]: It takes a very interesting approach. There's a lot of benchmarks out there that go PhD plus plus problems. And so they'll ask you, like, the hardest questions and then even hard, and they'll say, this is the last test we're ever going to have to take because we can't come up with any harder questions. AI ends up solving those. Like, it ends up doing it well. Like, the ceiling on AI is like really high. Like, it's insane. It's already doing some superhuman stuff.
Greg Kamradt [00:01:12]: So we take a different approach on that. We want to know what types of problems are easy for humans but hard for AI.
Demetrios [00:01:19]: I love that.
Greg Kamradt [00:01:20]: And the reason why just getting into it like they'll stick behind it is because we have one proof point of general intelligence right now, and that's the fricking human brain.
Demetrios [00:01:28]: So these are things like Strawberry.
Greg Kamradt [00:01:30]: I would say that that is a class of problems where if you can find things like that, it's like, dang, AI can't do that, but humans still can. We probably don't have AGI if we can come up with those problems right now. The hard part, the hard part is those are one off questions. And so it's re. It's easy to find one off questions. But if you want to find a domain where you can come up with like 2, 200 questions in the same class that you can actually quantify this for, then that becomes a lot more difficult. And so our theory about AGI, and this is more of a working. This is an observational definition rather than like a inherent one, is when we can no longer come up with problems that humans can do but AI can't, then we have AGI.
Demetrios [00:02:11]: Wow. Okay. The.
Greg Kamradt [00:02:12]: The inverse of that, though, is if we can come up with problems that humans can do and AI can't do, then we don't have HI not there yet. And by virtue of ARC AGI 1, our first, our first bench, our first version of our benchmark being out there, the fact that it's even out there and unsolved, that's a class of problems that humans can do. We just came out with RKI2 and we actually went and gathered, we gathered up 400 different people and tested them on Arcgi2 on every single task within there. And we made sure. Because if we're going to claim that humans can do this, humans better be able to. Better be able to do it. So we went, we got 400 different people down in San Diego and we tested on, tested them on all this and every task that was in there was solved by at least two people in under two attempts. So humans can do it.
Greg Kamradt [00:02:52]: We have first party data for that, but AI still can't do it. So we claim that we don't yet have AGI for that.
Demetrios [00:02:59]: But they're kind of hard tasks.
Greg Kamradt [00:03:00]: They get harder for sure. Yeah, well that's what's crazy is the way that I think about it is there's a gap in between what humans can do and what AI can't. That gap is narrowing and so we need to make sure that humans can still do it within a reasonable attempt. We not looking at PhDs, look, we're not looking at 2 year olds to see if they can do these, A comp. A competent person, give them these tasks and see if they can actually do it.
Demetrios [00:03:20]: So just if you pluck somebody off the streets, they have college education type.
Greg Kamradt [00:03:24]: Thing more or less. So like when we did our, when we did our filtering, we made sure that they could use the Internet, like, like things like that.
Demetrios [00:03:29]: Like we didn't want, my mom is out.
Greg Kamradt [00:03:32]: We didn't want to teach them how to use a computer when we taught them what ARC was, you know what I mean? And so, but that, that doesn't allow us to make the claim about the average human. So we're careful about not saying that that's not what we're going for. We're going for a capable human. Yeah, some people like to argue with us on it, but that, but that's a different conversation on it. So run this benchmark RKGI one. Okay, great. We get an email from one of OpenAI's board members who we have a relationship with in early December and it more or less said, hey, we have a new model, we want to test it on Ark.
Demetrios [00:04:07]: And back in that day it was Strawberry. It was the Strawberry.
Greg Kamradt [00:04:10]: You know, there's so many names going around. I mean there was Orion even at that point. There was Strawberry. There was. What did Ilya see? You know, there's so much stuff going around. So who, who the heck knows like what, what refers, what rumor refers to what production version.
Demetrios [00:04:21]: And it hasn't gotten better?
Greg Kamradt [00:04:22]: No.
Demetrios [00:04:23]: The official names are probably worse than.
Greg Kamradt [00:04:25]: The rumors and I think that that tells you don't expect it to get better. Yeah, because it won't get better. So again the email says we got a new model, we want to test it. Okay, cool. Yeah, sounds great. It's OpenAI. They have a new model and they claim to have a very good score, but they didn't say what the score was in the email. And so on the arcprise.
Greg Kamradt [00:04:42]: On arcprise because we have public data and so the way we run our benchmark is there's a bunch of public data that you can go train on and you can go kind of test yourself on it. But then we have a hidden holdout set.
Demetrios [00:04:51]: Nice.
Greg Kamradt [00:04:52]: That we can get into why that's important in the first place.
Demetrios [00:04:55]: It's the only way to do it.
Greg Kamradt [00:04:56]: It's the only way to. Do you have a hidden holdout set for it? And they said we want to see. Are we overfit to this? Because we think we're doing pretty good but we want to try it on your holdout set. Will you come and test it for us? So we spent the next two weeks testing it, basically working with our team to go do it. This is through NeurIPS 2024 of last year too. So I'm like at Neurips like in Vancouver thinking I'm going to relax and just watch talks the whole time. I'm like literally testing and hitting open as API endpoints for that. But we get through and it's like holy shit.
Greg Kamradt [00:05:25]: I mean frickin soda. It was multiples better than we had ever seen another model do beforehand. And keep in mind that this thing had been, this thing had been out there for five years so far and there hadn't been this type of progress. So we're like holy cow. And so I get on a meeting with Jerry and Nat to kind of just like, kind of like pre brief before we go and do the testing and I say what score do you do you claim on rkgi? Cause we're gonna go and verify that. Cause if they claim one it's different Then that'd be a big story. They claimed 87% and keep in mind like the highest on here with a publicly available model was like in the twenties. And then custom built solutions, purpose built just to try to be arc.
Greg Kamradt [00:06:05]: We're scoring in the 40s and 50s at that time and they're claiming 87%. And so it's like, all right, this is a really big deal for us. Anyway, long story short, we go through it and what's interesting is in inference time compute world you can no longer just say, here's our model, here's our score. It's here's the model, here's how much inference time compute we spent. Here's our score. So now there's another variable on it. And what we confirmed for them is that on low compute and we can get into what low compute actually means in a second here. They scored 75% and on high compute we saw.
Greg Kamradt [00:06:36]: Yep, more or less. We validated their 87% score.
Demetrios [00:06:39]: Wow.
Greg Kamradt [00:06:39]: Which it's like, okay, Yep, it's validated. So we write up our blog post and it's just like a one pager Google Doc and, and somehow, or not somehow, but just like along the lines, Sam ended up getting pulled up on the thread, like on the email thread that we had going back and forth and we said, we have our results, here they are, we want to discuss them live. He goes, great, I'm free Tuesday at 5:30 or whatever. And it's like, oh yeah, let's go. Yeah, let's go. Like it's, it's wonderful. I mean it's like this is a huge opportunity. It's like we put together our blog post and we get inside the room and we show them basically put up on the screen.
Greg Kamradt [00:07:09]: We show a blog post, everyone reads through it and, and discussion, discussion. And he goes, okay, great, you guys should join our livestream on Friday. And so we're sitting there not even. Like we hadn't even considered that we're gonna be testing this new model. Like, you know, if he was surreal, man. And then he says, you know, on Tuesday, you guys should come join us on Friday. And our one requirement was that we didn't wanna just go up there and have them tell us what to say. Like we didn't want them to write the script for us.
Greg Kamradt [00:07:36]: And so Mike and I, Mike Knoop, who co founded ArcPrize, we basically wrote the script that we were happy we were comfortable with. We gave it to him. They said, yeah, looks great. All right, cool. That's that. See You Friday. Exactly. Well, so they.
Greg Kamradt [00:07:51]: They had a very big production. Not, I would say big production crew, but, you know, call it, you know, 12 people between marketing comms, like videographers and events and sound and everything that were in there. And so we had two rehearsals that went through for that. Two rehearsals. Went through. Went great. Made edits, and then, you know, the live stream comes out on Friday or it went out on Friday.
Demetrios [00:08:15]: Wow.
Greg Kamradt [00:08:16]: It's funny because there was a room that wasn't much bigger than this, but it was. It was a much bigger room, but there was partitioned off on all the sides, you know, with like, kind of just like, you know, stage or whatever.
Demetrios [00:08:25]: Set design. Yeah. Meant to look all. It was like it was just a table. You guys are sitting around the table.
Greg Kamradt [00:08:31]: Yeah, they did a really cool job with that. Um, but so I'm on the other side of the partition. I'm hearing Sam and Mark talk about it. Mark Chen, they're at the time, they're SVP of research, and they're like, now he'd like to invite Greg and then just walk from behind the wizard, like behind the curtain and just go jump on there. And what's wild is that you knew how many people were watching on the live stream, but it was. It was a small room.
Demetrios [00:08:49]: Yeah.
Greg Kamradt [00:08:49]: Like, it was like. Like I said, only like 10 people in there. But it was cool. So did that. And. But this was an unreleased model, so it. We now we call it O3 preview, because there's a preview model for it. And it was more of a capabilities demonstration about what if you push this to the max, what could you actually do for it? And to put this into perspective, on the low compute, it was.
Greg Kamradt [00:09:10]: They were spending about $20 per task on this, and we verified with 500 tasks. And so that's about 10,000 bucks of compute that they spent on this just for arc. It's like, dang, that's a lot, right? Yeah, like, that's a. You're not going to find that many people that want to spend ten grand on solving ARC tasks. Right.
Demetrios [00:09:23]: But that was low compute. And then the O3 preview was the first reasoning model.
Greg Kamradt [00:09:28]: No, no, no. It wasn't the first reasoning model because, you know, it. Depending on when you want to call. I believe their first reasoning model was when they did, oh, one preview. I don't even know when that was. That must have been early 2024, mid 2024, maybe even 2023. But. So we tested O3 preview, low compute, and then we also tested high compute.
Demetrios [00:09:48]: Which was way more money that used.
Greg Kamradt [00:09:50]: I forget the exact dollar amounts. It was in the thousands of dollars per task. Oh, what per task? And again, remember we, we, we tested, you know, 500 tasks or whatever. Um, I think it was 170 times the amount of compute that was used on the low compute side. Um, but either way the tld, it's like, okay, so what? That's a lot of money. What, what's the important part? The important part is it just reconfirmed that you can pay for better performance, which is crazy. Right? And there's open questions as to where that scaling actually tops out. And so we haven't done as thorough analysis on that system as we'd like to.
Greg Kamradt [00:10:21]: There's a big open question on does it asymptote or can you get to 100% if you just threw more and more and more. But keep in mind it's log what you give it. So you, they, you know, let's just say you spent a million dollars in order to get a couple percentage points upgrade, you need to spend 10 million and then you spend 100 million. It's like, well, where's it actually end up stopping for us?
Demetrios [00:10:39]: Yeah. What's that trade off?
Greg Kamradt [00:10:40]: Yeah.
Demetrios [00:10:40]: And also, are you going to have to wait a year before it gets done?
Greg Kamradt [00:10:44]: Well, it's a long time. So for the high compute, the job took overnight, more or less like overnight, maybe even took longer than. I forget the exact time on the duration for it, but it was not a short amount of time. You're not gonna be sitting there waiting for the response.
Demetrios [00:10:58]: Yeah, it's something you do. Yeah, you let it run, go out, have your life and then come back and see if it was able to do it.
Greg Kamradt [00:11:05]: But actually just today we opening out last week they launched their O3 production grade of O3. And so there's like a big open question. It's like, okay, well how does what we tested in December matched to what was publicly released today? And we asked OpenAI we asked Jerry for confirmation on this and a bunch of nuance tldr. It's not the same model. Exactly. It's not the same model and there's less compute being used. So we should expect not the same scores. And so we just, we tested on it today and yeah, as expected, it does really, really well.
Greg Kamradt [00:11:35]: It doesn't do as well as the model that we tested.
Demetrios [00:11:37]: Not the 87%.
Greg Kamradt [00:11:38]: It's not the 87%. But also they released 04 Mini, so they're just, they're fricking keeping the models coming. Right. And so a lot of good testing, a lot of good stuff. But what's cool is that like RKGI is the tool that we're using to evaluate these things. We have RKGI2 and all these models are still going really, really low on RKGI2.
Demetrios [00:11:58]: So basically RKGI2 is meant to be like that next step, like order of difficulty more.
Greg Kamradt [00:12:06]: So I, I've been looking for the good analogy with it. So apologies, I don't have it down yet. But, but the way I think about it, and this may be an incorrect one, so apologies if I butcher it. But like, if RKGI1 measures is really good at measuring car speed between 20 miles an hour and 40 miles an hour, below 20, it's not very good. Over 40, it's not very good because it just maxed out. It's like literally redlining. It's over at the top. RKGI2 is measuring cars from 40 miles an hour to 80 miles an hour.
Demetrios [00:12:34]: Nice.
Greg Kamradt [00:12:34]: So below 40 miles an hour, you're not going to get much signal. You're going to get a little bit of, you know, a little bit of stuff.
Demetrios [00:12:39]: It's got to be those premium models.
Greg Kamradt [00:12:41]: It's going to have to be premium models. And we're not yet seeing the, that models are making substantial progress on it. So I think that the best open source model right now, I think is getting like 3 to 4%. I'm sorry, not best open source. Even when we tested, oh, three, like medium, it was going like 3 to 4% on this. And down at that range, we're only talking about 120 tasks on that. So like down at that range we're talking about noise. It's not until it starts to hit like 10:15 that you're really going to start to see some substantial from it.
Greg Kamradt [00:13:08]: Yeah.
Demetrios [00:13:09]: How did. And how do you go about deciding these questions? Yeah, and then there's also the other side of it where you can't just have it be a super hard task. You have to almost. You have to be creative about it.
Greg Kamradt [00:13:20]: That's not what like again, because we hold ourselves to the restriction that humans need to be able to do it. And that restricts you from just doing hard, hard, hard, hard.
Demetrios [00:13:28]: Yeah, it can't be this PhD plus.
Greg Kamradt [00:13:30]: It can't be the PhD plus. But as long as we can come up with those problems, that tells you there is a gap between human intelligence and AI. And like people argue with me and say, oh, you don't need to Aim for human intelligence if you want to aim for AGI, because they're two different things. I agree. But like, our hypothesis is that the fast track towards AGI is understanding how the human brain works and understanding where the gaps are. Because if we aim for those gaps, that's going to tell us something interesting from there. We can talk about this later, but like, it's nowhere. The human brain is nowhere near theoretically optimal intelligence.
Greg Kamradt [00:14:01]: Like, we got a lot of biological baggage.
Demetrios [00:14:03]: I could tell you that right now, man.
Greg Kamradt [00:14:05]: I.
Demetrios [00:14:05]: My human brain is not working to full capacity.
Greg Kamradt [00:14:09]: Exactly.
Demetrios [00:14:09]: Ever.
Greg Kamradt [00:14:10]: So by no means am I saying it's the best example, but it is our only example of general intelligence. And so we see it as a useful model to go after anyway. So how do we pick the problems? So in 2019, Francois Chollet came out this paper called on the Measure of Intelligence, which is so fascinating because it's like, how do you come up with the problems? That's actually not the question to start with. The question to start with is how do you define intelligence? Because if you can define it clearly, then you can come up with problems for it, which is the fricking fascinating thing. So Francois came out this paper on the measure of intelligence and so his definition of intelligence was what is your ability to learn new things? It's not how good you are at chess, it's not how good you are at go, it's not how good you are at self driving, it's if I give you a new task in a new domain, a new set of skills that you need to learn in order to do it, can you successfully learn that thing?
Demetrios [00:15:04]: Is it how fast you learn that?
Greg Kamradt [00:15:06]: So now that's a great question. So my opening definition of intelligence is always just binary. Can you learn new things or can you not? But his actual definition of intelligence is your efficiency of learning new things. So just for example, I like to do efficiency in terms of two axes. Number one is the amount of energy required to learn new things. And we'll get into that in a second. But like the second dimension is the amount of training data that you need to learn that new thing.
Demetrios [00:15:34]: So basically how many times you need to do it before you learned it.
Greg Kamradt [00:15:37]: Exactly. So a crude, crude, crude example is if I'm going to teach you how to, how to play Go, we might need like six hours. I'll teach you the rules and you'll become, you'll become like basic at it. We can at least have a conversation around it. Think about how much training data went into the, the system that ended up beating go a lot. A lot. Right. And so, of course, that was better skill for it, but there was almost out outsized training data that went into it.
Greg Kamradt [00:16:01]: So another way to do this is, do humans have an Internet's worth of training data in their head to output the intelligence that you see from us right now? And the answer is no. No, it doesn't. Language models do. And so on the recent podcast, it was the internal one with OpenAI, Sam Altman and believe the name. The fellow's name was Daniel, but he was talking about the efficiency of language and what is an LLMs, efficiency of language versus a human's efficiency of language. And he said, by his estimate, I think this might be a little low. But he said that humans are 100,000 times more efficient with language than with current LLMs, which speaks to. And one of the underlying things that they kept on talking in the pod was that, look, compute isn't what's blocking us anymore.
Greg Kamradt [00:16:42]: We have a shit ton of compute. Like, we have a lot of compute, like Stargate, all of Nvidia, we have so much fricking compute. What's blocking us right now is more on the data side. But underlying all that, what's blocking us more is also on the algorithmic side. Yeah, it's like we just literally need new. We just need new algorithms, basically breakthroughs in order to get to the human levels of efficiency on it. Just the random point to really drive this home is like, the other reason why I love using the human brain as a benchmark for it is because you know how much energy the human brain takes. Like, literally calories, like, how many calories does a human brain consume? And you convert calories into energy, and then you compare that to what is the inference energy used to solve arc.
Greg Kamradt [00:17:21]: Like, you can already tell. You're.
Demetrios [00:17:22]: You're miles ahead.
Greg Kamradt [00:17:24]: You're miles and miles ahead. So human brains is a good benchmark for us.
Demetrios [00:17:29]: Also, we should note, like, did this all just start you down this path from the needle in the haystack was that, like what blew up?
Greg Kamradt [00:17:38]: You know, needle in a haystack was. Is. Is a fun bullet point on my journey like, that I've been on so far. I wouldn't call it the thing that did it right. I mean, it was cool, but it wasn't like, it, like, it was. Didn't make me rich. Like, it didn't like, blow me up. You know what I'm saying? It's like it was a small little thing, like, on, like, you got a small Retweets.
Greg Kamradt [00:17:57]: I got a few retweets from it. You know, it's like, yeah, I got a few likes on Twitter, but it wasn't like, it wasn't much from that. Um, so no, but like the inherent thing, like what. Whatever it is about what drives me and like whatever it is about me that makes me put my energy where I do needle and haystack came out of that spot. Like other stuff comes out of that spot and like, you know, everything. And so like I would say that all, like all the activities that happened were symptoms of where I choose to put my energy and consequences of it. And those consequences line themselves up to put myself on the path of where I am.
Demetrios [00:18:30]: And then it opens doors. It's like, hey, this happened you. Because for those listeners also, they should know that you were doing amazing tutorials.
Greg Kamradt [00:18:39]: I was like, I was in YouTube work.
Demetrios [00:18:40]: Yeah, that's how I found you is back in the day when you were doing the YouTube tutorials. You were like the first guy making lane change.
Greg Kamradt [00:18:47]: It's wild. So that's another wild story. It's super brief on that. I remember the first LangChain well, I saw I was scrolling hacker and it was just trolling or whatever. And I saw it was. This was October like 22nd.
Demetrios [00:18:58]: Yeah. So like right when chat gbt came.
Greg Kamradt [00:19:01]: Out, right when chat GPT maybe even a hair before. And it said show hacker news LangChain. So it's like literally like there's like the launch blog post, the LangChain, and I'm looking at this, I'm like, holy shit. This is solving a lot of the problems that I had building with the RAW API at the time. Because keep. Keep in mind at the time, there wasn't a chat model, it was this DaVinci O3. And so like trying to work with that thing was obnoxious like you. There's a lot of friction to get the value out of it.
Greg Kamradt [00:19:25]: Anyway, Langchen helped out with that a little bit more and I was like, this is so cool. And I had had a previous history of doing Pandas tutorials on YouTube that went nowhere. They frickin sucked. Like, like it was talking about like me and like my mom's basement, like in my underwear, like making pandas.
Demetrios [00:19:39]: At least you're on it.
Greg Kamradt [00:19:40]: It wasn't exactly that, but it was along those lines. So I'd made, I think like some. I don't make it call it like 80 pandas, turtles or something like that. Because that was my craft. Like data analysis was my Craft at that time, that's what I pride myself on. And I saw. I went to YouTube and I typed in LangChain, and there was one tutorial by a guy who I ended up knowing a little bit later on. Super awesome.
Greg Kamradt [00:20:00]: His name is James Briggs. And there was one LangChain tutorial, and I kind of had just, like, one of those small, little, like, little light bulb moments. I was like, dude, Greg, you should do what you did for pandas, but you should do it for LangChain. And all I did for Pandas was just see what I was curious in. Go and make a bunch of, like, tutorials and functions. And so at the time, just based off just riding my pandas, kind of like success or ripples. And it wasn't success. I just mean, like, whatever was coming from it.
Greg Kamradt [00:20:23]: I was getting, like, three or four new YouTube subscribers a day. And I did my first Langchen tutorial, and I got 16 new subscribers after that one. I was like, that's 4x success. That's 4x of where I was. Anyway, I did number two, and that next day I got 25. And then I did number three, and that next day I got 50. And keep in mind, like, that's 10x what I was doing beforehand. I pulled my wife in the room.
Greg Kamradt [00:20:41]: I'm like, holy shit, Eliza. This is like, there's something here. Like. Like, I've. I've re. I've retold this story a few times, but, like, there's a few times in life when you notice that the ROI on your energy that you get, that you get sometimes, often in life, it's like you put out one unit of energy, you're getting, like, 20 back. Like, it's really not much. Like, you might get some money, but you're not getting, like, fulfillment, you know, blah, blah, blah.
Demetrios [00:21:03]: Yeah.
Greg Kamradt [00:21:04]: At that moment in life, I was putting on one unit of energy, and I was getting, like, two or three times back.
Demetrios [00:21:08]: Because you were getting energy?
Greg Kamradt [00:21:09]: I was getting energy. Like, I couldn't sleep. Like, I was just like, I gotta wake up. What am I doing today? Like, what tutorial am I making today?
Demetrios [00:21:15]: You're jazzed.
Greg Kamradt [00:21:16]: Upgraded my setup. Like, I was so freaking jazzed on it. Like, I know. I was like, met Harrison, did all this other stuff. And just through that, just natural questions came around. Like, how do you do better retrieval? All these business questions that I had beforehand, how do you do better on that? And one of them was for Needle in the Haystack, which is. Everybody was talking about long context. Oh, it's longer, longer, longer, longer.
Greg Kamradt [00:21:36]: And I'd seen some tweets that were like, yeah, but it's actually not that good at long context. I was like, you guys are idiots. Let's just go and test this thing.
Demetrios [00:21:43]: Yeah, there's a process.
Greg Kamradt [00:21:45]: Like I was like, remember I'm a data dude, so that's my craft. And all I saw in my head was a heat map. And I was like, the length. And then there was that whole question around, like if the position of where your needle was had a factor into it. I was like, might as well throw a two by two because it's going to look pretty if nothing else. And so ended up doing that. And that's where needle and the haystack came around.
Demetrios [00:22:01]: That's so wild, man.
Greg Kamradt [00:22:03]: Yeah.
Demetrios [00:22:03]: So now we were talking before about the reasoning models and just this test time compute and you have thoughts.
Greg Kamradt [00:22:10]: Here's the undisputable fact. You spend more money at inference time, you get better performance. The open questions are, and this is where people argue with me, but I still believe it's open is does it ask like for top, top frontier models, does asymptote sub 100% or can you get to 100? I think there's too much money that you need to go figure out to go try to answer that question. Yeah, that's a big one.
Demetrios [00:22:30]: So it's, it's so high risk that why even try?
Greg Kamradt [00:22:33]: Well, not high risk per se. You, it, the cost is guaranteed. You're gonna spend a ton of money. What you get returned TBD on where it is, it's not, it's not worth it right now. But here, here's the other thing. It's like I harp a lot on AGI and a lot of that stuff we're gonna get, we have really, really useful, economically useful models right now without having AGI. That's cool. Like that's great.
Greg Kamradt [00:22:55]: I love it. That's value to the world. I'm a capitalist at heart. Like I want good tools to be used for the good of humanity. LLMS0304 mini, all that stuff are great tools that are going to bring us really, really good progress. The AGI conversation is a separate conversation and that's more of a theoretical, philosophical, scientific one around, well, what is AGI? How do you actually define it and how are we going for that?
Demetrios [00:23:14]: Yeah, what is intelligence?
Greg Kamradt [00:23:15]: What is intelligence and what's wild? What blows my mind is you're freaking getting me going, man. I'm already like ranting, well, what's wild? What's Wild man is that we don't have a formal definition of intelligence that the community relies on. Yeah. For something as hot topic as AGI and what we have right now, it's making me wonder if it can be formally defined if it hasn't already beforehand. There's a few stories I can tell in my head. One is it can't be. But that also takes a very. Humans are really smart approach.
Greg Kamradt [00:23:43]: And we've seen many times over and over again that like humans, we're not as smart as we think we are. So the alternative is that maybe we just don't have a sensitive enough understanding about like the actual tools about what we need for it.
Demetrios [00:23:53]: But then the other story is, the other story that you play in your head is like, yeah, it can be.
Greg Kamradt [00:23:58]: We just don't, we just don't know potentially.
Demetrios [00:24:00]: And we're never gonna know.
Greg Kamradt [00:24:02]: Potentially. Yeah. And then there's, there's a whole different subclass of intelligence which is human relevant intelligence. So like there's a certain class of intelligence that you need to survive on Earth right here. That's what humans have. That's what we have and built up there. But if you, if you really expand out and this is where we get into like more philosophical, like in the, in the grand scheme of things, the Earth is, is a pretty small piece. Right.
Greg Kamradt [00:24:23]: So if you're talking about universal intelligence and talking about like theoretical intelligence and let's not go here, but I'll just, I'll just light the map. But if, if you, if you jump into, people are going to think, I'm going over the edge of this. But if you jump into like simulation theory, what's the intelligence that governs that type of thing that would make our own world that come from there. I guarantee it's not human relevant intelligence. And there's a, there's a theoretical optimum that's that we're not even going to touch. But that's the other thing too. You got to walk before you run. We're going to start with human intelligence first anyway.
Greg Kamradt [00:24:55]: Reasonable. I mean, they're great. I mean, you can scale them, throw more money at them, get better performance. They take longer thinking for longer. There's big open questions on how these reasoning models actually work. And so one simple way to do it is the very first reasoning model that people ever came up with was they told the model, please think out loud first and then give me your answer that ended up doing better performance. Crazy. Right? And then what you go do is you go train on Processes like that for much, much longer.
Greg Kamradt [00:25:23]: And that's another way to scale these things up. You say, think for longer, think for longer. Wait, Reflect a reflection step, you know, and you say. You say, keep on going. Another method to scale these things up is you say, all right, I'm going to tell 10 of you. I want you to. I want 10 of you to think out loud, and then I'm going to see what all 10 of you respond, and I'm going to pick the Ben's answer that comes from there. There's even further ways to do it, which is like, I want you to think of the first step in your process.
Greg Kamradt [00:25:47]: Okay, now, what are 10 potential steps that would come after that first step? All right, I'm going to pick the best. The best one of those ten. Okay, now I'm on step two. Think up ten potential step threes. I'm pick the best one and then boom, boom, boom, and go all the way down. There's always latency and cost straight offs that come with those things. But either way, it's undeniable the performance you're getting from these things and how good they are, even through vibe anecdotes and even through RKGI performance. So they're very impressive.
Demetrios [00:26:12]: Yeah. So it's almost subjective and objective.
Greg Kamradt [00:26:16]: Don't get me started, dude. I mean, this is another one of my things. RKGI is a verifiable domain. Like, you can just go check is the right answer right? What blows my mind is like, there's no right answer for how good a summary is right. There's no right. There's no right answer for how good an AI took notes on your call and then went and put them into Salesforce. Like, how good are the notes? Right.
Demetrios [00:26:36]: Yeah. And how good are they to who?
Greg Kamradt [00:26:38]: Well, so that's. The whole point is you have to keep in mind what is the background engine, what is your eval engine in which you're evaluating these things from. With Ark, we can. It's. It's an equality check. We can tell we have the right answer. It is not right. Much of what drives the economy and drives humans and everything.
Greg Kamradt [00:26:56]: The eval engine is human preference.
Demetrios [00:26:58]: Well, that's what I was going to say. With arc, don't you find that the answers can be subjective?
Greg Kamradt [00:27:05]: So if you're just looking at whether or not the task is correct. Yes. If you're looking at claims as this is human, solvable or whatever, then it's a lot more subjective. There's a lot more subjective that comes from There. But in terms of eval engines, I have a priority order of my favorite eval engines back there. Number one is going to be physics. And what I mean by that is I think that the coolest thing that we could have AI do for us is discover new knowledge about reality, basically about physics. So you think about what is the right answer? Well, it's.
Greg Kamradt [00:27:37]: What does the scientific process say about physics as the, as the eval engine? That's so freaking cool. There's no umbrella that encompasses physics. Physics is what we're in. Right. And so I think that's, that's number one, that's super cool. Number two, capitalism. And so you think about. Capitalism is a human construct of a set of rules that we all, we all play by, like a system and there's laws and there's how we choose to do things.
Greg Kamradt [00:27:58]: Running a business is an experiment playing in that world. Right. And so it's like almost like capitalism is the eval engine and like I'm going to go try to make a whole bunch of money, but you got to do it within the rules. Right. And so like there's certain things you need to go for. So I think capitalism is a really interesting eval engine and then human preference after that, which is like, how good is this summary? But the wild thing about human preference is there's no, there's no way to like at scale quantify that, which is really tough. Which is why when you do RLHF, you got to go spin up like data, not data centers, but like huge, huge conference rooms of hundreds and thousands of people giving you preference optimizations on which one's better. Right.
Greg Kamradt [00:28:37]: That's how you do it. Which that's crazy, but that's what it takes in order to do these things.
Demetrios [00:28:40]: So go back real fast to this capitalism one, or even the physics one, because in a way we are assuming that what is happening to us as humans is discoverable or is the engine, the eval engine. But potentially it's not. It's just us as humans.
Greg Kamradt [00:29:01]: Yeah. So a big caveat with that is the way that I think about it is if it's true, what we see is what we get, like if it's true that like reality appears to be what it is. And I know there's going to be like, even as you start to delve into like the quantum stuff and we don't know what's on the multi world side, like we don't know what's on that other pending, something surprising coming out of there, which I would love because it's like I want the truth and if that's the truth, then freaking so be it. That's freaking awesome. Pending all that, assuming what you see is what you get, then I think what I say still holds on. Like I still think that phys. The reality is if there's some unexplainable thing that like it's just out of our reach to go do it. I'm a little, I'm a.
Greg Kamradt [00:29:41]: I like answers less that we don't have an explanation for at least. But like I'm not ruling it out. I'm saying yes, that is a caveat. I'm operating looking in this way for.
Demetrios [00:29:50]: It though when I think about it, it's like there's something beyond our understanding.
Greg Kamradt [00:29:56]: Yeah.
Demetrios [00:29:57]: Potentially that is what we are going to get helped to understand. AI can help us understand it, but it's going to be outside of what we are looking at. Just like when you have the, the chess move that is played and then later it's like oh yeah, of course. Well I never would have thought of that. Or it would have taken us decades to figure that out. Now we get to see. But that's a wild one.
Greg Kamradt [00:30:23]: I'm with you man. And humans are traditionally very poor at forecasting the unknown unknowns. And right now that's all unknown unknowns.
Demetrios [00:30:30]: Yeah.
Greg Kamradt [00:30:31]: And countless examples. Go and ask somebody about something in the 1800s, what would today be like? They have no freaking idea. They just had no idea what came from it. So that will, that will happen to us, like whatever happens. And you know, even with how, how accelerated these timelines people talk about, I mean even call it 10 years from.
Demetrios [00:30:46]: Now, like, you know what? I had a great dinner two nights ago with a friend and he was saying I had as a thought experiment to come up with headlines for what 2030 would be saying in different magazines. So he was saying I created one headline for Wired and it was that teen 3D prints microchip in their basement type thing. So that was one and then another one was. And this is a complete tangent but it's trying to think forward on like oh, what could be possible. He was saying data center on the moon opened or second data center on the moon is opened by the US type thing. So you're like, well maybe that's not too far off.
Greg Kamradt [00:31:35]: Yeah, I'm both those are very seem tractable to me because the path to do those is you could lay that out like it's straightforward. If you said something that didn't have a clear Obvious, like lineage to get towards that, then I would start to think about it a little bit more. But, yeah, I'm of the David Deutsch philosophy that all problems are solvable. And that's the argument for optimism is if you believe all problems are solvable, then there's nothing out there that should really worry you that much because you can go figure it out, go do it all.
Demetrios [00:32:02]: After he told me that, I was trying to think, what would my headline be for 2030? Where would I go with that?
Greg Kamradt [00:32:09]: Only five, five years from now? Yeah, four years and three quarters.
Demetrios [00:32:13]: We're going to be specific.
Greg Kamradt [00:32:14]: Yeah, I mean, I mean, you kind of got to be with these things. It's like, yeah, I mean, I'm.
Demetrios [00:32:19]: That quarter could be a big.
Greg Kamradt [00:32:21]: Because I've been, I've been listening, I've been going deep, really deep on. There's a big conversation around intelligence explosion, 30% growth or GDP growth rates and all that. And one of the criticisms I have with some of the more outlandish ideas is that they're not as tactical and they're not as concrete as. I really wish that some, some of these projections were so getting concrete, saying four years and three quarters. It's like, well, damn. OpenAI just came out with 04 Mini this past week. When is 04 coming out? When is 04 Pro coming out? Could it be like, at the beginning of 2025, 2026? If so, you only got three years left for. With those types of things to go for it.
Greg Kamradt [00:32:58]: And so concretely, like, how are the. How is the GDP going to grow 30%? How is that data center going to get up to Mars? How many, like, windows, launch windows are there left? Or even up to the moon or whatever it may be? So it's like, well, what I'm thinking about one thing that's caught my. That's nerds on me a little bit is like, so Elon wants to go. He has a Mars window that he wants to go shoot for. Humans are not going to be the first.
Demetrios [00:33:21]: Yeah, why would we.
Greg Kamradt [00:33:22]: Why would you.
Demetrios [00:33:23]: We already have the Mars rover.
Greg Kamradt [00:33:24]: We have the Mars rover. And so humans are going to be the first one. So that means they're going to send Optimus up there. Could you. Are we going to have AGI on Earth before that window? If so, then you pretty much have AGI on Optimus because, like, you just go send a bunch of commands and so next thing you know, I feel a little bit, like, almost insecure, but then I need to remind myself not to Be so emotional. But it's like, damn, humans weren't the first on Mars.
Demetrios [00:33:49]: I missed that one, you know, I.
Greg Kamradt [00:33:50]: Mean, I mean, sort of. It sounds so lame, but to think about it. But that's. That was my first reaction. I was like, damn, there's gonna be this robot that is intelligent, that's its own human being, but it's not a human. And then I think it's like, damn, am I just speciesist? And I just love, like, the human race so much. And now I need to open my.
Demetrios [00:34:03]: Eyes to claim this. I wanted to plant that US flag on Mars, you know, I don't know.
Greg Kamradt [00:34:08]: Even if it's just like humanity's flag or whatever it may be.
Demetrios [00:34:11]: But if you think about it that way, there's already been the Mars Rover. So how is it different than the Mars.
Greg Kamradt [00:34:16]: And that's where my biological baggage is bringing me down, you know, Just because.
Demetrios [00:34:19]: It'S like a humanoid shape.
Greg Kamradt [00:34:21]: Humanoid. I think it's less the humanoid shape for me, and it's more just a generally intelligent being that can do that. That's artificial, that. Do it need to be.
Demetrios [00:34:30]: But isn't the Mars Rover. The Mars Rover is not being controlled.
Greg Kamradt [00:34:33]: I think it is. I think it is. Don't they send it instructions and tell it what to go do?
Demetrios [00:34:38]: That's a good question.
Greg Kamradt [00:34:39]: I should figure that out because it moves pretty slowly. Like, waits. Like.
Demetrios [00:34:44]: We gotta fact check that one takes actions. It's like, what next? And you're like three minutes later or four minutes later. Okay, turn right or pick up the rock.
Greg Kamradt [00:34:55]: I mean, whatever. I don't think it's that far off. I'm pretty sure it's like that.
Demetrios [00:34:57]: That's funny. Yeah. I thought it was a bit more autonomous.
Greg Kamradt [00:35:00]: Yeah.
Demetrios [00:35:00]: And. Or maybe they send three or four instructions at once. And if it fails, then resend no more. Figure out where we're at now.
Greg Kamradt [00:35:08]: Yeah, something like that.
Demetrios [00:35:10]: Somebody will have to give us that one, because that is. That is hilarious. What else have you been thinking about?
Greg Kamradt [00:35:16]: Yeah. Well, in terms of headlines, I still haven't given you a headline I'm thinking about here, so. Headline 2030 Wired says it. I, I don't think it's outside of the question that there could be a headline that says humans are no longer able to come up with questions that AI can't answer, which is. Isn't that sensationalist? Like, it's kind of muted from a sensation standpoint. But, like, if you use our definition, like the observational definition of AGI, that what, what other Problems are there. Right.
Demetrios [00:35:46]: But I still wonder if there's a world where you have run out of questions but you're still not seeing it, where every once in a while you'll find that question. It's not that you can find a hundred of them, but there's still those stupid questions where it is like the StrawBerry or the 9.11.
Greg Kamradt [00:36:02]: Yeah. And here's the thing. I don't want to give the viewer the impression that I'm relying on this as a formal definition. I think it's a pretty good working definition for sure. It's easy to communicate and it's easy for us to go against. I think we'll come up with a formal definition. But to your point, how often do you ask a human a question and it's like, what were you thinking? You know what I mean? So like as efficiency is such a good big, big piece of this here, it's like that Will Smith iRobot meme that keeps on going around. You know, it's like you ask him a question, can you? Right.
Demetrios [00:36:33]: So yeah. So then I could see that though. Yeah. We can't come up with more questions. We or we have to have AI come up with the questions that it can answer potentially.
Greg Kamradt [00:36:41]: And you know that that's a whole nother. I think that's a very under explored. It's talked about using AI to help build AI, to help align AI, to help test it, you know, all that other stuff and that that will happen because like again definitions are important but like look at all the people using cursor right now to go build AI models. It's like, is that using AI to help you build AI? It's like, yeah, it is. Um, so it just depends about how direct you want to use AI for. But yeah, so what we're thinking about for RKGI3 because we're coming out with RKGI2, it's going to get beaten one day. Right. We know that RKGI2 could be brute forced.
Greg Kamradt [00:37:15]: So if you give a data center's worth of compute and energy and time, like a month worth of a data center. Yeah. Go brute force it and literally try all random permutations using one of the DSLs to like try to solve arc. Yeah. You're going to figure it out. But that's why efficiency is a big piece of this, then the energy and the cash that you needed to go do that, isn't it? Doesn't make us interested because it's a verifiable domain.
Demetrios [00:37:36]: Do you consider Ark 1 beaten because of that 87%. Like, is 87% a pass grade?
Greg Kamradt [00:37:44]: That's like a B plus. So for a long time we talked about 85% being basically human human threshold on Arc AGI 1. I think that much like the battle of MMLU where people were like, we got 88. 8. Well, we got 88.9. Well, we got 90.1. It's like at that point you're redlining where your signal is actually telling me for you and you're actually losing signal and that you're not you. You get diminishing returns on the signal that comes from it.
Greg Kamradt [00:38:09]: So I think the RKGI one for anything between like 5 to probably 90%, I think it gives you really good signal on where something is. Anything outside those bounds is not going to. Isn't giving you a ton of it for it. So I think it's still a really useful tool today for it. It'll eventually go out of vogue though. Once models get so good at it.
Demetrios [00:38:30]: It's getting closer and closer. So it's almost like you see the end of this lifespan.
Greg Kamradt [00:38:35]: And here's the deal, it's like benchmark. There isn't one benchmark to rule them all. Even if you wanted to understand a model's capabilities, you need a portfolio. Not only that, but look how many benchmarks had their place and then were phased out because they did their job. Just even for example, look at ImageNet. What happened with that 2012, a big data set of images that had a huge impact on the industry and it did its job. Would anybody go and do they report? They don't report on imagenet today and that's okay. They have other types of benchmarks where they need to go deeper into like image vision capabilities in order to get a better job about it.
Greg Kamradt [00:39:10]: Um, so that's where Arc AGI 2 sits. But like I said, we're not seeing meaningful, meaningful performance on it yet. To, to have it give us a ton of signal.
Demetrios [00:39:17]: Brute force it if you want, but then you're kind of defeating the purpose totally. And so even though you can do it, should you.
Greg Kamradt [00:39:28]: So we run an, we run a Calco competition to try to beat RKGI 2. The incentives there are to beat it at any means necessary because we have money on the line and within the competition rules, there's no type of solution requirements. So people brute force the crap out of that all the time. Like that's the, that's a whole other part of my life. It's like if anyone was talking to me about benchmarks. Great. I love, I can talk about all day long. If anybody wants to talk to me about running an AI competition, talk to me about it all the time.
Greg Kamradt [00:39:52]: We did it all last year. We put a million dollars up to anybody who could be arc AGI on Kaggle and nobody was, nobody was able to. But like, we saw leaderboard probing, we saw people getting around the rules, we saw where our incentives were. We made assumptions about competitors participants incentives were not in line with our assumptions.
Demetrios [00:40:10]: Wait, how so?
Greg Kamradt [00:40:13]: Yeah, basically if somebody, if you wanted to win prize money, you needed to open source your solution. We thought that the money being a monetary incentive would be enough to make people open source their solution. There was one group out there who had a really strong solution, really, really awesome. And they made the choice. I'm not exactly sure to the exact reason. It was one of two. It was either we think that we have a better chance at not open sourcing our solution and competing next year for the grand prize to do really good at it. And so they wanted the $700,000 instead of just the yearly $100,000, or it was because it was so close to their startup's proprietary information that they didn't want to open source it.
Greg Kamradt [00:40:54]: Both of which are not necessarily in spirit of what we were aiming for as a competition. But we did not properly construct the incentives enough or communicate early enough that this was an issue. And so we basically did what we could, which is we took them off the leaderboard because you're not, you're not placing if you don't open source. And then this year we made a lot more. We're being much more clear about our intentions with this.
Demetrios [00:41:20]: How, how are you aligning incentives now?
Greg Kamradt [00:41:23]: Through better communication. And then not only that, this year we have a public and private leaderboard. So the leaderboard that's, that's seen right now is all just based off of public data. But the final leaderboard that says whether or not you've, you've even placed or done well is all in hidden data. And if you want to get your private score, you need to open source.
Demetrios [00:41:40]: Okay.
Greg Kamradt [00:41:40]: Yeah, so we're hoping that that does it. Either way, we're not talking about hundreds of thousands of teams here. We're Talking about maybe 10 teams that are in the running. I can go and I can go and have conversations with each one of those tens and make sure that they're seeing it the same way.
Demetrios [00:41:55]: And also the gaming of the leaderboard, you saw that?
Greg Kamradt [00:41:58]: Yeah. I mean, so people get creative because, like, money's on the line. Money's on the line. And Kagglers are professional competition people. They're really good at data science stuff and they're really good at playing competitions. And so one thing that I saw is that they will try to suss out attributes about ARC tasks one at a time. And what they'll do is, is they'll put in a wait statement in their script that says, if you see this current task attribute, wait 50 seconds. And then when they submit their solution to Kaggle, they say, did it run instantly or did it wait 50 seconds? And then that's way, that, that's a way you can tease out some more information about it.
Greg Kamradt [00:42:33]: Because the only other information you get is you get a score that you only get. You get a single integer, which is your score out the other end. You can't really tell that much information about that. Kaggle tries to prevent that a little bit more with some obfuscation about how long it actually took. But people, people get creative like that.
Demetrios [00:42:48]: And now with the ARC Prize 2.
Greg Kamradt [00:42:52]: Yeah.
Demetrios [00:42:54]: Do you have to create a variety of different tasks or is it very much in? All right, we're in this one field.
Greg Kamradt [00:43:03]: Sure.
Demetrios [00:43:04]: Trying to do it well.
Greg Kamradt [00:43:05]: So you brought up a question earlier which was good and I didn't answer it fully, which is, does it take a lot of research and like deep thinking to build these things? So I would say for Francois paper in 2019, that took a lot of work to put that, that hypothesis together, that formal definition or that that definition of intelligence. Right out of that came, okay, using this definition of intelligence, what would a problem look like that would actually go and test these things? And that's where the ARC paradigm came in. So what it is, is basically you have an input and you have an output grid, and it's just, it looks like a checkerboard. And you see, okay, the input turns into the output. Some way. I need to figure out how to. How do you transform the input into the output? You get a few examples and then you get a test. And on that test, you only have the input and what your goal is.
Greg Kamradt [00:43:47]: You have to go cell by cell and type out what the output would be. The important part is that each separate problem on RKGI requires a different rule or a different transformation to actually solve. So what I mean by that. Oh, nice is super variety, super variety. And it's almost like it's a meta thing. And I'll get why this is important in a second. The way that why this is important is let's just say one ARC task has a square on it and on the input outputs, all you're doing is you're just adding a border to the square. Okay.
Greg Kamradt [00:44:18]: Now on the test input, we're going to give you a square. We're going to ask you. And on output you just need to put a border. Okay, cool. That border transformation rule will only be asked once. On another task, what we might ask you to do is fill in the corner of every single shape. You go fill in the corners of all those different shapes. And so what we're doing, what we're forcing the tester to do, is learn the new mini skill in each one of those questions and then we're forcing you to demonstrate that you've learned that skill on the test.
Demetrios [00:44:47]: By doing it.
Greg Kamradt [00:44:48]: By doing it. Which goes back to the definition of Francois Francois definition of intelligence, which is learning new skills.
Demetrios [00:44:53]: That is so simple for us, right? Like write a border on a square. But it is, but it's exactly that.
Greg Kamradt [00:45:01]: And the reason why it's so hard for machines is because humans are very good at abstraction and reasoning. It's like, oh, duh, just put a border. Yeah, okay. But that's actually really hard for AI to go do. Now ARC one people are like, oh, it's so simple. It's not a good test of AI. Well, keep in mind, for five years it was unbeaten. Right.
Greg Kamradt [00:45:18]: And it actually pinpointed the moment like right when models started to get good was the exact moment that reasoning models took off. Okay, that's just something really interesting about reasoning models and using arc1 as a capabilities assertion. You can actually tell something about reasoning models that there's a, a non zero level of fluid intelligence that actually comes from that. Which is very cool. Arc2 is a simple extension of the arc1 domain. We still have input output. We still ask you to do rules. The difference with it is that the rules are much deeper and they require a bit more thought from a human perspective to go do it.
Greg Kamradt [00:45:49]: So instead of just doing a border, we might ask you to do a border and do the corners.
Demetrios [00:45:53]: Yeah. So there's put an X in or.
Greg Kamradt [00:45:55]: Put an X and now there's two rules. I won't go to the details on it. We actually have a full. Francois put it together. A We hosted a private preview of ARC AGI 2 for donors for. For Arcprice because we're a nonprofit. I should have said that earlier. Nonprofit.
Greg Kamradt [00:46:07]: And he gave a 30 minute presentation on RKGI2.
Demetrios [00:46:10]: Wow.
Greg Kamradt [00:46:11]: But where I want to talk about RGI3, of course, RKGI3 is going to be departing from the RKGI1 and RKGI2 framework style of doing it style of doing it. So it's very scoped and narrow domain. If you just have matrices, input, output, you know, filament or whatever, it's very scoped. You, you don't have very many axes of freedom for that. So we are taking inspiration from simulations and games.
Demetrios [00:46:37]: Oh, nice.
Greg Kamradt [00:46:38]: So back in 2000, I think 2017, DeepMind, they put together an exploration, they called it Agent 57. So they tried to get an agent, more or less an RL agent, to go and try to beat a bunch of different Atari games, right? There's like four that they didn't solve, which is super fascinating. What arc AGI 1 and 2 don't allow you to, don't make you do is. They don't make you figure out what is the goal. They don't make you figure out the rules of the environment. They don't make you have long term memory with hidden states. So like you learn something early on in the game and you have to remember that that thing still applies later on in the game. And so what I tell people is if, if you can make an AI that beats one game.
Greg Kamradt [00:47:14]: Well, we've done that a bunch. We've made AI beat chess, we made AI be go. Okay, cool. If you can make an AI that beats 50 games, that's much more interesting. But the problem is that those 50 games are all public and you can have developer intelligence and developer intuition as to how to go beat those 50 games. What RKGI3 is going to be is we're going to make AI beat 50 games. It is never seen beforehand and they're each novel from each other. And that is a much further extension and axes of freedom about where we're taking this.
Greg Kamradt [00:47:46]: What you can assert about the model that beats it is it will have had no choice but to interact with its environment, learn the rules of the game in 50 different novel situations.
Demetrios [00:47:55]: But you're not letting it just simulate for hours and hours. Or maybe you are. That's the test time compute type thing we will.
Greg Kamradt [00:48:02]: And that's where efficiency comes into it. And so what we're going to do is we're going to go test 400 humans on those 50 games and we're going to see how many actions does it take for a human to actually solve this and how many actions to take for AI to go solve it. So that's where we get our efficiency that comes from it. In addition to cost and energy that comes from that.
Demetrios [00:48:18]: I think we gotta go.
Greg Kamradt [00:48:19]: I just saw Beautiful, man.
Demetrios [00:48:20]: Boss man. That's a great way of ending it, man. We'll cut it there.
Greg Kamradt [00:48:23]: It's too good.
Demetrios [00:48:25]: You were awesome, man. This is great. But it was perfect. It was like.