Good question! This was one of the main motivations of our "Paper Prize" track. We wanted to reward conceptual progress vs leaderboard chasing. In fact, when we increased the prizes mid year we awarded more money towards the paper track vs top score.
We had 40 papers submitted last year and 8 were awarded prizes. [1]
On of the main teams, MindsAI, just published their paper on their novel test time fine tuning approach. [2]
Jan/Daniel (1st place winners last year) talk all about their progress and journey building out here [3]. Stories like theirs help push the field forward.
Alongside Mike Knoop and François Francois Chollet, we’re launching ARC-AGI-2, a frontier AI benchmark that measures a model’s ability to generalize on tasks it hasn’t seen before, and the ARC Prize 2025 competition to beat it.
In Dec ‘24, ARC-AGI-1 (2019) pinpointed the moment AI moved beyond pure memorization as seen by OpenAI's o3.
ARC-AGI-2 targets test-time reasoning.
My view is that good AI benchmarks don't just measure progress, they inspire it. Our mission is to guide research towards general systems.
Base LLMs (no reasoning) are currently scoring 0% on ARC-AGI-2. Specialized AI reasoning systems (like R1 or o3-mini) are <4%.
Every (100%) of ARC-AGI-2 tasks, however, have been solved by at least two humans, quickly and easily. We know this because we tested 400 people live.
Our belief is that once we can no longer come up with quantifiable problems that are "feasible for humans and hard for AI" then we effectively have AGI. ARC-AGI-2 proves that we do not have AGI.
Change log from ARC-AGI-2 to ARC-AGI-2:
* The two main evaluation sets (semi-private, private eval) have increased to 120 tasks
* Solving tasks requires more reasoning vs pure intuition
* Each task has been confirmed to have been solved by at least 2 people (many more) out of an average of 7 test taskers in 2 attempts or less
* Non-training task sets are now difficulty-calibrated
The 2025 Prize ($1M, open-source required) is designed to drive progress on this specific gap. Last year's competition (also launched on HN) had 1.5K teams participate and had 40+ research papers published.
> Our belief is that once we can no longer come up with quantifiable problems that are "feasible for humans and hard for AI" then we effectively have AGI.
I don’t think that follows. Just because people fail to create ARC-AGI problems that are difficult for an AI to solve, doesn’t mean that said AI can just be plugged into a humanoid robot and it will now reliably cook dinner, order a pizza and drive to pick it up, take a bus to downtown to busk on the street and take the money back home, etc.
ARC-AGI is an interesting benchmark, but it’s extremely presumptive to think that these types of tests are going to demonstrate AGI.
That’s precisely what I meant in my comment by “these types of tests.” People are eventually going to have some sort of standard for what they consider AGI. But that doesn’t mean the current benchmarks are useful for this task at all, and saying that the benchmarks could be completely different in the future only underscores this.
How are any of these a useful path to asking an AI to cook dinner?
We already know many tasks that most humans can do relatively easily, yet most people don’t expect AI to be able to do them for years to come (for instance, L5 self-driving). ARC-AGI appears to be going in the opposite direction - can these models pass tests that are difficult for the average person to pass.
These benchmarks are interesting in that they show increasing capabilities of the models. But they seem to be far less useful at determining AGI than the simple benchmarks we’ve had all along (can these models do everyday tasks that a human can do?).
The task you mention require intelligence but also a robot body with a lot of physical dexterity suited to a designed-for-humanoids world. That seems like an additional requirement on top of intelligence? Maybe we do not want an AGI definition to include that?
There are humans who cannot perform these tasks, at least without assistive/adapted systems such as a wheelchair and accessible bus.
> at least without assistive/adapted systems such as a wheelchair and accessible bus.
Which is precisely what the robotic body I mentioned would be.
You're talking about humans who have the mental capacity to do these things, but who don't control a body capable of doing them. That's the exact opposite of an AI that controls a body capable of doing these things, but lacks the mental capacity to do them.
I read that has “humans can perform these task, at least with…”
Put the computer in a wheelchair of his choice and let him try to catch the bus. How would you compare program and human reasoning abilities, but disregarding human ability to interact with the outside world?
Edit: Arc-AGI itself is only approachable by visually and manually valid humans, others needs assistive devices.
What are you doing to prevent the test set being leaked? Will you still be offering API access to the semi private test set to the big model providers who presumably train on their API?
1. Public Train - 1,000 tasks that are public
2. Public Eval - 120 tasks that are public
So for those two we don't have protections.
3. Semi Private Eval - 120 tasks that are exposed to 3rd parties. We sign data agreements where we can, but we understand this is exposed and not 100% secure. It's a risk we are open to in order to keep testing velocity. In theory it is very difficulty to secure this 100%. The cost to create a new semi-private test set is lower than the effort needed to secure it 100%.
4. Private Eval - Only on Kaggle, not exposed to any 3rd parties at all. Very few people have access to this. Our trust vectors are with Kaggle and the internal team only.
What prevents everything in 4 from becoming a part of 3 the first time the test set is run on a proprietary model, do you require competitors like OpenAI provide models Kaggle can self host for the test?
Sorry, I probably phrased the question poorly. My question is more along the lines of "when you already scored e.g. OpenAI's o3 on ARC AGI 2 how did you guarantee OpenAI can't just look at its server logs to see question set 4"?
1. We had a no-data retention agreement with them. We were assured by the highest level of their company + security division that the box our test was run on would be wiped after testing
2. We only tested o3 against the semi-private set. We didn't test it with the private eval.
Are you aware that OpenAI brazenly lied and went back on its word about its corporate structure, board governance, and for-profit status, and of the opinion that your data sharing agreement is different and less likely to be ignored? Or are you at step zero where you aren’t considering malfeasance as a possibility at all?
>> We were assured by the highest level of their company + security division that the box our test was run on would be wiped after testing
Yuri Geller assured us he was bending the spoons with his mind. Somehow it was only when the Amazing Randi was present that Yuri Geller couldn't bend the spoons with his mind.
Ironically "I have a magic AI test but nobody is allowed to use it" is a lot closer to the Yuri Geller situation. Tests are meant to be taken, that should be clear. And...maybe this does not apply in the academic domain, but to some extent if you cheat on an AI test "you're only cheating yourself."
And end users and developers and the general public too...
But here is the thing, I feel that even if its rote memorizing why GPT4o couldn't perform just as well on ArcAGI 1 on it or did the "reasoning" help in any way?
I'm really pleased to see this! The original ARC-AGI-1 paper still informs how I think about "what is intelligence" today. I was thrilled to see AI models make real progress on that test precisely when we had the next big idea (reasoning). Here's to hoping round 2 falls with a similarly big breakthrough!
That's a very small sample size by task. I wonder if they give the whole data set to an average human, what the result would be. I tried some simple tasks and they are doable, but I couldn't figure out the hard ones.
No, they're saying that the problems have been reviewed / play-tested by ≥2 humans, so they are not considered unfair or too ambiguous to solve in two attempts (a critique of some Arc-AGI-1 puzzles that o3 missed). They have a lot of puzzles so they were divided among some number of testers, but I don't think every tester had to try every problem.
I think a lot of people got discouraged, seeing how openai solved arc agi 1 by what seems like brute forcing and throwing money at it. Do you believe arc was solved in the "spirit" of the challenge? Also all the open sourced solutions seem super specific to solving arc. Is this really leading us to human level AI at open ended tasks?
Just want to say I really love these new problems - feels like some general intelligence went into conceiving of and creating these puzzles: we just did a few over dinner as a family.
You have my wheels turning on how to get computers better at these. Looking forward to see G the first computer tech that can get 30-50% on these!
Why wasn’t the ICOM framework (D. Kelley) allowed to make a scoring submission after they claimed to have beaten the scores? Are you concerned that may appear to contradict your mission statement and alienate the AGI community?
Which puzzles had the lowest solve rate? I did the first 10 and felt all easy (mentally solve it in 10-20 seconds for easier ones and 30-60 seconds for harder ones), I’d like to try the most difficult ones.
From ChatGPT 3.5 to o1, all LLMs progress came from investment in training: either by using much more data, or using higher quality data thanks to artificial data.
o1 (and then o3) broke this paradigm by applying a novel idea (RL+search on CoT) and that's because of it that it was able to make progress on ARC-AGI.
So IMO the success of o3 goes in favor of the argument of how we are in an idea-constrained environment.
This isn't a novel idea - some people tried the exact same thing the day GPT4 came out.
And going back even further, there's Goal Oriented Action Planning - an old timey video game AI technique, that's basically searching through solution space to construct a plan:
Not Greg/team, so unrelated opinion. o3 solution for ARC v1 was incredibly expensive. Some good ideas are at least needed to take that cost down by a factor 100-10000x.
Yeah my analogy for that solution is like claiming to have solved sorting arrays by using enormous compute to try all possible orderings of arrays of length 100.
Wow, these are gorgeous! Can I ask where you get your DEM data and at what resolution? I’ve been wanting to play around with some relief maps of various bioregions (eg the Great Lakes watershed, Cascadia, etc), but I’ve had trouble figuring out where to find data at the right resolution
Where do we get it?
Only publicly available sources. Usgs has a great portal. Private data is too expensive to get. I was quoted 6 figures for a larger area. They were going to fly a plane and capture it :)
What resolution?
Totally depends on the area the customer would like to cover. If it’s their ranch or property, we usually need 1-meter. If it’s a mountain range than 30-Meter works.
It mainly depends on the resolution limit for 3D printing. So it also depends on the size of the model they want.
Unfortunately not all areas are covered with high res
For the benefit of those who aren't familiar: SRTM was the space shuttle radar tomography mission. 30m resolution is now available (I remember 30m being released for the first time in the 2000s sometime I think).
What are you using to get it into an STL? I've had OK results with DEMto3D plugin, but it has some weird artifacts that I can't seem to get rid of. Are there any better options you're aware of?
I actually use DEMto3D. It's touchy, but I do post-work on the .stl/3d model in blender so it works out ok for me.
If you have weird artifacts, I'm guessing that is due to the underlying data vs QGIS itself. Have you looked at their documentation (https://demto3d.com/en/)?
You have fewer products for sale than I was expecting.
I live in NE Los Angeles, which has the Verdugo and San Gabriel mtns, plus Mt Washington and other modest peaks - I think it would look great in this style. Especially because all the development would be excluded.
For every new location you do the process looks like:
1. Get the data and prep it for print (fixed)
2. 3D print it (fixed)
3. Rubber Mold (fixed)
4. Wax Model (variable)
5. Bronze (variable)
Steps 1-3 are 40-60% of the costs. So I haven't put the money out of pocket yet to put up new locations. I've let customer's ask first and then do them.
+1 to this, I can think of a few places id order for but i couldnt figure out how. Did i miss the link? Looks like it refers to "custom orders" but i wasnt sure how that works.
use these plates on a larger scale to make terra-cotta impressions and turn them into 'chia-pets' such that you can grow micro greens on the landscape of a particular area.
I agree. Having the experience to spot common pitfalls or 'weak' looking stats is key. Like any craft, there is no easy way to learn this other than experience.
Whenever I'm advising rising data analysts/scientists, I tell them to understand three areas of background knowledge that will multiply their ability to pull out an insight. They'll help them connect disparate ideas together.
Stats is easy but connecting ideas is hard. That's where the real magic happens. Plus, computers have a hard time automating this, for now.
1) Product Knowledge - How well do you know the product? This is easy for a simple app, but for large enterprise apps there are many features to keep track of. If you don't know the full context of your product, how could you frame your analysis in a larger picture?
2) Stakeholder Empathy - Whether you like it or not, as a data person you're advancing a business cause/mission. This means you need to fully understand where the business has been and where it is going. The basic question is - what are your stakeholders priorities? Why do they matter?
3) Customer Empathy - Arguable the most important of them all, how well do you know the customer? I encourage data scientists to get away from the computer and in front of the customer. Hear user research calls and ask questions. Drive as a Lyft driver, deliver food [1], etc.
Unfortunately these three areas are soft skills and you won't know you've improved until you find yourself reciting a fact. Usually you'll think "well duh, because the customer thinks this." It'll seem obvious, but it is only because you went through the trenches to learn that fact.
Original author here. Super excited to see this on HN!
That's a great breakdown. I hadn't put my finger on stakeholder or customer empathy before, but I agree they're critical skills.
> Unfortunately these three areas are soft skills and you won't know you've improved until you find yourself reciting a fact. Usually you'll think "well duh, because the customer thinks this." It'll seem obvious, but it is only because you went through the trenches to learn that fact.
Definitely. This reminds me of Siver's "Obvious to you, Amazing to others" (https://sive.rs/obvious).
I'm trying to spend the rest of the year documenting as much of this soft-knowledge as I can. A lot of the data science hype over-focuses on hard skills and misses these soft skills.
How To Fail At Almost Everything And Still Win Big - Scott Adams (2013)
One of my favorite quotes:
“I put myself in a position where luck was more likely to happen. I tried a lot of different ventures, stayed optimistic, put in the energy, prepared myself by learning as much as I could, and stayed in the game long enough for luck to find me.” pg - 158
1. Use systems, not goals. A system lets you feel good every time you follow it, whereas a goal only makes you feel good when you reach it
2. Combination of skills. If you can be good (say top 20%) in more than one domain, then that combination of skills can be enough to make you very sought after.
3. What all adults should know, like public speaking, psychology, business writing, accounting, design, and conversations.
4. Learning from failures. This is a theme throughout the book. Each failure can teach you something. If you attempt something and fail, you at least gained experience. This experience will be useful for your next project.
Regarding #2: Does the book go into any specifics on how exactly you're going to be sought after or at least how to look for the people who needs generalists?
I consider myself a generalist but I never ever see much interest for hiring someone like me. There's always ask for a person who's a focused pro in some niche area AND then possesses a cloud of tangential skills, though.
> Everyone has at least a few areas in which they could be in the top 25% with some effort. In my case, I can draw better than most people, but I’m hardly an artist. And I’m not any funnier than the average standup comedian who never makes it big, but I’m funnier than most people. The magic is that few people can draw well and write jokes. It’s the combination of the two that makes what I do so rare. And when you add in my business background, suddenly I had a topic that few cartoonists could hope to understand without living it.
The idea isn't about being a generalist, but rather being valuable because you're like getting two okay guys or gals in one package.
Adams' own example is how he's by no means a highly talented artist nor is he a top comedian, but the combination of being halfway decent with a pen and having a better than average sense of humor suddenly puts a person into a much smaller group on the Venn diagram. And adding in just one more thing - his experience in the corporate business world, allowing him to create strips a lot of people could relate to - was enough to catapult Dilbert into a global phenomenon.
There are tons of moderately funny people in the world. And many okay line artists. And it's not hard to find someone with experience working in a corporate office. But the number of people who meet all three criteria is incredibly tiny. Heck, just having two of the three is quite rare.
The point being, it's far easier to become a big success by being above average in a few things, than it would be to try to be one of the best in a single area.
Finding a way to combine yout skills to make that success is the key, of course. And may require learning new skills or improving areas in which you're merely average.
One of the big themes of the book is how you shouldn't worry too much about trying new things and failing. For one, humans are terrible at anticipating what sort of work we would truly enjoy or be good at, and the only reliable way to find a true match is to try a lot of things and keep redirecting yourself. And for another thing, any skills you learn along the way only increase the odds of eventually finding a combination of skills that can lead to great success.
One thing many people reading the book overlook, I think, is that being mediocre in a lot of skills isn't the point. It can't hurt, of course, but the idea is to be above average in a combination of skills that can be utilized together in an interesting way. Recognizing that combination is more likely to be a process of trial and error rather than high minded planning.
I think the idea is that you are somewhat a specialist in 2 areas (but not a worldclass specialist). As I recall, that advice does not apply to being a generalist.
Top 20% is definitely good enough, the value having removed the issue with transcendence between two normally unrelated fields will improve your advantage by many factors.
I also used to enjoy Scott Adam's blog and ideas, that's why it was quite shocking to me when I revisited his site and found out what he has turned into. I can't take him seriously anymore.
It's not just his politics. He has become a delusional an egotistical person.
He kept talking about how Twitter shadowbanned him for months. When he was on Joe Rogan's podcast, Joe Rogan suggested ways to test if this is true. He got upset and tried to change the subject. When pressed by Rogan, he finally said: "I don't really want to find out. I just like the idea that I am important to be shadowbanned."
"what he has turned into" is roughly "a Trump fanboy" for those who don't want to go trawling through his blog. I also used to really enjoy his blog for his unconventional but mind-opening ideas before he started blogging about Trump in the pre-election runup.
it's more than being "trump fanboy". It's his confabulating of abstruse theories like "master puppeteer" and all this gibberish talk.
I find it irresponsible the opposite of what he has done before. Is this really the same person who created Dilbert and the above mentioned useful book?
I 100% agree with your opinion. Initially when he was discussing his analysis of Trump it felt like he was doing to tell us about Trump’s skills in persuading etc., but it has just turned into it being a mouthpiece for Trump. It’s so sad that such a talented person can become that. I truly enjoyed the book though
That Adams would join the Trump train isn't all that surprising when you read the book. Right from the beginning, he mentions that he didn't get a promotion at work because of upper management's 'minority hiring' policies (this was in the '80s, I believe). He doesn't provide much information about this policy, only that he perceived himself to have been left behind because of it.
He does provide information later in the book: an explicit policy to no longer promote white males. That policy was communicated to him by his boss, and was one of his primary motivations to become self employed.