Congratulations to Francois Chollet on making the most interesting and challenging LLM benchmark so far.
A lot of people have criticized ARC as not being relevant or indicative of true reasoning, but I think it was exactly the right thing. The fact that scaled reasoning models are finally showing progress on ARC proves that what it measures really is relevant and important for reasoning.
It's obvious to everyone that these models can't perform as well as humans on everyday tasks despite blowout scores on the hardest tests we give to humans. Yet nobody could quantify exactly the ways the models were deficient. ARC is the best effort in that direction so far.
We don't need more "hard" benchmarks. What we need right now are "easy" benchmarks that these models nevertheless fail. I hope Francois has something good cooked up for ARC 2!
Are there any single-step non-reasoner models that do well on this benchmark?
I wonder how well the latest Claude 3.5 Sonnet does on this benchmark and if it's near o1.
| Name | Semi-private eval | Public eval |
|--------------------------------------|-------------------|-------------|
| Jeremy Berman | 53.6% | 58.5% |
| AkyĂźrek et al. | 47.5% | 62.8% |
| Ryan Greenblatt | 43% | 42% |
| OpenAI o1-preview (pass@1) | 18% | 21% |
| Anthropic Claude 3.5 Sonnet (pass@1) | 14% | 21% |
| OpenAI GPT-4o (pass@1) | 5% | 9% |
| Google Gemini 1.5 (pass@1) | 4.5% | 8% |
https://arxiv.org/pdf/2412.04604Here are the results for base models[1]:
o3 (coming soon) 75.7% 82.8%
o1-preview 18% 21%
Claude 3.5 Sonnet 14% 21%
GPT-4o 5% 9%
Gemini 1.5 4.5% 8%
Score (semi-private eval) / Score (public eval)This emphasizes persons and a self-conceived victory narrative over the ground truth.
Models have regularly made progress on it, this is not new with the o-series.
Doing astoundingly well on it, and having a mutually shared PR interest with OpenAI in this instance, doesn't mean a pile of visual puzzles is actually AGI or some well thought exploration of True Intelligence(tm). It's one type of visual puzzle.
I don't mean to be negative, but to inject a memento mori. Real story is some guys get together and ride off Chollet's name with some visual puzzles from ye olde IQ test, and the deal was Chollet then gets to show up and say it proves program synthesis is required for True Intelligence.
Getting this score is extremely impressive but I don't assign more signal to it than any other benchmark with some thought to it.
Solving ARC doesn't mean we have AGI. Also o3 presumably isn't doing program synthesis, seemingly proving Francois wrong on that front. (Not sure I believe the speculation about o3's internals in the link.)
What I'm saying is the fact that as models are getting better at reasoning they are also scoring better on ARC proves that it is measuring something relating to reasoning. And nobody else has come up with a comparable benchmark that is so easy for humans and so hard for LLMs. Even today, let alone five years ago when ARC was released. ARC was visionary.
My initial impression: it's very impressive and very exciting.
My skeptical impression: it's complete hubris to conflate ARC or any benchmark with truly general intelligence.
I know my skepticism here is identical to moving goalposts. More and more I am shifting my personal understanding of general intelligence as a phenomenon we will only ever be able to identify with the benefit of substantial retrospect.
As it is with any sufficiently complex program, if you could discern the result beforehand, you wouldn't have had to execute the program in the first place.
I'm not trying to be a downer on the 12th day of Christmas. Perhaps because my first instinct is childlike excitement, I'm trying to temper it with a little reason.
These comments are getting ridiculous. I remember when this test was first discussed here on HN and everyone agreed that it clearly proves current AI models are not "intelligent" (whatever that means). And people tried to talk me down when I theorised this test will get nuked soon - like all the ones before. It's time people woke up and realised that the old age of AI is over. This new kind is here to stay and it will take over the world. And you better guess it'll be sooner rather than later and start to prepare.
Failing the test may prove the AI is not intelligent. Passing the test doesn't necessarily prove it is.
What kind of preparation are you suggesting?
Start learning a trade
This is far too broad to summarise here. You can read up on Sutskever or Bostrom or hell even Steven Hawking's ideas (going in order from really deep to general topics). We need to discuss everything - from education over jobs and taxes all the way to the principles of politics, our economy and even the military. If we fail at this as a society, we will at the very least create a world where the people who own capital today massively benefit and become rich beyond imagination (despite having contributed nothing to it), while the majority of the population will be unemployable and forever left behind. And the worst case probably falls somewhere between the end of human civilisation and the end of our species.
You should look up the terms necessary and sufficient.
The real issue is people constantly making up new goalposts to keep their outdated world view somewhat aligned with what we are seeing. But these two things are drifting apart faster and faster. Even I got surprised by how quickly the ARC benchmark was blown out of the water, and I'm pretty bullish on AI.
It doesn't need to be general intelligence or perfectly map to human intelligence.
All it needs to be is useful. Reading constant comments about LLMs can't be general intelligence or lack reasoning etc, to me seems like people witnessing the airplane and complaining that it isn't "real flying" because it isn't a bird flapping its wings (a large portion of the population held that point of view back then).
It doesn't need to be general intelligence for the rapid advancement of LLM capabilities to be the most societal shifting development in the past decades.
> to me seems like people witnessing the airplane and complaining that it isn't "real flying" because it isn't a bird flapping its wings
To me it is more like there is someone jumping on a pogo ball while flapping their arms and saying that they are flying. Whenever they hop off the ground.
Skeptics say that they are not really flying, while adherents say that "with current pogo ball advancements, they will be flying any day now"
> It doesn't need to be general intelligence or perfectly map to human intelligence.
> All it needs to be is useful.
Computers were already useful.
The only definition we have for "intelligence" is human (or, generally, animal) intelligence. If LLMs aren't that, let's call it something else.
People arenât responding to their own assumption that AGI is necessary, theyâre responding to OpenAI and the chorus constantly and loudly singing hymns to AGI.
I agree. If the LLMs we have today never got any smarter, the world would still be transformed over the next ten years.
I just googled arc agi questions, and it looks like it is similar to an iq test with raven matrix. Similar as in you have some examples of images before and after, then an image before and you have to guess the after.
Could anyone confirm if this is the only kind of questions in the benchmark? If yes, how come there is such a direct connection to "oh this performs better than humans" when llm can be quite better than us in understanding and forecasting patterns? I'm just curious, not trying to stir up controversies
It's a test on which (apparently until now) the vast majority of humans have far outperformed all machine systems.
But itâs not a test that directly shows general intelligence.
I am excited no less! This is huge improvement.
How does this do on SWE Bench?
ML is quite good at understanding and forecasting patterns when you train on the data you want to forecast. LLMs manage to do so much because we just decided to train on everything on the internet and hope that it included everything we ever wanted to know.
This tries to create patterns that are intentionally not in the data and see if a system can generalize to them, which o3 super impressively does!
Yes, it's pretty similar to Raven's. The reason it is an interesting benchmark is because humans, even very young humans, "get" the test in the sense of understanding what it's asking and being able to do pretty well on it - but LLMs have really struggled with the benchmark in the past.
Chollett (one of the creators of the ARC benchmark) has been saying it proves LLMs can't reason. The test questions are supposed to be unique and not in the model's training set. The fact that LLMs struggled with the ARC challenge suggested (to Chollett and others) that models weren't "Truly reasoning" but rather just completing based on things they'd seen before - when the models were confronted with things they hadn't seen before, the novel visual patterns, they really struggled.
> My skeptical impression: it's complete hubris to conflate ARC or any benchmark with truly general intelligence.
But isnât it interesting to have several benchmarks? Even if itâs not about passing the Turing test, benchmarks serve a purposeâsimilar to how we measure microprocessors or other devices. Intelligence may be more elusive, but even if we had an oracle delivering the ultimate intelligence benchmark, we'd still argue about its limitations. Perhaps we'd claim it doesn't measure creativity well, and we'd find ourselves revisiting the same debates about different kinds of intelligences.
It's certainly interesting. I'm just not convinced it's a test of general intelligence, and I don't think we'll know whether or not it is until it's been able to operate in the real world to the same degree that our general intelligence does.
Human performance is 85% [1]. o3 high gets 87.5%.
This means we have an algorithm to get to human level performance on this task.
If you think this task is an eval of general reasoning ability, we have an algorithm for that now.
There's a lot of work ahead to generalize o3 performance to all domains. I think this explains why many researchers feel AGI is within reach, now that we have an algorithm that works.
Congrats to both Francois Chollet for developing this compelling eval, and to the researchers who saturated it!
[1] https://x.com/SmokeAwayyy/status/1870171624403808366, https://arxiv.org/html/2409.01374v1
NNs are not algorithms.
As excited as I am by this, I still feel like this is still just a small approximation of a small chunk of human reasoning ability at large. o3 (and whatever comes next) feels to me like it will head down the path of being a reasoning coprocessor for various tasks.
But, still, this is incredibly impressive.
Whatâs interesting is it might be very close to human intelligence than some âalienâ intelligence, because after all it is a LLM and trained on human made text, which kind of represents human intelligence.
In that vein, perhaps the delta between o3 @ 87.5% and Human @ 85% represents a deficit in the ability of text to communicate human reasoning.
In other words, it's possible humans can reason better than o3, but cannot articulate that reasoning as well through text - only in our heads, or through some alternative medium.
I wonder how much of an effect amount of time to answer has on human performance.
Still it's comparing average human level performance with best AI performance. Examples of things o3 failed at are insanely easy for humans.
O3 High (tuned) model scored an 88% at what looks like $6,000/task haha
I think soon we'll be pricing any kind of tasks by their compute costs. So basically, human = $50/task, AI = $6,000/task, use human. If AI beats human, use AI? Ofc that's considering both get 100% scores on the task
This makes me think and speculate if the solution comprises of a "solver" trying semi-random or more targeted things and a "checker" checking these? Usually checking a solution is cognitively (and computationally) easier than coming up with it. Else I cannot think what sort of compute would burn 6000$ per task, unless you are going through a lot of loops and you have somehow solved the part of the problem that can figure out if a solution is correct or not, while coming up with the actual correct solution is not as solved yet to the same degree. Or maybe I am just naive and these prices are just like breakfast for companies like that.
Isn't that generally what ... all jobs are? Automation Cost vs Longterm Human cost... its why amazon did the weird "our stores are AI driven" but in reality was cheaper to higher a bunch of guys in a sweat shop to look at the cameras and write things down lol.
The thing is given what we've seen from distillation and tech, even if its 6,000/task... that will come down drastically over time through optimization and just... faster more efficient processing hardware and software.
I remember hearing Tesla trying to automate all of production but some things just couldnât , like the wiring which humans still had to do.
Compute costs on AI with the same roughly the same capabilities have been halving every ~7 months.
That makes something like this competitive in ~3 years
That's the elephant in the room with the reasoning/COT approach, it shifts what was previously a scaling of training costs into scaling of training and inference costs. The promise of doing expensive training once and then running the model cheaply forever falls apart once you're burning tens, hundreds or thousands of dollars worth of compute every time you run a query.
Yeah, but next year they'll come out with a faster GPU, and the year after that another still faster one, and so on. Compute costs are a temporary problem.
Whenever a benchmark that was thought to be extremely difficult is (nearly) solved, it's a mix of two causes. One is that progress on AI capabilities was faster than we expected, and the other is that there was an approach that made the task easier than we expected. I feel like the there's a lot of the former here, but the compute cost per task (thousands of dollars to solve one little color grid puzzle??) suggests to me that there's some amount of the latter. Chollet also mentions ARC-AGI-2 might be more resistant to this approach.
Of course, o3 looks strong on other benchmarks as well, and sometimes "spend a huge amount of compute for one problem" is a great feature to have available if it gets you the answer you needed. So even if there's some amount of "ARC-AGI wasn't quite as robust as we thought", o3 is clearly a very powerful model.
> the other is that there was an approach that made the task easier than we expected.
from reading Dennett's philosophy, I'm convinced that that's how human intelligence works - for each task that "only a human could do that", there's a trick that makes it easier than it seems. We are bags of tricks.
This feels like big news to me.
First of all, ARC is definitely an intelligence test for autistic people. I say as someone with a tad of the neurodiversity. That said, I think it's a pretty interesting one, not least because as you go up in the levels, it requires (for a human) a fair amount of lateral thinking and analogy-type thinking, and of course, it requires that this go in and out of visual representation. That said, I think it's a bit funny that most of the people training these next-gen AIs are neurodiverse and we are training the AI in our own image. I continue to hope for some poet and painter-derived intelligence tests to be added to the next gen tests we all look at and score.
For those reasons, I've always really liked ARC as a test -- not as some be-all end-all for AGI, but just because I think that the most intriguing areas next for LLMs are in these analogy arenas and ability to hold more cross-domain context together for reasoning and etc.
Prompts that are interesting to play with right now on these terms range from asking multimodal models to say count to ten in a Boston accent, and then propose a regional french accent that's an equivalent and count to ten in that. (To my ear, 4o is unconvincing on this). Similar in my mind is writing and architecting code that crosses multiple languages and APIs, and asking for it to be written in different styles. (claude and o1-pro are .. okay at this, depending).
Anyway. I agree that this looks like a large step change. I'm not sure if the o3 methods here involve the spinning up of clusters of python interpreters to breadth-search for solutions -- a method used to make headway on ARC in the past; if so, this is still big, but I think less exciting than if the stack is close to what we know today, and the compute time is just more introspection / internal beam search type algorithms.
Either way, something had to assess answers and think they were right, and this is a HUGE step forward.
> most of the people training these next-gen AIs are neurodiverse
Citation needed. This is a huge claim based only on stereotype.
The programming task they gave o3-mini high (creating Python server that allows chatting with OpenAI API and run some code in terminal) didn't seem very hard? Strange choice of example for something that's claimed to be a big step forwards.
YT timestamped link: https://www.youtube.com/watch?v=SKBG1sqdyIU&t=768s (thanks for the fixed link @photonboom)
Updated: I gave the task to Claude 3.5 Sonnet and it worked first shot: https://claude.site/artifacts/36cecd49-0e0b-4a8c-befa-faa5aa...
heres the right timestamp: https://www.youtube.com/watch?v=SKBG1sqdyIU&t=768s
It's good that it works since if you ask GPT-4o to use the openai sdk it will often produce invalid and out of date code.
I would say they didnât need to demo anything, because if you are gonna use the output code live on a demo it may make compile errors and then look stupid trying to fix it live
Isn't this at the level now where it can sort of self improve. My guess is that they will just use it to improve the model and the cost they are showing per evaluation will go down drastically.
So, next step in reasoning is open world reasoning now?
Get a daily email with the the top stories from Hacker News. No spam, unsubscribe at any time.