The benchmark is designed to test for AGI and intelligence, specifically the ability to solve novel problems.
If the hypothesis is that LLMs are the “computer” that drives the AGI then of course the benchmark is relevant in testing for AGI.
I don’t think you understand the benchmark and its motivation. ARC AGI benchmark problems are extremely easy and simple for humans. But LLMs fail spectacularly at them. Why they fail is irrelevant, the fact they fail though means that we don’t have AGI.
> The benchmark is designed to test for AGI and intelligence, specifically the ability to solve novel problems.
It's a bunch of visual puzzles. They aren't a test for AGI because it's not general. If models (or any other system for that matter) could solve it, we'd be saying "this is a stupid puzzle, it has no practical significance". It's a test of some sort of specific intelligence. On top of that, the vast majority of blind people would fail - are they not generally intelligent?
The name is marketing hype.
The benchmark could be called "random puzzles LLMs are not good at because they haven't been optimized for it because it's not valuable benchmark". Sure, it wasn't designed for LLMs, but throwing LLMs at it and saying "see?" is dumb. We can throw in benchmarks for tennis playing, chess playing, video game playing, car driving and a bajillion other things while we are at it.
And all that is kind of irrelevant, because if LLMs were human-level general intelligence, they would solve all these questions correctly without blinking.
No human would score high on that puzzle if the images were given to them as a series of tokens. Even previous LLMs scored much better than humans if tested in the same way.
And most humans would do well on maths problems if the input was given to them as binary. The reason that reversal isn't important is that the Tokens are an implementation detail for how an AI is meant to solve real world problems that humans face while noone cares about humans solving tokens.
Humans communicate with each other to get things done. We have to think carefully how we communicate with each other given the shortcomings of humans and shortcomings of different communication mediums.
The fact that we might need to be mindful of how we communicate with a person/system/whatever doesn't mean too much in the context of AI. Just like humans, the details of how they work will need to be considered, and the standard trope of "that's an implementation detail" won't work.
If the hypothesis is that LLMs are the “computer” that drives the AGI then of course the benchmark is relevant in testing for AGI.
I don’t think you understand the benchmark and its motivation. ARC AGI benchmark problems are extremely easy and simple for humans. But LLMs fail spectacularly at them. Why they fail is irrelevant, the fact they fail though means that we don’t have AGI.