Or the test wasn't testing anything meaningful, which IMO is what happened here. I think ARC was basically looking at the distribution of what AI is capable of, picked an area that it was bad at and no one had cared enough to go solve, and put together a benchmark. And then we got good at it because someone cared and we had a measurement. Which is essentially the goal of ARC.
But I don't much agree that it is any meaningful step towards AGI. Maybe it's a nice proofpoint that that AI can solve simple problems presented in intentionally opaque ways.
Id agree with you if there hasn’t been very deliberate work towards solving ARC for years, and if thr conceit of the benchmark wasn’t specifically based on a conception of human intuition being, put simply, learning and applying out of distribution rules on the fly. ARC wasn’t some arbitrary inverse set, it was designed to benchmark a fundamental capability of general intelligence
But I don't much agree that it is any meaningful step towards AGI. Maybe it's a nice proofpoint that that AI can solve simple problems presented in intentionally opaque ways.