Hacker News new | past | comments | ask | show | jobs | submit login

Yes, it's pretty similar to Raven's. The reason it is an interesting benchmark is because humans, even very young humans, "get" the test in the sense of understanding what it's asking and being able to do pretty well on it - but LLMs have really struggled with the benchmark in the past.

Chollett (one of the creators of the ARC benchmark) has been saying it proves LLMs can't reason. The test questions are supposed to be unique and not in the model's training set. The fact that LLMs struggled with the ARC challenge suggested (to Chollett and others) that models weren't "Truly reasoning" but rather just completing based on things they'd seen before - when the models were confronted with things they hadn't seen before, the novel visual patterns, they really struggled.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: