Are there any single-step non-reasoner models that do well on this benchmark? I ...

jackdoe · 2024-12-20T18:30:56 1734719456

    | Name                                 | Semi-private eval | Public eval |
    |--------------------------------------|-------------------|-------------|
    | Jeremy Berman                        | 53.6%             | 58.5%       |
    | Akyürek et al.                       | 47.5%             | 62.8%       |
    | Ryan Greenblatt                      | 43%               | 42%         |
    | OpenAI o1-preview (pass@1)           | 18%               | 21%         |
    | Anthropic Claude 3.5 Sonnet (pass@1) | 14%               | 21%         |
    | OpenAI GPT-4o (pass@1)               | 5%                | 9%          |
    | Google Gemini 1.5 (pass@1)           | 4.5%              | 8%          |

https://arxiv.org/pdf/2412.04604

kandesbunzler · 2024-12-20T19:43:52 1734723832

why is this missing the o1 release / o1 pro models? Would love to know how much better they are

Freebytes · 2024-12-21T02:15:08 1734747308

This might be because they are referencing single step, and I do not think o1 is single step.

aimanbenbaha · 2024-12-20T23:41:41 1734738101

Akyürek et al uses test-time compute.

YetAnotherNick · 2024-12-20T18:31:24 1734719484

Here are the results for base models[1]:

  o3 (coming soon)  75.7% 82.8%
  o1-preview        18%   21%
  Claude 3.5 Sonnet 14%   21%
  GPT-4o            5%    9%
  Gemini 1.5        4.5%  8%

Score (semi-private eval) / Score (public eval)

[1]: https://arcprize.org/2024-results

Bjorkbat · 2024-12-20T21:39:57 1734730797

It's easy to miss, but if you look closely at the first sentence of the announcement they mention that they used a version of o3 trained on a public dataset of ARC-AGI, so technically it doesn't belong on this list.

dot1x · 2024-12-21T09:25:17 1734773117

It's all scam. ClosedAI trained on the data they were tested on, so no, nothing here is impressive.

catalypso · 2024-12-22T15:07:00 1734880020

Just a clarification, they tuned on the public training dataset, not the semi-private one. The 87.5% score was on the semi-private eval, which means the model was still able to generalize well.

That being said, the fact that this is not a "raw" base model, but one tuned on the ARC-AGI tests distribution takes away from the impressiveness of the result — How much ? — I'm not sure, we'd need the un-tuned base o3 model score for that.

In the meantime, comparing this tuned o3 model to other un-tuned base models is unfair (apples-to-oranges kind of comparison).

Tanjreeve · 2024-12-22T09:48:14 1734860894

They definitely did or they probably did? Is there any source for that just so I can point It out to people?

simonw · 2024-12-20T20:46:15 1734727575

I'd love to know how Claude 3.5 Sonnet does so well despite (presumably) not having the same tricks as the o-series models.