> Models look god-tier on paper:
> they pass exams
> solve benchmark coding tasks
> reach crazy scores on reasoning evals
Models don't look "god-tier" from benchmarks. Surely an 80% is not godlike. I would really like more human comparisons for these benchmarks to get a good idea of what an 80% means though.
I would not say that any model shows a "crazy" score on ARC-AGI.
I broadly have seen incremental improvements in benchmarks since 2020, mostly at a level I would believe to be below average human reasoning, but above average human knowledge. No one would call GPT-3 godlike and it is quite similar to modern models in benchmarks; it is not a difference of like 1% vs 90%. I think most people would consider gpt-3 to be closer to opus 4.5 than opus 4.5 is to a human.
Roughly I'd agree, although I don't have hard numbers, and I'd say GPT-4 in 2023 vs GPT-3 as the last major "wow" release from a purely-model perspective. But they've also gotten a lot faster, which has its own value. And the tooling around them has gotten MASSIVELY better - remember the "prompt engineering" craze? Now there are a lot of tools out there that will take your two-sentence prompt and figure out - even asking you questions sometimes - how to best execute that based on local context like in a code repository, and iterate by "re-prompting" itself over and over. In a fraction of the time you could've done that by manual "prompt engineering."
Though I do not fully know where the boundary between "a model prompted to iterate and use tools" and "a model trained to be more iterative by design" is. How meaningful is that distinction?
But the people who don't get this are the less-technical/less-hands-on VPs, CEOs, etc, who are deciding on layoffs, upcoming headcount, "replace our customer service or engineering staffs with AI" things. A lot of those moves are going to look either really silly or really genius depending on exactly how "AGI-like" the plateau turns out to be. And that affects a LOT of people's jobs/livelihood, so it's good to see the hype machine start to slow down and get more realistic about the near-term future.
> I'd say GPT-4 in 2023 vs GPT-3 as the last major "wow" release from a purely-model perspective. But they've also gotten a lot faster, which has its own value. And the tooling around them has gotten MASSIVELY better
Tooling vs model is a false dichotomy in this case. The massive improvements in tooling are directly traceable back to massive improvements in the models.
If you took the same tooling and scaffolding and stuck GPT-3 or even GPT-4 in it, they would fail miserably and from the outside the tooling would look abysmal, because all of the affordances of current tooling come directly from model capability.
All of the tooling approaches of modern systems were proposed and prototypes were made back in 2020 and 2021 with GPT-3. They just sucked because the models sucked.
The massive leap in tooling quality directly reflects a concomitant leap in model quality.
How do you avoid overfitting with the automated prompts? It seems to add lots of exceptions from what I've seen in the past versus generalize as much as a human would.
I dunno, some of the questions on things like Humanity's Last Exam sure strike me as "godlike." Yes, I'm happy that I can still crush LLMs on ARC-AGI-2 but I see the writing on the wall there, too. Barely over a year ago LLMs were what, single digit percentages on ARC-AGI-1?
I would hope god can do better than 40% on a test. If you select experts from the relevant fields humans, they together would get a passing grade (70%) at least. A group of 20 humans is not godlike.
I would not say that any model shows a "crazy" score on ARC-AGI.
I broadly have seen incremental improvements in benchmarks since 2020, mostly at a level I would believe to be below average human reasoning, but above average human knowledge. No one would call GPT-3 godlike and it is quite similar to modern models in benchmarks; it is not a difference of like 1% vs 90%. I think most people would consider gpt-3 to be closer to opus 4.5 than opus 4.5 is to a human.