I disagree with the framing in 2.1 a lot. > Models look god-tier on paper: > the...

majormajor · 2025-11-27T01:26:47 1764206807

Roughly I'd agree, although I don't have hard numbers, and I'd say GPT-4 in 2023 vs GPT-3 as the last major "wow" release from a purely-model perspective. But they've also gotten a lot faster, which has its own value. And the tooling around them has gotten MASSIVELY better - remember the "prompt engineering" craze? Now there are a lot of tools out there that will take your two-sentence prompt and figure out - even asking you questions sometimes - how to best execute that based on local context like in a code repository, and iterate by "re-prompting" itself over and over. In a fraction of the time you could've done that by manual "prompt engineering."

Though I do not fully know where the boundary between "a model prompted to iterate and use tools" and "a model trained to be more iterative by design" is. How meaningful is that distinction?

But the people who don't get this are the less-technical/less-hands-on VPs, CEOs, etc, who are deciding on layoffs, upcoming headcount, "replace our customer service or engineering staffs with AI" things. A lot of those moves are going to look either really silly or really genius depending on exactly how "AGI-like" the plateau turns out to be. And that affects a LOT of people's jobs/livelihood, so it's good to see the hype machine start to slow down and get more realistic about the near-term future.

dwohnitmok · 2025-11-27T06:04:58 1764223498

> I'd say GPT-4 in 2023 vs GPT-3 as the last major "wow" release from a purely-model perspective. But they've also gotten a lot faster, which has its own value. And the tooling around them has gotten MASSIVELY better

Tooling vs model is a false dichotomy in this case. The massive improvements in tooling are directly traceable back to massive improvements in the models.

If you took the same tooling and scaffolding and stuck GPT-3 or even GPT-4 in it, they would fail miserably and from the outside the tooling would look abysmal, because all of the affordances of current tooling come directly from model capability.

All of the tooling approaches of modern systems were proposed and prototypes were made back in 2020 and 2021 with GPT-3. They just sucked because the models sucked.

The massive leap in tooling quality directly reflects a concomitant leap in model quality.

azinman2 · 2025-11-27T01:48:12 1764208092

How do you avoid overfitting with the automated prompts? It seems to add lots of exceptions from what I've seen in the past versus generalize as much as a human would.

adastra22 · 2025-11-27T01:53:21 1764208401

Ask the agent "Is this over-fitting?"

I'm not joking.

levocardia · 2025-11-27T02:01:36 1764208896

I dunno, some of the questions on things like Humanity's Last Exam sure strike me as "godlike." Yes, I'm happy that I can still crush LLMs on ARC-AGI-2 but I see the writing on the wall there, too. Barely over a year ago LLMs were what, single digit percentages on ARC-AGI-1?

thethirdone · 2025-11-27T04:29:17 1764217757

I would hope god can do better than 40% on a test. If you select experts from the relevant fields humans, they together would get a passing grade (70%) at least. A group of 20 humans is not godlike.