For reference, here is the terminal-bench leaderboard: https://www.tbench.ai/lea...

segmondy · 2025-08-21T21:37:42 1755812262

garbage benchmark, inconsistent mix of "agent tools" and models. if you wanted to present a meaningful benchmark, the agent tools will stay the same and then we can really compare the models.

there are plenty of other benchmarks that disagree with these, with that said. from my experience most of these benchmarks are trash. use the model yourself, apply your own set of problems and see how well it fairs.

paradite · 2025-08-22T05:00:28 1755838828

Hey. I like your roast on benchmarks.

I also publish my own evals on new models (using coding tasks that I curated myself, without tools, rated by human with rubrics). Would love you to check out and give your thoughts:

Example recent one on GPT-5:

https://eval.16x.engineer/blog/gpt-5-coding-evaluation-under...

All results:

https://eval.16x.engineer/evals/coding

jstummbillig · 2025-08-22T18:56:11 1755888971

Which benchmarks are not garbage?

I don't consider myself super special. I think it should be doable to create a benchmark that beats me having to test every single new model.

guluarte · 2025-08-21T20:53:25 1755809605

tbh companies like anthopic, openai, create custom agents for specific benchmarks

bazmattaz · 2025-08-21T21:07:30 1755810450

Do you have a source for this? I’m intrigued

guluarte · 2025-08-21T21:21:14 1755811274

https://www-cdn.anthropic.com/07b2a3f9902ee19fe39a36ca638e5a... "we iteratively refine prompting by analyzing failure cases and developing prompts to address them."

amelius · 2025-08-21T23:24:31 1755818671

Aren't good benchmarks supposed to be secret?

wkat4242 · 2025-08-21T23:57:59 1755820679

This industry is currently burning billions a month. With that much money around I don't think any secrets can exist.

noodletheworld · 2025-08-22T09:20:34 1755854434

How can a benchmark be secret if you post it to an API to test a model on it?

"We totally promise that when we run your benchmark against our API we won't take the data from it and use to be better at your benchmark next time"

:P

If you want to do it properly you have to avoid any 3rd party hosted model when you test your benchmark, which means you can't have GPT5, claude, etc. on it; and none of the benchmarks want to be 'that guy' who doesn't have all the best models on it.

So no.

They're not secret.

dmos62 · 2025-08-22T10:47:55 1755859675

How do you propose that would work? A pipeline that goes through query-response pairs to deduce response quality and then uses the low-quality responses for further training? Wouldn't you need a model that's already smart enough to tell that previous model's responses weren't smart enough? Sounds like a chicken and egg problem.

irthomasthomas · 2025-08-22T11:28:01 1755862081

It just means that once you send your test questions to a model API, that company now has your test. So 'private' benchmarks take it on faith that the companies won't look at those requests and tune their models or prompts to beat them.

dmos62 · 2025-08-22T12:30:00 1755865800

Sounds a bit presumptious to me. Sure, they have your needle, but they also need a cost-efficient way to find it in their hay stack.

lucianbr · 2025-08-22T16:59:34 1755881974

They have quite large amounts of money. I don't think they need to be very cost-efficient. And they also have very smart people, so likely they can figure out a somewhat cost-efficient way. The stakes are high, for them.

noodletheworld · 2025-08-22T13:53:01 1755870781

Security through obscurity is not security.

Your api key is linked to your credit card, which is linked to your identity.

…but hey, youre right.

Lets just trust them not to be cheating. Cool.

merelysounds · 2025-08-22T14:24:38 1755872678

Would the model owners be able to identify the benchmarking session among many other similar requests?

irthomasthomas · 2025-08-22T14:47:43 1755874063

Depends. Something like arc-agi might be easy as it follows a defined format. I would also guess that the usage pattern for someone running a benchmark will be quite distinct from that of a normal user, unless they take specific measures to try to blend in.

make3 · 2025-08-28T19:36:44 1756409804

if you're able to submit multiple times, then you can learn from the hidden set

YetAnotherNick · 2025-08-21T20:37:07 1755808627

Depends on the agent. Rank 5 and 15 are claude 4 sonnet, and this stands close to 15th.

coliveira · 2025-08-21T20:28:14 1755808094

My personal experience is that it produces high quality results.

amrrs · 2025-08-21T20:32:50 1755808370

Any example or prompt you use to make this statment?

imachine1980_ · 2025-08-21T20:49:45 1755809385

I remember asking for quotes about the Spanish conquest of South America because I couldn't remember who said a specific thing. The GPT model started hallucinating quotes on the topic, while DeepSeek responded with, "I don't know a quote about that specific topic, but you might mean this other thing." or something like that then cited a real quote in the same topic, after acknowledging that it wasn't able to find the one I had read in an old book. i don't use it for coding, but for things that are more unique i feel is more precise.

mycall · 2025-08-21T22:14:24 1755814464

I wonder if Conway's law is at all responsible for that, in the similarity it is based on; regional trained data which has concept biases which it sends back in response.

valtism · 2025-08-22T02:33:13 1755829993

Was that true for GPT-5? They claim it is much better at not hallucinating

sync · 2025-08-21T23:10:16 1755817816

I'm doing coreference resolution and this model (w/o thinking) performs at the Gemini 2.5-Pro level (w/ thinking_budget set to -1) at a fraction of the cost.

antman · 2025-08-22T11:33:35 1755862415

Nice point. How did you test for coreference resolution? Specific prompt or dataset?

dr_dshiv · 2025-08-21T23:59:39 1755820779

Strong claim there!

SV_BubbleTime · 2025-08-22T03:28:36 1755833316

Vine is about the only benchmark I think is real.

We made objective systems turn out subjective answers… why the shit would anyone think objective tests would be able to grade them?

seunosewa · 2025-08-21T20:08:31 1755806911

The DeepSeek R1 in that list is the old model that's been replaced. Update: Understood.

yorwba · 2025-08-21T20:33:52 1755808432

Yes, and 31.3% is given in the announcement as the performance of the new v3.1, which would put it in sixteenth place.

tonyhart7 · 2025-08-21T21:55:30 1755813330

Yeah but the pricing is insane, I don't care about SOTA if its not break my bank