PaperBench

no_multitudes · 2025-04-02T21:12:31 1743628351

Are there examples of the outputs the LLMs under test generated? I couldn't find any detailed ones in the paper or code.

The result here seems to be "Our Judge LLM gave another LLM a 21% grade for some code it generated", which is ... not qualitatively meaningful at all to me.

smusamashah · 2025-04-02T17:40:07 1743615607

    We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%.

attentive · 2025-04-02T21:57:55 1743631075

"We wished to also evaluate Claude 3.7 Sonnet, but were unable to complete the experiments given rate limits with the Anthropic API."

swyx · 2025-04-02T22:49:38 1743634178

overall i REALLY like this paper and effort, but this part sounds like a bit of bullshit. they dont have the ability to implement retries and backoffs to deal with rate limits?

eightysixfour · 2025-04-02T23:03:55 1743635035

Because they used wall clock time, not compute time, flops, or watts, to standardize. 24 hours and 36 hours of compute.

They could build a system which gives them equal compute time by ignoring time spent rate limiting and such, but they chose not to.

swyx · 2025-04-03T03:39:52 1743651592

ah. fair answer.

moralestapia · 2025-04-03T01:53:36 1743645216

"Why don't they just break the TOS"

Damned if you do, damned if you don't.

tetris11 · 2025-04-03T15:31:04 1743694264

Sounds like a good initiative, but not one that should be under the ownership of a for-profit company with a massive stake in the race.

amelius · 2025-04-02T20:55:04 1743627304

One thing I'd be interested in is a UI for reading papers with AI assistance.

benbreen · 2025-04-02T21:05:49 1743627949

I've been developing a more elaborate variation on the "chat with a pdf" idea for my own use as a researcher. It's mostly designed for a historian's workflow but it works pretty well for science and engineering papers too. Currently Flash 2.0 is the default but you can select other models to use to analyze pdfs and other text through various "lenses" ranging from a simple summary to text highlighting to extracting organized data as a .csv file:

https://source-lens.vercel.app

(Note: this is not at all a production ready app, it's just something I've been making for myself, though I'm also now sharing it with my students to see how they use it. If anyone reads this and is interested in collaborating, let me know).

boodleboodle · 2025-04-03T02:36:15 1743647775

Wow I just tried this and it's great!

I regularly paste papers into LLM interfaces but they all spit out generic non-helpful answers. Your app is the only one i've seen that actually helps me understand.

I am using Gemini 2.0 pro

harshPaliwal · 2025-04-03T07:55:46 1743666946

awesome tool ben, tried talking to author, although instead of answering it keep referencing me to read paper. is that the intent?

rfurmani · 2025-04-02T21:17:11 1743628631

I'm building such tools at https://sugaku.net, right now there's chatting with a paper and browsing similar papers. Generally arXiv and other repositories want you to link to them and not embed their papers, which makes it hard to build inline reading tools, but it's on my roadmap to support that for uploaded papers. Would love to hear if you have some feature requests there

amelius · 2025-04-03T09:43:11 1743673391

One feature could be that it automatically fetches the papers that it refers to and also feeds them through the llm. And maybe apply that recursively. This could give the AI a better overview of the related literature.

timabdulla · 2025-04-02T19:54:31 1743623671

What were the human PhDs able to do after more than 48 hours of effort? Presumably given that these are top-level PhDs, the replication success rate would be close to 100%?

macleginn · 2025-04-03T09:29:36 1743672576

Depending on how well the exact algorithms, implementation details, and experimental design were documented, replication can easily take days, if not weeks. (Personally, I would start by filtering out papers that cannot be replicated by well-skilled researchers in a fixed amount of time and only give the replicatable ones to the agents.)

riku_iki · 2025-04-02T20:53:48 1743627228

I didn't get idea of this benchmark, they ask to produce code to replicate result of papers, which already have code on github?..

kelseyfrog · 2025-04-03T18:15:43 1743704143

The point is to bootstrap self-improving AI. Once a measurement becomes a goal, model makers target saturating it.

There is a coefficient of intelligence replication ie: Model M with intelligence I_m, can reproduce a model N with intelligence I_n. When (I_n / I_m) > 1 we'll have a runaway intelligence explosion. There are of course several elements in the chain - akin to the Drake equation for intelligent machines - and their combined multiplicative effect determines the overall intelligence of the system. If f(paper) -> code is the weakest part of the chain, it makes sense to target that.

riku_iki · 2025-04-03T18:22:32 1743704552

> If f(paper) -> code is the weakest part of the chain, it makes sense to target that.

my point is that LLMs are already potentially seeing solution on github, so you can't use that benchmark as metric unless there is some explanation.

kelseyfrog · 2025-04-03T18:46:40 1743706000

How does that work with knowledge cutoff?

riku_iki · 2025-04-03T18:51:07 1743706267

It could work with knowledge cut off if they can reliably guarantee it, and also make sure LLMs are not searching github under the surface.

kelseyfrog · 2025-04-03T19:01:32 1743706892

What's the likelihood that the researchers have done this? It seems fairly easy.

riku_iki · 2025-04-03T19:04:08 1743707048

I honestly have no idea how OAI researchers can guarantee cut off date for Antropic models for example.

eightysixfour · 2025-04-02T21:22:05 1743628925

You don’t see the value of independent replication of findings?

The agent didn’t have access to the code, although they acknowledge it could theoretically be in the training set, even then the original code wouldn’t conform to the structure of the test.

riku_iki · 2025-04-03T18:21:23 1743704483

> even then the original code wouldn’t conform to the structure of the test.

yeah, this part should be central in this work: how well those tests are built, do they actually are catching data leakage, how this is measured, etc.

DrillShopper · 2025-04-02T19:26:13 1743621973

PaperBench sounds like a benchmarking software package for recently released GPUs.

aSanchezStern · 2025-04-02T19:35:41 1743622541

Where would the "paper" part come in? Is that just based on the word "bench" in general?

hnuser123456 · 2025-04-02T20:42:30 1743626550

The recent GPUs were a "paper launch"

antonkar · 2025-04-03T00:49:33 1743641373

There is a planet-wise eternal 100% safe AI solution that can be a billion dollar startup, too:

Put all the GPUs in cloud/s controlled by international scientists (now you can use your GPU on any device, can earn money by renting it when you don’t need it, nothing changes except you need to be online to us it, but we’ll have 5G and better worldwide. You can develop, sell or release free math-proven safe AI models in this cloud “AI App Store”, etc).

Because the main risk is an AI agent botnet - current GPUs are like nukes that are 100% unprotected - any hacker can make a virus with AI agent component just to steal money, this AI will be not aligned at all, will become a per perpetual and eventually autonomous botnet.