They very clearly explain why this matters in the "Why should I care?" section. ...

lolinder · 2024-04-23T19:37:53 1713901073

> Solving the "Harry Potter Problem" as stated seems like a natural prerequisite for doing this much more high stakes (and harder to benchmark) task

Not really. The "Harry Potter Problem" as formulated is asking an LLM to solve a problem that they are architecturally unsuited for. They do poorly at counting and similar algorithms tasks no matter the size of the context provided. The correct approach to allowing an AI agent to solve a problem like this one would be (as OP indicates) to have it recognize that this is an algorithmic challenge that it needs to write code to solve, then have it write the code and execute it.

Asking specific questions about your insurance policy is a qualitatively different type of problem that algorithms are bad at, but it's the kind of problem that LLMs are already very good at in smaller context windows. Making progress on that type of problem requires only extending a model's capabilities to use the context, not simultaneously building out a framework for solving algorithmic problems.

So if anything it's the reverse: solving the insurance problem would be a prerequisite to solving the Harry Potter Problem.

causal · 2024-04-23T19:31:31 1713900691

Word counting and summarizing key information are wildly different problems though

og_kalu · 2024-04-23T19:33:00 1713900780

Not really.

LLMs can't count well. This is in large part a tokenization issue. Doesn't mean they couldn't answer all those kind of questions. Maybe the current state of the art can't. But you won't find out by asking it to count.