I, too, could not read a chapter of Harry Potter and then tell you how many times a word was used. This isn't what my brain (and by extension LLMs) is good at. However, if you told me ahead of time that was my goal for reading a chapter, I'd approach reading differently. I might have a scratch pad for tallying. Or I might just do a word find on a document. I'd design a framework to solve the problem.
"The Harry Potter Problem" has the feel of a strawman. LLMs are not universal problem solvers. You still have to break down tasks or give it a framework for working things through. If you ask an LLM to produce a code snippet for word counting, it will do great. Maybe that isn't as sexy, but what are you really trying to achieve?
They very clearly explain why this matters in the "Why should I care?" section. Partially quoting them:
> Harry Potter is an innocent example, but this problem is far more costly when it comes to higher value use-cases. For example, we analyze insurance policies. They’re 70-120 pages long, very dense and expect the reader to create logical links between information spread across pages (say, a sentence each on pages 5 and 95). So, answering a question like “what is my fire damage coverage?” means you have to read: Page 2 (the premium), Page 3 (the deductible and limit), Page 78 (the fire damage exclusions), Page 94 (the legal definition of “fire damage”).
It's not at all obvious how you could write code to do that for you. Solving the "Harry Potter Problem" as stated seems like a natural prerequisite for doing this much more high stakes (and harder to benchmark) task, even if there are "better" ways of solving the Harry Potter problem.
> Solving the "Harry Potter Problem" as stated seems like a natural prerequisite for doing this much more high stakes (and harder to benchmark) task
Not really. The "Harry Potter Problem" as formulated is asking an LLM to solve a problem that they are architecturally unsuited for. They do poorly at counting and similar algorithms tasks no matter the size of the context provided. The correct approach to allowing an AI agent to solve a problem like this one would be (as OP indicates) to have it recognize that this is an algorithmic challenge that it needs to write code to solve, then have it write the code and execute it.
Asking specific questions about your insurance policy is a qualitatively different type of problem that algorithms are bad at, but it's the kind of problem that LLMs are already very good at in smaller context windows. Making progress on that type of problem requires only extending a model's capabilities to use the context, not simultaneously building out a framework for solving algorithmic problems.
So if anything it's the reverse: solving the insurance problem would be a prerequisite to solving the Harry Potter Problem.
LLMs can't count well. This is in large part a tokenization issue. Doesn't mean they couldn't answer all those kind of questions. Maybe the current state of the art can't. But you won't find out by asking it to count.
Some counterarguments:
1. If an AI company promises that their LLM has a million token context window, but in practice it only pays attention to the first and last 30k tokens, and then hallucinates, that is a bad practice. And prompt construction does not help here - the issue is with the fundamentals of how LLMs actually work. Proof: https://arxiv.org/abs/2307.03172
2. Regarding writing the code snippet: as I described in my post, the main issue is that the model does not understand the relationships between information in the long document. So yes, it can write a script that counts the number of times the word "wizard" appears, but if I gave it a legal case of similar length, how would it write a script that extracts all of the core arguments that live across tens of pages?
I'd do it like a human would. If a human was reading the legal case they would have a notepad with them where they would note locations and summaries of key arguments, page by page. I'd code the LLM to look for something that looks like a core argument on each page (or other meaningful chunk of text) and then have it give a summary if one occurs. I may need to do some few shot prompting to give it understanding of what to look for. If you are looking for reliable structured output you need to formulate your approach to be more algorithmic and use the LLM for it's ability to work with chunks of text.
Totally agree there. And that's one of my points: you have to design around this flaw by doing things like what you proposed (or build an ontology like we did, which is also helpful). And the first step in this process is figuring out whether your task falls into a category like the ones I described.
The structured output element is really important too - subject for another post though!
But you could write code to do exactly that easily though
So surely it’s not about LLMs being able to do that, but being smart enough to understand “hey this is something I could write a Python script to do” and be able to write the script, feed the Harry Potter chapter into it, run the script and parse the results (in the same way a human would do)?
Yes, and that's a completely different kind of problem than extending a model's ability to use its context window effectively.
If you solve the problem that way you don't even really need the Harry Potter chapter in the context at all, you could put it as an external document that the agent executes code against. This makes it qualitatively a different problem than the insurance policy questions that the article moves on to.
> I, too, could not read a chapter of Harry Potter and then tell you how many times a word was used. This isn't what my brain (and by extension LLMs) is good at. However, if you told me ahead of time that was my goal for reading a chapter, I'd approach reading differently. I might have a scratch pad for tallying. Or I might just to a word find on a document. I'd design a framework to solve the problem,
What is the relevance in what a human would do? This is not a human and does not work like a human.
I would expect any piece of software that allows me to input a text and ask it to count occurences of words to do so accurately.
You absolutely could count the number of times "wizard" was used if you had the book in front of you. Similarly the LLM does have the chapter available "to look at." Documents pasted in the context window aren't ethereal.
Part of the confusion here is that some people (apparently including you) use the word "LLM" to refer to the entire system that's built up around the language model itself, while others (like OP) are specifically referring to the large language model.
The large language model's context window absolutely is ephemeral. By the time inference is begun all you have is a giant vector that represents the context to date. This means that the model itself does not have the text available to look at, it only has the encoded "memory" of that text.
OP is simply saying that the underlying model is unsuitable for solving problems like this directly, so it makes a bad example for how models don't use their context effectively. A production grade AI agent should be able to solve problems like this, but it will likely do that through external scaffolding, not through improvements to the model itself, whereas improvements to the context window will probably need to occur at the model level.
Yeah. Humans can't count more than ~5 things intuitively; we have to run the "counting algorithm" in our heads. We just learn it so early in life that we don't really think of it as an algorithm. Not surprising at all that LLMs have the same limitation, but fortunately computers are extremely good at running algorithms once instructed to do so.
"The Harry Potter Problem" has the feel of a strawman. LLMs are not universal problem solvers. You still have to break down tasks or give it a framework for working things through. If you ask an LLM to produce a code snippet for word counting, it will do great. Maybe that isn't as sexy, but what are you really trying to achieve?