I don't think recall really addresses it sufficiently: the main issue I see is answers getting "muddy". Like it's getting pulled in too many directions and averaging.
I'm not saying it applies to the new architecture, I'm saying that's a big issue I've observed in existing models and that so far we have no info on whether it's solved in the new one (i.e. accurate recall doesn't imply much in that regard).
Ah, apologies for the misunderstanding. What tests would you suggest to evaluate "muddiness"?
What comes to my mind: run the usual gamut of tests, but with the excess context window saturated with irrelevant(?) data. Measure test answer accuracy/verbosity as a function of context saturation percentage. If there's little correlation between these two variables (e.g. 9% saturation is just as accurate/succinct as 99% saturation), then "muddiness" isn't an issue.
Manual testing on complex documents. A big legal contract for example. An issue can be referred to in 7 different places in a 100 page document. Does it give a coherent answer?
A handful of examples show whether it can do it. For example, GPT-4 turbo is downright awful at something like that.
You need to use relevant data. The question isn't random sorting/pruning, but being able to apply large numbers of related hints/references/definitions in a meaningful way. To me this would be the entire point of a large context window. For entirely different topics you can always just start a new instance.
This is only specifically for web interface LLMs in the past few years that it's been lack luster. However, this statements is not correct for their overall history. W2V based lang models and BERT/Transformer models in the early days (*publicly available, but not in web interface) were far ahead of the curve, as they were the ones that produced these innovations.
Effectively, Deepmmind/Google are academics (where the real innovations are made, but they do struggle to produce corporate products (where openai shines).
I am skeptical of benchmarks in general, to be honest. It seems to be extremely difficult to come up with benchmarks for these things (it may be true of intelligence as a quality...). It's almost an anti-signal to proclaim good results on benchmarks. The best barometer of model quality has been vibes, in places like /r/localllama where cracked posters are actively testing the newest models out.
Based on Google's track record in the area of text chatbots, I am extremely skeptical of their claims about coherency across a 1M+ context window.
Of course none of this even matters anyway because the weights are closed the architecture is closed nobody has access to the model. I'll believe it when I see it.
Their in-context long-sequence understanding "benchmark" is pretty interesting.
There's a language called Kalamang with only 200 native speakers left. There's a set of grammar books for this language that adds up to ~250K tokens. [1]
They set up a test of in-context learning capabilities at long context - they asked 3 long-context models (GPT 4 Turbo, Claude 2.1, Gemini 1.5) to perform various Kalamang -> English and English -> Kalamang translation tasks. These are done either 0-shot (no prior training data for kgv in the models), half-book (half of the kgv grammar/wordlists - 125k tokens - are fed into the model as part of the prompt), and full-book (the whole 250k tokens are fed into the model). Finally, they had human raters check these translations.
This is a really neat setup, it tests for various things (e.g. did the model really "learn" anything from these massive grammar books) beyond just synthetic memorize-this-phrase-and-regurgitate-it-later tests.
It'd be great to make this and other reasoning-at-long-ctx benchmarks a standard affair for evaluating context extension. I can't tell which of the many context-extension methods (PI, E2 LLM, PoSE, ReRoPE, SelfExtend, ABF, NTK-Aware ABF, NTK-by-parts, Giraffe, YaRN, Entropy ABF, Dynamic YaRN, Dynamic NTK ABF, CoCA, Alibi, FIRE, T5 Rel-Pos, NoPE, etc etc) is really SoTA since they all use different benchmarks, meaningless benchmarks, or drastically different methodologies that there's no fair comparison.
The available resources for Kalamang are: field linguistics documentation10 comprising a ∼500 page reference grammar, a ∼2000-entry bilingual wordlist, and a set of ∼400 additional parallel sentences. In total the available resources for Kalamang add up to around ∼250k tokens.
Haha. This was my thinking this morning. Like: "Oh cool... a talking computer.... but can it read a 2000 page book, give me the summary and find a sentence out of... it can? Oh... well it's lame anyway."
The Sora release is even more mind blowing - not the video generation in my mind but the idea that it can infer properties of reality that it has to learn and constrain in its weights to properly generate realistic video. A side effect of its ability is literally a small universe of understanding.
I was thinking that I want to play with audio to audio LLMs. Not text to speech and reverse but literally sound in sound out. It clears away the problem of document layout etc. and leaves room for experimentation on the properties of a cognitive being.