The point is that LLMs can’t backtrack after deciding on a token. So the probability at least one token along a long generation will lead you down the wrong path does indeed increase as the sequence gets longer (especially since we typically sample from these things), whereas humans can plan their outputs in advance, revise/refine, etc.
Humans can backtrack, but the probability of an "correct" output is still (1-epsilon)^n. Not only can any token introduce an error, but the human author will not perfectly catch errors they have previously introduced. The epsilon ought to be lower for humans, but it's not zero.
But more to the point, in the deck provided, Lecun's point is _not_ about backtracking per se. The highlighted / red text on the preceding slide is:
> LLMs have no knowledge of the underlying reality
> They have no common sense & they can't plan their answer
Now, we generally generate from LLMs by sampling uniformly forward, but it isn't hard to use essentially the same structure to generate tokens conditioned on both preceding and following sequences. If you ran generation for tokens 1...n, and then ran m iterations of re-sampling internal token i based on (1..i-1, i+1..n), it would sometimes "fix" issues created initial generation pass. It would sometimes introduce new issues, which were fine upon original generation. Process-wise, it would look a lot like MCMC at generation-time.
The ability to "backtrack" does _not_ on its own add knowledge of reality, common sense, or "planning".
When a human edits, they're reconciling their knowledge of the world and their intended impact on their expected audience, neither of which the LLM has.
- GPT style language models end up internally implementing a mini "neural network training algorithm" (gradient descent fine-tuning for given examples): https://arxiv.org/abs/2212.10559
This is false. Standard sampling algorithms like beamsearch can "backtrack" and are widely used in generative language models.
It is true that the runtime of these algorithms is exponential in the length of the sequence, and so lots of heuristics are used to reduce this runtime in practice, and this limits the "backtracking" ability. But this limitation is purely for computational convenience's sake and not something inherent in the model.
I could be wrong, but I think the only probabilistic component of an LLM is the statistical word fragment selection at the end. Assuming this is true, one could theoretically run the program multiple times, making different fragment choices. This (while horribly inefficient) would allow a sort of backtracking.
Do you know of any work on holistic response quality in LLMs? We currently have the LLM equivalent of the html line break and hyphenation algorithm, when what we want is the LaTeX version of that algorithm.
Nice post! I work on NLP and I think a lot of ideas in this post resonate with what I find exciting about working on the intersection of language + the real world: large text datasets as sources of abundant prior knowledge about the world, structure of language ~ structure of concepts that matter to humans, etc.
I feel like the bottleneck is getting access to paired (language, other modality) data though (if your other modality isn't images). i.e. "bolt on generalization" is an intuitively appealing concept, but then it reduces to the hard problem of "how do I learn to ground language to e.g. my robot action space?" I haven't seen a robotics + language paper that actually grapples with the grounding problem / tries to think about how to scale the data collection process for language-conditioned robotics beyond annotating your own dataset as a proof-of-concept. Unlike language modeling / CLIP-type pretraining, it seems (fundamentally?) more difficult to find natural sources of supervision of (language, action). I'd be curious about your thoughts on this!
> When it comes to combining natural language with robots, the obvious take is to use it as an input-output modality for human-robot interaction. The robot would understand human language inputs and potentially converse with the human. But if you accept that “generalization is language”, then language models have a far bigger role to play than just being the “UX layer for robots”.
You should check out Jacob Andreas's work, if you haven't seen it already - esp. his stuff on learning from latent language (https://arxiv.org/abs/1711.00482).
My hope is that sufficiently rich language models obviate the need for a lot of robot-language grounding data.
LfP (https://learning-from-play.github.io/) was a work that inspired me a lot. They relabel a few hours of open-ended demonstrations (humans instructed to play with anything in the environment) with a lot of hindsight language descriptions, and show some degree of general capability acquired through this richer language. You can describe the same action with a lot of different descriptions, e.g. "pick up the leftmost object unless it is a cup" could also be relabeled as "pick up an apple".
That being said, the LfP paper stops short of testing whether we can improve robotics solely by only scaling language - a confounding factor and central to their narrative was the role of "open-ended play data". We do need some paired data to ground (language, robot-specific sensor/actuator modalities), but perhaps we can scale everything else with language only data.
Thanks to the pointer on the Andreas paper! This is indeed quite relevant to the spirit of what I'm arguing for, though I prefer the implementation realized by the Lu et al '21 paper.
A couple of under-explored rich sources of training data on actions are videos and code. Videos, showing how people interact with objects in the world to achieve goals, might also come with captions and metadata, while code comes with comments, messages and variable names that relate to real world concepts, including millions of tables and business logic.
Maybe in the future we will add rich brain scans as an alternative to text. That kind of annotation would be so easy to collect in large quantities, provided we can wear neural sensors. If it's impractical to scan the brain, we can wear sensors and video cameras and use eye tracking and body tracking to train the system.
I am optimistic that language modelling can become the core engine of AI agents, but we need a system that has both a generator and a critic, going back and forth for a few rounds, doing multi-step problem solving. Another must is to allow search engine queries in order to make more efficient and correct models, not all knowledge must be burned into the weights.
> My hope is that sufficiently rich language models obviate the need for a lot of robot-language grounding data.
I feel like this is “missing the trees for the forest.” In my experience, generality only emerges after a critical mass of detailed low-level examples is collected and arranged into a pattern. Humans can’t actually reason about purely abstract ideas very well. Experts always have specifics in mind they are working from.
So I'm not convinced leaving it to the model gets you anything new.
I feel that the (IMHO plausible) idea is that a sufficiently rich language model can enable transfer learning for robotics, where you can effectively replace a lot of robot-language grounding data with a small amount of robot-language grounding and a lot of pure language data.
> My theory is that taste is one quality that separates the academics from the business people. Academia doesn't necessitate a lot of taste. If you have it, great. If you don't have it, no big deal.
This might be true for academics in ancient Greek literature, but certainly isn't true for academics in CS nowadays. If you don't have good taste in research problems that are {important for downstream industry applications, scientifically interesting, tractable}, you won't get anything done, and you won't get published. If anything, the pressure for academics to develop good taste is stronger than for people designing product. You can have a product that provides just one utility that users desperately need and have terrible taste for all the other axes that make a product "good," and do just fine. Academic papers get judged (in peer review / traction after publication) purely against the taste and aesthetics of other people in your community.
This is where an ancient Greek professor butts in and says that you need good taste in ancient Greek, but it's those damn ancient Aramaic professors who don't need good taste.
I'm just kidding. I agree that it's not fair to make such an overarching statement about academics. I guess I was trying to express that I've noticed there's a difference between academic intelligence and taste, and my HN instincts told me that blaming academics would appeal to the crowd :P. More seriously, I think there's certainly a sense of taste in academia, but it's a very particular, very niche taste, as you've described. Someone who is tapped into the taste of the crowd will not be adept at understanding the taste of the few and vice versa.
That’s far more sophisticated than I had imagined it would be before clicking the link. Has your friend done research in graph drawing, or SMT solvers?
Nice, I didn't know that course and I noticed that I can follow along the course, the classes are recorded. Do you think it's something worth investing to someone interested in distributed systems? The only downside is that I can't have my lab exercises validated.
Yeah, it's an excellent course. That is typically the unfortunate downside of self-studying courses, but AFAIK the tests aren't that sophisticated - you can mostly test it yourself just by running the test suite over and over again, making sure that it _really_ works 100% of the time and not just 98% of the time (given the nondeterministic nature of distributed systems failures). This is basically what students would do before submitting their labs, and most people got a full score if they ran it enough times :P