The hypothesis that you can't learn some things from text - you need real life experience, is intuitive and I used to think it's true. But there are interesting results from just a few days ago saying that text by itself is also enough:
> We test a stronger hypothesis: that the conceptual representations learned by text only models are functionally equivalent (up to a linear transformation) to those learned by models trained on vision tasks. Specifically, we show that the image representations from vision models can be transferred as continuous prompts to frozen LMs by training only a single linear projection.
The claim isn’t that you can’t learn it from text, but rather that this is why models require so much text to train on - because they’re learning the stuff that humans learn from video.
The key issue is learning effort (such as energy vs time). Congenitally deaf-blind humans with no accompanying mental disabilities as a shared cause can learn as children just fine without any video or sound from comparatively low bandwidth channels like proprioception and touch.
Another issue is what we really care about is scientific reasoning and there, if anything, nature has given an anti-bias, at least at the level of interfacing with facts. People aren't born biased towards learning Metric Tensors and Christoffell Symbols but it takes only a few years at a handful of hours a day using a small number of joules for many humans to get it (I'm counting from all grade school prerequisites vs GPUs watts x time). Much fewer for genius children.
Im testing this argument out, but doesnt this apply to all tasks, not just language? I can learn to paint from scratch in what like 300 attempts? 1000 attempts? It takes far more examples to train a guided diffusion model. I'd struggle to believe that our brains are hardwired for painting
> We test a stronger hypothesis: that the conceptual representations learned by text only models are functionally equivalent (up to a linear transformation) to those learned by models trained on vision tasks. Specifically, we show that the image representations from vision models can be transferred as continuous prompts to frozen LMs by training only a single linear projection.
Linearly Mapping from Image to Text Space - https://arxiv.org/abs/2209.15162