If you want to teach your kid to learn English, and they came back to you and said "Dad/mum, I finished reading the entire internet but I still don't understand English fully", would you say "OK son, now go and stare at the Twitter firehouse until you grok perfect English" ?
It's clear that these models have orders of magnitude too much data already.
It somewhat reminds me of the proposals for larger and larger colliders in the hopes of seeing new physics that is always one collider in the future.
> It somewhat reminds me of the proposals for larger and larger colliders in the hopes of seeing new physics that is always one collider in the future.
I agree with your main point, but think this analogy isn't an apt one. If you want to see what particles are created at higher energies you kinda need the bigger particle accelerators. (This isn't to say that we shouldn't be investigating lower energy collisions, but at a certain point you do need "bigger colliders" to see new things)
The general point is that there is a huge volume of training data generated daily not that Twitter is a great source of it. Though I believe that GPT-3 for example was trained on the Common Crawl dataset which would contain both Twitter and Reddit.
>It's clear that these models have orders of magnitude too much data already.
Seems like a strange claim. The scaling laws are showing that you can still make gains with more data and more parameters.
>It somewhat reminds me of the proposals for larger and larger colliders in the hopes of seeing new physics that is always one collider in the future.
This is literally true though, couldn't find the Higgs without the LHC and most GUT candidates would only start being ruled out at high energy levels.
Common Crawl actually does not contain Twitter, you can go check the indexes https://github.com/ikreymer/cdx-index-client . Twitter is extremely aggressive about scraping/caching, and I guess that blocks CC. Models like GPT-3 still know a decent amount of Twitter material, and I figure that this is due to tweets being excerpts or mirrored manually in non-Twitter.com URLs (eg all the Twitter-mirroring bots on Reddit).
> Seems like a strange claim. The scaling laws are showing that you can still make gains with more data and more parameters.
But then we’ve given up on matching human intelligence which is all about working efficiently with small training data, and certainly training a human does not need anywhere near as much data as GPT-3.
GPT-3 was interesting as a proof-of-concept of what happens when you use a gigantic amount of training data. We don’t need a bigger one until we can figure out how to make a smaller one that is just as effective.
If scaling laws are telling us to keep putting even more training data into the thing, then the conclusion should be that the architecture is just not working out.
>But then we’ve given up on matching human intelligence which is all about working efficiently with small training data, and certainly training a human does not need anywhere near as much data as GPT-3.
I don't think we should really take so much inspiration from the brain. We didn't make airplanes work by building bird machines so why should we do that here.
>GPT-3 was interesting as a proof-of-concept of what happens when you use a gigantic amount of training data. We don’t need a bigger one until we can figure out how to make a smaller one that is just as effective.
This feels like a non sequitor. We can certainly keep making larger models and we will, because we can continue to make performance gains doing so.
>If scaling laws are telling us to keep putting even more training data into the thing, then the conclusion should be that the architecture is just not working out.
I don't think anyone in the field would agree to this point. Researchers see an easy avenue to gain better performance so they take it. Deepmind's model shows you can get similar results with more refined architecture, but this was released well after GPT-3. When teams significantly advance the state of the art with a much smaller model I think we should take notice but that hasn't happened yet.
> I don't think we should really take so much inspiration from the brain. We didn't make airplanes work by building bird machines so why should we do that here.
It’s not that we should mimic the brain’s implementation, but we should certainly strive to match the brain’s capabilities. One of its outwardly observable capabilities is that it is extremely efficient in the size of the training data set it requires.
Efficiency isn’t an implementation detail, it’s definitional to what “highly intelligent” means.
GPT-3 is not an airplane, it’s a zeppelin. Zeppelins also have scaling laws dictating that a zeppelin should be very very large. Building bigger and bigger zeppelins is one thing, justifying expending resources on gigantic zeppelins by stating the scaling law and concluding that a jet aircraft will magically pop out if you build a big enough zeppelin is quite another.
Your earlier analogy kind of feels like saying that because you can go further by adding more fuel to a jet engines fuel tank that you have failed at efficiency and should redesign the engine.
But generally I think the better analogy is a rocket ship. If we can still go higher and faster with more fuel we should try to do that before we worry about engine efficiency. You have to get to the moon before you can colonize the galaxy.
> It's clear that these models have orders of magnitude too much data already.
I have a toy disproof for your claim that this is clear.
Imagine that you are training a ML system using oracle access to Mum. The ML training system can request 10 million representative samples of Mum output, and then we could judge if the ML system has adequately reproduced Mum.
Now also imagine that Mum frequently tells people that Mum knows a 23 letter secret and while mum won't tell people what is outright, she'll answer queries like if a guess is lexographically higher or lower. We could even imagine that the ML has seen Mum's side of some interactions with her doing that.
Would the ML know Mum's secret? No.
Would a child that could interact with Mum? Yes-- after at most ceil(log_alphabet(23)) queries at most, if the child is efficient.
Learning in an interactive context is not the same as learning from written material, so you can't be sure that the fact that children learn english from less text means that a non-interactive ML system could english from the same amount. Q.E.D.
Now, if someone figures out how to efficiently train these natural language models with reinforcement learning...
I disagree with this take because you grok English not only from the text you read, but also from the context of physical world around you. And that context is enormous: assuming 8000x8000x2 vision with 3 color 1 byte channels at 24fps without compression, you get 3e+17 bytes (300 petabytes) of data along with your reading per year.
It's clear that these models have orders of magnitude too much data already.
It somewhat reminds me of the proposals for larger and larger colliders in the hopes of seeing new physics that is always one collider in the future.