I love how AI apologists trip over themselves to explain away deficiencies in GP...

Thorentis · on March 25, 2023

Not sure why so many down votes for you. HN is clearly very pro-AI.

I think you're correct about the training data problem. This is starting to remind me of the Final Question story by Asimov. "Insufficient data for meaningful answer". While in that story the AIs kept progressing, I think in reality we will forever be stuck with diminishing quality of training data.

Even just consider post-Copilot Github. Presumably there is now code publicly available that was generated by an AI. Next time somebody slurps up Github to train a new model, some of that code will be included. Overfitting ensues.

hughesjj · on March 25, 2023

How is that over fitting? If code accomplishes a task with given requirements, it satisfies the problem.

So many devs just copy paste code as is.

jdietrich · on March 25, 2023

>Sorry, it won’t. This might even be peak GPT. Training data comes from human content, and currently there is decades worth of pure human content available. But new content will come in slowly, and it will probably take decades just to double the amount of training data we have today.

One word: AlphaZero. Deepmind ran out of human go games to study, but it turned out that self-play was dramatically better. Your argument only holds if a) there's a linear relationship between the amount of training data and the quality of a model and b) GPT is close to maximally efficient in converting training data into useful weights. Both of these premises are demonstrably false.

GPT-4 is, in the scheme of what's possible, an incredibly primitive model that uses training data very inefficiently. In spite of that, a dumb brute force architecture still managed to vastly exceed everyone's expectations and advance the SOTA by a huge leap.

yohannesk · on March 25, 2023

In go, or similarly chess, the AI can play stupendous number of games against itself and get accurate feedback for every single game. Everything is there to create your own training set just from knowing the rules. But outside of such games, how does an AI create it's own training data when there is no function to tell you how well you are doing? This might be a dumb question, I don't have any idea on how LLMs work

thom · on March 25, 2023

One such function is “what happens next?” which may work as well in the real world as on textual training data. Certainly it’s part of how human babies learn, via schemas.

typon · on March 25, 2023

Creating something is much harder than verifying it.

A simple setup for improving coding skills is the following:

1. GPT is given a coding task to implement as a high level prompt.

2. It generates unit tests to verify that the implementation is correct.

3. It generates code to implement the algorithm.

4. It runs the generated code against the generated unit tests. If there are errors generated by the interpreter/compiler, go back to Step 3, modify the code appropriately and try again.

5. If there are no errors found, take the generated code as a positive example and update the model weights with reinforcement learning.

ses1984 · on March 25, 2023

What if it’s wrong at step 2?

PoignardAzur · on March 25, 2023

The most naive way you could do things could be to procedurally generate immense amounts of python code, then ask the model to predict whether the code will compile, whether it will crash, what its outputs will be given certain inputs, etc.

visarga · on March 25, 2023

Code execution is also a good way to collect feedback signals.

trc001 · on March 25, 2023

Well, there sort of is a linear relationship between the ammount of training data and the quality of the model [1]

[1]: https://arxiv.org/abs/2203.15556

Vespasian · on March 25, 2023

That's why I'm advocating against being (almost) unconditionally "for" or "against" a certain technology.

It seems like some people struggle with the notion of something being good in some areas and bad in others.

Why not Evaluate it on the merits of what is possible now and extrapolate in the near future. Any prediction beyond that is most likely futile anyway.

Unless you are planning to run a business or work in academy there is most likely no need to overreact even if it ground breaking next month. Everything moves more slowly than we expect anyway.

In the end useful technologies will stay while the rest will disappear into the void sooner or later.

return_to_monke · on March 25, 2023

The internet has turned people into a mindless mob, also including myself.

Most people tend to take the same side their favourite celebrities do, and then argue for that to the extremes.

mrbungie · on March 25, 2023

This is pretty pessimistic. I don't know what kind of expectations you have about LLMs. Less than 10 years have passed since the original Transformers paper and we're seeing tangible and useful software out of it. I've seen far worse vaporware.

PS: Anyways, regarding your argument, yes, transformer-based models right now are "shit" (by whatever measure you are using, though I still don't know what you are comparing them with. I suppose human level intelligence), but more training data is not the only way to make better models.

krackers · on March 25, 2023

It's a good point for text input, but if you go multi-modal and somehow find a way to make good use of audio and video, there's practically unlimited data available.

Also considering that humans will probably still only publish the output that looks good, even that still provides a weak signal on quality.

jprete · on March 25, 2023

I'm skeptical that multimodal input can help with programming or logic problems - or even most scientific problems.

mrbungie · on March 25, 2023

Having diagrams (think free body diagrams in static mechanics, or a T-s diagram in thermodynamics) make a lot of non-trivial problems a lot simpler to communicate. And correctly understanding an unambiguous definition of a problem is a major step towards solving it.

If language was enough (or a similar idea, that multimodal input is not useful), college math professors wouldn't use so much chalk making drawings and diagrams to explain their ideas.

daydream · on March 25, 2023

> Sorry, it won’t. This might even be peak GPT. Training data comes from human content, and currently there is decades worth of pure human content available.

This assumes that GPT4 is trained on all currently-available training data.

Is that true?

awestroke · on March 25, 2023

Will you eat a hat if GPT5 and 6 are huge improvements?

MrScruff · on March 25, 2023

This is such an extraordinary claim. That this technology arms race the like of which we’ve possibly never seen before is somehow going to end at GPT-4 because ‘we’ve run out of training data’. Do you think the entire research staff at OpenAI is just there to figure out how to scale up transformers?

zeofig · on March 25, 2023

Agreed... I'm just disturbed by how badly people want to believe