Hacker News new | past | comments | ask | show | jobs | submit login

I love how AI apologists trip over themselves to explain away deficiencies in GPT and explain how it will “get better with time”.

Sorry, it won’t. This might even be peak GPT. Training data comes from human content, and currently there is decades worth of pure human content available. But new content will come in slowly, and it will probably take decades just to double the amount of training data we have today.

Worse, since we expect AI generated content to start becoming widespread and difficult to distinguish from human content, GPT will eventually start cannibalizing its own outputs as training data, leading to shittier and shittier models over time that get overfitted and no longer seem human enough for any practical use.

It’s a fading dream.




Not sure why so many down votes for you. HN is clearly very pro-AI.

I think you're correct about the training data problem. This is starting to remind me of the Final Question story by Asimov. "Insufficient data for meaningful answer". While in that story the AIs kept progressing, I think in reality we will forever be stuck with diminishing quality of training data.

Even just consider post-Copilot Github. Presumably there is now code publicly available that was generated by an AI. Next time somebody slurps up Github to train a new model, some of that code will be included. Overfitting ensues.


How is that over fitting? If code accomplishes a task with given requirements, it satisfies the problem.

So many devs just copy paste code as is.


>Sorry, it won’t. This might even be peak GPT. Training data comes from human content, and currently there is decades worth of pure human content available. But new content will come in slowly, and it will probably take decades just to double the amount of training data we have today.

One word: AlphaZero. Deepmind ran out of human go games to study, but it turned out that self-play was dramatically better. Your argument only holds if a) there's a linear relationship between the amount of training data and the quality of a model and b) GPT is close to maximally efficient in converting training data into useful weights. Both of these premises are demonstrably false.

GPT-4 is, in the scheme of what's possible, an incredibly primitive model that uses training data very inefficiently. In spite of that, a dumb brute force architecture still managed to vastly exceed everyone's expectations and advance the SOTA by a huge leap.


In go, or similarly chess, the AI can play stupendous number of games against itself and get accurate feedback for every single game. Everything is there to create your own training set just from knowing the rules. But outside of such games, how does an AI create it's own training data when there is no function to tell you how well you are doing? This might be a dumb question, I don't have any idea on how LLMs work


One such function is “what happens next?” which may work as well in the real world as on textual training data. Certainly it’s part of how human babies learn, via schemas.


Creating something is much harder than verifying it.

A simple setup for improving coding skills is the following:

1. GPT is given a coding task to implement as a high level prompt.

2. It generates unit tests to verify that the implementation is correct.

3. It generates code to implement the algorithm.

4. It runs the generated code against the generated unit tests. If there are errors generated by the interpreter/compiler, go back to Step 3, modify the code appropriately and try again.

5. If there are no errors found, take the generated code as a positive example and update the model weights with reinforcement learning.


What if it’s wrong at step 2?


The most naive way you could do things could be to procedurally generate immense amounts of python code, then ask the model to predict whether the code will compile, whether it will crash, what its outputs will be given certain inputs, etc.


Code execution is also a good way to collect feedback signals.


Well, there sort of is a linear relationship between the ammount of training data and the quality of the model [1]

[1]: https://arxiv.org/abs/2203.15556


That's why I'm advocating against being (almost) unconditionally "for" or "against" a certain technology.

It seems like some people struggle with the notion of something being good in some areas and bad in others.

Why not Evaluate it on the merits of what is possible now and extrapolate in the near future. Any prediction beyond that is most likely futile anyway.

Unless you are planning to run a business or work in academy there is most likely no need to overreact even if it ground breaking next month. Everything moves more slowly than we expect anyway.

In the end useful technologies will stay while the rest will disappear into the void sooner or later.


The internet has turned people into a mindless mob, also including myself.

Most people tend to take the same side their favourite celebrities do, and then argue for that to the extremes.


This is pretty pessimistic. I don't know what kind of expectations you have about LLMs. Less than 10 years have passed since the original Transformers paper and we're seeing tangible and useful software out of it. I've seen far worse vaporware.

PS: Anyways, regarding your argument, yes, transformer-based models right now are "shit" (by whatever measure you are using, though I still don't know what you are comparing them with. I suppose human level intelligence), but more training data is not the only way to make better models.


It's a good point for text input, but if you go multi-modal and somehow find a way to make good use of audio and video, there's practically unlimited data available.

Also considering that humans will probably still only publish the output that looks good, even that still provides a weak signal on quality.


I'm skeptical that multimodal input can help with programming or logic problems - or even most scientific problems.


Having diagrams (think free body diagrams in static mechanics, or a T-s diagram in thermodynamics) make a lot of non-trivial problems a lot simpler to communicate. And correctly understanding an unambiguous definition of a problem is a major step towards solving it.

If language was enough (or a similar idea, that multimodal input is not useful), college math professors wouldn't use so much chalk making drawings and diagrams to explain their ideas.


> Sorry, it won’t. This might even be peak GPT. Training data comes from human content, and currently there is decades worth of pure human content available.

This assumes that GPT4 is trained on all currently-available training data.

Is that true?


Will you eat a hat if GPT5 and 6 are huge improvements?


This is such an extraordinary claim. That this technology arms race the like of which we’ve possibly never seen before is somehow going to end at GPT-4 because ‘we’ve run out of training data’. Do you think the entire research staff at OpenAI is just there to figure out how to scale up transformers?


Agreed... I'm just disturbed by how badly people want to believe




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: