Hacker News new | past | comments | ask | show | jobs | submit login

The solution won't be just "bigger". A model with a trillion parameters will be more expensive to train and to run, but is unlikely to be better. Think of the early days of flight, you had biplanes; then you had triplanes. You could have followed that farther, and added more wings - but it wouldn't have improved things.

Improving AI will involve architectural changes. No human requires the amount of training data we are already giving the models. Improvements will make more efficient use of that data, and (no idea how - innovation required) allow them to generalize and reason from that data.




> "No human requires the amount of training data we are already giving the models."

Well, humans are also trained differently. We interact with other humans in real time and get immediate feedback on their responses. We don't just learn by reading through reams of static information. We talk to people. We get into arguments. And so on.

Maybe the ideal way to train an AI is to have it interact with lots of humans, so it can try things on its own? The implication of that is maybe the best trained AI will be the center of some important web property, like, say, the Google search engine (I'm imagining something like Google search now, but more conversational -- it asks you if that was what you're looking for, and asks clarifying questions.) Whoever has the most users will have the best AI, which creates the best product, which attracts the most users... and so on.

I do agree that architectural improvements could be hugely significant too.


Yeah I'm not totally convinced humans don't have a tremendous amount of training data - interacting with the world for years with constant input from all our senses and parental corrections. I bet if you add up that data it's a lot.

But once we are partially trained, training more requires a lot less.


> I bet if you add up that data it's a lot.

Let's Fermi estimate that.

A 4k video stream is about 50 megabits/second. Let's say that humans have the equivalent of two of those going during waking hours, one for vision and one for everything else. Humans are awake for 18 hours/day, and we'll say a human's training is 'complete' at 25.

Multiply that together, and you end up with 1.8e17 bytes, or 180 petabytes of data.

There's plenty of reason to think that we don't learn effectively from all of this data (there's lots of redundancy, for example), but at the grossest orders of magnitude you seem to be right that at least in theory we have access to a tremendous amount of data.


> Let's Fermi estimate that.

I couldn't agree more. Fermi estimates are underused.

> Multiply that together, and you end up with 1.8e17 bytes, or 180 petabytes of data.

And for comparison, GPT-4 is estimated by Bill Dally (NVIDIA) at around 10^12 or 10^13 tokens [1]. Let's assume about 1 word per token and 5 characters per word. Furthermore, US-ASCII tokens requires one byte per character in UTF-8. So, that gives about 50 terabyte [Edited after comment below].

As a side note, I would guess that GPT-4 knows more "things" if you would be able to count them all up. For example, it knows more languages, more facts about cities, more facts about persons. However, people know way more inside their specialization.

[1]: https://youtu.be/kLiwvnr4L80?si=77pXmIBmlp8dsSCG&t=349


> Let's assume about 1 word per token and 5 characters per word. Furthermore, US-ASCII tokens requires one byte per character in UTF-8. So, that gives about 40 petabyte.

Thanks, I was looking for that estimate but could not quickly find it. (Although are you sure about that 'petabyte' figure? I see 1e13 * 5 = 50e13 = 50 TB?)

I love these sorts of bulk comparisons, since they let us start to reason about how much humans learn from "each bit of experience" versus LLMs. On one hand humans process a lot of sensory information (most of it by discarding, mind), but on the other humans do a lot of 'active learning' whereby experience is directly related to actions. LLMs don't have the latter, since they're trained on passive feeds of tokens.


This is completely and fundamentally an incorrect approach from start to finish. The human body - and the human mind - do have electrical and logic components but they are absolutely not digital logic. We do not see in “pixels”. The human mind is an analog process. Analog computing is insanely powerful and exponentially more efficient (time, energy, bandwidth) than digital computing but it is ridiculously hard to pack or compose, difficult to build with, limited in the ability to perform imperative logic with, etc.

You cannot compare what the human eyes/brain/body does with analog facilities to its digital “equivalent” and then “math it” from there.

Also why trying to replicate the human brain with digital logic (current AI approach) is so insanely expensive.


You missed the point of the exercise. Of course it's extremely difficult to compare the two, but the question was: do humans get nearly as much training data as LLMs do? This analysis is good enough to say "actually humans receive much more raw input data than LLMs during their 'training period'." You're concerned with what the brain is doing with all that data compared to LLMs, but that's not the point of the exercise.


No, you can’t even compare that because the information isn’t packetized. Again, we don’t see in pixels so you can’t just consider sensory input in that manner.


What about using energy cost as a stand-in for the data?

The net energy cost of training and maintaining humans vs. training and maintaining (updating and augmenting) AI models.

The human body is incredibly efficient but there’s always a larger ramp-up cost.


Your estimate is using compressed video rates, which might be a argued way to look at the way that the brain processes and fills in presumably "already known" or inferred information. I don't know enough about this subject to make that argument.

Uncompressed rates are easy to calculate if wanting to adjust your approximation:

24-bit, 4K UHD @ 60 fps: 24 × 3840 × 2160 × 60 = 11.9 Gbit/s.


Children are awake 12 hours per day, a 90 minute movie used to fit on a CD-ROM, 8 700MB CD-ROMs times 365 is 2TB per year. Alexander the Great became king aged 20, he seemed trained enough by then, so that's 400TB, 4e14 bytes of data.

Pretty unconvinced by this argument, it mostly hinges on the video stream size.


I think it is an estimate. Yours is on the extreme low end of things. The GP was using 4K as a standin for touch, smell, taste, pain, hunger, pleasure, etc. Those are a bit more complex than just videos and you can't encode those on a CD.

Regardless, yours still came out to be an order of magnitude higher than what GPT was trained on. So I guess the original argument makes some sense.


So much of that data is totally useless though. Most of the data we get is visual and I would argue that visual data is some of the least useful data (with respect to volume). Think about the amount of time we're looking at blank colours with nothing to learn from. Once you've seen one wall, one chair, one table etc, there's not much left to know about those things. An encyclopedia though for example is much less data than a few hours of high res video of clouds passing by yet clearly its orders of magnitude information rich.


Also have to disagree here. You see that one object yes, but you see it from 10000000s of different angles and distances. In different settings. In different lighting conditions. With different arrangements. And you see the commonalities and differences. You poke it. prod it. Hit it. Break it. You listen to it.

This is the basis for 'common sense' and I’m pretty sure everything else needs that as the foundation.

Go watch a child learning and you'll see a hell of a lot of this going on. They want to play with the same thing over and over and over again. They want to watch the same movie over and over and over again. Or the same song over and over and over and over and over again.


This just sounds so hilariously wrong start to finish. That any of the data you collect is useless is already a big [citation needed]


Nit: The bandwidth of your optic nerve is only about 10 kilobits per second. You think you're seeing in 4K but most of it is synthetic.


Academic sources I’ve seen are in the 10Mbps range, e.g. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1564115/ (estimating 875kbps for a guinea pig and ~10x that for a human).


That cannot be right.

10 kilobits per second would only get you some awful telephone quality audio. Even super low res video would be orders of magnitude more than that.


Someone added it up:

https://osf.io/preprints/psyarxiv/qzbgx

> Large language models show intriguing emergent behaviors, yet they receive around 4-5 orders of magnitude more language data than human children.


_language_ data.

Humans get a lot more input that just language is the point. We get to see language out in the physical world. They have the (really quite difficult) task of inferring all the the real world and its connections, common sense, laws of physics, while locked in a box reading wikipedia.

How capable do you think a child would be if you kept them in a dark box and just spoke to them through the keyhole? (not recommending this experiment)


The first humans were kinda dumb. Yet by interacting with the world and each other, humans got smart. I wonder if neural network instances can interact with a simulated environment and each other, they could become smart enough for practical purposes.

Kinda like alpha go zero but with "life".


Conversion is essential, but so is the interaction with the real world.

If a bunch of AIs started discussing with each other, and learning from each other, without validation, things can go wrong very easily.

I bet a lot of people thought that strapping a pair of feahterry wing was all that's needed to fly. This "knowledge" could've passed from one person to another through discussion. But it only took one (or more) person to actually try it in the real world to learn that, no, that doesn't work like that.

AI communities (as in communities of AIs talking to each other) might be one filled with the most outrageous "conspiracy theories", that may sound plausible but have no substance.


> The first humans were kinda dumb

How can you assert this ? Do you have any evidence?


Can't tell if you're joking or serious.

In the latter case, the first humans (H. Habilis) had about 1/2 of H. Sapiens brain to work with, and a much smaller fraction of neocortex.

If that doesn't satisfy you, let's say I was speaking about some sort of human ancestor before that, which would have been about as dumb as chimps, unless you require proof of their dumbness as well.


Not your original objector, but ... I was willing to accept that early humans are dumb, but your explanation relies on evolution and not social interaction.


We know their brains didn’t grow much after birth, unlike humans, which also suggests faster maturity akin to living apes and likely less social support for extended adolescence.


Humans didn't evolve knowledge, they evolved the capacity to get more knowledge.

All the new knowledge they got they figured out from the environment.


Exactly. Humans didn't learn their way out of only having half a brain, more the opposite.


Hard to imagine anyone believes these prehuman theories. Just look at the h hablis wiki page to see 90% of it is pure speculation and debated.


Even tracing it back to the origin of Homo erectus, the line would still be too blurry. I don't think "the first humans" can actually mean anything in a science-based discussion.


It's not just humans we interact with - it's reality. For an LLM all "facts" are socially-determined, for a human they are not


Also, humans have multiple internal models for various aspects of reality, that seem to exist as a layer on top of the "raw" training. They have the ability to extrapolate based on those models (which transformers cannot do, as far I as understand). Perhaps GP is right -- perhaps what's missing in the architecture is an analog of these models.


While I agree with you that advances will come from being able to train with less data using as yet undevised techniques, I think you are jumping to a conclusion with this particular work.

First, yes, bigger appears to be better so far. We haven't yet found the plateau. No, bigger won't solve the well known problems, but it's absolutely clear that each time we build a bigger model it is qualitatively better.

Second, it's not clear that this work is trying to build AGI, which I assume you are referring to when you say "the solution." Of all the use case for language models, building one off all the worlds scientific data like they are doing in this project is probably the most exciting to me. If all it can do is dig up relevant work for a given topic in the entire body of scientific literature, it will be revolutionary for science.


> you had biplanes; then you had triplanes. You could have followed that farther, and added more wings - but it wouldn't have improved things.

But people did try! https://en.wikipedia.org/wiki/Besson_H-5 https://en.wikipedia.org/wiki/Multiplane_(aeronautics)#Quadr...


Why stop here? This engineer experimented with 20 wings and more:

https://en.wikipedia.org/wiki/Horatio_Frederick_Phillips


Should've kept going.


Karpathy in his recent video [1] agrees, but at this point scaling is a very reliable way to better accuracy.

[1]: https://youtu.be/zjkBMFhNj_g?si=eCH04466rmgBkHDA


Seems like he actually disagrees here:

If you train a bigger model on more text, we have a lot of confidence that the next-word prediction task will improve. So algorithmic progress is not necessary, it's a very nice bonus, but we can sort of get more powerful models for free, because we can just get a bigger computer, which we can say with some confidence we're going to get, and just train a bigger model for longer, and we are very confident we are going to get a better result.

https://youtu.be/zjkBMFhNj_g?t=1543 (23:43)


And then at 35 minutes he spends a few minutes talking about ideas for algorithmic improvements.


This. We have not exhausted all the techniques at our disposal yet. We do need to look for a new architecture though, but these are orthogonal


OpenAIs big innovation was “bigger”. It is not clear when we should stop scaling.


I asked ChatGPT how many parameters the human brain has, and it said 86B neurons * 1000 connections, so 86T parameters.

It does seem like bigger models give better responses when given benchmarks. It might plateau or overfit the data at some point, but I'm not sure we've reached it yet.


Unlike biplanes, CPU's with more transistors are more powerful than those with less. And adding more CPU cores keeps increasing the amount of threads you can run at the same time.

Why would LMM's be more like the biplanes analogy, and less like the CPU analogy?


In general you can view "understanding" as a compression of information. You take in a bunch of information, detect an underlying pattern, and remember the pattern and necessary context, instead of the entire input.

The "problem" with larger neural networks is that they can store more information, so they can substitute understanding with memorization. Something similar happens with human students, who can stuff lots of raw information into short-term-memory, but to fit it into the much more precious long-term-memory you have to "understand" the topic, not just memorize it. In neural networks we call that memorization a failure to generalize. Just like a human, a network that just memorizes doesn't do well if you ask it about anything slightly different than the training data.

Of course it's a balance act, because a network that's too small doesn't have space to store enough "understanding" and world model. A lot of the original premise of OpenAI was to figure out if LLMs keep getting better if you make them bigger, and so far that has worked. But there is bound to be a ceiling on this, where making the model bigger starts making it dumber.


No one expected larger LLMs to be amazing, so although it's unlikely that these larger models will do anything, it was also unlikely that we are in our current situation regarding LLMs.


Really early days of flying where like "let's add a few feathers and then add some more". Though architectural changes were added too.


What? The defining trend of the last 5 or so years is the victory of the scaling hypothesis. More scale = more intelligence. GPT-4 is way smarter than 3.5, this trend is ongoing.

You need more data to utilize more parameters, but folks at the forefront are confident that they are not going to run out of data any time soon.

If you mean “solution to AGI” maybe. But perhaps in-context scratchpads and agent loops will be sufficient to get this architecture to human-level performance, with enough data/parameters. (Sutskever and Hinton have both expressed credulity that the current architectures might get us there.)

All that said, it’s also possible that new architectures will be needed at some point, I’m just pushing back on your claim that we already hit the cliff.


The main hero here is not model size but the dataset and the environment that created it. All this model talk missed the point - without the repository of human experience captured in language these models would not get so smart. And the improvement path is the same - assimilate more experience. This time the AI agent can create its own interactions and feedback signals, this would help it fix its flaws.

Learning in third person from the past data can only take AI so far. It needs to act and learn in the present, in first person, to be truly smart. No architectural change is needed, but the model needs to be placed in a real environment to get feedback signals.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: