The paper shows that their model first overfitted the data. By overfitting I mean 100% train dataset accuracy and ~0% validation dataset accuracy. The model never gets any feedback from the validation dataset trough the training procedure.
Everyone's expectation would be that this is it. The model is overfitted, so it is useless. The model is as good as a hash map, 0 generalization ability.
The paper provides empirical, factual evidence that as you continue training there is still something happening in the model. After the model memorized the whole training dataset and while it still has not received any feedback information from the validation dataset, it starts to figure out how to solve validation dataset.
Mind you, this is not interpretation, this is factual. Long after 100% overfitting, the model is able to keep increasing its accuracy on dataset it has not seen.
It's as we discovered that water can flow upwards.
Grokking was discovered by someone forgetting to turn off their computer.
Nobody knows why. So, nobody is able to make any theoretical deductions about it.
But I agree that fig 3. requires interpretation. By itself it does not say a lot, but similar structures appear in other models like in the one where we discuss elements sequence prediction. To me, the models figure out some underlying structure of the problem, and we are able to interpret that structure.
I tend to look at it from Bayesian perspective. This type of evidence increases my belief that the models are learning what I would call semantics. It's a separate line of evidence from looking at benchmark results. Here we can get a glimpse at how some models may be doing some simple predictions and it does not look like memorization.
>> The paper shows that their model first overfitted the data. By overfitting I
mean 100% train dataset accuracy and ~0% validation dataset accuracy. The model
never gets any feedback from the validation dataset trough the training
procedure.
Yes, but the researchers get plenty of feedback from the validation set and
there's nothing easier for them than to tweak their system to perform well on
the validation set. That's overfitting on the validation set by proxy. It's
absolutely inevitable when the validation set is visible to the researchers and
it's very difficult to guard against because of course a team who has spent
maybe a month or two working on a system with a publication deadline looming are
not going to just give up on their work once they figure it it doesn't work very
well. They're going to tweak it and tweak it and tweak it, until it does what
they want it to. They're going to converge -they are going to converge- on
some ideal set of hyperparameters that optimises their system's performance on
its validation set (or the test set, it doesn't matter what it's called, it
matters that it is visible to the authors). They will even find a region of the
weight space where it's best to initialise their system to get it to perform
well on the validation set. And, of course, if they can't find a way to get good
performance out of their system, you and I will never hear about it because
nobody ever publishes negative results.
So there are very strong confirmation and survivorship biases at play and it's
not surprising to see, like you say, that the system keeps doing better. And
that suffices to explain its performance, without the need for any mysterious
post-overfitting grokking ability.
But maybe I haven't read the paper that carefully and they do guard against this
sort of overfitting-by-proxy? Have you found something like that in the paper? If
so, sorry for missing it myself.
> And that suffices to explain its performance, without the need for any mysterious post-overfitting grokking ability.
It actually still does not suffice. It is just not expected no matter what the authors would be doing.
Just the fact that they managed to get that effect is interesting.
Granted, the phenomenon may be limited in scope. For example, on ImageNet it may require ridiculously long time scales. But maybe there is some underlying reason we can exploit to get to grokking faster.
It's basically all in fig 2.:
- they use 3 random seeds per result
- they show results for 12 different simple algorithmic datasets
- they evaluate 12 different combinations of hyperparameters
- for each hyperparameters combination they use 10+ different ratios of train to validation splits
So they do some 10*12*3*2 = 720 runs.
They conclude that hyperparameters are important. Seems like weight decay is especially important for the grokking phenomenon to happen when model has access to low ratio of training data.
Also, at least 2 other people managed to replicate that results:
I don't agree. Once a researcher has full control of the data, they can use it
to prove anything at all. This is especially so in work like the one we discuss,
where the experiments are performed on artificial data and the researchers have
even more control on it than usual. As they say themselves, such effects are
much harder to obtain on real-world data. This hints to the fact that the effect
depends on the structure of the dataset and so it's unlikely to, well,
generalise to data that cannot be strictly controlled.
You are impressed by the fact that one particular, counter-intuitive result was
obtained, but of course there is an incentive to publish something that stands
out, rather than something less notable. There is a well-known paper by John
Ioannidis on cognitive biases in medical research:
It's not about machine learning per sé, but its observations can be applied to
any field where empirical studies are common, like machine learning.
Especially in the field of deep learning where scholarly work tends to be
primarily empirical and where understanding the behaviour of systems is impeded
by the black-box nature of deep learning models, observing something mysterious
and unexpected must be cause for suspicion and scrutiny of methodology, rather
than accepted unconditionally as an actual observation. In particular, any
hypothesis that tends towards magick, for example suggesting that a change in
quantities (data, compute, training time) yields qualitative improvements
(prediction transmogrifying into understanding, overfitting transforming into
generalisation), should be discarded with extreme prejudice.
> In particular, any hypothesis that tends towards magick, for example suggesting that a change in quantities (data, compute, training time) yields qualitative improvements (prediction transmogrifying into understanding, overfitting transforming into generalisation), should be discarded with extreme prejudice.
It does not tend towards magic. It does happen and people can replicate it. Melanie Mitchell recently brought back the point by Drew McDermott that AI people tend to use wishful mnemonic. Words like understanding or generalisation can easily be just a wishful mnemonic. I fully agree with that.
But the fact remains. A model that has ~100% training accuracy and ~0% validation accuracy on simple but non-trivial dataset is able to reach ~100% training and ~100% validation accuracy.
> This hints to the fact that the effect depends on the structure of the dataset and so it's unlikely to, well, generalise to data that cannot be strictly controlled.
Indeed, but it is still interesting. It may be that it manifests itself because there is a very simple rule underlying the dataset and the dataset is finite. But it also seems to work under some degree of noise and that's encouraging.
For example, the fact that it may help study connection of wide, flat local-minima and generalization is encouraging.
> You are impressed by the fact that one particular, counter-intuitive result was obtained
I'm impressed by double descent phenomenon as well. And this one shows up all over the place.
> There is a well-known paper by John Ioannidis on cognitive biases in medical research: Why most published research findings are false
I know about John Ioannidis. I was writing and thinking a lot about replication crisis in science in general. BTW - it's quite a pity that Ioannidis himself started selecting data towards his thesis with regard to COVID-19.
> It's not about machine learning per sé, but its observations can be applied to any field where empirical studies are common, like machine learning.
Unfortunately, it applies to theoretical findings too. For example, universal approximation theorem, no free lunch theorem or incompleteness theorems, are widely misunderstood. There are also countless less known theoretical results that are similarly misunderstood.
As far as I can tell the replications are on the same dataset, or at least the
same task, of modular arithmetic. Until we've seen comparable results on
radically different datasets, e.g. machine vision datasets, replications aren't
really telling us much. Some dudes ran the same program and they got the same
results. No surprise.
I confess that I'd be less suspicious if it reached less than full accuracy on
the validation set. 100% accuracy on anything is a big red flag and there's a
little leprechaun holding it and jumping up and down pointing at something. I'm
about 80% confident that this "grokking" stuff will turn out to be an artifact
of the dataset, or the architecture, or some elaborate self-deception of the
researchers by some nasty cognitive bias.
Perhaps one reason I'm not terribly surprised by all this is that uncertainties
about convergence are common in neural nets. See early stopping as a
regularisation procedure, and also, yes, double descent. If we could predict
when and how a neural net should converge, neural networks research would be a
more scientific field and less a
let's-throw-stuff-at-the-wall-and-see-what-sticks kind of field.
But, who knows. I may be wrong. It's OK to be wrong, even mostly wrong, as long
as you 're wrong for the right reasons. Science gives us the tools to know when
we're wrong, nothing more. The scientist must make peace with that. Thinking one
can be always right is hubris.
Speaking of which, John Ioannidis is one of my personal heroes of science
(sounds like an action figure line, right? The Heroes of Science!!
dun-dun-duuunnn). I was a bit shocked that he came out so strongly sceptical
against the mainstream concerns about Covid-19, and I've heard him make some
predictions that soon proved to be false, like the number of people who would
get Covid-19 in the USA (I think he said something like 20,000 people?). He
really seemed to think that it was just another flu. Which btw kills lots of
people and we're just used to it, so perhaps that's what he had in mind. But, I
have the privilege of sharing my maternal language with Ioannidis (he's Greek,
like me) and so I've been able to listen to him speak in Greek news channels, as
well as in English-speaking ones, and he remains a true scientist, prepared to
express his knowledgeable opinion, as is his responsibility, even if it may be
controversial, or just plain wrong. In the end, he's an infectious disease
expert and even his contrarian views lack that certain spark of madness in the
eye of most others who share his opinions. I mean, because he's speaking with
knowledge, rather than just expressing some random view he's fond of. He's still
a role model for me. Even if he was wrong in this case.
>> Unfortunately, it applies to theoretical findings too. For example, universal
approximation theorem, no free lunch theorem or incompleteness theorems, are
widely misunderstood. There are also countless less known theoretical results
that are similarly misunderstood.
I guess? Do you have some example you want to share? For my part, I try to avoid
talking of things I don't work with on a daily basis, on the internet. I know
what I know. I don't need to know -or have an opinion- on everything...
Everyone's expectation would be that this is it. The model is overfitted, so it is useless. The model is as good as a hash map, 0 generalization ability.
The paper provides empirical, factual evidence that as you continue training there is still something happening in the model. After the model memorized the whole training dataset and while it still has not received any feedback information from the validation dataset, it starts to figure out how to solve validation dataset.
Mind you, this is not interpretation, this is factual. Long after 100% overfitting, the model is able to keep increasing its accuracy on dataset it has not seen.
It's as we discovered that water can flow upwards.
Grokking was discovered by someone forgetting to turn off their computer.
Nobody knows why. So, nobody is able to make any theoretical deductions about it.
But I agree that fig 3. requires interpretation. By itself it does not say a lot, but similar structures appear in other models like in the one where we discuss elements sequence prediction. To me, the models figure out some underlying structure of the problem, and we are able to interpret that structure.
I tend to look at it from Bayesian perspective. This type of evidence increases my belief that the models are learning what I would call semantics. It's a separate line of evidence from looking at benchmark results. Here we can get a glimpse at how some models may be doing some simple predictions and it does not look like memorization.