Meta Learning Backpropagation and Improving It (2020)

lettergram · on Oct 25, 2022

Neural networks are merely generalized methods used to translate data. These translations occur by multiplying / adding / dividing / etc weights to the input then passing the output to the next node(s).

RNNs keep a memory of prior values such that you can pass on a “memory”.

At the end what this is doing is replacing components of the graph with mini-RNNs and pruning based on another network overseeing the first.

Having done quite a bit of work in this space I have a couple of thoughts.

What was the major advance in games? It was networks playing themselves.

Here we may want to do the same thing. Meta-learners need lots of “experience” (data) just like anything living.

This paper doesn’t dive too in-depth on the idea that the meta learner can be made general, but they can. There’s only so many problem types (classification, regression, generation, etc) and only so many data formats. Further, networks are fairly well defined, much more so than language.

This is the field of AutoML of anyones interested.

eru · on Oct 25, 2022

> What was the major advance in games? It was networks playing themselves.

I'm not sure. We have been using self-play since forever. That's how Alpha-Beta search works..

yobbo · on Oct 25, 2022

Self-play probably refers to different agents playing each other, but that has also been done since forever. It just didn't seem that interesting when the models were small with games like 4x4 tac-tic-toe.

_0ffh · on Oct 25, 2022

Tesauro used self-play to train a master level backgammon player nigh 30 years ago.

https://www.semanticscholar.org/paper/TD-Gammon%3A-A-Self-Te...

YorkshireSeason · on Oct 25, 2022

> Neural networks are merely generalized methods used to translate data.

That's true of all computation!

To see this, consider lambda-calculus, which is a calculus of functions (i.e. things that convert inputs into outputs). The computational equivalence of lambda-calculus with Turing machines was one of the early computability theory.

chaxor · on Oct 25, 2022

Yeah that cringy start to the OP's point removed quite a bit of credibility, despite stating 'they work in the area'. Probing towards a meta learner that is useful in many contexts is an intriguing idea though. I also wonder about the dream for meta learners and if they can approach a limit that is different than 'Adam-and-crew' when getting away from domain-specific parameter landscapes. I imagine much of the work will tend towards context switching to deal with effectively multiple domain specific optimizers, but I may be very far off from what is observed here.

melony · on Oct 25, 2022

What's the reasoning behind choosing LSTMs here? Won't an attention model be even more effective?

woadwarrior01 · on Oct 25, 2022

Did you see who the co-author of the paper is? :)

Skyy93 · on Oct 25, 2022

As you should know: Schmidhuber even invented the Transformer-Models, to be read in: https://arxiv.org/abs/2102.11174

;)

YorkshireSeason · on Oct 25, 2022

The claim of formal equivalence is restricted to "linear transformers". The tone of the comment seems to suggest that S.'s claim is slightly ridiculous. Can you point out a concrete technical shortcoming of the paper?

Skyy93 · on Oct 25, 2022

Naaa, I just wanted to pointing out his attitude towards other research.

YorkshireSeason · on Oct 25, 2022

It is valuable to point out connections with existing work, if only to avoid reinventing the wheel, and properly stand on the shoulders of gigants: אֵין כָּל חָדָשׁ תַּחַת הַשָּׁמֶשׁ‎ (there is nothing new under the sun).

Skyy93 · on Oct 25, 2022

Yep, but you can do this without acting like an ass.

melony · on Oct 25, 2022

I bet OP was his student

YorkshireSeason · on Oct 25, 2022

How much are you willing to bet?

mlajtos · on Oct 25, 2022

My guess is that tiny LSTM is less computationally expensive than a tiny transformer. Since paper proposes replacing every weight with this tiny shared net, you want to be as tiny & performant as possible.

lettergram · on Oct 25, 2022

Transformers are more complicated than RNNs and require more fine tuning. I’m guessing the RNNs were used to simplify the problem. I’m not even sure transformers would work here given we’re theyre dropping them into a process

laichzeit0 · on Oct 25, 2022

If you want it tiny and performant and choose LSTM over transformers why not just skip LSTM and use GRUs?

damiante · on Oct 25, 2022

Anyone else read this headline as "Facebook parent company does a thing"?

moffkalast · on Oct 25, 2022

https://xkcd.com/1447/

helmholtz · on Oct 25, 2022

That one always cracks me up. Munroe truly is a genius. The 'imaginary friends' one is just as funny.

rroot · on Oct 25, 2022

Can we start calling Zuckerberg's Meta something else. Zmeta?

tomrod · on Oct 25, 2022

Facebook works!

burning_hamster · on Oct 25, 2022

Inb4 any learning algorithm released in the next 20 years is claimed to be just a special case that can be learnt via this meta learning framework.

(For the record, I do believe the man has been done dirty a few times. Hate the meme, not the memer.)

jakobov · on Oct 25, 2022

1) This is by Schmidhuber so take it with an extra grain of salt 2) The experiments were on toy datasets. There is a reason they did not do this on a real dataset.

cfcf14 · on Oct 25, 2022

It being from Schmidhuber's lab makes it dramatically more credible, in my views. They've been practically a decade ahead of everybody for ages now from a theoretical point of view.

PartiallyTyped · on Oct 25, 2022

Agreed. I'd go as far as to put Schmidhuber and his lab above the laureates.

sva_ · on Oct 25, 2022

> This is by Schmidhuber so take it with an extra grain of salt

What makes you say that?

jimkoen · on Oct 25, 2022

Yes! Spill the tea!