DeepNet: Scaling Transformers to 1k Layers

cleancoder0 · on March 2, 2022

5 BLEU points more is massive with 4 times less params.

The fact that layers themselves are narrower means that training and evaluation of the NN is also much faster.

oofbey · on March 2, 2022

Why would evaluation be faster with more narrow layers? If there were fewer tokens it would definitely be faster, because transformers scale by tokens^2, but here "narrow" means number of channels, for presumably the same number of tokens.

chabons · on March 3, 2022

If I remember correctly the fully connected layers after the attention block are [?, a*h] * [a*h, b*h] (for some scalars a,b and hidden size h), which means that transformers also scale with h^2. I don't know what fraction of the total FLOPs that section of the model takes for practical model sizes, but it would indicate that making the model narrower for the same number of params would reduce compute.

oofbey · on March 3, 2022

Oh yeah, I think you're right. If it's all fully-connected (and parts of a transformer are) then more thinner layers use fewer FLOPs than fewer thicker layers, for the same number of parameters. As long as the layers are wide enough to keep the GPU busy it'll run faster.

londons_explore · on March 2, 2022

It is notable that they got only 2.5 extra BLEU while translating text to english, and 6.2 extra when translating text from english to another language.

Since the network will have seen far more english text than text of other languages, it suggests that performance on limited training data is more improved.

modeless · on March 3, 2022

That's exciting. Getting better performance with more data is trivial, the hard part is getting better performance with less data.

yoquan · on March 3, 2022

Actually no. Each layer requires output from previous one, which means sequentially computation. While wider layers can utilize GPU parallel computation better. This is kind of trade-off between less memory (less parameters) vs longer time.

fzliu · on March 3, 2022

A lot of previous transformer-based models did not have great depth simply because "shallow" transformers already took forever to train.

I am concerned that we are approaching an age where only massive companies and research groups with tons of GPU resources will be able to train next-gen models.

16890c · on March 3, 2022

This study suggests that the number of parameters can be significantly reduced using deeper networks, which is a consistent finding in ML research. So these results could actually help to decrease the amount of computing resources required.

jurschreuder · on March 3, 2022

I reached the same conclusion but was actually happy about it. When I just started working on neural networks I had almost no competition. Then starting from ~2015 suddenly everybody did neural networks, but all backprop. Right now almost nobody can work on backprop anymore because to improve the status quo you need massive budgets. So I can finally work in peace again on other more obscure neural networks that are way more fun in my opinion. The work is more creative than just tweaking backprop a bit more.

screye · on March 3, 2022

> approaching

We were already there in 2020.

robbedpeter · on March 3, 2022

That's why open research groups like EleutherAI are critical. They've used elbow grease and enthusiasm to follow in the tracks of OpenAI and Microsoft, and some of their members have prolifically coded implementations of the latest and greatest theory in papers.

Huggingface is also worthy of great praise, making the models accessible and lowering the barrier to actual commercial use.

Lastly, Google deserves props for Colab. You can run models that are 80% as good as the ones which initially cost millions of usd to produce. Even the free tier is excellent, and I've pointed various people to them as a resource for developing practical hands-on experience with machine learning.

This is a wonderful ai summer we're having, appreciate it while you can!

wodenokoto · on March 3, 2022

There is still plenty of interesting and relevant ml research available to people who don't have access to CERN-sized compute clusters, just like there is still plenty of interesting and relevant research for physicists outside of CERN.

malka · on March 3, 2022

> only massive companies and research groups with tons of GPU resources will be able to train next-gen models

Oh, don't be. We are already at this stage.

dtx1 · on March 3, 2022

I'm not too concerned GPUs scale very well horizontally and aren't as much effected by Moore's Law as Single Thread Bound algorithms. We just discovered a new class of algorithms and hardware needs time to catch up from big gpu clusters to end user hardware

dankle · on March 3, 2022

I'm not particularly concerned about this tbh. What I mean is that ML is highly relevant with less resources as well, and the fact the only a handful companies or research groups have the financial power to push the state of the art isn't really a problem.

lysecret · on March 2, 2022

Wow. This sounds a lot like the Resnet moment for Transformers.

albertzeyer · on March 3, 2022

Interestingly, Transformer has already used residual connections before. The new part here is to actually not use the standard residual connections anymore but normalize after every layer (including the residual connection), and also have a factor on the residual connection. So you don't have a direct pathway anymore.

The argument is that otherwise, the gradient magnitude for the lower layers becomes too big. Which intuitively makes sense, because due to the residual connections, all error signals from the upper layers will end up at the lower layers.

I wonder why this is apparently not a problem for ResNet.

orbifold · on March 3, 2022

This moves it closer to a "temporal" integration as well. Since the factor on the residual branch can be considered a timestep in an Euler-discretization.

oofbey · on March 3, 2022

I hope so, but I'm not convinced. The experimental results don't persuade me that deep transformers are better than the existing extremely wide transformers with tons of parameters. Yes, they show a much better result on the multilingual benchmark (M2M-100), but that's not a super well studied problem, so I don't know that the baseline they're beating is very well optimized. And on OPUS-100 they only beat the baseline by building models with vastly more parameters, which shouldn't really surprise anybody.

mrfusion · on March 3, 2022

What’s that?

bckr · on March 3, 2022

Residual neural networks[] employ skip-layer connections. It's thought that this helps prevent gradient decay in deep networks. It's a very important advancement.

[]https://www.wikiwand.com/en/Residual_neural_network

quocanh · on March 3, 2022

Resnet was when we figured out how to throw a fuckton of perceptrons together and make it actually work. We didn't think networks scaled before that.

tusharsadhwani · on March 3, 2022

iirc that was AlexNet, ResNet came 3 years later whose refinements ended up making it better than a human at object detection.

mlajtos · on March 3, 2022

I think the argument is that before ResNets, the depth of the network was constrained and could not be scaled easily. With ResNets (and of course Highway Nets; hi Jürgen!) the depth is just another hyperparam.

bertday · on March 3, 2022

In between those, VGG was parameterized by layers, though severely limited in scope (16-19 layers).

rdedev · on March 3, 2022

Anything that can reduce the number of parameters required for a transformer is good news. Wonder if this approach can be easily plugged into the recent retrieval transformer.

albertzeyer · on March 3, 2022

Why are the number of parameters relevant?

The training and decoding runtime and memory consumption are what is relevant. And the number of parameters is not really connected to that.

E.g. I assume this DeepNet with 3.2B params is slower and requires more memory than the M2M-100 with 12B params.

The number of layers are more connected to both runtime and memory consumption.

l33tman · on March 3, 2022

Firstly the number of parameters is important in itself due to the finite size of VRAM on the GPU cards commonly available. But secondly the bulk of the ops you do are scalar products (arranged as matrix muls) where one parameter means one multiply+accumulate.

So regardless of the actual net topology, number of layers etc you should see a rough scaling in the number of parameters, in an ideal case with no other overhead (you can write code that scales badly or weirdly). I didn't look at this DeepNet, maybe it will run slower like you say due to some other choice of architecture..

Also yes an architecture that requires 1000x more training iterations will of course be worse at the training stage even if it has 100x fewer parameters, but maybe the resulting inference model is much faster. Lots of stuff to consider :)

albertzeyer · on March 3, 2022

In a Transformer, the most expensive op is the self attention. So when you have more layers, this will be more expensive.

You could also share e.g. all the parameters in every layer (like Universal Transformer), and thus reduce the number of params drastically, but without any change in computing time, and only negligible difference in memory consumption because the hidden activations take most of the memory, not the params.

l33tman · on March 3, 2022

Ah ok I see what you mean, I didn't consider shared parameters (out of habit, in my own AI work I have no shared parameters). But isn't the self attention also trainable and thus is reflected in the number of parameters (with or without sharing)?

I naively see the transformer architecture as a generalization of other common topologies like convnets, i.e. you can arrange the token stream and the attention heads (if you have enough of them) to emulate a 2D convnet if you want even though the first uses where for 1D language token streams, and so the transformer arch with proper training can converge to a convnet arch if that is the best solution. Or is this a completely backward generalization? :)

rdedev · on March 3, 2022

Hadn't come across parameter sharing in transformers. That's something I'll definitely look into. Why is it not wide spread though ? Most famous and powerful models have a huge number of parameters

lucidrains · on March 3, 2022

yea definitely, the ideas are orthogonal

notacanofsoda · on March 2, 2022

I love the tl;dr for practitioners part. All ML papers should have one.

jorgemf · on March 3, 2022

The abstract? Or what do you mean?

merryje · on March 3, 2022

In the actual paper, right after the introduction.

They have a section with pseudocode and a tldr explanation.

Reubend · on March 3, 2022

This seems really impressive. I would love to see a rough comparison of how much this improves training time compared to past methods.

primordialsoup · on March 3, 2022

Does this apply to non-transformer based architectures as well?