Hacker News new | past | comments | ask | show | jobs | submit login
DeepNet: Scaling Transformers to 1k Layers (arxiv.org)
194 points by homarp on March 2, 2022 | hide | past | favorite | 38 comments



5 BLEU points more is massive with 4 times less params.

The fact that layers themselves are narrower means that training and evaluation of the NN is also much faster.


Why would evaluation be faster with more narrow layers? If there were fewer tokens it would definitely be faster, because transformers scale by tokens^2, but here "narrow" means number of channels, for presumably the same number of tokens.


If I remember correctly the fully connected layers after the attention block are [?, a*h] * [a*h, b*h] (for some scalars a,b and hidden size h), which means that transformers also scale with h^2. I don't know what fraction of the total FLOPs that section of the model takes for practical model sizes, but it would indicate that making the model narrower for the same number of params would reduce compute.


Oh yeah, I think you're right. If it's all fully-connected (and parts of a transformer are) then more thinner layers use fewer FLOPs than fewer thicker layers, for the same number of parameters. As long as the layers are wide enough to keep the GPU busy it'll run faster.


It is notable that they got only 2.5 extra BLEU while translating text to english, and 6.2 extra when translating text from english to another language.

Since the network will have seen far more english text than text of other languages, it suggests that performance on limited training data is more improved.


That's exciting. Getting better performance with more data is trivial, the hard part is getting better performance with less data.


Actually no. Each layer requires output from previous one, which means sequentially computation. While wider layers can utilize GPU parallel computation better. This is kind of trade-off between less memory (less parameters) vs longer time.


A lot of previous transformer-based models did not have great depth simply because "shallow" transformers already took forever to train.

I am concerned that we are approaching an age where only massive companies and research groups with tons of GPU resources will be able to train next-gen models.


This study suggests that the number of parameters can be significantly reduced using deeper networks, which is a consistent finding in ML research. So these results could actually help to decrease the amount of computing resources required.


I reached the same conclusion but was actually happy about it. When I just started working on neural networks I had almost no competition. Then starting from ~2015 suddenly everybody did neural networks, but all backprop. Right now almost nobody can work on backprop anymore because to improve the status quo you need massive budgets. So I can finally work in peace again on other more obscure neural networks that are way more fun in my opinion. The work is more creative than just tweaking backprop a bit more.


> approaching

We were already there in 2020.


That's why open research groups like EleutherAI are critical. They've used elbow grease and enthusiasm to follow in the tracks of OpenAI and Microsoft, and some of their members have prolifically coded implementations of the latest and greatest theory in papers.

Huggingface is also worthy of great praise, making the models accessible and lowering the barrier to actual commercial use.

Lastly, Google deserves props for Colab. You can run models that are 80% as good as the ones which initially cost millions of usd to produce. Even the free tier is excellent, and I've pointed various people to them as a resource for developing practical hands-on experience with machine learning.

This is a wonderful ai summer we're having, appreciate it while you can!


There is still plenty of interesting and relevant ml research available to people who don't have access to CERN-sized compute clusters, just like there is still plenty of interesting and relevant research for physicists outside of CERN.


> only massive companies and research groups with tons of GPU resources will be able to train next-gen models

Oh, don't be. We are already at this stage.


I'm not too concerned GPUs scale very well horizontally and aren't as much effected by Moore's Law as Single Thread Bound algorithms. We just discovered a new class of algorithms and hardware needs time to catch up from big gpu clusters to end user hardware


I'm not particularly concerned about this tbh. What I mean is that ML is highly relevant with less resources as well, and the fact the only a handful companies or research groups have the financial power to push the state of the art isn't really a problem.


Wow. This sounds a lot like the Resnet moment for Transformers.


Interestingly, Transformer has already used residual connections before. The new part here is to actually not use the standard residual connections anymore but normalize after every layer (including the residual connection), and also have a factor on the residual connection. So you don't have a direct pathway anymore.

The argument is that otherwise, the gradient magnitude for the lower layers becomes too big. Which intuitively makes sense, because due to the residual connections, all error signals from the upper layers will end up at the lower layers.

I wonder why this is apparently not a problem for ResNet.


This moves it closer to a "temporal" integration as well. Since the factor on the residual branch can be considered a timestep in an Euler-discretization.


I hope so, but I'm not convinced. The experimental results don't persuade me that deep transformers are better than the existing extremely wide transformers with tons of parameters. Yes, they show a much better result on the multilingual benchmark (M2M-100), but that's not a super well studied problem, so I don't know that the baseline they're beating is very well optimized. And on OPUS-100 they only beat the baseline by building models with vastly more parameters, which shouldn't really surprise anybody.


What’s that?


Residual neural networks[] employ skip-layer connections. It's thought that this helps prevent gradient decay in deep networks. It's a very important advancement.

[]https://www.wikiwand.com/en/Residual_neural_network


Resnet was when we figured out how to throw a fuckton of perceptrons together and make it actually work. We didn't think networks scaled before that.


iirc that was AlexNet, ResNet came 3 years later whose refinements ended up making it better than a human at object detection.


I think the argument is that before ResNets, the depth of the network was constrained and could not be scaled easily. With ResNets (and of course Highway Nets; hi Jürgen!) the depth is just another hyperparam.


In between those, VGG was parameterized by layers, though severely limited in scope (16-19 layers).


Anything that can reduce the number of parameters required for a transformer is good news. Wonder if this approach can be easily plugged into the recent retrieval transformer.


Why are the number of parameters relevant?

The training and decoding runtime and memory consumption are what is relevant. And the number of parameters is not really connected to that.

E.g. I assume this DeepNet with 3.2B params is slower and requires more memory than the M2M-100 with 12B params.

The number of layers are more connected to both runtime and memory consumption.


Firstly the number of parameters is important in itself due to the finite size of VRAM on the GPU cards commonly available. But secondly the bulk of the ops you do are scalar products (arranged as matrix muls) where one parameter means one multiply+accumulate.

So regardless of the actual net topology, number of layers etc you should see a rough scaling in the number of parameters, in an ideal case with no other overhead (you can write code that scales badly or weirdly). I didn't look at this DeepNet, maybe it will run slower like you say due to some other choice of architecture..

Also yes an architecture that requires 1000x more training iterations will of course be worse at the training stage even if it has 100x fewer parameters, but maybe the resulting inference model is much faster. Lots of stuff to consider :)


In a Transformer, the most expensive op is the self attention. So when you have more layers, this will be more expensive.

You could also share e.g. all the parameters in every layer (like Universal Transformer), and thus reduce the number of params drastically, but without any change in computing time, and only negligible difference in memory consumption because the hidden activations take most of the memory, not the params.


Ah ok I see what you mean, I didn't consider shared parameters (out of habit, in my own AI work I have no shared parameters). But isn't the self attention also trainable and thus is reflected in the number of parameters (with or without sharing)?

I naively see the transformer architecture as a generalization of other common topologies like convnets, i.e. you can arrange the token stream and the attention heads (if you have enough of them) to emulate a 2D convnet if you want even though the first uses where for 1D language token streams, and so the transformer arch with proper training can converge to a convnet arch if that is the best solution. Or is this a completely backward generalization? :)


Hadn't come across parameter sharing in transformers. That's something I'll definitely look into. Why is it not wide spread though ? Most famous and powerful models have a huge number of parameters


yea definitely, the ideas are orthogonal


I love the tl;dr for practitioners part. All ML papers should have one.


The abstract? Or what do you mean?


In the actual paper, right after the introduction.

They have a section with pseudocode and a tldr explanation.


This seems really impressive. I would love to see a rough comparison of how much this improves training time compared to past methods.


Does this apply to non-transformer based architectures as well?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: