10B Parameter Neural Networks in Your Basement [pdf]

signa11 · on July 1, 2015

The video link of the presentation: http://on-demand.gputechconf.com/gtc/2014/video/S4694-10-bil...

paperwork · on July 2, 2015

This was a fantastic presentation. It wasn't just a brain dump like too many other presentations. The speaker clearly took the time to craft his talk.

siavosh · on July 1, 2015

I gotta say this is pretty impressive, particularly the part where he reproduced the results of another paper. Being a PhD dropout in computer vision, I can appreciate that having reproducible results in this field is, ehem, novel.

taf2 · on July 2, 2015

I wish they would post source code. Actually does anyone know if they do publish the source code and perhaps setup instructions to re-create their results?

raverbashing · on July 1, 2015

I think there's a lot of possible advances in optimizing the training of neural networks.

(My suspicions is that the initial layers can be "frozen" or trained separately, low-level features are pretty much the same for most images, also, maybe someone will figure out a way of merging several NNs in one, so you can paralelnize the whole training)

This presentation is only one small example.

andbberger · on July 2, 2015

[citation needed]

Sure, there's a huge amount of research happening in the field right now. But you make it sound like there's a ton of low hanging fruit, which is emphatically not true.

Also, freezing the representations in the bottom layers usually doesn't lead to very good results. The representation the bottom layers learn in the standard formulation of deep feedforward networks is informed by gradient information in the output layers[1]. If you stop training the bottom layers after some time you're sacrificing representational power and in fact increasing the ultimate amount of computation that will need to performed to train the network to some level.

In fact, in a recent paper [2] some folks at Google describe how they achieved great performance training a huge CNN by inserting temporary output layers in the middle of the network while training (among other things). This increased the amount of gradient information that was propagated back to the bottom layers, forcing them to learn more powerful representations.

[1] https://en.wikipedia.org/wiki/Backpropagation [2] http://arxiv.org/abs/1409.4842

mytochar · on July 2, 2015

Regarding backpropagation and training sections of the NN at different times, there are other training algorithms. Evolutionary training algorithms come to mind, and you could really evolve any section you wanted. You could even train the output of each layer one by one to represent a certain form of input to the future layer.

andbberger · on July 2, 2015

Virtually everyone uses gradient based methods in the end to fine tune the weights.

Yes, there are other methods. Contrastive divergence seems to be king right now - of note is Minimum probability flow learning [1] (of which CD is a special case of). However the flavor of these methods tends to be tuning the weights of the model in such a way to maximize how close the model comes to sharing the probability distribution of the data. One can generally not constraint the model parameters (ie by freezing a layer) and retain the models ability to 'learn' the data distribution.

[1]http://arxiv.org/abs/0906.4779

darkmighty · on July 2, 2015

Those training algorithms should accomplish the same goal, but backpropagation (or some variant) turns out to be the best choice for the kind of network shown here (large, non-recurrent), because it directly uses the learning gradient, while evolutionary methods don't usually make use of this information. I'm not a researcher in this area, but I'm quite sure you'd have a bad time going with pure evolutionary methods for this kind of network.

raverbashing · on July 2, 2015

While it's true there's a lot of research, using ReLu instead of tanh and sigmoid was a pretty much low-hanging fruit that took a lot of time to be applied

irremediable · on July 1, 2015

Hasn't the "separate training" optimisation already been done? As I understand it, pre-training is very common.

jokoon · on July 2, 2015

I wonder, if GPGPU becomes more common would that mean that it's time to switch to chips that have smaller cores, but more cores ?

I mean current GPU architecture is very good at vectorization, but is it possible to have a chip that has bigger cores than GPU has, but smaller that the ones on a CPU ?

It really seems that we have big cores because managing an OS requires it, but I wonder how realistic it is to make a computer that use more massive parallelism, there are many algorithms out there that can be alternative to sequential ones.

sjg007 · on July 1, 2015

Great talk. It really brings the technical details into focus.

zhanxw · on July 2, 2015

Does this imply that to tune 11b params your basement needs to have 16 machines with 4 GPUs each (64 GPUs in all)?

m-i-l · on July 2, 2015

Slides 28 and 35 suggest it was 3 servers, each with 4 GPUs, i.e. 12 GPUs total. If that's the case, then you can probably build a Google scale network (1 billion parameter, 9 layer neural network, which needed 1,000 machines and 16,000 cores running for a week in 2012) at home for around £4K (US$6.2K).