Large Scale Deep Learning – Jeff Dean [pdf]

brandonb · on Dec 8, 2014

For those of you who want to learn the nuts and bolts of deep neural networks, Andrew Ng's tutorial on Unsupervised Feature Learning and Deep Learning is getting older but still great: http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial

The best research results from 2014 and 2013 make less use of the unsupervised techniques than initially expected, so I would start by focusing on the below sections, which focus more on supervised learning with deep neural networks:

Sparse Autoencoder: Neural Networks, Backpropagation Algorithm

Building Deep Networks for Classification: Deep Networks: Overview, Fine-tuning Stacked AEs

Working with Large Images: Feature extraction using convolution

You'll need some background in matrix algebra, calculus, and probability to understand this. Having taken a previous machine learning course, although not strictly necessary, is probably extremely helpful--I'd recommend taking any standard course on ML on Coursera or Udacity, or going through any standard textbook.

EDIT: I almost forgot that Michael Nielsen (who wrote the standard textbook on quantum computation) is also writing a free online textbook on Neural Networks and Deep Learning. Chapters 1-4 are currently available and would get you pretty far: http://neuralnetworksanddeeplearning.com/

kastnerkyle · on Dec 8, 2014

Bookwise, Yoshua Bengio, Aaron Courville, and Ian Goodfellow are nearly finished with their MIT Press book on deep learning: http://www.iro.umontreal.ca/~bengioy/dlbook/ . It is pretty strong on the true theory of what is going on in deep networks, and has fairly good intuition for how and why things work. Paired with the deep learning tutorials http://www.deeplearning.net/tutorial/, as well as the content from UFLDL it is a pretty strong foundation for advanced study.

Michael's book seems to target a more introductory level - a beginner might be better off to start with that, follow with Andrew Ng's ML course, which has a section on neural nets including an assignment implementing backpropagation, then continue with the deep learning book and the {deep learning, UFLDL} tutorials. This should be solid enough to at least read most of the cutting edge work and papers, if that is the aim.

Hugo Larochelle's youtube course https://www.youtube.com/playlist?list=PL6Xpj9I5qXYEcOhn7Tqgh... and Hinton's coursera course https://www.coursera.org/course/neuralnets are also great references.

brandonb · on Dec 8, 2014

Didn't realize Yoshua & co had a book coming! That would definitely be the one to read.

BTW, for anybody who wants to learn machine learning in general, Kyle's blog also seems to be packed full of clear explanations with working demo code: http://kastnerkyle.github.io/

Very nice!

kastnerkyle · on Dec 8, 2014

Thanks for checking it out! I am planning to add a few deep learning related posts during the holidays. The recent results for NLP, captioning and speech using encoder/decoder models are just too cool not to demo.

ehurrell · on Dec 8, 2014

Thanks for the resources, my current hobby project is attempting to recreate deep learning in Clojure, have been burning through material, cheers!

nl · on Dec 8, 2014

This whole slide deck is worth reading. A couple of highlights:

Pg 26, quote: "Anything humans can do in 0.1 sec, the right big 10-layer network can do too". That is a very bold claim. It encompasses the entire fields of image and voice recognition as well as knowledge encoding. It's slowly becoming clear that this is likely to be true.

Pg 39, 40: Google's ImageNet-winning system in 2011 had 7 layers and an error rate of around 16%. The 2014 system had 24 layers and an error rate of 6.66%. Note that trained humans have an error rate of around 5%[1].

Page 50-57 talk about the miracle that is Word2Vec, and what is possible with that.

Page 60-70 talks about paragraph embedding. I haven't seen this published before.

Page 70-73 extends word/paragraph embedding for translation. I've seen a slide deck showing this works before, but I need to read the new paper cited there.

Page 74+ talks about cross-modal embeddings, especially the caption generation stuff. HN has had a few things on that over the past month or so.

[1] http://karpathy.github.io/2014/09/02/what-i-learned-from-com...

xfs · on Dec 8, 2014

"Anything humans can do in 0.1 sec, the right big 10-layer network can do too"

Actually this was argued by Connectionists in 1980s. It is called Feldman's 100-step rule:

The critical resource that is most obvious is time. Neurons whose basic computational speed is a few milliseconds must be made to account for complex behaviors which are carried out in a few hundred milliseconds (Posner, 1978). This means that entire complex behaviors are carried out in less than a hundred time steps. Current AI and simulation programs require millions of time steps.

Feldman, J. A., & Ballard, D. H. (1982). Connectionist models and their properties. Cognitive Science, 6, p. 206.

waterlesscloud · on Dec 8, 2014

"Anything humans can do in 0.1 sec, the right big 10-layer network can do too"

That's an interesting direction from which to view things.

Then AI progress can be measured by increasing that timeframe.

Though I suspect there are some pretty gigantic discontinuities in there. The things humans can do in 3-4 seconds are qualitatively different from what they can do in less than 1 second, for example.

Still, it's a useful perspective to keep in mind.

crazypyro · on Dec 8, 2014

There's also the issue that we don't even fully understand what the brain even does, so how can we claim that anything in this incomplete set of operations is possible with a computer...

sanxiyn · on Dec 8, 2014

> Page 60-70 talks about paragraph embedding. I haven't seen this published before.

Paragraph embedding was published this year as "Distributed Representations of Sentences and Documents". http://arxiv.org/abs/1405.4053

nl · on Dec 8, 2014

Thanks.

It's interesting how the state of the art is outpacing publishing.

From a quick scan that appears quite similar to the approach in papers like "Parsing Natural Scenes and Natural Language with Recursive Neural Networks" (2011)[1]. Edit: I see they cite this paper too.

[1] http://nlp.stanford.edu/pubs/SocherLinNgManning_ICML2011.pdf

agibsonccc · on Dec 8, 2014

There's also better embeddings than word2vec now:

http://nlp.stanford.edu/projects/glove/

nl · on Dec 8, 2014

The characterisation of Glove as better than Word2Vec is controversial. I'm on mobile now, but one of the word2vec authors had a Google doc going through the claims, and pointing out that similar performance was possible from word2vec by changing the parameters word2vec is used with.

Edit: a link about this. https://news.ycombinator.com/item?id=8660624

agibsonccc · on Dec 8, 2014

Speaking from personal experience. I get paid to do deep learning. One of skymind's biggest app areas is text.

That being said: I will be benchmarking deeplearning4j's glove with word2vec here soon. Any machine learning algorithm is better when you tune it.

I personally like glove due to having less knobs. The mechanics involving document statistics being part of the gradient update is also interesting.

I've also messed quite a bit with the distributed representations.

I'm not partial to any particular implementation. I'll use what works. That being said, I'm not armchair. I'll be backing this up with my own data as well.

agibsonccc · on Dec 8, 2014

I'm aware the tone was a little condescending and I won't take back what I said. I will actually back it up though ;).

My main point is just because something is controversial shouldn't stop you from trying it. That's what research is: trying new things.

_ezkx · on Dec 8, 2014

> Pg 26, quote: "Anything humans can do in 0.1 sec, the right big 10-layer network can do too". That is a very bold claim. It encompasses the entire fields of image and voice recognition as well as knowledge encoding.

Actually, I'm not sure we can do those things in quite 0.1 sec. All I know about this is from around minute 9 from http://www.radiolab.org/story/267176-never-quite-now/ . One guest on the show even estimates thinking the simplest thought to be on the order of 0.25-0.5 sec.

rasz_pl · on Dec 8, 2014

maybe concious thought, most of the stuff in the brain is on autopilot and happens behind our backs, conciousness gets the results and just builds narrative (often rationalizing the results to make us feel better)

Strilanc · on Dec 8, 2014

Whelp, Page 54 blew my mind right out the window.

> E(hotter) - E(hot) + E(big) ≈ E(bigger)

> E(Rome) - E(Italy) + E(Germany) ≈ E(Berlin)

These things are linearly separable?!

abrichr · on Dec 8, 2014

Deep neural networks essentially transform input data into a vector space where the data is "easier" to model. So while the input vectors may not be linearly separable in the input space, the network learns how to transform the input vectors into a space where they are.

nl · on Dec 8, 2014

Yes. Hence 'miraculous' Word2vec.

fuelfive · on Dec 9, 2014

These are great highlights, although your second point from [1] is a bit misleading. Humans have a 5% error rate because the average person can't differentiate between 150 different breeds of dogs and other similar nonsense.

agibsonccc · on Dec 8, 2014

I know distributed neural nets are 10000 miles out there for most, but I just want to add a few nuggets for those considering it.

I work with distributed deep nets quite a bit. It's a different animal than training on a GPU.

I am working on benchmarks with my framework deeplearning4j now.

That aside, a few neat references/projects that will be digestible for people.

For those of you already in neural net land, there's a few key takeaways when doing distributed neural nets:

parameter averaging across mini batches

(depending on the algorithm) adagrad

momentum

[1] Project Adam: http://www.wired.com/2014/07/microsoft-adam/ [2]: Associated Paper: https://www.usenix.org/system/files/conference/osdi14/osdi14...

[3]: Hogwild algorithm: http://www.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf

[4]: A variation of this I use called Iterative Reduce done by my partner Josh Patterson: https://github.com/jpatanooga/KnittingBoar/wiki/Iterative-re...

[5]: Sandblaster LBFGS by Dean and Co. http://research.google.com/archive/large_deep_networks_nips2...

andrewcamel · on Dec 8, 2014

Reading through documents like this really pains me because I it seems like such interesting work and I immediately want to understand it better, but then realize the time required to acquire the knowledge and experience necessary to understand and apply this technology is so great, that it almost seems like a waste of time. After all, think of all the things one could build in the 2 full-time years it would take to fully comprehend all of this to the point where it's useful in any practical application.

sytelus · on Dec 8, 2014

Here's another way to see: What have you been doing for past two years? Now imagine you had started learning this about 2 years ago. You would have been done by now and ready to tackle some of the most interesting problems instead of continue to do same boring stuff you had been doing for past 2 years. In 2016, come back here and look at this comment again :).

PS: For people who are saying you can "apply" DNNs in a day or learn it by a coursera course in 6 weeks - they are only very superficially right. Yeah, anyone can build ML model for a sample training data using tool in the same sense that anyone can compile sample code and have a working app. The problem is that most models don't work the first time as expected. The challenge lies in debugging the model and fix many of N possibilities to make it work. This is what working in ML is all about. It's like usual programming where it takes years of experience to debug the code and make it work for your purpose. The added twist in ML is that debugging is almost entirely statistical. When your model doesn't work, it doesn't work only in statistical sense. Your problem would be essentially that the model doesn't give expected answer this 12% of the time. For this 12% of the time, it doesn't work not because of some wrong "if" condition or misplaced subroutine call. The debugging is almost always statistical debugging - there are no breakpoints to put or no watch to set or not even exceptions. So it takes pretty solid background in statistics and probability to effectively work in ML. And yes, most likely it would take much more than 2 years.

kastnerkyle · on Dec 8, 2014

Truly understanding this to the point where you are "caught up" with the field may take 2 years, but one of the big blessings of deep learning is abstraction. You can go from very high level "black box" approaches simply using and following example code from Torch, Theano, or Caffe, all the way to nitty gritty study of the details of various architectures, how to optimize them, and how to apply them.

Watching videos of presentations and reading slides is often much easier than comprehending papers, though ultimately the paper should have much richer detail.

Personal anecdote: Two years ago I just started learning about these things, coming from an undergraduate degree in electrical engineering. Now I am in graduate school for deep learning and AI working to push things forward, one small step at a time. It is totally possible to learn this stuff in a reasonable amount of study, and there are more free resources than ever. Note that I had a full time engineering job until 6 months ago... doing something totally different!

frozenport · on Dec 8, 2014

You should be complaining about the computing resources to train 24 hidden layers.

tonydiv · on Dec 8, 2014

GPUs make a lot of this tractable. Nvidia is actually offering some free compute time for researchers:

http://www.nvidia.com/object/gpu-test-drive.html

31reasons · on Dec 8, 2014

Its not as complicated as it seems. 2 full-time year might be required if you want to start in the PhD program for it but not for its applications.

andrewcamel · on Dec 8, 2014

That's pretty comforting to hear. Do you know of any resources off the top of your head to learn this type of thing?

Houshalter · on Dec 8, 2014

Geoffrey Hinton (a notable deep learning researcher) did a coursera course on neural networks awhile ago. It's over but you can still see the lectures which are very good: https://www.coursera.org/course/neuralnets

Metacademy is also a very useful resource for anything machine learning: http://www.metacademy.org/

zonzo · on Dec 8, 2014

torch7 used by facebook and google. Fast becoming the standard industrial neural nets library. http://torch.ch Read the tutorials and you can get something working in a day

discardorama · on Dec 8, 2014

"Used by Facebook and Google" ? Citation needed. AFAIK, at least Google has an internal homebrew solution, that automatically scales to large clusters.

benanne · on Dec 8, 2014

Google DeepMind chiefly uses Torch7. I presume parts of Twitter also use it now since they acquired Clement Farabet's startup MadBits.

Facebook's AI research lab has contributed to the Torch7 project (which is unsurprising since it is lead by Yann LeCun, and Torch7 was originally developed in his group at NYU).

I wouldn't go as far as to say it's "becoming the industry standard" though. Caffe and Theano are also very popular.

olavgg · on Dec 8, 2014

AMA with Yann LeCun which confirms they are using it. http://www.reddit.com/r/MachineLearning/comments/25lnbt/ama_...

discardorama · on Dec 8, 2014

I stand corrected. Thanks for the link. I had somehow missed that AMA; it's fascinating reading. (Hinton's too).

ajtulloch · on Dec 8, 2014

Facebook AI Research uses Torch heavily, and contributes improvements back to the open-source version.

chestervonwinch · on Dec 8, 2014

There's not a lot of theoretical background beyond the basics of pattern recognition to get started understanding this stuff. Of course, pattern recognition requires some knowledge of probability, statistics, linear algebra, and vector calculus. There are books on pattern recognition that are rather friendly towards this prerequisite knowledge, though.

jskonhovd · on Dec 8, 2014

I disagree. I believe that the foundation you will learn in statistics, linear algebra, and other mathematics would be worth the effort to any computer scientist. You would enjoy the introduction Machine Learning courses on Udacity or Coursera.

pacala · on Dec 8, 2014

This video uses the same slide deck: https://www.youtube.com/watch?v=vvK-XOiKXOs

cr4zy · on Dec 8, 2014

The second half of this video has an earlier talk that goes along with the slides: http://youtu.be/S9twUcX1Zp0?t=22m49s

ivan_ah · on Dec 9, 2014

HN readers in the Montreal area will have a chance to listen to the talk in person at the McGill colloquium:

Scaling Deep Learning, Wednesday, December 10th, 2:00PM-3:00 PM at the McGill University M1 amphitheater of the Strathcona building at 3640 University Street.

mr_overalls · on Dec 8, 2014

Does Deep Learning on this scale offer any obvious benefits for genomic analysis? Does it make sense to use data from the 1000 Genomes Project (or similar large-scale sequencing effort) and perform association studies?

forgotAgain · on Dec 8, 2014

I wonder how many machines are chomping on raw data generated by chrome, android and search. What is Google trying to learn and what does it already know?