Emergent Chip Vastly Accelerates Deep Neural Networks

dharma1 · on Jan 11, 2016

Here's the paper - http://arxiv.org/pdf/1510.00149.pdf

The compressed network achieves a decent speedup and energy saving on current hardware too (desktop/mobile) without significant loss of accuracy

nl · on Jan 12, 2016

Is this the correct paper? It seems to cover the (impressive!) speedups from compression and encoding, but nothing about the EIE chip mentioned.

dharma1 · on Jan 12, 2016

Yep it only covers the compression/pruning. Here is a talk by the author, covering the chip too -http://web.stanford.edu/class/ee380/Abstracts/160106.html

alvern · on Jan 11, 2016

thank you for sharing

jph · on Jan 11, 2016

Summary: The chip's impressive speed gains are because it uses on-chip SRAM instead of off-chip DRAM.

The chip is able to fit much more into SRAM because the chip uses network compression, pruning, quantization, Huffman encoding, etc.

johnzabroski · on Jan 11, 2016

Also, that this is another example of more specialized chip sets eating away at general purpose CPU/GPU market share.

sawwit · on Jan 11, 2016

> Static random-access memory (static RAM or SRAM) is a type of semiconductor memory that uses bistable latching circuitry (flip-flop) to store each bit. SRAM exhibits data remanence,[1] but it is still volatile in the conventional sense that data is eventually lost when the memory is not powered.

The term static differentiates SRAM from DRAM (dynamic random-access memory) which must be periodically refreshed. SRAM is faster and more expensive than DRAM; it is typically used for CPU cache while DRAM is used for a computer's main memory.

WP

6502nerdface · on Jan 11, 2016

Also, the on-board RAM is SRAM rather than DRAM.

fpgaminer · on Jan 11, 2016

Where does it say that? On-board SRAM would be very strange, as it's incredibly expensive and not at all performant compared to DRAM.

nl · on Jan 12, 2016

From the paper's abstract: This allows fitting the model into on-chip SRAM cache rather than off-chip DRAM memory

fpgaminer · on Jan 12, 2016

Exactly, that says "on-chip SRAM" not "on-board SRAM". Very different things, and my point stands.

nl · on Jan 12, 2016

Pretty sure from the context the OP was saying on board the chip.

jerf · on Jan 11, 2016

Is the embedded case really that interesting? Almost by definition, the embedded device will be receiving a tiny fraction of the data in the world that it may be concerned about. It seems unlikely to me that an embedded, power-constrained device is going to "deep learn" anything all that useful that wouldn't be better learned in something with more data and power available. But I do mean this as a question, if anybody's got a really cool use case in hand. (Please something more specific than text that boils down to "something something sensor network internet of things local conditions something".)

jessep · on Jan 11, 2016

This isn't for training, is it? It's for using the results of training immediately (inferring something), without the need for a network round trip (as far as I understand it).

So, you might still send the request to the network to continue training the model, but by the time you do, your answer has already computed on the local machine for local consumption.

jerf · on Jan 11, 2016

Thank you, everybody, that has helped me understand. Obvious in hindsight but isn't that the way of these things.

hosh · on Jan 12, 2016

From what I understand, some of the vehicular deep-nets work like this right now. The results of training gets sent to the car as firmware updates.

cscurmudgeon · on Jan 11, 2016

True. The cases where this would be useful (extremely low training data) there are machine learning methods that are much much better than deep learning, which is a glutton for data comparatively. E.g. handwriting recognition or signature recognition (where you would have only a few samples).

http://www.sciencemag.org/content/350/6266/1332.abstract

IanCal · on Jan 11, 2016

I didn't get the impression this was for training, but for running a trained network.

This may lead to high quality speech recognition or image recognition right on the device, without requiring a beast of a GPU.

cscurmudgeon · on Jan 11, 2016

Ah ok, got it. But is running that slow that you need special hardware?

IanCal · on Jan 11, 2016

The large nets, yes. Partly because bigger nets tend to do better so we've built massive nets that run pretty well on big GPUs. But that's the state of things, we have impressive results but from huge neural nets that we really cannot run in realtime on a mobile device.

It's particularly bad if you want to run them on a small, battery powered device. Far more so if you only have one image to process, here they're seeing single image speedups of over 10x compared to GPUs (but slower than batched processing) and energy efficiencies about 1000+ times better than mobile GPUs (and far more compared to the beasts in your desktop).

nl · on Jan 11, 2016

But that's the state of things, we have impressive results but from huge neural nets that we really cannot run in realtime on a mobile device.

This is absolutely incorrect.

A mobile device can execute a pretrained model fine. See, for example [1][2][3]. Google's new TensorFlow NN system is explicitly designed to be able to run on mobile devices and comes with a pre-trained image classification NN that workd fine on mobile devices.

This doesn't mean that energy saving is unimportant on a mobile device of course. But there are very widely deployed production systems (eg all of Android) that use them now with no special GPU acceleration.

[1] http://googleresearch.blogspot.com.au/2015/07/how-google-tra...

[2] http://googleresearch.blogspot.com.au/2015/09/google-voice-s...

[3] http://static.googleusercontent.com/media/research.google.co...

IanCal · on Jan 12, 2016

> A mobile device can execute a pretrained model fine.

Of course it can. The problem is with the size of the network you might want to run.

From your first link, which is about single character level image processing

> We needed to develop a very small neural net, and put severe limits on how much we tried to teach it—in essence, put an upper bound on the density of information it handles.

And from the voice training paper:

> While our server-based model has 50M parameters (k = 4, nh = 2560, ni = 26 and no = 7969), to reduce the memory and computation requirement for the embedded model, we experimented with a variety of sizes and chose k = 6, nh = 512, ni = 16 and no = 2000, or 2.7M parameters

AlexNet is, what, 60M+? VGG is pretty big too.

I guess I wasn't too clear. Yes, there are good results from nets that we can run on mobile devices in realtime. We do, however, want to run significantly larger nets.

nl · on Jan 13, 2016

[1] is a Google-authored demo running Inception-v3[2] (about twice as accurate as VGG) on Android phones.

I don't know how many parameters Inception v3 has, but I know Google considers it more efficient than VGG ("Although our network is 42 layers deep, our computation cost is only about 2.5 higher than that of GoogLeNet and it is still much more efficient than VGGNet".)

Yes, being able to run big networks is great. But ultimately it's what you do with it, and a 3% error rate on ImageNet is a pretty compelling argument that size isn't the only factor.

[1] https://github.com/tensorflow/tensorflow/tree/master/tensorf...

[2] http://arxiv.org/abs/1512.00567

versteegen · on Jan 12, 2016

Well, it depends on the neural network you want to use. At computer vision/image processing conferences I find that it is much more common for people to use preexisting DNNs and plug them into their systems rather than train their own (why would you waste weeks or months on that?). The good pretrained DNNs available today tend to be huge, requiring a large amount of video RAM. I've talked to people writing realtime image processing software (GPU acceleration required) intended for handheld devices which can't yet actually run on the device because they don't have a suitable DNN which will fit in VRAM.

nl · on Jan 13, 2016

3.5% top-5 error rate on ImageNet running on Android. Hard to beat that (unless you are Microsoft Research Asia).

https://github.com/tensorflow/tensorflow/tree/master/tensorf...

dharma1 · on Jan 12, 2016

mxnet has some optimised models for mobile devices - http://mxnet.readthedocs.org/en/latest/tutorial/smart_device...

dharma1 · on Jan 12, 2016

Its very difficult or near impossible to do image or speech recognition realtime on mobile CPUs/GPUs. Especially with large nets

You can do limited tasks but not the kind of things we can do on desktop

secondtimeuse · on Jan 11, 2016

Performing inference with deep learning modes is computationally costly as well. Its a reason why NVidia has two streams of business, Tesla / Titan / upcoming Volta GPUs to train models and Tegra boards to enable near real time inference.

nshm · on Jan 11, 2016

Those interested in optimal neural network compression might consider the paper "Bitwise Neural Networks" by Kim and Smaragdis http://paris.cs.illinois.edu/pubs/minje-icmlw2015.pdf which enables much better compression than simple quantization and pruning.

nharada · on Jan 12, 2016

How do you mean "much better compression"? Won't replacing 32bit multiplies by bitwise operations save 32x the memory[1]? Han et al. show not only 35-49x improvement, but on much more difficult benchmarks (MNIST vs Alexnet/VGG).

Combining these two techniques would be really cool and if the bitwise network can work with larger, more complex networks like VGG would be a massive game-changer, allowing these nets to fit on almost any device.

[1] http://minjekim.com/demo_bnn.html

jimbokun · on Jan 11, 2016

In the past, specialized architectures tended to lose out to the performance gains of generic x86 chips. I'm thinking RISC vs. CISC, or Sun's attempt at a custom Java-optimized processor.

Is that no longer the case? ARM architectures seem to be beating x86 for low power, mobile devices. GPUs are being used for many easily parallelizable workloads.

Is an overall slow down in Moore's law making chips designed for specific tasks (like Deep Neural Nets), attractive again?

secondtimeuse · on Jan 11, 2016

Deep Neural Networks involve Convolution and Matrix Multiplication operations which have a well defined hard upper bound in terms of speed wrt number of cores, GPUs (~thousands of processors) are in thus indispensable achieving a faster performance. There are newer interesting approaches such as reducing multiplication operations which might improve inference/forward speed even further. http://arxiv.org/abs/1510.03009

emcq · on Jan 11, 2016

I'm not sure CISC beat RISC, except for complicating the instruction set enough to keep costs high for competitors.

Underneath, Intel processors translate x86 to simpler RISC like microcode. They could make a more efficient chip without this translation component, and probably holds them back a bit in low power stuff.

cobaltblue · on Jan 11, 2016

It's more like x86 beat everything, which is also saying that externally CISC/RISC didn't matter so much. I like the section on RISC here: http://danluu.com/butler-lampson-1999/ Specifically part of the last paragraph: "It’s possible to nitpick RISC being a no by saying that modern processors translate x86 ops into RISC micro-ops internally, but if you listened to talk at the time, people thought that having a external RISC ISA would be so much lower overhead that RISC would win, which has clearly not happened. Moreover, modern chips also do micro-op fusion in order to fuse operations into decidedly un-RISC-y operations."

scottlocklin · on Jan 11, 2016

You can do fast array math on video cards and specialized FPGAs; neural nets require fast array math. This chip is optimizing some memory overhead involved in DL. Probably lots of such optimizations are possible in DL, since it is a fairly well defined problem.

p1esk · on Jan 11, 2016

Yes, it's still the case. Moore's Law is still alive (e.g. Tegra X1 vs K1). However, this paper is not about using specialized architectures, it's about compressing weight matrices and putting them in cache.

chongli · on Jan 11, 2016

x86 beat everything largely due to momentum. Intel had the greatest ability to invest in process shrinking technology and the brute force to engineer their way out of architectural dead-ends (see NetBurst).

Now that process shrinking has become extremely difficult, everybody else is starting to catch up.

melted · on Jan 11, 2016

Mark my words, in the next couple of years we'll see custom silicon that massively improves performance per watt for DNNs by using fixed point, quantization, and saturation arithmetic. The gain in performance per watt will be at least an order of magnitude. This will make DNNs worthwhile in a lot more classification problems where currently they are simply too slow.

nshm · on Jan 11, 2016

Mark my words, DNNs are not really the most efficient structure for predicting model due to the "distributed" representation that makes them good predictors, but also makes them hard to train and resource-consuming to apply. In few years DNNs will be replaced by more efficient models.

dman · on Jan 11, 2016

Any pointers to more approaches for more efficient alternatives?

nshm · on Jan 11, 2016

Sorry, nothing definite yet. Any sort of tree-based computation which makes early pruning decision is more efficient than full matrix multiplication, even if matrix is sparse. Then, high-order computation like power is again more efficient than simple linear model. Those two directions are quite probable.

gcr · on Jan 12, 2016

Reminds me of the "Expert Systems" from the 1980s. They saw similar levels of (inflation-adjusted) hype too. Decision trees have had their heyday; wonder if they'll be back in fashion soon.

ced · on Jan 12, 2016

> high-order computation like power

What is power?

melted · on Jan 11, 2016

You do realize you came up with this text using a deep biological neural network, right?

emcq · on Jan 11, 2016

Please don't compare the complexity of the brain to artificial neural networks. The complexity of a single neuron is far more sophisticated than we use for machine learning. Many people are starting to call the machine learning "neurons" as "units" to prevent this comparison.

nshm · on Jan 11, 2016

Sure, but they aren't necessary the most efficient model possible

melted · on Jan 11, 2016

Seems pretty efficient to me, at least per watt.

zhyan7109 · on Jan 11, 2016

This will be a game changer a few years down the line! Can't wait to see the commercial SW stack complimenting this.

bbhill · on Jan 11, 2016

These guys need to work on their dataviz. Could have made some of the tables into graphs, really.