> Static random-access memory (static RAM or SRAM) is a type of semiconductor memory that uses bistable latching circuitry (flip-flop) to store each bit. SRAM exhibits data remanence,[1] but it is still volatile in the conventional sense that data is eventually lost when the memory is not powered.
The term static differentiates SRAM from DRAM (dynamic random-access memory) which must be periodically refreshed. SRAM is faster and more expensive than DRAM; it is typically used for CPU cache while DRAM is used for a computer's main memory.
Is the embedded case really that interesting? Almost by definition, the embedded device will be receiving a tiny fraction of the data in the world that it may be concerned about. It seems unlikely to me that an embedded, power-constrained device is going to "deep learn" anything all that useful that wouldn't be better learned in something with more data and power available. But I do mean this as a question, if anybody's got a really cool use case in hand. (Please something more specific than text that boils down to "something something sensor network internet of things local conditions something".)
This isn't for training, is it? It's for using the results of training immediately (inferring something), without the need for a network round trip (as far as I understand it).
So, you might still send the request to the network to continue training the model, but by the time you do, your answer has already computed on the local machine for local consumption.
True. The cases where this would be useful (extremely low training data) there are machine learning methods that are much much better than deep learning, which is a glutton for data comparatively. E.g. handwriting recognition or signature recognition (where you would have only a few samples).
The large nets, yes. Partly because bigger nets tend to do better so we've built massive nets that run pretty well on big GPUs. But that's the state of things, we have impressive results but from huge neural nets that we really cannot run in realtime on a mobile device.
It's particularly bad if you want to run them on a small, battery powered device. Far more so if you only have one image to process, here they're seeing single image speedups of over 10x compared to GPUs (but slower than batched processing) and energy efficiencies about 1000+ times better than mobile GPUs (and far more compared to the beasts in your desktop).
But that's the state of things, we have impressive results but from huge neural nets that we really cannot run in realtime on a mobile device.
This is absolutely incorrect.
A mobile device can execute a pretrained model fine. See, for example [1][2][3]. Google's new TensorFlow NN system is explicitly designed to be able to run on mobile devices and comes with a pre-trained image classification NN that workd fine on mobile devices.
This doesn't mean that energy saving is unimportant on a mobile device of course. But there are very widely deployed production systems (eg all of Android) that use them now with no special GPU acceleration.
> A mobile device can execute a pretrained model fine.
Of course it can. The problem is with the size of the network you might want to run.
From your first link, which is about single character level image processing
> We needed to develop a very small neural net, and put severe limits on how much we tried to teach it—in essence, put an upper bound on the density of information it handles.
And from the voice training paper:
> While our server-based model has 50M parameters (k = 4,
nh = 2560, ni = 26 and no = 7969), to reduce the memory
and computation requirement for the embedded model, we experimented
with a variety of sizes and chose k = 6, nh = 512,
ni = 16 and no = 2000, or 2.7M parameters
AlexNet is, what, 60M+? VGG is pretty big too.
I guess I wasn't too clear. Yes, there are good results from nets that we can run on mobile devices in realtime. We do, however, want to run significantly larger nets.
[1] is a Google-authored demo running Inception-v3[2] (about twice as accurate as VGG) on Android phones.
I don't know how many parameters Inception v3 has, but I know Google considers it more efficient than VGG ("Although our network is 42 layers deep, our computation cost is only about 2.5 higher than that of GoogLeNet and it is still much more efficient than VGGNet".)
Yes, being able to run big networks is great. But ultimately it's what you do with it, and a 3% error rate on ImageNet is a pretty compelling argument that size isn't the only factor.
Well, it depends on the neural network you want to use. At computer vision/image processing conferences I find that it is much more common for people to use preexisting DNNs and plug them into their systems rather than train their own (why would you waste weeks or months on that?). The good pretrained DNNs available today tend to be huge, requiring a large amount of video RAM. I've talked to people writing realtime image processing software (GPU acceleration required) intended for handheld devices which can't yet actually run on the device because they don't have a suitable DNN which will fit in VRAM.
Performing inference with deep learning modes is computationally costly as well. Its a reason why NVidia has two streams of business, Tesla / Titan / upcoming Volta GPUs to train models and Tegra boards to enable near real time inference.
Those interested in optimal neural network compression might consider the paper "Bitwise Neural Networks" by Kim and Smaragdis http://paris.cs.illinois.edu/pubs/minje-icmlw2015.pdf which enables much better compression than simple quantization and pruning.
How do you mean "much better compression"? Won't replacing 32bit multiplies by bitwise operations save 32x the memory[1]? Han et al. show not only 35-49x improvement, but on much more difficult benchmarks (MNIST vs Alexnet/VGG).
Combining these two techniques would be really cool and if the bitwise network can work with larger, more complex networks like VGG would be a massive game-changer, allowing these nets to fit on almost any device.
In the past, specialized architectures tended to lose out to the performance gains of generic x86 chips. I'm thinking RISC vs. CISC, or Sun's attempt at a custom Java-optimized processor.
Is that no longer the case? ARM architectures seem to be beating x86 for low power, mobile devices. GPUs are being used for many easily parallelizable workloads.
Is an overall slow down in Moore's law making chips designed for specific tasks (like Deep Neural Nets), attractive again?
Deep Neural Networks involve Convolution and Matrix Multiplication operations which have a well defined hard upper bound in terms of speed wrt number of cores, GPUs (~thousands of processors) are in thus indispensable achieving a faster performance. There are newer interesting approaches such as reducing multiplication operations which might improve inference/forward speed even further. http://arxiv.org/abs/1510.03009
I'm not sure CISC beat RISC, except for complicating the instruction set enough to keep costs high for competitors.
Underneath, Intel processors translate x86 to simpler RISC like microcode. They could make a more efficient chip without this translation component, and probably holds them back a bit in low power stuff.
It's more like x86 beat everything, which is also saying that externally CISC/RISC didn't matter so much. I like the section on RISC here: http://danluu.com/butler-lampson-1999/ Specifically part of the last paragraph: "It’s possible to nitpick RISC being a no by saying that modern processors translate x86 ops into RISC micro-ops internally, but if you listened to talk at the time, people thought that having a external RISC ISA would be so much lower overhead that RISC would win, which has clearly not happened. Moreover, modern chips also do micro-op fusion in order to fuse operations into decidedly un-RISC-y operations."
You can do fast array math on video cards and specialized FPGAs; neural nets require fast array math.
This chip is optimizing some memory overhead involved in DL. Probably lots of such optimizations are possible in DL, since it is a fairly well defined problem.
Yes, it's still the case. Moore's Law is still alive (e.g. Tegra X1 vs K1).
However, this paper is not about using specialized architectures, it's about compressing weight matrices and putting them in cache.
x86 beat everything largely due to momentum. Intel had the greatest ability to invest in process shrinking technology and the brute force to engineer their way out of architectural dead-ends (see NetBurst).
Now that process shrinking has become extremely difficult, everybody else is starting to catch up.
Mark my words, in the next couple of years we'll see custom silicon that massively improves performance per watt for DNNs by using fixed point, quantization, and saturation arithmetic. The gain in performance per watt will be at least an order of magnitude. This will make DNNs worthwhile in a lot more classification problems where currently they are simply too slow.
Mark my words, DNNs are not really the most efficient structure for predicting model due to the "distributed" representation that makes them good predictors, but also makes them hard to train and resource-consuming to apply. In few years DNNs will be replaced by more efficient models.
Sorry, nothing definite yet. Any sort of tree-based computation which makes early pruning decision is more efficient than full matrix multiplication, even if matrix is sparse. Then, high-order computation like power is again more efficient than simple linear model. Those two directions are quite probable.
Reminds me of the "Expert Systems" from the 1980s. They saw similar levels of (inflation-adjusted) hype too. Decision trees have had their heyday; wonder if they'll be back in fashion soon.
Please don't compare the complexity of the brain to artificial neural networks. The complexity of a single neuron is far more sophisticated than we use for machine learning. Many people are starting to call the machine learning "neurons" as "units" to prevent this comparison.
The compressed network achieves a decent speedup and energy saving on current hardware too (desktop/mobile) without significant loss of accuracy