Amazon DSSTNE: Deep Scalable Sparse Tensor Network Engine

waleedka · on May 11, 2016

At a glance:

  - Only supports fully connected layers for now. No convnets or RNNs.

  - Requires a GPU. No option to run on CPU, not even for development. 

  - Setup instructions for Ubuntu only. No Mac or Windows.

  - Uses JSON to define the network architecture. Which limits what you can build.

  - Takes in data in NetCDF format only.

  - Very little documentation.

  - The name is bad. I'm not going to remember how to spell DSSTNE.

It seems like a very early proof of concept. I wouldn't expect it to be useful to most people at this point. Built-in support for sparse vectors is interesting, but not a strong selling point by itself. I hope Amazon continues to develop it. Or, even better, contribute to one of the existing more mature frameworks.

scottlegrand · on May 11, 2016

It's more than that, and it's in use in production at Amazon. 8 TitanX GPUs can contain networks with up to 6 billion weights. As Geoffrey Hinton once said:

"My belief is that we’re not going to get human-level abilities until we have systems that have the same number of parameters in them as the brain."

And you're right that it's a specialized framework/engine. But IMO making it more general purpose is a matter of cutting and pasting the right cuDNN code or we can double down on emphasizing sparse data. Amazon OSSed this partially IMO to see what people would want here.

jrapdx3 · on May 11, 2016

> "My belief is that we’re not going to get human-level abilities until we have systems that have the same number of parameters in them as the brain."

An interesting quote.

Replicating functioning of the brain, or some major subsystem of it, is no doubt going to require far more than just billions of parameters. The cortex contains >15 billion neurons, but there are also the neurons contained in all the other brain structures. Furthermore, neurons connect via dense dendritic trees, the human brain having on the order of 100 trillion synapses.

Adding to the complexity, neurons have numerous "communication ports", including numerous pre- and postsynaptic neurotransmitter receptors, and a wide range of receptors for endocrine, immune system and other types of signals. Message propagation typically involves as well the layer of complex intracellular "second-messenger" transformations.

While it's highly probably future NNs will be developed that do even more amazing things than now possible, I think the challenge of equaling what real brains do is to say the least enormously daunting.

Somebody smarter than me could probably figure out the magnitude, how many nodes or weights it takes for a NN to function like the brain, though I imagine it will be a really impressive number.

Edit: typos

JackFr · on May 11, 2016

> "My belief is that we’re not going to get human-level abilities until we have systems that have the same number of parameters in them as the brain."

While that may be true, I find this compelling:

"The fundamental unit of biological information processing is the molecule, rather than any higher level structure like a neuron or a synapse; molecular level information processing evolved very early in the history of life."

http://www.softmachines.org/wordpress/?p=1558#more-1558

Edit: formatting

fauigerzigerk · on May 11, 2016

>Replicating functioning of the brain, or some major subsystem of it, is no doubt going to require far more than just billions of parameters.

Maybe, but we shouldn't forget that computers do not suddenly lose their capability to function as exact, deterministic, programmable machines just because they happen to run an ANN.

What I mean is that there may be shortcuts to reduce the number of required nodes dramatically.

If you take the state of an ANN after it was trained to perform some specific task, you can ask the question whether there is a simpler function, i.e. one with much fewer parameters, that approximates the learned function.

Sort of like a human with the Occam's razor gene. I think the fact that the number of neurons does not correlate perfectly with intelligence in animals is an indication that there is room for optimization.

scottlegrand · on May 11, 2016

Absolutely 100% agree, but at the same time, I think we will ultimately need to build and evaluate models that can span the memory of more than one processor. I don't think a single GTX Titan X, GTX 1080 or even a server is enough here.

Additionally, data parallelization and ASGD broadly disallow these larger models (yes I know about send/receive nodes in TensorFlow, but they're not general or automatic enough for researchers IMO) while ASGD makes horribly inefficient use of the very limited bandwidth between processors. All IMO of course. There are hacks and tricks here, but I think those should be late stage optimizations, not requirements to achieve scaling.

Finally, I'm a stickler for deterministic computation as someone who spent a decade writing graphics drivers before joining the CUDA team in 2006, but that's pretty much a "hear me now, believe me later" opinion of mine after tracking down too many bizarro race conditions late into the night in that former life :-). Of course, one person's race condition can sometimes be an ANN's regularizer, but I digress.

I also agree we'll do some amazing things with far fewer neurons and weights than an actual human brain, but I'll bet you good money we end up needing more than 12GB to do it. AlphaGo alone was 200+ GPUs, right?

waleedka · on May 11, 2016

Thanks for the clarification. I'd change "early proof of concept" to "a specialized framework", but the other observations stand, I believe.

It's totally fine that it's a specialized framework, and it doesn't need to become general purpose. I just think the product description should do a better job positioning it and explaining what it's NOT intended for to set expectations correctly.

taneq · on May 11, 2016

Why does JSON limit what you can build? Or do you just mean it only supports certain architectures because there are no options to specify other ones in JSON?

waleedka · on May 11, 2016

Exactly what you said. The declarative approach is great for the common architectures, but if your requirements are different, and the JSON format doesn't have a way to declare it then you're stuck.

incepted · on May 11, 2016

> It seems like a very early proof of concept.

Agreed, it looks like a rushed response to TensorFlow.

throwaway6497 · on May 11, 2016

Amazon is turning a new leaf. They stopped publishing to any major conferences after their last significant paper, DynamoDB.

My perception of Amazon is that they take everything from open-source but don't actively give back. Amazon and open-source never went hand-in-hand. Making their deep learning frameworks open-source is cool. Kudos to the team which managed to do this. I am sure internally, it must have been a huge struggle to get the approval from execs.

[Edit: Grammar]

throwaway6497 · on May 11, 2016

For a second, a thought crossed my mind that Amazon is actively trying to change its external perception after the NY times article and is trying to cozy up to developers. I found this on Glassdoor. Apparently, it will take a long time for them to make their culture less toxic.

===From Glassdoor===

Cons

====

The management process is abusive, and I'm currently a manager. I've seen too much "behind the wall" and hate how our individual performers can be treated. You are forced to ride people and stack rank employees...I've been forced to give good employees bad overall ratings because of politics and stack ranking. Advice to Management Don't pretend that the recent NY Times article was all about "isolated incidents". The culture IS abusive and it WILL backfire once stock value starts to drop. I'm an 8 year veteran and I no longer recommend former peers to interview with Amazon.

== [Edit: Formatted to make it clear what was pulled from Glassdoor]

eranation · on May 11, 2016

I just joined AWS ProServ and I really don't see any of these things. Pretty amazing team and one of the best work life balance I've seen in a tech company so far. I have 4 other friends who work at AWS and all seem very happy so far. I found the glass door comment and it seems to be from an engineering manager. I have a friend who manages one of the AWS products and he seems to be pretty happy.

I just joined so I really am not a statistically significant case but so far it's no where near what was in that NYT article.

Edit: I can't read apparently :) thanks heuving for clarifying and the commenter for reformatting

inopinatus · on May 11, 2016

I suspect an inverse survivorship bias in the public representation of the company by ex-employees. Those of us with positive recollections tend to say very little, and (in my case at least) that's due to respect for Amazon's culture.

jsolson · on May 11, 2016

Eh, I had a pretty good ride there myself. I believe every incident in that NY Times article happened.

If you were at Amazon for any length of time and didn't notice the existence of toxic teams and the random chance element of being hired into one of them, you weren't paying attention.

eclipxe · on May 11, 2016

Concur

hueving · on May 11, 2016

The GP pulled that from glassdoor.

ryanobjc · on May 11, 2016

amazon management is more likely to optimize for optics than actually fix the "problem". In bezos's mind the problem is the NYT article and the external reputation. The work culture isnt an accident nor is it intended to be eventually fixed.

in my tenure at Amazon, I went from getting a 2 and PIP, then to a 4, then having my promotion held up because my VP didnt like me. Finally when I left to Google they offered SDE3, and another $15k a year. I didn't take that offer.

This was all back in the 2001-2006 timeframe. Sounds like nothing has changed.

throwaway6497 · on May 11, 2016

You know what I find unbelieveable! In spite of several people calling out that Amazon doesn't officially offer paternity leaves and that it sucks, Amazon leadership is simply doing nothing about it. Google/FB/Linkedin/Netflix/Microsoft have officially announced 3-unlimited month paternity leaves. It appears Amazon simply doesn't care about employees or the optics in this case. How do they officially justify their stand. We don't offer paternity leaves because - ???. WTF. Really

Update: They started offering 6 week paternity leave since Jan 2016. 20+ weeks is what other companies mentioned seem to offer.

eclipxe · on May 11, 2016

"The company said it is now offering up to 20 paid weeks of leave, consisting of four weeks of paid pre-partum medical leave for pregnant employees, followed by 10 weeks of paid maternity leave and six weeks of paid parental leave. The latter is the new element and is also available to “all other new parents who have been at Amazon for a year or more,” the company said in a Nov. 2 e-mail to employees. "

jvolkman · on May 11, 2016

http://www.bna.com/amazon-expands-parental-n57982063587/

throwaway6497 · on May 11, 2016

Thanks for pointing me to this. I didn't know they started offering 6 weeks leave regardless of Gender. 6 weeks feels half-hearted though but better than nothing.

manigandham · on May 11, 2016

> take everything from open-source but don't actively give back

There's nothing wrong with this. There's no contract when using open-source and this is probably how 99% of people interact with it.

barnacle_bill · on May 11, 2016

You really think Amazon has something to contribute? A popular thing to do at Amazon is take a complex open source package, wrap it in a web server and announce your team has launched a revolutionary new PAAS. Or take an someone else's web service and build a new web service on top of it with minimal new features and more restrictions. Then announce it and hope for Jeff visibility.

ktamura · on May 11, 2016

First TensorFlow and now this. Tensor is quickly becoming a mathematical-term-that-sounds-familiar-to-developers-but-most-don't-know-what-it-is-actually.

Another example is topology =)

vardhanw · on May 11, 2016

When I entered college after high school* in India (around 1990), I was enamored by their library (my school didn't have one), and I was a math enthusiast (also ranked in a few state level math talent competitions). After being introduced to vectors (in math and physics) I chanced upon tensors - it seemed interesting. I found some good books in the catalog, and asked the librarian to issue one. He just refused to lend it to me, saying that it was a topic for "higher level/senior studies" (BSc/MSc). Unfortunately that time I did could not get any other source for it, so it remained sufficiently out of my radar that I never managed to get back to it. Surprisingly, looking back, it never got covered even in my engineering curriculum - probably because it was (is) considered a more Higher mathematics thing without much engineering application. Did come across it while scanning though relativity literature, but never attempted to understand it in depth. Now seems to be the time to do it!

* College or 11th std in India is the same as 11th grade High school in the US.

rdtsc · on May 11, 2016

Other one is isomorphic. Anything that sounds sciency or mathy will be adopted. There is no other way ;-)

ryanobjc · on May 11, 2016

my new programming language has isomorphic tensors built in as a first class language feature :-)

rdtsc · on May 11, 2016

Sold!

I'll use it write microservices for my new IoT application.

mindcrime · on May 11, 2016

But is it reactive?

qrian · on May 11, 2016

But can we say we know what vectors are though? As far as I know tensors are derived from vectors and I would imagine programmers don't know what vectors are in a mathematical sense.

llamaz · on May 11, 2016

Tensors can work on other things apart from vector spaces (e.g. modules), but programmers don't know those either.

orm · on May 11, 2016

functor is another one.

ecesena · on May 11, 2016

field, group...

drauh · on May 11, 2016

All I know about tensors is this: https://upload.wikimedia.org/math/8/7/2/87202e291e05ddf49049...

scottlegrand · on May 11, 2016

Lead author of DSSTNE here...

1. DSSTNE was designed two years ago specifically for product recommendations from Amazon's catalog. At that time, there was no TensorFlow, only Theano and Torch. DSSTNE differentiated from these two frameworks by optimizing for sparse data and multi-GPU spanning neural networks. What it's not currently is another framework for running AlexNet/VGG/GoogleNet etc, but about 500 lines of code plus cuDNN could change that if the demand exists. Implementing Krizhevsky's one weird trick is mostly trivial since the harder model parallel part has already been written.

2. DSSTNE does not yet explicitly support RNNs, but it does have support for shared weights and that's more than enough to build an unrolled RNN. We tried a few in fact. CuDNN 5 can be used to add LSTM support in a couple hundred lines of code. But since (I believe) the LSTM in cuDNN is a black box, it cannot be spread across multiple GPUs. Not too hard to write from the ground up though.

3. There are a huge number of collaborators and people behind the scenes that made this happen. I'd love to acknowledge them openly, but I'm not sure they want their names known.

4. Say what you want about Amazon, and they're not perfect, but they let us build this from the ground up and now they have given it away. Google hired me away from NVIDIA (another one of those offers I couldn't refuse) OTOH blind-allocated me into search in 2011 and would not let me work with GPUs despite my being one of the founding members of NVIDIA's CUDA team because they had not yet seen them as useful. I didn't stay there long. DSSTNE is 100% fresh code, warts and all, and I think Amazon both for letting me work on a project like this and for OSSing the code.

5. NetCDF is a nice efficient format for big data files. What other formats would you suggest we support here?

6. I was boarding a plane when they finally released this. I will be benchmarking it in the next few days. TLDR spoilers: near-perfect scaling for hidden layers with 1000 or so hidden units per GPU in use, and effectively free sparse input layers because both activation and weight gradient calculation have custom sparse kernels.

7. The JSON format made sense in 2014, but IMO what this engine needs now is a TensorFlow graph importer. Since the engine builds networks from a rather simple underlying C struct, this isn't particularly hard, but it does require supporting some additional functionality to be 100% compatible.

8. I left Amazon 4 months ago after getting an offer I couldn't refuse. I was the sole GPU coder on this project. I can count the number of people I'd trust with an engine like this with two hands and most of them are already building deep learning engines elsewhere. I'm happy to add whatever functionality is desired here. CNN and RNN support seem like two good first steps and the spec already accounts for this.

8. Ditto for a Python interface, easily implemented IMO through the Python C/C++ extension mechanism: https://docs.python.org/2/extending/extending.html

Anyway, it's late, and it's turned out to be a fantastic day to see the project on which I spent nearly two years go OSS.

shoyer · on May 11, 2016

Thanks for sharing your story!

Let me comment on file formats as someone familiar with both netCDF and deep learning.

I agree that netCDF is a sane binary file format for this application. It's designed for efficient serialization of large arrays of numbers. One downside is that netCDF does not support streaming without writing the data to intermediate files on disk.

Keep in mind that netCDF v4 is itself just a thin wrapper around HDF5. Given that your input format is basically a custom file format written in netCDF, I would have just used HDF5 directly. The API is about as convenient, and this would skip one layer of indirection.

The native file format for TensorFlow is its own custom TFRecords file format, but it also supports a number of other file formats. TFRecords is much simpler technology than NetCDF/HDF5. It's basically just a bunch of serialized protocol buffers [1]. About all you can do with a TFRecords file is pull out examples -- it doesn't support the fancy multi-dimensional indexing or hierarchical structure of netCDF/HDF5. But that's also most of what you need for building machine learning models, and it's quite straightforward to read/write them in a streaming fashion, which makes it a natural fit for technologies like map-reduce.

[1] https://www.tensorflow.org/versions/r0.8/api_docs/python/pyt...

scottlegrand · on May 11, 2016

Thanks for that! And boy, I wish I had the resources the TensorFlow team has to build standards like this and also to write their own custom CUDA compiler.

I do want the multi-dimensional indexing for RNN data though. Maybe support HDF5 directly is the path forward.

Thanks again!

xiphias · on May 11, 2016

Where do you wok now? It's interesting to hear what offer you couldn't refuse after being in so many places

zellyn · on May 11, 2016

https://www.linkedin.com/in/scott-le-grand-b752111

xiphias · on May 12, 2016

Thanks!

jbandela1 · on May 11, 2016

Deep Learning systems are becoming C++11's halo projects. Here are some deep learning libraries from the Internet Big 4.

Amazon DSSTNE - https://github.com/amznlabs/amazon-dsstne

Google TensorFlow - https://github.com/tensorflow/tensorflow/

Microsoft CNTK - https://github.com/Microsoft/CNTK/

Facebook fbcunn - https://github.com/facebook/fbcunn/

They all utilize C++11 or later. Just as Hadoop pushed Java in the big data, map-reduce realm, I think these libraries will push C++11 in the Deep Learning realm.

vr3690 · on May 11, 2016

I get the acronym is easy to pronounce with the suggested word, but why not just use the suggested word (destiny) as the name instead of the acronym. So much easier to read and write. They could explain the name's origin in Readme.md

abtinf · on May 11, 2016

"Destiny" would also be ungooglable.

oh_sigh · on May 11, 2016

Meanwhile, DSSTNE is completely unmemorable, so even if you wanted to google it, you're going to end up typing "amazon destiny machine learning" or something

meepmorp · on May 11, 2016

I don't know about you, but I'm much more likely to be googling a project as I'm already working with it as opposed to for general information purpose. In that context, "DSSTNE [problem keywords]," seems more useful to me.

oh_sigh · on May 11, 2016

That's true... I was mostly thinking about the case where you aren't using it, but remember hearing about it or want to check it out.

nate_martin · on May 11, 2016

Maybe someone who works on deep learning could comment on what this provides vs other open source systems like theano, tensorflow, torch, etc.

curuinor · on May 11, 2016

They claim it's twice as fast as tensorflow, which is not blow-you-out-of-the-water (compare to like 50x speedup from GPU on most places), but it's a solid speedup.

It's easily parallelizable on GPU's, or so the claim goes.

Its configuration language is much, much shorter than caffe's, but upon inspection it looks like that the configuration language is also much less flexible than caffe's and they implemented a damn sight less stuff. No recurrent anything, for example, or LSTM, no gating stuff that you would need if you were doing LSTM, no residual net stuff, just off the top of my head.

It looks like much, much less complete docs in comparison to TF and Theano and things. Note the probability of dropout given in the user docs, but the actual documentation for dropout feature is hidden away inside the repo.

The important thing, however, is that they claim that there's a significant improvement on doing training on extraordinarily sparse datasets, like recommender systems and things like that. It seems very specialized for that specific exact purpose: see only accepting NetCDF format data, which is common enough in climatology-land but less common in machine learning-land proper.

The test coverage... To a first approximation, there is no test coverage. It seems quite research project-y.

romerocesar · on May 11, 2016

One important difference is model-parallel training. From the FAQ:

DSSTNE instead uses “model-parallel training”, where each layer of the network is split across the available GPUs so each operation just runs faster. Model-parallel training is harder to implement, but it doesn’t come with the same speed/accuracy trade-offs of data-parallel training.

https://github.com/amznlabs/amazon-dsstne/blob/master/FAQ.md

gidim · on May 11, 2016

They claim to perform much better on sparse data sets. "DSSTNE is much faster than any other DL package (2.1x compared to Tensorflow in 1 g2.8xlarge) for problems involving sparse data". It also has good support for distributing the computation over multiple GPUS. Theano for example can't do anything like that. On the other hand using JSON to design my models sound much worse than using a programming language.

Giorgi · on May 11, 2016

Soo... what is the application for this (other than buzzwords)

romerocesar · on May 11, 2016

srsly? all the discussion above 10hrs+ before your comment and that's your question?

RTFM: https://github.com/amznlabs/amazon-dsstne/blob/master/FAQ.md