Benchmarking CNTK on Keras: Is It Better at Deep Learning Than TensorFlow?

sirfz · on June 12, 2017

This comment [1] in Tensorflow's discussion group gives some valid points explaining some of the crazy numbers reported by CNTK.

[1] https://groups.google.com/a/tensorflow.org/d/msg/discuss/Dhy...

aub3bhat · on June 12, 2017

Thanks, great to see that TF recongizing the issue with Pipelines/queues and simplifying it. I wish they did NCHW conversion by automatically detecting GPU.

I think Tensorflow is developed in a very principled manner focusing on ability to add more ops and platforms in future. While this has hurt TF usability in the short term I am bullish on its future.

option · on June 12, 2017

PyTorch is also much faster than Tensorflow on LSTMs http://deeplearningathome.com/2017/06/PyTorch-vs-Tensorflow-.... There is also a reason for that

1024core · on June 12, 2017

> There is also a reason for that

Don't be shy. Do share what that reason is.

IanCal · on June 12, 2017

From the linked article, to help others like me who mostly scan the comments

> PyTorch LSTM network is faster because, by default, it uses cuRNN’s LSTM implementation which fuses layers, steps and point-wise operations. See blog-post on this here.

> Tensorflow’s RNNs (in r1.2), by default, do not use cuDNN’s RNN, and their ‘call’ function describes only one time-step of computation, hence a lot of optimization opportunities are lost. On the flip side, though, this gives user much more flexibility, provided that the user knows what he is doing.

sdenton4 · on June 12, 2017

Yeah, tf rnn's are slow. If you look at the benchmark results here, they're all comparable except for the lstm examples.

There are 'fused' implementations in tf, but they aren't the default, and I haven't tried them out yet...

Analemma_ · on June 12, 2017

How much of this is solely due to the 1-bit gradient descent? It's a genuine breakthrough and MSR (or whoever came up with it) deserves all the credit, but assuming it's not patented I imagine Google will be adding it to TF sooner or later and that will close most of the gap.

EDIT: My bad, I did not see at the end of the article that 1-bit SGD is not enabled on Keras yet, so the performance wins are coming from somewhere else. Neato.

minimaxir · on June 12, 2017

The container specifically uses the non-1-bit SGD version CNTK after I learned it does not work, just to be sure.

MSFT employees talk about LSTM gains on the original submission: https://news.ycombinator.com/item?id=14473255

IshKebab · on June 12, 2017

Last time I checked 1-bit gradient descent is one of the only bits of CNTK that is closed source. I expect it is patented.

smortaz · on June 12, 2017

If you havent tried CNTK and want to get an overview w/o installing anything, try:

https://notebooks.azure.com/cntk/libraries/tutorials

Cheers.

ntenenz · on June 12, 2017

Assessing accuracy? If you're running the same architecture with the same training regime, why would you expect differing accuracy numbers (other than a bug of some sort)? Seems a bit strange to include.

minimaxir · on June 12, 2017

From the CNTK vs. TensorFlow page: https://docs.microsoft.com/en-us/cognitive-toolkit/reasons-t...

> TensorFlow shared the training script for Inception V3, and offered pre-trained models to download. However, it is difficult to retrain the model and achieve the same accuracy, because that requires additional understanding of details such as data pre-processing and augmentation. The best accuracy that were achieved by a third party (Keras in this case) is about 0.6% worse that what the original paper reported. Researchers in the CNTK team worked hard and were able to train a CNTK Inception V3 model with 5.972% top 5 error, even better than the original paper reported!

This suggests that an improvement in accuracy is possible by switching to CNTK, hence why I included an accuracy metric from both frameworks. (and also for sanity checking, as you note)

benjismith · on June 12, 2017

No, if you re-read the first sentence there, it says that the different results are attributed to differences in "pre-processing and augmentation". The choice of NN framework is essentially irrelevant.

dgacmu · on June 13, 2017

Note, though, that the preprocessing and augmentation is (at least in TF) done within the framework itself. I helped debug the pure-TensorFlow version of the Inception input pipeline, and getting it to match the earlier DistBelief version was agonizing -- it really shows all of the differences (and bugs) in the image processing ops. And there can be subtle effects -- differences in which image resizing algorithm you use, for example.

But it's worth noting that this code is all released:

https://github.com/tensorflow/models/blob/master/inception/i...

It may be hard to replicate that across all platforms, though -- as an example, the distortions include using four different image resizing algorithms.

Some of it was true preprocessing, i.e., cleaning up the imagenet data. I wrote a bit about that here: https://da-data.blogspot.com/2016/02/cleaning-imagenet-datas...

(tl;dr - there are some invalid images and bboxes, etc., and some papers chose to deal with the "blacklisted" images differently.)

sdenton4 · on June 12, 2017

Should be irrelevant... It's still worth testing though, to see that the default implementations are actually doing what it says on the box...

sirfz · on June 12, 2017

That's not really related to the network. When you build a NN, you shouldn't expect different accuracy using different frameworks (given that you're using the same hyper parameters and training/evaluating on the same data).

ntenenz · on June 12, 2017

I mean this in the nicest possible way (so please don't take offense), but I think you're missing what is being said there. The framework itself should have no impact, positive or negative, on model accuracy. That being said, it can be extremely challenging to reproduce results given the stochasticity of random batches and asynchronous updates. Furthermore, precisely specifying the methods of data augmentation can be tedious, thus the protocol is often only partially detailed in published work which further exacerbates the challenges of reproducing results.

jacquesm · on June 12, 2017

We could introduce a simple rule to get rid of this kind of problem: if it can't be reproduced with the data supplied then it isn't true so you can't publish.

nerdponx · on June 12, 2017

Not to mention specifying the random seed and PRNG algorithm.

nl · on June 12, 2017

While this sounds like it should be true, in practice it isn't.

See for example the TF vs Torch implementations of Pix2Pix https://github.com/affinelayer/pix2pix-tensorflow (at the bottom of the page).

The reasons for this are many: different order of operations, failure to propagate random seeds, different order of processing files etc. I don't think there is a structured study of this so that would be a great thing to do!

AndrewKemendo · on June 12, 2017

Didn't you hear? Azure has an entropy converter that makes deep nets deterministic!

johnsmith21006 · on June 15, 2017

The thing is Tensorflow is racking up stars on Github 5x to CNTK. It has all the momentum and hard to see MS able to slow it down.

eggie5 · on June 12, 2017

Although fine for your relative benchmarking, do you notice that docker introduces noticeable overhead to training time vs non-docker?

minimaxir · on June 12, 2017

From my testing before starting the benchmarks, no. I did later search around the internet beforehand and others mention that there is not noticeable overhead.

eggie5 · on June 12, 2017

It would be interesting to quantify this.

I did to go through the rigmarole of building tensorflow for GPU training in GCE (building TF: installing nvidia drivers, cudnn, etc) and docker would have definitely been a boon! Or it would be nice of GCE had an image marketplace like AWS.

ipunchghosts · on June 12, 2017

I wonder if this will start a flame war with fchollet.

minimaxir · on June 12, 2017

In the PR he is very supportive of the addition of CNTK: https://github.com/fchollet/keras/pull/6800