Well, first of all, just to clarify, I was asking a question, not a statement. I recently ordered a HPC server for my company, we're using caffe to train/detect very large sets.
I went with the K80, the company we ordered it from charged us $4400 for the card, so there must be a good amount of markup that can be negotiated out of it.
I have since read some material comparing the K40 to the 980, giving a slight edge to the 980, which is surprising considering the price points, but I have not yet found any good benchmarks/posts about the K80 vs the 980. The K80 is not just two K40's glued together, as it uses the GK210 tesla chips rather than the GK110. The GK210 is a more advanced chip with more cache and better energy efficiency but I'm really not too sure how it translate into real world performance.
If anybody has any data or perspective on this, I would appreciate it.
Just my 2 cents. I've been trying my scientific computing code (QM/MM, not ML) in a cluster using various configs (6xK40, 4xK80, 6xK20, etc) and the performance I noticed of the K80 is quite strange. I've been using the CUDA_DEVICE 0,1,2,3 of that config and if I try to use more than one logical GPU, the performance is not 1:1, but more like 1:0.6
The only conclusion I've been able to find is that the K80 presents itself as 2 different devices (0,1 or 2,3 in that config) but the performance is not 2x, at all. There is quite a lot of PCI bus contention, hurting badly the performance of my code (as it is just running many <10ms kernels at a time). So far, having 2xK40 seems to be a better value and performance proposition than 1xK80 on the same bus, but the flops/watt aspect of that equation favors greatly the K80.