NVIDIA Announces the GeForce GTX 1000 Series

onli · on May 7, 2016

This article buys in in the hype. It reads like they just copied statements of the press release. The article on anandtech[0] is a lot better.

Also, do not forget that benchmarks made by Nvidia mean nothing. How fast the cards really are will only be clear if independent people do real benchmarks. That's true for games, but it will also be true for all the speculation found here over the ML performance.

[0] http://anandtech.com/show/10304/nvidia-announces-the-geforce...

varelse · on May 7, 2016

Fair enough, but when neither Intel nor AMD is bothering to put out competitive software/hardware solutions for deep learning, which is apparently going to be a ~$40B market by 2020, I predict continued dominance here. That said, since AMD is getting beaten by a factor of 2-3 in performance while Intel's Accelerator/FPGA solutions are getting beaten by a factor of 5-10, AMD has a real opportunity to be the competitor to NVIDIA here.

And despite questions about it, GTX 1080 will be a fine deep learning board, especially if the framework engineers get off their lazy butts and implement Alex Krizhevsky's one weird trick algorithm, an approach which allows an 8-GPU Big Sur with 12.5GB/s P2P bandwidth to go toe to toe with an 8-GPU DGX-1 on training AlexNet.

http://arxiv.org/abs/1404.5997

jhj · on May 8, 2016

That's just a combination of GPU model and data parallelism, which have been supported for about 2 years in Torch among others. You can add in the outer product trick for fc layers as well.

https://github.com/soumith/imagenet-multiGPU.torch/blob/mast...

nccl is in Torch as well, but it doesn't always win, and has some weird interactions regarding streams and other such things with its use of persistent kernels.

https://github.com/torch/cunn/blob/master/DataParallelTable....

However, this feels more of a benchmarking thing now. Networks tend to be over-parameterized and redundant in many ways; sending less data with much deeper / smaller parameter networks has been where the action is (e.g., residual networks, GoogLeNet, etc.), or to non-synchronous forms of communication among workers. Trying to squeeze out every last drop of GPU/GPU bandwidth is not as important as iterating on the architecture and learning algorithms themselves.

varelse · on May 9, 2016

I don't use NCCL, I wrote my own ring collectives after reinventing them in 2014 (I felt really clever for a whole 20 minutes before a Google search popped my balloon) to avoid working with a vendor that was trying to gouge us on servers: they beat NCCL by ~50%.

That said, most people couldn't write them so I advise NCCL. You're the third person to tell me NCCL has ish(tm), fascinating.

And sure, you could do the outer product trick. You could use non-deterministic ASGD. And you can do a lot of weird(tm) tricks. But why do these (IMO ad-hoc and potentially sketchy task-specific) things when there's an efficient way to parallelize the original network in a deterministic manner for training that allows performance on par with a DGX-1 server?

Because for me, the action is in automagically discovering the optimal deterministic distribution of the computation so researchers don't need to worry about it. And IMO that's where the frameworks fail currently.

nl · on May 7, 2016

Pretty sure most frameworks support "one weird trick" style learning.

Eg Caffe: https://github.com/BVLC/caffe/blob/master/docs/multigpu.md

varelse · on May 8, 2016

"The current implementation uses a tree reduction strategy. e.g. if there are 4 GPUs in the system, 0:1, 2:3 will exchange gradients, then 0:2 (top of the tree) will exchange gradients, 0 will calculate updated model, 0->2, and then 0->1, 2->3."

Nope... This is the stupid way... They should be using ring reductions, handily provided by the NCCL library:

https://github.com/NVIDIA/nccl

db1024 · on May 8, 2016

I wouldn't call it stupid. It was fine for when they wrote it. I would say their solution is a bit outdated though. I am a bit surprised that NVIDIA isn't allocating more engineering support to Caffe to make sure they support all of NVIDIA's latest features in a timely fashion. Torch at least tries to use nccl for nn.DataParallelTable by default.

varelse · on May 8, 2016

IMO no it wasn't. Ring reductions are the sort of things the MPI guys had a decade ago. This is (once again IMO) the sort of thing that happens when one doesn't do lit searches and then Juergen Schmidhuber has a temper tantrum when we all find out one of his students probably has a good claim to have previously invented something supposedly new.

My heuristic whenever I think I've invented something is to aggressively exploit Google to prove me wrong. It usually does so, but sometimes in very unusual ways. Glad to hear about Torch though. Do they have automagic model parallelization (not data parallelization) as well?

db1024 · on May 8, 2016

> Do they have automagic model parallelization (not data parallelization) as well?

Not that I know of.

Regarding claims of novelty, I don't think the Caffe maintainers are claiming that their multi-gpu update method is novel or even very good. I think it was just the easiest thing someone could think of. I think Flickr originally wrote the multi-gpu extensions and the maintainers simply accepted the pull request.

If anything, I think the maintainers are more than willing to listen to people in the scientific computing community with experience. Even better if they have a pull request in hand. But otherwise, they probably won't know about better methods and won't care.

varelse · on May 8, 2016

So from what I've heard unofficially from friends at NVIDIA, they've been pretty hard to work with such that NVIDIA just ended up making their own fork of Caffe for use within DIGITS.

https://github.com/NVIDIA/caffe

Am I missing something here?

dang · on May 7, 2016

OK, we've changed the URL to that from http://wccftech.com/nvidia-geforce-gtx-1080-launch/. Thanks.

ultramancool · on May 7, 2016

Anandtech also has extremely good benchmarks, if you're ever looking to optimize bang for your buck when it comes to GPUs they have an excellent comparison tool.

onli · on May 7, 2016

I know, I use them in my hardware recommender ;) They also are very accessible under http://anandtech.com/bench/GPU15/1248, it is a nice tool.

neverminder · on May 7, 2016

What really blew me away is that they went straight for Displayport 1.4 (https://en.wikipedia.org/wiki/DisplayPort#1.4) that was only announced on 1st of March 2016. DP 1.3 was approved on 15th of September 2014 and as of today there are no cards supporting it except this one (with backwards compatibility).

The bad news is that there's no news about new gen displays that could take advantage of this graphic card. I'm talking about 4K 120Hz HDR (https://en.wikipedia.org/wiki/High-dynamic-range_rendering) displays. This is a total WTF - we have a graphic card with DP 1.4 and we don't even have a single display with so much as DP 1.3...

potatolicious · on May 7, 2016

IMO still a preferable situation than what we had before - which was displays coming out with hacked up interfaces because they needed more bandwidth than there were video cards capable of driving it.

Displays get to play catchup for a while. This is great - it means we can avoid stupid hacks like MST.

raverbashing · on May 7, 2016

Well, chicken and egg problem

It would also be useless to have the displays first and cards later

I guess some manufacturers have it lined up already

neverminder · on May 7, 2016

Well, that's understandable, however comments like this are really discouraging: "Proper HDR support for monitors is years away. Some budget trash with HDR will be released, but they won't come close to matching the HDR standard and will have tons of issues." (https://www.reddit.com/r/Monitors/comments/4hj4xv/what_are_s...)

CyberDildonics · on May 7, 2016

There are monitors with 10 bit color depth, what exactly are you hoping to get beyond that?

neverminder · on May 7, 2016

10 bit is not the same as HDR and vice versa - http://www.dpreview.com/forums/thread/3822463

zokier · on May 8, 2016

I think this article (from 2005!) is still quite good introduction to HDR: http://www.bit-tech.net/hardware/2005/10/04/brightside_hdr_e...

Your link is now bit outdated in the regard that these days there are plenty of consumer HDR standards/names out there. HDR10 and Dolby Vision are the two main competing standards, and we have marketing terms like "Ultra HD Premium" or "HDR Compatible".

https://www.cta.tech/News/News-Releases/Press-Releases/2015-...

http://www.whathifi.com/advice/ultra-hd-premium-what-are-spe...

I think NCX is being bit overly negative (in his characteristic style) in his predictions. Sure, there will be cheap crap released, but considering that almost all major manufacturers are pushing this tech heavily on TV screens it can't be that long until it trickles down to monitors too. Of course you can plug your PC right now to one of those fancy new HDR TVs, I'm not sure though if you can get HDR actually working. I guess that would be mostly a software issue?

CyberDildonics · on May 8, 2016

I didn't say it was. None of that explains what HDR as it relates to monitors means technically. Everything uses the term, but we already have numbers for bit depth of the signal and contrast ratios of the monitor.

So if it isn't just more contrast then what does it actually mean? In images it means the image has values stored above what can be displayed.

tetraodonpuffer · on May 7, 2016

I think it's actually a smart choice, as as much as I doubt there will be monitors with that spec, there will for sure be VR headsets that will be able to use it, and likely not that far in the future

memonkey · on May 7, 2016

What cable transfers 4k 120-144hz? What are the benefits of DisplayPort 1.4 over something like Dual Link DVI or the newest HDMI?

Ductapemaster · on May 7, 2016

I don't have access to the spec for 1.4, but from what I can understand it's still a single cable that will do 4K 120hz HDR. Obviously it'll have to be a supported 1.4 capable cable, but that's the same as CAT 5 vs CAT 6

Dual link DVI is significantly old-hat compared to the newer DisplayPort and HDMI standards. A single dual link DVI cable can carry either half a 4K frame (1920x2160) at 8 bits per color, 60hz, OR a single 1080p frame at 10 bits per color, 60hz. To get a 4K 60hz display running you need to use two dual link DVI cables, which is insanely bulky. Plus you at best get 8 bit color out of it, and no audio.

The latest HDMI standard can do 4096x2160 60Hz in deep color (12 bits per color).

DisplayPort 1.4 has a higher maximum bandwidth than HDMI 2.0 (31.4 Gbps vs 18Gbps). This allows it to do 120Hz, deep color/HDR or carry multiple 4K streams (multiple monitors or 3D) on one cable. It's by far the most advanced standard out there for consumer video transmission.

Source: I'm an engineer working with video systems, and Wikipedia

neverminder · on May 7, 2016

Displayport 1.4 is a standard for socket and cable, so it will transfer 4K 120Hz.

Advantages over other standards: https://en.wikipedia.org/wiki/DisplayPort#Advantages_over_DV...

Also, another major advantage - USB Type-C compatibility (http://www.displayport.org/what-is-displayport-over-usb-c/)

bawana · on May 7, 2016

Why didn't video cables go optical? Like toslink did w audio?

Matthias247 · on May 8, 2016

There are lots of options for optical cabling (e.g. cheap plastic fibers vs. glass fibers) and all have different tradeoffs, e.g. available bandwidth vs. cost. However from what I've seen basically all options are more expensive then equivalent electrical connections (we use both optical and electrical networks in cars).

A major advantage of optical is that there are no EMV problems and that you are electrically decoupled (e.g. no hum loops). This might be one of the reasons why toslink is optical. For displays ground loops are typically no problem compared to amplifiers/speakers.

stevep98 · on May 8, 2016

Thunderbolt was supposed to go optical. You can read about it here, under "Copper vs. optical":

https://en.wikipedia.org/wiki/Thunderbolt_(interface)

seanp2k2 · on May 9, 2016

USB3 was also supposed to have fiber components. Really hard to find references for this online since at CES 2008 they showed the connectors without any optical stuff, but I found this: http://www.theregister.co.uk/2007/09/19/idf_usb_3_announced/

Anyone know why they didn't end up going optical + copper for USB3?

nl · on May 7, 2016

No advantage. It's still all wires inside the computer and the monitor.

chillydawg · on May 8, 2016

Yes, but inside the computer you have the luxury of static, pre-calculated track layouts that optimise for latency, noise and general integrity. A general purpose, variable length, high bandwidth cable and connectors is a much harder problem, thus the kind of issues we see here sending raw video signals of extremely high fidelity between card and monitor.

tormeh · on May 7, 2016

Latency and cost, probably.

Dylan16807 · on May 7, 2016

What latency?

brianwawok · on May 7, 2016

Electric to light to electric would add latency perhaps? Random guess

Dylan16807 · on May 7, 2016

It wouldn't, not at the scale of inter-device cables.

the8472 · on May 7, 2016

I wouldn't call it an advantage, but DP1.4 can cheat on the bandwidth department by using lossy compression.

Their press statements it's visually transparent, but the technical overview[1] says this:

> All of the analyses showed that the DSC algorithm outperformed five other proprietary algorithms on these picture quality tests, and was either visually lossless or very nearly so for all tested images at 8 bits/pixel.

[1] http://www.vesa.org/wp-content/uploads/2014/04/VESA_DSC-ETP2...

untog · on May 7, 2016

Ugh, I can see this getting very annoying - having to work out whether something is "true" DisplayPort 1.4.

elevensies · on May 7, 2016

This screen is supposed to do 4K 120hz, I haven't seen I hands-on review yet though. http://www.144hzmonitors.com/monitors/dell-up3017q-30-inch-4...

dmitshur · on May 7, 2016

How come Thunderbolt 3 still supports only DisplayPort 1.2? Which is the same as what Thunderbolt 2 supported.

seanp2k2 · on May 9, 2016

It likely has to do with the timing at which the standards were finalized. DP1.3 / DP1.4 probably weren't ready when they needed to finalize TB3. Unfortunate but alas, standards.

jp555 · on May 8, 2016

Won't something like a Thunderbolt 3 over usb-c be required for a 120Hz HDR 4K vid stream?

neverminder · on May 8, 2016

DisplayPort 1.4 can support 8K UHD (7680×4320) at 60 Hz with 10-bit color and HDR, or 4K UHD (3840×2160) at 120 Hz with 10-bit color and HDR. 4K at 60 Hz with 10-bit color and HDR can be achieved without the need for DSC. On displays which do not support DSC, the maximum limits are unchanged from DisplayPort 1.3 (4K 120 Hz, 5K 60 Hz, 8K 30 Hz). https://en.wikipedia.org/wiki/DisplayPort#1.4

mrb · on May 7, 2016

Note that in terms of pure compute performance the new 16nm Nvidia GTX 1080 (2560 shaders at up to 1733 MHz = 8873 SP GFLOPS) barely equals the performance of the previous-generation 28nm AMD Fury X (4096 shaders at up to 1050 MHz = 8602 SP GFLOPS). Of course the 16nm chip does so at a significantly lower TDP (180 Watt) than the 28nm chip (275 Watt), so it will be interesting to see what Nvidia can achieve at a higher thermal/power envelope with a more high-end card... I am waiting impatiently to see how AMD's upcoming 14nm/16nm Polaris chips will fare, but from the looks of it it seems like Polaris will beat Nvidia in terms of GFLOPS per Watt.

varelse · on May 7, 2016

The most interesting thing about AMD is their ability to create GPUs with superior synthetic specs than their NVIDIA competitor and yet they lose across the board in anything but embarrassingly parallel applications. Not only do they lose, but they lose big.

Here are two examples. First, the engine behind Folding@Home: https://simtk.org/plugins/moinmoin/openmm/BenchmarkOpenMMDHF...

See also AlexNet (Deep Learning's most prominent benchmark): https://github.com/amd/OpenCL-caffe versus GTX TitanX: https://github.com/soumith/convnet-benchmarks

TLDR: AMD losing by a factor of 2 or more...

So unless this changes dramatically, NVIDIA IMO will continue to dominate.

But hey, they both crush Xeon Phi and FPGAs so it doesn't suck to be #2 when #3 is Intel.

listic · on May 8, 2016

Which FPGAs in which workloads do they crush?

varelse · on May 8, 2016

Arria 10: http://www.nextplatform.com/2015/08/27/microsoft-extends-fpg...

600 images/s inference (highest ever publically reported), projected to be as high as 900 images/s. ~$5000 per FPGA.

Compare and contrast with TitanX/GTX 980 TI now topping 5000 images/s. GTX 1080 will only be faster than this and $600.

https://github.com/soumith/convnet-benchmarks

And so far, no FPGA training numbers. But here are some distributed training numbers for $10Kish Xeon servers:

https://software.intel.com/en-us/articles/caffe-training-on-...

64 of them train Alexnet in ~5 hours. A DGX-1 does it in 2. Don't have/Can't get a $129K DGX-1? Fine, buy a Big Sur from one of several vendors and throw in 8 GTX 1080s as soon as they're out, implement "One Weird Trick", and you'll do it in <3. That ought to run you about $30-$35K versus >$600K for those 64 Xeon servers.

FPGAs OTOH are fantastic at low memory bandwidth embarrassingly parallel computation like bitcoin mining. Deep Learning is not such a domain.

pierrebai · on May 8, 2016

I don't see anywhere on that page any benchmark about 5000 images/sec, maybe you should update your link to a page showing that info or update your claim to match what you link to?

Even your previous post had links to two page that were not measuring the same thing... and thus could not support your claims?

I think you need to clarify.

varelse · on May 9, 2016

Based on the text you apparently couldn't be bothered to read, for AlexNet, a forward pass on 128 images (Input 128x3x224x224) takes ~25 milliseconds (ignoring O(n) dropout and softmax where n is 128). I'll let you do the math for the rest of this...

redtuesday · on May 7, 2016

Yeah, it will be interesting to see how good AMD's Polaris chips will be, especially when it comes to energy efficiency. Fury was a step in the right direction (I'm wondering how much of this can be attributed to HBM's lower voltage etc.). But I'm even more interested in AMD Vega - the high end competitor to Nvidias GP 100 that is equipped with HBM2. Especially if AMD has an advantage over Nvidia when it comes to implementing HBM, since they developed it together with SK Hynx and have previous experience with Fury, or if this doesn't make a difference.

Vega is also exiting because it should be released when AMD's Zen CPU's will be released, and there were rumors about a HPC Zen APU with Vega and HBM.

mtgx · on May 7, 2016

If we're discussing future GPUs, does anyone know what exactly AMD means by "next-gen memory" when it talks about Navi?

At first I thought it means that HBM2 would be available across their whole lineup, but I think if that was the case they would've just said so.

I've also read a bit about how Nvidia wanted to go with HMC (Hyper Memory Cube) instead of HBM, but it seems it's still at least twice as expensive/GB, and they needed to go with high-end GPUs that have at least 16GB. HMC is not even at 8GB yet. Intel also seems to have adopted HMC for some servers.

So is the "next-gen" memory HMC, or something else? AMD is supposed to come out with it in 2018, hopefully at 10nm.

redtuesday · on May 7, 2016

Was the difference between HMC and HBM not basically HMC had good latency comparable to normal main RAM like DDR3 plus good bandwith (up to 400 GB/s), while HBM had worse latency than DDR3 (like GDDR5) with a large bus and really good bandwith (1 TB/s)? Will indeed be interesting to see what AMD means with next gen memory.

Tuna-Fish · on May 7, 2016

No. HMC is fully buffered and provides a packet-based interface to the CPU it's connected into. This means that the CPU doesn't need to know anything about the memory it uses, and saves die space on the CPU side, but it adds latency. HBM has lower latency than HMC.

Also, GDDR5 does not have higher latency than DDR3. The memory arrays themselves are the same and exactly as fast, and the bus doesn't add any extra latency. GDDR5 doesn't trade latency for bandwidth, it trades density for bandwidth.

Things built to use GDDDR5 often have much higher latencies than things built to use DDR3, but this has nothing to do with the memory and instead about how the memory controllers in GPUs delay accesses to merge ones close to each other to optimize for bandwidth.

redtuesday · on May 7, 2016

Thank you for the detailed explanation. Always thought the added latency came from GDDR5. Good to know.

yohui · on May 7, 2016

Both Vega and Zen are supposed to be due sometime next year, correct?

redtuesday · on May 7, 2016

Yes, Zen for the desktop (Summit Ridge) is even listed for end of 2016 [1], followed by first Zen APU's (Raven Ridge) sometime 2017 and Vega early 2017. [2]

[1] http://imgur.com/rRjKZDk [2] http://i.imgur.com/lkjgayO.jpg?1

tcoppi · on May 7, 2016

AMD chips have almost always had a lead over Nvidia from a per-flops perspective, and AMD's architecture is definitely ahead of Nvidia as far as supporting integer operations and bitwise ops efficiently(see their dominance of cryptocurrency mining). However, Nvidia's software tooling has always been superb, and they basically own the GPGPU-for-HPC space because of it. On top of that, one of the biggest growth areas of consumers of GPGPU in recent years has been "deep learning", for which Nvidia has embraced and has great tooling and software for their hardware. AMD has potential to compete with Nvidia in this area, but they don't seem to have the resources to develop optimized Neural Network software to run on their GPUs like Nvidia does, so they are way behind. Basically, GFLOPS per Watt is nice and all, but if nothing you care about runs on the hardware, you won't buy it. AMD seems to have been stuck in this position for HPC, and it doesn't seem to be changing sadly.

valarauca1 · on May 7, 2016

AMD pulls off better compute in pure GFLOPS but you'll have to extract that compute power in OpenCL. AMD still doesn't have a good answer to CUDA.

elementalest · on May 7, 2016

AMD are supporting open standards like OpenCL, which is effectively their answer to CUDA. The problem is that Nvidia refuse to keep their drivers up to date with the latest OpenCL capabilities. The latest version nVidia supports is OpenCL 1.1, but its now up to OpenCL 2.2. nVidia are years behind on OpenCL. This is pretty much the primary reason why CUDA seems so far ahead.

nVidia released new features and capabilities on CUDA that later versions of OpenCL also supported, but were not available on nVidia cards due to lack of OpenCL driver support. Hence, lots of people developed libraries and tools for CUDA, with great encouragement and support from nVidia (which attracted developers). These tools were useful and powerful, encouraging others to also use nVidia hardware to take advantage of those tools. CUDA has now taken root.

If nVidia were to start keeping their drivers updated with the latest version of OpenCL, i suspect CUDA's dominance would start to wane. Though the CUDA roots have gone deep, so it may take several years to reverse its dominance. However, its highly unlikely that nVidia will update their drives any time soon. The only thing that would probably force them is AMD gaining market dominance for a few years.

puzzlingcaptcha · on May 8, 2016

AMD is also developing a tool to convert CUDA into HIP but I haven't looked into that

http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/...

https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP

nl · on May 7, 2016

AMD are "supporting" OpenCL as in it works. If you compare the resources NVidia are putting into their deep learning support vs AMD it is miles ahead.

If you read between the lines of how dismissive AMD were of NVidia's work on self driving cars you can see they don't care about compute.

They are chasing the VR market, and that is fine.

Athas · on May 7, 2016

OpenCL is however a fairly nice target for compilers, which opens up different ways of using it.

Witness also PyOpenCL versus PyCUDA - both are essentially equally convenient to use, while raw OpenCL C is a nightmare compared to CUDA C.

m_mueller · on May 8, 2016

I think the near future in compilers is going to be dominated by accelerator capable LLVM variants (such as NVVM). Ideally, all Nvidia, AMD and Intel would and should need to do is to provide LLLVM based backends. Frontends just need a sane way of specifying SIMT. And no, loop + directive based stuff like OpenMP and OpenACC is IMO just a stopgap measure in order to serve higher ups the lie that portation is going to be swift and easy. There's basically one sane way to deal with data parallel problems, and that's to treat what's going on in one core as a scalar program with precomputed indices (i.e. the CUDA / OpenCL way).

Python has some good support for this already. I haven't done a project like this yet, but if setting up something fresh I'd immediately jump on NumPy interacting with my own kernels and some Python glue for the node level parallelism.

discardorama · on May 7, 2016

> AMD still doesn't have a good answer to CUDA.

I don't understand why AMD doesn't just jump on the CUDA bandwagon and support the API?

saturncoleus · on May 7, 2016

Alternatively they could make their own API, independent from OpenCL. I have written a small amount in each languages (a rainbow table generator), and found the CUDA version much more pleasant to use. CUDA made it really easy to get a good idea of the underlying hardware, and make architectural changes in order to make the code faster.

OpenCL (using an ATI card) was much harder to program, since the abstraction level was much higher. Writing two separate kernels and have them each be faster than a generic version that ends up compromising for compatibility.

The OpenCL one ended up being faster, but I suspect that's due to ATI hardware being superior at the time.

dharma1 · on May 7, 2016

They are working on something like that - http://www.extremetech.com/extreme/218092-amds-boltzmann-ini...

The current versions only work with a couple of latest AMD chips though. And CUDA is just one part of it, they have nothing equivalent to CuDNN

robbies · on May 7, 2016

1. They have no say in the API so NV would just steer the API in ways to make sure it would be suboptimal on AMD hardware.

2. It's not just the API. It's the entire SDK that has NV specific hooks that AMD can't tap into

maaku · on May 7, 2016

They do. They have a retargeting compiler. It's performance sucks.

sp332 · on May 7, 2016

The Fury X is AMD's current flagship, not last-gen.

mrb · on May 7, 2016

I meant previous-gen chip fab process (28nm) vs. new-gen (14/16nm).

gavanwoolery · on May 7, 2016

Worth noting - IIRC the stream mentioned that it could do up to 16 simultaneous projections at little additional performance cost. This is important for VR...a big part of the cost, when you are dumping many vertices to the GPU, is performing a transform on each vertex (a four component vector multiplied by a 4x4 matrix) +. Even bigger cost comes from filling the resulting polygons, which if done in two passes (as is fairly common) results in something that violates cache across the tiles that get filled. So, in other words, its expensive to render something twice, as is needed for each eye in VR - from what they have shown, their new architecture largely reduces this problem.

+ This is a "small" part of the cost, but doing 5m polygons at 60 fps can result in about 30 GFLOPS of compute for that single matrix operation (in reality, there are many vertex operations and often many more fragment operations).

paulmd · on May 7, 2016

I was thinking about this earlier - the transition from a warp of 32 to a warp of 64 that Pascal supposedly made sounds exactly like how you would accelerate multiple projections (by at least a factor of 4).

edit: apparently Pascal still has a warp-size of 32.

pandaman · on May 7, 2016

I am not sure I follow, how does processing 64 vertices instead of 32 at once accelerates anything, provided the total number of ALUs is the same?

paulmd · on May 7, 2016

A major concept in GPGPU programming is "warp coalescing".

Threads are executed an entire warp at a time (32 or 64 threads). All threads execute all paths though the code block - eg if ANY thread executes an if-statement, ALL threads execute an if-statement. The threads where the if-statement is false will execute NOP instructions until the combined control flow resumes. This formula is followed recursively if applicable (this is why "warp divergence/branch divergence" absolutely murders performance on GPUs).

When threads execute a memory load/store, they do so as a bank. The warp controller is designed to combine these requests if at all possible. If 32 threads all issue a request for 32 sequential blocks, it will combine them into 1 request for 32 blocks. However, it cannot do anything if the request isn't either contiguous (a sequential block of the warp-size) or strided (a block where thread N wants index X+0, N+1 wants thread X+0, X+2N, etc). In other words - it doesn't have to be contiguous, but it does have to be uniform. The resulting memory access will be broadcast to all units within the warp, and this is a huge factor in accelerating compute loads.

Having a warp-size of 64 hugely accelerates certain patterns of math, particularly wide linear algebra.

edit: apparently Pascal still has a warp-size of 32.

pandaman · on May 7, 2016

Wow, memory access on NVidia is pretty bad. AMD has a separate unit that coalesces memory requests and goes through cache so if you do strided loads, for example, the next load will likely be reading data cached by the previous one and it does not matter how many lanes are active. AMD has 64-wide "wraps" btw and it does not seem superior to NV in computation on the same number of ALUs.

paulmd · on May 7, 2016

I did my grad research on disease transmission simulations on GPUs, so this is super interesting to me. Could you please hit me with some papers or presentations?

The NVIDIA memory model also goes through L1 cache - but that's obviously not very big on a GPU processor (also true on AMD IIRC). Like <128 bytes per thread. It's great if your threads hit it coalesced, otherwise it's pretty meaningless.

pandaman · on May 7, 2016

I program AMD chips in game consoles so I use a different set of manuals but AMD has a lot of docs available to public at http://developer.amd.com/resources/documentation-articles/de...

At glance there is a lot of legacy stuff so I'd look at anything related to GCN, Sea Islands and Southern Islands. Evergreen, R600-800 etc are legacy VLIW ISA as far as I know.

SXX · on May 7, 2016

There also fairly recent GCN3 ISA from 2015 available that shed light on their modern hardware architecture.

robbies · on May 7, 2016

Well, sheds light on their Compute Unit architecture.

tostitos1979 · on May 7, 2016

A friend of mine will be starting an epidemiology grad program this fall. Do you have some good basic pointers on the use of GPUs in the field? Also .. what is your opinion of the field in general?

jhj · on May 7, 2016

the hardware performs some coalescing, but it's complicated...

memory accesses in a warp do not necessarily have to be contiguous, but it does matter how many 32 byte global memory segments (and 128 byte l1 cache segments) they fall into. the memory controller can load 1, 2 or 4 of those 32 byte segments in a single transaction, but that's read through the cache in 128 byte cache lines.

thus, if every lane in a warp loads a random word in a 128 byte range, then there is no penalty; it's 1 transaction and the reading is at full efficiency. but, if every lane in a warp loads 4 bytes with a stride of 128 bytes, then this is very bad: 4096 bytes are loaded but only 128 are used, resulting in ~3% efficiency.

pandaman · on May 7, 2016

Thanks, this is about how I imagined all GPUs work. With most games using interleaved vertex buffers it would be a very strange decision to penalize this access pattern.

robbies · on May 7, 2016

Actually...that might be changing on GCN. Check out Graham Wihlidal's presentation from this past GDC. Advocates some benefits of de-interleaved VBs

http://www.wihlidal.ca/Presentations/GDC_2016_Compute.pdf

pandaman · on May 7, 2016

It might, but most likely it won't: splitting position data from the rest of vertex had been around since, at least, PS3 (for the similar reasons of culling prims) and yet how many games have done the Edge's tri-culling in that generation? And even if you go ahead and split off position the rest of the vertex is still better be interleaved.

jhj · on May 7, 2016

It wouldn't because as you say the ALUs are the same, it just would change the unit that the warp scheduler deals with (and relevant changes to kernel occupancy that it would entail).

paulmd · on May 7, 2016

Unless you halve the precision going into the ALU (FP16 vs FP32), of course...

jhj · on May 7, 2016

the warp scheduler can dual-issue instructions; it has to because otherwise 192 fp32 alus can't be used by the 4 warp schedulers (4 x 32) effectively. increased ipc can do this for you. this was maxwell though; the ratio of resources to schedulers to sm has changed a lot in pascal.

look up "dual-issue":

http://on-demand.gputechconf.com/gtc/2013/presentations/S346...

jhj · on May 7, 2016

Warp size is still 32 in Pascal.

grondilu · on May 7, 2016

I've found the application to multi-screens setups more exciting.

gavanwoolery · on May 7, 2016

That's exciting as well, but for whatever reason I've always personally found multi-monitor gaming to be a little annoying (even with compensated projection). Probably ok if you have super small monitor bezels though.

paperwork · on May 7, 2016

Can someone provide a quick overview of the current GPU landscape?

There seems to be Nvidia's pascal, gtx, titan, etc. Something called geforce. And I believe these are just from Nvidia.

If I'm interested in building out a desktop with a gpu for: 1. Learning computation on GPU (matrix math such as speeing up linear regression, deep learning, cuda) using c++11 2. Trying out oculus rift

Is this the right card? Note that I'm not building production models. I'm learning to use gpus. I'm also not a gamer, but am intrigued by oculus. Which GPU should I look at?

paulmd · on May 7, 2016

Do you need FP64? If so, right now the OG GTX Titan is the default choice - it offers full double-precision performance with 6 GB of VRAM. There's nothing better south of $3000.

If not, the 980 Ti or Titan X offer excellent deep learning performance, albeit only at FP32. And their scheduling/preemption is not entirely there, they may not be as capable of Dynamic Parallelism as Kepler was. The 780 Ti is actually a more capable card in some respects.

The new consumer Pascal cards will almost certainly not support FP64, NVIDIA has considered that a Quadro/Tesla feature since the OG Titan. If DP performance is a critical feature for you and you need more performance than an OG Titan will deliver, you want the new Tesla P100 compute cards, and you'll have to convince NVIDIA you're worthwhile and pay a 10x premium for it if you want it within the next 6 months. But they probably will support compute better, although you should wait for confirmation before spending a bunch of money...

For VR stuff or deep learning, the consumer Pascal cards sound ideal. Get a 1070 or 1080, definitely. The (purportedly) improved preemption performance alone justifies the premium over Maxwell, and the FP16 capability will significantly accelerate deep learning (FP16 vs FP32 is not a significant difference in overall net output in deep learning).

haristo · on May 11, 2016

Question following up on the remark "Do you need FP64? If so, right now the OG GTX Titan is the default choice"

I am purely after FP64 performance for scientific compute.

a) what does the "OG" stand for? b) what about the titan black model? seems to offer yet a bit more FP64 performance than the normal Titan?

viperscape · on May 7, 2016

nothing better south of 3k$? does that include the new amd duo pro? or is that just useful for VR?

paulmd · on May 7, 2016

The AMD Pro Duo seems aimed at the workstation/rendering market rather than the compute market.

NVIDIA is totally dominant in the compute market. They have an enormous amount of software running on CUDA that you would be locking yourself out of with AMD, and since NVIDIA has such a dominant share of the compute hardware you would also be tuning for what amounts to niche hardware.

AMD has recently been working on getting a CUDA compatibility layer working, hopefully this will improve in the future.

yohui · on May 7, 2016

The other replies cover what's best for machine learning, but as for your terminology question:

- Pascal: codename for Nvidia's latest microarchitecture, used in the new GeForce 10-series announced today. (AMD counterpart is "Polaris".)

- GeForce: Nvidia's consumer graphics card brand. (AMD equivalent is "Radeon".)

- GTX: label for GeForce graphics cards meant for gaming and other demanding tasks (see AMD Radeon "R9").

- Titan: special name for super high-end GeForce graphics cards. (For AMD, see "Fury".)

You might also encounter Nvidia Quadro graphics cards (comparable to AMD FirePro), which are meant for professional workstations, and Nvidia Tesla graphics cards, which target high performance computing.

modeless · on May 7, 2016

If you have any interest in deep learning, NVIDIA is your only option. They are miles ahead of everyone else. For deep learning the 1080 is the best option. You'll be waiting for models to train and the extra money is well worth the many hours it will save you. If you aren't serious about deep learning then get the 1070, it's more than enough for VR and CUDA experimentation. Or, if you just can't wait until June (or July/August with supply constraints probably), get a 970 now (min spec for VR).

The 1080 will probably be the best card available until next year, when HBM2 cards (~3x memory bandwidth) reach general availability. I'm hoping for a 1080 Ti or a new Titan then.

mac01021 · on May 7, 2016

Question:

What are the characteristics of NVIDIA GPUs that make them superior for deep learning applications?

Phrased another way, if you're designing a card specifically to be good for training deep neural nets, how does it come out differently from cards designed to be good at other typical applications of GPGPU?

lightcatcher · on May 7, 2016

Nvidia's main advantage over AMD for deep learning is software, not the hardware itself. Nvidia's software advantage has lead to network effects where almost all GPU supporting deep learning libraries build on top of Nvidia. There's no Tensorflow/Theano/Torch/Keras/etc for AMD/OpenCL for the most part. Nvidia also releases high performance kernels for common neural net operations (such as matrix multiply and convolution).

On the hardware end, Nvidia has slightly superior floating point performance (which is the only thing that matters for neural nets). Pascal also contains 16 bit floating point instructions, which will also be a big boost to neural net training performance.

mac01021 · on May 7, 2016

That makes a lot of sense about the software.

I would be interested to hear about the difference in floating point performance. I would have guessed that, at this point, pretty much every chip designer in the world knows equally well how to make a floating-point multiplier. So it must be that nvidia vs amd have made different trade-offs in when it comes to caching memory near the FP registers or some such thing?

lightcatcher · on May 12, 2016

I'm unsure about the floating point performance differences, but 2 other reasons for potential differences are (1) number of floating point units and (2) different clock speeds.

hooloovoo_zoo · on May 7, 2016

It doesn't hurt that the Titan X and it's professional counterpart have 12 GB of VRAM.

gavanwoolery · on May 7, 2016

Buy a GTX 1070 if you are on a "budget" (around $350ish, comes out June 10th) or a GTX 1080 if not ($600ish, comes out may 27th). Do not buy a Titan model unless you feel you really need to work with 64 bit numbers on the GPU (32 bit floats should cover most of your cases).

rpedela · on May 7, 2016

I would get a NVIDIA card that Oculus recommends. Any card that can support Oculus Rift will have more than enough power for learning GPU computation.

frik · on May 7, 2016

I am waiting for the Nvidia GTX 1070, and sincerely hope NVidia doesn't fuck it up again like the GTX 970.

GTX 970: It was revealed that the card was designed to access its memory as a 3.5 GB section, plus a 0.5 GB one, access to the latter being 7 times slower than the first one. -- https://en.wikipedia.org/wiki/GeForce_900_series#False_adver...

yohui · on May 7, 2016

Nvidia didn't release as much information on the GTX 1070, but we know it will have 8GB GDDR5 (not GDDR5X) and offer 6.5 TFLOPs (vs. 9 TFLOPs for the GTX 1080). It will have a $379 MSRP and officially launch on June 10.

Source: http://www.anandtech.com/show/10304/nvidia-announces-the-gef...

nness · on May 7, 2016

I have a GTX 970 and have not yet noticed any issues as a result of that memory issue. Maybe I'm not pushing the card hard enough...

frik · on May 7, 2016

Weird comment. Like a grandma driving a Ferrari with a broken gearbox: "I am perfectly fine driving at moderate speed, maybe I am not pushing the car hard enough. I am perfectly fine that the mechanic deactivated the malfunctioning higher gears".

You may not notice it nowadays because several games received patches and newer nVidia drivers limit the memory consumption to just below 3.5 GB. Otherwise you would experience a major slowdown. Also a major problem for CUDA.

mynameisvlad · on May 7, 2016

If games truly don't need access to the last .5GB (and they seem to not in most cases if there's patches there) and all the parent does is game (which is what I do on my 970) then why would it even matter if the last .5GB is slow, or that the card wasn't being "pushed hard enough"? It's a gaming card, and using it for gaming seems to be working just fine, in my anecdotal experience at least. Gaming benchmarks have also been great for the card.

Just because not everyone writes CUDA applications that need access to all 4GB does not mean that people are using it wrong, or that they should mind that the last .5GB is slow. If it works for others' use then that's great for them.

ozi · on May 7, 2016

I see you point, but I use a 970 and just finished Dark Souls 3 on max settings without a single hiccup.

No doubt that with future games the 970 will not perform as well as it should, but I will have a different card by then. What they did is a shame and shouldn't happen again, but I haven't noticed any real-world ramifications yet.

sqldba · on May 8, 2016

How was DS3? I liked DS but haven't tried DS2 yet.

ozi · on May 15, 2016

It's amazing(ly frustrating).

gambiting · on May 7, 2016

Nope, you are not. Write a basic CUDA application that uses all 4GB of vram, you will see the difference in access speed as soon as you cross 3.5GB.

wnevets · on May 7, 2016

driver update prevents the card from using the last 0.5 GB IIRC

MarkArts · on May 9, 2016

Altough i agree that they fucked up the marketing they didn't fuck up the card.

Even with 3.5gb it's still a great cost/performance card.

tostitos1979 · on May 7, 2016

I was helping a friend put together a Titan X "rig" and we realized that case space, power supply and motherboard slots were some mundane but frustrating challenges. For someone building out a new rig for personal/hobbyist work in deep learning, any recommendations? Is the best setup to get two 1080s, 16-32 GB of RAM and a 6th generation i7?

kmike84 · on May 7, 2016

I built a rig a couple days ago; it works nicely both for deep learning and for RAM and CPU-heavy tasks: http://pcpartpicker.com/p/GvTGsY - two 8 core xeons, 128GB RAM, slots for 2 GPUs for ~2000$. Currently it has gtx970; likely I'll add 1080 at some point in future.

I've heard 6th gen i7 is not good for deep learning because its PCIe performance is crippled (16 PCIe lanes instead of 40 in previous generations, it should matter for dual GPU use case). Don't quote me on that ;) Used xeons 2670v1 are dirt cheap on ebay now, and they are modern enough. Single-core performance is worse than in modern desktop CPUs, but not too much, multi-core performance is great, and these xeons allow to install lots of RAM.

If you don't want that much RAM then for the same price a single desktop CPU (i7 5th gen?) could work better because of a higher clock rate.

paperwork · on May 7, 2016

This is an awesome build. It has everything I was thinking of building, dual xeon, huge ram, a nice gpu. I may use this as a model for my first desktop in years. I will probably add a couple of spinning disks so I can easily move between linux server and a windows (oculus) desktop.

Usually pcpartpicker builds have comments, did you not submit your build for review? (I don't actually know how the site works).

kmike84 · on May 7, 2016

Thanks! Yeah, adding a spinning drive is a good idea. It seems part list should be saved to enable comments (http://pcpartpicker.com/user/kmike/saved/QnGp99). I used pcpartspicker for the first time to well, pick pc parts (the UI is nice); haven't shared this list before.

rustymirror · on May 8, 2016

@kmike84 Is ASRock EP2C602-2T/D16 okay? can you please recommend other motherboards that are compatible with the setup? Thanks for posting the rig :)

kmike84 · on May 8, 2016

Hey,

ASRock EP2C602-2T/D16 has EEB 12" x 13" form factor, it may be a bit harder to find a case for it (but that's possible, at least form factor is standard). Also, it has 2 PCIe slots which means you can insert only a single GPU - each GPU usually takes 2 nearby slots. Nice thing is that you can put more RAM sticks; it means you can either get more RAM or save money by using e.g. 8GB sticks instead of 16GB sticks which are cheaper per GB.

There are Supermicro motherboards similar to ASUS Z9PA-D8C, but AFAIK they have non-standard mounting points, and Supermicro branded cases are very pricely. There is also a few other ASUS motherboards (Z9PE-D8 WS, Z9PE-D16); they have nice features - D16 has more RAM slots and D8 WS has more PCIe slots, but D8 WS cost more and they both don't use ATX form factor (though their form factor is still standard).

The motherboard I've chosen is not without gotchas - CPU slots are very close to each other, so you can't use a cooler which exceeds CPU slot size; most popular coolers don't fit. To install Ubuntu I had to first use integrated video to install nvidia drivers, and then I had to disable the integrated video using an on-board jumper (there were some Ubuntu configuration problems with both internal and nvidia gpus active), but likely this is not specific to this MB.

There may be other motherboards; look for C602 chipset, check what's the form factor, how many RAM and PCIe 16x slots are there, how is PCIe speed adjusted when several slots are occupied (e.g. often 16x slot can become 8x slot when something is inserted in another slot), and check MB image to understand how are PCIe slots layed out: when two of them are next to each other a GPU covers them both. E.g. ASRock Rack EP2C602 looks fine (it is EEB, and no USB3, but CPU slots are layed out better than in ASUS Z9PA-D8C, you can use more cooler models, and it is a bit cheaper). No idea about which brands are better, I only built such computer once.

rustymirror · on May 9, 2016

Thanks a lot for explaining in such detail. Really helpful.

dweekly · on May 7, 2016

Neat, I hadn't heard of PCpartPicker before. The Xeon 2670 you listed for $90 goes for >$1500 on NewEgg; that is quite a retail / eBay price split! Any idea why there's such a gap?

T-A · on May 7, 2016

"late last year, the supply became huge when thousands of these processors hit the market as previous-gen servers from Facebook and other big Internet companies were decommissioned by used equipment recyclers" [1]

[1] http://www.techspot.com/review/1155-affordable-dual-xeon-pc/

mciancia · on May 7, 2016

Amazon/google/facebook/something else replaced a lot of servers with those CPUs, so now ebay is flooded with them - hence price drop ;)

brynx97 · on May 7, 2016

I found there was a lack of supply at traditional online parts retailers for Xeon brands (consumer v enterprise processors). I picked up a Xeon 2620v3 last year at Microcenter for under $300 b/c of a sale. Also, the 2670 is older, I'd presume he picked it up second hand at that price.

gh02t · on May 7, 2016

I've had pretty good luck buying rack servers and workstations for HPC from Thinkmate (http://www.thinkmate.com). Their systems come prebuilt but their markup is pretty reasonable, considering they give you a known-good configuration. You can pick and choose whatever combo of Xeon's + GPUs you want [read: can afford].

antimora · on May 7, 2016

I was just pondering about this too. I would love to get some inputs from HN.

_jcwu · on May 7, 2016

Why 2 gpus? Isn't SLI annoying (e.g. you need to wait for support, if you get it at all) and almost not worth it?

hooloovoo_zoo · on May 7, 2016

SLI is not necessarily relevant for deep learning; he may be using each card for different tasks.

hooloovoo_zoo · on May 7, 2016

I would wait for the new Titan and hope it has HBM.

marmaduke · on May 7, 2016

The lower TDP is just as significant as speed. I've got a pair of GTX 480 that I can use to heat my office with a light workload. How many 1080s could run in a single workstation?

chx · on May 7, 2016

Haswell-E chips have 40 PCIe Lanes. If you give 8 to each then 4 is a go although cooling that will be fun. Now that already exceeds the ATX 7 slots but many cases (especially those that offer SSI EEB compatibility) offers 8 slots. Four such GPUs, a Haswell-E, the motherboard -- it'd be challenging to assemble this below $3K but I think $4k is a reasonable target.

Going beyond that is not impossible but requires server grade hardware. For example, the Dell R920/R930 has 10 PCIe slots so does the Supermicro SuperServer 5086B. The barebone for the latter is above $8K. You need to buy Xeon E7 chips for these and those will cost you more than a pretty penny. I do not think $20K is unreasonable to target.

Not enough? A single SGI UV 300 chassis (5U) provides 4 Xeon E7 processors, up to 96 DIMMs, and 12 PCIe slots. You can stuff 8 Xeon E7 CPUs into the new HP Integrity MC990 X and have 20 (!) PCIe slots. How much these cost? An arm and two legs. I can't possibly imagine how such a single workstation would be worth it instead of a multitude of cheaper workstations with with just 4 GPUs (you'd likely end up with a magnitude more GPUs in this case -- E7 CPUs and their base systems are hideously expensive) but to each their own.

chx · on May 7, 2016

I left out something: the Supermicro 6037R-TXRF Server is dual socket and has ten slots and since it's an older dual socket even with CPUs and RAM it can be had for <$3K so together with five GPUs you can have it for less than $6K. It's a whole another question whether almost double the price worths adding another GPU (no).

marmaduke · on May 7, 2016

Wow thanks for the detailed response on hardware. I was aware of the Supermicro beast.

Setting aside this 1080, just on per CUDA core prices, cheaper workstations with pairs of 970s are much better deals, because everything gets cheaper, not just the GPUs.

beezle · on May 8, 2016

Not so specific to this new card but wtf are cpus at 65 TDP and gpus at 200+? It just seems crazy that we have all other elements of computing heavily pushing for lower power use for many years now and graphics rips the other way.

kayoone · on May 7, 2016

I am just curious (and a total machine learning novice). If you were to experiment with ML, what are the benefits to getting a fast consumer card like this or use something like AWS GPU instances (or some other cloud provider) ? Or phrased differently: When does it make sense to move from a local GPU to the Cloud and vice versa ?

nightski · on May 7, 2016

I'm a relative novice also, but we have run an EC2 8 way GPU instance. Basically the GPUs are a bit slower and they only have 4 GB of ram. Also getting everything set up and uploading a data set is tedious. Costs weren't too bad. I could see AWS being used for training smaller problems and learning. If you have a serious project it probably won't fit the bill.

marmaduke · on May 7, 2016

AWS would make sense if you don't yet know if your ML algo scales well on a GPU: you can try the hardware without losing much money.

Consumer cards are a good deal if you know your algo works well on GPUs in single (32-bit) floating point and you don't mind dicking around with hardware configurations to save money.

verbify · on May 7, 2016

It's about how much machine learning you're currently doing and how much you expect to spend.

E.g. if you play around for a little bit, you could end up spending $50 on the cloud. By contrast, a card will set you back $350+.

If you're doing lots of machine learning, the cloud will quickly rack up.

There's other factors. E.g. the cloud's performance can vary depending on the time of day (and whose using the application).

Don't buy a graphics card unless you're certain you're going to do lots of ML. Or buy one if you think that'll motivate you to do justify the purchase by doing lots of ML.

mtw · on May 7, 2016

The AWS Linux GPU instances only have 4GB of memory which means processing can be slow (memory swaps) if your datasets is large.

Generally, it makes sense to have a local GPU to try out models, and move to the cloud for large scale computing, once you have a better idea of the model.

frozenport · on May 7, 2016

Image analysis is cumbersome on the cloud, copying data back and forth gets tedious really quick. I would use the cloud only when I was confident as what I was doing.

jkldotio · on May 7, 2016

Isn't the Titan X also about RAM? It has 12GB to the GTX 1080's 8GB. At the low end you could just buy more than one GTX 1080, so it looks like a good deal there, but at the top end you are running out of slots for cards.

vessenes · on May 7, 2016

Larger learning sets, essentially. I'm a neural net novice, but I think the general perspective seems to be that 8 vs 12 is not a big deal; either you are training something that's going to have to get split up into multiple GPUs anyway, or there's probably a fair amount of efficiency you can get in your internal representations shrinking RAM usage.

One thing not mentioned in this gaming-oriented press release is that the Pascal GPUs have additional support for really fast 32 bit (and do I recall 16 bit?) processing; this is almost certainly more appealing for machine learning folks.

On the VR side, the 1080 won't be a minimum required target for some time is my guess; the enthusiast market is still quite small. That said, it can't come too quickly; better rendering combined with butter-smooth rates has a visceral impact in VR which is not like on-screen improvements.

fscherer · on May 7, 2016

8 vs 12 is a really big difference - especially with state of the art deep learning architectures. The problem isn't the size of the dataset but of the model. Large Convolutional Neural Networks often need the whole 12 GB VRAM of a Titan card for themselves and distributing over multiple cards/machines adds a lot of complexity.

voltagex_ · on May 7, 2016

I wonder what the idle power usage of one of these would be? I wish my motherboard allowed me to turn off the dedicated card and fall back to the Intel chipset when I didn't need the performance.

tjohns · on May 7, 2016

From the slides in Nvidia's presentation, it looks like the max power consumption is about 180 W. That puts it squarely between the GTX 980 (165 W) and the GTX 980 Ti (250 W).

Idle power consumption is an order of magnitude less. For the 980 Ti, you're looking at about 10 W of power consumption while running outside of a gaming application. Maybe 40 W if you're doing something intensive. [1]

[1]: http://www.tomshardware.com/reviews/nvidia-geforce-gtx-980-t...

tgb · on May 7, 2016

My laptop does that, but it's frequently a pain. Pull up a demo of OpenGL ES stuff on Chrome and it won't switch to to the dedicated card, even if I ask it to since Chrome 'isn't a graphics application'. Plus Linux drivers for it are hit-or-miss (it's improved since launch). Still a good idea though and it keeps the fan from running much of the time.

tjohns · on May 7, 2016

All the benchmarks I can find are comparing this against the GTX 980. I'm curious how it compares to a 980 Ti.

yohui · on May 7, 2016

The 980 Ti is approximately equal to the Titan X. The new 1080 should be faster than the Titan X by a good 20% or so.

Source: http://www.anandtech.com/show/10304/nvidia-announces-the-gef...

jon_richards · on May 7, 2016

I've always been amazed by this chart http://www.videocardbenchmark.net/high_end_gpus.html

I'm sure some of those expensive cards have extra features, but the pricing differences are still crazy.

TylerE · on May 7, 2016

A big part of what you're paying for with the workstation-grade cards (EG Quadro/FirePro) is certification. If you're using many "serious" applications like 3D CAD, you must be using a certified (e.g. not consumer grade) card to be fully supported. They run fine on the consumer cards - but don't expect support.

hengheng · on May 7, 2016

And as it happens, the companies that specialize in cad support will use every opportunity to shove these overpriced cards down your throat. As far as I've understood their business model, they earn almost nothing from selling licenses, so they live on their maintenance contracts and selling Quadro/Fire cards.

Meanwhile we have no problems with an office running on 950Ms and 780/980 GTXes, except for some weird driver bugs that only occur on the single Quadro machine that we keep for reference.

yohui · on May 7, 2016

No artificial benchmark is perfect, but if I had to pick one I'd prefer 3DMark from Futuremark: https://www.futuremark.com/hardware/gpu

AFAICT, PassMark's GPU benchmarks are most often used simply because it's usually the first Google result for "video card benchmark".

tgb · on May 8, 2016

I know that synthetic tests only mean so much, but are these scorings roughly linear? I'm just kind of bluesy away at a ten times difference in power over my laptop's card. I guess it's been a while since I've had a desktop, or bought a demanding game for that matter.

drewg123 · on May 7, 2016

Does anybody know what kind of performance a modern Nvidia card like this can provide for offloading SSL ciphers (aes gcm 128 or aes gcm 256)?

robert_foss · on May 7, 2016

There are data-dependencies in most AES modes (except ecb) which prevent parallelization. If you however run 2k separate AES encryption streams you could expect a nice performance increase. But for a single stream there is little performance to be had.

drewg123 · on May 7, 2016

We run tens of thousands of streams in parallel.. So that's not a problem. :)

I'm just realizing that GPUs could be a viable alternative to specialized (eg, expensive) encryption offload engines or very high end (eg, expensive) v4 Xeons with many, many cores. However, it is rather hard to find data on the bandwidth at which GPUs can process a given cipher. I'm just now starting to look into this, so I could very well be mis-understanding something,

paulmd · on May 7, 2016

Throughput 100% depends on how hard they need to hit memory. GPUs love stuff that they can do entirely within registers. Having to hit (the limited quantity of) shared memory slows things down much more. Inefficient/non-coaslesced use of shared memory or global memory too often can trash performance. Hitting host/CPU memory in any non-trivial amount usually dooms a program.

With each step you not just cut down your bandwidth but also you increase your latency and that has a huge impact.

serialx · on May 7, 2016

Guys have a look at SSLShader[1]

[1]: http://shader.kaist.edu/sslshader/

nwrk · on May 7, 2016

Thanks!

ryuuchin · on May 7, 2016

What about a stream cipher like ChaCha20?

Also would data-dependencies be different on a GPU than with parallelization on a CPU? Because modes like GCM (i.e. CTR) both the encryption and decryption are parallelizable (with CBC it's just decryption).

I wonder if AMD GPU's would be better suited to the task than Nvidia's (at least their existing GPU's) provided you could actually saturate the GPU with enough work to make it worth it.

jhj · on May 7, 2016

Bit math (bfe.*, logical bit ops, ...) are nowhere near as fast as float32 ops on their GPUs by a factor of 2-4x; there are many fewer functional units for them. Maxwell doesn't have support for many 64-bit word bit ops as well (they get emulated in SASS using the operations that they do have; e.g. bfe bitfield extract), so you'd have to do it in uint32.

Issue rate is in the CUDA documentation for Maxwell (cc 5.2) at least, presumably they'll update with Pascal soon:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.h...

Where they would win is in memory bandwidth for intermediate results (e.g., in/out of registers or smem or even global memory), assuming that you aren't ultimately bound by I/O over the bus from the CPU.

rl3 · on May 7, 2016

I'm somewhat dismayed that the GTX 1080 has only 8GB of VRAM considering that previous generation AMD products already had 8GB, and the previous generation Titan model had 12GB—the latter of course being outperformed by the GTX 1080.

Then again, rendering a demanding title at 8192x4320 on a single card and downsampling to 4k is probably wishful thinking anyways. However, it's definitely a legitimate concern for those with dual/triple/quad GPU setups rendering with non-pooled VRAM.

On the bright side, 8GB should hopefully be sufficient to pull off 8k/4k supersampling[0] with less demanding titles (e.g. Cities: Skylines). Lackluster driver or title support for such stratospheric resolutions may prove to be an issue, though.

It's possible Nvidia is saving the 12GB configuration for a 1080 Ti model down the road. If they release a new Titan model, I'm guessing it'll probably weigh in at 16GB. Perhaps less if those cards end up using HBM2 instead of GDDR5X.

[0] https://en.wikipedia.org/wiki/Supersampling

rl3 · on May 10, 2016

Correction: 7680×4320.

Accidentally multiplied DCI 4K, not UHD 4K.

mioelnir · on May 7, 2016

Towards the end of the article in the table overview, they list a `NVIDIA GP100` model with a memory bus width of 4096 bit. It is still shared memory, but considering bcrypt only requires a 4k memory working set, that now fits into 8 cycles instead of the 128 of 256 bit bus architectures...

Am I wrong to think this card could really shake bcrypt GPU cracking up?

gnoway · on May 7, 2016

The AMD Radeon R9 Fury series has a 4096bit bus (HBM1) - it's been out for months - so it would already be shaking it up if so.

mioelnir · on May 7, 2016

Ah, did not know that. Nevermind.

tormeh · on May 7, 2016

I'll buy when there's Vulkan/DX12 benchmarks and real retail prices for both AMD and NVIDIA next-gen cards. Buying now seems slightly premature.

But oh man am I excited!

ChuckMcM · on May 7, 2016

This has a similar leap to the one I felt when the 3Dfx Voodoo 2 SLI came out. The possibilities seem pretty amazing.

I'm interested to know how quickly I can plug in a machine learning toolkit, it was bit finicky to get up and running on a box with a couple of 580GTs in it but that might just be because it was an older board.

BuckRogers · on May 7, 2016

I watched the livestream, looked good and love the performance/watt. 180watt card. Way more GPU power than I need professionally or for fun though. I'm actually all-in on the new Intel gaming NUCs. Skull Canyon looks fantastic and has enough GPU performance for the gaming I do anymore. Mostly older games, some CS:Go (my friends and I play at 1280x1024 anyway since it's competitive gaming) and League of Legends.

It's also nice to have an all Intel machine for Linux. I'd use a lowend NV Pascal to breathe new life into an older desktop machine since NV seems to always have a bit better general desktop acceleration that really helps out old CPUs. If building a high end gaming rig I'd probably wait for AMD's next chip. I've liked them more on the high end for a few generations now. Async compute and fine-grained preemption, generally better hardware for Vulkan/DX12. AMD is also worth keeping an eye on for their newfound console dominance, subsequent impact if they push dual 'Crossfire' GPUs into XboxOne v2, the PS4K and Nintendo NX. That would be a master stroke to get games programmed at a low-level for their specific dual GPUs by default. Also, the removal of driver middleware mess with the return of low-level APIs to replace OGL/DX11 will remove the software monkey off AMD's back. That always plagued them and the former ATI a bit.

I'll probably buy the KabyLake 'Skull Canyon' NUC revision next year and if I end up missing the GPU power, hook up the highest end AMD Polaris over Thunderbolt. Combining the 256MB L4 cache that Kabylake-H will have with Polaris will truly be next-level. Kaby also has Intel Optane support in the SODIMM slots, it's possible we'll finally see the merge of RAM+SSDs into a single chip.

But more than anything, I want Kabylake because it's Intel's last 14nm product so here's to hoping they finally sort out their electromigration problems. Best to take a wait and see on these 16nm GPUs for the same reason. I'm moving to a 'last revision' purchase policy on these <=16nm processes.

rjbwork · on May 7, 2016

>my friends and I play at 1280x1024 anyway since it's competitive gaming

Can you elaborate on this? Never heard of this.

BuckRogers · on May 7, 2016

It's common for pros and old timer Counterstrike players. My friends and I have been PC gaming since the mid-80s so we've been in on CS since the earliest days. Here's a spreadsheet of some pro player's setups.

https://docs.google.com/spreadsheets/d/1UaM765-S515ibLyPaBtM...

A little bit of everything but note how common 1024x768 is. It looks a little sharper, enough that it doesn't need AA (which I always felt introduced some strange input lag).

Guarantees good performance, makes the models slightly bigger. Gives you a slight edge. You can also run a much slower GPU, that's how I'm getting away with using stuff like Intel Iris Pro. I'd run 1280x1024 with all options on lowest even if I had a Fury X. Other people I know have GF780s and do the same. I only turn up graphics options in single player games, at which point I don't care about FPS dips. When playing competitively I want every edge I can get.

knodi123 · on May 7, 2016

> When playing competitively I want every edge I can get.

Jesus christ, I can't even imagine the leaps in skill I'd have to make before the difference between 50fps and 150fps had any impact on my performance. Watching gaming at that level feels like being a civilian in Dragonball Z, just seeing a bunch of blurs zipping in the sky while wondering if I'm about to become obsolete.

chrisan · on May 7, 2016

> I can't even imagine the leaps in skill I'd have to make before the difference between 50fps and 150fps had any impact on my performance

Keep in mind fps are normally given in averages. You can go up to a wall and have nothing on screen and get an insane FPS or you can back out and take in all of the area with max view distance and see a lower FPS. Add in players/objects/explosions etc and your FPS dips and dives.

Having an average of 50FPS means you likely ARE going to notice it at certain dips while at 150 average would be much less likely to get a perceivable fps dip

raihansaputra · on May 7, 2016

The jump from 60 to 100 fps (even on 60Hz monitors) really does make a difference. I haven't experienced playing on 120/144Hz, but I guess it would be even better. Pros demands 300+ fps on top championships (played on 120/144Hz monitors). For me it's more on the confidence that I would not have frame drops (smoke graphics on CS:GO is bad, I drop from 90ish fps to 30-45 fps on the edges of the smoke), perceived smoothness, and input lag. The actual game itself (timeline on the servers) is only played at 64-ticks (64Hz) online or 120-ticks (120Hz) on select community servers and competitive matches.

speeder · on May 7, 2016

1280x1024 is less resolution than 1080p, thus render faster, also it is the "recommended" resolution of many high end CRTs, some that can run that resolution at 160 fps ( or more even... ), with virtually zero latency, thus essentially allowing people with fast enough reflexes to use their full potential.

hengheng · on May 7, 2016

AFAIK no CRT was ever manufactured in a 5:4 format. To avoid distortion you'd have to use 1280x960, which noone ever did.

Also there are plenty of full HD displays with 120Hz+ now ...

camperman · on May 7, 2016

ViewSonic CRTs, which I swore by for years, were 5:4 with square pixels. They were optimum Quake deathmatch kit :)

nwmcsween · on May 7, 2016

A model will be larger on a 20" monitor at 1280x1024 vs say 1920x1080 and this makes aiming easier.

lewisl9029 · on May 7, 2016

I'd assume it's because higher resolutions would offer an advantage due to simply having more pixels to aim at, and possibly higher FOV if you're using a widescreen resolution.

dcl · on May 7, 2016

The player hitbox's aren't defined by what the player can see though right? So having those chunkier pixels will be less representative of what your bullets will actually hit no?

Back in the CS < 1.3 days everyone used 640x480 but I never really understood why.

raihansaputra · on May 7, 2016

it's not what your bullet hits, but how easy for the eyes to recognise that your crosshair is already aiming the right pixels. some players even use a stretched display mode (4:3 render displayed fully on 16:9 monitors) to further use this advantage.

BuckRogers · on May 7, 2016

My parent has been downvoted, can someone explain to me why you downvoted me? What was so offensive? Is it because I'm not a huge NV fan and would only use their cards on the lowend?

Szpadel · on May 7, 2016

> The GP104 Pascal GPU Features 2x the performance per watt of Maxwell.

And

> The GTX 1080 is 3x more power efficient than the Maxwell Architecture.

I think that someone get carried away by imagination.

I found that 980 has 4.6 TFLOPs (Single precision) And assuming that 1080 performance (9 TFLOPs) is also for single precision and new card has the same TDP, this is 1.95x increase, so it is ~2x

EDIT: I found that 1080 will have 180W TDP, where 980 has 165W, so correction it will be 1.79x increase

agumonkey · on May 7, 2016

Not long ago I found this 2009 project https://en.wikipedia.org/wiki/Fastra_II

built to give a 'desktop' cluster class performance based on GPUs. I wonder how it would fare against a 1080.

Sanddancer · on May 7, 2016

On paper, the Fastra II would still be a fair bit faster, and would have more RAM. However. in a standoff, I'd put my money on the 1080 simply because of the overhead of wrangling all those GPUs to work together.

partiallypro · on May 7, 2016

I'm more interested on the impact it will have on the 10 series as a whole. I literally just bought a 950, now I'm wondering if the 1050 will be priced just as reasonably and offer big performance gains. Also, what is the time table for the rest of the 10 series?

touristtam · on May 7, 2016

Based on the previous gens (since 400 series) you can expect about 20-25 % gain from one generation to the next. Traditionally the release have been: April to June: paper release (reads very limited supply of the top of the range). July to October: x70/x60 range (which are the main gaming GPU). November-December: rest of the range released with ample supply for the end of the year festive season. January to March: annoncement of the next generation. Rinse and repeat.

MichaelBurge · on May 7, 2016

How good does it look for a hobbyist manipulating large matrices for use in machine-learning?

slashcom · on May 7, 2016

Pretty good; 8gb memory and impressive transfer speeds. The 980 Ti is pretty common in the Deep Learning community. Only the Titan X is more popular, and that thing is far outside any average person's budget. I'm definitely looking to pick up one of these soon.

aab0 · on May 7, 2016

Titan budgets are definitely a problem, which is why you see so many acknowledgements of Nvidia in deep learning, thanking them for giving a Titan.

kosmic_k · on May 7, 2016

Does anyone know why video cards have stayed on a 28nm process for so long? It's appears that a significant factor in this incredible leap of performance is the process change, but I'm puzzled as to why 22nm was skipped.

msbarnett · on May 7, 2016

TSMC doesn't have a 22nm process -- they offer a 20nm process instead, but it was heavily delayed, had huge yield issues, and most of its production capability was double-booked with Apple, all of which meant it wasn't feasible to use in time for Nvidia's previous generation (Maxwell) launch.

Maxwell had been planned as 20nm, but when TSMC couldn't solve their process issues the first Maxwell was taped out at 28nm (GM107 aka the GeForce 750), while they waited on the rest of lineup. 6 months later, with TSMC still having issues at 20nm, the GM204(GeForce 980/970) was launched at 28nm.

16nm FinFET wasn't subject to massive delays (it launched early, in fact), and seems to have better yields than 20nm early on, so Pascal skipped right down to it.

kosmic_k · on May 8, 2016

Thanks for the explanation! I haven't been following fab process news for some time so I've been very much out of the loop.

valine · on May 7, 2016

I'm hopeful this will provide a significant speed up for my 3D modeling / rendering. The number of cuda cores is only slightly higher than the 780. I'll definitely wait for more benchmarks.

cheapsteak · on May 7, 2016

Could someone clarify if they meant that one 1080 is faster than two 980s in SLI? Or did they mean two 1080s were faster than two 980s?

abtinf · on May 7, 2016

From the article:

> The performance of NVIDIA’s GeForce GTX 1080 is just off the charts. NVIDIA’s CEO, Jen-Hsun, mentioned at the announcement that the GeForce GTX 1080 is not only faster than one GeForce GTX 980 but it crushes two 980s in SLI setup.

gnoway · on May 7, 2016

He meant one 1080 is faster than two 980s in SLI, but that seems pretty hard to believe; we're going to need to wait for some 3rd party reviews.

mkhalil · on May 7, 2016

I might believe 980s but not 980tis. That would be a feat

kayoone · on May 7, 2016

1080 should be a bit faster than a single 980Ti or as fast as an overclocked 980Ti which would put it pretty much at 980 SLI performance.