The data transfer rates are what will bite you, and actually are different than ...

paulmd · on Jan 27, 2015

It's somewhat misleading to look at an arithmetic average of the bandwidth of the fast/slow segments. Due to the way they architectured it, you cannot access both the fast segment and the slow segment during the same memory-fetch cycle, it's either/or. If control flow depends on data that's stuck in the slow segment, performance could be significantly degraded.

Now - blah blah prediction, blah blah heuristics, yadda yadda. If you don't use the memory fully (compute, 4K, etc) there's no problem, and even then you can optimize the problem away somewhat. This will work pretty well for AAA-grade game engines that get special attention - Unreal, CryEngine, Unity. But for memory-bound (especially latency-sensitive) compute applications, what you have here is a 3.5GB card, not a 4GB card.

Having a card show up with 1/8th of its specced memory units turned off is not acceptable, regardless.

paulmd · on Jan 27, 2015

Actually I should correct this - if you access the slow segment at all performance will be degraded, since you cannot also access the fast memory during the same cycle.

Looking at it on a 2-cycle basis, since performance is 7x as high you can either access (7+7) or (7+1) chunks of memory. That's a 43% performance drop if even 1 of the 32 threads in a warp consistently needs to touch the slow segment.

That data being used for control flow will amplify the problem, of course, since latency will double.

sharpneli · on Jan 27, 2015

There is an actual difference even for the fast segment. Because for the fast 3.5GB segment the performance is still 7/8 of the GTX980.

http://dl5.guru3d.com/gtx-970vs-980.png See this

So the peak performance of GTX970 is 196GB/s