The data transfer rates are what will bite you, and actually are different than advertised.
I know of at least one group that bought bunch of GTX970 cards because they were supposed to have same mem bw as GTX980 but just less computation power. Their application is memory bandwidth bound so that additional computation would be wasted.
However this means they didn't really get what was promised. 196GB/s instead of 224GB/s.
Even so, it's still has the best performance/price combo for that particular GPGPU application.
It's somewhat misleading to look at an arithmetic average of the bandwidth of the fast/slow segments. Due to the way they architectured it, you cannot access both the fast segment and the slow segment during the same memory-fetch cycle, it's either/or. If control flow depends on data that's stuck in the slow segment, performance could be significantly degraded.
Now - blah blah prediction, blah blah heuristics, yadda yadda. If you don't use the memory fully (compute, 4K, etc) there's no problem, and even then you can optimize the problem away somewhat. This will work pretty well for AAA-grade game engines that get special attention - Unreal, CryEngine, Unity. But for memory-bound (especially latency-sensitive) compute applications, what you have here is a 3.5GB card, not a 4GB card.
Having a card show up with 1/8th of its specced memory units turned off is not acceptable, regardless.
Actually I should correct this - if you access the slow segment at all performance will be degraded, since you cannot also access the fast memory during the same cycle.
Looking at it on a 2-cycle basis, since performance is 7x as high you can either access (7+7) or (7+1) chunks of memory. That's a 43% performance drop if even 1 of the 32 threads in a warp consistently needs to touch the slow segment.
That data being used for control flow will amplify the problem, of course, since latency will double.
I know of at least one group that bought bunch of GTX970 cards because they were supposed to have same mem bw as GTX980 but just less computation power. Their application is memory bandwidth bound so that additional computation would be wasted.
However this means they didn't really get what was promised. 196GB/s instead of 224GB/s.
Even so, it's still has the best performance/price combo for that particular GPGPU application.