Seeing as it is going to deliver 1 PFLOP, it will need to have similar speed as the "native" (GDDR) counterpart otherwise it will only be able to hit that performance as long as all data is in the cache...
My guess is that they will use the RTX 5070 Ti laptop version (992 TFLOPS, slightly higher clocked to reach 1000 TFLOPS/ 1 PFLOP).
Their big GB200 chips have 546 GB/s to their LPDDR memory, they could use the same memory controler on the GB10. They don't need to design a new one. It would still be slower than what they are currently using on the RTX 5070 Ti laptop GPU, but any slower than that, and there is no chance that they could argue that it would hit anywhere near 1 PFLOP of FP4. It would only be possible in extreme edge case scenarios when all data will fit in it's 40MB L2 cache.
I think you have the reasoning backwards, there's no "must" here. Historically there are lots and lots of systems which have struggled to approach their peak FLOPS in real-world apps due to off-chip bottlenecks.
My guess is that they will use the RTX 5070 Ti laptop version (992 TFLOPS, slightly higher clocked to reach 1000 TFLOPS/ 1 PFLOP).
Their big GB200 chips have 546 GB/s to their LPDDR memory, they could use the same memory controler on the GB10. They don't need to design a new one. It would still be slower than what they are currently using on the RTX 5070 Ti laptop GPU, but any slower than that, and there is no chance that they could argue that it would hit anywhere near 1 PFLOP of FP4. It would only be possible in extreme edge case scenarios when all data will fit in it's 40MB L2 cache.