It needs to support 64-bit integer arithmetic for handling 64-bit address calculations efficiently. The SASS ISA since Volta has explicit 32I suffixed integer instructions alongside the regular integer instructions, so I would expect the regular instructions to be 64-bit, although the documentation leave something to be desired:
Hmm, it looks like the underlying SASS does have 64-bit integer instructions now, but only with the 12.0 capability level in the recent Blackwell processors. Older versions emulate it via chained 32-bit instructions. Take this example kernel:
So if you want real 64-bit support, have fun getting your hands on a 5070! But even on sm_120, things like 64-bit × immediate 32-bit take a UIMAD.WIDE.U32 + UIMAD + UIADD3 sequence, so the support isn't all that complete.
(I've been looking into the specifics of CUDA integer arithmetic for some time now, since I've had the mad idea of doing 'horizontal' 448-bit integer arithmetic by storing one word in each thread and using the warp-shuffle instructions to send carries up and down. Given that the underlying arithmetic is all 32-bit, it doesn't make any sense to store more than 31 bits per thread. Then again, I don't know whether this mad idea makes any sense in the first place, until I implement and profile it.)
https://docs.nvidia.com/cuda/parallel-thread-execution/index...
https://docs.nvidia.com/cuda/parallel-thread-execution/index...
It needs to support 64-bit integer arithmetic for handling 64-bit address calculations efficiently. The SASS ISA since Volta has explicit 32I suffixed integer instructions alongside the regular integer instructions, so I would expect the regular instructions to be 64-bit, although the documentation leave something to be desired:
https://docs.nvidia.com/cuda/cuda-binary-utilities/index.htm...