Hi! I recently posted a blog post around this technique (i.e. using SIMD-like warp intrinsics), but I wanted to show the runnable code. It should be fairly portable.
tl;dr: Warp-wide (32-way) bitonic mergesort is 30%-40% faster than using L1 cache (shared memory), and around 50% faster than naively using global memory. (On an RTX 3090)
(Let me know if this is technically a repost, don't want to infringe).