Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Faster sorting with register shuffling in CUDA (github.com/wiwa)
3 points by winwang on March 15, 2024 | hide | past | favorite
Hi! I recently posted a blog post around this technique (i.e. using SIMD-like warp intrinsics), but I wanted to show the runnable code. It should be fairly portable.

tl;dr: Warp-wide (32-way) bitonic mergesort is 30%-40% faster than using L1 cache (shared memory), and around 50% faster than naively using global memory. (On an RTX 3090)

(Let me know if this is technically a repost, don't want to infringe).




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: