I don’t know enough about these implementations to know if this can be interpret...

dahart · 2025-02-09T17:44:36 1739123076

A GPU/SIMT branch works by running both sides, unless all threads in the thread group (warp/wavefront) make the same branch decision. As long as both paths have at least one thread, the GPU will run both paths sequentially and simply set the active mask of threads for each side of the branch. In other words, the threads that don’t take a given branch sit idle while the active threads do their work. (Note “sit idle” might involve doing all the work and throwing away the result.)

If you have two branches, and one is trivial while the other is expensive, and if the compiler doesn’t optimize away the branch already, it may be better for performance to write the code to take both branches unconditionally, and use a conditional assignment at the end.

It’s worth knowing that often there are clever techniques to completely avoid branching. Sometimes these techniques are simple, and sometimes they’re invasive and difficult to implement. It’s easy (for me, anyway) to get stuck thinking in a single-threaded CPU way and not see how to avoid branching until you’ve bumped into and seen some of the ways smart people solve these problems.

TinkersW · 2025-02-09T14:18:05 1739110685

A real branch is useful if you can realistically skip a bunch of work, but this requires all the lanes to agree, on a GPU that means 32 to 64 lanes need to all agree, also for something basic like a few arithmetic ops there is no point.