While I agree, that CUDA is the best in class API for GPU programming, OpenCL, V...

keldaris · 2025-04-11T20:12:25 1744402345

How are you writing compute shaders that work on all platforms, including Mac? Are you just writing Vulkan and relying on MoltenVK?

AFAIK, the only solution that actually works on all major platforms without additional compatibility layers today is OpenCL 1.2 - which also happens to be officially deprecated on MacOS, but still works for now.

coffeeaddict1 · 2025-04-11T21:14:00 1744406040

Yes, MoltenVK works fine. Alternatively, you can also use WebGPU (there are C++ and Rust native libs) which is a simpler but more limiting API.

keldaris · 2025-04-11T21:24:20 1744406660

WebGPU has no support for tensor cores (or their Apple Silicon equivalents). Vulkan has an Nvidia extension for it, is there any way to make MoltenVK use simdgroup_matrix instructions in compute shaders?

coffeeaddict1 · 2025-04-11T21:39:15 1744407555

AFAIK, MoltenVK doesn't. Dawn (Google's C++ WebGPU implementation) does have some experimental support for it [0][1].

[0] https://issues.chromium.org/issues/348702031

[1] https://github.com/gpuweb/gpuweb/issues/4195

pjmlp · 2025-04-11T21:11:19 1744405879

And is stuck with C99, versus C++20, Fortran, Julia, Haskell, C#, anything else someone feels like targeting PTX with.

keldaris · 2025-04-11T21:22:19 1744406539

Technically, OpenCL can also include inline PTX assembly in kernels (unlike any compute shader API I've ever seen), which is relevant for targeting things like tensor cores. You're absolutely right about the language limitation, though.

pjmlp · 2025-04-12T08:20:32 1744446032

At which point why bother, PTX is CUDA.

keldaris · 2025-04-12T17:33:38 1744479218

Generally, the reason to bother with this approach is if you have a project that only needs tensor cores in a tiny part of the code and otherwise benefits from the cross platform nature of OpenCL, so you have a mostly shared codebase with a small vendor-specific optimization in a kernel or two. I've been in that situation and do find that approach valuable, but I'll be the first to admit the modern GPGPU landscape is full of unpleasant compromises whichever way you look.

pjmlp · 2025-04-11T21:09:11 1744405751

No they aren't, because they lack the polyglot support from CUDA and as you acknowledge the debugging experience and ecosystem sucks.

fragmede · 2025-04-11T20:13:45 1744402425

why do you need to run across all those platforms? what's the cost benefit for doing so?

coffeeaddict1 · 2025-04-11T21:18:57 1744406337

Well it really depends on the kind of work you're doing. My (non-AI) software allows users to run my algorithms on whatever server-side GPU or local device they have. This is a big advantage IMO.

fragmede · 2025-04-11T23:09:01 1744412941

interesting! Can you say more about what kind of algorithms your software runs?

coffeeaddict1 · 2025-04-12T08:17:15 1744445835

My work is primarily about the processing of medical images (which usually are large 3D images). Doing this on the GPU, can be up to 10-20x faster.

fragmede · 2025-04-12T19:33:27 1744486407

But what about that wants to be multi-platform instead of picking one an specializing, probably picking up some more optimizations along the way?