Weird. A library that wraps SYCL within MPI, yet requires all processes to hold a copy of all the memory ?
One of the main reasons to use MPI is to solve problems that do not fit within the memory available in a single cluster node.
There is a presentation [0] from 2020-02-20 that's not very impressive. Particularly, they compare against MPI+OpenCL, but do not show a comparison against MPI-CUDA.
Doing a distributed MatMul using MPI-CUDA is trivial. I wonder how the cyclomatic complexity and performance compares for that case. Insted, they only compare doing one MatMul per process using MPI-OpenCL... that's... a two liner with CUDA (just call MpiInit followed by a cuBLAS call).
I'm honestly much more worried about NVIDIA's overwhelming success in buying their way into the entire scientific community.
CUDA is proprietary, not standardised, and tied to a single hardware vendor. Technical issues aside, I applaud any effort to break the dominance of CUDA. CUDA is a pest that needs to either open up or be replaced.
HPC is a very heterogeneous environment (on the whole) and making it easier to use the full potential of the hardware is understandable. Also keep in mind that it's a research project and very early in its development.
> CUDA is a pest that needs to either open up or be replaced.
My main problem with cuda is not that it is un-free, it is that the source code is unavailable and consequently you have to deal with a badly packaged, badly supported blob that turn only on a restricted set of compilers.
The consequences of that is that every software depending on cuda is a pain in the butt to distribute. Even major software like Pytorch or tensorflow need to have separated released due to cuda.
It is particularly sad that Nvidia in 2020 did not understand what Mellanox understood 10 years ago: that successful leadership in hardware is compatible OSS with an software stack.
Thank Khronos for being too focused on C and expecting partners to develop any kind of tooling or SDKs.
It took the CUDA bulldozer for them to consider more C++ love and introduction of bytecode formats that allow more languages to easily target Khronos standards.
Incidentally that were the biggest complains about Vulkan as well, on the last LunarG survey, tooling and documentation.
> I'm honestly much more worried about NVIDIA's overwhelming success in buying their way into the entire scientific community.
They bought their way by providing software along with the hardware, and by paying developers and researchers to work full-time in improving the performance of user applications on their hardware.
If you are a medium-ly important HPC application, the chances that there is an NVIDIA engineer working full-time on making it perform good on GPUs is quite large.
What's worrying is that Intel and AMD haven't realized, 20 years later, that hardware without software is useless. Even if they were to make GPUs that outperform NVIDIA's hardware in raw numbers by a 10x factor, NVIDIA's GPUs would still probably be the better product to buy because of the software and support you get.
> Weird. A library that wraps SYCL within MPI, yet requires all processes to hold a copy of all the memory ?
> One of the main reasons to use MPI is to solve problems that do not fit within the memory available in a single cluster node.
You are absolutely right; this is a problem that Celerity currently has. However, I'm happy to say that we are actively working on a solution that should cover a good portion of use cases and which will be available quite soon hopefully!
Of course, as was already pointed out in this thread (and on the website), this is a research project, and we only have limited resources. If you look a little closer, I'm sure you will find all sorts of issues ;-). However, we are committed to continuously improve Celerity, and ultimately strive to build a robust and modern HPC programming framework that is usable by non-experts (i.e., domain scientists) while also delivering solid performance.
> Doing a distributed MatMul using MPI-CUDA is trivial. I wonder how the cyclomatic complexity and performance compares for that case.
I would assume the programmability metrics to be somewhere in between OpenCL and SYCL. CUDA is a bit less verbose than OpenCL, but you still have to manually deal with MPI (which is also the main factor for MPI+SYCL having much higher cyclomatic complexity than Celerity). Performance-wise I would also not expect much difference, given that we are talking about naive matmul here.
In general, we're not trying to beat [insert your favorite BLAS library]. At this level of abstraction, that would be pretty much impossible. We are showing off results for matmul because everybody knows it and has at least somewhat of an understanding of the MPI operations required to do it in a distributed setting. You can see it as a stand-in for whatever domain-specific algorithm you want to run in a distributed setting.
> Insted, they only compare doing one MatMul per process using MPI-OpenCL... that's... a two liner with CUDA (just call MpiInit followed by a cuBLAS call).
That's not quite true. What we showed here are several successive matrix multiplications, each one being computed in distributed fashion across all participating GPUs. The splitting of work as well as all intermediate data transfers happen completely transparently to the user (save for having to provide a "range mapper"), which I think is one of the main selling points of the project!
Fun fact: I did a port of Franz Lisp[1] to the Celerity, having the compiler (liszt) generate code for the backend of the C compiler, since they told us: "in no way will you be able to generate assembly code out of your compiler for this architecture." They were an early Sun Microsystems competitor that didn't make it.
[1] Not to be confused with Franz's Common Lisp, this is the MacLisp compatible Lisp from UC Berkeley.
One of the main reasons to use MPI is to solve problems that do not fit within the memory available in a single cluster node.
There is a presentation [0] from 2020-02-20 that's not very impressive. Particularly, they compare against MPI+OpenCL, but do not show a comparison against MPI-CUDA.
Doing a distributed MatMul using MPI-CUDA is trivial. I wonder how the cyclomatic complexity and performance compares for that case. Insted, they only compare doing one MatMul per process using MPI-OpenCL... that's... a two liner with CUDA (just call MpiInit followed by a cuBLAS call).
[0] https://www.uibk.ac.at/fz-hpc/events/resources/2020_02_20_ah...