Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Julia GPU (notamonadtutorial.com)
131 points by cdsousa on Oct 20, 2020 | hide | past | favorite | 26 comments



I was really confused for a moment, because the article mention CUDA a lot, which is a nvidia-specific API/framework/language. I guess that's mainly to appeal to the CUDA crowd? However, Julia being seemingly based on LLVM, interfacing it with AMD GPUs should be quite doable:

> Much of the initial work focused on developing tools that make it possible to write low-level code in Julia. For example, we developed the LLVM.jl package that gives us access to the LLVM APIs. Recently, our focus has shifted towards generalizing this functionality so that other GPU back-ends, like AMDGPU.jl or oneAPI.jl can benefit from developments to CUDA.jl. Vendor-neutral array operations, for examples, are now implemented in GPUArrays.jl whereas shared compiler functionality now lives in GPUCompiler.jl. That should make it possible to work on several GPU back-ends, even though most of them are maintained by only a single developer.

This is the takeaway to me. First-class access to GPU accelerators using the same syntax, regardless of the vendor :)


Yes, it would be great if the .jl source code didn't even mention CUDA (but right now it does, with statements such as "using CUDA" and "CuArray(...)".)


Yes, that's fair. I focused on CUDA.jl because it is the most mature, easiest to install, etc. but as I mentioned we're actively working on generalizing that support as much as possible, and as a result support for AMD (AMDGPU.jl) and Intel (oneAPI.jl) GPUs is rapidly catching up.


This is a complete novice, ill informed, question. So forgive it in advanced, but why have an AMD specific backend at all? Couldn't you just use AMD's HIP/HIP-IFY tool on the CUDA backend and get an AMD friendly version out?

https://github.com/ROCm-Developer-Tools/HIP

I realize these sort of tools aren't magic and whatever it spites out will need work, but it seems like a really good thin starting place for AMD support with a lower overhead for growth.

After the original CUDA bits can ""cross-compile"", the workflow is greatly reduced, right?

Workflow:

- update CUDA code

- push through the HIPIFY tool

- Fix what is broken (if you can fix it on the CUDA side)

After enough iterations, the CUDA code will grow friendly to HIPification...


> This is a complete novice, ill informed, question. So forgive it in advanced, but why have an AMD specific backend at all? Couldn't you just use AMD's HIP/HIP-IFY tool on the CUDA backend and get an AMD friendly version out?

HIP and HIPify only work on C++ source code, via a Perl script. Since we start with plain Julia code, and we already have LLVM integrated into Julia's compiler, it's easiest to just change the LLVM "target" from Native to AMDGPU (or NVPTX in CUDA.jl's case) to get native machine code, while preserving Julia's semantics for the most part.

Also, interfacing to ROCR (AMD's implementation of the Heterogeneous System Architecture or HSA runtime) was really easy when I first started on this, and codegen through Julia's compiler and LLVM is trivial when you have CUDAnative.jl (CUDA.jl's predecessor) to look at :)

I should also mention that not everything that CUDA does maps well to AMD GPU; CUDA's streams are generally in-order (blocking), whereas AMD's queues are non-blocking unless barriers are scheduled. Also, things like hostcall (calling a CPU function from the GPU) doesn't have an obvious alternative with CUDA.


Thank you for taking the time! I found this quite helpful.


Something that is hinted at, but not spelled out loud in our posts is that AMD actively upstreams and maintains a LLVM back-end for their GPUs, so it really is a matter of switching the binary target for the generated code, at least in theory :)


> From a Python programmer perspective, how does CUDA.jl compare to PyCUDA?

I think the relevant comparison today is with Numba, here's a real world recurrence analysis,

    @cuda.jit
    def _sseij(Y, I, J, O):
        # strides
        sty = cuda.blockDim.x
        sbx = sty * cuda.blockDim.y
        sby = sbx * cuda.gridDim.x
        sbz = sby * cuda.gridDim.y
        # this thread's index
        t = (cuda.threadIdx.x
           + cuda.threadIdx.y * sty
           + cuda.blockIdx.x * sbx
           + cuda.blockIdx.y * sby
           + cuda.blockIdx.z * sbz)
        if t < I.size:
            i = I[t]
            j = J[t]
            x = nb.float32(0.0)
            for k in range(Y.shape[0]):
                x += (Y[k, i] - Y[k, j])**2
            O[t] = math.sqrt(x)
most of it is index calculation, but super easy


Does numbacuda allow passing buffers between kernels now? Or does it still pretty much require you to write a single superkernel in a buggy, difficult-to-debug language subset?

C isn't usually a breath of fresh air, but the last two times I tried to use nubacuda and failed (~1.5 year ago), it sure felt that way.

EDIT: yes, looks like it supports on-device buffers now. I'm still in "once bitten, twice shy" mode on account of the bugs and debug story, but I'm cautiously optimistic.


I'm not sure I understand the bounds checking example. Can someone shed some light on how CUDA.jl does bounds checking? It's not entirely straightforward to do on a GPU.


Since we use fat array objects, and not raw pointers, we know the size of the array and can perform bounds checks at run time. We then have a mechanism to throw an exception and signal it to the CPU to display it there. That's obviously quite expensive, so you can disable it with that annotation (the Julia debug setting also controls the granularity, and thus how expensive the exception handling is). It's fairly primitive, i.e. no full-featured exception handling (for now), but has proven very useful already.


How do you terminate the CUDA kernel when a bounds violation is encountered by a single thread? I don't think the CUDA API exposes a mechanism to do that safely.


You can emit `trap` or `exit` in the PTX code (although that has exposed many bugs in the PTX assembler because it does not expect that kind of often divergent control flow). But even if you'd just have the kernel return and otherwise produce invalid results, the fact that you can then report a bounds error instead of silently corrupting data and/or generating a fatal ERROR_ILLEGAL_ACCESS (requiring an application restart) is a significant usability improvement.


The Nvidia proprietary bit does leave HLLs like Furhark or DSLs like Neanderthal a bit of an advantage. Are there ways of doing portable GPU things with Julia?


The Julia array abstractions make it so that most code is vendor-neutral already, and you execute on whatever GPU back-end you want by using an appropriate array type. For vendor-neutral kernel programming there's GPUArrays.jl and KernelAbstractions.jl, but both aren't currently very user friendly (but are actively used as a building block for user-facing applications and APIs).


JuliaGPU appears to be a textbook example of 'how to present a bug as a feature'.

"You're a CUDA programmer. Well, why don't you learn it all over again the Julia way, and also unlearn CUDA, so the next time you have to program an nVidia GPU, you don't have a choice but to do it in Julia"

Vendor lock-in FTW.


Most of the julia GPU stuff is being made vendor agnostic via moving infrastructure from CUDA.jl to CPUCompiler.jl. The package AMDGPU.jl is coming along very well and will be plug-able with all this stuff.

https://github.com/JuliaGPU/GPUCompiler.jl

https://github.com/JuliaGPU/AMDGPU.jl


Tim has been working on making it easy to use CUDA.jl and AMDGPU.jl pretty interchangeably through GPUArrays.jl, and this approach seems to be pretty extensible to other accelerators like Intel's dGPUs. KernelAbstractions.jl will also be gaining AMDGPU.jl support soon, so it'll be easy to write generic kernels without buying into a single vendor's cards.



So we're freed from GPU vendor lock-in by getting a programming language lock-in.

Out of the frying pan, into the fire. Great.


I do not get what you are trying to say. Someone is developing an open-source solution that lets you use a vast array of different numerical accelerators more efficiently and with simpler code. All the while the work is done in the open and you can copy any subset if you want to make your own tool. Why is any of this bad?


What do you mean? What is programming language lock-in, and how is that worse than (or even comparable to) vendor lock-in? What's your goal?


What is programming language lock-in?

When any code is written in some language?


Maybe the issue raised is that libraries/APIs that define an ABI (Application Binary Interface) can be used from multiple programming languages that share calling conventions. So a question is about how Julialang libraries can be called from other programming languages. The "Embedding Julia" chapter of the manual says that from C there is an interface layer to lookup function objects, box parameters, and call.


As opposed to being trapped in the C++ CUDA API? Julia isn’t that much harder to learn than Python. Does anyone really think the hard part of say, deep learning, is in learning Python or PyTorch?


I don't think there is anything in our programming world that is programming language agnostic, except for theory.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: