Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Resources for GPU Compilers?
74 points by zvikinoza 60 days ago | hide | past | favorite | 21 comments
Hi folks, I've done some cpu compilers for x86, RISC-V, LLVM (mostly for fun and some for profit). I'm eager to learn more about GPU/TPU compilers and have been looking into Triton and XLA. What resources would be useful to learn HPC and/or gpu compilers in depth (or any adjacent areas)?

Any such books, courses etc will be much appreciated.




Newer editions of Computer Organization and Design: The Hardware Software Interface covers GPUs [1]

Multiflow still has some relevant ideas [2]

Programming on Parallel Machines: GPU, Multicore, Clusters and More. Gives you a look at some of the issues [3]

SPIRV-VM is a virtual machine for executing SPIR-V shaders [4]

NyuziRaster: Optimizing Rasterizer Performance and Energy in the Nyuzi Open Source GPU [5]

Ocelot is a modular dynamic compilation framework for heterogeneous systems, providing various backend targets for CUDA programs and analysis modules for the PTX virtual instruction set. [6]

glslang is the Khronos-reference front end for GLSL/ESSL, partial front end for HLSL, and a SPIR-V generator.

[1]: https://www.goodreads.com/book/show/83895.Computer_Organizat...

[2]: https://en.wikipedia.org/wiki/Multiflow

[3]: http://heather.cs.ucdavis.edu/parprocbook

[4]: https://github.com/dfranx/SPIRV-VM

[5]: https://www.cs.binghamton.edu/~millerti/nyuziraster.pdf

[6]:https://code.google.com/archive/p/gpuocelot/

[7]: https://github.com/KhronosGroup/glslang


and a few more

Pixel Planes/Pixel Flow [1]

The Geometry Engine: A VLSI Geometry System for Graphics [2]

Tim Purcell's research [3]

BrookGPU [4]

GRAMPS: A Programming Model for Graphics Pipelines [5]

[1]: https://www.cs.unc.edu/~pxfl/

[2]: https://graphics.stanford.edu/courses/cs148-10-summer/docs/1...

[3]: http://graphics.stanford.edu/~tpurcell/

[4]: http://graphics.stanford.edu/projects/brookgpu/

[5]: http://graphics.stanford.edu/papers/gramps-tog/


wow, Thanks! Some super interesting resources, some that I haven't even heard of.


Just some links I had lying around. Well at least the ones that still worked.

A major problem I had was register pressure. Even with a decent sized register file. Memory accesses in the hundred of cycles range, made it very difficult to fill those 'free' cycles while you are waiting for data to be loaded. Double buffering data really just cut the effective size of the register file making everything worse. That did lead to some interesting way of optimising code to reduce temp storage over execution time.

Another was trying to access different memory locations and getting poor usage of the cache. Giving larger gaps to try and fill with something useful.

This was sometime ago now. So the tradeoffs will be different. Different types of RAM and transistor budgets


If you're already familiar with compilers in the abstract, start by implementing some high-level solutions leveraging a GPU, some low-level performance-optimized kernels, and build up a bit of intuition. The high-level code depends on your goals, but for low-level code maybe try optimizing something equivalent to a binary tree (both with leaves smaller and larger than 4kb), something benefiting from operator fusion (e.g., a matmul followed by an element-wise exponential), and something benefiting from deeply understanding the memory hierarchy (e.g., multiplying two very large square matrices, and also inner/outer producting two very narrow matrices).

From there, hopefully you'll have the intuition to actually evaluate whether a given resource being recommended here is any good.


Here is a Phd thesis for a GPU programming language that compiles on the gpu itself:

https://scholarworks.iu.edu/dspace/items/3ab772c9-92c9-4f59-...


That undersells it IMHO. Here's the abstract:

This work describes a general, scalable method for building data-parallel by construction tree transformations that exhibit simplicity, directness of expression, and high-performance on both CPU and GPU architectures when executed on either interpreted or compiled platforms across a wide range of data sizes, as exemplified and expounded by the exposition of a complete compiler for a lexically scoped, functionally oriented programming commercial language. The entire source code to the compiler written in this method requires only 17 lines of simple code compared to roughly 1000 lines of equivalent code in the domain-specific compiler construction framework, Nanopass, and requires no domain specific techniques, libraries, or infrastructure support. It requires no sophisticated abstraction barriers to retain its concision and simplicity of form. The execution performance of the compiler scales along multiple dimensions: it consistently outperforms the equivalent traditional compiler by orders of magnitude in memory usage and run time at all data sizes and achieves this performance on both interpreted and compiled platforms across CPU and GPU hardware using a single source code for both architectures and no hardware-specific annotations or code. It does not use any novel domain-specific inventions of technique or process, nor does it use any sophisticated language or platform support. Indeed, the source does not utilize branching, conditionals, if statements, pattern matching, ADTs, recursions, explicit looping, or other non-trivial control or dispatch, nor any specialized data models.


Does it really matter that it's only 17 lines if they're dense, undecipherable APL See page 210 in the pdf.


Do you know APL? It's pretty asinine to jump directly deep into a technical document and complain that something is indecipherable to you. The 210 excerpt is completely readable once you've gone through the dissertation, especially since the author goes way out of his way to teach you the necessary APL along the way.


> 17 lines of simple code

17 lines of very much not simple code


For Intel you can just look into sources: https://github.com/intel/intel-graphics-compiler ;) Also AMDGPU and NVPTX targets in LLVM might be interesting.


Fair warning that the dominant model in GPU compilers is to use an ISA that looks like the GPUs did long ago. Expect to see i32 used to represent a machine vector of ints where you might reasonably expect <32 x i32>. There are actual i32 scalar registers as well, which are cheap to branch on, that are also represented in IR as i32. That is, the same as the vector. There are intrinsics that sometimes distinguish.

This makes a rather spectacular mess of the tooling. Instead of localising the cuda semantics in clang, we scatter it throughout the entire compiler pipeline, where it does especially nasty things to register allocation and generally obstructs non-cuda programming models. It's remarkably difficult to persuade GPU people that this is a bad thing.

Also the GPU programming languages use very large compiler runtimes to do a degree of papering over the CPU-host GPU-target assumption that also dates from long ago, so expect to find a lot of complexity in multiple libraries acting somewhat like compiler-rt. Those are optional in reality but the compiler usually emits a lot of symbols that resolve to various vendor libraries.


The Deep Learning Compiler: A Comprehensive Survey (2020) is in fact a comprehensive-as-of-2020 but now somewhat dated survey of the field. Unfortunately, I am not aware of better overall survey more up to date.

https://arxiv.org/abs/2002.03794



https://theartofhpc.com/

Great book series on the subject of HPC. Not sure if it actually touches GPU. Great material anyway. BONUS: it's free!


Yeah, Thanks, started reading that! it's not exactly what I was looking for but it's very deep/rigorous and feel like concepts will be transferable.



Staged FP in Spiral: https://www.youtube.com/playlist?list=PL04PGV4cTuIVP50-B_1sc...

Some of the stuff in this playlist might be relevant to you, though it is mostly about programming GPUs in a functional language that compiles to Cuda. The author (me) sometimes works on the language during the video, either fixing bugs or adding new features.


For NVIDIA,

1. play around with the NVPTX LLVM backend and/or try compiling CUDA with Clang,

2. get familiar with the PTX ISA,

3. play around with ptxas + nvdisasm.


Faith Ekstrand has an impressive track record of compiler work and has written a few blog posts[1], [1a]. Her Mastodon[2] is also worth a follow.

SPIR-V is important in the compute shader space, especially because DXIL and Metal's AIR are similar. I'm going to link three articles critical of SPIR-V: [3], [4], [5].

WebGPU [6] is interesting for a number of reasons, largely because they're trying to actually nail down the semantics, and also make it safe (see the uniformity analysis [7] in particular, which is a very "compiler" approach to a GPU-specific problem). Both Tint and naga projects are open source, with lots of high quality discussion in the issue trackers.

Shader languages suck, and we really need a good one. Promising approaches are Circle [8] (which is C++ based, very advanced but not open source), and Slang [9] (an evolution of HLSL). The Vcc work (also related to [4]) is worth studying.

Best of luck! This is a fascinating, if frustrating, space, and there's lots of room to improve things.

[1]: https://www.gfxstrand.net/faith/blog/

[1a]: https://www.collabora.com/news-and-blog/blog/2024/04/25/re-c...

[2]: https://mastodon.gamedev.place/@gfxstrand

[3]: https://kvark.github.io/spirv/2021/05/01/spirv-horrors.html

[4]: https://xol.io/blah/the-trouble-with-spirv/

[5]: https://themaister.net/blog/2022/08/21/my-personal-hell-of-t...

[6]: https://github.com/gpuweb/gpuweb

[7]: https://www.w3.org/TR/2022/WD-WGSL-20220505/#uniformity-over...

[8]: https://www.circle-lang.org/site/index.html

[9]: https://github.com/shader-slang/slang


The shader languages are optional though. You can absolutely write freestanding C or C++ instead if you choose to. This doesn't seem widely appreciated.

There's increasing danger of non-freestanding C++ working too as the libc++ port improves. Some fraction of libc is there already.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: