> Mesa 3D - driver library that provides open source driver (mostly copy of AMD-VLK)
Wrong, wrong, very wrong. No copy here. Mesa's RADV was developed completely independently from AMD, in fact it predates the public release of AMDVLK. It's also possibly the best Vulkan implementation out there. Valve invested heavily in the ACO compiler backend, so it compiles shaders both very well and very quickly.
Not to mention that there is much more to Mesa than RADV. Gallium3D and state trackers, RadeonSI and a few other drivers (Apple M1, Qualcomm Adreno, Broadcom, Mali) to only name these.
Ah, yes. Gallium is itself a bit like Vulkan IIRC, though a bit higher level.
There are state trackers for OpenGL, D3D9-10, OpenCL, and probably others. I guess one could make a Vulkan "state tracker", but it might not be efficient.
Gallium can be used on top of bare metal, but also CPU (LLVMpipe/OpenSWR), Vulkan trough zink, or now D3D12 thanks to Microsoft's recent work.
Valve is contributing quite a bit, with their ACO compiler for instance, but the RADV driver did not in fact originate with them. We owe a great deal to David Airlie for that. He wrote a complete vulkan driver years before AMD released the source code for AMDVLK. Valve started contributing a few years in, IIRC.
To AMD's credit, they did help with documentation, questions, and much of the work was built on top of what was there for RadeonSI (OpenGL): shader compiler back-end, etc.
Parallelism with the computing units in GPU permeates the entire computing model. For the longest time I don't get how partial derivative instructions like dFdx/dFdy/ddx/ddy work. None of the doc helps. These instructions take in a number and return its partial derivative, just a generic number, nothing to do with graphic, geometry or function. The number could have been the dollar amount of a mortgage and its partial derivative is returned.
It turns out these functions tied to how the GPU architecture runs the computing units in parallel. The computing units are arranged to run in parallel in a geometric grid according to the input data model. The computing units run the same program in LOCK STEP in parallel (well at least in lock step upon arriving at the dFdx/dFdy instructions). They also know their neighbors in the grid. When a computing unit encounters a dFdx instruction, it reaches across and grabs its neighbors' input value to their dFdx instructions. All the neighbors arrive at dFdx at the same time with their input value ready. With the neighbors' numbers and its own number as the mid point, it can compute the partial derivative using gradient slope.
That sounds like a useful mental model. But it can't be quite right, can it? There aren't enough cores to do _all_ pixels in parallel, so how is that handled? Does it render tiles and compute all edges several times for this?
Pixels are grouped into independent 2x2 blocks called quads and the derivatives are calculated only with respect to the other pixels in the quad.
At the edges of a triangle, when it doesn't cover all four pixels in a quad the pixel shader is still evaluated on all four pixels to calculate derivatives and at the end the results for pixels outside the triangle are thrown away. Your pixel shader better not blow up outside the triangle or you'll get bad derivatives for the pixels that are inside the triangle.
As triangles get smaller, the percentage of pixel shader evaluations that are happening outside of triangles goes up. In the extreme case if the triangle covers only one pixel, the pixel shader is executed 4 times and 3 results are thrown out! This has started to become a problem for games with high geometric detail and it's one of the reasons why Epic started using software rasterization for their Nanite virtualized geometry system in Unreal 5.
You might say "why don't you just stop doing this approximate derivative thing so you don't have to execute pixels that are thrown away? Are derivatives that important?" The answer is that derivatives are used by the hardware during texture sampling to select the correct mip level, which is pretty much essential for performance and eliminating aliasing. So you'd have to replace that with manual calculation of derivatives by the application which would be a big change.
That's correct. There aren't enough physical cores to do all cells in parallel. A 1000x1000 grid would have 1M cells. These are virtual computing units. Conceptually there are 1M virtual computing units, one per cell. The thousands of physical cores take on the cells one batch at a time until all of the cells are run. In fact multiple cores can run the code of one cell in parallel. E.g. for-loop is unrolled and each core takes on a different branch of the for-loop. The scheduling of the cores to virtual computing units is like the OS scheduling CPUs to different processes.
Most instructions have no inter-dependency between cells. These instructions can be executed by a core as fast as it can on one computing unit until an inter-dependent instruction like dFdx is encountered that requires syncing. The computing unit is put in a wait state while the core moves on to another one. When all the computing units are sync'ed up at the dFdx instruction, they're then executed by the cores batch by batch.
No, multiple cores can't run the code of the same cell in parallel. The key here is that these cores run in lockstep, so the program counter is the is the exact same for all cells being run at once. In practice, each core supports 32 (or sometimes 64) cells at a time. There is also no syncing required for dFdX! Since the quads ran in lockstep, the results are guaranteed to be computed at the same time. In practice, these are all computed in vector SIMD registers, and a dFdX simply pulls a registers from another SIMD lane.
If you want to execute more tasks than this, throw more cores at it.
All the computing units running in lockstep is what appears to the program. They don't have to run in lockstep for every single instruction in reality. SIMT works in a warp with up to 32 threads [1]. Different thread warps working on different cell regions can run with different program counters.
Also the newer Volta architecture (section 3.2 in [1]) allows independent thread scheduling of threads across warps such that each thread can have its own program counter and stack.
There're certainly sync. See how barrier is used (9.7.12.1 in [1]) to synchronize threads.
This is not how Volta's "Independent Thread Scheduling" works under the hood, though I'm not at liberty to say what's actually going on (unfortunately, one of the upsetting things about doing GPU programming is how much "behind NDA" tribal knowledge there tends to be. If you have an NVIDIA devrel contact, ask for their "GPU Programming Guide", that should hopefully clear things up). So for now, you'll just have to trust me.
Suffice to say, the Volta "Independent Thread Scheduling" is only used for compute shaders, not for vertex and pixel shaders.
You're right to note that different warps can run with different program counters, and when communicating across warps, those need synchronization. That's what those barriers are necessary for (see GroupMemoryBarrier and friends in the high-level languages). However, the definition of dFdX chosen by these languages basically requires that all threads participating in the derivative be scheduled within the same warp. dFdX will never synchronize or wait on another warp.
So the function F is implicitly defined to have value F(x,y) = v, where x and y are coordinates of the core, and v is the input value to the dFdx/dFdy instruction? Then the output of the instruction running on the x,y core (take dFdx for example) is supposedly equal to (F(x+1,y)-F(x-1,y))/2?
Yes. That's the idea. Though not exactly sure how the GPU uses the neighboring values to interpolate the partial derivative. It's probably GPU dependent.
Pretty much all of them work in 2x2 quads. DyDx is calculated within a quad. But, not across them. This is an imperfect approximation to the derivative that needs to be accounted for. But, it’s useful and very cheap and trivially easy.
It's defined to work on a 2x2 grid called a quad. If you're the top left pixel, then dFdX(v) = v(x+1,y) - v. If you're the top right pixel, then it's dFdX(v) = v(x-1,y) - v
Thanks a lot, always enjoy your posts on here and r/hardware! Do you have a hardcore introduction with even more detail / perhaps even with examples of implementations? :)
I find white papers quite good (although I admit there are many things I don't understand yet and constantly have to look up), but even these sometimes feel a bit general.
In particular, "Section 5: Performance Guidelines": https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.... gives a lot of micro-architectural details, including those nasty "bank conflicts" that people keep talking about. Honestly, the original documentation says it best with just a few paragraphs on that particular manner:
> To achieve high bandwidth, shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously. Any memory read or write request made of n addresses that fall in n distinct memory banks can therefore be serviced simultaneously, yielding an overall bandwidth that is n times as high as the bandwidth of a single module.
>
> However, if two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized. The hardware splits a memory request with bank conflicts into as many separate conflict-free requests as necessary, decreasing throughput by a factor equal to the number of separate memory requests. If the number of separate memory requests is n, the initial memory request is said to cause n-way bank conflicts.
See? Its really not that hard or unapproachable. Just read the original docs, its all there.
AMD's documentation is scattered to the winds, but the same information is around. I'd say that your #1 performance guidelines are from the ancient optimization guide from 2015. Its a bit dated, but its fine: http://developer.amd.com/wordpress/media/2013/12/AMD_OpenCL_...
Chapter 1 and Chapter 2 are relevant to today's architectures (even RDNA, even though some details have changed). Chapter 2: GCN, applies for all AMD GPUs from the 7xxx series, through the Rx 2xx series, Rx 3xx, 4xx, 5xx, Vega, and CDNA (aka: MI100) architectures.
RDNA does not have as good of an architectural guide. Start with the OpenCL guide for optimization, and then "update" your knowledge with the rather short RDNA guide: https://gpuopen.com/performance/
* PTX is a portable assembly language: NVidia continuously updates their GPUs. Volta was very well studied by this paper: https://arxiv.org/abs/1804.06826. I'd suggest reading this AFTER you learn the basics of PTX.
I know there's a lot of Javascript developers on this forum. If you want to get into GPU programming, I highly recommend gpu.js [1] library as a jumping off point. It's amazing how powerful computers are and how we squander most our cycles.
Note that GPU.js is not for graphics, it's primarily for doing GPU Compute. I'd recommend looking at a 3D scene library to get your feet wet (and there are plenty of those in JS), or if you're interested in the underlying workings, to look at either WebGL or WebGPU.
On the same basis, it would also help if you could provide a comparison between GPUs commonly used for ML. Tesla k80, P100, T4, V100 and A100. How has the architecture evolved to make the A100 significantly faster? Is it just the 80GB RAM, or there is more to it from an architecture standpoint?
> How has the architecture evolved to make the A100 significantly faster?
Oh, very much so. By way more than an order of magnitude. For a deeper read, have a look at the "architecture white papers" for Kepler, Pascal, Volta/Turing, and Ampere:
or check out the archive of NVIDIA's parallel4all blog ... hmm, that's weird, it seems like they've retired it. They used to have really good blog posts explaining what's new in each architecture.
As a starter T4 is heavily optimized for low power consumption on inference tasks. IIRC it doesn’t even require additional power beyond what the PCIe bus can provide but basically useless for training unlike the others.
One day I'll get my hands on both an A40 and an A100 and I'll maybe get an answer to the question: does the 5120bits memory bus help that much? The A100 has less cuda cores, around 1/4 more tensor cores but seems to be the preferred 'compute' and 'ai training' option all around. What gives?
Have folks seen good Linux tools for actually monitoring/profiling the GPU's inner workings? Soon I'll need to scale running ML models. For CPUs, I have a whole bag of tricks for examining and monitoring performance. But for GPUs, I feel like a caveman.
In nvidia land, `nvidia-smi` is like `top` for your gpus. If you're running compiled CUDA, `nvprof` is very useful. But I'm not sure how much work it would take to profile something like a pytorch model.
Very nice. This is gentle like movie dinosaurs are "cheeky lizards". I'd hate to see the "Turkish prison BDSM porn" version.
I'm looking at graphics code, again, from a "I know enough C to shoot myself in the foot and want to draw a circle on the screen" perspective. It's hilarious how much "stack" there is in all the ways of doing that; I look at some of this shit and want to go back to Xlib for its simple grace.
2D graphics is very different and mostly doesn't require the GPU's assistance. If you want to plot a circle on an image and then display that to the screen, you don't require any of this stack. If you want a high-level library that will draw a circle for you, you can use something like Skia or Cairo which will wrap this for you into a C API.
GPUs solve problems of much larger scale, and so the stack has evolved over time to meet the needs of those applications. All this power and corresponding complexity has been introduced for a reason, I assure you.
> 2D graphics is very different and mostly doesn't require the GPU's assistance.
There are interesting tasks in 2D graphics that may easily warrant GPU acceleration, such as smooth animation of composited surfaces (even something as simple as displaying a mouse pointer falls under this, and is accelerated in most systems) or rendering complex shapes including text.
A better explanation is that the main problems of processor design are that memory reads take 100-1000 times the time of an arithmetic operation and that hardware is faster when operation are run in parallel.
CPUs handle the those issues by having large memory caches, and lots of circuitry to execute instructions "out of order", i.e. run other instructions that don't depend on the memory read result or the result of other operations. This is great to run sequential code as fast as possible, but quite inefficient overall.
GPUs instead handle the memory problem by switching to another thread in hardware, and the parallelism problem by mainly using SIMD (with masking and scatter/gather memory accesses, so they look like multiple threads). This works well if you are doing mostly the same operations on thousands or millions of values, which is what graphics rendering and GPGPU is.
Then there are also DSPs that solve the memory access issue by only having very little on-CPU memory and either explicit DMA or memory reads giving a delayed result, and parallelism by being VLIW.
And finally the in-order/microcontroller CPUs that simply don't care about performance and do the cheapest and simplest thing.
This article says that AMD GPUs are vector in nature, but I think that stopped being the case with GCN; before that they had some weird vector stuff with 5 elements or something.
Wrong. GPUs are definitely vector (GCN in particular being 64 x 32-bit wide vectors).
What you're confusing is "Terascale" (aka: the 6xxx series from the 00s), which was VLIW _AND_ SIMD. Which was... a little bit too much going on and very difficult to optimize for. GCN made things way easier and more general purpose.
Terascale was theoretically more GFLOPs than the first GCN processors (after all, VLIW unlocks a good amount of performance), but actually utilizing all those VLIW units per clock tick was a nightmare. Its hard enough to write good SIMD code as it is.
There were two dimensions of "vector" -- whether you used XMM-style registers, requiring the compiler to auto-vectorize different operations. The old vector ISAs had things like dot product instructions. Those things are now all gone, because it turns out it's hard to optimize.
We in the industry call scalar-for-each-thread ISAs as "scalar ISAs" these days. For instance, Mali describes their transition from a vec4-for-each-thread-based ISA to a scalar-for-each-thread-based ISA as transitioning from "vector to scalar" [0].
Connection Machine / Thinking Machines Corporation use of "scalar-like SIMD" was around in the late 1980s and early 1990s. Look up *Lisp, the parallel language that Thinking Machines Corporation used. You can still find *Lisp manuals today. https://en.wikipedia.org/wiki/*Lisp
Intel implemented SIMD using a technique they called SWAR: SIMD within a Register (aka: the 64-bit MMX registers), which eventually evolved into XMM (SSE), YMM (AVX), and ZMM (AVX512).
Today's GPUs are programmed using the old 1980s style / Connection Machine *Lisp, which is clearly the source of inspiration for HLSL / GLSL / OpenCL / CUDA / etc. etc.
Granted, today's GPUs are still SWAR (GCN's vGPR really is just a 64-wide x 32-bit register). We can see with languages like ispc (Intel's SPMD program compiler), that we can indeed implement a CUDA-like language on top of AVX / XMM registers.*
I don't think any of it was based on *Lisp. Graphics mostly developed independently. And as such, we use different words and different terminology sometimes, like saying "scalar ISA" when we talk about designing ISAs that don't mandate cross-lane interaction in one thread. Sorry!
As far as I know, the first paper covering using SIMD for graphics was Pixar's "Channel Processor", or Chap [0], in 1984. This later became one of the core implementation details of their REYES algorithm [1]. By 1989, they had their own RenderMan Shading Language [2], an improved version of Chap, and you can see the similarities from just the snippet at the start of the code. This is where Microsoft took major inspiration from when designing HLSL, and which NVIDIA then started to extend with their own Cg compiler. 3dlabs then copy/pasted this for GLSL.
The terms are definitely overloaded and can differ subtly between vendors. Those "scalar-for-each-thread" instructions are prefixed v_ on AMD's GCN. The "v" doesn't stand for "scalar" ;)
Yeah, certainly having "V"GPRs in a "scalar ISA" is a bit off. In graphics parlance, the term really just refers to the lack of things like dot products and matrix multiply instructions where the compiler has to line numbers into "per-thread" vectors for the instructions to work out. Turns out you can save a lot of work by not having that kind of ISA, and just moving to ones where you operate on number registers (which are then "vector"'d to be thread-count wide).
Wouldn't be graphics unless we overloaded a piece of terminology 20 times.
Wrong, wrong, very wrong. No copy here. Mesa's RADV was developed completely independently from AMD, in fact it predates the public release of AMDVLK. It's also possibly the best Vulkan implementation out there. Valve invested heavily in the ACO compiler backend, so it compiles shaders both very well and very quickly.