Gentle introduction to GPUs inner workings

floatboth · on Oct 2, 2021

> Mesa 3D - driver library that provides open source driver (mostly copy of AMD-VLK)

Wrong, wrong, very wrong. No copy here. Mesa's RADV was developed completely independently from AMD, in fact it predates the public release of AMDVLK. It's also possibly the best Vulkan implementation out there. Valve invested heavily in the ACO compiler backend, so it compiles shaders both very well and very quickly.

MayeulC · on Oct 2, 2021

Not to mention that there is much more to Mesa than RADV. Gallium3D and state trackers, RadeonSI and a few other drivers (Apple M1, Qualcomm Adreno, Broadcom, Mali) to only name these.

The status of some drivers is tracked here: https://mesamatrix.net/

Jasper_ · on Oct 2, 2021

Note that a lot of the Gallium3D architecture is designed for the stateful APIs like OpenGL. Very little of it is used in the Vulkan drivers.

MayeulC · on Oct 3, 2021

Ah, yes. Gallium is itself a bit like Vulkan IIRC, though a bit higher level.

There are state trackers for OpenGL, D3D9-10, OpenCL, and probably others. I guess one could make a Vulkan "state tracker", but it might not be efficient.

Gallium can be used on top of bare metal, but also CPU (LLVMpipe/OpenSWR), Vulkan trough zink, or now D3D12 thanks to Microsoft's recent work.

https://docs.mesa3d.org/drivers/zink.html

MayeulC · on Oct 4, 2021

Looks like they fixed the mistake, and improved the description a lot.

The description could still be improved, and I started commenting on how it could, but I realize this would warrant its own blog post.

mhh__ · on Oct 2, 2021

(Thank you valve!)

MayeulC · on Oct 2, 2021

Valve is contributing quite a bit, with their ACO compiler for instance, but the RADV driver did not in fact originate with them. We owe a great deal to David Airlie for that. He wrote a complete vulkan driver years before AMD released the source code for AMDVLK. Valve started contributing a few years in, IIRC.

To AMD's credit, they did help with documentation, questions, and much of the work was built on top of what was there for RadeonSI (OpenGL): shader compiler back-end, etc.

https://www.phoronix.com/scan.php?page=news_item&px=RADV-Rad... -> https://airlied.livejournal.com/81460.html

ww520 · on Oct 2, 2021

Parallelism with the computing units in GPU permeates the entire computing model. For the longest time I don't get how partial derivative instructions like dFdx/dFdy/ddx/ddy work. None of the doc helps. These instructions take in a number and return its partial derivative, just a generic number, nothing to do with graphic, geometry or function. The number could have been the dollar amount of a mortgage and its partial derivative is returned.

It turns out these functions tied to how the GPU architecture runs the computing units in parallel. The computing units are arranged to run in parallel in a geometric grid according to the input data model. The computing units run the same program in LOCK STEP in parallel (well at least in lock step upon arriving at the dFdx/dFdy instructions). They also know their neighbors in the grid. When a computing unit encounters a dFdx instruction, it reaches across and grabs its neighbors' input value to their dFdx instructions. All the neighbors arrive at dFdx at the same time with their input value ready. With the neighbors' numbers and its own number as the mid point, it can compute the partial derivative using gradient slope.

bla3 · on Oct 2, 2021

That sounds like a useful mental model. But it can't be quite right, can it? There aren't enough cores to do _all_ pixels in parallel, so how is that handled? Does it render tiles and compute all edges several times for this?

modeless · on Oct 3, 2021

Pixels are grouped into independent 2x2 blocks called quads and the derivatives are calculated only with respect to the other pixels in the quad.

At the edges of a triangle, when it doesn't cover all four pixels in a quad the pixel shader is still evaluated on all four pixels to calculate derivatives and at the end the results for pixels outside the triangle are thrown away. Your pixel shader better not blow up outside the triangle or you'll get bad derivatives for the pixels that are inside the triangle.

As triangles get smaller, the percentage of pixel shader evaluations that are happening outside of triangles goes up. In the extreme case if the triangle covers only one pixel, the pixel shader is executed 4 times and 3 results are thrown out! This has started to become a problem for games with high geometric detail and it's one of the reasons why Epic started using software rasterization for their Nanite virtualized geometry system in Unreal 5.

You might say "why don't you just stop doing this approximate derivative thing so you don't have to execute pixels that are thrown away? Are derivatives that important?" The answer is that derivatives are used by the hardware during texture sampling to select the correct mip level, which is pretty much essential for performance and eliminating aliasing. So you'd have to replace that with manual calculation of derivatives by the application which would be a big change.

bla3 · on Oct 4, 2021

That's a great explanation, thanks.

ww520 · on Oct 2, 2021

That's correct. There aren't enough physical cores to do all cells in parallel. A 1000x1000 grid would have 1M cells. These are virtual computing units. Conceptually there are 1M virtual computing units, one per cell. The thousands of physical cores take on the cells one batch at a time until all of the cells are run. In fact multiple cores can run the code of one cell in parallel. E.g. for-loop is unrolled and each core takes on a different branch of the for-loop. The scheduling of the cores to virtual computing units is like the OS scheduling CPUs to different processes.

Most instructions have no inter-dependency between cells. These instructions can be executed by a core as fast as it can on one computing unit until an inter-dependent instruction like dFdx is encountered that requires syncing. The computing unit is put in a wait state while the core moves on to another one. When all the computing units are sync'ed up at the dFdx instruction, they're then executed by the cores batch by batch.

Jasper_ · on Oct 2, 2021

No, multiple cores can't run the code of the same cell in parallel. The key here is that these cores run in lockstep, so the program counter is the is the exact same for all cells being run at once. In practice, each core supports 32 (or sometimes 64) cells at a time. There is also no syncing required for dFdX! Since the quads ran in lockstep, the results are guaranteed to be computed at the same time. In practice, these are all computed in vector SIMD registers, and a dFdX simply pulls a registers from another SIMD lane.

If you want to execute more tasks than this, throw more cores at it.

ww520 · on Oct 3, 2021

All the computing units running in lockstep is what appears to the program. They don't have to run in lockstep for every single instruction in reality. SIMT works in a warp with up to 32 threads [1]. Different thread warps working on different cell regions can run with different program counters.

Also the newer Volta architecture (section 3.2 in [1]) allows independent thread scheduling of threads across warps such that each thread can have its own program counter and stack.

There're certainly sync. See how barrier is used (9.7.12.1 in [1]) to synchronize threads.

[1] https://docs.nvidia.com/cuda/parallel-thread-execution/index...

Jasper_ · on Oct 3, 2021

This is not how Volta's "Independent Thread Scheduling" works under the hood, though I'm not at liberty to say what's actually going on (unfortunately, one of the upsetting things about doing GPU programming is how much "behind NDA" tribal knowledge there tends to be. If you have an NVIDIA devrel contact, ask for their "GPU Programming Guide", that should hopefully clear things up). So for now, you'll just have to trust me.

Suffice to say, the Volta "Independent Thread Scheduling" is only used for compute shaders, not for vertex and pixel shaders.

You're right to note that different warps can run with different program counters, and when communicating across warps, those need synchronization. That's what those barriers are necessary for (see GroupMemoryBarrier and friends in the high-level languages). However, the definition of dFdX chosen by these languages basically requires that all threads participating in the derivative be scheduled within the same warp. dFdX will never synchronize or wait on another warp.

bigdict · on Oct 2, 2021

So the function F is implicitly defined to have value F(x,y) = v, where x and y are coordinates of the core, and v is the input value to the dFdx/dFdy instruction? Then the output of the instruction running on the x,y core (take dFdx for example) is supposedly equal to (F(x+1,y)-F(x-1,y))/2?

ww520 · on Oct 2, 2021

Yes. That's the idea. Though not exactly sure how the GPU uses the neighboring values to interpolate the partial derivative. It's probably GPU dependent.

corysama · on Oct 2, 2021

Pretty much all of them work in 2x2 quads. DyDx is calculated within a quad. But, not across them. This is an imperfect approximation to the derivative that needs to be accounted for. But, it’s useful and very cheap and trivially easy.

Jasper_ · on Oct 2, 2021

It's defined to work on a 2x2 grid called a quad. If you're the top left pixel, then dFdX(v) = v(x+1,y) - v. If you're the top right pixel, then it's dFdX(v) = v(x-1,y) - v

anonymous532 · on Oct 2, 2021

I don't sign in often, thank you for this amazing reveal.

dragontamer · on Oct 2, 2021

For another gentle introduction to GPU architecture, I like "The Graphics Codex", specifically the chapter on Parallel Architectures: https://graphicscodex.courses.nvidia.com/app.html?page=_rn_p...

corysama · on Oct 2, 2021

Also https://fgiesen.wordpress.com/2011/07/09/a-trip-through-the-...

M277 · on Oct 2, 2021

Thanks a lot, always enjoy your posts on here and r/hardware! Do you have a hardcore introduction with even more detail / perhaps even with examples of implementations? :)

I find white papers quite good (although I admit there are many things I don't understand yet and constantly have to look up), but even these sometimes feel a bit general.

dragontamer · on Oct 3, 2021

> Do you have a hardcore introduction with even more detail / perhaps even with examples of implementations?

Hard to beat the actual technical references when you want something hardcore!

The actual CUDA documentation is good:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....

In particular, "Section 5: Performance Guidelines": https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.... gives a lot of micro-architectural details, including those nasty "bank conflicts" that people keep talking about. Honestly, the original documentation says it best with just a few paragraphs on that particular manner:

> To achieve high bandwidth, shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously. Any memory read or write request made of n addresses that fall in n distinct memory banks can therefore be serviced simultaneously, yielding an overall bandwidth that is n times as high as the bandwidth of a single module.

>

> However, if two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized. The hardware splits a memory request with bank conflicts into as many separate conflict-free requests as necessary, decreasing throughput by a factor equal to the number of separate memory requests. If the number of separate memory requests is n, the initial memory request is said to cause n-way bank conflicts.

See? Its really not that hard or unapproachable. Just read the original docs, its all there.

AMD's documentation is scattered to the winds, but the same information is around. I'd say that your #1 performance guidelines are from the ancient optimization guide from 2015. Its a bit dated, but its fine: http://developer.amd.com/wordpress/media/2013/12/AMD_OpenCL_...

Chapter 1 and Chapter 2 are relevant to today's architectures (even RDNA, even though some details have changed). Chapter 2: GCN, applies for all AMD GPUs from the 7xxx series, through the Rx 2xx series, Rx 3xx, 4xx, 5xx, Vega, and CDNA (aka: MI100) architectures.

RDNA does not have as good of an architectural guide. Start with the OpenCL guide for optimization, and then "update" your knowledge with the rather short RDNA guide: https://gpuopen.com/performance/

For assembly language details:

* CUDA's PTX is basically assembly language: https://docs.nvidia.com/cuda/parallel-thread-execution/index...

* PTX is a portable assembly language: NVidia continuously updates their GPUs. Volta was very well studied by this paper: https://arxiv.org/abs/1804.06826. I'd suggest reading this AFTER you learn the basics of PTX.

* AMD publishes their assembly language for each generation. Vega is probably a good starting point: https://developer.amd.com/wp-content/resources/Vega_Shader_I...

pkaye · on Oct 3, 2021

Thanks for these great links.

M277 · on Oct 3, 2021

Wow.. this is really great, thank you very, very, very much!

tppiotrowski · on Oct 2, 2021

I know there's a lot of Javascript developers on this forum. If you want to get into GPU programming, I highly recommend gpu.js [1] library as a jumping off point. It's amazing how powerful computers are and how we squander most our cycles.

[1] https://gpu.rocks/#/

Disclaimer: I have one un-merged PR in the gpu.js repo

corysama · on Oct 2, 2021

Whenever I experience a slow, stuttering UI, I am reminded that 11 years ago, and iPhone 3GS could do this https://m.youtube.com/watch?v=kqz5ehun-o0

danielvaughn · on Oct 2, 2021

Thanks! I’m a JS dev who happens to be very interested in getting into graphics.

Jasper_ · on Oct 2, 2021

Note that GPU.js is not for graphics, it's primarily for doing GPU Compute. I'd recommend looking at a 3D scene library to get your feet wet (and there are plenty of those in JS), or if you're interested in the underlying workings, to look at either WebGL or WebGPU.

sanketsarang · on Oct 2, 2021

On the same basis, it would also help if you could provide a comparison between GPUs commonly used for ML. Tesla k80, P100, T4, V100 and A100. How has the architecture evolved to make the A100 significantly faster? Is it just the 80GB RAM, or there is more to it from an architecture standpoint?

einpoklum · on Oct 2, 2021

> How has the architecture evolved to make the A100 significantly faster?

Oh, very much so. By way more than an order of magnitude. For a deeper read, have a look at the "architecture white papers" for Kepler, Pascal, Volta/Turing, and Ampere:

https://duckduckgo.com/?t=ffab&q=NVIDIA+architecture+white+p...

or check out the archive of NVIDIA's parallel4all blog ... hmm, that's weird, it seems like they've retired it. They used to have really good blog posts explaining what's new in each architecture.

You could also have a look here:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....

for the table of various numeric sizes and limits which change with different architectures. But that's not a very useful resource in and of itself.

M277 · on Oct 2, 2021

You may find this[0] helpful (note -- download link to a .PDF). It's the GA100 whitepaper.

[0]: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Cent...

kkielhofner · on Oct 2, 2021

As a starter T4 is heavily optimized for low power consumption on inference tasks. IIRC it doesn’t even require additional power beyond what the PCIe bus can provide but basically useless for training unlike the others.

touisteur · on Oct 3, 2021

One day I'll get my hands on both an A40 and an A100 and I'll maybe get an answer to the question: does the 5120bits memory bus help that much? The A100 has less cuda cores, around 1/4 more tensor cores but seems to be the preferred 'compute' and 'ai training' option all around. What gives?

wpietri · on Oct 2, 2021

Have folks seen good Linux tools for actually monitoring/profiling the GPU's inner workings? Soon I'll need to scale running ML models. For CPUs, I have a whole bag of tricks for examining and monitoring performance. But for GPUs, I feel like a caveman.

zetazzed · on Oct 2, 2021

For NVIDIA GPUs, Nsight systems is wildly detailed and has both GUI and CLI options: https://developer.nvidia.com/nsight-systems

For DL specifically, this article covers a couple of options that actually plug into the framework: https://developer.nvidia.com/blog/profiling-and-optimizing-d...

nvidia-smi is the core tool most folks use for quick "top"-like output, but there is also an htop equivalent: https://github.com/shunk031/nvhtop

A lot of other tools are build on top of the low-level NVML library (https://developer.nvidia.com/nvidia-management-library-nvml). There are also Python NVML bindings if you need to write your own monitoring tools.

danielmorozoff · on Oct 2, 2021

I personally like gpustat -- it's a nvidia-smi wrapper but it has colors...

They also i guess now have a web sever plugged into it which seems pretty cool

https://github.com/wookayin/gpustat https://github.com/wookayin/gpustat-web

not-elite · on Oct 2, 2021

In nvidia land, `nvidia-smi` is like `top` for your gpus. If you're running compiled CUDA, `nvprof` is very useful. But I'm not sure how much work it would take to profile something like a pytorch model.

https://developer.nvidia.com/blog/cuda-pro-tip-nvprof-your-h...

arcanus · on Oct 2, 2021

In AMD land, `rocm-smi` is like `top` for your gpus. Very similar in capabilities and functionality to NVIDIA.

corysama · on Oct 2, 2021

https://developer.nvidia.com/nsight-compute

h2odragon · on Oct 2, 2021

Very nice. This is gentle like movie dinosaurs are "cheeky lizards". I'd hate to see the "Turkish prison BDSM porn" version.

I'm looking at graphics code, again, from a "I know enough C to shoot myself in the foot and want to draw a circle on the screen" perspective. It's hilarious how much "stack" there is in all the ways of doing that; I look at some of this shit and want to go back to Xlib for its simple grace.

Jasper_ · on Oct 2, 2021

2D graphics is very different and mostly doesn't require the GPU's assistance. If you want to plot a circle on an image and then display that to the screen, you don't require any of this stack. If you want a high-level library that will draw a circle for you, you can use something like Skia or Cairo which will wrap this for you into a C API.

GPUs solve problems of much larger scale, and so the stack has evolved over time to meet the needs of those applications. All this power and corresponding complexity has been introduced for a reason, I assure you.

zozbot234 · on Oct 2, 2021

> 2D graphics is very different and mostly doesn't require the GPU's assistance.

There are interesting tasks in 2D graphics that may easily warrant GPU acceleration, such as smooth animation of composited surfaces (even something as simple as displaying a mouse pointer falls under this, and is accelerated in most systems) or rendering complex shapes including text.

devit · on Oct 4, 2021

That's very verbose but not super clear.

A better explanation is that the main problems of processor design are that memory reads take 100-1000 times the time of an arithmetic operation and that hardware is faster when operation are run in parallel.

CPUs handle the those issues by having large memory caches, and lots of circuitry to execute instructions "out of order", i.e. run other instructions that don't depend on the memory read result or the result of other operations. This is great to run sequential code as fast as possible, but quite inefficient overall.

GPUs instead handle the memory problem by switching to another thread in hardware, and the parallelism problem by mainly using SIMD (with masking and scatter/gather memory accesses, so they look like multiple threads). This works well if you are doing mostly the same operations on thousands or millions of values, which is what graphics rendering and GPGPU is.

Then there are also DSPs that solve the memory access issue by only having very little on-CPU memory and either explicit DMA or memory reads giving a delayed result, and parallelism by being VLIW.

And finally the in-order/microcontroller CPUs that simply don't care about performance and do the cheapest and simplest thing.

pixelpoet · on Oct 2, 2021

This article says that AMD GPUs are vector in nature, but I think that stopped being the case with GCN; before that they had some weird vector stuff with 5 elements or something.

dragontamer · on Oct 2, 2021

Wrong. GPUs are definitely vector (GCN in particular being 64 x 32-bit wide vectors).

What you're confusing is "Terascale" (aka: the 6xxx series from the 00s), which was VLIW _AND_ SIMD. Which was... a little bit too much going on and very difficult to optimize for. GCN made things way easier and more general purpose.

Terascale was theoretically more GFLOPs than the first GCN processors (after all, VLIW unlocks a good amount of performance), but actually utilizing all those VLIW units per clock tick was a nightmare. Its hard enough to write good SIMD code as it is.

Jasper_ · on Oct 2, 2021

Eh, it's sort of a naming conflict.

There were two dimensions of "vector" -- whether you used XMM-style registers, requiring the compiler to auto-vectorize different operations. The old vector ISAs had things like dot product instructions. Those things are now all gone, because it turns out it's hard to optimize.

We in the industry call scalar-for-each-thread ISAs as "scalar ISAs" these days. For instance, Mali describes their transition from a vec4-for-each-thread-based ISA to a scalar-for-each-thread-based ISA as transitioning from "vector to scalar" [0].

[0] Compare 4:26 and 6:29 in the "Mali GPU Family" video on this page https://developer.arm.com/solutions/graphics-and-gaming/arm-...

dragontamer · on Oct 2, 2021

Connection Machine / Thinking Machines Corporation use of "scalar-like SIMD" was around in the late 1980s and early 1990s. Look up *Lisp, the parallel language that Thinking Machines Corporation used. You can still find *Lisp manuals today. https://en.wikipedia.org/wiki/*Lisp

Intel implemented SIMD using a technique they called SWAR: SIMD within a Register (aka: the 64-bit MMX registers), which eventually evolved into XMM (SSE), YMM (AVX), and ZMM (AVX512).

Today's GPUs are programmed using the old 1980s style / Connection Machine *Lisp, which is clearly the source of inspiration for HLSL / GLSL / OpenCL / CUDA / etc. etc.

Granted, today's GPUs are still SWAR (GCN's vGPR really is just a 64-wide x 32-bit register). We can see with languages like ispc (Intel's SPMD program compiler), that we can indeed implement a CUDA-like language on top of AVX / XMM registers.*

Jasper_ · on Oct 2, 2021

I don't think any of it was based on *Lisp. Graphics mostly developed independently. And as such, we use different words and different terminology sometimes, like saying "scalar ISA" when we talk about designing ISAs that don't mandate cross-lane interaction in one thread. Sorry!

As far as I know, the first paper covering using SIMD for graphics was Pixar's "Channel Processor", or Chap [0], in 1984. This later became one of the core implementation details of their REYES algorithm [1]. By 1989, they had their own RenderMan Shading Language [2], an improved version of Chap, and you can see the similarities from just the snippet at the start of the code. This is where Microsoft took major inspiration from when designing HLSL, and which NVIDIA then started to extend with their own Cg compiler. 3dlabs then copy/pasted this for GLSL.

[0] http://www.cs.cmu.edu/afs/cs/academic/class/15869-f11/www/re... [1] https://graphics.pixar.com/library/Reyes/paper.pdf [2] https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.21...

atq2119 · on Oct 2, 2021

The terms are definitely overloaded and can differ subtly between vendors. Those "scalar-for-each-thread" instructions are prefixed v_ on AMD's GCN. The "v" doesn't stand for "scalar" ;)

Jasper_ · on Oct 2, 2021

Yeah, certainly having "V"GPRs in a "scalar ISA" is a bit off. In graphics parlance, the term really just refers to the lack of things like dot products and matrix multiply instructions where the compiler has to line numbers into "per-thread" vectors for the instructions to work out. Turns out you can save a lot of work by not having that kind of ISA, and just moving to ones where you operate on number registers (which are then "vector"'d to be thread-count wide).

Wouldn't be graphics unless we overloaded a piece of terminology 20 times.

sidewinder128 · on Oct 3, 2021

Awesome read, I get more how GPUs works thank you to the author!.

ai_ja_nai · on Oct 3, 2021

Not gentle at all :) Requires familiarity.

Also, how about Mesa3D copying Vulkan? It predates it.

jimmyvalmer · on Oct 2, 2021

> GTX version of Turing architecture (1660, 1650) has no Tensor cores, instead > it has freely available FP16 units!

It's always a bad sign when an author's exclamation has all the surprise factor of a tax code. As a previous poster said, "gentle" is relative.