Incredible work, though. I had no idea GPU shaders were run as Von Neumann programs. I always thought they were really tiny sets of math operations with minimal branching, because it had to scale to a bunch of little cores. But that's not true entirely, somehow!
We're basically approaching the part of the Myer and Sutherland wheel of reincarnation[0] where the display processor is as powerful as the main processor and therefore it should all be folded back together. Of course, this turn was anticipated by AMD well in advance, and the GCN architecture, APU hardware, etc., all plays into it.
At the end of the day, the optimization envelope is always shifting around, and industrial computing architectures chase that incrementally, so they'll always be looping around the wheel as today's "narrow fast path" gradually becomes tomorrow's "general purpose".
I just started learning a little CUDA, but otherwise I know little about GPUs.
Is it surprising that they are Turing complete or surprising that they are using a von Neumann architecture?
You seem to be referring to Turing completeness when talking about branches. von Neumann architecture means you can execute data as code, which seems to be more what the presentation is about (?)
Given that CUDA exists, I don't see why it is really surprising that you can do advanced things with OpenGL shaders, given that they are running on the same hardware. CUDA is definitely turing complete and I think OpenCL is the same.
They can read and write memory (to access textures), do math (to calculate output colors), and can jump (necessary for bounds checks, etc). That's pretty much the definition of a Von Neumann program:
program variables ↔ computer storage cells
control statements ↔ computer test-and-jump instructions
assignment statements ↔ fetching, storing instructions
expressions ↔ memory reference and arithmetic instructions.
This is all true but ... AFAIK gpus are optimized for graphics and therefore suck at general purpose programs. Sure it would be fun to get some general code to run on them just as it's fun to make a gameboy or toaster run Linux.
There's also the issue the gpus are not premptable which kind of makes preempitve multi-tasking hard
Modern gpus (last 5 years) are optimized for GPGPU in addition to graphics. They also have preemptive multitasking on either a per batch or per workgroup basis.
Intel's latest gpu architecture has an embedded OS running on the gpu for scheduling command batches, I'm not sure what AMD and Nvidia do.
I still wouldn't write a general purpose OS for it.
I think we have different definitions of "preemptive multitasking". There is no GPU I know of that can be preempted once given a command to draw. Once it starts if that drawing command takes 30 seconds there's no preempting it. This is why Windows has a timeout that resets the GPU if it doesn't respond. (I believe other OSes have added that feature but I'm not 100% sure). Anyway, I've yet to use a GPU or an OS that supports preempting the GPU. I'd be happy to be proven wrong. I can also give you samples to test. It doesn't require fancy shaders. All it requires is lots of large polygons in one draw call.
That's what I know too for graphics draw calls. For GPGPU there's been hard work for finer grained preemption, last I looked into it (~1yr ago) on Linux it was a work in progress to put it kindly.
If you're curious, lookup the Intel Broadwell GPU specs, there's sections devoted to the various levels of preemption. If you're really curious look up the workarounds needed for the finest grained preemption (this would be preempting a single GPGPU draw call).
Then decide enabling fine grained preemption should probably wait for Skylake, unless you took too much Adderall and no challenge sounds impossible. Do I speak from personal experience? I plead the fifth.
I've no experience with how fine grained nvidia's preemption is.
>Intel's latest gpu architecture has an embedded OS running on the gpu for scheduling command batches, I'm not sure what AMD and Nvidia do.
Same on AMD and NVidia, except it's been like this for the past 10-15 years (depending if you count at the bottom or at the top of the hardware release pipeline).
To be more precise, GPUs are optimized for embarrassingly parallel workloads. AMD's GCN, for example, has scalar and vector instructions, where the vector instructions are 64 items wide. For graphics, this is used for running a shader on up to 64 items (vertices or pixels) simultaneously. Furthermore, the individual compute units are ridiculously hyper-threaded.
The advantage is that most of the silicon can go towards the actual computation rather than stuff like branch-prediction and out-of-order execution. The disadvantage is that branching and looping is problematic: when only one item wants to go down the other branch of an if-else-statement, the GPU has to run through both branches for all items (and execution is masked off on a per-item basis).
This works extremely well for graphics and high-dimensional numerics workloads (linear algebra, finite elements). It doesn't work at all for, say, spell-checking.
This makes me... uneasy.
Incredible work, though. I had no idea GPU shaders were run as Von Neumann programs. I always thought they were really tiny sets of math operations with minimal branching, because it had to scale to a bunch of little cores. But that's not true entirely, somehow!