Yeah, most audio processing is well within CPU capabilities even with a 100x performance disadvantage, so there's just no need to reach for hardware acceleration and all the complexity that comes with it. Rendering cutting edge graphics, especially on modern 4k 120Hz displays, is well outside CPU capabilities so that 100x boost is unfortunately essential.
Interestingly, for really advanced audio processing like sound propagation in 3D environments or physics-based sound synthesis [1], GPUs will probably be necessary.
physically modelled synthesis is no more advanced than anything going on in a modern game engine for video, and in some senses, significantly less.
so to restate something i said just below, the fact the "most audio processing is well within CPU capabilities" is only really saying "audio processing hasn't been pushed in the way that video has because there's been no motivating factor for it". if enough people were actually interested in large scale physically modelled synthesis, nobody would see "audio processing [as] well within CPU capabilities". why aren't there many people who want/do this? Well ... classic chicken and egg. The final market is no doubt smaller than the game market, so I'd probably blame lack of the hardware for preventing the growth of the user community.
ironically, sound propagation in 3D environments is more or less within current CPU capabilities.
I agree that physically modeled synthesis could easily push audio processing requirements far beyond CPU limits and require hardware acceleration similar to graphics. But I disagree that it's a chicken and egg problem. It's a cost/benefit tradeoff and the cost of realtime physically modelled synthesis is higher than its benefit for most applications. Mostly because the alternative of just mixing prerecorded samples works so well, psychoacoustically. (Though I think VR may be the point where this starts to break down; I'd love to try a physically modeled drum set in VR.)
I also doubt that specialized audio hardware could provide a large benefit over a GPU for physically modeled synthesis. Ultimately what you need is FLOPS and GPUs these days are pretty much FLOPS maximizing machines with a few fixed function bits bolted on the side. Swapping the graphics fixed function bits for audio fixed function bits is not going to make a huge difference. Certainly nothing like the difference between CPU and GPU.
I suppose that I broadly agree with most of this. Only caveat would be that you also a very fast roundtrip from where all those FLOPS happen to the audio interface buffer, and my understanding is even today in 2020, current GPUs do not provide that (data has to come back to the CPU, and for understandable reasons, this hasn't been optimized as a low latency path).
Re: physically modelled drums ... watch this (from 2008) !!!
Consider the sound made right before the 3:00 mark when Randy just wipes his hand over the "surface". You cannot do this with samples :)
It says something about the distance between my vision of the world and reality that even though I thought this was the most exciting thing I saw in 2008, there's almost no trace of it having any effect on music technology down here in 2020.
Very cool video! I agree that it's a little disappointing that stuff like that hasn't taken off.
I would say that it's not impossible to have a low latency path from the GPU to the audio buffers on a modern system. It would take some attention to detail but especially with the new explicit graphics APIs you should be able to do it quite well. In some cases the audio output actually goes through the GPU so in theory you could skip readback, though you probably can't with current APIs.
Prioritization of time critical GPU work has also gotten some attention recently because VR compositors need it. In VR you do notice every missed frame, in the pit of your stomach...
Even with a fast path (one could perhaps the GPU being a DMA master with direct access to the same system memory that the audio driver allocated for audio output) ... there's a perhaps more fundamentak problem with using the GPU for audio "stuff". GPUs have been designed around the typical workloads (notably those associated with games).
The problem is that the workload for audio is significantly different. In a given signal flow, you may have processing stages implemented by the primary application, then the result of that is fed to a plugin which may run on the CPU even in GPU-for-audio world, then through more processing inside the app, then another plugin (which might even run on another device) and then ... and so on and so forth.
The GPU-centric workflow is completely focused on collapsing the entire rendering process onto the GPU, so you basically build up an intermediate representation of what is to be rendered, ship it to the GPU et voila! it appears on the monitor.
It is hard to imagine an audio processing workflow that would ever result in this sort of "ship it all to the GPU-for-audio and magic will happen" model. At least, not in a DAW that runs 3rd party plugins.
I agree it wouldn't make sense unless you could move all audio processing off of the CPU. You couldn't take an existing DAW's CPU plugins and run them on the GPU, but a GPU based DAW could totally support third party plugins with SPIR-V or other GPU IR. You could potentially even still write the plugins in C++. Modern GPUs can even build and dispatch their own command buffers, so you have the flexibility to do practically whatever work you want all on the GPU. The only thing that would absolutely require CPU coordination is communicating with other devices.
GPU development is harder and the debugging tools are atrocious, so it's not without downsides. But the performance would be unbeatable.
If I got my history right, one of the big motivations for physically modeled synthesis in video aka 3d graphics in games was cost. At least that's the argument The Digital Antiquarian makes[1]:
From the standpoint of the people making the games, 3D graphics had another massive advantage: they were also cheaper than the alternative. When DOOM first appeared in December of 1993, the industry was facing a budgetary catch-22 with no obvious solution. ... Even major publishers like Sierra were beginning to post ugly losses on their bottom lines despite their increasing gross revenues.
3D graphics had the potential to fix all that, practically at a stroke. A 3D world is, almost by definition, a collection of interchangeable parts. ... Small wonder that, when the established industry was done marveling at DOOM‘s achievements in terms of gameplay, the thing they kept coming back to over and over was its astronomical profit margins. 3D graphics provided a way to make games make money again.
3d games were cheaper than Sierra style adventure games because those style games were moving towards lots of voice acting and full-motion-video. The 7th guest came out around the same time as doom, needed two cd-roms and cost $1 million to make (the biggest video game budget at that time). Phantasmagoria came out two years later, had cost $4.5 million and needed seven discs (8 for the Japanese Saturn release). Doom fit on four floppies and wasn't expensive to make (but I can't find a number).
Of course, today's game budgets are huge --- detailed 3d models are expensive, and FMV cutscenes are nearly mandatory. And games have expanded to 10s of gigabytes.
Interestingly, for really advanced audio processing like sound propagation in 3D environments or physics-based sound synthesis [1], GPUs will probably be necessary.
[1] https://graphics.stanford.edu/projects/wavesolver/