It is fascinating to me, as someone who has been doing real time audio work for 20 years, to see how two fundamental processes have been at work during that time:
* audio developers recognizing the parallels between the way audio and video hardware works, and taking as many lessons from the video world as possible. The result is that audio data flow (ignoring the monstrosity of USB audio) is now generally much better than it used to be, and in many cases is conceptually very clean. In addition, less and less processing is done by audio hardware, and more and more by the CPU.
* video developers not understanding much about audio, and failing to notice the parallels between the two data handling processes. Rather than take any lessons from the audio side of things, stuff has just become more and more complex and more and more distant from what is actually happening in hardware. In addition, more and more processing is done by video hardware, and less and less by the CPU.
In both cases, there is hardware which requires data periodically (on the order of a few msec). There are similar requirements to allow multiple applications to contribute what is visible/audible to the user. There are similar needs for processing pipelines between hardware and user space code.
(one important difference: if you don't provide a new frame of video data, most humans will not notice the result; if you don't provide a new frame of audio fata, every human will hear the result).
I feel as if both these worlds would really benefit from somehow having a full exchange about the high and low level aspects of the problems they both face, how they have been solved to date, how might be solved in the future, and how the two are both very much alike, and really quite different.
I like your overall point a lot. I'm an ex-game developer who has also dabbled in audio programming and music so I see a little both sides.
I think the key difference, though, is that the consumer video needs are probably in the area of two orders of magniture more computationally complex for video than for audio. On consume machines, almost no interesting real-time audio happens. It's basically just playback with maybe a little mixing and EQ. Your average music listener is not running a real-time software synthesizer on their computer. Gamers are actually probably the consumers with the most complex audio pipelines because you're mixing a lot of sound sources in real-time with low latecy, reverb, and other spatial effects.
The only people doing real heavyweight real-time audio are music producers and for them it's a viable marketing strategy for audio programmers to expect them to upgrade to beefier hardware.
With video, almost every computer user is doing really complex real-time rendering and compositing. A graphics programmer can't easily ask their userbase to get better hardware when the userbase is millions and millions of people.
Also, of course, generating 1470 samples of audio per frame (44100 sample rate, stereo, 60 FPS) is a hell of a lot easier than 2,073,600 pixels of video per frame.
I agree that audio pipelines are a lot simpler, but I think's largely a luxury coming from having a much easier computational problem to solve and a target userbase more willing to put money into hardware to solve it.
> Gamers are actually probably the consumers with the most complex audio pipelines because you're mixing a lot of sound sources in real-time with low latecy, reverb, and other spatial effects.
There's a GDC talk out there about the complexities of audio in Overwatch, a competitive game. Not only do you have invaluable audio cues coming from 12 players at the same time, but also all of their combined abilities and general sound effects, like countdown timers and such.
One of the bigger problems talked about was that in a competitive game you can't just scale the loudness of an opponent's footsteps by the direct distance between them and you. You need to take into account level geometry, and so the game has to raycast in order to determine how loud any single sound effect should sound to every active player in the game.
One more thing to add about Overwatch - with all the attention they give to audio, if your CPU is pegged and the game is trying to render 3d audio, nothing can save it and it drops audio.
The end effect is that on low-powered hardware your audio can end up buzzing. In addition the high costs of dolby atmos processing on the CPU affects the rest of the game, such that turning off dolby atmos is a viable way to get higher FPS at the cost of lower-fidelity audio.
> The only people doing real heavyweight real-time audio are music producers
I wouldn't discount the amount of people using voice assist and rely on noise reduction/echo cancellation in their meetings, or the near future where we have SMPTE 2098 renderers running on HTPCs.
I really think we're only a few years away from seeing a lot more realtime DSP in consumer applications. Conferencing, gaming, and content consumption all benefit immensely, and it would good for us all to start thinking about the latency penalty on audio-centric HCI like we do for video.
You don't generate 1470 samples of audio per frame.
You generate N channels worth of 1470 samples per frame, and mix (add) them together. Make N large enough, and make the computation processes associated with generating those samples complex enough, and the difference between audio and video is not so different.
Jacob Collier routinely uses 300-600 tracks in his mostly-vocal overdubs, and so for sections where there's something going on in all tracks (rare), it's more in the range of 400k-900k samples to be dealt with. This sort of track count is also typical in movie post-production scenarios. If you were actually synthesizing those samples rather than just reading them from disk, the workload could exceed the video workload.
And then there's the result of missing the audio buffer deadline (CLICK! on every speaker ever made) versus missing the video buffer deadline (some video nerds claiming they can spot a missing frame :)
> You generate N channels worth of 1470 samples per frame, and mix (add) them together. Make N large enough, and make the computation processes associated with generating those samples complex enough, and the difference between audio and video is not so different.
Sure, but graphics pipelines don't only touch each pixel once either. :)
> Jacob Collier routinely uses 300-600 tracks in his mostly-vocal overdubs, and so for sections where there's something going on in all tracks (rare), it's more in the range of 400k-900k samples to be dealt with.
Sure, track counts in modern DAW productions are huge. But like you note in practice most tracks are empty most of the time and it's pretty easy to architect a mixer than can optimize for that. There's no reason to iterate over a list of 600 tracks and add 0.0 to the accumulated sample several hundred times.
> If you were actually synthesizing those samples rather than just reading them from disk, the workload could exceed the video workload.
Yes, but my point is that you aren't. Consumers do almost no real-time synthesis, just a little mixing. And producers are quite comfortable freezing tracks when the CPU load gets too high.
I guess the interesting point to focus on is that with music production, most of it is not real-time and interactive. At any point time, the producer's usually only tweaking, recording, or playing a single track or two and it's fairly natural to freeze the other things to lighten the CPU load.
This is somewhat analogous to how game engines bake lighting into static background geometry. They partition the world into things that can change and things that can't and use pipelines approach for each task.
> And then there's the result of missing the audio buffer deadline (CLICK! on every speaker ever made)
Agreed, the failure mode is catastrophic with audio. With video, renderers will simply use as much CPU and GPU as they can and players will max everything out. With audio, you set aside a certain amount of spare CPU as headroom so you never get too close to the wall.
I think the most concrete argument in support of your position is that typical high-performance audio processing doesn't take an external specialized supercomputer-for-your-computer with 12.3 gazillion parallel cores, a Brazil-like network of shiny ducts hotter than a boiler-room pipe and a memory bus with more bits than an Atari Jaguar ad.
> doesn't take an external specialized supercomputer-for-your-computer with 12.3 gazillion parallel cores,
Not anymore at least. :)
In the early days of computer audio production, it was very common to rely on external PCI cards to offload the DSP (digital signal processing) because CPUs at the time couldn't handle it.
You're restating the grandparents point: The only people doing real heavyweight real-time audio are music producers. Once Jacob Collier is done rendering his song, it's distributed as a YouTube video with single stereo audio track.
And, talking about learning lessons from other fields, there's no particular reason that Jacob has to to render his audio at full fidelity in real time. Video editors usually do their interactive work with a relatively low fi representation, and do the high fidelity rendering in a batch process that can take hours or days. As I'm sure you're aware.
> Film sound editors in movie post-production do not work the way you describe.
They don't generally lower fidelity. That is needed for video simply because the data sizes for video footage are so huge they are to work with.
But DAWs do let users "freeze", "consolidate" or otherwise pre-render effects so that everything does not need to be calculated in real-time on the fly.
I'm the original author of a DAW, so I'm more than familiar with what they do :)
Film sound editors do not do what you're describing. They work with typically 600-1000 tracks of audio. They do not lower fidelity, they do not pre-render. Ten years ago, one of the biggest post-production studios in Hollywood used TEN ProTools system to be able to function during this stage of the move production process.
One big difference though is jitter tolerance. To some extent, dropping visual frames will be seen as a reduction in average frame rate. For example, if scroll stutters, people may brush it off as "oh yeah the machine's loaded right now" and forgive and forget. However, the tolerance level of audio glitches is much lower (I think for most, but certainly for me). So the leeway for a computation to bleed over from one frame to the next is far less than for video.
> [video] stuff has just become more and more complex and more and more distant from what is actually happening in hardware
There's been a huge push recently to remove layers from graphics APIs and expose hardware more directly, in the form of Vulkan, DX12, and Metal. The problem that this has uncovered is that graphics hardware is not only very complex but quite varied, so exposing "what is actually happening in hardware" is incompatible with portability and simplicity. As a result you get APIs that are either not portable or hard to use.
> more and more processing is done by video hardware, and less and less by the CPU.
Unfortunately this will remain a reality because it comes with a 20-100x performance advantage.
The "lots of different hardware" side of it is where audio hardware was 20 years ago. Devices with multiple streams in hardware, and without. Devices with builtin processing effect X, and without. Devices that could use huge buffer sizes and ones that could only use small ones. Devices that could use just 2 buffers, and devices that could use many. Devices that supported all sample rates in hardware, devices that only supported one.
Things are not totally cleared up, but the variation has been much reduced, and as a result the design of the driver API and then user-space APIs above that have become more straightforward and more consistent even across platforms.
>20-100x performance advantage.
Yes, for sure this is one aspect of the comparison that really breaks down. Sure, there are audio interfaces (e.g. those from UAD) that have a boatload of audio processing capabilities on board, and offload that DSP to the hardware saves oodles of CPU cycles. But these are unusual, far from the norm, and not widely used. Audio interfaces have mostly converged on a massively less capable hardware design than they had 20 years ago.
Yeah, most audio processing is well within CPU capabilities even with a 100x performance disadvantage, so there's just no need to reach for hardware acceleration and all the complexity that comes with it. Rendering cutting edge graphics, especially on modern 4k 120Hz displays, is well outside CPU capabilities so that 100x boost is unfortunately essential.
Interestingly, for really advanced audio processing like sound propagation in 3D environments or physics-based sound synthesis [1], GPUs will probably be necessary.
physically modelled synthesis is no more advanced than anything going on in a modern game engine for video, and in some senses, significantly less.
so to restate something i said just below, the fact the "most audio processing is well within CPU capabilities" is only really saying "audio processing hasn't been pushed in the way that video has because there's been no motivating factor for it". if enough people were actually interested in large scale physically modelled synthesis, nobody would see "audio processing [as] well within CPU capabilities". why aren't there many people who want/do this? Well ... classic chicken and egg. The final market is no doubt smaller than the game market, so I'd probably blame lack of the hardware for preventing the growth of the user community.
ironically, sound propagation in 3D environments is more or less within current CPU capabilities.
I agree that physically modeled synthesis could easily push audio processing requirements far beyond CPU limits and require hardware acceleration similar to graphics. But I disagree that it's a chicken and egg problem. It's a cost/benefit tradeoff and the cost of realtime physically modelled synthesis is higher than its benefit for most applications. Mostly because the alternative of just mixing prerecorded samples works so well, psychoacoustically. (Though I think VR may be the point where this starts to break down; I'd love to try a physically modeled drum set in VR.)
I also doubt that specialized audio hardware could provide a large benefit over a GPU for physically modeled synthesis. Ultimately what you need is FLOPS and GPUs these days are pretty much FLOPS maximizing machines with a few fixed function bits bolted on the side. Swapping the graphics fixed function bits for audio fixed function bits is not going to make a huge difference. Certainly nothing like the difference between CPU and GPU.
I suppose that I broadly agree with most of this. Only caveat would be that you also a very fast roundtrip from where all those FLOPS happen to the audio interface buffer, and my understanding is even today in 2020, current GPUs do not provide that (data has to come back to the CPU, and for understandable reasons, this hasn't been optimized as a low latency path).
Re: physically modelled drums ... watch this (from 2008) !!!
Consider the sound made right before the 3:00 mark when Randy just wipes his hand over the "surface". You cannot do this with samples :)
It says something about the distance between my vision of the world and reality that even though I thought this was the most exciting thing I saw in 2008, there's almost no trace of it having any effect on music technology down here in 2020.
Very cool video! I agree that it's a little disappointing that stuff like that hasn't taken off.
I would say that it's not impossible to have a low latency path from the GPU to the audio buffers on a modern system. It would take some attention to detail but especially with the new explicit graphics APIs you should be able to do it quite well. In some cases the audio output actually goes through the GPU so in theory you could skip readback, though you probably can't with current APIs.
Prioritization of time critical GPU work has also gotten some attention recently because VR compositors need it. In VR you do notice every missed frame, in the pit of your stomach...
Even with a fast path (one could perhaps the GPU being a DMA master with direct access to the same system memory that the audio driver allocated for audio output) ... there's a perhaps more fundamentak problem with using the GPU for audio "stuff". GPUs have been designed around the typical workloads (notably those associated with games).
The problem is that the workload for audio is significantly different. In a given signal flow, you may have processing stages implemented by the primary application, then the result of that is fed to a plugin which may run on the CPU even in GPU-for-audio world, then through more processing inside the app, then another plugin (which might even run on another device) and then ... and so on and so forth.
The GPU-centric workflow is completely focused on collapsing the entire rendering process onto the GPU, so you basically build up an intermediate representation of what is to be rendered, ship it to the GPU et voila! it appears on the monitor.
It is hard to imagine an audio processing workflow that would ever result in this sort of "ship it all to the GPU-for-audio and magic will happen" model. At least, not in a DAW that runs 3rd party plugins.
I agree it wouldn't make sense unless you could move all audio processing off of the CPU. You couldn't take an existing DAW's CPU plugins and run them on the GPU, but a GPU based DAW could totally support third party plugins with SPIR-V or other GPU IR. You could potentially even still write the plugins in C++. Modern GPUs can even build and dispatch their own command buffers, so you have the flexibility to do practically whatever work you want all on the GPU. The only thing that would absolutely require CPU coordination is communicating with other devices.
GPU development is harder and the debugging tools are atrocious, so it's not without downsides. But the performance would be unbeatable.
If I got my history right, one of the big motivations for physically modeled synthesis in video aka 3d graphics in games was cost. At least that's the argument The Digital Antiquarian makes[1]:
From the standpoint of the people making the games, 3D graphics had another massive advantage: they were also cheaper than the alternative. When DOOM first appeared in December of 1993, the industry was facing a budgetary catch-22 with no obvious solution. ... Even major publishers like Sierra were beginning to post ugly losses on their bottom lines despite their increasing gross revenues.
3D graphics had the potential to fix all that, practically at a stroke. A 3D world is, almost by definition, a collection of interchangeable parts. ... Small wonder that, when the established industry was done marveling at DOOM‘s achievements in terms of gameplay, the thing they kept coming back to over and over was its astronomical profit margins. 3D graphics provided a way to make games make money again.
3d games were cheaper than Sierra style adventure games because those style games were moving towards lots of voice acting and full-motion-video. The 7th guest came out around the same time as doom, needed two cd-roms and cost $1 million to make (the biggest video game budget at that time). Phantasmagoria came out two years later, had cost $4.5 million and needed seven discs (8 for the Japanese Saturn release). Doom fit on four floppies and wasn't expensive to make (but I can't find a number).
Of course, today's game budgets are huge --- detailed 3d models are expensive, and FMV cutscenes are nearly mandatory. And games have expanded to 10s of gigabytes.
Hardware-wise both audio and video had converged on the exactly same thing: there is this buffer that should be scanned/played out. The fact that contents of this buffer is generated by bunch of specialized hardware for the video case and can be perfectly well computed in software for audio is just question of bandwidth and CPU performance.
You still can't render an orchestral-size collection of physically modelled instruments on the CPU. A processor that could would be some of analogy for the modern GPU, but for various reasons, there has been little investment in chip design that would move further in this direction.
Games in particular created a large push for specialized GPU architectures, and there's really no equivalent motivation to do the same for audio, in part because there's no standard synthesis approach that things would converge on (the way they have in GPUs).
So, "perfectly well computed in software for audio" is more of a statement about the world as-it-is rather than one about any fundamental difference between the two media.
Is it possible to do audio processing using a GPU?
I don't have enough working knowledge or mathematics to fully comprehend the 3d pipeline but as I understand, it is a huge linear algebra/matrix manipulator and audio processing would fall into this. My reason for asking is that at least in the guitar world I've seen issues related to CPU processing and multiple VST plugins all wired together. Some guitarists use fairly complicated rigs to get their tone and when you have multiple amps, speaker cabinet simulation, and effects running you need a pretty powerful machine to do everything real-time without under-running buffers.
There are specific hardware solutions for this but I've always wondered if you could just create shaders or something and dump the audio stream to the GPU and get back the transformed audio stream with minimal CPU burden. Any insight here would be very helpful to me.
The last time I checked, the problem with GPUs and realtime audio is that although the GPU is extremely powerful and the bandwidth is fairly large, the latency for a roundtrip from/to the CPU is too large for what we typically consider "low latency audio".
Put differently, the GPU is plenty powerful to do the processing you're thinking of, but it can't get the results back to the CPU in the required time.
For video, there is no roundtrip: we send a "program" (and maybe some data) from the CPU to the GPU; the GPU delivers it into the video framebuffer.
For audio, the GPU has no access to the audio buffer used by the audio interface, so the data has to come back from the GPU before it can be delivered to the audio interface.
I would be happy to hear that this has changed.
None of this prevents incredibly powerful offline processing of audio on a GPU, or using the GPU in scenarios where low latency doesn't matter. Your friends with guitars and pedal boards are not one of those scenarios, however.
I think the latency overhead can be low enough these days (kernel launch / copy data from CPU to GPU, read from global memory, do some compute, write to global memory, copy data back to CPU and wait for completion from the CPU side (which might copy the data to some other buffer) or non-blocking DMA write instead), say on the order of 10-100 us over PCIe, but there is a tradeoff between the units of work that one would give the GPU, the efficiency of the compute (and the working set size of data that you need to load in from global memory to do the compute), and the number of individual kernel launches that one would need to do to produce small pieces of the output. There are some tricks involving atomics (or in CUDA in particular, cooperative thread groups) that could allow for persistent kernels that are always producing data and periodically ingesting commands from the CPU to avoid the CPU constantly needing to tell the GPU to do things to make it easier.
Both those links describe audio processing on a GPU. Neither of them address (from a quick skim) the roundtrip issue that occurs when doing low latency realtime audio.
For the unpredictable part of the input, you can't have the best latency and throughput at the same time. If you want realtime in a variable load scenario, you have to cap the GPU usage.
That's how gamers do it when they want the lowest latency possible, anyway. Something like "find the lowest frame rate your game runs on, and cap it to 80% of that".
we have hdmi audio, is that a start? I assume that means the videocard is registering as a sound card as well, no idea if any processing is on the gpu...
the GPU is not involved in computing the audio stream. the audio side of the HDMI appears to the OS as an audio device, and the CPU generates and delivers the audio data to it. It's essentially a passthrough to the HDMI endpoint.
> Is it possible to do audio processing using a GPU?
This thread fascinates me. A sibling comment addresses why it can't be done in realtime (barring some pedestrian-sounding hardware improvements), but the question of "how" fascinates me. I know a little bit about audio, I'm a classical musician with some experience with big halls, I know some physics, and I wrote a raytracer in JS back in the early aughts. Can we raytrace sound? It's a wave, it's got a spectrum, different materials reflect, refract, and absorb various parts of the spectrum differently... sounds pretty similar! The main differences I see is that we usually only have a handful of "single pixel cameras" (I expect to lose a bit of parallelism to that) and more problematic, we can't treat the speed of sound as infinite. I can imagine a cool hack to preprocess rays for a static scene, but my imagined approach breaks down entirely if the mic moves erratically.
We can already do this (without GPUs, and quite likely on them). It is done, for example, by a certain class of reverb algorithm which more or less literally "raytraces" the sound in room. It's not the most efficient way to implement a reverb (convolution algorithms are better for this). But it does allow for a large degree of control that convolution does not.
In addition, you may also consider physical modelling synthesis to be an example of this. Look up Pianoteq. It is a physical modelled piano that literally does all the math for the sound from a hammer of particular hardness striking a string of given size and tension at a particular speed, and then that sound generating resonances within the body of the piano and other strings, and then out into the room and interacting with microphone placement, lid opening etc. etc.
Most (not all) people agree that Pianoteq is the best available synthetic piano. It's all math, no samples.
There are similar physical modelling synth engines for many other kinds of sounds, notably drums. They are not particular popular because they use more CPU than samples and can be complicated to tweak. They also really benefit from a more sophisticated control surface/instrument than most people have access to (i.e. playing a physically modelled drum from a MIDI keyboard doesn't really allow you to get into the nuances of the drum).
Yes, sound can be raytraced - in realtime on current GPUs even. Using a ray model for sound waves is called Geometric Acoustics and works reasonably well in most cases, although it has a few shortcomings, mostly when dealing with low frequencies (room modes, diffraction).
I think it's too pessimistic to say it can't be done in real time on current GPUs. I think it should be possible to do reliable low latency readback with modern explicit graphics APIs. The harder problem is scheduling the audio work at high priority so it always finishes before the deadline, but this too has improved recently because VR compositors also have strict deadlines they must meet.
I remember hearing that Nvidia were working on an audio spatialiser using their RTX hardware/expertise, but I can't recall hearing about it after 2018 or so, which is where the project I was working on that I wanted audio spatialisation for ended.
There are a number of clever audio processing algorithms implemented on GPUs already. But that's not the issue at hand, which is doing so in a low-latency realtime context. GPUs are insanely powerful, and if you're just doing playback of material from disk, the tiny bit of extra latency they may cause isn't a problem, so go for whatever they can do. But if you're using the computer for synthesis or FX processing, you need better than GPUs can offer in terms of latency (for now).
Could you be a bit more tangible about how video developers have have just made things more complex? I don't necessarily disagree, since I can't wrap my mine around modern video codecs easily (in concept sure, but they're harder to use in practice), but that might be just because I'm ignorant?
I think Ralph (Levien) did a fine job of that in the original article, albeit primarily focused on the issues surrounding compositing.
I wasn't talking about codecs, but about how software interacts with the hardware. We're already dealing with the move away from OpenGL, which for a while had looked as if it might actually finally wrap most video in a consistent, portable model. Metal and Vulkan are moderately consistent with each other, but less so with the existing OpenGL-inspired codebase(s).
But there's even more simplistic levels than this: in audio, on every (desktop) platform, it is trivial for audio I/O to be tightly coupled to the hardware "refresh" cycle. WDM-WASAPI/CoreAudio/ALSA all make this easy, if you need it (and for low latency audio, you do). Doing this for graphics/video is an insanely complex task. I know of no major GUI toolkit that even exports the notion of sync-to-vblank let alone does so in any actually portable way (Qt and GTK have both made a few steps in this direction, but if you dive in deep, you find that its a fairly weak connection compared to the way things work in audio).
It seems like we are coming full circle in try to reduce latency in all parts of the chain. With the next gen gaming consoles supporting VRR and 120hz, recently announced Nvidia Reflex low latency modes, 360hz monitors, and things like the Apple Pencil 2 reducing the latency from pen input to on screen down to 9ms we are working our way back to something close to what we used to have.
But I feel like we have a number of years to go until we can really get back to where we used to be with vintage gaming consoles and CRT displays.
I think that with the end of Moore's law, there is a gradual unwinding of software bloat and inefficiency. Once the easy hardware gains die away, it is worth studying how software can be better, but it's a cultural transition that takes time.
It is also the case that in some/many areas software has improved performance by several multiples of the improvement in hardware performance over the last 20 years. So it makes sense that this would invigorate software performance investment.
Can you give some examples of areas where software performance improvement has been several multiples of hardware improvement?
Most of the examples I can think of are ones where the software slowdown has more than cancelled out the hardware improvements. Then there are a some areas where hardware performance improvement was sufficient to overcome software slowdown. Software getting faster? Software getting faster than hardware??
Compilers are way smarter and can make old code faster than it used to run. They also parse the code way faster, and would as such compile faster if you restrict them to the level of optimizations they used to do.
"Interpreters" are faster as well: Lisp runs faster, JavaScript is orders of magnitudes more efficient as well.
Algorithms have been refined. Faster paths have been uncovered for matrix multiplication [1] (unsure if the latest improvements are leveraged) and other algorithms.
Use-cases that have been around fo run a wile (say, h.264 encode/decode) are more optimized.
We now tend to be a lot better at managing concurrency too (see: rust, openmp and others), with the massively parallel architectures that come out nowadays.
However, I can't really agree that they are examples of software improvements being multiples of hardware improvements.
1. Compiler optimizations
See Proebsting's Law [1], which states that whereas Moore's Law provided a doubling of performance ever 18-24 months, compiler optimizations provide a doubling every 18 years at best. More recent measurements indicate that this was optimistic.
2. Compilers getting faster
Sorry, not seeing it. Swift, for example, can take a minute before giving up on a one line expression, and has been clocked at 16 lines/second for some codebases.
All the while producing slow code.
See also part of the motivation for Jonathan Blow's Jai programming language.
3. Matrix multiplication
No numbers given, so ¯\_(ツ)_/¯
4. h.264/h.265
The big improvements have come from moving them to hardware.
You are reading my examples in bad faith :) (though I originally missed your point about "multiples of")
You want examples where software has sped up at a rate faster than hardware (meaning that new software on old hardware runs faster than old software on new hardware).
Javascript might not have been a good idea in the first place, but I bet that if you were to run V8 (if you have enough RAM) on 2005-era commodity hardware, it would be faster than running 2005-SpiderMonkey on today's hardware. JIT compilers have improved (including lisps, php, python, etc).
Can you give me an example of 2005-era swift running faster on newer hardware than today's compiler on yesterday's hardware? You can't, as this is a new language, with new semantics and possibilities. Parsing isn't as simple as it seems, you can't really compare two different languages.
These software improvements also tend to pile up along the stack. And comparing HW to SW is tricky: you can always cram more HW to gain more performance, while using more SW unfortunately tends to have the opposite effect. So you have to restrict yourself HW-wise: same price? same power requirements? I'd tend to go with the latter as HW has enjoyed economies of scale SW can't.
Concurrency might be hardware, but in keeping with the above point, more execution cores will be useless for a multithread-unaware program. Old software might not run better on new HW, but old HW didn't have these capabilities, so the opposite is probably true as well. Keep in mind that these new HW developments were enabled by SW developments.
> No numbers given, so ¯\_(ツ)_/¯
Big-O notation should speak for itself, I am not going to try and resurrect a BLAS package from the 80s to benchmark against on a PIC just for this argument ;)
Other noteworthy algorithms include the FFT [1]. (I had another one in mind but lost it).
> The big improvements have come from moving them to hardware.
I'm talking specifically about SW implementations. Of course you can design an ASIC for most stuff. And most performance-critical applications probably had ASICs designed for them by now, helping prove your point. SW and HW are not isolated either, and an algorithm optimized for old HW might be extremely inneficient on new HW, and vice-versa.
And in any case, HW developments were in large part enabled by SW developments with logic synthesis, place and route, etc. HW development is SW development to a large extent today, though that was not your original point.
What can't be argued against, however, is that both SW and HW improvements have made it much easier to create both HW and SW. Whether SW or HW has been most instrumental with this, I am not sure. They are tightly coupled: it's much easier to write a complex program with a modern compiler, but would you wait for it to compile on an old machine? Likewise for logic synthesis tools and HW simulators. Low-effort development can get you further, and that shows. I guess that's what you are complaining about.
> I originally missed your point about "multiples of"
That wasn't my point, but the claim of the poster I was replying to, and it was exactly this claim that I think is unsupportable.
Has some software gotten faster? Sure. But mostly software has a gotten slower and the rarer cases of software getting faster have been outpaced significantly by HW.
> You want examples where software has sped up at a rate faster than hardware
"several multiples":
that in some/many areas software has improved performance by several multiples of the improvement in hardware performance over the last 20 years
> [JavaScript] JIT compilers have improved
The original JITs were done in the late 80s early 90s. And their practical impact is far less then the claimed impact.
As an example the Cog VM is a JIT for Squeak. They claim a 5x speedup in bytecodes/s. Nice. However the naive bytecode interpreter, in C, on commodity hardware in 1999 (Pentium/400) was 45 times faster than the one microcoded on a Xerox Dorado in 1984, which was a high-end, custom-built ECL machine costing many hundred thousands of dollars. (19m bytecodes/s vs. 400k bytecodes/s).
So 5x for software, at least 45x for hardware. And the hardware kept improving afterward, nowadays at least another 10x.
> [compilers] Parsing isn't as simple as it seems [..]
Parsing is not where the time goes.
> 2005-era swift running faster
Swift generally has not gotten faster at all. I refer you back to Proebsting's Law and the evidence gathered in the paper: optimizer (=software) improvements achieve in decades what hardware achieves/achieved in a year.
There are several researchers that say optimization has run out of steam.
(the difference between -O2 and -O3 is just noise)
> Big-O notation should speak for itself
It usually does not. Many if not most improvements in Big-O these days are purely theoretical findings that have no practical impact on the software people actually run. I remember when I was studying that "interior point methods" were making a big splash, because they were the first linear optimization algorithms that had polynomial complexity, whereas the Simplex algorithm is NP-hard. I don't know what the current state is, but at the time the reaction was a big shrug. Why? Although Simplex is NP-hard, it typically runs in linear or close to linear time and is thus much, much faster than the interior point methods.
Similar for recent findings of slightly improved multiplication algorithms. The n required for the asymptotic complexity to overcome the overheads is so large that the results are theoretical.
> FFT
The Wikipedia link you provided goes to algorithms from the 1960s and 1940s, so not sure how applicable that is to the question of "has software performance improvement in the last 20 years outpaced hardware improvement by multiples?".
Are you perchance answering a completely different question?
> [H264/H265] I'm talking specifically about SW implementations
Right, and the improvements in SW implementations don't begin to reach the improvement that comes from moving significant parts to dedicated hardware.
And yes, you have to modify the software to actually talk to the hardware, but you're not seriously trying to argue that this means this is a software improvement??
But let's agree to put this argument to a rest. I generally agree with you that
1. Current software practices are wasteful, and it's getting worse
2. According to 1. most performance improvements can be attributed to HW gains.
I originally just wanted to point out that this was true in general, but that there were exceptions, and that hot paths are optimized. Other tendencies are at play, though, such as the end of dennard's scaling. I tend to agree with https://news.ycombinator.com/item?id=24515035 and to achieve future gains, we might need tighter coupling between HW and SW evolution, as general-purpose processors might not continue to improve as much. Feel free to disagree, this is conjecture.
> And yes, you have to modify the software to actually talk to the hardware, but you're not seriously trying to argue that this means this is a software improvement??
My point was more or less the same as the one made in the previously linked article: HW changes have made some SW faster, other comparatively slower. These two do not exist in isolated bubbles. I'm talking of off-the-shelf HW, obviously. HW gets to pick which algorithms are considered "efficient".
Recursive descent has been around forever, the Wikipedia[1] page mentions a reference from 1975[2]. What recent advances have there been in parsing performance?
> 1. Current software practices are wasteful, and it's getting worse
> 2. According to 1. most performance improvements can be attributed to HW gains.
Agreed.
3. Even when there were advances in software performance, they were outpaced by HW improvements, certainly typically and almost invariably.
I think a better answer than I could give can be found at [1]. One pull quote they highlight:
But the White House advisory report cited research, including a study of progress over a 15-year span on a benchmark production-planning task. Over that time, the speed of completing the calculations improved by a factor of 43 million. Of the total, a factor of roughly 1,000 was attributable to faster processor speeds, according to the research by Martin Grotschel, a German scientist and mathematician. Yet a factor of 43,000 was due to improvements in the efficiency of software algorithms.
I actually took Professor Grötschel's Linear Optimization course at TU Berlin, and the practical optimization task/competition we did for that course very much illustrates the point made in the answer to the stackexchange question you posted.
Our team won the competition, beating the performance not just of the other student teams' programs, but also the program of the professor's assistants, by around an order of magnitude. How? By changing a single "<" (less than) to "<=" (less than or equal), which dramatically reduced the run-time of the dominant problem of the problem-set.
This really miffed the professor quite a bit, because we were just a bunch of dumb CS majors taking a class in the far superior math department, but he was a good sport about it and we got a nice little prize in addition to our grade.
It also helped that our program was still fastest without that one change, though now with only a tiny margin.
The point being that, as the post notes, this is a single problem in a single, very specialized discipline, and this example absolutely does not generalize.
I think that your original comment is the one that risks over-generalization. The "software gets slower" perception is in a very narrow technical niche: user interface/user experience for consumer applications. And it has a common cause: product managers and developers stuff as much in the the user experience as they can until performance suffers. But even within that over-stuffing phenomenon you can see individual components that have the same software improvement curve.
In any area where computing performance has been critical - optimization, data management, simulation, physical modeling, statistical analysis, weather forecasting, genomics, computational medicine, imagery, etc. there have been many, many cases of software outpacing hardware in rate of improvement. Enough so that it is normal to expect it, and something to investigate for cause if it's not seen.
First, I think your estimate of which is the niche, High Performance Computing or all of personal computing (including PCs, laptops, smartphones, and the web) is not quite correct.
The entire IT market is ~ $3.5 trillion, HPC is ~ $35 billion. Now that's nothing to sneeze at, but just 1% of the total. I doubt that all the pieces I mentioned also account for just 1%. If so, what's the other 98%?
Second, there are actually many factors that contribute to software bloat and slowdown, what you mention is just one, and many other kinds of software are getting slower, including compilers.
Third, while I believe you that many of the HPC fields see some algorithmic performance improvements, I just don't buy your assertion that this is regularly more than the improvements gained by the massive increases in hardware capacity, and that one singular example just doesn't cut it.
> But I feel like we have a number of years to go until we can really get back to where we used to be with vintage gaming consoles and CRT displays.
Interesting. CRTs also had a fixed framerate. Let's make that 60 fps for the sake of the argument.
It really depends on what you are calling "vintage". Most later consoles (with GPUs) just composited images at a fixed frame-rate.
Earlier software renderers are quite interesting, though, in that they tend to "race the beam" and produce pixel data a few microseconds before it is displayed. Does that automatically transfer to low latency? I'm not sure. If the on-screen character is supposed to move by a few pixels with the last input, it really depends on whether you have drawn it already. Max latency is 16 ms, min is probably around 100 µs. That gives you 8 ms of expected latency. And I think you still get tearing, in some cases.
There is also no reason it couldn't be done with modern hardware, except the wire data format for HDMI/DP might need to be adjusted.
However, and I've said that for a long time, one key visible difference between CRTs and LCDs is persistence. Images on a LCD persist for a full frame, instead of letting your brain interpolate the images. The result is a distinctively blurry edge on moving objects. Some technologies such as backlight strobing (aka ULMB) aim to aleviate this (you likely need triple buffering to combine this with adaptative sync, which I haven't seen).
I wonder if rolling backlights could allow us to race the beam once again?
QLED/OLED displays could theoretically bring a better experience than CRTs if the display controller allowed it: every pixel emits its own light, so low persistence is achievable. You don't have a beam with fixed timings, so you could just update what's needed in time for displaying it.
A separate latency difference is in how long it takes for the pixels to switch. Ie the time between when the device starts trying to change the colour of a pixel and when you see that colour change. This is a relatively long time for an lcd but not so long for a crt
Edit: regarding my older comment, I thought that QLED were quantum dots mounted on individual LEDs. They are regular LCDs, with quantum dots providing the colour conversion. That makes more sense from an economical perspective, less so for performance. Maybe OLED and QLED could be combined to leverage the best OLED for all colors?
Persistence is the time it takes between when the luminosity goes up and when it goes down. It is low for CRTs and high for LCDs (which stay on until they need to switch unless the monitor is low persistence (or low persistence is emulated with a high refresh rate and black frame insertion).
The switching time is the time between when the signal starts (or rather when the monitor has finished working out what the signal means and has started trying to change the luminosity) and when it gets bright (or in the case of LCDs, to the right luminosity).
Ghosting comes from not switching strongly enough and overdrive (switching too hard) tends to lead to poor colour accuracy.
You are right. That directly contributes to latency. This effect is called pixel response time [1], and is usually measured "grey-to-grey" [2]. Nowadays, though, I think monitors usually have a short response time (<1ms for "gamer" monitors).
> Persistence is the time it takes between when the luminosity goes up and when it goes down
Right. But that proves the two are somewhat linked, especially as pixel response isn't symmetric (0->1 and 1->0).
Reading a bit more into it, the industry terms for persistence seem to be both grey-to-grey (GtG), and Moving Picture Response Time (MPRT). The latter is a measurement method [3] for perceived motion blur. It directly depends on the time each pixel remains lit with the same value ("persistence" strikes again), so a slow (>1 frame) or incomplete transition can create motion blur (contributes to persistence).
> low persistence is emulated with a high refresh rate and black frame insertion
It can also be achieved wit backlight strobing on backlit displays: strobing for 1ms on a "full-persistence" (pixel always on) display gives a 1ms persistence regardless of the frame rate. A ~1000 FPS display would be necessary to get the same level of persistence with black frame insertion alone. I believe this is part of the reason Valve went with an LCD instead of an OLED screen to get the 0.33ms persistence on the Index HMD.
It seems to be a normal engineering process: first make it work, then look for those high-level things that you are ready to break in order to gain performance through optimization and micro-optimization.
I've always really liked the Nintendo DS's GPU, and thought something with a similar architecture but scaled up would make a lot sense in a few places in the stack. Unlike most GPUs, you submitted a static scene, and then the rendering happened pretty close to lock step with the scan out to the LCD. There was a buffer with a handful of lines stored, but it was pretty close; the latency was pretty much unbeatable. In a lot fo ways it was really a 3D extension to their earlier 2D (NES, SNES, GB, GBA) GPU designs.
Something like that sitting where the scan out engines exist today (with their multiple planes the composite today) would be absolutely killer if you could do hardware software co design to take advantage of it.
I've also thought that something like that would be great in VR/AR.
Buffers see extensive use in modern shaders. A frame will get rasterized several times for different shaders before those buffers are all composited into an output buffer. Several types of things can only be known if you've got the full scene rasterized or at least more than a few lines.
If you've only got fixed function shaders and a relatively simple rasterizer the raster-to-display works fine. The DS also only has a single contributor to the output. A game on a PS4 or a PC is just one of several contributors to a compositing window/display manager which is actually doing final composition to the output device(s).
Right, but my point here was mainly aimed at the compositor. Those shaders are much much simpler. They're just combining existing buffers in a single pass every time I've seen it. I'm not saying swapping out the main GPU, but instead the programmability that already exists in the scan out engine.
Awesome! I'd really suggest Martin Korth's fantastic documentation in that case. In a lot of ways it was better than even the official docs given to developers.
This guy has articles about most of the major consoles from the last 30 years, I find them fascinating to read even though much of it goes over my head.
So one thing to note is that hardware overlays are really uncommon on desktop devices. NVIDIA only has a single YUV overlay and a small (used to be 64x64, now it might be 256x256?) cursor overlay. And even then, the YUV overlay might as well not exist -- it has a few restrictions that mean most video players can't or won't use it. After all, you're already driving a giant power-hungry beast like an NVIDIA card, there's no logical reason to save power and move a PPU back into the CRTC. So hardware overlays won't help the desktop case.
I still think we get better compositor performance by decoupling "start of frame" / "end of frame" and the drawing API. The big thing we lack on the app side is timing information -- an application doesn't know its budget for how long it should take and when it should submit its frame, because the graphics APIs only expose vsync boundaries. If the app could take ~15ms to build a frame, submit it to the compositor, and the compositor takes the remaining ~1ms to do the composite (though much likely much much less, these are just easy numbers), we could could be made to display in the current vsync cycle. We just don't have accurate timing feedback for this though.
One of my favorite gamedev tricks was used on Donkey Kong Country Returns. There, the developers polled input far above the refresh rate, and rendered Donkey Kong's 3D model at the start of the frame into an offscreen buffer, and then, as the frame was being rendered, processed input and did physics. Only at the end of the frame, did they composite Donkey Kong into the updated physics. So they in fact cut the latency to be sub-frame through clever trickery, at the expense of small inaccuracies in animation. Imagine if windows get "super late composite" privileges, where it could submit its image just in the nick of time.
(Also, I should probably mention that my name is "Jasper St. Pierre". There's a few tiny inaccuracies in the history -- old-school Win95/X11 still provides process separation as the display server bounds the window's drawing to the window's clip list for the app, and Windows 2000 also had a limited compositor, known as "layered windows", where certain windows could be redirected offscreen [0], but these aren't central to your thesis)
Apologies for getting your name wrong; it should be fixed now. I also made some other changes which I hope address the points you brought up. I didn't know about Windows 2000 layered windows, but I think the section on which systems had process separation was just confusingly worded.
Counting Intel, most (by number) desktop devices have pretty sophisticated hardware layer capabilities. Sky Lake has 3 display pipes, each of which has 3 display planes and a cursor. The multiple pipes are probably mostly used for multi-monitor configurations, but it's still a decent setup. From the hints I've picked up, I believe DirectFlip was largely engineered for this hardware.
There are lots of interesting latency-reducing tricks. Some of those might be good inspiration, but what I'm advocating is a general interface that lets applications reliably get good performance.
Technically I think it goes
1. DirectFlip (Windows 8)
2. Independent Flip
3. Multi-plane overlays (Windows 10)
DirectFlip can be supported on a lot of hardware, but is limited to borderless fullscreen unless you have overlays. https://youtu.be/E3wTajGZOsA?t=1531 has an explanation.
Intel GPUs support hardware overlays. Chrome grants one to a top most eligible canvas element.
https://developers.google.com/web/updates/2019/05/desynchron... has half the story for getting extremely low latency inking on Chromebooks with Intel GPUs. The other half is ensuring the canvas is eligible for hardware overlay promotion, which the ChromeOS compositor will do.
Nvidia's chips support hardware overlays. Historically CAD software used them, so Nvidia soft-locks the feature in the GeForce drivers to force CAD users to buy Quadro cards for twice the price instead. Their price discrimination means we can't have nice things.
Huh, I've never heard of this. Traditionally, hardware overlays are consumed by the system compositor. Do they have an OpenGL extension to expose it to the app?
I guess it's possible that their overlay support is limited and not fully equivalent to more modern overlays. It's tough to tell for sure from that description. But even if so, the price discrimination aspect may still have stopped them from wanting to implement a more capable feature and expose it in the GeForce drivers.
Ah, this is a classic "RGB overlay", as pioneered by Matrox for the workstation market, which doesn't really mean much, and I assume is fully emulated on a modern chip. Nothing like a modern overlay pipe like you see in the CRTCs on mobile chips.
Interestingly, many old workstations that did 3d (primarily SGI), had what was called an overlay plane. This was essentially a separate frame buffer using indexed color into a palette. One of the colors would indicate transparency. The overlay was traditionally used for all UI or debug output on top of a 3d scene. For example, the Alias/Wavefront 3d modeler required this feature in order to run. It allowed the slower 3d hardware of that era to focus on a complex scene vs the scene AND the UI.
Great article. The brute force solution to this problem is switching to high refresh rate monitors, but if you want to beat Apple II latency by brute force without fixing the underlying issues in modern software then 120 Hz probably isn't enough. You need crazy high refresh rates like 240 or 360 Hz. I do enjoy high refresh rates, but they are not great for power consumption and limit the resolution you can support.
A high refresh monitor isn't going to do much if your synchronization primitives and scheduling capabilities aren't up to the task. A millisecond of latency in the compositor is still a millisecond no matter what output refresh rate you have.
Latency is almost always in frames, the renderer is one frame behind the compositor which is one frame behind scanout.
If each steps takes 4 msec (240 Hz) instead of 16.7 msec (60 Hz), and you're the same number of steps behind, the latency is reduced by (16.7 - 4) * nsteps, or 50 msec - 12 msec = 38 msec.
Now that's assuming your system can actually keep up and produce frames and composite that fast, which is where the "brute force" comes into play.
Why is that delay coupled to the monitor that's plugged into the computer? Could we ask the computer to produce graphics at 240Hz and then take every 4th frame?
Then we have "average" delay in ms rather than frames.
(Asking as someone who doesn't know this area well)
You could do that in theory but it would be wasteful to render frames that are never displayed. If you can modify the software, better to modify it as the article advocates for, to strip latency without the waste.
Only if you know how long it will take to render a frame.
If you render "as fast as possible (to a max of xyz hz), you get better worst case performance than if you delay compositing. A render time of 2 physical frames results in only 1 missed frame instead of a render time of 1.26 physical frames resulting in 2 missed frames.
Of course nothing is simple in the real world, the flip side is rendering the extra frames creates waste heat that slows down modern processors with insufficient cooling.
Indeed it doesn't. I have a 120Hz CRT on one of my old PCs where moving/resizing windows, etc feels butter smooth so i decided to buy a 165Hz monitor for my main PC to try and capture the same experience.
All i ended capturing was something approximating the experience i had back when Windows allowed me to disable the compositor which wasn't vsynced. It still doesn't feel as smooth, just smoother. Nowhere near the smoothness of my CRT though.
And the brute force solution is the better solution this case. 240 or 360 Hz are not that "crazy" in the grand scheme of things considering 8k Displays will be the norm soon and the additional complexity in software and GPU hardware proposed in this article will completely useless and wasted effort once we are there.
> [...] the worst that can happen is just jank, which people are used to, rather than visual artifacts.
This is an important point. There are many solutions out there (beyond compositing) that add incredible complexity along with other costs in order to solve a perceived problem by it's proponents... not uncommonly those same people pushing the solution are incapable of fairly judging the cost benefit - sometimes the solutions are just not worth the cost.
There is currently an issue in chromium compositing with checkerboarding (blanking whole regions of the screen with white) when scrolling due to an optimization released to reduce jank:
You will notice this in recent releases of chromium when you scroll any non trivial page fast enough... or a simple page very fast by yanking the scroll bar... unless you have an extremely fast computer and graphics card. When reading through that thread you will notice how defensive they are, it's a complex and no doubt incredible bit of coding they have done - unfortunately the side-effect is worse than the original problem that most users don't even notice.
In the long run I think time will prove this problem and it's solutions are purely transient... In the same way font anti-aliasing is becoming obsolete with DPIs higher than we can perceive, solving jank isn't necessary if you cannot perceive it (i.e using frame rates > 60 Hz)... and the original problem really isn't as bad as all the people attempting to solve it think it is anyway.
> In the same way font anti-aliasing is becoming obsolete with DPIs higher than we can perceive
Screens might exceed acuity of most users but not hyperacuity. Even on a 400ppi phone screen at typical viewing distances it is possible to tell whether a slanted line is anti-aliased or not. Font anti-aliasing is not becoming obsolete any time soon.
> When pressing a key in, say, a text editor, the app would prepare and render a minimal damage region
Is this similar to the "dirty rectangles" technique? [1]
It seems difficult to implement some sort of scene system (for either an app, game or general GUI) that given a single key press can determine a minimum bounding box of changes on the screen no matter what the scene is, and is able to render just that bounding box.
If the single key occurs on a text area, potentially the whole text area could be "damaged" right? Edit: I was looking at this with the "Rendering" tab of chrome dev tools, enabling "Paint flashing" and "Layout Shift Regions", it seems like the text area is its own layer and the space is partitioned pretty cleverly on things like paragraphs and lines, but from time to time the whole text area just flashes, which tells me the algorithm sometimes is not sure what is dirty and just repaints the whole thing, but not always.
> Is this similar to the "dirty rectangles" technique?
Yes.
> It seems difficult to implement some sort of scene system (for either an app, game or general GUI) that given a single key press can determine a minimum bounding box of changes on the screen no matter what the scene is, and is able to render just that bounding box.
I don't think it's that bad. In the days before hardware graphics, basically every 2D sprite engine used in every computer game you liked had to do this logic.
Text editors are a harder case in some ways because of line wrapping, but the editor does need to figure out where everything goes spatially in order to render, so extending that to tell which things have not moved is, I think, not that difficult.
There's some hardware fixes that are becoming available.
Variable refresh rate (VRR / G-Sync / FreeSync / Adaptive Sync) and high-refresh rate hardware is now starting to gain traction on the high-end, usually with both technologies combined. Apple's ProMotion is often touted as a "120Hz" feature, but it's also a variable refresh in that it can support much lower framerates. Without a static v-sync interval, you can display the frames as they become available, for both perfect tear-free visuals and lower latency. It's quite likely we'll have that on Mac's as well.
I recently got a 144Hz FreeSync (G-Sync compatible) "business" monitor, so this tech is starting to filter down from the gamer side. It works great with a compositor on Linux! Ultra smooth mouse and input response at 144Hz, and completely tear-free. I would highly recommend it for developers as well.
> It works great with a compositor on Linux! Ultra smooth mouse and input response at 144Hz, and completely tear-free. I would highly recommend it for developers as well.
I got a 165Hz monitor and while it is indeed smoother than a 60Hz monitor with a vsync'd compositor, it still isn't as smooth as a 60Hz monitor without a compositor.
I was surprised that the tile-based updates from VNC, Teradici, PCoIP etc. wasn't in the related work. In particular, a challenge with "send < smaller > updates" rather than whole frames is how you decide to deal with massive updates where you'd end up causing more work / traffic than the whole frame approach.
This post is focused on local compositing, but I think the same arguments apply as with the networked case: too many updates is actually worse than vsync, but the usual case of "just ship deltas" is amazing (for remote display you get the bandwidth and latency win, here you'd get latency/power/whatever).
I think a low-level "updates" API would make sense for some sophisticated applications, but I'm not convinced that this quote holds:
> I think this design can work effectively without much changing the compositor API, but if it really works as tiles under the hood, that opens up an intriguing possibility: the interface to applications might be redefined at a lower level to work with tiles.
Seems like if you can show Chrome and VLC both working fine, that'd be a great proof of concept!
I feel like this almost always comes back to a reinvention of I-P-B Frames.
I am curious if there is a solid data structure that could be used to collect updates together in such a way that updates could be coalesced into a collection of other updates efficiently.
Variable refresh rate technologies like FreeSync and G-Sync are used in some video games. Wouldn't this improve the latency of desktop compositing too? After all, the idea is that the GPU sends out a frame to the monitor once it's finished, not once a periodic interval is reached.
We had a discussion on this in #gpu on our Zulip. The short answer is that if the screen is updating at 60fps, then FreeSync and its cousins are effectively a no-op. If the display is mostly idle, then there is potentially latency benefit in withholding scanout until triggered by a keypress. Older revisions of FreeSync have a "minimum frame rate" which actually doesn't buy you much; 40Hz is common: https://www.amd.com/en/products/freesync-monitors
I would expect that VRR helps if you try to make your compositor only start compositing as late as possible while still finishing on time. With VRR, one could delay the next frame slightly if the deadline gets missed. This would work even when rendering at 60Hz.
Problem is that causes jitter which is extremely noticeable even in small amounts. You can't have smooth animations if the frame timing is constantly varying.
I'm in the opinion that real-time desktop is already overdue. I can understand that in the olden days the perf benefit from dropping latency guarantees were worth it (classic latency vs throughput optimization), especially on servers where current operating systems draw their lineage (NT and UNIX). But these days with beefy, even overpowered CPUs, and 25 years of OS/real-time research later I would imagine that we would be in a place that the tradeoff would look different.
Of course just slapping RT kernel in a modern system alone would not do much good. The whole system needs to be thought with RT in mind, starting from kernel and drivers through toolkits/frameworks and services up to the applications themselves. But ultimately the APIs for application developers should be nudging people to do the right things (whatever those would be)
You need to go down to the hardware even, a realtime kernel running on mainstream CPUs and GPUs might not be enough (I wonder if a realtime system can even be built with current mainstream CPUs and GPUs):
The video systems in 8-bit home computers only worked because the entire system provided "hard realtime" guarantees. CPU instructions always were the same number of cycles and memory access happened at exactly defined clock cycles within an instruction. Interrupt timing was completely predictable down to the clock cycle, etc etc. Modern computers get most of their performace because they dumped those assumptions and as a result made timings entirely unpredictable (inside the CPU via caches, pipelining, branch prediction, etc etc...), and between system components (e.g. CPU and GPU don't run cycle-locked with each other).
We're probably already halfway there with modern apps, if you consider that any program with hardware-accelerated graphics is already performing rendering as a stateless operation in a separate thread pool running on an entirely different processor.
I think most modern OS kernels support realtime threads with a guarantee strong enough to support this kind of workflow. As Paul Davis points out in a different comment, the latency needs of audio and video are actually more similar than they are different, and pro audio is possible (though still harder than it should be).
But I don't want to underestimate the engineering challenges either. You'd have to make this stuff actually work.
I wouldn't hold my breath for this happening anytime soon on Linux. At the very least the GPU scheduler would probably need some significant enhancements, and it would need to be exposed to userspace, and work would need to be done to move all the existing drivers over to it. Windows 10 is far ahead of Linux in this regard.
Not at all, I thought it was a fascinating conversation. As you probably know, I worked on low-latency audio a bit as a "serious 20%" project while on Android, so I'm very interested in audio as well, and I think the two communities have lots to learn from each other.
I don't get this obsession with compositors. I just switch them off. Has no benefit for me. If I'm so inclined I can move windows with video running in Full-HD without any tearing across several screens like mad. When I'm in a terminal window of some sort I don't care for the real transparency, because it is unergonomic. When I'm in some other application the same applies.
The app thumbnails in macOS’s Mission Control [1] and GNOME’s Activities overview [2] provide a nice way of managing graphical windows and workspaces which requires a compositor. Animated window management in general is a UI affordance that I value. I’m not sure if it’s a “real benefit,” but consistently getting this right is one thing that attracts users to macOS and iOS.
I know them. Maybe I'm just an old fart and "set in my ways", but it's not even "nice to have" for me when it compromises the general haptics of the environment, as in dragging it down with sluggishness.
The first bit is that practically all that is argued in for here, is in Android - and has been so for a long time (~2013 last I worked on the topic in that context) and it is still not enough - they will go (and are going) higher refresh rate.
Even at the time there were crazy things like tuning CPU governor to wake up when touch screen detects incoming 'presence'(before you can get an accurate measurement of where and how hard the finger hit) - it's not like you are going to change your muscle intention mid flight. Input events were specifically tagged so that the memory would be re-used rather than GCed and reallocated.
A sunny frame could go from motion to photon in 30ms. Then something happens and the next takes 110ms. That something is often garbage collectors and other runtime systems that think it is safe to do things as there is no shared systemic synchronization mechanism around for these things.
The second is that this is -ultimately- a systemic issue and the judge, jury and executioner is the user's own perception. That's what you are optimising for. Treating it as a graphics only thing is not putting the finger on something, it is picking your nose. The input needs to be in on it, audio needs to be in on it, and the producer needs to be cooperative. Incidentally, the guitar hero style games and anything VR are good qualitative evaluation targets but with brinelling like incremental loads.
(Larger scale)
1. communicate deadlines to client so the renderer can judge if it is even worth doing anything.
2. communicate presentation time to client so animations and video frames arrive at the right time.
3. have a compositor scheduler than will optimize for throughput, latency, minimizing jitter or energy consumption.
4. inform the scheduler about the task that is in focus, and bias client unlock and event routing to distinguish between input focus and the thundering herd.
5. type annotate client resources so the scheduler knows what they are for.
6. coalesce / resample input and resize events.
7. align client audio buffers to video buffers.
8. clients with state store/restore capabilities can rollback, inject and fastforward (http://filthypants.blogspot.com/2016/03/using-rollback-to-hi...)
There is a bunch of more eccentric big stuff after that as well as the well known quality of life improvements (beyond the obvious gamedev lessons like don't just malloc in your renderloop and minimize the amount of syscall jitter in the entire pipeline), but baby steps.
Oh, and all of the above++ is already in Arcan and yet it is still nowhere near good enough.
Android scroll still feels surprisingly laggy to me in comparison to even something like GPE, or OPIE, which used GTK+ (and Qt mobile respectively,) and had fully CPU based animation.
It tells just how much more garbage been thrown into the linux stack since 2007, that even, effectively, kernel based, and hardware accelerated animation stutters today.
Android may have all the parts to do it, but you widget instanciation takes way too long, so the app can never hit the frame deadline if it has to show a new widget.
The beam-racing compositor is a cool idea. It seems to me, however, that it could create visible jitter for applications that use traditional vsync-driven frame timing --- e.g. an animation of vertical motion would see a one-frame latency change when the damage rectangle crosses some vertical threshold.
If you're animating (or doing smooth window resizing), then your "present" request includes a target frame sequence number. The way I analyze the text editor case is that it's something of a special case of triple-buffering, but instead of spamming frames as fast as possible at the swapchain, you do multiple revisions for a single target sequence number. The last revision to make it ahead of the beam wins. If you have a smooth clock animation in addition to some latency-critical input sensitive elements, then the clock hands would be in the same position for each revision, only the other elements would change.
If done carefully, this would give you smoother animations at considerably lower power consumption than traditional triple buffering. It's not the way present APIs are designed today, though, one of my many sadnesses.
As I hinted, this is a good way to synchronize window resize as well. The window manager tells the application, "as of frame number 98765, your new size is 2345x1234." It also sends the window frame and shadows etc with the same frame number. These are treated as a transaction; both have to arrive before the deadline for the transaction to go through. This is not rocket science, but requires careful attention to detail.
A year or few back, I did camera-passthru XR in VR HMDs, on a DIY stack with electron.js (which is chromium) as compositor. So a couple of thoughts...
The full-screen electron window apparently displaced the window system compositor. So "compositing" specialized compositors can be useful.
Chromium special cased video, making it utterly trivial to do comfortable camera passthru, while a year+ later, mainstream stacks were still considering that hard to impossible. So having compositor architecture match the app is valuable.
That special case would collapse if the CPU needed to touch the camera frames for analysis. So carefully managing data movement between GPU and CPU is important. That's one design objective of google's mediapipe for instance. Perhaps future compositors should permit similar pipeline specifications?
My XR focus was software-dev "office" work, not games. So no game-dev "oh, the horror! an immersion-shattering visual artifact occurred!" - people don't think "Excel, it's ghastly... it leaves me aware I'm sitting in an office!". Similarly, with user balance based on nice video passthru, the rest of the rendering could be slow and jittery. Game-dev common wisdom was "VR means 90 fps, no jitters, no artifacts, or user-sick fail - we're GPU-tech limited", and I was "meh, 30/20/10 fps, whatever" on laptop integrated graphics. So... Games are hard - don't assume their constraints are yours without analysis. And different aspects of the rendered environment can have very different constraints.
My fuzzy recollection was chromium could be persuaded to do 120 Hz, though I didn't have a monitor to try that. Or higher? - I fuzzily recall some variable to uncap the frame rate.
I've used linux evdev (input pipeline step between kernel and libinput) directly from an electron renderer process. Latency wasn't the motivation, so I didn't measure it. But that might save an extra ms or few. At 120 Hz, that might mean whatever ms HID to OS, 0-8 ms wait for the next frame, 8 ms processing and render, plus whatever ms to light. On electron.js.
New interface tech like speech and gesture recognition may start guessing at what it's hearing/seeing many ms before it provides it's final best guess at what happened. Here low latency responsiveness is perhaps more about app system architecture than pipeline tuning. With app state and UI supporting iterative speculative execution.
Eye tracking changes things. "Don't bother rendering anything for the next 50 ms, the user is saccading and thus blind." "After this saccade, the eye will be pointing at xy, so only that region will need a full resolution render" (foveated rendering). Patents... but eventually.
As someone who uses a tiling window manager it has been a long time since I have actually used a compositor. I have to ask if the cool window effects are actually worth the bother? I just can't see anyone actually caring about stuff like that in 20 years.
Compositors are not just about visual effects. They are also a pretty good way to leverage GPUs.
If you don't care that much about latency, there's a clear benefit in never having windows getting "damaged" just because other stuff has temporarily occluded them. You will never get "ghost" images because one app has locked up or is slow. No flurry of "WM_PAINT", "WM_NCPAINT" or whatever equivalent just because you are moving windows. No matter how fast your system is, any 'movement' is much more fluid (as position changes are almost free).
Essentially, having windows completely isolated from one another - except when they are "composited" back - simplifies a lot. There's a lot of cruft and corner cases that you don't have to care.
There are features that can enhance usability that are trivial to do with compositors but difficult without - for instance, showing a window "preview" when you are switching apps or when you mouse over in the taskbar. While you can do this without a compositor, you can get this very easily and for almost 'free'.
And, as you point out, you can post-process and apply whatever effects you want. But this is a minor detail and was used initially as a selling point.
Of course, for a tiling window manager none of this matters as they never overlap. But most users don't use tiling window managers.
> They are also a pretty good way to leverage GPUs.
Compositors do not do much GPU leverage outside of drawing a bunch of textured quads for the final image, but those textures are assembled on the CPU (some toolkits have GPU backends but even that goes back and forth the CPU because not everything is done though the GPU).
It would need a big rewrite of the GUI stack to actually take advantage of GPUs with whatever service/server handles it to have a full scene tree with all the contents of the desktop resident (and whenever possible, manipulated) in GPU memory.
As things are nowadays, however, at best you get a mix of the two (and that itself isn't without its drawbacks too).
> If you don't care that much about latency, there's a clear benefit in never having windows getting "damaged" just because other stuff has temporarily occluded them.
Isn't that why X11 provided an (optional) backing store?
I have set up a keyboard shortcut to toggle my compositor on and off and tbh I find it tends to remain off most of the time, since I don't care for 'cool window effects' either. Most of the especially visual applications I use - eg. video players, games - tend to handle themselves in this respect, taking their steps for vsync as necessary. On the rare occasion that I'm offended by a bit of tearing I'll just activate the shortcut and then no problem. I just worry that with the badmouthing of X11 that goes around, one day this control - the ability to run without a compositor, if I want to - might be taken away from me.
I predict there will be a systemd-style debate about removing X11 by 2030 and it will be completely gone by 2040 (since someone mentioned 20 years). There won't be a single non-composited OS available at all and people born after 2000 won't really remember anything else.
There are still distributions that do not use systemd though, e.g. two that come to my mind are Slackware if you are more conservative or Void if you prefer a rolling distro.
And TBH IMO the choice of window system affects way more the day-to-day operations of a desktop oriented distribution than the init system ever did, so i expect slightly more friction about removing Xorg (as a sidenote the server is Xorg, X11 is the protocol and another server - most likely a fork of Xorg - can continue providing it if Xorg shuts down).
On Android, you have Picture-In-Picture video support now. How do you imagine it working without a compositor? Do you think nobody cares about that feature?
Picture in picture videos used to be done with chromakey. Playing a video in one window on top of another window isn't some new thing compositing provides. You could do it in Mac classic or Windows 3.1 or BeOS.
Nobody might care about the 3D cube but stuff like shadows under windows have a clear usability benefit. aka: knowing which window is on to of each other. Of course in a tiling WM you don’t care about that but vast majority of people don’t want to use tiling WMs.
Why would I need to know the Z-order of overlapping windows? None of the windows I'm actively working with overlap (it would be too annoying to have them constantly blocking each other), and all the rest just have "background" status. I don't care how far they are into the background, so drop shadow is useless visual clutter.
You are you. Other people work differently. If you want to hear how a power user uses overlapping windows I’d recommend listening to this ATP episode: https://atp.fm/96
Shadows don't need a compositor though, do they? At the least, affordances don't need a compositor. So, make the "active" border less soft and this is easily doable. (I feel like this is how things used to be?)
How many windows do you tend to have open? I find it quite difficult to use a tiling window manager when you have more windows open than you can display on screen. I recently switched to PaperWM and I find the animated transitions between workspaces, and the ability to scroll through windows tiled across a rectangle that exceeds the dimensions of my physical screen, to be a big improvement compared to my previous setup with sway.
I’d really like to get 60 Hz window resizing as well, as in iPadOS’s split screen, but even compositing tiling window managers like PaperWM and sway struggle with this at the moment.
Hate to go down this road, but JavaScript is anything but slow. What is slow is the layer upon layer of complexity + poor coding of many web apps. That's not an exclusive problem to JavaScript.
There is some amount of overhead in the VM due to the dynamic nature of JavaScript values and the need for a GC that can't be avoided (even though VM implementations have had years of effort and optimization put in to address this). You can get around a lot of this by using WASM and having control over the data layout though. In my experiments some times the browser just kicks in and drops some frames in any case. :/ Things just feel a lot smoother in native apps always. Figma for example is doing a pretty good job with a C++ engine running in WASM and using WebGL for their canvas (https://www.figma.com/blog/webassembly-cut-figmas-load-time-...).
It can be avoided to a degree if you understand how JavaScript memory allocation works for a particular VM, and use custom methods instead of built-ins for certain operations, to avoid the overhead of the generalized nature of the built-in.
You might reduce it for sure, but you aren't going to avoid it entirely. The best is when it gets to a tracing JIT and all of the calls have a similar codepath. And the fact that you need to jump through hoops like this instead of just writing normal code and expect it to do only what you're saying instead of doing a bunch of other stuff too, is part of the problem. Like, what you are saying, is exactly the point: the default method of working introduces performance issues and makes it hard to reason about performance in a way that is predictable, requiring you to understand VM internals and just 'hope' that the runtime works it out. So you are essentially saying that the default path is slower. It's also hard to ensure that your objects have a predictable layout in memory (important for cache issues) other than by just using a `TypedArray` and re-inventing structs within that.
The semantics of JS itself make certain optimizations hard. Even a simple VM can do arithmetic faster than no-JIT Node.js and JSC: https://godbolt.org/z/5GbhnK (an example I've been working on lately) (I'm interested in no-JIT perf because that's what you're allowed to do in your own runtime on iOS). LuaJIT is the best at this, and even there Lua's semantics prevent certain optimizations--https://news.ycombinator.com/item?id=11327201
For sure, I think you can mitigate it, but I would take that as being different from avoidance (simply because there are other languages / runtimes where there is clear actual avoidance). Regardless I think we have enough details in the conversation at this point to understand what each other is saying. :)
> Hate to go down this road, but JavaScript is anything but slow. What is slow is the layer upon layer of complexity + poor coding of many web apps. That's not an exclusive problem to JavaScript.
where are those fast javascript apps ? where ? compare the super snappiness of a Telegram Desktop with molasses $every_other_chat_client_using_electron for instance.
And if you are gonna say that VSCode is fast you have no idea how truly fast software feels like - if you come across try a QNX 6 demo iso and be amazed at how fast your computer answers to literally everything.
What makes you say that they aren't also due to the kind of programming style that JS encourages ?
Also, my anecdotal experience writing some audio algorithms in pure JS is that it's ... not an order of magnitude, but not far away, from the performance of the same code written in C++ with std::vector<double>s. Likely there's some magic that could be done to improve it, but the C++ code is written naïvely and ends up extremely well optimized.
The poor coding is one thing, but it's javascript's fault that layers of complexity inherently result in performance losses beyond those caused by leaky abstractions.
JavaScript is just a programming language with the same constructs as any other. And it's much faster than most languages, especially dynamically typed languages.
It sounds like you don't understand how a browser works. Performance issues are typically due to the browser runtime environment which includes a slow and complex DOM.
JavaScript will mop up most CLI apps in performance, including python and ruby, and will get close in perf to C if written carefully.
> [...] the browser runtime environment which includes a slow and complex DOM
Counter to this narrative the DOM is very fast, and compared to many other platforms it's much faster at displaying a large number of views and then resizing and reflowing them.
ex. Android warns when there's more than 80 Views, meanwhile Gmail is 3500 DOM elements and the new "fast" Facebook is ~6000 DOM elements. Neither of those apps spends the majority of their time in the browser code on load, it's almost entirely JS. Facebook spends 3 seconds running script on page load on my $3k Macbook Pro. That's not because the DOM is slow, that's because Facebook runs a lot of JS on startup.
> JavaScript will mop up most CLI apps in performance
I'll bet half the coreutils are a minimum of 10x slower in their day-to-day use cases (i.e. small n) when re-implemented in any interpreted language like JS. Cold start overhead matters a lot for CLI programs.
> JavaScript will mop up most CLI apps in performance, including python and ruby, and will get close in perf to C if written carefully.
asm.js is what you mean?
If you so, you are not better to handwriting an assembly yourself.
I am kind of an involuntary webdev from time to time, and I am often the one invited to fix consequences of coding under influence, like a decision to write a GPS server with 100k+ packets per second load in Javascript.
People claiming that JS can be made be as fast as C have no idea at all what they say. Even Java with hand-jacked VM is easier to scale for such load than a JS server reduced to 1k lines of vanilla JS, using all and every performance trick Nodejs provides.
No I don't mean asm.js. You can write vanilla JS that avoids memory allocation beyond initialization as well as garbage collection, and runs certain algorithms faster than typical C due to differences such as JIT optimizations or the lack of a sentinel to terminate strings, etc.
My point was never to say that JS is in general as fast as C, only that a careful dev that is familiar with the VM's operation can get close for some tasks.
Giants like Airbnb have resources to write fast and responsive apps. But it is still slow. There is no point to worry about 16ms latency, if web app refreshes 6x times just to load couple of images.
"As the experience of realtime audio illuminates, it’s hard enough scheduling CPU tasks to reliably complete before timing deadlines, and it seems like GPUs are even harder to schedule with timing guarantees."
Well, I suppose that's way more feasible in RISC CPUs.
More modern/better arm cores and common environments (starting at approx. Cortex-M3 microcontrollers, where flash instruction fetching depends on page alignment) are also pretty unpredictable. In fact, most advanced pipeline designs, no matter the ISA, have very hard-to-predict instruction timings. Couple this with dynamic frequency scaling and the instruction timings are irrelevant.
(one important difference: if you don't provide a new frame of video data, most humans will not notice the result; if you don't provide a new frame of audio fata, every human will hear the result).
I feel as if both these worlds would really benefit from somehow having a full exchange about the high and low level aspects of the problems they both face, how they have been solved to date, how might be solved in the future, and how the two are both very much alike, and really quite different.