So one thing to note is that hardware overlays are really uncommon on desktop devices. NVIDIA only has a single YUV overlay and a small (used to be 64x64, now it might be 256x256?) cursor overlay. And even then, the YUV overlay might as well not exist -- it has a few restrictions that mean most video players can't or won't use it. After all, you're already driving a giant power-hungry beast like an NVIDIA card, there's no logical reason to save power and move a PPU back into the CRTC. So hardware overlays won't help the desktop case.
I still think we get better compositor performance by decoupling "start of frame" / "end of frame" and the drawing API. The big thing we lack on the app side is timing information -- an application doesn't know its budget for how long it should take and when it should submit its frame, because the graphics APIs only expose vsync boundaries. If the app could take ~15ms to build a frame, submit it to the compositor, and the compositor takes the remaining ~1ms to do the composite (though much likely much much less, these are just easy numbers), we could could be made to display in the current vsync cycle. We just don't have accurate timing feedback for this though.
One of my favorite gamedev tricks was used on Donkey Kong Country Returns. There, the developers polled input far above the refresh rate, and rendered Donkey Kong's 3D model at the start of the frame into an offscreen buffer, and then, as the frame was being rendered, processed input and did physics. Only at the end of the frame, did they composite Donkey Kong into the updated physics. So they in fact cut the latency to be sub-frame through clever trickery, at the expense of small inaccuracies in animation. Imagine if windows get "super late composite" privileges, where it could submit its image just in the nick of time.
(Also, I should probably mention that my name is "Jasper St. Pierre". There's a few tiny inaccuracies in the history -- old-school Win95/X11 still provides process separation as the display server bounds the window's drawing to the window's clip list for the app, and Windows 2000 also had a limited compositor, known as "layered windows", where certain windows could be redirected offscreen [0], but these aren't central to your thesis)
Apologies for getting your name wrong; it should be fixed now. I also made some other changes which I hope address the points you brought up. I didn't know about Windows 2000 layered windows, but I think the section on which systems had process separation was just confusingly worded.
Counting Intel, most (by number) desktop devices have pretty sophisticated hardware layer capabilities. Sky Lake has 3 display pipes, each of which has 3 display planes and a cursor. The multiple pipes are probably mostly used for multi-monitor configurations, but it's still a decent setup. From the hints I've picked up, I believe DirectFlip was largely engineered for this hardware.
There are lots of interesting latency-reducing tricks. Some of those might be good inspiration, but what I'm advocating is a general interface that lets applications reliably get good performance.
Technically I think it goes
1. DirectFlip (Windows 8)
2. Independent Flip
3. Multi-plane overlays (Windows 10)
DirectFlip can be supported on a lot of hardware, but is limited to borderless fullscreen unless you have overlays. https://youtu.be/E3wTajGZOsA?t=1531 has an explanation.
Intel GPUs support hardware overlays. Chrome grants one to a top most eligible canvas element.
https://developers.google.com/web/updates/2019/05/desynchron... has half the story for getting extremely low latency inking on Chromebooks with Intel GPUs. The other half is ensuring the canvas is eligible for hardware overlay promotion, which the ChromeOS compositor will do.
Nvidia's chips support hardware overlays. Historically CAD software used them, so Nvidia soft-locks the feature in the GeForce drivers to force CAD users to buy Quadro cards for twice the price instead. Their price discrimination means we can't have nice things.
Huh, I've never heard of this. Traditionally, hardware overlays are consumed by the system compositor. Do they have an OpenGL extension to expose it to the app?
I guess it's possible that their overlay support is limited and not fully equivalent to more modern overlays. It's tough to tell for sure from that description. But even if so, the price discrimination aspect may still have stopped them from wanting to implement a more capable feature and expose it in the GeForce drivers.
Ah, this is a classic "RGB overlay", as pioneered by Matrox for the workstation market, which doesn't really mean much, and I assume is fully emulated on a modern chip. Nothing like a modern overlay pipe like you see in the CRTCs on mobile chips.
Interestingly, many old workstations that did 3d (primarily SGI), had what was called an overlay plane. This was essentially a separate frame buffer using indexed color into a palette. One of the colors would indicate transparency. The overlay was traditionally used for all UI or debug output on top of a 3d scene. For example, the Alias/Wavefront 3d modeler required this feature in order to run. It allowed the slower 3d hardware of that era to focus on a complex scene vs the scene AND the UI.
I still think we get better compositor performance by decoupling "start of frame" / "end of frame" and the drawing API. The big thing we lack on the app side is timing information -- an application doesn't know its budget for how long it should take and when it should submit its frame, because the graphics APIs only expose vsync boundaries. If the app could take ~15ms to build a frame, submit it to the compositor, and the compositor takes the remaining ~1ms to do the composite (though much likely much much less, these are just easy numbers), we could could be made to display in the current vsync cycle. We just don't have accurate timing feedback for this though.
One of my favorite gamedev tricks was used on Donkey Kong Country Returns. There, the developers polled input far above the refresh rate, and rendered Donkey Kong's 3D model at the start of the frame into an offscreen buffer, and then, as the frame was being rendered, processed input and did physics. Only at the end of the frame, did they composite Donkey Kong into the updated physics. So they in fact cut the latency to be sub-frame through clever trickery, at the expense of small inaccuracies in animation. Imagine if windows get "super late composite" privileges, where it could submit its image just in the nick of time.
(Also, I should probably mention that my name is "Jasper St. Pierre". There's a few tiny inaccuracies in the history -- old-school Win95/X11 still provides process separation as the display server bounds the window's drawing to the window's clip list for the app, and Windows 2000 also had a limited compositor, known as "layered windows", where certain windows could be redirected offscreen [0], but these aren't central to your thesis)
[0] https://docs.microsoft.com/en-us/windows/win32/winmsg/window...