Something I've been thinking about that maybe the HN hivemind can help with; I know enough about io_uring and GPU each to be dangerous.
The roundtrip for command buffer submission in GPU is huge by my estimation, around 100µs. On a 10TFLOPS card, which is nowhere near the top of the line, that's 1 billion operations. I don't know exactly where all the time is going, but suspect it's a bunch of process and kernel transitions between the application, the userland driver, and the kernel driver.
My understanding is that games mostly work around this by batching up a lot of work (many dozens of draw calls, for example) in one submission. But it's still a problem if CPU readback is part of the workload.
So my question is: can a technique like io_uring be used here, to keep the GPU pipeline full and only take expensive transitions when absolutely needed? I suspect the programming model will be different and in some cases harder, but that's already part of the territory with GPU.
The roundtrip for command buffer submission in GPU is huge by my estimation, around 100µs. On a 10TFLOPS card, which is nowhere near the top of the line, that's 1 billion operations. I don't know exactly where all the time is going, but suspect it's a bunch of process and kernel transitions between the application, the userland driver, and the kernel driver.
My understanding is that games mostly work around this by batching up a lot of work (many dozens of draw calls, for example) in one submission. But it's still a problem if CPU readback is part of the workload.
So my question is: can a technique like io_uring be used here, to keep the GPU pipeline full and only take expensive transitions when absolutely needed? I suspect the programming model will be different and in some cases harder, but that's already part of the territory with GPU.