> That said, I think there have been efforts to use io_uring on Linux. I'm not s...

scottlamb · on June 20, 2023

I don't understand how it works with thread per connection either. io_uring is designed for systems that have a thread and ring per core, for you to give it a bunch of IO to do at once (batches and chains), and your threads to do other work in the meantime. The syscall cost is amortized or even (through IORING_SETUP_SQPOLL) eliminated. If your code is instead designed to be synchronous and thus can only do one IO at a time and needs a syscall to block on it, I don't think there's much if any benefit in using io_uring.

Possibly they'd have a ring per connection and just get an advantage when there's parallel IO going on for a single query? or these per-connection processes wouldn't directly do IO but send it via IPC to some IO-handling thread/process? Not sure either of those models are actually an improvement over the status quo, but who knows.

anarazel · on June 20, 2023

> io_uring is designed for systems that have a thread and ring per core

That's not needed to benefit from io_uring

> for you to give it a bunch of IO to do at once (batches and chains), and your threads to do other work in the meantime.

You can see substantial gains even if you just submit multiple IOs at once, and then block waiting for any of them to complete. The cost of blocking on IO is amortized to some degree over multiple IOs. Of course it's even better to not block at all...

> If your code is instead designed to be synchronous and thus can only do one IO at a time and needs a syscall to block on it, I don't think there's much if any benefit in using io_uring.

We/I have done the work to issue multiple IOs at a time as part of the patchset introducing AIO support (with among others, an io_uring backend). There's definitely more to do, particularly around index scans, but ...

scottlamb · on June 20, 2023

Oh, I hadn't realized until now I was talking with someone actually doing this work. Thanks for popping into this discussion!

> > io_uring is designed for systems that have a thread and ring per core

> That's not needed to benefit from io_uring

90% sure I read Axboe saying that's what he designed io_uring for. If it helps in other scenarios, though, great.

> Of course it's even better to not block at all...

Out of curiosity, is that something you ever want/hope to achieve in PostgreSQL? Many high-performance systems use this model, but switching a synchronous system in plain C to it sounds uncomfortably exciting, both in terms of the transition itself and the additional complexity of maintaining the result. To me it seems like a much riskier change than the process->thread one discussed here that Tom Lane already stated will be a disaster.

> We/I have done the work to issue multiple IOs at a time as part of the patchset introducing AIO support (with among others, an io_uring backend). There's definitely more to do, particularly around index scans, but ...

Nice.

Is the benefit you're getting simply from adding IO parallelism where there was none, or is there also a CPU reduction?

Is having a large number of rings (as when supporting a large number of incoming connections) practical? I'm thinking of each ring being a significant reserved block of RAM, but maybe in this scenario that's not really true. A smallish ring for a smallish number of IOs for the query is enough.

Speaking of large number of incoming connections, would/could the process->thread change be a step toward having a thread per active query rather than per (potentially idle) connection? To me it seems like it could be: all the idle ones could just be watched over by one thread and queries dispatched. That'd be a nice operational improvement if it meant folks no longer needed a pooler [1] to get decent performance. All else being equal, fewer moving parts is more pleasant...

[1] or even if they only needed one layer of pooler instead of two, as I read some people have!

anarazel · on June 20, 2023

> > Of course it's even better to not block at all...

> Out of curiosity, is that something you ever want/hope to achieve in PostgreSQL? Many high-performance systems use this model, but switching a synchronous system in plain C to it sounds uncomfortably exciting, both in terms of the delta and the additional complexity of maintaining the result. To me it seems like a much riskier change than the process->thread one discussed here that Tom Lane already stated will be a disaster.

Depends on how you define it. In a lot of scenarios you can avoid blocking by scheduling IO in a smart way - and I think we can quite far towards that for a lot of workloads and the wins are substantial. But that obviously cannot alone guarantee that you never block.

I think we can get quite far avoiding blocking, but I don't think we're going to a complete asynchronous model in the foreseeable future. But it seems more feasible to incrementally make common blocking locations support asynchronicity. E.g. when a query scans multiple partitions, switch to processing a different partition while waiting for IO.

> Is having a large number of rings (as when supporting a large number of incoming connections) practical? I'm thinking of each ring being a significant reserved block of RAM, but maybe in this scenario that's not really true. A smallish ring for a smallish number of IOs for the query is enough.

It depends on the kernel version etc. The amount of memory isn't huge but initially it was affected by RLIMIT_MEMLOCK... That's one reason why the AIO patchset has a smaller number of io_uring "instances" than the allowed connections. The other reason is that we need to be able to complete IOs that other backends started (otherwise there would be deadlocks), which in turn requires having the file descriptor for each ring available in all processes... Which wouldn't be fun with a high max_connections.

> Speaking of large number of incoming connections, would/could the process->thread be a step toward having a thread per active query rather than per (potentially idle) connection?

Yes. Moving to threads really mainly would be to make subsequent improvements more realistic...

> That'd be a nice operational improvement if it meant folks no longer needed a pooler [1] to get decent performance. All else being equal, fewer moving parts is more pleasant...

You'd likely often still want a pooler on the "application server" side, to avoid TCP / SSL connection establishment overhead. But that can be a quite simple implementation.