Yep this would be fantastic. The ironic thing about it is that a fine-grained write completion callback like you mention would actually improve performance compared to the coarse fsync we have today.
Another simpler way to improve the filesystem API would be to add a write barrier system call. If I call write(file, a); barrier(file); write(file, b); then the OS guarantees that the second write can't start before the first write is fully committed to disk. This is enough to implement most journaling systems, databases and log files.
This would perform much better than the current POSIX API because you don't need to "stop the world" with an expensive fsync before issuing subsequent write commands. The kernel can buffer chunks of writes separated by barriers, and issue them to the underlying device as needed.
The completion API is nicer though, because you can (for example), not respond to a network request to update the database until the changes are flushed to disk.
The underlying SCSI and SATA protocols would support this with command queueing. I would have envisioned using an fcntl() to set the current group ID on an fd.
Grouped writes seem more or less equivalent to write barriers in terms of their semantics, but probably with 1 less syscall - which is nice. But I'd take any of these approaches over years of discussion & no solutions.
I wonder what it would take to actually implement this stuff in linux. A lot of work, sure - but far from impossible.
Thanks for lmdb btw! The design is lovely. I've used it in a couple projects, and I had a read through your code for inspiration while I was designing out a little storage engine for something I've been working on. Its delightful.
POSIX doesn't stop linux from introducing new, non-posix APIs. Eg, io_uring isn't part of POSIX but linux has it anyway. As far as I can tell, an API for write barriers, grouped writes, transactional writes or iocp should be able to work alongside the POSIX API just fine. Its just another API for writing files which user programs can opt in to using.
They break down performance for programs that expect zero-latency filesystem and access it in C-style sequential fashion. If we rewrite all UNIX utilities to be concurrent by default, the file API wouldn't make things that much worse.
It’s interesting to me that in the original proposal for fixing fsync to actually be durable (https://lwn.net/Articles/270891/), there was thought given to the desire for a non-flushing write barrier (ctrl-f SYNC_FILE_RANGE_NO_FLUSH), but it appears that it never got implemented.
Another simpler way to improve the filesystem API would be to add a write barrier system call. If I call write(file, a); barrier(file); write(file, b); then the OS guarantees that the second write can't start before the first write is fully committed to disk. This is enough to implement most journaling systems, databases and log files.
This would perform much better than the current POSIX API because you don't need to "stop the world" with an expensive fsync before issuing subsequent write commands. The kernel can buffer chunks of writes separated by barriers, and issue them to the underlying device as needed.
The completion API is nicer though, because you can (for example), not respond to a network request to update the database until the changes are flushed to disk.