- SQLite itself would be simpler and probably faster if it could run on top of a better OS-level API
- It doesn't make sense to build every other program on top of SQLite. Eg, other databases (redis, postgres), anything performance sensitive, etc. The filesystem is the system interface. Not sqlite.
Aesthetically, its also pretty gross to make a mess in POSIX, then solve it in userland (via sqlite). Its much better to just have the OS expose an API that is actually nice to work with and avoid all the slowdown, complexity and bugs which can arise from the nasty fsync API. As an example, redis runs a whole extra userland thread dedicated to issuing fsync commands to the operating system. There's a whole signalling system needed to manage that thread. With an API like the GP proposed, none of that code would be needed.
True - but that's not exactly a high bar for success. I don't want every program I run to be an electron style hodgepodge mess which takes 1800 dependencies to copy an iso to a USB stick (or whatever).
Right. When fsync returns an error code, it means (I think) that one of the earlier writes has failed or partially failed. This is not an obvious behaviour, and it wasn’t clear from the fsync documentation. And fsync doesn’t tell you which write failed! It’s spooky action at a distance.
We can blame Postgres for not handling this case properly, but I think the design of fsync bears just as much blame here. A write completion callback (as suggested upthread) would be a much more sensible and intuitive way to pass write errors back to the database. I think if Linux exposed an api like that, there’s a much lower chance this bug would have happened at all.
It was always clear from what fsync does that it doesn't indicate which write failed. That's exactly the point why fsync exists -- write() doesn't write through, for obvious performance reasons.
So any error reported from fsync() could apply to any write that happened since the last successful fsync(). In fact, storage errors that happen at the time of asynchronous writeback cannot be traced back to individual write() requests (write() requests can overwrite each other, can be collated, etc) so it is technically impossible for fsync() to report which write it was.
But what may be surprising is that fsync() can report errors that were "caused" by other processes writing to the same file -- because, again, storage error in general cannot be traced back to individual write() calls.
> It was always clear from what fsync does that it doesn't indicate which write failed. That's exactly the point why fsync exists -- write() doesn't write through, for obvious performance reasons.
Hah - apparently it wasn't clear to the developers of postgres!
> But what may be surprising is that fsync() can report errors that were "caused" by other processes writing to the same file -- because, again, storage error in general cannot be traced back to individual write() calls.
Thanks for clarifying. This is utterly ridiculous behaviour. It basically turns any error returned by fsync into a fatal error, because there's no way to know whats gone wrong. And what about the file cache? If I issue a write, followed by a read, my understanding is that the OS can return the "written" data straight from the filesystem cache. But what happens if the subsequent fsync fails? Will reads now once more return the data as it is on disk? If so, I guess the data I get back from my read() calls must suddenly "snap" back to the previous value at some point before fsync returns? What a mess. Oh, and according to the linux man page, some of the specifics of this behaviour are also filesystem-specific. Lovely.
One of the commenters up-thread said this:
> Many databases (including SQLite) don't use the filesystem correctly. The fault doesn't lie with the filesystem but with the database.
I think the problems postgres had are about 95% the fault of the terrible design of fsync. We need a better API!
Which postgres bug are you referring to? I've always been under impression that the big databases use O_DIRECT to be in control, combined with userland I/O workers for performance. But apparently that isn't the case (I've never even databases much, and mostly sqlite3).
Multiple processes writing to the same file would need some coordination anyway, so it's not much of a problem that you see errors "from" other processes. Note that under Linux, all processes will always be reported the last error that happened since opening that file (errors are reported from write() and other functions as well, and any error is reported only once, AFAIK). Again, looking on the implementation side, where the page cache has a partial view of the file address space, and writes just write to the pages in that view, the behaviour makes quite a bit of sense. (I'm not saying that this is a great model to build a database on).
Yep, if fsync() returns an error in general probably just assume that any of your writes since the last fsync() may or may not have hit the disk. Maybe it means even less in some situations in practice.
You can try sync_file_range() as well (Linux specific).
As is mentioned elsewhere in the comments here, for more demanding workloads, ordering guarantees in the form of I/O barriers are needed. I suppose that is what you can have with O_DIRECT and an I/O thread pool. Maybe io_uring() will enable such a style of programming some day?
> I've always been under impression that the big databases use O_DIRECT to be in control, combined with userland I/O workers for performance. But apparently that isn't the case (I've never even databases much, and mostly sqlite3).
As I understand it, O_DIRECT doesn't change fsync's wonky behaviour. It just asks the OS to stop using its filesystem cache.
> As is mentioned elsewhere in the comments here, for more demanding workloads, ordering guarantees in the form of I/O barriers are needed. I suppose that is what you can have with O_DIRECT and an I/O thread pool. Maybe io_uring() will enable such a style of programming some day?
Maybe.
The nice thing about barriers or write groups is that you don't need to move to an async API like io_uring to keep issuing write commands to the OS. The kernel can still buffer writes and issue them to the underlying hardware when its ready, and your program can go back to responding to requests. I think that'll always be nicer to work with than some io_uring + sync_file_range combination.
> Thanks for clarifying. This is utterly ridiculous behaviour. It basically turns any error returned by fsync into a fatal error, because there's no way to know whats gone wrong. And what about the file cache? If I issue a write, followed by a read, my understanding is that the OS can return the "written" data straight from the filesystem cache. But what happens if the subsequent fsync fails? Will reads now once more return the data as it is on disk? If so, I guess the data I get back from my read() calls must suddenly "snap" back to the previous value at some point before fsync returns? What a mess. Oh, and according to the linux man page, some of the specifics of this behaviour are also filesystem-specific. Lovely.
fsync says
fsync() transfers ("flushes") all modified in-core data of (i.e.,
modified buffer cache pages for) the file referred to by the file
descriptor fd to the disk device (or other permanent storage
device) so that all changed information can be retrieved even if
the system crashes or is rebooted. This includes writing through
or flushing a disk cache if present. The call blocks until the
device reports that the transfer has completed.
pretty clearly operating on a per-file basis, and pretty clearly asserting stuff against both the fs cache and the underlying device. it is also pretty clear that it doesn't isolate operations from one process vs. another if they are made against the same file.
yes definitely a write followed by a read without a guaranteed fsync between the two can result in data that "snaps back" to whatever is on disk if that write fails to sync to the device in the future
these things are obviously not ideal, but they are both well-understood and in fact unavoidable, if there will exist a cache between user space fs operations and the underlying physical device
the design of fsync is not improvable in any meaningful way
> the design of fsync is not improvable in any meaningful way
I think fsync should be removed entirely. A completion based API, or simply write barriers like macos does would both be more ergonomic for developers and result in faster software (since you wouldn't need to block your thread waiting for all outstanding writes to be confirmed).
- SQLite itself would be simpler and probably faster if it could run on top of a better OS-level API
- It doesn't make sense to build every other program on top of SQLite. Eg, other databases (redis, postgres), anything performance sensitive, etc. The filesystem is the system interface. Not sqlite.
Aesthetically, its also pretty gross to make a mess in POSIX, then solve it in userland (via sqlite). Its much better to just have the OS expose an API that is actually nice to work with and avoid all the slowdown, complexity and bugs which can arise from the nasty fsync API. As an example, redis runs a whole extra userland thread dedicated to issuing fsync commands to the operating system. There's a whole signalling system needed to manage that thread. With an API like the GP proposed, none of that code would be needed.