There's a host of reasons this isn't good enough: - SQLite itself would be simpl...

1over137 · on April 3, 2023

>It doesn't make sense to build every other program on top of SQLite

I dunno, it's not a crazy as building everything on top of Electron. :)

josephg · on April 3, 2023

True - but that's not exactly a high bar for success. I don't want every program I run to be an electron style hodgepodge mess which takes 1800 dependencies to copy an iso to a USB stick (or whatever).

mgaunard · on April 3, 2023

Many databases (including SQLite) don't use the filesystem correctly. The fault doesn't lie with the filesystem but with the database.

josephg · on April 3, 2023

Can you give some examples of ways databases misuse the filesystem?

blibble · on April 3, 2023

there were several data loss inducing bugs in postgres with fsync/close

josephg · on April 4, 2023

Right. When fsync returns an error code, it means (I think) that one of the earlier writes has failed or partially failed. This is not an obvious behaviour, and it wasn’t clear from the fsync documentation. And fsync doesn’t tell you which write failed! It’s spooky action at a distance.

We can blame Postgres for not handling this case properly, but I think the design of fsync bears just as much blame here. A write completion callback (as suggested upthread) would be a much more sensible and intuitive way to pass write errors back to the database. I think if Linux exposed an api like that, there’s a much lower chance this bug would have happened at all.

jstimpfle · on April 6, 2023

It was always clear from what fsync does that it doesn't indicate which write failed. That's exactly the point why fsync exists -- write() doesn't write through, for obvious performance reasons.

So any error reported from fsync() could apply to any write that happened since the last successful fsync(). In fact, storage errors that happen at the time of asynchronous writeback cannot be traced back to individual write() requests (write() requests can overwrite each other, can be collated, etc) so it is technically impossible for fsync() to report which write it was.

But what may be surprising is that fsync() can report errors that were "caused" by other processes writing to the same file -- because, again, storage error in general cannot be traced back to individual write() calls.

josephg · on April 7, 2023

> It was always clear from what fsync does that it doesn't indicate which write failed. That's exactly the point why fsync exists -- write() doesn't write through, for obvious performance reasons.

Hah - apparently it wasn't clear to the developers of postgres!

> But what may be surprising is that fsync() can report errors that were "caused" by other processes writing to the same file -- because, again, storage error in general cannot be traced back to individual write() calls.

Thanks for clarifying. This is utterly ridiculous behaviour. It basically turns any error returned by fsync into a fatal error, because there's no way to know whats gone wrong. And what about the file cache? If I issue a write, followed by a read, my understanding is that the OS can return the "written" data straight from the filesystem cache. But what happens if the subsequent fsync fails? Will reads now once more return the data as it is on disk? If so, I guess the data I get back from my read() calls must suddenly "snap" back to the previous value at some point before fsync returns? What a mess. Oh, and according to the linux man page, some of the specifics of this behaviour are also filesystem-specific. Lovely.

One of the commenters up-thread said this:

> Many databases (including SQLite) don't use the filesystem correctly. The fault doesn't lie with the filesystem but with the database.

I think the problems postgres had are about 95% the fault of the terrible design of fsync. We need a better API!

jstimpfle · on April 7, 2023

Which postgres bug are you referring to? I've always been under impression that the big databases use O_DIRECT to be in control, combined with userland I/O workers for performance. But apparently that isn't the case (I've never even databases much, and mostly sqlite3).

Multiple processes writing to the same file would need some coordination anyway, so it's not much of a problem that you see errors "from" other processes. Note that under Linux, all processes will always be reported the last error that happened since opening that file (errors are reported from write() and other functions as well, and any error is reported only once, AFAIK). Again, looking on the implementation side, where the page cache has a partial view of the file address space, and writes just write to the pages in that view, the behaviour makes quite a bit of sense. (I'm not saying that this is a great model to build a database on).

Yep, if fsync() returns an error in general probably just assume that any of your writes since the last fsync() may or may not have hit the disk. Maybe it means even less in some situations in practice.

You can try sync_file_range() as well (Linux specific).

As is mentioned elsewhere in the comments here, for more demanding workloads, ordering guarantees in the form of I/O barriers are needed. I suppose that is what you can have with O_DIRECT and an I/O thread pool. Maybe io_uring() will enable such a style of programming some day?

josephg · on April 14, 2023

> I've always been under impression that the big databases use O_DIRECT to be in control, combined with userland I/O workers for performance. But apparently that isn't the case (I've never even databases much, and mostly sqlite3).

As I understand it, O_DIRECT doesn't change fsync's wonky behaviour. It just asks the OS to stop using its filesystem cache.

> As is mentioned elsewhere in the comments here, for more demanding workloads, ordering guarantees in the form of I/O barriers are needed. I suppose that is what you can have with O_DIRECT and an I/O thread pool. Maybe io_uring() will enable such a style of programming some day?

Maybe.

The nice thing about barriers or write groups is that you don't need to move to an async API like io_uring to keep issuing write commands to the OS. The kernel can still buffer writes and issue them to the underlying hardware when its ready, and your program can go back to responding to requests. I think that'll always be nicer to work with than some io_uring + sync_file_range combination.

preseinger · on April 9, 2023

> Thanks for clarifying. This is utterly ridiculous behaviour. It basically turns any error returned by fsync into a fatal error, because there's no way to know whats gone wrong. And what about the file cache? If I issue a write, followed by a read, my understanding is that the OS can return the "written" data straight from the filesystem cache. But what happens if the subsequent fsync fails? Will reads now once more return the data as it is on disk? If so, I guess the data I get back from my read() calls must suddenly "snap" back to the previous value at some point before fsync returns? What a mess. Oh, and according to the linux man page, some of the specifics of this behaviour are also filesystem-specific. Lovely.

fsync says

       fsync() transfers ("flushes") all modified in-core data of (i.e.,
       modified buffer cache pages for) the file referred to by the file
       descriptor fd to the disk device (or other permanent storage
       device) so that all changed information can be retrieved even if
       the system crashes or is rebooted.  This includes writing through
       or flushing a disk cache if present.  The call blocks until the
       device reports that the transfer has completed.

pretty clearly operating on a per-file basis, and pretty clearly asserting stuff against both the fs cache and the underlying device. it is also pretty clear that it doesn't isolate operations from one process vs. another if they are made against the same file.

yes definitely a write followed by a read without a guaranteed fsync between the two can result in data that "snaps back" to whatever is on disk if that write fails to sync to the device in the future

these things are obviously not ideal, but they are both well-understood and in fact unavoidable, if there will exist a cache between user space fs operations and the underlying physical device

the design of fsync is not improvable in any meaningful way

josephg · on April 14, 2023

> the design of fsync is not improvable in any meaningful way

I think fsync should be removed entirely. A completion based API, or simply write barriers like macos does would both be more ergonomic for developers and result in faster software (since you wouldn't need to block your thread waiting for all outstanding writes to be confirmed).

preseinger · on April 15, 2023

that's fine but the guarantees offered by those approaches are weaker than those provided by fsync, so not directly comparable