File system semantics need a rethink. Here's what I've proposed. This is not original with me; these are all things implemented in some non-UNIX-based systems.
Files should come in several kinds:
- Unit files. They're immutable, but replaceable. This is the normal case.
You create them with "creat",
write them, and when closed, they replace any previous version with the same name. Any program which opens one for reading sees a complete file. On a crash before close, the new file is dropped and the old version remains. If a program aborts, that also happens. An explicit close commits the file update.
- Log files. These are append only. On a crash, you are guaranteed that the file will be intact to some point ending at a write boundary. The last transactions may be lost, but the file must not tail off into junk. For journals, log files, etc.
- Temp files. These disappear on a crash.
- Managed files. These are for databases. They can do everything files can do.
Along with this, you need async I/O. This is primarily for log files and managed files. When you do a write, you get two callbacks, one when the buffer has been copied, and one when the data has been committed to permanent storage and will survive a crash. It is the job of the OS, drivers, and devices to not send the data-committed completion until the data is safe.
Only a few programs need that, but they're the important ones - database managers.
This turns the fsync problem around. Instead of doing an fsync to force write-out, databases wait for a completion before considering the data committed. It's not a flush-everything; it's a commit-complete for a single write. So if you have multiple I/O operations pending on a single file, a normal event in database land, there's no big stall for a cache flush.
There's also no incentive for drive manufacturers to send commit-completes early. The only people who use them are the ones who will notice if someone cheats. Most bulk I/O is on unit files, where something is being copied.
(There are mainframe systems that had this sort of thing half a century ago.)
Yep this would be fantastic. The ironic thing about it is that a fine-grained write completion callback like you mention would actually improve performance compared to the coarse fsync we have today.
Another simpler way to improve the filesystem API would be to add a write barrier system call. If I call write(file, a); barrier(file); write(file, b); then the OS guarantees that the second write can't start before the first write is fully committed to disk. This is enough to implement most journaling systems, databases and log files.
This would perform much better than the current POSIX API because you don't need to "stop the world" with an expensive fsync before issuing subsequent write commands. The kernel can buffer chunks of writes separated by barriers, and issue them to the underlying device as needed.
The completion API is nicer though, because you can (for example), not respond to a network request to update the database until the changes are flushed to disk.
The underlying SCSI and SATA protocols would support this with command queueing. I would have envisioned using an fcntl() to set the current group ID on an fd.
Grouped writes seem more or less equivalent to write barriers in terms of their semantics, but probably with 1 less syscall - which is nice. But I'd take any of these approaches over years of discussion & no solutions.
I wonder what it would take to actually implement this stuff in linux. A lot of work, sure - but far from impossible.
Thanks for lmdb btw! The design is lovely. I've used it in a couple projects, and I had a read through your code for inspiration while I was designing out a little storage engine for something I've been working on. Its delightful.
POSIX doesn't stop linux from introducing new, non-posix APIs. Eg, io_uring isn't part of POSIX but linux has it anyway. As far as I can tell, an API for write barriers, grouped writes, transactional writes or iocp should be able to work alongside the POSIX API just fine. Its just another API for writing files which user programs can opt in to using.
They break down performance for programs that expect zero-latency filesystem and access it in C-style sequential fashion. If we rewrite all UNIX utilities to be concurrent by default, the file API wouldn't make things that much worse.
It’s interesting to me that in the original proposal for fixing fsync to actually be durable (https://lwn.net/Articles/270891/), there was thought given to the desire for a non-flushing write barrier (ctrl-f SYNC_FILE_RANGE_NO_FLUSH), but it appears that it never got implemented.
> Along with this, you need async I/O. This is primarily for log files and managed files. When you do a write, you get two callbacks, one when the buffer has been copied, and one when the data has been committed to permanent storage and will survive a crash.
This is what you want if you need to communicate the durability guarantee to the outside world, but if you’re just trying to enforce some consistency guarantees for internal purposes (e.g. transaction journal write goes out before data write), callbacks impose an unnecessary CPU-disk-CPU roundtrip. That’s what Featherstitch[1] was trying to avoid. Or is that not worth it?
More generally, how does this compare to other throughput-oriented queue-adjacent things, such as GPU commands buffers and fences, or Linux io_uring and Windows RIO? (A comparison to NIC queues is probably less interesting because NICs don’t reorder things—or do they? What about radios?) Do all of these need to be separate mechanisms?
Sudden thought: Featherstitch is to completion callbacks as promise pipelining[2] is to traditional RPC.
How much are current database implementations hamstrung by the standard file semantics? If it's a lot, I'm surprised there aren't more databases that just mount a bare disk or partition and do whatever they want with it. Or ship a special kernel module they can talk to that implements a better interface to the underlying block device?
Or does none of this really matter in a world of virtualized-everything where your "block device" is still ultimately hitting multiple layers of actually-just-being-a-file on its way to hardware?
LMDB performance suffers when used on a journaling filesystem. Consequently, LMDB supports using a plain block device, to eliminate all FS overheads. And yes, all of the perf benefits go out the window if you're running on several layers of virtualization anyway.
~15 years ago, the last time I used Oracle in production, that was still standard practice. I'd be interested to hear from someone who's used it recently.
Have you read "Designing Data-Intensive Applications" by Martin Kleppmann? I'm sure most of it will be old news to you, but it does add multiple computers to what you've described here. You will have to read like 80% of the book to get to that part though.
Forgive my lack of knowledge, but wouldn’t something like io_uring satisfy the need of an async API, and also obviates the need for the first callback? You write into the ring (which the kernel does not have to do a secondary copy from) and it writes that onto your disk and tells you when it’s done? In fact, IIUC, direct IO goes a step further and let’s you write directly to the storage-and in return you give up some niceties like OS-managed caching and prefetching because you have to do them yourself.
In which case, a step towards this better FS API you’ve proposed would be some higher-level API’s around io_uring/directIO to make it better to use?
- SQLite itself would be simpler and probably faster if it could run on top of a better OS-level API
- It doesn't make sense to build every other program on top of SQLite. Eg, other databases (redis, postgres), anything performance sensitive, etc. The filesystem is the system interface. Not sqlite.
Aesthetically, its also pretty gross to make a mess in POSIX, then solve it in userland (via sqlite). Its much better to just have the OS expose an API that is actually nice to work with and avoid all the slowdown, complexity and bugs which can arise from the nasty fsync API. As an example, redis runs a whole extra userland thread dedicated to issuing fsync commands to the operating system. There's a whole signalling system needed to manage that thread. With an API like the GP proposed, none of that code would be needed.
True - but that's not exactly a high bar for success. I don't want every program I run to be an electron style hodgepodge mess which takes 1800 dependencies to copy an iso to a USB stick (or whatever).
Right. When fsync returns an error code, it means (I think) that one of the earlier writes has failed or partially failed. This is not an obvious behaviour, and it wasn’t clear from the fsync documentation. And fsync doesn’t tell you which write failed! It’s spooky action at a distance.
We can blame Postgres for not handling this case properly, but I think the design of fsync bears just as much blame here. A write completion callback (as suggested upthread) would be a much more sensible and intuitive way to pass write errors back to the database. I think if Linux exposed an api like that, there’s a much lower chance this bug would have happened at all.
It was always clear from what fsync does that it doesn't indicate which write failed. That's exactly the point why fsync exists -- write() doesn't write through, for obvious performance reasons.
So any error reported from fsync() could apply to any write that happened since the last successful fsync(). In fact, storage errors that happen at the time of asynchronous writeback cannot be traced back to individual write() requests (write() requests can overwrite each other, can be collated, etc) so it is technically impossible for fsync() to report which write it was.
But what may be surprising is that fsync() can report errors that were "caused" by other processes writing to the same file -- because, again, storage error in general cannot be traced back to individual write() calls.
> It was always clear from what fsync does that it doesn't indicate which write failed. That's exactly the point why fsync exists -- write() doesn't write through, for obvious performance reasons.
Hah - apparently it wasn't clear to the developers of postgres!
> But what may be surprising is that fsync() can report errors that were "caused" by other processes writing to the same file -- because, again, storage error in general cannot be traced back to individual write() calls.
Thanks for clarifying. This is utterly ridiculous behaviour. It basically turns any error returned by fsync into a fatal error, because there's no way to know whats gone wrong. And what about the file cache? If I issue a write, followed by a read, my understanding is that the OS can return the "written" data straight from the filesystem cache. But what happens if the subsequent fsync fails? Will reads now once more return the data as it is on disk? If so, I guess the data I get back from my read() calls must suddenly "snap" back to the previous value at some point before fsync returns? What a mess. Oh, and according to the linux man page, some of the specifics of this behaviour are also filesystem-specific. Lovely.
One of the commenters up-thread said this:
> Many databases (including SQLite) don't use the filesystem correctly. The fault doesn't lie with the filesystem but with the database.
I think the problems postgres had are about 95% the fault of the terrible design of fsync. We need a better API!
Which postgres bug are you referring to? I've always been under impression that the big databases use O_DIRECT to be in control, combined with userland I/O workers for performance. But apparently that isn't the case (I've never even databases much, and mostly sqlite3).
Multiple processes writing to the same file would need some coordination anyway, so it's not much of a problem that you see errors "from" other processes. Note that under Linux, all processes will always be reported the last error that happened since opening that file (errors are reported from write() and other functions as well, and any error is reported only once, AFAIK). Again, looking on the implementation side, where the page cache has a partial view of the file address space, and writes just write to the pages in that view, the behaviour makes quite a bit of sense. (I'm not saying that this is a great model to build a database on).
Yep, if fsync() returns an error in general probably just assume that any of your writes since the last fsync() may or may not have hit the disk. Maybe it means even less in some situations in practice.
You can try sync_file_range() as well (Linux specific).
As is mentioned elsewhere in the comments here, for more demanding workloads, ordering guarantees in the form of I/O barriers are needed. I suppose that is what you can have with O_DIRECT and an I/O thread pool. Maybe io_uring() will enable such a style of programming some day?
> I've always been under impression that the big databases use O_DIRECT to be in control, combined with userland I/O workers for performance. But apparently that isn't the case (I've never even databases much, and mostly sqlite3).
As I understand it, O_DIRECT doesn't change fsync's wonky behaviour. It just asks the OS to stop using its filesystem cache.
> As is mentioned elsewhere in the comments here, for more demanding workloads, ordering guarantees in the form of I/O barriers are needed. I suppose that is what you can have with O_DIRECT and an I/O thread pool. Maybe io_uring() will enable such a style of programming some day?
Maybe.
The nice thing about barriers or write groups is that you don't need to move to an async API like io_uring to keep issuing write commands to the OS. The kernel can still buffer writes and issue them to the underlying hardware when its ready, and your program can go back to responding to requests. I think that'll always be nicer to work with than some io_uring + sync_file_range combination.
> Thanks for clarifying. This is utterly ridiculous behaviour. It basically turns any error returned by fsync into a fatal error, because there's no way to know whats gone wrong. And what about the file cache? If I issue a write, followed by a read, my understanding is that the OS can return the "written" data straight from the filesystem cache. But what happens if the subsequent fsync fails? Will reads now once more return the data as it is on disk? If so, I guess the data I get back from my read() calls must suddenly "snap" back to the previous value at some point before fsync returns? What a mess. Oh, and according to the linux man page, some of the specifics of this behaviour are also filesystem-specific. Lovely.
fsync says
fsync() transfers ("flushes") all modified in-core data of (i.e.,
modified buffer cache pages for) the file referred to by the file
descriptor fd to the disk device (or other permanent storage
device) so that all changed information can be retrieved even if
the system crashes or is rebooted. This includes writing through
or flushing a disk cache if present. The call blocks until the
device reports that the transfer has completed.
pretty clearly operating on a per-file basis, and pretty clearly asserting stuff against both the fs cache and the underlying device. it is also pretty clear that it doesn't isolate operations from one process vs. another if they are made against the same file.
yes definitely a write followed by a read without a guaranteed fsync between the two can result in data that "snaps back" to whatever is on disk if that write fails to sync to the device in the future
these things are obviously not ideal, but they are both well-understood and in fact unavoidable, if there will exist a cache between user space fs operations and the underlying physical device
the design of fsync is not improvable in any meaningful way
> the design of fsync is not improvable in any meaningful way
I think fsync should be removed entirely. A completion based API, or simply write barriers like macos does would both be more ergonomic for developers and result in faster software (since you wouldn't need to block your thread waiting for all outstanding writes to be confirmed).
The semantics of a bare file system cannot be (easily) changed, I think. What would be nice is having a thin shim layer on top of their existing semantics to handle more advanced file abstraction. Applications nowadays need to repeatedly do something that file system doesn't provide, but sometimes their solution partially overlaps with the fs. A middleware that provides clearer separation would be nice.
Last time I looked, about three years ago, one still couldn’t assume O_TMPFILE would be supported, too many file systems and old kernels didn’t have it. It is a nice flag, so hopefully it’s more likely to be supported now.
I'm building a custom storage engine for a CRDT at the moment and I don't want my data corrupted under any circumstances. Writing code to do this reliably is painful. Even figuring out the rules is difficult - is fsync enough? fdatasync? Will it still corrupt on power loss events in modern consumer SSDs? How do I test / simulate my software under these failure conditions? If I get an error when I fsync, what assumptions can I make about the state of my data?
Some atomicity rules seem to be page-size specific, but I can't find any way to query the OS for the kernel's page size, or the page size of the underlying device. The status quo is horrid.
I'm developing on linux, but my code should work on any modern platform. (Windows, MacOS, iOS, Android, Linux).
On Apple's hardware, fsync (the "no really" version) actually does durably flush data to disk when you ask it to. And darwin has a write barrier command, which is great because I don't need to fsync at all in order to append to my transaction log. But linux is missing a barrier command, and I've heard most consumer grade SSDs lie when you ask them to fsync.
Generally, linux doesn't really seems designed to preserve your data when the system crashes or experiences a power loss. The single API it gives you to help (fsync) is badly designed, coarse grained, slow, badly documented and its hard to use correctly. And worst of all, it often doesn't work anyway thanks to crap SSDs.
There's lots of better APIs out there. IO completion ports on windows are a better design. Likewise, adding write barriers (like macos has done) would help a lot.
The filesystem APIs on linux are long in need of being overhauled. Its sorely lacking if you want your software to be reliable.
There's the problem of badly designed APIs, and there's the problem of crap hardware that lies to you.
IME, it's difficult to ensure that data "written" to disk has actually reached the disk, without disabling the drive's cache. This was true even with spinning rust, and continues to be true with SSDs. I'd rather not get started on read-after-write, with inline checksums in order to verify that your disk isn't lying to you.
In an ideal world, we'd have both well-designed APIs as well as hardware that isn't crap. If I had to choose between the two, I'd go with good hardware, because the API problem can at least be worked around - albeit with a significant investment in time, effort and pain.
I feel your pain, and wish you good luck in your efforts.
Thanks! But as for the two problems (bad APIs and bad firmware on SSDs), we should do both!
Personally I don't have influence over my solid state devices, but linux is opensource. We should absolutely fix linux to have better APIs here, even if the underlying devices lie sometimes.
It might even help with the firmware. The reason for SSDs to lie about flush commands is so they look better in benchmarks. If linux had a better API (with barriers, write groups, transactions, whatever), then benchmarks would probably depend less on fsync. Maybe, eventually, we can solve that problem too.
It's the opposite, you want less layers of abstractions. At least you want your OS to give you less abstractions, so you can choose your own.
See 'exokernels' for that: their idea was that the OS should concentrate on safely multiplexing hardware, and leave abstracting to user-space libraries. Do one thing, and do it well; instead of serving multiple masters badly.
Files should come in several kinds:
- Unit files. They're immutable, but replaceable. This is the normal case. You create them with "creat", write them, and when closed, they replace any previous version with the same name. Any program which opens one for reading sees a complete file. On a crash before close, the new file is dropped and the old version remains. If a program aborts, that also happens. An explicit close commits the file update.
- Log files. These are append only. On a crash, you are guaranteed that the file will be intact to some point ending at a write boundary. The last transactions may be lost, but the file must not tail off into junk. For journals, log files, etc.
- Temp files. These disappear on a crash.
- Managed files. These are for databases. They can do everything files can do.
Along with this, you need async I/O. This is primarily for log files and managed files. When you do a write, you get two callbacks, one when the buffer has been copied, and one when the data has been committed to permanent storage and will survive a crash. It is the job of the OS, drivers, and devices to not send the data-committed completion until the data is safe. Only a few programs need that, but they're the important ones - database managers.
This turns the fsync problem around. Instead of doing an fsync to force write-out, databases wait for a completion before considering the data committed. It's not a flush-everything; it's a commit-complete for a single write. So if you have multiple I/O operations pending on a single file, a normal event in database land, there's no big stall for a cache flush.
There's also no incentive for drive manufacturers to send commit-completes early. The only people who use them are the ones who will notice if someone cheats. Most bulk I/O is on unit files, where something is being copied.
(There are mainframe systems that had this sort of thing half a century ago.)