File system semantics need a rethink. Here's what I've proposed. This is not ori...

josephg · on April 3, 2023

Yep this would be fantastic. The ironic thing about it is that a fine-grained write completion callback like you mention would actually improve performance compared to the coarse fsync we have today.

Another simpler way to improve the filesystem API would be to add a write barrier system call. If I call write(file, a); barrier(file); write(file, b); then the OS guarantees that the second write can't start before the first write is fully committed to disk. This is enough to implement most journaling systems, databases and log files.

This would perform much better than the current POSIX API because you don't need to "stop the world" with an expensive fsync before issuing subsequent write commands. The kernel can buffer chunks of writes separated by barriers, and issue them to the underlying device as needed.

The completion API is nicer though, because you can (for example), not respond to a network request to update the database until the changes are flushed to disk.

hyc_symas · on April 3, 2023

Write barriers have been discussed for years and years. Personally I prefer grouped writes. https://www.spinics.net/lists/linux-fsdevel/msg70047.html

The underlying SCSI and SATA protocols would support this with command queueing. I would have envisioned using an fcntl() to set the current group ID on an fd.

josephg · on April 3, 2023

Grouped writes seem more or less equivalent to write barriers in terms of their semantics, but probably with 1 less syscall - which is nice. But I'd take any of these approaches over years of discussion & no solutions.

I wonder what it would take to actually implement this stuff in linux. A lot of work, sure - but far from impossible.

Thanks for lmdb btw! The design is lovely. I've used it in a couple projects, and I had a read through your code for inspiration while I was designing out a little storage engine for something I've been working on. Its delightful.

hyc_symas · on April 3, 2023

> Thanks for lmdb btw!

Thanks, that's great feedback! Happy to hear that you're both using it successfully and that you actually read the code and learned from it.

eru · on April 3, 2023

Yes, POSIX compatibility is the culprit for many of the issues we have with filesystems today.

The POSIX requirements were made in and for a different world.

For example they also completely break down for distributed network filesystems. (Or at least destroy your hope of performance.)

josephg · on April 3, 2023

POSIX doesn't stop linux from introducing new, non-posix APIs. Eg, io_uring isn't part of POSIX but linux has it anyway. As far as I can tell, an API for write barriers, grouped writes, transactional writes or iocp should be able to work alongside the POSIX API just fine. Its just another API for writing files which user programs can opt in to using.

eru · on April 3, 2023

Yes, though for most practical file systems you still get hit by essentially having to also provide a POSIX compatible API.

bheadmaster · on April 3, 2023

They break down performance for programs that expect zero-latency filesystem and access it in C-style sequential fashion. If we rewrite all UNIX utilities to be concurrent by default, the file API wouldn't make things that much worse.

transactional · on April 3, 2023

It’s interesting to me that in the original proposal for fixing fsync to actually be durable (https://lwn.net/Articles/270891/), there was thought given to the desire for a non-flushing write barrier (ctrl-f SYNC_FILE_RANGE_NO_FLUSH), but it appears that it never got implemented.

mananaysiempre · on April 3, 2023

> Along with this, you need async I/O. This is primarily for log files and managed files. When you do a write, you get two callbacks, one when the buffer has been copied, and one when the data has been committed to permanent storage and will survive a crash.

This is what you want if you need to communicate the durability guarantee to the outside world, but if you’re just trying to enforce some consistency guarantees for internal purposes (e.g. transaction journal write goes out before data write), callbacks impose an unnecessary CPU-disk-CPU roundtrip. That’s what Featherstitch[1] was trying to avoid. Or is that not worth it?

More generally, how does this compare to other throughput-oriented queue-adjacent things, such as GPU commands buffers and fences, or Linux io_uring and Windows RIO? (A comparison to NIC queues is probably less interesting because NICs don’t reorder things—or do they? What about radios?) Do all of these need to be separate mechanisms?

Sudden thought: Featherstitch is to completion callbacks as promise pipelining[2] is to traditional RPC.

[1] https://lwn.net/Articles/354861/

[2] https://capnproto.org/rpc.html

mikepurvis · on April 3, 2023

How much are current database implementations hamstrung by the standard file semantics? If it's a lot, I'm surprised there aren't more databases that just mount a bare disk or partition and do whatever they want with it. Or ship a special kernel module they can talk to that implements a better interface to the underlying block device?

Or does none of this really matter in a world of virtualized-everything where your "block device" is still ultimately hitting multiple layers of actually-just-being-a-file on its way to hardware?

hyc_symas · on April 3, 2023

LMDB performance suffers when used on a journaling filesystem. Consequently, LMDB supports using a plain block device, to eliminate all FS overheads. And yes, all of the perf benefits go out the window if you're running on several layers of virtualization anyway.

FridgeSeal · on April 9, 2023

I was under the impression that MSSQL basically runs almost an entire FS/OS kernel inside itself to manage its own IO?

Scylla DB also goes to great lengths to optimise IO paths and talks directly to storage devices via DirectIO IIUC.

meangrape · on April 3, 2023

Thirty years ago it was standard practice to mount raw disks for Oracle to use.

icedchai · on April 3, 2023

~15 years ago, the last time I used Oracle in production, that was still standard practice. I'd be interested to hear from someone who's used it recently.

needusername · on April 3, 2023

That or ASM devices. To my knowledge io_submit and io_getevents only really work with raw partitions anyway.

sayrer · on April 3, 2023

Have you read "Designing Data-Intensive Applications" by Martin Kleppmann? I'm sure most of it will be old news to you, but it does add multiple computers to what you've described here. You will have to read like 80% of the book to get to that part though.

FridgeSeal · on April 9, 2023

This is a cool idea, I like this.

Forgive my lack of knowledge, but wouldn’t something like io_uring satisfy the need of an async API, and also obviates the need for the first callback? You write into the ring (which the kernel does not have to do a secondary copy from) and it writes that onto your disk and tells you when it’s done? In fact, IIUC, direct IO goes a step further and let’s you write directly to the storage-and in return you give up some niceties like OS-managed caching and prefetching because you have to do them yourself. In which case, a step towards this better FS API you’ve proposed would be some higher-level API’s around io_uring/directIO to make it better to use?

tarruda · on April 3, 2023

> File system semantics need a rethink

Normally when writing software that needs fopen/fsync, I would just use SQLite instead

josephg · on April 3, 2023

There's a host of reasons this isn't good enough:

- SQLite itself would be simpler and probably faster if it could run on top of a better OS-level API

- It doesn't make sense to build every other program on top of SQLite. Eg, other databases (redis, postgres), anything performance sensitive, etc. The filesystem is the system interface. Not sqlite.

Aesthetically, its also pretty gross to make a mess in POSIX, then solve it in userland (via sqlite). Its much better to just have the OS expose an API that is actually nice to work with and avoid all the slowdown, complexity and bugs which can arise from the nasty fsync API. As an example, redis runs a whole extra userland thread dedicated to issuing fsync commands to the operating system. There's a whole signalling system needed to manage that thread. With an API like the GP proposed, none of that code would be needed.

1over137 · on April 3, 2023

>It doesn't make sense to build every other program on top of SQLite

I dunno, it's not a crazy as building everything on top of Electron. :)

josephg · on April 3, 2023

True - but that's not exactly a high bar for success. I don't want every program I run to be an electron style hodgepodge mess which takes 1800 dependencies to copy an iso to a USB stick (or whatever).

mgaunard · on April 3, 2023

Many databases (including SQLite) don't use the filesystem correctly. The fault doesn't lie with the filesystem but with the database.

josephg · on April 3, 2023

Can you give some examples of ways databases misuse the filesystem?

blibble · on April 3, 2023

there were several data loss inducing bugs in postgres with fsync/close

josephg · on April 4, 2023

Right. When fsync returns an error code, it means (I think) that one of the earlier writes has failed or partially failed. This is not an obvious behaviour, and it wasn’t clear from the fsync documentation. And fsync doesn’t tell you which write failed! It’s spooky action at a distance.

We can blame Postgres for not handling this case properly, but I think the design of fsync bears just as much blame here. A write completion callback (as suggested upthread) would be a much more sensible and intuitive way to pass write errors back to the database. I think if Linux exposed an api like that, there’s a much lower chance this bug would have happened at all.

jstimpfle · on April 6, 2023

It was always clear from what fsync does that it doesn't indicate which write failed. That's exactly the point why fsync exists -- write() doesn't write through, for obvious performance reasons.

So any error reported from fsync() could apply to any write that happened since the last successful fsync(). In fact, storage errors that happen at the time of asynchronous writeback cannot be traced back to individual write() requests (write() requests can overwrite each other, can be collated, etc) so it is technically impossible for fsync() to report which write it was.

But what may be surprising is that fsync() can report errors that were "caused" by other processes writing to the same file -- because, again, storage error in general cannot be traced back to individual write() calls.

josephg · on April 7, 2023

> It was always clear from what fsync does that it doesn't indicate which write failed. That's exactly the point why fsync exists -- write() doesn't write through, for obvious performance reasons.

Hah - apparently it wasn't clear to the developers of postgres!

> But what may be surprising is that fsync() can report errors that were "caused" by other processes writing to the same file -- because, again, storage error in general cannot be traced back to individual write() calls.

Thanks for clarifying. This is utterly ridiculous behaviour. It basically turns any error returned by fsync into a fatal error, because there's no way to know whats gone wrong. And what about the file cache? If I issue a write, followed by a read, my understanding is that the OS can return the "written" data straight from the filesystem cache. But what happens if the subsequent fsync fails? Will reads now once more return the data as it is on disk? If so, I guess the data I get back from my read() calls must suddenly "snap" back to the previous value at some point before fsync returns? What a mess. Oh, and according to the linux man page, some of the specifics of this behaviour are also filesystem-specific. Lovely.

One of the commenters up-thread said this:

> Many databases (including SQLite) don't use the filesystem correctly. The fault doesn't lie with the filesystem but with the database.

I think the problems postgres had are about 95% the fault of the terrible design of fsync. We need a better API!

jstimpfle · on April 7, 2023

Which postgres bug are you referring to? I've always been under impression that the big databases use O_DIRECT to be in control, combined with userland I/O workers for performance. But apparently that isn't the case (I've never even databases much, and mostly sqlite3).

Multiple processes writing to the same file would need some coordination anyway, so it's not much of a problem that you see errors "from" other processes. Note that under Linux, all processes will always be reported the last error that happened since opening that file (errors are reported from write() and other functions as well, and any error is reported only once, AFAIK). Again, looking on the implementation side, where the page cache has a partial view of the file address space, and writes just write to the pages in that view, the behaviour makes quite a bit of sense. (I'm not saying that this is a great model to build a database on).

Yep, if fsync() returns an error in general probably just assume that any of your writes since the last fsync() may or may not have hit the disk. Maybe it means even less in some situations in practice.

You can try sync_file_range() as well (Linux specific).

As is mentioned elsewhere in the comments here, for more demanding workloads, ordering guarantees in the form of I/O barriers are needed. I suppose that is what you can have with O_DIRECT and an I/O thread pool. Maybe io_uring() will enable such a style of programming some day?

josephg · on April 14, 2023

> I've always been under impression that the big databases use O_DIRECT to be in control, combined with userland I/O workers for performance. But apparently that isn't the case (I've never even databases much, and mostly sqlite3).

As I understand it, O_DIRECT doesn't change fsync's wonky behaviour. It just asks the OS to stop using its filesystem cache.

> As is mentioned elsewhere in the comments here, for more demanding workloads, ordering guarantees in the form of I/O barriers are needed. I suppose that is what you can have with O_DIRECT and an I/O thread pool. Maybe io_uring() will enable such a style of programming some day?

Maybe.

The nice thing about barriers or write groups is that you don't need to move to an async API like io_uring to keep issuing write commands to the OS. The kernel can still buffer writes and issue them to the underlying hardware when its ready, and your program can go back to responding to requests. I think that'll always be nicer to work with than some io_uring + sync_file_range combination.

preseinger · on April 9, 2023

> Thanks for clarifying. This is utterly ridiculous behaviour. It basically turns any error returned by fsync into a fatal error, because there's no way to know whats gone wrong. And what about the file cache? If I issue a write, followed by a read, my understanding is that the OS can return the "written" data straight from the filesystem cache. But what happens if the subsequent fsync fails? Will reads now once more return the data as it is on disk? If so, I guess the data I get back from my read() calls must suddenly "snap" back to the previous value at some point before fsync returns? What a mess. Oh, and according to the linux man page, some of the specifics of this behaviour are also filesystem-specific. Lovely.

fsync says

       fsync() transfers ("flushes") all modified in-core data of (i.e.,
       modified buffer cache pages for) the file referred to by the file
       descriptor fd to the disk device (or other permanent storage
       device) so that all changed information can be retrieved even if
       the system crashes or is rebooted.  This includes writing through
       or flushing a disk cache if present.  The call blocks until the
       device reports that the transfer has completed.

pretty clearly operating on a per-file basis, and pretty clearly asserting stuff against both the fs cache and the underlying device. it is also pretty clear that it doesn't isolate operations from one process vs. another if they are made against the same file.

yes definitely a write followed by a read without a guaranteed fsync between the two can result in data that "snaps back" to whatever is on disk if that write fails to sync to the device in the future

these things are obviously not ideal, but they are both well-understood and in fact unavoidable, if there will exist a cache between user space fs operations and the underlying physical device

the design of fsync is not improvable in any meaningful way

josephg · on April 14, 2023

> the design of fsync is not improvable in any meaningful way

I think fsync should be removed entirely. A completion based API, or simply write barriers like macos does would both be more ergonomic for developers and result in faster software (since you wouldn't need to block your thread waiting for all outstanding writes to be confirmed).

preseinger · on April 15, 2023

that's fine but the guarantees offered by those approaches are weaker than those provided by fsync, so not directly comparable

blibble · on April 3, 2023

this is like deciding to go and use redis when you need an array internally

ritcgab · on April 3, 2023

The semantics of a bare file system cannot be (easily) changed, I think. What would be nice is having a thin shim layer on top of their existing semantics to handle more advanced file abstraction. Applications nowadays need to repeatedly do something that file system doesn't provide, but sometimes their solution partially overlaps with the fs. A middleware that provides clearer separation would be nice.

sokoloff · on April 2, 2023

Wouldn’t early commits be a way to game synthetic benchmarks? (And thus sell more hardware if your product benchmarks well.)

needusername · on April 3, 2023

> - Temp files.

O_TMPFILE should do this.

> - Unit files.

O_TMPFILE together with renameat2 and RENAME_EXCHANGE should do this.

CJefferson · on April 3, 2023

Last time I looked, about three years ago, one still couldn’t assume O_TMPFILE would be supported, too many file systems and old kernels didn’t have it. It is a nice flag, so hopefully it’s more likely to be supported now.

ggm · on April 3, 2023

Vax/VMS filesystem's different flavours of file wasn't that dissimilar to this. Additionally it had "mailboxes" which were basically named pipes.

jmspring · on April 3, 2023

Files work fine and there are abstraction layers on top of the file system that handle more of it.

josephg · on April 3, 2023

They don't work fine for everyone.

I'm building a custom storage engine for a CRDT at the moment and I don't want my data corrupted under any circumstances. Writing code to do this reliably is painful. Even figuring out the rules is difficult - is fsync enough? fdatasync? Will it still corrupt on power loss events in modern consumer SSDs? How do I test / simulate my software under these failure conditions? If I get an error when I fsync, what assumptions can I make about the state of my data?

Some atomicity rules seem to be page-size specific, but I can't find any way to query the OS for the kernel's page size, or the page size of the underlying device. The status quo is horrid.

mangamadaiyan · on April 3, 2023

What operating system are you on?

josephg · on April 3, 2023

I'm developing on linux, but my code should work on any modern platform. (Windows, MacOS, iOS, Android, Linux).

On Apple's hardware, fsync (the "no really" version) actually does durably flush data to disk when you ask it to. And darwin has a write barrier command, which is great because I don't need to fsync at all in order to append to my transaction log. But linux is missing a barrier command, and I've heard most consumer grade SSDs lie when you ask them to fsync.

Generally, linux doesn't really seems designed to preserve your data when the system crashes or experiences a power loss. The single API it gives you to help (fsync) is badly designed, coarse grained, slow, badly documented and its hard to use correctly. And worst of all, it often doesn't work anyway thanks to crap SSDs.

There's lots of better APIs out there. IO completion ports on windows are a better design. Likewise, adding write barriers (like macos has done) would help a lot.

The filesystem APIs on linux are long in need of being overhauled. Its sorely lacking if you want your software to be reliable.

mangamadaiyan · on April 3, 2023

Fair enough :)

There's the problem of badly designed APIs, and there's the problem of crap hardware that lies to you.

IME, it's difficult to ensure that data "written" to disk has actually reached the disk, without disabling the drive's cache. This was true even with spinning rust, and continues to be true with SSDs. I'd rather not get started on read-after-write, with inline checksums in order to verify that your disk isn't lying to you.

In an ideal world, we'd have both well-designed APIs as well as hardware that isn't crap. If I had to choose between the two, I'd go with good hardware, because the API problem can at least be worked around - albeit with a significant investment in time, effort and pain.

I feel your pain, and wish you good luck in your efforts.

josephg · on April 3, 2023

Thanks! But as for the two problems (bad APIs and bad firmware on SSDs), we should do both!

Personally I don't have influence over my solid state devices, but linux is opensource. We should absolutely fix linux to have better APIs here, even if the underlying devices lie sometimes.

It might even help with the firmware. The reason for SSDs to lie about flush commands is so they look better in benchmarks. If linux had a better API (with barriers, write groups, transactions, whatever), then benchmarks would probably depend less on fsync. Maybe, eventually, we can solve that problem too.

hyc_symas · on April 3, 2023

kernel pagesize is returned by sysconf(_SC_PAGESIZE)

eru · on April 3, 2023

It's the opposite, you want less layers of abstractions. At least you want your OS to give you less abstractions, so you can choose your own.

See 'exokernels' for that: their idea was that the OS should concentrate on safely multiplexing hardware, and leave abstracting to user-space libraries. Do one thing, and do it well; instead of serving multiple masters badly.