> It just (mostly) decouples discarding state from power failures, making it int...

Animats · on March 27, 2022

Just wondering, but how is an OS supposed to achieve this without fsync()ing all writes to memory and being slow as dog$h!t?

Separate from the persistent storage thing, I've wanted traditional file systems that work like this:

* Unit files. The unit of file consistency is the file. The default case. Open, create, close, no other open can read it before close, on close it replaces the old file for all later opens. On crashes, you always get a consistent file, the old version if the close hadn't committed yet.

* Log files. The unit of file consistency is one write. Append only. On crashes, you get a consistent file up to the end of a previous write. No trailing off into junk.

* Temp files. There is no file consistency guaranteed. Random access. On crashes, deleted.

* Managed files. A more complex API is available. Random access. On a write, you get two completion events. The first one is "buffer has been copied". The second is "data is safely in permanent storage". This is for databases, which really care about when the data is safely stored. The second full completion event might also be offered for other file types, primarily log files.

That covers the standard use cases and isn't too hard to do.

jandrewrogers · on March 27, 2022

As additional support for this thesis, the interface to the logical (sometimes physical) file system inside many database kernels is almost exactly this. All four file types have a concrete use case that requires their existence.

One of the things that makes them easier to use in a database kernel context is that even though they are all built on top of the same low-level storage primitives, each file type usually has its own purpose-built API instead of a generic overloaded file API, which would get messy very quickly. A file's type never changes and the set of types is enumerable, so this is clean.

The only thing keeping this from being a user space thing is that there is no practical way of making this POSIX compliant. POSIX has all but permanently frozen what is feasible in a file system. Many database kernels either work off raw block devices or emulated raw blocks device on top of a POSIX file system, on which they build a file system, but it is never exposed to the world which eliminates any requirement for compliance with external standards. Imagine how much easier it would be to build things like databases if this was a first-class API.

rurban · on March 28, 2022

not being POSIX compliant is actually needed for a modern OS. with POSIX you'll never get to safe concurrency, so you'll hardly can make use of all the CPU kernels.

it must be a microkernel, supporting only safe languages. certainly not rust.

Genode is close, pony was close, Fuchsia is getting there, singularity was there already. concurrent pascal also.

kragen · on March 28, 2022

Hmm, are you saying POSIX compliance makes scalability impossible? Because it makes it possible to run software written in unsafe languages? I'm pretty sure Fuchsia will always be able to run software written in C.

rurban · on March 29, 2022

not impossible but hard. but more importantly not safe.

Fuchsia supports C++, sure. Fuchsia is not a safe OS, just like Rust a way to better safety. Dart for flutter is their safe language there.

kragen · on March 30, 2022

Sounds reasonable. I didn't know about Dart for Flutter!

kragen · on March 27, 2022

I agree that this would be useful. Have you ever written a prototype of it?

zozbot234 · on March 27, 2022

> The unit of file consistency is the file. The default case. Open, create, close, no other open can read it before close

"$PROGRAM cannot open document $FILENAME because it's already open in $PROGRAM. Try closing any open applications and open $FILENAME again." Who thinks this is actually sane behavior? Unix does it right.

skissane · on March 27, 2022

> "$PROGRAM cannot open document $FILENAME because it's already open in $PROGRAM. Try closing any open applications and open $FILENAME again." Who thinks this is actually sane behavior? Unix does it right.

Wouldn't a better approach be: "$FILENAME is already open in $PROGRAM. Would you like to: (1) Switch to $PROGRAM and continue editing it there; (2) Ask $PROGRAM to close it; (3) Edit it anyway (DANGEROUS!); (4) Abort". (For non-interactive processes, you wouldn't want to ask, you'd probably just want to always abort–which causes problems for some situations like "I want to install an upgrade to this shared library but $PROGRAM is executing using it", but there are ways to solve those problems.)

Kind of like swap files in vim – although vim doesn't have the "switch to $PROGRAM" option, because the underlying operating systems don't have good support for it – but that's an application-level feature which only works in one app. It would be better if it was an OS-level feature which worked across all apps.

kragen · on March 27, 2022

Instead of giving you a mutual exclusion error, a newly created file written this way might simply be invisible until it's complete, at which point it atomically replaces the old version of the file (which might still be accessible through mechanisms like VMS version numbers, WAFL snapshots, Linux reflinks, or simply having opened it before the replacement). This is probably what a transactional filesystem would give you: the creation, filling, and naming of the file would normally be part of the same transaction.

But, with optimistic synchronization, transactions that were reading an old version of the file might require redoing from the start when a new version was written.

kragen · on March 26, 2022

Both KeyKOS and EUMEL used periodic globally consistent checkpoints supplemented by an explicit fsync() call for use on things that required durability like database transaction journals. So when the system crashed you would lose your last few minutes of work; EUMEL ran on floppies, so the interval could be as long as 15 minutes, but I think even KeyKOS only took a checkpoint every minute or so, perhaps because it would pause the whole system for close to a second on the IBM 370s it ran on. (Many years later, Norm Hardy showed me a SPARC workstation with I think 32 or 64 MiB of RAM; he said it was the biggest machine KeyKOS had ever run on, twice the size of the mainframe that was the previous record holder. Physically it was small enough to hold in one hand.)

I assume L3 used the same approach as EUMEL. I don't know enough about OS/400 or Multics to comment intelligently, but at least OS/400 is evidently reasonably performant and reliable, so I'd be surprised if it isn't a similar strategy.

There's a latency/bandwidth tradeoff: if you take checkpoints less often, more of the data you would have checkpointed gets overwritten, so your checkpoint bandwidth gets smaller. NetApp's WAFL (not a single-level store, but a filesystem using a similar globally-consistent-snapshot strategy) originally (01996) took a checkpoint ("snapshot") once a second and stored its journals in battery-backed RAM, but with modern SSDs a snapshot every 10 milliseconds or so seems feasible, depending on write bandwidth.

Most of these systems did have "volatile segments" actually, but not for the reason you suggest, which is performance; rather, it was for device drivers. You can't swap out your device drivers (in most cases) and they don't have state that would be useful across power failures anyway. So they would be, as EUMEL called it, "resident": exempted from both the snapshot mechanism and from paging.

zozbot234 · on March 26, 2022

> So when the system crashed you would lose your last few minutes of work

So, pretty much how disk storage is managed today? But then if you're going to checkpoint application programs from time to time, it might better to design them so that they explicitly serialize their program state to a succinct representation that they can reload from, and only have OS-managed checkpoint and restore as a last resort for programs that aren't designed to do this...Which is again quite distinct from having everything on a "single level".

> but with modern SSDs a snapshot every 10 milliseconds or so seems feasible, depending on write bandwidth.

Not so. Even swapping to SSD is generally avoided because every single write creates wear-and-tear that impacts lifetime of the device.

kragen · on March 26, 2022

Well, I don't think it's "pretty much how disk storage is managed today," among other things because you don't have a multi-minute reboot process, and you usually lose some workspace context in a power failure.

The bigger advantage of carefully designed formats for persistent data is usually not succinctness but schema upgrade.

It's true that too much write bandwidth will destroy your SSDs prematurely, and that might indeed be a drawback of a single-level store, because you might end up spending that bandwidth on things that aren't worth it. This is especially true now that DWPD has collapsed from 10 or more down to, say, 0.2. But they're fast enough that you can do submillisecond durable commits, potentially hundreds of them per second. In the limit where each snapshot writes a single 4 KiB page, 100 snapshots per second is 0.4 MB/s.

If you have a 3-year-life 1TiB SSD with 0.2 DWPD, you can write about 220 TiB to it before it fails. So if you have 4 GiB of RAM, in the worst case where every checkpoint dirties every page, you can only take 56000 checkpoints before killing the drive: about an hour and a half if you're doing it every 10 ms. But actually the drive probably isn't that fast. If you limit yourself to an Ultra DMA 2 speed of 33 MB/s on average, while still permitting bursts of the SSD's maximum speed, that's 2.8 months, but a maximum checkpoint latency of more than 2 minutes. You have to scale back to 7.7 MB/s to guarantee a disk life of a year, at which point your worst-case checkpoint time rises to over 9 minutes.