How can you trust a disk to write data?

tokamak-teapot · on Feb 22, 2022

Which of these guarantee that data is committed to storage?

    flush()
    fsync()
    close()

What about if the storage is on NFS?

The reality is that, unless you're heavily reliant on platform-specific behaviour and are certain you're not, for example, dealing with files over NFS, or hardware which happens to relax any guarantees you thought you had, nothing is really guaranteed.

The best defense I've seen against the uncertainty of whether files are written is to use the strategy required when writing to a Maildir-format mailbox. This employs a directory used for temporary writes to unique filenames, followed by a move into place in a 'real' directory, where files should be accessible once write is deemed to be complete.

Those who've been around for some time may have experienced corrupt mailboxes where the storage format was something mbox-alike - with all messages held in one file (and there's no fancy database-style transaction handling to deal with integrity).

Any kind of cross-platform development where files are written to is made more difficult and less performant, thanks to this problem. In many cases, I'm afraid this is poorly understood, or ignored. I don't think there's even a way to measure how often data corruption occurs due to this not being accounted for.

MobiusHorizons · on Feb 22, 2022

Any such guarantees are only true for the OS and filesystem (ie the data has been transferred from main memory across whatever physical bus the disk is attached to ide/scsi/sata/spi/network/nvme, etc)

In modern drives, the specific write operation may be queued up on the physical disk in volatile storage outside the control of the operating system. A flush can’t help you with this problem, flush is for the operating system to flush its cache to disk.

This is where F_FULLFSYNC comes in. It is a request to the disk to finish flushing the volatile caches to nonvolatile storage. Some disks just ignore this request, claiming to be done before that is actually true.

In these cases no userland or even kernel level programming tricks can save you from data loss, since you can’t actually know when the data becomes safe on the disk.

generalizations · on Feb 22, 2022

> This employs a directory used for temporary writes to unique filenames, followed by a move into place in a 'real' directory, where files should be accessible once write is deemed to be complete.

This still seems like it makes the assumption that the move will complete. Or is 'mv' significantly different from the full write?

pkhuong · on Feb 22, 2022

The move (rename) is a metadata-only operation and is atomic on POSIX-compliant platforms and in reality.

GrumpySloth · on Feb 22, 2022

Rename being atomic doesn't imply that after a rename completes, it's flushed to cold storage. It just implies that a rename is either completely flushed to cold storage, or the data in cold storage looks as if there was never a rename.

Also, the data you wrote to a file before the rename may not have been flushed completely to cold storage. Rename doesn't change here anything. The data may still be served from the disk controller's cache.

The rename trick is mostly useful for "committing" long operations, which even on the high level are not atomic. But it has nothing to do with flushing.

Also, renaming a file is a filesystem-level operation. From the POV of a disk, it looks the same as any other write. So if you can't make sure that normal writes are flushed, you can't make sure that renames are flushed.

Another note: rename(2) used not to be even atomic wrt power loss on ext4 (which was later fixed), so it's not guaranteed. POSIX itself only guarantees atomicity wrt other filesystem operations. Atomicity against power loss is a distinct checkbox. You need to check your specific filesystem.

cout · on Feb 22, 2022

I'll add this: on spinning disks it's normal to reorder write operations; it is optimal to order writes in a way that matches the layout on disk. That way, each sector is written as the write head passes over the sector. We like to think of a disk-backed filesystem as slow, persistent RAM, but in reality it's a lot more complex than that.

generalizations · on Feb 22, 2022

Sure; IIRC, it only changes the inode. However, it's still trying to change bits on the disk, regardless if those are data or metadata. Can the metadata-only bits that the disk claims to write in order to implement the move be trusted to have been written?

steerablesafe · on Feb 22, 2022

It might be atomic on the OS level (no processes can see an "intermediate" state), but it might not be atomic when you yank the power cord. POSIX in general doesn't offer any guarantees there, AFAIK.

reincarnate0x14 · on Feb 22, 2022

> In many cases, I'm afraid this is poorly understood, or ignored.

Definite agreement on that. In many ways it's even less understood nowadays as pervasive reliable building power and UPSes are largely taken for granted, so the operational knowledge of the problem has been fading away until some cloud service loses an important circuit and suddenly a database has unpredictable data corruption.

runnerup · on Feb 22, 2022

I don’t trust the disk to write data. I trust it to report back when the data is written. If the disk confirms the write, I assume it won’t somehow lose that data if there’s a power loss shortly after confirmation. If the disk doesn’t confirm the write, all bets are off.

The problem is that disks are confirming writes before they actually store the write in the non volatile target.

warmwaffles · on Feb 22, 2022

> The problem is that disks are confirming writes before they actually store the write in the non volatile target.

Consumer NVMe's are bad about this. They'll fill a write buffer on the drive and flush it when it is convenient for the device, regardless of what the OS wants.

The Samsung 963 NVMe is an interesting device though because it has little capacitors on it that will flush the contents of that buffer when power is suddenly lost. But it still also just fills that write buffer and does not commit to disk when it said it did.

This is how they are gaming the benchmark tests.

runnerup · on Feb 22, 2022

> The Samsung 963 NVMe is an interesting device though because it has little capacitors on it that will flush the contents of that buffer when power is suddenly lost. But it still also just fills that write buffer and does not commit to disk when it said it did.

As long as it's truly "past the point of no return" I don't particularly care where the data is stored (QLC, SLC, DDR+Capacitor). I just care that the confirmation happens after the data in no longer able to be lost. I'm okay with the first 512MB getting confirmation very quickly because it's going to volatile DDR RAM on the SSD package -- I just want the hardware to guarantee that there's enough stored energy to eventually write any confirmed data. This doesn't need to have a huge impact on performance.

This goes for either a power loss event, or kernel panic. Either way, the hard disk should eventually store all confirmed data in a non-volatile way -- even if it's not yet stored non-volatile at the time of confirmation.

praptak · on Feb 22, 2022

I can't seem to dig this out but I vaguely remember some disks (HDD, not SSD) had this mode when data is:

  * acknowledged & buffered in the cache

  * not physically written to the plate (yet)

  * but still guaranteed to be written, even when power fails

Anybody can confirm that?

bombcar · on Feb 22, 2022

Some hardware raid controllers do that with their battery backed cache - they return written before it’s flushed to disk because they have a battery in case of power interruption.

WrtCdEvrydy · on Feb 22, 2022

I always wondered... does the battery power the disk so the writes are done or does it just hold the data in the RAID controller until the computer is re-powered?

dijit · on Feb 22, 2022

The second one.

On power loss the drives will immediately lose power, they have little capacitors in them that allow the head to safely rest on power loss. (it used to be that the head just crashed into the spinning platter, can you even imagine).

Part of the Server boot-up is to flush the raid caches, which are battery backed volatile storage. (and the batteries are fairly big so can last about 7 days in most cases).

The battery being big doesn't matter to the drives though, there would be no guarantee that you could finish writing the cache to the drives with the capacity of the battery: since drives use a considerably larger amount of power than keeping some DIMMs hot.

praptak · on Feb 22, 2022

> it used to be that the head just crashed into the spinning platter, can you even imagine

If I remember correctly, it was not that bad. The head being parked in a random place meant it could damage the plate if the disk was moved or shaken. I think that modern disks would indeed crash (if not for autoparking) but they have much smaller flying heights and thus require serious active magic to keep the head from hitting the plate.

bombcar · on Feb 22, 2022

This is correct - parking the heads moved the head to the side so it wasn't above the platter; but just pulling power meant the head was still above the platter, but no damage.

Parking would mean you could move the drive and the head wouldn't bounce off the platter.

bombcar · on Feb 22, 2022

Exactly - the better controllers would "lock" and basically refuse to do anything until they saw the array they wanted to write to (which was always fun when dealing with a remaindered system pulled from production, often requiring a full BIOS reset or wipe on the card).

cout · on Feb 22, 2022

> On power loss the drives will immediately lose power, they have little capacitors in them that allow the head to safely rest on power loss. (it used to be that the head just crashed into the spinning platter, can you even imagine).

I remember as a kid running PARK.COM to park the heads before powering off the computer every night; I continued this habit for many years after I got my own computer, until I learned that hard drives now have mechanisms that make this unnecessary.

reincarnate0x14 · on Feb 22, 2022

The controllers typically accept writes in a log structure that can be replayed once power is restored. Because the battery is only powering some SRAM (might be more DRAM these days) and a small ASIC or whatever, not even the whole controller, a small-laptop sized battery can keep that up for sometimes days. Lots of variability in how those are implemented depending on how enterprise-y and expensive they are.

warmwaffles · on Feb 22, 2022

Other's like the Samsung 963 NVMe have little capacitors on it that flush to disk in case of power loss. It's kind of cool to see it.

actionfromafar · on Feb 22, 2022

There was a drive which used the rotational energy in the platters to generate electricity enough to write out what was in the cache.

I wanna say it was IBM.

mikequinlan · on Feb 22, 2022

Possibly Western Digital.

https://arstechnica.com/civis/viewtopic.php?f=2&t=1478947&st...

zinekeller · on Feb 22, 2022

> acknowledged & buffered in the cache

I remember some enterprise drives having a capacitor for this exact reason, although at the time some manufacturers (maybe it was Hitachi, can't remember exactly who) experimenting with using a flash storage (probably SLC) as the cache.

masklinn · on Feb 22, 2022

Some SSDs behave like this (well not with plates but you get the gist). The issue is others don’t have the 3rd bit, and you have to test to know which is which, unless it’s a server-grade ssd which advertises power loss protection (and even then you should probably test it).

wkandek · on Feb 22, 2022

Related: https://www.deconstructconf.com/2019/dan-luu-files mostly about how the challenges in filesystems in writing data to disk and a bit about disks as well.

masklinn · on Feb 22, 2022

> macOS has two low-level ways of flushing disk writes: fsync() and the F_FULLFSYNC command in fcntl().

Through the recent spat / flareup (which I expect is what triggered this article), I learned that macOS has an intermediate level called F_BARRIERFSYNC whose purpose is pretty much what it says on the tin.

But of course as the article notes that only goes as far as to send the relevant commands to the hardware, whether the hardware responds appropriately is, from what I gathered (from discussions here and elsewhere) a whole other kettle of fish.

usrbinbash · on Feb 22, 2022

Fact of the matter is, when a disk suddenly loses power, all bets are off. You can design intricate and complex systems of buffers all you want, but when the chips are down, at some point bits need to be flipped in the right positions to store the data, and for this you need electricity.

The only way around this, would be if disks incorporated some kind of internal backup power system, which kicks in and flushes its internal buffers in case of a power loss.

However, even that doesn't really solve the problem, it just moves it somewhere else...because the internal backup power is just another system that can fail.

masklinn · on Feb 22, 2022

> Fact of the matter is, when a disk suddenly loses power, all bets are off.

The problem is not “the disk lost power during a write”, the problem is “I performed a write, the disk acknowledged the write, some time later it lost power”.

> The only way around this, would be if disks incorporated some kind of internal backup power system, which kicks in and flushes its internal buffers in case of a power loss.

That actually happens, caps can be used thus. Apparently an alternative (or complement?) for some SSDs is an SLC write cache (which is thus durable, not as fast as memory but much faster than TLC or somesuch).

geocar · on Feb 22, 2022

> The problem is not “the disk lost power during a write”, the problem is “I performed a write, the disk acknowledged the write, some time later it lost power”.

I don't think this is right.

There are a lot of things that can happen after the "disk" acknowledged the write that are a lot more common than a power failure (especially in a machine with batteries!) that fsync() does nothing about, so a system that gets its reliability from fsync() cannot in practice offer any more data persistence guarantees than one that doesn't.

masklinn · on Feb 22, 2022

> I don't think this is right.

You can think what you want, the entire thing started from marcan (re)discovering that fsync() on macos only flushes from FS to disk but does not require any guarantee of the disk.

> There are a lot of things that can happen after the "disk" acknowledged the write that are a lot more common than a power failure

Like what, putting a bullet through the SSD?

> especially in a machine with batteries!

Not every machine has a battery. Not every SSD is internal and hooked onto that battery.

> a system that gets its reliability from fsync() cannot in practice offer any more data persistence guarantees than one that doesn't.

So your point of view is what, reliability nihilism where data does not exist and death comes to us all?

cout · on Feb 22, 2022

Both are problems.

In one case you care about consistency: how can you guarantee a transaction fully succeeded vs only partially succeeded if the drive tells you the data was written, even though it wasn't?

In the other case, e.g. writing a log file or saving your word processing document, you want to ensure as much data is written as possible, even if out of order, so you can recover at least some of your work in the event of a failure.

masklinn · on Feb 22, 2022

> Both are problems.

One is what the discussion is about and the other is a completely different discussion nobody is actually having.

And I don’t think many people actually care about what happens when you just yolo your writes, it’s not like it’s going to be anything you can usefully rely on.