Hacker News new | past | comments | ask | show | jobs | submit login
Why `fsync()`: Losing unsynced data on a single node leads to global data loss (redpanda.com)
72 points by avinassh on May 16, 2023 | hide | past | favorite | 45 comments



I'm pretty sure Redpanda has identified a real issue here, but my only response to any situation where you need to be absolutely, positively, 100% sure that `fsync()` has done The Right Thing would be: "Good luck with that!"

Even in a relatively simple modern storage stack, say a single NVMe Flash device on a local PCIe controller, the question 'are these exact bits irrevocably committed to the media?' is pretty much impossible to answer without an NDA with the disk manufacturer and a hefty dose of proprietary IOCTLs (and even then, I would not bet my life on it).

In a typical 'Enterprise' stack, which would extend the above with a healthy dose of Fibre Channel over Ethernet, load-balancing switches, distributed RAM caches, and a bunch of other complexity, I would not even bet the life of my goldfish on it.

The article mentions Byzantine systems as a possible solution, but my only knowledge about those comes from reading https://www.usenix.org/system/files/login/articles/13_micken..., which makes me not exactly optimistic about their real-world applicability.

One thing that comes to mind when reading things like this is 'Distributed journaling for distributed systems', but since nobody seems to be doing that, it's probably a silly idea. I'll probably stick to SQLite write-ahead logs, at least I can wrap my mind around those...


" the question 'are these exact bits irrevocably committed to the media?' is pretty much impossible to answer"

Well, kind of. fsync()'s job is only to ensure caches are written to durable storage. Its job is not to ensure the integrity of durable storage. Your idea of fsync()'s "Right Thing" is not quite correct, because these types of durable storage failures can happen outside of the context of a write -- so it's a bit silly to point a finger at fsync().

For example, you can write bytes to disk today and the drive might experience corruption on that sector next week.

"(and even then, I would not bet my life on it)"

All hardware fails, so of course you shouldn't bet your life on it. That's why we have both onsite and offsite backups. The durable media itself can always fail and lose data -- even if there aren't any writes occurring. Data loss can even occur on a powered off system.

Again, fsync()'s job is only to ensure the data is moved to durable storage. Solving the problem of data loss is done in other ways -- for example raid5 addresses the problem of durable storage failing by using parity and extra copies.

The real takeaway from the blog is perhaps that Raft as a protocol is inadequate, in the same way a raid1 mirror system is inadequate in terms of protecting from data corruption on durable storage (which of the two mirrored drives has the correct version of the data?). I'm frankly surprised that this situation is undetectable. It's a solved problem in other storage layers.

"I'll probably stick to SQLite write-ahead logs, at least I can wrap my mind around those"

Well, keep in mind: those have the same problem in the event of durable storage failure.


> Again, fsync()'s job is only to ensure the data is moved to durable storage.

The problem is that every layer in the stack lies. They say "yes it is definitely written to durable storage" when it is just in some cache layer and about to be written.


"The problem is that every layer in the stack lies. "

Not enterprise gear. Reliable storage does exist. I have tested many vendors myself, and gone through spec sheets under NDA (as mentioned above).

"They say "yes it is definitely written to durable storage" when it is just in some cache layer and about to be written."

Enterprise hardware contains batteries specifically designed so that caches can still be written out to durable storage in the event of power loss. Have you ever dealt with managing battery learning cycles on a Dell PERC?


> it is just in some cache layer and about to be written

Definitely not. I get that sometimes we find things that do lie, but lying about this is a huge P0 bug and everyone with an interest in data storage knows to be on the lookout for such brokenness. E.g. such people do not buy SSD that lack some sort of power fail flush protection.


> One thing that comes to mind when reading things like this is 'Distributed journaling for distributed systems', but since nobody seems to be doing that, it's probably a silly idea

That's exactly what Redpanda, BookKeeper, LogDevice, Pravega, and countless other proprietary systems at big tech companies do.


There is no way to know for sure that a Byzantine system is actually operating in a way that replicated copies of your data are actually safely written. Both due to the exact same issue of the drives themselves, and also that Byzantine system software is liable to have a variety of bugs and invalid states that will keep it operating as normal, even though the nodes are actually in a fault mode. (the problem is twofold: 1) the node and/or system not refusing writes when in a fault state, 2) the system not actually knowing that it's in a fault state) Even if you do all kinds of Jepsen simulation and mathematical proofs of the software (including the operating system!), you still can't trust the drives.

I think the only way to solve the problem is new storage firmware and hardware that is open and guarantees a write is done. I'm sure some companies may claim such functionality but we need an open source architecture and code to be sure.

In the meantime I think synchronous writes to multiple nodes is the safest option. Avoids complexity and bugs in fancy software, and the hardware is what it is.


> In the meantime I think synchronous writes to multiple nodes is the safest option.

This is what is being proposed though, right? Like no one is saying just use fsync, it's to use fsync across multiple systems, no?


> There is no way to know for sure that a Byzantine system is actually operating in a way that replicated copies of your data are actually safely written.

Isn't this how Bitcoin adds to the ledger though? Using a merkle tree and slowing things down significantly with those 6+ confirmations.


Technically Bitcoin can work without ever safely writing to disk. It's potentially all magical caches until the firmware decides to put bits to media.


That's actually a really good point. It isn't dependent on the hardware, it depends on the math.


That’s right. Raft uses sync writes which is what redpanda uses.

Complexity and industrial level reference impls are crucial


> Even in a relatively simple modern storage stack, say a single NVMe Flash device on a local PCIe controller, the question 'are these exact bits irrevocably committed to the media?' is pretty much impossible to answer without an NDA with the disk manufacturer and a hefty dose of proprietary IOCTLs (and even then, I would not bet my life on it).

The answer is easy in all cases: No.

Media can always fail without notice.


Media can fail, but the parent there is saying they can fail with byzantine faults.

Most of us don't protect against such faults. (Neither Redpanda nor Kafka do, but that's not the point of the OP; the OP's point is that Kafka is not protecting against non-byzantine faults in the disk.)


Media failures after the data is written seems out of scope.


Using fsync() to flush data to disk is, generally speaking, the wrong way to do things. fsync() is like using a piece of wood to hammer a nail into the ground. It'll work under very limited circumstances, but will fail horribly to do the heavy lifting required in other cases. fsync()'s semantics are complex and poorly defined, doubly so if multiple threads or processes are involved. Who gets the i/o error for a failed write? Nobody knows! fsync() is probably one of the worst APIs warts in Linux/Unix.

Developers competent in matters of storage generally use other methods / APIs like open()ing a file with O_SYNC and then checking the results of write() or pwrite() which will clearly identify when data is or is not successfully written to disk. If you want more performance, use async writes and it'll perform just fine. Seeing fsync() in code is a sign that the code has a bad smell and probably needs to be thoroughly reviewed.


There may be performance reasons to still use fsync with O_DIRECT instead of only O_SYNC.

https://news.ycombinator.com/item?id=15539828


> sing fsync() to flush data to disk is, generally speaking, the wrong way to do things.

I think this is an old "statement" due to the Linux ext4 sync issue from quite awhile ago. As far as I know it has been fixed. I doubt it is discouraged on the BSDs.

But, curious, what is the Right Thing to do ? There are cases where you want to be sure the data reaches the platter.

On a mini I use to program on, if a filename started with an '@' sign, that said all writes went directly to disk. Based upon that were were able to create a roll-forward recovery for our custom applications. So, as far as I know, fsync(2) should accomplish the same thing using the open descriptor.


I stand by my position. The problem with fsync() is the question of when and where do errors get reported. If code checks errors on completion of each write it becomes significantly easier to know what write has failed. The error reporting is the granularity of the write being performed.

In contrast, fsync() has to report errors for everything that has happened since the last fsync() call. If you are doing anything complicated like a database, which is virtually every single modern application, you need to have some idea of which writes have failed to figure out how to recover. fsync() gives you none of that. Different .

In kernel you get results for each i/o performed, so filesystems can make the appropriate decision about how to respond to a failure.

If you're building a high performance application, fsync() is a disaster. Unrelated writes will get bundled up when different threads call fsync() on the same file descriptor. Having multiple file descriptors open on the same file is silly and causes issues with applications that need to have many file descriptors open. The Linux kernel goes to great lengths to ensure that filesystems are high performance for writes on multiple threads.

fsync() sucks for anything more complex than "I have a single file to write out".


it tends to be true. we do things at the application tier to minimize these things.

for example, assume you have to write A, B, C ... in syscalls it would be

write( A ), flush() write( B ), flush() write( C ), flush()

so if you add a debounce of say 4ms you get

Write( A ), Write( B ), Write( C ), Flush()

and saved 2 flush()-es

This is common with protocol aware storage applications.


And I guess the idea is furthermore that just because you don't fsync after every write does not mean that you haven't fsync-ed before responding to the user request saying the data is stored durably. I assume that you do actually guarantee the flush before returning a success to the user.


Yes, this is exactly how it works. It effectively batches the sync part of multiple user requests, so none of those user requests are ack-ed until the sync completes. It is a low-latency analogue to checkpointing storage write-backs to minimize the number of round-trips to the I/O syscalls.


exactly. you improve latency and throughput at the same time. is kinda cool. by delaying the reponses just a small bit you get huge benefits


Wrong. Use something like aio or io_uring to submit 4 asynchronous writes in a single system call and you'll get way better performance. The kernel has all kinds of infrastructure that tries to coalesce and batch things that the write()+fsync() syscalls make horribly inefficient, and in the modern world of really fast nvme drives, you want to make your calls into the device driver as efficient as possible. You'll burn far fewer CPU cycles by giving the kernel the whole set of i/os in one single go, burn less on synchronization. It really is better to avoid write() + fsync().


what makes you think that we haven't tested this? seastar's io engine is defaulted to io-uring... there are about 10 things here to comment on. on optimized kernels we disable block coalescing at the kernel level, second we tell the kernel to use fifo, etc. these low hanging fruit was already something we've done for a very very long time.


Right. But that line of thinking gets you to a place where one is like “how does anything work” haha.

In general this was a response to Confluent attempting to dismiss fsync() as a neat trick rather than an actual safety problem and why when we benchmarked we showcase the numbers we did.


Which numbers? The ones on the blog that shows Redpanda significantly lagging behind open source Kafka?

https://jack-vanlightly.com/blog/2023/5/15/kafka-vs-redpanda...


this is a burner account created at the time this post was up.


> Even in a relatively simple modern storage stack, say a single NVMe Flash device on a local PCIe controller, the question 'are these exact bits irrevocably committed to the media?' is pretty much impossible to answer without an NDA with the disk manufacturer and a hefty dose of proprietary IOCTLs (and even then, I would not bet my life on it).

This would limit throughput quite a bit and might significantly shorten the lifespan of the disk, but you could put the disk on a separate power supply that the computer could control. Don't consider data as written until you've power cycled the disk and been able to read that data back. :-)


> you need to be absolutely, positively, 100% sure that `fsync()` has done The Right Thing would be: "Good luck with that!"

But it's safe to assume the right thing is not done if we haven't even called `fsync` to start with and just hope and pray that the dirty page flushing mechanism does it for us.


I'm confused here. If I'm reading this correctly, the following is happening:

- Node 3 is the leader.

- Node 1 is isolated (failure #1).

- 10 records are written. They are successful because a majority is still alive (persisted to Node 2 and 3).

- Node 3 loses data (failure #2).

- Node 1 comes back up again.

However, isn't this two failures at the same time? Kafka with three nodes can only guarantee a single failure, no?


Strongly consistent protocols such as a Paxos and Raft always choose consistency over availability and when consistency isn't certain they refuse to answer.

Raft & Paxos: any number of nodes may be down, as soon as the majority is available a replicated system is available and doesn't lie.

Kafka as it's described in the post(): any number of nodes may be down, at most one power outage is allowed (loss of unsynced data), as soon as the majority is available a replicated system is available and doesn't lie.

The counter-example simulates a single power outage

() https://jack-vanlightly.com/blog/2023/4/24/why-apache-kafka-...


It just feels like two widely different scenarios we're talking about here.

https://jack-vanlightly.com/blog/2023/4/24/why-apache-kafka-... talks about the case of a single failure and it shows how (a) Raft without fsync() loses ACK-ed messages and (b) Kafka without fsync() handles it fine.

This post on the other hand talks about a case where we have (a) one node being network partitioned, (b) the leader crashing, losing data, and combing back up again, all while (c) ZooKeeper doesn't catch that the leader crashed and elects another leader.

I think definitely the title/blurb should be updated to clarify that this is only in the "exceptional" case of >f failures.

I mean, the following paragraph seems completely misleading:

> Even the loss of power on a single node, resulting in local data loss of unsynchronized data, can lead to silent global data loss in a replicated system that does not use fsync, regardless of the replication protocol in use.

The next section (and the Kafka example) is talking about loss of power on a single node combined with another node being isolated. That's very different from just "loss of power on a single node".


We can't ignore or pretend that network partitioning doesn't happen. When people talk about choosing two out of CAP the real question is C or A because P is out of our control.

When we combine network partitioning with single local data suffix loss it either leads to a consistency violation or to a system being unavailable desperate the majority of the nodes being are up. At the moment Kafka chooses availability over consistency.

Also I read Kafka source and the role of network partitioning doesn't seem to be crucial. I suspect that it's also possible to cause similar problem with a single node power-outage https://twitter.com/rystsov/status/1641166637356417027 and unfortunate timing


For what it’s worth, this form of loss wouldn’t be possible under KRaft since they (ironically?) use Raft for the metadata and elections. Ain’t nobody starting a new cluster with Zookeeper these days.


you are right, this is a failure of both zookeeper and leader node, so two independent failures at the same time


Exactly my thoughts.


I don't know what consistency Kafka is targeting but usually with 'consistent' systems in the situation where the cluster cannot serve valid data it should serve failure responses instead of giving back bogus data. if your system is 'eventually consistent' then that's a different game.



> A system must use cutting-edge Byzantine fault-tolerant (BFT) replication protocols, which neither of these systems currently employ.

Cutting-edge? pBFT (Practical Byzantine Fault Tolerance) was published in 1999. The first Tendermint release was in 2015. With few exceptions, almost all big proof of stake blockchains are powered by variations of pBFT and have been for many years.


Yep, I still consider them to be cutting edge. Paxos was written in 1990 but the industry adopted it only in 2010s. For example I've looked through pBFT and it doesn't mention reconfiguration protocol which is essential for industry use. I've found one from 2012 so it should be getting ripe by now.


Protocol Aware Recovery is really needed if you want Raft to tolerate disk corruption. https://www.usenix.org/conference/fast18/presentation/alagap...


> Protocol Aware Recovery is really needed if you want Raft to tolerate disk corruption

I believe OP does not make any claims about arbitrary log corruption. Neither raft nor Kafka protocol can handle it. It is about losing tail of the log due to fsync failures or rather lack of fsyncs.


the blog posts mentions that it is for any protocol that is non bft


I wish articles such as this would also talk about the risk-adjusted cost of absolute data safety or provide treatment discussing the cost of ensuring the last 0.00001% of data is unconditionally guaranteed in the event of black swan events. For example, what's better for the business? Double EC2 operating costs from $100K/mo to $200K/mo or pay out $10K in credits/yr as a direct loss to the business plus whatever the reputational or regulatory cost? Sometimes you need 100%. Sometimes you don't. Not acknowledging this tradeoff makes the article very FUD'y and is a disservice to engineers who get sucked into black-and-white thinking.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: