More

rystsov · on May 17, 2023

Yep, I still consider them to be cutting edge. Paxos was written in 1990 but the industry adopted it only in 2010s. For example I've looked through pBFT and it doesn't mention reconfiguration protocol which is essential for industry use. I've found one from 2012 so it should be getting ripe by now.

rystsov · on May 16, 2023

Strongly consistent protocols such as a Paxos and Raft always choose consistency over availability and when consistency isn't certain they refuse to answer.

Raft & Paxos: any number of nodes may be down, as soon as the majority is available a replicated system is available and doesn't lie.

Kafka as it's described in the post(): any number of nodes may be down, at most one power outage is allowed (loss of unsynced data), as soon as the majority is available a replicated system is available and doesn't lie.

The counter-example simulates a single power outage

() https://jack-vanlightly.com/blog/2023/4/24/why-apache-kafka-...

judofyr · on May 16, 2023

It just feels like two widely different scenarios we're talking about here.

https://jack-vanlightly.com/blog/2023/4/24/why-apache-kafka-... talks about the case of a single failure and it shows how (a) Raft without fsync() loses ACK-ed messages and (b) Kafka without fsync() handles it fine.

This post on the other hand talks about a case where we have (a) one node being network partitioned, (b) the leader crashing, losing data, and combing back up again, all while (c) ZooKeeper doesn't catch that the leader crashed and elects another leader.

I think definitely the title/blurb should be updated to clarify that this is only in the "exceptional" case of >f failures.

I mean, the following paragraph seems completely misleading:

> Even the loss of power on a single node, resulting in local data loss of unsynchronized data, can lead to silent global data loss in a replicated system that does not use fsync, regardless of the replication protocol in use.

The next section (and the Kafka example) is talking about loss of power on a single node combined with another node being isolated. That's very different from just "loss of power on a single node".

rystsov · on May 17, 2023

We can't ignore or pretend that network partitioning doesn't happen. When people talk about choosing two out of CAP the real question is C or A because P is out of our control.

When we combine network partitioning with single local data suffix loss it either leads to a consistency violation or to a system being unavailable desperate the majority of the nodes being are up. At the moment Kafka chooses availability over consistency.

Also I read Kafka source and the role of network partitioning doesn't seem to be crucial. I suspect that it's also possible to cause similar problem with a single node power-outage https://twitter.com/rystsov/status/1641166637356417027 and unfortunate timing

comet-engine · on May 16, 2023

For what it’s worth, this form of loss wouldn’t be possible under KRaft since they (ironically?) use Raft for the metadata and elections. Ain’t nobody starting a new cluster with Zookeeper these days.

slt2021 · on May 16, 2023

you are right, this is a failure of both zookeeper and leader node, so two independent failures at the same time

frant-hartm · on May 16, 2023

Exactly my thoughts.

rystsov · on May 3, 2022

Hey folks, I wrote this post and I'm happy to answer questions

metadat · on May 3, 2022

I realize running it through Mr. Kingsbury's Jepsen is a solid PR move, 10/10 top nerds love what submitting to Jepsen signals - Confidence and commitment to correct behavior. I doubt anyone would fault you for dropping a hundred grand, it's more or less the ticket price to enter the arena of "proper" distributed systems.

I'm curious though, what _new_ bugs or integrity violations did you learn about from the Jepsen runs? In your post, it mentions you were already aware of most or all problems through in-house chaos monkey testing. Did I read correctly?

rystsov · on May 3, 2022

I was following Jepsen results since Kyle's first post and it's amazing that the blog post series became a well respected company

The report revealed the following unknown consistency issues which we had to fix:

  - duplicated writes by default
  - aborted read with InvalidTxnState
  - lost transactional writes

The first issue was caused by a client: starting with recent versions the client has idempotency on by default but when the server-side doesn’t support it (we had idempotency behind a feature flag) the client doesn’t complain. We will enable idempotency by default in 21.1.1 so it shouldn't be an issue. Also it's possible to turn the feature flag on for the older versions.

The other two issues were related to the transactions; we haven’t chaos tested the abort operation since it’s very similar to commit but even the tiny difference in logic was enough to hide the bug. It’s fixed now too.

rystsov · on April 29, 2022

Cool! What did you test? I've played with Jepsen and Cosmos DB when I was at Microsoft but we had to ditch ssh, write custom agent and inject faults with PowerShell command lets.

rystsov · on April 29, 2022

Can you clarify what you mean? AFAIK with manual commit you have the most control over when the commit happens

Look at this blog post describing a data loss caused by auto-commit: https://newrelic.com/blog/best-practices/kafka-consumer-conf...

Also there also may be more subtle issues with auto-commit: https://github.com/edenhill/librdkafka/issues/2782

antonmry · on April 30, 2022

I'm afraid the article is also wrong, this is a typical misconception when working with Kafka. Offsets are committed in the next poll() invocation. If the previous messages weren't processed, a rebalance occurs and messages are processed by other instance. This is an implementation detail of the Java client library but it allows the at-least-once semantic with auto-commit. The book Effective Kafka has a better explanation.

librdkafka isn't part of official Kafka so it may have problems with this as it has other limitations.

In any case, the report isn't right about this and it doesn't use the safest options. Commit offsets manually is the most flexible way but it isn't easy, being the error more usual to commit offsets individually

aphyr · on May 2, 2022

> Offsets are committed in the next poll() invocation.

I'm a little surprised by this--not that you're necessarily wrong, but our tests consumed messages synchronously, and IIRC (pardon, it's been 3 months since I was working on Redpanda full time and my time to go get a repro case is a bit limited) did see lost messages with the default autocommit behavior. At some point I'll have to go dig into this again.

rystsov · on April 29, 2022

Documentation is a bit confusing: the protocol was evolved over time (new KIPs) and there is mismatch between the database model and kafka model. But we see a lot of potential in the Kafka transactional protocol.

At Redpanda we were able to push to 5k distributed transactions cross replicated shard. It's a mind-blowing for a database to achieve the same result.

Also Kafka transactional protocol works at low level it's very easy to build systems on top of it. For example, it's very easy to build a Calvin inspired system http://cs.yale.edu/homes/thomson/publications/calvin-sigmod1...

rystsov · on April 29, 2022

The mess is mostly the result of the mismatch between the classic database transactional model and kafka transactional model (G0 anomaly). If you read the documentation without the database background it seems ok, but when you notice the differences between the models it becomes hard to understand if it's a bug or property of the Kafka protocol.

There is a lot of research happening around this area even in the database world. The list of the isolation levels isn't final and some of the recent developments include PC-PSI and NMSI which also seem to "violate" the order. I hope one day we get the formal academic description of the Kafka model. It looks very promising.

aphyr · on May 2, 2022

I agree--rystsov has covered this well here and in other parts of the thread. I just want to add that some of the Kafka documentation did claim writes were isolated (where other Kafka documentation contradicts that claim!) so it's possible that depending on which parts of the docs users read, they might expect that G0 was prohibited. That's why this report discusses it in such detail. :-)

btown · on April 29, 2022

Are there good research groups or journals to follow to keep apprised of the state of the art here?

rystsov · on April 29, 2022

I've created this list a while ago https://github.com/redpanda-data/awesome-distributed-transac.... Maybe it's time to update it.

Usually I start with a couple of seed papers then follow the references, look at the other papers the authors wrote. When a phd student explores an area they write several paper on the topic so there is a lot material to read. But the real gem is the thesis, it has depth, context and a lot of links to other work in the area.

btown · on April 30, 2022

Thank you for putting this together!

rystsov · on April 29, 2022

Different systems solve different problems and have different functional characteristics. Actually one of the thing which Kyle highlighted in his report is write cycles (G0 anomaly), it isn't a problem of the Redpanda implementation but a fundamental property of the Kafka protocol. Records in Kafka protocol don't have preconditions and they don't overwrite each other (unlike the database operations) so it doesn't make sense to enforce order on the transactions and it's possible to run them in parallel. It gives enormous performance benefits and doesn't compromise safety.

rystsov · on April 29, 2022

Hey folks, I was working with Kyle Kingsbury on this report from the Redpanda side and I'm happy to help if you have questions

cgaebel · on April 29, 2022

Thanks for working with Jespen. Being willing to subject your product to their testing is a huge boon for Redpanda's credibility.

I have two questions:

1. How surprising were the bugs that Jepsen found?

2. Besides the obvious regression tests for bugs that Jepsen found, how did this report change Redpanda's overall approach to testing? Were there classes of tests missing?

rystsov · on April 29, 2022

It wasn't a big surprise for us. Redpanda is a complex distributed system with multiple components even at the core level: consensus, idmepotency, transactions so we were ready that something might be off (but we were pleased to find that all the safety issues were with the things which were behind the feature flags at the time).

Also we have internal chaos test and by the time partnership with Kyle started we already identified half of the consistency issues and sent PRs with fixes. The issues got in the report because by the time we started the changes weren't released yet. But it is acknowledged in the report

> The Redpanda team already had an extensive test suite— including fault injection—prior to our collaboration. Their work found several serious issues including duplicate writes (#3039), inconsistent offsets (#3003), and aborted reads/circular information flow (#3036) before Jepsen encountered them

We missed other issues because haven't exercised some scenario. As soon as Kyle found the issues we were able to reproduce them with the in-house chaos tests and fix. This dual testing (jepsen + existing chaos harness) approach was very beneficial. We were able to check the results and give feedback to Kyle if he found a real thing or if it looks more like an expected behavior.

We fixed all the consistency (safety) issues, but there are several unresolved availability dips. We'll stick with Jepsen (the framework) until we're sure we fixed then too. But then we probably rely just on the in house tests.

Clojure is very powerful language and I was truly amazed how fast Kyle for able to adjust his tests to new information but we don't have clojure expertise and even simple tasks take time. So it's probably wiser to use what we already know even it it a bit more verbose.

polio · on April 29, 2022

A complete nit, but the testimonial from the CTO of The Seventh Sense on https://redpanda.com/ spells Redpanda as "Redpand".

northstar702 · on April 29, 2022

Thank you. Fixed.

rystsov · on Nov 16, 2020

Yes and no, we do both. For ack=-1 we wait for fsync & replication confirmation but for other modes we relaxed the behavior.