I just came here to thank Jepsen for these amazing reports. What a wonderful way to use your intellect to contribute to the wellbeing of the entire community.
Also I wanted to say to redpanda: I was on the fence but now I’m convinced. Will definitely be deploying on my next project, which has already kicked off. I only wish I could run it natively on MacOS instead of requiring docker.
Thanks for working with Jespen. Being willing to subject your product to their testing is a huge boon for Redpanda's credibility.
I have two questions:
1. How surprising were the bugs that Jepsen found?
2. Besides the obvious regression tests for bugs that Jepsen found, how did this report change Redpanda's overall approach to testing? Were there classes of tests missing?
It wasn't a big surprise for us. Redpanda is a complex distributed system with multiple components even at the core level: consensus, idmepotency, transactions so we were ready that something might be off (but we were pleased to find that all the safety issues were with the things which were behind the feature flags at the time).
Also we have internal chaos test and by the time partnership with Kyle started we already identified half of the consistency issues and sent PRs with fixes. The issues got in the report because by the time we started the changes weren't released yet. But it is acknowledged in the report
> The Redpanda team already had an extensive test suite— including fault injection—prior to our collaboration. Their work found several serious issues including duplicate writes (#3039), inconsistent offsets (#3003), and aborted reads/circular information flow (#3036) before Jepsen encountered them
We missed other issues because haven't exercised some scenario. As soon as Kyle found the issues we were able to reproduce them with the in-house chaos tests and fix. This dual testing (jepsen + existing chaos harness) approach was very beneficial. We were able to check the results and give feedback to Kyle if he found a real thing or if it looks more like an expected behavior.
We fixed all the consistency (safety) issues, but there are several unresolved availability dips. We'll stick with Jepsen (the framework) until we're sure we fixed then too. But then we probably rely just on the in house tests.
Clojure is very powerful language and I was truly amazed how fast Kyle for able to adjust his tests to new information but we don't have clojure expertise and even simple tasks take time. So it's probably wiser to use what we already know even it it a bit more verbose.
I happened to know RedPanda founder back in the days he was at Concord.io (as a founder and a main dev). The level of obsession with performance and optimization of this guy was insane. He's not only extremelly skilled with C++, but also very passionate about rethinking large and complex systems and rebuilding them to enable 10-100x speed improvements. It's like his personal hobby – take a piece of software everyone use, and optimize it to the limits of physics, usually by implementing better version from scratch himself :) Plus, he's an excellent communicator. Watching their team working I was always thinking that successful companies can be built only with that level of passion and expertise in a single package.
kyle is very friendly and I recommend reaching out. we can't and wouldn't disclose any pricing that is not public information. would be unethical on my part. all i can say is we wish to continue our work with him indefinitely as long as we keep making progress on the product :)
There's been a lot of interest in this, and I keep meaning to put together an open-to-the-public class when I'm less swamped! Might get a chance to do this shortly--I'll post on https://groups.google.com/a/jepsen.io/g/announce when it happens.
This isn't anything against Redpanda, but I'm always amazed how badly all these distributed databases do in Jepsen.
What would one use them for in practice, which wouldn't be better suitable by a (the thing I've used), say postgresql and streaming replication in case the server goes down? (I'm not suggesting there isn't a good application, just I'm not knowledgeable enough to know of one).
When a distributed database is designed, you must navigate and optimize several complex technical tradeoffs to meet the architecture and product objectives. The specific set of tradeoffs made -- and they are different for every platform -- will determine the kinds of data models and workloads that the database will be suitable for, especially if performance and scalability are critical as in this case.
The reason distributed databases tend to be buggy, especially in the first iterations, is straightforward if not simple to address. While it is convenient to describe technical design tradeoffs as a set of discrete, independent things, in real implementation they are all interconnected in subtle, complex, nuanced ways. Modifying one design tradeoff in code can have unanticipated consequences for other intended tradeoffs. In other words, there isn't a set of simple tradeoffs, there is a single extremely high-dimensionality tradeoff that is being optimized. Not only are complex high-dimensionality design elements difficult to reason about when writing code the first time, any changes to the code may shift how the tradeoffs interact in non-obvious ways. Humans have finite cognitive budgets, so unless it is obvious that a code change has the potential to have unintended side effects, we generally don't spend the time to fully verify this fact.
I can't tell you how many times I've seen tiny innocuous code changes alter the behavior of distributed databases in surprising ways. This is also why once the core code seems to be correct, people are reluctant to modify it if that can be avoided at all.
I'm constantly surprised more folks don't use FoundationDB, I'm pretty sure the Jepsen folks said something to the tune of the way FoundationDB is tested is far beyond what Jepsen does (Good talk on FDB testing: https://www.youtube.com/watch?v=4fFDFbi3toc).
My read is that most use cases just need something that works _enough_ at scale that the product doesn't fall over and any issues introduced by such bugs can be addressed manually (i.e. through customer support, or just sufficient ad-hoc error handling). Couple that with the investment some of these databases have put into onboarding and developer-acquisition, and you have something that can be quite compelling even compared to something which is fundamentally more correct.
As someone who is switching to FoundationDB: because it's not easy. It doesn't look like other databases, it isn't in fashion (yes, these things matter), and it requires thinking and adapting your application to really use it to its full potential. It could also benefit from a bit more developer marketing.
Having looked at FoundationDB a bit it wasn't clear why I would choose it. It has transactions, which is nice, but not that big of a deal despite how much time they put into talking about it. I actually don't even need transactions since all of my writes commute, so it's particularly uninteresting to me.
They say they're fast, but I didn't find a ton of information about that.
Ultimately the sell seemed to be "nosql with transactions" and I just couldn't justify putting more time into it. I did watch their excellent talk on testing, and I respect that they've put that level of effort into it, and it was why I even considered it, but yeah, what am I missing?
Different systems solve different problems and have different functional characteristics. Actually one of the thing which Kyle highlighted in his report is write cycles (G0 anomaly), it isn't a problem of the Redpanda implementation but a fundamental property of the Kafka protocol. Records in Kafka protocol don't have preconditions and they don't overwrite each other (unlike the database operations) so it doesn't make sense to enforce order on the transactions and it's possible to run them in parallel. It gives enormous performance benefits and doesn't compromise safety.
There's a lot of different ways to answer this, but I think about it as a different architectural paradigm. Yes you can do stream-ish things with Postgres but at some level of scale you'd be putting a square peg in a round hole.
Databases are only as bad as the filesystem wrt being mutable, but since we do expect them to be a "better filesystem" it's surprising we let them get away with losing data. IMO beyond transactions you should just be able to unwind any writes to a SQL database, including deletes, for at least a day.
But if you ask Alan Kay he'd say all programs should have an explicit concept of time and be able to operate on the past and future state of everything.
CAP-theorem and configuration. When choosing a distributed database you are in a way already by definition giving up a chunk of of the C, Consistency.
Once you have a distributed database, they often have a myriad of tuning parameters that all impact in which corner of the CAP triangle you want to be. Using this you have to choose what risks you are willing to accept. If all my replicas are all in the same rack, any timeout-issues found would often just be academical and I can pragmatically design such a system very differently, vs if they are in different parts of the globe.
I might also have such high influx of low value data that I can accept some losses, the cost of a total deadlock would be more than just one transaction lost in cyberspace, not everyone is building a bank. That said, it still sucks a lot to have inconsistent data regardless of your application, so in such cases aim for lost data rather than wrong data. So in theory, the distributed approach is consider wrong, but in practice it might just be good enough.
This also ties into how you model your data, a lot of the faults found in the latest mongodb analysis was around multi-document transactions, but nobody uses mongodb this way, mostly it’s just a place where you dump standalone documents into it.
In the end, you have to go back to original question of why you are choosing a distributed DB in the first place, is it for scale, HA, regionality or other reasons. Then design around that, it’s never a silver bullet fix all solution.
Taking your example of streaming replication. How would that behave if primary acked to the client and then crashed, before the replica received the transaction? The alternative of waiting for the replica before you ack instead gives you 2 sources of failure. You’ve now reinvented a distributed system and are now in the same soup as all these other databases, just with other tradeoffs. :)
> When choosing a distributed database you are in a way already by definition giving up a chunk of of the C, Consistency.
Er, I don't think that's right. Distributed systems (even those running in asynchronous networks) can and often do satisfy linearizability. Gilbert & Lynch's proof of the CAP theorem just says that if you do choose linearizability (C) in an asynchronous network (P), you can't also guarantee total availabilility (A)--under some network faults, some operations may not complete. https://users.ece.cmu.edu/~adrian/731-sp04/readings/GL-cap.p...
totally different approaches tho. people have tried what you proposed many times before and for some scale succeeded. hard to compare at all when you dig into the details.
expect a companion post. this was super fun to partner with kyle on this. +1 would recommend to anyone building a storage system.
This report seems to have some wrong insights. Auto-commit offsets doesn't imply dataloss if records are processed synchronously. This is the safest way to test Kafka instead of commit offsets manually
I'm afraid the article is also wrong, this is a typical misconception when working with Kafka. Offsets are committed in the next poll() invocation. If the previous messages weren't processed, a rebalance occurs and messages are processed by other instance. This is an implementation detail of the Java client library but it allows the at-least-once semantic with auto-commit. The book Effective Kafka has a better explanation.
librdkafka isn't part of official Kafka so it may have problems with this as it has other limitations.
In any case, the report isn't right about this and it doesn't use the safest options. Commit offsets manually is the most flexible way but it isn't easy, being the error more usual to commit offsets individually
> Offsets are committed in the next poll() invocation.
I'm a little surprised by this--not that you're necessarily wrong, but our tests consumed messages synchronously, and IIRC (pardon, it's been 3 months since I was working on Redpanda full time and my time to go get a repro case is a bit limited) did see lost messages with the default autocommit behavior. At some point I'll have to go dig into this again.
Always great to read this. I preformed a jenkins test on Microsoft internal infra and it's a huge insight. From an academic side it's just as interesting looking into the lack of standards within consistently and the definitions of them.
Cool! What did you test? I've played with Jepsen and Cosmos DB when I was at Microsoft but we had to ditch ssh, write custom agent and inject faults with PowerShell command lets.
> A KafkaConsumer, by contrast, will happily connect to a jar of applesauce14 and return successful, empty result sets for every call to consumer.poll. This makes it surprisingly difficult to tell the difference between “everything is fine and I’m up to date” versus “the cluster is on fire”, and led to significant confusion in our tests.
This tickled my funny bone. Never expected humor in a Jepsen writeup. Kudos!
Well, when you think about distributed databases, you have a bunch of servers "calling" one another. Some of Jepsen's testing is about disrupting that and seeing what happens.
The mess is mostly the result of the mismatch between the classic database transactional model and kafka transactional model (G0 anomaly). If you read the documentation without the database background it seems ok, but when you notice the differences between the models it becomes hard to understand if it's a bug or property of the Kafka protocol.
There is a lot of research happening around this area even in the database world. The list of the isolation levels isn't final and some of the recent developments include PC-PSI and NMSI which also seem to "violate" the order. I hope one day we get the formal academic description of the Kafka model. It looks very promising.
I agree--rystsov has covered this well here and in other parts of the thread. I just want to add that some of the Kafka documentation did claim writes were isolated (where other Kafka documentation contradicts that claim!) so it's possible that depending on which parts of the docs users read, they might expect that G0 was prohibited. That's why this report discusses it in such detail. :-)
Usually I start with a couple of seed papers then follow the references, look at the other papers the authors wrote. When a phd student explores an area they write several paper on the topic so there is a lot material to read. But the real gem is the thesis, it has depth, context and a lot of links to other work in the area.
I wonder if Redpanda thinks about or offers some alternative protocol that would be better defined in terms of transaction guarantees. At this point it looks like Kafka’s protocol was a nice try but it needs a major refactoring.
Documentation is a bit confusing: the protocol was evolved over time (new KIPs) and there is mismatch between the database model and kafka model. But we see a lot of potential in the Kafka transactional protocol.
At Redpanda we were able to push to 5k distributed transactions cross replicated shard. It's a mind-blowing for a database to achieve the same result.
Redpanda (back when they were VectorizedIO) spammed my work email after I starred one of their repos, denied it after I called them out on it and I just noticed that they had deleted their response to me.
Pretty sneaky to go back and delete the tweets first denying and then apologizing.
hi newman314 - i mentioned in the tweet this was a mistake and offered an apology there, an sdr reached out to you, when i realized that i apologize. no ill intent. feel free to test this with a fake github account. my tweets automatically delete after 6mo, all of them on a rolling window. nothing special about this interaction. there is no sneaky-ness, though feel free to disagree.
Also I wanted to say to redpanda: I was on the fence but now I’m convinced. Will definitely be deploying on my next project, which has already kicked off. I only wish I could run it natively on MacOS instead of requiring docker.