I realize running it through Mr. Kingsbury's Jepsen is a solid PR move, 10/10 top nerds love what submitting to Jepsen signals - Confidence and commitment to correct behavior. I doubt anyone would fault you for dropping a hundred grand, it's more or less the ticket price to enter the arena of "proper" distributed systems.
I'm curious though, what _new_ bugs or integrity violations did you learn about from the Jepsen runs? In your post, it mentions you were already aware of most or all problems through in-house chaos monkey testing. Did I read correctly?
I was following Jepsen results since Kyle's first post and it's amazing that the blog post series became a well respected company
The report revealed the following unknown consistency issues which we had to fix:
- duplicated writes by default
- aborted read with InvalidTxnState
- lost transactional writes
The first issue was caused by a client: starting with recent versions the client has idempotency on by default but when the server-side doesn’t support it (we had idempotency behind a feature flag) the client doesn’t complain. We will enable idempotency by default in 21.1.1 so it shouldn't be an issue. Also it's possible to turn the feature flag on for the older versions.
The other two issues were related to the transactions; we haven’t chaos tested the abort operation since it’s very similar to commit but even the tiny difference in logic was enough to hide the bug. It’s fixed now too.