Jepsen: PostgreSQL 12.3

brandur · on June 12, 2020

Personally, this kind of thing actually gives me _more_ confidence in Postgres rather than less. The core team's responsiveness to this bug report was incredibly impressive.

Around June 4th, the article's author comes in with a bug report that basically says "I hammered Postgres with a whole bunch of artificial load and made something happen" [1].

By the 8th, a preliminary patch is ready for review [2]. That includes all the time to get the author's testing bootstrap up and running, reproduce, diagnose the bug (which, lest us forget, is the part of all of this that is actually hard), and assemble a fix. It's worth noting that it's no one's job per se on the Postgres project of fix this kind of thing — the hope is that someone will take interest, step up, and find a solution — and as unlikely as that sounds to work in most environments, amazingly, it usually does for Postgres.

Of note to the hacker types here, Peter Geoghegan was able to track the bug down through the use of rr [4] [5], which allowed an entire problematic run to be captured, and then stepped through forwards _and_ backwards (the latter being the key for not having to run the simulation over and over again) until the problematic code was identified and a fix could be developed.

---

[1] https://www.postgresql.org/message-id/CAH2-Wzm9kNAK0cbzGAvDt...

[2] https://www.postgresql.org/message-id/CAH2-Wzk%2BFHVJvSS9VPP...

[3] https://www.postgresql.org/message-id/CAH2-WznTb6-0fjW4WPzNQ...

[4] https://en.wikipedia.org/wiki/Rr_(debugging)

[5] https://www.postgresql.org/message-id/CAH2-WznTb6-0fjW4WPzNQ...

aphyr · on June 12, 2020

Yeah, the PostgreSQL team really knocked it out of the park on this one. It was a pleasure working with them. :)

MoOmer · on June 12, 2020

To be fair, you have a great batting average in identifying issues to allow for improvement. Thanks for your work

AtlasBarfed · on June 12, 2020

With distributed and multicore being the path forward with the end of Moores law, your work has been instrumental in helping open source distributed systems improve.

Since distributed systems are so difficult and complicated, it enables salespeople and zealots to both deny issues and overstate capability.

Your work is a shining star in that darkness. Thank you.

archon810 · on June 12, 2020

For those not entirely sure, this is Kyle Kingsbury, the author of the main article.

gen220 · on June 12, 2020

Thank you for this comment that gives credit where it's due, this is a very impressive set of threads to read through.

And I agree. For me, one of the most important measures of the reliability of a system is how that system responds too information that it might be wrong. If the response is defensiveness, evasiveness, or persuasive in any way, i.e. of the "it's not that bad" variety, run for the hills. This, on the other hand is technical, validating, and prompt.

Every system has bugs, but depending on these cultural features, not every system is capable of systematically removing those bugs. With logs like these, the pg community continues to prove capable. Kudos!

tetha · on June 12, 2020

>If the response is defensiveness, evasiveness, or persuasive in any way, i.e. of the "it's not that bad" variety, run for the hills. This, on the other hand is technical, validating, and prompt.

This resonates with me with teams inside the company as well.

We have a few teams that just deflect issues. Find any issue in the bug report, be it an FQDN in a log search, and poof it goes. Back to sender, don't care. Engineers in my team just don't care to report bugs there anymore, regardless how simple. Usually, it's faster and less frustrating to just work around it or ignore it. You could be fighting windmills for weeks, or just fudge around it.

Other teams, far more receptive with bugs.. engineers end up curious and just poke around until they understand what's up. And then you have bug reports like "Ok, if I create these 57 things over here, and toggle thing 32 to off, and then toggle 2 things indexed by prime numbers on, then my database connection fails. I've reproduced this from empty VMs. If 32 is on, I need to toggle two perfect squares on, but not 3". And then a lot of things just get fixed.

emmelaich · on June 13, 2020

If you reject the "it's not that bad" response then you are saying all bugs are equal in severity / urgency / priority.

Which is obviously not true.

A better response might be something like "I understand the problem is X however the fix might take Y months and the next release will deprecate that feature anyway. Can you quantify or guesstimate the impact on our product if we allow this bug to stay unfixed?"

gen220 · on June 13, 2020

Yeah, in retrospect, I disagree with myself, too. (and agre with your rationale!) :)

I think if I were to try and salvage the quote, I'd change it to "It's probably not that bad". In this telling, I'm meaning to imply that the responder has made an apriori value judgement about a bug without determining scope and cause.

I agree with your ideal response. The only caveat I'd personally make, is that I'd recommend "planting the flag" by providing your own guesstimate of product impact, and ask the reporter for their opinion on moving it.

In my perspective, it's important to acknowledge that is has an impact, and it's especially important to convey your sense of impact concretely with some napkin math. If you're terribly wrong, the reporter will try to convince you of it[1], and you'll learn something along the way. It also conveys empathy (i.e. we're both trying to solve your problem, even if it's only a problem for you).

[1]: https://meta.wikimedia.org/wiki/Cunningham%27s_Law

jennyyang · on June 13, 2020

Practically everyone in distributed systems knows who Aphyr is, and to get a bug reported by him is like a badge of honor. I am willing to bet that people at PostgrSQL jousted over who got to work with him on this bug, because it would be such an amazing learning opportunity. I know I would want to!

shawn-butler · on June 12, 2020

What are the storage requirements for using rr for intense or longer debugging sessions?

petergeoghegan · on June 12, 2020

The exact recording in question was about 125MB, and that was after I materialized it using "rr pack".

I'd say that the storage overhead is unlikely to be a concern in almost all cases. It's just something that you need to keep an eye on.

gen220 · on June 12, 2020

this paper describes rr, which for context was designed to be used on commodity hardware: https://arxiv.org/pdf/1610.02144.pdf

section 4.4 talks about disk requirements:

> Memory-mapped files are almost entirely just the executables and libraries loaded by tracees. As long as the original files don’t change and are not removed, which is usually true in practice, their clones take up no additional space and require no data writes

> Like cloned files, cloned file blocks do not consume space as long as the underlying data they’re cloned from persists.

they conclude the section with:

> In any case, in real-world usage trace storage has not been a concern

I imagine that over "longer debugging sessions" the metadata footprint would expand linearly, but probably with a constant smaller than the logs for the average program.

jwr · on June 12, 2020

Indeed — it's great to see a vendor (team, in this case) that doesn't try to downplay a Jepsen result, and instead fixes the issues.

However, there is one more takeaway here. I've heard too many times "just use Postgres", repeated as an unthinking mantra. But there are no obvious solutions in the complex world of databases. And this isn't even a multi-node scenario!

arcticfox · on June 12, 2020

> there is one more takeaway here

I don't think the "just use Postgres" mantra takes any hits at all from this. (If anything, I feel better about it).

I've used maybe a dozen (?) databases/stores over the years - graph databases, NoSQL databases, KV stores, the most boring old databases, the sexiest new databases - and my general approach is now to just use Postgres unless it really, really doesn't fit. Works great for me.

ComputerGuru · on June 13, 2020

I too have been there, done that when it comes to persistent storage, but my similar conclusion is “use SQLite unless it really, really doesn’t fit, in which case, use PostgreSQL.” (It’s usually easy to figure out which of the two you’ll end up needing.)

jwr · on June 12, 2020

All the answers to my post are missing the point.

I'm happy Postgres works for you. It works for me, too, in a number of setups. But one should never accept advice like "just use Postgres" without thinking and careful consideration. As the Jepsen article above shows.

arp242 · on June 12, 2020

I have rarely seen people give "just use PostgreSQL" as advice, but rather "just use PostgreSQL unless you have a compelling reason not to". There's a pretty big difference between the two.

Thaxll · on June 12, 2020

"just use PostgreSQL unless you have a compelling reason not to" is just as a bad advice tbh. At work we did not even consider it because no HA support / availability. What do you do exactly when your master is down, your entire backend just stop working? PG is not a silver bulet especially not when you're looking for good managed / HA solution.

There is no such thing as "use x.y.z unless" this statement does not make any sense.

chousuke · on June 13, 2020

Setting up a HA PostgreSQL setup might not be as easy as with some turnkey solutions where you just deploy three instances and then it (probably) works, but I'd still trust my data to a simple PostgreSQL replication setup over something fancier where the behaviour in failure scenarios is not as straightforward.

In simple cases, what people usually need is what I would call "operational" HA anyway; database servers going down unintended is (hopefully!) not a common occurrence, and even with a super simple 2-node primary/failover asynchronous replication setup, you can operate on the database servers with maybe a few seconds of downtime, which is about the same you'd get from "natively" clustered solutions.

In true failure scenarios, you might get longer downtime (though setting up automatic failover with eg. pacemaker is also possible), but in situations where that ends up truly mattering you likely have the resources to make it work anyway.

speedgoose · on June 13, 2020

The classic high availability setup for postgresql is to have standby instances, usually one, ready to become the new primary if the primary goes down.

https://www.postgresql.org/docs/current/high-availability.ht...

I used CouchDB in a project because the high availability was much better on the paper, but I think postgresql would have been fine after all.

Good managed postgresql options with high availability are in many public clouds.

abrookewood · on June 15, 2020

This won't work for everyone, but if you are on the AWS stack, then using PostgreSQL on Amazon RDS is a compelling option. There are 6 copies of your data, stored on 3 servers and in 3 separate regions and the replication between the nodes is less than 1s (usually around 50ms). I'd agree that a simple HA solution is the most obvious thing lacking for PostgreSQL, but with RDS it's not something you have to worry about anymore.

arp242 · on June 13, 2020

> PG is not a silver bulet

No one claims it to be, hence the "unless there's a good reason not to" part. However, it is a good default that works well for a wide range of – though hardly all – use cases.

Replication in general is not PostgreSQL's strongest point and if you require advanced replication then PostgreSQL may indeed not be the choice for you, but most projects don't need this. That being said, "no HA support" is just not true.

pnathan · on June 12, 2020

"Use PostGres until you have an engineering - data driven rationale not to" is my standard answer for non-blob data storage when a project starts.

Why? because when `n` is small (tables, rows, connections), postgres works well enough, and if `n` should ever become large, we'll have interest in funding the work to do a migration, if that's appropriate - and we'll be able to evaluate the system at scale with real data.

jmalicki · on June 12, 2020

postgres works great for `n` in ranges in excess of (10k, 10B, 1k) at least, which not many would consider small.

The only real justification I've heard is postgres' story on replication still lags mysql, and the write amplification bit which they've done a lot of work on for pg13 (see [0])

Even for many very large companies I've seen, noSQL databases are mostly used as caches that can afford to be delayed or a bit out of sync, not as the database of record.

[0] https://eng.uber.com/postgres-to-mysql-migration/

pnathan · on June 14, 2020

I mean, you can go a loooong way on Postgres. Things that probably are going to start causing issues:

- the story of replication

- ultra-high volume write contention

- high volume timeseries data streams

For #3 there, I was consulting on a project a few years ago where there was about 1 gigabyte of timeseries data in Postgres/RDS. Some ad hoc queries were really running into slowness - COTS RDS isn't great at speed. I want to say 1 billion rows, but that seems a off. Anyway, we were talking data sharding, etc. Didn't go anywhere, because the project was canceled, etc. But we could have done a nice job if we wanted to.

Point being, there are design limits on postgres, particularly regarding distributed systems. Which is fine. I just have this allergy to designing for scale on Very Small Project.

Also worth noting that the anti-monolith dogma plays out here, a few well built monoliths probably are better for a SQL database than "one bajillion microservices".

jmalicki · on June 15, 2020

I'm totally with you that those three are reasonable limits (hell, I named two in my rebuttal)... Just that scale in itself is not the issue, it has highly specific issues. And for things like high update time series databases, those in my experience are usually metrics databases that say Prometheus serves well, and pg extensions like timescaledb do a stellar job for when you truly need ACID semantics.

jmalicki · on June 15, 2020

Also most webscale use of nosql databases I've seen basically boils down to "eventually consistent materialized views"

jeffdavis · on June 13, 2020

"I've heard too many times 'just use Postgres', repeated as an unthinking mantra."

You use "unthinking" pejoratively, but being able to skip past some decisions without over-analyzing is really important. If you are an 8-person startup, you don't have time for 3 of the people to spend weeks discussing and experimenting with alternatives for every decision.

Databases are really important, but people make tons of important decisions based on little information. If you have little information other than "I need to pick a database", then Postgres is a pretty good choice -- a sensible default, if you will.

Everyone wants to be the default, of course, so you need some criteria to choose one. It could be a big and interesting discussion, but there are relatively few contenders. If it's not free, it's probably out. If it's not SQL, it's probably out. If it's not widely used already, it's probably out. If it's tied to any specific environment/platform (e.g. Hadoop), it's probably out (or there will be a default within that specific platform). By "out", I don't mean that it's unreasonable to choose these options, just that they would not make a sensible default. It would be weird if your default was an expensive, niche, no-SQL option that runs inside of Hadoop.

So what's left? Postgres, MySQL, and MariaDB. SQL Server and MongoDB, while not meeting all of the criteria, have a case for being a default choice in some circles, as well. Apparently, out of those options, many on HN have had a good experience with Postgres, so they suggest it as default.

But if you have additional information about your needs that leads you to another choice, go for it.

Scarbutt · on June 12, 2020

"Just use Postgres" may have become meme but for good reason and is well grounded IMO.

Many immature databases with not much wide use are better avoided though, we manage to break datomic three times during development, the first two bugs were fixed in a week, the third took a month, which they called in their changelog "Fix: Prevent a rare scenario where retracting non-existent entities could prevent future transactions from succeeding" so yeah, we went back to "just use postgres", who wants to go through the nightmare of hitting those bugs in production and who knows how many more?scary situation.

bredren · on June 12, 2020

Thanks for this summary. I take for granted that I have a Postgres, powerful And reliable database that I get to use for free in all my projects and work.

feike · on June 12, 2020

This postgresql mailing list thread allows you to read along with the PostgreSQL developers and Jepsen, seems like a very useful discussion: https://www.postgresql.org/message-id/flat/db7b729d-0226-d16...

aeontech · on June 12, 2020

This is just such a pleasure to read, even as someone that has only surface awareness of database internals at all. Both for the incredibly friendly and professional tone, and for the obvious deep technical knowledge on both sides.

And that first email, my god, that should be titanium-and-gold-plated standard of a bug report.

bloopernova · on June 12, 2020

> that first email, my god, that should be titanium-and-gold-plated standard of a bug report.

It's a thing of beauty. It even includes versions of software used!

My daily experience with bug reports are that they 50/50 won't even include a description, just a title. It's such a cliche already, but "project name is broken" makes my blood boil. What environment? What were you doing? Is this production? How do I test this bug? (from an Ops perspective) When did you notice this? Has anything changed recently to possibly cause an error?

Arg, my blood pressure!

/offtopic, sorry.

arghwhat · on June 12, 2020

It is very rare to see a Jepsen report that concludes with a note that a project is being too humble about their consistency promises.

Finding effectively only a single obscure and now fixed issue where real-world consistency did not match the promised consistency is pretty impressive.

nopurpose · on June 12, 2020

> Finding effectively only a single obscure and now fixed issue where real-world consistency did not match the promised consistency is pretty impressive.

They also admitted, that testing framework cannot evaluate more complex scenarios with subqueries, aggregates and predicates. So it is possible, that PG consistency promises are spot on or maybe even overpromising.

willvarfar · on June 12, 2020

Let’s hope the tests grow in scope!

jennyyang · on June 13, 2020

The testing on Zookeeper found no problems, which was extremely impressive given how thorough Aphyr is.

tatersolid · on June 16, 2020

Strong consistency is easier when throughput and concurrency are limited.

Zookeeper is impressive in many ways. However, unless something changed drastically in the last two years, Zookeper throughput will always limit it to configuration/metadata/control-plane rather than a primary/data-plane use cases.

sandGorgon · on June 12, 2020

>PostgreSQL has an extensive suite of hand-picked examples, called isolationtester, to verify concurrency safety. Moreover, independent testing, like Martin Kleppmann’s Hermitage has also confirmed that PostgreSQL’s serializable level prevents (at least some!) G2 anomalies. Why, then, did we immediately find G2-item with Jepsen? How has this bug persisted for so long?

This is super interesting. Jepsen seems to be like Hypothesis for race conditions: you specify the race condition to be triggered and it generates tests to simulate it.

Yesterday, Gitlab acquired a fuzz testing company[1]. I wonder if Jepsen was envisioned as a full CI integrated testing system

[1] https://m.calcalistech.com/Article.aspx?guid=3832552

aphyr · on June 12, 2020

Yes. Jepsen and Hypothesis both descend from a long line of property-based testing systems--mostly notably, Haskell & Erlang's QuickCheck. Jepsen makes a number of unusual choices specific to testing concurrent distributed systems: notably, we don't do much shrinking (real-world systems are staggeringly nondeterministic). Jepsen also includes tooling for automated deploys, fault injection, a language for specifying complex concurrent schedules, visualizations, storage, and an array of sophisticated property checkers.

sandGorgon · on June 12, 2020

Is Jepsen for testing - say the microservices for Uber?

Or is it specific to the people who build things like databases, api frameworks,etc.

aphyr · on June 12, 2020

You can test pretty much any kind of concurrent system using Jepsen: in-memory data structures, filesystems, databases, queues, APIs, services, etc. Not all the tooling is applicable to every situation, but it's pretty darn general.

theptip · on June 12, 2020

Do you know of anyone using Jepsen to torture their microservices? This sounds like a really interesting usecase.

aphyr · on June 13, 2020

I hear occasional rumors, but nobody's published.

camgunz · on June 12, 2020

Reading through the source of Elle:

> "I cannot begin to convey the confluence of despair and laughter which I encountered over the course of three hours attempting to debug this issue. We assert that all keys have the same type, and that at most one integer type exists. If you put a mix of, say, Ints and Longs into this checker, you WILL question your fundamental beliefs about computers" [1].

I feel like Jepsen/Elle is a great argument for Clojure, reading the source is actually kind of fun. Not what you'd expect for a project like this.

[1]: https://github.com/jepsen-io/elle/blob/master/src/elle/txn.c...

agambrahma · on June 12, 2020

Wonder if this "manual type constraints"-style code is pre-"spec"

aphyr · on June 12, 2020

Normally I'm a core.typed person, but static type constraints don't quite make sense here. We want heterogeneity in some cases (e.g. you want to be able to mix nils and ints), but not in others (e.g. this short and int mixing, which could be intentional, but also, might not be)

I've considered spec as well, but spec has a weird insistence that a keyword has exactly one meaning in a given namespace, which is emphatically not the case in pretty much any code I've tried to verify. Also its errors are... not exactly helpful.

lemming · on June 12, 2020

This is interesting to me as a Clojure person - you would be approximately the first person I've seen using core.typed since CircleCI's post in 2015 discussing why it didn't work for them. Are you using more modern versions of core.typed? What's the experience like these days?

aphyr · on June 12, 2020

I don't use it often. In general, I've found the number of bugs I catch with core.typed doesn't justify the time investment in convincing things to typecheck--my tests generally (not always, of course!) find type issues first. I also tend to do a lot of weird performance-oriented stateful stuff with java interop, which brings me into untyped corners of the library.

That said, I've found core.typed helpful in managing complex state transformations, especially in namespaces which have, say, five or six similar representations of the same logical thing. What do you do when a "node" is a hostname, a logical identifier in Jepsen, an identifier in the database itself, a UID, and a UID+signature pair? Managing those names can be tricky, and having a type system really helps.

lemming · on June 13, 2020

Interesting, thanks for the feedback. I often miss having a type system in Clojure and I was really hoping that core.typed might be more ergonomic these days. It sounds like it's still a pretty serious tradeoff, and probably not worth it for most code.

camgunz · on June 12, 2020

Elle is pretty new so I would guess not--unless it's been lurking somewhere else. Dunno what aphyr's thoughts on spec are, plus I'm an amateur clojurian so, I'm not sure what community consensus is or if spec has drawbacks that make it not a good fit.

micimize · on June 12, 2020

This is my understanding of what a G2-Item Anti-dependecy Cycle is from the linked paper example:

  -- Given (roughly) the following transactions:  

  -- Transaction 1 (SELECT, T1)
  with all_employees as (
    select sum(salary) as salaries
    from employees
  ),
  department as (
    select department, sum(salary) as salaries
    from employees group by department
  )
  select sum(all_employees.salaries) - sum(department.salaries);

  -- Transaction 2 (INSERT, T2)
  insert into employees (name, department, salary)
  values ('Tim', 'Sales', 70000);

  -- G2-Item is where the INSERT completes between all_employees and department,
  -- making the SELECT result inconsistent

This is called an "anti-dependency" issue because T2 clobbers the data T1 depends on before it completes.

They say Elle found 6 such cases in 2 min, which I'm guessing is a "very big number" of transactions, but can't figure out exactly how big that number is based on the included logs/results.

Also, "Elle has found unexpected anomalies in every database we've checked"

aphyr · on June 12, 2020

Yeah, it was relatively infrequent in that particular workload--dramatically less than PostgreSQL's "repeatable read" exhibited. These histories are roughly 15K successful transactions long--see the stats field in results.edn. I'm hesitant to make strong statements about frequency, because I suspect this kind of thing depends strongly on workload, but I would hazard a gueesssss that it's not super common.

mekoka · on June 12, 2020

Props to Jensen for exposing this longtime bug. Props to the PG team for identifying the culprit and their response. This report just strengthens my faith in the project.

redwood · on June 12, 2020

It would be great to see Jepsen testing on distributed Postgres as this is a single node issue they've found here. In prod don't folks run HA?

aphyr · on June 12, 2020

I started this analysis intending to do just that--it's been difficult, however, to figure out which of the dozens of replication/HA configurations to actually test. I settled on Stolon, since it seemed to make the strongest safety claims. However, I found bugs which turned out to be PostgreSQL's fault, so I backed off to investigate those first.

qeternity · on June 12, 2020

And herein lies the rub: HA Postgres is an extremely painful proposition. Based on our non-scientific research, Patroni seems to be the most battle tested solution, and as popular if not more so than Stolon.

zozbot234 · on June 12, 2020

"HA in Postgres" does not have a very well-defined meaning. The Postgres documentation provides an overview of different viable solutions: https://www.postgresql.org/docs/12/different-replication-sol... with features and drawbacks for each. But to call it "extremely painful" seems to be a bit overstated.

aphyr · on June 12, 2020

Patroni's documentation also seems to suggest that even with the strongest settings, it can lose transactions; Stolon makes stronger claims.

feike · on June 12, 2020

Patroni does have synchronous_mode_strict setting, which may be what you're looking for:

This parameter prevents Patroni from switching off the synchronous replication on the primary when no synchronous standby candidates are available. As a downside, the primary is not be available for writes (unless the Postgres transaction explicitly turns of synchronous_mode), blocking all client write requests until at least one synchronous replica comes up.

https://patroni.readthedocs.io/en/latest/replication_modes.h...

edit: seems I missed this discussion on twitter: https://twitter.com/jepsen_io/status/1265626035380346881

aphyr · on June 12, 2020

Er, again, the docs say "it is still possible to lose transactions even when using synchronous_mode_strict". I've talked about this with some of the Stolon folks on Twitter, and we're not exactly sure how that manifests--possibly an SI or SSI violation.

merb · on June 13, 2020

why is stolon more safe than? I tought stolon uses synchronous replication of pg aswell?

satyanash · on June 12, 2020

> documentation also seems to suggest that even with the strongest settings, it can lose transactions;

Can be reproduced even on a single node postgres. Just hammer it with inserts and maintain a local counter for inserts performed. Then, kill9 the postgres process. You'd expect your local counter to match the actual rows inserted, but you'll find that your counter will always be "less" than the actual rows inserted. Like any "networked" system, it is possible to lose commit acknowledgments even if the commit itself was successful.

So yes, you've not "lost" transactions per se. You've "gained" them, but it is still a data issue in either case.

anarazel · on June 12, 2020

The classical solution to that is to use 2PC. But often it's not worth it...

andreareina · on June 13, 2020

2PC can have unacknowledged writes: https://aphyr.com/posts/282-call-me-maybe-postgres

qeternity · on June 12, 2020

Ah, I presumed you were talking about distributed failure situations (split brain, etc) as opposed the to PG level replication (which most solutions orchestrate anyway).

LunaSea · on June 12, 2020

It's one of the reasons for which NoSQL databases got a lot of publicity during the early 2010's.

threeseed · on June 12, 2020

And are still widely used today.

People like to criticise NoSQL databases like MongoDB etc but at least they took on the challenge of making clustering easy enough to use and safe enough to rely on. Especially because it such a complex and error prone challenge.

biggestdummy · on June 12, 2020

Odd that you would point out MongoDB as your named example, as it is pretty awful at sharding/clustering. For HA, the more better example would be Cassandra or Scylla. Mongo's success is more tied to the ease of development with a native JSON document DB, rather than any claims to scalability. (Insert "Mongodb is webscale" video here.)

PeterCorless · on June 25, 2020

I lol'ed. (Full disclosure: I work for ScyllaDB.)

MongoDB has definitely come a long way in terms of HA, but yes, they are still have a long way to go. A good primer talk on the differences can be found here: https://www.scylladb.com/tech-talk/mongodb-vs-scylla-product...

threeseed · on June 12, 2020

MongoDB was called out because of its ease of use. You can create replica sets and shards in seconds. And for many use cases it works great.

Cassandra is one of if not the best since it's multi-master but it's a little bit more complex to setup.

mwcampbell · on June 12, 2020

> HA Postgres is an extremely painful proposition.

Does anyone here know how Amazon RDS's HA setup, particularly their multi-AZ option, works? That seems to be a switch that the AWS customer can just turn on. Do they have a proprietary implementation, even for non-Aurora Postgres?

aeyes · on June 12, 2020

Here is pretty much the most detailed post about how it works you'll be able to find in public: https://aws.amazon.com/blogs/database/amazon-rds-under-the-h...

They basically do replication at the storage layer. Each write has to be acknowledged by both the primary and secondary EBS volume.

threeseed · on June 12, 2020

They basically have built a proprietary, distributed block store.

And on top of this they have layered PostgreSQL, MySQL, MongoDB, Cassandra etc.

I doubt they will never release the code for it since it's very much a competitive advantage.

devit · on June 12, 2020

So if the hardware running the database is suddenly destroyed they try to start another instance really fast?

That seems inferior to having multiple sync replicas ready to take over without having to start a process and replay the WAL.

Also, such an HA block store seems very easy to replicate ( I'd guess there would be something open source already), not much of a competitive advantage.

aeyes · on June 13, 2020

There are two types of failover in RDS:

- Single instance, if the instance dies they start a new one and mount the same storage. This can take some time, in general under 5min but I have seen it take 45min, especially for the large instance types.

- Multi A-Z, they run a hot standby with replicated and physically separated storage. Failover takes about a minute. The replication happens at storage level, every write has to be acknowledged by both availability zones. I'm not sure if Postgres is always running or if it gets started when failing over.

I guess you could replicate this using drbd.

redis_mlc · on June 12, 2020

- using block-level replication allows them to support multiple databases in a common way

- block-level replication can be more reliable in the long run operationally than some types of database replication, especially MySQL back in the day

- block-level replication has more scalable support staff available than hiring DBAs to fix database replication problems

- programming for all the edge cases is something that is a competitive advantage

- no licensing required for it

- you can probably guess which Open Source project it's based on

Source: DBA, worked there.

hashhar · on June 12, 2020

Just gonna put this here for people who might be wondering. Not too many block-level replication implementations exist.

https://www.linbit.com/drbd/

NB: It is Free software, don't be alarmed by the domain name.

bratao · on June 13, 2020

Linstor can do that. I use in Kubernetes and it is incredible!

crgwbr · on June 12, 2020

That’s how the Aurora DBs work, but I believe standard RDS Postgres uses a more typical method of WAL replication.

jpgvm · on June 13, 2020

For read replicas yes but not for the standby master. They give you access to pg_stat_replication so this is quite easy to verify.

ksec · on June 12, 2020

Is there a proposed roadmap for basic / default solution of HA Postgres? It seems MySQl has this well covered and Postgres continue to think it is not a core part of their DB and relies on third party. ( Not suggesting that is necessary a bad thing )

KingOfCoders · on June 12, 2020

We laughed when this happend to MongoDB.

The difference though is the reaction from the vendor.

aphyr · on June 12, 2020

Fun story: after the last report which called them out for not talking about write loss by default, MongoDB updated their Jepsen page to say that the analysis didn't observe lost writes. I guess they assumed that people wouldn't... read the abstract? Let alone the report?

ldng · on June 12, 2020

For me, MongoDB has track record of bolstering a lot ("webscale") and hiding/denying mistakes.

PostgreSQL is quite the opposite on that front, confident yet open to critics and abble to admit mistakes. Hell, I've even them present their mistakes at conferences and ask for help.

blablabla123 · on June 12, 2020

Yes, for instance not returning errors in some cases when writes fail. I think this was until version 2 but to be fair they fixed this kind of stuff and started to deal with this differently later on. However their reputation never fully recovered from this.

pkilgore · on June 12, 2020

> Neither process crashes, multiple tables, nor secondary-key access is required to reproduce our findings in this report. The technical justification for including them in this workload is “for funsies”.

Always read the footnotes!

reitanqild · on June 12, 2020

By the way: where does the Jepsen name come from?

I have wondered more than once and my browsing and searching skills are failing me on this one.

Edit: The closest link I can find is "Call me maybe" but I am not able to find a causation or even a direct link or mention for now.

aphyr · on June 12, 2020

For legal reasons, Jepsen, the series on distributed database safety, has nothing to do with any other thing, place, person, or concept.

aphyr · on June 13, 2020

It's like HN is trying to get me in legal trouble today. Y'all are exhausting. Stop.

reitanqild · on June 13, 2020

Sorry I didn't realize it before

To my defence I did stop immediately once I understood.

Oh, and by the way: thanks for your work!

takeda · on June 12, 2020

Yes, it started as a hobby and turned into a business, but yes, the song is the inspiration. It basically was testing distributed systems with network partitioning (i.e. services not calling back etc)

https://aphyr.com/posts/281-jepsen-on-the-perils-of-network-...

https://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-a...

jdwithit · on June 12, 2020

IIRC it’s a joke referencing the pop song “Call Me Maybe” by Carly Rae Jepsen and the unreliability of many of the systems he tests.

perlgeek · on June 12, 2020

I don't actually know, but I could imagine it's a tribute to Carly Rae Jepsen and their song "Call me maybe".

I dimly recall that either Aphyr's blog or the jepsen blog was called "call me maybe" in the earlier days.

twunde · on June 12, 2020

Here's at least one reference to it: https://www.informationweek.com/database/the-man-who-torture... And it looks like earlier versions of the github project looked more like this: https://github.com/threadwaste/jepsen with references to the Carly Rae Jepson song in the project description in both the README and in github.

Actually, it looks like the original talk (Slides: https://aphyr.com/media/jepsen-ricon-east.pdf has multiple references) and the original blog post has a slug referring to the song https://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-a...

tta · on June 12, 2020

Slides 12 and 13 here should help: https://aphyr.com/media/talk.pdf

reitanqild · on June 12, 2020

That is a 403 for me, and based on aphyr's answer above that's OK with me.

ivanfon · on June 12, 2020

There's an old Jepsen post that used to be referencing that song, but it looks like it's been modified/renamed now: https://aphyr.com/posts/284-call-me-maybe-mongodb

(you can still see it in the url)

amyjess · on June 12, 2020

When I first discovered aphyr's site, all of the test articles began with "Call Me Maybe:" rather than "Jepsen:", and then one day all the articles were renamed.

I've always suspected he changed it for legal reasons, and his comment elsewhere in this thread pretty much confirms it.

cp9 · on June 12, 2020

it's named after the Carly Rae Jepsen song "Call Me Maybe"

threeseed · on June 12, 2020

I am still wondering when we will see PostgreSQL being tested in a HA form.

It's just extraordinary to me that it's 2020 and it still does not have a built-in, supported set of features for supporting this use case. Instead we have to rely on proprietary vendor solutions or dig through the many obsolete or unsupported options.

castorp · on June 12, 2020

There is a built-in supported set of features for high availability. What exactly are you missing?

phaemon · on June 12, 2020

The option to install postgres on three instances, specify that they're in cluster "foo" and then it just works, including automatically fixing any issues when one of the instances drops out and rejoins.

That's what other DBs have but it seems to be missing from postgres. If it now exists could you point me to the doc explaining how to do this?

devit · on June 12, 2020

Stolon or an equivalent way of automatically deploying on Kubernetes with an Helm chart being officially blessed by the PostgreSQL team and made part of the official distribution.

Also, same for a multi-master solution.

popotamonga · on June 12, 2020

What does this really mean? I just migrated from mongo to Pg.

oauea · on June 12, 2020

If you came from mongo that means everything will work far more reliably than you're used to.

threeseed · on June 12, 2020

This test only applies to a single instance of PostgreSQL.

If you're looking for HA or need to shard then it's reliability is in question since it's never been tested.

snuxoll · on June 12, 2020

There were edge cases in PostgreSQL’s SERIALIZABLE isolation level - which is supposed to ensure that concurrent transactions behave as if they were committed sequentially.

Specifically - if running a transaction as SERIALIZABLE there was a very small chance that you might not see a rows inserted by another transaction that committed before you in the order. Many applications don’t need this level of transaction isolation - but for those that do it’s somewhat scary to know this was lurking under the bed.

Every implementation of a “bank” system where you keep track of deposits and withdrawals is a use-case for SERIALIZABLE, and this means a double-spend could happen because the next transaction didn’t see an account just had a transaction that drained the balance, for example.

Props to Jepsen for finding this.

greggyb · on June 12, 2020

The bank example is useful, because it tends to elicit the right thinking for people, but banking has a long history of eventual consistency.

For the vast majority of the history of banking, local branches (which is a very loose term here, e.g. a family member of the guy you know in your hometown, rather than an actual physical establishment) would operate on local knowledge only. Consistency is achieved only through a regular reconciliation process.

Even in more modern, digital times, banks depend on large batch processes and reconciliation processes.

nopurpose · on June 12, 2020

I'd say MOST non trivial application require SERIALIZABLE. Every time apps does `BEGIN; SELECT WHERE; INSERT/UPDATE; COMMIT` it needs `SERIALIZABLE`, becuase it is only level catching cases, where concurrent transaction adds rows so that SELECT WHERE changes it's result set and therefore subsequent INSERT/UPDDATE should be done with different values.

detaro · on June 12, 2020

The common bank example as I understand it doesn't require serializable, but only snapshot isolation: If two transactions both drain the source balance, the one that commits last will fail, because its snapshot doesn't match the state anymore.

snuxoll · on June 12, 2020

If you’re UPDATEing a balance on some account table - yes. If you’re using a ledger and calculating balances (which you SHOULD) then SERIALIZABLE is needed.

nkozyra · on June 12, 2020

It's an isolation issue but if you're coming from Mongo I'd broadly guess it's not one you're going to trigger. Also, look at their other analyses ... they're very detailed and upfront about serialization isolation issues in a lot of huge databases/datastores.

Noteworthy: "In most respects, PostgreSQL behaved as expected: both read uncommitted and read committed prevent write skew and aborted reads."

castorp · on June 12, 2020

Postgres does not support read uncommitted

petergeoghegan · on June 12, 2020

Technically it does. You can ask for read uncommitted mode, though you'll just get read committed mode. This is correct because you're getting the minimal guarantees that you asked for. The SQL standard allows this.

mekoka · on June 12, 2020

It means you will need to patch your pg in the next release scheduled in August https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit...

petergeoghegan · on June 12, 2020

The default isolation level is read committed mode, whereas the bug in question only affected applications that use serializable mode. You have to ask for serializable mode explicitly; if you're not, then you cannot possibly be affected by the bug. (Perhaps you should consider using a higher isolation level, but that would be equally true with or without this bug.)

redwood · on June 12, 2020

Was Jepsen a key contributor to your choice to migrate? Are you using PG in a distributed/replicated/HA mode like mongo?

popotamonga · on June 12, 2020

-Yes but not the only one, was a succession of problems (why did i use mongo on the first place, on a transaction heavy callcenter database? Because the customer forced it because it was the only thing he knew)

-No just a single huge instanced, managed on Azure

rolls-reus · on June 12, 2020

So this does not affect SSI guarantees if the transactions involved all operate on the same row? Is my understanding correct? For instance can I update a counter with serializable isolation and not run into this bug?

aphyr · on June 12, 2020

I think so, yeah. You could theoretically have a G2-item anomaly on a single key, but in PostgreSQL's case, the usual write-set conflict checking seems to prevent them.

zeroimpl · on June 13, 2020

For the repeatable read issue, I don't intuitively understand why the violation mentioned would be a problem. In particular, even though the transaction sequence listed wouldn't make sense for a serializable level, it seems consistent with what I'd expect from repeatable read (though I have not read the ANSI SQL standard's definition of repeatable read).

Any insights into why we should want repeatable read to block that? It feels like blocking that is specifically the purpose of serializable isolation.

aphyr · on June 13, 2020

The report talks at length about this, but maybe it bears rephrasing!

The ANSI definitions are bad: they allow multiple interpretations with varying results. 25 years ago, Berenson, O'Neil, et al. published a paper showing the ANSI definitions had this ambiguity, and that what the spec meant to define should have been a broader class of anomalies. They literally say that the broad interpretation is the "correct" one, and the research community basically went "oh, yeah, you're right". Adya followed up with generalized isolation level definitions, and pretty much every paper I've read has gone with these versions since. That didn't make its way back into the SQL spec though: it's still ambiguous, which means you can interpret RR as allowing G2-item.

Why prevent G2-item in RR? Because then the difference between repeatable read and serializable is specifically phantoms, rather than phantoms plus... some other hard-to-describe anomalies. If you use the broad/generalized interpretation, you can trust that a program which only accesses data by primary key, running under repeatable read, is actually serializable. That's a powerful, intuitive constraint. If you use the strict interpretation, RR allows Other Weird Behaviors, and it's harder to prove an execution is correct.

For a very thorough discussion of this, see either Berenson or Adya's papers, linked throughout the report.

zeroimpl · on June 13, 2020

Thanks, I got the part about the spec being ambiguous, am more interested in the "why" aspect, since the current behaviour seems intuitive to the name "repeatable read". But on closer inspection, I see PostgreSQL's repeatable read blocks phantom reads even though the ANSI spec permits that! I don't get why phantom reads would be acceptable under "repeatable read"... I probably should give those papers a read some time. But in the meantime, given the choice of phantom reads or G2-item, I think I'd pick blocking phantom reads. (It might be nice to have the option to choose though!)

In PostgreSQL's case, if they somehow made repeatable read to prevent G2-item without sacrificing the phantom reads, would that mean repeatable read is then "serializable" according to the ANSI definition?

aphyr · on June 13, 2020

But in the meantime, given the choice of phantom reads or G2-item, I think I'd pick blocking phantom reads.

Well... it's not quite so straightforward. SI still allows some phantoms. It only prohibits some of them.

In PostgreSQL's case, if they somehow made repeatable read to prevent G2-item without sacrificing the phantom reads, would that mean repeatable read is then "serializable" according to the ANSI definition?

I'm not quite sure I follow--If you're asking whether snapshot isolation (Postgres "Repeatable Read") plus preventing G2-item is serializable, the answer is no--that model would still allow G2 in general--specifically, cycles involving non-adjacent rw dependencies with predicates.

emilyst · on June 12, 2020

Ah, now I know why you hopped on IRC finally last week. :)

senderista · on June 14, 2020

I don’t know why the author is surprised that Postgres offers stronger guarantees than serializability in practice. Serializability per se allows anomalies that would be disastrous in practice: http://dbmsmusings.blogspot.com/2019/06/correctness-anomalie....

aphyr · on June 14, 2020

It's not particularly surprising, but it is noteworthy. Other systems I've tested have offered serializability but not strict serializability: for instance, CockroachDB.

emmelaich · on June 13, 2020

@aphyr could you please clarify this sentence?

> This behavior is allowable due to long-discussed ambiguities in the ANSI SQL standard, but could be surprising for users familiar with the literature.

Should that be "not familiar"? And which literature - the standard or the discussions?

aphyr · on June 13, 2020

Familiar. If you've read any of the literature on snapshot isolation or isolation levels in general (Berenson, Bernstein, Gray, Melton, O'Neil, O'Neil, Adya, Liskov, Fekete, Cerone, Bailis, Alvisi, Clement, Crooks, Pu...) you would probably not expect to see SI treated as stronger than RR. The paper which first defined SI says it's not. Heck, Jim Melton, who wrote that part of the ANSI spec, is a co-author on that paper, which goes on to say the strict interpretation is incorrect and not what ANSI intended!

emmelaich · on June 13, 2020

Thank you.

BTW, nice to see one of of my lecturers (Fekete) among the names!

wlll · on June 13, 2020

Thanks for doing these, they're incredibly interesting, useful, amusing (Oh no! the schadenfreude!) and also, incredibly inspiring to me to be a better engineer, so thank you again :)

julienfr112 · on June 13, 2020

Can (should ?) Jepsen tests be integrated in PostGres CI/CT ? Can we raise money for that ?

ordx · on June 12, 2020

Any plans to test any other NoSQL databases? I'm interested in MarkLogic

aphyr · on June 13, 2020

No, Jepsen is over, sorry!

Jokes aside, Marklogic is welcome to pay me. Each one of these reports takes weeks to months of full-time work.

arthurcolle · on June 13, 2020

This is probably outside the purview of Jepsen but have you ever looked at Kdb, or the language it is build on top of? (K)

Curious to hear your thought on it! Would love a Jepsen style analysis of kdb

emmelaich · on June 13, 2020

If Oracle were smart they'd pay you to test theirs.

Of course maybe this has already happened - and you are not able to discuss because of NDA. Which would be perfectly fine I think.

AtlasBarfed · on June 12, 2020

Imo don't trust a nosql database that hasn't done jepsen.

It is a basic honesty test, because the uncertainty and difficult to reproduce things can be used for denialism by the proponents / salespeople.

k0k-xox · on June 13, 2020

I think its funny