Personally, this kind of thing actually gives me _more_ confidence in Postgres rather than less. The core team's responsiveness to this bug report was incredibly impressive.
Around June 4th, the article's author comes in with a bug report that basically says "I hammered Postgres with a whole bunch of artificial load and made something happen" [1].
By the 8th, a preliminary patch is ready for review [2]. That includes all the time to get the author's testing bootstrap up and running, reproduce, diagnose the bug (which, lest us forget, is the part of all of this that is actually hard), and assemble a fix. It's worth noting that it's no one's job per se on the Postgres project of fix this kind of thing — the hope is that someone will take interest, step up, and find a solution — and as unlikely as that sounds to work in most environments, amazingly, it usually does for Postgres.
Of note to the hacker types here, Peter Geoghegan was able to track the bug down through the use of rr [4] [5], which allowed an entire problematic run to be captured, and then stepped through forwards _and_ backwards (the latter being the key for not having to run the simulation over and over again) until the problematic code was identified and a fix could be developed.
With distributed and multicore being the path forward with the end of Moores law, your work has been instrumental in helping open source distributed systems improve.
Since distributed systems are so difficult and complicated, it enables salespeople and zealots to both deny issues and overstate capability.
Your work is a shining star in that darkness. Thank you.
Thank you for this comment that gives credit where it's due, this is a very impressive set of threads to read through.
And I agree. For me, one of the most important measures of the reliability of a system is how that system responds too information that it might be wrong. If the response is defensiveness, evasiveness, or persuasive in any way, i.e. of the "it's not that bad" variety, run for the hills. This, on the other hand is technical, validating, and prompt.
Every system has bugs, but depending on these cultural features, not every system is capable of systematically removing those bugs. With logs like these, the pg community continues to prove capable. Kudos!
>If the response is defensiveness, evasiveness, or persuasive in any way, i.e. of the "it's not that bad" variety, run for the hills. This, on the other hand is technical, validating, and prompt.
This resonates with me with teams inside the company as well.
We have a few teams that just deflect issues. Find any issue in the bug report, be it an FQDN in a log search, and poof it goes. Back to sender, don't care. Engineers in my team just don't care to report bugs there anymore, regardless how simple. Usually, it's faster and less frustrating to just work around it or ignore it. You could be fighting windmills for weeks, or just fudge around it.
Other teams, far more receptive with bugs.. engineers end up curious and just poke around until they understand what's up. And then you have bug reports like "Ok, if I create these 57 things over here, and toggle thing 32 to off, and then toggle 2 things indexed by prime numbers on, then my database connection fails. I've reproduced this from empty VMs. If 32 is on, I need to toggle two perfect squares on, but not 3". And then a lot of things just get fixed.
If you reject the "it's not that bad" response then you are saying all bugs are equal in severity / urgency / priority.
Which is obviously not true.
A better response might be something like "I understand the problem is X however the fix might take Y months and the next release will deprecate that feature anyway. Can you quantify or guesstimate the impact on our product if we allow this bug to stay unfixed?"
Yeah, in retrospect, I disagree with myself, too. (and agre with your rationale!) :)
I think if I were to try and salvage the quote, I'd change it to "It's probably not that bad". In this telling, I'm meaning to imply that the responder has made an apriori value judgement about a bug without determining scope and cause.
I agree with your ideal response. The only caveat I'd personally make, is that I'd recommend "planting the flag" by providing your own guesstimate of product impact, and ask the reporter for their opinion on moving it.
In my perspective, it's important to acknowledge that is has an impact, and it's especially important to convey your sense of impact concretely with some napkin math. If you're terribly wrong, the reporter will try to convince you of it[1], and you'll learn something along the way. It also conveys empathy (i.e. we're both trying to solve your problem, even if it's only a problem for you).
Practically everyone in distributed systems knows who Aphyr is, and to get a bug reported by him is like a badge of honor. I am willing to bet that people at PostgrSQL jousted over who got to work with him on this bug, because it would be such an amazing learning opportunity. I know I would want to!
> Memory-mapped files are almost entirely just the executables and libraries loaded by tracees. As long as the original files don’t change and are not removed, which is usually true in practice, their clones take up no additional space and require no data writes
> Like cloned files, cloned file blocks do not consume space as long as the underlying data they’re cloned from persists.
they conclude the section with:
> In any case, in real-world usage trace storage has not been a concern
I imagine that over "longer debugging sessions" the metadata footprint would expand linearly, but probably with a constant smaller than the logs for the average program.
Indeed — it's great to see a vendor (team, in this case) that doesn't try to downplay a Jepsen result, and instead fixes the issues.
However, there is one more takeaway here. I've heard too many times "just use Postgres", repeated as an unthinking mantra. But there are no obvious solutions in the complex world of databases. And this isn't even a multi-node scenario!
I don't think the "just use Postgres" mantra takes any hits at all from this. (If anything, I feel better about it).
I've used maybe a dozen (?) databases/stores over the years - graph databases, NoSQL databases, KV stores, the most boring old databases, the sexiest new databases - and my general approach is now to just use Postgres unless it really, really doesn't fit. Works great for me.
I too have been there, done that when it comes to persistent storage, but my similar conclusion is “use SQLite unless it really, really doesn’t fit, in which case, use PostgreSQL.” (It’s usually easy to figure out which of the two you’ll end up needing.)
I'm happy Postgres works for you. It works for me, too, in a number of setups. But one should never accept advice like "just use Postgres" without thinking and careful consideration. As the Jepsen article above shows.
I have rarely seen people give "just use PostgreSQL" as advice, but rather "just use PostgreSQL unless you have a compelling reason not to". There's a pretty big difference between the two.
"just use PostgreSQL unless you have a compelling reason not to" is just as a bad advice tbh. At work we did not even consider it because no HA support / availability. What do you do exactly when your master is down, your entire backend just stop working? PG is not a silver bulet especially not when you're looking for good managed / HA solution.
There is no such thing as "use x.y.z unless" this statement does not make any sense.
Setting up a HA PostgreSQL setup might not be as easy as with some turnkey solutions where you just deploy three instances and then it (probably) works, but I'd still trust my data to a simple PostgreSQL replication setup over something fancier where the behaviour in failure scenarios is not as straightforward.
In simple cases, what people usually need is what I would call "operational" HA anyway; database servers going down unintended is (hopefully!) not a common occurrence, and even with a super simple 2-node primary/failover asynchronous replication setup, you can operate on the database servers with maybe a few seconds of downtime, which is about the same you'd get from "natively" clustered solutions.
In true failure scenarios, you might get longer downtime (though setting up automatic failover with eg. pacemaker is also possible), but in situations where that ends up truly mattering you likely have the resources to make it work anyway.
The classic high availability setup for postgresql is to have standby instances, usually one, ready to become the new primary if the primary goes down.
This won't work for everyone, but if you are on the AWS stack, then using PostgreSQL on Amazon RDS is a compelling option. There are 6 copies of your data, stored on 3 servers and in 3 separate regions and the replication between the nodes is less than 1s (usually around 50ms). I'd agree that a simple HA solution is the most obvious thing lacking for PostgreSQL, but with RDS it's not something you have to worry about anymore.
No one claims it to be, hence the "unless there's a good reason not to" part. However, it is a good default that works well for a wide range of – though hardly all – use cases.
Replication in general is not PostgreSQL's strongest point and if you require advanced replication then PostgreSQL may indeed not be the choice for you, but most projects don't need this. That being said, "no HA support" is just not true.
"Use PostGres until you have an engineering - data driven rationale not to" is my standard answer for non-blob data storage when a project starts.
Why? because when `n` is small (tables, rows, connections), postgres works well enough, and if `n` should ever become large, we'll have interest in funding the work to do a migration, if that's appropriate - and we'll be able to evaluate the system at scale with real data.
postgres works great for `n` in ranges in excess of (10k, 10B, 1k) at least, which not many would consider small.
The only real justification I've heard is postgres' story on replication still lags mysql, and the write amplification bit which they've done a lot of work on for pg13 (see [0])
Even for many very large companies I've seen, noSQL databases are mostly used as caches that can afford to be delayed or a bit out of sync, not as the database of record.
I mean, you can go a loooong way on Postgres. Things that probably are going to start causing issues:
- the story of replication
- ultra-high volume write contention
- high volume timeseries data streams
For #3 there, I was consulting on a project a few years ago where there was about 1 gigabyte of timeseries data in Postgres/RDS. Some ad hoc queries were really running into slowness - COTS RDS isn't great at speed. I want to say 1 billion rows, but that seems a off. Anyway, we were talking data sharding, etc. Didn't go anywhere, because the project was canceled, etc. But we could have done a nice job if we wanted to.
Point being, there are design limits on postgres, particularly regarding distributed systems. Which is fine. I just have this allergy to designing for scale on Very Small Project.
Also worth noting that the anti-monolith dogma plays out here, a few well built monoliths probably are better for a SQL database than "one bajillion microservices".
I'm totally with you that those three are reasonable limits (hell, I named two in my rebuttal)... Just that scale in itself is not the issue, it has highly specific issues. And for things like high update time series databases, those in my experience are usually metrics databases that say Prometheus serves well, and pg extensions like timescaledb do a stellar job for when you truly need ACID semantics.
"I've heard too many times 'just use Postgres', repeated as an unthinking mantra."
You use "unthinking" pejoratively, but being able to skip past some decisions without over-analyzing is really important. If you are an 8-person startup, you don't have time for 3 of the people to spend weeks discussing and experimenting with alternatives for every decision.
Databases are really important, but people make tons of important decisions based on little information. If you have little information other than "I need to pick a database", then Postgres is a pretty good choice -- a sensible default, if you will.
Everyone wants to be the default, of course, so you need some criteria to choose one. It could be a big and interesting discussion, but there are relatively few contenders. If it's not free, it's probably out. If it's not SQL, it's probably out. If it's not widely used already, it's probably out. If it's tied to any specific environment/platform (e.g. Hadoop), it's probably out (or there will be a default within that specific platform). By "out", I don't mean that it's unreasonable to choose these options, just that they would not make a sensible default. It would be weird if your default was an expensive, niche, no-SQL option that runs inside of Hadoop.
So what's left? Postgres, MySQL, and MariaDB. SQL Server and MongoDB, while not meeting all of the criteria, have a case for being a default choice in some circles, as well. Apparently, out of those options, many on HN have had a good experience with Postgres, so they suggest it as default.
But if you have additional information about your needs that leads you to another choice, go for it.
"Just use Postgres" may have become meme but for good reason and is well grounded IMO.
Many immature databases with not much wide use are better avoided though, we manage to break datomic three times during development, the first two bugs were fixed in a week, the third took a month, which they called in their changelog "Fix: Prevent a rare scenario where retracting non-existent entities could prevent future transactions from succeeding" so yeah, we went back to "just use postgres", who wants to go through the nightmare of hitting those bugs in production and who knows how many more?scary situation.
Thanks for this summary. I take for granted that I have a Postgres, powerful And reliable database that I get to use for free in all my projects and work.
This is just such a pleasure to read, even as someone that has only surface awareness of database internals at all. Both for the incredibly friendly and professional tone, and for the obvious deep technical knowledge on both sides.
And that first email, my god, that should be titanium-and-gold-plated standard of a bug report.
> that first email, my god, that should be titanium-and-gold-plated standard of a bug report.
It's a thing of beauty. It even includes versions of software used!
My daily experience with bug reports are that they 50/50 won't even include a description, just a title. It's such a cliche already, but "project name is broken" makes my blood boil. What environment? What were you doing? Is this production? How do I test this bug? (from an Ops perspective) When did you notice this? Has anything changed recently to possibly cause an error?
It is very rare to see a Jepsen report that concludes with a note that a project is being too humble about their consistency promises.
Finding effectively only a single obscure and now fixed issue where real-world consistency did not match the promised consistency is pretty impressive.
> Finding effectively only a single obscure and now fixed issue where real-world consistency did not match the promised consistency is pretty impressive.
They also admitted, that testing framework cannot evaluate more complex scenarios with subqueries, aggregates and predicates. So it is possible, that PG consistency promises are spot on or maybe even overpromising.
Strong consistency is easier when throughput and concurrency are limited.
Zookeeper is impressive in many ways. However, unless something changed drastically in the last two years, Zookeper throughput will always limit it to configuration/metadata/control-plane rather than a primary/data-plane use cases.
>PostgreSQL has an extensive suite of hand-picked examples, called isolationtester, to verify concurrency safety. Moreover, independent testing, like Martin Kleppmann’s Hermitage has also confirmed that PostgreSQL’s serializable level prevents (at least some!) G2 anomalies. Why, then, did we immediately find G2-item with Jepsen? How has this bug persisted for so long?
This is super interesting. Jepsen seems to be like Hypothesis for race conditions: you specify the race condition to be triggered and it generates tests to simulate it.
Yesterday, Gitlab acquired a fuzz testing company[1]. I wonder if Jepsen was envisioned as a full CI integrated testing system
Yes. Jepsen and Hypothesis both descend from a long line of property-based testing systems--mostly notably, Haskell & Erlang's QuickCheck. Jepsen makes a number of unusual choices specific to testing concurrent distributed systems: notably, we don't do much shrinking (real-world systems are staggeringly nondeterministic). Jepsen also includes tooling for automated deploys, fault injection, a language for specifying complex concurrent schedules, visualizations, storage, and an array of sophisticated property checkers.
You can test pretty much any kind of concurrent system using Jepsen: in-memory data structures, filesystems, databases, queues, APIs, services, etc. Not all the tooling is applicable to every situation, but it's pretty darn general.
> "I cannot begin to convey the confluence of despair and laughter which I encountered over the course of three hours attempting to debug this issue. We assert that all keys have the same type, and that at most one integer type exists. If you put a mix of, say, Ints and Longs into this checker, you WILL question your fundamental beliefs about computers" [1].
I feel like Jepsen/Elle is a great argument for Clojure, reading the source is actually kind of fun. Not what you'd expect for a project like this.
Normally I'm a core.typed person, but static type constraints don't quite make sense here. We want heterogeneity in some cases (e.g. you want to be able to mix nils and ints), but not in others (e.g. this short and int mixing, which could be intentional, but also, might not be)
I've considered spec as well, but spec has a weird insistence that a keyword has exactly one meaning in a given namespace, which is emphatically not the case in pretty much any code I've tried to verify. Also its errors are... not exactly helpful.
This is interesting to me as a Clojure person - you would be approximately the first person I've seen using core.typed since CircleCI's post in 2015 discussing why it didn't work for them. Are you using more modern versions of core.typed? What's the experience like these days?
I don't use it often. In general, I've found the number of bugs I catch with core.typed doesn't justify the time investment in convincing things to typecheck--my tests generally (not always, of course!) find type issues first. I also tend to do a lot of weird performance-oriented stateful stuff with java interop, which brings me into untyped corners of the library.
That said, I've found core.typed helpful in managing complex state transformations, especially in namespaces which have, say, five or six similar representations of the same logical thing. What do you do when a "node" is a hostname, a logical identifier in Jepsen, an identifier in the database itself, a UID, and a UID+signature pair? Managing those names can be tricky, and having a type system really helps.
Interesting, thanks for the feedback. I often miss having a type system in Clojure and I was really hoping that core.typed might be more ergonomic these days. It sounds like it's still a pretty serious tradeoff, and probably not worth it for most code.
Elle is pretty new so I would guess not--unless it's been lurking somewhere else. Dunno what aphyr's thoughts on spec are, plus I'm an amateur clojurian so, I'm not sure what community consensus is or if spec has drawbacks that make it not a good fit.
This is my understanding of what a G2-Item Anti-dependecy Cycle is from the linked paper example:
-- Given (roughly) the following transactions:
-- Transaction 1 (SELECT, T1)
with all_employees as (
select sum(salary) as salaries
from employees
),
department as (
select department, sum(salary) as salaries
from employees group by department
)
select sum(all_employees.salaries) - sum(department.salaries);
-- Transaction 2 (INSERT, T2)
insert into employees (name, department, salary)
values ('Tim', 'Sales', 70000);
-- G2-Item is where the INSERT completes between all_employees and department,
-- making the SELECT result inconsistent
This is called an "anti-dependency" issue because T2 clobbers the data T1 depends on before it completes.
They say Elle found 6 such cases in 2 min, which I'm guessing is a "very big number" of transactions, but can't figure out exactly how big that number is based on the included logs/results.
Also, "Elle has found unexpected anomalies in every database we've checked"
Yeah, it was relatively infrequent in that particular workload--dramatically less than PostgreSQL's "repeatable read" exhibited. These histories are roughly 15K successful transactions long--see the stats field in results.edn. I'm hesitant to make strong statements about frequency, because I suspect this kind of thing depends strongly on workload, but I would hazard a gueesssss that it's not super common.
Props to Jensen for exposing this longtime bug. Props to the PG team for identifying the culprit and their response. This report just strengthens my faith in the project.
I started this analysis intending to do just that--it's been difficult, however, to figure out which of the dozens of replication/HA configurations to actually test. I settled on Stolon, since it seemed to make the strongest safety claims. However, I found bugs which turned out to be PostgreSQL's fault, so I backed off to investigate those first.
And herein lies the rub: HA Postgres is an extremely painful proposition. Based on our non-scientific research, Patroni seems to be the most battle tested solution, and as popular if not more so than Stolon.
"HA in Postgres" does not have a very well-defined meaning. The Postgres documentation provides an overview of different viable solutions: https://www.postgresql.org/docs/12/different-replication-sol... with features and drawbacks for each. But to call it "extremely painful" seems to be a bit overstated.
Patroni does have synchronous_mode_strict setting, which may be what you're looking for:
This parameter prevents Patroni from switching off the synchronous replication on the primary when no synchronous standby candidates are available. As a downside, the primary is not be available for writes (unless the Postgres transaction explicitly turns of synchronous_mode), blocking all client write requests until at least one synchronous replica comes up.
Er, again, the docs say "it is still possible to lose transactions even when using synchronous_mode_strict". I've talked about this with some of the Stolon folks on Twitter, and we're not exactly sure how that manifests--possibly an SI or SSI violation.
> documentation also seems to suggest that even with the strongest settings, it can lose transactions;
Can be reproduced even on a single node postgres. Just hammer it with inserts and maintain a local counter for inserts performed.
Then, kill9 the postgres process. You'd expect your local counter to match the actual rows inserted, but you'll find that your counter will always be "less" than the actual rows inserted.
Like any "networked" system, it is possible to lose commit acknowledgments even if the commit itself was successful.
So yes, you've not "lost" transactions per se. You've "gained" them, but it is still a data issue in either case.
Ah, I presumed you were talking about distributed failure situations (split brain, etc) as opposed the to PG level replication (which most solutions orchestrate anyway).
People like to criticise NoSQL databases like MongoDB etc but at least they took on the challenge of making clustering easy enough to use and safe enough to rely on. Especially because it such a complex and error prone challenge.
Odd that you would point out MongoDB as your named example, as it is pretty awful at sharding/clustering. For HA, the more better example would be Cassandra or Scylla.
Mongo's success is more tied to the ease of development with a native JSON document DB, rather than any claims to scalability. (Insert "Mongodb is webscale" video here.)
> HA Postgres is an extremely painful proposition.
Does anyone here know how Amazon RDS's HA setup, particularly their multi-AZ option, works? That seems to be a switch that the AWS customer can just turn on. Do they have a proprietary implementation, even for non-Aurora Postgres?
So if the hardware running the database is suddenly destroyed they try to start another instance really fast?
That seems inferior to having multiple sync replicas ready to take over without having to start a process and replay the WAL.
Also, such an HA block store seems very easy to replicate ( I'd guess there would be something open source already), not much of a competitive advantage.
- Single instance, if the instance dies they start a new one and mount the same storage. This can take some time, in general under 5min but I have seen it take 45min, especially for the large instance types.
- Multi A-Z, they run a hot standby with replicated and physically separated storage. Failover takes about a minute. The replication happens at storage level, every write has to be acknowledged by both availability zones. I'm not sure if Postgres is always running or if it gets started when failing over.
Is there a proposed roadmap for basic / default solution of HA Postgres? It seems MySQl has this well covered and Postgres continue to think it is not a core part of their DB and relies on third party. ( Not suggesting that is necessary a bad thing )
Fun story: after the last report which called them out for not talking about write loss by default, MongoDB updated their Jepsen page to say that the analysis didn't observe lost writes. I guess they assumed that people wouldn't... read the abstract? Let alone the report?
For me, MongoDB has track record of bolstering a lot ("webscale") and hiding/denying mistakes.
PostgreSQL is quite the opposite on that front, confident yet open to critics and abble to admit mistakes. Hell, I've even them present their mistakes at conferences and ask for help.
Yes, for instance not returning errors in some cases when writes fail. I think this was until version 2 but to be fair they fixed this kind of stuff and started to deal with this differently later on. However their reputation never fully recovered from this.
> Neither process crashes, multiple tables, nor secondary-key access is required to reproduce our findings in this report. The technical justification for including them in this workload is “for funsies”.
Yes, it started as a hobby and turned into a business, but yes, the song is the inspiration. It basically was testing distributed systems with network partitioning (i.e. services not calling back etc)
When I first discovered aphyr's site, all of the test articles began with "Call Me Maybe:" rather than "Jepsen:", and then one day all the articles were renamed.
I've always suspected he changed it for legal reasons, and his comment elsewhere in this thread pretty much confirms it.
I am still wondering when we will see PostgreSQL being tested in a HA form.
It's just extraordinary to me that it's 2020 and it still does not have a built-in, supported set of features for supporting this use case. Instead we have to rely on proprietary vendor solutions or dig through the many obsolete or unsupported options.
The option to install postgres on three instances, specify that they're in cluster "foo" and then it just works, including automatically fixing any issues when one of the instances drops out and rejoins.
That's what other DBs have but it seems to be missing from postgres. If it now exists could you point me to the doc explaining how to do this?
Stolon or an equivalent way of automatically deploying on Kubernetes with an Helm chart being officially blessed by the PostgreSQL team and made part of the official distribution.
There were edge cases in PostgreSQL’s SERIALIZABLE isolation level - which is supposed to ensure that concurrent transactions behave as if they were committed sequentially.
Specifically - if running a transaction as SERIALIZABLE there was a very small chance that you might not see a rows inserted by another transaction that committed before you in the order. Many applications don’t need this level of transaction isolation - but for those that do it’s somewhat scary to know this was lurking under the bed.
Every implementation of a “bank” system where you keep track of deposits and withdrawals is a use-case for SERIALIZABLE, and this means a double-spend could happen because the next transaction didn’t see an account just had a transaction that drained the balance, for example.
The bank example is useful, because it tends to elicit the right thinking for people, but banking has a long history of eventual consistency.
For the vast majority of the history of banking, local branches (which is a very loose term here, e.g. a family member of the guy you know in your hometown, rather than an actual physical establishment) would operate on local knowledge only. Consistency is achieved only through a regular reconciliation process.
Even in more modern, digital times, banks depend on large batch processes and reconciliation processes.
I'd say MOST non trivial application require SERIALIZABLE. Every time apps does `BEGIN; SELECT WHERE; INSERT/UPDATE; COMMIT` it needs `SERIALIZABLE`, becuase it is only level catching cases, where concurrent transaction adds rows so that SELECT WHERE changes it's result set and therefore subsequent INSERT/UPDDATE should be done with different values.
The common bank example as I understand it doesn't require serializable, but only snapshot isolation: If two transactions both drain the source balance, the one that commits last will fail, because its snapshot doesn't match the state anymore.
If you’re UPDATEing a balance on some account table - yes. If you’re using a ledger and calculating balances (which you SHOULD) then SERIALIZABLE is needed.
It's an isolation issue but if you're coming from Mongo I'd broadly guess it's not one you're going to trigger. Also, look at their other analyses ... they're very detailed and upfront about serialization isolation issues in a lot of huge databases/datastores.
Noteworthy:
"In most respects, PostgreSQL behaved as expected: both read uncommitted and read committed prevent write skew and aborted reads."
Technically it does. You can ask for read uncommitted mode, though you'll just get read committed mode. This is correct because you're getting the minimal guarantees that you asked for. The SQL standard allows this.
The default isolation level is read committed mode, whereas the bug in question only affected applications that use serializable mode. You have to ask for serializable mode explicitly; if you're not, then you cannot possibly be affected by the bug. (Perhaps you should consider using a higher isolation level, but that would be equally true with or without this bug.)
-Yes but not the only one, was a succession of problems (why did i use mongo on the first place, on a transaction heavy callcenter database? Because the customer forced it because it was the only thing he knew)
-No just a single huge instanced, managed on Azure
So this does not affect SSI guarantees if the transactions involved all operate on the same row? Is my understanding correct?
For instance can I update a counter with serializable isolation and not run into this bug?
I think so, yeah. You could theoretically have a G2-item anomaly on a single key, but in PostgreSQL's case, the usual write-set conflict checking seems to prevent them.
For the repeatable read issue, I don't intuitively understand why the violation mentioned would be a problem. In particular, even though the transaction sequence listed wouldn't make sense for a serializable level, it seems consistent with what I'd expect from repeatable read (though I have not read the ANSI SQL standard's definition of repeatable read).
Any insights into why we should want repeatable read to block that? It feels like blocking that is specifically the purpose of serializable isolation.
The report talks at length about this, but maybe it bears rephrasing!
The ANSI definitions are bad: they allow multiple interpretations with varying results. 25 years ago, Berenson, O'Neil, et al. published a paper showing the ANSI definitions had this ambiguity, and that what the spec meant to define should have been a broader class of anomalies. They literally say that the broad interpretation is the "correct" one, and the research community basically went "oh, yeah, you're right". Adya followed up with generalized isolation level definitions, and pretty much every paper I've read has gone with these versions since. That didn't make its way back into the SQL spec though: it's still ambiguous, which means you can interpret RR as allowing G2-item.
Why prevent G2-item in RR? Because then the difference between repeatable read and serializable is specifically phantoms, rather than phantoms plus... some other hard-to-describe anomalies. If you use the broad/generalized interpretation, you can trust that a program which only accesses data by primary key, running under repeatable read, is actually serializable. That's a powerful, intuitive constraint. If you use the strict interpretation, RR allows Other Weird Behaviors, and it's harder to prove an execution is correct.
For a very thorough discussion of this, see either Berenson or Adya's papers, linked throughout the report.
Thanks, I got the part about the spec being ambiguous, am more interested in the "why" aspect, since the current behaviour seems intuitive to the name "repeatable read". But on closer inspection, I see PostgreSQL's repeatable read blocks phantom reads even though the ANSI spec permits that! I don't get why phantom reads would be acceptable under "repeatable read"... I probably should give those papers a read some time. But in the meantime, given the choice of phantom reads or G2-item, I think I'd pick blocking phantom reads. (It might be nice to have the option to choose though!)
In PostgreSQL's case, if they somehow made repeatable read to prevent G2-item without sacrificing the phantom reads, would that mean repeatable read is then "serializable" according to the ANSI definition?
But in the meantime, given the choice of phantom reads or G2-item, I think I'd pick blocking phantom reads.
Well... it's not quite so straightforward. SI still allows some phantoms. It only prohibits some of them.
In PostgreSQL's case, if they somehow made repeatable read to prevent G2-item without sacrificing the phantom reads, would that mean repeatable read is then "serializable" according to the ANSI definition?
I'm not quite sure I follow--If you're asking whether snapshot isolation (Postgres "Repeatable Read") plus preventing G2-item is serializable, the answer is no--that model would still allow G2 in general--specifically, cycles involving non-adjacent rw dependencies with predicates.
I don’t know why the author is surprised that Postgres offers stronger guarantees than serializability in practice. Serializability per se allows anomalies that would be disastrous in practice: http://dbmsmusings.blogspot.com/2019/06/correctness-anomalie....
It's not particularly surprising, but it is noteworthy. Other systems I've tested have offered serializability but not strict serializability: for instance, CockroachDB.
> This behavior is allowable due to long-discussed ambiguities in the ANSI SQL standard, but could be surprising for users familiar with the literature.
Should that be "not familiar"? And which literature - the standard or the discussions?
Familiar. If you've read any of the literature on snapshot isolation or isolation levels in general (Berenson, Bernstein, Gray, Melton, O'Neil, O'Neil, Adya, Liskov, Fekete, Cerone, Bailis, Alvisi, Clement, Crooks, Pu...) you would probably not expect to see SI treated as stronger than RR. The paper which first defined SI says it's not. Heck, Jim Melton, who wrote that part of the ANSI spec, is a co-author on that paper, which goes on to say the strict interpretation is incorrect and not what ANSI intended!
Thanks for doing these, they're incredibly interesting, useful, amusing (Oh no! the schadenfreude!) and also, incredibly inspiring to me to be a better engineer, so thank you again :)
Around June 4th, the article's author comes in with a bug report that basically says "I hammered Postgres with a whole bunch of artificial load and made something happen" [1].
By the 8th, a preliminary patch is ready for review [2]. That includes all the time to get the author's testing bootstrap up and running, reproduce, diagnose the bug (which, lest us forget, is the part of all of this that is actually hard), and assemble a fix. It's worth noting that it's no one's job per se on the Postgres project of fix this kind of thing — the hope is that someone will take interest, step up, and find a solution — and as unlikely as that sounds to work in most environments, amazingly, it usually does for Postgres.
Of note to the hacker types here, Peter Geoghegan was able to track the bug down through the use of rr [4] [5], which allowed an entire problematic run to be captured, and then stepped through forwards _and_ backwards (the latter being the key for not having to run the simulation over and over again) until the problematic code was identified and a fix could be developed.
---
[1] https://www.postgresql.org/message-id/CAH2-Wzm9kNAK0cbzGAvDt...
[2] https://www.postgresql.org/message-id/CAH2-Wzk%2BFHVJvSS9VPP...
[3] https://www.postgresql.org/message-id/CAH2-WznTb6-0fjW4WPzNQ...
[4] https://en.wikipedia.org/wiki/Rr_(debugging)
[5] https://www.postgresql.org/message-id/CAH2-WznTb6-0fjW4WPzNQ...