I’ve been involved in various Postgres roles (developer on it, consultant for it for BigCo, using it at my startup now) for around 18 years.
I’ve never in that time seen a pacemaker/corosync/etc/etc configuration go well. Ever. I have seen corrupted DBs, fail overs for no reason, etc. The worst things always happen when the failover doesn’t go according to plan and someone accidentally nukes the DB at 2am.
The lesson I’ve taken from this is it’s better to have 15-20 minutes of downtime in the unlikely event a primary goes down and run a manual failover/takeover script then it is to rely on automation. pgbouncer makes this easy enough.
That said, there was a lot of bad luck involved in this incident.
Our PostgreSQL failover plan is very pedestrian, but it works well (even across datacenters). We run streaming replication off of the primary to a pair of replicas, with one replica in another datacenter. The primary write DB advertises a loopback IP out OSPF into a top-of-rack switch, where it's aggregated by BGP and distributed throughout our network. There's a health check script [0] running every 3 seconds that makes sure PG is happy and that it is still writeable.
If we want to failover (nothing automatic), we stop the primary (or it's already dead) and the route is withdrawn. An operator touches the recovery file on new primary, the health checker sees that, and the IP is announced back into the network. Yes, it's a "VIP", but it's one controlled by our operations team, not automation software. One nice things about this is that you can failover across datacenters (remember it's advertised into our network over BGP) without reconfiguring DNS or messing with application servers.
While the mechanisms are different, we do something very similar for MySQL with MHA. It's still an operator running scripts intentionally though (which is what we want).
I will definitely agree with you that manual operator intervention is better than automated failover.
Pacemaker/Corosync is a good tool if you know exactly how it works. However, very few people know exactly how it works. This is a very old problem for that team if you check the mailing lists. If they offered a few "blessed" configurations or even an online config generator, I'm sure that people would have nicer things to say about it.
I've used it heartbeat/pacemaker/corosync with DRBD for over a decade and I'm pretty pleased with it...now. But it took a bit of trial by fire to get it right. Luckily I never lost any data and found the issues in testing. Which gets at the heart of any failover mechanism -- it's all useless unless you test it on a regular basis (just like your backups).
I'd generalize this to many kinds of failover setups, DB or non-DB. Practically each and every one I've seen is in itself less reliable than the thing it's covering for.
I’ve never in that time seen a pacemaker/corosync/etc/etc configuration go well. Ever.
That mirrors my experience - PG, DRDB, anything else, this form of clustering is a complete joke. If it's important, it's worth spending the money on a grown-solution. Veritas has its flaws sure but in 20+ years on half a dozen different OS's it's never let me down when the balloon went up.
The post you link is still relevant to this discussion, but we can't ignore that 5 years have passed. It would be good to see 2017's Baron revisit the topic :)
How does pgbouncer make this process easy? Just because there are less connections to go to the final DB? I've also got a random question you might be able to answer (having a hard time googling)... when using pgsql in streaming replication mode, are created functions replicated/updated to everything else as well?
(just learning about postgres and saw an opportunity to ask someone in the know)
Cheers!
One reason is that some applications do not properly support reconnecting to the database if the connection is lost. With pgbouncer that is not an issue. Another is to avoid having floating IPs or updating the DNS.
I'd assume it's because you can just reconfigure to point pgbouncer to redirect everything to the secondary, rather then having to update all the applications using pgbouncer. It centralizes the configuration of which database is active.
Zero. Update configuration in a central location. Push an update to configuration manager agent on each node (using a dead simple gossip protocol). The agent wakes up, pulls new configuration, writes it and reloads pgbouncer. Using an aggressive gossip protocol, the whole process shouldn't take more than 5 seconds on a large cluster, let alone couple of nodes.
That's a fair amount of moving parts. What if one of the agents misses the update and keeps using the old server? What happens during that period of 5 seconds where the hosts are switching? It doesn't sound like an atomic, instant change process.
It's gossip. There is no precise gossip protocol. I've just mentioned one approach. It's not difficult to modify it such that each node gets one message multiple times, to ensure everyone will see the changes. Centrally configuration management systems like Zookeeper/etcd,... will also suffer from the same problem: message propagation takes time. Even with floating IP, a couple of requests may fail until routing tables converge. So there is no atomic configuration switch in either case anyway. The argument was if floating IP is easier than pure software approach to deploy, which is not the case. both of them are equally challenging. One of them requires experienced network engineers, the other one requires experienced distributed system designer. In case of failure the effort to switch from failed node is zero in both approaches.
You need the .so's on the replica's simply for the database structure, but afaik, it's only the on-disk format that's being synchronized, functions are not executed on the replicas. I'm not even sure this would throw errors if the .so's would be missing - but it will give problems when a replica is promoted to the new master.
My understanding is that the problem is not really with pacemaker/corosync. Those tools also are always consistent as ZK/etcd/Consul. There is also SONITH to make sure the node that goes down can't cause damage once it is back.
The problem is not these tools, but implementing what is the right thing to do during an outage or even properly detecting one (what happened with github). Your solution might work 99 cases out of 100 but that remaining 1 case might cause your data loss.
When there is a human required to do the switch it typically he/she can investigate what happened and make the right decision.
It's theoretically possible to have a foolproof solution that always works right, but that's extremely hard to implement, because you need to know in advance what kind of issues you will have, and if you miss something, that's one case where your tool might make a wrong decision.
well corosync/pacemaker is definitly not the same as zk/etcd/consul.
STONITH is mostly a bad idea. Two node clusters are actually always a bad idea. Using a VIP is a bad idea, too.
This is what I learned in the small scale and in the big scale it's even worse.
The problem in this topic was that they didn't understood corosync/pacemaker correctly. The syntax is akward and it's hard to configure.
With consul + patroni they would have a way better architecture that could be way more understood.
They would not need a VIP (it would work over DNS).
They used archive_command to get a WAL file from the primary on a sync replica. This should NEVER be done, if archive_command did not returned with a sane status code (which in fact it probably did not).
They did not read https://www.postgresql.org/docs/10/static/continuous-archivi... at all.
Last but not least you should never use restore_command on a sync node when it doesn't need to (always check if master is alive/healty before doing it. Maybe even check how far behind you are)
patroni would've worked in their case. patroni would've made it easy to restart the failed primary.
patroni would be in control of the postgresql which is way better than using pacemaker/corosync (especially combined with a watchdog/softdog).
what would've helped also would have been two sync nodes and fail to any of them. (will be harder since sync nodes need to be detached if unhealty)
and best thing is with etcd/consul/zk you could have a cluster of etcd/consul/zk on three different nodes than your 3 database servers (this helps a lot).
It's a little lost in another comment thread (https://news.ycombinator.com/item?id=15862584), but I'm definitely excited about solutions like Patroni and Stolon that have come along more recently.
Well you should definitly look into them.
In the past we used corosync/pacemaker a lot (even for different things than just database-ha) but trust me... it was never a sane system. if it ain't broke it worked. if something broke it was horrible to actually get back to any sane state at all.
we migrated to patroni (yeah stolon is cool aswell, but since it's a little bit bigger than we need to we used patroni).
the hardest part for patroni is actually creating a script which would create service files for consul (consul is a little bit wierd when it comes to services) or somehow changes dns/haproxy whatever to point to the new master (this is not a problem on stolon)
but since then we tried all sorts of failures and never had a problem. we pulled plugs (hard drive, network, power cord) nothing bad did happen no matter what we did. watchdog worked better than expected in some cases where we tried to fire bad stuff at patroni/overload it. and since it's in python the charactaristic/memory/cpu usage is well understood. (the code is also easy to reason about, at least better than corosync/pacemaker.) etcd/zk/consul is battle tested and did work even that we have way more network partitions than your typical network (this was bad for galera.. :(:()
we never autostart a failed node after a restart/clean start. we always look into the node and manually start patroni. and also we use the role_change/etc hooks to create/delete service files in consul and to ping us if anything on the cluster happens.
I am currently using Stolon with synchronous replication for a setup, and overall it's great.
It gives me automated failover, and -- perhaps more imporatantly -- the opportunity to exercise it a lot: I can reboot single servers willy-nilly, and do so regularly (for security updates every couple days).
I picked the Stolon/Patroni approach over Corosync/Pacemaker because it seems simpler and more integrated; it fully "owns" the postgres processes and controls what they do, so I suspect there is less chance to accidentally mis-configurations in the fashion of what the article describes.
I currently prefer Stolon over Patroni because statically typed languages make it easier to have less bugs (Stolon is Go, Patroni is Python), and because the proxy it brings out of the box makes it convenient: On any machine I connect to localhost:5432 to get to postgres, and if the Postgres fails over, it ensures to disconnect me so that I'm not accidentally connected to a replica.
In general, the Stolon/Patroni approach feels like the "right way" (in absence of failover being built directly into the DB, which would be great to have in upstream postgres).
Cons:
Bugs. While Stolon works great most of the time, every couple months I get some weird failure. In one case it was that a stolon-keeper would refuse to come back up with an error message, in another that a failover didn't happen, in a third that Consul stopped working (I suspect a Consul bug, the create-session endpoint hung even when used via plain curl) and as a result some stale Stolon state accidentally accumulated in the Consul KV store, with entries existing that should not be there and thus Stolon refusing to start correctly.
I suspect that, as with other distributed systems that are intrinsically hard to get right, the best way to get rid of these bugs is if more people use Stolon.
> I currently prefer Stolon over Patroni because statically typed languages make it easier to have less bugs (Stolon is Go, Patroni is Python)
Sounds like a holy-war topic :)
But lets be serious. How statically typed language helps you to avoid bugs in algorithms you implement? The rest is about proper testing.
> and because the proxy it brings out of the box makes it convenient: On any machine I connect to localhost:5432 to get to postgres
It seems like you are running a single database cluster. When you'll have to run and support hundreds of them you will change your mind.
> if the Postgres fails over, it ensures to disconnect me so that I'm not accidentally connected to a replica.
HAProxy will do absolutely the same.
> Bugs. While Stolon works great most of the time, every couple months I get some weird failure. In one case it was that a stolon-keeper would refuse to come back up with an error message, in another that a failover didn't happen, in a third that Consul stopped working (I suspect a Consul bug, the create-session endpoint hung even when used via plain curl) and as a result some stale Stolon state accidentally accumulated in the Consul KV store, with entries existing that should not be there and thus Stolon refusing to start correctly.
Yeah, it proves one more time:
* don't reinvent wheel: HAProxy vs stolon-proxy
* using statically typed language doesn't really help you to have less bugs
> I suspect that, as with other distributed systems that are intrinsically hard to get right, the best way to get rid of these bugs is if more people use Stolon.
As I've already told before. We are running a few hundred Patroni clusters with etcd and a few dozen with ZooKeeper. Never had such strange problems.
> > if the Postgres fails over, it ensures to disconnect me so that I'm not accidentally connected to a replica.
> HAProxy will do absolutely the same.
well I think that is not the same what stolon-proxy actually provides.
(actually I use patroni)
but if your network gets split and you end up with two masters (one application writes to the old master) there would be a problem if one application would still be connected to the splitted master.
however I do not get the point, because etcd / consul would not allow to still hold the master role which means that the splitted master would lose the master role and thus either die, because it can not connect to the new master or just be a read slave and the application would than probably throw errors if users are still connected to the splitted application.
highly depends how big your etcd/consul is and how good your application detects failures.
(since we are highly dependent on our database we actually kill hikaricp (java) in case of too many master write failures and just restart it after a special amount of time.
well we also look in creating a small lightweight async driver based on akka, where we do this in a little bit more automated fashion.)
> well I think that is not the same what stolon-proxy actually provides. (actually I use patroni) but if your network gets split and you end up with two masters (one application writes to the old master) there would be a problem if one application would still be connected to the splitted master.
On network partition Patroni will not be able to update leader key in Etcd and therefore restart postgres in read-only mode (create recovery.conf and restart). No writes will be possible.
it would be interesting to know how stolon/patroni deal with the failover edge cases and how this impacts availability. like if you accessing the DB but can't contact etcd/consul then you should stop accessing the DB because you might start doing unsafe writes. but this means that consul/etcd is now a point of failure (though, this usually runs multiple nodes so shouldn't happen!). but you can end up in a situation where bugs/issues with the HA system ends up causing you more downtime than manual failover would cause.
you also have to be careful with ensuring there is sufficient time gaps when failing over to cover the case when the master is not really down and connections are still writing to it. like the patroni default haproxy config doesn't even seem to kill live connections which seems kind of risky.
Thanks for the extra info, and the insight into how you're using Patroni. Always helpful to hear about someone using it for real, especially someone who's come from Pacemaker. :)
Patrons is great. Running dockerised Postgres with consul backend for years without a hitch. Haproxy as lb. What that? A replica need reboot. Just reboot. Primary? Just failover to replica and reboot. Undesired reboots recovers in under 10 seconds. During which just primary is not available but replicas are.
I’ve never in that time seen a pacemaker/corosync/etc/etc configuration go well. Ever. I have seen corrupted DBs, fail overs for no reason, etc. The worst things always happen when the failover doesn’t go according to plan and someone accidentally nukes the DB at 2am.
The lesson I’ve taken from this is it’s better to have 15-20 minutes of downtime in the unlikely event a primary goes down and run a manual failover/takeover script then it is to rely on automation. pgbouncer makes this easy enough.
That said, there was a lot of bad luck involved in this incident.