My understanding is that the problem is not really with pacemaker/corosync. Those tools also are always consistent as ZK/etcd/Consul. There is also SONITH to make sure the node that goes down can't cause damage once it is back.
The problem is not these tools, but implementing what is the right thing to do during an outage or even properly detecting one (what happened with github). Your solution might work 99 cases out of 100 but that remaining 1 case might cause your data loss.
When there is a human required to do the switch it typically he/she can investigate what happened and make the right decision.
It's theoretically possible to have a foolproof solution that always works right, but that's extremely hard to implement, because you need to know in advance what kind of issues you will have, and if you miss something, that's one case where your tool might make a wrong decision.
well corosync/pacemaker is definitly not the same as zk/etcd/consul.
STONITH is mostly a bad idea. Two node clusters are actually always a bad idea. Using a VIP is a bad idea, too.
This is what I learned in the small scale and in the big scale it's even worse.
The problem in this topic was that they didn't understood corosync/pacemaker correctly. The syntax is akward and it's hard to configure.
With consul + patroni they would have a way better architecture that could be way more understood.
They would not need a VIP (it would work over DNS).
They used archive_command to get a WAL file from the primary on a sync replica. This should NEVER be done, if archive_command did not returned with a sane status code (which in fact it probably did not).
They did not read https://www.postgresql.org/docs/10/static/continuous-archivi... at all.
Last but not least you should never use restore_command on a sync node when it doesn't need to (always check if master is alive/healty before doing it. Maybe even check how far behind you are)
patroni would've worked in their case. patroni would've made it easy to restart the failed primary.
patroni would be in control of the postgresql which is way better than using pacemaker/corosync (especially combined with a watchdog/softdog).
what would've helped also would have been two sync nodes and fail to any of them. (will be harder since sync nodes need to be detached if unhealty)
and best thing is with etcd/consul/zk you could have a cluster of etcd/consul/zk on three different nodes than your 3 database servers (this helps a lot).
It's a little lost in another comment thread (https://news.ycombinator.com/item?id=15862584), but I'm definitely excited about solutions like Patroni and Stolon that have come along more recently.
Well you should definitly look into them.
In the past we used corosync/pacemaker a lot (even for different things than just database-ha) but trust me... it was never a sane system. if it ain't broke it worked. if something broke it was horrible to actually get back to any sane state at all.
we migrated to patroni (yeah stolon is cool aswell, but since it's a little bit bigger than we need to we used patroni).
the hardest part for patroni is actually creating a script which would create service files for consul (consul is a little bit wierd when it comes to services) or somehow changes dns/haproxy whatever to point to the new master (this is not a problem on stolon)
but since then we tried all sorts of failures and never had a problem. we pulled plugs (hard drive, network, power cord) nothing bad did happen no matter what we did. watchdog worked better than expected in some cases where we tried to fire bad stuff at patroni/overload it. and since it's in python the charactaristic/memory/cpu usage is well understood. (the code is also easy to reason about, at least better than corosync/pacemaker.) etcd/zk/consul is battle tested and did work even that we have way more network partitions than your typical network (this was bad for galera.. :(:()
we never autostart a failed node after a restart/clean start. we always look into the node and manually start patroni. and also we use the role_change/etc hooks to create/delete service files in consul and to ping us if anything on the cluster happens.
I am currently using Stolon with synchronous replication for a setup, and overall it's great.
It gives me automated failover, and -- perhaps more imporatantly -- the opportunity to exercise it a lot: I can reboot single servers willy-nilly, and do so regularly (for security updates every couple days).
I picked the Stolon/Patroni approach over Corosync/Pacemaker because it seems simpler and more integrated; it fully "owns" the postgres processes and controls what they do, so I suspect there is less chance to accidentally mis-configurations in the fashion of what the article describes.
I currently prefer Stolon over Patroni because statically typed languages make it easier to have less bugs (Stolon is Go, Patroni is Python), and because the proxy it brings out of the box makes it convenient: On any machine I connect to localhost:5432 to get to postgres, and if the Postgres fails over, it ensures to disconnect me so that I'm not accidentally connected to a replica.
In general, the Stolon/Patroni approach feels like the "right way" (in absence of failover being built directly into the DB, which would be great to have in upstream postgres).
Cons:
Bugs. While Stolon works great most of the time, every couple months I get some weird failure. In one case it was that a stolon-keeper would refuse to come back up with an error message, in another that a failover didn't happen, in a third that Consul stopped working (I suspect a Consul bug, the create-session endpoint hung even when used via plain curl) and as a result some stale Stolon state accidentally accumulated in the Consul KV store, with entries existing that should not be there and thus Stolon refusing to start correctly.
I suspect that, as with other distributed systems that are intrinsically hard to get right, the best way to get rid of these bugs is if more people use Stolon.
> I currently prefer Stolon over Patroni because statically typed languages make it easier to have less bugs (Stolon is Go, Patroni is Python)
Sounds like a holy-war topic :)
But lets be serious. How statically typed language helps you to avoid bugs in algorithms you implement? The rest is about proper testing.
> and because the proxy it brings out of the box makes it convenient: On any machine I connect to localhost:5432 to get to postgres
It seems like you are running a single database cluster. When you'll have to run and support hundreds of them you will change your mind.
> if the Postgres fails over, it ensures to disconnect me so that I'm not accidentally connected to a replica.
HAProxy will do absolutely the same.
> Bugs. While Stolon works great most of the time, every couple months I get some weird failure. In one case it was that a stolon-keeper would refuse to come back up with an error message, in another that a failover didn't happen, in a third that Consul stopped working (I suspect a Consul bug, the create-session endpoint hung even when used via plain curl) and as a result some stale Stolon state accidentally accumulated in the Consul KV store, with entries existing that should not be there and thus Stolon refusing to start correctly.
Yeah, it proves one more time:
* don't reinvent wheel: HAProxy vs stolon-proxy
* using statically typed language doesn't really help you to have less bugs
> I suspect that, as with other distributed systems that are intrinsically hard to get right, the best way to get rid of these bugs is if more people use Stolon.
As I've already told before. We are running a few hundred Patroni clusters with etcd and a few dozen with ZooKeeper. Never had such strange problems.
> > if the Postgres fails over, it ensures to disconnect me so that I'm not accidentally connected to a replica.
> HAProxy will do absolutely the same.
well I think that is not the same what stolon-proxy actually provides.
(actually I use patroni)
but if your network gets split and you end up with two masters (one application writes to the old master) there would be a problem if one application would still be connected to the splitted master.
however I do not get the point, because etcd / consul would not allow to still hold the master role which means that the splitted master would lose the master role and thus either die, because it can not connect to the new master or just be a read slave and the application would than probably throw errors if users are still connected to the splitted application.
highly depends how big your etcd/consul is and how good your application detects failures.
(since we are highly dependent on our database we actually kill hikaricp (java) in case of too many master write failures and just restart it after a special amount of time.
well we also look in creating a small lightweight async driver based on akka, where we do this in a little bit more automated fashion.)
> well I think that is not the same what stolon-proxy actually provides. (actually I use patroni) but if your network gets split and you end up with two masters (one application writes to the old master) there would be a problem if one application would still be connected to the splitted master.
On network partition Patroni will not be able to update leader key in Etcd and therefore restart postgres in read-only mode (create recovery.conf and restart). No writes will be possible.
it would be interesting to know how stolon/patroni deal with the failover edge cases and how this impacts availability. like if you accessing the DB but can't contact etcd/consul then you should stop accessing the DB because you might start doing unsafe writes. but this means that consul/etcd is now a point of failure (though, this usually runs multiple nodes so shouldn't happen!). but you can end up in a situation where bugs/issues with the HA system ends up causing you more downtime than manual failover would cause.
you also have to be careful with ensuring there is sufficient time gaps when failing over to cover the case when the master is not really down and connections are still writing to it. like the patroni default haproxy config doesn't even seem to kill live connections which seems kind of risky.
Thanks for the extra info, and the insight into how you're using Patroni. Always helpful to hear about someone using it for real, especially someone who's come from Pacemaker. :)
The problem is not these tools, but implementing what is the right thing to do during an outage or even properly detecting one (what happened with github). Your solution might work 99 cases out of 100 but that remaining 1 case might cause your data loss.
When there is a human required to do the switch it typically he/she can investigate what happened and make the right decision.
It's theoretically possible to have a foolproof solution that always works right, but that's extremely hard to implement, because you need to know in advance what kind of issues you will have, and if you miss something, that's one case where your tool might make a wrong decision.