I am currently using Stolon with synchronous replication for a setup, and overall it's great.
It gives me automated failover, and -- perhaps more imporatantly -- the opportunity to exercise it a lot: I can reboot single servers willy-nilly, and do so regularly (for security updates every couple days).
I picked the Stolon/Patroni approach over Corosync/Pacemaker because it seems simpler and more integrated; it fully "owns" the postgres processes and controls what they do, so I suspect there is less chance to accidentally mis-configurations in the fashion of what the article describes.
I currently prefer Stolon over Patroni because statically typed languages make it easier to have less bugs (Stolon is Go, Patroni is Python), and because the proxy it brings out of the box makes it convenient: On any machine I connect to localhost:5432 to get to postgres, and if the Postgres fails over, it ensures to disconnect me so that I'm not accidentally connected to a replica.
In general, the Stolon/Patroni approach feels like the "right way" (in absence of failover being built directly into the DB, which would be great to have in upstream postgres).
Cons:
Bugs. While Stolon works great most of the time, every couple months I get some weird failure. In one case it was that a stolon-keeper would refuse to come back up with an error message, in another that a failover didn't happen, in a third that Consul stopped working (I suspect a Consul bug, the create-session endpoint hung even when used via plain curl) and as a result some stale Stolon state accidentally accumulated in the Consul KV store, with entries existing that should not be there and thus Stolon refusing to start correctly.
I suspect that, as with other distributed systems that are intrinsically hard to get right, the best way to get rid of these bugs is if more people use Stolon.
> I currently prefer Stolon over Patroni because statically typed languages make it easier to have less bugs (Stolon is Go, Patroni is Python)
Sounds like a holy-war topic :)
But lets be serious. How statically typed language helps you to avoid bugs in algorithms you implement? The rest is about proper testing.
> and because the proxy it brings out of the box makes it convenient: On any machine I connect to localhost:5432 to get to postgres
It seems like you are running a single database cluster. When you'll have to run and support hundreds of them you will change your mind.
> if the Postgres fails over, it ensures to disconnect me so that I'm not accidentally connected to a replica.
HAProxy will do absolutely the same.
> Bugs. While Stolon works great most of the time, every couple months I get some weird failure. In one case it was that a stolon-keeper would refuse to come back up with an error message, in another that a failover didn't happen, in a third that Consul stopped working (I suspect a Consul bug, the create-session endpoint hung even when used via plain curl) and as a result some stale Stolon state accidentally accumulated in the Consul KV store, with entries existing that should not be there and thus Stolon refusing to start correctly.
Yeah, it proves one more time:
* don't reinvent wheel: HAProxy vs stolon-proxy
* using statically typed language doesn't really help you to have less bugs
> I suspect that, as with other distributed systems that are intrinsically hard to get right, the best way to get rid of these bugs is if more people use Stolon.
As I've already told before. We are running a few hundred Patroni clusters with etcd and a few dozen with ZooKeeper. Never had such strange problems.
> > if the Postgres fails over, it ensures to disconnect me so that I'm not accidentally connected to a replica.
> HAProxy will do absolutely the same.
well I think that is not the same what stolon-proxy actually provides.
(actually I use patroni)
but if your network gets split and you end up with two masters (one application writes to the old master) there would be a problem if one application would still be connected to the splitted master.
however I do not get the point, because etcd / consul would not allow to still hold the master role which means that the splitted master would lose the master role and thus either die, because it can not connect to the new master or just be a read slave and the application would than probably throw errors if users are still connected to the splitted application.
highly depends how big your etcd/consul is and how good your application detects failures.
(since we are highly dependent on our database we actually kill hikaricp (java) in case of too many master write failures and just restart it after a special amount of time.
well we also look in creating a small lightweight async driver based on akka, where we do this in a little bit more automated fashion.)
> well I think that is not the same what stolon-proxy actually provides. (actually I use patroni) but if your network gets split and you end up with two masters (one application writes to the old master) there would be a problem if one application would still be connected to the splitted master.
On network partition Patroni will not be able to update leader key in Etcd and therefore restart postgres in read-only mode (create recovery.conf and restart). No writes will be possible.
it would be interesting to know how stolon/patroni deal with the failover edge cases and how this impacts availability. like if you accessing the DB but can't contact etcd/consul then you should stop accessing the DB because you might start doing unsafe writes. but this means that consul/etcd is now a point of failure (though, this usually runs multiple nodes so shouldn't happen!). but you can end up in a situation where bugs/issues with the HA system ends up causing you more downtime than manual failover would cause.
you also have to be careful with ensuring there is sufficient time gaps when failing over to cover the case when the master is not really down and connections are still writing to it. like the patroni default haproxy config doesn't even seem to kill live connections which seems kind of risky.
It gives me automated failover, and -- perhaps more imporatantly -- the opportunity to exercise it a lot: I can reboot single servers willy-nilly, and do so regularly (for security updates every couple days).
I picked the Stolon/Patroni approach over Corosync/Pacemaker because it seems simpler and more integrated; it fully "owns" the postgres processes and controls what they do, so I suspect there is less chance to accidentally mis-configurations in the fashion of what the article describes.
I currently prefer Stolon over Patroni because statically typed languages make it easier to have less bugs (Stolon is Go, Patroni is Python), and because the proxy it brings out of the box makes it convenient: On any machine I connect to localhost:5432 to get to postgres, and if the Postgres fails over, it ensures to disconnect me so that I'm not accidentally connected to a replica.
In general, the Stolon/Patroni approach feels like the "right way" (in absence of failover being built directly into the DB, which would be great to have in upstream postgres).
Cons:
Bugs. While Stolon works great most of the time, every couple months I get some weird failure. In one case it was that a stolon-keeper would refuse to come back up with an error message, in another that a failover didn't happen, in a third that Consul stopped working (I suspect a Consul bug, the create-session endpoint hung even when used via plain curl) and as a result some stale Stolon state accidentally accumulated in the Consul KV store, with entries existing that should not be there and thus Stolon refusing to start correctly.
I suspect that, as with other distributed systems that are intrinsically hard to get right, the best way to get rid of these bugs is if more people use Stolon.