I'm surprised by your risk tolerance. If I had any cloud service at this level in my stack go down for three days, I'd start shopping for an alternative. This exceeds the level of acceptability for me for even non-HA requirements. After all, if I can't trust them for this, why would I ever consider giving them my HA business? Just based on napkin math for us, this could've been a potential loss of nearly half a million dollars. Up until this point, I've looked at Fly.io's approach to PR and their business as unconventional but endearing. Now I'm beginning to look at them as unserious. I'm sorry if that sounds harsh. It's the cold truth.
I think you're not exposed enough to the reality of hardware. There was no need for the host to come back online at all. I think it was a mistake of Fly.io to even attempt to do it. Just say tell the customer the host was lost and offer them a new one (with a freshly zeroed volume attached). You rent a machine, it breaks, you get a new one.
If they're sad that they lost their data, it's their fault for running on a single host with no backup. By actually performing an (apparently) difficult recovery, they reinforced their customers erroneous expectation that they are somehow responsible for the integrity of the data on any single host.
They're not responsible for extreme data recovery, but (almost?) all of the customer data volumes on that server were completely intact. They damn well should be responsible for getting that data back to their customers, whether or not they get the server going again.
If you run off a single drive, and the drive dies, any resulting data loss is your fault. But not if something else dies.
Directly attached storage in AWS is a special niche that disappears when you so much as hibernate. And even then they talk about how disk failure loses the data but power failure won't.
This is much closer to EBS breaking. It happens sometimes, but if the data is easily accessible then it shouldn't get tossed.
In hindsight I wish I could edit because my above comment was pretty trigger happy and focused overly focused on the amount of downtime. It was colored by some existing preconceptions I had about Fly, and I'm honestly surprised it continues to be upvoted. When I made this comment I hadn't yet learned some of the bits you mentioned here at the end from another thread. Anyway, I tend to agree overall. I actually suggested Fly even reconsider offering this configuration given that they refer to it as a "single-node cluster", which is an oxymoron.
I would think so, it's honestly strange to think about. The idea of having the node come back after it broke is a bit ridiculous to me. A node breaks, you delete it from your interface and provision a new one, the idea of even waiting 5 minutes for it to come up is strange. This whole conversation seems detached from how the cloud is supposed to and has operated in the past decade.
You're saying a single server failure is going to to cost your business half a million dollars?
This was a server with local NVMe storage. The simplest thing to do would have been to just get rid of it, but we have quite a few free users with data they care about running on single node Postgres (because it's cheaper). It seemed like a better idea to recover this thing.
No, it wouldn't, at least not given the contextual details of this situation because we wouldn't do that. Honestly there are parts of my above comment that hold but I admit in the moment that it was a bit impulsive of me because I hadn't yet learned all of the details necessary to make that judgment call. That number is right under slightly different circumstances if you're asking, but it sounds like you were trying to prove a point. If that's true, you succeeded. I learned a bit later that what they were calling a cluster was a single server and that's just... yeah.