I'm surprised by your risk tolerance. If I had any cloud service at this level i...

tinco · on July 21, 2023

I think you're not exposed enough to the reality of hardware. There was no need for the host to come back online at all. I think it was a mistake of Fly.io to even attempt to do it. Just say tell the customer the host was lost and offer them a new one (with a freshly zeroed volume attached). You rent a machine, it breaks, you get a new one.

If they're sad that they lost their data, it's their fault for running on a single host with no backup. By actually performing an (apparently) difficult recovery, they reinforced their customers erroneous expectation that they are somehow responsible for the integrity of the data on any single host.

Dylan16807 · on July 22, 2023

They're not responsible for extreme data recovery, but (almost?) all of the customer data volumes on that server were completely intact. They damn well should be responsible for getting that data back to their customers, whether or not they get the server going again.

If you run off a single drive, and the drive dies, any resulting data loss is your fault. But not if something else dies.

markonen · on July 22, 2023

I'm absolutely 100% certain that AWS (for example) wouldn't do that for you with the instance types that feature direct attached storage.

Dylan16807 · on July 22, 2023

Directly attached storage in AWS is a special niche that disappears when you so much as hibernate. And even then they talk about how disk failure loses the data but power failure won't.

This is much closer to EBS breaking. It happens sometimes, but if the data is easily accessible then it shouldn't get tossed.

CSSer · on July 21, 2023

In hindsight I wish I could edit because my above comment was pretty trigger happy and focused overly focused on the amount of downtime. It was colored by some existing preconceptions I had about Fly, and I'm honestly surprised it continues to be upvoted. When I made this comment I hadn't yet learned some of the bits you mentioned here at the end from another thread. Anyway, I tend to agree overall. I actually suggested Fly even reconsider offering this configuration given that they refer to it as a "single-node cluster", which is an oxymoron.

itake · on July 21, 2023

Is this the posture of other hosting providers? If not, it seems other hosting providers offer better quality of service.

tinco · on July 21, 2023

I would think so, it's honestly strange to think about. The idea of having the node come back after it broke is a bit ridiculous to me. A node breaks, you delete it from your interface and provision a new one, the idea of even waiting 5 minutes for it to come up is strange. This whole conversation seems detached from how the cloud is supposed to and has operated in the past decade.

mrkurt · on July 21, 2023

You're saying a single server failure is going to to cost your business half a million dollars?

This was a server with local NVMe storage. The simplest thing to do would have been to just get rid of it, but we have quite a few free users with data they care about running on single node Postgres (because it's cheaper). It seemed like a better idea to recover this thing.

CSSer · on July 21, 2023

No, it wouldn't, at least not given the contextual details of this situation because we wouldn't do that. Honestly there are parts of my above comment that hold but I admit in the moment that it was a bit impulsive of me because I hadn't yet learned all of the details necessary to make that judgment call. That number is right under slightly different circumstances if you're asking, but it sounds like you were trying to prove a point. If that's true, you succeeded. I learned a bit later that what they were calling a cluster was a single server and that's just... yeah.