The way to not be exposed to this is to run an HA configuration with more than one instance.
If you're running an app on Fly.io without local durable storage, then it's easy to fail over to another server. But durable storage on Fly.io is attached NVMe storage.
By far the most common way people use durable storage on Fly.io is with Postgres databases. If you're doing that on Fly.io, we automatically manage failover at the application layer: you run multiple instances, they configure themselves in a single-writer multi-reader cluster, and if the leader fails, a replica takes over.
We will let you run a single-instance Postgres "cluster", and people definitely do that. The downside to that configuration is, if the host you're on blows up, your availability can take a hit. That's just how the platform works.
I see. Have you considered eliminating this configuration from your offering? It sounds like the terminology could confuse people, and it may be the case that they're assuming that a host isn't really what it is (a single host). This kind of thing is difficult for those seeking to build managed services, because I think people expect you to provide offerings that can't harm them when the cause is related to the service they're paying for and it's difficult to figure out which sharp objects they understand and which ones they don't. People should know better, but if they did would they need you?
If this sounds ludicrous, then I think I probably don't understand who Fly.io wants to be and that's okay. If I don't understand, however, you may want to take a look at your image and messaging to potentially recalibrate what kind of customers you're attracting.
Plenty of people would rather take downtime than pay for redundancy, for example for a test database.
AWS RDS lets you spin up a RDS instance that costs 3x less and regularly has downtime (the 'single-az' one), quite similar to this.
Anyone who's used servers before knows "A single instance" is the same as "sometimes you might have downtime".
Computers aren't magic, everyone from heroku (you must have multiple dynos to be high availability) to ec2 (multiple instances across AZs) agree on "a single machine is not redundant".
I don't see how fly's messaging is out of line with that. They don't tell you anywhere "Our apps and machines are literally magic and will never fail".
Sure, but isn't this more about risk tolerance at this point and how much your customers care about? Where the responsibility should be on customer's end. Running on EBS/RDS doesn't guarantee you won't lose data. If you care about it, you enable backups and test recovery.
Just because some customers are less fault tolerant than others, doesn't mean we shouldn't offer those options where people don't have the same requirements or are willing to work around it.
Unless something has changed and I'm out of date, I think a piece of context here is fly postgres isn't really a managed service offering. From what I've seen fly does try to message this, but I think it's still easy for some subset of customers to miss that they're deploying an OSS component, maybe deployed a non-HA setup and forgot, and it's not the same as buying a database as a service.
So hopefully as fly.io get's more popular, there will be some compelling managed offerings. I saw comments at one point from the neon CEO about a fly.io offering, but not sure if that went anywhere. I'm sure customers can also use crunchy, or other offerings.
If you're running an app on Fly.io without local durable storage, then it's easy to fail over to another server. But durable storage on Fly.io is attached NVMe storage.
By far the most common way people use durable storage on Fly.io is with Postgres databases. If you're doing that on Fly.io, we automatically manage failover at the application layer: you run multiple instances, they configure themselves in a single-writer multi-reader cluster, and if the leader fails, a replica takes over.
We will let you run a single-instance Postgres "cluster", and people definitely do that. The downside to that configuration is, if the host you're on blows up, your availability can take a hit. That's just how the platform works.