It's really disappointing that they made this forum thread private, apparently in response to this HN thread blowing up. This is the first negative HN thread I've seen about them, it's not even really that bad because this kind of downtime is expected, and they can't get to every forum post, and their response that someone posted here is totally reasonable in my opinion.
So why is the link to the thread 404ing and why does this post have to link to google webcache of it? I've grown to like fly.io and use them for my side projects now, and this just isn't sometime they would do. Going through some minor cognitive dissonance right now :/
I wonder if there will ever be a wake up call to the arrogance of people at fly.io
At work when it came up in a meeting people went around with horror stories of broken elements while the status page wasn't updated, terrible communication and an overall attitude that nothing is wrong, even when servers go down for days at a time.
There's a global status page, and then there's a local update for people with instances on an affected host --- past some threshold of hosts, the probability of having an issue on some random host gets pretty high just because math. The local status thing happened for people with instances on that machine.
Ordinarily, a single-host incident takes a couple minutes to resolve, and, ordinarily, when it's resolved, everything that was running on the host pops right back up. This single-host outage wasn't ordinary. Somehow, a containerd boltdb got corrupted, and it took something like 12 hours for a member of our team (themselves a containerd maintainer) to do some kind of unholy surgery on that database to bring the machine back online.
The runbook we have for handling and communicating single-host outages wasn't tuned for this kind of extended outage. It will be now. Probably we'll just paint the global status page when a single-host outage crosses some kind of time threshold.
So why is the link to the thread 404ing and why does this post have to link to google webcache of it? I've grown to like fly.io and use them for my side projects now, and this just isn't sometime they would do. Going through some minor cognitive dissonance right now :/