I worked on a web store once that depended on a couple of 3rd party web services and set up subscription billing. The project turned out to be like an iceberg, where 10% of the work went into features, and 90% of the work went into heading off all of the various possible failure modes. When the system failed, people's credit cards got billed the wrong amount and either the store didn't get paid or the customer got overcharged. So I can relate to all of the effort they put into fail-safety.
As in, the mission-critical, high-bandwidth stuff is on distributed servers they manage themselves like the traditional companies do. Unsurprising to we skeptics of going all-in on clouds.
Architecting applications for the cloud is a non-trivial problem. Lot of forethought needs to go in place while choosing a particular infrastructure component. And the ideas of resiliency, robustness (what can be bucketed into 'graceful-degradation') etc along with the most important Monitoring needs to be put in place right from the initial days of the app, so that rearchitecting is does NOT end up being a costly affair.
Couldn't agree more. The cloud just makes it clear where you didn't architect for resiliency properly, in your own DC you work around this by doing things that you can't do in the cloud.
But the biggest difference is automation. Because in the cloud you have a full suite of APIs, you can literally automate ever piece of your infrastructure. In your own DC you have to design this stuff to be automated from the ground up, which just making a poor layer2 choice can make that task astronomically more difficult.
At my company, we've decided that if multiple AWS regions are down, we're down too.
We use so many AWS services that porting to a second cloud provider is cost-prohibitive at this point. So we're willing to accept that if 2 USA regions and one European region are down, we're down too.
> At my company, we've decided that if multiple AWS regions are down, we're down too.
We've come to the same conclusion as well. We cannot afford the engineering hours to run duplicate infrastructure just in case the big boys go down.
If multiple AWS regions go down... we wait until they are not down. Our developers sighed in relief when we decided that this was going to be our course of action... but we're approaching the scale that requires us to really mitigate against large scale outages like this. For now we bury our heads.
I think it's best to keep your critical infrastructure portable. It usually doesn't make financial sense to have a hot backup on another cloud service but having a plan to quickly switch it over in case of emergency is a good idea.
For example, using something like Terraform could spin up infrastructure on AWS, but in a pinch you could make your templates work for Google, Azure, etc.
That said, if your infrastructure is at all complex it can be very difficult to keep it portable.
Do you know of any known public example of someone who has actually done that? This seems advise similar to eating 6 portions vegetables a day: everyone says it's critical, nobody is doing it.
In all the cases of Amazon and Google Cloud outages I've seen that were beyond a single zone, it all hit the high-valued companies hard. I've never seen anyone declare that they decided to activate their alternative provider where they have a continuously updated copy of all the data so the data loss is minimal.
Everyone is just assuming Amazon or Google will fix it and that any data loss from activating a non-hot standby is not worth it.
I don't know of any public examples where someone made an emergency switch from one provider to another. I doubt it's happened often, mostly because multi-region failures are so rare. Like I said, if your infrastructure is at all complex it's very difficult to keep it portable. I doubt any "high-value" company would be able to make an emergency switch without their downtime exceeding what they would have experienced if they just stayed put.
There's products on the market dedicated to abstracting clouds for using more than one or switching easily. I'd be shocked if there weren't companies using them for the former. Those most concerned with availability usually just avoid clouds, though.