Is infrastructure at this scale typically unable to do a cold start? I can belie...

cranekam · on Nov 23, 2021

I guess it depends what "infrastructure" means.

If you mean "all of Google" then a cold restart would probably be very hard. At Facebook a cold restart/network cutoff of a datacenter region (a test we did periodically) took considerable planning. There is a lot to coordinate — many components and teams involved, lots of capacity planning, and so on. Over time this process got faster but it is still far from just pulling out the power cord and plugging it in again.

If you mean a single backend component then cold starting it may or may not be easy. Easy if it's a stateless service that's not in the critical path. But it seems this GCP outage was in the load balancing layer and likely harder to handle. A parent comment suggested this could be restarted in 15s, which is probably far from the truth. If it takes 5s to get an individual node restarted and serving traffic you'd need to take down a third of capacity at a time, almost certainly overloading the rest.

In some cases the component may also have state that needs to be kept or refilled. Again, at FB, cold starting the cache systems was a fairly tricky process. Just turning them off and on again would leave cold caches and overload all the systems behind them.

Lastly, needing to be able to quickly cold restart something is probably a design smell. In the case of this GCP outage rather than building infra that can handle all the load balancers restarting in 15s it would probably be easier and safer to add the capability of keeping the last known good configuration in memory and exposing a mechanism to roll back to it quickly. This wouldn't avoid needing to restart for code bugs in the service but it would provide some safety from configuration-related issues.

mschuster91 · on Nov 23, 2021

> Lastly, needing to be able to quickly cold restart something is probably a design smell.

For everyone not at a scale to afford their own transoceanic fiber cables, a major internet service disruption is equivalent to a cold start. And as long as hackers or governments are able to push utter bullshit to the global BGP tables with a single mouse click, this threat remains present.

cranekam · on Nov 25, 2021

The comment I was replying to mentions "at [Google] scale", so my answer was with that in mind.

sofixa · on Nov 23, 2021

When Amazon S3 in us-east-1 failed a few years ago, the reason for the long outage(6 hours? 8 hours? I don't recall) was that they needed to restart the metadata service, and it took a long time for it to come back with the mind boggling amount of data on S3. Cold starts are hard to plan for precisely at this type of scale

allset_ · on Nov 23, 2021

It can be done. It takes a heck of a lot longer than 15s though.