There's definitely miscommunication around this. I know I've miscommunicated imp...

mickeyp · on Dec 11, 2021

> I cant think of one.

I can.

"S3 is unavailable because X, Y, and Z services are unavailable."

A graph of dependencies between services is surely known to AWS; if not, they ought to create one post-haste.

Trying to externalize Amazon's internal AWS politicking over which service is down is unproductive to the customers who check the dashboard and see that their service ought to be up, but... well, it isn't?

Because those same customers have to explain to their clients and bosses why their systems are malfunctioning, yet it "shows green" on a dashboard somewhere that almost never shows red.

(And I can levy this complaint against Azure too, by the way.)

Corrado · on Dec 11, 2021

Yes, I can envision a (simplified) AWS X-Ray dashboard showing the relationships between the systems and the performance of each one. Then we could see at a glance what was going on. Almost anything is better than that wall of text, tiny status images, and RSS feeds.

wiredfool · on Dec 11, 2021

And all relationships eventually end in us-east-1.

bradknowles · on Dec 12, 2021

Later on in the process, you could do something like this. When you know what else is impacted and how that looks to your customers. But by then the problem is most likely over or at least on the way to being fixed. And hours may have gone by before you get to that point.

Early in the process, when you’re flying blind because you don’t know what’s going on around you and you look at your own systems and they appear to be fine, you can’t really say anything useful.

These weird edge cases are hard to adjudicate because they’ve never happened before — otherwise fixes would already be in place to prevent them. And nothing quite like them has ever before happened at this scale.

I understand the frustration, but when everything you think you know turns out to be wrong, or at least you are unable to confirm whether it’s right or wrong, what do you do?

Read the RCA — When AWS got to that point, they did actually update the SHD with a banner across the top of the page, but that ended up actually causing even more problems. There’s a reason why you try to do these sorts of things safely, which may mean using manual methods in some cases. And sometimes even those safe manual methods have their own weird side effects.

Sometime shit is hard. Sometimes you run into problems like no one else on the planet has ever experienced before, and you have to figure out what the laws of physics are in this new part of the world as you go about trying to fix whatever it was that broke or acted in an unexpected manner.

Disclaimer: my opinions are my own and are not necessarily shared or reflective of my employer.

jetru · on Dec 11, 2021

This is a good idea

brentcetinich · on Dec 11, 2021

It doesn’t matter about the dependency graph , but on the definition of unavailable for s3 in its sla