There's definitely miscommunication around this. I know I've miscommunicated impact, or my communication was misinterpreted across the 2 or 3 people it had to jump before hitting the status page.
For example, The meaning of "S3 was affected" is subject to a lot of interpretation. STS was down, which is a blocker for accessing S3. So, the end result is S3 is effectively down, but technically it is not. How does one convey this in a large org? You run S3, but not STS, it's not technically an S3 fault, but an integration fault across multiple services. If you say S3 is down, you're implying that the storage layer is down. But it's actually not. What's the best answer to make everyone happy here? I cant think of one.
"S3 is unavailable because X, Y, and Z services are unavailable."
A graph of dependencies between services is surely known to AWS; if not, they ought to create one post-haste.
Trying to externalize Amazon's internal AWS politicking over which service is down is unproductive to the customers who check the dashboard and see that their service ought to be up, but... well, it isn't?
Because those same customers have to explain to their clients and bosses why their systems are malfunctioning, yet it "shows green" on a dashboard somewhere that almost never shows red.
(And I can levy this complaint against Azure too, by the way.)
Yes, I can envision a (simplified) AWS X-Ray dashboard showing the relationships between the systems and the performance of each one. Then we could see at a glance what was going on. Almost anything is better than that wall of text, tiny status images, and RSS feeds.
Later on in the process, you could do something like this. When you know what else is impacted and how that looks to your customers. But by then the problem is most likely over or at least on the way to being fixed. And hours may have gone by before you get to that point.
Early in the process, when you’re flying blind because you don’t know what’s going on around you and you look at your own systems and they appear to be fine, you can’t really say anything useful.
These weird edge cases are hard to adjudicate because they’ve never happened before — otherwise fixes would already be in place to prevent them. And nothing quite like them has ever before happened at this scale.
I understand the frustration, but when everything you think you know turns out to be wrong, or at least you are unable to confirm whether it’s right or wrong, what do you do?
Read the RCA — When AWS got to that point, they did actually update the SHD with a banner across the top of the page, but that ended up actually causing even more problems. There’s a reason why you try to do these sorts of things safely, which may mean using manual methods in some cases. And sometimes even those safe manual methods have their own weird side effects.
Sometime shit is hard. Sometimes you run into problems like no one else on the planet has ever experienced before, and you have to figure out what the laws of physics are in this new part of the world as you go about trying to fix whatever it was that broke or acted in an unexpected manner.
Disclaimer: my opinions are my own and are not necessarily shared or reflective of my employer.
For example, The meaning of "S3 was affected" is subject to a lot of interpretation. STS was down, which is a blocker for accessing S3. So, the end result is S3 is effectively down, but technically it is not. How does one convey this in a large org? You run S3, but not STS, it's not technically an S3 fault, but an integration fault across multiple services. If you say S3 is down, you're implying that the storage layer is down. But it's actually not. What's the best answer to make everyone happy here? I cant think of one.