Complex systems are really really hard. I'm not a big fan of seeing all these folks bash AWS for this, and not really understanding the complexity or nastiness of situations like this. Running the kind of services they do for the kind of customers, this is a VERY hard problem.
We ran into a very similar issue, but at the database layer in our company literally 2 weeks ago, where connections to our MySQL exploded and completely took down our data tier and caused a multi-hour outage, compounded by retries and thundering herds. Understanding this problem under the stressful scenario is extremely difficult and a harrowing experience. Anticipating this kind of issue is very very tricky.
Naive responses to this include "better testing", "we should be able to do this", "why is there no observability" etc. The problem isn't testing. Complex systems behave in complex ways, and its difficult to model and predict, especially when the inputs to the system aren't entirely under your control. Individual components are easy to understand, but when integrating, things get out of whack. I can't stress how difficult it is to model or even think about these systems, they're very very hard. Combined with this knowledge being distributed among many people, you're dealing with not only distributed systems, but also distributed people, which adds more difficulty in wrapping this around your head.
Outrage is the easy response. Empathy and learning is the valuable one. Hugs to the AWS team, and good learnings for everyone.
> Outrage is the easy response. Empathy and learning is the valuable one.
I'm outraged that AWS, as a company policy, continues to lie about the status of their systems during outages, making it hard for me to communicate to my stakeholders.
Empathy? For AWS? AWS is part a mega corporation that is closing in on 2 TRILLION dollars in market cap. It's not a person. I can empathize with individuals who work for AWS but it's weird to ask us to have empathy for a massive faceless, ruthless, relentless, multinational juggernaut.
My reading of GP's comment is that the empathy should be directed towards AWS' team, the people who are building the system and handling the fallout, not AWS the corporate entity.
It seems obvious to me that they're specifically talking about having empathy for the people who work there, the people who designed and built these systems and yes, empathy even for the people who might not be sure what to put on their absolutely humongous status page until they're sure.
But I don’t see people attacking the AWS team, at worst the “VP” who has to approve changes to the dashboard. That’s management and that “VP” is paid a lot.
I think most of the outrage is not because "it happened" but because AWS is saying things like "S3 was unaffected" when the anecdotal experience of many in this thread suggests the opposite.
That and the apparent policy that a VP must sign off on changing status pages, which is... backwards to say the least.
> a VP must sign off on changing status pages, which is... backwards to say the least.
I think most people's experience with "VP's" makes them not realize what AWS VP's do.
VP's here are not sitting in an executive lounge wining and dining customers, chomping on cigars and telling minions to "Call me when the data center is back up and running again!"
They are on the tech call, working with the engineers, evaluating the problem, gathering the customer impact, and attempting to balance communicating too early with being precise.
Is there room for improvement? Yes. I wish we would just throw up a generic "Shit's Fucked Up. We Don't Know Why Yet, But We're Working On It" message.
But the reason why we don't, doesn't have anything to do with having to get VP approval to put that message up. The VP's are there in the trenches most of the time.
> I wish we would just throw up a generic "Shit's Fucked Up. We Don't Know Why Yet, But We're Working On It" message.
I gotta say, the implication that you can't register an outage until you know why it happened is pretty damning. The status page is where we look to see if services are effected, if that information can't be shared there until you understand the cause, that's very broken.
The AWS status page has become kind of a joke to customers.
I was encouraged to see the announcement in OP say that there is "a new version of our Service Health Dashboard" coming. I hope it can provide actual capabilities to display, well, service health.
From how people talk about it, it kind of sounds like updates to the Service Health Dashboard are currently purely a manual process. Rather than automated monitoring automatically updating the Service Health Dashboard in any way at all. I find that a surprising implementation for an organization of Amazon's competence and power. That alarms me more than who it is that has the power to manually update it; I agree that I don't have enough knowledge of AWS internal org structures to have an opinion on if it's the "right" people or not.
I suspect AWS must have internal service health pages that are actually automatically updated in some way by monitoring, that is, that actually work to display service health. It seems like a business decision rather than a technical challenge if the public facing system has no inputs but manual human entry, but that's just how it seems from the outside, I may not have full information. We only have what Amazon shares with us of course.
Can you please help me understand why you, and everyone else, are so passionate about the status page?
I get that it not being updated is an annoyance, but I cannot figure out why it is the single most discussed thing about this whole event. I mean, entire services were out for almost an entire day, and if you read HN threads it would seem that nobody even cares about lost revenue/productivity, downtime, etc. The vast majority of comments in all of the outage threads are screaming about how the SHD lied.
In my entire career of consulting across many companies and many different technology platforms, never once have I seen or heard of anyone even looking at a status page outside of HN. I'm not exaggerating. Even over the last 5 years when I've been doing cloud consulting, nobody I've worked with has cared at all about the cloud provider's status pages. The only time I see it brought up is on HN, and when it gets brought up on HN it's discussed with more fervor than most other topics, even the outage itself.
In my real life (non-HN) experience, when an outage happens, teams ask each other "hey, you seeing problems with this service?" "yea, I am too, heard maybe it's an outage" "weird, guess I'll try again later" and go get a coffee. In particularly bad situations, they might check the news or ask me if I'm aware of any outage. Either way, we just... go on with our lives? I've never needed, nor have I ever seen people need, a status page to inform them that things aren't working correctly, but if you read HN you would get the impression that entire companies of developers are completely paralyzed unless the status page flips from green to red. Why? I would even go as far to say that if you need a third party's SHD to tell you if things aren't working right, then you're probably doing something wrong.
Seriously, what gives? Is all this just because people love hating on Amazon and the SHD is an easy target? Because that's what it seems like.
A status page give you confidence that the problem indeed lies with Amazon and not your own software. I don't think it's very reasonable to notice issues, ask other teams if they are also having issues, and if so, just shrug it off and get a cup of coffee without more investigation. Just because it looks like the problem is with AWS, you can't be sure until you further investigate it, specially if the status page says it's all working fine.
I think it goes without saying that having an outage is bad, but having an outage which is not confirmed by the service provider is even worse. People complain about that a lot because it's the least they could do.
I care about status pages, because when something breaks upstream I need to know whether it's an issue I need to report, and if there's additional problems related to the outage I need to look out for, or workarounds I can deploy. If I find out anything that might help me narrow down the ETA for a fix, that's bonus fries.
I don't gripe about it on HN, but it is generally a disappointment to me when I stumble upon something that looks like a significant outage but a company is making no indication that they've seen it and are working on it (or waiting for something upstream of them, as sometimes happens).
It is extremely common for customers to care about being informed accurately about downtime, and not just for AWS. I think your experience of not caring and not knowing anyone who cares may be an outlier.
> Can you please help me understand why you, and everyone else, are so passionate about the status page?
I don't think people are "passionate about status page." I think people are unhappy with someone they are supposed to trust straight up lying to their face.
aws isn’t a hobby platform. businesses are built on aws and other cloud providers. those businesses customers have the expectation of knowing why they are not receiving the full value of their service.
it makes sense that part of marketing yourself as a viable infrastructure upon which other businesses can operate, you’d provide more granular and refined communication to allow better communication up and down the chain instead of forcing your customers to rca your service in order to communicate to their customers.
> I wish we would just throw up a generic "Shit's Fucked Up. We Don't Know Why Yet, But We're Working On It" message.
I think that's the crux of the matter? AWS seems to now have a reputation for ignoring issues that are easily observable by customers, and by the time any update shows up, it's way too late. Whether VPs make this decision or not is irrelevant. If this becomes a known pattern (and I think it has), then the system is broken.
disclaimer: I have very little skin in this game. We use S3 for some static assets, and with layers of caching on top, I think we are rarely affected by outages. I'm still curious to observe major cloud outages and how they are handled, and the HN reaction from people on both side of the fence.
> disclaimer: I have very little skin in this game. We use S3 for some static assets, and with layers of caching on top, I think we are rarely affected by outages. I'm still curious to observe major cloud outages and how they are handled, and the HN reaction from people on both side of the fence.
I'd like to share my experience here. This outage definitely impacted my company. We make heavy use of autoscaling, we use AWS CodeArtifact for Python packages, and we recently adopted AWS Single Sign-On and EC2 Instance Connect.
So, you can guess what happened:
- No one could access the AWS Console.
- No one could access services authenticated with SAML.
- Very few CI/CD, training or data pipelines ran successfully.
- No one could install Python packages.
- No one could access their development VMs.
As you might imagine, we didn't do a whole lot that day.
With that said, this experience is unlikely to change our cloud strategy very much. In an ideal world, outages wouldn't happen, but the reason we use AWS and the cloud in general is so that, when they do happen, we aren't stuck holding the bag.
As others have said, these giant, complex systems are hard, and AWS resolved it in only a few hours! Far better to sit idle for a day rather than spend a few days scrambling, VP breathing down my neck, discovering that we have no disaster recovery mechanism, and we never practiced this, and hardware lead time is 3-5 weeks, and someone introduced a cyclical bootstrapping process, and and and...
Instead, I just took the morning off, trusted the situation would resolve itself, and it did. Can't complain. =P
I might be more unhappy if we had customer SLAs that were now broken, but if that was a concern, we probably should have invested in multi-region or even multi-cloud already. These things happen.
Saying "S3 is down" can mean anything. Our S3 buckets that served static web content stayed up no problem. The API was down though. But for the purposes of whether my organization cares I'm gonna say it was "up".
> We are currently experiencing some problems related to FOO service and are investigating.
A generic, utterly meaningless message, which is still a hell of a lot more than usually gets approved, and approved far too late.
It is also still better than "all green here, nothing to see" which has people looking at their own code, because they _expect_ that they will be the problem, not AWS.
Most of what they actually said via the manual human-language status updates was "Service X is seeing elevated error rates".
While there are still decisions to be made in how you monitor errors and what sorts of elevated rates merit an alert -- I would bet that AWS has internally-facing systems that can display service health in this way based on automated monitoring of error rates (as well as other things). Because they know it means something.
They apparently choose to make their public-facing service health page only show alerts via a manual process that often results in an update only several hours after lots of customers have noticed problems. This seems like a choice.
What's the point of a status page? To me, the point of it is, when I encounter a problem (perhaps noticed because of my own automated monitoring), one of the first thing I want to do is distinguish between a problem that's out of my control on the platform, and a problem that is under my control and I can fix.
A status page that does not support me in doing that is not fulfilling it's purpose. the AWS status page fails to help customers do that, by regularly showing all green with no alerts hours after widespread problems occured.
It doesn’t matter what the VPs are doing, that misses the point. Every minute you know there is a problem and you haven’t at least put up a “degraded” status, you’re lying to your customers.
It was on the top of HN for an hour before anything changed, and then it was still downplayed, which is insane.
I don't think the matter is whether or not VPs are involved, but the fact that human sign off is required. Ideally the dashboard would accurately show what's working or not, regardless if the engineers know what's going on.
There's definitely miscommunication around this. I know I've miscommunicated impact, or my communication was misinterpreted across the 2 or 3 people it had to jump before hitting the status page.
For example, The meaning of "S3 was affected" is subject to a lot of interpretation. STS was down, which is a blocker for accessing S3. So, the end result is S3 is effectively down, but technically it is not. How does one convey this in a large org? You run S3, but not STS, it's not technically an S3 fault, but an integration fault across multiple services. If you say S3 is down, you're implying that the storage layer is down. But it's actually not. What's the best answer to make everyone happy here? I cant think of one.
"S3 is unavailable because X, Y, and Z services are unavailable."
A graph of dependencies between services is surely known to AWS; if not, they ought to create one post-haste.
Trying to externalize Amazon's internal AWS politicking over which service is down is unproductive to the customers who check the dashboard and see that their service ought to be up, but... well, it isn't?
Because those same customers have to explain to their clients and bosses why their systems are malfunctioning, yet it "shows green" on a dashboard somewhere that almost never shows red.
(And I can levy this complaint against Azure too, by the way.)
Yes, I can envision a (simplified) AWS X-Ray dashboard showing the relationships between the systems and the performance of each one. Then we could see at a glance what was going on. Almost anything is better than that wall of text, tiny status images, and RSS feeds.
Later on in the process, you could do something like this. When you know what else is impacted and how that looks to your customers. But by then the problem is most likely over or at least on the way to being fixed. And hours may have gone by before you get to that point.
Early in the process, when you’re flying blind because you don’t know what’s going on around you and you look at your own systems and they appear to be fine, you can’t really say anything useful.
These weird edge cases are hard to adjudicate because they’ve never happened before — otherwise fixes would already be in place to prevent them. And nothing quite like them has ever before happened at this scale.
I understand the frustration, but when everything you think you know turns out to be wrong, or at least you are unable to confirm whether it’s right or wrong, what do you do?
Read the RCA — When AWS got to that point, they did actually update the SHD with a banner across the top of the page, but that ended up actually causing even more problems. There’s a reason why you try to do these sorts of things safely, which may mean using manual methods in some cases. And sometimes even those safe manual methods have their own weird side effects.
Sometime shit is hard. Sometimes you run into problems like no one else on the planet has ever experienced before, and you have to figure out what the laws of physics are in this new part of the world as you go about trying to fix whatever it was that broke or acted in an unexpected manner.
Disclaimer: my opinions are my own and are not necessarily shared or reflective of my employer.
I’m not all that angry over the situation but more disappointed that we’ve all collectively handed the keys over to AWS because “servers are hard”. Yeh they are but it’s not like locking ourselves into one vendor with flaky docs and a black box of bugs is any better, at least when your own servers go down it’s on you and you don’t take out half of North America.
If you aren't going to rely on external vendors, servers are really, really hard. Redundancy in: power, cooling, networking? Those get expensive fast. Drop your servers into a data center and you're in a similar situation to dropping it in AWS.
A couple years ago all our services at our data center just vanished. I call the data center and they start creating a ticket. "Can you tell me if there is a data center outage?" "We are currently investigating and I don't have any information I can give you." "Listen, if this is a problem isolated to our cabinet, I need to get in the car. I'm trying to decide if I need to drive 60 miles in a blizzard."
That facility has been pretty good to us over a decade, but they were frustratingly tight-lipped about an entire room of the facility losing power because one of their power feeder lines was down.
Could AWS improve? Yes. Does avoiding AWS solve these sorts of problems? No.
Servers are not hard if you have a dedicated person (long time ago known as Systemadminstrator), and fun fact...it's sometimes even much cheaper and more reliable then having everything in the "cloud".
Personally i am a believer in mixed environments, public webservers etc in the "cloud", locally used systems and backup "in house" with a second location (both in Data-centers or at least one), and no, i don't talk about the next google but the 99% of businesses.
You can either pay a dedicated team to manage your on prem solution, go multi cloud, or simply go multi region on aws.
My company was not affected by this outage because we are multi region. Cheapest and quickest option if you want to have at least some fault tolerance.
> ... multi region. Cheapest and quickest option if you want to have at least some fault tolerance.
That is simple not true, you have to adapt your application to be multi region aware to start with, and if you do that on AWS you are basically locked-in, and one of the most expensive cloud providers out there.
You're saying it's not true, but do you have another example of a quick and cheap way to do this ?
I'm not saying this can be done in 1 day for 2 cents, I'm saying that it's quick and cheap compared to other options.
> adapt your application to be multi region aware
This vs adapting your application to support multi cloud deployments or go from the cloud to start doing on prem with a dedicated team, you can take your bets.
On aws you can setup route 53 to point to multiple regions based on health check or latency.
Excuse me, do we need all that complexity? Telling that it is "hard" is justifiable?
It is naive to assume people bashing AWS are uncapable to running things better, cheaper, faster, across many other vendors, on-prem, colocation or what not.
> Outrage is the easy response.
That is what made AWS get the marketshare it has now in the first place, the easy responses.
The main selling point of AWS in the beginning was "how easy is to sping a virtual machine". After basically every layman started recommending AWS and we flocked there, AWS started making things more complex than it should. Was that to make harder to get out of it? IDK.
> Empathy and learning is the valuable one.
When you run your infrastructure and something fails and you are not transparent, your users will bash you, independently who you are.
And that was another "easy response" used to drive companies towards AWS. We developers were echoing that "having a infrastructure team or person is not necessary", etc.
Now we are stuck in this learned helplessness where every outage is a complete disaster in terms of transparency, multiple services failing, even for multi-region and multi-az customers, we saying "this service here is also not working" and AWS simple states that service was fine, not affected, up and running.
If it was a sysadmin doing that, people will be asking for his/her neck with pitchforks.
> AWS started making things more complex than it should
I don’t think this is fair for a couple reasons:
1. AWS would have had to scale regardless just because of the number of customers. Even without adding features. This means many data centers, complex virtual networking, internal networks, etc. These are solving very real problems that happen when you have millions of virtual servers.
2. AWS hosts many large, complex systems like Netflix. Companies like Netflix are going to require more advanced features out of AWS, and this will result in more features being added. While this is added complexity, it’s also solving a customer problem.
My point is that complexity is inherent to the benefits of the platform.
Thanks for these thoughts. Resonated well with me. I feel we are sleepwalking into major fiascos, when a simple doorbell needs to sit on top this level of complexity. It's in our best interest to not tie every small thing into layers, and layers of complexity. Mundane things like doorbells need to have their fallback at least done properly to function locally without relying on complex cloud systems.
The problem isn't AWS per se. The problem is it's become too big to fail. Maybe in the past an outage might take down a few sites, or one hospital, or one government service. Now one outage takes out all the sites, all the hospitals and all the government services. Plus your coffee machine stops working.
> I'm not a big fan of seeing all these folks bash AWS for this,
The disdain I saw was towards those claiming that all you need is AWS, that AWS never goes down, and don't bother planning for what happens when AWS goes down.
AWS is an amazing accomplishment, but it's still a single point of failure. If you are a company relying on a single supplier and you don't have any backup plans for that supplier being unavailable, that is ridiculous and worthy of laughter.
But Amazon advertises that they DO understand the complexity of this, and that their understanding, knowledge and experience is so deep that they are a safe place to put your critical applications, and so you should pay them lots of money to do so.
Totally understand that complex systems behave in incomprehensible ways (hopefully only temporarily incomprehensible). But they're selling people on the idea of trading your complex system, for their far more complex system that they manage with such great expertise that it is more reliable.
Not sure why I got down voted for an honest question. Most start-ups are founders, developers, sales and marketing. Dedicated infrastructure, network and database specialists don't get factored in because "smart CS graduates can figure that stuff out". I've worked at companies who held onto that false notion way too long and almost lost everything as a result ("company extinction event", like losing a lot of customer data)
I am always amazed at how little my software dev spouse understands about infrastructure, basic networking troubleshooting is beyond her. She is a great dev, but a terrible at ops. Fortunately she is at a large company with lots of devs, sysadmins and SREs.
We ran into a very similar issue, but at the database layer in our company literally 2 weeks ago, where connections to our MySQL exploded and completely took down our data tier and caused a multi-hour outage, compounded by retries and thundering herds. Understanding this problem under the stressful scenario is extremely difficult and a harrowing experience. Anticipating this kind of issue is very very tricky.
Naive responses to this include "better testing", "we should be able to do this", "why is there no observability" etc. The problem isn't testing. Complex systems behave in complex ways, and its difficult to model and predict, especially when the inputs to the system aren't entirely under your control. Individual components are easy to understand, but when integrating, things get out of whack. I can't stress how difficult it is to model or even think about these systems, they're very very hard. Combined with this knowledge being distributed among many people, you're dealing with not only distributed systems, but also distributed people, which adds more difficulty in wrapping this around your head.
Outrage is the easy response. Empathy and learning is the valuable one. Hugs to the AWS team, and good learnings for everyone.