What's the point of having a status page if it doesn't indicate the issues? http...

VyseofArcadia · on Jan 25, 2023

When anything on that page turns not-green, there are news stories about it. Not positive ones. So exec approval is needed, because the decision to flip something on that page is ultimately the decision to cause stories negative to MS to be published. The exec has to weigh whether pissing off the customers (by failing to acknowledge reality) is worth the bad press and SLA fallout.

ctvo · on Jan 25, 2023

It has nothing to do with press. This is negative press already, and journalist can use this to write their stories without waiting for the official light to go from green to yellow.

It's about contractual obligations and SLAs. Things are not officially down in most agreements until MSFT acknowledges they're down. Refunds issued because your blob storage failed to meet 99.9999 uptime to your largest customers are directly tied to these statuses.

Enginerrrd · on Jan 25, 2023

I'm not going out of my way to be hyperbolic or anything here, but that sounds suspiciously like "fraud" to me.

ctvo · on Jan 25, 2023

I don't think they're committing fraud.

I think it's an important enough page that it can't be automated. It needs a manual approval from a human, for the very basics, like even if the status reporting system is operating correctly, because of various downstream effects.

PenguinCoder · on Jan 25, 2023

Which means it's not a status page any more. Defeating the supposed purpose.

cutemonster · on Jan 25, 2023

"SLA refund page"?

maushu · on Jan 25, 2023

The point is PR. Never trust a status page if it's not directly connected to the monitoring system.

wrldos · on Jan 25, 2023

They never attach it to the monitoring because monitoring systems usually generate a lot of false positives which affect their published SLA.

polack · on Jan 25, 2023

Then they should have a "?" status that can be triggered by automated systems that acknowledge that it looks to be an issue but that they are manually investigating.

If it's a false positive they just resolve it without it affecting SLA and if it's a real problem then us customers wouldn't have to debug our own stack for 2 hours before Microsoft informs us that they are the problem.

EDIT: Wonder how many man-years of extra debugging work their non-working status page have caused the customers.

jhoechtl · on Jan 25, 2023

They never attach it to the monitoring because monitoring systems usually generate a lot of correct positives which affect their published SLA.

Works equally well. See the point?

mdip · on Jan 25, 2023

Which means if one were to require monitoring and status pages to be connected, one of two things happen (for each monitored component):

(1) The monitoring system would be altered to ignore tests that return false positives (at the expense of missing the alert when it represents an outage).

(2) Fixing the monitoring. It wasn't working for the sysadmins/operators, anyway, since it had so many false positives that their "mental model" was essentially based on (1), anyway.

At least, where I've forced the issue of doing just this, that's exactly what happened. At the end of the day, especially since SLAs took a hit and that affected bonus payouts, monitoring got a lot better -- as did overall team function when we truly realized how bad things were -- we stopped doing workarounds and started fixing problems at a more fundamental level which led to SLAs that were both accurate and excellent.

It helped bring attention to a hidden problem which resulted in time being allocated to fix tests that dropped constant false-positives and to evaluate each for whether or not it should exist in the first place.

steveBK123 · on Jan 25, 2023

Which impacts economics because some customers surely got deals guaranteeing some amount of credits based on up/downtime as reported by the status page.

And so updates to the status page become political and locked behind senior management approvals.. like AWS.

mirekrusin · on Jan 25, 2023

Yeah, that's why SLA reports never include <30m downtimes, convenient truth bending.

edf13 · on Jan 25, 2023

It's updated now - updates for service outages at this level generally need signoff form someone higher up the chain

saghm · on Jan 25, 2023

Someone has to "approve" the status pages showing what's actually happening? From a customer perspective, it seems far worse to have status pages fail to reflect actual outages than to have them accidentally report an outage when there isn't one because no one really cares about what the status page says if they're not having issues. It's hard to see how the goal here could be anything other than trying to add plausible deniability for what would otherwise be obvious deception.

vineyardmike · on Jan 25, 2023

> it seems far worse to have status pages fail to reflect actual outages than to have them accidentally report an outage when there isn't one

Thats not the goal.

> It's hard to see how the goal here could be anything other than trying to add plausible deniability for what would otherwise be obvious deception

Thats the goal. The "status page" is considered the source of truth for most of the big contracts. If status-page=OK then your contract with them isn't violated. So changing the status page is a big deal, with real financial implications. The status page isn't a view into the SRE's tickets, its a declaration that the service isn't being provided.

mattclarkdotnet · on Jan 25, 2023

Utter rubbish. Major contracts have account managers and it all gets hashed out 1-1.

squeaky-clean · on Jan 25, 2023

Don't know why this was downvoted. We've definitely been able to provide proof of an outage when the status page showed otherwise and get a refund in the form of server credits by contacting them directly. For all 3 big vendors, AWS, Azure, GCP

RajT88 · on Jan 25, 2023

Agree here as well. It's usually not that hard to provide based on the many, many metrics Azure resources emit that their SLA was breached.

What might be happening is that there is fine print you have to read and be in compliance with in order to be eligible for the SLA.

For example, look at all the conditions which have to be met for a breach of VM SLA in Azure:

https://azure.microsoft.com/en-us/support/legal/sla/virtual-...

Hidden in the SLA details is typically hints on how you can become more resilient in the cloud. So it pays to read the SLA details and really deeply understand what they are telling you.

oefrha · on Jan 25, 2023

Exec approval for showing major outages on status dashboard is pretty much standard practice across large companies. The main differentiator is whether it’s approved within five minutes or two hours.

remus · on Jan 25, 2023

> it seems far worse to have status pages fail to reflect actual outages than to have them accidentally report an outage when there isn't one because no one really cares about what the status page says if they're not having issues.

I disagree. What if you're having issues and the status page is incorrectly reporting an incident? It would be easy to waste a load of time waiting for the status page to sort itself out, only to find out you've still got an issue.

UK-AL · on Jan 25, 2023

You can't approve a fact.

hdjjhhvvhga · on Jan 25, 2023

As others noted, the so-called "status" pages of big service providers don't serve to reflect reality but to shape it. For actual status you need to consult independent monitoring services.

2Gkashmiri · on Jan 25, 2023

well.... if that fact can be delayed by just a tiny bit... that's enough

alkonaut · on Jan 25, 2023

But shouldn't the individual service dots be automatically turning another color than green? I mean it's an automated service status page, right? Whether there is a human message at the top and that can take some time I understand.

luckylion · on Jan 25, 2023

No, it's not automated. I'm sure the underlying tech is automated, but once companies grow beyond a certain size, it needs a human to say "show this status change to the world" because there are lots of things depending on it (e.g. SLAs, but also bonuses, I assume), so they don't want a potential bug in the status system to influence that.

It's weird how slow they are with manual sign-off though.

adql · on Jan 25, 2023

I haven't seen any SLA deal that says the status page must show 99.9% uptime...

cm2187 · on Jan 25, 2023

No but if msft’s own status page shows downtime more than 0.01% of the time msft will struggle to argue they haven’t breached their SLA, so financial consequences to the company.

alkonaut · on Jan 25, 2023

But I don’t want the page connected to their bonuses or SLA’s I just want to know whether they are having any issues anywhere. And I need to know within a minute of my own service not working so I’m not chasing the wrong thing. This can’t be an unreasonable thing to ask for?

luckylion · on Jan 25, 2023

I agree. I'm already annoyed at Hetzner with their 5 minute lag in reporting network outages where I'm regularly noticing them, investigating, checking status and then only after a few minutes see them updating and saying "it's us".

If you work with Microsoft, you might as well spend a few bucks extra and have an external monitoring system monitor Microsoft's systems so you get real-time third-party confirmation when your monitoring alerts you of issues concerning your system. It's the price you pay for scale, I guess. More money involved = more lawyers involved = more accountants involved = more MBAs involved = more corporate bullshit.

copperroof · on Jan 25, 2023

An automated one would be red 100% the time.

steve1977 · on Jan 25, 2023

Maybe they couldn't update the status page due to the network outage.

I'm joking, but...

867-5309 · on Jan 25, 2023

or, they could be automated and transparent

nikau · on Jan 25, 2023

Then some group scrapes the uptime of their competitors page and reports that "competitor is x times more reliable than transparent co"

Yuioup · on Jan 25, 2023

That's what happens when you don't have an independent party that keeps tabs on this.

SkyPuncher · on Jan 25, 2023

We've concluded that status pages are a complete joke.

berkut · on Jan 25, 2023

status.office.com had been down for 15 mins, but it's back up now...

belter · on Jan 25, 2023

We have been here before...HN is the only status page that matters.

funnymony · on Jan 25, 2023

Its public relations page.