Microsoft's Azure cloud down and out for 8 hours

powertower · on Feb 29, 2012

This seems to be about some odd certificate issue, not the network, which caused Microsoft to take access to its service management system down.

From the article itself:

> It later added that less than 3.8 per cent of hosted services had been affected.

If this was about Google or Apple, this submit would already have been flagged, taken off the front page, and several accounts would have been hell-banned.

</endjoke, but it's really true>

It will be interesting to see if Microsoft expands on this issue and if we can learn something about it... Perhaps other datacenters are also vulnerable to cert issues in their management systems.

ChuckMcM · on Feb 29, 2012

"We have identified the root cause of this incident. It has been traced back to a cert issue triggered on 2/29/2012 GMT," the software giant said.

So here is a quiz for you, does you software know that this is a leap year? I'll speculate that someone's software didn't recognize it as such. Personally I'm always on the lookout for this sort of 'weird' thing because in the world of testing the edge cases are often poorly tested.

stickfigure · on Feb 29, 2012

Compare this to administering machines yourself: Your "admin console" goes down every night for 8 hours while your sysadmin is asleep.

ithkuil · on Feb 29, 2012

It's not unusual for my sysadmin to be waken up in the middle of the night

stickfigure · on Feb 29, 2012

Lucky him.

(or her)

danielsoneg · on Feb 29, 2012

In my rather limited experience as a sysadmin, I've found avoiding those calls to be one of my best motivators: My systems work like they're supposed to because I like my sleep.

Consider it a performance-based bonus.

ww520 · on Feb 29, 2012

Can you elaborate what is the cert issue that causing the service down? It's always good to learn from temporary failure.

powertower · on Feb 29, 2012

Today, February 29th, is a "leap day" which happens every 4 years in a "leap year"... Otherwise, the last day of February is the 28th, and today would normally be March 1st.

I'm assuming that when the day rolled over, it caused an issue with an internal system at Microsoft, that had something to do with certificate dates and/or timestamps.

It's not just Microsoft that it's happening to, today I could not automatically renew any domains at Namecheap that expire tomorrow because the registry could not produce the correct expire-at date.

Leap days cause odd issues.

DrJokepu · on Feb 29, 2012

How the hell is that possible? Every platform out there has high quality date libraries. It's a cliché that date & time operations are very complicated (leap years, leap seconds, time zones, daylight saving time and so on), every even marginally competent developer knows instinctly to stay away from writing their own date & time code and trust a library instead to do the heavy lifting. In addition, every tester should know that when dates are involved, they should try Feb 29th and see if it breaks anything. How can bugs like this slip through?

crcastle · on March 1, 2012

Also whether or not a year is a leap year is easy to calculate:

Year divisible by 4? => leap year; year also divisible by 100? => not a leap year unless year also divisible by 400

drivebyacct2 · on Feb 29, 2012

></endjoke, but it's really true>

Seriously? Alarmist flamebait headlines about Google and Apple stuff gets posted near daily.

bry · on Feb 29, 2012

The most frustrating thing for me was the complete lack of any real communication from Microsoft. For awhile, even their status dashboard was down. I only found out about it after I got a PagerDuty alert and had to search Twitter (other people complaining about it) to confirm.

We have an Azure CDN backed by a Compute Instance, and zero official notice from Microsoft about this still. I've learned more about the problem from news articles than the company that provides the service. Fortunately we haven't finished migrating the rest of the site to Azure. No emails from them, nothing. Not even a tweet on their official @WindowsAzure account. Frustrating.

TeHCrAzY · on Feb 29, 2012

They finally pushed something out via Twitter, about an hour ago: http://twitter.com/#!/WindowsAzure/status/174954154362548224

d4nt · on Feb 29, 2012

I guess this sort of thing was bound to happen at some point. As the article says, all the major cloud platforms have had outages at some point.

As someone who is actively considering building products using this platform I'm keen to see how well they manage this issue. How do they communicate during the outage, how open about what went wrong are they afterwards, do they learn from it, and so on. I particularly like the efforts Amazon have gone to in the past to show they are learning from these issues (see http://aws.amazon.com/message/65648/) I'm hopeful that Microsoft will show the same level of openness.

chollida1 · on Feb 29, 2012

I feel bad when ever I hear about outages at companies like this.

I'm in charge of technology for a hedge fund with 15 people and I get freaked out each time we roll out a new piece of software.

latchkey · on Feb 29, 2012

Heroku Dyno's were down for a lot of clients for several hours on saturday. One site that I use a lot, Intercom.io, was completely off the grid because of this. I was wondering why there was no news about it here.

https://status.heroku.com/incident/308

andrewem · on Feb 29, 2012

It's unclear how big a deal this outage was, as I can never make heads or tails of what's really behind an article in the The Register.

That said, I run an app on Heroku and noticed that outage - Pingdom sent me a bunch of emails, which I saw after a day spent outside. The first thing I do when I get a Pingdom failure notice is check status.heroku.com, and sure enough they had an issue and were working on fixing it. I wasn't particularly expecting there to be mention of it on HN because, first, it wasn't an enormous outage, second, it was on the weekend, and third Heroku always does such a great job of posting about issues.

I don't know if they did so, but I'd expect Intercom.io and other affected sites to post information letting people know that the site is down and why. I haven't tried it yet but Heroku allows you to specify a custom error page to be served [1], which you could update on the fly to let people know what's going on. For other outages I have let users know via Twitter and Facebook, though obviously there's no substitute for a nice up-to-date static error page.

[1] http://devcenter.heroku.com/articles/custom_error_pages

andypants · on Feb 29, 2012

I'm sure I read about the heroku downtime on hacker news. Maybe it got pushed off the front page pretty quickly though.

krobertson · on Feb 29, 2012

It was on the front page for a while, however it was the weekend, so likely didn't garner as much attention. And was like 2 hours... not 8-9+ like Azure.

Tloewald · on Feb 29, 2012

So is this a feb 29 bug? Problem occurred, it seems, at the advent of 2/29 GMT. Worst date handling ever?

Maxious · on Feb 29, 2012

Apparently electronic payment systems from ATMs to merchant terminals to HMO claim machines all went straight to march 1st. Hilarity ensued.

bouncing · on Feb 29, 2012

Yeah, and I noticed my paycheck arrived a day early.

joshmaker · on Feb 29, 2012

"We have identified the root cause of this incident. It has been traced back to a cert issue triggered on 2/29/2012 GMT"

viraptor · on Feb 29, 2012

It's a bit unfortunate description of time, I think. They could at least say "2/29/2012 01:45 GMT". Otherwise it looks like there's a different date in each timezone (yeah, there are two different ones, but not different in each, the bug started closer to beginning of 2/29/2012 GMT+2).

kokey · on Feb 29, 2012

I think this is the problem with rewriting platforms, libraries and languages from scratch instead of incrementally chiseling it into something stable like in the open unix world.

batista · on Feb 29, 2012

instead of incrementally chiseling it into something stable like in the open unix world.

LOL. If I had a penny for whenever the "open unix world" chose to re implement some stuff from scratch, I'd be rich. From Gnome and KDE always changing stuff (especially the KDE multimedia architecture took this to comical levels), to FreeBSD moving to Clang...

speedracr · on Feb 29, 2012

As a non-tech observer, Azure actually struck me as an honest attempt by Microsoft to add a compelling offer to the mix. However, their status page-cum-website seem to be hosted on Azure itself, which is ridiculous in a situation like this. (Almost like Twitter having a status page on tumblr.) Worst of all, even www.twitter.com/windowsazure offers no comment at all so far. Isn't this wiping out any credibility they might have built up with developers? Is anyone affected?

Edit: if this truly is because of 2/29, I guess anyone signing up from now on will get perfect service.

ot · on Feb 29, 2012

> Edit: if this truly is because of 2/29, I guess anyone signing up from now on will get perfect service.

At least for the next 4 years

sriramk · on Feb 29, 2012

I don't think that's true. Because I remember giving somebody a lot of pain for exactly that when I worked there. Of course, it might have been gone down for other reasons.

TeHCrAzY · on Feb 29, 2012

Weird, we have a couple of respectable (80+ requests per second) services running on azure in the SE asia zone, and I can't see any problems at all.

That said, we only rely on table storage and our instance count is mostly static.

chubot · on Feb 29, 2012

So just "service management" is down, but the apps themselves are up? If so that's better than Heroku's recent downtime.

sriramk · on Feb 29, 2012

Disclaimer: ex-Windows Azure person, no insider knowledge on this particular outage

That's typically because the 'service management service' typically kicks in when it needs to do things - allocate capacity, restart things when stuff goes down, etc. By default, it isn't touching the running apps inside VMs. There is no Windows Azure equivalent to Heroku's routing mesh to be taken down; the requests go to the VMs directly via the various networking layers.

ahrens · on Feb 29, 2012

We had problems today. We had the bad luck that one of our web roles crashed during the time the admin interface was down. That meant we couldn't restart it and neither could microsoft. We will be adding instances to the role to avoid similar problems. Otherwise, we are very happy with Azure.

TeHCrAzY · on Feb 29, 2012

Can you explain what your problem is? Azure will automatically restart instances that are dead afaik (unless you are saying that the automatic restarting was broken as well?).

dragosstancu · on Feb 29, 2012

That sucks, we have a bunch of media stored with them and we're looking forward to using the cloud for some .NET based intensive processing (reporting).

A bit off topic: I've been trying for days to get Azure to work with .webm or .ogv files. Maybe I'm using the wrong tool (CloudBerry). I want to be able to deliver HTML5 video from the cloud but I'm without success for FF users. Luckily, my video player features a Flash fallback which is awesome.

barranger · on Feb 29, 2012

was the media stored in blob storage? From the article it looks like storage accounts weren't effected, just the management api and ~4% of service accounts.

securingsincity · on Feb 29, 2012

Considering they are announcing the windows 8 consumer preview right now and possibly other new cloud related features not good timing at all.

JonoW · on Feb 29, 2012

Article is not clear whether Azure instances are down, or just the service management tools? But either way, ouch!

noahhs · on Feb 29, 2012

Today's outage took down our website, our computing grid, everything. And this afternoon, when Microsoft said "the majority" of Azure clients were back up and running, we were still in the dark. Dammit.

krmmalik · on Feb 29, 2012

I wonder if this explains why i have been having problems with Siri? I know that sounds far fetched but isnt Apple relying on Azure now?

jstclair · on Feb 29, 2012

Don't know why you got downvoted - AFAIK Siri isn't tied to Azure (although it could be using it for storage), but iCloud runs across both Azure and Amazon.

wonderercat · on Feb 29, 2012

I think it's interesting that the #1 feature on their features page is labeled "Always up. Always on."

cache77 · on Feb 29, 2012

Maybe they forgot to account for leap year in their programming :/

rmc · on Feb 29, 2012

Not Itchy & Scratchy Land Europe!

bouncing · on Feb 29, 2012

The bad news for Microsoft isn't that Azure is down, it's that only a tiny number of people even noticed.

recoiledsnake · on Feb 29, 2012

Less than 3.8% of users were affected, take your trolling else where.

huggyface · on Feb 29, 2012

Is it really trolling? Honestly, who uses Azure? Like most MSDN subscribers I did some mashing in it -- that still exists -- but did absolutely nothing real in it. My sense is that very, very few did anything beyond prototyping in it.

And just to add some opinion, the reason I wouldn't even consider it is Microsoft absolutely flippant ADD when it comes to online services. I have zero faith that they won't just shut it down tomorrow.

kruipen · on Feb 29, 2012

Apple for one: http://apple.slashdot.org/story/11/09/04/0051209/apples-iclo...

cooldeal · on Feb 29, 2012

A whole bunch of universities and government agencies.

huggyface · on Feb 29, 2012

Such as? Microsoft's case studies for Azure are absolutely dismal.

I note the other comment mentions Apple, which was a pre-release beta rumor based upon the IP that a beta iMessage lived at (since moved). It speaks volumes, I think, when such a disproven pre-release claim is still held as the example.

huggyface · on Feb 29, 2012

-1? For real? "A whole bunch" is not a credible statement of fact.

wonderercat · on Feb 29, 2012

And unfortunately that with outages like this, that number's probably not increasing anytime soon.

dutchbrit · on Feb 29, 2012

http://www.windowsazure.com isn't even loading here.

ping www.windowsazure.com PING wamktg-prod-db-001.cloudapp.net (65.52.64.144): 56 data bytes

Request timeout for icmp_seq 0

Request timeout for icmp_seq 1

Request timeout for icmp_seq 2

Request timeout for icmp_seq 3

Request timeout for icmp_seq 4

Request timeout for icmp_seq 5

NARKOZ · on Feb 29, 2012

ping responses are disabled

tathagatadg · on Feb 29, 2012

Makes sense, but at least www.windowsazure.com/en-us/support/service-dashboard/ shouldn't time out ...

dutchbrit · on Feb 29, 2012

Exactly.. Not sure why my comment was down voted - sure, maybe pinging is disabled and that part of my comment wasn't valid. But still, isn't responding.