Facebook, Instagram go down around the world in an apparent outage

wybiral · on March 13, 2019

Can you imagine if Twitter and Google went down at the same time?

People would be reactivating their Facebook accounts and having to sift through conspiracy theory posts about Hillary Clinton still just to figure out what was going on.

Edit: The points on this post keep going up and down every time I check these comments. Yes, it was sarcasm, I was joking, but I was trying to point out that most people rely on a small set of services. "Cloud" has centralized things a lot.

bouncycastle · on March 14, 2019

Whenever I hear when some service is down, I immediately go to that service to confirm. Then I repeatedly hit reload if it doesn't work to see if it can come up. I guess many people do the same and that may contribute to the problem...

tyingq · on March 14, 2019

Gmail's exponential back off, with it's visible countdown to next retry is a nice idea. Probably reduces that compounding wave of customer reloads.

juliansimioni · on March 14, 2019

Years ago I worked at a large online casual gaming company who's name ended in -ynga. Our web tier was split into two: one for serving static content required to load the HTML, Flash app, assets, etc. The other was for actual communication regarding actions taken in game.

Whenever we had any sort of issue we could generally get a good idea of what was happening by looking at changes in traffic in those two web tiers.

If people couldn't play for most reasons, game action traffic would drop to near zero, but the static asset tier traffic would usually at least triple.

So yeah, there are a lot of F5 buttons being hit out there when pages don't load.

hiei · on March 13, 2019

Gmail and other Google products went down last night. Close though. Thankfully not on Twitter or FB.

kyrra · on March 13, 2019

I don't believe Gmail was ever fully down. For me, I was just having problems with attachments. I also noticed app icons in the play store failing to load.

bshipp · on March 13, 2019

I experienced issues with Drive last night but it was never fully down. I was trying to work on a 500GB file and the API would drop the link intermittently. I could parse the directory no problem, just couldn't reliably access the files.

kyrra · on March 13, 2019

Dir layout is likely stored in a different system then large files.

hknd · on March 14, 2019

^ this.

It was not fully down, there were issues with their blobstore.

hateful · on March 13, 2019

Didn't they just announce shutting down Flickr or something. Plus they could be decommissioning Google+ stuff. Maybe this is related. Just a guess.

Buge · on March 14, 2019

Flickr is unrelated to Google and Facebook. Flicker is owned by SmugMug, previously by Yahoo, previously by Ludicorp.

oarsinsync · on March 14, 2019

You may be thinking of Picasa, which had it's EOL announced in 2016, and the web API due to be discontinued 15 March 2019 (tomorrow)

Google+ APIs were shutdown 9 March 2019

hiccuphippo · on March 13, 2019

My company's tech support sent us an email to tell us our email was down. Fun times we live in.

zwayhowder · on March 14, 2019

Back in my tech support days I received an email from a customer "I am unable to send or receive emails" I replied "I am very sorry for your inconvenience, I have resolved the issue".

Customers in 1999 really couldn't believe no one had replied to their emails within a day or two.

sundvor · on March 14, 2019

People would be calling emergency hotlines..?

Wait, people are doing this already: https://twitter.com/SA_SES/status/1105969450698694656

pennaMan · on March 13, 2019

I'm more worried of all of it going down at the same time.

celticninja · on March 14, 2019

When it all goes down at the same time you should be worried. Not because of the lack of Twitter or FB or Gmail bit for what it means if it's all down.

bargl · on March 13, 2019

[flagged]

wybiral · on March 13, 2019

It was a joke based on what people often say about FB. I don't use their platforms.

adventist · on March 13, 2019

Totally agree.

jsgo · on March 13, 2019

it really goes to show how different all of our feeds are based on who we are friends with (and of those, who we interact with the most). Anecdotally, I have 3 people who will share any "conservative" attack meme (at this point, I don't even know if it counts as conservative so much as just outright attacking Democrats. Sometimes it reads more like an attack for the sake of attacking than a statement of belief in something different. Kind of weird. Part of me wonders if maybe some of these accounts do conservative attacks and some do liberal attacks in an attempt to get shares and what not with no political interest whatsoever: effectively acting as an arms dealer of the meme variety.) they can get their hands on and of my friends of a more liberal view, it is mostly policy things they share (pro-choice, anti-rape, etc.). There's one dude that is pretty anti-Trump, but his for the most part stand out as an exception. Most of the ones I see referencing Trump directly (if it isn't during a period where he did something that Democrats felt was highly suspect) are more in support of him than anything.

Which is why I have to refrain from taking an "over the top much?" slant when people post the pro-Trump/victim of the left type memes as I don't see a ton of attacks on him directly, but then again, their feed could be totally different from mine so who knows?

bargl · on March 13, 2019

I've got a few people in my feed who just dump on the right no matter what. I've fact checked them a little bit. Again it's about a 2-1 for me fact checking my right leaning and left leaning friends (typically older). But I keep all of them on there just to have an ear to the ground.

jsgo · on March 13, 2019

Yeah, I appreciate the various view points and items shared in my feed from both conservative/liberal viewpoints have been wrong (almost consistently so).

I think the real bummer is when you present the actual video of what the person says, and the response is essentially “yeah? Well, they still suck” or something in that ballpark. I have zero issue with someone not liking another person’s views, but a lot of it is just outright libel.

josteink · on March 13, 2019

> Can you imagine if Twitter and Google went down at the same time?

Google sure, but what people in the real world cares about twitter?

Twitter could be down for days and only the technocracy would notice.

corobo · on March 13, 2019

Twitter is where we complain that Google is down

chriswarbo · on March 13, 2019

The US president?

It also seems popular with journalists and media companies (e.g. TV shows asking viewers to "tweet us your questions")

linsomniac · on March 13, 2019

Could this be related to the storm?

I was out shoveling, and came back in to my phone blowing up. Our systems at IronMountain (formerly Fortrust) in Denver all rebooted at once. These are all on redundant power, each systems redundant power supplies connecting to different circuits entering the cabinet, and those two circuits fed from 3 PDUs (two separate, one share). Each of those is supposed to be fed by a separate UPS and generator. Last status update I had says that they are running off generators, but they've been shockingly tight-lipped about it.

Don't get me wrong, it was hi-LAR-ious to call into their NOC and have them pretend that I was the only one having problems. "Can you tell me if there is a major data center outage going on?" "We are trying to gather information, we are making a bunch of client phone calls, we will know after we make those calls." "... Why are you making a bunch of client calls if you aren't having an outage?"

jyriand · on March 13, 2019

So, a storm in Denver stops me from using Messengner in Estonia? I wonder where the butterfly flapped its wings.

ljm · on March 13, 2019

Pretty sure it doesn't apply to Facebook, but Amazon's cheapest AWS tiers are around there. Same with Virginia.

mcbain · on March 14, 2019

But AWS doesn’t have any regions near Denver.

joncrane · on March 14, 2019

Wrong, AWS has no regions near Denver.

iamgopal · on March 13, 2019

https://en.m.wikipedia.org/wiki/Butterfly_effect

erobbins · on March 14, 2019

No, FB datacenters are geographically diverse.

They do run quarterly 'storms' where a datacenter is shut down to test failover and resiliency. I have no idea if today is one of those days, since I left last year.

johannes1234321 · on March 14, 2019

Theoretically a real shutdown might go in a different way than previous tests or simulations. For instance in a test you might cut the connection completely, while in the real case only some power circuits go down or whatever.

For instance GitHub's relatively recent shutdown was due to a fail-over heartbeat not going as expected.

linsomniac · on March 15, 2019

Test failures are all well and good, but don't always match reality. In this case, the design of the power infrastructure was solid, and their plans include running monthly generator testing and quarterly "disconnect from the grid" testing. But apparently something about this failure of both of the incoming power lines caused failures in multiple UPSes. Still waiting on the after-action review.

strombofulous · on March 14, 2019

Interesting. Out of curiosity, how hard is it to turn the datacenter "back on" in case they discover there's a problem with the failover?

londons_explore · on March 14, 2019

Typically that could be done with one undrain command and take about 30 seconds.

AceyMan · on March 14, 2019

[flagged]

londons_explore · on March 14, 2019

it's only so quick because stuff isn't actually turned off with disks wiped. The machines are still running, with applications loaded, just with no traffic directed towards them.

bertil · on March 14, 2019

It is an internal software problem.

unethical_ban · on March 13, 2019

> "We are trying to gather information, we are making a bunch of client phone calls, we will know after we make those calls."

I think that is a yes, and he getting ahead by saying "Yes and we have no idea why or ETA so let us do our job".

Granted, they should have a status page.

opportune · on March 13, 2019

Last time I dealt with a cloud provider outage the status page was unresponsive during the outage because the status page had some kind of dependency on the resources that were down...

lozaning · on March 13, 2019

I remember that, the status page for AWS stored the red icon on s3 which was the service that was down.

opportune · on March 13, 2019

Actually in this case it was Azure related but it's pretty funny that the same thing has happened to the two biggest cloud platforms

linsomniac · on March 15, 2019

Sure, I understand the "so let us do our job". I've been on the other side of that.

On the other hand, I need information to be able to do my job: Is this only our cabinet having problems and I need to start rolling to the datacenter (in the middle of a giant blizzard)? Is this possibly some sort of problem with our own power infrastructure? Is something on fire (an EPO triggered by fire could cause this)? Did the roof cave in under the weight of the snow we are getting? Is the power stabilized or is there some indication that power might be up and down?

In short, I need answers to: Do I need to gracefully take down my site to prevent lost transactions and database corruption? Do I need to switch to our backup site?

For context: All of our servers powering off at once and then back on shouldn't be possible. It should require the failure of at least 3 independent pieces of equipment (except at the breaker panel or in our cabinet where it could be only two failures). It is extremely unusual for this to happen, first time it's happened for me and I've been in that facility since 2004.

So, yes, I respect that you need to do your job. But I also need to do my job.

Plus, I'm pretty sure the guy answering the trouble line, his job WAS talking with the customers. The people working the problem likely didn't include him. This is a huge data center run by a ginormous company. I don't think I was taking him away from twisting a wrench. :-)

rhizome · on March 13, 2019

I wouldn't be surprised if they think a status page would open up liability for not putting it up soon enough, or for too long, or for some text that turned out to be wrong or unnecessary.

rconti · on March 13, 2019

"The storm"? It's sunny in the Bay Area for the first time in I don't know how long. I imagine it's nice in other parts of the world as well, other than where this localized "storm" is.

kornish · on March 13, 2019

Denver is getting slammed right now — power surges everywhere.

mises · on March 14, 2019

You missed the bomb cyclone that's happening right now in half the country?

rconti · on March 14, 2019

Yes. Believe it or not, it's not really major news for those not impacted. We have the local evening news on most nights in the background, haven't heard a thing about it. I also regularly read NYT and WaPo and follow the Internets.

mises · on March 19, 2019

It's sad that such outlets don't bother to print about stuff that impacts middle America. They could regain some credibility with a lot of said middle Americans without ever changing their political alignment just by giving them a little more coverage.

ninju · on March 14, 2019

Colorado weather: Bomb cyclone brings wild winds, big impact as blizzard whips state

https://www.denverpost.com/2019/03/13/colorado-weather-bomb-...

jonstokes · on March 13, 2019

My first reaction to "could this be related to the storm," was "oh no, now this QAnon stuff has spread to HN."

komali2 · on March 13, 2019

Do they do this to get around and 99.9% uptime agreements?

Implicated · on March 13, 2019

Probably a combination of that and to curtail the "I just spoke to Brad in Customer Service who confirmed _the whole datacenter if offline_" type posts.

But that's my presumption, I don't actually know anything and don't want to imply I do.

ljm · on March 13, 2019

It's easy to be cynical but it's optimistic expectation management.

It might be resolved, it has to get worse before you escalate it further. They might not know the full facts. Might be worse than it really is. How do you know? You can't judge that because your personal rendering of Facebook failed. You have load balancers and CDNs and A/B testers all getting in the way of delivering data to your machine.

It's too easy to draw a conclusion from the client-side armchair and the provider is absolutely not going to make false promises, for the worse or for the better.

You want to hope that Facebook, in this case, acts on more complete information.

6nf · on March 13, 2019

0.1% is still almost an hour per month

SilverSurfer972 · on March 13, 2019

That's the trust issue with current agreements we are solving. If an API is down the bound agreement is enforced instantly with our platform, no lies, no call, no pain. We are actually onboarding companies to try it out! https://stacktical.com

ezekg · on March 13, 2019

Why does this need a blockchain?

SilverSurfer972 · on March 14, 2019

TLDR: Because Smart Contracts on the blockchain are the right tool for Secure Digital Agreements.

Paperweight contracts are irrelevant in a world of data

* A Smart Contract is cheaper to publish that the stack of paper handled by lawyers.

* Code is cheap to iterate from whereas traditional SLA are expensive/slow to renegociate. Over time, SLAs drive behaviors that are focused on delivering a minimum level of service at minimum cost to the provider.

* A Smart Contract is a code you can trust, understand and expect to behave instantly compared to the traditional SLM.

* A Smart Contract is immutable. More about the benefit of Smart Contract over paper/digital agreements: https://www.forbes.com/sites/cognitiveworld/2019/03/10/rise-...

Why not a python script? I don't trust the guy that handle that script. Blockchain is very good at bringing trust.

I hope I answered your question

[edit] formatting

00N8 · on March 14, 2019

so, are you saying we should replace social media platforms w/ decentralized sharing & aggregation driven by smart contracts? sounds intriguing but daunting

edoceo · on March 13, 2019

Even with your platform there will be calls and pain, maybe even lies during an outage. BC-Recordkeeping or not

SilverSurfer972 · on March 14, 2019

unscheduled outages are always painful and people will always call, I agree. But instant compensation is doing a better job at damage control that a status page. Keeping customer satisfaction even in bad situation is key in a world of high availability expectations. And with a distributed, non partisan metric sourcing about the availability of an API, it's not possible for a Service Provider to lie anymore.

Feel free to give that whitepaper a look

callalex · on March 14, 2019

I don’t understand something: what kind of company is so down to the wire with cash flow that an outage requires income within seconds/minutes instead of weeks? Anyone with a financial runway so short that it can be described as “instantaneous” doesn’t sound like a customer you would want to be in business with.

dspillett · on March 14, 2019

> what kind of company

The kind that will make a lot of noise as publicly as possible and create ample work for your support/admin people if you don't keep them happy...

> doesn’t sound like a customer you would want to be in business with

I could say that about most of the companies I have had the dubious pleasure of doing business with! Very few are pleasant when something goes awry even for a moment.

rubicon33 · on March 13, 2019

Classic response for any kind of service provider:

Deny, deny, deny, obfuscate, deny, then blame someone else (usually, YOU).

jorblumesea · on March 14, 2019

A company as large and sophisticated as FB has data centers and cloud services in multiple countries, and in the US, probably colocated data centers. Certainly nothing localized to where you are.

TallGuyShort · on March 14, 2019

If the outage is at all infrastructure-related, the root cause was something that at some point was local and cascaded. Unless someone git pushed to a repo used by both companies and it's taken all day to get it git revert'd, their redundancy obviously didn't work, did it? There's effectively a category-2 hurricane moving from the Rockies through the mid-west right now.

ct520 · on March 14, 2019

Awesome we use iron mountain for escrow

taurath · on March 13, 2019

“Outage, what outage?” Is a sort of laughable response but all too common with tons of providers.

rad_gruchalski · on March 13, 2019

Reminds me when I was contacting Deutsche Telekom last year regarding an outage in Monschau area. "We have no problems". In fact, the whole exchange was down, press got a sniff of it when people could not contact emergency services anymore: https://www.aachener-nachrichten.de/lokales/eifel/netzstoeru...

bradhe · on March 13, 2019

I'm looking at you, AWS...

ceejayoz · on March 13, 2019

A nuclear war could take out an AWS region and we'd get an "increased error rates" informational message.

bluetwo · on March 13, 2019

Are you saying that a cold-war-era system like the internet/arpanet meant to survive a nuclear war might be vulnerable to an attack if we take all the code and data and store it in the same place?

:-)

kerng · on March 13, 2019

Cloud is not much different then Mainframe Computing 2.0.

I'm already curious and have been pondering about this for quite a while to understand what/how/when we will see a disruption of this.

deusofnull · on March 13, 2019

Limewire is gonna make a come back

yellowapple · on March 14, 2019

Once upon a time Multics' developers predicted that someday computing power would be treated like a public utility that homes and businesses would buy like electricity or water or natural gas.

Given the proliferation of minute/hourly billing among service providers, it looks like the Multics folks guessed right. It just happened on top of Unix(-like systems) instead of Multics.

I wonder how long it'll be before we start seeing municipal datacenters?

bradhe · on March 17, 2019

No he's saying that AWS sucks at updating their status page.

WrtCdEvrydy · on March 13, 2019

Remember when S3 failed and the S3 green arrow was being cached on S3. I remember... unfortunately.

ct520 · on March 14, 2019

Man that reminds me of that century link cluster f

40acres · on March 13, 2019

I'm interviewing for a Production Engineer role at Facebook on Monday, thanks for providing relevant "do you have any questions for us" content.

mtw · on March 13, 2019

Good question is why oh why switch WhatsApp to Facebook tech when it was running perfectly ok on its own. Never crashed.

duado · on March 14, 2019

So that engineers can be moved between product groups while carrying relevant knowledge and experience with them.

yellowapple · on March 14, 2019

I'm pretty sure a bunch of Erlang programmers being shoehorned into a PHP codebase is the literal definition of Hell for all parties involved.

duado · on March 14, 2019

Facebook is not a “PHP codebase.” I’d guess that fewer than 0.01% of CPU cycles at Facebook are used by PHP.

nothal · on March 14, 2019

Isn’t a significant amount of the codebase in Hack, a PHP derivative?

Lt_Riza_Hawkeye · on March 14, 2019

C++ is the most popular language at Facebook, I know that one for sure. They used to run PHP on HipHop VM which was written in C++, but now they transpile PHP to C++.

yellowapple · on March 14, 2019

The transpiler is HipHop. They discontinued that in favor of HHVM, which does JIT compilation instead. More info: https://hhvm.com/

EDIT: apparently, though, HHVM stopped supporting PHP itself last month; now it only supports Hack. I'm not familiar enough with Hack to know how much it actually deviates from / improves upon PHP.

macintux · on March 14, 2019

If that’s the goal, why not start incorporating more Erlang into the rest of Facebook instead? It proved its mettle at WhatsApp.

duado · on March 14, 2019

Facebook has to hire thousands of engineers per year. They may incorporate more Erlang into Facebook, but they have to have a core tech stack that can easily onboard engineers from a variety of backgrounds. I don't have the foggiest idea of whether Erlang can be part of that or not, but people talk about it as if it's a special-purpose tool.

kevsim · on March 14, 2019

Facebook was actually using Erlang way before WhatsApp

https://www.quora.com/Why-was-Erlang-chosen-for-use-in-Faceb...

macintux · on March 13, 2019

Speaking as an Erlang developer I approve of this question.

mietek · on March 14, 2019

The answer should be clear at this point. WhatsApp provided too much privacy and not enough monetisation opportunities.

packetslave · on March 14, 2019

Good luck! If you’re interviewing in MPK make sure your recruiter takes you to the barbecue shack for lunch

erobbins · on March 14, 2019

But be prepared to stand in line for longer than it takes you to actually eat.

packetslave · on March 14, 2019

#worthit (although I’ll admit my afternoon interviews were a bit... challenging... after 5lbs of brisket)

B-Con · on March 14, 2019

When I interviewed for SRE at Google, they'd had a non-trivial cross product outage days before. Good conversation starter, but I couldn't get many details out of them.

t3rabytes · on March 14, 2019

You mean this week when they took out most of our their cloud and public products for a few hours? /s

taneq · on March 14, 2019

You think they might have a couple more vacancies going now? :P

HelloFellowDevs · on March 14, 2019

Funnily enough, same for me too! New Grad this Friday. Good luck!

snazz · on March 13, 2019

I’ve seen many systems go down over the last few days worldwide. Aside from the possibility of a mega-DDoS attack (which Facebook denies), all of these organizations have fairly diverse tech stacks to my knowledge. Google’s issue (supposedly) had to do with their Blobstore API, we don’t know what happened with Facebook, and many other, smaller services have had issues as well, including three intranet services at my workplace.

This leaves me wondering what software all these places have in common. The application layers are all different, the databases are all different, the containerization and provisioning systems are different, but I imagine that all these systems rely on two things: the global Internet backbone, and maybe the Linux kernel.

Have there been major security vulnerabilities patched lately in the Linux kernel that could have had unintended consequences?

str33t_punk · on March 13, 2019

Both companies are massive and have tons of developers. It becomes almost impossible to look at the system as a whole with the amount of changes coming through. And, you get scenarios where small failures cascade through the stack reaking havok. Often times its just one config change

Its telling that one of the hottest areas of distributed systems research these days is the boring topic of configuration management. Google, Microsoft, etc are paying researchers top dollar to figure out how to prevent massive outages through novel techniques. It is one of the harder problems to solve and requires massive investment in tooling, refactoring, etc.

snazz · on March 13, 2019

You’re undeniably right about not looking at Facebook or Google as one whole system, but there have also been what seems like an unprecedented number of strange little outages (see the ones mentioned by https://news.ycombinator.com/item?id=19382418) that aren’t huge companies. My workplace had some of their own today that I haven’t heard an incident report about (it’s a pretty large company and I’m not in IT).

hideo · on March 14, 2019

>>Google, Microsoft, etc are paying researchers top dollar to figure out how to prevent massive outages through novel techniques

Curious what makes you think this. Are there specific job postings in either company that are focused on this?

bpye · on March 14, 2019

I work for Microsoft, I know of at least CrystalNet [1].

[1] - https://www.microsoft.com/en-us/research/blog/eliminating-ne...

shafte · on March 13, 2019

The best explanation is coincidence, I think. I have direct knowledge of two of the incidents in the past few weeks, and they have completely unrelated causes.

Sometimes you just get unlucky!

smacktoward · on March 14, 2019

"Once is happenstance. Twice is coincidence. The third time it’s enemy action."

-- Ian Fleming (in Goldfinger)

z3t4 · on March 14, 2019

If it can happen, it will happen.

snazz · on March 13, 2019

That’s certainly possible. We’re probably still too early to tell, but the innate conspiracy theorist slash pattern-matching part of my brain wants to find a probable connection.

taneq · on March 14, 2019

Maybe we're in the first week of a rogue AI's hard takeoff. ;)

nubslayer2 · on March 14, 2019

>all of these organizations have fairly diverse tech stacks to my knowledge

>This leaves me wondering what software all these places have in common.

dunno what systems you're talking about, but seems likely they are mostly x86 systems and maybe even mostly using Intel hardware and microcode

those systems can-be/are rooted and more, to my knowledge

felix246 · on March 13, 2019

It could also just be coincidence.

madrox · on March 13, 2019

> This leaves me wondering what software all these places have in common.

Cisco or Arista

happythought · on March 13, 2019

FANGs all use white box hardware with “merchant silicon” meaning they buy the chipsets directly from Broadcom, Mellanox, etc. and build their own devices. However, they do all have Broadcom and Mellanox in common and Cisco, juniper, and arista do too.

nwlieb · on March 14, 2019

Some FANGs definitely use Arista/Cisco. As far as I know white box hardware is mostly for top of the rack switches (as opposed to backbone infra).

packetslave · on March 14, 2019

Depends on which backbone infra :-) Take a look at the B4 and Jupiter papers from Google, for example.

Implicated · on March 13, 2019

Facebook's own status dashboard (https://developers.facebook.com/status/dashboard/) showed no issues or outtage just 30 min ago.

I run a messenger bot platform - the webhooks stopped being delivered _hours_ ago... nothing on their status page until it had been down for hours.

Their current issue...

"We are currently experiencing issues that may cause some API requests to take longer or fail unexpectedly. We are investigating the issue and working on a resolution."

What? lmao

pbhjpbhj · on March 13, 2019

I'm pretty sure businesses use status pages to divert attention from support resources, they never seem to give useful information about outages and half the time don't even mention the outage.

yalok · on March 13, 2019

that page is down now as well

buboard · on March 13, 2019

you get what you pay for

knicholes · on March 13, 2019

If you're inferring that Facebook is "free," I'd strongly disagree. Data is the currency of today, and they're swimming in it.

buboard · on March 13, 2019

fb platform is free

marnett · on March 13, 2019

Facebook is a paid advertising platform. There are plenty of people not getting what they paid for as a result of an outage.

buboard · on March 13, 2019

ok but you are all talking about the users. facebook platform is free to use for developers

as for the advertisers, i doubt they 'll be charged for impressions that didn't happen

evasote · on March 14, 2019

But for someone who runs a business that relies on Instagram for marketing, and pays for advertising on that platform, it's a bit scary when the whole thing is down. Obviously this was only temporary this morning, and sure, no charges while down, but doesn't do me much good...

valentinemsmith · on March 13, 2019

TAANSTAFL

https://en.wikipedia.org/wiki/There_ain%27t_no_such_thing_as...

henrikschroder · on March 13, 2019

I can't even reach that status page right now...

augbog · on March 13, 2019

when I click I see a blank screen. why

cronix · on March 13, 2019

It looks like something much larger is going on. If you look at the front page of https://downdetector.com/ you'll see most major sites/backbones are having issues (Verizon/ATT/Sprint/CenturyLink/TMobile/Comcast/Level3/etc).

ceejayoz · on March 13, 2019

That site relies on user reports.

I strongly suspect users are reporting "my Internet is having troubles" because their FB, Messenger, etc. isn't working right.

For example, in the comments of the T-Mobile outage page, there's stuff like "Haven't been able to upload anything to social media all day" and "Cannot send pictures through whatsapp and fb messenger".

cronix · on March 13, 2019

That's true, but it is a good indicator. Here's a better map from Akamai: https://www.akamai.com/us/en/resources/visualizing-akamai/re...

Also, check out the "Attacks" tab. That one really lights up. Like seriously lights up. Something is going on... all over. US, China, Russia, EU...

ceejayoz · on March 13, 2019

The attacks tab says the current count is "100% Below normal". It does not support what you think it supports.

It "lights up" because there are always attacks happening.

cronix · on March 13, 2019

Look at the scale. The blue area is 100% below normal. The red are way above normal.

ceejayoz · on March 13, 2019

That doesn't make any sense, given that the "traffic" tab's scale says "7% above normal".

The red are the areas with the most attacks, and as you'd expect, they correspond to large population centers. (It's also not very granular, and appears to largely correspond to "where does Akamai have a datacenter".)

snazz · on March 13, 2019

It’s a poorly designed graphic that’s hard to get any real info from.

calebegg · on March 14, 2019

https://xkcd.com/1138/

mv4 · on March 14, 2019

That chart is a marketing tool, nothing more.

jnothing · on March 13, 2019

Yeah in Turkey people thought government is slowing down/blocking the whole internet

_v7gu · on March 14, 2019

To be fair, the priors lean into that rather than Facebook and Whatsapp having troubles.

mclightning · on March 14, 2019

Yes, I had that exact problem yesterday. I couldnt upload pictures/videos/gifs on whatsapp and instagram.

I thought maybe my ISP blocked a port which these services maybe transferring their multimedia on.

/Sweden

onychomys · on March 13, 2019

That's a fair point, but those folks probably aren't reporting AWS problems, and it's showing a spike too.

ceejayoz · on March 13, 2019

AWS is reporting an EC2 outage ("increased API error rates and launch failures in the US-WEST-2 Region") in Oregon currently.

wybiral · on March 13, 2019

I believe that too but some of the services seem unrelated (like Flickr and Capital One).

Now, another interpretation is that the reports are simply false...

wybiral · on March 13, 2019

Ironically https://outage.report/ is down too.

Edit: it's back now (8:37 PM UTC)

blang · on March 13, 2019

HN was down for a minute.

jedberg · on March 13, 2019

So yesterday Google had a major (and out of character) outage across its apps, and today Facebook has a major (and also out of character) outage across its apps.

I can't wait to see the RCA for both of these and if they're related.

godzillabrennus · on March 13, 2019

Private post Morten: The NSA middleware we are required to run (that took time to deploy to each of our social partners) is breaking something so let’s revert.

Public post Mortem:

Entirely believable technical cause.

samstave · on March 13, 2019

Alternate Post Mortem: Cyber WWIII's first public skirmishes become visible...

(Ignore Stuxnet, Ignore DUQU)

wybiral · on March 13, 2019

/me puts on tin foil hat

It's the rice grain implants, man

jmrobertson · on March 13, 2019

yeah I wonder

popz41 · on March 13, 2019

I imagine the NSA uses an optical tap device. These devices create identical copies and require no power or management.

cronix · on March 13, 2019

That's why it's called PRISM. It's exactly what you describe. Splitting an optical signal into 2 using, basically, a prism. One signal goes out to the net as normal, the other goes to their own datacenters, that they keep continually building and expanding. The newer ones are being build on military bases, for added security. Check em out. Look at the size and cost of them. Some are over a million sq. ft. That's a lot of data. They measure it in terms of zottabytes and zettabytes (in 2013, a lifetime ago in terms of storage space):

https://modernsurvivalblog.com/government-gone-wild/nsa-loca...

http://worldstopdatacenters.com/government-data-centers/

sterlind · on March 13, 2019

Nah, PRISM referred to the front door for lawful access to customer records under warrant. That's the sort of portal that China once hacked Gmail by gaining access to.. the companies explicitly built those access relays.

The beam splitter stuff (e.g. Room 641A) went by different codenames, TRAFFICTHIEF and TURMOIL iirc. That's the back door.

tivert · on March 13, 2019

> The newer ones are being build on military bases, for added security.

IIRC, the NSA is organizationally part of the military, and it's currently headed by a military officer who gives congressional testimony in his uniform (https://www.youtube.com/watch?v=nMi241XLeQ8). It makes sense they'd build on military bases, it'd be kinda weird if they didn't.

bduerst · on March 13, 2019

No, it was called MUSCULAR. PRISM is different.

Also, the majority of tech and data companies have closed this loophole by encrypting traffic between data centers. Nobody thought it was necessary to do it before over dark fiber before because, hey, who was listening? (answer: the NSA was)

newaccoutnas · on March 13, 2019

I wonder if they publish their drive failure rates like BackBlaze? 5 Eyes tax payers and all.

propogandist · on March 13, 2019

One of my coworkers came from a large telecom. He mentioned they had to get technology from an Israeli firm that specializes in quantum cryptography on the fiber optic line to fend of NSA and GCHQ, who are apparently worse than NSA. (iirc) the tech apparently encrypts data streams on one side and check to see if the hash is the same on the other side, if somethings off (evidence of tamper) it instantly changes the cipher.

samstave · on March 14, 2019

And they have been using the USS Jimmy Carter sub with the front huge cable splice bay for decades to compromise all undersea cables.

>>>The New York Times reported in 2005 that the USS Jimmy Carter, a highly advanced submarine that was the only one of its class built, had a capability to tap undersea cables. An Institute of Electrical and Electronics Engineers report speculated that a 45-foot extension added to the Jimmy Carter provided this capability by allowing engineers to bring the cable up into a floodable chamber to install a tap. But it is unlikely that the USS Jimmy Carter routinely taps cables since U.S. intelligence agencies can much more easily (and lawfully) obtain cable data through taps at above-ground cable landing stations.

https://www.lawfareblog.com/evaluating-russian-threat-unders...

dontbenebby · on March 14, 2019

I wonder how Jimmy Carter feels about his namesake being used to wiretap the world?

https://www.washingtonpost.com/news/post-politics/wp/2014/03...

samstave · on March 14, 2019

It was a known FU to carter...

dontbenebby · on March 15, 2019

But the USS Jimmy Carter launched in 2004:

https://en.wikipedia.org/wiki/USS_Jimmy_Carter

I don't think he made any pre-Snowden comments on mass surveillance?

FooHentai · on March 13, 2019

Optical tap for unauthorized access, but surely port mirrors for the national-security-letter stuff. Wouldn't make sense to go through the hassle of a tap install and the ongoing risk of it failing, versus using a capability available on almost all serious switching hardware to give you a guaranteed 1:1.

6nf · on March 14, 2019

They would still need the encryption keys since Google encrypts their inter-data center fiber

drbenway · on March 13, 2019

interesting that facebooks cavalrylogger is still being sucessfully injected despite their being nothing but a blank page also interesting that cavalrylogger has a function that lets you bind key-presses to callbacks even more interesting is that cavalrylogger seems to come prepackaged with any facebook like button! cheers for the keylogger facebook

https://stackoverflow.com/questions/4188605/what-is-cavalryl...

anticensor · on March 15, 2019

Alternative post mortem (blind): Massive power outage

Yet another alternative: Third World War has just started, and this was the first battle.

betolink · on March 13, 2019

I don't think it's the NSA this time, for once they don't have to do deep package analysis or install any MITM device since they get the whole info in bulk, maybe it's just a 400-pound hacker.

copperx · on March 13, 2019

I've never understood the 400 pound hacker thing. Does it refer to body weight or technical prowess?

etiam · on March 14, 2019

I spontaneously took it as a loose analogy to dominant male gorillas and the extent to which they need to be taken seriously.

swozey · on March 14, 2019

It's the stereotype that "hackers" are obese, chugging mountain dew and living in their parents basement.

betolink · on March 14, 2019

The 400-pound hacker is a reference to what Trump said during the 2016 campaign but it seems like not everybody gets it.

eqdw · on March 14, 2019

Remember when youtube was down for like six hours a few months ago and we still haven't heard why?

mikemotherwell · on March 14, 2019

Yeh why was that? I kept wondering when we'd hear. I've got bets to collect on!

crb002 · on March 13, 2019

Curious of they did it due to a kernel exploit being used by a nation state bad enough that it was worth YOLO patching.

arisAlexis · on March 13, 2019

there are actively people downvoting such comments. I guess that's suspicious too.

CodeWriter23 · on March 13, 2019

This is bigger than Facebook.

https://imgur.com/a/gePwi0i

https://www.akamai.com/us/en/resources/visualizing-akamai/re...

ce4 · on March 13, 2019

The heavy traffic is due to sports events - champions league last-16 matchday live streaming: Bayern Munich vs. Liverpool, FC Barcelona vs Olympique Lyon. The heatmap matches the clubs' home countries UK, Germany, France & Spain quite well.

Dont conflate that with fb/insta problems.

why-el · on March 13, 2019

Indeed. The games are streamed like crazy! Lots of streams are in HD.

ce4 · on March 13, 2019

Yep, the actual bandwidth consumption is mindblowing :)

pferde · on March 14, 2019

Multicast was designed exactly for this - same data streamed to many endpoints at the same time. Too bad it's not being more widely used, the bandwidth savings would likely be huge.

ceejayoz · on March 13, 2019

That chart doesn’t support your assertion. Akamai’s traffic and attack charts usually look like that, and the attack chart even says it’s currently low.

dr1ggins · on March 14, 2019

That traffic map seems to line up with time zones

FabHK · on March 13, 2019

Let's see whether we have a spike in the birth rate in 9 months.

(Oh, turns out the Great Blackout Baby Boom was a myth:

https://www.snopes.com/fact-check/from-here-to-maternity/ )

Arbalest · on March 13, 2019

What if there is a fall in birthrate because people can't organise risky hookups without their preferred communication platform?

anitil · on March 13, 2019

Or a fall in STDs for the same reason.

kartan · on March 13, 2019

"This usually means we're making an improvement to the database your account is stored on. While this process won't affect your account, you temporarily won't be able to access the site." https://www.facebook.com/help/134401680031995

I guess that this is all that I will get. Facebook is never down, it is just making improvements (like restarting the services to make them work again).

jfaat · on March 13, 2019

We've always been at way with Eastasia

subcosmos · on March 13, 2019

and autocorrect - it's doubleplus bad

T_D_K · on March 13, 2019

*ungood

apolloWBrazil · on March 14, 2019

They could be consolidating all of the DB infrastructure for their platforms. A zero down time dial-up would not be possible as they would need to nearly double their DB infrastructure. Short planned temporary outages of various features probably become long unplanned cross-platform outages. They probably decided to not rollback the migration after the first outage.

earless1 · on March 13, 2019

What manner of failure would cause such globally deployed and distributed systems to go down like this? I'm very interested to read up on this when they release details of the failure.

rwultsch · on March 13, 2019

Short duration: network, bad software deploy Long duration: db. If you break data, it takes a while to unbreak.

Source: Me. My career has been spent managing db's for internet scale sites.

dsfyu404ed · on March 13, 2019

I work for a smaller but comparably large platform. "If everything is down check the DB" is at the top of one of our internal monitoring websites in red.

Screw ups related to data loss are rare (I've been here years and haven't seen one with the DBs that the stuff I work with uses) but failures at this scale tend to cascade a little ways and it takes time to dig out of the hole. They probably have the problem solved but they have to spend a bunch of time synchronizing things and verifying the fix before they press the big red "go live" button.

pferde · on March 14, 2019

Shouldn't the monitoring websites be able to check the DB status for you before you even look at that red text? :)

dsfyu404ed · on March 14, 2019

We have a different dedicated page that gives an overviews of what's going on with the DB. The page in question is supposed to be a single stop that lets you visually get an overview of the state of the application servers and whether things are "normal" and if not allow you to quickly identify what is not normal.

cheeze · on March 13, 2019

Nothing worse than that sinking feeling of "oh fuck, we have to backfill a lot of data.

WrtCdEvrydy · on March 13, 2019

Why did my username on this site just change to 'test123'... oh, where clauses.

jomkr · on March 14, 2019

Nothing worse then the page on Friday night, oh there goes my weekend.

fixermark · on March 14, 2019

I have no inside knowledge of this one, but broadly speaking, these sorts of failures can be caused by a change thought innocent at the time to the core software that is then widely deployed using automated systems. If the core's tests didn't catch a real issue in production (and for whatever reason, the rollout happens faster than the regular small-release verification process can catch the error), things can go sour in a way that's expensive to un-sour.

Amazon once pushed a seemingly-innocuous change to their internal DNS that caused all the routers between and within datacenters to drop their IP tables on the floor. They had to re-establish the entire network by hand---datacenter heads calling each other up and reading IP address ranges over the phone to be hand-entered into lookup tables. Cost a fortune in lost sales for the time the whole site was inaccessible.

str33t_punk · on March 13, 2019

As someone who works at a large company in the networking space, you would be surprised that minor changes to configuration can cause catastrophic failures that are really challenging to come back from

Network failures are usually really bad when your system is globally deployed and distributed -- often times you can't even communicate with your machines to deliver fixes :p

phoe-krk · on March 13, 2019

An expired certificate, for instance.

https://www.thesslstore.com/blog/expired-certificate-ericsso...