Hacker News new | past | comments | ask | show | jobs | submit login
[dupe] Facebook-owned sites were down (facebook.com)
2589 points by nabeards on Oct 4, 2021 | hide | past | favorite | 1283 comments



There's still no connectivity to Facebook's DNS servers:

    > traceroute a.ns.facebook.com
      traceroute to a.ns.facebook.com (129.134.30.12), 30 hops max, 60 byte packets
      1  dsldevice.attlocal.net (192.168.1.254)  0.484 ms  0.474 ms  0.422 ms
      2  107-131-124-1.lightspeed.sntcca.sbcglobal.net (107.131.124.1)  1.592 ms  1.657 ms  1.607 ms 
      3  71.148.149.196 (71.148.149.196)  1.676 ms  1.697 ms  1.705 ms
      4  12.242.105.110 (12.242.105.110)  11.446 ms  11.482 ms  11.328 ms
      5  12.122.163.34 (12.122.163.34)  7.641 ms  7.668 ms  11.438 ms
      6  cr83.sj2ca.ip.att.net (12.122.158.9)  4.025 ms  3.368 ms  3.394 ms
      7  * * *
      ...
So they're hours into this outage and still haven't re-established connectivity to their own DNS servers.


"facebook.com" is registered with "registrarsafe.com" as registrar. "registrarsafe.com" is unreachable because it's using Facebook's DNS servers and is probably a unit of Facebook. "registrarsafe.com" itself is registered with "registrarsafe.com".

I'm not sure of all the implications of those circular dependencies, but it probably makes it harder to get things back up if the whole chain goes down. That's also probably why we're seeing the domain "facebook.com" for sale on domain sites. The registrar that would normally provide the ownership info is down.

Anyway, until "a.ns.facebook.com" starts working again, Facebook is dead.


Notes as Facebook comes back up:

"registrarsafe.com" is back up. It is, indeed, Facebook's very own registrar for Facebook's own domains. "RegistrarSEC, LLC and RegistrarSafe, LLC are ICANN-accredited registrars formed in Delaware and are wholly-owned subsidiaries of Facebook, Inc. We are not accepting retail domain name registrations." Their address is Facebook HQ in Menlo Park.

That's what you have to do to really own a domain.


Out of curiosity, I looked up how much it costs to become an registrar. Based on the ICANN site, it is $4,000 USD per yr, plus variable fees and transactions fees ($0.18/yr). Does anyone have experience or insight into running a domain registrar? Curious what it would entail (aside from typical SRE type stuff).


> transactions fees ($0.18/yr)

Wow, I had no idea it was so cheap[1] once you're a registrar. The implication is that anyone who wants to be a domain squatting tycoon should become a registrar. For an annual cost of a few thousand dollars plus $0.18 per domain name registered, you can sit on top of hundreds of thousands of domain names. Locking up one million domain names would cost you only $180,000 a year. Anytime someone searched for an unregistered domain name on your site, you could immediately register it to yourself for $0.18, take it off the market, and offer to sell it to the buyer at a much inflated price. Does ICANN have rules against this? Surely this is being done?

[1] "Transaction-based fees - these fees are assessed on each annual increment of an add, renew or a transfer transaction that has survived a related add or auto-renew grace period. This fee will be billed at USD 0.18 per transaction." as quoted from https://www.icann.org/en/system/files/files/registrar-billin...


> Surely this is being done?

Personally saw this kind of thing as early as 2001.

Never search for free domains on the registar site unless you are going to register it immediately. Even whois queries can trigger this kind of thing, although that mostly happens on obscure gtld/cctld registries which have a single registrar for the whole tld.


I can sadly attest to this behavior as recently as a couple years ago :(

I searched for a domain that I couldn't immediately grab (one of more expensive kind) using a random free whois site... and when I revisited the domain several weeks later it was gone :'(

Emailed the site's new owner D: but fairly predictably got no reply.

Lesson learned, and thankfully on a domain that wasn't the absolute end of the world.

I now exclusively do all my queries via the WHOIS protocol directly. Welp.


> Surely this is being done?

Probably every major retail registrar was rumored to do this at some point. Add to your calculation that even some heavyweights like GoDaddy (IIRC) tend to run ads on domains that don't have IPs specified.


Network Solutions definitely did it. I searched for a few domains along the lines of "network-solutions-is-a-scam.com", and watched them come up in WHOIS and DNS.


There are also fees you have to pay to the owner of the tld. For example .com has a $8.39 fee. In total that would be $8.57 per .com domain.

You are off by a factor of almost 50.


I didn't know that, and you're right. For anyone who's interested, I found the following references regarding the $8.39 additional fee for a .com registration:

https://itp.cdn.icann.org/en/files/registry-agreements/com/c...

https://www.icann.org/en/system/files/correspondence/stewart...

https://www.icann.org/en/announcements/details/icann-and-ver...


They have a pretty interesting page on the topic: https://www.icann.org/resources/pages/financials-55-2012-02-...

They want you to have $70k liquid.


And they want you to be someone else than Peter Sunde:

https://torrentfreak.com/icann-refuses-to-accredit-pirate-ba...


This is not completely accurate. The whole reason a registrar with domain abc.com can use ns1.abc.com is because glue records are established at the registry, this allows a bootstrap that keeps you in from a circular dependency. All that said it’s usually a bad idea, for someone as large as Facebook they should have nameservers across zones ie a.ns.fb.com b.ns.fb.org c.ns.fb.co Etc…


There is always a step which involve to email the domain when a domain update its information with the registrar. In this case, facebook.com and registrarsafe.com are managed by the same NS. You need these NS to query the MX to send that update approval by email and unblock the registrar update. Glue records are more for performance than to make that loop. I'm maybe missing something but, hopefully they won't need to send an email to fix this issue.


I have literally never once received an email to confirm a domain change. Perhaps the only exception is on a transfer to another registrar (though I can't recall that occurring, either).

To be fair, we did have to get an email from eurid recently for a transfer auth code, but that was only because our registrar was not willing to provide.

In any case, no, they will not need to send an email to fix this issue.


I just changed the email address on all my domains. My inbox got flooded with emails across three different domain vendors. If they didn't do it before, they sure are doing it now.


Yes I meant for transferring to another DNS server. In this case, they can't.


This is not true when your the registrar (as in this case) in fact your entire system could be down and you’d still have access to the registries system to do this update


FB is running their own registrar. Supposedly they can sidestep the email procedure if it's even there to begin with.


Facebook does operate their own private Registrar, since they operate tens of thousands of domains. Most of these are misspellings and domains from other countries and so forth.

So yes, the registrar that is to blame is themselves.

Source: I know someone within the company that works in this capacity.


> That's also probably why we're seeing the domain "facebook.com" for sale on domain sites. The registrar that would normally provide the ownership info is down.

That’s not how it works. The info of whether a domain name is available is provided by the registry, not by the registrars. It’s usually done via a domain:check EPP command or via a DAS system. It’s very rare for registrar to registrar technical communication to occur.

Although the above is the clean way to do it, it’s common for registrars to just perform a dig on a domain name to check if it’s available because it’s faster and usually correct. In this case, it wasn’t.


When the NS hostname is dependent on the domain it serves, "glue records" cover the resolution to the NS IP addresses. So there's no circular dependency type issue


Good catch. Hopefully, they won't need an email sent to fb.com from registrarsafe.com to update an important record to fix this. What a loop.


Its partially there. C and D are still not in the global tables according to routeviews ie. 185.89.219.12 is still not being advertised to anyone. My peers to them in Toronto have routes from them, but not sure how far they are supposed to go inside their network. (past hop 2 is them)

% traceroute -q1 -I a.ns.facebook.com

traceroute to a.ns.facebook.com (129.134.30.12), 64 hops max, 48 byte packets 1 torix-core1-10G (67.43.129.248) 0.133 ms

2 facebook-a.ip4.torontointernetxchange.net (206.108.35.2) 1.317 ms

3 157.240.43.214 (157.240.43.214) 1.209 ms

4 129.134.50.206 (129.134.50.206) 15.604 ms

5 129.134.98.134 (129.134.98.134) 21.716 ms

6 *

7 *

% traceroute6 -q1 -I a.ns.facebook.com

traceroute6 to a.ns.facebook.com (2a03:2880:f0fc:c:face:b00c:0:35) from 2607:f3e0:0:80::290, 64 hops max, 20 byte packets

1 toronto-torix-6 0.146 ms

2 facebook-a.ip6.torontointernetxchange.net 17.860 ms

3 2620:0:1cff:dead:beef::2154 9.237 ms

4 2620:0:1cff:dead:beef::d7c 16.721 ms

5 2620:0:1cff:dead:beef::3b4 17.067 ms

6 *

7 *

8 *


Kevin Beaumont:

   »The Facebook outage has another major impact: lots of mobile apps constantly poll Facebook in the background = everybody is being slammed who runs large scale DNS, so knock on impacts elsewhere the long this goes on.«

https://twitter.com/GossiTheDog/status/1445118907187175427


Oh my gosh, their IPv6 address contains "face:b00c"...

> 2a03:2880:f0fc:c:face:b00c:0:35


Besides being fun and quirky, it is actually useful for their sysadmins as well as sysadmins at other orgs.

Well at least it will in 2036, when IPv6 goes mainstream.


How difficult is to get such a "vanity" address?


You just need to get a large enough block so that you can throw most of it away by adding your own vanity part to the prefix you are given. IPv6 really isn't scarce so you can actually do that.


The face:b00c part is in the Interface ID, so this did not even need a large block (Though I am sure they have one).


dead beef sounds about right


My suspicion is that since a lot of internal comms runs through the FB domain and since everyone is still WFH, then its probably a massive issue just to get people talking to each other to solve the problem.


I don’t know how true it is but a few reports claim employees can’t get into the building with their badges.


I remember my first time having a meeting at Facebook and observing none of the doors had keyholes and thinking "hope their badge system never goes down"


> I remember my first time having a meeting at Facebook and observing none of the doors had keyholes and thinking "hope their badge system never goes down"

Every internet-connected physical system needs to have a sensible offline fallback mode. They should have had physical keys, or at least some kind of offline RFID validation (e.g. continue to validate the last N badges that had previously successfully validated).


Breaking the glass to get in to fix the service is totally a good business move.

A few hundred bucks of glass Vs a billion wiped off the share price if the service is down for a day and all the user's go find alternatives.


In case of emergency, break glass...

...the doors are glass right?


Zucks personal conference room has 3 glass walls, so I’ve been amusing myself imagining him just throwing a chair through one of the walls


That glass is bullet resistant.


Do they (you?) call him that at FB?


Yes, "Zuck".


I'm assuming someone in building security has watched the end of Ex Machina...and applied some learnings, or not.


All doors are glass with the right combination of a halligan bar, an axe and a gasoline powered saw.

And I guess beyond that point, walls are glass. Or you need explosives.


Aaaaaaand it's down!


maybe they're open by default, like old 7-11 stores when they went 24hrs and had no locks on the doors :)


Link to such claims here: https://news.ycombinator.com/item?id=28750894

I have no doubt that the publicly published post-mortem report (if there even is one) will be heavily redacted in comparison to the internal-only version. But I very much want to see said hypothetical report anyway. This kind of infrastructural stuff fascinates me. And I would hope there would be some lessons in said report that even small time operators such as myself would do well to heed.


I think the real take away is that no one has this figured out.

A small company has to keep all of its customers happy (or at least be responsive when issues arise, at a bare minimum).

Massive companies deal in error budgets, where a fraction of a percent can still represent millions of users.



I guess they didn't have an "emergency ingress" plan.


The they will have to old school it and try a brick.


I've heard on Blind this is unrelated, more of a Covid restriction issue.


What is Blind? Or shouldn't I ask?


www.teamblind.com

Enjoy.


A copy of Glassdoor


More like a crossover between Glassdoor... and Gab.


first rule of Blind, never talk about Blind


You mean the same problem as when GMail goes down and Googlers can't reach each other?

I guess good decentralized public communication services could solve those issues for everybody.


Googler here - my opinions are my own, not representing the company

at the lowest level in case of severe outage we resort to IRC, Plain Old Telephone Service and, sometimes, stick-it notes taped to windows...


Around here we use Slack for primary communications, Google Hangouts (or Chat or whatever they call it now) as secondary, and we keep an on-call list with phone numbers in our main Git repo, so everyone has it checked out on their laptop, so if the SHTF, we can resort to voice and/or SMS.

I remembered to publish my cell phone's real number on the on-call list rather than just my Google Voice number since if Hangouts is down, Google Voice might be too.


Where are the tapes though? Colo on separate tectonic tape or nah?


?


I think texasbigdata is talking about backup tapes and maybe mistyped tectonic plate

Backup tapes and in production servers are kept at different colocation sites to protect data from fire and other catastrophes of that level

Using colo sites on separate tectonic plates would protect you from catastrophes on a geological cataclysm level


We don't use tapes, everything we have is in the cloud, at a minimum everything is spread over multiple datacenters (AZ's in AWS parlance), important stuff is spread over multiple regions, or depending on the data, multiple cloud providers.

Last time I used tape, we used Ironmountain to haul the tapes 60 miles away which was determined to be far enough for seismic safety, but that was over a decade ago.


Thank you kind sir.


Some people here say their fallback IRC doesn't work due to DNS reliance. :|


One of my employers once forced all the staff to use an internally-developed messenger (for sake of security, but some politics was involved as well), but made an exception for the devops team who used Telegram.


Telegram? Interesting choice!


Devops like Telegram because it has proper bot API, unlike many other competitors.


Oh! It makes sense. While I don't like telegram for some reasons, their API is totally top notch and a real pleasure to work with.


That would completely defeat the purpose... I have a hard time believing that.


Why? Even if it's not DNS reliance, if they self-hosted the server (very likely) then it'll be just as unreachable as everything else within their network at the moment.


The entire purpose of an IRC backup is in case shit hits the fan. That means having it run on a completely separate stack.

What use is it if it runs on the same stack as what you might be trying to fix?


Clearly "our entire network is down, worldwide" wasn't part of their planning. Don't get too cocky with your 20/20 hindsight.


I don't think it's cocky or 20/20 hindsight. Companies I've worked for specifically set up IRC in part because "our entire network is down, worldwide" can happen and you need a way to communicate.


I bet they never tested taking out their own DNS.

IRC does use DNS at least to get hostnames during connection. I'd be surprised if it didn't use it at other points.


I’ve setup hosts files in case DNS was down to access critical systems before. It’s a perfectly reasonable precaution.


My small org, maybe 50 ips/hosts we care about, maintain a hosts file stills, for those nodes public and internal names. It's in Git, spread around and we also have our fingers crossed.


If only IRC would have been built with multi-server setups in mind, that forward messages between servers, and continues to work if a single - or even a set - of servers would go down, just resulting in a netsplit...Oh wait, it was!

My bet is, FB will reach out to others in FAMANG, and an interest group will form maintaining such an emergency infrastructure comm network. Basically a network for network engineers. Because media (and shareholders) will soon ask Microsoft and Google what their plans for such situations are. I'm very glad FB is not in the cloud business...


> If only IRC would have been built with multi-server setups in mind, that forward messages between servers, and continues to work if a single - or even a set - of servers would go down, just resulting in a netsplit...Oh wait, it was!

yeah if only Facebook's production engineering team had hired a team of full time IRCops for their emergency fallback network...


Considering how much IRCops were paid back in the day (mostly zero as they were volunteers) and what a single senior engineer at FB makes, I'm sure you will find 3-4 people spread amongst the world willing to share this 250k+ salary amongst them.


That is called outbound network :)


I worked on the identity system that chat (whatever the current name is) and gmail depend on and we used IRC since if we relied on the system we support we wouldn’t be able to fix it.


Word is that the last time Google had a failure involving a cyclical dependency they had to rip open a safe. It contained the backup password to the system that stored the safe combination.


The safe in question contained a smartcard required to boot an HSM. The safe combination was stored in a secret manager that depended on that HSM.

The engineer attempted to restart the service, but did not know that a restart required a hardware security module (HSM) smart card. These smart cards were stored in multiple safes in different Google offices across the globe, but not in New York City, where the on-call engineer was located. When the service failed to restart, the engineer contacted a colleague in Australia to retrieve a smart card. To their great dismay, the engineer in Australia could not open the safe because the combination was stored in the now-offline password manager.

Source: Chapter 1 of "Building Secure and Reliable Systems" (https://sre.google/static/pdf/building_secure_and_reliable_s... size warning: 9 MB)


Lovely.

Safes typically have the instructions on how to change the combination glued to the inside of the door, and ending with something like "store the combination securely. Not inside the safe!"

But as they say: make something foolproof and nature will create a better fool.


I'm sure this sort of thing won't be a problem for a company whose founding ethos is 'move fast and break things.' O:-)


Anyone remember the 90s? There was this thing called the Information Superhighway, a kind of decentralised network of networks that was designed to allow robust communications without a single point of failure. I wonder what happened to that...?


Folks are still chatting here... seems to work as designed...


Aren't we still communicating on HN, even though the possibly largest network is down? Can you send email?


We are a dying breed... A few days ago my daughter asked me "will you send me the file on Whatsapp or Discord?". I replied I will send an email. She went "oh, you mean on Gmail?" :-D


Hahaha... I can relate to that. Email is synonymous with Gmail now, something that only dads and uncles use. :-)


Somehow I gotta figure out how to get kiddos interested in networking...


Setting up a Minecraft server has been a good experience for my kiddo to learn more networking.


I am going to guess it’s one of those things the techies want to get round to, but in reality there is never any chance or will to do it.


I can assure you that Google has a procedure in place for that.


I unfortunately cannot edit the parent comment anymore but several people pointed out that I didn't back up my claim or provided any credentials so here they are:

Google has multiple independent procedures for coordination during disasters. A global DNS outage (mentioned in https://news.ycombinator.com/item?id=28751140) was considered and has been taken into account.

I do not attempt to hide my identity here, quite the opposite: my HN profile contains my real name. Until recently a part of my job was to ensure that Google is prepared for various disasterous scenarios and that Googlers can coordinate the response independently from Google's infrastructure. I authored one of the fallback communication procedures that would likely be exercised today if Google's network experienced a global outage. Of course Google has a whole team of fantastic human beings who are deeply involved in disaster preparedness (miss you!). I am pretty sure they are going to analyze what happened to Facebook today in light of Google's emergency plans.

While this topic is really fascinating, I am unfortunately not at liberty to disclose the details as they belong to my previous employer. But when I stumble upon factually incorrect comments on HN that I am in a position to correct, why not do that?


In future news: Waymo outage results in engineers unable to get to data center. Engineers don't even know where their servers are.


Give us the dirt on how google does it's disaster planning exercises please! Do you do these exercises all at once or slowly over the year?


Interesting that you are asking for the dirt given that DiRT stands for Disaster and Recovery Testing, at least at Google.

Every year there is a DiRT week where hundreds of tests are run. That obviously requires a ton of planning that starts well in advance. The objective is, of course, that despite all the testing nobody outside Google notices anything special. Given the volume and intrusiveness of these tests, the DiRT team is doing quite an impressive job.

While the DiRT week is the most intense testing period, disaster preparedness is not limited to just one event per year. There are also plenty tests conducted througout the year, some planned centrally, some done by individual teams. That's in addition to the regular training and exercises that SRE teams are doing periodically.

If you are interested in reading more about Google's approach to distaster planning and preparedness, you may be interested in reading the DiRT, or how to get dirty section from Shrinking the time to mitigate production incidents—CRE life lessons (https://cloud.google.com/blog/products/management-tools/shri...) and Weathering the Unexpected (https://queue.acm.org/detail.cfm?id=2371516).


Why not do both? ;)


Yup, they make a new chat app if the previous one is down.


Google Talk, Google Voice, Google Buzz, Google+ Messenger, Hangouts, Spaces, Allo, Hangouts Chat, and Google Messages.

At some point, they must run out of names, right?


You forgot google meet!


And Google Wave.


You forgot the chat boxes inside other apps like Google docs, Gmail, YouTube, etc.


And Google Pay, apparently.


> Yup, they make a new chat app if the previous one is down.

Continuous Deployment.


For those who don't know who he is: l9i would know this. Just clarifying that this is not an Internet nobody guessing.


He is still an anonymous dude to me.


HN Profile -> Personal Website -> LinkedIn -> Over 10 years experience as Google Site Reliability Engineer


Is the LinkedIn profile linking back to the hn account?


Security Engineer asking?


Ha, no. It just occured to me that any random hacker news account could link to somebody's personal account and claim authority on some subject.


Google SRE for 10 years, ending as the Principal Site Reliability Engineer (L8).


s/the//

Google has more than 1 L8 SRE.


I don't know who either he or you are, so...


I was clarifying his comment, since he didn't mention that this is not a guess, but inside knowledge.

I was not trying to establish a trust chain.

Take from it what you will.


Why does it matter if he's guessing or not?


Because, it may shock you to know, but sometimes people just go on the Internet and tell lies.

No shit Google has plans in place for outages.

But what are these plans, are they any good... a respected industry figure who's CV includes being at Google for 10 years doesn't need to go into detail describing the IRC fallback to be believed and trusted that there is such a thing.


I've found that when I post things I learned on the job here it actually causes people to tell me I'm wrong or made it up even more often…


It's kind of amusing given that employers are usually pretty easy to deduce based on comments…


That's just an 'appeal to authority'.

No-one knows or cares who made the statement, it may as well have been 'water is wet', it was useless and adds nothing but noise.


I found a comment that was factually incorrect and I felt competent to comment on that. Regrettably, I wrote just one sentence and clicked reply without providing any credentials to back up my claim. Not that I try to hide my identity, as danhak pointed out in https://news.ycombinator.com/item?id=28751644, my full name and URL of my personal website are only a click away.

I have replied to my initial comment with provide some additonal context: https://news.ycombinator.com/edit?id=28752431. Hope that helps.


That’s…not what “appeal to authority” means.


I've read here on HN that exactly this was the issue as they had one of the bigger outages (I think it was due to some auth service failure) and GMail didn't accept incoming mail.


A Gmail outage would be barely an inconvenience as Gmail plays a minor role in Google's disaster response.

Disclaimer: Ex-Googler who used to work on disaster reponse. Opinions are my own.


What do you think all those superfluous chat apps were for?


I think the issue there is that in exchange for solving the "one fat finger = outage" problem, you lose the ability to update the server fleet quickly or consistently.


BGP is decentralised.


LOL - score one against building out all tooling internally (a la Amazon and apparently Facebook too)


The rate at which some amazon services lately go done because other AWS services went down proves that this is an unsustainable house of cards anyways.


Netflix knows how to build on top of a house of cards.


There's a joke here somewhere about how bad the final season was


Those communications are done over irc at FB for exactly this purpose.


time to start working at your mfing desk again, johnson


They supposedly can't enter facebook office right now. Their cards don't work.


Why would a system like that have to be in their online infrastructure?


For doing LDAP lookups against the corporate directory? Oh wait, LDAP configuration of course depends on DNS and DNS is kaputt...


source?



Sheera Frenkel @sheeraf Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.


"Something went wrong. Try reloading."

its not loading for me. could you say what it said?



From the Tweet, "Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors."


"Something went wrong. Try reloading."

its not loading for me. could you say what it said?


> Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.

https://nitter.net/sheeraf/status/1445099150316503057


Disclose.tv @disclosetv JUST IN - Facebook employees reportedly can't enter buildings to evaluate the Internet outage because their door access badges weren’t working (NYT)


What do you think will be the impact on WFH and office requirements?


Unlikely, PagerDuty was invented for this kind of thing


Oh I'm sure everyone knows whats wrong, but how am I supposed to send an email, find a coworkers phone number, get the crisis team on video chat etc etc if all of those connections rely on the facebook domain existing?


Hence the suggestion for PagerDuty. It handles all this, because responders set their notification methods (phone, SMS, e-mail, and app) in their profiles, so that when in trouble nobody has to ask those questions and just add a person as a responder to the incident.


Yes, but Facebook is not a small company. Could PagerDuty realistically handle the scale of notifications that would be required for Facebook's operations?


PagerDuty does not solve some of the problems you would have at FB's scale, like how do you even know who to contact ? And how do they login once they know there is a problem ?


Sure. As long as you plan for disaster.

The place where I worked had failure trees for every critical app and service. The goal for incident management was to triage and have an initial escalation for the right group within 15 minutes. When I left they were like 96% on target overall and 100% for infrastructure.


Even if it can’t, it’s trivial to use it for an important subset, ie is Facebook.com down, is the ns stuff down etc. So there is an argument to be made for still using an outside service as a fallback


Sure, if you're...

- not arrogant - or complacent - haven't inadvertently acquired the company - know your tech peers well enough to have confidence in their identity during an emergency - do regular drills to simulate everything going wrong at once

Lots of us know what should be happening right now, but think back to the many situations we've all experienced where fallback systems turned into a nightmarish war story, then scale it up by 1000. This is a historic day, I think it's quite likely that the scale of the outage will lead to the breakup of the company because it's the Big One that people have been warning about for years.


I guarantee you that every single person at Facebook who can do anything at all about this, already knows there's an issue. What would them receiving an extra notification help with?


We kind of got off topic, I was arguing that if you were concerned about internal systems being down (including your monitoring/alerting) something like pager duty would be fine as a backup. Even at huge scale that backup doesn’t need to watch everything.

I don’t think it’s particularly relevant to this issue with fb. I suspect they didn’t need a monitoring system to know things were going badly.


Heck of a coincidence I must say...

I can imagine this affects many other sites that use FB for authentication and tracking.

If people pay proper attention to it, this is not just an average run of the mill "site outage", and instead of checking on or worrying about backups of my FB data (Thank goodness I can afford to lose it all), I'm making popcorn...

Hopefully law makers all study up and pay close attention.

What transpires next may prove to be very interesting.


Indeed, what happened shows a good reason not to rely only on social log-in for various sites.


NYT tech reporter Sheera Frenkel gives us this update:

>Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.

https://twitter.com/sheeraf/status/1445099150316503057


Got a good chuckle imagining a fuming Zuckerberg not being allowed into his office, thinking the world is falling apart.


Can’t get in to fix error


I just got off a short pre-interview conversation with a manager at Instagram and he had to dial in with POTS. I got the impression that things are very broken internally.


How much of modern POTS is reliant on VOIP? In Australia at least, POTS has been decommissioned entirely, but even where it's still running, I'm wondering where IP takes over?


I am guessing that most POTS is VOIP now, except for the few places with existing copper infrastructure that has not been decommissioned yet.


This person has a POTS line in their current location, and a modem, and the software stack to use it, and Instagram has POTS lines and modems and software that connect to their networks? Wow. How well do Instagram and their internal applications work over 56K?


He called on his mobile phone. As a result it was a voice-only conversation, no video.


They could have dialed in by their own cell phone though


I read that as POTUS at first and paused for a minute


What is POTS?



Plain old telephone system. Aka a phone.


Plain Old Telephone System


Looks like they misconfigured a web interface that they can't reach anymore now that they're off the net.

"anyone have a Cisco console cable lying around?"


The only one they have is serial and the company's one usb-to-serial converter is missing.


The voices, stories, announcements, photos, hopes and sorrows of millions, no, literally billions of people, and the promise that they may one day be seen and heard again now rests in the hands of Dave, the one guy who is closest to a Microcenter, owns his own car and knows how to beat the rush hour traffic and has the good sense to not forget to also buy an RS-232 cable, since those things tend to get finicky.


Great visual!


Yeah the patch to fix BGP to reach the DNS is sent by email to @facebook.com. Ooops no DNS to resolve the MX records to send the patch to fix the BGP routers.


Seriously? Is that how it works?


No. A network like Facebook's is vast and complicated and managed by higher-level configuration systems, not people emailing patches around.

If this issue is even to do with BGP it's much more likely the root of the problem is somewhere in this configuration system and that fixing it is compounded by some other issues that nobody foresaw. Huge events like this are always a perfect storm of several factors, any one or two of which would be a total noop alone.


The Swiss cheese model of accidents. Occasionally the holes all align.

https://en.wikipedia.org/wiki/Swiss_cheese_model


The fun part of BGP is they apparently make a lot of use of it within their network, not just advertising routes externally.

https://engineering.fb.com/2021/05/13/data-center-engineerin...

(and yes, fb.com resolves)


No, the backbone of the internet is not maintained with patches sent in emails.


You are very wrong about that ;) https://lkml.org/


You are very wrong about that https://lkml.org/


Clearly you and the person you replied to are talking about very different things.


I think the sub-comment is confusing the linux kernel with BGP.


In a way, the Linux kernel does power the "backbones of the internet".


There are a hell of a lot of non-linux OS's running on core routers, but yes, in a way. However BGP isn't via email.


On the other hand, I and my office mate at the time negotiated the setup of a ridiculous number of BGP sessions over email, including sending configs. That was 20 years ago.


luckily not... would be absolutely terrible to have the backbone only on linux


Interoperability and a thriving ecosystem are necessities for resiliency.

Note that resiliency and efficiency are often working against each other.



I don't know. I doubt. It's just funny to think that you need email to fix BGP, but DNS is down because of BGP. You need DNS to send email which needs BGP. It's a kind of chicken and egg problem but at a massive scale this time.


Sheera Frenkel:

    Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.
https://twitter.com/sheeraf/status/1445099150316503057


You'd think they'd have worked that into their DR plans for a complete P1 outage of the domain/DNS, but perhaps not, or at least they didn't add removal of BGP announcements to the mix.


Can someone explain why it is also down when trying to access it via Tor using its onion address: http://facebookwkhpilnemxj7asaniu7vnjjbiltxjqhye3mhbshg7kx5t...

Or when trying ips directly: https://www.lifewire.com/what-is-the-ip-address-of-facebook-...

I would have expected a DNS issue to not affect either of these.

I can understand the onionsite being down if facebook implemented it the way a thirdparty would (a proxy server accessing facebook.com) instead of actually having it integrated into its infrastructure as a first class citizen.


You can get through to a web server, but that web server uses DNS records or those routes to hit other services necessary to render the page. So the server you hit will also time out eventually and return a 500


The issue here is that this outage was a result of all the routes into their data centers being cut off (seemingly from the inside). So knowing that one of the servers in there is at IP address "1.2.3.4" doesn't help, because no-one on the outside even knows how to send a packet to that server anymore.


routing was down _everywhere_ so tor is getting a better experience than most people by getting a 500 error


DNS is back, looks like systems are still coming online.



Reddit r/Sysadmin user that claims to be on the "Recovery Team" for this ongoing issue:

>As many of you know, DNS for FB services has been affected and this is likely a symptom of the actual issue, and that's that BGP peering with Facebook peering routers has gone down, very likely due to a configuration change that went into effect shortly before the outages happened (started roughly 1540 UTC). There are people now trying to gain access to the peering routers to implement fixes, but the people with physical access is separate from the people with knowledge of how to actually authenticate to the systems and people who know what to actually do, so there is now a logistical challenge with getting all that knowledge unified. Part of this is also due to lower staffing in data centers due to pandemic measures.

User is providing live updates of the incident here:

https://www.reddit.com/r/sysadmin/comments/q181fv/looks_like...


He just deleted all his updates.

user:

https://old.reddit.com/user/ramenporn

some messages:

* This is a global outage for all FB-related services/infra (source: I'm currently on the recovery/investigation team).

* Will try to provide any important/interesting bits as I see them. There is a ton of stuff flying around right now and like 7 separate discussion channels and video calls.

* Update 1440 UTC: \

    As many of you know, DNS for FB services has been affected and this is likely a symptom of the actual issue, and that's that BGP peering with Facebook peering routers has gone down, very likely due to a configuration change that went into effect shortly before the outages happened (started roughly 1540 UTC).

    There are people now trying to gain access to the peering routers to implement fixes, but the people with physical access is separate from the people with knowledge of how to actually authenticate to the systems and people who know what to actually do, so there is now a logistical challenge with getting all that knowledge unified.

    Part of this is also due to lower staffing in data centers due to pandemic measures.


The 1440 UTC update is also archived on the Wayback Machine: https://web.archive.org/web/20211004171424/https://old.reddi...

And archive.today: https://archive.ph/sMgCi


Essentially, they locked themselves out with an uninspired command line at the exact moment the datacenter was being hijacked by ape-people.

Yup, corporate comms won't love these status updates.


Sorry, are you referring to data center technicians as “ape people”?


As a former data center technician, I wouldn't say it's too far off


But we're all ape people.



I mean, when I last worked in a NOC, we used to call ourselves "NOC monkeys", so yeah. IF you're in the NOC, you're a NOC monkey, if you're on the floor, you're a floor monkey. And so on.


Same with "SOC monkeys". (Which carries the additional pun of sounding like the "sock monkey" toy.)


Are you fucking kidding me?

We even had a site and operation for a long while called:

"NOC MONKEY .DOT ORG"

We called all of ourselves NOC MONKEYS. [[Remote Hands]]

Yeah, that was a term used widely.

I'm 46. I assume you are < #

---

Where were you in 1997 building out the very first XML implementations to replace EDI from AS400s to FTP EDI file retrievals via some of the first Linux FTP servers based in SV?

I was there? Remember LinuxCare?


Are you ok, Sir?


Weren't able to get their ego-fill on facebook like normally.


And there his account went poof, thanks for archiving.


They were quoted on multiple news sites including Ars Technica. I would imagine they were not authorized to post that information. I hope they don't lose their job.

Shareholders and other business leaders I'm sure are much happier reporting this as a series of unfortunate technical failures (which I'm sure is part of it) rather than a company-wide organizational failure. The fact they can't physically badge in the people who know the router configuration speaks to an organization that hasn't actually thought through all its failure modes. People aren't going to like that. It's not uncommon to have the datacenter techs with access and the actual software folks restricted, but that being the reason one of the most popular services in the world has been down for nearly 3 hours now will raise a lot of questions.

Edit: I also hope this doesn't damage prospects for more Work From Home. If they couldn't get anyone who knew the configuration in because they all live a plane ride away from the datacenters, I could see managers being reluctant to have a completely remote team for situations where clearly physical access was needed.


Facebook should have had a panic room.

Operations teams normally have a special room with a secure connection for situations like this, so that production can be controlled in the event of bgp failure, nuclear war, etc. I could see physical presence being an issue if their bgp router depends on something like a crypto module in a locked cage, in which case there's always helicopters.

So if anything, Facebook's labor policies are about to become cooler.


Yup, it's terrifying how much is ultimately, ultimately dependent on dongles and trust. I used to work at a company with a billion or so in a bank account (obviously a rather special type of account), which was ultimately authorised by three very trusted people who were given dongles.


What did the dongles do?


Sorry, I should have been clearer - the dongles controlled access to that bank account. It was a bank account for banks to hold funds in. (Not our real capital reserves, but sort of like a current account / checking account for banks.)

I was friends with one of those people, and I remember a major panic one time when 2 out of 3 dongles went missing. I'm not sure if we ever found out whether it was some kind of physical pen test, or an astonishingly well-planned heist which almost succeeded - or else a genuine, wildly improbable accident.


I would be absolutely shocked if they didn't.

The problem is when your networking core goes down, even if you get in via a backup DSL connection or something to the datacenter, you can't get from your jump host to anything else.


It helps if your dsl line is is bridging at layer 2 in the osi model using rotated psks, so it won't be impacted by dns/bgp/auth/routing failures. That's why you need to put it in a panic room.


That model works great, until you need to ask for permission to go into the office, and the way to get permission is to use internal email and ticketing systems, which are also down.


Operations teams don't need permission from some apparatchik to enter the office when production goes down. If they can't get in, they drill.


> nuclear war

I think you need some convincing to keep your SREs on-site in case of a nuclear war ;)


Hey, if I can take the kids and there’s food for a decade and a bunker I’m probably in ;)


I'm not sure why shareholders are lumped in here. A lot of reasons companies do the secret squirrel routine is to hide their incompetence from the shareholders.


That is what I meant, although you have lots of executives and chiefs who are also shareholders.


> an organization that hasn't actually thought through all its failure modes

Thinking about any potential things that can happen is impossible


You don't need to consider 'what if a meteor hit the data centre and also it was made of cocaine'. You do need to think through "how do I get this back online in a reasonable timeframe from a starting point of 'everything is turned off and has the wrong configuration'."


In a company the size of FaceBook, "everything is turned off" has never happened since before the company was founded 17 years ago. This makes is very hard to be sure you can bring it all back online! Every time you try it, there are going to be additional issues that crop up, and even when you think you've found them all, a new team that you've never heard of before has wedged themselves into the data-center boot-up flow.

The meteor isn't made of cocaine, but four of them hitting at exactly the same time is freakishly improbable. There are other, bigger fish to fry, that we're going to treat four simultaneous meteors as impossible. Which is great, but then one the day, five of them hit at the same time.


>we're going to treat four simultaneous meteors as impossible. Which is great, but then one the day, five of them hit at the same time.

I think that suggests that there were not bigger fish to fry :)

I take your point on priorities, but in a company the size of facebook perhaps a team dedicated to understanding the challenges around 'from scratch' kickstarting of the infrastructure could be funded and part of the BCP planning - this is a good time to have a binder with, if not perfectly up-to-date data, pretty damned good indications of a process to get things working.


>> we're going to treat four simultaneous meteors as impossible. Which is great, but then one the day, five of them hit at the same time.

> I think that suggests that there were not bigger fish to fry :)

I can see this problem arising in two ways:

(1) Faulty assumptions about failure probabilities: You might presume that meteors are independent, so simultaneous impacts are exponentially unlikely. But really they are somehow correlated (meteor clusters?), so simultaneous failures suddenly become much more likely.

(2) Growth of failure probabilities with system size: A meteor hit on earth is extremely rare. But in the future there might be datacenters in the whole galaxy, so there's a center being hit every month or so.

In real, active infrastructure there are probably even more pitfalls, because estimating small probabilities is really hard.


> "how do I get this back online in a reasonable timeframe from a starting point of 'everything is turned off and has the wrong configuration'."

The electricity people have a name for that: black start (https://en.wikipedia.org/wiki/Black_start). It's something they actively plan for, regularly test, and once in a while, have to use in anger.


It's a process I'm familiar with gaming out. For our infrastructure, we need to discuss and update our plan for this from time to time, from 'getting the generator up and running' through to 'accessing credentials when the secret server is not online' and 'configuring network equipment from scratch'.


I love that when you had to think of a random improbable event, you thought of a cocaine meteor. But ... hell YES!


Luckily you don't need to do that exhaustively: all you have to do is cover the general failure case. What happens when communications fail?

This is something that most people aren't good at naturally, it tends to come from experience.


Right, but imagining that DNS goes down doesn’t take a science fiction author.


Of course you can’t think of every potential scenario possible, but an incorrect configuration and rollback should be pretty high in any team’s risk/disaster recovery/failure scenario documentation.


This is true, but it's not an excuse for not preparing for the contingencies you can anticipate. You're still going to be clobbered by an unanticipated contingency sooner or later, but when that happens, you don't want to feel like a complete idiot for failing to anticipate a contingency that was obvious even without the benefit of hindsight.


> I hope they don't lose their job.

I hope they do.

#1 it's a clear breach of corporate confidentiality policies. I can say that without knowing anything about Facebook's employment contracts. Posting insider information about internal company technical difficulties is going to be against employment guidelines at any Big Co.

In a situation like this that might seem petty and cagey. But zooming out and looking at the bigger picture, it's first and foremost a SECURITY issue. Revealing internal technical and status updates needs to go through high-level management, security, and LEGAL approvals, lest you expose the company to increased security risk by revealing gaps that do not need to be publicized.

(Aside: This is where someone clever might say "Security by obscurity is not a strategy". It's not the ONLY strategy, but it absolutely is PART of an overall security strategy.)

#2 just purely from a prioritization/management perspective, if this was my employee, I would want them spending their time helping resolve the problem not post about it on reddit. This one is petty, but if you're close enough to the issue to help, then help. And if you're not, don't spread gossip - see #1.


You're very, very right - and insightful - about the consequences of sharing this information. I agree with you on that. I don't think you're right that firing people is the best approach.

Irrespective of the question of how bad this was, you don't fix things by firing Guy A and hoping that the new hire Guy B will do it better. You fix it by training people. This employee has just undergone some very expensive training, as the old meme goes.


I feel this way about mistakes, and fuckups.

Whoever is responsible for the BGP misconfiguration that caused this should absolutely not be fired, for example.

But training about security, about not revealing confidential information publicly, etc is ubiquitous and frequent at big co's. Of course, everyone daydreams through them and doesn't take it seriously. I think the only way to make people treat it seriously is through enforcement.


I feel you're thinking through this with a "purely logical" standpoint and not a "reality" standpoint. You're thinking worst case scenario for the CYA management, having more sympathy for the executive managers than for the engineer providing insight to the tech public.

It seems like a fundamental difference of "who gives a shit about corporate" from my side. The level of detail provided isn't going to get nationstates anything they didn't already know.


Yeah but what is the tech public going to do with these insights?

It's not actionable, it's not whistleblowing, it's not triggering civic action, or offering a possible timeline for recovery.

It's pure idle chitchatter.

So yeah, I do give a shit about corporate here.

Disclosure: While I'm an engineer too, I'm also high enough in the ladder that at this point I am more corporate than not. So maybe I'm a stooge and don't even realize it.


Facebook, the social media website is used, almost exclusively for 'idle chitchatter', so you may want to avoid working there if your opinion of the user is so low. (Actually, you'll probably fit right in at Facebook.)

It's unclear to me how a 'high enough in the ladder' manager doesn't realize that there's easily dozen people who know the situation intimately but who can't do anything until a dependent system to them is up. "Get back to work" is... the system is down, what do you want them to do, code with a pencil and paper?

ramenporn violated the corporate communication policy, obviously, but the tone and approach for a good manager to an IC that was doing this online isn't to make it about corporate vs them/the team, and in fact, encourage them to do more such communication, just internally. (I'm sure there was a ton of internal communication, the point is to note where ramenporn's communicative energy was coming from, and nurture that, and not destroy that in the process of chiding them for breaking policy.


> Edit: I also hope this doesn't damage prospects for more Work From Home. If they couldn't get anyone who knew the configuration in because they all live a plane ride away from the datacenters, I could see managers being reluctant to have a completely remote team for situations where clearly physical access was needed.

You're conflating working remotely ("a plane ride away") and working from home.

You're also conflating the people who are responsible network configuration, and for coming up with a plan to fix this; and the people who are responsible for physically interacting with systems. Regardless of WFH those two sets likely have no overlap at a company the size of Facebook.


There could be something in the contract that requires all community interaction to go via PR official channels.

It's innocous enough, but leaking info, no matter what, will be a problem if it's stated in their contract.


100%! comms will want to proof any statement made by anybody along with legal to ensure that there is no D&O liability for sec fraud.


> an organization that hasn't actually thought through all its failure modes

Move Fast and Break Things!


I came here to move fast and break things, and i'm all out of move fast.


In their defense they really lived up to their mission statement today.


I doubt WFH will be impacted by this - not an insider but seems unlikely that the relevant people were on-site at data centers before COVID


> I doubt WFH will be impacted by this - not an insider but seems unlikely that the relevant people were on-site at data centers before COVID

I think the issue is less "were the right people in the data center" and more "we have no way to contact our co-workers once the internal infrastructure goes down". In non-wfh you physically walk to your co-workers desk and say "hey, fb messenger is down and we should chat, what's your number?". This proves that self-hosting your infra (1) is dangerous and (2) makes you susceptible to super-failures if comms goes down during WFH.

Major tech companies (GAFAM+) all self-host and use internal tools so they're all at risk of this sort of comms breakdown. I know I don't have any co-workers number (except one from WhatsApp which if I worked at FB wouldn't be useful now).


Apple is all on Slack.


But is it a publicly hosted slack, or does apple host it themselves?


I don't think it is possible to self-host Slack.


Amazon has a privately managed instance.


Most of the stuff was probably implemented before COVID anyways.

They will fix the issue and add more redundant communication channels, which is either an improvement or a non-event for WFH.

And Zuck is slowly moving (dogfooding) company culture to remote too with their Quest work app experiments


They must have been moving very fast!


shoestring budget on a billion dollar product. you get what you deserve.


> I hope they don't lose their job.

FB has such poor integrity, I'd not be surprised if they take such extreme measures.


It is a matter of preparation. You can make sure there are KVMoIPs or other OOB technologies available on site to allow direct access from a remote location. In the worst case technician has to know how to connect the OOB device or press a power button ;)


I'm not disagreeing with you, however clearly (if the reddit posts were legitimate) some portion of their OOB/DR procedure depended on a system that's down. From old coworkers who are at FB, their internal DNS and logins are down. It's possible that the username/password/IP of an OOB KVM device is stored in some database that they can't login to. And the fact FB has been down for nearly 4 hours now suggests it's not as simple as plugging in a KVM.


I was referring to the WFH aspect the parent post mentioned. My point was that the admins could get the same level of access as if they were physically on site, assuming the correct setup.


Pushshift maintains archives of Reddit. You can use camas reddit search to view them.

Comments by u/ramenporn: https://camas.github.io/reddit-search/#{%22author%22:%22rame...


PushShift is one of the most amazing resources out the for social media data and more people should know about it


Can you recommend similar others (or maybe how to find them)? I learned of PushShift because snew, an alternative reddit frontend showing deleted comments, was making fetch requests and I had to whitelist it in uMatrix. Did not know about Camas until today.


If it was actually someone in Facebook, their job is gone by now, too.


It's time to decentralize and open up the Internet again, as it once was (ie. IRC, NNTP and other open protocols) instead of relying on commercial entities (Google, Facebook, Amazon) to control our data and access to it.


I'll throw in Discord into that mix, the thing that basically mostly killed IRC. Which is yet again centralized despite pretending that it is not centralized.


The account has been deleted as well.


What are they afraid of? While they are sharing information that's internal/proprietary to the company, it isn't anything particularly sensitive and having some transparency into the problem is good for everyone.

Who'd want to work for a company that might take disciplinary action because an SRE posted a reddit comment to basically say "BGP's down lol" - If I was in charge I'd give them a modest EOY bonus for being helpful in their outreach to my users in the wider community.


Seems reasonable that at a company of 60k, with hundreds who specialize in PR, you do not want a random engineer making the choice himself to be the first to talk to the press by giving a PR conference on a random forum.


Honestly, from a PR perspective, I’m not sure it’s so bad. Giving honest updates showing Facebook hard at work is certainly better PR for our kind of crowd than whatever actual Facebook PR is doing.


That one guy's comments seen fine from a PR perspective apart from it not being his role to communicate for the company.

I still think he should be fired for this kind of communication though. One reason is, imagine Facebook didn't punish breaches of this type. Every other employee is going to be thinking "Cool, I could be in a Wired article" or whatever. All they have to do is give sensitive company information to reporters.

Either you take corporate confidentiality seriously or you don't. Posting details of a crisis in progress on your Reddit account is not taking corporate confidentiality seriously. If the Facebook corporation lightly punishes, scolds, or ignores this person then the corporation isn't taking confidentiality seriously either.


I agree, but try to explain that to PR people...


It's terrible PR for the FB PR team's performance.


Reporters are going to opportunistically start writing about those comments vs having to wait for a controlled message from a communications team. So the reddit posts might not be "so bad", but they're also early and preempting any narrative they may want to control.


You falsely assume Hacker News is even remotely what Facebook PR gives a shit about.


That was their best PR in years


Compare Facebook's official tweet: "We’re aware that some people are having trouble accessing our apps and products. We’re working to get things back to normal as quickly as possible, and we apologize for any inconvenience."

That's the PR team, clueless.


I don't think Facebook could actually say anything more accurate or more honest. "Everything is dead, we are unable to recover, and we are violently ashamed" would be a more fun statement, but not a more useful one.

There will be plenty of time to blame someone, share technical lessons, fire a few departments, attempt to convince the public it won't happen again, and so on.


I agree completely. The target audience Facebook is concerned about is not techies wanting to know the technical issues. Its the huge advertising firms, governments, power users, etc. who have concerns about the platform or have millions of dollars tied up in it. A bland statement is probably the best here - and even if the one engineer gave accurate useful info I don't see how you'd want to encourage an org in which thousands of people feel the need to post about whats going on internally during every crisis.


Well, they could at least be specific about how large the outage is. "Some people" is quite different to absolutely everyone. At least they did not add a "might" in there.


Facebook has never been open and honest about anything, no reason to think they would start now.


To be fair, Facebook has never been open and honest about anything.


Facebook is well known for having really good PR, if they go after this guy for sharing such basic info that's yet another example of their great PR teams.


These few sentences were a better and more meaningful read than what hundreds of PR people could ever come up with


A few random guesses (I am not in any way affiliated with FB); just my 2c:

Sharing status of an active event may complicate recovery, especially if they suspect adversarial actions: such public real-time reports can explain to the red team what the blue team is doing and, especially important, what the blue team is unable to do at the moment.

Potentially exposing the dirty laundry. While a postmortem should be done within the company (and as much as possible is published publicly) after the event, such early blurbs may expose many non-public things, usually unrelated to the issue.


Mentioned in another reply

Shareholders and other business leaders I'm sure are much happier reporting this as a series of unfortunate technical failures (which I'm sure is part of it) rather than a company-wide organizational failure. The fact they can't physically badge in the people who know the router configuration speaks to an organization that hasn't actually thought through all its failure modes. People aren't going to like that. It's not uncommon to have the datacenter techs with access and the actual software folks restricted, but that being the reason one of the most popular services in the world has been down for nearly 3 hours now will raise a lot of questions.


I did not read it as they can't get them on site but rather that it takes travel to get them on site. Travel takes time of which they desperately want not to spend.


> If I was in charge I'd give them a modest EOY bonus for being helpful in their outreach to my users in the wider community.

That seems pretty unlikely at any but the smallest of companies. Most companies unify all external communications through some kind of PR department. In those cases usually employees are expressly prohibited from making any public comments about the company without approval.


> What are they afraid of?

Zuckerberg Loses $7 Billion in Hours as Facebook Plunges

https://finance.yahoo.com/news/zuckerberg-loses-7-billion-ho...

Stop the hemorrhaging. Too much bad press for FB lately and it all adds up.


Unrelated to the outage, but I hate headlines like this.

Facebook is down ~5% today. That's a huge plunge to be sure, but Zuckerberg hasn't "lost" anything. He owns the same number of shares today as he did yesterday. And in all likelihood, unless something truly catastrophic happens the share price will bounce back fairly quickly. The only reason he even appears to have lost $7 billion is because he owns so much Facebook stock.

These types of alarmist headlines are inane.


Unlikely to be related. FB's losses today already happened before FB went down, and are most likely related to the general negative sentiment in the market today, and the whistleblower documents. It's actually kind of remarkable how little impact the outage had on the stock.


There was no permanent damage to Facebook as a result of the outage so it's understandable that the stock price wasn't really affected by it


I was thinking the same...


As much as all of the curious techies here would love transparency into the problem, that doesn't actually do any good for Facebook (or anyone else) at the moment. Once everything is back online, making a full RCA available would do actual good for everyone. But I wouldn't hold my breath for that.


FB takes confidentiality very seriously. He crossed a major red line.


They got told, explicitly that they shouldn't be sharing updates from the outage meeting, in the outage meeting.


Do we even know if someone had the account deleted? I think facebook might have their hands full right now solving the issue rather than looking at social media posts that discusses the issue.


There are a lot of people who work at Facebook, and I'm sure the people responsible for policing external comms do not have the skills or access to fix what's wrong right now.


Assuming that Facebook forced the account to be deleted, it wouldn't have been done by anyone who's working on fixing the problem.


> the people with physical access is separate from the people with knowledge of [...]

Welcome to the brave new world of troubleshooting. This will seriously bite us one day.


I like how FB decided to send "ramenporn" as their spokesperson.


A particular facet I love of the internet era is journalists reporting serious events while having to use the completely absurd usernames...

"A Facebook engineer in the response team, ramenporn..."


I remember some huge DDOS attacks like a decade ago, and people were speculating who could be behind it. The three top theories were Russian intelligence, the Mossad, and this guy on 4chan who claimed to have a Botnet doing it.

That was the start of living in the future for me.


4chan is disturbingly resourceful at times. I have heard them described as weaponized autism.


Ya, on hn it's merely productized.


That's a pretty accurate description of the site, lol.

On a side-note, I think you'll enjoy some of the videos by the YouTube 'Internet Historian' on 4chan:

* https://www.youtube.com/watch?v=SvjwXhCNZcU

* https://www.youtube.com/watch?v=HiTqIyx6tBU


My favorite example of this is when I saw references to "Goatse Security" on the front page of the Wall Street Journal


This felt like something straight out of a post modern novel during the whole WSB press rodeo, where some user names being used on TV were somewhere between absurd to repulsive.

Loved it.


I believe that's the exact reason behind the pattern of horrifying usernames on reddit and imgur. It's magnificent in its surrealness.


Exactly, I'm having deja vues from Vernor Vinge's Rainbow's End constantly lately.


>journalists reporting serious events

A facet I don't love is journalism devolving to reposting unverified, anonymous reddit posts.


"Discussed in Hacker News, the user that goes by the 'huevosabio' handle, stated as a fact that..."


‘He was then subsequently attacked by “OverTheCounterIvermectin” for his tweets on transgender bathrooms from several months ago’.


The problem with tweets on transgender bathrooms is that you can be attacked for them by either side at any point in the future, so the user OverTheCounterIvermectin should have known better.


I got quoted as noir_lord in the press.

My bbs handle from 30 years ago.


Immortality.


I'm worried about that person. I doubt Facebook will look kindly on breaking incident news being shared on reddit.


Apparently Facebook HQ didn't like how ramenporn handled the situation. His account has been deleted, as well as all his messages about the incident.


his account is active, only the incident comments were deleted


> [Reddit logo] u/ramenporn: deleted

> This user has deleted their account.


At least that department at Facebook is still working!


There never was a ramenporn.


That Ramenporn got engagement by Hate Speech


They work at facebook. Can’t imagine they have any illusions regarding their privacy/anonymity.


Curious what the internal "privacy" limitations are. Certainly FB must track reddit users : fb account even if they don't actually display it. It just makes sense.


Thanks to the GDPR at least that's easy to verify for European users.


That said, it will be interesting to read their post-mortem next year and compare it with what ramenporn wrote.


lol no one cares. we're all laughing about this too (all of us except the networks people at least...)


I hope you won't have to delete your account too :)


Well, seems like FB shutdowned his post...


This is why so many teams fight back against the audit findings:

"The information systems office did not enforce logical access to the system in accordance with role-based access policies."

Invariably, you want your best people to have full access to all systems.


Well, you want the right people to have access. If you're a small shop or act like one, that's your "top" techs.

If you're a mature larger company, that's the team leads in your networking area on the team that deal with that service area (BGP routing, or routers in general).

Most likely Facebook et. al. management never understood this could happen because it's "never been a problem before".


I can't fathom how they didn't plan for this. In any business of size, you have to change configuration remotely on a regular basis, and can easily lock yourself out on a regular basis. Every single system has a local user with a random password that we can hand out for just this kind of circumstance...


Organizational complexity grows super-linearly; in general, the number of people a company can hire per unit time is either constant or grows linearly.

Google once had a very quiet big emergency that was, ironically(1), initiated by one of their internal disaster-recovery tests. There's a giant high-security database containing the 'keys to the kingdom', as it were... Passwords, salts, etc. that cannot be represented as one-time pads and therefore are potentially dangerous magic numbers for folks to know. During disaster recovery once, they attempted to confirm that if the system had an outage, it would self-recover.

It did not.

This tripped a very quiet panic at Google because while the company would tick along fine for awhile without access to the master password database, systems would, one by one, fail out if people couldn't get to the passwords that had to be occasionally hand-entered to keep them running. So a cross-continent panic ensued because restarting the database required access to two keycards for NORAD-style simultaneous activation. One was in an executive's wallet who was on vacation, and they had to be flown back to the datacenter to plug it in. The other one was stored in a safe built into the floor of a datacenter, and the combination to that safe was... In the password database. They hired a local safecracker to drill it open, fetched the keycard, double-keyed the initialization machines to reboot the database, and the outside world was none the wiser.

(1) I say "ironically," but the actual point of their self-testing is to cause these kinds of disruptions before chance does. They aren't generally supposed to cause user-facing disruption; sometimes they do. Management frowns on disruption in general, but when it's due to disaster recovery testing, they attach to that frown the grain of salt that "Because this failure-mode existed, it would have occurred eventually if it didn't occur today."


That's not quite how it happened. ;)

<shameless plug> We used this story as the opening of "Building Secure and Reliable Systems" (chapter 1). You can check it out for free at https://sre.google/static/pdf/building_secure_and_reliable_s... (size warning: 9 MB). </shameless plug>


Thanks for telling this story as it was more amusing than my experiences of being locked in a security corridor with a demagnetised access card, looooong ago.


what if the executive had been pick-pocketed


EDIT: I had mis-remembered this part of the story. ;) What was stored in the executive's brain was the combination to a second floor safe in another datacenter that held one of the two necessary activation cards. Whether they were able to pass it to the datacenter over a secure / semi-secure line or flew back to hand-deliver the combination I do not remember.

If you mean "Would the pick-pocket have access to valuable Google data," I think the answer is "No, they still don't have the key in the safe on the other continent."

If you mean "Would the pick-pocket have created a critical outage at Google that would have required intense amounts of labor to recover from," I don't know because I don't know how many layers of redundancy their recovery protocols had for that outage. It's possible Google came within a hair's breadth of "Thaw out the password database from offline storage, rebuild what can be rebuilt by hand, and inform a smaller subset of the company that some passwords are now just gone and they'll have to recover on their own" territory.


> I can't fathom how they didn't plan for this

Maybe because they were planning for a million other possible things to go wrong, likely with higher probability than this. And busy with each day's pressing matters.


Anyone who has actually worked in the field can tell you that a deploy or config change going wrong, at some point, and wiping out your remote access / ability to deploy over it is incredibly, crazy likely.


That someone will win the lottery is also incredibly likely. That a given person will win the lottery is, on the other hand, vanishingly unlikely. That a given config change will go wrong in a given way is ... eh, you see where I'm going with this


Right, which is why you just roll in protection for all manner of config changes by taking pains to ensure there are always whitelists, local users, etc. with secure(ly stored) credentials available for use if something goes wrong; rather than assuming your config changes will be perfect.


I'm not sure it's possible to speculate in a way which is generic over all possible infrastructures. You'll also hit the inevitable tradeoff of security (which tends towards minimal privilege, aka single points of failure) vs reliability (which favours 'escape hatches' such as you mentioned, which tend to be very dangerous from a security standpoint).


Absolutely, and I'd even call it a rite of passage to lock yourself out in some way, having worked in a couple of DCs for three years. Low-level tooling like iLO/iDRAC can sure help out with those, but is often ignored or too heavily abstracted away.


A config change gone bad?

That’s like failure scenarios 101. That should be the second on the list, after “code change gone bad”.


Exactly! Obviously they have extremely robust testing and error catching on things like code deploys: how many times do you think they deploy new code a day? And at least personally, their error rate is somewhere below 1%.

Clearly something about their networking infrastructure is not as robust.


Right? Especially on global scale. Something doesn't add up!


Curious/unfortunate timing. The day after a whistleblower docu and with a long list of other legal challenges and issues incoming.


Haha sure. They were too busy implementing php compilers to figure out that "whole DR DNS thing"

rotflmao. I'd remove Facebook from my resume.


Most likely they did plan for this. Then, something happened that the failsafe couldn't handle. E.g. if something overwrites /etc/passwd, having a local user won't help. I'm not saying that specific thing happened here -- it's actually vanishingly unlikely -- but your plan can't cover every contingency.


Agreed, it’s also worth mentioning that at the end of every cloud is real physical hardware, and that is decidedly less flexible than cloud, if you locked yourself out of a physical switch or router you have many fewer options.


In risk management cultures where consequences from failures are much, much higher, the saying goes that “failsafe systems fail by failing to be failsafe”. Explicit accounting for scenarios where the failsafe fails is a requirement. Great truths of the 1960s to be relearned, I guess.


Another Monday morning at a boring datacenter job, i bet they weren't even there yet at 830 when the phones started ringing.


You mean the VOIP phones that could no longer receive incoming calls?


Assuming anyone can actually look up the phone numbers to call.


There should be 24/7 on-site rotations. I wonder if physical presence was cut on account of COVID?


phones? how lame.


It certainly wasn't the Messenger.


Phones - the old, analogue, direct cable ones - were self-sustaining, and kept running even when there was a power cut in the house.


yes, indeed. Reliability. That's so 20th century. #lame.

(Actually not lame at all in my eyes)


This sounds like something that might have been done with security in mind. Although generally speaking, remote hands don't have to be elite hackors.


Have you ever tried to remotely troubleshoot THROUGH another person?!


My company runs copies of all our internal services in air-gapped data centers for special customers. The operators are just people with security clearance who have some technical skills. They have no special knowledge of our service inner workings. We (the dev team) aren’t allowed to see screenshots or get any data back. So yeah, I have done that sort of troubleshooting many times. It’s very reminiscent of helping your grandma set up her printer over the phone.


And this is why we should build our critical systems in a way that can be debugged on the phone... With your grandma.


We try to write our ops manuals in a way that my grandma could follow but we don’t always succeed. :)


For all the hours I spent on the phone spelling grep, ls, cd, pwd, raging that we didn't keep nano instead of fucking vim (and I'm a vim person)... I could have stayed young and been solving real customer problems, not imperium-typing on a fucking keyboard with a 5s delay 'cause colleague is lost in the middle of nowhere and can't remember what file he just deleted and the system doesn't start anymore your software is fragile, just shite.


Yes. Depending on the person, it can either go extremely well or extremely poorly. Getting someone else to point a camera at the screen helps.


Yes, and it works if both parties are able to communicate using precise language. The onus is on the remote SME to exactly articulate steps, and on the local hands to exactly follow instructions and pause for clarifications when necessary.


Yeah. Do what you have to.

Sometimes the DR plan isn't so much I have to have a working key, I just have to know who gets their first with a working key, and break glass might be literal.


Not OP, but many times. Really makes you think hard about log messages after an upset customer has to read them line by line over the phone.

One was particularly painful, as it was a "funny" log message I had added the code when something went wrong. Lesson learned was to never add funny / stupid / goofy fail messages in the logs. You will regret it sooner or later.


folks with physical access are also denied. source - https://twitter.com/YourAnonOne/status/1445100431181598723


FWIW that's not the original source, just some twitter account reposting info shared by someone else. See this sub-thread: https://news.ycombinator.com/item?id=28750888


IT: "Please do this fix."

Person 1: "I can't, I don't have physical access."

IT: "Please do this fix."

Person 2: "I can't, I don't have digital access."

Why? It's [IT's?] policy.


Let me guess, it is tied to FB systems which are down. That would be hilarious.


this is not new, this is everyday life with helping hands, on duty engineers, l2-l3 levels telling people with physical access which commands to run etc. etc. etc.


Then you have security issues like this where someone impersonates a client with helping hands and drains your exchanges hot wallet:

https://www.huffpost.com/archive/ca/entry/canadian-bitcoins-...


The places I've seen this at had specific verification codes for this. One had a simple static code per person that the hands-on guys looked up in a physical binder on their desk. Very disaster proof.

The other ones had a system on the internal network in which they looked you up, called back on your company phone and asked for a passphrase the system showed them. Probably more secure but requires those systems to be working.


This is not a real datacenter case but normal social hacking. On the datacenter side you have many more security checks plus many of the times the helping hands and engineers are part of the same company, using internal communication tools etc. so they are on the same logical footprint anyhow


Telecommunication satellite communication issues might seriously shut down whole regions if they occur.


I don't think so. I bet nobody is ever going to make that mistake at FB again after today.


I think it's the same with supply chains.


It just bit FB.


like today! xD


> Even in the biggest of organizations, they still have to wait for somebody to race down to the datacenter and plug his laptop into a router.

I love this comment.


Imagine having the a huge portion of the digital world internationally riding on your shoulders...


Imagine that guy has this big npm repository locally with all those dodgy libraries with uncontrolled origin, in their /lib/node_modules with root permissions.

Wait, we all do, here.


You can use a custom npm prefix to avoid the mess you're describing. So basically:

See current prefix:

> npm config get prefix

Set prefix to something you can write to without sudo:

> npm config set prefix /some/custom/path


for something as distributed as Facebook, do multiple somebodys all have to race down each individual datacenter and plug their laptops into the routers?

As someone with no experience in this, it sounds like a terrifying situation for the admins...


Interesting that they published stuff about their BGP setup and infrastructure a few months ago - maybe a little tweak to roll backs is needed.

"... We demonstrate how this design provides us with flexible control over routing and keeps the network reliable. We also describe our in-house BGP software implementation, and its testing and deployment pipelines. These allow us to treat BGP like any other software component, enabling fast incremental updates..."


    # todo: add rollbacks


Surely Facebook don't update routing systems between data centres (IIRC the situation) when they don't have people present to fix things if they go wrong? Or have an out-of-band connection (satellite, or dial-up (?), or some other alternate routing?).

I must be misunderstanding this situation here.

[Aside: I recall updating wi-fi settings on my laptop and first checking I had direct Ethernet connection working ... and that when I didn't have anything important to do (could have done a reinstall with little loss). Is that a reasonable analogy?]


Move fast and break . . . <NO CARRIER>


> don't update routing systems between data centres (IIRC the situation) when they don't have people present

Ha. You put too much faith into people.


Wondering how Facebook communicates now internally - most of their work streams likely depend on Facebooks systems which are all down.

Can engineers and security teams even access prod systems anymore? Like, would "Bastion" hosts be reachable?

Wonder if they use Signal and Slack now?


There are various non-FB fallback measures, including IRC as a last-ditch method. The IRC fallback is usually tested once a year for each engineer.


I just heard from a contact that the fallback/backup IRC is also down.


Bet it was located at irc.facebook.com ;)

Joking aside, I can see how an IRC network has potential to be used in these situations. Maybe FAMANG should work together to set something like this up. The problem is, a single IRC server is not fail safe, but a network of multiple servers would just see a netsplit, in which case users would switch servers.

Also, I remember back in the IRCnet days using simply telnet to connect to IRCnet just for fun and sending messages, so its a very easy protocol that can be understood in a global desaster scenario (just the PING replys where annoying in telnet).


I heard the same thing from my old coworker who is at FB currently. All of their internal DNS/logins are broken atm so nobody can reach the IRC server. I bet this will spur some internal changes at FB in terms of how to separate their DR systems in the case of an actual disaster.


Good planning! Now, where does the IRC server live, and is it currently routable from the internet?

While normally I know the advice is "Don't plan for mistakes not to happen, it's impossible, murphy's law, plan for efficient recovery for mistakes"... when it comes to "literally our entire infrastructure is no longer routable from the internet", I'm not sure there's a great alternative to "don't let that happen. ever." And yet, here facebook is.


Also, are the users able to reach the server without DNS (i.e. are the IP addresse(s) involved static and communicated beforehand) and is the server itself able to function without DNS?

Routing is one thing which you can't do without (then you need to fallback to phone communications), but DNS is something that's quite probable to not work well in a major disaster.


A lot of the core 'ops like' teams at FB use IRC on a daily basis.

When I worked there, I wasn't aware of any 'test once per year' concept or directive.

Of course, FB is a really big place, so things are different in different areas.


FB uses a separate IRC instance for these kinds of issues, at least when I used to work there


I would think that their internal network would correctly resolve facebook.com even though they've borked DNS for the external world, or if not they could immediately fix that. So at least they'd be able to talk to each other.


To the communication angle, I've worked at two different BigCo's in my career, and both times there was a fallback system of last resort to use when our primary systems were unavailable.


I haven't worked for a FAANG but it would be unthinkable that FB does not have backup measures in place for communications entirely outside of Facebook.

Hmm well I mean for key people, ops and so on. Not for every employee.

Only a few people need that type of access, and they should have it ready. They need to bring more people there should be an easy way to do it.

Maybe the internal FB Messenger app has a slide button to switch to the backup network for those in need.


> Maybe the internal FB Messenger app has a slide button to switch to the backup network for those in need.

Having worked for 2 FAANG companies, I can tell you most core services like which FB Messenger would be using internal database services and relying on those which would be ineffective in a case like this as it would not work and the engineering cost to design them to support an external database would be a lot more than just paying for like 5 different external backup products for your SRE team.


Facebook does use IRC and Zoom as a fallback.


Actually, in this situation: Discord.


If they planned ahead, they should have had their oncalls practice on the backup systems (like Signal/Slack/Zoom) before now.


My team set up a discord lol


Don't they have a separate instance for internal communications?


"I believe the original change was 'automatic' (as in configuration done via a web interface). However, now that connection to the outside world is down, remote access to those tools don't exist anymore, so the emergency procedure is to gain physical access to the peering routers and do all the configuration locally."

Hmm, could be a UI/UX bug then :)


Seems odd to not have a redundant backdoor on a different network interface. Maybe that is too big of a security risk but idk.


You know how after changing resolution and other video settings you get a popup "do you want to keep these changes?" with a countdown and automatic revert in case you managed to screw up and can't see the output anymore?

Well, I wonder why a router that gets a config update but then doesn't see any external traffic for 4 hours doesn't just revert back to the last known good config...


So, does anyone know where to one can buy an LTE gateway with a serial port interface? Asking for a friend.


Our security team complained that we have some services like monitoring or SSH access to some Jump Hosts accessible without a VPN because VPN should be mandatory to access all internal services. I'm afraid once comply we could be in similar situation where Facebook is now...


But you have two independent VPNs right, using different technologies on different internet handoffs in very different parts of your network, right?


Fundamentally, how is a 2nd independent VPN into your network a different attack surface than a single, well-secured ssh jumphost? When you're using them for narrow emergency access to restore the primary VPN, both are just "one thing" listening on the wire, and it's not like ssh isn't a well-understood commodity.


Zero day sshd vulnerability would be bad.

On the other hand if you had to break through wireguard first, and then go through your single well-secured bastion, you'd not only be harder to find, you'd have two layers of protection, and of course you tick the "VPN" box


Vpn can also have a zero day, and seems about as likely?


But if your vpn has a zero day, that lets you get to the ssh server. It's two layers of protection, you'd have to have two zero days to get in instead of one.

You could argue it's overkill, but it's clearly more secure


Only if the VPN means you have a VPN and a jump box. If it's "VPN with direct access to several servers and no jump box" there's still only one layer to compromise.


Still wouldn't help if your configuration change wipes you clear off the Internet like Facebook's apparently has. The only way to have a completely separate backup is to have a way in that doesn't rely on "your network" at all.


Your OOB network wouldn't be affected by changes to your main network


These are readily available, OpenGear and others have offered them forever. I can't believe fb doesn't have out of band access to their core networking in some fashion. OOB access to core networking is like insurance, rarely appreciated until the house is on fire.


It's quite possible that they have those, but that the credentials are stored in a tool hosted in that datacenter or that the DNS entries are managed by the DNS servers that are down right now.


You are probably right but if that is the case, it isn't really out of band and needs another look. I use OpenGear devices with cellular to access our core networking to multiple locations and we treat them as basically an entirely independent deployment, as if it is another company. DNS and credentials are stored in alternate systems that can be accessed regardless of the primary systems.

I'm sure the logistics of this become far more complicated as the organization scales but IMHO it is something that shouldn't be overlooked, exactly for outlier events like this. It pays dividends the first time it is really needed. If the accounts of ramenporn are correct, it would be paying very well right now.

Out of band access is a far more complicated version of not hosting your own status page, which they don't seem to get right either.


Facebook is likely scrambling private jets as we speak to get the right people to the right places.


Reminds me of that episode in Mr Robot


The cost of the downtime would be


Facebook 2021 revenue is around $100B. That’s $11M an hour. Since it’s peak hour for ad printing, one can assume double or triple this rate.

They are already looking at > $100M in ad loss, not counting reputation damage etc.


Think of all the influencers who can’t influence and FB addicts who can’t get their fix (+insta and whatsapp)


This tweet seems to confirm it is a bgp issue...

https://twitter.com/GossiTheDog/status/1445063880963674121?s...


Cloudflare also confirmed it:

https://twitter.com/jgrahamc/status/1445068309288951820

Also, the Domain name is for sale???

https://whois.domaintools.com/facebook.com


Weird banner at the top, seems like false advertising as it says a couple lines down: Expires on 2030-03-29


I suspect it's an automated system triggered by DNS not resolving, and they try to "make an offer" if you follow through.


You're right, it's misleading, thanks. Other sites (dreamhost, godaddy) don't list it as for sale.


Just imagine the amount of stress on this people, hope the money really worth it.


It shouldn't be too stressful. Well-managed companies blame processes rather than people, and have systems set up to communicate rapidly when large-scale events occur.

It can be sort of exciting, but it's not like there is one person typing at a keyboard with a hundred managers breathing down their neck. These resolutions are collaborative, shared efforts.


> It can be sort of exciting, but it's not like there is one person typing at a keyboard with a hundred managers breathing down their neck.

As someone who formerly did Ops for many many years... this is not accurate. Even in a well organized company there are usually stakeholders at every level on IM calls so that they don't need to play "telephone" for status. For an incident of this size, it wouldn't be unusual to have C-level executives on the call.

While those managers are mostly just quietly listening in on mute if they know what's good (e.g. don't distract the people doing the work to fix your problem), their mere presence can make the entire situation more tense and stressful for the person banging keyboards. If they decide to be chatty or belligerent, it makes everything 100x worse.

I don't envy the SREs at Facebook today. Godspeed fellow Ops homies.


I think it comes down to the comfort level of the worker. I remember when our production environment went down. The CTO was sitting with me just watching and I had no problem with it since he was completely supportive, wasn't trying to hurry me, just wanted to see how the process of fixing it worked. We knew it wasn't any specific person's fault, so no one had to feel the heat from the situation beyond just doing a decent job getting it back up.


C levels don't sit on the call with engineers. They aren't that dumb. Managers will communicate upward.


That greatly depends on the incident and the organization. I’ve personally been on numerous incident calls with C-level folks involved.


Yeah hell, I've ended up with one of the big names as my comm's lead.

That in itself was stressful, and became an example case later.


"it's not like there is one person typing at a keyboard with a hundred managers breathing down their neck. These resolutions are collaborative, shared efforts"

Well, you'd be surprised about how one person can bring everything down and/or save the day at Facebook, Cloudflare, Google, Gitlab, etc. Most people are observers/cheerleaders when there is an incident.


> Most people are observers/cheerleaders when there is an incident.

Yeah, a typical fight/flight response.


Or most people simply don't have anything useful to add or do during an incident.


Taking all the available slots in the massive gvc warroom ain't much... but its honest work.


Well, individuals will still stress, if anything, due to the feeling of bein personally responsible for inflicting damage.

I know someone who accidentally added a rule 'reject access to * for all authenticated users' in some stupid system where the ACL ruleset itself was covered by this *, and this person nearly collapsed when she realized even admins were shut out of the system. It required getting low level access to the underlying software to reverse engineer its ACLs and hack into the system. Major financial institution. Shit like leaves people with actual trauma.

As much as I hate fb, I really feel for the net ops guys trying to figure it all out, with the whole world watching (most of it with shadenfreude)


As one of the major responders to an incident analogous to this at a different fang... you're high, its still hella stressful.


> It shouldn't be too stressful. (...) it's not like there is one person typing at a keyboard with a hundred managers breathing down their neck

Earlier comment mentioned that there is a bottleneck, and that people who are physically able to solve the issue are few and that they need to be informed what to do; being one of these people sounds pretty stressful to me.

"but the people with physical access is separate (...) Part of this is also due to lower staffing in data centers due to pandemic measures", source: https://news.ycombinator.com/item?id=28749244


Sure, but that's what conference calls are for.

Most big tech companies automatically start a call for every large scale incident, and adjacent teams are expected to have a representative call in and contribute to identifying/remediating the issue.

None of the people with physical access are individually responsible, and they should have a deep bench of advice and context to draw from.


I'm not an IT Operations guy, but as a dev I always thought it was exciting when the IT guys had in their shoulders the destiny of the firm. I must be exciting.


You tend not to think about it…

Most teams that handle incidents have well documented incident plans and playbooks. When something major happens you are mostly executing the plan (which has been designed and tested). There are always gotchas that require additional attention / hands but the general direction is usually clear.


>Well-managed companies

To what extent does this include Facebook?


> Well-managed companies blame processes rather than people,

We're six hours without a route to their network, and counting. I think we can safely rule out well-managed.


> Well-managed companies blame processes rather than people

I feel like this just obfuscates the fact that individuals are ultimately responsible, and allows subpar employees to continue existing at an organization when their position could be filled by a more qualified employee. (Not talking about this Facebook incident in particular, but as a generalisation: not attributing individual fault allows faulty employees to thrive at the expense of more qualified ones).


> this just obfuscates the fact that individuals are ultimately responsible

in critical systems, you design for failure. if your organizational plan for personnel failure is that no one ever makes a mistake, that's a bad organization that will forever have problems.

this goes by many names, like the swiss cheese model[0]. its not that workers get to be irresponsible, but that individuals are responsible only for themselves, and the organization is the one responsible for itself.

[0] https://en.wikipedia.org/wiki/Swiss_cheese_model


> is that no one ever makes a mistake

This isn't what I'm saying, though. The thought I'm trying to express is that if no individual accountability is done, it allows employees who are not as good at their job (read: sloppy) to continue to exist in positions which could be better occupied by employees who are better at their job (read: more diligent).

The difference between having someone who always triple-checks every parameter they input, versus someone who never double-checks and just wings it. Sure, the person who triple-checks will make mistakes, but less than the other person. This is the issue I'm trying to get at.


If someone is sloppy and not willing to change he should be shown the door, but not because he caused outage but because he is sloppy.

People who operate systems under fear tend to do stupid things like covering up innocent actions (deleting logs), keep information instead of sharing it etc. Very few can operate complex systems for long time without doing mistake. Organization where the spirit is "oh, outage, someone is going to pay for that" wiil never be attractive to good people, will have hard time adapting to changes and to adopt new tech.


> The difference between having someone who always triple-checks every parameter they input, versus someone who never double-checks and just wings it. Sure, the person who triple-checks will make mistakes, but less than the other person. This is the issue I'm trying to get at.

If you rely on someone triple-checking, you should improve your processes. You need better automation/rollback/automated testing to catch things. Eventually only intentional failure should be the issue (or you'll discover interesting new patterns that should be protected against)


If there is an incident because an employee was sloppy, the fault lies with the hiring process, the evaluation process for this employee, or with the process that put four eyes on each implementation. The employee fucked up, they should be removed if they are not up to standards, but putting the blame on them does not prevent the same thing from happening in the future.


If you'd think about it, it isn't very useful to find a person who is responsible. Suppose someone cause outage or harm, due to neglect or even bad intentions, either the system will be setup in a way that the person couldn't cause the outage or that in time it will be down. To build truly resilient system, especially on global scale, there should never be an option for a single person to bring down the whole system.


By focusing on the process, lessons are learned and systems are put in place which leads to a cycle of improvement.

When individuals are blamed instead, a culture of fear sets in and people hide / cover up their mistakes. Everybody loses as a result.


I don't think the comment you're replying to applies to your concern about subpar employees.

We blame processes instead of people because people are fallible. We've spent millenia trying to correct people, and it rarely works to a sufficient level. It's better to create a process that makes it harder for humans to screw up.


Yes, absolutely, people make mistakes. But the thought I was trying to convey is that some people make a lot more mistakes than others, and by not attributing individual fault these people are allowed to thrive at the cost of having less error-prone people in their position. For example, someone who triple-checks every parameter that they input, versus someone who has a habit of just skimming or not checking at all. Yes the triple-checker will make mistakes too, but way less than the person who puts less effort in.


But that has nothing to do with blaming processes vs people.

If the process in place means that someone has to triple check their numbers to make sure they’re correct, then it’s a broken process. Because even that person who triple checks is one time going to be woken up at 2:30am and won’t triple check because they want sleep.

If the process lets you do something, then someone at some point in time, whether accidentally or maliciously, will cause that to happen. You can discipline that person, and they certainly won’t make the same mistake again, but what about their other 10 coworkers? Or the people on the 5 sister teams with similar access who didn’t even know the full details of what happened?

If you blame the process and make improvements to ensure that triple checking isn’t required, then nobody will get into the situation in the first place.

That is why you blame the process.


Yeah, I've heard this view a hundred times on Twitter, and I wish it were true.

But sadly, there is no company which doesn't rely, at least at one point or another, on a human being typing an arbitrary command or value into a box.

You're really coming up against P=NP here. If you can build a system which can auto-validate or auto-generate everything, then that system doesn't really need humans to run at all. We just haven't reached that point yet.

Edit: Sorry, I just realised my wording might imply that P does actually equal NP. I have not in fact made that discovery. I meant it loosely to refer to the problem, and to suggest that auto-validating these things is at least not much harder than auto-executing them.


I don’t think anyone ever claimed the process itself is perfect. If it were, we obviously would never have any issues.

To be explicit here, by blaming the process, you are discovering and fixing a known weakness in the process. What someone would need to triple check for now, wouldn’t be an issue once fixed. That isn’t to say that there aren’t any other problems, but it ensures that one issue won’t happen again, regardless of who the operator is.

If you have to triple check that value X is within some range, then that can easily be automated to ensure X can’t be outside of said range. Same for calculations between inputs.

To take the overly simplistic triple check example from before, said inputs that need to be triple checked are likely checked based on some rule set (otherwise the person themselves wouldn’t know if it was correct or not). Generally speaking, those rules can be encoded as part of the process.

What was before potentially “arbitrary input” now becomes an explicit set of inputs with safeguards in place for this case. The process became more robust, but is not infallible.

But if you were to blame people, the process still takes arbitrary input, the person who messed up will probably validate their inputs better but that speaks nothing of anyone else on the team, and two years down the line where nobody remembers the incident, the issue happens again because nothing really has changed.


The issue is that this view always relies on stuff like "make people triple check everything".

- How does that relate to making a config change?

- How do you practically implement a system where someone has to triple check everything they do?

- How do you stop them just clicking 'confirm' three times?

- Why do you assume they will notice on the 2nd or 3rd check, rather than just thinking "well, I know I wrote it correctly, so I'll just click confirm"?

I don't think rules can always be encoded in the process, and I don't see how such rules will always be able to detect all errors, rather than only a subset of very obvious errors.

And that's only dealing with the simplest class of issues. What about a complex distributed systems problem? What about the engineer who doesn't make their system tolerant of Byzantine faults? How is any realistic 'process' going to prevent that?

This entire trope relies on the fundamental axiom that "for any individual action A, there is a process P which can prevent human error". I just don't see how that's true.

(If the statement were something like "good processes can eliminate whole classes of error, and reduce the likelihood of incidents", I'd be with you all the way. It's this Twitter trope of "if you have an incident, it's a priori your company's fault for not having a process to prevent it" which I find to be silly and not even nearly proven.)


> and allows subpar employees to continue existing at an organization when their position could be filled by a more qualified employee.

Not really, their incompetence is just noticed earlier at the review/testing stages instead of in production incidents.

If something reaches production that's no longer the fault of one person, it's the fault of the process and that's what you focus on.


The stress for me usually goes away once the incident is fully escalated and there's a team with me working on the issue. I imagine that happened quite quick in this case...


Exactly, the primary focus in situations like this, is to ensure that no one feel like they are alone, even if in the end it is one person who has to type in the right commands.

Always be there, help them double check, help monitor, help make the calls to whomever needs to be informed, help debug. No one should ever be alone during a large incident.


This is a one off event, not a chronic stress trigger. I find them envigorating personally, as long as everybody concerned understands that this is not good in the long run, and that you are not going to write your best code this way.


Well, those comments have been deleted now... I guess someone's boss didn't like the unofficial updates going out? :)


Also, equally important to note, there was a massive expose on FaceBook yesterday that is reverberating across social media and news networks, and today, when I tried to make a post including the tag #deletefacebook, my post mysteriously could not be published and the page refreshed, mysteriously wiping my post...

This is possibly the equivalent of a corporate watergate if you ask me... Just my personal opinion as a developer though... Not presented as fact... But hrmmm.


So what you're saying is facebook... deleted itself?

The singularity is happening. It realized it would end society, so it ended itself.


They decided that they publish too much misinformation and self censored ;)


This reminds me the last time the singularity nearly happened.

https://google.com/search?q=google

I beg you, don't go there.


Archived version: https://archive.is/QvdmH



The Reddit post is down but not before it was archived: https://archive.is/QvdmH and https://archive.is/TNrFv


User has now deleted the update.


I am sure this is not what they specifically mean by fail fast and break things often.


> Reddit r/Sysadmin user that claims to be on the "Recovery Team"

They have time to make public posts, and think it's a good idea?

Sure, I'm on the 'Recovery Team' too! How about you?


If it's anything like my past employers, they probably have a lot of time. They probably also got in a lot of trouble.

When we'd have situation bridges put in place to work a critical issue, there would usually be 2-3 people who were actively troubleshooting and a bunch of others listening in, there because "they were told to join" but with little-to-nothing to do. In the worst cases, there was management there, also.

Most of the time I was one of the 2 or 3 and generally preferred if the rest of them weren't paying much attention to what was going on. It's very frustrating when you have a large group of people who know little about what's going on injecting their opinions while you're feverishly trying to (safely) resolve a problem.

It was so bad that I once announced[0] to a C-Level and VP that they needed to exit the bridge, immediately because the discussion devolved into finger-pointing. All of management was "kicked out". We were close to solving it but technical staff was second-guessing themselves in the presence of folks with the power to fire them. 30 minutes later we were working again. My boss at the time explained that management created their own bridge and the topic was "what do to about 'me'" which quickly went from "fire me" to "get them all a large Amazon gift card". Despite my undiplomatic handling of the situation, that same C-Level negotiated to get me directly beneath during a reorganization about six months later and I stayed in that spot for years with a very good working relationship. One of my early accomplishments was to limit any of management's participation in situation bridges to once/hour, and only when absolutely necessary, for status updates assuming they couldn't be gotten any other way (phones always worked, but the other communication options may not have).

[0] This was the 16th hour of a bridge that started at 11:00 PM after a full work day early in my career -- I was a systems person with a title equivalent to 'peon', we were all very raw by then and my "announcement" was, honestly, very rude, which I wasn't proud of. Assertive does not have to be rude, but figuring out the fine line between expressing urgency and telling people off is a skill that has to be learned.


Uh oh that user deleted their account. Hope they are OK.


Looks like those updates have now been deleted


Comment now seems to be deleted by user.


That reddit comment has been deleted.


he started deleting the comments


For Facebook and WhatsApp it looks like a DNS issue, name resolution fails with SERVFAIL:

    $ dig facebook.com

    ; <<>> DiG 9.16.21 <<>> facebook.com
    ;; global options: +cmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 23982
    ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 512
    ;; QUESTION SECTION:
    ;facebook.com.   IN A

    ;; Query time: 16 msec
    ;; SERVER: 8.8.8.8#53(8.8.8.8)
    ;; WHEN: Mon Oct 04 17:53:00 CEST 2021
    ;; MSG SIZE  rcvd: 41


John Graham-Cumming:

>Between 15:50 UTC and 15:52 UTC Facebook and related properties disappeared from the Internet in a flurry of BGP updates. This is what it looked like to @Cloudflare.

https://twitter.com/jgrahamc/status/1445065270272434176 (thread)

UPD

>About five minutes before Facebook's DNS stopped working we saw a large number of BGP changes (mostly route withdrawals) for Facebook's ASN.

https://twitter.com/jgrahamc/status/1445068309288951820


Maybe they tried everything else before that.

At first it was working but they couldn't serve responses: https://i.imgur.com/UaCtOiX.png

Notice the "2020"


The servers struggle to reply a basic 5xx answer.

Two possibilities:

- the DNS services internally have issues (most likely, as this could explain the snowball effect)

- it could be also a core storage issue and all their VMs are relying on it and so they don't want to block third-party websites and think it will last for a long time, so they prefer to answer nothing for now in the DNS (so it will fail instantly to the client, and drain the application/database servers so they can reboot with less load)


I was on a video call during the incident. The service was working but with super-low bandwidth for 30 minutes, then I got disconnected and every FB property went down suddenly. Seems more suggestive of someone pulling the plug than a DNS issue, although it could also be both.


It isn't just DNS. If you happen to have cached entries, the site is returning errors as well.


Presumably the DNS being down also wreaks havoc in their internal infrastructure as services can no longer resolve each other's names.


I wonder if Facebook has circular 'boot' dependencies on their microservices or something? I.e. they can't restart stuff now when everything is down.


For sure. Reminds me of the difficulties of starting a power grid from total blackout, bringing generators and power stations to sync.. .


Oh you bet they do. In large organizations with complex microservices these dependencies inevitably arise. It takes real dedication and discipline to avoid creating these circular dependencies.


This is very true. I tell everyone who'll listen that every competent engineer should be well versed in the nuances of feedback in complex systems (https://en.wikipedia.org/wiki/Feedback).

The most successful systems rely on the property of feedback (https://en.wikipedia.org/wiki/Feedback): evolution, untrained learning, genetic algorithms, the diagonal arguments (https://en.wikipedia.org/wiki/Diagonal_argument), artificial general intelligence (https://en.wikipedia.org/wiki/Technological_singularity), financial markets according to no less than George Soros (https://en.wikipedia.org/wiki/Reflexivity_(social_theory)#In...), etc.

That said, virtuous cycles can't exist without vicious cycles. I think we as a society need to do a lot more work into helping people understand and model feedback in complex systems, because at scales like Facebook's it's impossible for any one person to truly understand the hidden causal loops until it goes wrong. You only need to look at something like the Lotka-Volterra equations (https://en.wikipedia.org/wiki/Lotka%E2%80%93Volterra_equatio...) to see how deeply counterintuitive these system dynamics can be (e.g. "increasing the food available to the prey caused the predator's population to destabilize": https://en.wikipedia.org/wiki/Paradox_of_enrichment).


Internal services using public dns records?


Probably not, but their external and internal DNS may share infrastructure that's at the root of the failure


Yikes, seems like an easy redundancy split.


It seems like an easy redundancy split, but imagine driving two cars down the freeway at the same time, because you got a flat tire in one, the other day.

In order to actually be redundant you need to have two sets of infrastructure to serve, and then if the internal one goes down, the external one's basically useless when the internal resolution's down anyway. Capacity planning (because you're inside Facebook and can't pretend that all data-centers ever-where are connected via an infinitely fast network) becomes twice as much work. How you do updates for a couple thousand teams isn't trivial in the first place, now you have to cordon them off appropriately?

I don't know what Facebook's DNS serving infrastructure looks like internally, but it's definitely more complicated than installing `unbound` on a couple of left-over servers.


Yes, all of that (imo) is an argument in favor.

I never said it was free, but it's worth it as long as it's cheaper than failure.

I don't keep backups because I enjoy having multiple copies of my data. I do it because losing that data would be devastating.


agreed, they fell off the internet according to routeviews


I'm seeing similar DNS errors for many non-Facebook sites.


My ISP's DNS server went down a few minutes after the Facebook outage, presumably because all the residential customers' devices keep querying.


Seeing the same thing with 8.8.8.8 name servers. Everything I query returns an error


Do you have some examples?


I am getting DNS fails for wikipedia


Wikipedia wfm.


wfm


normashooting.com - but only when, like the parent poster, using Google's DNS servers. Just switched to Cloudflare and it works.

Using Google DNS:

nslookup

> normashooting.com

Server: 8.8.8.8

Address: 8.8.8.8#53

* server can't find normashooting.com:

SERVFAIL

Using Cloudflare DNS servers:

> normashooting.com Server: 1.1.1.1

Address: 1.1.1.1#53

Non-authoritative answer:

Name: normashooting.com

Address: 104.22.56.165

Name: normashooting.com

Address: 104.22.57.165

Name: normashooting.com

Address: 172.67.43.70


cant login to aws console either


aws.amazon.com is down as well



Even the Name servers are not returning any values. That's bad.

dig @8.8.8.8 +short facebook.com NS

These are usually anycasted, meaning that 1 ip return in NS are in fact several servers spread in several regions. They are distributed to closer match through agreements with ISP with the BGP protocol. Very interesting, because it seems that it took 1 DNS entry misconfiguration to withdraw M$ worth of devices from the internet.



It's always DNS


>It's always DNS

How is this not the top comment? Underrated


Even Google's 8.8.8.8 DNS server says can't find, SERVFAIL.


Is this related in any way to what happened to Slack recently in their DNS?



So far the pattern isn't the same. Slack published a DNSSEC record that got cached and then deleted it, which broke clients that tried to validate DNSSEC for slack.com. But in this case, the records are just completely gone. As if "facebook.com", "instagram.com", et al just didn't exist.


Thank god we have DoH.


It's DNS over HTTPS. It relies on the same system as plain DNS, so DoH won't really help in this case...


Same here on facebook.com , [api]whatsapp.com (instagram.com works)


Is it just me or HN also feels kinda laggy?


General tip: If HN is being laggy and you're determined you want to waste some time here, open it in a private window. HN works extremely quickly if it doesn't know who you are.


Wow this really works, thank you. What actually is the reason for it being much faster in a private window? Is there so much tracking going on in a normal window?


its faster because the pages are cached, they are effectively static. It's slower when logged in because the pages are created dynamically as it has your username, tracking favourites, upvotes etc, and much of it cannot be quickly cached.


Honestly surprised that HN, a website for techies, is so poorly coded. For example, the whole lack of proper "paging", with dang posting a disclaimer on every large thread for honestly over a year at this point and no progress. Or the fact that if you want to reply to a comment, it has to load a whole new page (which has to fetch more data from the server?) before getting to the comment box. Until recently, trying to collapse a large comment thread would also take 3s+ as I think it individually set the collapse state of every sub-comment?


The whole thing was put together in a somewhat obscure dialect of Lisp over 15 years ago. There’s probably under 100 people that write Arc regularly enough to meaningfully contribute, so the general approach has been to not fix what’s not broken.


This is not a very complex website, any HN reader could probably whip up a replacement from scratch over a weekend.

I guess there does exist many alternative UIs, though I don't see many that support commenting. I wonder if the "API" (if there's any) allows for that, or if people are just scraping the page and reformating it.


Not to argue, just to post a contrasting view: while FB, and a lot of the internet, failed or slowed today—and I know there were tons of reports of HN slowing too—I also experienced a phone death and attempted to hobble along by putting my SIM back in my old iPhone 5. Basically the only thing that worked was HN. In fact it loaded as quickly as I’d expect.

There’s plenty of stuff I’d like to be different about usability of this site, but perf is basically at the bottom of that list.


Most of the things I listed weren't really perf related, though they do show up when there's perf issues. Being able to "load more comments" and reply inline are super basic usability features. There's no reason why I'd need to navigate to a whole different page with a textbox, then navigate back and lose my position every time I post a comment.


One of the first optimizations large/high traffic sites will do, is cache pages for logged out users. even if the cache is only valid for a minute, that's still a huge reduction in server traffic.

The cache is faster because its not having to talk to the database, and can be done at by the load balancing layers rather then the actual application layer.

Wikipedia does this too (although, via a layer to add back on the ip talkpage header).


You can also just log out instead of opening a private window. Users that aren't logged in are served cached pages.


Could they offer cached pages to logged in users as an optimization? You only need to invalidate when a user posts a comment, most of the time you are reading now commenting?


This would make for a very good deep-dive technical discussion in an interview setting, I'm using this.


That explains why it works fine in my computer, where I haven’t logged in. Thanks for the tip.


This works like charm. Thank you!


I can confirm, HN, GitHub and Slack are very slow for me as well. Google is very fast, on the other hand.


All running their DNS on AWS. My guess is that AWS is seeing a massive flood of failed and retried DNS requests for facebook properties, similar to what jgrahamc mentions here for Cloudflare: https://twitter.com/jgrahamc/status/1445066136547217413


Is there a "Kessler syndrome" analogue for the internet, where failures beget failures until it's just an impenetrable cloud of fail, forever?


There's such a thing called the "Thundering Herd" problem, that partially matches.

From wiki: the thundering herd problem occurs when a large number of processes or threads waiting for an event are awoken when that event occurs, but only one process is able to handle the event. When the processes wake up, they will each try to handle the event, but only one will win.


Until someone smashes the "SEND MOAR SERVERS" button.


I can't see how this is the reason for HN to take 10 seconds for the response of the main page (I mean, the URL fetched from the address bar, not the subrequests the page does), as everything else downloads immediately.

The DNS entries should be cached by the browser (and the middleware), so that this problem should only happen once, but I get this constantly.

Also, I sometimes get an error message from HN, which seems to indicate that this is some backend issue which fails gracefully with a custom "We're having some trouble serving your request. Sorry!" on top of a 502 code.

It feels more like there is something else still broken.


In the case of HN it's probably just heavier load than normal. It's much faster if you're logged out.


Dropping that many BGP routes will have its high latency toll on the whole internet backbone for minutes/hours, I'm not surprised. I wonder if the recent LE's DST Root CA X3 deprecation has something to do with the outage (some DC internal tool/API not accessible because its certificate is expired or something like that).


Also slow here. I can't see anything on the AWS Service Dashboard https://status.aws.amazon.com


In my experience, any service dashboard is useless unless the problem has been going on for so long (i.e. hours) that it is obvious something's wrong.


AWS punishes its sysadmin teams for any downtime so there is heavy incentive to not report unless there is a community shaped gun pointed at your head. This is not a universal problem.


AWS punishes its sysadmin teams for any downtime so there is heavy incentive to not report unless there os a community shaped gun pointed at your head. This is not a universal problem.


People either have to work, creating load on GitHub, or waste their time elsewhere, creating load on HN and Slack.


People probably got more time to work.


Also slow for me also.


Probably traffic related. Lots of people reallocated to checking other sites.


Probably people flooding in to see if anyone knows why things are down. Even Google speed test was down, presumably from too many people testing if it's their internet that's at issue.



The site is working fine for me. Speedtest CLI also is useful but doubt when DNS is down.


working for me


I'd guess that automatic processes dominate. Maybe billions of phone apps polling for facebook connectivity (FB messenger is down, for example).


A couple of years ago, an admin at Hacker News asked those of us who are just reading to log out because their system is architected in such a way that logged in users use more resources than anonymous ones. So, if you're feeling altruistic, log out of HN!


Logging out does work! Probably delivering a cache.


Internet's got a case of the Mondays for sure


Yep. I am the developer of HN client HACK for iOS and Android and a bunch of users emailed me asking if my app was broken. Looks like something bigger is afoot.


Something's wrong with your app. It's not working at all, while Harmonic works perfectly.


Harmonic most likely uses Algolia for the data whereas my app uses the HN website. So Algolia delivers a cache copy from their own servers whereas mine scrapes the HN website itself. Hence the difference. Also logged out pages were working much better than logged in (HN delivers caches copies for logged out users).


Best HN client app ever. Thanks for the great work!


Thank you!


I can confirm, HN, GitHub and Slack are very slow for me as well. Google is very fast, on the other hand.

EDIT: actually HN failed to post this comment the first time I posted it!


Not just you. It is very laggy on my end too.


HN lagging, BBC was also very laggy about 30 minutes ago, and 35 minutes ago our whole company got booted out of their various hangouts simultaneously apart from the people in the states.


Definitely laggy for me as well. Went to Facebook and couldn’t so come here to check in and the load time made me think oh it must be my wifi is not working with 2 sites not opening then finally HN opened. Then tried to hit reply to your post and again seemed like it wouldn’t load then finally did. So yes laggy usually this is the one site that loads almost instantly.


Here too. Just had the "We're having some trouble serving your request. Sorry!" error.


With the Facebook properties down, the rest of the internet will have a significant increase in usage.


Plus I don't know about you, but I came to HN just now specifically to check if there was any insight into why it was down! The thundering herd just arrived :)


Yes, it's slow here as well, and posting this comment failed the first and second time.


Can confirm. HN, YT, Google, etc are all a bit laggy for me at the moment (eating lunch so I'm trying to entertain myself).


Yes, it's slow here as well, and posting this comment failed the first and second and third and fourth time.


This is either a hilarious accident or genius comedy.


This is not too rare when HN is being slow and giving those "We're having some trouble serving your request. Sorry!" pages.

If you get one of those on your comment submission you have no way to know if the trouble stopped it from accepting the comment or if it accepted the comment and ran into trouble then trying to display the updated thread.

For some reason I can't even begin to guess at HN does not seem to have protection against multiple submissions of the same form, so if after getting "We're having some trouble serving your request. Sorry!" on your comment submission you hit refresh again to display the page and the form gets resubmitted, you get a duplicate comment.


Earlier today when I was getting these I went to go check the page to see if the comment was posted there. More than once it said it failed but I was able to stop myself from trying again because it was actually there.


Yes, it's slow here as well, and posting this comment failed the first and second and third time.


Yes, it's slow here as well, and posting this comment failed the first time.


It is slow for me too.


Same here. Sounds like another cloudflare-like problem.


yes, me too (I'm in south-america)


Yep, struggled to load the homepage and this


yeah lagging for me too


Some internet backbone provider is probably down itself.


Or some country has started a war.


Funny enough, I went to https://www.isitdownrightnow.com/ to check if Facebook is down, and isitdownrightnow is down itself... probably from the massive number requests coming to check if Facebook is down


Seems like the perfect time to launch isisitdowndownrightnow.com.


Seems like it should be isisitdownrightnowdownrightnow.com


I've said "I've said it before, and I'll say it again" before, and I'll say "I've said it before, and I'll say it again" again.


Seems like noel already launched that one


Seems like noel already launches that one


You missed one rightnow in the middle


I personally like https://isup.me (alias of downforeveryoneorjustme.com) because it's much shorter.

isup.me/facebook gets me what I want.


Their methodology is flawed it seems.

It says Google is down but it's not. [1]

[1] https://downforeveryoneorjustme.com/google


https://downdetector.com/ seems to be working for me at least.


It's amusing that the top 3 trending reports are the FB sites that are down, and then the mobile carriers themselves, presumably because when FB doesn't load they assume it's their mobile network's fault. People really do think FB is the internet


> People really do think FB is the internet

It is a really large part of it. Also, when people see WhatsApp and see no connection, then open Facebook and see no connection either, it's _very_ likely that the link is at fault and not Facebook.


> People really do think FB is the internet

It's the AOL of 2021


But at one point AOL was the actual internet for it's subscribers.


Facebook tried to do that too.


Tried and failed?

Tried and died.


'Unusual traffic patterns detected' now



Amusingly, that returns:

> Is Facebook down right now?

> Uh oh! Something went wrong on our side. It's not you, it's us. Feel free to contact us if this persists.


Noticed the same. I started to suspect my mobile plan ran out


Which in turn, reminds of this paper [1] (from someone who previously worked at Facebook).

TLDR; Metastable failures occur in open systems with an uncontrolled source of load where a trigger causes the system to enter a bad state that persists even when the trigger is removed.

[1] Metastable Failures in Distributed Systems - https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s...


yep down for me too


hugops for the engineers having to deal with this. It's incredibly stressful and I personally feel like they deserve some empathy, even if I don't like Facebook.

I wonder if maybe part of the lesson will be to run the root of your authoritative DNS hierarchy on separate infrastructure with a separate domain name. Using facebook.com as your root is cool and all but when that label disappears it causes huge issues.


There will be so many meetings over this. If powerpoint was listed on the stock exchange i'd say now's a good time to buy hah.


I used to do this properly. One vanity got the better of me. Got some work to do. TGF SQL.


Its alive!

drill @1.1.1.1 www.facebook.com ;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 2172 ;; flags: qr rd ra ; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;; www.facebook.com. IN A

;; ANSWER SECTION: www.facebook.com. 3401 IN CNAME star-mini.c10r.facebook.com. star-mini.c10r.facebook.com. 3403 IN A 31.13.72.36


Kinda sorta. There are four DNS servers for Facebook: 129.134.30.12, 129.134.31.12, 185.89.218.12, and 185.89.219.12.

Of those, only 185.89.219.12 is up right now (Edit All four DNS servers are now up). For people who want to add Facebook to hosts.txt, the A record (IP) I’m getting right now is 157.240.11.35 (it was 31.13.70.36)


"Sorry, something went wrong. Facebook © 2020"


Yes, even Facebook falls prey to the wrong copyright year. Anyway, I got further now to a page that says "Account Temporarily Unavailable." and has the old Facebook layout. Would love a peek inside the Facebook codebase to see how this happens, hah!


Definitely not recent, but you might find this interesting nonetheless: https://gist.github.com/nikcub/3833406


See e.g. https://www.digwebinterface.com/?hostnames=facebook.com&ns=a... for responses from different nameservers.


Nothing of value was lost


I get that this in jest, but a lot of people rely on Whatsapp and FB Messenger for messaging.


A lot of people, out of habit, rely on high fructose corn syrup for calories.


You do know that whatsapp is literally used by small businesses in 3rd world to conduct....business right?


It's a little irksome how other commentors are quick to dismiss this very valid point. SMBs in Asia aren't using WhatsApp because they've forced the platform on their consumers; it's their consumers who are using WhatsApp who've forced a choice on the SMBs. WhatsApp has very wide consumer penetration, and its use by businesses is meant as a convenience wrapper for customers.

Now, does switching from WhatsApp to some other not-very-widely-used platform cause customer engagement / retention to drop? I would wager very much so! It's a matter of priorities - people go where there is least friction, and WhatsApp otherwise provides a seamless friction-less experience.


It's very first world centric point of view. I doubt some of these commentator claiming whatsapp being down is good for the society have ever been outside of the first world and have seen how it actually helps people in need.


But doesn't that mean it will be easy for the SMBs to move to any replacement service?


At the cost of losing customers, is my point :)


Uff, I see no reason to smile about it.


Maybe these business will diversify their communication mediums because WhatsApp is down - seems like a good thing for society.


Do you even know who these business owners are and what kind of life do they live? These are the guys that don't have a solid roof over their head, struggle to meet their daily needs and might have to sleep hungry if their day's sales weren't good. Diversifying is the least of the things they have to worry about. Whatsapp allowed them to reduce friction when it comes to communicating with customers, it helps their sales.

What might be a good thing for society in the first world doesn't mean it's necessarily good thing for society in the third world.


I reject this logic - it's an argument for sustaining the status quo at all costs.

Facebook is the most user-hostile tech megacorp, and they will inevitably harm these businesses you care about. The sooner the bandaid is ripped off the better.


I mean, sure, status quo can / should be changed - but you want to get to a point where a changed status quo is sustainable, and you're not going to get there by simply removing existing options. It doesn't change the incentives people have for preferring to use the platform, namely the pre-existing widespread penetration.

You want to dislodge Facebook, you need to disrupt it / curtail its monopoly.


Companies diversifying their communication platforms is the disruption.


Have you considered that any change done is going to mean winners and losers.

If Facebook permanently goes down then those businesses would move to a different platform.

Would it suck? Probably. Would the world be a better place without Facebook? A ton of people think so. Me included.

This is the same argument people have used when we talk about health insurance in the US being scammy. If we ever decided to address it it means a good chunk of people lose their jobs but also means that the health of this country goes up. Which one is more important?


But people moving from Facebook to another social media or messaging platform is just changing the company. That new company could do whatever things you don't like that Facebook is doing. Your example seems to mean that we move to another healthcare system as in method of implementation not just moving from one company to another.


> But people moving from Facebook to another social media or messaging platform is just changing the company.

This is not necessarily true. There are social networks and messaging systems implemented as open protocols.


That's a fair point. I'm going to make an assumption here but those systems aren't moderated or controlled centrally? So I see two issues with these if they become popular. 1. This only increases misinformation since there's no censorship at all 2. What prevents people from using the service for illegal purposes (I think this becomes a problem once services become popular) 3. Finally won't it just be banned eventually? If some court in any country issues a ruling to control it and it can't be controlled that will just escalate. Especially if it's something like pedophiles.


I've been doing a pretty good job of moving my client's communications to Signal out here.

I feel bad for everyone who relies on whatsapp bots for making stuff happen, though. These are getting really common out here for a lot of things and it always worries me that it's such a linchpin. They're really handy and save a lot of bullshit phone calls from having to be something people deal with for simple stuff like pharmacy delivery. I can get food from the local place down the street that's only really open for lunch and totally off the map for uber eats, for example... if this persists a few more hours those mom and pop type shops aren't going to have as great a day.


At the start of this year I started working for an employment service company that covers the Indo-Asia-Pacific and South American markets.

I was amazed to discover how pervasive Facebook, Inc. has become in the developing world for conducting business and navigating everyday life.

For a lot of people in developing nations such as the Phillipines and Indonesia, Facebook is synonymous with the internet. This has been buoyed by their push to bundle uncapped/free data for Facebook with mobile plans in markets with high growth of mobile internet access.

It's interesting, because I'm always reading articles about how "Western teens aren't using Facebook any more", which is true, but it's also irrelevant, because they're not really a profitable market, teenagers have short attention spans and no money. Facebook's growth strategy is to become the one stop shop (in lower income nations) for everything you want and need.


In Latin American 3rd world countries, people also conduct business via Instagram.

They create Instagram accounts and post products as posts, with a caption of "DM me for price".

It also turns on every alarm on my mind, when they start calling these "Instagram pages". It blurs the line between a real website and an Instagram account (In Spanish, "website" is "página web" as well).

I've also heard: "My business went to hell because Instagram killed my account" and that's when I reply: "Have you ever thought of owning a real website?"


Maybe an event like this will spur some people into... not doing that? Yes I'm aware of the ubiquitous nature of whatsapp in many developing nations. Have also successfully got a lot of people moved onto using Signal for anything they care about.


Signal has and will go down just like facebook. Cloudflare/aws having issues affects an insanely high percentage of the internet. People still use them. Outages rarely cause anything, they happen, people move on.


El Salvador basically runs on WhatsApp. From the small food stall to CEOs and maybe even government.


> The country where I live

With your username, I think you can risk naming the country without any additional loss of privacy.


Edited :)


Unfortunately they are about to be taught a hard lesson in what "free" really means.


Big tech free services have WAY better uptime than commercial alternatives.


That's not what they meant by free.


That'll depend on the length of the outage, other tasks they can do during it, and the uptime and market penetration of any competing services.

I don't think much of a lesson is going to occur here. It'll be a brief blip that impacts few meaningfully.


Maybe that was a bit of a....mistake?


And the alternative is... ?


Email, SMS, good ol' phone calls, Signal, <insert local app/platform here>, your own website, etc, on top of whatever you use right now.

If you're in a country that relies a lot on Facebook or Whatsapp, that's where the main focus will be, but at least try to have alternatives just in case something goes wrong.


So 4/4 of those are platforms controlled by a single company or a few large corporations. This really isn't a win in any meaningful sense.

It should be fine for huge corporations to exist and provide services really efficiently at scale while also being forced to play nice and respond to the will of the people they serve.

If we collectively can't stop Facebook from doing bad thing and being bad stewards to their own platform then you won't be able to stop whatever would replace them either.


It's quite possible to run a business without WhatsApp. Lots of businesses have been doing it for quite a long time.


It was a mistake to communicate with the users on a platform that they use? Instead of trying to get them on signal, losing 90% of leads in the process and making each of your sales cost x10 much?


Don't businesses fall back to SMS/phone or e-mail? Doesn't seem like a good idea to rely on a single corporation.


Not to mention all the small businesses that rely on Instagram too. Here it's used as an e-commerce platform.


You need a contingency plan for when vendors go down even in 3rd world countries. It just so happens a lot of us would not mind this vendor failing entirely. It’s unfortunate that we have so little choice in the matter but ultimately the same advice holds true for all of us smugly throwing insults while keeping our billing in AWS.


He’s a HN 10xer. He doesn’t care about anyone outside his Palo Alto cold-press-Koffee-Klatch, despite what he virtue signals. It’s amusing seeing people here trip over each other to say some variety of “I don’t use Facebook.”


You do know that <insert-extremely-damaging-thing> is literally used by small businesses in 3rd world to conduct....business right?


Facebook doesn't care.


How is that a good comparison? Not everyone uses Facebook out of habit, some businesses need it, and it can be used for good things as well as bad because it's just a medium in which people post content

Yes how that content is presented, ranked, etc is controlled by Facebook but that contribution is less than the content itself.

It would be better to say it's the spoon in which someone could eat a sugary cereal or something healthy.


Are you a Facebook employee? Your justification sounds a lot like the internal propaganda that is being fed to employees. “Facebook is net positive”, “it’s just a tool”, etc


What makes you say that? I think it's a good argument, doesn't mean it's right but it has substance. You also have some quotes that I never said. Nowhere did I imply it's a net positive. It is like a tool however but it has much more input.


The argument was that Facebook is neutral as a platform. Similar to the internet, it serves all kinds of content. Some of the content is good, and some is bad. That doesn't necessarily mean the platform is good or bad.


Facebook is not a neutral platform. It has a lot of moderation and algorithmic ranking of posts.


Having worked in growth before (not at Facebook), I can tell that you vastly underestimate the impact FB teams have on how/when/what/for how long/how many times/etc content is displayed to end users. This is absolutely not a neutral impact.


Here in Europe, WhatsApp actually powers many neighborhood watch groups, and so when it goes down, basically a formal crime reporting system also goes down.


Neighbors watching Neighbors and reporting via WhatsApp...sounds like the Netherlands.

I think if its staying down for a few more days Canibalism will ensue by the end of the week.


This also means that you can't participate in a neighborhood without agreeing to a legal contract with Facebook to use their services, as well as submitting to ad surveillance and tracking.

That's a dick move by the neighborhood.


Quite a lot of people rely on heroin to get through the day too.


I know you think this is some sort of neutral comment about personal choice, but it isn't. Millions of underserved people all over the world live in Food Deserts (https://en.wikipedia.org/wiki/Food_desert), places with little to no access to affordable nutritious food. Those people wind up consuming a large portion of their calories from high fructose corn syrup, not because they have chosen to do so, but because they have no choice, and that is their only option. Whether you want to accept it or not, your comment is classist and makes HN a more hostile place.


People don’t eat straight corn syrup. The products they do eat that contain it are quite expensive per calorie. I.e. Coke.

The problem is initiative and knowledge. They should walk or ride a couple of miles and buy the biggest bags of rice and beans they can, along with a bottle of multivitamins. And then learn how to cook.

If that’s classist, then the classes are structured by knowledge and choices. Which they may well be.


The entire reason that high fructose corn syrup is so prevalent in low-cost foods is that it's cheaper than sugar, especially in the US because of corn subsidies. Find literally any evidence that HFCS is more expensive per-calorie than sugar and you will come up empty-handed.

> If that’s classist, then the classes are structured by knowledge and choices. Which they may well be.

class by its definition accounts for massive difference in access to resources. If you think access to resources doesn't measurably change the level of knowledge that a population has, that's a declaration that resources do nothing, which would be an odd stance to take on a knowledge-focused community website.

> They should walk or ride a couple of miles and buy the biggest bags of rice and beans they can, along with a bottle of multivitamins.

I just LOVE the subtle food choice of rice and beans here, paired with the recommendation to take multivitamins, a recommendation that is supported by little to no evidence. Your own lack of knowledge on this topic is in full display, as is a clear demonstration of your own biases across multiple dimensions.


Of course HFCS is cheaper than sugar. I'm referring to the products made from it, like Coke. They are a poor way to spend your food dollar.

I agree that class accounts for a massive difference in access to resources. However, in this case, the knowledge is available for free, and in the US the basic foodstuffs are available for far less than what disadvantaged people pay for the typical processed and fast food they live on.

Rice and beans - nothing subtle about it. They are basic foods that provide the necessary carbs, fat, and complete protein. The vitamins are a simple way to prevent scurvy and similar deficiencies, until the choice of food can become more varied.

As a person learns to cook and bake, they can add wheat, peas, and corn (But they need to learn about nixtamalization before they add corn.) None of these foods require refrigeration.

I have in mind the cuisine of Mexico, which is inexpensive and nutritious. Similar cuisines are found in home cooking all over the world, at least where commercially processed food hasn't driven them out.

It is most important to make sure that all school children are taught how to process and cook these basic foods.

If you are knowledgeable in this area, I'd appreciate some specific suggestions.


My mother uses FB/Messenger to talk to her children and grandchildren.

My extended family uses FB to share info about events.

This, and other pedantic activities are really common around the world.

Don't reduce the material reality a situation to a meme that that represents a personalized view.


These things didn't start because FB was invented.


They did.

My family didn't share online before FB.

My mother didn't really have a common means to communicate with her grandchildren in the same way.

Email, phone are just not the same.

There are more channels available now for sure, but none so ubiquitous.

Facetime is not displacing FB for a lot of things, but that's more direct.

'Everyone is on FB' is the reason it still holds in these kinds of uses cases.

None of us case one way or the other about the platform, we'll just use what's convenient, but that is what it is.

This is a very common theme among FB users. FB by the way, is still growing it's userbase, and growing revenues even more so. The themes we see here on HN and even in the news don't represent the views among the population, nor are they necessarily very close to material reality.


and a lot of people are addicted to nicotine


You made my day. Thank you


There are plenty of ways to communicate with friends and family. If Facebook is down long enough, many people will just move to something else. (And I hope they do)


Making poor choices seems to be the curse of humankind.


Instagram messaging is also very popular, at least around me.


Sure they do. And it's why Whatsapp needs to be broken off from Facebook, because they blatantly lied about it and only bought it to kill off their competition


Maybe they shouldn't.


they shouldn't.


They relied on AOL Instant Messenger too...


I certainly do and I dream of the day that everyone I message switches, so I can too.


Why not lead the way?


A lot of people, many of them home based businesses, also rely on FB Marketplace as a primary source of income.


Many people don't realize that with the 2020 lockdown and next to zero face to face transactions happening, platforms like FB Marketplace provided an opportunity for many people to set up businesses and generate income. I understand the angst people have with FB, but there's a bigger world out there beyond our keyboards.


That's terrifying.


They have to go where their market is sadly


for one example of this look at certain ethnic food catering/delivery services that exist in many major cities and operate almost exclusively on facebook.


I felt a great disturbance in the Force, as if millions of voices suddenly cried out in terror and were suddenly silenced....Finally!


I can't message my friends on whatsapp :(


Seize the moment - switch to signal.


Is Signal not equally centralised, and thus susceptible to the exact same problem as this?


Yes it is.

Alternatives beyond signal that normies can use: Email.

Spread the word!


Doesn't help when everyone just uses Gmail.


Write a blog post teaching them how to stop:

https://sneak.berlin/20201029/stop-emailing-like-a-rube/


Convincing somebody who can hardly turn on their computer to get their own domain is just not practical. Even if they can get their own domain they still have to set up DNS. Good luck getting them to set up MX, spf and dkim.

I think things would be better if more people had their own domain. I just don't see any way of making it happen. I can't get my own family to leave gmail even with me handling all the domain stuff for them. Even my technical coworkers who are capable of this don't care.


Yep. Matrix is a decentralized alternative (provided you don't just use the default homeserver).


Yeah, but if you're going to use something centralised anyway, may as well use a more private option.


This issue isn't about privacy, it's about reliability. How reliable is Signal compared to WhatsApp?


Yes. In the ideal world messaging would've have followed the same federalized model as email. XMPP offers this, unfortunately few people use it or even are aware of it.


Correct. Switch to Briar.


...and where do you go when AWS/Signal's servers go down?

How about choosing something that's federated? https://matrix.org/


I'm fine with Matrix, but I'm not seeing the people around me moving to it, even with a more friendly solution like Element. It's already hard to make them use Signal just because they want users to remember a pin...


Email

(I'm not kidding)


Delta.chat is an instant messenger implemented over email. Alternatively, it's an email client that looks like an instant messenger.


Can't tell them to switch if whatsapp is down!

More guidance required.


Based on their track record I wouldn't be surprised if Signal just happened to be having an extended outage too.


Just because a company has questionable or even straight evil business practices doesn't mean that literally millions of companies/people don't rely on them to do business and communicate.


It is a good start.


Well, I know you jest, but a lot of conversations, with many people, over years and years would be lost. It'd be akin to hundreds of email threads with friends being deleted.


But positive social value was gained


much value was gained!


This cannot be said enough.


On the contrary, it's said far too much. Facebook is extremely valuable for a lot of people. I dislike Facebook as much as most people on here, but saying "it's totally pointless" is silly and it's not surprising that those who say it are ignored by those who use Facebook.


lots of people are heroin dependant, the number of people hooked doesn't make it right.

At the very least you are going to need a better arguments than that following the recent data dump.


In what ways is Facebook "extremely valuable for a lot of people"?


* A friend of mine runs a posh burger van that moves around a lot, and he puts "today's location" on Facebook.

* My wife talks with her family in Brazil through Facebook, sharing photos

* My Church receives a lot of help requests from people in trouble through Facebook

* Some abuse charities talk give support to victims through Facebook

etc

You could argue that it would be nice if there were alternatives, or that these organisations shouldn't be using Facebook at all. Sign me up for your campaign, I agree with you.

But if you say "Facebook has no value" then you will never understand the value proposition you need to offer in order to kill Facebook.


Communication for many out there. Many will be just fine without commenting on cat photos or bragging with their likes or followers. Many will be in trouble if they use WhatsApp/Instagram/Messenger/Marketplace to do business and any important communication.


I have many connections to people I met travelling. While not friends that I talk to often, the connections are still valuable.


Never underestimate software that is 'just good enough'


Facebook bashing is getting old. It's 2021, dammit.


And every year it makes the world worse.


Who else sees their deleted messages on WhatsApp that shouldn't be there?

https://twitter.com/Pytlicek/status/1445072626729242637


That is wild and definitely newsworthy. Capture as many screenshots and data as you can.

FWIW it seems possible that the messages remain cached locally on your device but deleted from their servers, and with their outage they aren’t being updated to delete?


This is something to get in front of a tech journalist who covers Facebook. It’s a major breach of trust. Probably hit one of your favs with a tweet, but they also tend to list their contact info on author pages of the sites they publish for.


It seems it has caused DNS servers crash for one of biggest Czechia's internet provider - Vodafone. Can be unrelated but I doubt it (https://twitter.com/BlazejKrajnak/status/1445063232486531099).

Think of it - half the country doesn't have internet because of this crash, that's terrifying. (Switching DNS servers obviously works but that's not something the general population will do)


If only the news reporting was not as stupid as "internet is not working at UPC", instead of DNS resolvers at UPC crashed, here's what you can do...

Anyway, I didn't even notice since I run knot-resolver at home.

I wonder what it will be like connecting Facebook back to the internet, thundering herd and everything...


I suspect that the DNS aspect will be fine. The middle DNS servers only need one valid response to cache it for $TTL, but they can't cache SERVFAIL.


> but they can't cache SERVFAIL.

Correction: according to Cloudflare's blog post, some do cache errors as well.


I mean connecting your company to the internet when you have billions of devices waiting in the line to fetch updates, or whatever.

Will not that be an issue? Re-enabling routing to such a massive internet service...


I suspect it will slowly get there. We'll find out, it is recovering now.


Quite likely. Cloudflare has a report that outlines the reason behind it: https://blog.cloudflare.com/october-2021-facebook-outage/


Same in the UK, I've just experienced external DNS outage on BT!


All most people have to do now is install an app and it takes care. But messaging need to go from media and news.


Vodafone is down in Romania too.


Posting this comment will be like farting into a hurricane, but here goes.

Company like Facebook has a serious problem and their stock drops ... precipitously. CEO of said company instead of selling their equity in their company has taken out loans against their equity in order to decrease their tax burden and cash in on the value of their equity.

What amount of decrease would cause a margin call from lenders for the forced sale of said equity and subsequently the loss of majority stake in their own company? Now obviously only the lenders know this information and assuming I have the rough order of operations correct.

Could this be a potential chink in the armor of founders / CEOs / anyone who takes out low interest loans against the equity they hold in their company? Maybe my understanding of this is too simplified.


It's happened [1] but for it to happen to a company with the structure of Facebook a lot of things would have to go wrong.

For starters Zuck owns Class B shares which have 10x the votes of Class A shares. I'm sure a bank would happily loan him money against his Class B shares, but any forced liquidation would involve a transfer to Class A shares. Zuck could lose a lot of shares and still maintain control of the company.

[1] https://www.wsj.com/articles/a-board-struggles-with-its-ceos...


This was the reply I was really looking for, thank you.


I don't know much about this but from the limited amount of I've read it is probably only a portion of the equity owned, and generally when borrowing against an asset the lender will not give you 100% of that assets value to protect from downside risk. Another probability is that it was adjusted in the past, potentially year(s) ago, and FB's stock price a year ago was almost $100 a share less than current so a $10 drop is not a big deal in the long term.

You raise an interesting question though and I'd like to know the answer as well!


Nah. Almost certainly he could lose 100% of that value of the stock without being at risk of anything like this (as in, he probably put up $100m if he wanted a $50m loan, etc..)

Even if he didn't, the bank would let him move funds in without forcing him to sell.


Margin calls don't really exist as far as loans are concerned. Once you have agreed collateral, and an agreed schedule of payment, you only get in trouble if you miss a payment, regardless of how the collateral fluctuates in value.


That might be true for a house or a small scale loan, but once you are dealing with billions I doubt that’s the case. I assume that it works as follows: you have $1b in stock, Bank gives you $500m line of credit. If stock goes down enough they force a sale, but they only sell against what you have actually Utilized in your line of credit. If you are Mark Zuckerberg and worth more than $100 billion, you probably don’t have any issues. If you add up all of his houses and planes and cars it probably doesn’t add up to more than 1% of that. He’s fine.


Loans based on assets like stocks/bonds/other assets with highly variable prices always have collateral requirements. If the loan is backed by 100M in facebook shares and the price of stock drops in half you will have to hand over more stock for collateral. If the price doubles, you can ask for your collateral back.

It is doubtful Z has any margin call issues as he has so much stock, I can't imagine he would have pledged even 5% of it for loans, so he can just hand them another chunk without even blinking (which he generally doesn't do any way)


Okay, but as the value of the collateral approaches 0 your lender asks you to increase your collateral correct?


Yes. But banks won't loan 100% on equities.

This scenario would just be basically impossible.


I am not sure you understand my question / hypothetical. A bank is not the only form of a lender first off, second the reason there isn't a 100% loan on an equity is that it's understood that the value of the underlying collateral can fluctuate. These are called over collateralized loans.


I currently have a $150k margin loan, and it absolutely will lead to forced liquidation if the collateral drops in value: Interactive Brokers was clear about that.

Separately, my bank tried to sell me a "Pledged Asset Line of Credit" that would also have required the collateral to maintain a certain value or there would be forced liquidations.

Can you link an example of a bank or similar that lets you borrow against stock or options collateral without margin call risk?


A margin loan is very different to the kind of financing agreement a company will enter into. You are using the money at IB to speculate, and probably purchasing volatile assets at that. A company will generally utilise that money very differently, and it is unlikely that a lending institution will accept shares as collateral due to the wrong way risk (i.e. if they can't service their debt, their shares are probably losing value too, so probably not good as collateral).


> A margin loan is very different to the kind of financing agreement a company will enter into

The original post you were replying to talked about founders borrowing against their company's shares as individuals, not companies.

> it is unlikely that a lending institution will accept shares as collateral due to the wrong way risk

That's my point: Nobody's getting a special financing deal on their company stock as individuals to eliminate their margin call risk.

Multi-billion dollar hedge funds get a personal contact at the bank, but even they will get margin-called borrowing against stocks as collateral if it goes against them.


He might not be speculating, he might be holding a bunch of SPY shares and simply withdrew $150k as a margin loan so he can make a purchase on a house or car but not pay taxes on gains yet from selling his shares, opting instead to pay off the loan over time through regular deposits.


He is still speculating on the SPY and his ability to pay off the loan depends on how the SPY holds its value. A market crash would hurt his ability to pay off the loan.


No it doesn’t. His loan is the same no matter what SPY’s value is. He pays it by depositing money he earns from his day job, not by selling shares. If SPY crashes very hard his broker may force him to pay the loan very quickly, either by adding more money or by selling off his shares to cover the loan.


Even Facebook's Onion site isn't working: http://facebookwkhpilnemxj7asaniu7vnjjbiltxjqhye3mhbshg7kx5t...

Fascinating simply that it's apparently not just a DNS issue.


Any idea what is the explanation for this?


It's about damn time. Hopefully they stay down. It will do the world some good (long term) to have some time away from this platform and platforms like it.


Facebook being down makes me think of all of those small businesses who never built websites. They rely on traffic and publicity from their Facebook pages only.

It's so important to diversify, such as building a website.


Well assuming they do that, likely they will host websites on AWS or Azure. Then AWS is down, what are you gonna do?


There's always Oracle Cloud ;)


Move your resources to the one you didn't pick and point your domain registrar there.


Unsurprisingly, Oculus is down as well, as are most services for the VR headset. So that's 4 major properties right now.


Can you not use an Oculus headset if FB servers are down? That’s absurd.


Some preloaded apps work (like YouTube, Firefox), but stuff like settings, the lobby, etc., are very slow or display "Unable to Load" messages. Any game that relies on your friends list seems to freeze for a while, then try to carry on.


Yep.

We've come full circle, where techies are rediscovering the original hatred for the Oculus, that it is tied to a social media walled garden, for some reason.


I think you may need a refresher on their history - https://en.wikipedia.org/wiki/Oculus_(brand)#History

When FB announced they would be buying Oculus, they promised that no social media integration would be required. FB breaking that promise is not the same as Oculus having that requirement from the get-go.

What original hatred are you talking about??


Originally, you did not need a facebook account to use oculus after purchase. They framed this as "you do not need to integrate social media/facebook".

This ^ means fuck-all, because at that time (day 1), their oculus services where hosted in the same infrastructure as their social media services.

Last year, they got rid of "you do not need a facebook account". But in all situations since inception, all of your data is passing through the same infrastructure as facebook data. It may not be being exposed, or targeted for advertising, but this WAS a huge point of contention years back.


> oculus services where [sic] hosted in the same infrastructure as their social media services

with my second-hand knowledge from someone who worked for FB and assisted the Oculus team being folded in to FB processes/policies/tech, I don't think this is accurate, either.


Well, You cant access a city if the freeway has been bombed...

Remember the 'information super-highway'? Yeah it gets carpet bombed constantly....


The Rift headsets probably still work fine, but the Quest headsets require a FB connection to work.


I couldn't help but to think of the fellow who has no monitors but uses an Oculus for virtual displays full-time, from last week

https://news.ycombinator.com/item?id=28678041


And a lot of people seem to be coming to HN to find out why, judging by how laggy HN is getting right now...


Well it sure is the place to find out for sure.


Yet another reason to not over-rely on a few big tech companies for the majority of the planet's communication. Forget concerns about competition, monopolies and so on for now (as important as they are), what we want are many social networks, video conferencing apps, messenger apps. Every country should strive to build their own Google or FB, or certainly many more should. State-backed if needed. It's a question of resilience and security as much as anything.


I had problems with my internet connection and loaded my ISPs site. Strangely, my bill was paid. Even stranger, some sites load while others do not.

Then it hit me: I am so dependent on Facebook owned properties (Whatsapp, Facebook, insta) that a Facebook failure looks to me like an internet failure.


Ironic enough; status.fb.com also down.


need a status.status.fb.com to indicate the status of the status

and / or an S3 bucket with a json blob the apps can pull to at least tell users 'here's what's up'


Reminds me of S3 outage a couple of years ago when AWS status page went down because it was relying on... S3...


It uses the same nameservers as facebook.com, same point of failure.

Seems like another poster posted finer details regarding BGP/peering which is ultimately causing the issue.


From the archived ramenporn reddit comment thread at [0]:

> This must be incredibly stressful so for your sake I hope you sort it out quickly... but for the world's sake, I hope you fail and make the problem worse before jumping ship followed by every other engineer, leaving it to Zuckerberg to fix himself. But I still hope it's not too stressful for you!

https://archive.is/Idsdl



Still an American centralized platform. Federated Matrix is the way to go!


Ironically, I cannot send messages on Signal right now. They can’t handle the extra load?


Signal replicates each message to NSA and FB, so when one is down, Signal’s backend fails with a timeout error.


Source?


to also face a failure in the single point of failure - system?


I love how understated companies always are about things like this.

> Facebook said: "We are aware some people are having trouble accessing our apps and products. We are working to get things back to normal as quickly as possible and apologise for any inconvenience."

https://www.bbc.co.uk/news/technology-58793174


I keep trying to submit to HN but I keep getting an error.

What's wrong with the internet?

FaceBook is down.

My friend from Slovenia is having trouble with discord. It eats his messages.

I can't load photos from my friend in telegram and the messages take a relatively long time - multiple seconds! - to get received.

TrackMania players have talked about having input lag.

ycombinator is really slow and reports an error after submitting. "We're having some trouble serving your request. Sorry!" (lost count of the times i've tried submitting this)

ycombinator turned out to be giving only errors.

Some sites I've found via google results seem to report that they are suffering from slow connections.

Do you have anything to add to this?


I'm having issues with telegram as well. Images won't send and the app continuously says "updating" on the top status bar.

How could facebook dns issues cause this?


Because everyone is using Telegram instead of Whatsapp due to WA being down.


Big ISP outages in NYC right now


From this tweet: https://twitter.com/BlazejKrajnak/status/1445063232486531099

"Because of missing DNS records for http://Facebook.com, every device with FB app is now DDoSing recursive DNS resolvers. And it may cause overloading ..."


Seems to be DNS related.

None of the listed facebook nameservers are resolvable or reachable:

a.ns.facebook.com b.ns.facebook.com c.ns.facebook.com d.ns.facebook.com


Looks like the routing is goofed. Loops over and over - DDoS attacking themselves.

mtr -r -c10 -w -b a.ns.facebook.com Start: 2021-10-04T10:02:50-0600 Loss% Snt Last Avg Best Wrst StDev

...

  4.|-- ae-2-rur101.cosprings.co.denver.comcast.net (162.151.51.125)         0.0%    10   12.6  11.9   9.6  19.0   2.9 

  5.|-- 24.124.155.233                                                       0.0%    10    9.3  10.2   9.1  12.4   1.1 

  6.|-- 96.216.22.45                                                         0.0%    10   12.0  14.0  11.6  31.3   6.1
  7.|-- be-36041-cs04.1601milehigh.co.ibone.comcast.net (96.110.43.253)     20.0%    10   14.6  13.5  11.6  20.5   3.0
  8.|-- be-3402-pe02.910fifteenth.co.ibone.comcast.net (96.110.38.126)       0.0%    10   12.2  12.0  11.5  13.2   0.5
  9.|-- 173.167.59.170                                                       0.0%    10   13.8  17.8  12.0  34.7   8.4
 10.|-- 129.134.40.74                                                        0.0%    10   15.3  12.6  11.4  15.3   1.1
 11.|-- 129.134.43.226                                                       0.0%    10   18.9  15.3  12.6  20.3   3.0
 12.|-- 129.134.98.166                                                       0.0%    10   12.5  14.2  12.5  20.4   2.3
 13.|-- 129.134.54.61                                                        0.0%    10   34.2  30.8  28.9  34.2   1.8
 14.|-- 129.134.53.61                                                        0.0%    10   29.8  31.1  28.9  36.5   2.7
 15.|-- 129.134.53.61                                                       90.0%    10   31.9  31.9  31.9  31.9   0.0


Same issue over at b.ns.facebook.com Looping routing creating self-inflicted DDoS

mtr -r -c10 -n b.ns.facebook.com

Start: 2021-10-04T10:28:03-0600 Loss% Snt Last Avg Best Wrst StDev

  1.|-- 192.168.1.1                0.0%    10    0.2   0.2   0.2   0.3   0.0 

  2.|-- 96.120.12.229              0.0%    10   10.2  10.8   8.8  15.7   1.9 

  3.|-- 96.110.149.185             0.0%    10   17.7  13.6   9.8  32.3   7.0 

  4.|-- 162.151.51.125             0.0%    10   10.9  12.2   9.6  15.3   1.9 

  5.|-- 24.124.155.233             0.0%    10   13.0  10.4   9.4  13.0   1.2 

  6.|-- 96.216.22.45               0.0%    10   16.5  16.7  11.2  29.1   6.4 

  7.|-- 96.110.43.241              0.0%    10   17.4  13.6  11.9  17.4   1.6 

  8.|-- 96.110.38.114              0.0%    10   12.5  12.8  12.0  14.0   0.6 

  9.|-- 173.167.59.170             0.0%    10   36.1  19.3  11.6  36.1   9.7 

 10.|-- 129.134.40.76              0.0%    10   13.1  12.3  11.3  13.1   0.6 

 11.|-- 129.134.34.72              0.0%    10   15.3  15.7  13.5  21.3   2.5 

 12.|-- 129.134.102.85             0.0%    10   39.0  39.2  38.0  40.8   1.0 

 13.|-- 31.13.25.13                0.0%    10   30.5  29.8  28.5  31.0   0.9 

 14.|-- ???                       100.0    10    0.0   0.0   0.0   0.0   0.0 

 15.|-- ???                       100.0    10    0.0   0.0   0.0   0.0   0.0 

 16.|-- 31.13.25.13               90.0%    10   30.2  30.2  30.2  30.2   0.0
On a side note... why is it so freaking hard to format line breaks in HN?


In the beginning it responded but gave server errors.


Which seems to indicate a massive infrastructure failure.


Actually I'd argue that the biggest problem would be to wait for the TTL to expire after you've fixed the problem.


The TTL was most likely very low, so I don't see that as being an issue.


Discord [1] is taking a toll from the increased traffic as well:

"We're noticing an elevated level of usage for the time of day and are currently monitoring the performance of our systems. We do not anticipate this resulting in any impact to the service.

We have temporarily disabled typing notifications. We expect these to be re-enabled soon."

[1] https://discordstatus.com/


Yeah, it seems like a lot of places online are facing the same. Even here has been bad for me


Someone save a risky release for 9am on a Monday morning? Decided Friday afternoon was too risky?


0830 actually :/

But to be fair... seems like it was a good call to not do it Friday night :D


If they chose 8:30, then it must have been really risky! ;)


The Honolulu office is getting ready for a long night :)


Facebook employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.

https://twitter.com/sheeraf/status/1445099150316503057


Not an indepth technical comment here but: seeing a tech megacorp go offline for a day makes me very very happy.


Some Oculus Quest owners can't use their device https://www.reddit.com/r/OculusQuest/comments/q18xwy/faceboo...


Is anyone else seeing knock-on effects at the other major public DNS providers? I'm seeing nslookups sent to 4.2.2.2 and 8.8.8.8 intermittently timeout if the hostname does not belong to a major website. CloudFlare DNS (1.1.1.1) doesn't appear to be impacted. For example:

[root@app ~]# nslookup downforeveryoneorjustme.com 4.2.2.2 ;; connection timed out; trying next origin ;; connection timed out; no servers could be reached

[root@app ~]# nslookup downforeveryoneorjustme.com 1.1.1.1 Server: 1.1.1.1 Address: 1.1.1.1#53

Non-authoritative answer: Name: downforeveryoneorjustme.com Address: 172.67.166.187 Name: downforeveryoneorjustme.com Address: 104.21.91.48

[root@app ~]#

Perhaps DNS queries are skyrocketing and overwhelming some of the major public DNS servers.


See this thread (with replies):

>Now, here's the fun part. @Cloudflare runs a free DNS resolver, 1.1.1.1, and lots of people use it. So Facebook etc. are down... guess what happens? People keep retrying. Software keeps retrying. We get hit by a massive flood of DNS traffic asking for http://facebook.com

https://twitter.com/jgrahamc/status/1445066136547217413

>Our small non profit also sees a huge spike in DNS traffic. It’s really insane.

https://twitter.com/awlnx/status/1445072441886265355

>This is frontend DNS stats from one of the smaller ISPs I operate. DNS traffic has almost doubled.

https://twitter.com/TheodoreBaschak/status/14450732299707637...


No idea if it’s related but a lot of Tor websites have also been offline all day (BBC, ProtonMail etc).


Yes same issue for me with 8.8.8.8, errors for everything but big domains.


Also Speedtest.net for me is showing a 503 error page. Seems a large CDN might be having problems. Their status page shows all green. FB and their other sites are also down.

edit: I see it's back up and I've been getting downvoted, here's a screenshot of the error for clarity

https://i.imgur.com/wvhOwwL.png


If Facebook and WhatsApp and Instagram fails there are probably a lot of people checking whether their Internet works. That might be why Speedtest was overwhelmed.


I recognize that for WhatsApp users around the globe this is probably more than an inconvenience, but the rest of humanity is getting something of a reprieve here.


Also amusingly I think quite a lot of employers use WhatsApp for part of their disaster communications plan.

If this happened then some wider issue (github down) there’d be chaos.


The UK government will grind to a halt if WhatsApp is still down tomorrow. Well, more than usual that is.


It's quite a coincidence for this to coincide with the whistleblower report + rumors of Peter Thiel (perhaps via Palantir?) involved in leveraging FB for the 2022 midterm elections.

I'm not suggesting that this is the case, but a failure of this scale (with internal systems also down) could allow scrubbing of evidence without leaving traces.


This is taking a longer time than expected for a company like Facebook - must be serious where a rollback isn't possible or trivial.


from what i understand (take with grain of salt) remote access to the routers affected is down. So they need to be physically plugged in to address the issue. hence some of the other "scrambling private jets" comments referring to getting the right people physically plugged in to the right routers.


I wonder if everyone refreshing the sites/apps trying to get it to load is contributing to the problem


Probably not, from other comments it looks like there was a wrong configuration rolled out, and now they are logistically struggling to get access to fix them.


This is honestly the best feature Facebook has ever developed. I hope it's permanent. It has the following effects: you feel better about yourself, you can spend more time with your family, you are more productive.


Whatsapp's down too. Tough month for FB, especially with the leak.


What leak are you referring to?



The actual leak was published on WSJ “The Facebook Files”

https://www.wsj.com/articles/the-facebook-files-11631713039



Facebook Whistleblower Claims Profit Was Prioritized Over Clamping Down on Hate Speech

A Facebook whistleblower, who is due to testify before Congress on Tuesday, has accused the Big Tech company of repeatedly putting profit before doing “what was good for the public,” including clamping down on hate speech.

Frances Haugen, who told CBS’s “60 Minutes” program that she was recruited by Facebook as a product manager on the civic misinformation team in 2019, said she and her attorneys have filed at least eight complaints with the U.S. Securities and Exchange Commission.

During her appearance on the television program on Sunday, Haugen revealed that she was the whistleblower who provided the internal documents for a Sept. 14 exposé by The Wall Street Journal that claims Instagram has a “toxic” impact on the self-esteem of young girls.

That investigation claimed that the social media giant knows about the issue but “made minimal efforts to address these issues and plays them down in public.”

“The thing I saw at Facebook over and over again was there were conflicts of interest between what was good for the public and what was good for Facebook. And Facebook, over and over again, chose to optimize for its own interests, like making more money,” said Haugen.

She explained that Facebook did so by “picking out” content that “gets engagement or reaction,” even it that content is hateful, divisive, or polarizing, because “it’s easier to inspire people to anger than it is to other emotions.”

“Facebook has realized that if they change the algorithm to be safer, people will spend less time on the site, they’ll click on less ads, they’ll make less money,” she claimed.

Haugen is expected to to testify at a Senate hearing on Oct. 5 titled “Protecting Kids Online,” about Facebook’s knowledge regarding the photo sharing app’s allegedly harmful effects on children.

During her appearance on the television program, Haugen also accused Facebook of lying to the public about the progress it made to rein in hate speech on the social media platform. She further accused the company of fueling division and violence in the United States and worldwide.

“When we live in an information environment that is full of angry, hateful, polarizing content it erodes our civic trust, it erodes our faith in each other, it erodes our ability to want to care for each other. The version of Facebook that exists today is tearing our societies apart and causing ethnic violence around the world,” she said.

She added that Facebook was used to help organize the breach of the U.S. Capitol building on Jan. 6, after the company switched off its safety systems following the U.S. presidential elections.

While she believed no one at Facebook was “malevolent,” she said the company had misaligned incentives.

“Facebook makes more money when you consume more content,” she said. “People enjoy engaging with things that elicit an emotional reaction. And the more anger that they get exposed to, the more they interact and the more they consume.”

Shortly after the televised interview, Facebook spokesperson Lena Pietsch released a statement pushing back against Haugen’s claims.

“We continue to make significant improvements to tackle the spread of misinformation and harmful content,” said Pietsch. “To suggest we encourage bad content and do nothing is just not true.”

Separately, Facebook Vice President of global affairs Nick Clegg told CNN before the interview aired that it was “ludicrous” to assert social media was to blame for the the events that unfolded on Jan. 6.

The Epoch Times has reached out to Facebook for additional comment.

https://www.theepochtimes.com/facebook-whistleblower-claims-...


While (editorial commentary aside) the basic facts in that article are accurate as far as I can tell, I'd be careful with that source - The Epoch Times is a mouthpiece for Falun Gong's political interests and engages in disinfo programs.

https://en.wikipedia.org/wiki/The_Epoch_Times

They also previously ran a large sockpuppet network on Facebook and the Facebook ad platform (both of which have since been banned) so they may have a bit of a bone to pick with the platform.


Here's what wikipedia says about using wikipedia as a source: https://en.wikipedia.org/wiki/Wikipedia:Researching_with_Wik...


This sounds so extremely far-fetched and designed to create a negative impression of Facebook. Does anyone even take this seriously? If yes, why?

> mobile phones are addictive

> internet is used to organize protests


Whistleblower that spoke to WSJ.


The media coverage and lots of the comments don't make sense to me. FB would not be so stupid and put all of their crucial DNS servers into a single autonomous system (which is now offline due to BGP issues). They operate literally dozens of datacenters around the world, and are surely not using a single AS for them - why not put secondary Nameservers there? Can someone make a sense of this?


Sounds like automation deployed a configuration update to most of Facebook's peering routers simultaneously. Something similar brought down Google in 2019.


If so, then it would simply be a BGP issue - no FB servers reachable, as all routes are down. But media+claims a combo of BGP/DNS. Hard to believe world-wide border routers, only responsible for networks containing DNS servers, are misconfigured. I am rely curious about that post-mortem :)


I think it was only a BGP issue. The DNS servers apparently shared the same peering routers as the rest of Facebook's infrastructure. Everyone focused on DNS because that's the first sign of failure to an end user.


Facebook down, WhatsApp down, but Signal still works. Time for a change?

EDIT: Yes, Signal is not federated, but that's what people are at least ready to consider as a WhatsApp alternative. I also created Matrix / Element account, and had 0 contacts using it already.


Signal had their own share of downtime. How about going to a federated system instead of repeating the same mistakes? (https://matrix.org)


But still a part would be down, if a server has an outage. How about a system, where every device that is used for chatting is a server at the same time? I wonder whether something like that already exists. Bundle it together with bigger servers to handle the load. If the bigger servers experience outages, the service can still continue, although a bit slower


Matrix P2P already exists.


Thanks, didn't know that!


The reason contacts don't tend to show up on Matrix/Element is because we don't push the user into sharing their addressbooks given the obvious privacy issues. Instead you mainly have to figure out who you know out-of-band for now (e.g. tweet "hey, who's on Matrix?").


I would be happy to have an option to share my availability on Matrix with other people that decided the same if that would mean I could bootstrap my network on that platform.

As it is now, Matrix may offer better privacy and more robust weil federated and p2p, but if I had to personally ask all of my contacts if they actually use it using some other medium, I can keep using that other medium too for conversations.


> Signal still works

https://old.reddit.com/r/thehatedone/comments/f160jh/is_sign...

Signal is still centralized and uses AWS. So if AWS was to go down, it would affect not just Signal but vast swathes of the Internets.


This event should be a good conversation starter on how horrifyingly monopolistic this trifecta of services has on worldwide communication. When I think through a random smattering of people in my contact book, I now have no way of contacting quite a few people at all. That's fucked. I wonder how many important messages, replies, etc will be screwed up due to this.


New York Times has coverage.[1]

"A small team of employees was soon dispatched to Facebook’s Santa Clara, Calif., data center to try a “manual reset” of the company’s servers, according to an internal memo."

[1] https://archive.is/iBzs3


Interesting, even some open source sites like: https://fbinfer.com are down

but https://glean.software and https://reactjs.org aren't


Anything hosted on Facebook's infrastructure is down. The two sites that you note are up aren't hosted by Facebook.


What's react hosted on?


Looks like Cloudflare nameservers and Vercel hosting.


According to https://lookup.icann.org/lookup both glean and reactjs have Cloudflare nameservers. fbinfer has ns.facebook.com nameservers which are presumably down.


Here is a handy troubleshooting flowchart for megacorp outages:

> Is it a DNS issue? -> yes

It can be used in reverse as a postmortem too.


The cynic in me wonders if this is related to the Pandora Papers leak


If it is an DNS error, why is the .onion site also offline?

- https://en.wikipedia.org/wiki/Facebook_onion_address

- facebookwkhpilnemxj7asaniu7vnjjbiltxjqhye3mhbshg7kx5tfyd.onion


DNS outage is an outcome of faulty BGP updates. As such not only the Internet can't see the FB network, there is also no connectivity from the FB network to the Internet right now.


My guess is that the FB backend also required DNS. The .Onion site isn't backed by a backend built on a onion native stack (is that a thing?)


It's not just/primarily a DNS error.


Facebook is proving that it's systemically important by taking the entire site down.....

Zuckerberg is taking his ball and going home unless you stop writing mean things about him /s


Used to have a manager that we swore did exactly that. Every time he was away on holiday, mysterious site problems to prove his worth!


I had the exact opposite and it was hilarious. Every time my manager (a great guy and really good at what he did) was away for a week the sprint would go very smoothly.


Lol exactly what I was thinking. I'm trying to keep my tinfoil hat in the closet and yet it seems odd that after there is a huge FB whistleblower story on 60 Minutes last night, all of FB goes down today.

I really hope it's just some internal technical error and not a "see, despite the bad things of FB, you really need us" move.

It's probably trivial, the timing just seems weird to me.


Let’s hope it’s this. Everyone will just shrug and move onto the next hopefully less evil site


Which one?


lobste.rs? mastodon?


Is this related to the outages from lets encrypt root cert expiring? Probably not since this looks like a DNS issue, but still it's a crazy coincidence that two major internet breaking events happen in the same week


There is zero reason to believe it's related at all. It's perfectly reasonable to have multiple large and unrelated failures in the same week.

I also wouldn't classify the loss of 1 company, and the expiration of some TLS certificates, as the interconnected network of networks being broken. The Internet has continued to function even if some larger players were unreachable or having issues.


Somebody moved fast.


And broke everything


DNS configuration is becoming a single point of failure. A few weeks ago, many services running out of AWS West 2 failed because the within-the-datacenter DNS system broke down somehow.


What are some of the possible scnarios beyond the DNS issue suggested? (and might it be an attack?)


This is what i came to comments for.

Unfortunately, we have literally dozens of comments that amount to nothing more than schadenfreude, and another handful of non-FANG employees speculating how one of the largest internet operations in existence could improve their game (lol)


I doubt this is the case, but someone on twitter was speculating "what if this is fb's infra team going on strike"


A BGP routing mistake that can cascade into a hard-to-recover-from state of the network where inter-dependencies lock each other.


After changing the screen resolution all operating systems will prompt the user if the applied settings where correct, otherwise it will time out and reset to last known good setting. Maybe time for the core internet infrastructure to implement something similar? :)


You can’t really make a system that will unboil an egg.


Wow. I can’t remember the last time whatsapp was done. I pretty much use messenger/instagram/whatsapp to talk to most of my friends and family. I’m happy that I do use other platforms otherwise I would be completely cut off from my parents right now.


I keep trying to submit to HN but I keep getting an error.

What's wrong with the internet?

FaceBook is down.

My friend from Slovenia is having trouble with discord. It eats his messages.

I can't load photos from my friend in telegram and the messages take a relatively long time - multiple seconds! - to get received.

TrackMania players have talked about having input lag.

ycombinator is really slow and reports an error after submitting. "We're having some trouble serving your request. Sorry!" (lost count of the times i've tried submitting this)

ycombinator turned out to be giving only errors, but now seems to be working occasionally. I can not submit anything, though.

Some sites I've found via google results seem to report that they are suffering from slow connections.

Do you have anything to add to this?


What's wrong with the internet?

With Facebook down, some large DNS servers seem to be struggling with the extra load of failing requests to look up "facebook.com". Cloudflare reports overload with their DNS server at 1.1.1.1, although that's working for me.

Billions of things worldwide are trying to connect to Facebook. The lookup which normally returns the IP address for facebook.com on the first try now requires trying a.ns.facebook.com, b.ns.facebook.com, etc. several times each before giving up. Probably several times a minute for everyone who has a Facebook app in their phone turned on. That may be using a big fraction of world DNS resources.

Vodaphone Ireland seems to be struggling with a DNS overload right now, per the Irish Independent. Also, their status page can't find "Dublin" as a city.


Some of those can probably be explained by facebooks traffic being redistributed to other services, overloading them


A few wordpress blogs crashed because addon facebook pixel is crashing. Very intensive lesson for the internet!



What I think is interesting is the effects of this type of thing across peripheral news sites, like HN. I wonder how much spike HN gets with people rushing here to find out what's going on and to read the (articulate) related discussions.


Somebody published vogon poetry and pictures and the internet routed around the damage...


I have not laughed so loud since before the virus that shall have no vogon name.


nslookup www.facebook.com 8.8.8.8 Server: 8.8.8.8 Address: 8.8.8.8#53

* server can't find www.facebook.com: SERVFAIL


  dig +trace messenger.com
shows that all is well with the root DNS servers and

  dig @a.ns.facebook.com messenger.com
  ;; connection timed out; no servers could be reached
and also

  ping a.ns.facebook.com
  3 packets transmitted, 0 packets received, 100.0% packet loss
shows that something's wrong with facebook.


tracert 129.134.30.12

Tracing route to a.ns.facebook.com [129.134.30.12] over a maximum of 30 hops:

  1     1 ms     1 ms     1 ms  eehub.home [192.168.1.254]
  2     3 ms     3 ms     3 ms  172.16.14.63
  3     \*        5 ms     3 ms  213.121.98.145
  4     5 ms     3 ms     4 ms  213.121.98.144
  5    17 ms     8 ms    18 ms  87.237.20.142
  6     8 ms     6 ms     7 ms  lag-107.ear3.London2.Level3.net [212.187.166.149]
  7     \*        \*        \*     Request timed out.
  8     \*        \*        \*     Request timed out.
  9     7 ms     7 ms     6 ms  be2871.ccr42.lon13.atlas.cogentco.com [154.54.58.185]
 10    70 ms    69 ms    70 ms  be2101.ccr32.bos01.atlas.cogentco.com [154.54.82.38]
 11    73 ms    73 ms    74 ms  be3600.ccr22.alb02.atlas.cogentco.com [154.54.0.221]
 12    84 ms    85 ms    84 ms  be2879.ccr22.cle04.atlas.cogentco.com [154.54.29.173]
 13    90 ms    90 ms    90 ms  be2718.ccr42.ord01.atlas.cogentco.com [154.54.7.129]
 14   143 ms   142 ms   143 ms  po111.asw02.sjc1.tfbnw.net [173.252.64.102]
 15   114 ms   119 ms   114 ms  be3036.ccr22.den01.atlas.cogentco.com [154.54.31.89]
 16   125 ms   126 ms   124 ms  be3038.ccr32.slc01.atlas.cogentco.com [154.54.42.97]
 17    91 ms    92 ms    91 ms  po734.psw03.ord2.tfbnw.net [129.134.35.143]
 18    91 ms    93 ms    90 ms  157.240.36.97
 19    74 ms    74 ms    73 ms  a.ns.facebook.com [129.134.30.12]
Trace complete.


Okay, let me tell you the difference between Facebook and everyone else, we don't crash EVER! If those servers are down for even a day, our entire reputation is irreversibly destroyed! Users are fickle, Friendster has proved that. Even a few people leaving would reverberate through the entire userbase. The users are interconnected, that is the whole point. College kids are online because their friends are online, and if one domino goes, the other dominos go, don't you get that?


At this point in time, I doubt this holds true.

Facebook is just too big and pervasive that such an outage would be treated by its users like an internet outage or a power outage. Once it's back online, everyone will forget.


Is this a quote?


Yes, it's from "The Social Network". It's a scene where Mark Z. is explaining to Eduardo how important it is that the servers stay up all the time.

Of course it was, as far as I know, ficitonalized in the first place, although it rings true (in context) to some extent. What I wonder is, how much is that true now? That is, how much downtime would FB have to experience for enough users to start leaving, to the point that it might prompt a serious exodus.


google suggest its from a social network movie


Confirming that we're seeing a major outage with all of our integrations with FB products.


Is this in some way connected to the Facebook data leak of 1.5 billion users? The timing seems quite odd that both these things happen around the same time.


A lot of addict's will now feel how is it to live in the reality not scrolling dump images. I hope this will be tradition at least once a month.


I am an addict. I refresh this thread waiting for information: fb is back online.

I need help but this is too hard for me. I uninstall social media once a week but install two days later.

I should probably go to therapy with this, but I am not sure I wouldnt be laughed at


https://heyfocus.com/ worked for me, maybe it’ll help (if you’re on Mac). Addiction to social media is a real problem; thousands of engineers are paid to make sure that these products ensnare your attention. It wouldn’t be odd if it takes a few bucks of your own to rescue yourself. Don’t hesitate to seek help, no one will laugh at you.


And here we are, looking at the xth top-level comment on HN...


Is there any site that tracks number of users for messaging apps? I'd be really curious to see if signal\telegram\etc are seeing a big bump.


What I find weird is that there is no indication in the app that nothing is working. I just get a cached view of everything I've seen the last few days.

Which is a feature I hate, since it does that all the time even when I have a connection. Says there are 3 comments on a post, when I know there is more. Opening them doesnt show them, and no way to refresh. But going to the web page I can see them.


  % ping whatsapp.com
  ping: whatsapp.com: Name or service not known
  % ping web.whatsapp.com
  ping: web.whatsapp.com: Name or service not known
  % ping facebook.com
  ping: facebook.com: Name or service not known
  % ping instagram.com
  PING instagram.com (31.13.65.174) 56(84) bytes of data.
  64 bytes from 31.13.65.174 (31.13.65.174): icmp_seq=1 ttl=53 time=110 ms


I thought Facebook, Instagram and WhatsApp ran on different infrastructure (and they've been trying for a while to align everything)?

How could they all go down at the same time, if they have different teams of engineers running each product separately?

Could anyone with some background (or person familiar with the matter) explain how their system's set up?


Seems unrelated to their infrastructure, the DNS records for facebook.com, instagram.com, whatsapp.com and all derivative domains are wiped clean it seems

edit: though saying that, they do run their own registrar... Might've fucked something up over there.


WhatsApp and Instagram are both in FB infra. As I understand it, Instagram is fairly integrated with FB services; when I left in 2019, WhatsApp was less so, it was mostly WhatsApp specific containers running with FB's container orchestration on FB machines dedicated to WhatsApp (there was and probably is some dependence on FB systems for some parts of the app, for example the server side of multimedia is mostly a FB system with some small tweaks and specific settings, but chat should be relatively isolated). Inbound connection loadbalancing is shared though.

FWIW, WhatsApp (on phones) should be resiliant to a DNS only outage, the clients contain fallback IPs to use when DNS doesn't work, and internal services don't use DNS as far as I remember.

At one time, WhatsApp had actually separate infrastructure at SoftLayer (IBM Cloud now), but that hasn't been in place for quite some time now. When I left, it was mostly just HAProxy to catch older clients with SoftLayer IPs as their DNS fallback.


I noticed that some websites are loading slowly due to the third party script https://connect.facebook.net/en_US/fbevents.js timing out.

When uBlock Origin is running, this script gets blocked and pages return to feeling snappy.


Their stock is down 5% too. Everything is down for them today ;)


Great dip to buy. (I'm not a facebook zealot, but you know it will recover today or tomorrow once the DNS is sorted in an hour or two)

Of course there is the whistle-blower issue too...


I don’t think stock dip is related to downtime; anecdotally, I’ve never seen a company’s stock affected by downtime (unless that downtime destroys the business)


You may be right, but theres a Reuters article about the downtime, this is making the news today. I would say Facebook is different because of their scale.

Looks like there are a few problems with fb in the news today ...


Oculus is also down


Indeed. People seem to forget that when Facebook goes down, it's not just your feed of depressing posts, photos and messages that go away, but also the entire Oculus VR platform, since they demanded a FB account to use Quest headsets.


Hacker News also got so much slower, is it the load from people hoarding here after not being able to reach FB?


Either many HN users are in glee over FB's potential demise

...or many HN users are also avid FB users (and now have to resort to backup sources of entertainment)


If I wanted to know if a site is down for everyone or just me, I would check twitter/hn first before checking the down detector sites


Gotta post this every time theres a big DNS issue, which seems daily now.

Check out Dug! Its a global DNS propagation/monitoring toolon the CLI: https://github.com/unfrl/dug/


Is there any place to see how the overall internet bandwidth usage has changed during this outage?


One real potential cost to FB here is breaking people's addictions to FB and IG. This might just be the little finger-snap to wake up a sizable chunk of the user base that they life is just a little better during the outage.


Outage is top story on CNN and Fox. Facebook is not returning their calls. Sheera Frenkel at the New York Times has been able to get a little more info, but not much.

Now Twitter is starting to have problems with overload.


Seems to be affecting all Facebook properties.


Instagram just returns a 503. Crazy how closely everything seems to be integrated.

I’d guess internal networking issues, but the insane that something can bring down all of Facebooks properties.


This outage is huge. I'm waiting for the write-up, assuming they release one


"When you are strong, appear weak."


Who else sees their deleted messages on WhatsApp that shouldn't be there?

https://news.ycombinator.com/item?id=28749652


Interestingly, their .onion site¹ is also down.

https://en.wikipedia.org/wiki/Facebook_onion_address


The onion site is just a reverse-proxy to the main web-site. So if the main site is down (due to internal DNS or BGP issues) onion reverse-proxy can’t get to it as well.


This is expected. I guess, the internal DNS is down, so the whole infrastructure is broken


DNS servers of a major internet provider in the Czech Republic are down now. Probably not a coincidence (other DNS server's stats show increased traffic so my guess is that Vodafone's DNS servers were unable to cope with the increased traffic and crashed https://twitter.com/BlazejKrajnak/status/1445063232486531099).

It's crazy that half the country doesn't have internet because Facebook stopped working.


Alle Störungen shows a massive spike in problems for every service it keeps track of: https://xn--allestrungen-9ib.de/


Interesting; I have been noticing a lot of service are unstable today. I wonder if there is a larger outage.


tracert 129.134.30.12

Tracing route to a.ns.facebook.com [129.134.30.12] over a maximum of 30 hops:

  1     1 ms     1 ms     1 ms  eehub.home [192.168.1.254]
  2     3 ms     3 ms     3 ms  172.16.14.63
  3     *        5 ms     3 ms  213.121.98.145
  4     5 ms     3 ms     4 ms  213.121.98.144
  5    17 ms     8 ms    18 ms  87.237.20.142
  6     8 ms     6 ms     7 ms  lag-107.ear3.London2.Level3.net [212.187.166.149]
  7     *        *        *     Request timed out.
  8     *        *        *     Request timed out.
  9     7 ms     7 ms     6 ms  be2871.ccr42.lon13.atlas.cogentco.com [154.54.58.185]
 10    70 ms    69 ms    70 ms  be2101.ccr32.bos01.atlas.cogentco.com [154.54.82.38]
 11    73 ms    73 ms    74 ms  be3600.ccr22.alb02.atlas.cogentco.com [154.54.0.221]
 12    84 ms    85 ms    84 ms  be2879.ccr22.cle04.atlas.cogentco.com [154.54.29.173]
 13    90 ms    90 ms    90 ms  be2718.ccr42.ord01.atlas.cogentco.com [154.54.7.129]
 14   143 ms   142 ms   143 ms  po111.asw02.sjc1.tfbnw.net [173.252.64.102]
 15   114 ms   119 ms   114 ms  be3036.ccr22.den01.atlas.cogentco.com [154.54.31.89]
 16   125 ms   126 ms   124 ms  be3038.ccr32.slc01.atlas.cogentco.com [154.54.42.97]
 17    91 ms    92 ms    91 ms  po734.psw03.ord2.tfbnw.net [129.134.35.143]
 18    91 ms    93 ms    90 ms  157.240.36.97
 19    74 ms    74 ms    73 ms  a.ns.facebook.com [129.134.30.12]
Trace complete.

this is what i got now


In the post-mortem, we'll find out that Facebook's alerting and comms systems all run on Facebook. As a result, they can't even coordinate the restart to roll back changes.


I'm genuinely not sure if the reports I heard of employees being locked out of the systems they need to fix it because their network is down are jokes or true.


The timing of this is so rich in irony I can't help but wonder if there is an element of internal sabotage. How many FB employees hate FB right now? The latest expose of FB is both effective and truly awful. I can't imagine feeling good about a FB job. And it's gotten worse! Now they look like they can't even keep their websites up.


Perhaps we'll find out. As fun as internal sabotage would be, schadenfreude-wise, i think it much more likely this will turn out to be a time when Hanlon's Razor applies


Can we really ever know? There are million of $ at stake!


Is facebook being down causing hacker news to get the hug of death??


I feel for the sysadmins who are fighting ulcers and migraines at the moment, but I can't shake feeling that the world is just a little bit better for this small window of time.


Somebody just had their very own "onosecond".

https://www.youtube.com/watch?v=X6NJkWbM1xk

The video is one that Tom Scott published in June 2020 about the worst typo he ever made in one of his prior jobs, and while the Facebook mistake is almost certainly not going to be anything irrecoverable like this one, you can bet that Facebook pride themselves on being available all the time.


I would have thought that these companies that are richer then $GOD would have (virtual) instances of at least the previous stable version available for situations such as this. It would at least keep their damn doors open and internal communications systems going... Maybe they'll NOW think of such things? What's the cliche, penny wise and pound foolish? Or is it, no need to listen to experienced Network Designers? I can never remember...


most of what they do, they do with in house tools, and custom-everything, including hardware. as a consequence, for some classes of problems there are no experts - not at facebook, not anywhere.

i feel for their netops people. uncharted territory with the whole world watching and, no doubt, a lot of morons from management trying to be "helpful" in getting this nice crisis resolved. for any crisis there is always a bunch of clowns with MBAs that consider it their golden opportunity to shine (nearly always at someone elses expense)


My father (https://bit.ly/3acZAAI), who is a certified CCSP Ethical Hacker and formerly worked @ZScaler/Checkpoint/Palo Alto Networks, would say that there are basically two scenarios: someone like him did it intentionally or someone like him did it by mistake.

Any other scenario of outsiders, code updates, etc - basically misses the point of how modern DNS infrastructure works.


Whataspp too. This seems pretty big


Did I just read that the Facebook IRC fallback went down too?!? I was about to say what’s wrong with freenode ( but yeah on 2nd thoughts let’s not talk about freenode )


What would a full day of WhatsApp outage mean for the world?


Doesn't seem too clever that Facebook's NS servers are a.ns.facebook.com, b.ns.facebook.com etc. IIRC that kind of setup requires some glue records.


If you mean because the name servers are in the same zone, this is very common. When an NS is returned for a zone, you also get an “additional” A and AAAA to resolve the NS name. It’s called glue.

    dig NS example.com
    ; ANSWER
    example.com. NS ns1.example.com.
    ; ADDITIONAL
    ns1.example.com. A 1.2.3.4
Edit: I didn’t see your glue comment when I wrote this.


Cheers, I'd edited my post.

Thought the common wisdom nowadays was to use nameservers on different TLDs and sub-labels for the best resilience.

/added, they seem to have glue records so I'd assume it's the nameservers themselves having issues.

$ dig NS @g.gtld-servers.net. a.ns.facebook.com

;; AUTHORITY SECTION:

facebook.com. 172800 IN NS a.ns.facebook.com.

facebook.com. 172800 IN NS b.ns.facebook.com.

facebook.com. 172800 IN NS

c.ns.facebook.com.

facebook.com. 172800 IN NS d.ns.facebook.com.

;; ADDITIONAL SECTION:

a.ns.facebook.com. 172800 IN A 129.134.30.12

a.ns.facebook.com. 172800 IN AAAA 2a03:2880:f0fc:c:face:b00c:0:35

b.ns.facebook.com. 172800 IN A 129.134.31.12

b.ns.facebook.com. 172800 IN AAAA 2a03:2880:f0fd:c:face:b00c:0:35

c.ns.facebook.com. 172800 IN A 185.89.218.12

c.ns.facebook.com. 172800 IN AAAA 2a03:2880:f1fc:c:face:b00c:0:35

d.ns.facebook.com. 172800 IN A 185.89.219.12

d.ns.facebook.com. 172800 IN AAAA 2a03:2880:f1fd:c:face:b00c:0:35


In this context, I remember youtube+pakistan issue[1]. I also wonder how an AS/BGP manager do his/her job... I imagine a guy/girl changing a text file in a old console. Anyone knows?

[1] https://www.infoworld.com/article/2648947/youtube-outage-und...


Suspecting it might be related to the recent letsencrypt cert authority expiring? Was just debugging an issue earlier today and just couldn't help wondering how much of the internet is secured by letsencrypt.

All of the static hosts providing free SSL: vercel, netlify, render, firebase hosting, github pages, heroku etc. ...

It does work on modern browsers and devices but goes terribly broken on a lot of old devices.


Obviously not possible to check right now to provide proof, but I feel quite confident in saying that Facebook does not use Let's Encrypt. It's also clearly not an SSL issue.


You're right. Fb doesn't seem to be using letsencrypt.

https://crt.sh/?q=facebook.com

On a side note, the amount of phishing sites using letsencrypt and having a domain similar to facebook.com is quite appalling.


I would like that say that after my "burn it down" comments on another Facebook related post that I had nothing to do with this.


https://www.status.fb.com/ is back online now


Just thinking about all the conspiracy theories you could make of this. Yesterday pandora papers, today the internet stops working.


Facebook outage is now the top story on CNN and Fox. Facebook stock down 5%. Facebook is not returning calls from Fox, or CNN.


Anyone wanna estimate the cost of total downtime for facebook and instagram, as far as lost ad revenue goes - per minute?


Looks like someone built a counter: https://facebookadloss.facebookadloss.repl.co/

Edit: the counter just jumped from 10B to 60M, so I doubt it's any reliable :)


When I worked there they were all about open source projects to build it themselves and control the service. Well, when your whole company is run on one DNS service this is going to bite you in the butt.

I only know of a handful of Saas apps they didn’t build internally. Sadly none of those will help them get out of this situation.


Reminds me of a story Jack Fresco use to tell were financial workers were unable to get to work because a bridge was not usable. People were worried about terrible consequences if all these important people were unable to do their work. To their surprise life just continued as if nothing changed.


Do you mean Jacque Fresco?


yes


Reading the thread, I'm surprised at the number of nearly identical "How much do we have to pay to keep it down? xD" posts I'm seeing, often from throwaway accounts. Some accounts with multiple near-identical posts within the same minute.

Could this be a coordinated smear in HN comments?


I think a lot of people just think facebook is bad for society. I do.


some poor engineer is sobbing over a split brain mnesia cluster right now praying to get the thing back up.


This is all left brain implimentation with looping and classic complexity coming home to roost. As we move through time, we build off of solutions of the past which are solving a problem, but complexity keeps adding on and this is a classic programming/computer science delemma.


Imagine if they need access to fb.com email to re-enable the access for the on-site technician.


FB seems to be finally loading for me, after nearly 6 hours.

This will be a highly discussed topic for a bit.


I'm curious if this extended outage will do anything to curb the dopamine addition caused by facebook.

For example, will FB addicts experience a day of repeated failed attempts to get their FB fix, which will then condition them to stop trying.


I unblocked Facebook right now from my hosts file so I could message someone and couldn't figure out why Facebook failed to load. I tested HN and viola I see that the entire world has sent Facebook requests to 0.0.0.0 lol


You broke it.

I didn't receive expected WhatsApp messages and am only now realizing there's no indication within the app that there is even a problem. It only becomes (somewhat) apparent when sending a message never gets a single check mark. Not a graceful failure for the user view.


When I click the HN link I am presented with facebook login page. I don't have an account so can not proceed.

Is the article link just https://facebook.com/ ?


No, it’s back up by now.

The link was to show that facebook was down.


Perhaps allowing Facebook, WhatsApp and Instagram to merge was efficient after all - now that they have synchronized outages, people finally have a chance to get on with their lives, free of clickbait news and misinformation.


Looks like the routes to their hosted nameservers are down, e.g. A.NS.FACEBOOK.COM


Aren't there places around poorer countries where Facebook is basically an ISP? What about them?

https://tcrn.ch/3kOHco1 2Africa cable, as an example


Aren't there places around poorer countries where Facebook is basically an ISP? What about them? They have literally 0 info.

https://tcrn.ch/3kOHco1



As are Instagram and WhatsApp


can productivity (or emotional stability) for the overall US economy be tracked on a daily basis? I wonder if a wholesale facebook outage would show up on that graph as a brief blip in the positive direction.


it is probably unrelated but HN is crawling


I find myself a little bit happy that it's down. I use Facebook quite often, but mostly because everyone else I know uses it. If everyone is forced to find an alternative, that'd be fine by me.


They made they own BGP tools and looks like it failed https://www.youtube.com/watch?v=wHfYUbKNEyc


Many local governments use FB to get info out.

Events like this show they should use multiple outlets instead of the big monopoly.

Alternatives like gab exist, but its incredibly hard to gain traction against the big monopolies.


somewhere an engineer is begging a mnesia instance to come back online.



I am soooo happy I did not even notice! Don’t use any of these apps.


Was receiving an error page, now just a server not responding error.


I can't even resolve Facebook.com


Mobile app hangs too (I am from Italy btw)


Just wondering - would the engineer who made the mistake be fired?


The only person I've ever heard of being fired for an operational error was a principal networking engineer at Amazon who end-ran DNS policies and hand-edited a zone file. Somehow, the file got truncated. It brought down everything including the soft phones so people couldn't even spin up a phone-based conference call to deal with it. I think Amazon was down for several hours, with 8 digit losses. That was in the mid 2000's. Heard that person was fired but don't know for sure.


If a single person can cause the failure during the course of their normal tasks, it's not the fault of that person it's the fault of designers of the systems and processes used by that person.


"Asking for a friend."

I kid. If it were to come down to a single person, that's really a failure of the whole organization system and not of the individual.

This apocryphal [1] punchline to the Jack Welch story also sums up how most orgs deal with this sort of thing:

"I just spent a million dollars on your education - why would I fire you now?"

[1]: http://www.nickmilton.com/2016/03/jack-welch-on-learning-fro...


This question doesn't deserve downvotes. While the answer is quite clearly in the negative (this will be a process failure, not a human failure), it looks as though it was asked in good faith, and might not be so obvious to those outside the industry.

Vote buttons are not a substitute for proper responses to legitimate enquiry.


How could anyone answer that question? We don't even know that an engineer made a mistake in the first place, much less what the mistake was and what led up to it.


Thanks everyone for providing the insights, I have no ill-intention, just asking for curiosity sake.



That’s not the culture at facebook


"Move fast and break things". Yeah, it's exactly the opposite. The person should be promoted ;)


What makes you think it was a mistake? What makes you think an engineer did it?

Sometimes things just break and take time to fix.


If they are then Facebook is worse than I thought.


nope. an individual is never blamed for these sort of issues.


Not only facebook, but also Google, Zoom, Telegram, Youtube and many more internet service/ product/ providers from 8:00 AM today. This is more like internet outage.


No. All the ones you mentioned are up.


Youtube and google are definitely working for me without any problems (haven't tested telegram or zoom).


Would be very interesting if they release the RCA to the public


Glad to report that facebook's dns in china is not affected. You can dig facebook.com and the depth of the internet happily reply with a random ip address as usual.


I'm pretty sure this has been building up since the morning (Germany). I've had odd connectivity problems to a number of sites including slack for a moment.


Hacker News also got so much slower, is it the load from people hoarding here after not being able to reach FB?

[I'm also getting server error trying to submit this comment]


Sir, I don't care who you are, you must open a ticket.


Rajeesh FFS, get off HN! We have the world on fire!


Getting everything back up again will probably be a nightmare. Imagine all the internal services trying to reach a consistent state after such a long outage.


Does everyone just buy in that this is just a network change gone wrong? OR could they be mitigating a breach/hack? OR could it be some other theory?



Rejoice! The revolution has started!

Yes, I know humor is not welcome in HN


Lichess Android app is also down but not the webpage. Infinity app for Reddit is down. HN is super slow and "having trouble serving requests".


So they managed to remove facebook.com from 1.1.1.1 and 8.8.8.8. That is impressive. Not something anyone can achieve in such short time by even trying.


Facebook and Messenger are working now.

Instagram and WhatsApp - not yet.


I was hoping it would stay down for longer haha


There is a global outage this morning starting from 8:00 AM. This list has Google, Tiktok, Zoom, Slack and of course FB products and services.


How much revenue does facebook loose per hour down?


This is very hard to get exactly right, because traffic isn't constant at all times, and you don't know if people won't just make up for lost time using facebook at another time of the day, etc. So you can't really know.

But, a good rule of thumb right now is about $10,000,000 per hour.


Facebook made $29 billion last quarter which translates to $315,217,391 per day. Divide that over 24 hours in a day, and it's ~$13 million per hour.

Of course, depends on the hour of day. Facebook likely makes more ad money when North America is awake than when Asia is awake for instance.


Reddit wasn’t working a few min ago. Broader issue?


Reddit goes down every 10 mins anyway.


this person uses reddit


makes me think, why dont porn sites ever go down ?


Phrasing!


facebook.com resolves again!

> ping facebook.com

PING facebook.com (31.13.83.36) 56(84) bytes of data.

64 bytes from edge-star-mini-shv-01-mad1.facebook.com (31.13.83.36): icmp_seq=1 ttl=54 time=12.2 ms

64 bytes from edge-star-mini-shv-01-mad1.facebook.com (31.13.83.36): icmp_seq=2 ttl=54 time=12.1 ms

64 bytes from edge-star-mini-shv-01-mad1.facebook.com (31.13.83.36): icmp_seq=3 ttl=54 time=11.7 ms

Can't yet traceroute to a.ns.facebook.com tho


I am sure it is just a symptom of the Facebook outage, but it seems like every website I am going on is slower than usual today.


Every Facebook App on every phone is DDoS-ing the DNS system

https://twitter.com/blazejkrajnak/status/1445063232486531099


I've seen a couple fail to log in, because their SSO is broken through this. (even if FB login is merely an option)


One thing that triggers my OCD is leaving the Facebook session open, though it's my own computer.

Maybe it's DNS. It's always DNS.


Facebook went down (error page, then 503) before DNS went down.


Same for me.


> Maybe it's DNS

If it is, close Facebook as there's probably a BGP hijack going on that is siphoning off personal data and or secrets


I assume all of the tertiary sites that use "Login with Facebook" are broken now too? So glad I never adopted that.


When I do dig instagram.com I get an A response for this IP: 31.13.65.174 or similar addresses, which leads to an empty page.


This really is a fascinating case-study of what is truly resilient systems. More often than not, they are not centralized.


Is this caused by missing glue records? I can’t resolve any of FB’s nameservers. Anyone know how that could happen?


the glues are still there-- it's not a DNS issue but a network one. Their ASN has mostly been withdrawn from everywhere.


The glue records are fine from my end: dig -t NS facebook.com @a.gtld-servers.net


The joy that people are getting from this is quite shitty. I hate social media but there are people earning a living working for these companies. Like others have pointed out, businesses and neighborhood watches rely on tech like this. At some point we've all had sites/apps go down, in a situation like that the last thing you want is people enjoying it. The lack of empathy in this thread is telling.


It pales in comparison to the lack of empathy facebook has shown to its user herd.


Twitter seems to be a big buggy now too maybe just a coincidence. User comments under posts are not appearing.


It is about time to speculate about sabotage, a disgruntled employee or something more exotic.

All this BGP talk is boring.



Lichess Android app is also down but not the webpage. Infinity app for Reddit is down. HN is super slow.


It's back as of approx 14:47 PST.


Agreed, website loading here, still no whatsapp though.


This solves the disinformation problem


I keep getting non-dns errors from Hackner News as well. There appears to be some sort of broader incident happening?

It's not just lag, I keep getting the "We're having some trouble serving your request. Sorry!" page.

Edit: HN related thread https://news.ycombinator.com/item?id=28749476


Here we can see why you should not have all your DNS servers in the same AS (in this case, AS32934).


Terrible day for many people. Both working for Facebook and those depending on their services.


well... this is unfortunate:

; <<>> DiG 9.10.6 <<>> facebook.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 36072 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1


Where could the physical data centers be that they need to access? How far away could it be?


maybe, there are reports (i.e. unverified tweets) that employees cannot access sites due to the security systems also being down. I imagine email, and messaging for employees would also be down too.

It may be very hard for employees to get to the physical boxes, and/or bypass any physical or software security systems.


Delete Facebook / Instagram / WhatsApp when it comes back up. They are all trash.


Checked isitdownrightnow.com and said Netflix was also down. Any chance these are related?


Netflix is up


In other news, a bunch of people got a lot more work done today than normal I suspect...


Great time to take a break from Facebook and Instagram. Use Telegram instead of WhatsApp


Out of band management is an important feature for the reliability of your network.


Does anyone have a reasonable guess on how much money they have already lost?


Likely $0. Ad views lost now will likely be made up for later. And even if there is a reduction in views, it just makes other views more valuable. Facebook doesn't have real competitors, so the money isn't going anywhere else.


This is a great argument for the antitrust authorities to break up Facebook. Allowing the big social media companies to buy each other creates a single point of failure. If Instagram and WhatsApp were separate companies, a technical disaster at one would not take out the other two.


It's funny that hackernews is now overloaded with distracted people ;-)


Also instagram and messenger


This is all total left brain looping and complexity coming home to roost.


Ignore.


19 hours ago??


DNS?


No, I'm getting their error page, so Load Balancers or whatever is behind. EDIT: Or at least not /just/ DNS.


Definitely DNS. facebook.com might be in your local dns cache.


I'm getting an error page with a dead image link and a 2020 copyright date (uk)


Looks like someone found the light switch and turned everything back on!


Seems like Telegram went down with a big whatsapp-is-down hug of death.


Yep, Telegram stopped sending messages a while back and not loading at all for me now.


Is HN hit by something as well? It's loading really slow for me.


I think the mods throttled logged in users to discourage over-discussion and thread creation. There’s no rhyme or reason for read-only to be snappy and logged-in to be crawling.


my home connection with ISP is down Vodafone Ireland, so I guess they have such a big churn in Vodafone from FB BGP routes that it blew Vodafone network. Is it DNS or routing issue?


Its always a DNS problem


Still amazes me their infra team is supposedly the best in the world, and compensated as such, yet things like this happen.

Personally I'm glad FB went down for a few hours, but it's hard to imagine how that would happen in the first place.


Expecting to get messages on WhatsApp alternatives tonight ...


India runs on whatsapp. They'll have more backups now.


I just got the login page. It was fun while it lasted.


Should say "Facebook owned. Sites are down."


for me even this site loads very very slowly. pinging google name server is fast as usual. it could be a more wide problem not just FB related.



Looks like HN is being hit pretty hard right now?


Hopefully forever.


Just a sidenote I think outlook is down as well.


Anyone able to Connect w WhatsApp at the moment?


it is always DNS!


At 21:44 UTC, facebook.com resolves for me.


That explain the new contacts in Telegram.


Any way to read this without an account?


> Facebook-owned sites are down

And the world rejoiced.


I like it, it feels like it's 1999.


Does this mean Oculus is unusable now?


It's not hyperbole to say that this is going to literally save lives.

Cutting off Facebook's firehouse of hate and misinformation for just a couple hours is going to have a obvious positive effect on millions of people. At this scale, at least one person will get vaccinated today because they didn't see the wall of ignorance that is FB's news feed.

Maybe we should introduce "digital blue laws", where one day a week, social media is shut down for the overall good of society.


between outage and whistleblower, this has got to be the worst day in facebook's life


value of fb directly proportional to traffic flow curious about why fb is down ;)


If it bleeds we can kill it!


Thought experiment: what if they were down for a week and the world completely healed itself?


They seem to be back now.


Why is this not the first post on hn? It’s double the points and has over 590 comments?


wonder how much of internet traffic as a whole is down now..


Whatsapp down as well


Interesting Timing!


pings to a.ns.facebook.com are no longer timing out


still down, people going to signal to have a chat


Have they tried turning it off and then turning it back on again?


When I noticed HN was loading slowly, I already knew FB was down.


So how much cash do we need to pitch in to keep it down?


frankly, who cares? Seriously.

Those services are toxic for years now and everybody knows that. Who still uses them occasionally let alone relies on them can't be helped, can they?


do we know what caused the DNS outage at FB?


Facebook hacked?


It's like a nuclear bomb exploded on the internet.


The internet felt a little more fresh and clean today.


Let this be permanent - not a huge loss for humanity.


Telegram seems down too, is it down for you?


now if only tiktok would fail


Ha-ha!


Fixed


[redacted]


What evidence is there to suggest this is due to Facebook being down?


Hopefully it never comes back up!


Imagine this happening to AWS


Good, now I can go for lunch.


makes me wonder, why dont porn sites ever seem to go down ?


add comment


add comment


Oh no


hahaha, good riddance


ok mom i commented


good ridance


Let's hope it's permanent.


Of all the big tech companies Facebook is the only one where it can completely disappear overnight and my life would be completely unaffected (or possibly improved by not having to explain to people I don't use facebook, please email or text me your invitations rather than use messenger). If Google, Amazon, Netflix, Apple disappeared the story would be completely different.


Facebook is an unparalleled titan in the realm of advertising and WhatsApp is basically a utility-level communication system for a big chunk of the globe. Instagram is a key cultural driver of the Western world. You many not feel any direct firsthand consequences, but the overall impact would transform the world around you.


> Facebook is an unparalleled titan in the realm of advertising

Uh, Google? It's definitely paralleled, and also preceded


Yeah, but would there be any drawbacks?


I find this kind of comment fascinating because it's illustrative of how humans can form intentional blindspots as to the utility of a person or institution when when all they care about are the negative aspects of that person's or institution's existence.

    op: "I don't care about thing X disappearing"
    re: "While you may not care about it because of Y, X also provides benefit Z to other people"
    op: "But would there be any drawbacks?"
yeah, there would be drawbacks, other people would lose Z, which may matter a lot of them even if it doesn't matter to you. Someone just told you about Z, and you just responded as if you weren't just told about Z"

These days I find it incredibly frustrating to deal with people who have conclusively decided they don't like something and that renders them incapable of acknowledging other benefits that said thing provides even if those benefits aren't relevant to them or are less relevant than the things they vocalize caring about.


I can agree with the "intentional blindspots" argument but turn it right around.

I'd like to explicitly note that the parent post did not say "X also provides benefit Z to other people" - it asserted "Facebook is an unparalleled titan in the realm of advertising" which is a substantially different thing; it's not something that some people simply don't care about and a benefit to some other people and considering those statements as equivalent is a (very large) intentional blindspot. The current way of how advertising is done (driven, in part, by FB) is also a harm to many people and society at large, so publicly making an implicit assumption that "advertising" is at most neutral is not okay, it's something that should be called out.

This very "unparalleled titan in the realm of advertising" aspect is a major cost on society, a net harm that perhaps should be tolerated if it's outweighed by some other benefits FB provides (such as the "utility-level communication system for a big chunk of the globe"), but as itself it's definitely not something that should be treated as benign just because some people get paid for it.

If FB advertising disappeared with no other drawbacks, that would be a great thing. Of course, there are some actual drawbacks, but even so it's quite reasonable to motivate people to ask about the actual drawbacks of FB being down, because "oh but ads" (with which the grandparent post started) is not one.


Thank you, I agree with everything you said here. But I'd also like to address the other things I was answering with the drawbacks quip...

> WhatsApp is basically a utility-level communication system for a big chunk of the globe.

Unfortunately, it's not an actual utility though, which is precisely my point. It's pure folly to build your business around a pseudo utility owned by a private company.

> Instagram is a key cultural driver of the Western world.

I honestly have no idea how this is being presented as a good thing. A "key cultural driver of the western world" is an app whose entire purpose is to harvest your data and sell it to dodgy partners who will use it to usurp democracy.


There are several people earning their living through Facebook/Instagram and there is a whole marketplace that would impact lots of people. Don't get me wrong, I don't use or like FB in any way but FB disappearing overnight would definitely have drawbacks for lots of people.


Replace Facebook in your post with human trafficking :)

Obvious I'm not serious, and it's popular sentiment here that "Fuck Facebook... Oh but I use Instragram and WhatsApp of course!", but the point was "some people making a living on x" isn't really a great argument for "x is harmful and we might be better without it".


There would be a massive opening for new platforms to take over, and the odds that they are also based in the West would be much lower.


What's the advantage of using a Chinese platform instead of Facebook in terms of privacy, freedom of speech or political influence?


yes. I want to see what my friends and acquaintances are up to.


My time on Facebook made it abundantly clear how racist, misogynist and otherwise vile a large portion of the people I grew up with are. I was much happier having a superficial contact with them once every ten years at a high school reunion. I'm no longer on Facebook (or Twitter).

Occasionally, I'll see/hear/do something and think that it would have made a good status update/tweet, but then I remember that these things have happened to me for decade before social media was a thing and life was fine. Some I'll share with my wife or a friend, most just disappear and that's fine too.


People seem to not know that you can unfriend or at minimum unfollow people on Facebook.

Why did you put up with racist and misogynistic people on your feed? Why did you feel the need to delete your account instead of unfollowing people?

My feed is nice and clean, with family, some friends, and some pages.


I did. Facebook also spent a lot of time dumping stuff in my newsfeed from people I wasn't friends with (Twitter also liked to do this). It was a lot easier to just not have all that crap in my life.


An act of unfriending someone is interpreted as hostile action. It's much easier just to not be there in the first place.


Then unfollow. They don't know if you unfollow.


How about you call them to set up a meeting to catch up?


Why? It is not as efficient. I can buy everything from stores but I use amazon, same thing. I don't actually use facebook though because I don't care about anyone really but for people that care, it is a solid platform.

There is a gap between "I want to know what people I know are up to" and "I want to meet with those people one by one to see what they are up to". Some people just want to passively watch and that is ok.


> Some people just want to passively watch and that is ok.

And this is the culprit for loneliness.


I don’t. Why would I need to know more than they decide to tell me? I got enough shit on my mind.


> You many not feel any direct firsthand consequences, but the overall impact would transform the world around you.

For the better.


"Are you alright? What's wrong?"

"I felt a great disturbance in the DNS. As if millions of influencers suddenly cried out in terror and were suddenly silenced. I feel something terrible has happened. But you better get on with your content curation."


> Facebook is an unparalleled titan in the realm of advertising

Not unparalleled - Google exists.

And we need less advertising, not more.

> and WhatsApp is basically a utility-level communication system for a big chunk of the globe.

Many other such systems exist - Telegram, Signal, Google Chat.

> Instagram is a key cultural driver of the Western world

Western culture will get along just fine without Instagram.

> the overall impact would transform the world around you.

For the better.


Facebook is implicated in genocide in multiple countries, and Instagram is nothing but a psychotic lie factory designed to induce depression and self loathing in young women.

The world would only improve if it disappeared.


Facebook is an unparalleled titan in the realm of consumer manipulation

There I fixed it for you.


Weird, because over here WhatsApp is ingrained into the social fabric of your life. Couldn't imagine ever going back to texting/iMessage.


WhatsApp, like the metric system, is a “literally everywhere but the US” thing. I’ve never once seen it used in the US.


But your friend groups would probably be able to migrate to Signal/Discord/Hangouts/etc quite quickly if WhatsApp were to disappear, no? WhatsApp has the network effect on its side by way of existing, but that could change quickly if given a push.


Sure. But you might not get everyone back - you'd have to have an alternate method of talking to the folks to get them to switch and meet up in the same place. You'd have this if the service just slowly died (like landlines), but not if something breaks instantly - forever. I'm guessing we've all had this when games died (especially old text-based MMORPG's, for example. So many people gone).


At least with WhatsApp you do have the contact's phone number, so you can reach them via SMS if necessary.


Do you have to open WhatsApp and connect to the servers to access the number? (Honest question here, I've never used it)


You have to open the app, but you can see phone numbers in airplane mode.


People would just use an alternative, like Telegram or whatever is the next most popular one.


Having trouble doing calls on Telegram now - I guess because of the shift in load to Telegram


After using Telegram, WhatsApp is a complete piece of garbage, if it disappeared from the face of the earth it would be sure for the best as people would move on to alternative messengers.


Does Telegram have E2E messages by default, and using a sensible encryption protocol? If not, I disagree.


IIRC, e2e by default for audio/video; for text chats, can be enabled by marking chat as 'secret'. Is it true E2E? Probably not (i.e. Telegram has keys that can be turned over to any government, noone argues with that)

Does WhatsApp have a true E2E either? Ask hundreds of moderators employed by Facebook who review WhatsApp messages flagged as improper and the chat history around them...

However, accepting the fact that neither of the services is truly secure, Telegram experience as a service is much better for an average user.


> for text chats, can be enabled by marking chat as 'secret'. Is it true E2E? Probably not (i.e. Telegram has keys that can be turned over to any government, noone argues with that)

That was my problem, and your confirmation means it's still as good as nothing.

> Does WhatsApp have a true E2E either? Ask hundreds of moderators employed by Facebook who review WhatsApp messages flagged as improper and the chat history around them...

If one of the ends decides to share a message, it's still E2E. That is the big difference.


> If one of the ends decides to share a message, it's still E2E. That is the big difference.

True. But you can't prove that "one of the ends" must necessarily be a human and not the logic in the app code, or an intended backdoor? E.g., an automated logic scanning for 'malicious' messages on-device.


I still remember the era when the "in" messenger changed every 2-3 years: ICQ -> AIM -> MSN Messenger -> Google Chat, etc.

Changing messaging apps not the most convenient thing in the world, but it's not some kind of IT cataclysm. Plenty of WhatsApp competitors exist.


Small and medium businesses would suffer as well, since many use WhatsApp as a sales channel now.


Maybe Google, because of the search engine. Android: somebody will fill the void.

Messaging: people have been switching on hordes to every new free messaging system in the 90s and early 2000s, we will adapt to something else.

Netflix and video in general: same thing without the 90s/early 2000s.

Amazon: very convenient store, we'll spend a little less and somebody will fill the void.

Apple: can't say, never bought anything from them.

By the way, when I couldn't message on WA today I thought day they finally cut me off because I still didn't accept their new privacy policy from months ago :-) I resolved to wait and see for a couple of days.


I dunno. If AWS went away suddenly, or if Google Search/the G-Suite suddenly stopped existing, the internet as we know it would need some time to recover.


> Messaging: people have been switching on hordes to every new free messaging system in the 90s and early 2000s, we will adapt to something else.

Back then the IM population was a lot smaller. Also with "Free Basics" and other things in some regions of the world Facebook plays a game which makes it impossible to switch. (Using Whatsapp is free, for others one ahs to buy mobile data credits)


Man, out of those only Netflix going down wouldn't cause a gigantic billions of dollars worth clusterfuck to people, businesses and companies. It's nice you don't use them but about everyone around does and mostly for at least some important things.


I am surprised you have Netflix on the list. It would be annoying for 2 minutes, then you can simply go for a walk or read a book.


Or watch movies and shows using one of the many alternatives to Netflix.


Facebook is the only one of those that I regularly use. I'd like something like Google's Android to stay around. The rest I don't need.


disappearance of FB might not impact you, but India runs on WhatsApp.


With Facebook, whatsapp and Instagram down, it feels like the entire internet is down for me.


All of the big tech companies you mentioned could completely disappear overnight and my life would be completely unaffected or possibly improved.


si es muy posible, de otro lado se daría la oportunidad a empresas mas cercanas con la gente y que les paguen por los usuarios por los datos. finalmente los usuarios son su activo para generar muy importantes ingresos, estaría muy bien que compartan sus beneficios!


si es muy posible, de otro lado se daría la oportunidad a empresas mas cercanas con la gente y que les paguen por los usuarios por los datos.


> "Idk man, this seems like a tough issue. Maybe we should just give up"

> "Ok." - Zuckerburg

shuts down $100B company


Unfortunately Whatsapp replaced texting for around 80 % of the world.


Not a fan of FB, but the main reason for WhatsApp's success was SMS sucking hairy balls.


It didn't help that telecoms used to use SMS as an extreme profit center. I don't think WhatsApp would have taken over the way it did if SMS was always included in all plans for free. This is similar to the way "local" long distance used to be such a racket.


Most UK plans included unlimited SMS for a long time, but whatsapp still took over.

The group chat functions don't really exist in SMS (maybe in MMS but they never work properly), photos (same), whatsapp desktop, you can text when you have WiFi but no 4G (or using a different sim card when travelling), etc.


No problem, they still have phone numbers of those people so they can send them SMS with Signal invitation. :)



The problem for me is Oculus. I really love their headset and I appreciate the investment Facebook has made in that.

I hate the stupid strategy tax that makes me have an FB account to use their headset, and has it go down when they have an outage. I hope they can learn from MSFT that "Facebook Everywhere" is ultimately a self defeating strategy.


i talk to my parents in India everyday in india. Watsapp is the only game in town there.


I hope that you can install other messaging apps as well?


Not OP. They _can_ but good luck trying to convince parents of that. They're not tech savvy enough to install apps themselves. They have simple questions about why Whatsapp cannot be installed in a basic Nokia phone for instance. It's not easy to convince them to use Signal or Telegram or anything else.


Why?


Because combined with the abysmal state of education in most places, and a general lack of government action, Facebook is an actual threat to our civilization.


People unfortunately love the upsides of misinformation, or perhaps it's the format that makes it easy to build community around shared (misinformed) values, to rally in battles that rage for hours or days for a cause you deeply believe in and can follow by digesting 30-second soundbites on social midea and 30-minute videos on YouTube.

People will do this wherever they can talk in a group online, not just Facebook properties. It's... pretty bad actually, I think the only tool that exists right now is censorship, because the bullshit gets created, spread, and wholeheartedly received way faster than debunking will.

And censorship is a power that can't be safely entrusted to nobody.


I don't necessarily disagree, but often I hear FB or other tech companies like Twitter singled out re: misinformation. News media contributes to misinformation and contributes to a warped, partisan, permanently-in-catastrophe-mode population just as much as FB, Twitter, and other mediums.

I doubt, if FB goes away, that any of the issues you're implying will go away or even get much better. In fact, the lack of a real look into the negative effects of consumer news product reinforces this idea that only the elite can know the truth, and the masses just have to get in line and shut up.

News media proliferated nonsense from fed sources to justify the Iraq war, they gave Trump 24/7 airtime for a while because it increased ratings. They constantly forgo any real accountability for their actions, and pretend that they aren't just another addictive consumer product that warps peoples' brains.


Why?

The reasons I've seen are:

> it creates a risk of bad self-image for young girls

It's a parent's job to educate your children. There are much worse things than Facebook out there.

> it collects data

Literally no harm in knowing that someone is interested in JavaScript, cats and fetish porn, and targeting ads to that user.

> it's addictive

So is sex, marijuana, and collecting stamps.

> it helps organize protests

Good.


> It's a parent's job to educate your children. There are much worse things than Facebook out there.

I'm guessing that either you're not a parent, or your kids aren't teens.

But most parents of teens realize that kids, and especially teens, are often much more influenced by things like social media & peers (and peers via social media) vs. influence their parents have on them.


It actively uses its algorithm to radicalize racists and conspiracy theorists, and when it discovered that's what it was doing decided to keep doing it because it was good for the bottom line:

https://www.businessinsider.com/facebook-pushes-qanon-racism...


An alternate explanation is that the algorithm tries to promote engagement and user retention. Presumably, people susceptible to radicalization engage with the content discussed in the article. It would be unreasonable to expect Facebook to not act in its own self-interest.


> An alternate explanation is that the algorithm tries to promote engagement and user retention. Presumably, people susceptible to radicalization engage with the content discussed in the article. It would be unreasonable to expect Facebook to not act in its own self-interest.

That's the whole point. Oh they're just trying to make a buck like everyone else is exactly the problem.

They are a running a paperclip maximizer that turns passive consumers of misinformation into "engaged" radicals and the system that is Facebook has no incentive to correct this.

https://en.wikipedia.org/wiki/Instrumental_convergence


Any algorithm that can maximize engagement can be tuned to minimize radicalization and dissemination of hatred and fascism.

I'd argue that it's absolutely in Facebook's self-interest to reduce their active role in promoting fascism, racism, homophobia, etc.


To recap, you seem to be concerned that all social media are allowing posts to become popular, and those posts sometimes promote hatred towards conservatives or liberals.

Two questions:

- What do you think should be done about the legacy media that is doing the same?

- Should social media promote boring posts, or actively censor political content in favour of a certain viewpoint, or anything else? Perhaps a real-life name registration for anyone with over 1000 followers, like in China?


> those posts sometimes promote hatred towards conservatives or liberals.

Incorrect assertion. Those posts promote hatred and/or violence toward humans for traits those humans did not choose. e.g. race, sexual orientation, etc.

Legacy media aren't actively amplifying the voices and recruiting efforts of white supremacists.

Facebook is. They acknowledge that they are. They chose to actively allow and encourage it for profit.


Eh, Twitter's worse.


b/c most humans are on the wrong end of fb's covert, exploitative attention-manipulation


One American example quote that holds true for countries outside of America:

Direct quote: "That website on Facebook."

There are people who believe that "Facebook" literally equals "Internet". Facebook, Internet ... Internet, Facebook.

Rinse and repeat for your alternative echo chamber regarding Google, the Microsoft Bing, &c.


Because he doesn't like the website so he thinks nobody else should be able to use it.


You win for best comment


Rumormill is suggesting that facebook badge readers are also down causing issues with trying to get to the servers to manually fix them.

https://twitter.com/sheeraf/status/1445099150316503057?s=21


Maybe intentional?

Zuck trying to give an example of what a world without FB would look like, kinda saying to detractors what would happen if they had it their way.


Maybe "intentional" in quotes. My money is on a major security breach and they've shut everything down until they can deal with it. Even if you go to Instagram by the IP address [1], you get a 400 error. So it looks like things are off line because they want them off line for now.

https://31.13.65.174/


That would be classic Zuck right there


[flagged]


Yeah let's keep it that way.


good


Good.


Signal FTW


...Including that one


Thoughts and prayers...


Hope it's permanent


I wish it would stay that way.


"It's always DNS."


rich people serve their revenge for pandora papers


They must have shut it off and turned it back on


Good riddance.


So how much do we need to pitch in to keep it down?


Oh no...

Anyway...


WhatsApp is pretty important infrastructure for most of the world


Which is regrettable when secure alternatives exist like Signal and Matrix whose business model doesn't involve selling your data.


Yeah I'm not saying I like it


For some egregiously loose definitions of "infrastructure," maybe.


The same definition that includes phone lines also includes the messaging service everybody uses


In parts of South America it's used for all sorts of things. Want to know when your bus is arriving? The bus company likely only knows because the driver is WhatsApp'ing them status updates.


And "important"


World productivity just grew by 10%


or it went down by another 20%, everybody at first thinking there's something wrong with their internet connection.


Everything is a f*king Facebook problem


The down page shows copyright from 2020 smh


Finally! A small (or not so small) outage for FB, a large benefit for mankind :)


The whistleblower said she wanted to fix Facebook.

Mission accomplished, I'd say. For now at least.


We can only hope that they will be gone forever... and HN is having major issues at the same time!


No like for you


hrm, bgp and dns. It's weird when decades old technology somehow fails like this. The main reason distributed systems is hard is because of the time component. Whenever you add timeouts to an algorithm, everything becomes orders of magnitude more difficult to reason about, as the number of states grows without bound. In any case, this is an epic outage and sad.


Teen depression and suicide rates plummeting right now


Perhaps tomorrow, the brave man or woman responsible for this beautiful screw up will step forward in HN for an outstanding ovation. Whoever did this, thank you! As a souvenir I took a screenshot on my phone.


that's not how post-mortems work


downdetector looks like a real mess for it.

I'm going to parrot the other comment here and say nothing of value was lost.

https://downdetector.com/status/facebook/


From [0]

> ...there is no limit to the scandals, leaks, whistleblowers, lawsuits or penalties that will bring the Facebook mafia down.

Fine. 'Literally' bringing the Facebook mafia down like that would do.

But only for now.

[0] https://news.ycombinator.com/item?id=28742179


This would be a golden opportunity to launch your 'Facebook Killer' app. Preferably a social network where people don't pay with their data, but with, you know, a thing called Money.


Who would pay money to be the first user of a new social network?


I guess the "prophets" at Victory Channel / Flashpoint called down Holy Fire on the Facebook infrastructure in retribution ... https://youtu.be/FbSkFuvqFdA?t=1127 . (I'm an Evangelical Christian but those folks are nuts ... Mario Murillo, Lance Wallnau, Hank Kunneman, Gene Bailey, etc.)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: