There's still no connectivity to Facebook's DNS servers:
> traceroute a.ns.facebook.com
traceroute to a.ns.facebook.com (129.134.30.12), 30 hops max, 60 byte packets
1 dsldevice.attlocal.net (192.168.1.254) 0.484 ms 0.474 ms 0.422 ms
2 107-131-124-1.lightspeed.sntcca.sbcglobal.net (107.131.124.1) 1.592 ms 1.657 ms 1.607 ms
3 71.148.149.196 (71.148.149.196) 1.676 ms 1.697 ms 1.705 ms
4 12.242.105.110 (12.242.105.110) 11.446 ms 11.482 ms 11.328 ms
5 12.122.163.34 (12.122.163.34) 7.641 ms 7.668 ms 11.438 ms
6 cr83.sj2ca.ip.att.net (12.122.158.9) 4.025 ms 3.368 ms 3.394 ms
7 * * *
...
So they're hours into this outage and still haven't re-established connectivity to their own DNS servers.
"facebook.com" is registered with "registrarsafe.com" as registrar. "registrarsafe.com" is unreachable because it's using Facebook's DNS servers and is probably a unit of Facebook. "registrarsafe.com" itself is registered with "registrarsafe.com".
I'm not sure of all the implications of those circular dependencies, but it probably makes it harder to get things back up if the whole chain goes down. That's also probably why we're seeing the domain "facebook.com" for sale on domain sites. The registrar that would normally provide the ownership info is down.
Anyway, until "a.ns.facebook.com" starts working again, Facebook is dead.
"registrarsafe.com" is back up. It is, indeed, Facebook's very own registrar for Facebook's own domains. "RegistrarSEC, LLC and RegistrarSafe, LLC are ICANN-accredited registrars formed in Delaware and are wholly-owned subsidiaries of Facebook, Inc. We are not accepting retail domain name registrations." Their address is Facebook HQ in Menlo Park.
That's what you have to do to really own a domain.
Out of curiosity, I looked up how much it costs to become an registrar. Based on the ICANN site, it is $4,000 USD per yr, plus variable fees and transactions fees ($0.18/yr). Does anyone have experience or insight into running a domain registrar? Curious what it would entail (aside from typical SRE type stuff).
Wow, I had no idea it was so cheap[1] once you're a registrar. The implication is that anyone who wants to be a domain squatting tycoon should become a registrar. For an annual cost of a few thousand dollars plus $0.18 per domain name registered, you can sit on top of hundreds of thousands of domain names. Locking up one million domain names would cost you only $180,000 a year. Anytime someone searched for an unregistered domain name on your site, you could immediately register it to yourself for $0.18, take it off the market, and offer to sell it to the buyer at a much inflated price. Does ICANN have rules against this? Surely this is being done?
[1] "Transaction-based fees - these fees are assessed on each annual increment of an add, renew or a transfer transaction that has survived a related add or auto-renew grace period. This fee will be billed at USD 0.18 per transaction." as quoted from https://www.icann.org/en/system/files/files/registrar-billin...
Personally saw this kind of thing as early as 2001.
Never search for free domains on the registar site unless you are going to register it immediately. Even whois queries can trigger this kind of thing, although that mostly happens on obscure gtld/cctld registries which have a single registrar for the whole tld.
I can sadly attest to this behavior as recently as a couple years ago :(
I searched for a domain that I couldn't immediately grab (one of more expensive kind) using a random free whois site... and when I revisited the domain several weeks later it was gone :'(
Emailed the site's new owner D: but fairly predictably got no reply.
Lesson learned, and thankfully on a domain that wasn't the absolute end of the world.
I now exclusively do all my queries via the WHOIS protocol directly. Welp.
Probably every major retail registrar was rumored to do this at some point. Add to your calculation that even some heavyweights like GoDaddy (IIRC) tend to run ads on domains that don't have IPs specified.
Network Solutions definitely did it. I searched for a few domains along the lines of "network-solutions-is-a-scam.com", and watched them come up in WHOIS and DNS.
I didn't know that, and you're right. For anyone who's interested, I found the following references regarding the $8.39 additional fee for a .com registration:
This is not completely accurate. The whole reason a registrar with domain abc.com can use ns1.abc.com is because glue records are established at the registry, this allows a bootstrap that keeps you in from a circular dependency. All that said it’s usually a bad idea, for someone as large as Facebook they should have nameservers across zones ie a.ns.fb.com
b.ns.fb.org
c.ns.fb.co
Etc…
There is always a step which involve to email the domain when a domain update its information with the registrar. In this case, facebook.com and registrarsafe.com are managed by the same NS. You need these NS to query the MX to send that update approval by email and unblock the registrar update. Glue records are more for performance than to make that loop. I'm maybe missing something but, hopefully they won't need to send an email to fix this issue.
I have literally never once received an email to confirm a domain change. Perhaps the only exception is on a transfer to another registrar (though I can't recall that occurring, either).
To be fair, we did have to get an email from eurid recently for a transfer auth code, but that was only because our registrar was not willing to provide.
In any case, no, they will not need to send an email to fix this issue.
I just changed the email address on all my domains. My inbox got flooded with emails across three different domain vendors. If they didn't do it before, they sure are doing it now.
This is not true when your the registrar (as in this case) in fact your entire system could be down and you’d still have access to the registries system to do this update
Facebook does operate their own private Registrar, since they operate tens of thousands of domains. Most of these are misspellings and domains from other countries and so forth.
So yes, the registrar that is to blame is themselves.
Source: I know someone within the company that works in this capacity.
> That's also probably why we're seeing the domain "facebook.com" for sale on domain sites. The registrar that would normally provide the ownership info is down.
That’s not how it works. The info of whether a domain name is available is provided by the registry, not by the registrars. It’s usually done via a domain:check EPP command or via a DAS system. It’s very rare for registrar to registrar technical communication to occur.
Although the above is the clean way to do it, it’s common for registrars to just perform a dig on a domain name to check if it’s available because it’s faster and usually correct. In this case, it wasn’t.
When the NS hostname is dependent on the domain it serves, "glue records" cover the resolution to the NS IP addresses. So there's no circular dependency type issue
Its partially there. C and D are still not in the global tables according to routeviews ie. 185.89.219.12 is still not being advertised to anyone. My peers to them in Toronto have routes from them, but not sure how far they are supposed to go inside their network. (past hop 2 is them)
% traceroute -q1 -I a.ns.facebook.com
traceroute to a.ns.facebook.com (129.134.30.12), 64 hops
max, 48 byte packets
1 torix-core1-10G (67.43.129.248) 0.133 ms
2 facebook-a.ip4.torontointernetxchange.net (206.108.35.2) 1.317 ms
3 157.240.43.214 (157.240.43.214) 1.209 ms
4 129.134.50.206 (129.134.50.206) 15.604 ms
5 129.134.98.134 (129.134.98.134) 21.716 ms
6 *
7 *
% traceroute6 -q1 -I a.ns.facebook.com
traceroute6 to a.ns.facebook.com (2a03:2880:f0fc:c:face:b00c:0:35) from 2607:f3e0:0:80::290, 64 hops max, 20 byte packets
1 toronto-torix-6 0.146 ms
2 facebook-a.ip6.torontointernetxchange.net 17.860 ms
»The Facebook outage has another major impact: lots of mobile apps constantly poll Facebook in the background = everybody is being slammed who runs large scale DNS, so knock on impacts elsewhere the long this goes on.«
You just need to get a large enough block so that you can throw most of it away by adding your own vanity part to the prefix you are given. IPv6 really isn't scarce so you can actually do that.
My suspicion is that since a lot of internal comms runs through the FB domain and since everyone is still WFH, then its probably a massive issue just to get people talking to each other to solve the problem.
I remember my first time having a meeting at Facebook and observing none of the doors had keyholes and thinking "hope their badge system never goes down"
> I remember my first time having a meeting at Facebook and observing none of the doors had keyholes and thinking "hope their badge system never goes down"
Every internet-connected physical system needs to have a sensible offline fallback mode. They should have had physical keys, or at least some kind of offline RFID validation (e.g. continue to validate the last N badges that had previously successfully validated).
I have no doubt that the publicly published post-mortem report (if there even is one) will be heavily redacted in comparison to the internal-only version. But I very much want to see said hypothetical report anyway. This kind of infrastructural stuff fascinates me. And I would hope there would be some lessons in said report that even small time operators such as myself would do well to heed.
Around here we use Slack for primary communications, Google Hangouts (or Chat or whatever they call it now) as secondary, and we keep an on-call list with phone numbers in our main Git repo, so everyone has it checked out on their laptop, so if the SHTF, we can resort to voice and/or SMS.
I remembered to publish my cell phone's real number on the on-call list rather than just my Google Voice number since if Hangouts is down, Google Voice might be too.
We don't use tapes, everything we have is in the cloud, at a minimum everything is spread over multiple datacenters (AZ's in AWS parlance), important stuff is spread over multiple regions, or depending on the data, multiple cloud providers.
Last time I used tape, we used Ironmountain to haul the tapes 60 miles away which was determined to be far enough for seismic safety, but that was over a decade ago.
One of my employers once forced all the staff to use an internally-developed messenger (for sake of security, but some politics was involved as well), but made an exception for the devops team who used Telegram.
Why? Even if it's not DNS reliance, if they self-hosted the server (very likely) then it'll be just as unreachable as everything else within their network at the moment.
I don't think it's cocky or 20/20 hindsight. Companies I've worked for specifically set up IRC in part because "our entire network is down, worldwide" can happen and you need a way to communicate.
My small org, maybe 50 ips/hosts we care about, maintain a hosts file stills, for those nodes public and internal names. It's in Git, spread around and we also have our fingers crossed.
If only IRC would have been built with multi-server setups in mind, that forward messages between servers, and continues to work if a single - or even a set - of servers would go down, just resulting in a netsplit...Oh wait, it was!
My bet is, FB will reach out to others in FAMANG, and an interest group will form maintaining such an emergency infrastructure comm network. Basically a network for network engineers. Because media (and shareholders) will soon ask Microsoft and Google what their plans for such situations are. I'm very glad FB is not in the cloud business...
> If only IRC would have been built with multi-server setups in mind, that forward messages between servers, and continues to work if a single - or even a set - of servers would go down, just resulting in a netsplit...Oh wait, it was!
yeah if only Facebook's production engineering team had hired a team of full time IRCops for their emergency fallback network...
Considering how much IRCops were paid back in the day (mostly zero as they were volunteers) and what a single senior engineer at FB makes, I'm sure you will find 3-4 people spread amongst the world willing to share this 250k+ salary amongst them.
I worked on the identity system that chat (whatever the current name is) and gmail depend on and we used IRC since if we relied on the system we support we wouldn’t be able to fix it.
Word is that the last time Google had a failure involving a cyclical dependency they had to rip open a safe. It contained the backup password to the system that stored the safe combination.
The safe in question contained a smartcard required to boot an HSM. The safe combination was stored in a secret manager that depended on that HSM.
The engineer attempted to restart the service, but did not know that a restart required a hardware security module (HSM) smart card. These smart cards were stored in multiple safes in different Google offices across the globe, but not in New York City, where the on-call engineer was located. When the service failed to restart, the engineer contacted a colleague in Australia to retrieve a smart card. To their great dismay, the engineer in Australia could not open the safe because the combination was stored in the now-offline password manager.
Safes typically have the instructions on how to change the combination glued to the inside of the door, and ending with something like "store the combination securely. Not inside the safe!"
But as they say: make something foolproof and nature will create a better fool.
Anyone remember the 90s? There was this thing called the Information Superhighway, a kind of decentralised network of networks that was designed to allow robust communications without a single point of failure. I wonder what happened to that...?
We are a dying breed... A few days ago my daughter asked me "will you send me the file on Whatsapp or Discord?". I replied I will send an email. She went "oh, you mean on Gmail?" :-D
I unfortunately cannot edit the parent comment anymore but several people pointed out that I didn't back up my claim or provided any credentials so here they are:
Google has multiple independent procedures for coordination during disasters. A global DNS outage (mentioned in https://news.ycombinator.com/item?id=28751140) was considered and has been taken into account.
I do not attempt to hide my identity here, quite the opposite: my HN profile contains my real name. Until recently a part of my job was to ensure that Google is prepared for various disasterous scenarios and that Googlers can coordinate the response independently from Google's infrastructure. I authored one of the fallback communication procedures that would likely be exercised today if Google's network experienced a global outage. Of course Google has a whole team of fantastic human beings who are deeply involved in disaster preparedness (miss you!). I am pretty sure they are going to analyze what happened to Facebook today in light of Google's emergency plans.
While this topic is really fascinating, I am unfortunately not at liberty to disclose the details as they belong to my previous employer. But when I stumble upon factually incorrect comments on HN that I am in a position to correct, why not do that?
Interesting that you are asking for the dirt given that DiRT stands for Disaster and Recovery Testing, at least at Google.
Every year there is a DiRT week where hundreds of tests are run. That obviously requires a ton of planning that starts well in advance. The objective is, of course, that despite all the testing nobody outside Google notices anything special. Given the volume and intrusiveness of these tests, the DiRT team is doing quite an impressive job.
While the DiRT week is the most intense testing period, disaster preparedness is not limited to just one event per year. There are also plenty tests conducted througout the year, some planned centrally, some done by individual teams. That's in addition to the regular training and exercises that SRE teams are doing periodically.
Because, it may shock you to know, but sometimes people just go on the Internet and tell lies.
No shit Google has plans in place for outages.
But what are these plans, are they any good... a respected industry figure who's CV includes being at Google for 10 years doesn't need to go into detail describing the IRC fallback to be believed and trusted that there is such a thing.
I found a comment that was factually incorrect and I felt competent to comment on that. Regrettably, I wrote just one sentence and clicked reply without providing any credentials to back up my claim. Not that I try to hide my identity, as danhak pointed out in https://news.ycombinator.com/item?id=28751644, my full name and URL of my personal website are only a click away.
I've read here on HN that exactly this was the issue as they had one of the bigger outages (I think it was due to some auth service failure) and GMail didn't accept incoming mail.
I think the issue there is that in exchange for solving the "one fat finger = outage" problem, you lose the ability to update the server fleet quickly or consistently.
The rate at which some amazon services lately go done because other AWS services went down proves that this is an unsustainable house of cards anyways.
Sheera Frenkel
@sheeraf
Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.
From the Tweet, "Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors."
> Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.
Disclose.tv
@disclosetv
JUST IN - Facebook employees reportedly can't enter buildings to evaluate the Internet outage because their door access badges weren’t working (NYT)
Oh I'm sure everyone knows whats wrong, but how am I supposed to send an email, find a coworkers phone number, get the crisis team on video chat etc etc if all of those connections rely on the facebook domain existing?
Hence the suggestion for PagerDuty. It handles all this, because responders set their notification methods (phone, SMS, e-mail, and app) in their profiles, so that when in trouble nobody has to ask those questions and just add a person as a responder to the incident.
Yes, but Facebook is not a small company. Could PagerDuty realistically handle the scale of notifications that would be required for Facebook's operations?
PagerDuty does not solve some of the problems you would have at FB's scale, like how do you even know who to contact ? And how do they login once they know there is a problem ?
The place where I worked had failure trees for every critical app and service. The goal for incident management was to triage and have an initial escalation for the right group within 15 minutes. When I left they were like 96% on target overall and 100% for infrastructure.
Even if it can’t, it’s trivial to use it for an important subset, ie is Facebook.com down, is the ns stuff down etc. So there is an argument to be made for still using an outside service as a fallback
- not arrogant
- or complacent
- haven't inadvertently acquired the company
- know your tech peers well enough to have confidence in their identity during an emergency
- do regular drills to simulate everything going wrong at once
Lots of us know what should be happening right now, but think back to the many situations we've all experienced where fallback systems turned into a nightmarish war story, then scale it up by 1000. This is a historic day, I think it's quite likely that the scale of the outage will lead to the breakup of the company because it's the Big One that people have been warning about for years.
I guarantee you that every single person at Facebook who can do anything at all about this, already knows there's an issue. What would them receiving an extra notification help with?
We kind of got off topic, I was arguing that if you were concerned about internal systems being down (including your monitoring/alerting) something like pager duty would be fine as a backup. Even at huge scale that backup doesn’t need to watch everything.
I don’t think it’s particularly relevant to this issue with fb. I suspect they didn’t need a monitoring system to know things were going badly.
I can imagine this affects many other sites that use FB for authentication and tracking.
If people pay proper attention to it, this is not just an average run of the mill "site outage", and instead of checking on or worrying about backups of my FB data (Thank goodness I can afford to lose it all), I'm making popcorn...
Hopefully law makers all study up and pay close attention.
What transpires next may prove to be very interesting.
NYT tech reporter Sheera Frenkel gives us this update:
>Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.
I just got off a short pre-interview conversation with a manager at Instagram and he had to dial in with POTS. I got the impression that things are very broken internally.
How much of modern POTS is reliant on VOIP? In Australia at least, POTS has been decommissioned entirely, but even where it's still running, I'm wondering where IP takes over?
This person has a POTS line in their current location, and a modem, and the software stack to use it, and Instagram has POTS lines and modems and software that connect to their networks? Wow. How well do Instagram and their internal applications work over 56K?
The voices, stories, announcements, photos, hopes and sorrows of millions, no, literally billions of people, and the promise that they may one day be seen and heard again now rests in the hands of Dave, the one guy who is closest to a Microcenter, owns his own car and knows how to beat the rush hour traffic and has the good sense to not forget to also buy an RS-232 cable, since those things tend to get finicky.
Yeah the patch to fix BGP to reach the DNS is sent by email to @facebook.com. Ooops no DNS to resolve the MX records to send the patch to fix the BGP routers.
No. A network like Facebook's is vast and complicated and managed by higher-level configuration systems, not people emailing patches around.
If this issue is even to do with BGP it's much more likely the root of the problem is somewhere in this configuration system and that fixing it is compounded by some other issues that nobody foresaw. Huge events like this are always a perfect storm of several factors, any one or two of which would be a total noop alone.
On the other hand, I and my office mate at the time negotiated the setup of a ridiculous number of BGP sessions over email, including sending configs. That was 20 years ago.
I don't know. I doubt. It's just funny to think that you need email to fix BGP, but DNS is down because of BGP. You need DNS to send email which needs BGP. It's a kind of chicken and egg problem but at a massive scale this time.
Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.
You'd think they'd have worked that into their DR plans for a complete P1 outage of the domain/DNS, but perhaps not, or at least they didn't add removal of BGP announcements to the mix.
I would have expected a DNS issue to not affect either of these.
I can understand the onionsite being down if facebook implemented it the way a thirdparty would (a proxy server accessing facebook.com) instead of actually having it integrated into its infrastructure as a first class citizen.
You can get through to a web server, but that web server uses DNS records or those routes to hit other services necessary to render the page. So the server you hit will also time out eventually and return a 500
The issue here is that this outage was a result of all the routes into their data centers being cut off (seemingly from the inside). So knowing that one of the servers in there is at IP address "1.2.3.4" doesn't help, because no-one on the outside even knows how to send a packet to that server anymore.
Reddit r/Sysadmin user that claims to be on the "Recovery Team" for this ongoing issue:
>As many of you know, DNS for FB services has been affected and this is likely a symptom of the actual issue, and that's that BGP peering with Facebook peering routers has gone down, very likely due to a configuration change that went into effect shortly before the outages happened (started roughly 1540 UTC).
There are people now trying to gain access to the peering routers to implement fixes, but the people with physical access is separate from the people with knowledge of how to actually authenticate to the systems and people who know what to actually do, so there is now a logistical challenge with getting all that knowledge unified.
Part of this is also due to lower staffing in data centers due to pandemic measures.
User is providing live updates of the incident here:
* This is a global outage for all FB-related services/infra (source: I'm currently on the recovery/investigation team).
* Will try to provide any important/interesting bits as I see them. There is a ton of stuff flying around right now and like 7 separate discussion channels and video calls.
* Update 1440 UTC: \
As many of you know, DNS for FB services has been affected and this is likely a symptom of the actual issue, and that's that BGP peering with Facebook peering routers has gone down, very likely due to a configuration change that went into effect shortly before the outages happened (started roughly 1540 UTC).
There are people now trying to gain access to the peering routers to implement fixes, but the people with physical access is separate from the people with knowledge of how to actually authenticate to the systems and people who know what to actually do, so there is now a logistical challenge with getting all that knowledge unified.
Part of this is also due to lower staffing in data centers due to pandemic measures.
I mean, when I last worked in a NOC, we used to call ourselves "NOC monkeys", so yeah. IF you're in the NOC, you're a NOC monkey, if you're on the floor, you're a floor monkey. And so on.
We even had a site and operation for a long while called:
"NOC MONKEY .DOT ORG"
We called all of ourselves NOC MONKEYS. [[Remote Hands]]
Yeah, that was a term used widely.
I'm 46. I assume you are < #
---
Where were you in 1997 building out the very first XML implementations to replace EDI from AS400s to FTP EDI file retrievals via some of the first Linux FTP servers based in SV?
They were quoted on multiple news sites including Ars Technica. I would imagine they were not authorized to post that information. I hope they don't lose their job.
Shareholders and other business leaders I'm sure are much happier reporting this as a series of unfortunate technical failures (which I'm sure is part of it) rather than a company-wide organizational failure. The fact they can't physically badge in the people who know the router configuration speaks to an organization that hasn't actually thought through all its failure modes. People aren't going to like that. It's not uncommon to have the datacenter techs with access and the actual software folks restricted, but that being the reason one of the most popular services in the world has been down for nearly 3 hours now will raise a lot of questions.
Edit: I also hope this doesn't damage prospects for more Work From Home. If they couldn't get anyone who knew the configuration in because they all live a plane ride away from the datacenters, I could see managers being reluctant to have a completely remote team for situations where clearly physical access was needed.
Operations teams normally have a special room with a secure connection for situations like this, so that production can be controlled in the event of bgp failure, nuclear war, etc. I could see physical presence being an issue if their bgp router depends on something like a crypto module in a locked cage, in which case there's always helicopters.
So if anything, Facebook's labor policies are about to become cooler.
Yup, it's terrifying how much is ultimately, ultimately dependent on dongles and trust. I used to work at a company with a billion or so in a bank account (obviously a rather special type of account), which was ultimately authorised by three very trusted people who were given dongles.
Sorry, I should have been clearer - the dongles controlled access to that bank account. It was a bank account for banks to hold funds in. (Not our real capital reserves, but sort of like a current account / checking account for banks.)
I was friends with one of those people, and I remember a major panic one time when 2 out of 3 dongles went missing. I'm not sure if we ever found out whether it was some kind of physical pen test, or an astonishingly well-planned heist which almost succeeded - or else a genuine, wildly improbable accident.
The problem is when your networking core goes down, even if you get in via a backup DSL connection or something to the datacenter, you can't get from your jump host to anything else.
It helps if your dsl line is is bridging at layer 2 in the osi model using rotated psks, so it won't be impacted by dns/bgp/auth/routing failures. That's why you need to put it in a panic room.
That model works great, until you need to ask for permission to go into the office, and the way to get permission is to use internal email and ticketing systems, which are also down.
I'm not sure why shareholders are lumped in here. A lot of reasons companies do the secret squirrel routine is to hide their incompetence from the shareholders.
You don't need to consider 'what if a meteor hit the data centre and also it was made of cocaine'. You do need to think through "how do I get this back online in a reasonable timeframe from a starting point of 'everything is turned off and has the wrong configuration'."
In a company the size of FaceBook, "everything is turned off" has never happened since before the company was founded 17 years ago. This makes is very hard to be sure you can bring it all back online! Every time you try it, there are going to be additional issues that crop up, and even when you think you've found them all, a new team that you've never heard of before has wedged themselves into the data-center boot-up flow.
The meteor isn't made of cocaine, but four of them hitting at exactly the same time is freakishly improbable. There are other, bigger fish to fry, that we're going to treat four simultaneous meteors as impossible. Which is great, but then one the day, five of them hit at the same time.
>we're going to treat four simultaneous meteors as impossible. Which is great, but then one the day, five of them hit at the same time.
I think that suggests that there were not bigger fish to fry :)
I take your point on priorities, but in a company the size of facebook perhaps a team dedicated to understanding the challenges around 'from scratch' kickstarting of the infrastructure could be funded and part of the BCP planning - this is a good time to have a binder with, if not perfectly up-to-date data, pretty damned good indications of a process to get things working.
>> we're going to treat four simultaneous meteors as impossible. Which is great, but then one the day, five of them hit at the same time.
> I think that suggests that there were not bigger fish to fry :)
I can see this problem arising in two ways:
(1) Faulty assumptions about failure probabilities: You might presume that meteors are independent, so simultaneous impacts are exponentially unlikely. But really they are somehow correlated (meteor clusters?), so simultaneous failures suddenly become much more likely.
(2) Growth of failure probabilities with system size: A meteor hit on earth is extremely rare. But in the future there might be datacenters in the whole galaxy, so there's a center being hit every month or so.
In real, active infrastructure there are probably even more pitfalls, because estimating small probabilities is really hard.
> "how do I get this back online in a reasonable timeframe from a starting point of 'everything is turned off and has the wrong configuration'."
The electricity people have a name for that: black start (https://en.wikipedia.org/wiki/Black_start). It's something they actively plan for, regularly test, and once in a while, have to use in anger.
It's a process I'm familiar with gaming out. For our infrastructure, we need to discuss and update our plan for this from time to time, from 'getting the generator up and running' through to 'accessing credentials when the secret server is not online' and 'configuring network equipment from scratch'.
Of course you can’t think of every potential scenario possible, but an incorrect configuration and rollback should be pretty high in any team’s risk/disaster recovery/failure scenario documentation.
This is true, but it's not an excuse for not preparing for the contingencies you can anticipate. You're still going to be clobbered by an unanticipated contingency sooner or later, but when that happens, you don't want to feel like a complete idiot for failing to anticipate a contingency that was obvious even without the benefit of hindsight.
#1 it's a clear breach of corporate confidentiality policies. I can say that without knowing anything about Facebook's employment contracts. Posting insider information about internal company technical difficulties is going to be against employment guidelines at any Big Co.
In a situation like this that might seem petty and cagey. But zooming out and looking at the bigger picture, it's first and foremost a SECURITY issue. Revealing internal technical and status updates needs to go through high-level management, security, and LEGAL approvals, lest you expose the company to increased security risk by revealing gaps that do not need to be publicized.
(Aside: This is where someone clever might say "Security by obscurity is not a strategy". It's not the ONLY strategy, but it absolutely is PART of an overall security strategy.)
#2 just purely from a prioritization/management perspective, if this was my employee, I would want them spending their time helping resolve the problem not post about it on reddit. This one is petty, but if you're close enough to the issue to help, then help. And if you're not, don't spread gossip - see #1.
You're very, very right - and insightful - about the consequences of sharing this information. I agree with you on that. I don't think you're right that firing people is the best approach.
Irrespective of the question of how bad this was, you don't fix things by firing Guy A and hoping that the new hire Guy B will do it better. You fix it by training people. This employee has just undergone some very expensive training, as the old meme goes.
Whoever is responsible for the BGP misconfiguration that caused this should absolutely not be fired, for example.
But training about security, about not revealing confidential information publicly, etc is ubiquitous and frequent at big co's. Of course, everyone daydreams through them and doesn't take it seriously. I think the only way to make people treat it seriously is through enforcement.
I feel you're thinking through this with a "purely logical" standpoint and not a "reality" standpoint. You're thinking worst case scenario for the CYA management, having more sympathy for the executive managers than for the engineer providing insight to the tech public.
It seems like a fundamental difference of "who gives a shit about corporate" from my side. The level of detail provided isn't going to get nationstates anything they didn't already know.
Yeah but what is the tech public going to do with these insights?
It's not actionable, it's not whistleblowing, it's not triggering civic action, or offering a possible timeline for recovery.
It's pure idle chitchatter.
So yeah, I do give a shit about corporate here.
Disclosure: While I'm an engineer too, I'm also high enough in the ladder that at this point I am more corporate than not. So maybe I'm a stooge and don't even realize it.
Facebook, the social media website is used, almost exclusively for 'idle chitchatter', so you may want to avoid working there if your opinion of the user is so low. (Actually, you'll probably fit right in at Facebook.)
It's unclear to me how a 'high enough in the ladder' manager doesn't realize that there's easily dozen people who know the situation intimately but who can't do anything until a dependent system to them is up. "Get back to work" is... the system is down, what do you want them to do, code with a pencil and paper?
ramenporn violated the corporate communication policy, obviously, but the tone and approach for a good manager to an IC that was doing this online isn't to make it about corporate vs them/the team, and in fact, encourage them to do more such communication, just internally. (I'm sure there was a ton of internal communication, the point is to note where ramenporn's communicative energy was coming from, and nurture that, and not destroy that in the process of chiding them for breaking policy.
> Edit: I also hope this doesn't damage prospects for more Work From Home. If they couldn't get anyone who knew the configuration in because they all live a plane ride away from the datacenters, I could see managers being reluctant to have a completely remote team for situations where clearly physical access was needed.
You're conflating working remotely ("a plane ride away") and working from home.
You're also conflating the people who are responsible network configuration, and for coming up with a plan to fix this; and the people who are responsible for physically interacting with systems. Regardless of WFH those two sets likely have no overlap at a company the size of Facebook.
> I doubt WFH will be impacted by this - not an insider but seems unlikely that the relevant people were on-site at data centers before COVID
I think the issue is less "were the right people in the data center" and more "we have no way to contact our co-workers once the internal infrastructure goes down". In non-wfh you physically walk to your co-workers desk and say "hey, fb messenger is down and we should chat, what's your number?". This proves that self-hosting your infra (1) is dangerous and (2) makes you susceptible to super-failures if comms goes down during WFH.
Major tech companies (GAFAM+) all self-host and use internal tools so they're all at risk of this sort of comms breakdown. I know I don't have any co-workers number (except one from WhatsApp which if I worked at FB wouldn't be useful now).
It is a matter of preparation. You can make sure there are KVMoIPs or other OOB technologies available on site to allow direct access from a remote location. In the worst case technician has to know how to connect the OOB device or press a power button ;)
I'm not disagreeing with you, however clearly (if the reddit posts were legitimate) some portion of their OOB/DR procedure depended on a system that's down. From old coworkers who are at FB, their internal DNS and logins are down. It's possible that the username/password/IP of an OOB KVM device is stored in some database that they can't login to. And the fact FB has been down for nearly 4 hours now suggests it's not as simple as plugging in a KVM.
I was referring to the WFH aspect the parent post mentioned. My point was that the admins could get the same level of access as if they were physically on site, assuming the correct setup.
Can you recommend similar others (or maybe how to find them)? I learned of PushShift because snew, an alternative reddit frontend showing deleted comments, was making fetch requests and I had to whitelist it in uMatrix. Did not know about Camas until today.
It's time to decentralize and open up the Internet again, as it once was (ie. IRC, NNTP and other open protocols) instead of relying on commercial entities (Google, Facebook, Amazon) to control our data and access to it.
I'll throw in Discord into that mix, the thing that basically mostly killed IRC. Which is yet again centralized despite pretending that it is not centralized.
What are they afraid of? While they are sharing information
that's internal/proprietary to the company, it isn't anything particularly sensitive and having some transparency into the problem is good for everyone.
Who'd want to work for a company that might take disciplinary action because an SRE posted a reddit comment to basically say "BGP's down lol" - If I was in charge I'd give them a modest EOY bonus for being helpful in their outreach to my users in the wider community.
Seems reasonable that at a company of 60k, with hundreds who specialize in PR, you do not want a random engineer making the choice himself to be the first to talk to the press by giving a PR conference on a random forum.
Honestly, from a PR perspective, I’m not sure it’s so bad. Giving honest updates showing Facebook hard at work is certainly better PR for our kind of crowd than whatever actual Facebook PR is doing.
That one guy's comments seen fine from a PR perspective apart from it not being his role to communicate for the company.
I still think he should be fired for this kind of communication though. One reason is, imagine Facebook didn't punish breaches of this type. Every other employee is going to be thinking "Cool, I could be in a Wired article" or whatever. All they have to do is give sensitive company information to reporters.
Either you take corporate confidentiality seriously or you don't. Posting details of a crisis in progress on your Reddit account is not taking corporate confidentiality seriously. If the Facebook corporation lightly punishes, scolds, or ignores this person then the corporation isn't taking confidentiality seriously either.
Reporters are going to opportunistically start writing about those comments vs having to wait for a controlled message from a communications team. So the reddit posts might not be "so bad", but they're also early and preempting any narrative they may want to control.
Compare Facebook's official tweet: "We’re aware that some people are having trouble accessing our apps and products. We’re working to get things back to normal as quickly as possible, and we apologize for any inconvenience."
I don't think Facebook could actually say anything more accurate or more honest. "Everything is dead, we are unable to recover, and we are violently ashamed" would be a more fun statement, but not a more useful one.
There will be plenty of time to blame someone, share technical lessons, fire a few departments, attempt to convince the public it won't happen again, and so on.
I agree completely. The target audience Facebook is concerned about is not techies wanting to know the technical issues. Its the huge advertising firms, governments, power users, etc. who have concerns about the platform or have millions of dollars tied up in it. A bland statement is probably the best here - and even if the one engineer gave accurate useful info I don't see how you'd want to encourage an org in which thousands of people feel the need to post about whats going on internally during every crisis.
Well, they could at least be specific about how large the outage is. "Some people" is quite different to absolutely everyone. At least they did not add a "might" in there.
Facebook is well known for having really good PR, if they go after this guy for sharing such basic info that's yet another example of their great PR teams.
A few random guesses (I am not in any way affiliated with FB); just my 2c:
Sharing status of an active event may complicate recovery, especially if they suspect adversarial actions: such public real-time reports can explain to the red team what the blue team is doing and, especially important, what the blue team is unable to do at the moment.
Potentially exposing the dirty laundry. While a postmortem should be done within the company (and as much as possible is published publicly) after the event, such early blurbs may expose many non-public things, usually unrelated to the issue.
Shareholders and other business leaders I'm sure are much happier reporting this as a series of unfortunate technical failures (which I'm sure is part of it) rather than a company-wide organizational failure. The fact they can't physically badge in the people who know the router configuration speaks to an organization that hasn't actually thought through all its failure modes. People aren't going to like that. It's not uncommon to have the datacenter techs with access and the actual software folks restricted, but that being the reason one of the most popular services in the world has been down for nearly 3 hours now will raise a lot of questions.
I did not read it as they can't get them on site but rather that it takes travel to get them on site. Travel takes time of which they desperately want not to spend.
> If I was in charge I'd give them a modest EOY bonus for being helpful in their outreach to my users in the wider community.
That seems pretty unlikely at any but the smallest of companies. Most companies unify all external communications through some kind of PR department. In those cases usually employees are expressly prohibited from making any public comments about the company without approval.
Unrelated to the outage, but I hate headlines like this.
Facebook is down ~5% today. That's a huge plunge to be sure, but Zuckerberg hasn't "lost" anything. He owns the same number of shares today as he did yesterday. And in all likelihood, unless something truly catastrophic happens the share price will bounce back fairly quickly. The only reason he even appears to have lost $7 billion is because he owns so much Facebook stock.
Unlikely to be related. FB's losses today already happened before FB went down, and are most likely related to the general negative sentiment in the market today, and the whistleblower documents. It's actually kind of remarkable how little impact the outage had on the stock.
As much as all of the curious techies here would love transparency into the problem, that doesn't actually do any good for Facebook (or anyone else) at the moment. Once everything is back online, making a full RCA available would do actual good for everyone. But I wouldn't hold my breath for that.
Do we even know if someone had the account deleted? I think facebook might have their hands full right now solving the issue rather than looking at social media posts that discusses the issue.
There are a lot of people who work at Facebook, and I'm sure the people responsible for policing external comms do not have the skills or access to fix what's wrong right now.
I remember some huge DDOS attacks like a decade ago, and people were speculating who could be behind it. The three top theories were Russian intelligence, the Mossad, and this guy on 4chan who claimed to have a Botnet doing it.
That was the start of living in the future for me.
This felt like something straight out of a post modern novel during the whole WSB press rodeo, where some user names being used on TV were somewhere between absurd to repulsive.
The problem with tweets on transgender bathrooms is that you can be attacked for them by either side at any point in the future, so the user OverTheCounterIvermectin should have known better.
Curious what the internal "privacy" limitations are. Certainly FB must track reddit users : fb account even if they don't actually display it. It just makes sense.
Well, you want the right people to have access. If you're a small shop or act like one, that's your "top" techs.
If you're a mature larger company, that's the team leads in your networking area on the team that deal with that service area (BGP routing, or routers in general).
Most likely Facebook et. al. management never understood this could happen because it's "never been a problem before".
I can't fathom how they didn't plan for this. In any business of size, you have to change configuration remotely on a regular basis, and can easily lock yourself out on a regular basis. Every single system has a local user with a random password that we can hand out for just this kind of circumstance...
Organizational complexity grows super-linearly; in general, the number of people a company can hire per unit time is either constant or grows linearly.
Google once had a very quiet big emergency that was, ironically(1), initiated by one of their internal disaster-recovery tests. There's a giant high-security database containing the 'keys to the kingdom', as it were... Passwords, salts, etc. that cannot be represented as one-time pads and therefore are potentially dangerous magic numbers for folks to know. During disaster recovery once, they attempted to confirm that if the system had an outage, it would self-recover.
It did not.
This tripped a very quiet panic at Google because while the company would tick along fine for awhile without access to the master password database, systems would, one by one, fail out if people couldn't get to the passwords that had to be occasionally hand-entered to keep them running. So a cross-continent panic ensued because restarting the database required access to two keycards for NORAD-style simultaneous activation. One was in an executive's wallet who was on vacation, and they had to be flown back to the datacenter to plug it in. The other one was stored in a safe built into the floor of a datacenter, and the combination to that safe was... In the password database. They hired a local safecracker to drill it open, fetched the keycard, double-keyed the initialization machines to reboot the database, and the outside world was none the wiser.
(1) I say "ironically," but the actual point of their self-testing is to cause these kinds of disruptions before chance does. They aren't generally supposed to cause user-facing disruption; sometimes they do. Management frowns on disruption in general, but when it's due to disaster recovery testing, they attach to that frown the grain of salt that "Because this failure-mode existed, it would have occurred eventually if it didn't occur today."
Thanks for telling this story as it was more amusing than my experiences of being locked in a security corridor with a demagnetised access card, looooong ago.
EDIT: I had mis-remembered this part of the story. ;) What was stored in the executive's brain was the combination to a second floor safe in another datacenter that held one of the two necessary activation cards. Whether they were able to pass it to the datacenter over a secure / semi-secure line or flew back to hand-deliver the combination I do not remember.
If you mean "Would the pick-pocket have access to valuable Google data," I think the answer is "No, they still don't have the key in the safe on the other continent."
If you mean "Would the pick-pocket have created a critical outage at Google that would have required intense amounts of labor to recover from," I don't know because I don't know how many layers of redundancy their recovery protocols had for that outage. It's possible Google came within a hair's breadth of "Thaw out the password database from offline storage, rebuild what can be rebuilt by hand, and inform a smaller subset of the company that some passwords are now just gone and they'll have to recover on their own" territory.
Maybe because they were planning for a million other possible things to go wrong, likely with higher probability than this. And busy with each day's pressing matters.
Anyone who has actually worked in the field can tell you that a deploy or config change going wrong, at some point, and wiping out your remote access / ability to deploy over it is incredibly, crazy likely.
That someone will win the lottery is also incredibly likely. That a given person will win the lottery is, on the other hand, vanishingly unlikely. That a given config change will go wrong in a given way is ... eh, you see where I'm going with this
Right, which is why you just roll in protection for all manner of config changes by taking pains to ensure there are always whitelists, local users, etc. with secure(ly stored) credentials available for use if something goes wrong; rather than assuming your config changes will be perfect.
I'm not sure it's possible to speculate in a way which is generic over all possible infrastructures. You'll also hit the inevitable tradeoff of security (which tends towards minimal privilege, aka single points of failure) vs reliability (which favours 'escape hatches' such as you mentioned, which tend to be very dangerous from a security standpoint).
Absolutely, and I'd even call it a rite of passage to lock yourself out in some way, having worked in a couple of DCs for three years. Low-level tooling like iLO/iDRAC can sure help out with those, but is often ignored or too heavily abstracted away.
Exactly! Obviously they have extremely robust testing and error catching on things like code deploys: how many times do you think they deploy new code a day? And at least personally, their error rate is somewhere below 1%.
Clearly something about their networking infrastructure is not as robust.
Most likely they did plan for this. Then, something happened that the failsafe couldn't handle. E.g. if something overwrites /etc/passwd, having a local user won't help. I'm not saying that specific thing happened here -- it's actually vanishingly unlikely -- but your plan can't cover every contingency.
Agreed, it’s also worth mentioning that at the end of every cloud is real physical hardware, and that is decidedly less flexible than cloud, if you locked yourself out of a physical switch or router you have many fewer options.
In risk management cultures where consequences from failures are much, much higher, the saying goes that “failsafe systems fail by failing to be failsafe”. Explicit accounting for scenarios where the failsafe fails is a requirement. Great truths of the 1960s to be relearned, I guess.
My company runs copies of all our internal services in air-gapped data centers for special customers. The operators are just people with security clearance who have some technical skills. They have no special knowledge of our service inner workings. We (the dev team) aren’t allowed to see screenshots or get any data back. So yeah, I have done that sort of troubleshooting many times. It’s very reminiscent of helping your grandma set up her printer over the phone.
For all the hours I spent on the phone spelling grep, ls, cd, pwd, raging that we didn't keep nano instead of fucking vim (and I'm a vim person)... I could have stayed young and been solving real customer problems, not imperium-typing on a fucking keyboard with a 5s delay 'cause colleague is lost in the middle of nowhere and can't remember what file he just deleted and the system doesn't start anymore your software is fragile, just shite.
Yes, and it works if both parties are able to communicate using precise language. The onus is on the remote SME to exactly articulate steps, and on the local hands to exactly follow instructions and pause for clarifications when necessary.
Sometimes the DR plan isn't so much I have to have a working key, I just have to know who gets their first with a working key, and break glass might be literal.
Not OP, but many times. Really makes you think hard about log messages after an upset customer has to read them line by line over the phone.
One was particularly painful, as it was a "funny" log message I had added the code when something went wrong. Lesson learned was to never add funny / stupid / goofy fail messages in the logs. You will regret it sooner or later.
this is not new, this is everyday life with helping hands, on duty engineers, l2-l3 levels telling people with physical access which commands to run etc. etc. etc.
The places I've seen this at had specific verification codes for this. One had a simple static code per person that the hands-on guys looked up in a physical binder on their desk. Very disaster proof.
The other ones had a system on the internal network in which they looked you up, called back on your company phone and asked for a passphrase the system showed them. Probably more secure but requires those systems to be working.
This is not a real datacenter case but normal social hacking. On the datacenter side you have many more security checks plus many of the times the helping hands and engineers are part of the same company, using internal communication tools etc. so they are on the same logical footprint anyhow
Imagine that guy has this big npm repository locally with all those dodgy libraries with uncontrolled origin, in their /lib/node_modules with root permissions.
for something as distributed as Facebook, do multiple somebodys all have to race down each individual datacenter and plug their laptops into the routers?
As someone with no experience in this, it sounds like a terrifying situation for the admins...
Interesting that they published stuff about their BGP setup and infrastructure a few months ago - maybe a little tweak to roll backs is needed.
"... We demonstrate how this design provides us with flexible control over routing and keeps the network reliable. We also describe our in-house BGP software implementation, and its testing and deployment pipelines. These allow us to treat BGP like any other software component, enabling fast incremental updates..."
Surely Facebook don't update routing systems between data centres (IIRC the situation) when they don't have people present to fix things if they go wrong? Or have an out-of-band connection (satellite, or dial-up (?), or some other alternate routing?).
I must be misunderstanding this situation here.
[Aside: I recall updating wi-fi settings on my laptop and first checking I had direct Ethernet connection working ... and that when I didn't have anything important to do (could have done a reinstall with little loss). Is that a reasonable analogy?]
Joking aside, I can see how an IRC network has potential to be used in these situations. Maybe FAMANG should work together to set something like this up. The problem is, a single IRC server is not fail safe, but a network of multiple servers would just see a netsplit, in which case users would switch servers.
Also, I remember back in the IRCnet days using simply telnet to connect to IRCnet just for fun and sending messages, so its a very easy protocol that can be understood in a global desaster scenario (just the PING replys where annoying in telnet).
I heard the same thing from my old coworker who is at FB currently. All of their internal DNS/logins are broken atm so nobody can reach the IRC server. I bet this will spur some internal changes at FB in terms of how to separate their DR systems in the case of an actual disaster.
Good planning! Now, where does the IRC server live, and is it currently routable from the internet?
While normally I know the advice is "Don't plan for mistakes not to happen, it's impossible, murphy's law, plan for efficient recovery for mistakes"... when it comes to "literally our entire infrastructure is no longer routable from the internet", I'm not sure there's a great alternative to "don't let that happen. ever." And yet, here facebook is.
Also, are the users able to reach the server without DNS (i.e. are the IP addresse(s) involved static and communicated beforehand) and is the server itself able to function without DNS?
Routing is one thing which you can't do without (then you need to fallback to phone communications), but DNS is something that's quite probable to not work well in a major disaster.
I would think that their internal network would correctly resolve facebook.com even though they've borked DNS for the external world, or if not they could immediately fix that. So at least they'd be able to talk to each other.
To the communication angle, I've worked at two different BigCo's in my career, and both times there was a fallback system of last resort to use when our primary systems were unavailable.
I haven't worked for a FAANG but it would be unthinkable that FB does not have backup measures in place for communications entirely outside of Facebook.
Hmm well I mean for key people, ops and so on.
Not for every employee.
Only a few people need that type of access, and they should have it ready. They need to bring more people there should be an easy way to do it.
Maybe the internal FB Messenger app has a slide button to switch to the backup network for those in need.
> Maybe the internal FB Messenger app has a slide button to switch to the backup network for those in need.
Having worked for 2 FAANG companies, I can tell you most core services like which FB Messenger would be using internal database services and relying on those which would be ineffective in a case like this as it would not work and the engineering cost to design them to support an external database would be a lot more than just paying for like 5 different external backup products for your SRE team.
"I believe the original change was 'automatic' (as in configuration done via a web interface). However, now that connection to the outside world is down, remote access to those tools don't exist anymore, so the emergency procedure is to gain physical access to the peering routers and do all the configuration locally."
You know how after changing resolution and other video settings you get a popup "do you want to keep these changes?" with a countdown and automatic revert in case you managed to screw up and can't see the output anymore?
Well, I wonder why a router that gets a config update but then doesn't see any external traffic for 4 hours doesn't just revert back to the last known good config...
Our security team complained that we have some services like monitoring or SSH access to some Jump Hosts accessible without a VPN because VPN should be mandatory to access all internal services. I'm afraid once comply we could be in similar situation where Facebook is now...
Fundamentally, how is a 2nd independent VPN into your network a different attack surface than a single, well-secured ssh jumphost? When you're using them for narrow emergency access to restore the primary VPN, both are just "one thing" listening on the wire, and it's not like ssh isn't a well-understood commodity.
On the other hand if you had to break through wireguard first, and then go through your single well-secured bastion, you'd not only be harder to find, you'd have two layers of protection, and of course you tick the "VPN" box
But if your vpn has a zero day, that lets you get to the ssh server. It's two layers of protection, you'd have to have two zero days to get in instead of one.
You could argue it's overkill, but it's clearly more secure
Only if the VPN means you have a VPN and a jump box. If it's "VPN with direct access to several servers and no jump box" there's still only one layer to compromise.
Still wouldn't help if your configuration change wipes you clear off the Internet like Facebook's apparently has. The only way to have a completely separate backup is to have a way in that doesn't rely on "your network" at all.
These are readily available, OpenGear and others have offered them forever. I can't believe fb doesn't have out of band access to their core networking in some fashion. OOB access to core networking is like insurance, rarely appreciated until the house is on fire.
It's quite possible that they have those, but that the credentials are stored in a tool hosted in that datacenter or that the DNS entries are managed by the DNS servers that are down right now.
You are probably right but if that is the case, it isn't really out of band and needs another look. I use OpenGear devices with cellular to access our core networking to multiple locations and we treat them as basically an entirely independent deployment, as if it is another company. DNS and credentials are stored in alternate systems that can be accessed regardless of the primary systems.
I'm sure the logistics of this become far more complicated as the organization scales but IMHO it is something that shouldn't be overlooked, exactly for outlier events like this. It pays dividends the first time it is really needed. If the accounts of ramenporn are correct, it would be paying very well right now.
Out of band access is a far more complicated version of not hosting your own status page, which they don't seem to get right either.
It shouldn't be too stressful. Well-managed companies blame processes rather than people, and have systems set up to communicate rapidly when large-scale events occur.
It can be sort of exciting, but it's not like there is one person typing at a keyboard with a hundred managers breathing down their neck. These resolutions are collaborative, shared efforts.
> It can be sort of exciting, but it's not like there is one person typing at a keyboard with a hundred managers breathing down their neck.
As someone who formerly did Ops for many many years... this is not accurate. Even in a well organized company there are usually stakeholders at every level on IM calls so that they don't need to play "telephone" for status. For an incident of this size, it wouldn't be unusual to have C-level executives on the call.
While those managers are mostly just quietly listening in on mute if they know what's good (e.g. don't distract the people doing the work to fix your problem), their mere presence can make the entire situation more tense and stressful for the person banging keyboards. If they decide to be chatty or belligerent, it makes everything 100x worse.
I don't envy the SREs at Facebook today. Godspeed fellow Ops homies.
I think it comes down to the comfort level of the worker. I remember when our production environment went down. The CTO was sitting with me just watching and I had no problem with it since he was completely supportive, wasn't trying to hurry me, just wanted to see how the process of fixing it worked. We knew it wasn't any specific person's fault, so no one had to feel the heat from the situation beyond just doing a decent job getting it back up.
"it's not like there is one person typing at a keyboard with a hundred managers breathing down their neck. These resolutions are collaborative, shared efforts"
Well, you'd be surprised about how one person can bring everything down and/or save the day at Facebook, Cloudflare, Google, Gitlab, etc.
Most people are observers/cheerleaders when there is an incident.
Well, individuals will still stress, if anything, due to the feeling of bein personally responsible for inflicting damage.
I know someone who accidentally added a rule 'reject access to * for all authenticated users' in some stupid system where the ACL ruleset itself was covered by this *, and this person nearly collapsed when she realized even admins were shut out of the system. It required getting low level access to the underlying software to reverse engineer its ACLs and hack into the system. Major financial institution. Shit like leaves people with actual trauma.
As much as I hate fb, I really feel for the net ops guys trying to figure it all out, with the whole world watching (most of it with shadenfreude)
> It shouldn't be too stressful. (...) it's not like there is one person typing at a keyboard with a hundred managers breathing down their neck
Earlier comment mentioned that there is a bottleneck, and that people who are physically able to solve the issue are few and that they need to be informed what to do; being one of these people sounds pretty stressful to me.
"but the people with physical access is separate (...) Part of this is also due to lower staffing in data centers due to pandemic measures", source: https://news.ycombinator.com/item?id=28749244
Most big tech companies automatically start a call for every large scale incident, and adjacent teams are expected to have a representative call in and contribute to identifying/remediating the issue.
None of the people with physical access are individually responsible, and they should have a deep bench of advice and context to draw from.
I'm not an IT Operations guy, but as a dev I always thought it was exciting when the IT guys had in their shoulders the destiny of the firm. I must be exciting.
Most teams that handle incidents have well documented incident plans and playbooks. When something major happens you are mostly executing the plan (which has been designed and tested). There are always gotchas that require additional attention / hands but the general direction is usually clear.
> Well-managed companies blame processes rather than people
I feel like this just obfuscates the fact that individuals are ultimately responsible, and allows subpar employees to continue existing at an organization when their position could be filled by a more qualified employee. (Not talking about this Facebook incident in particular, but as a generalisation: not attributing individual fault allows faulty employees to thrive at the expense of more qualified ones).
> this just obfuscates the fact that individuals are ultimately responsible
in critical systems, you design for failure. if your organizational plan for personnel failure is that no one ever makes a mistake, that's a bad organization that will forever have problems.
this goes by many names, like the swiss cheese model[0]. its not that workers get to be irresponsible, but that individuals are responsible only for themselves, and the organization is the one responsible for itself.
This isn't what I'm saying, though. The thought I'm trying to express is that if no individual accountability is done, it allows employees who are not as good at their job (read: sloppy) to continue to exist in positions which could be better occupied by employees who are better at their job (read: more diligent).
The difference between having someone who always triple-checks every parameter they input, versus someone who never double-checks and just wings it. Sure, the person who triple-checks will make mistakes, but less than the other person. This is the issue I'm trying to get at.
If someone is sloppy and not willing to change he should be shown the door, but not because he caused outage but because he is sloppy.
People who operate systems under fear tend to do stupid things like covering up innocent actions (deleting logs), keep information instead of sharing it etc. Very few can operate complex systems for long time without doing mistake. Organization where the spirit is "oh, outage, someone is going to pay for that" wiil never be attractive to good people, will have hard time adapting to changes and to adopt new tech.
> The difference between having someone who always triple-checks every parameter they input, versus someone who never double-checks and just wings it. Sure, the person who triple-checks will make mistakes, but less than the other person. This is the issue I'm trying to get at.
If you rely on someone triple-checking, you should improve your processes. You need better automation/rollback/automated testing to catch things. Eventually only intentional failure should be the issue (or you'll discover interesting new patterns that should be protected against)
If there is an incident because an employee was sloppy, the fault lies with the hiring process, the evaluation process for this employee, or with the process that put four eyes on each implementation. The employee fucked up, they should be removed if they are not up to standards, but putting the blame on them does not prevent the same thing from happening in the future.
If you'd think about it, it isn't very useful to find a person who is responsible. Suppose someone cause outage or harm, due to neglect or even bad intentions, either the system will be setup in a way that the person couldn't cause the outage or that in time it will be down. To build truly resilient system, especially on global scale, there should never be an option for a single person to bring down the whole system.
I don't think the comment you're replying to applies to your concern about subpar employees.
We blame processes instead of people because people are fallible. We've spent millenia trying to correct people, and it rarely works to a sufficient level. It's better to create a process that makes it harder for humans to screw up.
Yes, absolutely, people make mistakes. But the thought I was trying to convey is that some people make a lot more mistakes than others, and by not attributing individual fault these people are allowed to thrive at the cost of having less error-prone people in their position. For example, someone who triple-checks every parameter that they input, versus someone who has a habit of just skimming or not checking at all. Yes the triple-checker will make mistakes too, but way less than the person who puts less effort in.
But that has nothing to do with blaming processes vs people.
If the process in place means that someone has to triple check their numbers to make sure they’re correct, then it’s a broken process. Because even that person who triple checks is one time going to be woken up at 2:30am and won’t triple check because they want sleep.
If the process lets you do something, then someone at some point in time, whether accidentally or maliciously, will cause that to happen. You can discipline that person, and they certainly won’t make the same mistake again, but what about their other 10 coworkers? Or the people on the 5 sister teams with similar access who didn’t even know the full details of what happened?
If you blame the process and make improvements to ensure that triple checking isn’t required, then nobody will get into the situation in the first place.
Yeah, I've heard this view a hundred times on Twitter, and I wish it were true.
But sadly, there is no company which doesn't rely, at least at one point or another, on a human being typing an arbitrary command or value into a box.
You're really coming up against P=NP here. If you can build a system which can auto-validate or auto-generate everything, then that system doesn't really need humans to run at all. We just haven't reached that point yet.
Edit: Sorry, I just realised my wording might imply that P does actually equal NP. I have not in fact made that discovery. I meant it loosely to refer to the problem, and to suggest that auto-validating these things is at least not much harder than auto-executing them.
I don’t think anyone ever claimed the process itself is perfect. If it were, we obviously would never have any issues.
To be explicit here, by blaming the process, you are discovering and fixing a known weakness in the process. What someone would need to triple check for now, wouldn’t be an issue once fixed. That isn’t to say that there aren’t any other problems, but it ensures that one issue won’t happen again, regardless of who the operator is.
If you have to triple check that value X is within some range, then that can easily be automated to ensure X can’t be outside of said range. Same for calculations between inputs.
To take the overly simplistic triple check example from before, said inputs that need to be triple checked are likely checked based on some rule set (otherwise the person themselves wouldn’t know if it was correct or not). Generally speaking, those rules can be encoded as part of the process.
What was before potentially “arbitrary input” now becomes an explicit set of inputs with safeguards in place for this case. The process became more robust, but is not infallible.
But if you were to blame people, the process still takes arbitrary input, the person who messed up will probably validate their inputs better but that speaks nothing of anyone else on the team, and two years down the line where nobody remembers the incident, the issue happens again because nothing really has changed.
The issue is that this view always relies on stuff like "make people triple check everything".
- How does that relate to making a config change?
- How do you practically implement a system where someone has to triple check everything they do?
- How do you stop them just clicking 'confirm' three times?
- Why do you assume they will notice on the 2nd or 3rd check, rather than just thinking "well, I know I wrote it correctly, so I'll just click confirm"?
I don't think rules can always be encoded in the process, and I don't see how such rules will always be able to detect all errors, rather than only a subset of very obvious errors.
And that's only dealing with the simplest class of issues. What about a complex distributed systems problem? What about the engineer who doesn't make their system tolerant of Byzantine faults? How is any realistic 'process' going to prevent that?
This entire trope relies on the fundamental axiom that "for any individual action A, there is a process P which can prevent human error". I just don't see how that's true.
(If the statement were something like "good processes can eliminate whole classes of error, and reduce the likelihood of incidents", I'd be with you all the way. It's this Twitter trope of "if you have an incident, it's a priori your company's fault for not having a process to prevent it" which I find to be silly and not even nearly proven.)
The stress for me usually goes away once the incident is fully escalated and there's a team with me working on the issue. I imagine that happened quite quick in this case...
Exactly, the primary focus in situations like this, is to ensure that no one feel like they are alone, even if in the end it is one person who has to type in the right commands.
Always be there, help them double check, help monitor, help make the calls to whomever needs to be informed, help debug. No one should ever be alone during a large incident.
This is a one off event, not a chronic stress trigger. I find them envigorating personally, as long as everybody concerned understands that this is not good in the long run, and that you are not going to write your best code this way.
Also, equally important to note, there was a massive expose on FaceBook yesterday that is reverberating across social media and news networks, and today, when I tried to make a post including the tag #deletefacebook, my post mysteriously could not be published and the page refreshed, mysteriously wiping my post...
This is possibly the equivalent of a corporate watergate if you ask me... Just my personal opinion as a developer though... Not presented as fact... But hrmmm.
If it's anything like my past employers, they probably have a lot of time. They probably also got in a lot of trouble.
When we'd have situation bridges put in place to work a critical issue, there would usually be 2-3 people who were actively troubleshooting and a bunch of others listening in, there because "they were told to join" but with little-to-nothing to do. In the worst cases, there was management there, also.
Most of the time I was one of the 2 or 3 and generally preferred if the rest of them weren't paying much attention to what was going on. It's very frustrating when you have a large group of people who know little about what's going on injecting their opinions while you're feverishly trying to (safely) resolve a problem.
It was so bad that I once announced[0] to a C-Level and VP that they needed to exit the bridge, immediately because the discussion devolved into finger-pointing. All of management was "kicked out". We were close to solving it but technical staff was second-guessing themselves in the presence of folks with the power to fire them. 30 minutes later we were working again. My boss at the time explained that management created their own bridge and the topic was "what do to about 'me'" which quickly went from "fire me" to "get them all a large Amazon gift card". Despite my undiplomatic handling of the situation, that same C-Level negotiated to get me directly beneath during a reorganization about six months later and I stayed in that spot for years with a very good working relationship. One of my early accomplishments was to limit any of management's participation in situation bridges to once/hour, and only when absolutely necessary, for status updates assuming they couldn't be gotten any other way (phones always worked, but the other communication options may not have).
[0] This was the 16th hour of a bridge that started at 11:00 PM after a full work day early in my career -- I was a systems person with a title equivalent to 'peon', we were all very raw by then and my "announcement" was, honestly, very rude, which I wasn't proud of. Assertive does not have to be rude, but figuring out the fine line between expressing urgency and telling people off is a skill that has to be learned.
>Between 15:50 UTC and 15:52 UTC Facebook and related properties disappeared from the Internet in a flurry of BGP updates. This is what it looked like to @Cloudflare.
- the DNS services internally have issues (most likely, as this could explain the snowball effect)
- it could be also a core storage issue and all their VMs are relying on it and so they don't want to block third-party websites and think it will last for a long time, so they prefer to answer nothing for now in the DNS (so it will fail instantly to the client, and drain the application/database servers so they can reboot with less load)
I was on a video call during the incident. The service was working but with super-low bandwidth for 30 minutes, then I got disconnected and every FB property went down suddenly. Seems more suggestive of someone pulling the plug than a DNS issue, although it could also be both.
Oh you bet they do. In large organizations with complex microservices these dependencies inevitably arise. It takes real dedication and discipline to avoid creating these circular dependencies.
This is very true. I tell everyone who'll listen that every competent engineer should be well versed in the nuances of feedback in complex systems (https://en.wikipedia.org/wiki/Feedback).
That said, virtuous cycles can't exist without vicious cycles. I think we as a society need to do a lot more work into helping people understand and model feedback in complex systems, because at scales like Facebook's it's impossible for any one person to truly understand the hidden causal loops until it goes wrong. You only need to look at something like the Lotka-Volterra equations (https://en.wikipedia.org/wiki/Lotka%E2%80%93Volterra_equatio...) to see how deeply counterintuitive these system dynamics can be (e.g. "increasing the food available to the prey caused the predator's population to destabilize": https://en.wikipedia.org/wiki/Paradox_of_enrichment).
It seems like an easy redundancy split, but imagine driving two cars down the freeway at the same time, because you got a flat tire in one, the other day.
In order to actually be redundant you need to have two sets of infrastructure to serve, and then if the internal one goes down, the external one's basically useless when the internal resolution's down anyway. Capacity planning (because you're inside Facebook and can't pretend that all data-centers ever-where are connected via an infinitely fast network) becomes twice as much work. How you do updates for a couple thousand teams isn't trivial in the first place, now you have to cordon them off appropriately?
I don't know what Facebook's DNS serving infrastructure looks like internally, but it's definitely more complicated than installing `unbound` on a couple of left-over servers.
Even the Name servers are not returning any values. That's bad.
dig @8.8.8.8 +short facebook.com NS
These are usually anycasted, meaning that 1 ip return in NS are in fact several servers spread in several regions. They are distributed to closer match through agreements with ISP with the BGP protocol. Very interesting, because it seems that it took 1 DNS entry misconfiguration to withdraw M$ worth of devices from the internet.
So far the pattern isn't the same. Slack published a DNSSEC record that got cached and then deleted it, which broke clients that tried to validate DNSSEC for slack.com. But in this case, the records are just completely gone. As if "facebook.com", "instagram.com", et al just didn't exist.
General tip: If HN is being laggy and you're determined you want to waste some time here, open it in a private window. HN works extremely quickly if it doesn't know who you are.
Wow this really works, thank you. What actually is the reason for it being much faster in a private window? Is there so much tracking going on in a normal window?
its faster because the pages are cached, they are effectively static. It's slower when logged in because the pages are created dynamically as it has your username, tracking favourites, upvotes etc, and much of it cannot be quickly cached.
Honestly surprised that HN, a website for techies, is so poorly coded. For example, the whole lack of proper "paging", with dang posting a disclaimer on every large thread for honestly over a year at this point and no progress. Or the fact that if you want to reply to a comment, it has to load a whole new page (which has to fetch more data from the server?) before getting to the comment box. Until recently, trying to collapse a large comment thread would also take 3s+ as I think it individually set the collapse state of every sub-comment?
The whole thing was put together in a somewhat obscure dialect of Lisp over 15 years ago. There’s probably under 100 people that write Arc regularly enough to meaningfully contribute, so the general approach has been to not fix what’s not broken.
This is not a very complex website, any HN reader could probably whip up a replacement from scratch over a weekend.
I guess there does exist many alternative UIs, though I don't see many that support commenting. I wonder if the "API" (if there's any) allows for that, or if people are just scraping the page and reformating it.
Not to argue, just to post a contrasting view: while FB, and a lot of the internet, failed or slowed today—and I know there were tons of reports of HN slowing too—I also experienced a phone death and attempted to hobble along by putting my SIM back in my old iPhone 5. Basically the only thing that worked was HN. In fact it loaded as quickly as I’d expect.
There’s plenty of stuff I’d like to be different about usability of this site, but perf is basically at the bottom of that list.
Most of the things I listed weren't really perf related, though they do show up when there's perf issues. Being able to "load more comments" and reply inline are super basic usability features. There's no reason why I'd need to navigate to a whole different page with a textbox, then navigate back and lose my position every time I post a comment.
One of the first optimizations large/high traffic sites will do, is cache pages for logged out users. even if the cache is only valid for a minute, that's still a huge reduction in server traffic.
The cache is faster because its not having to talk to the database, and can be done at by the load balancing layers rather then the actual application layer.
Wikipedia does this too (although, via a layer to add back on the ip talkpage header).
Could they offer cached pages to logged in users as an optimization? You only need to invalidate when a user posts a comment, most of the time you are reading now commenting?
All running their DNS on AWS. My guess is that AWS is seeing a massive flood of failed and retried DNS requests for facebook properties, similar to what jgrahamc mentions here for Cloudflare: https://twitter.com/jgrahamc/status/1445066136547217413
There's such a thing called the "Thundering Herd" problem, that partially matches.
From wiki: the thundering herd problem occurs when a large number of processes or threads waiting for an event are awoken when that event occurs, but only one process is able to handle the event. When the processes wake up, they will each try to handle the event, but only one will win.
I can't see how this is the reason for HN to take 10 seconds for the response of the main page (I mean, the URL fetched from the address bar, not the subrequests the page does), as everything else downloads immediately.
The DNS entries should be cached by the browser (and the middleware), so that this problem should only happen once, but I get this constantly.
Also, I sometimes get an error message from HN, which seems to indicate that this is some backend issue which fails gracefully with a custom "We're having some trouble serving your request. Sorry!" on top of a 502 code.
It feels more like there is something else still broken.
Dropping that many BGP routes will have its high latency toll on the whole internet backbone for minutes/hours, I'm not surprised. I wonder if the recent LE's DST Root CA X3 deprecation has something to do with the outage (some DC internal tool/API not accessible because its certificate is expired or something like that).
AWS punishes its sysadmin teams for any downtime so there is heavy incentive to not report unless there is a community shaped gun pointed at your head. This is not a universal problem.
AWS punishes its sysadmin teams for any downtime so there is heavy incentive to not report unless there os a community shaped gun pointed at your head. This is not a universal problem.
Probably people flooding in to see if anyone knows why things are down. Even Google speed test was down, presumably from too many people testing if it's their internet that's at issue.
A couple of years ago, an admin at Hacker News asked those of us who are just reading to log out because their system is architected in such a way that logged in users use more resources than anonymous ones. So, if you're feeling altruistic, log out of HN!
Yep. I am the developer of HN client HACK for iOS and Android and a bunch of users emailed me asking if my app was broken. Looks like something bigger is afoot.
Harmonic most likely uses Algolia for the data whereas my app uses the HN website. So Algolia delivers a cache copy from their own servers whereas mine scrapes the HN website itself. Hence the difference. Also logged out pages were working much better than logged in (HN delivers caches copies for logged out users).
HN lagging, BBC was also very laggy about 30 minutes ago, and 35 minutes ago our whole company got booted out of their various hangouts simultaneously apart from the people in the states.
Definitely laggy for me as well. Went to Facebook and couldn’t so come here to check in and the load time made me think oh it must be my wifi is not working with 2 sites not opening then finally HN opened. Then tried to hit reply to your post and again seemed like it wouldn’t load then finally did. So yes laggy usually this is the one site that loads almost instantly.
Plus I don't know about you, but I came to HN just now specifically to check if there was any insight into why it was down! The thundering herd just arrived :)
This is not too rare when HN is being slow and giving those "We're having some trouble serving your request. Sorry!" pages.
If you get one of those on your comment submission you have no way to know if the trouble stopped it from accepting the comment or if it accepted the comment and ran into trouble then trying to display the updated thread.
For some reason I can't even begin to guess at HN does not seem to have protection against multiple submissions of the same form, so if after getting "We're having some trouble serving your request. Sorry!" on your comment submission you hit refresh again to display the page and the form gets resubmitted, you get a duplicate comment.
Earlier today when I was getting these I went to go check the page to see if the comment was posted there. More than once it said it failed but I was able to stop myself from trying again because it was actually there.
Funny enough, I went to https://www.isitdownrightnow.com/ to check if Facebook is down, and isitdownrightnow is down itself... probably from the massive number requests coming to check if Facebook is down
It's amusing that the top 3 trending reports are the FB sites that are down, and then the mobile carriers themselves, presumably because when FB doesn't load they assume it's their mobile network's fault. People really do think FB is the internet
It is a really large part of it. Also, when people see WhatsApp and see no connection, then open Facebook and see no connection either, it's _very_ likely that the link is at fault and not Facebook.
Which in turn, reminds of this paper [1] (from someone who previously worked at Facebook).
TLDR; Metastable failures occur in open systems with an uncontrolled source of load where a trigger causes the system to enter a bad state that persists even when the trigger is removed.
hugops for the engineers having to deal with this. It's incredibly stressful and I personally feel like they deserve some empathy, even if I don't like Facebook.
I wonder if maybe part of the lesson will be to run the root of your authoritative DNS hierarchy on separate infrastructure with a separate domain name. Using facebook.com as your root is cool and all but when that label disappears it causes huge issues.
Kinda sorta. There are four DNS servers for Facebook: 129.134.30.12, 129.134.31.12, 185.89.218.12, and 185.89.219.12.
Of those, only 185.89.219.12 is up right now (Edit All four DNS servers are now up). For people who want to add Facebook to hosts.txt, the A record (IP) I’m getting right now is 157.240.11.35 (it was 31.13.70.36)
Yes, even Facebook falls prey to the wrong copyright year. Anyway, I got further now to a page that says "Account Temporarily Unavailable." and has the old Facebook layout. Would love a peek inside the Facebook codebase to see how this happens, hah!
It's a little irksome how other commentors are quick to dismiss this very valid point. SMBs in Asia aren't using WhatsApp because they've forced the platform on their consumers; it's their consumers who are using WhatsApp who've forced a choice on the SMBs. WhatsApp has very wide consumer penetration, and its use by businesses is meant as a convenience wrapper for customers.
Now, does switching from WhatsApp to some other not-very-widely-used platform cause customer engagement / retention to drop? I would wager very much so! It's a matter of priorities - people go where there is least friction, and WhatsApp otherwise provides a seamless friction-less experience.
It's very first world centric point of view. I doubt some of these commentator claiming whatsapp being down is good for the society have ever been outside of the first world and have seen how it actually helps people in need.
Do you even know who these business owners are and what kind of life do they live? These are the guys that don't have a solid roof over their head, struggle to meet their daily needs and might have to sleep hungry if their day's sales weren't good. Diversifying is the least of the things they have to worry about. Whatsapp allowed them to reduce friction when it comes to communicating with customers, it helps their sales.
What might be a good thing for society in the first world doesn't mean it's necessarily good thing for society in the third world.
I reject this logic - it's an argument for sustaining the status quo at all costs.
Facebook is the most user-hostile tech megacorp, and they will inevitably harm these businesses you care about. The sooner the bandaid is ripped off the better.
I mean, sure, status quo can / should be changed - but you want to get to a point where a changed status quo is sustainable, and you're not going to get there by simply removing existing options. It doesn't change the incentives people have for preferring to use the platform, namely the pre-existing widespread penetration.
You want to dislodge Facebook, you need to disrupt it / curtail its monopoly.
Have you considered that any change done is going to mean winners and losers.
If Facebook permanently goes down then those businesses would move to a different platform.
Would it suck? Probably. Would the world be a better place without Facebook? A ton of people think so. Me included.
This is the same argument people have used when we talk about health insurance in the US being scammy. If we ever decided to address it it means a good chunk of people lose their jobs but also means that the health of this country goes up. Which one is more important?
But people moving from Facebook to another social media or messaging platform is just changing the company. That new company could do whatever things you don't like that Facebook is doing.
Your example seems to mean that we move to another healthcare system as in method of implementation not just moving from one company to another.
That's a fair point. I'm going to make an assumption here but those systems aren't moderated or controlled centrally? So I see two issues with these if they become popular.
1. This only increases misinformation since there's no censorship at all
2. What prevents people from using the service for illegal purposes (I think this becomes a problem once services become popular)
3. Finally won't it just be banned eventually? If some court in any country issues a ruling to control it and it can't be controlled that will just escalate. Especially if it's something like pedophiles.
I've been doing a pretty good job of moving my client's communications to Signal out here.
I feel bad for everyone who relies on whatsapp bots for making stuff happen, though. These are getting really common out here for a lot of things and it always worries me that it's such a linchpin. They're really handy and save a lot of bullshit phone calls from having to be something people deal with for simple stuff like pharmacy delivery. I can get food from the local place down the street that's only really open for lunch and totally off the map for uber eats, for example... if this persists a few more hours those mom and pop type shops aren't going to have as great a day.
At the start of this year I started working for an employment service company that covers the Indo-Asia-Pacific and South American markets.
I was amazed to discover how pervasive Facebook, Inc. has become in the developing world for conducting business and navigating everyday life.
For a lot of people in developing nations such as the Phillipines and Indonesia, Facebook is synonymous with the internet. This has been buoyed by their push to bundle uncapped/free data for Facebook with mobile plans in markets with high growth of mobile internet access.
It's interesting, because I'm always reading articles about how "Western teens aren't using Facebook any more", which is true, but it's also irrelevant, because they're not really a profitable market, teenagers have short attention spans and no money. Facebook's growth strategy is to become the one stop shop (in lower income nations) for everything you want and need.
In Latin American 3rd world countries, people also conduct business via Instagram.
They create Instagram accounts and post products as posts, with a caption of "DM me for price".
It also turns on every alarm on my mind, when they start calling these "Instagram pages". It blurs the line between a real website and an Instagram account (In Spanish, "website" is "página web" as well).
I've also heard: "My business went to hell because Instagram killed my account" and that's when I reply: "Have you ever thought of owning a real website?"
Maybe an event like this will spur some people into... not doing that? Yes I'm aware of the ubiquitous nature of whatsapp in many developing nations. Have also successfully got a lot of people moved onto using Signal for anything they care about.
Signal has and will go down just like facebook.
Cloudflare/aws having issues affects an insanely high percentage of the internet. People still use them.
Outages rarely cause anything, they happen, people move on.
Email, SMS, good ol' phone calls, Signal, <insert local app/platform here>, your own website, etc, on top of whatever you use right now.
If you're in a country that relies a lot on Facebook or Whatsapp, that's where the main focus will be, but at least try to have alternatives just in case something goes wrong.
So 4/4 of those are platforms controlled by a single company or a few large corporations. This really isn't a win in any meaningful sense.
It should be fine for huge corporations to exist and provide services really efficiently at scale while also being forced to play nice and respond to the will of the people they serve.
If we collectively can't stop Facebook from doing bad thing and being bad stewards to their own platform then you won't be able to stop whatever would replace them either.
It was a mistake to communicate with the users on a platform that they use? Instead of trying to get them on signal, losing 90% of leads in the process and making each of your sales cost x10 much?
You need a contingency plan for when vendors go down even in 3rd world countries. It just so happens a lot of us would not mind this vendor failing entirely. It’s unfortunate that we have so little choice in the matter but ultimately the same advice holds true for all of us smugly throwing insults while keeping our billing in AWS.
He’s a HN 10xer. He doesn’t care about anyone outside his Palo Alto cold-press-Koffee-Klatch, despite what he virtue signals. It’s amusing seeing people here trip over each other to say some variety of “I don’t use Facebook.”
How is that a good comparison? Not everyone uses Facebook out of habit, some businesses need it, and it can be used for good things as well as bad because it's just a medium in which people post content
Yes how that content is presented, ranked, etc is controlled by Facebook but that contribution is less than the content itself.
It would be better to say it's the spoon in which someone could eat a sugary cereal or something healthy.
Are you a Facebook employee? Your justification sounds a lot like the internal propaganda that is being fed to employees. “Facebook is net positive”, “it’s just a tool”, etc
What makes you say that? I think it's a good argument, doesn't mean it's right but it has substance. You also have some quotes that I never said. Nowhere did I imply it's a net positive. It is like a tool however but it has much more input.
The argument was that Facebook is neutral as a platform. Similar to the internet, it serves all kinds of content. Some of the content is good, and some is bad. That doesn't necessarily mean the platform is good or bad.
Having worked in growth before (not at Facebook), I can tell that you vastly underestimate the impact FB teams have on how/when/what/for how long/how many times/etc content is displayed to end users. This is absolutely not a neutral impact.
Here in Europe, WhatsApp actually powers many neighborhood watch groups, and so when it goes down, basically a formal crime reporting system also goes down.
This also means that you can't participate in a neighborhood without agreeing to a legal contract with Facebook to use their services, as well as submitting to ad surveillance and tracking.
I know you think this is some sort of neutral comment about personal choice, but it isn't. Millions of underserved people all over the world live in Food Deserts (https://en.wikipedia.org/wiki/Food_desert), places with little to no access to affordable nutritious food. Those people wind up consuming a large portion of their calories from high fructose corn syrup, not because they have chosen to do so, but because they have no choice, and that is their only option. Whether you want to accept it or not, your comment is classist and makes HN a more hostile place.
People don’t eat straight corn syrup. The products they do eat that contain it are quite expensive per calorie. I.e. Coke.
The problem is initiative and knowledge. They should walk or ride a couple of miles and buy the biggest bags of rice and beans they can, along with a bottle of multivitamins. And then learn how to cook.
If that’s classist, then the classes are structured by knowledge and choices. Which they may well be.
The entire reason that high fructose corn syrup is so prevalent in low-cost foods is that it's cheaper than sugar, especially in the US because of corn subsidies. Find literally any evidence that HFCS is more expensive per-calorie than sugar and you will come up empty-handed.
> If that’s classist, then the classes are structured by knowledge and choices. Which they may well be.
class by its definition accounts for massive difference in access to resources. If you think access to resources doesn't measurably change the level of knowledge that a population has, that's a declaration that resources do nothing, which would be an odd stance to take on a knowledge-focused community website.
> They should walk or ride a couple of miles and buy the biggest bags of rice and beans they can, along with a bottle of multivitamins.
I just LOVE the subtle food choice of rice and beans here, paired with the recommendation to take multivitamins, a recommendation that is supported by little to no evidence. Your own lack of knowledge on this topic is in full display, as is a clear demonstration of your own biases across multiple dimensions.
Of course HFCS is cheaper than sugar. I'm referring to the products made from it, like Coke. They are a poor way to spend your food dollar.
I agree that class accounts for a massive difference in access to resources. However, in this case, the knowledge is available for free, and in the US the basic foodstuffs are available for far less than what disadvantaged people pay for the typical processed and fast food they live on.
Rice and beans - nothing subtle about it. They are basic foods that provide the necessary carbs, fat, and complete protein. The vitamins are a simple way to prevent scurvy and similar deficiencies, until the choice of food can become more varied.
As a person learns to cook and bake, they can add wheat, peas, and corn (But they need to learn about nixtamalization before they add corn.) None of these foods require refrigeration.
I have in mind the cuisine of Mexico, which is inexpensive and nutritious. Similar cuisines are found in home cooking all over the world, at least where commercially processed food hasn't driven them out.
It is most important to make sure that all school children are taught how to process and cook these basic foods.
If you are knowledgeable in this area, I'd appreciate some specific suggestions.
My mother didn't really have a common means to communicate with her grandchildren in the same way.
Email, phone are just not the same.
There are more channels available now for sure, but none so ubiquitous.
Facetime is not displacing FB for a lot of things, but that's more direct.
'Everyone is on FB' is the reason it still holds in these kinds of uses cases.
None of us case one way or the other about the platform, we'll just use what's convenient, but that is what it is.
This is a very common theme among FB users. FB by the way, is still growing it's userbase, and growing revenues even more so. The themes we see here on HN and even in the news don't represent the views among the population, nor are they necessarily very close to material reality.
There are plenty of ways to communicate with friends and family. If Facebook is down long enough, many people will just move to something else. (And I hope they do)
Sure they do. And it's why Whatsapp needs to be broken off from Facebook, because they blatantly lied about it and only bought it to kill off their competition
Many people don't realize that with the 2020 lockdown and next to zero face to face transactions happening, platforms like FB Marketplace provided an opportunity for many people to set up businesses and generate income. I understand the angst people have with FB, but there's a bigger world out there beyond our keyboards.
for one example of this look at certain ethnic food catering/delivery services that exist in many major cities and operate almost exclusively on facebook.
Convincing somebody who can hardly turn on their computer to get their own domain is just not practical. Even if they can get their own domain they still have to set up DNS. Good luck getting them to set up MX, spf and dkim.
I think things would be better if more people had their own domain. I just don't see any way of making it happen. I can't get my own family to leave gmail even with me handling all the domain stuff for them. Even my technical coworkers who are capable of this don't care.
Yes. In the ideal world messaging would've have followed the same federalized model as email. XMPP offers this, unfortunately few people use it or even are aware of it.
I'm fine with Matrix, but I'm not seeing the people around me moving to it, even with a more friendly solution like Element. It's already hard to make them use Signal just because they want users to remember a pin...
Just because a company has questionable or even straight evil business practices doesn't mean that literally millions of companies/people don't rely on them to do business and communicate.
Well, I know you jest, but a lot of conversations, with many people, over years and years would be lost. It'd be akin to hundreds of email threads with friends being deleted.
On the contrary, it's said far too much. Facebook is extremely valuable for a lot of people. I dislike Facebook as much as most people on here, but saying "it's totally pointless" is silly and it's not surprising that those who say it are ignored by those who use Facebook.
* A friend of mine runs a posh burger van that moves around a lot, and he puts "today's location" on Facebook.
* My wife talks with her family in Brazil through Facebook, sharing photos
* My Church receives a lot of help requests from people in trouble through Facebook
* Some abuse charities talk give support to victims through Facebook
etc
You could argue that it would be nice if there were alternatives, or that these organisations shouldn't be using Facebook at all. Sign me up for your campaign, I agree with you.
But if you say "Facebook has no value" then you will never understand the value proposition you need to offer in order to kill Facebook.
Communication for many out there. Many will be just fine without commenting on cat photos or bragging with their likes or followers. Many will be in trouble if they use WhatsApp/Instagram/Messenger/Marketplace to do business and any important communication.
That is wild and definitely newsworthy. Capture as many screenshots and data as you can.
FWIW it seems possible that the messages remain cached locally on your device but deleted from their servers, and with their outage they aren’t being updated to delete?
This is something to get in front of a tech journalist who covers Facebook. It’s a major breach of trust. Probably hit one of your favs with a tweet, but they also tend to list their contact info on author pages of the sites they publish for.
Think of it - half the country doesn't have internet because of this crash, that's terrifying. (Switching DNS servers obviously works but that's not something the general population will do)
Posting this comment will be like farting into a hurricane, but here goes.
Company like Facebook has a serious problem and their stock drops ... precipitously. CEO of said company instead of selling their equity in their company has taken out loans against their equity in order to decrease their tax burden and cash in on the value of their equity.
What amount of decrease would cause a margin call from lenders for the forced sale of said equity and subsequently the loss of majority stake in their own company? Now obviously only the lenders know this information and assuming I have the rough order of operations correct.
Could this be a potential chink in the armor of founders / CEOs / anyone who takes out low interest loans against the equity they hold in their company? Maybe my understanding of this is too simplified.
It's happened [1] but for it to happen to a company with the structure of Facebook a lot of things would have to go wrong.
For starters Zuck owns Class B shares which have 10x the votes of Class A shares. I'm sure a bank would happily loan him money against his Class B shares, but any forced liquidation would involve a transfer to Class A shares. Zuck could lose a lot of shares and still maintain control of the company.
I don't know much about this but from the limited amount of I've read it is probably only a portion of the equity owned, and generally when borrowing against an asset the lender will not give you 100% of that assets value to protect from downside risk. Another probability is that it was adjusted in the past, potentially year(s) ago, and FB's stock price a year ago was almost $100 a share less than current so a $10 drop is not a big deal in the long term.
You raise an interesting question though and I'd like to know the answer as well!
Nah. Almost certainly he could lose 100% of that value of the stock without being at risk of anything like this (as in, he probably put up $100m if he wanted a $50m loan, etc..)
Even if he didn't, the bank would let him move funds in without forcing him to sell.
Margin calls don't really exist as far as loans are concerned. Once you have agreed collateral, and an agreed schedule of payment, you only get in trouble if you miss a payment, regardless of how the collateral fluctuates in value.
That might be true for a house or a small scale loan, but once you are dealing with billions I doubt that’s the case. I assume that it works as follows: you have $1b in stock, Bank gives you $500m line of credit. If stock goes down enough they force a sale, but they only sell against what you have actually Utilized in your line of credit. If you are Mark Zuckerberg and worth more than $100 billion, you probably don’t have any issues. If you add up all of his houses and planes and cars it probably doesn’t add up to more than 1% of that. He’s fine.
Loans based on assets like stocks/bonds/other assets with highly variable prices always have collateral requirements. If the loan is backed by 100M in facebook shares and the price of stock drops in half you will have to hand over more stock for collateral. If the price doubles, you can ask for your collateral back.
It is doubtful Z has any margin call issues as he has so much stock, I can't imagine he would have pledged even 5% of it for loans, so he can just hand them another chunk without even blinking (which he generally doesn't do any way)
I am not sure you understand my question / hypothetical. A bank is not the only form of a lender first off, second the reason there isn't a 100% loan on an equity is that it's understood that the value of the underlying collateral can fluctuate. These are called over collateralized loans.
I currently have a $150k margin loan, and it absolutely will lead to forced liquidation if the collateral drops in value: Interactive Brokers was clear about that.
Separately, my bank tried to sell me a "Pledged Asset Line of Credit" that would also have required the collateral to maintain a certain value or there would be forced liquidations.
Can you link an example of a bank or similar that lets you borrow against stock or options collateral without margin call risk?
A margin loan is very different to the kind of financing agreement a company will enter into. You are using the money at IB to speculate, and probably purchasing volatile assets at that. A company will generally utilise that money very differently, and it is unlikely that a lending institution will accept shares as collateral due to the wrong way risk (i.e. if they can't service their debt, their shares are probably losing value too, so probably not good as collateral).
> A margin loan is very different to the kind of financing agreement a company will enter into
The original post you were replying to talked about founders borrowing against their company's shares as individuals, not companies.
> it is unlikely that a lending institution will accept shares as collateral due to the wrong way risk
That's my point: Nobody's getting a special financing deal on their company stock as individuals to eliminate their margin call risk.
Multi-billion dollar hedge funds get a personal contact at the bank, but even they will get margin-called borrowing against stocks as collateral if it goes against them.
He might not be speculating, he might be holding a bunch of SPY shares and simply withdrew $150k as a margin loan so he can make a purchase on a house or car but not pay taxes on gains yet from selling his shares, opting instead to pay off the loan over time through regular deposits.
He is still speculating on the SPY and his ability to pay off the loan depends on how the SPY holds its value. A market crash would hurt his ability to pay off the loan.
No it doesn’t. His loan is the same no matter what SPY’s value is. He pays it by depositing money he earns from his day job, not by selling shares. If SPY crashes very hard his broker may force him to pay the loan very quickly, either by adding more money or by selling off his shares to cover the loan.
It's about damn time. Hopefully they stay down. It will do the world some good (long term) to have some time away from this platform and platforms like it.
Facebook being down makes me think of all of those small businesses who never built websites. They rely on traffic and publicity from their Facebook pages only.
It's so important to diversify, such as building a website.
Some preloaded apps work (like YouTube, Firefox), but stuff like settings, the lobby, etc., are very slow or display "Unable to Load" messages. Any game that relies on your friends list seems to freeze for a while, then try to carry on.
We've come full circle, where techies are rediscovering the original hatred for the Oculus, that it is tied to a social media walled garden, for some reason.
When FB announced they would be buying Oculus, they promised that no social media integration would be required. FB breaking that promise is not the same as Oculus having that requirement from the get-go.
Originally, you did not need a facebook account to use oculus after purchase. They framed this as "you do not need to integrate social media/facebook".
This ^ means fuck-all, because at that time (day 1), their oculus services where hosted in the same infrastructure as their social media services.
Last year, they got rid of "you do not need a facebook account". But in all situations since inception, all of your data is passing through the same infrastructure as facebook data. It may not be being exposed, or targeted for advertising, but this WAS a huge point of contention years back.
> oculus services where [sic] hosted in the same infrastructure as their social media services
with my second-hand knowledge from someone who worked for FB and assisted the Oculus team being folded in to FB processes/policies/tech, I don't think this is accurate, either.
Yet another reason to not over-rely on a few big tech companies for the majority of the planet's communication. Forget concerns about competition, monopolies and so on for now (as important as they are), what we want are many social networks, video conferencing apps, messenger apps. Every country should strive to build their own Google or FB, or certainly many more should. State-backed if needed. It's a question of resilience and security as much as anything.
I had problems with my internet connection and loaded my ISPs site. Strangely, my bill was paid. Even stranger, some sites load while others do not.
Then it hit me: I am so dependent on Facebook owned properties (Whatsapp, Facebook, insta) that a Facebook failure looks to me like an internet failure.
From the archived ramenporn reddit comment thread at [0]:
> This must be incredibly stressful so for your sake I hope you sort it out quickly... but for the world's sake, I hope you fail and make the problem worse before jumping ship followed by every other engineer, leaving it to Zuckerberg to fix himself. But I still hope it's not too stressful for you!
I love how understated companies always are about things like this.
> Facebook said: "We are aware some people are having trouble accessing our apps and products. We are working to get things back to normal as quickly as possible and apologise for any inconvenience."
I keep trying to submit to HN but I keep getting an error.
What's wrong with the internet?
FaceBook is down.
My friend from Slovenia is having trouble with discord. It eats his messages.
I can't load photos from my friend in telegram and the messages take a relatively long time - multiple seconds! - to get received.
TrackMania players have talked about having input lag.
ycombinator is really slow and reports an error after submitting. "We're having some trouble serving your request. Sorry!" (lost count of the times i've tried submitting this)
ycombinator turned out to be giving only errors.
Some sites I've found via google results seem to report that they are suffering from slow connections.
"Because of missing DNS records for http://Facebook.com, every device with FB app is now DDoSing recursive DNS resolvers. And it may cause overloading ..."
Discord [1] is taking a toll from the increased traffic as well:
"We're noticing an elevated level of usage for the time of day and are currently monitoring the performance of our systems. We do not anticipate this resulting in any impact to the service.
We have temporarily disabled typing notifications. We expect these to be re-enabled soon."
Is anyone else seeing knock-on effects at the other major public DNS providers? I'm seeing nslookups sent to 4.2.2.2 and 8.8.8.8 intermittently timeout if the hostname does not belong to a major website. CloudFlare DNS (1.1.1.1) doesn't appear to be impacted. For example:
[root@app ~]# nslookup downforeveryoneorjustme.com 4.2.2.2
;; connection timed out; trying next origin
;; connection timed out; no servers could be reached
>Now, here's the fun part. @Cloudflare runs a free DNS resolver, 1.1.1.1, and lots of people use it. So Facebook etc. are down... guess what happens? People keep retrying. Software keeps retrying. We get hit by a massive flood of DNS traffic asking for http://facebook.com
Also Speedtest.net for me is showing a 503 error page. Seems a large CDN might be having problems. Their status page shows all green. FB and their other sites are also down.
edit: I see it's back up and I've been getting downvoted, here's a screenshot of the error for clarity
If Facebook and WhatsApp and Instagram fails there are probably a lot of people checking whether their Internet works. That might be why Speedtest was overwhelmed.
I recognize that for WhatsApp users around the globe this is probably more than an inconvenience, but the rest of humanity is getting something of a reprieve here.
It's quite a coincidence for this to coincide with the whistleblower report + rumors of Peter Thiel (perhaps via Palantir?) involved in leveraging FB for the 2022 midterm elections.
I'm not suggesting that this is the case, but a failure of this scale (with internal systems also down) could allow scrubbing of evidence without leaving traces.
from what i understand (take with grain of salt) remote access to the routers affected is down. So they need to be physically plugged in to address the issue. hence some of the other "scrambling private jets" comments referring to getting the right people physically plugged in to the right routers.
Probably not, from other comments it looks like there was a wrong configuration rolled out, and now they are logistically struggling to get access to fix them.
This is honestly the best feature Facebook has ever developed. I hope it's permanent. It has the following effects: you feel better about yourself, you can spend more time with your family, you are more productive.
Facebook Whistleblower Claims Profit Was Prioritized Over Clamping Down on Hate Speech
A Facebook whistleblower, who is due to testify before Congress on Tuesday, has accused the Big Tech company of repeatedly putting profit before doing “what was good for the public,” including clamping down on hate speech.
Frances Haugen, who told CBS’s “60 Minutes” program that she was recruited by Facebook as a product manager on the civic misinformation team in 2019, said she and her attorneys have filed at least eight complaints with the U.S. Securities and Exchange Commission.
During her appearance on the television program on Sunday, Haugen revealed that she was the whistleblower who provided the internal documents for a Sept. 14 exposé by The Wall Street Journal that claims Instagram has a “toxic” impact on the self-esteem of young girls.
That investigation claimed that the social media giant knows about the issue but “made minimal efforts to address these issues and plays them down in public.”
“The thing I saw at Facebook over and over again was there were conflicts of interest between what was good for the public and what was good for Facebook. And Facebook, over and over again, chose to optimize for its own interests, like making more money,” said Haugen.
She explained that Facebook did so by “picking out” content that “gets engagement or reaction,” even it that content is hateful, divisive, or polarizing, because “it’s easier to inspire people to anger than it is to other emotions.”
“Facebook has realized that if they change the algorithm to be safer, people will spend less time on the site, they’ll click on less ads, they’ll make less money,” she claimed.
Haugen is expected to to testify at a Senate hearing on Oct. 5 titled “Protecting Kids Online,” about Facebook’s knowledge regarding the photo sharing app’s allegedly harmful effects on children.
During her appearance on the television program, Haugen also accused Facebook of lying to the public about the progress it made to rein in hate speech on the social media platform. She further accused the company of fueling division and violence in the United States and worldwide.
“When we live in an information environment that is full of angry, hateful, polarizing content it erodes our civic trust, it erodes our faith in each other, it erodes our ability to want to care for each other. The version of Facebook that exists today is tearing our societies apart and causing ethnic violence around the world,” she said.
She added that Facebook was used to help organize the breach of the U.S. Capitol building on Jan. 6, after the company switched off its safety systems following the U.S. presidential elections.
While she believed no one at Facebook was “malevolent,” she said the company had misaligned incentives.
“Facebook makes more money when you consume more content,” she said. “People enjoy engaging with things that elicit an emotional reaction. And the more anger that they get exposed to, the more they interact and the more they consume.”
Shortly after the televised interview, Facebook spokesperson Lena Pietsch released a statement pushing back against Haugen’s claims.
“We continue to make significant improvements to tackle the spread of misinformation and harmful content,” said Pietsch. “To suggest we encourage bad content and do nothing is just not true.”
Separately, Facebook Vice President of global affairs Nick Clegg told CNN before the interview aired that it was “ludicrous” to assert social media was to blame for the the events that unfolded on Jan. 6.
The Epoch Times has reached out to Facebook for additional comment.
While (editorial commentary aside) the basic facts in that article are accurate as far as I can tell, I'd be careful with that source - The Epoch Times is a mouthpiece for Falun Gong's political interests and engages in disinfo programs.
They also previously ran a large sockpuppet network on Facebook and the Facebook ad platform (both of which have since been banned) so they may have a bit of a bone to pick with the platform.
The media coverage and lots of the comments don't make sense to me. FB would not be so stupid and put all of their crucial DNS servers into a single autonomous system (which is now offline due to BGP issues). They operate literally dozens of datacenters around the world, and are surely not using a single AS for them - why not put secondary Nameservers there?
Can someone make a sense of this?
Sounds like automation deployed a configuration update to most of Facebook's peering routers simultaneously. Something similar brought down Google in 2019.
If so, then it would simply be a BGP issue - no FB servers reachable, as all routes are down. But media+claims a combo of BGP/DNS. Hard to believe world-wide border routers, only responsible for networks containing DNS servers, are misconfigured. I am rely curious about that post-mortem :)
I think it was only a BGP issue. The DNS servers apparently shared the same peering routers as the rest of Facebook's infrastructure. Everyone focused on DNS because that's the first sign of failure to an end user.
Facebook down, WhatsApp down, but Signal still works. Time for a change?
EDIT: Yes, Signal is not federated, but that's what people are at least ready to consider as a WhatsApp alternative. I also created Matrix / Element account, and had 0 contacts using it already.
But still a part would be down, if a server has an outage. How about a system, where every device that is used for chatting is a server at the same time? I wonder whether something like that already exists. Bundle it together with bigger servers to handle the load. If the bigger servers experience outages, the service can still continue, although a bit slower
The reason contacts don't tend to show up on Matrix/Element is because we don't push the user into sharing their addressbooks given the obvious privacy issues. Instead you mainly have to figure out who you know out-of-band for now (e.g. tweet "hey, who's on Matrix?").
I would be happy to have an option to share my availability on Matrix with other people that decided the same if that would mean I could bootstrap my network on that platform.
As it is now, Matrix may offer better privacy and more robust weil federated and p2p, but if I had to personally ask all of my contacts if they actually use it using some other medium, I can keep using that other medium too for conversations.
This event should be a good conversation starter on how horrifyingly monopolistic this trifecta of services has on worldwide communication. When I think through a random smattering of people in my contact book, I now have no way of contacting quite a few people at all. That's fucked. I wonder how many important messages, replies, etc will be screwed up due to this.
"A small team of employees was soon dispatched to Facebook’s Santa Clara, Calif., data center to try a “manual reset” of the company’s servers, according to an internal memo."
According to https://lookup.icann.org/lookup both glean and reactjs have Cloudflare nameservers. fbinfer has ns.facebook.com nameservers which are presumably down.
DNS outage is an outcome of faulty BGP updates. As such not only the Internet can't see the FB network, there is also no connectivity from the FB network to the Internet right now.
I had the exact opposite and it was hilarious. Every time my manager (a great guy and really good at what he did) was away for a week the sprint would go very smoothly.
Lol exactly what I was thinking. I'm trying to keep my tinfoil hat in the closet and yet it seems odd that after there is a huge FB whistleblower story on 60 Minutes last night, all of FB goes down today.
I really hope it's just some internal technical error and not a "see, despite the bad things of FB, you really need us" move.
It's probably trivial, the timing just seems weird to me.
Is this related to the outages from lets encrypt root cert expiring? Probably not since this looks like a DNS issue, but still it's a crazy coincidence that two major internet breaking events happen in the same week
There is zero reason to believe it's related at all. It's perfectly reasonable to have multiple large and unrelated failures in the same week.
I also wouldn't classify the loss of 1 company, and the expiration of some TLS certificates, as the interconnected network of networks being broken. The Internet has continued to function even if some larger players were unreachable or having issues.
DNS configuration is becoming a single point of failure. A few weeks ago, many services running out of AWS West 2 failed because the within-the-datacenter DNS system broke down somehow.
Unfortunately, we have literally dozens of comments that amount to nothing more than schadenfreude, and another handful of non-FANG employees speculating how one of the largest internet operations in existence could improve their game (lol)
After changing the screen resolution all operating systems will prompt the user if the applied settings where correct, otherwise it will time out and reset to last known good setting.
Maybe time for the core internet infrastructure to implement something similar? :)
Wow. I can’t remember the last time whatsapp was done. I pretty much use messenger/instagram/whatsapp to talk to most of my friends and family. I’m happy that I do use other platforms otherwise I would be completely cut off from my parents right now.
I keep trying to submit to HN but I keep getting an error.
What's wrong with the internet?
FaceBook is down.
My friend from Slovenia is having trouble with discord. It eats his messages.
I can't load photos from my friend in telegram and the messages take a relatively long time - multiple seconds! - to get received.
TrackMania players have talked about having input lag.
ycombinator is really slow and reports an error after submitting. "We're having some trouble serving your request. Sorry!" (lost count of the times i've tried submitting this)
ycombinator turned out to be giving only errors, but now seems to be working occasionally. I can not submit anything, though.
Some sites I've found via google results seem to report that they are suffering from slow connections.
With Facebook down, some large DNS servers seem to be struggling with the extra load of failing requests to look up "facebook.com". Cloudflare reports overload with their DNS server at 1.1.1.1, although that's working for me.
Billions of things worldwide are trying to connect to Facebook. The lookup which normally returns the IP address for facebook.com on the first try now requires trying a.ns.facebook.com, b.ns.facebook.com, etc. several times each before giving up. Probably several times a minute for everyone who has a Facebook app in their phone turned on. That may be using a big fraction of world DNS resources.
Vodaphone Ireland seems to be struggling with a DNS overload right now, per the Irish Independent. Also, their status page can't find "Dublin" as a city.
What I think is interesting is the effects of this type of thing across peripheral news sites, like HN. I wonder how much spike HN gets with people rushing here to find out what's going on and to read the (articulate) related discussions.
Tracing route to a.ns.facebook.com [129.134.30.12]
over a maximum of 30 hops:
1 1 ms 1 ms 1 ms eehub.home [192.168.1.254]
2 3 ms 3 ms 3 ms 172.16.14.63
3 \* 5 ms 3 ms 213.121.98.145
4 5 ms 3 ms 4 ms 213.121.98.144
5 17 ms 8 ms 18 ms 87.237.20.142
6 8 ms 6 ms 7 ms lag-107.ear3.London2.Level3.net [212.187.166.149]
7 \* \* \* Request timed out.
8 \* \* \* Request timed out.
9 7 ms 7 ms 6 ms be2871.ccr42.lon13.atlas.cogentco.com [154.54.58.185]
10 70 ms 69 ms 70 ms be2101.ccr32.bos01.atlas.cogentco.com [154.54.82.38]
11 73 ms 73 ms 74 ms be3600.ccr22.alb02.atlas.cogentco.com [154.54.0.221]
12 84 ms 85 ms 84 ms be2879.ccr22.cle04.atlas.cogentco.com [154.54.29.173]
13 90 ms 90 ms 90 ms be2718.ccr42.ord01.atlas.cogentco.com [154.54.7.129]
14 143 ms 142 ms 143 ms po111.asw02.sjc1.tfbnw.net [173.252.64.102]
15 114 ms 119 ms 114 ms be3036.ccr22.den01.atlas.cogentco.com [154.54.31.89]
16 125 ms 126 ms 124 ms be3038.ccr32.slc01.atlas.cogentco.com [154.54.42.97]
17 91 ms 92 ms 91 ms po734.psw03.ord2.tfbnw.net [129.134.35.143]
18 91 ms 93 ms 90 ms 157.240.36.97
19 74 ms 74 ms 73 ms a.ns.facebook.com [129.134.30.12]
Okay, let me tell you the difference between Facebook and everyone else, we don't crash EVER! If those servers are down for even a day, our entire reputation is irreversibly destroyed! Users are fickle, Friendster has proved that. Even a few people leaving would reverberate through the entire userbase. The users are interconnected, that is the whole point. College kids are online because their friends are online, and if one domino goes, the other dominos go, don't you get that?
Facebook is just too big and pervasive that such an outage would be treated by its users like an internet outage or a power outage. Once it's back online, everyone will forget.
Yes, it's from "The Social Network". It's a scene where Mark Z. is explaining to Eduardo how important it is that the servers stay up all the time.
Of course it was, as far as I know, ficitonalized in the first place, although it rings true (in context) to some extent. What I wonder is, how much is that true now? That is, how much downtime would FB have to experience for enough users to start leaving, to the point that it might prompt a serious exodus.
Is this in some way connected to the Facebook data leak of 1.5 billion users? The timing seems quite odd that both these things happen around the same time.
https://heyfocus.com/ worked for me, maybe it’ll help (if you’re on Mac). Addiction to social media is a real problem; thousands of engineers are paid to make sure that these products ensnare your attention. It wouldn’t be odd if it takes a few bucks of your own to rescue yourself. Don’t hesitate to seek help, no one will laugh at you.
What I find weird is that there is no indication in the app that nothing is working. I just get a cached view of everything I've seen the last few days.
Which is a feature I hate, since it does that all the time even when I have a connection. Says there are 3 comments on a post, when I know there is more. Opening them doesnt show them, and no way to refresh. But going to the web page I can see them.
% ping whatsapp.com
ping: whatsapp.com: Name or service not known
% ping web.whatsapp.com
ping: web.whatsapp.com: Name or service not known
% ping facebook.com
ping: facebook.com: Name or service not known
% ping instagram.com
PING instagram.com (31.13.65.174) 56(84) bytes of data.
64 bytes from 31.13.65.174 (31.13.65.174): icmp_seq=1 ttl=53 time=110 ms
Seems unrelated to their infrastructure, the DNS records for facebook.com, instagram.com, whatsapp.com and all derivative domains are wiped clean it seems
edit: though saying that, they do run their own registrar... Might've fucked something up over there.
WhatsApp and Instagram are both in FB infra. As I understand it, Instagram is fairly integrated with FB services; when I left in 2019, WhatsApp was less so, it was mostly WhatsApp specific containers running with FB's container orchestration on FB machines dedicated to WhatsApp (there was and probably is some dependence on FB systems for some parts of the app, for example the server side of multimedia is mostly a FB system with some small tweaks and specific settings, but chat should be relatively isolated). Inbound connection loadbalancing is shared though.
FWIW, WhatsApp (on phones) should be resiliant to a DNS only outage, the clients contain fallback IPs to use when DNS doesn't work, and internal services don't use DNS as far as I remember.
At one time, WhatsApp had actually separate infrastructure at SoftLayer (IBM Cloud now), but that hasn't been in place for quite some time now. When I left, it was mostly just HAProxy to catch older clients with SoftLayer IPs as their DNS fallback.
I don’t think stock dip is related to downtime; anecdotally, I’ve never seen a company’s stock affected by downtime (unless that downtime destroys the business)
You may be right, but theres a Reuters article about the downtime, this is making the news today. I would say Facebook is different because of their scale.
Looks like there are a few problems with fb in the news today ...
Indeed. People seem to forget that when Facebook goes down, it's not just your feed of depressing posts, photos and messages that go away, but also the entire Oculus VR platform, since they demanded a FB account to use Quest headsets.
One real potential cost to FB here is breaking people's addictions to FB and IG. This might just be the little finger-snap to wake up a sizable chunk of the user base that they life is just a little better during the outage.
Outage is top story on CNN and Fox. Facebook is not returning their calls. Sheera Frenkel at the New York Times has been able to get a little more info, but not much.
Now Twitter is starting to have problems with overload.
The onion site is just a reverse-proxy to the main web-site.
So if the main site is down (due to internal DNS or BGP issues) onion reverse-proxy can’t get to it as well.
DNS servers of a major internet provider in the Czech Republic are down now. Probably not a coincidence (other DNS server's stats show increased traffic so my guess is that Vodafone's DNS servers were unable to cope with the increased traffic and crashed https://twitter.com/BlazejKrajnak/status/1445063232486531099).
It's crazy that half the country doesn't have internet because Facebook stopped working.
Tracing route to a.ns.facebook.com [129.134.30.12]
over a maximum of 30 hops:
1 1 ms 1 ms 1 ms eehub.home [192.168.1.254]
2 3 ms 3 ms 3 ms 172.16.14.63
3 * 5 ms 3 ms 213.121.98.145
4 5 ms 3 ms 4 ms 213.121.98.144
5 17 ms 8 ms 18 ms 87.237.20.142
6 8 ms 6 ms 7 ms lag-107.ear3.London2.Level3.net [212.187.166.149]
7 * * * Request timed out.
8 * * * Request timed out.
9 7 ms 7 ms 6 ms be2871.ccr42.lon13.atlas.cogentco.com [154.54.58.185]
10 70 ms 69 ms 70 ms be2101.ccr32.bos01.atlas.cogentco.com [154.54.82.38]
11 73 ms 73 ms 74 ms be3600.ccr22.alb02.atlas.cogentco.com [154.54.0.221]
12 84 ms 85 ms 84 ms be2879.ccr22.cle04.atlas.cogentco.com [154.54.29.173]
13 90 ms 90 ms 90 ms be2718.ccr42.ord01.atlas.cogentco.com [154.54.7.129]
14 143 ms 142 ms 143 ms po111.asw02.sjc1.tfbnw.net [173.252.64.102]
15 114 ms 119 ms 114 ms be3036.ccr22.den01.atlas.cogentco.com [154.54.31.89]
16 125 ms 126 ms 124 ms be3038.ccr32.slc01.atlas.cogentco.com [154.54.42.97]
17 91 ms 92 ms 91 ms po734.psw03.ord2.tfbnw.net [129.134.35.143]
18 91 ms 93 ms 90 ms 157.240.36.97
19 74 ms 74 ms 73 ms a.ns.facebook.com [129.134.30.12]
In the post-mortem, we'll find out that Facebook's alerting and comms systems all run on Facebook. As a result, they can't even coordinate the restart to roll back changes.
I'm genuinely not sure if the reports I heard of employees being locked out of the systems they need to fix it because their network is down are jokes or true.
The timing of this is so rich in irony I can't help but wonder if there is an element of internal sabotage. How many FB employees hate FB right now? The latest expose of FB is both effective and truly awful. I can't imagine feeling good about a FB job. And it's gotten worse! Now they look like they can't even keep their websites up.
Perhaps we'll find out. As fun as internal sabotage would be, schadenfreude-wise, i think it much more likely this will turn out to be a time when Hanlon's Razor applies
I feel for the sysadmins who are fighting ulcers and migraines at the moment, but I can't shake feeling that the world is just a little bit better for this small window of time.
The video is one that Tom Scott published in June 2020 about the worst typo he ever made in one of his prior jobs, and while the Facebook mistake is almost certainly not going to be anything irrecoverable like this one, you can bet that Facebook pride themselves on being available all the time.
I would have thought that these companies that are richer then $GOD would have (virtual) instances of at least the previous stable version available for situations such as this. It would at least keep their damn doors open and internal communications systems going... Maybe they'll NOW think of such things? What's the cliche, penny wise and pound foolish? Or is it, no need to listen to experienced Network Designers? I can never remember...
most of what they do, they do with in house tools, and custom-everything, including hardware. as a consequence, for some classes of problems there are no experts - not at facebook, not anywhere.
i feel for their netops people. uncharted territory with the whole world watching and, no doubt, a lot of morons from management trying to be "helpful" in getting this nice crisis resolved. for any crisis there is always a bunch of clowns with MBAs that consider it their golden opportunity to shine (nearly always at someone elses expense)
My father (https://bit.ly/3acZAAI), who is a certified CCSP Ethical Hacker and formerly worked @ZScaler/Checkpoint/Palo Alto Networks, would say that there are basically two scenarios: someone like him did it intentionally or someone like him did it by mistake.
Any other scenario of outsiders, code updates, etc - basically misses the point of how modern DNS infrastructure works.
Did I just read that the Facebook IRC fallback went down too?!? I was about to say what’s wrong with freenode ( but yeah on 2nd thoughts let’s not talk about freenode )
If you mean because the name servers are in the same zone, this is very common. When an NS is returned for a zone, you also get an “additional” A and AAAA to resolve the NS name. It’s called glue.
In this context, I remember youtube+pakistan issue[1]. I also wonder how an AS/BGP manager do his/her job... I imagine a guy/girl changing a text file in a old console. Anyone knows?
Suspecting it might be related to the recent letsencrypt cert authority expiring?
Was just debugging an issue earlier today and just couldn't help wondering how much of the internet is secured by letsencrypt.
All of the static hosts providing free SSL: vercel, netlify, render, firebase hosting, github pages, heroku etc. ...
It does work on modern browsers and devices but goes terribly broken on a lot of old devices.
Obviously not possible to check right now to provide proof, but I feel quite confident in saying that Facebook does not use Let's Encrypt. It's also clearly not an SSL issue.
When I worked there they were all about open source projects to build it themselves and control the service. Well, when your whole company is run on one DNS service this is going to bite you in the butt.
I only know of a handful of Saas apps they didn’t build internally. Sadly none of those will help them get out of this situation.
Reminds me of a story Jack Fresco use to tell were financial workers were unable to get to work because a bridge was not usable. People were worried about terrible consequences if all these important people were unable to do their work. To their surprise life just continued as if nothing changed.
Reading the thread, I'm surprised at the number of nearly identical "How much do we have to pay to keep it down? xD" posts I'm seeing, often from throwaway accounts. Some accounts with multiple near-identical posts within the same minute.
This is all left brain implimentation with looping and classic complexity coming home to roost. As we move through time, we build off of solutions of the past which are solving a problem, but complexity keeps adding on and this is a classic programming/computer science delemma.
I unblocked Facebook right now from my hosts file so I could message someone and couldn't figure out why Facebook failed to load. I tested HN and viola I see that the entire world has sent Facebook requests to 0.0.0.0 lol
I didn't receive expected WhatsApp messages and am only now realizing there's no indication within the app that there is even a problem. It only becomes (somewhat) apparent when sending a message never gets a single check mark. Not a graceful failure for the user view.
Perhaps allowing Facebook, WhatsApp and Instagram to merge was efficient after all - now that they have synchronized outages, people finally have a chance to get on with their lives, free of clickbait news and misinformation.
can productivity (or emotional stability) for the overall US economy be tracked on a daily basis? I wonder if a wholesale facebook outage would show up on that graph as a brief blip in the positive direction.
I find myself a little bit happy that it's down. I use Facebook quite often, but mostly because everyone else I know uses it. If everyone is forced to find an alternative, that'd be fine by me.
The only person I've ever heard of being fired for an operational error was a principal networking engineer at Amazon who end-ran DNS policies and hand-edited a zone file. Somehow, the file got truncated. It brought down everything including the soft phones so people couldn't even spin up a phone-based conference call to deal with it. I think Amazon was down for several hours, with 8 digit losses. That was in the mid 2000's. Heard that person was fired but don't know for sure.
If a single person can cause the failure during the course of their normal tasks, it's not the fault of that person it's the fault of designers of the systems and processes used by that person.
This question doesn't deserve downvotes. While the answer is quite clearly in the negative (this will be a process failure, not a human failure), it looks as though it was asked in good faith, and might not be so obvious to those outside the industry.
Vote buttons are not a substitute for proper responses to legitimate enquiry.
How could anyone answer that question? We don't even know that an engineer made a mistake in the first place, much less what the mistake was and what led up to it.
Not only facebook, but also Google, Zoom, Telegram, Youtube and many more internet service/ product/ providers from 8:00 AM today. This is more like internet outage.
Glad to report that facebook's dns in china is not affected.
You can dig facebook.com and the depth of the internet happily reply with a random ip address as usual.
I'm pretty sure this has been building up since the morning (Germany). I've had odd connectivity problems to a number of sites including slack for a moment.
Getting everything back up again will probably be a nightmare. Imagine all the internal services trying to reach a consistent state after such a long outage.
So they managed to remove facebook.com from 1.1.1.1 and 8.8.8.8. That is impressive. Not something anyone can achieve in such short time by even trying.
This is very hard to get exactly right, because traffic isn't constant at all times, and you don't know if people won't just make up for lost time using facebook at another time of the day, etc. So you can't really know.
But, a good rule of thumb right now is about $10,000,000 per hour.
The joy that people are getting from this is quite shitty. I hate social media but there are people earning a living working for these companies. Like others have pointed out, businesses and neighborhood watches rely on tech like this.
At some point we've all had sites/apps go down, in a situation like that the last thing you want is people enjoying it.
The lack of empathy in this thread is telling.
maybe, there are reports (i.e. unverified tweets) that employees cannot access sites due to the security systems also being down. I imagine email, and messaging for employees would also be down too.
It may be very hard for employees to get to the physical boxes, and/or bypass any physical or software security systems.
Likely $0. Ad views lost now will likely be made up for later. And even if there is a reduction in views, it just makes other views more valuable. Facebook doesn't have real competitors, so the money isn't going anywhere else.
This is a great argument for the antitrust authorities to break up Facebook. Allowing the big social media companies to buy each other creates a single point of failure. If Instagram and WhatsApp were separate companies, a technical disaster at one would not take out the other two.
I think the mods throttled logged in users to discourage over-discussion and thread creation. There’s no rhyme or reason for read-only to be snappy and logged-in to be crawling.
my home connection with ISP is down Vodafone Ireland, so I guess they have such a big churn in Vodafone from FB BGP routes that it blew Vodafone network. Is it DNS or routing issue?
It's not hyperbole to say that this is going to literally save lives.
Cutting off Facebook's firehouse of hate and misinformation for just a couple hours is going to have a obvious positive effect on millions of people. At this scale, at least one person will get vaccinated today because they didn't see the wall of ignorance that is FB's news feed.
Maybe we should introduce "digital blue laws", where one day a week, social media is shut down for the overall good of society.
Of all the big tech companies Facebook is the only one where it can completely disappear overnight and my life would be completely unaffected (or possibly improved by not having to explain to people I don't use facebook, please email or text me your invitations rather than use messenger). If Google, Amazon, Netflix, Apple disappeared the story would be completely different.
Facebook is an unparalleled titan in the realm of advertising and WhatsApp is basically a utility-level communication system for a big chunk of the globe. Instagram is a key cultural driver of the Western world. You many not feel any direct firsthand consequences, but the overall impact would transform the world around you.
I find this kind of comment fascinating because it's illustrative of how humans can form intentional blindspots as to the utility of a person or institution when when all they care about are the negative aspects of that person's or institution's existence.
op: "I don't care about thing X disappearing"
re: "While you may not care about it because of Y, X also provides benefit Z to other people"
op: "But would there be any drawbacks?"
yeah, there would be drawbacks, other people would lose Z, which may matter a lot of them even if it doesn't matter to you. Someone just told you about Z, and you just responded as if you weren't just told about Z"
These days I find it incredibly frustrating to deal with people who have conclusively decided they don't like something and that renders them incapable of acknowledging other benefits that said thing provides even if those benefits aren't relevant to them or are less relevant than the things they vocalize caring about.
I can agree with the "intentional blindspots" argument but turn it right around.
I'd like to explicitly note that the parent post did not say "X also provides benefit Z to other people" - it asserted "Facebook is an unparalleled titan in the realm of advertising" which is a substantially different thing; it's not something that some people simply don't care about and a benefit to some other people and considering those statements as equivalent is a (very large) intentional blindspot. The current way of how advertising is done (driven, in part, by FB) is also a harm to many people and society at large, so publicly making an implicit assumption that "advertising" is at most neutral is not okay, it's something that should be called out.
This very "unparalleled titan in the realm of advertising" aspect is a major cost on society, a net harm that perhaps should be tolerated if it's outweighed by some other benefits FB provides (such as the "utility-level communication system for a big chunk of the globe"), but as itself it's definitely not something that should be treated as benign just because some people get paid for it.
If FB advertising disappeared with no other drawbacks, that would be a great thing. Of course, there are some actual drawbacks, but even so it's quite reasonable to motivate people to ask about the actual drawbacks of FB being down, because "oh but ads" (with which the grandparent post started) is not one.
Thank you, I agree with everything you said here. But I'd also like to address the other things I was answering with the drawbacks quip...
> WhatsApp is basically a utility-level communication system for a big chunk of the globe.
Unfortunately, it's not an actual utility though, which is precisely my point. It's pure folly to build your business around a pseudo utility owned by a private company.
> Instagram is a key cultural driver of the Western world.
I honestly have no idea how this is being presented as a good thing. A "key cultural driver of the western world" is an app whose entire purpose is to harvest your data and sell it to dodgy partners who will use it to usurp democracy.
There are several people earning their living through Facebook/Instagram and there is a whole marketplace that would impact lots of people. Don't get me wrong, I don't use or like FB in any way but FB disappearing overnight would definitely have drawbacks for lots of people.
Replace Facebook in your post with human trafficking :)
Obvious I'm not serious, and it's popular sentiment here that "Fuck Facebook... Oh but I use Instragram and WhatsApp of course!", but the point was "some people making a living on x" isn't really a great argument for "x is harmful and we might be better without it".
My time on Facebook made it abundantly clear how racist, misogynist and otherwise vile a large portion of the people I grew up with are. I was much happier having a superficial contact with them once every ten years at a high school reunion. I'm no longer on Facebook (or Twitter).
Occasionally, I'll see/hear/do something and think that it would have made a good status update/tweet, but then I remember that these things have happened to me for decade before social media was a thing and life was fine. Some I'll share with my wife or a friend, most just disappear and that's fine too.
I did. Facebook also spent a lot of time dumping stuff in my newsfeed from people I wasn't friends with (Twitter also liked to do this). It was a lot easier to just not have all that crap in my life.
Why? It is not as efficient. I can buy everything from stores but I use amazon, same thing. I don't actually use facebook though because I don't care about anyone really but for people that care, it is a solid platform.
There is a gap between "I want to know what people I know are up to" and "I want to meet with those people one by one to see what they are up to". Some people just want to passively watch and that is ok.
"I felt a great disturbance in the DNS. As if millions of influencers suddenly cried out in terror and were suddenly silenced. I feel something terrible has happened. But you better get on with your content curation."
Facebook is implicated in genocide in multiple countries, and Instagram is nothing but a psychotic lie factory designed to induce depression and self loathing in young women.
But your friend groups would probably be able to migrate to Signal/Discord/Hangouts/etc quite quickly if WhatsApp were to disappear, no? WhatsApp has the network effect on its side by way of existing, but that could change quickly if given a push.
Sure. But you might not get everyone back - you'd have to have an alternate method of talking to the folks to get them to switch and meet up in the same place. You'd have this if the service just slowly died (like landlines), but not if something breaks instantly - forever. I'm guessing we've all had this when games died (especially old text-based MMORPG's, for example. So many people gone).
After using Telegram, WhatsApp is a complete piece of garbage, if it disappeared from the face of the earth it would be sure for the best as people would move on to alternative messengers.
IIRC, e2e by default for audio/video; for text chats, can be enabled by marking chat as 'secret'. Is it true E2E? Probably not (i.e. Telegram has keys that can be turned over to any government, noone argues with that)
Does WhatsApp have a true E2E either? Ask hundreds of moderators employed by Facebook who review WhatsApp messages flagged as improper and the chat history around them...
However, accepting the fact that neither of the services is truly secure, Telegram experience as a service is much better for an average user.
> for text chats, can be enabled by marking chat as 'secret'. Is it true E2E? Probably not (i.e. Telegram has keys that can be turned over to any government, noone argues with that)
That was my problem, and your confirmation means it's still as good as nothing.
> Does WhatsApp have a true E2E either? Ask hundreds of moderators employed by Facebook who review WhatsApp messages flagged as improper and the chat history around them...
If one of the ends decides to share a message, it's still E2E. That is the big difference.
> If one of the ends decides to share a message, it's still E2E. That is the big difference.
True. But you can't prove that "one of the ends" must necessarily be a human and not the logic in the app code, or an intended backdoor? E.g., an automated logic scanning for 'malicious' messages on-device.
Maybe Google, because of the search engine. Android: somebody will fill the void.
Messaging: people have been switching on hordes to every new free messaging system in the 90s and early 2000s, we will adapt to something else.
Netflix and video in general: same thing without the 90s/early 2000s.
Amazon: very convenient store, we'll spend a little less and somebody will fill the void.
Apple: can't say, never bought anything from them.
By the way, when I couldn't message on WA today I thought day they finally cut me off because I still didn't accept their new privacy policy from months ago :-) I resolved to wait and see for a couple of days.
I dunno. If AWS went away suddenly, or if Google Search/the G-Suite suddenly stopped existing, the internet as we know it would need some time to recover.
> Messaging: people have been switching on hordes to every new free messaging system in the 90s and early 2000s, we will adapt to something else.
Back then the IM population was a lot smaller. Also with "Free Basics" and other things in some regions of the world Facebook plays a game which makes it impossible to switch. (Using Whatsapp is free, for others one ahs to buy mobile data credits)
Man, out of those only Netflix going down wouldn't cause a gigantic billions of dollars worth clusterfuck to people, businesses and companies. It's nice you don't use them but about everyone around does and mostly for at least some important things.
si es muy posible, de otro lado se daría la oportunidad a empresas mas cercanas con la gente y que les paguen por los usuarios por los datos. finalmente los usuarios son su activo para generar muy importantes ingresos, estaría muy bien que compartan sus beneficios!
It didn't help that telecoms used to use SMS as an extreme profit center. I don't think WhatsApp would have taken over the way it did if SMS was always included in all plans for free. This is similar to the way "local" long distance used to be such a racket.
Most UK plans included unlimited SMS for a long time, but whatsapp still took over.
The group chat functions don't really exist in SMS (maybe in MMS but they never work properly), photos (same), whatsapp desktop, you can text when you have WiFi but no 4G (or using a different sim card when travelling), etc.
The problem for me is Oculus. I really love their headset and I appreciate the investment Facebook has made in that.
I hate the stupid strategy tax that makes me have an FB account to use their headset, and has it go down when they have an outage. I hope they can learn from MSFT that "Facebook Everywhere" is ultimately a self defeating strategy.
Not OP. They _can_ but good luck trying to convince parents of that. They're not tech savvy enough to install apps themselves. They have simple questions about why Whatsapp cannot be installed in a basic Nokia phone for instance. It's not easy to convince them to use Signal or Telegram or anything else.
Because combined with the abysmal state of education in most places, and a general lack of government action, Facebook is an actual threat to our civilization.
People unfortunately love the upsides of misinformation, or perhaps it's the format that makes it easy to build community around shared (misinformed) values, to rally in battles that rage for hours or days for a cause you deeply believe in and can follow by digesting 30-second soundbites on social midea and 30-minute videos on YouTube.
People will do this wherever they can talk in a group online, not just Facebook properties. It's... pretty bad actually, I think the only tool that exists right now is censorship, because the bullshit gets created, spread, and wholeheartedly received way faster than debunking will.
And censorship is a power that can't be safely entrusted to nobody.
I don't necessarily disagree, but often I hear FB or other tech companies like Twitter singled out re: misinformation. News media contributes to misinformation and contributes to a warped, partisan, permanently-in-catastrophe-mode population just as much as FB, Twitter, and other mediums.
I doubt, if FB goes away, that any of the issues you're implying will go away or even get much better. In fact, the lack of a real look into the negative effects of consumer news product reinforces this idea that only the elite can know the truth, and the masses just have to get in line and shut up.
News media proliferated nonsense from fed sources to justify the Iraq war, they gave Trump 24/7 airtime for a while because it increased ratings. They constantly forgo any real accountability for their actions, and pretend that they aren't just another addictive consumer product that warps peoples' brains.
> It's a parent's job to educate your children. There are much worse things than Facebook out there.
I'm guessing that either you're not a parent, or your kids aren't teens.
But most parents of teens realize that kids, and especially teens, are often much more influenced by things like social media & peers (and peers via social media) vs. influence their parents have on them.
It actively uses its algorithm to radicalize racists and conspiracy theorists, and when it discovered that's what it was doing decided to keep doing it because it was good for the bottom line:
An alternate explanation is that the algorithm tries to promote engagement and user retention. Presumably, people susceptible to radicalization engage with the content discussed in the article. It would be unreasonable to expect Facebook to not act in its own self-interest.
> An alternate explanation is that the algorithm tries to promote engagement and user retention. Presumably, people susceptible to radicalization engage with the content discussed in the article. It would be unreasonable to expect Facebook to not act in its own self-interest.
That's the whole point. Oh they're just trying to make a buck like everyone else is exactly the problem.
They are a running a paperclip maximizer that turns passive consumers of misinformation into "engaged" radicals and the system that is Facebook has no incentive to correct this.
To recap, you seem to be concerned that all social media are allowing posts to become popular, and those posts sometimes promote hatred towards conservatives or liberals.
Two questions:
- What do you think should be done about the legacy media that is doing the same?
- Should social media promote boring posts, or actively censor political content in favour of a certain viewpoint, or anything else? Perhaps a real-life name registration for anyone with over 1000 followers, like in China?
> those posts sometimes promote hatred towards conservatives or liberals.
Incorrect assertion. Those posts promote hatred and/or violence toward humans for traits those humans did not choose. e.g. race, sexual orientation, etc.
Legacy media aren't actively amplifying the voices and recruiting efforts of white supremacists.
Facebook is. They acknowledge that they are. They chose to actively allow and encourage it for profit.
Maybe "intentional" in quotes. My money is on a major security breach and they've shut everything down until they can deal with it. Even if you go to Instagram by the IP address [1], you get a 400 error. So it looks like things are off line because they want them off line for now.
In parts of South America it's used for all sorts of things. Want to know when your bus is arriving? The bus company likely only knows because the driver is WhatsApp'ing them status updates.
hrm, bgp and dns. It's weird when decades old technology somehow fails like this. The main reason distributed systems is hard is because of the time component. Whenever you add timeouts to an algorithm, everything becomes orders of magnitude more difficult to reason about, as the number of states grows without bound. In any case, this is an epic outage and sad.
Perhaps tomorrow, the brave man or woman responsible for this beautiful screw up will step forward in HN for an outstanding ovation. Whoever did this, thank you! As a souvenir I took a screenshot on my phone.
This would be a golden opportunity to launch your 'Facebook Killer' app. Preferably a social network where people don't pay with their data, but with, you know, a thing called Money.
I guess the "prophets" at Victory Channel / Flashpoint called down Holy Fire on the Facebook infrastructure in retribution ... https://youtu.be/FbSkFuvqFdA?t=1127 . (I'm an Evangelical Christian but those folks are nuts ... Mario Murillo, Lance Wallnau, Hank Kunneman, Gene Bailey, etc.)