How Facebook deals with PCIe faults to keep its data centers running reliably

jandrese · on June 4, 2021

Is anybody else frustrated that this somewhat lengthy article was relentlessly vague with the actual numbers? It was so consistent it felt like something you would present to a person allergic to technical detail.

omnimike · on June 4, 2021

This is likely very deliberate. Publishing error rates of vendor equipment could violate NDAs or just damage vendor relations. The number of units purchased can be used by stock market analysts to predict FB growth or vendor sales, so the PR/legal team might prevent the publishing of details which could be used to that effect. Those are my guesses as to why actual numbers would be left out.

MobileVet · on June 4, 2021

One of the many reasons that I love Backblaze as a cloud backup solution is that they publish these numbers regularly. If there are bad drives in the market place, we as consumers have a much better view of it than Amazon reviews et al.

https://www.backblaze.com/b2/hard-drive-test-data.html

baybal2 · on June 4, 2021

I believe this is called "technical marketing"

Does Facebook still want to launch an own AWS clone?

nautilus12 · on June 4, 2021

Not sure if they did who would end up being the bigger asshole. My opinion of AWS is already at an all time low, but my opinion of Facebook has always been low too. At least it would introduce some competition for AWS.

unixhero · on June 4, 2021

AWS all time low? What planet are you getting your insights from? Because down here on earth they are making billions of dollar. AWS is an absolute juggernaut.

AWS and we as users get an insanely well made product. I just delivered an analytics solution built with AWS components and its frigging great to work with. Maybe with one exception; IAM.

atatatat · on June 4, 2021

Meanwhile I'm sitting here after 4 visits to every dashboard I can find deleting and turning things off, and I still can't figure out where the fuckers are charging me from, because they don't care to improve that.

nautilus12 · on June 4, 2021

I'm talking about their moral/ethical compas.

dsyrk · on June 4, 2021

Do they lie, cheat, steal, or could you elaborate on problems with their moral/ethical compass?

nautilus12 · on June 8, 2021

Oh my...so much to learn...

adflux · on June 4, 2021

Some competition for AWS? Seems like Azure and GCP are competing pretty hard don't you think?

antihero · on June 4, 2021

Until there's something like CDK for Azure and GCP I am hesitant to try it. Being able to leverage a mature language (TypeScript)'s completions and type system to code up an infra is absolutely wonderful. The docs are lacking a lot of examples, mind.

nuker · on June 4, 2021

Cloudformation is a better tool for infra.

adflux · on June 4, 2021

Pulumi?

mschuster91 · on June 4, 2021

I know many people who won't dare touch GCP with a ten foot pole simply because they're afraid of Google's random banning AI wiping their digital lives.

Azure is a minefield that's only worth it when on a .NET stack.

danpalmer · on June 4, 2021

For any companies in retail, using AWS can be a hard sell because of Amazon being a competitor. GCP/Azure are much easier for them.

As for Google banning accounts, that is not a thing past a certain level. They would be in breach of contract – past a certain point the business has a contract, it's not a personal GSuite account paying by credit card, it's a business account with a more substantial contract, SLAs, etc.

Turning off a whole service for a company violating terms of service (like what happened with AWS+Parler) is another case.

verst · on June 4, 2021

Re Azure, what makes you say that? Can you be specific?

mschuster91 · on June 4, 2021

I had to spend literal weeks back and forth with Microsoft for getting my MPN account running - the end result was to go somewhere deep in PowerBI to unlock some random DNS setting. No I'm not kidding - I have kept all that stuff in my Twitter DMs with their support. Utter nightmare.

verst · on June 5, 2021

What is MPN? That's not a core product or service I recognize.

peteretep · on June 4, 2021

Recently I tried to set up some pretty basic Office 365 stuff, and in the end I had to pay someone from Upwork to run a bunch of PowerShell commands because the admin UI simply ignore or timed out when I tried to set stuff. Between that, the absolutely dreadfully confusing product URLs and so on, the chances of me ever trusting MS enough to run servers for me is slim.

jiggawatts · on June 4, 2021

I recently built a bunch of stuff on Azure, and the product limitations are absolutely insane. I came up with a new term in the aftermath of this project: "Almost Minimal Viable Product" (AMVP). It's like an MVP, but not quite.

Just in the last few weeks I hit these fun "broken by design" issues:

Availability Sets decrease your availability because they force big-bang changes for the member VMs. They flat out prevent one-VM-at-a-time changes for large categories of settings, such as SKU family, Accelerated Networking, and Proximity Placement groups. VMware had similar features yet no such limits over a decade ago.

Speaking of availability sets, you can create one with the number of fault domains set to "1", which makes sure that your critical servers are all plugged in to the same power rail and will fail together, ensuring disaster. You can't change this parameter.

Oh don't worry, their doco helpfully tells you to work around these glaring issues by deleting the VMs and recreating them. Except that this wipes out a bunch of settings and data that can't be recreated. Data loss is their official solution!

Speaking of data loss: You can't move a VM from one Recovery Vault to another without permanently deleting its backups first.

Other than that, Recovery Vault is a great product with only a few small feature gaps, such as the inability to back up Ultra SSD disks. You know: the type used for the most important VMs!

They NAT IPv6. I still can't get over that. You can't do anything if you enable IPv6 anywhere. For example, they just released Virtual WAN, but it has exactly zero support for IPv6. It just flat refuses to work with it. Ditto for NAT Gateway, which will refuse to NAT IPv4 if you have IPv6 enabled.

Speaking of IPv6: They generously hand them out in blocks as large as 16 addresses at a time. You get a whole /124 range all to yourself!

Stopping a VM can take up to half an hour, sometimes 2-3 hours. I hope you weren't making those aforementioned big-bang changes!

They have Gen 2 images for Windows, but not Windows + SQL Server. In fact, SQL Server has a random subset of the images you'd expect it to have, with gaps all over the place.

You can enable OS-level ("Guest") metrics, but you can only see them one VM at a time, not in any multi-VM view. You cannot imagine how fiddly this is to enable through any kind of automation.

Recently, Log Analytics randomly stopped collecting IIS logs world wide. The fix is to restart the service manually. This went on for like a week.

Some of their managed certificates are validated based on the "TLD" name, not the DNS zone name. So if you have "dev.myapp.dept.org.megacorp.com", then you have to figure out who receives these emails at the head office. In a different time zone. PS: They've never heard of you, and this looks 100% like a phishing attempt. PPS: This is totally broken for some domains, it goes to the wrong one by design.

Look, I could go on, but listing all of the showstopper issues I encountered while doing rather trivial stuff in just the last few weeks would require several hours of typing, and I'm tired because I was up until 9:30pm waiting for Azure VMs to take their sweet time to reboot.

verst · on June 5, 2021

That sounds painful indeed. I've never had to use any of these services or features on any other cloud so I can't compare. I've certainly heard that Azure Networking isn't great, but then again I am not someone who ever has needs that can't be met by what is being offered.

It sounds like you mostly deal with the Infrastructure level services - VMs, availability sets and networking.

What are your thoughts on the PaaS offerings (and there are many - too many to the point it gets confusing)? The Log Analytics issue seems very surprising - definitely something I'd expect to recover quickly without the need for intervention.

jiggawatts · on June 5, 2021

I used or touched most of their flagship services, including a bunch of PaaS stuff, including DNS, App Service, Service Fabric, AKS, Front Door, etc...

Microsoft's Azure team simply doesn't have quality in their vocabulary. Everything they do misses the mark, and PaaS is significantly worse than IaaS, especially for performance. Just barely good enough? Ship it! Not good enough? Ship it anyway!

The first thing I noticed about App Service is that if you use ARM templates, there's a different schema for the "Primary" slot and the other named slots. This is an insanely bad design, and should have been caught very early on and never seen the light of day. That team is just beyond lazy: instead of updating the ARM schema when they introduce new features, they just shove them into barely-documented (or undocumented) environment variables that the platform picks up. So in other words, the "bag of app settings" isn't just the settings used by your app, it is also the system configuration! This makes it night impossible to factor out reusable chunks of ARM templates, because they blend wildly unrelated things into the same flat list of variables. Things like: Regional options, network routing(!), and App Insights monitoring settings are side-by-side with your app settings. It's nuts.

But the performance issues just blew my mind. App Service is shockingly slow. Microsoft runs it in their VNets, and then tunnels the traffic through basically a VPN gateway running on virtual machines. So if you need private VNet integration, your latencies go from merely disappointing to fantastically bad. Think up to 10 ms for a "ping" HTTPS REST call, or 3-7 ms for a SQL "print 1" statement to their "Business Critical" tier! It's absurd.

For comparison, in the IaaS space, they're catching up with the network performance that AWS or GCP have been providing for a while. The combination of Proximity Placement Groups and Accelerated Networking reduces latency to about 50 microseconds, which is very good, and very noticeably speeds up practically all applications. Combined with the new AMD EPYC VM SKUs, I'm yet to see a speed-up smaller than twice as fast compared to the older Intel SKUs without those networking features.

Unfortunately, 100% of their PaaS components run without the aforementioned features. All of it. The Private Endpoints? Reedy little VMs running on old Intel CPUs with software emulated NICs. Azure SQL Database? Ditto. App Service? No acceleration, and can't be placed in a proximity placement group to be close to the Azure SQL Database! In some locations you have a mere 20% chance of your web server being put into the same data centre as your database! Crazy.

My impression is that their insistence on using IPv4 for everything is the source of most of their woes. Everything has to be NAT-ed multiple times in a typical PaaS app, or even tunneled or proxied, which is madness. If they had just embraced IPv6 early on and used it for all of their PaaS services, they could have eliminated an awful lot of complexity while boosting performance very dramatically. For example, there are at least four or five different, unrelated ways of connecting App Service to a VNet, none of which would be required at all if they just used IPv6: https://docs.microsoft.com/en-us/azure/app-service/web-sites...

Someone really needs to bang some heads together at that place and explain to them that scalability is not the only concern, and that latency also matters.

And availability.

App Service has no Zone Redundant option! Azure SQL Database does, but not the matching App Service. So a typical 2-tier PaaS application has mismatched high availability capabilities at the various tiers. Again, how did nobody notice this and fix it years ago? Boggles the mind.

I could seriously go on and on for hours, like how Azure DNS only collects metrics every 2 hours, which means if you make an administrative change you don't get to see the impact for at least an hour. At which point you see a ZERO in the graph (that only shows values in 1 hour intervals) and you have a panic attack.

Or how Azure Front door supports none of the technologies you'd expect, and actually slows down most web applications. It's missing all of the following: Brotli, 0-RTT, HTTP/3, ECC certificates, TLS v1.3, OCSP stapling, and probably some more than I've forgotten. The competition, like Cloudflare, typically supports all of these and more. Don't worry, you can now enable HSTS headers, but they charge you $50/mo for the privilege.

dosman33 · on June 4, 2021

I'm calling BS on this. Total lack of any specifics, and I don't need to know what brand of hardware is giving them fits. I've been responsible for tens of millions of dollars in hardware over the years of my career and PCIe has never been the source of my headaches. Disk drives, memory, and firmware bugs are the usual suspects. They can get right out of here with their holier-than-thou attitude on this.

dekhn · on June 4, 2021

I worked at a large cluster computing company and we did occasionally, very occasionally see PCIe problems. Note that a lot of people are now exporting PCIe over a cable, not just plugging into the mainboard, and that can be a source of problems ('oops, the PCIe cable was routed in a location that made it experience more EMF, vibration, and physical damage and then it started to show more errors).

These sorts of problems mainly show up if you're running your own fleet, designing your own servers (poorly/aggressively) and have a budget of $10B.

mox1 · on June 4, 2021

I don't mean to be rude, but did you ever look? Swapping parts because something is wrong would fix weird PCIe errors as well.

Also, PCIe usage has probably grown exponentially over the past 10 years. Its possible facebook has an order (or two) more of PCIe "links" in their datacenters today compared to 10 years ago.

londons_explore · on June 4, 2021

> also important to rate-limit remediations and repairs as a safety net to prevent bugs in the code from mass draining and unprovisioning, which can result in service outages if not handled properly.

This is the most important bit for someone reimplementing this...

Never let automation 'run wild' - always have a maximum number of machines per second it can act on, and a maximum percentage of the fleet unhealthy to allow it to continue taking things out of service.

WJW · on June 4, 2021

I agree wholeheartedly. Computers can do the wrong thing at incredible speed sometimes. This reminds me of the anecdote that nuclear submarine designers knowingly don't automate some things because if the automation would fail then it would cause the loss of the entire sub quickly enough that the situation is beyond salvage by the time humans can react. Instead, they have an extensive checklist and do all procedures with 2 people and a third over an audio line. Even if it does go wrong, it will probably go wrong slowly enough that a concerted damage control effort from the rest of the crew can contain the problem.

th33ngineer · on June 4, 2021

> Bad…link speed…and bad…link width…were other concerning PCIe faults. These faults can be difficult to detect without some sort of automated tool…

Most modern PCIe PHY now have the ability to serve interrupts when down-training occurs much like they would for AER errors or hard link failures. Does FB use their own silicon in these data centers? Having this feature enabled is crucial when you get up to gen4 speeds. Weirdly I don’t see any detail about the gen used here though.

maccam94 · on June 4, 2021

FB doesn't want hardware to run at lower than rated speeds. Their tool allows them to detect when it happens and remediate the issue.

numpad0 · on June 4, 2021

OP claims it shouldn’t be “difficult to detect (...) because the hardware is working” because most commercially sold host controller chips would generate interrupt and report errors, unless Facebook is using something nonstandard that don’t.

maccam94 · on June 4, 2021

The hardware is reporting the errors to the kernel but not crashing the system. It's "difficult to detect" because unless you are specifically monitoring for those stats, the only issue you'll see is degraded performance on an occasional machine (assuming you are watching carefully enough to even discern the performance delta). Some of the error counters are even predictive of an issue rather than something that is actively impacting performance. The FB software is basically scraping those messages and bus stats into JSON that can be consumed by their monitoring infrastructure.

drewg123 · on June 4, 2021

Pcicrawler looks interesting, but its too bad that it uses sysfs, making it tied to Linux. I wonder how hard it would be to make it use pciutls, which would make it portable to FreeBSD, MacOS, Windows, etc..

EDIT: There is a libpci for python that uses pciutls...

DaiPlusPlus · on June 4, 2021

Aren’t pciutils dependent on hardware support though? I have a couple of Asus mobos that have weird mouse movement depending on how slightly angled the GPU is in the slot - I tried to use pciutils to diagnose the issue but the “gamer”-grade Core i7 I have in there wasn’t on support-list because Intel only enables PCIe diags on i9/X and Xeon chips.

...and I know a few DCs that loaded up on cheap low-grade i3/i5 chips for “near-line” computing (slow but cheap and cheerful) so they’d be SoL too.

lights0123 · on June 4, 2021

...Why would sysfs expose something that your hardware doesn't support while pciutils doesn't?

DaiPlusPlus · on June 4, 2021

I assumed sysfs exposes a subset of available information: that pciutils can expose more data given hardware support. I know some of the more basic pciutils ran on my computer, though.

drewg123 · on June 4, 2021

FWIW, pciutils can use a variety of different access methods. Use lspci -A help to see the different access methods, and then lspci -A $method to use a different access method. Note that one of the access methods on linux is linux-sysfs, so pciutils can use sysfs under the hood.

secondcoming · on June 4, 2021

Even though it seems to only officially support CentOS, I gave it a shot on a GCP Ubuntu instance. `pcicrawler` output some info, but `pcicrawler -t` was just blank.

Seems similar to `lstopo`

Animats · on June 4, 2021

It sounds like a description of how an IBM mainframe works, where each hardware unit has fault detection and isolation.

TwoBit · on June 4, 2021

Has Facebook ever publicly written about the major Intel chip SSE failures they ran into a few years ago?

throwawayoknj · on June 4, 2021

No, because they use processor bugs as leverage to get a discount on the next batch they order.

KirillPanov · on June 4, 2021

Go on...

TwoBit · on June 8, 2021

Facebook discovered that in their data centers heavy usage of SSE instructions caused overheating that broke the chips. It was happening frequently. Source: I worked at Facebook.

moosebear847 · on June 4, 2021

Sucks that I have to worry while reading their technical blog post that it's pretty likely that Facebook is tracking the fact that I'm reading their article.

tobyhinloopen · on June 4, 2021

At least they dont use a misleading/lying cookie wall. “WE CaRe AbOuT yOuR PrIvaCy, please click accept or open this huge modal with 100 checkboxes to disable all trackers one by one”

bawolff · on June 4, 2021

Ah yes, because no other company puts analytics on their blog

OldGoodNewBad · on June 4, 2021

Facebooks a disgusting company filled with icky people but they have produced the occasional good bit of research,

peteretep · on June 4, 2021

I’ve been very impressed with fastText and React