Hacker News new | past | comments | ask | show | jobs | submit login
BPF at Facebook and beyond (lwn.net)
185 points by Tomte on Oct 10, 2019 | hide | past | favorite | 61 comments



Excuse me while I shill for my employer but we're indeed big fans of BPF at Facebook.

Our L4 load balancer is implemented entirely in BPF byte code emitting C++ and relies on XDP for "blazing fast" (comms approved totally scientific replacement for gbps and pps figures...) packet forwarding. It's open source and was discussed here at HN before https://news.ycombinator.com/item?id=17199921.

We discussed how we use eBPF for traffic shaping in our internal networks at Linux Plumber's Conference http://vger.kernel.org/lpc-bpf2018.html#session-9

We presented how we enforce network traffic encryption, catch and terminate cleartext communication, again, you guessed, with BPF at Networking@Scale https://atscaleconference.com/events/networking-scale-3/ (video coming soon, I think.)

Firewalls with BPF? Sure we have 'em. http://vger.kernel.org/lpc_net2018_talks/ebpf-firewall-LPC.p...

In addition to all these nice applications we heavily rely on fleet wide tooling constructed with eBPF to monitor:

  - performance (why is it slow? why does it allocate this much?)
  - correctness (collect evidence it's doing its job like counters and logs. this should never happen, catch if it does!)
...in our systems.


> - performance (why is it slow? why does it allocate this much?)

One of the pieces of fleet-wide tooling that heavily uses eBPF is PyPerf, which we talked about publicly at Systems@Scale in September ("Service Efficiency at Instagram Scale" - https://atscaleconference.com/events/systems-scale-2/ - video also coming soon, I think).


Is there any public code for these perf tools?



woah I hadn't seen that.. sounds like PySpy but implemented in BPF. That's crazy and cool: https://github.com/benfred/py-spy


Has there been any comparison on if XDP-eBPF packet processing is faster/slower than pure user-space packet processors like Cisco's VPP, especially the ones with support for DPDK and zero-copy? I ask because user space packet processors are extensively used as virtual switches in container deployments. Assuming XDP-eBPF is faster, I wonder if, in combination with namespaces, there could be an efficient "virtual switch" implemented completely in eBPF.


Personal experience spanning across my time at Twitter and Facebook, DPDK is definitely faster but it is (/was) a pain in the butt to program and share the NIC with the host. XDP allows you to reuse very many packet parsing / handling capabilities in the kernel while in DPDK world you’re shipping a tiny stack with your app.

Re: Virtual Switch implemented in BPF: see Cilium’s work to connect containers with their BPF based connectors.


That depends on the nic. Mellanox's bifurcated driver is very easy to share with the kernel. Intel's, not so much. Do you mind mentioning which NICs are used at FB?

Also for GP, is anyone using vpp in production?


There is not even need for a bifurcated driver. The upstream kernel has AF_XDP which Intel and Mellanox NICs support in their drivers, and DPDK has official integration for it as well: https://doc.dpdk.org/guides/nics/af_xdp.html This will make deploying DPDK significantly easier for those that need/want to use it and allows to share the same driver for pushing packets up to DPDK and into the normal kernel stack with very close to "native" (as in user space driver) DPDK performance (target is ~90-95%).


I'm aware of the xdp support, but it's brand new, and certainly nowhere near as mature as the normal mode. It doesn't even work on most NICs yet.

But I agree that once it's more mature, it will be better overall.


From the latest DPDK docs in your link:

"Current implementation only supports single queue, multi-queues feature will be added later.

Note that MTU of AF_XDP PMD is limited due to XDP lacks support for fragmentation."

That's a non-starter for many use cases, since multiple queues is a fundamental feature of DPDK.


Predominantly, some internal teams at Cisco. VPP and its CNI (Ligato Contiv) seem to be picking up steam off late. This is going by activity in the mailing lists. I know Yahoo Japan uses it.

http://events19.linuxfoundation.org/wp-content/uploads/2018/...


Sort of a 'virtual switch' but much more beyond that and all with a BPF data plane. Cilium provides connectivity but also security policy, load-balancing and introspection for container pods. Preferred choice in Kubernetes-based workloads.


The article mentions the servers run up to 100 different BPF related programs on an individual server. I get the firewall and traffic shaping stuff, but are there any unusual use-cases or hacks you could share?


The number increases because there are various monitoring tools. Stuff like something that extracts more data from TCP retransmits so we have a better understanding of congestions and errors in the network path. ...or things that simply collect certain system events for security event detection. Imagine, for every certain event, what is injected is counted as a separate program.

On top of these which is common for every machine, service owners can deploy their own BPF programs for specific use cases. In fact our self service tracing tooling is also a BPF program. We talked about it back in 2014 when it did not use BPF https://tracingsummit.org/w/images/6/6f/TracingSummit2014-Tr...


> Facebook, he began, has an upstream-first philosophy, taken to an extreme; the company tries not to carry any out-of-tree patches at all. All work done at Facebook is meant to go upstream as soon as it practically can. The company also runs recent kernels, upgrading whenever possible.

I was chatting to a Facebook engineer on their use of BPF this summer and heard the same thing, which was surprising to me. There seem to be a number of companies that take advantage of Linux being licensed under GPL and keep their own forks/patches of the kernel that they use internally (anecdotally, I’ve heard Google does this), and the they stay on some old version, which apparently Facebook doesn’t do.


Upstreaming means the 'cost of maintainance' is shared. If you have out of tree patches, and upstream changes the API, either you have to update your patches or you have to keep running an older version. Possibly missing out on other improvements.

Getting things upstream can be more time consuming, but it's a worthwhile investment IMO.


Right. A lot of companies seem to do this calculus and come up with “my proprietary add ons are worth more than upstreaming this and having to update the patches whenever we want to upgrade”, unfortunately :(


Either that or the code you developed for the Linux kernel is so special no one else wants it and you can't get it merged.

I've seen this happen with Linux kernel stuff as well as other OSS projects.


> There seem to be a number of companies that take advantage of Linux being licensed under GPL and keep their own forks/patches of the kernel that they use internally (anecdotally, I’ve heard Google does this), and the they stay on some old version

:::cough::: Amazon :::cough:::


Well Facebook is just a user of compute, not a seller of it. Google and Amazon sell compute as a business. The incentives are different. Facebook, despite its size, will never be a large fraction of Linux usage, so they are much wiser to benefit from the improvements of the community and give and take from the master. Amazon sells compute so their contributions would be shared by only the three major cloud vendors.


I can't wait until this is the norm. Facebook (and presumably, Google if they wanted to) can do this with an army of engineers and some excellent continuous deployment systems. I'm hoping these kinds of systems become commoditized over time and regular companies can stay close to the latest kernel version at all times


Take a look at the list of major contributors to the kernel, just covering top 50 between v5.0 and v5.3:

   $ git log v5.3...v5.0 | grep "^Author:" | cut -d"@" -f2 | sed 's/>//' | sort | uniq -c | sort -n | tail -50
        167 samsung.com
        188 sang-engineering.com
        190 acm.org
        192 broadcom.com
        196 collabora.com
        206 ingics.com
        207 infradead.org
        207 lixom.net
        219 pengutronix.de
        223 glider.be
        232 c-s.fr
        232 microchip.com
        247 mediatek.com
        251 roeck-us.net
        329 st.com
        341 fb.com
        341 socionext.com
        346 netronome.com
        360 ti.com
        380 canonical.com
        384 linuxfoundation.org
        387 nvidia.com
        388 codeaurora.org
        398 arndb.de
        424 embeddedor.com
        460 renesas.com
        469 chris-wilson.co.uk
        498 baylibre.com
        504 suse.com
        516 chromium.org
        539 suse.de
        540 lst.de
        554 oracle.com
        652 nxp.com
        682 bootlin.com
        690 davemloft.net
        697 linutronix.de
        774 arm.com
        781 linux.ibm.com
        948 google.com
       1052 linaro.org
       1230 huawei.com
       1315 mellanox.com
       1333 linux-foundation.org
       1477 linux.intel.com
       1501 kernel.org
       1851 amd.com
       2229 redhat.com
       2548 intel.com
       4373 gmail.com
There's the usual suspects in there (after a quick double check, it turns out microsoft.com just missed the cut-off with 140 commits, plus 2 from linux.microsoft.com, Amazon has authored 71, and alibaba have 139)

No one wants to maintain out-of-tree patches, it's a complete pain. Google was doing it extensively with the Android project for the longest time, but they've been working hard at getting those all in to upstream to drastically reduce the work involved in updating the Android kernel.


Keep in mind most of us don’t use our @fb.com addresses. LWN keeps track of companies by keeping track of who works where so our numbers are much higher than this list shows.


I believe the same applies to Amazon to an extent, too.


Applies across a bunch of companies (I use my IBM email but I know quite a few kernel devs here who prefer to keep their open source contributions under their personal email).


A lot of them have it tied to PGP and other stuff so it makes sense. Identity matters in OSS.


Google can't necessarily upstream everything because of social problems in the kernel process. For example their datacenter TCP improvements have never been accepted by the gatekeeper of the net subsystem, which was a significant motivation to develop QUIC.


I'm not sure where you have heard about this? Their DCTCP extensions have never been posted in the first place to a public list as of today. Pretty much all of the core TCP developers for the (upstream) kernel's networking subsystem are employed by Google and doing an excellent job. That said, I would love to see their extensions integrated into the upstream tcp_dctcp module.


Is Facebook running DCTCP in production these days?


They did get BBR into the kernel though, and many moons ago BQL too which was a prerequisite.


Isn't DCTCP generalized by TCP Prague and L4S? Which, if they get the IETF stamp of approval and the potential patent issues around L4S get sorted out, I'd guess would be implemented in the upstream Linux kernel pretty quickly.


Social problems, a.k.a. Linux must work for everyone and not just Google.


Reinventing TCP over UDP is sortof silly, I hope they have a better reason than "they don't want to upstream our changes" lol.


I think the inability to upstream changes into Windows and (ironically) old versions of Android are bigger motivations for using UDP.


Isn't it a pretty good reason? gRPC is terrible in a datacenter context without Google's internal TCP fixes that Linux won't adopt (and which have been advocated for in numerous conference papers since at least 2009). If they are steadfast cavemen what other workaround exists?


Apparently Microsoft is considering gRPC as an future replacement for WCF, so that might change. https://news.ycombinator.com/item?id=21055487


The standard workaround is to send short messages using UDP and long ones using TCP.


What parts of gRPC are fixed by using it over QUIC vs. TCP (presuming intra-DC traffic and equally long-lived flows)?


Latency caused by packet loss. TCP needs microsecond timestamps and the ability to tune RTOmin down to 1ms before it is suitable for use in a datacenter. With the mainline kernel TCP stack you are looking at, at a minimum, a 20ms penalty whenever a packet is dropped.


TCP over UDP seems rather silly to me, but congestion control and segmentation in userland is pretty useful. Especially so, since Google and partners have made an ecosystem where kernel updates on deployed devices don't happen.


BPF is awesome. We build a full rust toolchain that targets it https://github.com/solana-labs/rust


That looks cool! Any chance it gets upstreamed into rust proper? And if so, is there a place to keep track of that progress?


We would love to, maybe in a few months as we stabilize the changes.

One challenge is that bpf upstreaming is much harder. We had to add support for relocations, spilling, multiple return values, and a few other things that might not be needed by the C bpf folks.


> We had to add support for relocations, spilling, multiple return values, and a few other things that might not be needed by the C bpf folks.

I'm curious why C BPF programs wouldn't need this.


I'm a novice at it, but I have written a few small eBPF programs in assembler.

To elaborate on what aey said in the sister comment, since eBPF allows user applications to generate code that runs in the kernel, the language is kept pretty simple so that it's easier for the kernel to verify that the generated code isn't doing something bad. If you look at the documentation, there are only two registers:

https://www.kernel.org/doc/Documentation/networking/filter.t...

There are also limitations on the size of eBPF programs the kernel will allow, and looping and such, so that a single user is less likely to DOS the entire system with a bad program (whether purposefully or accidentally). It's not really that similar, but I had the amusing thought that writing eBPF assembly vaguely reminded me a little of writing TIS-100 code.

http://www.zachtronics.com/tis-100/


There is no Linux usecase asking for it. We happen to be using bpf outside of the kernel.


Ah, that explains it. To be clearer, I should have probably instead asked why you needed those things ;)


Bounded loops and concurrency management sounds pretty awesome. Can't wait till Cloudflare's next write up on BPF with these new features.


Is there a difference between BPF and eBPF?


Short answer: no, same thing

Original bpf is a much simpler bytecode. Ebpf extended it and made it essentially x86_64 assembly (semantically). Then we all decided to call ebpf bpf for reasons.


Isn't it the case that you still need original BPF for seccomp?


The old classic BPF is still used by seccomp, but in the kernel transparently converted into eBPF. In the kernel we dropped the notion of eBPF and just call everything BPF (as the old classic BPF is pretty much a thing of the past and not extended / developed any further).


Is there a easy to use Golang library for eBPF? I do know that Cilium / cloudflare are attempting to build one but I gather that the project is not yet ready


Can you define ready? The functionality of the library is solid, it's based on code we're using in production. We've not committed to a stable API however.

(I'm one of the maintainers of said library.)


Cool that the term isn't defined once in the post.


It's hard to decide what we need to define and what we don't. In this case, I kind of opted for assuming that the readers know what BPF is, given that a fairly high percentage of our articles seem to be about BPF these days. Still, I'll try to include a link next time, sorry.

Meanwhile, the LWN kernel index (https://lwn.net/Kernel/Index/#Berkeley_Packet_Filter) will lead you to more information about BPF than you ever wanted.


This is what the HTML <abbr> tag is for.


The target audience for LWN doesn't need eBPF defined for them.


This other article seems to explain it https://lwn.net/Articles/740157/


Its a pretty technical post, I'm guessing its intended audience knows what the BPF is?




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: