Excuse me while I shill for my employer but we're indeed big fans of BPF at Facebook.
Our L4 load balancer is implemented entirely in BPF byte code emitting C++ and relies on XDP for "blazing fast" (comms approved totally scientific replacement for gbps and pps figures...) packet forwarding. It's open source and was discussed here at HN before https://news.ycombinator.com/item?id=17199921.
We presented how we enforce network traffic encryption, catch and terminate cleartext communication, again, you guessed, with BPF at Networking@Scale https://atscaleconference.com/events/networking-scale-3/ (video coming soon, I think.)
In addition to all these nice applications we heavily rely on fleet wide tooling constructed with eBPF to monitor:
- performance (why is it slow? why does it allocate this much?)
- correctness (collect evidence it's doing its job like counters and logs. this should never happen, catch if it does!)
> - performance (why is it slow? why does it allocate this much?)
One of the pieces of fleet-wide tooling that heavily uses eBPF is PyPerf, which we talked about publicly at Systems@Scale in September ("Service Efficiency at Instagram Scale" - https://atscaleconference.com/events/systems-scale-2/ - video also coming soon, I think).
Has there been any comparison on if XDP-eBPF packet processing is faster/slower than pure user-space packet processors like Cisco's VPP, especially the ones with support for DPDK and zero-copy? I ask because user space packet processors are extensively used as virtual switches in container deployments. Assuming XDP-eBPF is faster, I wonder if, in combination with namespaces, there could be an efficient "virtual switch" implemented completely in eBPF.
Personal experience spanning across my time at Twitter and Facebook, DPDK is definitely faster but it is (/was) a pain in the butt to program and share the NIC with the host. XDP allows you to reuse very many packet parsing / handling capabilities in the kernel while in DPDK world you’re shipping a tiny stack with your app.
Re: Virtual Switch implemented in BPF: see Cilium’s work to connect containers with their BPF based connectors.
That depends on the nic. Mellanox's bifurcated driver is very easy to share with the kernel. Intel's, not so much. Do you mind mentioning which NICs are used at FB?
There is not even need for a bifurcated driver. The upstream kernel has AF_XDP which Intel and Mellanox NICs support in their drivers, and DPDK has official integration for it as well: https://doc.dpdk.org/guides/nics/af_xdp.html This will make deploying DPDK significantly easier for those that need/want to use it and allows to share the same driver for pushing packets up to DPDK and into the normal kernel stack with very close to "native" (as in user space driver) DPDK performance (target is ~90-95%).
Predominantly, some internal teams at Cisco. VPP and its CNI (Ligato Contiv) seem to be picking up steam off late. This is going by activity in the mailing lists. I know Yahoo Japan uses it.
Sort of a 'virtual switch' but much more beyond that and all with a BPF data plane. Cilium provides connectivity but also security policy, load-balancing and introspection for container pods. Preferred choice in Kubernetes-based workloads.
The article mentions the servers run up to 100 different BPF related programs on an individual server. I get the firewall and traffic shaping stuff, but are there any unusual use-cases or hacks you could share?
The number increases because there are various monitoring tools. Stuff like something that extracts more data from TCP retransmits so we have a better understanding of congestions and errors in the network path. ...or things that simply collect certain system events for security event detection. Imagine, for every certain event, what is injected is counted as a separate program.
On top of these which is common for every machine, service owners can deploy their own BPF programs for specific use cases. In fact our self service tracing tooling is also a BPF program. We talked about it back in 2014 when it did not use BPF https://tracingsummit.org/w/images/6/6f/TracingSummit2014-Tr...
> Facebook, he began, has an upstream-first philosophy, taken to an extreme; the company tries not to carry any out-of-tree patches at all. All work done at Facebook is meant to go upstream as soon as it practically can. The company also runs recent kernels, upgrading whenever possible.
I was chatting to a Facebook engineer on their use of BPF this summer and heard the same thing, which was surprising to me. There seem to be a number of companies that take advantage of Linux being licensed under GPL and keep their own forks/patches of the kernel that they use internally (anecdotally, I’ve heard Google does this), and the they stay on some old version, which apparently Facebook doesn’t do.
Upstreaming means the 'cost of maintainance' is shared. If you have out of tree patches, and upstream changes the API, either you have to update your patches or you have to keep running an older version. Possibly missing out on other improvements.
Getting things upstream can be more time consuming, but it's a worthwhile investment IMO.
Right. A lot of companies seem to do this calculus and come up with “my proprietary add ons are worth more than upstreaming this and having to update the patches whenever we want to upgrade”, unfortunately :(
> There seem to be a number of companies that take advantage of Linux being licensed under GPL and keep their own forks/patches of the kernel that they use internally (anecdotally, I’ve heard Google does this), and the they stay on some old version
Well Facebook is just a user of compute, not a seller of it. Google and Amazon sell compute as a business. The incentives are different. Facebook, despite its size, will never be a large fraction of Linux usage, so they are much wiser to benefit from the improvements of the community and give and take from the master. Amazon sells compute so their contributions would be shared by only the three major cloud vendors.
I can't wait until this is the norm. Facebook (and presumably, Google if they wanted to) can do this with an army of engineers and some excellent continuous deployment systems. I'm hoping these kinds of systems become commoditized over time and regular companies can stay close to the latest kernel version at all times
There's the usual suspects in there (after a quick double check, it turns out microsoft.com just missed the cut-off with 140 commits, plus 2 from linux.microsoft.com, Amazon has authored 71, and alibaba have 139)
No one wants to maintain out-of-tree patches, it's a complete pain. Google was doing it extensively with the Android project for the longest time, but they've been working hard at getting those all in to upstream to drastically reduce the work involved in updating the Android kernel.
Keep in mind most of us don’t use our @fb.com addresses. LWN keeps track of companies by keeping track of who works where so our numbers are much higher than this list shows.
Applies across a bunch of companies (I use my IBM email but I know quite a few kernel devs here who prefer to keep their open source contributions under their personal email).
Google can't necessarily upstream everything because of social problems in the kernel process. For example their datacenter TCP improvements have never been accepted by the gatekeeper of the net subsystem, which was a significant motivation to develop QUIC.
I'm not sure where you have heard about this? Their DCTCP extensions have never been posted in the first place to a public list as of today. Pretty much all of the core TCP developers for the (upstream) kernel's networking subsystem are employed by Google and doing an excellent job. That said, I would love to see their extensions integrated into the upstream tcp_dctcp module.
Isn't DCTCP generalized by TCP Prague and L4S? Which, if they get the IETF stamp of approval and the potential patent issues around L4S get sorted out, I'd guess would be implemented in the upstream Linux kernel pretty quickly.
Isn't it a pretty good reason? gRPC is terrible in a datacenter context without Google's internal TCP fixes that Linux won't adopt (and which have been advocated for in numerous conference papers since at least 2009). If they are steadfast cavemen what other workaround exists?
Latency caused by packet loss. TCP needs microsecond timestamps and the ability to tune RTOmin down to 1ms before it is suitable for use in a datacenter. With the mainline kernel TCP stack you are looking at, at a minimum, a 20ms penalty whenever a packet is dropped.
TCP over UDP seems rather silly to me, but congestion control and segmentation in userland is pretty useful. Especially so, since Google and partners have made an ecosystem where kernel updates on deployed devices don't happen.
We would love to, maybe in a few months as we stabilize the changes.
One challenge is that bpf upstreaming is much harder. We had to add support for relocations, spilling, multiple return values, and a few other things that might not be needed by the C bpf folks.
I'm a novice at it, but I have written a few small eBPF programs in assembler.
To elaborate on what aey said in the sister comment, since eBPF allows user applications to generate code that runs in the kernel, the language is kept pretty simple so that it's easier for the kernel to verify that the generated code isn't doing something bad. If you look at the documentation, there are only two registers:
There are also limitations on the size of eBPF programs the kernel will allow, and looping and such, so that a single user is less likely to DOS the entire system with a bad program (whether purposefully or accidentally). It's not really that similar, but I had the amusing thought that writing eBPF assembly vaguely reminded me a little of writing TIS-100 code.
Original bpf is a much simpler bytecode. Ebpf extended it and made it essentially x86_64 assembly (semantically). Then we all decided to call ebpf bpf for reasons.
The old classic BPF is still used by seccomp, but in the kernel transparently converted into eBPF. In the kernel we dropped the notion of eBPF and just call everything BPF (as the old classic BPF is pretty much a thing of the past and not extended / developed any further).
Is there a easy to use Golang library for eBPF? I do know that Cilium / cloudflare are attempting to build one but I gather that the project is not yet ready
Can you define ready? The functionality of the library is solid, it's based on code we're using in production. We've not committed to a stable API however.
It's hard to decide what we need to define and what we don't. In this case, I kind of opted for assuming that the readers know what BPF is, given that a fairly high percentage of our articles seem to be about BPF these days. Still, I'll try to include a link next time, sorry.
Our L4 load balancer is implemented entirely in BPF byte code emitting C++ and relies on XDP for "blazing fast" (comms approved totally scientific replacement for gbps and pps figures...) packet forwarding. It's open source and was discussed here at HN before https://news.ycombinator.com/item?id=17199921.
We discussed how we use eBPF for traffic shaping in our internal networks at Linux Plumber's Conference http://vger.kernel.org/lpc-bpf2018.html#session-9
We presented how we enforce network traffic encryption, catch and terminate cleartext communication, again, you guessed, with BPF at Networking@Scale https://atscaleconference.com/events/networking-scale-3/ (video coming soon, I think.)
Firewalls with BPF? Sure we have 'em. http://vger.kernel.org/lpc_net2018_talks/ebpf-firewall-LPC.p...
In addition to all these nice applications we heavily rely on fleet wide tooling constructed with eBPF to monitor:
...in our systems.