Cilium 1.0: Bringing the BPF Revolution to Kubernetes Networking and Security

atonse · on May 7, 2018

I can’t be the only person thinking “What the hell is BPF?”. I know that probably means I’m not the audience but it wouldn’t hurt them to just state that the very first line.

Updated with article about BPF: https://lwn.net/Articles/747551/

Really good talk by the Cilium folks that explains these concepts: https://m.youtube.com/watch?v=ilKlmTDdFgk

jitl · on May 7, 2018

My first thought is that it’s “Berkeley Packer Filter”, see here for some context on what else it’s used for: http://blog.memsql.com/bpf-linux-performance/

eatonphil · on May 7, 2018

I believe it started out as an extension of BPF. The real term they mean to use (and most people mean to use, I think) is eBPF. Calling it eBPF makes a whole lot more sense to me because it's evolved a lot since it was just doing packet filtering. Also, it's Linux-only so again differentiating eBPF from BPF (which is not Linux-only) makes sense to me. But I didn't write the post.

kyrra · on May 7, 2018

I believe the proof of concept for the Spectre attack used the Linux eBPF JIT.

https://googleprojectzero.blogspot.com/2018/01/reading-privi...

stuartaxelowen · on May 7, 2018

Or Band Pass Filter :P

erikb · on May 7, 2018

Or it means people want you to think you are stupid for not knowing something that in reality doesn't exist or doesn't have the credit and success it wants you to believe to has.

hueving · on May 7, 2018

The networking community is pretty insular and tends to assume that the wider world is keeping up with the hundreds of new niche networking projects, protocols, abstractions, and foundations created and left to rot every day.

BPF is hot today. Tomorrow it will be sooooo openflow.

twic · on May 7, 2018

If eBPF manages to be an exception to that, it could be very interesting.

If i understand this correctly, eBPF is a fairly general-purpose bytecode format that can be executed inside the kernel. It's safe, and there's a JIT, so it's pretty fast (is there really a JIT compiler running in kernel space?). It was originally used for packet filtering, but it's now used at various decision points in networking, and is somehow involved with tracing as well.

But it could potentially go much further. Anywhere the kernel currently gets configured with data-like configuration could be replaced or augmented with an eBPF, right? For example, instead of setting an ACL on a directory, you could set an eBPF program which would run for each attempted access and decide whether to allow it, as well as logging or doing other stuff. eBPF programs could guard the interfaces between a container and its host, allowing more flexible isolation. An eBPF program could respond to every system call a process makes, allowing behaviour like OpenBSD's pledge, only much more sophisticated.

With the right access control model (implemented in eBPF!), normal userland programs could install eBPF programs for resources they control (sockets, files, etc), potentially shifting a significant fraction of their processing into kernel mode, improving performance, reducing system call overhead, and allowing safe access to kernel facilities that are currently inaccessible. Imagine implementing a garbage collector in userspace, but being able to configure your slice of the virtual memory system in the kernel using an eBPF program.

I don't know if this will happen. But a pervasively eBPF world would be very different, and very interesting. We'll have all sorts of fun. We'll get tools we never imagined. And we'll get pwned by black hats harder than ever before.

catern · on May 7, 2018

>An eBPF program could respond to every system call a process makes, allowing behaviour like OpenBSD's pledge, only much more sophisticated.

That is actually one of the oldest and most widespread uses of BPF :) https://www.kernel.org/doc/Documentation/prctl/seccomp_filte...

If you're interested in this notion you might be interested in: https://en.wikipedia.org/wiki/Language-based_system https://en.wikipedia.org/wiki/Exokernel

twic · on May 7, 2018

I love language-based systems! The one that really got me interested was SPIN:

http://www-spin.cs.washington.edu/

And, to an extent, Taos:

https://news.ycombinator.com/item?id=9806607

eecc · on May 7, 2018

FreeBSD’s firewall?

hardwaresofton · on May 7, 2018

It's really exciting to see technology being moved into the mainline kernel (see the lwn/mailing list posts) and being so quickly useful to many entities doing serious work (tm) with it.

KubeCon Copenhagen just wrapped up, and I'm working through to watching all of them but here's a video on eBPF applied to tracing:

https://www.youtube.com/watch?v=ug3lYZdN0Bk&index=5&list=PLj...

RIP to people who were working on nftables.

catern · on May 7, 2018

The OP states that BPF is replacing nftables as if it was a foregone conclusion, but it's not at all certain - it's still far from happening.

hardwaresofton · on May 8, 2018

It was actually the lwn post (which is basically a summary of the mailing list) that made me think that nftables was doomed.

https://lwn.net/Articles/747551/

seems that BPF is taking the nftables API, which seemed to be the core value-add (reimagined iptables API), and actually delivering on the performance benefits as well.

I hold people that work on the kernel in pretty high regard, and I expect them to be pragmatic about it (the whole "strong opinions loosely held" thing), and if BPF doesn't introduce too many possible security vulnerabilities (that's about the only issue with it I can see), it might represent the best of both worlds -- new api + improved performance.

fulafel · on May 7, 2018

All this "service mesh" layer technology seems very complex. Does anyone have a link to write-up that would cover the motivations?

It seems all this could be just done with traditional networking tech, like microservice endpoints just having real IP addresses and using normal application level auth/load balancing methods when conversing with internal services.

twic · on May 7, 2018

This is more or less what Calico does:

https://docs.projectcalico.org/v3.1/introduction/

http://leebriggs.co.uk/blog/2017/02/18/kubernetes-networking...

In a Calico setup, every container (or VM - they use the term "workload" to abstract over the two) has an IP address, and talks to other containers in the usual way. AIUI, Calico does a couple of things to make that work at Cloud Scale (tm): it pushes firewall rules into the kernel, to make sure each container can only communicate with the other containers it's supposed to, and it propagates routes around, so each host knows where to send packets destined for containers on other hosts.

The routing bit is important because Calico is designed to run as a flat IP space on top of a non-flat ethernet space, ie one where the access switches are connected by routers rather than more switches. That's useful because scaling a flat ethernet network up to a huge size is apparently hard (network engineers start telling horror stories about Spanning Tree Protocol etc).

Calico still has more moving parts than i'm personally comfortable with, but it seems broadly sensible.

zxcmx · on May 7, 2018

Agreed re: complexity, the motivation behind the heavier service meshes (not necessarily cilium per se) is that it's a bit like AOP (aspect oriented programming) applied at the service level.

Different orgs work at different scales and in different styles; some orgs are producing monoliths, others are producing "fat services" or "microservices" (without going too much into what that might mean).

Some orgs have template repos or base libraries (big difference!) that they use to produce services. Others just have standards and you can do it how you like but plz conform to the standard (have /healthz, use statsd or export for prometheus, etc etc).

Also, how do all the things auth to each other? Do you TLS all the things or do you have api keys and secrets n-way between all the things? Does stuff trust each other based on IP? Etc.

There are lots of "illities", but particularly various kinds of monitoring, metrics, circuit breakers, access control and so on that you can either bake in to each service independently or implement via shared code somehow or other.

Notice that the above generally implies some degree of language homogenisation (usually a sane thing to have when you take into account other illities like artifact repos, dependency analysis, coding style guides, static analysis tooling etc - adopting a new language is not "easy" at scale) or else you are rewriting all these things a lot.

Anyway one option in this whole rainbow of possibilities is that you pull some of this out of the service itself and push it into a network layer wrapper somehow.

And that is how you end up with a service mesh...

Broadly speaking, my estimation is that if your company doesn't have multiple buildings with lots of people who have never met each other, you probably don't need a service mesh. And maybe not even then.

w4tson · on May 7, 2018

Great explanation: AOP at a macro level. You made an interesting point about a homogenous language across services. It’s prescient problem in my team. I’m not alone in thinking that shared binary code for the “ilities” produces more problems in the long term than it solves.

At this point the solutions seem to be intro a service mesh or copy pasta.

bonesss · on May 7, 2018

One of the linked blog posts has a decent overview of the what & why of BPF: https://cilium.io/blog/2018/04/17/why-is-the-kernel-communit...

The issue with microservices at scale vis-a-vis microservices is linear scaling of routing tables, very short-lived IP connections, and large update times for massive service maps.

I think networking is complex in general, but the promise of this kind of integration (from my limited understanding), would be a clear reflection of services at the routing level instead of a hodgepodge of ports and IP addresses that provide little context or meaning. Having lost a few half-days learning iptable nuances in our Kubernetes cluster I can see the benefits in having an integrated stack with less impedance between architecture and routing.

Generally we don't eliminate complexity, we just move it from one place to another. I imagine this is a case where a more complex service implementation could provide a simpler user experience.

falcolas · on May 7, 2018

A large part of this problem is Kubernetes' decision to do some really fucked up things with iptables to avoid implementing a load balancing service (be it a proxy or intelligent DNS).

From an ops point of view, this creates a nearly opaque wall you get to run up against when troubleshooting issues; a completely unnecessary wall.

tango12 · on May 7, 2018

How do you compare some of the features/goals of Cilium to istio's network policy? https://istio.io/blog/2017/0.1-using-network-policy.html

Edit: Just came across this https://cilium.io/blog/istio/ :)

eloycoto · on May 7, 2018

Hi, A Cilium contributor here.

There is a native integration with Istio, you can read tested at the following url: http://docs.cilium.io/en/doc-1.0/gettingstarted/istio/

Last week was the Kubecon Europe, Thomas Graf, founder of the project, presented some new improvements made in Envoy and TLS integration. Here are the slides:

https://schd.ws/hosted_files/kccnceu18/d9/2018%20KubeCon%20E...

Regards

throwbacktictac · on May 7, 2018

Cilium lives below userspace which makes it perform better than istio. This article has more information about the differences from the cilium developer's point of view.

https://cilium.io/blog/istio/

hueving · on May 7, 2018

>Cilium lives below userspace which makes it perform better than istio.

There are a lot of fast userspace networking projects that bypass the kernel precisely to be faster. Which approach is better is up for debate but the kernel is definitely not faster in all cases.

woah · on May 7, 2018

How can you bypass the kernel?

aseipp · on May 7, 2018

One major difference between the kernel and any other piece of standard software is access control: the kernel has privileged access to most hardware peripherals. The software for the driver is nothing particularly special; it is simply run in a context in which it has control of the hardware.

More concretely, most hardware devices are relatively easy to interface with: for example, you may simply set up a region of DMA memory, poke the hardware device with the address of this memory, and write into it, then read results back out. A NIC is a good example of such a model. This can be done with any block of memory, except normally the kernel is the only thing that can talk to the NIC (to tell it where to write to/read from).

So the main thing you need to do is pass control of the hardware to a userspace process. For the NIC/DMA example, the easiest way is to just allocate some memory, make sure it's non-swappable, and then get its physical address. You then just need a small driver to connect userspace with the hardware -- it must give you a way to tell the hardware where to read/write. Maybe it exposes a sysfs-based file with normal unix permissions (a common method). Writing an address into this file is equivalent to telling the hardware to "read here, and write there". Now you can write to the memory you allocated (in userspace) to control the NIC.

At this point, the kernel is more-or-less out of the loop completely. Of course, this is the easy part, since now you must write the rest of the hardware driver. :)

derphj · on May 7, 2018

There is a proposal that has recently been merged into the kernel to implement AF_XDP (LWN intro: https://lwn.net/Articles/750845/), which is a long-term kernel built-in solution in combination with XDP/BPF to achieve speed of user space bypasses for use cases that require it. First, API pieces were recently merged and more work soon to be appear such as the zero-copy bits as another milestone.

kjeetgill · on May 7, 2018

The basic idea is that you get the NIC to write packets directly into your userspace applications' memory.

There are plenty of projects out there to help get the kernel out of the critical path of a networking application. DPDK and PF_RING are the two I hear about most but here is a blog from cloudflare outlining a number of others: https://blog.cloudflare.com/kernel-bypass/

dward · on May 7, 2018

You can map ingress/egress channels of a network device directly into a processes memory inuserspace. These are just memory pages in what's known as the DMA region that the device can write to without interacting with the CPU.

perbu · on May 7, 2018

Cilium performs better because the instruction cache is kept well as opposed to netfilter, which needs to gather data from all over in order to do it's job which leads to low cache utilization.

Kernel code doesn't run faster than userspace code.

chatmasta · on May 7, 2018

Isn’t kernel code considered “faster” when used in networking components, because of reduced context switching? The idea being that packets need to travel through kernel codepaths anyway, so any user space filtering will slow down the packets simply due to context switching.

yanslookup · on May 7, 2018

Has anyone found any articles comparing and contrasting Cilium with other popular k8s networking impls? ie Flannel, Calico etc?