Learning containers from the bottom up

_ktx2 · on Nov 19, 2021

This is a great article.

I disagree with this:

> Now, when you have a decent understanding of containers - from both the implementation and usage standpoints - it's time to tell you the truth. Containers aren't Linux processes!

This is a bit of wordplay, I'm assuming, in absence of a word that defines the operating system features that power the concept of containers. To Linux, there is no (to my knowledge) concept of a "container". The container runtime runs your process(es) as the parent and uses the operating systems features to isolate it and restrict it/them. A virtual machine would just be a full emulated version of this, rather than using the operating system to virtualize the network stack. The author is right in that there is no such thing as a container, but only as much as containing is a thing you do, imo. What users think of containers are still just processes though, and I don't think that's an entirely useless abstraction to be cognizant of.

otterley · on Nov 19, 2021

I would go even further - containers are process trees. They just happen to be process trees with the following attributes: (a) they (usually) have separate namespaces (network/pid/uts/cgroups/mount); (b) they (usually) have dropped capabilities; and (c) they (usually) are in cgroups that have resource reservations and/or limits.

Under the hood, that's all containers are!

stevebmark · on Nov 20, 2021

I think it’s important to understand that containers aren’t Linux processes. Containers can run more than one process. Containers can be stopped and restarted even though the initial process is gone forever. And containers have their own isolated writable layer.

jandrewrogers · on Nov 20, 2021

The container runtime intercepts some syscalls, altering the observable behaviors of the kernel in ways that can adversely impact software that is otherwise perfectly designed to operate outside the container runtime. Normal processes don’t have their syscalls intercepted and this is material difference to the extent it is not transparent.

If running the same properly designed software exhibits material differences in behavior between a bare metal process and a containerized process, then they aren’t the same as a matter of practical semantics. Ironically, virtualized processes have much closer equivalence to a bare metal process than containerized processes in practice.

Saying a container is “just a process” is like saying a virtual machine is “just a process”, both are true in some sense depending on how you define “process”. But as a matter of practical engineering, they are different kinds of things.

_ktx2 · on Nov 20, 2021

Are you talking about SecComp and namespacing?

jandrewrogers · on Nov 20, 2021

The root cause is likely SecComp. The notoriously poor I/O performance of containerized code, regardless of configuration, is largely a side effect of syscall interception.

In particular it breaks software that does I/O scheduling in user space, which is idiomatic and explicitly supported by the Linux kernel, even on virtual machines, but this use case conflicts with the container abstraction so runtimes offer an ersatz version that allows the code to run albeit poorly.

yencabulator · on Nov 22, 2021

More likely you're running your containers using overlay or some FUSE thing, and that's causing the I/O slowdown.

What syscalls do you think are intercepted, how? Speaking as someone who can write kernel code, I'm not aware of any such thing specific to containers. (As far as the linux kernel is concerned, there's no such thing as a container.)

If you're talking about BPF, that can be used outside of containers, e.g. systemd can limit any unit, and using it is not part of a definition of what a container is.

jandrewrogers · on Nov 26, 2021

I have decades of experience writing this type of low-level high-performance data infrastructure code directly against the Linux kernel, deployed in diverse environments. I’ve seen almost every rare edge case in practice.

You can find many examples in the wild of reputable software that loses significant performance once containerized no matter how configured. Literally no one has demonstrated state-of-the-art data infrastructure software that works around this phenomenon, and at this point you’d think someone would be able to if it was trivially possible. I test database kernels in a diverse set of environments and currently popular containers aren’t remotely competitive with VMs, never mind bare metal. The reasons for the performance loss are actually pretty well understood at a technical level, albeit esoteric.

Every popular container system has runtimes that intercept syscalls. Whether or not Linux requires containers to intercept syscalls is immaterial because in practice they all do in a manner destructive to I/O performance.

There used to be a similar phenomenon with virtual machines for many years, such that no one deployed databases on them. Then clever people learned how to trick the VM into letting them punch a hole through the hypervisor, and we’ve been using that trick ever since. It isn’t as fast as bare metal, but it is usually within 10%. No such trick exists for containers and as a consequence performance in containers is quite poor.

_ktx2 · on Nov 26, 2021

How can an unprivileged runtime intercept syscalls of an application talking directly to a kernel? I'll go browse through the containerd code to see if I can find such a thing because I know Go pretty well, but I have never heard of a runtime intercepting syscalls. That's why application kernels like gvisor exist.

rajmann · on Nov 28, 2021

>Then clever people learned how to trick the VM into letting them punch a hole through the hypervisor, and we’ve been using that trick ever since

Can you say more on this trick. Is it available by default on the cloud VMs, or is it something that has to done in the user's code. I have seen tweets from DirectIO developers saying AWS Nitro machines are better for kernel-bypass-IO compared the Google cloud VMs. but my understanding of it was something done by Amazon Nitro card engineers/developers, and Google was working on something similar to improve the performance.

yencabulator · on Nov 27, 2021

Lots of words, very little said. How do you think e.g. docker "intercepts syscalls"?

jandrewrogers · on Nov 27, 2021

No idea, don’t really care. That these performance disparities exist isn’t controversial, several well-known companies have written about it e.g. nginx[0]. I deploy on Kubernetes, but all of our I/O intensive infrastructure is deployed on virtual machines for performance reasons — we still lose ~10% on a virtual machine compared to bare metal but a container on bare metal is significantly worse.

All high-performance data infrastructure bypasses the operating system for most things, taking total control of the physical hardware it uses. Cores, memory, storage, network. Linux happily grants that control, with caveats. It doesn’t work the same inside Kubernetes and Docker, and no one can figure out how to turn it off. Maybe you don’t work on I/O intensive applications, but for people that do this is a well-documented phenomenon. My teams have wasted far too many hours trying to coax good performance out of code in containers that works just fine on bare metal and virtual machines.

[0] https://www.nginx.com/blog/comparing-nginx-performance-bare-...

pm90 · on Nov 20, 2021

This really got me at first. Since I had seen Windows virtual vm on Linux (and vice versa), my mental model was still “full virtualization”. But the processes running in a Linux container are still _linux_ processes, they’re just isolated (fairly) well.

spenrose · on Nov 19, 2021

> The author is right in that there is no such thing as a container, but only as much as containing is a thing you do, imo. What users think of containers are still just processes though, and I don't think that's an entirely useless abstraction to be cognizant of.

Fantastic distillation. Thank you!

jjtheblunt · on Nov 19, 2021

why not think of them as process (group) spawned with particular parent process setup, in particular the cgroups etc configuration effecting isolation.

musicale · on Nov 20, 2021

In the bad old days before setns() it was more of a pain to add processes to a container since they had to be children of an existing process in that container.

kuizu · on Nov 19, 2021

A nice blog series explaining in detail each Linux kernel mechanism making up containers: https://www.schutzwerk.com/en/43/posts/linux_container_intro...

otterley · on Nov 19, 2021

Agreed - this is a far more comprehensive, logical, and technically correct explanation of how containers work under the hood.

musicale · on Nov 20, 2021

Docker and Kubernetes embody a number of design decisions that might be a good fit for some users (and for Google) but add more complexity and overhead than I usually need or want for my typical use case of basic isolation and resource limits.

Fortunately the container architecture is flexible so that you can use as much or as little of it as you like.

I also tend to think that if you want stronger isolation for security purposes then you will want a lightweight VM rather than a container (and if you are worried about side channels, probably hardware partitioning - good luck.)

porker · on Nov 20, 2021

For a quick overview of containers I found https://wizardzines.com/zines/containers/ super helpful.

ashater · on Nov 20, 2021

Good article, steps one level below container managers like Docker or k8s. Obviously not the indepth of how Linux kernel manages container processes but a good write-up.

yencabulator · on Nov 22, 2021

> ... but containers are needed to build images

Incorrect. The images are mere files(/subtrees), and you can write one however you wish.