1 and 7 need SYS_ADMIN, not available by default in any container runtime.
2 needs Docker socket, it's explicitly meant for running other workloads
3 needs shared PID namespace in addition to SYS_PTRACE, neither is granted by default by any container runtime
4 needs SYS_MODULE, again no one has that
5 and 6 need DAC_READ_SEARCH, no one grants that, no one uses that
None of those seem like vulnerabilities or things that would be available without the admin taking explicit steps to specifically want to allow escaping. Being root in the container would not be enough to get any of those capabilities.
I've seen people who use testcontainers and run their CI workloads in containers abusing the docker.sock mounting so they can spin up the tests.
The anti-pattern of using docker.sock has been always a threat because when docker got popularity in CI/CD systems it was the easiest way to have a platform independent way to spin up isolated environments. In my perception, this was a very common pattern in Jenkins a few years ago.
A workaround for this is to run Docker Engine inside a (non-Docker) container and use that container's socket. The Docker containers become processes within the Docker Engine container.
Exactly! This is a huge pet peeve of mine. I feel like I'm going crazy when I read articles like this one (that is, any of the similar "how to escape containers" blog posts). With the exception of a rare few that actually describe 0-days, the overwhelming majority seem to be exactly like this.
It confuses the heck out of me because red-team-type people will shout from the mountaintops about how insecure containers are and how absolutely trivial it is to break out of them, but when I go look for an example of that trivialness, all I find is stuff like this. Not that these aren't at all useful techniques - there are plenty of containers with --privileged, or a Docker socket mount, etc. But surely this doesn't apply to >90% of containers out there, especially ones that are exposed to the Internet. Your average Redis or Nginx container, or some container running a Python or Node webapp, is not going to have Docker mounted or some weird capability added. Sure, misconfigs happen, us sysadmins get lazy, but this is really common-sense stuff. It feels almost unfair to "blame" it on the container.
Of course, as mentioned, there are 0-days that allow for container breakout, and those are the truly scary stuff. But they seem to be few and far between, and they get publicized semi-widely and patched pretty quickly when they are found.
So to this day I don't really understand all the security folks who act like they (as in any decent attacker, not just a nation-state with 0days up the wazoo) could break out of a container with their eyes closed, while the only material I can find on the open Internet is stuff like this. Am I just looking in the wrong places?
But these caps aren't extraordinary — in particular I find often that using profilers and the like often requires them. E.g. the manual for Nvidia Nsight specifically tells you to add SYS_ADMIN, which as we see here breaks the seal. Ergo we can _either_ be able to run those tools, _or_ have a safe containerized workflow: that dichotomy alone is useful enough to justify this article IMO
That's fair! That's a good point, and I don't mean to invalidate those situations.
A part of me still feels like that's somewhat separate from a "typical" scenario where you want to harden a container that's exposed to the outside world in some way; like I said, your average container running a webapp or whatever. Nvidia Insight seems to be a performance analysis tool, so I would hope nobody is running it with untrusted input or exposing it to the Internet. (Yes, I know someone out there totally is.)
Similarly, it feels near-obvious to me that adding a privilege called "SYS_ADMIN" to a container will make it more or less equivalent to root on the host. It's not like Docker hides this info from people, last I checked it's explained pretty prominently.
You're totally right overall, it still matters, it's just something about this framing that rubs me the wrong way.
Not really unfortunately. A novice sysadmin granting someone's request to add some permission called SYS_MODULE may not know that it could be equivalent to full root access. That's why posts like this are important for education.
> A novice sysadmin granting someone's request to [be able to run some tool/software that says it needs this, but is in fact secretly malicious]
or perhaps more common
> A novice sysadmin granting someone's request to [be able to run some legitimate tool or software, then later a virus in the container taking advantage and breaking out]
for a real world example, the Nvidia Nsight docs specifically direct you to add SYS_ADMIN; do you think your everyday ML engineer would know that doing so poses a security risk if they're doing this on Docker?
As someone who works in security (both pre-fail and post-fail), 95% of incidents happen because things are not configured correctly. Actual vulnerabilities in Linux are expensive, and any attacker competent enough to have one, also will not burn it needlessly, if they can get in because a SWE with a deadline gave the container SYS_PTRACE.
I think your response betrays a broken line of communication in our industry. People tend to assume most production environments are configured by people who knew what they were doing, were deploying a well-behaved application and had time and management support to do a good job. That's almost never the case, even in well-funded, well-regarded tech companies.
I find it relatively comforting that every single one of these requires disabling security features, though it is a good reminder of why certain things are dangerous to give containers and should be avoided.
In practice, though, containers are given these capabilities all the time, and in infosec people do refer to these as escapes.
Generally, it’s not the stuff the container is intended to run that’s doing the escape, though. This is usually the second step, after getting local execution through a vulnerability in the application running in the container.
I posted another comment with some examples, but I think the general answer, IME, is that it's usually easier to do the wrong thing than to do the right thing. Security boundaries in Cloud-style Linux are typically hard to configure and not well documented. Engineers have deadlines and they're interested in building things, not debugging permissions. They'll do whatever is the easiest to get the system working.
This, by the way, is why I think Google had it right by splitting most products into SWE and SRE groups, so you had a group of people who could focus full-time on the production environment. SWEs are rewarded for building and deploying stuff, so they're going to build and deploy stuff.
In my experience profilers often need SYS_ADMIN or other capabilities. This is true even outside of a container. If I'm forced to work inside a container without the ability to add that cap, performance optimization work becomes very difficult.
That's not true, those capabilities can be granted to normal users. If the container is badly composed enough, sys admins might grant a user those caps so they can avoid security alerts about running containers as root.
Actually, no. The root in container isn't actually uid 0 in host system if you setup things correctly. Linux user namespace can offset uid in a container by certain number. Effectively make uid 0 actually uid 60000, uid 1000 as uid 61000 and so on.
And in this case. If you escaped the container, you are literally nobody. All you allowed to read is public files that any user can read.
By the way. If you use rootless container, this is the default settings you will get. Since you don't have access to uid0 at first place.
You're correct in that uid 0 in the container is faked through userns, but you can easily bypass that with a privileged container or adding the CAP_ADMIN capability. Yes, you can disable creating privileged containers within Kubernetes, but that isn't something you can rely on. If the runtime is running as root, which it often is, you can also become real root within the container under these conditions.
Judging by the number of HOWTOs I see on GitHub and YouTube that tell you to check the Privileged box, this assumption isn't worth what you think it is.
This post seems confused and misleading. It only lists ways to escape containers with various non-default capabilities added, but doesn't address the other more realistic and seen in the wild ways (eg from uid=0 confusion due to disuse of user namespaces, kernel privilege escalation bugs, etc).
I don’t think it’s misleading - working in security, I can tell you that containers are misconfigured with too many privileges everywhere you look. SWEs like focusing on cool stuff, like kernel exploits, to the detriment of basic production security. From the point of view, this is an article I think many SWEs could benefit from reading, especially on teams that do their own deployments.
That's interesting. Do you mean you see containers that aren't running with "least possible privileges" more narrow than the defaults, or that you often see misconfigurations such as the ones described in this article? Are the people still thinking the container works as as a security boundary after this, or are they knowingly turning off security to get to some functionality that otherwise doesn't work with the normal confines of a container?
Yeah, lots. The reasons are many, and pretty diverse, otherwise I think it'd be easier to fix. Some examples:
- You need to mount a filesystem, use a raw socket, etc. and Stack Overflow / ChatGPT tells you to enable a capability, so you do. It works, so you check it in.
- You're deploying something legacy, that assumes it has more control of the OS than it does inside a container.
- You're a data engineer. All of your tools assume they run as root, because that's just how data science is.
- You're building a "sidecar" for monitoring, control over other containers or something similar. The extra privileges are needed to do your job.
- You're trying to access something you can't, and it's generally easier to overprovision access than to do least privilege.
These problems aren't unique to containers and the cloud, mind you. I saw the same problems, e.g. when working on mobile device security. In general, it's a lot easier to just turn off half of SELinux than to learn how to configure it to do what you want, especially if you have a deadline.
For example, it isn't that weird to give container access to ptrace cap. Sys_ptrace is a important capacity if you need to profile the process for whatever reason. Sys_admin is required to mount a tmpfs file system even it don't actually modify any file. There are cases that these are actually required. And you don't know what may also cause security issue and why it isn't given by default if you did not read about that.
For interactive debug, maybe? Container can be used a a dev environment to dev/test things that would otherwise affect the host irreversibly. For example: something like installer.
For many, many services, requiring an entire separate exploit against the kernel is a huge win. So many services can just be dropped into a container with roughly 0 effort. If you want a higher level of security it's going to take significantly more effort.
I think there's some version of the dunning krueger effect going on in these comments - assuming that no one would include this number of security flaws unless intentionally. Perhaps it's that this forum tends to attract people more engaged in the CS space that wouldn't do this - but I've seen enough brute forcing in the wild to know that this ABSOLUTELY exists where a "just make it work" mentality is present.
You're totally right, but the part that annoys me is that articles like this one (and this sounds overly hostile, I don't intend that, but I'm not sure how else to phrase it) kind of pollute the topic of container security. I described it above, but I have this huge pet peeve where I hear "containers are insecure and trivial to break out of" and then when I go to look up examples of container breakouts, all I find is stuff like this; how to break through a wall that had a gaping, intentional hole left in it.
It feels like "breaking out of vanilla containers" and "breaking out of misconfigured containers" are two different topics, two different threat models. And while the second absolutely matters in the real world, the really scary stuff is obviously the first (and usually involves 0-days, kernel exploits, etc?). But people seem to talk less about the first.
I genuinely thought this link was about escaping a standardized steel shipping container, something I recently had to seriously consider. In that regard a disappointing click.
I also wrote my own docker-like containerization code for educational purposes a while ago, so container has both these meanings for me. Yet, me brain was expecting a physical escape story. Brains are funny!
They would still be vulnerable to the sort of attack described in the article, though: If the host deliberately hands the guest a socket it can use to execute commands as root on the host, there's nothing that can be done to make it secure.
KataContainers and gvisor come to mind. KataContainers really spin up VMs with various optimizations. Gvisor uses a reimplementation of the kernel syscall interface in go, which is also a pretty interesting idea.
Do those lightweight VMs startup (cold start) in milliseconds? Because I think that's a key thing that people are looking for from containers and VMs that might be used in their stead.
It's managing them at a bigger scale that is the challenge. Besides that, the container ecosystem gives you APIs to do all this management with established tools, in an automatic way.
All these escapes seem to involve Docker. It looks to me as if at least some of them are strictly Docker-dependent, but I've never used Docker, so I'm no expert.
You are being downvoted right now, but this is a great point. If you have execution control in someone’s container, use the containers existing secrets to achieve your goals. I don’t need to escape your web apps container to steal all of the contents of the backend database.
While there are various models of shipping containers, most of them are not airtight by design to allow circulation of air and avoid condensation and temperature build-up. They have seals to prevent water though.
Really?! I know that these things are built to be very stable, but they never gave me the impression to be air-tight. Not bad. I'm sure the world of shipping containers is a huge rabbit-hole to read into.
I think the only answer is you fire up the sawzall your captor neglected to remove before they locked you inside. Also hope there are some earplugs, because you are going to be deaf before you can cut an escape hatch.
All of these escapes rely on some obvious explicit reduction of the isolation guarantees. If you know how to escape a simple docker container invoked with default parameters such as `docker run --rm -it ubuntu /bin/bash` I'm sure many people would be interested.
"Escapes" is a vaguely defined term, but considering basic escalation of privilege:
- Networking is the obvious one in this scenario. By default (on Docker/LXC/others) all containers are on the same virtual bridge and can communicate with each other. Even with some additional configuration and isolation it is possible to MITM attack other containers on the same host.
- It is very easy to DDoS adjacent containers e.g. by spamming signals, forking new processes, creating files. There is again no default safeguard against this.
I think we must be careful not to use no-true-scotsman here. Imo, an attack should be considered valid if it works against a typical config - one you would arrive at following a popular tutorial, rather than requiring that it must work against an ideal config.
The definition of escaping a container (namespace) is being able to interact with resources not in the namespaces of the process.
If a process's network namespace contains a network device then it is not an escape to use it by definition. If the network namespace for a process contains no devices then being able to use a network device would be an escape.
The standard way to escape a simple Docker container with default privileges is to use a kernel LPE vulnerability. Kernel LPEs are common enough that they're found and patched without major news stories written about them.
Is putting processes in separate memory spaces not a security boundary to you? Isolating separate app responsibilities in separate UIDs? Filesystem permission bits? All those are "weaker" boundaries than containers. Do you really claim that these have ZERO security value?
That's silly. Of course containers are a security boundary. They have advantages and disadvantages. Treat them as tools and not slogans.
See Google's gvisor as an attempt at reducing the attack-surface of a container to make things more secure.
I think the general advice is that a single container can never be a robust security boundary because the OS surface area they involve is so large that the isolation layer is ripe for possible vulnerabilities. You also really have to avoid screwing up, there are lot of fiddly little security mistakes you can make when attempting to use a container to run untrusted code.
Typically you might use something like gvisor, or a VM. Systems where isolation is simpler to reason about and the attack surface is smaller.
In any case a single isolation boundary can have a vulnerability and my understanding is that more advanced systems typically involve multiple layers of isolation to sandbox untrusted code.
> Everywhere I read treats them as a security boundary for say, untrusted code.
who's everybody? There's special kind of VM hosts for that, containers is like your kitchen jars, if someone is vomiting with Ebola in your kitchen - your jars will not help you
I think that's a great analogy, because yes, if you have a live sample of Ebola in a sealed glass jar then that will very much help you. (I would not recommend leaving the lid open by giving it SYS_ADMIN, but that doesn't mean that glass isn't a fine material for containing pathogens.)
Thinking further on the analogy, yeah it's better than nothing but I would _not_ recommend people leave Ebola in a glass jar in their kitchen. What if someone accidentally knocks it over, cutting themselves on a shard while doing so? What if someone, looking for cookies, fumbles around inside and gets it on their hands? Sure these are not "best practices" but the point is that it should be difficult to do the dangerous things, not easy and certainly not recommended by tutorials everywhere.
Do "level 4 biosafety facility" not use glass vials? I imagine the security isn't provided by the choice of containers itself (plastic?), but rather the entire lab design.
Yes, my point was vials is not all they use. There are many layers of protection: biosuits, negative air pressure, decontamination procedures at the exits, etc.
Containers should never run untrusted code, at least not without other layers of protection added on top. Otherwise stuff like https://github.com/google/gvisor would not need to exist. And even then as long as processes are sharing a host kernel they are always vulnerable. A full VM is really the minimum acceptable boundary.
>"Everywhere I read treats them as a security boundary"
The people writing those articles are wrong. Containers are insufficient for untrusted code. containers should not be treated as a security boundary. A virtual machine or something similar (Firecracker) can be treated as a security boundary, but not a container.
People just keep repeating the same assertion, over and over, without elaborating on exactly why containers shouldn't be used to run untrusted code. I think that's what the GP is complaining about.
The distinction is getting more and more fuzzy, so this is almost a meaningless point (as is GGP). It's very vendor-specific, let alone that I'm pretty sure Spectre attacks can work across VMs.
I wouldn’t consider the distinction “fuzzy”. Assuming we’re talking Linux (I don’t know about the Mac and Windows world) containers are implemented using namespaces and cgroups, and always have been. Whether you are talking docker, containerd, some more minimalistic thing built on runc, it’s all Linux namespaces and cgroups. And those things were explicitly not designed to act as security boundaries when running untrusted code.
Does anyone actually use Kata containers? I've tried recently to run them on a current Ubuntu platform and couldn't get it working at all after a few days of work.
If it's Ubuntu, possible you had docker inside snap unintentionally and issues because of that? I had a bit trouble getting it integrated with certain versions of Podman, but that aside setting up kata was pretty straightforward.
I have so far only used it for
hosting some gameservers which I don't trust, i.e some simple containers, but I really want to try it in a new k3s cluster once I get it setup and move some services there. I like the idea of putting internet facing ones into it as an additional layer of separation and could imagine it being useful in production.
The distinction is in fact pretty clear. Conventional container runtimes are shared-kernel isolation. Virtual machines aren't: every tenant has their own running kernel.
Sure, anything that is misconfigured could eliminate a security boundary. That doesn’t mean that containers are even in the same ballpark as VMs in terms of providing a security boundary.
Why? Both are supposed to keep whatever is inside trapped unless you poke holes in that protection (say, using virtio or even just 9P to hand it real storage)
I'm willing to believe that Linux containers were not initially designed to be a security boundary, but I struggle to see why that means they aren't now; it's been over a decade and they have an awful lot of security features for something that doesn't care about security.
EDIT: For that matter, they're clearly being used for security; the features in Linux that are used by runc et al. are the same features used by eg. Chrome to isolate components in order to contain vulnerabilities.
"On Linux, Docker manipulates iptables rules to provide network isolation. While this is an implementation detail and you should not modify the rules Docker inserts into your iptables policies, it does have some implications on what you need to do if you want to have your own policies in addition to those managed by Docker.
If you're running Docker on a host that is exposed to the Internet, you will probably want to have iptables policies in place that prevent unauthorized access to containers or other services running on your host. This page describes how to achieve that, and what caveats you need to be aware of."
I mean, there was this story[0] ("How a Docker footgun led to a vandal deleting NewsBlur's MongoDB database") about how the Docker rules allowed a hacker to delete someone's database.
>Turns out the ufw firewall I enabled and diligently kept on a strict allowlist with only my internal servers didn’t work on a new server because of Docker. When I containerized MongoDB, Docker helpfully inserted an allow rule into iptables, opening up MongoDB to the world. So while my firewall was “active”, doing a sudo iptables -L | grep 27017 showed that MongoDB was open the world. This has been a Docker footgun since 2014.
Story was previously discussed on HN[1]. Sure, you could argue the author should have done more to secure the endpoint, but this was 100% a failure mode due to how Docker prioritizes convenience over security.
Most containers do not switch the CPU VM context when the CPU switches between containers. VMs do. This is necessary to prevent attacks which can leak data through the CPU cache.
I agree that they have a bunch of security features, because they've been playing security whack-a-mole for a decade. Retrofitting a security boundary onto an existing system is very difficult. Personally Linux containers are still not at a level where I'd trust them for something that was meant to be a hard security boundary rather than a damage mitigation exercise, though obviously that's a subjective judgement.
Linux was designed as a multi-user operating system from the get go. By definition, programs from different users are supposed to run without access to data that they shouldn’t have access to.
I wouldn't say "designed from the start", I would just say virtual machines are naturally more isolated than containers. Since virtual machines are simulating hardware, compared to containers which are isolated user space instances.
But I guess that doesn't mean virtual machines aren't easily escapable without extra work, same as containers.
This doesn’t answer the question. How were VMs designed with security in mind and not with host emulation? Untrusted code? I’m confused, people are talking about vulnerabilities.
VMs don’t share the kernel with the host. Any host escape would need to happen through a device driver exposed to the guest (virtio, etc.) Containers use the same kernel, obviously a much larger set of code. More code means a greater chance of vulnerabilities.
How do you think the VM itself is spawned? Something in the host instantiate it.
But even if we ignore that, I would argue that concentrating in one location only (e.g., some exposed driver) to escape is also easier, since an attacker would need to spend time trying to find only one vulnerability rather than several.
I think the hardware can help bridging the gap between containers and VMs by enabling userspace processes behave as VMs, which is more or less what QEMU+KVM try to do, except that it still comes with some overheads and less flexibility.
I have to disagree. With a VM, there is less shared code shared with the host to review and audit for vulnerabilities. The developers can go over those device drivers, system calls for VM management, etc. with a fined-tooth comb. With containers, you have essentially the entire kernel, much more surface area to potentially exploit.
Maybe I am wrong. We can wait for a security professional to comment.
When you switch between VMs and the host the CPU executes an instruction to isolate the data of each VMs entirely (flushing caches). This doesn't happen with containers.
Although this feature does improve security, it is due to the CPU, the VM and the main idea was to reduce performance overheads. I do not see adding layers and layers of abstractions the same as more secure. In fact, meltdown and spectre serve as counter-examples.
Also, one can achieve similar effects with containers as well, just think AppArmor, capabilities, permissions, etc., all layers of administrative privileges between some untrusted code and the host.
There is nothing you can do in the OS that replaces the VMENTER instruction. You need this precisely because it mitigates SPECTRE and Meltdown, assuming your microcode is up to date.
This is true, but the exploits listed in this article are poor evidence of this. If it was as simple as not giving Linux containers any capabilities or host sockets they would be a decent security boundary.
I find it super frustrating that we're stuck with kernels with inherent weaknesses to their security approach that we have to re-implement them in userspace in one way or another (gVisor, Firecracker, etc.) just to get the hardware-provided userspace/kernel boundary to work properly.
A container is a security boundry. However, no security boundry is perfect and more boundries must be put in place as well, depending on your threat model.
Containers are absolutely a security boundary, and an excellent one at that - at least, Docker containers are.
1. You get file, process, and network namespaces, which are a security boundary
2. You get a seccomp filter, which is a security boundary
The "containers are not a security boundary" meme needs to die.
Elsewhere you mention that "containers are not sufficient for untrusted code" but that's a very specific and very niche threat model. Most people don't say "send me a binary and I'll execute it", or have arbitrary RCE + multitenancy concerns.
Containers aren't sufficient for multi-tenant RCE because the RCE is by design so 100% of your security pressure is on the container at that point. In the vast majority of cases you're dealing with servers that don't intend to allow arbitrary code execution, and containers are an extremely easy way to drive up the cost of an attack given that the attacker has already spent a lot of time and money on the RCE.
SELinux is also not sufficient for the "RCE by design" threat model - is SELinux not a security boundary?
Further, containers can limit the impact of remote vulnerabilities like path traversal attacks, since they have file isolation by default.
edit: I see elsewhere that there's a real lack of clarity here.
First off, a security boundary can be meaningfully defined as a limitation on an attacker that does not have a way around it without additional exploitation.
So the main reason why people have said "containers are not a security boundary" is because:
a) Very, very early on, escaping a container was trivial - like you could just ask to leave and you'd be out.
b) There were some blog posts basically saying "containers aren't sufficient for multi-tenancy" where arbitrary users can run arbitrary code on the same host. This is still the case today - but it's also an extremely rare threat model.
Why would containers not be sufficient for (b) ? Because the majority of the Linux kernel is still exposed to the attacker within a container - the vast majority of system call interfaces are exposed (but seccomp removes a number of these, which is nice). The Linux kernel is not at all sufficiently hardened against attackers who can make arbitrary system calls, therefor containers are not sufficient against those attackers. If you give the attacker RCE by default (ie: your service is "Send me a binary and i'll run it") then an attacker can spend all of their time and money just on a local privesc, which isn't crazy difficult.
Since the cost of RCE in an RCE-aaS is 0 the consensus is that containers aren't strong enough for RCE-aaS threat models. In that case use Firecracker or gVisor or a dedicated host.
Otherwise, RCE costs tend to be pretty high and having to develop an additional LPE on top of one is, at minimum, quite a pain for many attackers.
Containers are extremely easy to deploy software into, something like a Firecracker VM is not. Containers are basically just processes, so you can monitor them and manage them trivially. Monitoring and managing VMs with processes inside of them is obviously harder. So I think the 'bang for your buck' with containers is extremely solid.
2 needs Docker socket, it's explicitly meant for running other workloads
3 needs shared PID namespace in addition to SYS_PTRACE, neither is granted by default by any container runtime
4 needs SYS_MODULE, again no one has that
5 and 6 need DAC_READ_SEARCH, no one grants that, no one uses that
None of those seem like vulnerabilities or things that would be available without the admin taking explicit steps to specifically want to allow escaping. Being root in the container would not be enough to get any of those capabilities.