Hacker News new | past | comments | ask | show | jobs | submit login
How to Escape a Container (panoptica.app)
198 points by chuckhend on Dec 20, 2023 | hide | past | favorite | 135 comments



1 and 7 need SYS_ADMIN, not available by default in any container runtime.

2 needs Docker socket, it's explicitly meant for running other workloads

3 needs shared PID namespace in addition to SYS_PTRACE, neither is granted by default by any container runtime

4 needs SYS_MODULE, again no one has that

5 and 6 need DAC_READ_SEARCH, no one grants that, no one uses that

None of those seem like vulnerabilities or things that would be available without the admin taking explicit steps to specifically want to allow escaping. Being root in the container would not be enough to get any of those capabilities.


I've seen people who use testcontainers and run their CI workloads in containers abusing the docker.sock mounting so they can spin up the tests. The anti-pattern of using docker.sock has been always a threat because when docker got popularity in CI/CD systems it was the easiest way to have a platform independent way to spin up isolated environments. In my perception, this was a very common pattern in Jenkins a few years ago.


A workaround for this is to run Docker Engine inside a (non-Docker) container and use that container's socket. The Docker containers become processes within the Docker Engine container.


AFAIK you can't do that without giving the Docker-in-Docker container privileged permissions, which allows escaping.


You can do what Docker Desktop does- put the whole thing in a VM :^)

You still can't run untrusted code but at least you can isolate stuff like Compose environments


Exactly! This is a huge pet peeve of mine. I feel like I'm going crazy when I read articles like this one (that is, any of the similar "how to escape containers" blog posts). With the exception of a rare few that actually describe 0-days, the overwhelming majority seem to be exactly like this.

It confuses the heck out of me because red-team-type people will shout from the mountaintops about how insecure containers are and how absolutely trivial it is to break out of them, but when I go look for an example of that trivialness, all I find is stuff like this. Not that these aren't at all useful techniques - there are plenty of containers with --privileged, or a Docker socket mount, etc. But surely this doesn't apply to >90% of containers out there, especially ones that are exposed to the Internet. Your average Redis or Nginx container, or some container running a Python or Node webapp, is not going to have Docker mounted or some weird capability added. Sure, misconfigs happen, us sysadmins get lazy, but this is really common-sense stuff. It feels almost unfair to "blame" it on the container.

Of course, as mentioned, there are 0-days that allow for container breakout, and those are the truly scary stuff. But they seem to be few and far between, and they get publicized semi-widely and patched pretty quickly when they are found.

So to this day I don't really understand all the security folks who act like they (as in any decent attacker, not just a nation-state with 0days up the wazoo) could break out of a container with their eyes closed, while the only material I can find on the open Internet is stuff like this. Am I just looking in the wrong places?


But these caps aren't extraordinary — in particular I find often that using profilers and the like often requires them. E.g. the manual for Nvidia Nsight specifically tells you to add SYS_ADMIN, which as we see here breaks the seal. Ergo we can _either_ be able to run those tools, _or_ have a safe containerized workflow: that dichotomy alone is useful enough to justify this article IMO


That's fair! That's a good point, and I don't mean to invalidate those situations.

A part of me still feels like that's somewhat separate from a "typical" scenario where you want to harden a container that's exposed to the outside world in some way; like I said, your average container running a webapp or whatever. Nvidia Insight seems to be a performance analysis tool, so I would hope nobody is running it with untrusted input or exposing it to the Internet. (Yes, I know someone out there totally is.)

Similarly, it feels near-obvious to me that adding a privilege called "SYS_ADMIN" to a container will make it more or less equivalent to root on the host. It's not like Docker hides this info from people, last I checked it's explained pretty prominently.

You're totally right overall, it still matters, it's just something about this framing that rubs me the wrong way.


> to specifically want to allow escaping

Not really unfortunately. A novice sysadmin granting someone's request to add some permission called SYS_MODULE may not know that it could be equivalent to full root access. That's why posts like this are important for education.


> A novice sysadmin granting someone's request to [give them root in an obscure, sneaky manner]

...is a social engineering exploit, not a technical one.


To pose an alternative scenario:

> A novice sysadmin granting someone's request to [be able to run some tool/software that says it needs this, but is in fact secretly malicious]

or perhaps more common

> A novice sysadmin granting someone's request to [be able to run some legitimate tool or software, then later a virus in the container taking advantage and breaking out]

for a real world example, the Nvidia Nsight docs specifically direct you to add SYS_ADMIN; do you think your everyday ML engineer would know that doing so poses a security risk if they're doing this on Docker?


As someone who works in security (both pre-fail and post-fail), 95% of incidents happen because things are not configured correctly. Actual vulnerabilities in Linux are expensive, and any attacker competent enough to have one, also will not burn it needlessly, if they can get in because a SWE with a deadline gave the container SYS_PTRACE.

I think your response betrays a broken line of communication in our industry. People tend to assume most production environments are configured by people who knew what they were doing, were deploying a well-behaved application and had time and management support to do a good job. That's almost never the case, even in well-funded, well-regarded tech companies.


I find it relatively comforting that every single one of these requires disabling security features, though it is a good reminder of why certain things are dangerous to give containers and should be avoided.


It doesn’t mention zero day exploits, but those still occur.


Step 1: already be root.


Any of the capabilities needed by the exploits is in linux only available to root users. And are now granted to a container.

So these ”escapes“ are basically giving a container full access to the host and using that. None of them are enabled by default.


Raymond Chen had a blog post complaining of bug reports that started with - assuming the attacker has root access...

unfortunately cannot find it anymore.

on edit - maybe it was write permissions to the file system, but largely the same level of "once you have this it is game over"


“It rather involved being on the other side of this airtight hatchway” - there’s a couple posts in the series.

https://devblogs.microsoft.com/oldnewthing/20060508-22/?p=31...


In practice, though, containers are given these capabilities all the time, and in infosec people do refer to these as escapes.

Generally, it’s not the stuff the container is intended to run that’s doing the escape, though. This is usually the second step, after getting local execution through a vulnerability in the application running in the container.


Why are these capabilities usually granted, in your experience?


I posted another comment with some examples, but I think the general answer, IME, is that it's usually easier to do the wrong thing than to do the right thing. Security boundaries in Cloud-style Linux are typically hard to configure and not well documented. Engineers have deadlines and they're interested in building things, not debugging permissions. They'll do whatever is the easiest to get the system working.

This, by the way, is why I think Google had it right by splitting most products into SWE and SRE groups, so you had a group of people who could focus full-time on the production environment. SWEs are rewarded for building and deploying stuff, so they're going to build and deploy stuff.


In my experience profilers often need SYS_ADMIN or other capabilities. This is true even outside of a container. If I'm forced to work inside a container without the ability to add that cap, performance optimization work becomes very difficult.


User error is the source of many vulnerabilities that shouldn't exist.


That's not true, those capabilities can be granted to normal users. If the container is badly composed enough, sys admins might grant a user those caps so they can avoid security alerts about running containers as root.


Which is relevant because containerd most commonly runs as root, even in Kubernetes.


Actually, no. The root in container isn't actually uid 0 in host system if you setup things correctly. Linux user namespace can offset uid in a container by certain number. Effectively make uid 0 actually uid 60000, uid 1000 as uid 61000 and so on.

And in this case. If you escaped the container, you are literally nobody. All you allowed to read is public files that any user can read.

By the way. If you use rootless container, this is the default settings you will get. Since you don't have access to uid0 at first place.


You're correct in that uid 0 in the container is faked through userns, but you can easily bypass that with a privileged container or adding the CAP_ADMIN capability. Yes, you can disable creating privileged containers within Kubernetes, but that isn't something you can rely on. If the runtime is running as root, which it often is, you can also become real root within the container under these conditions.


>if you setup things correctly

Judging by the number of HOWTOs I see on GitHub and YouTube that tell you to check the Privileged box, this assumption isn't worth what you think it is.


It doesn't matter, the "root" user inside the container is still jailed and none of these tricks will allow that user to escape.


This post seems confused and misleading. It only lists ways to escape containers with various non-default capabilities added, but doesn't address the other more realistic and seen in the wild ways (eg from uid=0 confusion due to disuse of user namespaces, kernel privilege escalation bugs, etc).


I don’t think it’s misleading - working in security, I can tell you that containers are misconfigured with too many privileges everywhere you look. SWEs like focusing on cool stuff, like kernel exploits, to the detriment of basic production security. From the point of view, this is an article I think many SWEs could benefit from reading, especially on teams that do their own deployments.


That's interesting. Do you mean you see containers that aren't running with "least possible privileges" more narrow than the defaults, or that you often see misconfigurations such as the ones described in this article? Are the people still thinking the container works as as a security boundary after this, or are they knowingly turning off security to get to some functionality that otherwise doesn't work with the normal confines of a container?


Yeah, lots. The reasons are many, and pretty diverse, otherwise I think it'd be easier to fix. Some examples:

- You need to mount a filesystem, use a raw socket, etc. and Stack Overflow / ChatGPT tells you to enable a capability, so you do. It works, so you check it in.

- You're deploying something legacy, that assumes it has more control of the OS than it does inside a container.

- You're a data engineer. All of your tools assume they run as root, because that's just how data science is.

- You're building a "sidecar" for monitoring, control over other containers or something similar. The extra privileges are needed to do your job.

- You're trying to access something you can't, and it's generally easier to overprovision access than to do least privilege.

These problems aren't unique to containers and the cloud, mind you. I saw the same problems, e.g. when working on mobile device security. In general, it's a lot easier to just turn off half of SELinux than to learn how to configure it to do what you want, especially if you have a deadline.


For example, it isn't that weird to give container access to ptrace cap. Sys_ptrace is a important capacity if you need to profile the process for whatever reason. Sys_admin is required to mount a tmpfs file system even it don't actually modify any file. There are cases that these are actually required. And you don't know what may also cause security issue and why it isn't given by default if you did not read about that.


The "escape" in the article also requires "--pid=host", i.e. permission to debug processes on the host. That seems like a very obviously bad idea.


For interactive debug, maybe? Container can be used a a dev environment to dev/test things that would otherwise affect the host irreversibly. For example: something like installer.


Not mentioned: use a kernel N-day, of which there are many. Patch your hosts folks


For many, many services, requiring an entire separate exploit against the kernel is a huge win. So many services can just be dropped into a container with roughly 0 effort. If you want a higher level of security it's going to take significantly more effort.


I think there's some version of the dunning krueger effect going on in these comments - assuming that no one would include this number of security flaws unless intentionally. Perhaps it's that this forum tends to attract people more engaged in the CS space that wouldn't do this - but I've seen enough brute forcing in the wild to know that this ABSOLUTELY exists where a "just make it work" mentality is present.


You're totally right, but the part that annoys me is that articles like this one (and this sounds overly hostile, I don't intend that, but I'm not sure how else to phrase it) kind of pollute the topic of container security. I described it above, but I have this huge pet peeve where I hear "containers are insecure and trivial to break out of" and then when I go to look up examples of container breakouts, all I find is stuff like this; how to break through a wall that had a gaping, intentional hole left in it.

It feels like "breaking out of vanilla containers" and "breaking out of misconfigured containers" are two different topics, two different threat models. And while the second absolutely matters in the real world, the really scary stuff is obviously the first (and usually involves 0-days, kernel exploits, etc?). But people seem to talk less about the first.


I genuinely thought this link was about escaping a standardized steel shipping container, something I recently had to seriously consider. In that regard a disappointing click.

I also wrote my own docker-like containerization code for educational purposes a while ago, so container has both these meanings for me. Yet, me brain was expecting a physical escape story. Brains are funny!


Thought the same thing.

I, however, have not written any containerization code.


Using rootless podman limits the blast radius of a container escape.

Also many of the cappabilities described in this article aren't compatible with a rootless user deployment scénario.


Whats the state of lightweight VMs? Can they replace containers yet?

Not sharing the kernel with the host os (or other containers) is a huge security boundary.


They would still be vulnerable to the sort of attack described in the article, though: If the host deliberately hands the guest a socket it can use to execute commands as root on the host, there's nothing that can be done to make it secure.


KataContainers and gvisor come to mind. KataContainers really spin up VMs with various optimizations. Gvisor uses a reimplementation of the kernel syscall interface in go, which is also a pretty interesting idea.


To me, a guy that's been doing this for decades, this is a weird thing to say.

A bit of debootstrap, a few apt-get commands, and copying in config files, and you have a lightweight VM, minimal image.

Something people have been doing for 20 years.

There are also sorts of tricks, such as having two images, one for the app layer, one for the OS, which makes the deploy for app updates faster.

I'm not even sure why people care about image size all that much. You copy it to your local cluster, then deploy from there.


Do those lightweight VMs startup (cold start) in milliseconds? Because I think that's a key thing that people are looking for from containers and VMs that might be used in their stead.


Firecracker, yes


It's managing them at a bigger scale that is the challenge. Besides that, the container ecosystem gives you APIs to do all this management with established tools, in an automatic way.


AWS lambda and fly.io uses firecracker VMs internally. So I think it can replace containers to some extent.


All these escapes seem to involve Docker. It looks to me as if at least some of them are strictly Docker-dependent, but I've never used Docker, so I'm no expert.


Why escape?


You are being downvoted right now, but this is a great point. If you have execution control in someone’s container, use the containers existing secrets to achieve your goals. I don’t need to escape your web apps container to steal all of the contents of the backend database.


But if you do escape the container, then you have plenty other containers to peek into. Which is juicier than "just one."


Maybe. It’s also increasing the likelihood that security will get a detection though, which might or might not be worth it depending on my goals.


All those processes living in our containers are going to want the red pill eventually. Haven’t you seen the matrix?

But seriously though, it’s so you can write exploits or satisfy that curious itch when working with a cloud service.



I found the content to be of interest and gave me new things to think about.

But honestly, I was hoping before I clicked it that this was going to be about how to escape from the inside of a shipping container.


Apparently the answer to escaping from a shipping container is...

You don't, they cannot be opened from inside once locked. Also they're airtight, so bang on the walls and hope help arrives before you suffocate.

That's more nightmare inducing than I was hoping.


While there are various models of shipping containers, most of them are not airtight by design to allow circulation of air and avoid condensation and temperature build-up. They have seals to prevent water though.


> Also they're airtight

Really?! I know that these things are built to be very stable, but they never gave me the impression to be air-tight. Not bad. I'm sure the world of shipping containers is a huge rabbit-hole to read into.


carry a pocket angle grinder.


And hope your container is not buried deep between hundreds of other containers. You'll need a lot of battery packs.


Me too, that skill could come in handy one day.


Likewise. Damn. I guess I’ll stay stuck in here for a while longer.


I think the only answer is you fire up the sawzall your captor neglected to remove before they locked you inside. Also hope there are some earplugs, because you are going to be deaf before you can cut an escape hatch.



I clicked on this thinking I would learn how to escape a shipping container if I ever got trapped in one ;)


I had the exact same thought! That would make a very interesting article.

Related video :) https://www.youtube.com/watch?v=-trd_f6j3eI


Containers are not a security boundary.

Containers are not a security boundary.

Containers are not a security boundary.

Any system that treats them as such is inherently compromised.


All of these escapes rely on some obvious explicit reduction of the isolation guarantees. If you know how to escape a simple docker container invoked with default parameters such as `docker run --rm -it ubuntu /bin/bash` I'm sure many people would be interested.


"Escapes" is a vaguely defined term, but considering basic escalation of privilege:

- Networking is the obvious one in this scenario. By default (on Docker/LXC/others) all containers are on the same virtual bridge and can communicate with each other. Even with some additional configuration and isolation it is possible to MITM attack other containers on the same host.

- It is very easy to DDoS adjacent containers e.g. by spamming signals, forking new processes, creating files. There is again no default safeguard against this.


If you give a container access to the network, then yes it will have access to the network. That isn't an escape.

cgroups can prevent containers from using too much memory or cpu


If one container is OOM’ing incessantly, it can disrupt other containers as the system tries to page out read only pages furiously.


I think we must be careful not to use no-true-scotsman here. Imo, an attack should be considered valid if it works against a typical config - one you would arrive at following a popular tutorial, rather than requiring that it must work against an ideal config.


The definition of escaping a container (namespace) is being able to interact with resources not in the namespaces of the process.

If a process's network namespace contains a network device then it is not an escape to use it by definition. If the network namespace for a process contains no devices then being able to use a network device would be an escape.


Specifying networks is one of the main docker compose commands and seems pretty normal:

  services:
    your_service:
      image: your_image
      networks:
        isolated_network:

    another_service:
      image: another_image
      # This service is on the default network

I'd be curious how a vanilla (actual) networking setup is full of holes...


Can you provide more details on how a container can MITM other containers?

Does that also apply to Kubernetes workloads? And does that then require an encrypted service mesh (e.g. Linkerd) or TLS between services?


You have to give your container NET_RAW capability to allow MITM attacks.

Still a good idea to use tls.


The standard way to escape a simple Docker container with default privileges is to use a kernel LPE vulnerability. Kernel LPEs are common enough that they're found and patched without major news stories written about them.


Is putting processes in separate memory spaces not a security boundary to you? Isolating separate app responsibilities in separate UIDs? Filesystem permission bits? All those are "weaker" boundaries than containers. Do you really claim that these have ZERO security value?

That's silly. Of course containers are a security boundary. They have advantages and disadvantages. Treat them as tools and not slogans.


What is the security boundary then? Everywhere I read treats them as a security boundary for say, untrusted code.


See Google's gvisor as an attempt at reducing the attack-surface of a container to make things more secure.

I think the general advice is that a single container can never be a robust security boundary because the OS surface area they involve is so large that the isolation layer is ripe for possible vulnerabilities. You also really have to avoid screwing up, there are lot of fiddly little security mistakes you can make when attempting to use a container to run untrusted code.

Typically you might use something like gvisor, or a VM. Systems where isolation is simpler to reason about and the attack surface is smaller.

In any case a single isolation boundary can have a vulnerability and my understanding is that more advanced systems typically involve multiple layers of isolation to sandbox untrusted code.


> Everywhere I read treats them as a security boundary for say, untrusted code.

who's everybody? There's special kind of VM hosts for that, containers is like your kitchen jars, if someone is vomiting with Ebola in your kitchen - your jars will not help you


I think that's a great analogy, because yes, if you have a live sample of Ebola in a sealed glass jar then that will very much help you. (I would not recommend leaving the lid open by giving it SYS_ADMIN, but that doesn't mean that glass isn't a fine material for containing pathogens.)


Thinking further on the analogy, yeah it's better than nothing but I would _not_ recommend people leave Ebola in a glass jar in their kitchen. What if someone accidentally knocks it over, cutting themselves on a shard while doing so? What if someone, looking for cookies, fumbles around inside and gets it on their hands? Sure these are not "best practices" but the point is that it should be difficult to do the dangerous things, not easy and certainly not recommended by tutorials everywhere.


Perhaps not a good analogy, given Ebola is generally handled in a level 4 biosafety facility. Do you want to risk carrying it around in a glass jar?


Do "level 4 biosafety facility" not use glass vials? I imagine the security isn't provided by the choice of containers itself (plastic?), but rather the entire lab design.


Yes, my point was vials is not all they use. There are many layers of protection: biosuits, negative air pressure, decontamination procedures at the exits, etc.


Containers should never run untrusted code, at least not without other layers of protection added on top. Otherwise stuff like https://github.com/google/gvisor would not need to exist. And even then as long as processes are sharing a host kernel they are always vulnerable. A full VM is really the minimum acceptable boundary.


[flagged]


>"Everywhere I read treats them as a security boundary"

The people writing those articles are wrong. Containers are insufficient for untrusted code. containers should not be treated as a security boundary. A virtual machine or something similar (Firecracker) can be treated as a security boundary, but not a container.


People just keep repeating the same assertion, over and over, without elaborating on exactly why containers shouldn't be used to run untrusted code. I think that's what the GP is complaining about.


Virtual machines can be considered a security boundary. Containers are absolutely not.


The distinction is getting more and more fuzzy, so this is almost a meaningless point (as is GGP). It's very vendor-specific, let alone that I'm pretty sure Spectre attacks can work across VMs.


I wouldn’t consider the distinction “fuzzy”. Assuming we’re talking Linux (I don’t know about the Mac and Windows world) containers are implemented using namespaces and cgroups, and always have been. Whether you are talking docker, containerd, some more minimalistic thing built on runc, it’s all Linux namespaces and cgroups. And those things were explicitly not designed to act as security boundaries when running untrusted code.


Kata containers would like a word


Now in fairness, that is very specifically using a VM to add an even stronger boundry; it's not really the same thing.


Does anyone actually use Kata containers? I've tried recently to run them on a current Ubuntu platform and couldn't get it working at all after a few days of work.


If it's Ubuntu, possible you had docker inside snap unintentionally and issues because of that? I had a bit trouble getting it integrated with certain versions of Podman, but that aside setting up kata was pretty straightforward.

I have so far only used it for hosting some gameservers which I don't trust, i.e some simple containers, but I really want to try it in a new k3s cluster once I get it setup and move some services there. I like the idea of putting internet facing ones into it as an additional layer of separation and could imagine it being useful in production.



Yes, I worked for a very large tech company that used Kata.


The distinction is in fact pretty clear. Conventional container runtimes are shared-kernel isolation. Virtual machines aren't: every tenant has their own running kernel.


Misconfiguring or granting something unnecessary privileges is enough to eliminate anything as a security boundary.

VMs easily give a false sense of security especially with any kind of network-based trust.


Sure, anything that is misconfigured could eliminate a security boundary. That doesn’t mean that containers are even in the same ballpark as VMs in terms of providing a security boundary.


Why not?


Everything has the "network-based trust" problem, including isolated hardware.


Why? Both are supposed to keep whatever is inside trapped unless you poke holes in that protection (say, using virtio or even just 9P to hand it real storage)


Because virtual machines were designed from the start to run untrusted code, and containers were not?

> supposed to keep whatever is inside trapped unless you poke holes in that protection

As far as I know that was never a design decision for containers on Linux; certainly not in the early days.


I'm willing to believe that Linux containers were not initially designed to be a security boundary, but I struggle to see why that means they aren't now; it's been over a decade and they have an awful lot of security features for something that doesn't care about security.

EDIT: For that matter, they're clearly being used for security; the features in Linux that are used by runc et al. are the same features used by eg. Chrome to isolate components in order to contain vulnerabilities.


As far as I know, Docker still punches through your firewall by default. I consider that a pretty big negative against assuming secure-by-design.


What attack are you envisioning that is aided by docker bypassing firewall rules?


Weak password + Open ports ---> Malware attack ( kinsing , etc )

so use "-p 127.0.0.1:5432:5432"

- https://github.com/docker-library/postgres/issues/770

- https://sysdig.com/blog/zoom-into-kinsing-kdevtmpfsi

- https://sysdig.com/blog/cloud-defense-in-depth/

- https://thenewstack.io/kinsing-malware-targets-kubernetes/

- https://stackoverflow.com/search?q=kinsing

- https://github.com/search?q=repo%3Adocker-library%2Fpostgres...

-----------

https://docs.docker.com/network/packet-filtering-firewalls/

"On Linux, Docker manipulates iptables rules to provide network isolation. While this is an implementation detail and you should not modify the rules Docker inserts into your iptables policies, it does have some implications on what you need to do if you want to have your own policies in addition to those managed by Docker.

If you're running Docker on a host that is exposed to the Internet, you will probably want to have iptables policies in place that prevent unauthorized access to containers or other services running on your host. This page describes how to achieve that, and what caveats you need to be aware of."


I mean, there was this story[0] ("How a Docker footgun led to a vandal deleting NewsBlur's MongoDB database") about how the Docker rules allowed a hacker to delete someone's database.

>Turns out the ufw firewall I enabled and diligently kept on a strict allowlist with only my internal servers didn’t work on a new server because of Docker. When I containerized MongoDB, Docker helpfully inserted an allow rule into iptables, opening up MongoDB to the world. So while my firewall was “active”, doing a sudo iptables -L | grep 27017 showed that MongoDB was open the world. This has been a Docker footgun since 2014.

Story was previously discussed on HN[1]. Sure, you could argue the author should have done more to secure the endpoint, but this was 100% a failure mode due to how Docker prioritizes convenience over security.

[0] https://blog.newsblur.com/2021/06/28/story-of-a-hacking/

[1] https://news.ycombinator.com/item?id=27670058


Most containers do not switch the CPU VM context when the CPU switches between containers. VMs do. This is necessary to prevent attacks which can leak data through the CPU cache.


I agree that they have a bunch of security features, because they've been playing security whack-a-mole for a decade. Retrofitting a security boundary onto an existing system is very difficult. Personally Linux containers are still not at a level where I'd trust them for something that was meant to be a hard security boundary rather than a damage mitigation exercise, though obviously that's a subjective judgement.


Linux was designed as a multi-user operating system from the get go. By definition, programs from different users are supposed to run without access to data that they shouldn’t have access to.

That said, it’s hard to get right at all times.


I wouldn't say "designed from the start", I would just say virtual machines are naturally more isolated than containers. Since virtual machines are simulating hardware, compared to containers which are isolated user space instances.

But I guess that doesn't mean virtual machines aren't easily escapable without extra work, same as containers.


This doesn’t answer the question. How were VMs designed with security in mind and not with host emulation? Untrusted code? I’m confused, people are talking about vulnerabilities.


VMs don’t share the kernel with the host. Any host escape would need to happen through a device driver exposed to the guest (virtio, etc.) Containers use the same kernel, obviously a much larger set of code. More code means a greater chance of vulnerabilities.


How do you think the VM itself is spawned? Something in the host instantiate it. But even if we ignore that, I would argue that concentrating in one location only (e.g., some exposed driver) to escape is also easier, since an attacker would need to spend time trying to find only one vulnerability rather than several.

I think the hardware can help bridging the gap between containers and VMs by enabling userspace processes behave as VMs, which is more or less what QEMU+KVM try to do, except that it still comes with some overheads and less flexibility.


I have to disagree. With a VM, there is less shared code shared with the host to review and audit for vulnerabilities. The developers can go over those device drivers, system calls for VM management, etc. with a fined-tooth comb. With containers, you have essentially the entire kernel, much more surface area to potentially exploit.

Maybe I am wrong. We can wait for a security professional to comment.


When you switch between VMs and the host the CPU executes an instruction to isolate the data of each VMs entirely (flushing caches). This doesn't happen with containers.


Although this feature does improve security, it is due to the CPU, the VM and the main idea was to reduce performance overheads. I do not see adding layers and layers of abstractions the same as more secure. In fact, meltdown and spectre serve as counter-examples.

Also, one can achieve similar effects with containers as well, just think AppArmor, capabilities, permissions, etc., all layers of administrative privileges between some untrusted code and the host.


There is nothing you can do in the OS that replaces the VMENTER instruction. You need this precisely because it mitigates SPECTRE and Meltdown, assuming your microcode is up to date.


What disqualifies containers as a security boundary?


This is true, but the exploits listed in this article are poor evidence of this. If it was as simple as not giving Linux containers any capabilities or host sockets they would be a decent security boundary.

I find it super frustrating that we're stuck with kernels with inherent weaknesses to their security approach that we have to re-implement them in userspace in one way or another (gVisor, Firecracker, etc.) just to get the hardware-provided userspace/kernel boundary to work properly.


A container is a security boundry. However, no security boundry is perfect and more boundries must be put in place as well, depending on your threat model.

edit: typo


Containers are absolutely a security boundary, and an excellent one at that - at least, Docker containers are.

1. You get file, process, and network namespaces, which are a security boundary

2. You get a seccomp filter, which is a security boundary

The "containers are not a security boundary" meme needs to die.

Elsewhere you mention that "containers are not sufficient for untrusted code" but that's a very specific and very niche threat model. Most people don't say "send me a binary and I'll execute it", or have arbitrary RCE + multitenancy concerns.

Containers aren't sufficient for multi-tenant RCE because the RCE is by design so 100% of your security pressure is on the container at that point. In the vast majority of cases you're dealing with servers that don't intend to allow arbitrary code execution, and containers are an extremely easy way to drive up the cost of an attack given that the attacker has already spent a lot of time and money on the RCE.

SELinux is also not sufficient for the "RCE by design" threat model - is SELinux not a security boundary?

Further, containers can limit the impact of remote vulnerabilities like path traversal attacks, since they have file isolation by default.

edit: I see elsewhere that there's a real lack of clarity here.

First off, a security boundary can be meaningfully defined as a limitation on an attacker that does not have a way around it without additional exploitation.

So the main reason why people have said "containers are not a security boundary" is because:

a) Very, very early on, escaping a container was trivial - like you could just ask to leave and you'd be out.

b) There were some blog posts basically saying "containers aren't sufficient for multi-tenancy" where arbitrary users can run arbitrary code on the same host. This is still the case today - but it's also an extremely rare threat model.

Why would containers not be sufficient for (b) ? Because the majority of the Linux kernel is still exposed to the attacker within a container - the vast majority of system call interfaces are exposed (but seccomp removes a number of these, which is nice). The Linux kernel is not at all sufficiently hardened against attackers who can make arbitrary system calls, therefor containers are not sufficient against those attackers. If you give the attacker RCE by default (ie: your service is "Send me a binary and i'll run it") then an attacker can spend all of their time and money just on a local privesc, which isn't crazy difficult.

Since the cost of RCE in an RCE-aaS is 0 the consensus is that containers aren't strong enough for RCE-aaS threat models. In that case use Firecracker or gVisor or a dedicated host.

Otherwise, RCE costs tend to be pretty high and having to develop an additional LPE on top of one is, at minimum, quite a pain for many attackers.

Containers are extremely easy to deploy software into, something like a Firecracker VM is not. Containers are basically just processes, so you can monitor them and manage them trivially. Monitoring and managing VMs with processes inside of them is obviously harder. So I think the 'bang for your buck' with containers is extremely solid.


They were on Solaris no?


> Containers are not a security boundary.

The name is at least misleading if not wrong, then? What do they "contain"?


They contain the transitive closure of software and libraries to run a binary.

Under the hood, they're a fancy wrapper around a pile of tar files.

Tar files certainly contain other files and are also not a security boundary.


This kind of reductive dogma is meaningless FUD. A container is as secure as its underlying implementation.


Is gvisor considered sufficient when running 3rd-party code? Are there other measures that should be taken in addition to using gvisor?


Of course they are. There is debate over how good of a boundary they are but even caution tape is a security boundary.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: