Execute Docker Containers as QEMU MicroVMs

riobard · on June 16, 2021

A few years ago I invested in a small startup called `hyper.sh`. It open sourced a container runtime called `runV` which provided exactly this: security of virtual machines plus convenience of containers.

The project later merged with Intel Clear Container to become what's now called Kata Containers (https://katacontainers.io/) and is now widely used by several Internet giants like Alibaba and Baidu.

The startup was acquired by Ant Finance a couple of years ago.

(I recorded a podcast with one of hyper.sh engineer if you can listen to Mandarin https://pan.icu/25)

temp_praneshp · on June 16, 2021

Probably off topic: Back in 2014-15 at my first job, when I was working on openstack, they used to show up at the summits. They were super smart and very generous with their time when I had questions. I wondered sometime in 2020 what happened to them, I'm happy they had a decent exit.

XorNot · on June 17, 2021

I used runV with drone.io (on top of Media) to run distributed on-demand VM builders for GitHub enterprise (we were building physical machine images to deploy so needed VM isolation).

It actually worked great, and I've struggled to get as quite a flexible CI system at other jobs since then (the big advantage was it looked like Docker, so with compose you could either spin a metal-like nested VM or just pull in some DB containers in your build instance).

cptnapalm · on June 16, 2021

I was looking at Kata containers a few days ago. I'm pretty new to trying to use VMs/containers for services; purely hobby level. Couldn't figure out how to use them, but that's not necessarily a knock on them as I also can't get OpenBSD wireguard to work either.

polskibus · on June 16, 2021

How does it differ from Firecracker?

riobard · on June 16, 2021

I'm not familiar with later development, but AFAIK Firecracker came much later and now you can actually use Firecracker as Kata Container's hypervisor in addition to QEMU.

lifty · on June 16, 2021

I worked with their tech, testing it, and I loved the product. It was definitely ahead of its time. Similar in some ways to what Fly is doing these days, without the edge.

eatonphil · on June 16, 2021

There are a few existing projects out there like this (running Docker images as virtual machines, specifically) if folks are interested. Slim [0] is the one I can remember off the top of my head. I think there are a couple more.

Still, neat to have the walkthrough here in this post.

https://github.com/ottomatica/slim

hardwaresofton · on June 16, 2021

A couple more:

https://github.com/containers/krunvm

https://github.com/weaveworks/ignite

tptacek · on June 16, 2021

As I understand the landscape here, the big enabling win of microvms is faster boot time; there's a cool qemu-lite slide deck that goes into detail about how they cut down boot time:

https://www.linux-kvm.org/images/d/d2/03x05B-Chao_Peng-Light...

The big win was slashing away the BIOS stuff.

We use AWS's Firecracker to turn our customers Docker containers into Firecracker microvms (Firecracker is Amazon's Rust VMM, the engine for Fargate and Lambda). Anecdotally: in my dev environment, the difference between Firecracker boot times and native Docker container startup is imperceptible; the logging we do swamps the VM boot stuff. It's very fast.

rwmj · on June 16, 2021

https://katacontainers.io/ ?

bonzini · on June 16, 2021

Yes, indeed. However it's nice to see directly the mechanisms that let Kata do its magic.

ashishbijlani · on June 16, 2021

> Can we somehow combine the advantages of the docker ecosystem with VMs?

Shameless plug: this is exactly what our goal is with https://kwarantine.xyz We are creating a new hypervisor (from scratch) that can run strongly isolated Docker/LXC containers.

amscanne · on June 16, 2021

The "fork" sounds like you blue pill the OS for each container? I'm assuming the concept is like Cappsule [1] or Bromium [2]?

[1] https://cappsule.github.io/ [2] https://en.wikipedia.org/wiki/Bromium#/media/File:Bromium-en...

ashishbijlani · on June 16, 2021

fork here is COW on the host kernel (i.e., copying EPT entries). We will post detailed technical documentation soon.

mikepurvis · on June 16, 2021

Is this what gvisor is? https://github.com/google/gvisor

ashishbijlani · on June 16, 2021

No, gVisor is from Google. They emulate system calls in user-space and use VMs, which increases runtime performance overhead. We use hardware virtualization to directly run containers -- no I/O emulation, no expensive VM exits, scale as needed. Initial comparison with FC/GVisor/Xen here: https://github.com/ashishbijlani/kwarantine

tptacek · on June 16, 2021

It sounds like you just said "yes, but what we're building is faster". The userland Linux emulation is a security benefit, not a liability.

monocasa · on June 16, 2021

I'm not sure gvisor requires vm exits. Their first backend used ptrace very similarly to how user mode Linux worked.

Minor quip though since ptrace might even be slower than vm exits; your core point stands.

rkeene2 · on June 16, 2021

User Mode Linux is still around and works well. I use it when I need a "fakeroot" without any special privileges on the host.

https://rkeene.org/viewer/tmp/fakeroot.sh.htm

mikepurvis · on June 22, 2021

Thanks for the poke on this. I had looked briefly and become frustrated that many of the instructions I found assumed you were a kernel dev and started with compiling everything from source— the Debian-supplied UML binaries seem to work well for my needs though, and do indeed allow doing basic stuff like mounting a disk image so you can run install-grub on it.

rkeene2 · on June 22, 2021

Hmm, building UML from source is really easy. Here [0] is my process for doing it, as a Makefile. The actual compile step is just one line (line 32):

    $(MAKE) -C linux-$(KERNEL_VERSION) ARCH=um linux

The rest of it sets up the configuration how I want and compiles other dependencies (like slirp) or is for maintenance, like cleaning up, or downloading.

This is a rather old version -- newer versions check the checksum and use my HashCache system.

[0] https://rkeene.org/viewer/tmp/uml.Makefile.htm

stefanha · on June 17, 2021

For an even more lightweight approach to running containers in VMs see: https://github.com/containers/krunvm

It's powered by https://github.com/containers/libkrun.

forty · on June 16, 2021

Isn't firecracker an AWS tech?

bhawks · on June 17, 2021

If you're splitting hairs firecracker (aws) is an offshoot of crosvm from chrome/Google which actually was a greenfield vmm :) anyway memory safe virtualization for the win.

cpach · on June 16, 2021

That’s correct.

https://github.com/firecracker-microvm/firecracker

remram · on June 16, 2021

The article wrongly states that Fly.io created firecracker.

jjacobson93 · on June 16, 2021

Yeah, the author is incorrect. Fly.io uses Firecracker but they didn’t create it.

tptacek · on June 17, 2021

Whoah, missed that.

thekevjames · on June 16, 2021

I had fun exploring Docker->VM conversion a while back [1], though the larger goal in my case was to be able to make the build path to custom GCP VM Images a bit simpler. Exciting to see other cases where folks are finding this sort of flow useful!

1: https://thekev.in/blog/2019-08-05-dockerfile-bootable-vm/ind...

dzonga · on June 17, 2021

I understand, it's cool to do content marketing. but folks proof-read your articles. Firecracker was created by AWS and rightly states so on the page.

OldGoodNewBad · on June 16, 2021

I think a lot of folks are going out of their way to misunderstand what happened. Yes there are other similar projects and containers. No, none come from a long established COMMUNITY RUN PROJECT. This is something akin to the difference between VirtualBox and OpenBSD’s vmd. Ones a product with a “free” tier, the other is a community project.

gravypod · on June 16, 2021

Something I'd be very interested in: building a PXE image from something declarative like Dockerfiles.

justincormack · on June 16, 2021

Try LinuxKit https://github.com/linuxkit/linuxkit

laurencerowe · on June 16, 2021

Google Container Optimized OS is basically this I think. It's what's used when you start a GCE instance with a docker image.

https://cloud.google.com/container-optimized-os/

jonjonsonjr · on June 17, 2021

I don't think I'd ever call a Dockerfile declarative.

encryptluks2 · on June 16, 2021

Why not run containers in VMs in containers in VMs? :)

Seriously, VMs are hardly as secure as many people want to believe unless you're utilizing enclaves and even that has vulnerabilities. I think a better approach is Seccomp and whatever other filtering makes sense.

handrous · on June 16, 2021

A while back I did some looking at FreeBSD jails to try to figure out why they don't have more mindshare (especially when paired with the nigh-superpower-granting ZFS).

I came away baffled that they weren't more widely-promoted, compared with Docker and friends. After thinking about it for a while, all I can figure is they're so straightforward to use and well-documented that there's no room to make one's name, or to make a buck, re-packaging them or wrapping them in complex tools, so there's little money or glory (= personal marketing via open-source project leadership/contributions) in promoting them.

[EDIT] that is: what would be a blog post in LXC/Docker land... doesn't exist, because it's covered perfectly well in the docs. What would be a simple open-source tool... becomes a blog post, because it's short, simple, and clear enough not to merit special software, but just a quick guide to existing tools. What would be a business, becomes a simple open-source tool without enough of a difficulty/convenience "moat" to support a business.

boardwaalk · on June 16, 2021

I suspect the answer includes it not being Linux, even with the compatibility layer available.

handrous · on June 16, 2021

I'm sure that's some of it, but the trend seems to be moving away from leveraging OS-level tools anyway. As long as your containers (or jails) and the single important binary in each one start up OK and your network tuning on the parent OS isn't completely screwed up, the rest barely matters anymore.

coder543 · on June 16, 2021

It seems like you're missing a lot of things.

As a developer, how do I run FreeBSD Jails on my MacBook during development? With Docker for Mac, it is trivial for me to do everything on my Mac, and the fact that there is a virtual machine is completely invisible to me. Everything "Just Works". With FreeBSD Jails, I would have to actually interact with a VM constantly, including the pain of shipping files back and forth.

As a developer, are popular databases and applications pre-packaged as FreeBSD Jails so that I can spin one up on my laptop with a single command? Where is the Docker Hub equivalent?

As a developer, how do I orchestrate a collection of FreeBSD Jails for each project? With Docker, I define a single `docker-compose.yml` file for each project. With a single `docker-compose up`, the entire project is running including dependencies such as databases and other related projects in a completely reproducible fashion. This makes it trivial for coworkers to spin up a project on their machine and immediately be productive without spending an hour trying to get all the right versions of everything installed and up and running.

As someone responsible for deploying an application to production, what is the story around FreeBSD Jails for deploying across a cluster? Is there a Kubernetes-equivalent that can manage the allocation of resources, blue-green deployments, and manage the lifecycle of my FreeBSD Jails?

As someone responsible for deploying an application to production, do any of the major clouds support FreeBSD Jails? With Docker images, I can deploy those straight to ECS Fargate, Google Cloud Run, and half a dozen other services. Then I don't even have to think about my own infrastructure unless I need some really specialized hardware for a specific application.

> the rest barely matters anymore.

Everything else matters so much.

As to your earlier point about ZFS, most Linux distros these days seem to trivially support ZFS. Even TrueNAS is working on switching to Linux with their TrueNAS Scale offering.

It's not that I'm opposed to FreeBSD... FreeBSD is just a hard sell. It's hard to pin down exactly what you're gaining by throwing out all the collective Linux knowledge of an organization and switching to FreeBSD. FreeBSD is an N-th tier platform for pretty much every programming language except C, so good luck when you run into random subtle problems. Also, good luck doing hardware accelerated machine learning inference or training on FreeBSD... it's probably possible?

> the single important binary

This is also such a weird thing to throw out there. I like a good Go program myself, but most companies are not only deploying single-binary statically linked applications. Most companies are also deploying some kind of Ruby, Python, or Java application... none of which are likely to be a single file in practice. Most of them will have a variety of shared libraries, and I don't know if I've ever seen a Ruby application shipped in a `FROM scratch` container before. Technically possible, but that's just not common reality as far as I've seen. It sounds like you're proposing that everyone is already running in `FROM scratch` containers, so a FreeBSD Jail is just a drop-in replacement.

Linux containers are far from perfect, but as a developer... I have played with FreeBSD Jails before, and come away frustrated by all the work you have to do yourself.

vermaden · on June 16, 2021

> As a developer, are popular databases and applications pre-packaged as FreeBSD Jails so that I can spin one up on my laptop with a single command?

The closest you can get is BastilleBSD (framework for FreeBSD Jails) and their templates - available here:

https://github.com/BastilleBSD/templates https://bastillebsd.org/templates/

handrous · on June 16, 2021

> > the single important binary

> This is also such a weird thing to throw out there. I like a good Go program myself, but most companies are not only deploying single-binary statically linked applications. Most companies are also deploying some kind of Ruby, Python, or Java application... none of which are likely to be a single file in practice.

Sure, but usual practice with containers is to put each thing in its own, unless they are very tightly coupled. Web-app with a SQL database and a memory cache? Three containers. You can do otherwise, but that's typical. Usually each container ends up with one main, important running process, and not much else.

[EDIT]

> As someone responsible for deploying an application to production, what is the story around FreeBSD Jails for deploying across a cluster? Is there a Kubernetes-equivalent that can manage the allocation of resources, blue-green deployments, and manage the lifecycle of my FreeBSD Jails?

> As someone responsible for deploying an application to production, do any of the major clouds support FreeBSD Jails? With Docker images, I can deploy those straight to ECS Fargate, Google Cloud Run, and half a dozen other services. Then I don't even have to think about my own infrastructure unless I need some really specialized hardware for a specific application.

These are exactly the kinds of things I was thinking of when I noted that the OS itself has been seriously diminished in importance, for modern workflows. I agree that most commercial or high-profile open-source "cloud" tools and platforms are built around LXC/Docker.

coder543 · on June 16, 2021

> Sure, but usual practice with containers is to put each thing in its own, unless they are very tightly coupled. Web-app with a SQL database and a memory cache? Three containers. You can do otherwise, but that's typical. Usually each container ends up with one main, important running process, and not much else.

I agree, but... getting all the application dependencies in there is more than just getting a single binary in there. If it's just a single-binary Go program, then a Jail works just fine, but it's not that simple for a Ruby application. I'm definitely not talking about databases running in the same container as the application. That's where Kubernetes and docker-compose come in for multi-container orchestration, which are things that FreeBSD Jails don't have as far as I know.

> These are exactly the kinds of things I was thinking of when I noted that the OS itself has been seriously diminished in importance

Yes, but... these are all the things that FreeBSD doesn't offer. These are the real reasons that people don't talk about FreeBSD Jails in the same breath as Docker. The Docker container itself (or the FreeBSD Jail) as a unit of isolation is the least interesting part of the ecosystem. All of the developer tools, orchestration tools, and prebuilt images are what make the Docker universe so interesting, and make FreeBSD Jails... less interesting.

You said you were confused why Jails don't have more mindshare. It has absolutely nothing to do with people being able to invent useless tools and write blog posts about them, and it has absolutely nothing to do with FreeBSD Jails being too well documented. You kind of implied those were the best explanations you could come up with. Those are not the problems at all, and it seems disingenuous to me to say you think those are the problems unless you really didn't know the things I mentioned in my first reply.

handrous · on June 16, 2021

My personal favorite thing about Docker, and the part I'd most miss if I switched to Jails (which I'm fairly confident could meet my needs with some fairly simple scripts and aliases that wouldn't take me long to arrive at, which is why I think there's so much less of an "ecosystem" there, even a nascent and under-developed one) is the way it forces projects to un-fuck their configuration.

500-line config, much of which few people ever care about, with all kinds of ill-conceived nesting? Better put the ~20 options that 99% of users ever touch in environment variables, and document them. Weird state garbage that's not captured in your config-on-disk? Better figure it out and get it into env vars, and have your startup script use those to transparently manage whatever bad decisions you made re: state in the past. Shit files all over the system? Better get that sorted out so people can handle persistence with at the very most three total mounts—and oh, gee, look, now your simple example docker-compose also serves to document where exactly you store files. And so on.

(my second-favorite thing is that it's a de-facto cross-distro package manager with very up-to-date packages that are trivial to completely and cleanly uninstall)

oarsinsync · on June 16, 2021

FreeBSD introduced Jails in 1999.

I used my first Jail in 2001.

Docker was started over a decade later in 2013.

It’s reasonable to be confused why Jails lacks the mindshare. “Because it lacks all these other over-the-top features that we need” might be reasonable in response, except that Docker didn’t have any of these things on day 0 either.

Jails had a 14 year head start, Docker reinvents the wheel, and nor particularly well at first. Why did it succeed more than Jails did? It wasn’t because of the piss-poor native Mac support.

tptacek · on June 16, 2021

It seems pretty obvious that the big thing here is that most people ship apps on Linux, not on FreeBSD.

tyingq · on June 16, 2021

If technically best in the container space mattered, Illumos would be everywhere...

tptacek · on June 16, 2021

People say this a lot too, but Illumos also uses shared-kernel isolation. Linux + gVisor is probably (significantly) superior to it as far as security goes.

yjftsjthsd-h · on June 17, 2021

90%+ of Docker users aren't using gVisor; I don't disagree that it's good, but it feels like an aside.

cestith · on June 16, 2021

Or z/OS

tptacek · on June 16, 2021

Jails are still shared-kernel isolation. Docker's reputation is mired in its earlier implementations, when it wasn't really even intended for multitenant isolation. Modern Docker, running with unprivileged containers (which is the norm), is substantially hardened. The real win over Docker is losing the shared kernel, which is what lots of people are doing, so the win to Jails is marginal.

nicolaslem · on June 16, 2021

TrueNAS exposed me to FreeBSD jails but what put me off is that there does not seem to be an equivalent of "docker build".

Jails seem to be treated like OpenVZ containers in the Linux world: a lighter alternative to virtual machines, not a way to build and distribute applications like Docker.

This is just my take after playing a few hours with jails, I would happily be proven wrong.

LargoLasskhyfv · on June 19, 2021

Heretics! Vicitimizing all the Fashionistas! Where would be the fun of endless shiny new things? The thrill of employing l33t google skillz to find just another solution to cut&paste in haste, with no wasted time reading boring old style manuals and documentation. Attention deficit is the hottest shit! Deal with it!

tptacek · on June 16, 2021

I don't know what people generally believe.

But the attack surface of a Linux kernel is very large, is pretty unpredictable, and can't be coherently masked out with rules (my favorite example Jann Horn's VM reference count bug, which was a simple concurrency flaw in the core virtual memory system). By comparison, a Linux KVM hypervisor is not just a subset of the kernel by definition, but also a much smaller codebase, a tiny fraction of the whole kernel.

Replacing shared-kernel isolation like seccomp-filtered containers with VMs is, architecturally, simply the replacement of a large trusted computing base with a smaller one. If the overhead is acceptable, it's hard to argue with from a security perspective.

gorkish · on June 16, 2021

OK; https://github.com/harvester/harvester

Security and performance aren't the only driving forces; there are a lot of technical and operational benefits to the abstraction and standard interfaces that you get when running stacks that might otherwise look like someone took an Xzibit meme too far.

Also remember on a modern system, there are often at least 2 additional layers at work abstracting interfaces to the "bare metal" OS already.

encryptluks2 · on June 16, 2021

I'm not disagreeing that abstraction can be useful, but the overhead of a VM is unnecessary if utilizing the full potential of containers. Afterall, the Linux Kernel is acting as the hypervisor already, so might as well trust it enough to properly sandbox containers too and use the right functionality to do so. I also think that running a virtualization layer adds quite a bit of complexity, so while it is cool that projects and companies have made it work and integrated it with a container solution, eliminating the VM layer altogether seems more ideal IMO.

riobard · on June 16, 2021

That's the approach taken by Google's gVisor (at the cost of I/O and network performance).

tptacek · on June 16, 2021

No, that's really not at all what gVisor is. gVisor is best thought of as user-mode Linux --- a complete reimplementation of most of the OS kernel. It's not a system call filter; it's something much closer to a VM than to seccomp.

gVisor is a very cool codebase. As an illustration of the approach: it includes its own TCP/IP stack; we use it in our command-line dev tool to allow people to SSH to their VMs over WireGuard without having to install WireGuard or obtain privileges to manage WireGuard.

fsociety · on June 16, 2021

gVisor, for better or for worse, does a whole lot of other things than just seccomp filtering, and it shows in performance tests.

encryptluks2 · on June 16, 2021

gVisor does more than filtering, they basically reimplemented the syscalls in an application kernel. At least with seccomp the performance overhead is minimal.

remram · on June 18, 2021

How does gVisor fair against KVM and other hardware-accelerated VM solutions (firecracker)?

dboreham · on June 16, 2021

Machine Turducken.