AFAIK, this is how docker CoW works. Your read-only parts i.e. base OS / startin...

psi-squared · on Jan 8, 2017

Okay, so on reading through that it looks like the answer to my question is "it depends":

* On-disk, the layered approach always saves space, as expected

* In memory, it depends on which storage backend you use: apparently btrfs can't share page cache entries between containers, while aufs/overlayfs/zfs can - I'm not sure if this is due to btrfs or docker's btrfs backend.

From looking at the relevant sources, it looks like (but I could be wrong if I looked in the wrong places) both exec() and dlopen() end up mmap-ing the executable/libraries into the calling process's address space, which should mean they just reuse the page cache entries.

So, if I understand correctly, as long as you pick a filesystem which shares page cache entries between containers, then you do indeed only end up with one copy of (the read-only sections of) executables/libraries in memory, no matter how many containers are running them at once. That's good to know!

akiselev · on Jan 8, 2017

Yes, as long as the back end for the container supports it, RO sections of shared libraries will be shared and pulled from the same cache when available. The functionality that enables shared memory (and L* cache access in general) is implemented in silicon in the MMU so as long as the backend properly updates the page tables, you can share pages across any container or VM (except when prohibited by other virtualization hardware).

It's not something that happens automatically though because each kernel is responsible for telling the MMU how it should map memory for its child processes only. Any cross container page sharing has to be implemented at the host level where the kernel has unrestricted access to all guest memory.

bitwiseand · on Jan 8, 2017

This is essentially same as processes sharing (via Page Table mapping) one .so file AKA Dynamic Linking :).