Julia writes: "I think "violates a lot of normal Unix assumptions about what normally happens to normal processes" is basically the whole story about containers."
This is a key point. Lots and lots of standard Unix invariants are violated in the name of abstraction and simplification, and the list of those violations is not popularized; and most of the current systems have different lists.
For example, in Kubernetes (my current love affair), the current idea of PetSets (basically, containers that you want to be carefully pampered, like paxos members, database masters, etc. -- stuff that needs care) /still/ has the notion that a netsplit can cause the orchestrator to create (1 .. #-of-nodes) exact doppelgangers of your container, all of which believe they are the one true master. You can imagine what this means for database masters and paxos members, and that is going to be, as the kids say, surprising af to the first enterprise oracle db admin who encounters this situation.
If you believe in containers, then one thing that you really do have to get to, is that most of your existing apps should not be in them yet, and that if your app is not (a) stateless (b) strongly 12-factor (c) designed for your orchestrator and (d) written not to do things like fork() or keep strong references to IP addresses, then you should probably wait 3-4 years and use VMs in the meantime.
Oracle has had multi-homed master-master RDBMS setups for > 10 years. I'm pretty sure a half-competent Oracle administrator wouldn't be really 'surprised af' at functionality that's been in Oracle for at least a decade.
For things that need 'care', this has been a solved problem for decades. Banks[0] homed in the WTC on Sept 11 kept on running because OpenVMS has had NUMA clusters and multi-node replication since the DEC Alpha days. This is with 100% transactional integrity maintained and DC failovers measured within the order of 500ms to 5s. (Obviously banks don't all run on VMS.)
Platforms exist like IBM z systems let you live upgrade zOS in a test environment hosted within the mainframe to see if anything breaks, in complete isolation from production of course, revert snapshots, and do basically everything the whole ESX suite (from things like live migrations of VMotion, to newer stuff like growing raid arrays transparently / virtual storage solutions where you can add FC storage dynamically and transparently to the end user). Their stock systems let you live upgrade entire mainframes without a blip. They're built to withstand total system failure (i.e. literally processors, RAM, NICs, and PSU's could all fail on one z13 and you'd have fail-over to a hot-backup without losing any clients attached to the server). HP's Non-Stop, with which I have no experience, offers a similar comprehensive set of solutions.
[0] On Sept 11, a bunch of servers went down with those buildings.
* “Because of the intense heat in our data
center, all systems crashed except for our
AlphaServer GS160... OpenVMS wide-area
clustering and volume-shadowing technology
kept our primary system running off the
drives at our remote site 30 miles away.”
--Werner Boensch, Executive Vice President
Commerzbank, North America*
http://ttk.mirrors.pdp-11.ru/_vax/ftp.hp.com/openvms/integri...
I'm saying that an arbitrary number of exact replicas of a master can magically appear on the network believing they are the one true master, identifying themselves as such, and expecting to act that way. Additionally, an arbitrary number of database masters expecting to participate in the cluster may show up or leave at any time. That is somewhat nontrivial for even modern databases to deal with.
Why run your database inside kubernetes though? We've always white gloved our database (and a few other special services). You don't have to put 100% of your infrastructure in docker/kubernetes.
If you're running multiple copies of anything that cares about the concept of a master it better have its own consensus algorithm. Luckily such things exist and are open source.
I think Kubernetes does a good job creating a normal "Unix process environment".
The Pod concept allows for:
- Container processes share localhost, mount points, etc
- Providing a "normal" IP address that is routable
- Ensuring a PID1 can monitor the group of processes (as done by rkt integration)
- Allowing for normal POSIX IPC (signals, etc)
As for PetSets I do agree that they need more work to support things that are replicated but not cluster aware. It doesn't magically solve the issues of distributed systems. Also, natively cluster aware things might be better served by controllers. See this demo of an etcd controller:
It definitely does better than many of the rest, in my experience, and for sure it has better defaults and chooses its violations carefully and generally wisely. In fact, I wrote the first draft of a paper on this specific topic:
Having been inside Google when Docker started to get big, there's a really simple explanation for all of this:
Kubernetes is a well designed descendant of a well-designed API with pretty specific tradeoffs for distributed systems (that mostly still work at the small scale).
Docker is a reverse-engineered mishmash of experiments attempting to replicate the same ancestor. Things like the horrible network abstraction layer - Google had the advantage of being able to move all their apps to a well understood naming scheme, rather than treating IP addresses as immutable. That any app does this is technical debt, but it worked for a long time. Now it doesn't.
Docker has tried to fix these things by wrapping them, not fixing the underlying debt. That only ever accumulates more debt, and rarely even provides the stopgap solution that is required. It's an admirable effort, and they've done a fantastic job - but a fantastic job at a fool's errand is still not behavior to emulate.
This is a key point. Lots and lots of standard Unix invariants are violated in the name of abstraction and simplification, and the list of those violations is not popularized; and most of the current systems have different lists.
For example, in Kubernetes (my current love affair), the current idea of PetSets (basically, containers that you want to be carefully pampered, like paxos members, database masters, etc. -- stuff that needs care) /still/ has the notion that a netsplit can cause the orchestrator to create (1 .. #-of-nodes) exact doppelgangers of your container, all of which believe they are the one true master. You can imagine what this means for database masters and paxos members, and that is going to be, as the kids say, surprising af to the first enterprise oracle db admin who encounters this situation.
If you believe in containers, then one thing that you really do have to get to, is that most of your existing apps should not be in them yet, and that if your app is not (a) stateless (b) strongly 12-factor (c) designed for your orchestrator and (d) written not to do things like fork() or keep strong references to IP addresses, then you should probably wait 3-4 years and use VMs in the meantime.