Docker not ready for primetime

markbnj · on Aug 28, 2016

I have run docker in production at past employers, and am getting ready to do so again at my current employer. However I don't run it as a native install on bare metal. I prefer to let a cloud provider such as Google deal with the infrastructure issues, and use a more mature orchestration platform (kubernetes). The author's complaints are valid, and the Docker team needs to do a better job on these issues. Personally I am going to be taking a close look at rkt and other technologies as they come along. Docker blew this technology open by making it more approachable but there is no reason to think they are going to own it. It's more like databases than operating systems.

lugg · on Aug 28, 2016

I'm trying to get started with rkt right now but it's (understandable) lack of maturity is a bit daunting, I think some usability issues need to be handled / offloaded to some other tools. And acbuild severly needs caching built in.

Disclaimer, new to containers, if it sounds like I'm doing something wrong let me know, it certainly feels like I'm missing something right now.

jzelinskie · on Aug 28, 2016

You'll see a maturity in this space after the Open Containers Initiative standardizes the container image format. Then, developers can focus on UX improvements rather than worrying about what the build tool should even be producing.

jmspring · on Aug 28, 2016

Reliance on initiatives and standards bodies to provide a guidance for maturity, at least in recent years, is a fools errand. They usually end up rubber stamping what's been the norm and work back from there.

cesnja · on Aug 29, 2016

That's true, OCF is more or less the same as the Docker container format.

pescerosso · on Aug 29, 2016

If you are looking at rkt, if you have time, take a look at Kurma (open-source too). Kurma was built using the same specification rkt was built from. Here is the getting started guide: http://kurma.io/documentation/kurmad-quick-start/ What I like of kurma is that I can simply run docker images from the hub, no need to download, convert, etc.

Disclaimer: I am an Apcera employee and Kurma is an open-source project sponsored by Apcera

siliconc0w · on Aug 28, 2016

We use it in production.

It generally works if:

* you don't use it to store data

* don't use 'ambassador', 'buddy', or 'data' container patterns.

* use tooling available to quickly and easily nuke and rebuild docker hosts on a daily or more frequent basis.

* use tooling available to 'orchestrate' what gets run where - if you're manually running containers you're doing it wrong.

* wrap docker pulls with 'flock' so they don't deadlock

* don't use swarm - use mesos, kube, or fleet(simpler, smaller clusters)

bphogan · on Aug 28, 2016

woah, wait.....

It "geneerally works if" you " rebuild docker hosts on a daily or more frequent basis."

Perhaps I'm misunderstanding, but needing to rebuild my prod env several times a day seems pretty "not ready for prime time" to me.

That's like when we'd say that Rails ran great in production in 2005, as long as you had a cron task to bounce fastCGI processes every hour or so.

So, can you elaborate on why rebuilding the containers is good advice?

jacques_chester · on Aug 28, 2016

The security exec at Pivotal, where I work, has been talking about "repaving" servers as a security tactic (along with rotating keys and repairing vulnerabilities).[0]

The theory runs that attackers need time to accrue and compound their incomplete positions into a successful compromise.

But if you keep patching continuously, attackers have fewer vulnerabilities to work with. If you keep rotating keys frequently, the keys they do capture become useless in short order. And if you rebuild the servers frequently, any system they've taken control of simply vanishes and they have to start from scratch.

I'm not completely sold on the difference between repair and repave, myself. And I expect that sophisticated attackers will begin to rely more on identifying local holes and quickly encoding those in automated tools so that they can re-establish their positions after a repaving happens.

But it raises the cost for casual attackers, which is still worthy.

[0] https://medium.com/built-to-adapt/the-three-r-s-of-enterpris...

tptacek · on Aug 28, 2016

Having everything patched as soon as patches are available (or within, say, 6 hours of availability, for "routine" patches, with better responsiveness for critical patches) is a win.

The rest: not so much.

Rebuilding continuously for security is not something I would recommend.

jacques_chester · on Aug 28, 2016

> Rebuilding continuously for security is not something I would recommend.

So that I understand, could you elaborate?

Particularly, do you mean "not recommend" as in "recommend against" or "not worth the bother"?

tptacek · on Aug 28, 2016

It's not worth the bother. Apart from keeping patches up today --- which is a good idea --- it's probably not really buying you anything.

It's not crazy to periodically rotate keys, but attackers don't acquire keys by, you know, stumbling over them on the street or picking them up when you've accidentally left them on the bar. They get them because you have a vulnerability --- usually in your own code or configuration. Rebuilding will regenerate those kinds of vulnerabilities. Attackers will reinfect in seconds.

lawnchair_larry · on Aug 29, 2016

A lot of companies do lose their keys that way. www roots, gists, hardcoded in products, github history, etc.

The win to rotating them is not so much because you'll be regularly evicting attackers you didn't know had your keys, but because when you do have a fire, you won't be finding out for the first time that you can't actually rotate them.

It also forces you to design things much more reliably which helps continuity in non-security scenarios.

After redeploying and realizing that Todd has to ssh in and hand edit that one hostname and fix a symlink that was supposed to be temporary so the new version of A can talk to B, that's going to get rolled in pretty quickly. Large operations not doing this tend to quickly end up in the "nobody is allowed to touch this pile of technical debt because we don't know how to re-create it anymore" problem.

tkiley · on Aug 28, 2016

It seems like it's good to be able to rebuild everything at a moment's notice after patching against a major exploit, though. You should have a fast way to rebuild secrets and servers after the next heartbleed-scale vulnerability.

tptacek · on Aug 28, 2016

Being able to rebuild critical infrastructure from source, and know that you'll be able to reliably deploy it, is a _huge_ win for security.

After a bunch of harrowing experiences with clients, I'm pretty close to believing "using packages for critical infrastructure is a bad idea".

helloiamaperson · on Aug 28, 2016

Being able to rebuild critical infrastructure from source, and know that you'll be able to reliably deploy it, is a _huge_ win for security.

In that case, you might be interested in bosh: http://bosh.io/docs/problems.html (the tool that enables the workflow jacques_chester was describing). It embraces the idea of reliably building from source for the exact reasons you've mentioned.

Gigablah · on Aug 29, 2016

I'm confused now, earlier you recommended patches over rebuilding continuously from source, but this seems like the opposite?

Noumenon72 · on Aug 29, 2016

What does "packages" mean here? Sorry.

edmccard · on Aug 29, 2016

My guess is that "packages" is shorthand for "binary packages", as opposed to being able to redeploy from source.

tptacek · on Aug 29, 2016

Gigablah · on Aug 29, 2016

I'm guessing they meant to write "patches".

jacques_chester · on Aug 28, 2016

That was my hunch too. Thanks. I'll ask more about whether I missed something on the other side of the argument.

StreamBright · on Aug 28, 2016

Depends, usually you have to be able to re-build your prod infra within minutes or maximum hours, otherwise you are doing devops wrong. The whole point of automation is reproducible infrastructure that you can stand up quickly. With stateless approach you can just do this. Why would you do that? Imagine an outage in one of the 3 datacenters you are running your infra in the same region. You need to move 1/3 of the capacity to the remaining 2 datacenters. This is not too much different to re-building it.

falcolas · on Aug 28, 2016

Aaah, the classing "you're doing <it> wrong" argument. I can come up with dozens of different environments where it is simply not feasable to rebuild an environment within two hours.

- Any infrastructure with lots of data. Data just takes time to move; backups take time to restore.

- You're on bare metal because running node on VMs isn't fast enough.

- You're in a secure environment, where the plain old bureaucracy will get in the way of a full rebuild.

- Anytime you have to change DNS. That's going to take days to get everything failed over.

- Clients (or vendors) whitelist IPs, and you have to work through with them to fix the IPs.

- Amazon gives you the dreaded "we don't have capacity to start your requested instance; give us a few hours to spin up more capacity"

> Imagine an outage in one of the 3 datacenters you are running your infra in the same region. You need to move 1/3 of the capacity to the remaining 2 datacenters.

Oh, this is very different. If your provider loses a datacenter, and your existing infrastructure can't handle it, you're already SOL - the APIs for spinning up instances and networking is going to be DDOSed to death by all of the various users.

Basic HA dictates that you provision enough spare capacity that a DC (AZ) can go down and you can still serve all of your customers.

StreamBright · on Aug 29, 2016

I mostly disagree with your points, with the exception of the last one.

I used to work in the team that runs Amazon.com. All of the systems serving the site can be re-built within hours and nothing can serve the site that cannot be rebuilt within a very thing SLA. However, I understand that not all the companies have this requirement. This feature is only relevant when a site downtime hurting the company too much, so it could not be allowed.

Reflecting to your points:

- Lots of data -> use S3 with de-normalized data, or something similar

- Running a VM has 3% overhead in 2016, scalability is much more important than a single node performance

- High security environments are usually payment processing systems, downtime there can be a bit more tolerated, delaying transactions is ok

- Amazon uses DNS for everything, even for datacenter moves. It is usually done within 5 minutes

- This is a networking challenge, using something like EIP (where the public facing IP can be attached to different nodes) makes this a non-issue

- Amazon has an SLA, they extremely rarely have a full region outage, so you can juggle capacity around

Losing a dc out of 3 does not require work because you can't handle the load, it is required to have the same properties (same extra capacity for example) just like before. Spinning up instances should not DDOS anything, it is with constant load on the supporting infrastructure.

The last point I agree with.

falcolas · on Aug 29, 2016

First, two important assumption I'm making when I say this (and I feel they are reasonable assumptions). I'm not just talking about bringing a production environment back up in the same or adjacent AZ; I'm talking about true DR, where you're moving regions. I'm also not limiting my discussion to AWS' infrastructure - not with Google, Rackspace, Cloudflare and others in the space as well.

> Lots of data -> use S3 with de-normalized data, or something similar

S3's use case does not match up with many different computing models (hadoop clusters, database tables, state overflowing memory), and moving data within S3 between regions is painful. Also, not all cloud providers have S3.

> Running a VM has 3% overhead in 2016, scalability is much more important than a single node performance

Not when you have a requirement to respond to _all_ requests in under 50ms (such as with an ad broker).

> High security environments are usually payment processing systems

Or HIPPA, or government.

> delaying transactions is ok

Not really. When I worked for Amazon, they were still valuing one second of downtime at around $13k in lost sales. I can't imagine this has gone down.

> Amazon uses DNS for everything, even for datacenter moves. It is usually done within 5 minutes

Amazon also implements their own DNS servers, with some dynamic lookup logic; they are an outlier. Fighting against TTL across the world is a real problem for DR type scenarios.

> EIP (where the public facing IP can be attached to different nodes) makes this a non-issue

EIPs are not only AWS specific, but they can not traverse across regions, and rely on AWS' api being up. This is not historically always the case.

> they extremely rarely have a full region outage, so you can juggle capacity around

Not always. Sometimes, you can. But not always. Some good examples from the past - anytime EBS had issues in us-east-1, the AWS API would be unavailable. When an AZ in us-east-1 went down, the API was overwhelmed and unresponsive for hours afterwards.

> Spinning up instances should not DDOS anything, it is with constant load on the supporting infrastructure.

See above. There's nothing constant about the load when there is an AWS outage; everyone is scrambling to use the APIs to get their sites backup. There's even advice to not depend on ASGs for DR, for the very same reason.

AWS is constantly getting better about this, but they are not the only VPS provider, nor are they themselves immune to outages and downtime which requires DR plans.

alrs · on Aug 29, 2016

"Any infrastructure with lots of data. Data just takes time to move; backups take time to restore."

Exactly. Don't put data in Docker. Files go in an object store, databases need to go somewhere else.

vacri · on Aug 29, 2016

> Any infrastructure with lots of data.

OP's first point is 'don't put data in docker'. Docker is not for your data. But more to the point, if you're rebuilding your data store a couple of times every day, a couple of hours downtime isn't going to be feasible.

> You're on bare metal because running node on VMs isn't fast enough

In such a situation, you should be able to image bare metal faster than 2 hours. DD a base image, run a config manager over it, and you should be done. Small shops that rarely bring up new infra wouldn't need this, but anyone running 'bare metal to scale' should.

> bureaucracy

Isn't part of the infra rebuild per se.

> Anytime you have to change DNS. That's going to take days

Depends on your DNS timeouts, but this is config, not infra. Even if it is infra, 48-hour DNS entries aren't a best-practice anymore (and if you're on AWS, most things default to a 5 min timeout)

> Clients (or vendors) whitelist IPs, and you have to work through with them to fix the IPs

I'd file this under 'bureaucracy' - it's part of your config, not part of your prod infra (which the GP was talking about).

> Amazon gives you the dreaded...

Well, yes, but this is on the same order as "what if there's a power outage at the datacentre". Every single deploy plan out there has an unknown-length outage if the 'upstream' dependencies aren't working. "What if there's a hostage event at our NOC?" blah blah.

The point is that with upstream working as normal, you should be able to cover the common SPOFs and get your prod components up in a relatively short time.

falcolas · on Aug 29, 2016

> OP's first point is 'don't put data in docker'. Docker is not for your data.

I agree, but I (and the GP, from my reading) was not speaking about only Docker infrastructure.

> Isn't part of the infra rebuild per se.

I can see your point, and perhaps these points don't belong in a discussion purely about rebuilding instances discussion. That said, I have a very hard time focusing just on the time it takes to rebuilding capacity when discussing a DC going down; there's just too many other considerations that someone in Operations must consider.

When I have my operations hat on, I consider a DC going down to be a disaster. Even if the company has followed my advice and the customers do not notice anything, we're now at a point where any other single failure will take the site down. It's imperative to get everything taken down with that DC back up; and it's going to take more than an hour or two.

cortesoft · on Aug 28, 2016

I know you probably aren't trying to address all cases, but just because you can't re-build your prod infrastructure in minutes or hours doesn't mean you aren't doing devops right.

Many larger companies can't do this; my company has 70+ datacenters with tens of thousands of servers. We can't re-build our prod infra in minutes or hours. We are still doing devops right :D

Like I said, I know you aren't talking about my situation when you made your statement... I just get frustrated when people act like there are hard and fast rules for everyone.

StreamBright · on Aug 28, 2016

Well I am not talking about a 1M node outage. That you cannot fix with anything. I am talking about a maximum datacenter wide outage, that actually happens pretty often. Amazon has game days, Netflix has chaos monkeys for the same reason. Make sure that you can rebuild parts of your infra pretty quick.

toomuchtodo · on Aug 28, 2016

As long as your VMs are prebaked and your cloud provider supports ASG-esq primatives (i.e. "I need X instances running at a time, instantiate with such and such meta data") anyone can rebuild their prod infra quickly. You don't need Docker or containers to do that.

cortesoft · on Aug 28, 2016

Some of us work for those cloud providers :)

toomuchtodo · on Aug 29, 2016

I have no doubt ;) it's why I always strive for accuracy here!

cortesoft · on Aug 28, 2016

Oh, of course. Our datacenters often go offline (both planned and unplanned), and we are always ready to handle that. We are pretty much constantly re-provisioning servers... with so many physical machines, hard drive and other hardware failures are a daily occurrence.

ckdarby · on Aug 28, 2016

I'm going to have to call bullshit, Where do you work or what company do you own?

I work at MindGeek, depending on the time of the year it would be fair to say we rank within the top 25 bandwidth users in the world. We are not even close to that amount of servers and we deal with some of the largest traffic in the world. What company is running in 70+ datacenters!? World's largest VPN provider? Security company providing all the data to the NSA?

Maybe it is just my broad assumptions but I would hope that the major big 10 that come to mind such as Google, Amazon, Microsoft, etc would be able to rebuild their production regions in hours.

Sanddancer · on Aug 28, 2016

Not the OP, but first thing that comes to mind as to why you'd need a lot of datacenters are CDNs like Cloudflare or Akamai. Stuff like that, you need lots of servers, lots of storage, and low latency. You'd also need a good amount of configurations because things like video streaming would require different server settings than, say, protecting a site that's being DDOSed.

cortesoft · on Aug 28, 2016

Ding ding ding :D

I don't work for one of those two, but I do work for a very large CDN.

ceejayoz · on Aug 29, 2016

> I work at MindGeek, depending on the time of the year it would be fair to say we rank within the top 25 bandwidth users in the world.

"I'd have thought I've had heard of them..."

One Wiki search later... yup. I've heard of them.

Steeeve · on Aug 29, 2016

Ha! I went to their website... Hmm... never heard of them. Went to Wikipedia... Oh, now I know who they are! Yeah. Lot's of bandwidth.

fnord123 · on Aug 29, 2016

>What company is running in 70+ datacenters!?

People running edge networks and therefore need servers local to everywhere in the world to keep latencies down. Maybe it's not as much of an issue for MindGeek (the parent company for a lot of video streaming sites). I would guess you guys need a lot of throughput but latency isn't so much of a problem. Or you simply don't need to serve some parts of the world where it might be illegal to distribute some types of content.

FWIW, Cloudflare has 86 data centers: https://www.cloudflare.com/network-map/

cortesoft · on Aug 29, 2016

Or they are customers of CDNs.

user5994461 · on Aug 29, 2016

[Note: MindGeek = the pornhub network]

Short version: That's a video streaming websites which is rather simple, yet bandwidth intensive.

Outsourcing the caching and video delivery means MindGeek can do with little servers and a few locations.

Nonetheless the CDN you're outsourcing to does need a lot of servers at many edge locations.

Actually, if we think in terms of "top bandwith users in the world". It's possible that your company is far from being in the list. It's likely dominated by content delivery / ISP / and other providers, most of which are unknown to the public.

ckdarby · on Sept 3, 2016

Would you say Netflix & Youtube are rather simple? Handling +100 million users is never rather simple...

user5994461 · on Sept 3, 2016

Youtube had the challenge of being one of the first streaming services, and it's operating at an unprecedented scale. I am actually wondering, how many orders of magnitude youtube has more traffic than pornhub?

I am in distributed systems and try to work exclusively on hard problems. So when I say "simple", that is biased on the high end of the spectrum.

If you go to pornhub.com and looks at "popular keyword", you'll only find thousand or ten thousands videos. In a way, there is not that much content on pornhub.

All major websites have challenge. Pornhub is a single purpose website and a lot of the challenge is in video delivery, which can be outsourced to a CDN nowadays.

"simple" is maybe too strong a word. I am trying to convey the idea that it has limited scope and [some of] the problems it's facing are understood by now and have [decent] solutions.

That's not to say it's easy ;)

cortesoft · on Aug 28, 2016

I work for a large CDN.

cerebellum42 · on Aug 28, 2016

I think the point of the comment you're replying to is not really what you are replying to. Having the ability to rebuild quickly and frequently is great, and something you should aim for, but actually being forced to do it regularly whether you need it or not is pretty bad.

Edit: didn't notice someone else had already said this, opened this tab like 15 minutes ago

StreamBright · on Aug 28, 2016

I see your point. The source of force that makes you do it should not come from Docker I agree. :) We are just lucky that we can do it when it happens.

raverbashing · on Aug 28, 2016

"have to be able" is very different from "have to"

Sure, you want to be able to deploy quickly. But if there's no reason to, then don't.

And I would be very scared if Docker images had a 1 day uptime max

marcosdumay · on Aug 28, 2016

Can != should

siliconc0w · on Aug 28, 2016

We see a couple of different bugs that are best solved by simply rebuilding the container host. To docker's credit these tend to decrease with high versions.

We also see them mostly in non-prod environments where we have greater container/image churn. We use AWS autoscale and Fleet so containers just get moved to other hosts when we terminate them. We have actually thought about scheduling a logan's run type job that kills older hosts automatically - it's in the backlog.

majewsky · on Aug 29, 2016

Bugs that can be solved by a rebuild are not restricted to Docker. We had an interesting week when the build was red all the time for various reasons, and then prod started failing. Usually we deploy once a day, and not deploying for a week caused several small memory leaks to turn into big ones.

nolok · on Aug 28, 2016

> So, can you elaborate on why rebuilding the containers is good advice?

While I sincerely hope I'm wrong, I assume it's because you reset the clock on the probability something goes very wrong.

sp527 · on Aug 28, 2016

The "have you tried turning it off and on again" of DevOps. It makes a surprising amount of sense though, as long as your service is truly stateless, the restart can be easily orchestrated, and it results in no difference in operational costs.

creshal · on Aug 28, 2016

If it's stateless, then why does rebuilding it change anything about the frequency of bugs popping up?

XorNot · on Aug 28, 2016

Ooh I can answer this one: because ask people if their container root is writeable, and get amused at the blank stares you get back.

I am currently fighting an ongoing battle at work to point out that the plans for our Mesos cluster have not factored in that the first outage we have will be when someone fills up the 100gb OS SSD because no one's given any thought to where the ephemeral container data goes.

piaste · on Aug 29, 2016

I am a layman to devops. By "ephemeral container data" do you mean temporary files created by the service, temporary files created by the OS / other applications, or something else?

kbar13 · on Aug 28, 2016

if both the code and the infra it's running on is stateless, then yeah.

creshal · on Aug 29, 2016

So we're replacing somewhat not fully stable VMs running on somewhat not fully stable virtualization infrastructure with theoretically stable containers running on violently unstable container infrastructure?

tristor · on Aug 29, 2016

Pretty much, yes.

pmarreck · on Aug 28, 2016

Mutability is the root of all computing evils

andrewprock · on Aug 29, 2016

And the source of all computing value.

jjn2009 · on Aug 28, 2016

The host isn't the container itself. They want to re-provision the host likely not because of something wrong with the application but instead docker is in some state which is non-recoverable, or at least not recoverable by automatic means.

benologist · on Aug 28, 2016

Because then you always know you can always rebuild automatically and that's being tested constantly while developers work, a bit like how Netflix crashes everything all the time randomly to ensure they can always automatically recover from every dependency.

It also naturally rewards optimizing around time-to-redeploy, probably a lot of benefits there.

api · on Aug 28, 2016

" use tooling available to quickly and easily nuke and rebuild docker hosts on a daily or more frequent basis."

That's like "my car works fine as long as you spray WD-40 into the engine block every 48 hours..."

majewsky · on Aug 29, 2016

Of course, since we're developers, we will automate this task with a 3D-printed WD-40 injector plus filling level indication for the dashboard.

Pirate-of-SV · on Aug 28, 2016

Where can I read more about the container patterns you mentioned?

r4um · on Aug 28, 2016

These are some good curated books https://msdn.microsoft.com/en-us/library/dn568099.aspx http://shop.oreilly.com/product/0636920023777.do

These are not necessarily container centric.

gabetax · on Aug 28, 2016

http://static.googleusercontent.com/media/research.google.co... is a collection of more advanced container patterns you might be interested in.

LaurenceW1 · on Aug 28, 2016

google :)

jjn2009 · on Aug 28, 2016

Thats a lot of ifs that could go away if docker would put some more emphasis on quality.

jfindley · on Aug 28, 2016

You shouldn't need to wrap pulls with flock anymore, the pull code got completely rewritten and should be fine now.

Also hosts shouldn't need to be rebuilt THAT often. Of course, your infrastructure should automatically nuke and replace failing hosts (and there's various ways to establish what's "failing"), and like any good infrastructure team you should be keeping your hosts up to date with security patches, etc, which is best done by replacing them with new versions (as then you can properly test and apply a CI lifecycle), but docker itself is stable enough these days that you don't need to be that aggressive with nuking.

Agree (especially with the comments about orchestrating with kube/mesos) with all the other points, though.

bogomipz · on Aug 28, 2016

Can you elaborate on:

"you don't use it to store data" - What is wrong with Docker volumes specifically? What issues did you run into?

"wrap docker pulls with 'flock' so they don't deadlock" - I have never had a problem with docker pull, can you elaborate on this?

ttiurani · on Aug 29, 2016

I've been running Docker in production for a year and both of these issues were very real.

I got bit hard by this issue:

https://github.com/docker/docker/issues/20079

when CoreOS stable channel updated Docker. All my data volume containers broke during the migration and migration could not be reverted.

As for the second issue: when two systemd services that pull containers with the same base images would start simultaneously, they would deadlock pretty reliably. Had to flock every pull as a result. This might be fixed by now though.

bogomipz · on Aug 29, 2016

Yikes that issue has been open since February?!

sovietmudkipz · on Aug 29, 2016

I'm echoing /u/bogomipz' request for elaboration. I rely heavily on docker-managed data volumes for data. The advice in these threads of comments is alarming. Should I conclude that hackernews devs don't understand docker-managed volumes, or is there some glaringly obvious downside that I'm failing to see?

Apes · on Aug 29, 2016

I'm curious about the "buddy" and "data" container patterns -- I didn't see those in the documents you linked to. Why do you mention these as being best avoided?

m_mueller · on Aug 29, 2016

may I ask - what dragons did you find that make you want to rebuild docker hosts daily? Do you do this just to properly clean up unused containers / images and such?

dawnerd · on Aug 28, 2016

I spent close to 12 hours yesterday trying to get a fairly simple node app to run on my mac yesterday. Turned out I had to wipe out docker completely and reinstall. Keep in mind this is their stable version thats no longer in beta. I've just run into too many documented bugs for me to consider it stable. I wouldn't even say it should be out of beta.

The issues here are the real telling story: I spent close to 12 hours yesterday trying to get a fairly simple node app to run on my mac yesterday. Turned out I had to wipe out docker completely and reinstall. Keep in mind this is their stable version thats no longer in beta. I've just run into too many documented bugs for me to consider it stable. I wouldn't even say it should be out of beta.

The issues here are the real telling story. https://github.com/docker/for-mac/issues

I love docker, it's amazing when it works. It's just really not there yet. I get that their focus is on making money right now, but they need to nail their core product first. I honestly don't care about whatever cloud platform they're building if their core app doesn't even work reliably.

serverholic · on Aug 28, 2016

I'm a fan of Docker in general but I'm amazed by some of the poor choices they've made with Docker for Mac.

- Stable? That's honestly laughable. It is nowhere near stable. As an example, I was trying to upload images to a third-party image registry but the upload speed was ridiculously slow. It took me forever to figure out but it turned out I needed to completely reinstall docker for mac.

- They had a command-line tool called pinata for managing daemon settings in docker for mac. They chose to get rid of it. Not only did we lose a way to declaratively define and set configuration but the preferences window has no where near all of the daemon settings that are available.

- The CPU usage is still crazy. I regularly get 100% CPU usage on 3/4 of my CPU cores while starting up just a few containers. Even after the containers have started it will idle at 100% 1/4 cores.

- It needs to be reinstalled regularly if you are using it on a daily basis. Otherwise it will get slower and slower over time. See my first complaint.

- The GUI (kitematic) will randomly disconnect from the daemon forcing me to restart the GUI repeatedly.

- They really need some sort of garbage collector with adjustable settings. With the default settings the app will just keep building and building images and eventually fill up, crash, slow down, etc. How is that acceptable? What other apps do that?

Like I said, I like docker in general. I think they are tackling some very hard problems and definitely experiencing some growing pains from such crazy growth. However, at some point they need to take a step back and focus on the core of what they offer and make it as simple, and rock solid as possible. As another example, they still haven't added a way to compress, and/or flatten docker images. No wonder docker for mac slows down after regular use when it's building 1GB+ images for simple things.

dawnerd · on Aug 28, 2016

Not 100% sure but I think there's a memory leak in hyperkit. Eventually the memory usage will grow to fill up the allocated space then docker will crash. It might be something else causing it, but that's just what I've observed.

There's also the Docker.qcow2 file ballooning in size. Only way is to do a "factory reset" or running a couple commands to clear out old images.

seeekr · on Aug 28, 2016

Not sure where the leak is, but I can confirm there is definitely one there. For me it happens whenever I restart Docker for Mac: 500+ MB of usable RAM gone. Combined with the fact that we don't reboot our Macs very often and that Docker for Mac needs frequent restarts because of hogging CPU otherwise, that's a bit of a problem.

alecthomas · on Aug 29, 2016

Yes, agreed 100%.

Also, the CoW disk volume they use basically grows unbounded, even if you purge layers and old containers.

Ninn · on Aug 28, 2016

Heads up -- I think you messed up your copy/paste.

coldtea · on Aug 28, 2016

That, or the issues here are the real telling story.

smcleod · on Aug 28, 2016

I'd like to point out that docker or no docker, node is one of the worst languages to host applications for, custom selinux rules to get around if executing memory of the stack, memory and CPU hungry applications, npm is an absolute nightmare and you have people that don't understand programming writing 'code' they think is production worthy.

ldehaan · on Aug 28, 2016

This is because docker wasn't written for apples sad excuse for bsd, it was written for Linux.

It's cute that they're trying to port it other places, but if you're trying to run with stability, don't use a mac, use a Linux box.

Why they did this; I have no idea, but it's a horrible idea.

If nothing else use VirtualBox, install Linux, and use docker that way.

coldtea · on Aug 28, 2016

Only Dockers has the same issues on Linux -- in fact most of the people complaining in this thread use it on Linux in the first place.

jcoffland · on Aug 28, 2016

One of Docker's biggest problems is that internally they have fomented a culture of "users are stupid" which is immediately apparent if you interact with their developers on GitHub.

shykes · on Aug 28, 2016

Docker founder here.

It makes me sad that you believe that. We're not perfect but we try very hard to keep our users happy.

Are there specific issues that you could point out, so that I can get a sense of what you saw that you didn't like?

Keep in mind that anyone can participate in github issues, not just Docker employees, and although we have a pretty strict code of conduct, being dismissive about another participant's use case is not grounds for moderation.

EDIT: sorry if this came across as dismissive, that wasn't the intention. We regularly get called out for comments made by non-employees, it's a common problem on the github repo.

carapace · on Aug 28, 2016

Look dude, this is hard. Most of the people here are rooting for you, even the ones complaining. I'm on your side.

But this isn't about you, or your feels, or what you think is going on.

All that's happening is people are trying to communicate with you and you're not listening. You're doing your best, I don't doubt that, but you've got to step back, regroup, and come at the problem from another angle.

Don't get defensive, don't make excuses. "When you make a mistake, take of your pants and roll around in it." ;-) Give people the benefit of the doubt that they mostly know what they're complaining about (even if they don't.)

It's a pain-in-the-ass, but it's the only way to deal with this kind of systemic (mis-?)perception in your community.

jsmthrowaway · on Aug 28, 2016

I have said one form of this or another to Solomon regarding Docker the company and his own personal defensiveness on two or three separate occasions on this account alone[0]. So while I laud your effort, I don't think much is going to change on this front. Honestly, that's too bad and I'm not being snarky or shitty here; I genuinely wish Docker and Solomon would work on this stuff because it'd go a long way. I'm discouraged to even bring stuff up any more.

I don't even think it's intentional, really, but I have never once seen a response by Solomon to criticism of Docker that did not dismiss the content and messenger in some way. These things are hard to get right and I'm not perfect, so I don't even really know what advice to offer and I'm far from qualified. I am very frequently dismissive of criticism as well and have had to put a lot of work into actively accepting it, so I can at least understand how hard it is.

I outright told him his initial reaction to Rocket, to use an example, directly caused me to plan for a future without Docker. Companies are defined by their executives, and a lot of Docker's behaviors become clearer when you consider some of the context around Solomon's personal style.

[0]: the most productive example being https://news.ycombinator.com/item?id=8789181

carapace · on Aug 29, 2016

I feel a little badly about the exchange now, having more of the context available.

I told my sister last night, jokingly, "I think I made a co-founder of Docker cry on HN today." (I also explained what Docker and Hacker News are a little. She likes me so she didn't call me a nerd to my face.)

Honestly, I was just amused at the reply evidencing (or seeming to) the very attitude the OP was complaining of, it was incredible.

jcoffland · on Aug 28, 2016

It is your right to ignore nonpaying users but do so at your own peril. RedHat is a perfect example of this. They started loosing their lead in the Linux market as soon as they quit directly supporting the desktop version of RedHat Linux. Fedora was a poor replacement and it opened the door for Ubuntu. What RedHat didn't realize was that even though these nonpaying users seemed like a drain on resources, they were the next generation of hackers who would have recommended, or even insisted upon, RedHat enterprise at their future jobs and would have brought with them the expertise to roll it out.

Docker may be on top now but if you don't cater to the needs of your nonpaying users, it will be short lived. This includes taking seriously feature requests which don't necessarily serve enterprise.

You can find me on GitHub.

tinco · on Aug 28, 2016

Why don't you just ignore people like this? There's a whole thread about technical issues upthread, and you respond to some random guys emotional problems?

If there is a problem with the way the Docker open source project is run, it is with the amount of small but serious bugs that affect subsets of your users. This is probably due to you guys overextending.

Responding that you are saddened by this guys issue with his perception of the 'culture' in your team is just craziness. No one will be helped by this. The only thing that will help is showing that you guys are running a tight ship. Maybe hire an extra person on two to maintain your GH issue garden and proceed with making Docker the successful business we all believe it can be.

bradrydzewski · on Aug 28, 2016

fwiw I read this comment and didn't find it dismissive at all. shykes indicated this isn't the experience the docker community is striving for and asks for specific instances so that he could gain more context and to potentially address the problem.

I have seen many instances of individuals with no affiliation to an open-source project being negative or rude and therefore giving a negative impression of the community or parent company. As a project owner or maintainer this is a really difficult thing to see happening, and can be difficult to address and prevent. So I personally think shykes makes a valid point. I wouldn't take this as dismissive. Give him the benefit of the doubt.

I also think we should cut maintainers a bit of slack. Imagine your are managing an open source project with over 10,000 logged issues and even more in the community forums. It is draining to spend all day dealing with complaints and issues and can be a thankless job. Maintainers often try to put their best foot forward, but it isn't an easy and they are human, and make mistakes and say things they regret. I'm not saying it is justified, just try to put yourself in their shoes.

I've personally logged a few issues with the Docker team and have also found the interactions to be respectful. There are times when I've asked for features and they have certainly played devils advocate, but that is to be expected with any project that is trying to prioritize their work and constantly fighting feature creep.

carapace · on Aug 28, 2016

Do you realize you've just provided an example?

The third paragraph particularly is spectacular, as you manage to be dismissive about being dismissive.

Incroyable.

shykes · on Aug 28, 2016

Sorry, that wasn't the intention. I tried to clarify with an edit in the original post.

It looks like now is not a good time for me to participate in this discussion. Instead I'll make a list of things we can improve so that the next HN post about Docker is a more positive one.

carapace · on Aug 28, 2016

naaaaaaw! Don't squander this golden opportunity to connect with your peeps! You can turn this around, but you've got to "walk the talk" here.

You should just be like, "Hey folks, we're sorry as hell we let you down. I'm here now and I'm listening. What has made you sad, and what can we improve?"

People love that sh--stuff.

That's just my advice. I don't even use Docker. I just like to see warm-fuzzy success, eh?

neuronexmachina · on Aug 28, 2016

I'm not the original commenter, but the "--insecure-registry should be on docker pull" issue is one that impacted me personally. It was maddening reading all the comments from frustrated users who were running into the same problems I was: https://github.com/docker/docker/issues/8887

Steeeve · on Aug 29, 2016

I think this is one of many instances where Github's issue system doesn't enable good communication between developers and end users.

- You shouldn't have to read through pages of posts to find potential workarounds.

- You shouldn't have to count comments to guess if there are a lot of people affected by an issue.

- You shouldn't have to read comments, then wonder which posts were an official developer response or a comment from another end user.

- There should be some way to communicate easily that "this issue has our attention, but dealing with it is de-prioritized because ..."

In this instance, you can see it's a security issue and it will be tough to convince Docker to change it, but at this point it's two years later and people are still griping about it, so maybe it's time to put some attention on it. There are a few suggestions for seemingly reasonable updates, but nobody is championing any of them, presumably because there's no indication other than comment activity that Docker will consider any updates on this issue whatsoever.

I'm not sure myself what the right thing to do from Docker's perspective would be, but this is clearly one of those issues that has worn through the attention span of the development team. Someone either needs to stand up and say "we're not changing anything here" or "we're looking for a solution" - no bug report should be open that long.

kofejnik · on Aug 29, 2016

"Like @ewindisch already said, we do not want to encourage this client-side behavior. The pain induced by requiring the flag as a daemon flag, is so that people actually set up TLS on their registry. Otherwise there's no incentive. Thanks for your understanding."

This was a reply by @tiborvass in https://github.com/docker/docker/issues/8887

Can you possibly get any more dismissive?

m0lll · on Aug 28, 2016

[flagged]

ambulancechaser · on Aug 28, 2016

Dude, you literally had the founder of Docker asking you where the opinion came from that devs don't care about users. He or she just pointed out that often times non-employees talk on github issues and just remember Docker can only be responsible for employees conduct, not that of others. You've read into this that

1) you have been called a mouth breathing moron

2) it is confirmed that docker is run by assholes

3) the founder does not care about users (by asking for examples of poor conduct by Docker employees)

4) docker is run by a lousy leader who cannot demonstrate care, again by asking where people get a sense of docker not caring.

Internet comments lack a lot of tone so its easy to misread, but when there's this kind of fanatic backlash when there's a direct connection to the founder of a pretty big service, the luster of any of these sites dims considerably. You have given nothing actionable to the person who could have done something positive for docker, and just made it that more difficult for the organization to take legitimate criticisms at face value.

arcticfox · on Aug 28, 2016

Well, that's an incredibly cynical and negative way to read shykes' words. FWIW, I thought his post was fine.

A more constructive response than yours would be to post examples of where Docker has failed in community engagement, rather than just re-stating his post with the most negative spin possible.

x0 · on Aug 28, 2016

Hi,

Can you please add Facebook connectivity to Docker?

Thanks.

hosh · on Aug 28, 2016

They more or less ignored the file sync issue of Docker for Mac, until there was a push asking the Docker devs for more transparency. There were other issues with the release, and where people did not push for it, issues would languish with no one from the dev team saying anything. It sometimes feels like a culture of coverups and shame. (Let's hide this under here and maybe no one will notice).

If people are feeling the "users are stupid", combined with what looks like coverup culture, I find Docker's blog posts less and less credible.

For example, there was a recent blog post about An independent security review comparing Docker to other similar technologies, including rkt. The conclusion of the post was that Docker is secure by default. (I'll leave it to the reader's own opinion whether that is true or not; this is not the issue I am pointing out).

What is so weird is that, nearly two years after the fact, the tone in that blog post was still as if CoreOS had betrayed them. And while I get there are hurt feelings involved, this is not a high school popularity contest. When combined with arrogance and coverup culture, I can see Docker moving in a direction that drives them further and further from relevancy.

Based on what I'm hearing here about Docker Swarm, Swarm is basically a great advertising for K8S or Mesos. No one there is pretending that multi-node orchestration is an easy problem. If the divergence from the community continues, I can see lots of people fed up and going over to rkt, or something else that works just as well.

shabble · on Aug 28, 2016

> They more or less ignored the file sync issue of Docker for Mac

Which particular problem? The crappy performance/CPU usage one(s), or something else? Having used various different approaches (docker in vmware linux/virtualbox), the d4m (osxfs?) one seemed to be the least broken for general dev stuff.

Hopefully not more issues I need to keep a lookout for.

hosh · on Aug 28, 2016

Crappy performance related to the osxfs --latency issues for writes for host volumes. It's getting closer to the latency as Docker for Linux. (I hear though, workarounds like using unisonfs and fswatch works all right)

There is also the bit about host networking. Since Docker for Mac runs a transparent Linux, the host networking goes there instead. That thread ended up in a big, roaring silence.

danielrhodes · on Aug 28, 2016

I've also had such an experience. Same with Realm.io. The developers get very dismissive with just about anybody who comes in saying they have a use case which doesn't fit the scope of what already exists. Meanwhile features/fixes sit on the backlog for years.

I'm not saying this will necessarily stop me from using a project, but it definitely does not create any loyalty.

Compare this to say Rails, where I've had really positive experiences with the maintainers.

StreamBright · on Aug 28, 2016

This is pretty dangerous path to failure. If you look at the most successful companies it is obvious that they respect the user base using their service. Software is no exception to that, you have to make it user friendly, regardless if your user base is mostly software and systems engineers.

enraged_camel · on Aug 28, 2016

I don't think it's so cut-and-dry. I mean, does Oracle respect its users and is it user-friendly? What about Epic, the dominant software system in the healthcare industry? Or Microsoft SharePoint?

ghufran_syed · on Aug 28, 2016

"If you look at the most successful companies it is obvious that they respect the user base using their service" - I think a better way of putting it would be that "successful companies respect their customers". In the case of software tools like Docker, the customers generally are the users. In the case of Epic, an electronic health record, the buyers are senior management, the users are the clinical staff such as the nurses, doctors and techs. So Epic does give the senior management what they want (buzzword compliance, a safe choice that it's hard to get fired for), but doesn't give clinical staff what they want..because they are not the ones making the buying decision.

StreamBright · on Aug 28, 2016

Most, not all. I think you are still better of listening to your customers.

bdcravens · on Aug 28, 2016

Won't validate whether this happens with Docker, but I've seen it other places. I won't name names, but I bought a commercial add-on to an open source project very similar to Docker, and when I had a support issue, I was pretty much told "tough luck, not my problem" by the project founder.

fermigier · on Aug 28, 2016

Was the "project founder" related to the company that sold you the commercial add-on? If not, why should he support it, unless of course there is some commercial relationship between the open source project and the company selling the add-on?

bdcravens · on Aug 29, 2016

Yes, his company is the one that made the add-on; it was their first commercial product atop the open source project. Today they're a nice fat funded company, but then, they had little revenue, which really put me off.

kozikow · on Aug 28, 2016

I am using kubernetes instead of docker swarm to orchestrate docker images and all points mentioned in this article do not apply. My cluster is small - I have <100 machines at peak, but so far it feels ready for prime time.

There are parts of docker that are relatively stable, have many other companies involved, and have been around for a while. There are also "got VC money, gotta monetise" parts that damage the reputation of stable parts.

_ugfj · on Aug 28, 2016

> My cluster is small - I have <100 machines at peak

Let me ask: WTF are people are doing that <100 machines is a "small cluster"? I ran a Top 100 (as measured by Quantcast) website with 23 machines and that included kit and caboodle -- Dev, Staging and Production environments. And quite some of that were just for HA purposes not because we needed that much... Stackexchange also runs about two dozen servers. Yes, yes, Google, Facebook runs datacenters but there's a power law kind of distribution here and the iron needs are falling very, very fast as you move from say Top 1 website to Top 30.

necubi · on Aug 28, 2016

The number of machines you need to run a service is not really a linear function of your traffic. If you have a mostly static website that can be heavily cached/cdn'd, you can easily scale to thousands of requests a second with a small server footprint. I expect that's true of many of the top 100 sites as measured by visitors (like Quantcast does).

But if you need to store a lot of data, or need to look up data with very low latency, or do CPU-intensive work for every request, you will end up with a lot more servers. (The other thing to consider is that SaSS companies can easily deal with more traffic than even the largest web sites, because they tend to aggregate traffic from many websites; Quantcast, for example, where I used to work, got hundreds of thousands of requests per second to its measurement endpoint.)

_ugfj · on Aug 29, 2016

Note: the site I mentioned did hit the database quite a few times for each page. It was a nice challenge.

tinco · on Aug 28, 2016

Not everyone can afford to vertically scale. Those 24 servers of StackExchange together cost more than the average 100 machine 'small' cluster. At Digital Ocean the performance sweetspot probably lies at the $20/mo machine, so that's 2000$/mo for a 100 machine cluster. You think StackExchange pays less than that for their hardware?

Also, some sites are simply larger than StackExchange, and you never heard of them. There's a huge spectrum between StackExchange and Google.

_ugfj · on Aug 29, 2016

There's some truth in that, the site I mentioned in 2010 was running DB servers with 144GB of RAM -- unlike SE we rented servers, the colo numbers didn't look good.

brown9-2 · on Aug 28, 2016

Not all computing use cases are running a website.

kozikow · on Aug 28, 2016

I have a processing pipeline that does crawling, indexing and some analytics on the crawled data - I am building a vertical-focused search engine focused on catering for some specific kinds of companies.

Usually my cluster is very small, unless the pipeline is in progress. At this very moment I am not doing any processing, so I only have 3 machines up.

brianwawok · on Aug 28, 2016

Heard this elsewhere. Why do people use Docker swarm with Kubernetes around? Does it require less setup or something?

kozikow · on Aug 28, 2016

I see two reasons for this:

- If you don't use google container engine (hosted kubernetes, also known as GKE), kubernetes have a reputation of being hard to set up. I am running on GKE, so I can't comment too much.

- It's hard to see how all of the orchestration/docker/etc. things play together when entering the area. I expect many people hear "docker" and want to just try "docker", not being aware that there exist alternatives for some parts, some parts are more reliable that others, etc. E.g. the article we are discussing seem to be doing this.

tazjin · on Aug 28, 2016

Kubernetes definitely runs best on GKE, but projects like kops (https://github.com/kubernetes/kops) are making setups on AWS etc. a lot easier.

You still have to deal with the other drawbacks of those platforms (slow networks, disks etc.) but that's not really a k8s issue.

rodgerd · on Aug 28, 2016

> kubernetes have a reputation of being hard to set up

Yeah, but there's no shortage of things wrapping Kuber (e.g. OpenShift).

20yrs_no_equity · on Aug 28, 2016

Kubernetes seems very unwieldy and hard to wrap my head around. Swarm seems straightforward and easy to figure out how to configure.

Let's give it a try and see if I can find a good tutorial for kubernetes to do what I want. (I haven't tried this experiment in a couple months, since before nomad and swarm got my interest.)

-----

Ok, I'm back. I went to kubernetes.io. there' "give it a try" has me creating a google account and getting set up on google container engine. Due to standard issue google account hassles, I quickly got mired in quicksand having nothing to do with kubernetes.

I have no interest in google container engine. Let me set it up using vagrant, or my own VPSes, or as a demo locally, whatever, I'll set up virtual boxes.

They lost me there.

dlor · on Aug 28, 2016

Check out minikube: github.com/kubernetes/minikube

It's a local setup to let you give Kubernetes a try. We haven't made it the default in the "give it a try" dialog yet, but we're considering it.

Disclosure: I'm an engineer at Google and I work on Minikube.

CoffeeOnWrite · on Aug 29, 2016

I found the Minikube getting started to be a good experience about a month ago. If you did make Minikube the default for "give it a try" that would instantly gain k8s more credibility in my eyes, as it calls attention to the excellent tooling around k8s, diminishes the perception of GKE as the only first-class k8s environment (ahem, lock-in!), and promotes the notion of an economical and fast k8s development environment.

anoctopus · on Aug 28, 2016

It isn't directly linked from the front page, but they have a getting started section [0] that covers non-GKE options. I run k8s on AWS using the Kops tool.

Kubernetes is definitely more to learn before getting started than swarm, but that's mostly because it has different and more powerful primitives and more features built in.

[0]: http://kubernetes.io/docs/getting-started-guides/

nostrebored · on Aug 28, 2016

What are standard Google account hassles? The k8s getting started guide is one of the single best cloud setup guides I've ever read. You've deployed your app, and can easily leverage the concepts listed their to deploy more complex apps.

K8s might seem more unwieldy than swarm, but from that feature set you can expect things to work the way they are explained.

Swarm on the other hand has made my entire team question whether 1.12 is even worth upgrading to.

coding123 · on Aug 29, 2016

Completely 100% disagree that the getting started guide is easy. Let's go through this by the numbers. First off specifically comparing the guides, the Docker 1.12 stuff works ANYWHERE you have root access. When I go here I'm totally confused:

http://kubernetes.io/docs/getting-started-guides/

Ok, let's see, not only do I immediately get diverted to another page, but I feel like every OS+Cloud combination isn't represented. I guess the CLOSEST thing to working is Ubuntu+AWS. Click.

http://kubernetes.io/docs/getting-started-guides/juju/

YAY! JUJU a new technology I need to learn. Hey guess what, this only works with Ubuntu. Closes browser. I spent weeks trying to map this out in my head. I can't understand why Kub doesn't just "install" like Docker does.

Ok, back to the deployment guide. Let's see there's a GIANT TABLE OF LINKS based on cloud+OS+whatever. So I think it's a massive understatement to say that Kub is more unwieldy than Swarm 1.12.

nostrebored · on Aug 29, 2016

The thing is, the docker stuff might be simple, but it doesn't actually work. Or at least the networking portion doesn't work as advertised.

http://kubernetes.io/docs/hellonode/ Is the guide that I was taking about. From your post it was not clear that you weren't using gce.

justinsb · on Aug 29, 2016

The docs are more of an encyclopedia: a lot of facts, not a lot of editorializing. Unless you enjoy learning from encyclopedias, I would highly recommend talking to other members of the community on the kubernetes slack (kubernetes-novice, kubernetes-user or for AWS sig-aws). That way you can quickly find what has worked well for people and what hasn't, and hopefully save yourself a ton of time.

I would love for our docs to be better - good people are working on it though documentation is always hard. In the meantime, the community is a wonderful resource!

jjn2009 · on Aug 28, 2016

You can have a swarm with only a handful of commands now, with TLS (for docker api). Eventually stable docker host to docker host container networking will have encryption out of the box too. The number of steps for a production docker setup keeps getting shorter and shorter, kubernetes seems to be now focused on this aspect, they might catch up. This is all of course ignoring the IaaS solutions and obviously for the reasons stated in the article, swarm doesn't see those benefits with 1.12 because major features do not work yet which are supposedly stable.

fidget · on Aug 28, 2016

Yeah setup time is great. However, it seems that the number of commands required for stability with docker swarm -> ∞

heurist · on Aug 29, 2016

I used it because I didn't understand what I needed and wanted to stay in the Docker ecosystem to try to limit tooling issues. Kubernetes sounded like more than I needed given what Swarm offered. Turned out alright because I was able to learn a lot, but Docker tools aren't as great as they make them sound (nothing is production ready and devs are too constrained to improve anything). I've moved up to kube and feel like I would have saved a month if I'd gone with it in the first place.

hosh · on Aug 28, 2016

I've used K8S before, and those things are smooth once installation happen. We have a project which is looking at using Docker Swarm on a two node set up. The idea is that this will give us a baseline, and since it is looking more and more we'll move to GKE, this sets us up for a later move into GKE. I thought Docker Swarm would be mature enough to handle two nodes, for a short time (maybe half a year) until we make the leap to GKE. But based on the reports coming in from the wild, that's looking less and less likely.

We hired a contractor to do this part. My team is so resource constrained that I don't have time for this, so we farmed it out. But now I'm thinking, the risk of project failure is much higher than I thought, made worse that the contractor is also showing signs of having poor communication skills. (I would rather be updated on things going wrong than to have someone try to be the hero or cowboy and figure it all out).

sandGorgon · on Aug 28, 2016

we are exactly in your boat. However, we are signing up for kubernetes since we want to keep the GKE option open.

The only reason why we didnt go ahead is because of broken logging in k8s - https://github.com/kubernetes/kubernetes/issues/24677

kozikow · on Aug 28, 2016

I went through a few iterations for logging, but now I settled on using built-in GKE logging. Stdout logs from my containers are picked up by kubernetes and forwarded to stackdriver. Since it's just stdout I do not create too much lock-in. I use stackdriver dashboard for investigating recent logs and BigQuery exporter for complex analysis. My stdout logs are jsons, so I can export extra metadata without relying on regexps for analysis - I use https://pypi.python.org/pypi/python-json-logger.

tazjin · on Aug 28, 2016

Just as a hint in case you're not aware of this: If you log errors in the format[1] expected by Stackdriver Error Reporting, your errors will automatically be picked up and grouped in that service as well.

1: https://cloud.google.com/error-reporting/docs/formatting-err...

sandGorgon · on Aug 28, 2016

since we are not on GKE already, I want to be able to use k8s to forward to my host machine journald and thats broken. I think Google is doing a lot of handholding to get it working with stackdriver.

This is the blocker for me. I cant switch to GKE already because I use AWS postgresql. But I want to use k8s :(

hosh · on Aug 28, 2016

Thr K8S project I was involved in last year used AWS postgresql too. At that time, figuring out how to have persistant data was too much. Further, AWS EBS driver for K8S storage wasn't there. And, PetSets have not come out. With PetSets out, I think figuring out how to do a datadtore on K8S or GKE will be easier. (I had mentioned to my CTO, I don't know how to do persistant store on GKE; he pointed out the gcloud data store; I told him, that was not what I meant ;-)

sandGorgon · on Aug 28, 2016

this is interesting. Were you on GKE or AWS. GKE is obvious, but if you were hosting on AWS.. what was your logging setup ?

If you were not using AWS EBS .. what were you doing ?

hosh · on Aug 28, 2016

The owner of the company ran out of money before I could add logging, but my plan was to get it out to something like papertrail.

I was on AWS. I sidestepped the issue by using AWS RDS (postgresql).

I had tried to get the nascent EBS stuff working, but when I realized that I'd have to get a script to check if an EBS volume was formatted with a filesystem before mounting it in K8S, I stopped. This might have been improved by now.

justinsb · on Aug 29, 2016

I probably wrote that support (or at least maintain it), and you shouldn't ever have needed to add a script to format your disk: you declare the filesystem you want and it comes up formatted. If you didn't open an issue before please do so and I'll make double-sure it is now fixed (or point me to the issue if you already opened it!)

On the logging front, kube-up comes up with automatic logging via fluentd to an ElasticSearch cluster hosted in k8s itself. You can relatively easily replace that ES cluster with an AWS ES cluster (using a proxy to do the AWS authentication), or you can reconfigure fluentd to run to AWS ES. Or you can set pretty easily set up something yourself using daemonsets if you'd rather use something like splunk, but I don't know if anyone has shared a config for this!

A big shortcoming of the current fluentd/ES setup is that it also predates PetSets, and so it still doesn't use persistent storage in kube-up. I'm trying to fix this in time for 1.4 though!

If you don't know about it, the sig-aws channel on the kubernetes slack is where the AWS folk tend to hang out and work through these snafus together - come join us :-)

sandGorgon · on Aug 29, 2016

@justinsb - based on the bug link I posted above, what do you think is the direction that k8s logging is going to take?

From what you wrote, it seems that lots of people consider logging in k8s to be a solved issue. I'm wondering why is there a detailed spec for all the journald stuff, etc.

From my perspective - it will be amazing if k8s can manage and aggregate logs on the host machine. It's also a way of reducing complexity to get started. People starting with 1-2 node setups start with local logs before tackling the complexity of fluent, etc

Is that the reason for this bug?

justinsb · on Aug 29, 2016

I'm not particularly familiar with that github issue. A lot of people in k8s are building some amazing things, but that doesn't mean that the base functionality isn't there today.

If you want logs to go into ElasticSearch, k8s does that today - you just write to stdout / stderr and it works. I don't love the way multi-line logs are not combined (the stack trace problem), but it works fine, and that's more an ElasticSearch/fluentd issue really. You'll likely want to replace the default ES configuration with either one backed by a PersistentVolume or an AWS ES cluster.

Could it be more efficient and more flexible? Very much so! Maybe in the future you'll be able to log to journald, or more probably be able to log to local files. I can't see a world in which you _won't_ be able to log to stdout/stderr. Maybe those streams are redirected to a local file in the "logs" area, but it should still just work.

If anything I'd say this issue has suffered from being too general, though some very specific plans are coming out of it. If writing to stdout/stderr and having it go to ElasticSearch via fluentd doesn't meet your requirements today, then you should open a more specific issue I think - it'll likely help the "big picture" issue along!

exratione · on Aug 28, 2016

So far I'm feeling pretty good about the decision to skip the first generation containerization infrastructure.

At the outset it had the look of something that wasn't an advance over standard issue virtualization, in that it just shuffled the complexity around a bit. It doesn't do enough to abstract away the ops complexity of setting up environments.

I'm still of the mind, a few years later, that the time to move on from whatever virtualization approach you're currently using for infrastructure and development (cloud instances, virtualbox, etc), is when the second generation of serverless/aws-lambda-like platforms arrive. The first generation is a nice adjunct to virtual servers for wrapping small tools, but it is too limited and clunky in its surrounding ops-essential infrastructure to build real, entire applications easily.

So the real leap I see ahead is the move from cloud servers to a server-free abstraction in which your codebase, from your perspective, is deployed to run as a matter of functions and compute time and you see nothing of what is under that layer, and need to do no meaningful ops management at all.

dmourati · on Aug 28, 2016

This sounds like I wrote it.

okket · on Aug 28, 2016

Two days ago there was a fairly long discussion about a similar argument ("The sad state of Docker")

https://news.ycombinator.com/item?id=12364123 (217 comments)

urvader · on Aug 28, 2016

The title should say: Docker Swarm is not ready for primetime. We have used Docker in production for more than two years and there has been very few issues overall.

limelight · on Aug 28, 2016

This article should really be about Docker Swarm not being ready for production. It's much newer technology than Docker and is predictably brittle.

The only points made against Docker proper are rather laughable. You shouldn't be remotely administering Docker clusters from the CLI (use a proper cluster tool like Kubernetes), and copying entire credentials files from machine to machine is extremely unlikely/esoteric.

Docker, with Kubernetes or ECS, is totally suitable for production at this point. Lots and lots of companies are successfully running production workloads using it.

atemerev · on Aug 28, 2016

> You shouldn't be remotely administering Docker clusters from the CLI (use a proper cluster tool like Kubernetes)

Docker Swarm is advertised as stable, production ready cluster management solution. Then, if you actually try to use it, it is very NOT. Kubernetes is great, but it feels like adding another layer to the system, and it is not always a good thing (especially if you are on AWS and have to work with _their_ infrastructure management too).

forktheweb · on Aug 28, 2016

I would say that my experience with Docker has been fantastic. I run over 10 Ubuntu Trusty instances on EC2 as 8G instances, mounted with NFS4 to EFS. This makes it super simple to manage data across multiple hosts. From that you can run as many containers as you like, and either mount them to the EFS folder, or just spawn them with data-containers, then export backups regularly with something like duplicity.

I use rancher with it, and it's retarded simple using rancher/docker compose.

For a quick run-down see: https://github.com/forktheweb/amazon-docker-devops More advanced run-down of where I'm going with my setup: https://labs.stackfork.com:2003/dockistry-devexp/exp-stacker...

forktheweb · on Aug 28, 2016

Just want to add that I also use it on windows without barely any issues (win 10 x64). I'm not sure how stable it is on Mac OSX but Kitematic is pretty sweet.

The only problems I've had with Docker container are those where processes get stuck inside the container and the --restart=always flag is set. When this happens it means that if you can't force the container to stop, when you reboot the defunct container will restart anyway and cause you the same issue...

My solution to this has been to just create a clean AMI image with ubuntu/rancher/docker and then nuke the old host when it gives me problems. This is made even easier if you use EFS because it's literally already 100% setup once you launch a replacement instance.

Also, you can do automatic memory-limiting and cpu-limiting your nodes using rancher-compose and health-checks that re-route traffic with L7 & HAProxy: http://docs.rancher.com/rancher/v1.1/zh/cattle/health-checks...

The only thing even comparable to that in my mind would be Consul health checks with auto-discovery: https://www.consul.io/docs/guides/index.html

perturbation · on Aug 28, 2016

Most of the complaints I've seen recently about using Docker are about the immaturity of Docker Swarm. Can this be mitigated by using Docker with Kubernetes / Mesos / Yarn?

If it's truly a problem with the containerization format / stability with the core product, I'm not sure what a good alternative would be. I see a lot of praise for rkt but the ecosystem and tooling around it are so much smaller than that for Docker.

atombender · on Aug 28, 2016

Using it with Kubernetes definitely helps. One reason is that if you have enough surplus nodes, then Docker misbehaving on one of them shouldn't screw up anything; Kubernetes is really good at shuffling things around, and you can "cordon off" problem nodes to prevent them from being used for scheduling.

joshka · on Aug 28, 2016

> Each version of the CLI is incompatible with the last version of the CLI.

Run the previous version of the cli in a container on your local machine. https://hub.docker.com/_/docker/

  $ docker run -it --rm  docker:1.9 version

coldtea · on Aug 28, 2016

>Run the previous version of the cli in a container on your local machine.

I'd rather get out of IT and go into farming is this is considered a valid recourse.

Annatar · on Aug 28, 2016

May I join you?

inlined · on Aug 28, 2016

This is both very clever and horrifying. This obviously works for development but doing a rolling update of your fleet is stressful without the additional wrinkle of progressively updating which tool you need to inspect or control that fleet.

amazingman · on Aug 29, 2016

Or, you know, just set `DOCKER_API_VERSION` to the version of the engine you're interacting with.

mappu · on Aug 29, 2016

If it has the capability to communicate cross-version, it should just do that, instead of displaying an error message that /doesn't/ point you to `DOCKER_API_VERSION`.

amazingman · on Aug 29, 2016

Not doing so is a perfectly defensible decision. Not informing the user of the environment variable, however, is indeed a fairly awful mistake.

andrewguenther · on Aug 29, 2016

Are we the only two people who know this??

amazingman · on Aug 29, 2016

I certainly hope not. It's literally the first environment variable listed in the Docker CLI reference doc: https://docs.docker.com/engine/reference/commandline/cli/

Chronic9q · on Aug 28, 2016

You literally just proved his point.

jcoffland · on Aug 28, 2016

I don't think he was arguing against it.

Twirrim · on Aug 28, 2016

It's so hard to tell. I hope they were joking.

joshka · on Aug 28, 2016

I'm not sure that I did. Let's expand out my point to see if I'm on the same page as the replies here.

1. TFA complains that versions of servers require corresponding client tools to be installed.

2. TFA says he loves docker in dev, but not in prod.

3. I state (omitting many steps): install the latest docker in your dev environment (or wherever you're connecting to prod), spin up a small docker image to run the appropriate version cli. This is cheap, easy, easy to understand why it works, consistent with points 1 and 2.

What am I missing?

ldehaan · on Aug 28, 2016

I have several past clients running docker in production just fine.

At my current job we run nearly all our services in docker.

I've replied to this type of comment on here at least a dozen times, it has nothing to do with docker, it is a lack of understanding how it all works.

Understand the history, understand the underlying concepts and this is no more complex than an advanced chroot.

Now on the tooling side, I personally stay away from any plug-ins and tools created by the docker team, they do docker best, let other tools manage dockers externalities.

I've used weave since it came out, and it's perfect for network management and service discovery.

I prefer to use mesos to manage container deploys.

There is an entirely usable workflow with docker but I like to let the specialists specialize, and so I just use docker (.10.1 even), because all the extra stuff is just making it bloated.

I'm testing newer versions on a case by case basis, but nothing new has come out that makes me want to upgrade yet.

And I'll probably keep using docker as long as it stays separate from all the cruft being added to the ecosystem.

_jezell_ · on Aug 28, 2016

The Docker team is doing more than anyone to move container technology forward, but orchestration is a much harder problem to solve than wrapping OS APIs. I wish they would stick to the core, and let others like the Kubernetes team handle the orchestration pieces. Swarm is hard to take seriously right now. I'm not sure bundling it into the core was the best way to handle it.

coding123 · on Aug 28, 2016

I disagree. Kube, Mesos, etc.. are all great I'm sure. I had Kub documentation open in the background for about a week or two while doing other things, trying to get a handle on installation, deployment, how it will fit my use-cases... I kept wanting to dip in, but kept getting pounded by tons of heavy dependencies I would have had to learn. After Swarm Mode was released I had a cluster deploying things locally within 2 hours of reading the documentation. From there was able to map all of our use-cases on top (rolling updates, load balancing, etc...). I totally get that it's "hard to take seriously" perspective with some of the early glitching, but Kube, Mesos, etc.. needed this egg on their faces to realize how easy the installation and getting things up and running should be. If nothing else, that alone will make those other products better.

SilverSurfer972 · on Aug 29, 2016

I think we should stop falling into the marketing greed of Docker to manage container at large scale. They want to get a grip of the corporate market and catch up with kubernetes at the cost of their containerization quality. Unfortunately with their current strategy they are loosing credibility as a containerization tool AND as the container orchestration tool they try to become. Using it solely as a containerization tool with k8s/ecs for the heavy lifting is the relevant way to go as for today IMO.

acd · on Aug 28, 2016

Here are alternatives to Docker

One can use Ubuntu LXD which is Linux containers built on top of LXC but with ZFS as storage backend. LXD can also run Docker containers. http://www.ubuntu.com/cloud/lxd

One can also use Linux containers via Kubernetes by Google. http://kubernetes.io/

bjt · on Aug 29, 2016

I think the underlying issue here is that no two people agree on what "Docker" is. Is it the CLI? Is it Docker Machine? Is it Docker Swarm?

The container part of Docker works well. And they've ridden that hype wave to try to run a lot of other pieces of your infrastructure by writing another app and calling it Docker Something. Now everybody means a different subset when they say "Docker".

forktheweb · on Aug 28, 2016

I would say that my experience with Docker has been fantastic. I run over 10 Ubuntu Trusty instances on EC2 as 8G instances, mounted with NFS4 to EFS. This makes it super simple to manage data across multiple hosts. From that you can run as many containers as you like, and either mount them to the EFS folder, or just spawn them with data-containers, then export backups regularly with something like duplicity.

I use rancher with it, and it's retarded simple using rancher/docker compose.

For a quick run-down see: https://github.com/forktheweb/amazon-docker-devops

dcosson · on Aug 28, 2016

> Each version of the CLI is incompatible with the last version of the CLI.

I'm pretty sure that as long as the CLI version is >= the server version, you can set the DOCKER_VERSION env var to the server version and everything works.

I haven't used this extensively, so maybe there are edge cases or some minimal supported version of backwards compatibility?

kev009 · on Aug 28, 2016

I wonder how much suffering would be alleviated in most mid-level IT organizations if they just used Joyent/SmartDataCenter, and I say this as a FreeBSD developer with no affiliation.

Annatar · on Aug 28, 2016

A lot.

zones in SmartOS provide full-blown UNIX servers running at the speed of the bare metal, but in complete isolation (no need for hacks like "runC", containers, or any other such nonsense).

Packaging one's software into OS packages provides for repeatability: after the packages are installed in a zone, a ZFS image with a little bit of metadata can be created, and imported into a local or remote image repository with imgadm(1M).

https://wiki.smartos.org/display/DOC/Managing+Images

That's it. It really is that simple.

rjurney · on Aug 28, 2016

Docker Swarm is definitely not production ready. Try to run any service that requires communication among nodes and you will agree. It works fine for web servers, but that is about it.

DC/OS is emerging as the go-to way to deploy docker containers at scale in complex service combinations. It 'just works' with one simple config per service.

20yrs_no_equity · on Aug 28, 2016

Network partitions really do happen! They are often short, but if you can't recover from them, then you shouldn't call yourself a distributed system.

I am shocked at how fragile etcd is in this way. I was hoping docker swarm was better, but I'm not surprised (alas) to find out that it has the same problem.

I'm about ready to build my own solution, because I know a way to do it that will be really robust in the face of partitions (and it doesn't use RAFT, you probably should not be using RAFT, I've seen lots of complaints about zookeeper too. I've done this before in other contexts so I know how to make it work, but so have others so why are people who don't know how to make it work reinventing the wheel all the time?)

helloiamaperson · on Aug 28, 2016

I'd love to hear more about your solution. Are you saying that you've created an algorithm distinct from paxos/raft/zab that's more robust?

ldehaan · on Aug 29, 2016

Check out weave, it's great for service discovery. Zookeeper is much better than etcd imo.

forktheweb · on Aug 28, 2016

amazon efs + duplicity p.o.t. snapshots -> s3 + docker = win

callumjones · on Aug 28, 2016

If you're truly running Docker in production you're probably not making use of both of these issues taken with Docker. No one would dare interact with a production cluster via the basic Docker CLI, instead you should be interacting with the orchestration technology like ECS, Mesos or Kubernetes. We are running ECS and we only interact with the Docker CLI to query specific containers or shut down specific containers that ECS has weirdly forgotten about.

It definitely sounds like Swarm is not ready but I wouldn't say this is representative of running Docker in production: instead you should be running one of the many battle tested cluster tools like ECS, Mesos or Kubernetes (or GCE).

drchiu · on Aug 28, 2016

Agreed. Wouldn't use Docker CLI for production purposes.

We use Cloud66 (disclaimer - not associated with them) to help with the deployment issues if any arise.

Also we don't store DB in containers.

coding123 · on Aug 28, 2016

Just read through most of the thread, seems like a very large disconnect between people that are happy with Docker and those that are not. Personally, I've been extremely happy with it. We have one product in production using pre-1.12 swarm (will be upgrading in the next couple months) and most of our dev -> uat environments are now fully docker. It's been stable. On my personal projects I used Docker 1.12 and yes, after a few days things kerploded, but after upgrading to 1.12.1 things have been incredibly stable. For Nodejs apps I have been able to use Docker service replicas instead of Nodejs clustering, and been very happy with the results.

sandGorgon · on Aug 28, 2016

it seems that none of the container frameworks are generally ready.

Take for example k8s - I just started exploring it as something we could move to. https://github.com/kubernetes/kubernetes/issues/24677 - logging of application logs is an unsolved problem.

And most of the proposals talk about creating yet another logger...rather than patching journald or whatever exists out there.

For those of you who are running k8s in production - how are you doing logging ? does everyone roll their own ?

atombender · on Aug 28, 2016

Logging is definitely not an unsolved problem with K8s. It's trivial to set up Fluentd to snarf all container logs into some destination, such as Graylog, ElasticSearch or a simple centralized file system location.

The Github issue talks a lot about "out of the box" setup via kube-up. If you're not using kube-up (and I wouldn't recommend using it for production setups), the problem is rather simple.

The logging story could be better, but it's not "unsolved".

sandGorgon · on Aug 28, 2016

Thanks for that clarification. We are trying to get our feet wet by building a small 2 node setup.

We don't want to stream our logs or setup fluent, etc. I just want to make sure my logs are captured and periodically rotated. Now, the newer docker allows me to use journald as logger (which means all logs are sent to journald on the HOST machine)... But I can't seem to figure out how to do this in k8s.

Also as an aside, for production deployment on aws.. What would you suggest. I was thinking that kube-up is what i should use.

atombender · on Aug 28, 2016

Yes, journald would definitely be superior, conceptually, to just writing to plain files. I can't help you there, though, as I've never tried setting it up.

(As an aside, I don't understand why more Unix software doesn't consistently use syslog() — we already have a good, standard, abstract logging interface which can be implemented however you want on the host. The syslog protocol is awful, but the syslog() call should be reliable.)

As for kube-up: It's a nice way to bootstrap, but it's an opaque setup that doesn't give you full control over your boxes, or of upgrades. I believe it's idempotent, but I wouldn't trust it to my production environment. Personally, I set up K8s from scratch on AWS via Salt + Debian/Ubuntu packages from Kismatic, and I recommend this approach. I have a Github repo that I'm planning to make semi-generic and public. Email me if you're interested.

robszumski · on Aug 28, 2016

Check out kube-aws, a tool from CoreOS that will get you a proper CloudFormation/VPC set up in just a few minutes: https://coreos.com/kubernetes/docs/latest/kubernetes-on-aws....

If you're interested to see more about what it does behind the scenes, it is very similar to the full manual guide for setting up Kubernetes: https://coreos.com/kubernetes/docs/latest/getting-started.ht...

parhamn · on Aug 29, 2016

Whats the k8's limitation here? You can do it however you like. e.g. make a fluentd daemonset, deploy on both nodes have them slurp up to s3. There are tons of ways to handle the logs and k8s can either help or stay out of the way.

https://github.com/fabric8io/fluent-plugin-kubernetes_metada...

Annatar · on Aug 28, 2016

it seems that none of the container frameworks are generally ready.

None of the container frameworks on Linux are ready, but that doesn't mean that's the case in general.

felixgallo · on Aug 28, 2016

logging of application logs generally in the world is an unsolved problem, but I feel like k8s does better than most; fluentd is the default, and then logs are collected by whatever end system you like the best that's appropriate for your needs (e.g. GKE -> fluentd -> Cloud Logging, and I've heard of K8S -> fluentd -> ELK).