> the docker daemon was riddled with locks which resulted in frequent dead-locki...

andrewguenther · on Aug 26, 2016

You were likely seeing the bug that kept us from deploying 1.9 which was related to corruption of the bit mask which managed IP address application. We saw failure rates very similar to yours with that issue.

liveoneggs · on Aug 26, 2016

how is this acceptable?

byroot · on Aug 26, 2016

You have to design for those failures. In our case we spawn 200 containers for one build, if 9% of those crashes, we still have a satisfactory experience.

In the end, at this scale even with four or five nines of reliability, you'd still have to deal with 80 or 8 failures everyday. So we would have to be resilient to those crashes anyway.

However it's a lot of wasted computing and performance that we'd love to get back. But even with those drawbacks our Docker based CI still run 2 to 3 times faster than our previous one because containers make heavy CI parallelism quite trivial.

Now maybe another container technology is more reliable, but at this point our entire infrastructure works with Docker because besides those warts it gives us other advantages that makes the overall thing worth it. So we stick with the devil we know ¯\_(ツ)_/¯.

zeveb · on Aug 26, 2016

> In our case we spawn 200 containers for one build, if 9% of those crashes, we still have a satisfactory experience.

You spawn 200 containers for one build‽ Egad, we really are at the end of days.

> But even with those drawbacks our Docker based CI still run 2 to 3 times faster than our previous one because containers make heavy CI parallelism quite trivial.

Since containers are just isolated processes, wouldn't just running processes be just as fast (if not slightly faster), without requiring 200 containers for a single build?

byroot · on Aug 26, 2016

> wouldn't just running processes be just as fast

The applications we test with this system have dependencies, both system packages and datastores. Containers allow us to isolate the test process with all the dependant datastores (MySQL, Redis, ElasticSearch, etc)

If we were to use regular processes we'd both have to ensure the environment is properly setup before running the tests, and also fiddle with tons of port configurations so we can run 16 MySQLs and 16 Redises on the same host.

See my other comment for more details https://news.ycombinator.com/item?id=12366824

dominotw · on Aug 26, 2016

CI can just recover from these error by retrying/restarting containers.

falsedan · on Aug 26, 2016

Not Dead containers (which failed their post-shutdown cleanup).

segmondy · on Aug 26, 2016

"move fast and do'break shit" philosophy.

nogox · on Aug 26, 2016

Where do you run the CI containers? AWS?

byroot · on Aug 26, 2016

Yes, on a pool of c4.8xlarge EC2 instances with up to 16 containers per instance.

But very little of our failures are accountable to AWS, restarting the Docker daemon "fix" most of them.

maxavant · on Aug 26, 2016

For a newbie, what is the reason you didn't use hosted CI, like Travis CI?

byroot · on Aug 26, 2016

Initially we were using an hosted CI (which I won't name), but it had tons of problems we couldn't fix, and we were against the wall in term of performance.

To put it simply when you run a distributed CI your performance is:

    setup_time + (test_run_time / parallelism)

So when you have a very large test suite, you can speedup the `test_run_time` part by increasing the parallelism, but the `setup_time` is a fixed cost you can't parallelize.

By setup_time I mean installing dependencies, preparing the DB schema and similar things. On our old hosted CI, we would easily end up with jobs spending 6 or 7 minutes setting up, and then 8 or 9 minutes actually running tests.

Now with our own system, we are able to build and push a docker image with the entirety of the CI environment in under 2 minutes, then all the jobs can pull and boot the docker image in 10-30 seconds and start running tests. So we were both able to make the setup faster, and to centralize it, so that our workers can actually spend their time running test and not pointlessly installing the same packages over and over again.

In the end for pretty much the same price we made our CI 2 to 3 times faster (there is a lot of variance) than the hosted one we were using before.

But all this is for our biggest applications, our small ones still use an hosted CI for now as it's much lower on maintenance for us, and I wouldn't recommend anyone going through this unless CI speed becomes a bottleneck for your organization.

initdaemon · on Aug 29, 2016

You didn't include the maintenance cost to manage your infrastructure and container platform, which you don't need to worry with a hosted service.

byroot · on Aug 29, 2016

Even with those it was still worth it. A couple people maintaining the CI is nothing if you can make the build of the 350 other developers twice as fast.

Also it's not like hosted CI is without maintenance, if you want it to not be totally sluggish, you have to use some quite complex scripts and caching strategies that need to be maintained.