> The scenario we want to avoid is that a faulty commit makes it to the main branch.
Close. The scenario we want to minimize is faulty code on the main branch. As your team grows, as the number of commits go up, it becomes a game of chance. Sooner or later something will get through. The more new teammates you have, the more often that will happens.
This is an inescapable cost of growth. The cost of promoting people to management. The cost of starting new projects. Occasionally you can avoid it as a cost of turnover, but you will have turnover at some point.
What matters most is how long the code is "broken" (including false positives) before it is identified, mitigated, and fully corrected. The amount of work you can do to keep these number relatively stable in the face of change is profound.
If you insist on no errors on master ever you will kill throughput. You will create situations where the only failures are big, which is neck deep in the philosophy that CI rejects: that problems are to be avoided instead of embraced and conquered.
> If you insist on no errors on master ever you will kill throughput.
There are a large number of automated tools which will help you prevent merging code that could break master: https://github.com/chdsbd/kodiak#prior-art--alternatives. The basic approach is to make a new branch from master, apply one or more commits on top of that branch, run the tests, and if tests pass, merge those commits (with fast-forward) back onto master. This makes it very difficult to get broken commits on master, as they have to pass the tests before. It is possible if you have a flaky test suite, but in my experience it happens very rarely, and is usually very easy to fix if something creeps in. In my experience, they speed up throughput, not slow it down, especially when you account for the disruption that merging broken code to master can be.
With merge trains the merge requests with a passing feature branch is placed in a queue and tests are run against the combination of that branch and all the branches before it merged in. Since tests will pass 95%+ of the time the feature branches passes this can speed up the amount of merges you can get into master by 10x or more.
> If you insist on no errors on master ever you will kill throughput.
Unless you solve this engineering problem with tooling. At Uber, the full-blown CI mobile test suite takes over 30 minutes to run on a development machine (linting, unit test, UI tests - most of this time being the long-running UI tests, specific to native mobile). So we only do incremental runs locally, and have a submit queue, which parallelises this work and merges only changes that don’t break, into master. And we have one repository that hundreds of engineers work on.
It’s not an easy problem and the solution is also rather complex, but it keeps master at green - with the trade-off of having needed to build and maintain this system. See it discussed on HN a while ago: https://news.ycombinator.com/item?id=19692820
How do you handle situations like that: multiple dvelopers added merge requests to queue, the changes they made are mutually exclusive (automatic rebase wont work). What happens when the first branch gets merged to master and next 10 are still in the queue ? How do you mitigate that to decrease development cycle ?
Lets just say in my company it also takes 30m to run tests and 4h to run them on merge pipeline with FATs and CORE tests.. Its way too long and highly cripples productivity.
A lot of the below comments touched on things we do (verifying that changesets are independent, breaking tests into smaller pieces, prioritising changes that are likely to succeed). They add up and the approach does become more complex. We wrote an ACM white paper with more of the details[1]. It’s the many edge cases and several optimisation problems that turn this into an interesting theoretical and practical problem.
Well first step is to optimize, parallelize and refactor so you do not have a single process that takes hours, but many separate ones you can run at once in a cluster.
If those get too expensive to run or you cannot speed them up them you have to do what Chromium does: run them post commit then bisect and revert any changes that break the tests. If things are truly broken you close the tree for a bit while you get the break reverted or fixed.
Also the system that is landing changes tests the optimistically in parallel assuming they will all succeed, so it does land a change only 30 minutes for example.
What you describe is typically an architecture problem: if you have a good architecture in place the problem won't happen because you have already broken your system up so that those places that 10 completely different developers need to touch do not exist in the first place. You need to hire more senior developers to think about this problem and fix it. You should be able to assign every area of code to a small team of developers who work together and coordinate their changes to that area. (even with common code ownership you quickly specialize just because on a large project you cannot understand everything)
There are exceptions. Sometimes there is a management problem: management has been told some things cannot be done in parallel because you couldn't mitigate the problem in architecture and they failed to apply project management practices to ensure the developers worked serially.
Sometimes there is a team problem: the 10 developers have been placed on the same team to work on the same thing, and despite all that they still failed to coordinate among themselves to ensure that the changes happened in order.
The robot won’t merge a change in the queue if it can’t be merged or tests fail. The changeset would be left open and the developer notified to fix it.
The whole process assumes that multiple changes in the queue don’t depend on each other, if they did, it should all be in the same changeset.
It assumes most do not, but it’s entirely possible for someone to change a common library which makes several down stream changes wait. Even if there are no merge conflicts, if they effect the same tests, changes will have to wait.
don't work at uber but have similar problems at my job and i'm quite convinced the problems you ask about are part of the 'not easy' in the OP comment. maybe they can queue whole branches instead of single checkins?
Tests are there to prove everything you thought of works correctly. They do nothing to find things you didn't think of. However with effort ($$$) you can become very creative in thinking about failure cases that you then test.
If you want to find problems you didn't think of formal proofs are the only think I have heard of. However formal proofs only work if you can think of the right constraints (I forget the correct term) which isn't easy.
Note that the two are not substitutes for each other. While there is overlap there are classes of errors that one along will not catch. For most projects though it is more cost effective to live with some bugs for a long time than to spend enough money on either of the above to find it ahead of time. Different projects have different needs (games vs medical devices...)
An alternative system includes optimizations: 1) don't use a monorepo, 2) don't run tests that have nothing to do with the code changed. Both require redesign of code structure, testing, execution, but both remove the inherent limits of integration.
Nobody seems to talk about this and I don't know why. It would remove integration complexity and speed up testing. We do the same thing for CD and nobody seems to have a problem with it...
The queue and test system Uber and Google use for their monorepos essentially do both of those. The restructuring you mention was to use a build system such as Bazel or Buck universally.
1) Two changes which don’t effect intersecting parts of the repo are landed separately. Similar to having infinite separate repos.
2) Only the tests that your code effects are run.
This is all possible because Bazel let’s you look at a commit and determine with certainty which test needs to run a and which targets are effected.
That is good to hear, but I'm interested in finding the patterns that makes this feasible without a build tool designed for massively parallel building, testing, and integration of a single codebase. A lot of the historic reasons for Google's build system come down to "we just like the monorepo but it needs complex tooling to work".
For a complex project you need complex tooling. A mono-repo and a multi-repo system have different needs, but both need complex tooling to work. Neither is inherently better than the other, there are pros and cons. Sometimes those are compelling (which is why a few projects at google are not part of the mono-repo)
For me I prefer the pros and cons of multi-repo. However sometimes I wish I could do the large cross project refactoring that a mono-repo would make easy.
It's possible (and not that hard) to define an integration process that prevents faulty commits from being integrated to the main branch.
> If you insist on no errors on master ever you will kill throughout.
Not sure why you believe this. It hasn't been my experience; just the opposite, in fact. By using CI in conjunction with a process that prevents errors on master, everything goes more smoothly, because people don't get stalled by the broken master.
The healthy mentality is to realize mistakes will happen. This creates a healthier culture when things do break.
However, you should take every step to ensure it doesn't happen. You should act as though you want to prevent all faults from hitting you master branch.
I am using Azure Devops for the past couple of years and they have kind of nailed it. You have builds that do most of the CI and Releases which can be fine tuned to do complex deployments and complete the CD story. Then you can set them as a requirement for the pull request approval to the main and branch it helps to guarantee healthy trunk.
I don't agree that CI is a team problem and CD is an engineering problem. If you are following infrastructure as code principles it is everyone's problem because if you don't add how your new feature should be deployed it will break the CI and CD pipelines and you won't be able to merge it.
Also using Azure DevOps and it is indeed very well structured.
As for CI/CD differences: how many commits can actually affect code and infrastructure? I think this is part of the engineering problem at the end of the day.
We recently moved from Jenkins to CircleCI, and while the PR experience has improved dramatically (no queueing, faster builds), the _release_ process is far worse.
The reason seems to be that CircleCI just treats CD as CI. In reality doing CD requires high correctness, care, and nuance.
For example... with CircleCI there's no way to ensure that you release your code in the correct order other than to manually wait to merge your code until the previous code has gone out. That's not _continuous_. This is a very basic requirement.
So perhaps they are not a CD service as they pitch themselves as? That means deploys are manually triggered then? Nope, there is no way to manually trigger a build.
I wish this was an isolated example, but I've yet to see a CI/CD service that is easy to build fast, correct, deployments with. Jenkins is correct but not fast or easy, Circle is fast but not correct, and most others I've used are none of these at all.
Few things are as aggravating to me as people who say they understand CD but then manage avoid practicing any of the tenets of CI.
Automated builds are the smallest part of CI. Necessary, but drastically insufficient. If that's all you're doing you've missed the forest for the trees.
> Few things are as aggravating to me as people who say they understand CD but then manage avoid practicing any of the tenets of CI.
It is completely reasonable to utilise one without the other, not all projects are giant multi author efforts trying to wrangle commits.
For instance if you have lots of small projects being worked on independently in parallel with no more than one or two authors on a repo at a time, CI is not going to be worth the investment... but CD still has it's uses.
Small-scale CI is trivial to set up. A build script and integration VM is all you need. If it's difficult, there's most likely hygiene factors in your codebase that are worth resolving.
The title reminds me a lot of all those fatuous articles about the difference between statistics and machine learning. This one is alright though - I wonder how we got to lumping CI and CD together as is commonplace now.
In SRE and DevOps land we’ve mostly had arguments over continuous deployment vs continuous delivery and have mostly let feature engineers decide how they want to use the possible approaches and options available
On our newer services there is a CircleCI pipeline which parallelises work and takes ~1-2 minutes on a branch, and maybe an extra minute at most on master - where it automatically deploys to production if the build is green.
If you make the choice to prioritise this from the start, it isn’t all that difficult.
Maybe for software it's reasonable (tests should be parallelized more as you go above a 5m build), but for infrastructure it's ridiculous. No IaaS API responds remotely quick enough to bring up whole environments from scratch that quickly.
Our entire test suite takes about six minutes to run, we run all our tests in CI with Capybara. Our CD pipeline runs the same tests but against Chrome, Firefox, Safari and Edge. It takes more than an hour.
I would say for me that's a pretty reasonable estimate for microservice architecture applications / services, of course large legacy monoliths take longer but not more than say 15-20 minutes at most.
3m seems aggressive to do builds and spin up infrastructure for anything non-trivial.
Reading a bit closer, I see the author describes CI as a sanity check, "ensur[ing] the bare minimum" and doesn't consider deploying on every commit. Maybe 3-7m is more realistic then.
However, I'm slightly surprised by this definition of CI. According to Fowler [0], "Continuous Delivery is a software development discipline where you build software in such a way that the software can be released to production at any time. ... The key test is that a business sponsor could request that the current development version of the software can be deployed into production at a moment's notice." So having CI gates on the development version that are weaker than the release tests would not seem to be continuous delivery according to his definition.
We're currently releasing on every commit and our CI build (which implements continuous delivery) takes about 15m.
Which is a nonsense, since CI is the practice of merging to master frequently, in a state that can be released if need be.
We won't understand it unless we distinguish the practice from the supporting tools that help us do it safely:
in this case, the practice is frequent merge to shared trunk, and the supporting tools are as many automated checks before and after that merge as can be done quickly.
A "CI build" of a branch is a tool to help you do CI, but unless you merge that branch when it's green, you're not _doing_ CI.
Misunderstanding this and doing "CI on a branch" means that you are mistaking the tool for the practice, and not doing the practice: by delaying integration, you will be accomplishing the opposite of CI.
Yeah, I totally agree with Fowler’s definition of CI more than I do this article.
In my case it wasn’t a need to spin up infrastructure as much as it was just pulling a few container images and starting them, the longest of the CI builds were if you were say loading and indexing test data from a database (container) into ElasticSearch etc... but overall moving images around and starting containers to build and test some ruby / python was usually around 1-3 minutes or there abouts.
We have a large Java monolith application. Builds ran for 30 minutes. Then we said let's only run the unit tests and critical smoke tests. The build time went down to 7 minutes ... on 12 CPU and 32 GB of RAM build slaves :) There's always a way.
I've worked on C++ codebases where linking modules took over 10 minutes with a trivial change. Per config. Granted, that was with BFD, and I sped it up a good bit by switching to gold where we could... which meant our Windows MinGW builds still spent 10+ minutes per config, since those were stuck with BFD. But at least our Android and Linux builds were a bit faster!
But I like to touch common headers - say to document logging macros, and printf-format annotations to catch bugs, or maybe to optimize their codegen - and those logging macros are in our PCHes for good reason - so that still means rebuilding everything frequently. Which ties up a lot of build time (30-60 minutes per config per platform, and I typically have a lot of configs and a lot of platforms!)
The process of Continuous Integration is independent of any tool.
This is one of my pet peeves, people using the term CI referring to the tooling. For me this alone invalidates anything they have to say about the subject.
Close. The scenario we want to minimize is faulty code on the main branch. As your team grows, as the number of commits go up, it becomes a game of chance. Sooner or later something will get through. The more new teammates you have, the more often that will happens.
This is an inescapable cost of growth. The cost of promoting people to management. The cost of starting new projects. Occasionally you can avoid it as a cost of turnover, but you will have turnover at some point.
What matters most is how long the code is "broken" (including false positives) before it is identified, mitigated, and fully corrected. The amount of work you can do to keep these number relatively stable in the face of change is profound.
If you insist on no errors on master ever you will kill throughput. You will create situations where the only failures are big, which is neck deep in the philosophy that CI rejects: that problems are to be avoided instead of embraced and conquered.