Hacker News new | past | comments | ask | show | jobs | submit login

It's interesting that their entire release architecture seems to be focused on never pushing bad things out to production, whereas given their traffic they could probably push things out much sooner (minutes after they're committed) to small parts of their overall traffic, and slowly increase the traffic on those pieces of code as they prove themselves to be stable, or quickly revert them if they're not.

That would mean having a lot of versions of Facebook live at any one point, but as those parts prove themselves stable they'd gradually be rolled out to all of their traffic.

One point that also wasn't covered is that as they're pushing things out they'll only cherry-pick parts of their codebase depending on which engineers they have around. I wonder if they have a lot of hairy merge conflicts around release time due to that, and bugs in production resulting purely from those merge conflicts. Or worse, subtle bugs resulting in change A going out, but being programmed against a function that was changed in change B, which is not going out because the author of change B isn't in today.




"they could probably push things out much sooner (minutes after they're committed) to small parts of their overall traffic, and slowly increase the traffic on those pieces of code as they prove themselves to be stable, or quickly revert them if they're not."

The risk to user data is way too high.

This could have serious consequences. You could push client bugs with erroneous API calls, or server-side bugs that cause data loss. Rolling back isn't enough to fix the damage. The user's data has reached a permanent bad state that they didn't intentionally reach. You could roll back the data of every person who used the change, which would undo all of their work. You could analyze the data and try to fix it. This might work, or it might get the user data into a different bad state.

Plus, bugs in the view of the site might not cause errors that pop up in your error console, since it's hard to write tests for "looks wrong." Obvious errors - "when I click on my profile picture my name disappears" - are caught by external people instead of internal people, which adds a level of indirection between a problem appearing and a fix being written.

That being said, there are great uses for gradual rollouts. The video mentions that they do this for mature features with Gatekeeper - the developer can conditionally enable a Prod feature, and see what it does.


This is correct, especially for a company under the level of government privacy scrutiny as Facebook. An erroneous push that exposes private user data could lead to a very heavy fine.


Cherry picking based on which engineers are around is much more about our daily releases. Everything checked into trunk on Sunday will go out on Tuesday. But if I've requested a diff be merged for the Wednesday release, it won't happen unless I've told request_bot that I'm around to support my changes. This also means that if there are merge conflicts, the engineers who wrote the patches will be there to help resolve them.


I haven't watched the video and I've been up all night, so forgive me if I'm contradicting the video. I'm probably wrong and the video's probably right. At least as of 2009, you are correct and that is how things were pushed. Code would have to be reviewed before it was pushed, but push would happen in stages, and chuckr and others would monitor its progress and revert commits that were found to be broken, as they went out. Errors were monitored and correlated to sets of patches, and would be investigated in real-time. There was the usual weekly push for typical changes, daily push for important changes, and unscheduled pushes for critical/very urgent changes.

Unless you mean to suggest continuously integrating developer commits to trunk into the live branch, in which case, no, that's a horrible idea. Not every bug manifests itself that quickly.

As for merging, my memory is pretty hazy but I believe pushing and merging went hand in hand, and stuff that conflicted meant someone wasn't communicating yet working on the same code as someone else. Code was often documented with its owner. The code review utility at the time would (I think) take that✝, and (I think) run a blame and automatically CC those people on the code review, so it could be caught before it went out.

✝ Unless that was just done by convention so you know who to ask about a bit of code you might need to revise. Sorry my memory is unreliable on that bit.


You have to remember that facebook is a compiled binary (hiphop). They entered the release cycle they did vs true Continuous Integration because original compilation times for this binary were extreme. Pretty sure Chuck mentioned something in the range of 1gb binary. They've corrected the amount of time it takes to compile and can push within 15 minutes of a trunk merge, but it's stil not on the scale of a few lines here, a few lines there.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: