This is technical recruiting done right, a lucid walkthrough of a hard problem complete with links to the implementation. It must have taken at least a few weeks of engineer-time to write, Github is awesome for making this public.
I wish they had talked a little more about the tradeoff they made. They mentioned that splitting packfiles by fork was space-prohibitive, but ended up with a solution which must take more space than originally used. (If the new heuristic refuses to use some objects as delta bases, some options which would have provided the best compression are no longer available to git)
The performance win is incredible, how much space did they give up in the process?
The fork-aware packfiles are bigger, but not as much as you might think. We give up the "best" delta that git can find, but usually there is another one in the same group that is almost as good.
I didn't have any numbers on hand, so I just repacked our git/git network.git with and without the delta-aware code. A stock repack is about 300MB, and the delta-aware one is 400MB.
That sounds like a lot, but the real alternative is not sharing objects at all, which is more like 350GB.
This is an actual interesting problem that takes skill and knowledge to solve and it actually produces immense business value too (at least for Github). This is the kind of stuff more software engineers should be working on. I'm dying over here working on a small React JS component and some CSS layout tinkering, my brain's got graph traversal algorithms and bitmap indexes on its mind and can't do anything about it.
Every large problem is a series of small problems. I've worked on large complex systems and on small toy problems. It really isn't much different except for one thing. With toy problems you can get away with making a lot of mistakes because the interactions are simple. As the system becomes more and more complex, your execution as a programmer becomes more and more important. You have to think very deeply about the "why" and you have to be able to test your assumptions well.
If you have ever played the game of go, when you first start out the board is empty. You have to place your stone somewhere, but actually it doesn't really matter where. Over time, as you place more and more stones, your choices become more and more important (and ironically, you have more and more constraints). The weird thing is that, even though it doesn't matter where you place those first stones, it becomes very important where the stones were originally placed as the situation unfolds.
Programming is similar. When you first start a project, it really doesn't matter what you do. Almost everything will work to one degree or another. But the original decisions gain more and more weight as the project becomes more and more complex (and you are faced with more and more compromises). Eventually those original decisions can make or break your project, even though it didn't matter at first what they were (of course, this is why refactoring is so important --- but that's a different discussion).
My point (finally) is, that even though you may be making simple changes on simple systems, it doesn't have to stop you from understanding the implications of your work should the project become larger. Take the opportunity to polish your skills and make your "opening game" as perfect as you can make it.
I agree that at some point every programmer must start working on complex systems in order to grow. If you are at that point and your employer does not offer complex problems, then maybe it is time to move. However, don't neglect your "opening game". It is very, very important because, as I said at the beginning, every large problem is a series of small problems.
> we're sending few objects, all from the tip of the repository, and these objects will usually be delta'ed against older objects that won't be sent. Therefore, Git tries to find new delta bases for these objects.
Why is this the case ? git can send thin packs if the receiver already has the objects, why does it still need to find a full base to diff against ? (Not counting when initial base objects are from another fork -- I don't know if it's often the case)
On top of that as far as I understood from the discussion about heuristics (https://git.kernel.org/cgit/git/git.git/tree/Documentation/t...) it seems like the latest objects are full and the earlier objects are diffed against them (double benefits: you usually want access to the last object which is already full, and earlier objects tend to be only remove stuff, not add because "stuff grows over time). So if objects are still stored as packs, things should already be in a pretty good shape to be sent as-is... or not ?
Their initial test case was a full clone, in which case you can't really send a shallow pack.
The problem, as I understand it, was that when you requested a full clone of fork1/repo.git, they'd find all objects reachable from refs in that repo, but git would by default generate deltas for those objects that referred to objects from other forks. When it noticed those objects were not going to be sent to the client, nor did client know about them, git recovered by doing expensive matching across the objects that it was sending, and without having traversed the graph before its heuristics weren't working properly so this took forever.
There are second order effects as well. Having a good tech blog that talks about this stuff gains the company mindshare amongst developers. If, as in in the case of github, the developers (etc) are the customer, then the translation to marketshare is pretty straight-forward.
When developers (etc) are not the target audience, they still get warm-fuzzies about the company and it gains word of mouth, and non-tech who seek out a recommendation from their friends, will hear about the company with the good blog. Usually in the form of "Oh yeah, there are a couple of alternatives, but the tech at $X is top notch, look hard at them".
Hm yeah, that makes sense. So it's not the case that "[there are] no real reasons" - but I still think it's awesome, and learn a lot/see the applications of stuff learned elsewhere in them.
Perhaps I'm a bit ignorant of git's storage and protocols, but what's the purpose of this initial count? Seems to me it's traversing the tree twice - once to count and once to send the objects across the network. So why not traverse the tree, sending objects that need to be sent and ignoring the ones that don't instead of counting?
You need to know which objects are reachable before you can delta compress them (or reuse a packed delta chain). I guess you could restrict compression antecedents to graph ancestors in your pack, but at the cost of not compressing similar siblings (and blobs are leaves).
The Azure team responsible for the Local Git implementation needs to read this fantastic article.
I've been putting up with 10-minute deploys due to precisely this issue of counting objects. It's slow because we don't use Local Git as our source-of-record repository (because commits initiate a deployment step), so every deploy involves a clean fetch into a new tmpdir.
At least now I know why our deploys are getting slower and slower.
I wonder why GitHub has a separate domain githubengineering.com for this blog instead of a subdomain like engineering.github.com.
I notice that there is an inactive user account called engineering. If at all an User Page is created by that account, it would be available at engineering.github.io.
GitHub Pages used to be served under github.com subdomains (they were switched after GitHub realized this was a same-origin security hole), and said username.github.com subdomains still redirect to github.io for backward compatibility.
It's not unheard of for GitHub to rename inactive accounts, though: most likely, they gave this its own domain for something like SEO purposes (as it's content marketing).
Interesting. I've heard that thrown around before, but I've never heard someone say the same thing about Python, even though the main Python implementation is C.
From my own experience with Ruby and MRI, there's quite a strong culture of "transparency of code". It's not unusual if you're really down into some "interesting" problem to delve first into your gems/frameworks, and eventually even into MRI itself. Being a PL polyglot pays off in spades in these cases, since you get a solid understanding of the underpinnings of everything.
Note that there's very rarely any need to go there, but the fact that it's both possible and culturally encouraged makes a huge difference for those of us who've needed it.
I have a talk at a Ruby conference last week and even put a small amount of assembly on the screen... While many Rubyists aren't expert C programmers, they know enough to look at Ruby's source every once in a while to figure out what's going on.
"When you fork a repository on GitHub, we create a shallow copy of it. This copy has no objects of its own, but it has access to all the objects of an alternate ..."
So does this mean one could attach a GitHub repository by having a lot of shill accounts cline it and add random objects (possibly having a performance impact on the original)? I understand the engineering need for the use of alternates, but wonder about the lowered degree of isolation.
I wish they had talked a little more about the tradeoff they made. They mentioned that splitting packfiles by fork was space-prohibitive, but ended up with a solution which must take more space than originally used. (If the new heuristic refuses to use some objects as delta bases, some options which would have provided the best compression are no longer available to git)
The performance win is incredible, how much space did they give up in the process?