Hacker News new | past | comments | ask | show | jobs | submit login
Counting Objects (githubengineering.com)
267 points by jssjr on Sept 22, 2015 | hide | past | favorite | 37 comments



This is technical recruiting done right, a lucid walkthrough of a hard problem complete with links to the implementation. It must have taken at least a few weeks of engineer-time to write, Github is awesome for making this public.

I wish they had talked a little more about the tradeoff they made. They mentioned that splitting packfiles by fork was space-prohibitive, but ended up with a solution which must take more space than originally used. (If the new heuristic refuses to use some objects as delta bases, some options which would have provided the best compression are no longer available to git)

The performance win is incredible, how much space did they give up in the process?


The fork-aware packfiles are bigger, but not as much as you might think. We give up the "best" delta that git can find, but usually there is another one in the same group that is almost as good.

I didn't have any numbers on hand, so I just repacked our git/git network.git with and without the delta-aware code. A stock repack is about 300MB, and the delta-aware one is 400MB.

That sounds like a lot, but the real alternative is not sharing objects at all, which is more like 350GB.


This is an actual interesting problem that takes skill and knowledge to solve and it actually produces immense business value too (at least for Github). This is the kind of stuff more software engineers should be working on. I'm dying over here working on a small React JS component and some CSS layout tinkering, my brain's got graph traversal algorithms and bitmap indexes on its mind and can't do anything about it.


Every large problem is a series of small problems. I've worked on large complex systems and on small toy problems. It really isn't much different except for one thing. With toy problems you can get away with making a lot of mistakes because the interactions are simple. As the system becomes more and more complex, your execution as a programmer becomes more and more important. You have to think very deeply about the "why" and you have to be able to test your assumptions well.

If you have ever played the game of go, when you first start out the board is empty. You have to place your stone somewhere, but actually it doesn't really matter where. Over time, as you place more and more stones, your choices become more and more important (and ironically, you have more and more constraints). The weird thing is that, even though it doesn't matter where you place those first stones, it becomes very important where the stones were originally placed as the situation unfolds.

Programming is similar. When you first start a project, it really doesn't matter what you do. Almost everything will work to one degree or another. But the original decisions gain more and more weight as the project becomes more and more complex (and you are faced with more and more compromises). Eventually those original decisions can make or break your project, even though it didn't matter at first what they were (of course, this is why refactoring is so important --- but that's a different discussion).

My point (finally) is, that even though you may be making simple changes on simple systems, it doesn't have to stop you from understanding the implications of your work should the project become larger. Take the opportunity to polish your skills and make your "opening game" as perfect as you can make it.

I agree that at some point every programmer must start working on complex systems in order to grow. If you are at that point and your employer does not offer complex problems, then maybe it is time to move. However, don't neglect your "opening game". It is very, very important because, as I said at the beginning, every large problem is a series of small problems.


I only recently started playing Go and the similarity that you point is fantastic.

Take the opportunity to polish your skills and make your "opening game" as perfect as you can make it

Yep, exactly what I've been doing.

If you are at that point and your employer does not offer complex problems, then maybe it is time to move.

Yep ;-)


Not specific to the post, but:

> we're sending few objects, all from the tip of the repository, and these objects will usually be delta'ed against older objects that won't be sent. Therefore, Git tries to find new delta bases for these objects.

Why is this the case ? git can send thin packs if the receiver already has the objects, why does it still need to find a full base to diff against ? (Not counting when initial base objects are from another fork -- I don't know if it's often the case)

On top of that as far as I understood from the discussion about heuristics (https://git.kernel.org/cgit/git/git.git/tree/Documentation/t...) it seems like the latest objects are full and the earlier objects are diffed against them (double benefits: you usually want access to the last object which is already full, and earlier objects tend to be only remove stuff, not add because "stuff grows over time). So if objects are still stored as packs, things should already be in a pretty good shape to be sent as-is... or not ?


Their initial test case was a full clone, in which case you can't really send a shallow pack.

The problem, as I understand it, was that when you requested a full clone of fork1/repo.git, they'd find all objects reachable from refs in that repo, but git would by default generate deltas for those objects that referred to objects from other forks. When it noticed those objects were not going to be sent to the client, nor did client know about them, git recovered by doing expensive matching across the objects that it was sending, and without having traversed the graph before its heuristics weren't working properly so this took forever.


This is awesome. I absolutely love <company>engineering blogs like this.

There's no real reason for companies to educate external devs/hobbyists/students like this, but some do, and it's really awesome.


The reason is recruiting, and it's effective.


Because people read it and think "hey that sounds cool, I want to work there"?


Pretty much.

There are second order effects as well. Having a good tech blog that talks about this stuff gains the company mindshare amongst developers. If, as in in the case of github, the developers (etc) are the customer, then the translation to marketshare is pretty straight-forward.

When developers (etc) are not the target audience, they still get warm-fuzzies about the company and it gains word of mouth, and non-tech who seek out a recommendation from their friends, will hear about the company with the good blog. Usually in the form of "Oh yeah, there are a couple of alternatives, but the tech at $X is top notch, look hard at them".


Hm yeah, that makes sense. So it's not the case that "[there are] no real reasons" - but I still think it's awesome, and learn a lot/see the applications of stuff learned elsewhere in them.


Or, perhaps even more importantly, “they really seem to know what they're doing. I'm going to put my company's projects there”.

If you sell developer tools, this is _way_ more effective than an ad campaign because it's completely genuine.


Perhaps I'm a bit ignorant of git's storage and protocols, but what's the purpose of this initial count? Seems to me it's traversing the tree twice - once to count and once to send the objects across the network. So why not traverse the tree, sending objects that need to be sent and ignoring the ones that don't instead of counting?


You need to know which objects are reachable before you can delta compress them (or reuse a packed delta chain). I guess you could restrict compression antecedents to graph ancestors in your pack, but at the cost of not compressing similar siblings (and blobs are leaves).


Hopefully this isn't too off topic, does anyone know what software was used to make the charts in this post?


Judging by the water colors I'm guessing Paper by FiftyThree.


The Azure team responsible for the Local Git implementation needs to read this fantastic article.

I've been putting up with 10-minute deploys due to precisely this issue of counting objects. It's slow because we don't use Local Git as our source-of-record repository (because commits initiate a deployment step), so every deploy involves a clean fetch into a new tmpdir.

At least now I know why our deploys are getting slower and slower.


I wonder why GitHub has a separate domain githubengineering.com for this blog instead of a subdomain like engineering.github.com.

I notice that there is an inactive user account called engineering. If at all an User Page is created by that account, it would be available at engineering.github.io.


GitHub Pages used to be served under github.com subdomains (they were switched after GitHub realized this was a same-origin security hole), and said username.github.com subdomains still redirect to github.io for backward compatibility.

It's not unheard of for GitHub to rename inactive accounts, though: most likely, they gave this its own domain for something like SEO purposes (as it's content marketing).


So does GitHub hire C engineers? Git is implemented in C.


Yes, much of GitHub's core infrastructure is written in C.


That's awesome. You don't find many people that enjoy writing in C. I assumed it was Ruby.


Ruby MRI and YARV are both written in C, so to be a true Ruby hacker you need some C chops. :-)


Interesting. I've heard that thrown around before, but I've never heard someone say the same thing about Python, even though the main Python implementation is C.


Really? Lots of the most famous Python modules are partially implemented in C(++) for performance reasons too, it's not exclusive to Ruby.

For example, Numpy is 56% C. https://github.com/numpy/numpy


From my own experience with Ruby and MRI, there's quite a strong culture of "transparency of code". It's not unusual if you're really down into some "interesting" problem to delve first into your gems/frameworks, and eventually even into MRI itself. Being a PL polyglot pays off in spades in these cases, since you get a solid understanding of the underpinnings of everything.

Note that there's very rarely any need to go there, but the fact that it's both possible and culturally encouraged makes a huge difference for those of us who've needed it.


From the Python docs:

https://docs.python.org/3/

Extending and Embedding tutorial for C/C++ programmers

Python/C API reference for C/C++ programmers

... right between "Distributing Python Modules" and "FAQs"


I have a talk at a Ruby conference last week and even put a small amount of assembly on the screen... While many Rubyists aren't expert C programmers, they know enough to look at Ruby's source every once in a while to figure out what's going on.


Yes. We have open positions for C engineers.


Look at the bottom of the blog post for commits involved in merging these changes upstream for Git 2.0. They are all from vmg.


"so we ported his C++ implementation to C to be backwards compatible"


The illustrations look like they were done in Paper. Does anyone know if that's right or not?



Could this technique be used to optimise graph traversal in a postgresql table? Or are recursive queries still the way to go?


Not only is the tech achievement amazing, and something that will save countless man hours it's incredibly written to boot.

Amazing work.


"When you fork a repository on GitHub, we create a shallow copy of it. This copy has no objects of its own, but it has access to all the objects of an alternate ..."

So does this mean one could attach a GitHub repository by having a lot of shill accounts cline it and add random objects (possibly having a performance impact on the original)? I understand the engineering need for the use of alternates, but wonder about the lowered degree of isolation.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: