Some Git internals

shadykiller · on Dec 25, 2020

Git is really simple at the core and fun to poke around with. I gave a talk in RubyConf India about poking with git internals using ruby.

It was fun. Link for anyone interested(not very high quality but watchable):

https://www.youtube.com/watch?v=lPlwkxrG2NM

thaliaarchi · on Dec 26, 2020

A FUSE filesystem that fetches objects on demand seems like it would be great when working in a large repo such as the Linux Kernel or Chromium, especially when only a small subset of files are needed. It would also be useful to monitor the fs for file changes since `git status` scans the entire tree. Does anything like this exist?

nine_k · on Dec 27, 2020

Fetching files one by one on demand could be slow.

OTOH git natively supports sparse commits.

leosarev · on Dec 26, 2020

GVFS?

srathi · on Dec 26, 2020

Good intro.

Shameless plug: I recently picked up Golang and used it to implement a mini version of Git CLI using these internals [1].

[1] https://github.com/ssrathi/gogit

ufo · on Dec 25, 2020

A dumb question: if I make a commit changing just a single line in a large file, does this create a completely new object/blob with the entire contents of the file? Or does it store it in a more space-efficient manner?

glandium · on Dec 25, 2020

The former, until you run git gc, in which case the latter happens.

ufo · on Dec 25, 2020

Wow! Running git gc just shrunk my .git folder from 18MB to only 3MB.

glandium · on Dec 25, 2020

git gc runs automatically when some threshold is met in the number of "loose" objects (objects that were created by e.g. git add and that haven't been packed yet). I don't remember what the threshold is, though.

u801e · on Dec 26, 2020

I believe loose objects are also packed when pushing to a remote or fetching from it. I wonder if the remote's .git folder is also about 3 MB in size.

glandium · on Dec 26, 2020

Indeed, the git protocol exchanges packs.

sitzkrieg · on Dec 25, 2020

did it have binary data? if so try moving to lfs and running bfg cleaner and watch it shrink even more

ufo · on Dec 26, 2020

No binary data. Just an ordinary git repo with a bunch of source code files (mostly C and Lua).

sitzkrieg · on Dec 26, 2020

thats even more impressive really, reckon decent history and never pruned before

gwelps · on Dec 26, 2020

Do you know if there's an easy way to do this without running git gc? Say by just supplying a diff or something?

In particular I'm thinking of the case where there's an append-only file keeping a log, to commit the changes every so often without having to make a blob copy of the file first (which may be relatively large)?

glandium · on Dec 26, 2020

There isn't one at the moment.

The closest would be to create a pack manually that contains a diff against the previous version but that'd require manual work.

It would be possible, for example, to modify git-fast-import to a) allow to take diffs as input b) allow to store those diffs (these are different things to deal with). The downside is that the more packs there are, the slower object lookup is, which can make everything much slower. Newer versions of git have cross-pack indexes to deal with that, though, but I don't think that's enabled by default.

Another option would be to add a new format for loose objects that allows to store diffs, but that has backwards compatibility implications.

fiddlerwoaroof · on Dec 26, 2020

This is sort of a worst-case scenario for git: there are patch-based DVCSes that would handle this scenario better (darcs and pijul), but those have their own set of trade offs (I think they end up being slower for large histories).

pmeunier · on Dec 26, 2020

Applying a patch in Darcs usued to be in O(2^n), and now apparently O(n^2), where n is the size of history.

Applying a patch in Pijul is in O(p c log n), where p is the size of the patch and c the size of the largest "deletion-insertion conflict" p is involved in, where a "deletion-insertion conflict" is a situation where Alice deletes a block of text while Bob adds stuff in that same block.

Note that this is a rough bound, since all non-conflicting operations in a patch are in O(log n), except those involved in a "deletion-insertion conflict", which are in O(c log n).

So, Pijul is in fact faster than Git for merging (and rebasing). The only tradeoff at the moment is that going arbitrarily far back in history isn't as fast as it could be (this will be fixed very soon).

fiddlerwoaroof · on Dec 26, 2020

Interesting, I’ve been following these systems for a while, but I didn’t realize that pijul had solved the performance issues.

Is pijul’s on-disk format stable yet?

pmeunier · on Dec 26, 2020

> Is pijul’s on-disk format stable yet?

Probably. The patch format is very unlikely to change. The repository format may change a little bit still.

I'd say it's probably ok to try and learn it now, but you should maybe wait for a few weeks before using it for something serious. On the other hand, we use it for itself, and I use it personally for most of my projects.

globular-toast · on Dec 26, 2020

Eh? Why would git gc get rid of the old blob? It will still be referenced by the old commits.

glandium · on Dec 26, 2020

It doesn't. It packs it.

globular-toast · on Dec 26, 2020

Is git gc the only thing that can trigger packing?

auscompgeek · on Dec 26, 2020

Answered above: https://news.ycombinator.com/item?id=25540122

glandium · on Dec 26, 2020

There is also git pack-objects and git repack.

strogonoff · on Dec 25, 2020

Nice walkthrough! I did not know about FETCH_HEAD.

Another (complementing) way of getting an idea how Git works is by reading the source of Isomorphic Git[0]. Being in JS and with fewer (as of now) features, it’s a bit more accessible than reference Git.

[0] E.g. https://github.com/isomorphic-git/isomorphic-git/blob/main/s...

duffmancd · on Dec 26, 2020

I really enjoyed running through Write Yourself A Git[0], which guides you through implementing a basic Git replacement in python.

[0]https://wyag.thb.lt/

Waterluvian · on Dec 25, 2020

This was really really educational. Git internals seem rather simple.

Does anyone have a really good visual explanation of what's being done to the tree and such when commits and merges and rebases are done? I still don't quite grok it intimately enough.

divbzero · on Dec 25, 2020

The Git Internals section of the Pro Git book covers commits [1] and branches [2] well though it doesn’t go into details about merges and rebases.

[1]: https://git-scm.com/book/en/v2/Git-Internals-Git-Objects

[2]: https://git-scm.com/book/en/v2/Git-Internals-Git-References

edrobap · on Dec 26, 2020

Branching is more of an art. With the kind of flexibility git provides, one can get lost in the convention to follow. I have found this useful For a small team working on app development -

https://nvie.com/posts/a-successful-git-branching-model/

aequitas · on Dec 25, 2020

Ungit[0] shows what will happen to the tree before you execute a merge or rebase. It really helped me get a hang of Git.

[0] https://github.com/FredrikNoren/ungit

sransara · on Dec 26, 2020

I wrote my self a note about this last year to demystify the internal data structure [0], if you want to check it out. But the TLDR straight to rabbit holes; is that commits are a directed acyclic graph (DAG) [1]. And the file tree snapshot is a fully persistent trie [2]. And the commit DAG points to different snapshots. Branches are name shortcuts to commits.

[0] https://sransara.com/notes/2019/build-yourself-a-git/ [1] https://en.wikipedia.org/wiki/Directed_acyclic_graph [2] https://en.wikipedia.org/wiki/Persistent_data_structure

rustybolt · on Dec 25, 2020

learngitbranching.js.org is quite good

arunc · on Dec 26, 2020

The website looks terrible on Opera on Android.