Hacker News new | past | comments | ask | show | jobs | submit login
Git’s database internals I: packed object store (github.blog)
216 points by todsacerdoti on Aug 29, 2022 | hide | past | favorite | 64 comments



This article gives me whiplash, it starts at a high level, dives, stops just before it touches upon the implementation details (and all the weird stuff which makes you realise the developers were human after all) then shoots back up again.

> Git does not currently have the capability to update a packfile in real time without shutting down concurrent reads from that file. Such a change could be possible, but it would require updating Git’s storage significantly. I think this is one area where a database expert could contribute to the Git project in really interesting ways.

I'm not sure it would be super useful, because of the delta-ification process. Adding "literal" (non-delta) files to a pack isn't much of a gain (file contents are zlib-compressed either way).


Also presumably "shutting down concurrent reads from that file" is not much of a problem because the file can just be unlinked and it will be GCed when the last reader finishes naturally. Other than hung processes the only downside is extra disk space usage and some cache.


I have written a small python library to read info directly from the .git folder (without using the git executable): https://github.com/spapas/python-git-info/

Decoding the pack files was notoriously difficult, probably there are still cases that I don't handle properly. The packfile is a really complex/messy format and in the top of that it lacks any proper documentation!

This article explains a lot of the concepts very good but doesn't go into implementation details (where the big snakes are). Another great resource for unpacking packfiles is this post: https://codewords.recurse.com/issues/three/unpacking-git-pac... and of course reading the source code of other git clients.

Unfortunately it seems that the official packfile format documentation is the git source code :|


> The packfile is a really complex/messy format and in the top of that it lacks any proper documentation!

Maybe it’s changed since you'd checked it, but I found the pack files to be documented, through nowhere near as well as index files (now that was a pleasure, the index file format is straightforward, pretty logical, and very well documented).

The problem I had with the pack files documentation is that it’s non-linear, so if you read it linearly as you implement it you hit areas which turn out to be documented a few screens later. Furthermore it doesn’t necessarily define or spell out what it’s talking about, so it can take a while to realise that it has 3 (.5?) different formats of varints, or that the size it provides is for decompressed content, and that it relies on zlib to discover the end of the compressed stream (and good luck to you if your implementation doesn’t expose that).

But in my experience, it’s really nothing compared to the documentation of the wire format. That leaves even more details out, some of the explanations are outright misleading (I spent hours convinced the v2 protocol was using an HTTP connection as a half-duplex socket and wondering how I would manage), and with TODOs covering half the protocol.


Yes. The wire format docs, especially for HTTP, are atrocious. The pack format is well documented enough. (https://github.com/git/git/tree/maint/Documentation/technica...)

I wrote the 9front implementation (https://git.9front.org/plan9front/plan9front/HEAD/sys/src/cm...) more or less entirely from the official documentation in the git repo. For debugging, GIT_TRACE_PACKET=1 is very useful for getting valid packet traces.

Repacking efficiently for serving is the most annoying part, and I still haven't implemented delta reuse.


FWIW git has recently reorganised their docs a lot, and moved some things to man pages.

The pack format docs were one of the documents affected, it now lives at https://github.com/git/git/blob/master/Documentation/gitform....


Yes i saw this just yesterday seems like a great resource! I don't think this was available some years ago (at least i wasn't able to find it)...


The first commit containing the packfile docs is from 2006.


Can you please share the documentation you used to understand the pack file implemention? It will be very useful to me!


Git’s own: https://github.com/git/git/blob/master/Documentation/gitform...

I also used gitoxide’s source when I got unsure / lost about the nomenclature / varints.

I’d suggest giving it a pass unless you have to or are really interested though.


I stand corrected these are much better that what I was able to find! Thank you


One of the things that surprises me about Git is that it doesn’t store diffs. It just stores the full content of every file every time it’s modified and committed.

I work on concurrent editing systems (using CRDTs). People always ask about how we deal with the problem that if we store changes forever, the filesize grows indefinitely. Modern CRDTs are massively more space efficient than git but nobody seems to notice or care about the size of our git repositories. I think git supports shallow clones but I don’t know anyone who bothers with them.


It's a reminder that sometimes it's OK to ignore hypothetical problems.

"How does git deal with the problem that if we store changes forever, the file size grows indefinitely?"

Even though Git uses deltas for diffs to reduce space usage, it still technically grows indefinitely, same as CRDTs. So the answer is it doesn't, and it doesn't matter.

If the size of something actually becomes a problem, -then- it can be handled (e.g. Git LFS).

With CRDTs, depending on what you're doing, it might never become a problem. Keeping all versions of a document is entirely inconsequential, as an example. The size might grow indefinitely but how many edits is a document going to have, realistically?


I know what you mean, but — as the article explains [0] — Git’s packfile format uses delta compression, which is essentially “storing diffs”.

[0] https://github.blog/2022-08-29-gits-database-internals-i-pac...


> Git’s packfile format uses delta compression, which is essentially “storing diffs”.

FWIW it's closer to lossless compression e.g. LZ77's behaviour: git's delta compression has two commands which are LITERAL and COPY, and technically it doesn't need to copy linearly from the base object, it would be unlikely[0] but a delta could start by COPY-ing the end of the base object, then add literal data, then COPY the start of the base object (or COPY the end again, or whatever, you get the point).

[0] outside of handcrafting, as the packing process would need to find such a situation which it would estimate useful


Many years ago, I wrote a some code that made pack files with out of order copy commands. (Just messing around in Go trying to learn how pack files worked.)

I was too lazy to implement the diffing algorithm, so I used the suffixarray from the go standard library and greedily grabbed the longest match at each point in the input. It's been a while, but I think I got a 12% improvement on my test case. (Probably wouldn't do that well in general.)

I'm assumed the git code just runs off of a diff (longest common substring), but I never checked.


In my mental model of git internals: I envision the initial commit stores a full copy in object store - and thereafter the states are captured as diffs. If a new file is added at a later commit stage, of course the "blob" will be stored, but subsequent commits will add the deltas to it - not a full copy.


That's not the case.

The actual model of git internals is that every commit always stores full copies (of modified entries, unmodified entries are dedup'd through the magic of content addressing).

Then as an optimisation of the packing process git will perform delta compression between objects based on a bunch of heuristics. Until you pack you get full copies, and depending how "hard" you pack you may or may not gets huge savings.

Furthermore, Git's pack files are ordered, so depending on the selection it makes it can diff later files from earlier ones (your model) or the reverse, or even both at the same time.

You can see some of the early packing heuristics at https://github.com/git/git/blob/master/Documentation/technic...


> That's not the case.

As far as the actual internals are concerned, yes. But as a model it's not wrong. It's like epicycles aren't wrong as a model either. GP's mental model of Git is mine as well, and it works very well for me.


> But as a model it's not wrong.

As a model it's wrong, and useless, and possibly harmful.

Because the "states are captured as diffs" model does not tell you anything true that's useful. And at worst it gives you incorrect ideas of the performance characteristics of the system e.g. that small changes are bad, because lots of changes means lots of diffs to apply which is slow.

By forgetting that model, you're strictly less wrong, and helped no less about reasoning about git.

> It's like epicycles aren't wrong as a model either.

But they were. "Adding epicycles" has literally become a saying about trying to pile on more crap to try and (fail to) make an incorrect model work.

And epicycles were at least trying to keep a simple model working, the "snapshot model" is more complicated than git's actual behaviour to start with.


> As a model it's wrong, and useless, and possibly harmful.

It's no more wrong than epicycles. It's not useless either: it works very well for me. For users (as opposed to Git developers), it's hard to see how it could be harmful.


The harmful part comes from assumptions based on it.

Systems like CVS or Subversion indeed store diffs of file conceptually and storage wise. This has notable co sequences: when doing operations on a large span of revisions all takes long as a bunch of diffs have to be applied in sequence. This then leads to reluctance of small commits for the wrong reasons.

Wrong assumptions about the workings leading to wrong decisions is harmful.


> This has notable co sequences: when doing operations on a large span of revisions all takes long as a bunch of diffs have to be applied in sequence. This then leads to reluctance of small commits for the wrong reasons.

This happens in Git as well though. If you have to rebase your commits across a few thousand upstream commits, you'll be in for some pain. (For example, I've a commit to PostgreSQL to add ALWAYS DEFERRED constraints, and I've had to rebase it across thousands of upstream commits because I've lacked the time to see that contribution across the finish line.)

You seem to be indirectly arguing for merge-based workflows. (At least that's what I'm reading between the lines. Perhaps my reading is wrong, but I'll run with it until you tell me otherwise.) However, a) merging is not always accepted by upstream, and b) merge-based workflows suck.

I grant that merging is very simple with Git's internals: Git does the three-way merge thing, you manually merge conflicts, you create a commit for the conflict-resolved checked out state, recording the two parents in that commit, and you're done. This works because Git simply stores the objects as they are, internally using deltas -not quite diffs- to compress the object store. This is very simple, yes, but it yields a workflow that doesn't work for many projects. Rebasing is much better, and it always works, especially with [0].

So now you might wonder how to deal with the rebase-across-thousands-of-commits problem. Well, I've a script[0] (not originally mine, but I've hacked on it) that does something like bisection to find upstream commits causing conflicts so as to greatly speed up the process.

[0] https://gist.github.com/nicowilliams/ea2fa2b445c2db50d2ee650...


If you got to rebase over many decisions you are likely in trouble anyways. There small commits can be helpful to ease dealing with merge conflicts.

However rebase is a single special operation. The diff-based storage affects all. From updating a local repository from upstream, diffing between some distance, ...


> The diff-based storage affects all.

No one is saying Git does or should do that.

> If you got to rebase over many decisions you are likely in trouble anyways.

That script I linked made it easy.

> There small commits can be helpful to ease dealing with merge conflicts.

They were indeed mostly small commits. Mine wasn't that small -- it's fairly invasive and not possible to split up into more than two commits.


Calling things harmful considered harmful.


Especially without an explanation.

  them: harmful
  me: how?
  them: really harmful
  me: sure, but, can you explain it?
  them: like a bajillion harmful


When you rename a file in git, you can get a glimps of its internal model -- renames are not explicitly tracked. Since git only stores full copies of each object, it has to employ heuristics to figure out if a (deleted file, created file) pair should be considered a rename. This is why you sometimes see a percentage next to a rename in a "git log" -- this is git's internal metric about how similar/dissimilar the files are.


Is there any value in first class renames other than some performance boost ?


I'd argue yes. I remember times where I was puzzled that some config file had been deleted (in a commit touching lots of files) only to find out after some analysis that is was simply moved/ renamed.


This is a good mental model. The internals are the internals. But the model users see is that a repository is a bag of commits (deltas) that organize into a tree, with branches and tags being names for specific commits. With this model in mind anyone can become a Git power user.


From a user point of view, you should not see commits as diffs or deltas, but as a full snapshots of the contents of the repository.

Delta compression is just an implementation detail for storage optimization. Conceptually each commit is an immutable snapshot in its own right.


> From a user point of view, you should not see commits as diffs or deltas, but as a full snapshots of the contents of the repository.

That tends to lead to a merge-centric workflow.

The alternative makes it easy to understand rebase. Ergo it's good.


I don't see any particular reason that'd be the case. Even with rebase, the diff model causes some of git's behavior to no longer make sense. For instance, the fact that rebase causes an entirely new series of commits to be made - this is because rebase is just a layer over cherry-picking commits, and that cherry-picking commits calculates the diffs between two commits then applies the diff as its own, new commit.

If commits were diffs, then that process doesn't make any sense. It only makes sense when you think of commits as snapshots, and cherry-pick ing as taking a commit, calculating the diff with the current commit, and then applying that diff to make a brand new commit.


Yes, rebase is a series of cherry-picks. Each commit picked is treated as a set of diffs that are applied with conflict resolution, then committed, then off to the next. The mental model of commits as diffs works, obviously.

> If commits were diffs, then that process doesn't make any sense. It only makes sense when you think of commits as snapshots, and cherry-pick ing as taking a commit, calculating the diff with the current commit, and then applying that diff to make a brand new commit.

Sure it does. It works for me. You can use whatever mental model of Git that you works for you.


There are two ways commits can be treated in Git. Most of the time, a commit is a specific tree and some metadata. There are, however, some rebase-like operations that use the commit metadata to treat a commit as a diff against one or more parents which it points to. You're wrong in this discussion because it's about storage, but you do need to flip into this dual view of the commit graph to use rebase-like operations correctly.


Exactly. Thinking of commits as diffs helps think about rebasing.


> But the model users see is that a repository is a bag of commits (deltas)

Commits in your .git directory aren’t stored using deltas from a previous commit. Each different version of every file is stored in full.


From https://git-scm.com/book/en/v2/Git-Internals-Packfiles :

> You have two nearly identical 22K objects on your disk (each compressed to approximately 7K). Wouldn’t it be nice if Git could store one of them in full but then the second object only as the delta between it and the first?

> It turns out that it can. The initial format in which Git saves objects on disk is called a “loose” object format. However, occasionally Git packs up several of these objects into a single binary file called a “packfile” in order to save space and be more efficient.

...

> What is also interesting is that the second version of the file is the one that is stored intact, whereas the original version is stored as a delta — this is because you’re most likely to need faster access to the most recent version of the file.


That is what's covered by the article, but as various commenters have noted it's not part of git's model, it's an optimisation of git's storage.

So thinking in those terms for modelling / visualising git's behaviour is nonsensical.


It's not hard to make correct statements about the git user interface without being misleading about its internals.

josephg said:

> Each different version of every file is stored in full.

which is an untrue statement about how git internally manages its data. It would be fine had they instead said:

> From the user's point of view, it's as if each different version of every file is stored in full. Internally, git may perform compression, including differential compression.

This isn't the first time I've seen this confusion. This 2014 StackOverflow thread [0] does the same, as does an old HN thread. [1] The solution is clear communication.

[0] https://stackoverflow.com/q/8198105/

[1] https://news.ycombinator.com/item?id=25459757


Surely you must know what the word "model" means.


It doesn't mean "get out of jail free card for cryptonector being wrong", no.

The simplified model that leaves out tricky implementation details is just "every file that's different is a different blob object, stored in full on its own, under the hash of its contents". If you work with this model, you're basically correct in all your reasoning, except that in reality a bit less storage is used because some objects get packed. If you work with the wrong (wrong! not simplified, wrong!) model "every blob object is stored as base-and-deltas", you won't reason correctly about either how the object store presents itself in plumbing commands or how it's stored.


> It doesn't mean "get out of jail free card for cryptonector being wrong", no.

I never said that Git commits are diffs. Keep it civil.

> The simplified model that leaves out tricky implementation details [...]

Models do that.

> If you work with this model, you're basically correct in all your reasoning, except that in reality a bit less storage is used because some objects get packed.

That's it? That's the one detail that should stop me thinking correctly based on this model?


Git actually arranges to delta starting from the current version to older versions, so that it has less work to do to show you recent versions, and does more work as it works its way backwards in time.


Ah, thanks. You've reminded me about the 2016 debacle [0] wherein github rate limited the cocoapods repository because their practice of using a shallow clone and later fetches impacted github's performance. I'm going to warc that now, so I don't forget again.

[0] https://github.com/CocoaPods/CocoaPods/issues/4989#issuecomm...

[+] https://github.com/orgs/Homebrew/discussions/225


> Modern CRDTs are massively more space efficient than git but nobody seems to notice or care about the size of our git repositories.

The packing process explained in the article is why: rather than make compression an intrinsic part of the system's model (which most VCS do / have done), git moves it to a later "optimisation" phase. This is an additional and somewhat odd costs, but on the other hand it can be extremely efficient (at $DAYJOB we've got files whose compression ratio is 99%).

An other interesting bit (which makes it very annoying to inspect the raw contents of a git repo) is that git DEFLATEs object contents when it stores them on-disk. But while that's clearly documented for packfiles (although the lack of end-of-data marker is implicit and extremely annoying), it's not really spelled out for loose object, you get to find out when you can't make heads or tails of loose objects, and git barfs at your hand-crafted loose objects.


It's easy to forget that code is... so small, but I guess if your codebase on its own was hundreds of megabytes big, then that would become noticed more quickly. I never realized it stored a copy every time, I assumed it used diffs. So git uses the same solution some of us use when we're about to do something on git we're unfamiliar with... you copy the entire directory housing your code + git repo into a new location just in case it fails miserably.


> I never realized it stored a copy every time, I assumed it used diffs

git does store diffs. People are confusing the logical representation with the on-disk layout.


I have used shallow clones to speed up and shrink what gets pulled into a docker image when building it. Of course it was a hack to make up for past mistakes but I was glad the option was available.


Did you need to update the docker image in-place afterwards? Unless you did, why not just export?


I still don't grok how CRDTs deal with conflicts. Or is it just "this happens so rarely I'll just use last update"?


That's a LWW (last write wins) resolution strategy. Here's a pretty good overview of a variety of CRDT methods: https://en.wikipedia.org/wiki/Conflict-free_replicated_data_...


It kinda store diff in `rr-cache/` the preimage | postimage | thisimage combined is a kind of diff.


> It just stores the full content of every file every time it’s modified and committed.

That's not true; otherwise each git branch folder would be huge, but it's not. It stores the snapshot of changes in branch folders


> It stores the snapshot of changes in branch folders

No, gp is right. There is no snapshot of changes. Each unique version of a file is an object in the object store. That does mean each commit would only introduce new objects for files that have changed (these are already added to the object store when you use `git add`) .

But the pack files like the article explains (which are created regularly during gc/maintenance runs) is why the repo size can often be smaller then the working tree.

> Not only is each object compressed individually, they can also be compressed against each other to take advantage of common data.


Branches are just pointers to commits in Git. How those commits are stored doesn't affect their size.


Maybe OP stores binary files?


Binary files are also delta'd in git.


I also have my own implementation of packfiles [1], which I implemented as part of a toy Git implementation in Haskell [2]. I personally found the packfile format underdocumented, e.g. the official reference for the packfile generation strategy is an IRC transcript [3].

1: https://github.com/vaibhavsagar/duffer/blob/bad9c6c6cf09f717...

2: https://vaibhavsagar.com/blog/2017/08/13/i-haskell-a-git/

3: https://git-scm.com/docs/pack-heuristics/


Shameless plug: I gave a tech talk on this at my last company, and implemented some of these in Golang as a learning exercise [0]. While it is not mandatory to know the internals, but doing so helps a lot when you occasionally encounter a git-gotcha!

[0] https://github.com/ssrathi/gogit


> What if Git had a long-running daemon that could satisfy queries on-demand, but also keep that in-memory representation of data instead of needing to parse objects from disk every time

I guess that would be performance-wise the same as switching from running the compiler in a one-off mode to a LSP analyzer.


Packfiles give me PTSD. They are so notoriously hard to maintain on the large scale server side.

I really wish git wasn't using them.


Could you elaborate on "maintain" please?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: