An author here, I agree! The packfile format is heavily inspired by git, and git...

johnyzee · on Nov 19, 2021

As far as I was able to learn (don't remember the details, sorry), git does not do well with large binary files. I believe it ends up with a lot of duplication. It is the major thing I am missing from git, currently we store assets (like big PSDs that change often) outside of version control and it is suboptimal.

peterwaller-arm · on Nov 19, 2021

Performing poorly with non-textual data happens for a a number of reasons. Binary data, when changed, often have a lot of 'non-local' changes in them. For example, a PSD file might well have a compression algorithm already applied to it. An insertion/deletion is going to result in a very different compressed representation for which there is no good way to have an efficient delta. elfshaker will suffer the same problem here.

derefr · on Nov 19, 2021

One could, in theory, write a git-clean filter (like the one used for git-lfs), that teaches git various heuristic approaches to "take apart" well-known binary container formats into trees of binary object leaf-nodes.

Then, when you committed a large binary that git could understand, what git would really be committing in its place would be a directory tree — sort of like the "resource tree" you see if you edit an MKV file, PNG file, etc., but realized as files in directories. Git would generate it, then commit it.

On checkout, this process would happen in reverse: a matching git-smudge filter could notice a metadata file in each of these generated directories, and collapse the contents of the directory together to form a binary chunk; recursively, up the tree, until you hit the toplevel, and end up with the original large binary again.

Since most of the generated leaf-nodes from this process wouldn't change on each commit, this would eliminate most of the storage overhead of having many historical versions of large files in git. (In exchange for: 1. the potentially-huge CPU overhead of doing this "taking apart" of the file on every commit; 2. the added IOPS for temporarily creating the files to commit them; and 3. the loss of any file-level compression [though git itself compresses its packfiles, so that's a wash.])

I'm almost inspired to try this out for a simple binary tree format like https://en.wikipedia.org/wiki/Interchange_File_Format. But ELF wouldn't be too hard, either! (You could even go well past the "logical tree" of ELF by splitting the text section into objects per symbol, and ensuring the object code for each symbol is stored in a PIC representation in git, even if it isn't in the binary.)

josephg · on Nov 20, 2021

As I understand it, this is essentially what the Google chrome updater does. It disassembles the binary and recalculates jump labels. Then it generates a diff based on the assembly code. When it applies that diff on people’s computers, the your computer again pulls the chrome binary apart and reconstructs it. The code for this is complex, but it’s all opensource.

I remember reading about this technique years ago, thinking “cool when this catches on, software updates in all my software will be tiny”. But no, for some reason macos updates are still gigabytes in size. I have no idea why?

hcs · on Nov 20, 2021

I think you're referring to courgette, which does the disassemble + patch + reassemble thing. Fwiw, Chrome now uses a simpler but competetive format called zucchini, see my recent comment: https://news.ycombinator.com/item?id=29028534

derefr · on Nov 20, 2021

It does!

The major distinction (besides this being generalized to pluggable binary container formats) is that what Courgette does happens in the context of delta comparison between two known binaries, where the goal is to create a minimal delta patch that can move you from one known binary to another known binary (given a sufficiently-smart updater, which actually "contains" a lot of the information in a Kolmogorov-complexity sense.)

There are efficiencies you can gain in encoding, that only work in a one-to-one, known-to-known comparison context. (Think "interstitial frames" in video; or RLE'd 1-bit delta-sigma audio encoding.) In these contexts, you need known1 on-hand already, plus a patch specific to the pair of {known1, known2}, to be able to recover known2. (This is why, in streaming video where non-delta-compressed "keyframes" are rare, you can't seek to arbitrary locations in the stream without the decoded video turning into a garbled mess for a while: you're receiving patches, but you don't have the [correct, clean] known1s to apply them to.)

But these efficiencies don't apply in the context of a system like git, where you might be checking out any arbitrary version of the data at any time, with any [or no!] other arbitrary version(s) already cached or checked out. To enable people to efficiently grab the delta they need to do the particular, arbitrary version-to-version jumps they need to do, you'd either need to generate the power-set of all updates; or at least enough entries to populate a skip-list (as Microsoft apparently does for Windows updates! [1]).

The powerset of updates is impractical to generate+store, but gets clients the ability to perform arbitrary jumps in O(1) time. With a skip-list of updates, on the other hand, either the server or the client needs to compute its way through all the skips, to combine them into one final update — so each jump would take O(log N) "patch application" computation steps to resolve. (This is why it takes the Microsoft servers a while to figure out which Windows updates your computer needs!) Neither approach really makes sense for a system like Git, where Git repos are entirely mirrored to every client (so the clients would need to store all those update patches), and where git checkouts are something you're supposed to be able to do often and quickly, in the middle of your daily workflow.

Meanwhile, the sort of approach I outlined would be optimized instead for storing currently known binaries, in a way most amenable to inter-compression against future, unknown versions of the binary, without the need to store any explicit known-to-known patches, or to compute the composition of such patches.

Rather than seeking to compress together existing available things as well as possible, a known-to-unknown compression approach seeks to segment and normalize the data in such a way that future revisions of the same data, put through the same process, would generate lots of content-identical duplicate component objects, which would end up being dedup'ed away when stored in a Content-Addressable Store.

It's the difference between "meeting in the middle" and canonicalization. Levenshtein distance vs. document fingerprinting. RAID5 vs. a fountain-codec-encoded dataset. Or, to be very literal, it's streaming-window Huffman coding, vs. merkle-trie encoding.

This "normalize, to later deduplicate" known-to-unknown compression approach, is the approach that Git itself is built on — but it only works well when the files Git works with are decomposed into mostly-independent leaf-node chunks. My proposed thought here, is just that it wouldn't be impossible to translate other types of files, into "mostly-independent leaf-node chunks" at commit time, such that Git's approach would then be applicable to them.

[1] https://devblogs.microsoft.com/oldnewthing/20200212-00/?p=10...

JoshTriplett · on Nov 19, 2021

Can you talk a bit more about what ELF-specific heuristics elfshaker uses? What kind of preprocessing do you do before zstd? Do you handle offsets changing in instructions, like the BCJ/BCJ2 filter? Do you do anything to detect insertions/deletions?

peterwaller-arm · on Nov 19, 2021

We've just added an applicability section, which explains a bit more what we do. We don't have any ELF specific heuristics [0].

https://github.com/elfshaker/elfshaker#applicability

In summary, for manyclangs, we compile with -ffunction-sections and -fdata-sections, and store the resulting object files. These are fairly robust to insertions and deletions, since the addresses are section relative, so the damage of any addresses changing is contained within the sections. A somewhat surprising thing is that this works well enough when building many revisions of clang/llvm -- as you go from commit to commit, many commits have bit identical object files, even though the build system often wants to rebuild them because some input has changed.

elfshaker packs use a heuristic of sorting all unique objects by size, before concatenating them and storing them with zstandard. This gives us an amortized cost-per-commit of something like 40kiB after compression with zstandard.

[0] (edit: despite the playful name suggesting otherwise -- when we chose the name we planned to do more with ELF files, but it turned out to be unnecessary for our use case)

JoshTriplett · on Nov 19, 2021

Ah, I see! Makes sense that you can do much better if you get to compile the programs with your choice of options.

ChrisMarshallNY · on Nov 19, 2021

> we store assets (like big PSDs that change often) outside of version control and it is suboptimal.

Perforce is still used by game developers and other creatives, because it handles large binaries, quite well.

In fact, I'm not sure if they still do it, but one of the game engines (I think, maybe, Unreal) used to have a free tier that also included a free Perforce install.

mdaniel · on Nov 19, 2021

It was my recollection, and I confirmed it, that they've almost always had a "the first hit is free" model for small teams, and they also explicitly call out indie game studios as getting free stuff too: https://www.perforce.com/how-buy

3np · on Nov 19, 2021

Do you think it would be feasible to do a git-lfs replacement based on elfshaker?

Down the line maybe it would even be possible to have binaries as “first-class” (save for diff I guess)