Hacker News new | past | comments | ask | show | jobs | submit login

Yeah. I suspect the answer is 'store all binary data in BAM', which then uses some different encoding for the binary stuff -- but that then makes my gittish soul wonder why not just use that encoding for everything. (It works for git packfiles... though 'git gc' on large repos is a total memory and CPU hog, one presumes that whatever delta encoding BAM uses is not.)



We support the uuencode horror for compat (and for smaller binaries that don't change) but the answer for binaries is BAM, there is no data in the weave for BAM files.

I don't agree that the weave is horrible, it's fantastic for text. Try git blame on a file in a repo with a lot of history then try the same thing in BK. Orders and orders of magnitude faster.

And go understand smerge.c and the weave lightbulb will come on.


Yeah, that's the problem; it's optimizing for the wrong thing. It speeds up blame at the expense of absolutely every other operation you ever need to carry out; the only thing which avoids reading (or, for checkins, writing) the whole file is a simple log. Blame is a relatively rare operation: its needs should not dominate the representation.

The fact that the largest file you mention is frankly tiny shows why your performance was good: we had ~50,000 line text files (yeah, I know, damn copy-and-paste coders) with a thousand-odd revisions and a resulting SCCS filesize exceeding three million lines, and every one of those lines had to be read on every checkout: dozens to hundreds of megabytes, and of course the cache would hardly ever be hot where that much data was concerned, so it all had to come off the disk and/or across NFS, taking tens of seconds or more in many cases. RCS could have avoided reading all but 50,000 of them in the common case of checkouts of most recent changes. (git would have reduced read volume even more because although it is deltified the chains are of finite length, unlike the weave, and all the data is compressed.)


Give me a file that was slow and lets see how it is in BitKeeper. I bet you'll be impressed.

50K lines is not even 3x bigger than the file I mentioned. Which we check out in 20 milliseconds.

As for optimizing blame, you are missing the point, it's not blame, it's merge, it's copy by reference rather than copy by value.


I'd do that if I was still working there. I can probably still get hold of a horror case but it'll take negotiation :)

(And yes, optimizing merge matters too, indeed it was a huge part of git's raison d'etre -- but, again, one usually merges with the stuff at the tip of tree: merging against something you did five years ago is rare, even if it's at a branch tip, and even rarer otherwise. Having to rewrite all the unmodified ancient stuff in the weave merely because of a merge at the tip seems wrong.)

(Now I'm tempted to go and import the Linux kernel or all of the GCC SVN repo into SCCS just to see how big the largest weave is. I have clearly gone insane from the summer heat. Stop me before I ci again!)


Our busiest file is 400K checked out and about 1MB for the history file lz4 compressed. Uncompressed is 2.2M and the weave is 1.7M of that.

Doesn't seem bad to me. The weave is big for binaries, we imported 20 years of Solaris stuff once and the history was 1.1x the size of the checked out files.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: