Yeah. I suspect the answer is 'store all binary data in BAM', which then uses so...

luckydude · on May 11, 2016

We support the uuencode horror for compat (and for smaller binaries that don't change) but the answer for binaries is BAM, there is no data in the weave for BAM files.

I don't agree that the weave is horrible, it's fantastic for text. Try git blame on a file in a repo with a lot of history then try the same thing in BK. Orders and orders of magnitude faster.

And go understand smerge.c and the weave lightbulb will come on.

nullnix · on May 11, 2016

Yeah, that's the problem; it's optimizing for the wrong thing. It speeds up blame at the expense of absolutely every other operation you ever need to carry out; the only thing which avoids reading (or, for checkins, writing) the whole file is a simple log. Blame is a relatively rare operation: its needs should not dominate the representation.

The fact that the largest file you mention is frankly tiny shows why your performance was good: we had ~50,000 line text files (yeah, I know, damn copy-and-paste coders) with a thousand-odd revisions and a resulting SCCS filesize exceeding three million lines, and every one of those lines had to be read on every checkout: dozens to hundreds of megabytes, and of course the cache would hardly ever be hot where that much data was concerned, so it all had to come off the disk and/or across NFS, taking tens of seconds or more in many cases. RCS could have avoided reading all but 50,000 of them in the common case of checkouts of most recent changes. (git would have reduced read volume even more because although it is deltified the chains are of finite length, unlike the weave, and all the data is compressed.)

luckydude · on May 12, 2016

Give me a file that was slow and lets see how it is in BitKeeper. I bet you'll be impressed.

50K lines is not even 3x bigger than the file I mentioned. Which we check out in 20 milliseconds.

As for optimizing blame, you are missing the point, it's not blame, it's merge, it's copy by reference rather than copy by value.

nullnix · on May 12, 2016

I'd do that if I was still working there. I can probably still get hold of a horror case but it'll take negotiation :)

(And yes, optimizing merge matters too, indeed it was a huge part of git's raison d'etre -- but, again, one usually merges with the stuff at the tip of tree: merging against something you did five years ago is rare, even if it's at a branch tip, and even rarer otherwise. Having to rewrite all the unmodified ancient stuff in the weave merely because of a merge at the tip seems wrong.)

(Now I'm tempted to go and import the Linux kernel or all of the GCC SVN repo into SCCS just to see how big the largest weave is. I have clearly gone insane from the summer heat. Stop me before I ci again!)

luckydude · on May 12, 2016

Our busiest file is 400K checked out and about 1MB for the history file lz4 compressed. Uncompressed is 2.2M and the weave is 1.7M of that.

Doesn't seem bad to me. The weave is big for binaries, we imported 20 years of Solaris stuff once and the history was 1.1x the size of the checked out files.