Hacker News new | past | comments | ask | show | jobs | submit login

The last section is really interesting. The author presents the following algorithm as the "obvious" fast way of doing diffing:

1. Split the input into lines.

2. Hash each line to facilitate fast line equivalence testing (comparing a u32 or u64 checksum is a ton faster than memcmp() or strcmp()).

3. Identity and exclude common prefix and suffix lines.

4. Feed remaining lines into diffing algorithm.

This seems like a terrible way of finding the common prefix/suffix! Hashing each line isn't magically fast, you have to scan through each line to compute the hash. And unless you have a cryptographic hash (which would be slow as anything), you can get false positives, so you still have to compare the lines anyway. Like, a hash will tell you for sure that two lines are different, but not necessarily that they are the same: different strings can have the same hash. In a diff situation, the assumption here is that 99% of the times, the lines will be the same, only small parts of the file will change.

So, in reality, the hashing solution does this:

1. Split the files into lines

2. Scan through each line of both files, generating the hashes

3. For each pair of lines, compare the hashes. For 99% of pairs of lines (where the hash matches), scan through them again to make sure that the lines actually match

You're essentially replacing a strcmp() with a hash() + strcmp(). Compared to the naive way of just doing this:

1. Split the files into lines

2. For each pair line, strcmp() the lines once. Start from the beginning for the prefix, start from the end for the suffix, in each case, stop when you get to a mismatch

That's so much faster! Generating hashes is not free!

The hashes might be useful for the actual diffing algorithm (between the prefix/suffix) because it presumably has to do a lot more line comparing. But for finding common prefix/suffix, it seems like an awful way of doing it.




Isn't this exactly what the author says? From the posted article:

> Another common inefficiency is computing the lines and hashes of content in the common prefix and suffix. Use of memcmp() (or even better: hand-rolled assembly to give you the offset of the first divergence) is more efficient, as again, your C runtime library probably has assembly implementations of memcmp() which can compare input at near native memory speed.

So I think you're agreeing with him: it's a useful optimization to first remove the matching prefix and suffix using memcmp() to avoid having to do line splitting and hashing across the entire file. Especially since it's not uncommon for the files being compared to be mostly identical, with only a few changes in the middle, or some content added at the end.


Yeah, I misread that section: he talked about how the program spent longer time in the prefix/suffix section, and how picking a better hashing algorithm would improve things, and I was going "no! don't hash at all for that part! just compare the lines!". I missed the paragraph you quoted there.

Still, though, this is wrong: "Another common inefficiency is computing the lines and hashes of content in the common prefix and suffix." You have to compute the lines for the prefix at least, diffs use line numbers to indicate where the change is.


> diffs use line numbers to indicate where the change is.

The optimal solution would be a minor variation on the AVX-optimised memcmp that also counts the number of times it sees the newline character as it is comparing. You still do a single pass through the data, but you get the prefix comparison and the line number in the same go.

For modern CPUs, this is likely optimal.


You assumed the diff algorithm only compares each line against one other line. That's not true.

You look at each line many times in these algorithms. Running time is O(n log n) or O(n^2) not O(n).

So you generate N hashes and compare each hash against log N or N other hashes.

So, for big enough data it should be faster.


No, you misunderstand: he mentions a common optimization where before you run your diffing algorithm, you find the common line suffix/prefix for the file, and how that will improve performance (if you have a compact 5-line diff in a 10,000 line file, it's unnecessary to run the diffing algorithm over the whole thing). His point was that this suffix/prefix finding thing was surprisingly slow totally apart from the actual diffing.

I was talking about that part, how hashing there is unnecessary. As I mentioned at the end of my comment, for the actual diffing algorithm, it's fine to hash away.


"That's so much faster!"

Do you have benchmarks for that?


would love to see an experiment on this




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: