The last section is really interesting. The author presents the following algori...

sltkr · on April 8, 2021

Isn't this exactly what the author says? From the posted article:

> Another common inefficiency is computing the lines and hashes of content in the common prefix and suffix. Use of memcmp() (or even better: hand-rolled assembly to give you the offset of the first divergence) is more efficient, as again, your C runtime library probably has assembly implementations of memcmp() which can compare input at near native memory speed.

So I think you're agreeing with him: it's a useful optimization to first remove the matching prefix and suffix using memcmp() to avoid having to do line splitting and hashing across the entire file. Especially since it's not uncommon for the files being compared to be mostly identical, with only a few changes in the middle, or some content added at the end.

OskarS · on April 8, 2021

Yeah, I misread that section: he talked about how the program spent longer time in the prefix/suffix section, and how picking a better hashing algorithm would improve things, and I was going "no! don't hash at all for that part! just compare the lines!". I missed the paragraph you quoted there.

Still, though, this is wrong: "Another common inefficiency is computing the lines and hashes of content in the common prefix and suffix." You have to compute the lines for the prefix at least, diffs use line numbers to indicate where the change is.

jiggawatts · on April 9, 2021

> diffs use line numbers to indicate where the change is.

The optimal solution would be a minor variation on the AVX-optimised memcmp that also counts the number of times it sees the newline character as it is comparing. You still do a single pass through the data, but you get the prefix comparison and the line number in the same go.

For modern CPUs, this is likely optimal.

ajuc · on April 8, 2021

You assumed the diff algorithm only compares each line against one other line. That's not true.

You look at each line many times in these algorithms. Running time is O(n log n) or O(n^2) not O(n).

So you generate N hashes and compare each hash against log N or N other hashes.

So, for big enough data it should be faster.

OskarS · on April 8, 2021

No, you misunderstand: he mentions a common optimization where before you run your diffing algorithm, you find the common line suffix/prefix for the file, and how that will improve performance (if you have a compact 5-line diff in a 10,000 line file, it's unnecessary to run the diffing algorithm over the whole thing). His point was that this suffix/prefix finding thing was surprisingly slow totally apart from the actual diffing.

I was talking about that part, how hashing there is unnecessary. As I mentioned at the end of my comment, for the actual diffing algorithm, it's fine to hash away.

mcguire · on April 8, 2021

"That's so much faster!"

Do you have benchmarks for that?

hadsed · on April 9, 2021

would love to see an experiment on this