I've always wanted to explore the topic of diff in depth, but so far haven't been able to. I have some links that I've stashed; hopefully they will be useful.
For binary/executable code, I believe Colin Percival's bsdiff is the best: http://www.daemonology.net/bsdiff/ although he hints that his thesis contains a better algorithm.
For people just wanting to get their feet wet and understand one of the simpler algorithms, I implemented a simple diff algorithm in Python a decade ago with code readability in mind and it's since been translated to a handful of languages.
Quick question, is this the sort of technique that used for general text diffing? I see you effectively tokenize the string using whitespace to split and then pass in 2 lists of tokens. Is that the way it's generally done, and you just choose a suitable tokenizer for your specific use-case?
Does anyone know much about prose diff? Very few diff apps work well with prose, and I wonder to what extent this is driven at the level of the algorithm.
in the same way that one-sentence-per-line
gives you much more fine-grained diffs than
one-paragraph-per-line, notching it down to
one-phrase-per-line gives the best results.
a simple javascript routine can do that split,
and then rejoin the lines after you do a diff.
i've written this up extensively, to no notice.
http://zenmagiclove.com/simple/breaker.html
https://github.com/bbirdiman/breakerbreaker
***
i've since found that a dozen changes suffice:
replace ". " -- with -- ". \n"
replace ", " -- with -- ", \n"
replace "? " -- with -- "? \n"
replace "! " -- with -- "! \n"
replace ": " -- with -- ": \n"
replace "; " -- with -- "; \n"
replace ") " -- with -- ") \n"
replace "] " -- with -- "] \n"
replace "} " -- with -- "} \n"
replace "-- " -- with -- "-- \n"
replace "' " -- with -- "' \n"
replace '" ' -- with -- '" \n'
***
and, of course, to revert those changes,
you simply change " \n" to " " and boom.
> I wonder to what extent this is driven at the level of the algorithm.
Almost entirely.
Most programming diff algorithms operate on a line-by-line basis. That means that a single character change in a line marks the whole line as changed. For prose, a single 'line' is usually a paragraph, so it's pretty obvious why they don't work well. You might check out gnu `wdiff` for what a word-by-word diff could look like. I haven't really looked into the area deeply, so I don't know what the state of the art is.
This makes me wonder if anyone has implemented diff by comparing the Abstract Syntax Trees of the two code-bases. I guess it's probably easier than regular diff (assuming you've already got the ASTs from somewhere).
I would think diffing trees would be necessarily more complex than diffing lines - you can think of the line-by-line description of a file as a tree with at most one child per node, so any algorithm for diffing trees should also automatically be able to diff lines.
Tree diff algorithms definitely exist (I know React.js uses diffs on a virtual DOM to minimize operations), but I'm not sure what the state of the art is for those.
Tree diff varies between O(N^2) and O(N^4) for simple solutions of varying complexity of matches possible, with a complex fully general algorithms coming in at O(N^2 log^2 N).
Tree diff is harder because the range of operations is bigger.
React.js cheats by insisting on keys for arrays so matching is much easier.
I've maintained prose in git with the default diff algorithm using one sentence per line, and it's worked pretty well.
My specific use case was a document tweaked for two different audiences, and I've been able to make changes in one version and rebase the other version on top of it relatively easily.
From the simple perspective, using a fixed width line in a prose markup language that is mostly whitespace agnostic like Markdown creates okay diffs in a line-based diff tool.
That particular tool/experiment uses the Pygments tokenizer used for syntax highlighting and produces interesting somewhat semantically meaningful code diffs. I think the same principles would apply if you used something like a part of speech tagger on prose.
From a different approach, I put some effort into better line-based and file-based diffs of Inform 7 which is a prose format that is less whitespace agnostic than Markdown and also built as a single monolithic file, by converting it to an intermediate format.
That works by splitting the file at things that resembles headers, converting newlines to pilcrows (the paragraph symbol), and essentially reformatting to more of a fixed width format. (All of which is trivially reversible.)
You can see the commits in that repository as an example of what the intermediate format looks and diffs like.
I wrote that to better source control zip files as the contents of their zip rather than a binary blob, by letting the zip/unzip operations be automatable as a part of source control operations (pre- and post-commit hooks). This I've used for some of the modern file formats like .docx which are built as zip files of XML files and other assets. For instance, you want write the document in Word, interacting with the .docx, and source control its XML contents. That too gives more useful diffs than source controlling the .docx on its own.
Courgette used to only work on Windows (PE/COFF) executables. It looks like it now also works on ELF x86 and ELF arm executables. But it doesn't seem to support ELF 64-bit, nor does it support mach-o (Mac) executables.
In addition to the algorithms you mentioned git also has --diff-algorithm=minimal. This is just the XDF_NEED_MINIMAL flag to xdiff, so I'm unsure if it counts as a separate algorithm.
For most reasonable quantities of data, diff speed is Good Enough as it is, and it's certainly not worth trading off accuracy for speed. In fact if anything we should be going the other direction - Patience Diff is not quite as fast as regular diff, but the output is much higher-quality in my experience, particularly with non-trivial amounts of diverge between codebases.
What if you have unreasonable quantities of data? I've as yet not come across a really good program that lets me do `bigdiff <(xzcat bigresult-old.xz) <(xzcat bigresult-new.xz)|less` (where the files are gigabytes of text with fairly few differences) in a reasonable amount of time/memory. I've used hacks that only work on a line-by-line basis (or use some hardcoded marker in the input) to try to read both files in parallel and run a real diff on a subsection when seeing a difference between the markers, but it's far from trivial getting it to work well (and I unfortunately don't have time to shave that yak :/)
It was a very significant improvement in speed a few years ago — though with time I've gotten more RAM faster than bigger files to run diff on, and I haven't had any difficulty with the regular Linux diff for a long time.
As far as diff performance go concatenating the two large files using the diff replace all syntax is faster, uses o(1) memory, o(n) time and it's only slightly space inefficient.
I'd also say it might beat the OP alghorithm in performances under certain assumption (i.e large writes vs scan read performances)
I might be wildly misreading the code, but it seems like [1,2,3,a,b,c] diffed with [a,b,c,1,2,3] is going to appear as six lines of replacement.
The table will be {a :(1,1), b :(1,1), ...} because each line appears in both files exactly once.
do {
c = a;
d = b;
// first pass a=0, b=0
// the lines aren't equal, they're "1" and "a" so we pass this case.
if (one[a] === two[b]) {
equality();
} else {
//one[0] is "1", matching 1:(1,1)
//two[0] is "a", matching a:(1,1)
if (table[one[a]].two < 1 && table[two[b]].one < 1) {
//can't get here
replaceUniques();
} else if (table[one[a]].two < 1 && one[a + 1] !== two[b + 2]) {
// 1:(1,1) cdr not less than 1, so this is out
deletion();
} else if (table[two[b]].one < 1 && one[a + 2] !== two[b + 1]) {
// a:(1,1) car not less than 1, so this is out
insertion();
} else if (table[one[a]].one - table[one[a]].two === 1 && one[a + 1] !== two[b + 2]) {
// 0 === one[a+1] !== two[b+2]
// === "2" !== "c"
// === true
deletionStatic();
} else if (table[two[b]].two - table[two[b]].one === 1 && one[a + 2] !== two[b + 1]) {
// 0 === one[a+2] !== two[b+1]
// === "3" !== "b"
// === true
insertionStatic();
} else {
// so we're stuck replacing.
replacement();
}
}
a += 1;
b += 1;
} while (a < lena && b < lenb);
it's very pretty, it just seems like it's not not doing enough lookahead to find those long distance relationships.
(although i didn't run it, i could have screwed up my interpretation)
in my limited experience, diff is very hard.
edit
fixed a couple typos in the comments.
yeah, three deletes and three inserts showing at least 3 lines of equality is preferable. ideally it'd show a single edit, move of 3 lines, but that's really hard to do.
I tested this using the online version at http://prettydiff.com/ , and you're right; if you feed it those two six-line files, it produces a six-line replacement without finding any common lines. Good catch, and good test case.
I had trouble reproducing 9 differences. I tried these samples in a couple different ways (on the same line or broken onto different lines) in different browsers and always came up with 6 differences instead of 9 differences.
The weird thing is, that website highlights 6 changes for that example (first and last three lines) but says at the top: Number of differences: 9 differences from 9 lines of code.
And the little folding button does indeed fold all nine lines.
When I instead try [a,b,c,1,2,3,x,y,z] against [a,b,c,1,2,3,a,b,c] it only shows "Number of differences: 3 differences from 3 lines of code." and folds the first 6 and the last 3 lines.
I still cannot reproduce the problem of 9 differences. I have tried in Edge, Chrome, and Firefox on Windows 10. I will try on OSX in just a second.
The reason why it changes output after using square braces is because the tool is language aware. Before it could not guess at a programming language and so the input is text parsed into lines. After the language detection guesses this is JSON format and formally parses the input, beautifies it, and then compares the beautified output.
I already realize that if I am going to the trouble of parsing languages that it would be far more efficient to simply compare the parsed token lists instead of beautified text, but I haven't gotten there yet.
Depends on what you want to achieve? When diffing source code, comparing lines, without caring too much about the broader context, may be a good-enough proxy of what you ideally want to compare, which is the program’s logic. But for the latter, you would first have to execute both programs, and diff their parse trees.
Likewise – and perhaps most notoriously – simple line-based diffing does not suffice for diffing prose (natural language text), where knowledge of the broader context, markup (semantics in punctuation and document structure), and paragraph order are crucially important.
programs are self-contained entities, whereas prose often drags in multitudes of real-world context involving assumptions, motives, and other considerations.
a diff might inform you of an edit in chapter 2 where you changed a man's status from "divorced" to "widowed" (to make him appear more sympathetic), but it won't also remind you that you need to rewrite that section in chapter 25 where the man gets a phone-call from his former wife.
this type of thing happens a lot, in ways that are much more subtle than that silly example. and it's usually because the arc of the story is greater than (and different from) a simple sum of its pieces.
> I might be wildly misreading the code, but it seems like [1,2,3,a,b,c] diffed with [a,b,c,1,2,3] is going to appear as six lines of replacement.
A bit of a pickle.
The challenge I experienced with this is that either there are 6 changes (line for line comparison) or there are 3 lines the same and three lines moved, but three lines moved means a deletion of 3 lines from the first sample and 3 lines of insertion later in the second sample, which is still 6 differences. The output reads very differently, but the number of differences is identical, which is no change in precision.
Unless my understanding of things is wrong, I'm pretty sure that unless the strong exponential time hypothesis (essentially, n variable CNF-SAT takes O(2^n) time in the worst case) is false, no diff algorithm can run in subquadratic time. If you could, then you would be able to solve longest common subsequence in subquadatic time, and that in turn implies SETH being false by https://arxiv.org/abs/1501.07053 .
I'm pretty sure exact diff in subquadratic time is therefore impossible. It's still a nice heuristic though.
Minor remark: If the edit distance between two n-length strings is (relatively) small, it is possible to find it in subquadratic time. More precisely, if the edit distance is at most d, you can solve the problem in O(nd) time. The upper bound d does not have to be known in advance. The conditional lower bound you mentioned holds only for large values of edit distance.
This "problem" may be more of a conceptual one rather than a technical one; for example, a file with "years = [1989,1870,2016]" diffed with "years = [1920,1889,1670]" according to you should appear as the move of 6 lines, it is technically true but doesn't help the user one bit, which is usually the purpose of a diff.
Hmm, interesting. The classic dynamic programming edit distance algorithm uses linear space (with a far better constant factor than this algorithm) and quadratic time. You build a table c where c[i][j] is the edit distance from the first I characters of file 1 to the first j characters of file 2. (You don't need to store the whole table.) There's a cute trick to get the actual diff out without using more space (by rerunning the algorithm a few times on portions of the input). To get below quadratic time on sensible input, IIRC, the standard approach is to think of the problem a bit like Dijkstra's algorithm and fill in the table in order by value, smallest edits first.
I imagine you get to linear time without much loss by heuristically throwing out chunks of the table where i is far from j.
The article raises an interesting question: can you improve this by using advantage of the fact that lines of code have a decent chance of being unique or nearly unique within a file?
> I imagine you get to linear time without much loss by heuristically throwing out chunks of the table where i is far from j.
I originally tried a delete as you go approach to the hash map where you remove entries that were no longer needed, such as when both values (counts) for a given key reached 0. My thinking was that this would improve execution speed over the space of a very large table by gradually reducing the key space and thus reducing look up times. This thinking proved false against two 1.7mb code samples. Removing this set of instructions reduced execution time from an average of 0.95 seconds to an average of 0.71 seconds (on my machine and browser). It is simply more efficient to read from a large hash map than it is modify that hash map to improve reading efficiency.
A better approach, which I have not tried, would be to reduce writes to the hash map at the time of running through the second code sample, because you have already gone through the first code sample and not yet performed your formal analysis. To be efficient enough for a valid performance improvement though the analysis at this point would have to be tiny and yet somehow benefit for the formal analysis coming in the next step. The underlying assumption is that having a smaller hash map means fewer writes to memory. To be helpful to the next step of formal analysis the loop would have to achieve fewer iterations.
> The article raises an interesting question: can you improve this by using advantage of the fact that lines of code have a decent chance of being unique or nearly unique within a file?
As a code author I cannot predict user behavior and do not try. Instead I make assumptions upon that behavior and execute to these assumptions, whether or not they are valid. My assumption is that if a user wishes to compare unrelated things the number of differences sky-rockets, which means a bit more computation and temporary data. I suspect that if a user does this it is probably for some exploratory reason and thus are better prepare for the trivial increase in processing.
Yes, you can. There's an algorithm running in O((r+l) log l) time, where l is the number of lines and r is the total number of ordered pairs of line numbers at which the two files match [1]. So if all lines are unique, this runs in O(l log l).
[1] J. W. Hunt and T. G. Szymanski. A fast algorithm for computing longest common subsequences. Communications of the ACM, 20(5):350–353, 1977.
As the paper itself says, this is still N^2 log N worst case :)
Note that these are all variants of general algorithm I posted, and more generally, variants of tricks used in boyer-moore (though hunt's work predates boyer-moore, that's the easiest way to describe it), which means they try to skip parts of the text they can prove can't match.
Because they can't always do so, they don't change the worst case time bound, only various other time bounds.
"The article raises an interesting question: can you improve this by using advantage of the fact that lines of code have a decent chance of being unique or nearly unique within a file?
"
No.
Well, in theoretical time bound, no.
In practice, also no :)
On hackernews, maybe!
You can't just throw out unique lines, they may be in the middle of a replacement.
IE
a
e
c
a
d
c
Note: edit distance is zero until unique line occurs :(
But what if you just start at the unique lines and edit distance from there, somehow ignoring the equal lines?
This is basically just transforming it into another problem with the same time bound, where that problem is "where is the next/previous point in the file where the streams misalign, if i assume it was aligned before point x".
Sadly, this problem is not any faster to solve. For simple proof, think of it this way: The unique line info just tells you, for sure, that the next alignment point is at least N characters away, where N is the line length, when you are staring at a given character. This is, at best, a constant factor faster, where the constant factor is the average line length * number of unique lines or something like that.
It's actually just an application of the boyer-moore trick.
But it's still not theoretically faster, because edit distance is really just a stream alignment algorithm where, instead of stopping, you just keep counting how misaligned they are.
IE edit distance is already a measure of stream alignment, and it's already a very fast one.
Now, you could make a diff algorithm out of the above if you wanted.
Find unique lines, or also lines with count mismatches between a and b (since they are at least adds or deletes). Starting from there, segment file into pieces by stream misalignment points. This is equivalent to splitting the file into segments, where each segment has the same edit distance.
While computing, hash each aligned segment to map to where it appears on each side (we need to know where it moved to, and this is a single O(N) pass anyway).
Now you know, for everything that moved, where it moved to, for everything that's the same, where they start and end being the same, etc.
From that you can output add, remove, changed as you will.
If this sounds familiar, it's because it's what the algorithms do already for the most part, you are just starting the diff in a different place.
Edit distance can already be done in O(s * min(m,n )) time, where s is the maximum edit distance (which is still O(mn) if the file has completely changed). That's going to be hard to beat.
Truthfully, the real good stream alignment algorithms you would want to look at here at the various algorithms used to do gene sequence alignment :)
But it's not going to be faster unless you are diffing gigabytes of text. Only better in the sense that they use heuristics to do better than edit distance when there are multiple possible points of alignment for a given piece of text.
(IE diff falls down, among other places, where a piece of code could validly have been said to be duplicated to two places. Usually it will notice one of them, and consider the other a replacement or an add of brand new text.
Using those algorithms, or the above would let you augment this by saying "it's not an add of really new text, I duplicated thing X to two new places")
It's nice to look at, but it doesn't actually work that well, because it has no real way of stream alignment other than simple equality (stream alignment == figuring out the points where two streams become equal. This is what the edit distance calculation gives most diff algorithms)
So various forms of lines that have been moved will always be shown as replacement, for example
(I see someone else discovered this)
Most of the cost and complexity of existing diff algorithms is essentially stream alignment (that's what the the dynamic programming problem they solve tells them) , so yes, removing stream alignment will in fact, make your diff algorithm "simple" :)
The fix function is also just a hack for no real way to align streams on a per-character basis, so it has no way of, for example, maximally extending matches until it's done.
You could avoid a lot of what it does by not using lines as your basis, but characters instead.
The direct translation here would be:
1. build segments by starting at the beginning of both files, and incrementing one file until they match, and then incrementing the other until they stop matching (this gives you one or two segments, the mismatch, segment, which may be empty, but represents the part where they don't match, and the matched segment, which will be maximal instead of line split.
This is O( min of both file sizes), just like the current hash building.
You do have to keep track of the line number, but it doesn't change time bounds
2. hash/index the segments the same way you are hashing/indexing lines.
3. do rest of described algorithm.
the only upside/downside is you end up with partially changed lines (IE there may be more than one segment per line), but here you can detect that immediately (you can even just mark where this has happened while you build segments) and transform to replacements if that's what you want instead of trying to notice later that this is what happened.
Split file2 into x sized blocks.
Use it as patterns you search against file1.
Simple, and same worst case as naive edit distance.
Note that if you had a good enough rolling hash, you could do the same thing in O(N) time, by not doing the equality check in the hashtable, and instead just issuing replacement if it turned out you were wrong :)
I don't think it is -- it's full of the kind of features (complicated condition expressions, long chains of elses, hard-coded numbers) that, when I find myself writing them, suggest that I'm approaching the problem in the wrong way, because previous experiences make me associate that kind of code with lack of generality and corner-case bugs.
which is what most diff tools use. It is a simple and clear algorithm with no special cases once you get your head around it. It works in O(NxM) time, which seems like a reasonable lower bound (you need to compare everything to everything else to have a chance of getting the best alignment) although there are ways to do better with constraints.
(I remember one for gene alignment which broke N and M in two, recurse 4 times on each pairing, and then had a quick-ish way to put those together again. Can't remember the details though!)
The standard diff only requires one pass through the data. Am I missing something? I can't understand the value/utility of this, given this is likely slower and more ad-hoc (likely incorrect) than the standard diff and its cousins.
The minimum number of data passes is two, one for each submitted sample. The Pretty Diff algorithm, in the subject of this thread, uses three complete passes through data without any repetition. Therefore the number of iterations is simply: total number of lines of the first sample, total number of lines of the second sample, and the total number of lines of the smallest sample.
A more optimized approach is 2 complete passes and a partial third pass without repetition. Achieving that optimization requires more logic up front to populate a smaller central analysis store and a different means of iterating through that container.
Most approaches, as I have seen them, don't achieve a fully optimized approach. While they may not have complete passes through data (after the required initial two) they tend to have numerous smaller passes in order to derive edit distances. This could be more efficient if these smaller passes never pass through the same data indexes/keys more than once and achieve a reduced total number of iterations. These approaches seem less straight forward to me and are misleading in terms of total statements of execution/iterations.
The only way to guarantee greater execution efficiency is to run through a checklist like this and compare clock times in similar execution contexts:
Another useful quality of a diff algorithm is being able to understand and detect moves. This is not commonly implemented but one real-world example is https://www.semanticmerge.com/.
On the subject of diff, can anyone recommend a resource to visualize the difference between two ordered lists? This website does visualize things nicely, but I would love to see a similar article written solely on the visual aspect.
For a long term solution you may be better served with a domain specific tool that fits your needs directly. In the mean time the Pretty Diff tool, http://prettydiff.com/, is language aware and offers a fair number of features that may provide what you are looking for. To see the options GUI just scroll down. You can also enter options and values as URI parameters. Just see the documentation for the options and their requirements.
If you are looking for something that isn't available please open a Github issue and I will triage it.
Thanks for the reply. I was thinking more general on how to display differences, not so much for code, although it definitely does help. More of a Tufte treatise on difference.
Working on a tool to calculate search relevancy and it would be nice to visual display to users the difference between two queries.
github needs to read this article. I've seen some pretty awful diffs on github and have had to resort to using CLI diffs which tend to be superior. For a company that's bread and butter is code, vcs, committs, PR's, etc you would think their diffs wouldn't be scraping the bottom of the barrel.
I believe the classic for typical textual diff is this article by Myers, whose algorithm is still the default in git: http://link.springer.com/article/10.1007/BF01840446
Git has two other diff algorithms, patience and histogram: http://alfedenzo.livejournal.com/170301.html https://github.com/git/git/commit/8c912eea94a2138e8bc608f7c3...
For binary/executable code, I believe Colin Percival's bsdiff is the best: http://www.daemonology.net/bsdiff/ although he hints that his thesis contains a better algorithm.
For just executables, however, I think Google Chrome uses Courgette, which actually performs disassembly first: https://www.chromium.org/developers/design-documents/softwar...
Also useful is libxdiff, which is a C library offering various diff utilities: http://www.xmailserver.org/xdiff-lib.html