It would be awesome to visualize wikipedia edits overtime. I don't really care about what the text says, just how blocks of it change over time. I am after the ascetics present in the ever flowing change of data. I think your script might be a good starting point. Think of the videos of a flowers growing, that compress months into a few seconds. Do something similar with wikipedia edits.
I've been doing something similar with my blog, except it is currently for people who do care what the text says. The diff algo is tricky, I should have built up a larger corpus of material before designing.
It looks at paragraphs, sentences, sub-sentence structures, words. It even draws little sparkgraph-ish diagrams. It is not really that long (250 lines by wc) but it has been a huge time sink for tweaking.
I was hoping to see a compact implementation of diff algorithm. However, the script seems to be relying on using the 'diff' utility already present. Not a bad thing, but I was expecting to see something else.
I went with a don't-reinvent-the-wheel approach. Anything I did would have at least doubled the time to write the script and probably yielded a diff half as good.
I think it would be useful. Consider this example.
The original sentence is: "He went to the stoar." Person A sends this to person B for review. While Person B is reviewing this, Person A changes the sentence to "Bill went to the stoar.". Then, Person B sends back the spelling correction; "He went to the store.".
If the sentence translated to one line for `diff` to work on, this would be a merge conflict. If each word was its own line, this would merge cleanly.
You've just described dwdiff. Not the smartest diff algo, though its strong point is producing diffs of prose that are human readable. I used it as the core of a wiki with 'perfect' collision resolution. Two people could edit at the same time, and the 2nd to hit save would have their work merged in with dwdiff instead of having their work automatically discarded.
It might sort of drown you (a 15-word sentence totally changed to a different 15-word sentence would take up 30 lines. My default terminal size holds 36 lines.
But really because I never thought of it. Might give it a try and see if it's better.
I could use this as part of git itself for comparing latex document revisions. The current line-oriented diff has all the problems that sentdiff tries to solve.
It would be awesome to visualize wikipedia edits overtime. I don't really care about what the text says, just how blocks of it change over time. I am after the ascetics present in the ever flowing change of data. I think your script might be a good starting point. Think of the videos of a flowers growing, that compress months into a few seconds. Do something similar with wikipedia edits.