I really like the idea of focusing on producing patches for human consumption. I studied the problem of merging AST-level patches during my PhD (https://github.com/VictorCMiraldo/hdiff) and can confirm: not simple! :)
So I looked at the paper and it seems interesting. Basic idea: Instead of the operations to consider being "insert", "delete" and "copy", one adds "reorder" "contract subtree" and "duplicate" (although I didn't quite get the subtlety of copy vs duplicate on a short skim); and even though extra ops increase the search space, they actually let you search more effectively. I can buy that argument.
The practical problem, though, is that the Haskell compiler is limited/buggy, so you couldn't implement this for C, and you settled on a small language like Lua. If you _do_ extend this to other languages (perhaps port your implementation from Haskell to something else?), please post it on HN and elsewhere!
Copy just copies once. The need for duplicate is clear if you're trying to diff something like `t = [a]` and `u = [a, a]`. You could copy `a`, but you'd have to decide whether to copy it on the first or second position; the second one would be classified an "insertion" by any ins/del/cpy-algorithm. If you instead opt to NOT make that choice, you can say: pick the source `a` and duplicate it instead
Early in the linked thesis there is a one-page argument about the shortcomings of traditional approaches, which technically isn't what you asked but might still answer the side of the question that deals with human usage at least:
I’d imagine there’s some challenging judgement calls that such a tool would have to make. Like, in Go, you can reorder the members of a struct definition. In many cases this is just diff noise to reviewers. HOWEVER, it does impact the layout of the struct in memory, so it can be semantically meaningful in performance work.
A wild nitpicker appears. I understand where you're coming from & why this matters. But Go, the language spec, doesn't make any guarantees about struct layout at all. A layout difference may be meaningful, practically, but it's potentially unreliable.
PHP long stated that associative array sorting order was unstable and not guaranteed (especially when the union (+) operator or array_merge function were involved) - that doesn't mean ten bazillion websites wouldn't instantly break if they ever actually changed the ordering to be unpredictable.
Language designers need to contend with the fact that the ultimate final say in whether a thing is or not is whether that behavior is observed.
Didn't ruby actually do exactly this though? And it broke a million websites and they changed it back in the next version and have made it explicit ever since? To me that is much stronger evidence than what we think would happen if php did it.
I don't know about Ruby, but one example I can think of where a language made the instability explicit is that early on in the language Go changed the behavior of the select statement:
> If one or more of the communications can proceed, a single one that can proceed is chosen via a uniform pseudo-random selection.
In an early implementation it would pick in lexical order, IIRC (and the specification did not mention how a communication should be picked). Not only could this lead to bugs, apparently some people were relying on it and they didn't want that.
The tl;dr is that there's an almost infinite number of ways to atomize/conceptualize code into meaningful "units" (to "register" it, in my supervisor's words), and the most appropriate way to do that is largely perspectival — it depends on what you care about after the fact, and there is no single maximal way to do it up front.
I mean to have an improvement over the status quo we need to simply find a conception that works better than lines as units of code. Let’s not let perfect be the enemy of the good.
love to tell someone who literally wrote a masters thesis on a topic what we need to "simply do" to solve it lol. I almost want to admire the confidence but
>>I’d imagine there’s some challenging judgement calls that such a tool would have to make
Just thinking about it makes my head spin. I spend a lot of time working out font/color hierarchies, supplementary to coding and data viz. Arguably what you're bringing up is a case for a carefully colored diff that visually cues whether something is a true semantic change or indicative of a lower level issue. I'm comfortable with reading a plain ol' diff that just shows me what changed, superficially, and interpreting it. While I think OP's idea is awesome, it also might create more confusion than it resolves; and resolving confusion is the point of a diff.
Efficiency is not the issue at this point. My prototype diffing algorithm was linear and there have been improvements on it already (I think something called "truediff" is linear but an order of magnitude better! I could be misremembering the name, don't quote me :) ).
The real difficult part is in how you represent AST-level changes, which will limit what your merging algorithm can do. In particular, working around moving "the same subtree" into different places is difficult. Imagine the following conflict:
Both p and q move the same thing to different places so they need a human to make a decision about what's the correct merge. Depending on your choice of "what is a change", even detecting this type of conflict will be difficult. And that's because we didn't add insertions nor deletions. Because now, say p was:
([1,2,3], [4,5]) -- p --> ([1,3], [2,5])
One could argue that we can now merge, because '4' was also deleted hence the position in which we insert '2' in the second list is irrelevant.
If we extrapolate from lists of integers to arbitrary ASTs the difficulties become even worse :)
One thing I do find interesting (and a wish were different) is that only programming languages are supported, rather than data formats as well.
For example, two JSON documents may be valid but formatted slightly differently, or a common task for me is comparing two YAML files.
Comparing config files that have a well defined syntax and or can be abstracted into a tree (JSON, YAML, TOML, etc.) would be absolutely lovely, even and including (if possible) Markdown and its ilk.
JSON and CSS are supported today, and I'm interested in adding more structured text formats.
If a format has a tree-sitter parser, it can be added to difftastic. The TOML tree-sitter parser looks good, but there isn't a mature markdown parser for tree-sitter. There are other markdown parsers available, so in principle difftastic could support markdown that way.
The display logic might need a little tuning for prose-heavy formats like markdown though. I'm not happy with how difftastic handles block comments yet either.
I'm not sure about formats that contain more prose, such as markdown or HTML.
I think supporting XML would be something, a lot of people would appreciate. That XML is difficult to diff comes up again and again... However, one would need to decide, whether one wants to compare by syntax or by meaning. Latter one may be preferrable, but would require the XML to be canonicalized on both sides, first.
Indeed. One could just do `diff $(jq . $fileOne) $(jq . $fileTwo)` and you'll end up with a "nice enough" diff even if $fileOne and $fileTwo were very differently formatted.
The problem is when a file also needs to be normalized - e.g. object keys in a different order, YAML syntax expansion. It can be very useful to indicate when a JSON file is identical to another JSON file but some of the properties or array items are out of order and that requires more in-depth knowledge of the data format. Let's not mention that you could UTF-8 encode characters or write out the same character using backslash notation, numeric or boolean data that might be wrapped in a string in one file but not in another, etc. There can still be a lot of modelling and interpretation to consider when comparing data files rather than code files.
I'm not too familiar with YAML, so can't answer to that.
But re JSON:
> object keys in a different order
They can't be "in a different order" as JSON keys are not ordered. They can be whatever order, and would still be considered the same.
> array items are out of order
Then it's different, as JSON arrays are ordered. ["a", "b"] is not the same as ["b", "a"] while {a: 1, b: 1} and {b: 1, a: 1} is the same.
> you could UTF-8 encode characters or write out the same character using backslash notation, numeric or boolean data that might be wrapped in a string in one file but not in another
Then again, they are different. If the data inside is different, it's different.
I understand that logically, they are the same, but not syntax-wise, which is why I included the "differently formatted" "disclaimer", it wouldn't obviously understand that "one" and "1" is the same, but then again, should you? Depends on use case I'd say, hard to generalize.
> They can't be "in a different order" as JSON keys are not ordered. They can be whatever order, and would still be considered the same.
This is what GP is saying, I'm pretty sure. Object member order is non-semantic in json, so in order to do a semantic diff (one that understands structure), you need to canonicalize the order of the two sides. Simply diffing the output of jq doesn't do that, because (afaik) jq doesn't alter the order.
Basically, if you want this to come up the same:
{"a":"b","c":"d"}
{"c":"d","a":"b"}
you need more than just `diff $(jq) $(jq)`.
Can argue about whether a tool like difftastic should do that, I guess, but I would personally lean towards that it should be smart enough to see this because it's precisely the sort of thing that both humans and line-based diff can be awful at seeing.
Nitpick: diff takes filenames as arguments, so comparing the output of two commands would need the `<()` expansion. So the command would be `diff <(jq . $fileOne) <(jq . $fileTwo)`
This is kind of like the problem of programmatically analyzing AWS IAM roles and policies to understand impact of changes. Very difficult to do in JSON format but worth tons of money to CISOs if it can be solved.
Similarly, I would love it if Pandoc’s AST were supported. Or, if this could be extended to compare any documents taking formatting into account, or document-to-document conversions.
This isn't going to add anything to existing diff tools for JSON or YAML though. Those formats barely have any syntax highlighting or complex structures.
same, I don't know how many times I do a diff and wish there was a smarter solution that could take account formatting and whitespaces. This is it. Wish git diff would incorporate this, would be a real treat.
Funny side note: I had a flat mate once who was on a working holiday from Japan.
He was in love with and endlessly curious about English slang, it’s basically all we talked about.
I remember explaining to him why my uni friends and I referred to things as being “craptastic”, starting with American marketing’s love affair with the portmanteau.
He got it pretty quickly and enjoyed using it in conversation.
The saying that was harder for him to understand was “fuck all”. He always wanted fuck to be the verb, rather than using “fuck all” as the adjective, so he would say things like “I fuck all my money last night at the pub”.
This is written by the same guy who wrote Helpful, an enhancement package for the Emacs Help buffer. I highly recommend checking out Helpful if you haven’t seen it. https://github.com/Wilfred/helpful
EDIT: Wilfred IS the original author [3]; my apologies.
Not to discredit Wilfred (it looks like he's taken over the project as the maintainer), but, based on the historical contributions [1], it looks like it was originally developed by Max Brunsfeld, who also created Tree-sitter. [2]
I think the contributor graph is misleading, and that he's using git-subtree to vendor tree-sitter, which makes it look like others have contributed more to the project.
Agreed. It’s so good it feels like it should have been that way all along. For example, when you view the help for a function Emacs has always given you a link to the source code where that function is defined. Helpful shows you the source code right in the Help buffer, and shows you a list of callers, and gives you buttons that enable tracing or debugging for the function.
Once I discovered Helpful, all of those things seemed so obviously useful that I can’t understand why nobody else thought to put them there, including myself.
The best part is the forget function, for when functions are incompatible. As an example, lsp won't work for me unless I forget the project-root function from ess-r (I have no idea why this hasn't been fixed) and helpful makes this a two or three key activity.
This looks really cool and I can't wait to try it, tho... a bit of a PITA to get running. ;) Took a while to figure out how to build, and had to install 400MB of dependencies first....
Edit: And after installing cargo, watching it fail to build, then determining I must need a newer version of cargo, so I built that from source... it fails. Apparently I need to install `rustc-mozilla` and not `rustc`. "obviously".
This is all a testament to how much I want to try this tool...
MOAR EDIT: even with rustc-mozilla cargo fails to build. running `cargo install difftastic` gives me an error about my version of cargo being too old ;.;
Ah, well, if you're willing to accept having a frankensystem with a mix of packaged and unpackaged software, sure. ;) I used to do that, back in Slackware days.
It's considered really sloppy and unmaintainable to admin a system like that. Things quickly get out of hand.
That strategy _does_ work if you isolate it to a chroot or a container, but littering /usr/local with all sorts of locally compiled upstream is just asking for future pain. Security updates, library incompatibilities, &c.
Prebuilt binaries might be nice, but I don't expect them for random projects. (and I wouldn't have used them if offered) I do think it's a reasonable expectation to be able to build software w/o essentially setting up a new userland just for that tool though. :)
I've also had requests from Alpine Linux packagers to allow dynamic linking to parsers. This is something I want to support in future, once I'm happy with the basic diffing logic.
Not sure about Go, but Rust still links against glibc, so I sometimes have to recompile things to make them work on my Debian systems if they're built against newer glibc.
Using vim has nothing to do with ones ability to troubleshoot compiler/ubuntu issues. Plus both compiler and ubuntu issues can be massive PITA to solve even if you're familiar with them. Personally, if I'm trying to install something on whim to try it out and I start getting "no such file or directory" errors I'd be upset that something is going wrong.
The GCC version in Ubuntu 18.04 is too old. I had the same problem, I just installed clang, updated the default c++ and it worked. There is an issue in the repo about that.
How did you do it? When I tried to rebuild cargo I got build errors. I'm starting to suspect the only way to run this tool is make a chroot tracking sid or something....
I feel the same way, I am just not willing to pipe curl into a shell blindly.
Even if this specific instance of curl'ing into sh is safe, or if I download and then run it, it's still extremely poor practice and gives me serious doubts about the developers and their security practices in general.
I also do not like when every project decides to poorly reimplement the package manager. If every software used it's own package manager my system would be a complete mess with dozens of different package managers fighting each other and it would be a total nightmare to update the system or manage non-trivial dependency chains when installing something new.
Rust is one of my favorite languages but this is definitely my least favorite aspect of it all. It really feels like the developers "optimized" for systems with no package manager.
Sure, but first I had to figure out wtf "cargo" is. :P
Also, `cargo install difftastic` AIUI pulls it from a central location, if I'm gonna poke at software for the first time, I enjoy building it myself first, so I can get my hands dirty in the source. :)
Honest question: how did you arrive to the conclusion you needed rustc-mozilla? I would love to make sure whatever flow led you to that is made clearer for other newcomers, because that is definitely not something anyone that isn't working on Firefox should even try.
My favorite dev tool is diff2html - a CLI that opens up your browser with a rich diff. Pro tip: alias `diff` to the command so you can launch it quickly ;)
I would love it if version control stored an AST that also includes comments and dividers (where right now we would leave an empty line) and dev machines rendered it out however they wanted. They could even change the language of keywords in addition to normal formatting.
To do this requires some standard way of encoding an AST which includes comments and dividers.
That standard format is commonly known as source code - although it lacks a normal form.
Tools like prettier, gofmt and black can be thought of as a way to produce a normal form of source code.
This is (IMO) a reasonable incremental approach towards exactly what you describe - if a project checks in only source code that's formatted using a standardised format, then you're free to work on it using whatever equivalent representation you like - as long as it's converted back at commit-time.
The challenge for a tool like difftastic is that I can't guarantee that syntax is well-formed. You might be using new syntax that my parser doesn't support, you might have merge conflicts, or you might have a plain syntax error in your code.
Tree-sitter handles parse errors gracefully, so difftastic handles syntax errors pretty well in my experience.
Yep, I posted this idea on Reddit recently and people said they need a formatted syntax because of diff and version control; we do not; get the ast, reformat in the editor as the particular user fancies and generate diff and version control artefacts also as a particular user sees fit. Our computers are very fast so you can make a lot more different views on your code than we have now by using the ast instead of text and regexps.
Well, the comment I was responding to was about storing the AST, not diffing it. If that's what you meant the one follows naturally from the other. Once the file format is the AST instead of its visual representation, it makes sense to implement lots of operations as DSL extensions instead of library features, because the language is the library in a sense. MPS is marketed as a DSL tool but what really is is a projectional language tool.
SemanticMerge sounded interesting enough so I wanted to check it out, but to my surprise there is no Buy or Download link anywhere on the site. The only thing that might do it is a Login link, but I don't want to create an account just to see how much the thing costs. Is it only sold in bulk to companies? I find it bizarre that there isn't even a "contact sales" button.
That's incredibly annoying! They must have changed something about their pricing and sales model since the time that I had purchased it. I don't understand why companies think that's a good idea. I guess I can't recommend it anymore.
I was interested in SemanticMerge/XMerge but when I looked they didn't have a Mac clinet and now it looks like they don't have a personal edition. I just want to buy a private license and use it locally. https://semanticmerge.com
Personally I long for a syntactic merge-tool. Every time Syncthing hiccups for some reason, I'm up for a merge session with my Org-mode files, in the vein of: ‘These properties look just like those ones, only with a different timestamp... Oh lookie, and the heading is totally changed. Let me merge this new heading all over the old one, and then pop in the old one after it.’ Dammit, it's just a whole new heading added with properties. This happens with every language heavy on markup.
However, I'm not sure if Org markup lends itself to structuring that would allow proper diffing—even with just the headings.
Be good to have different git merge strategies per file type.
e.g. A merge that knows properties files support the same property added in different places but only once is needed.
And another strategy if order is significant.
Cool to have an HTML merge that recognises the tree structure and supports merging tags and having the indentation follow some rules.
I believe git supports merge strategies, its been on my todo list forever.
Looks really cool, but there was no instructions on how to install it.
I would recommend putting an installation guide in your readme, and it being a full installation guide.
I followed the link to your manual and then it told me to install your tool using a tool called "cargo" with no reference on how to install cargo. At this point I gave up. Lazy, maybe, but for a convenience tool like this I want a convenient installation.
Cargo is Rust's build tool/package manager and can be installed easily using rustup. But I would probably suggest the difftastic maintainers add some prebuilt binaries to the releases
I think it's wonderful that there's an explosion of new exciting languages, it can only improve the quality of all our tools. I for one am looking forward to replacing my eons of MATLAB experience with Julia.
But I wish there was more of a convention in the F/OSS community that if your software isn't written in something universal (C, C++, shell and maybe python), then it also comes with a container of all that's necessary to run it.
It's frustrating to pollute my nicely packaged managed system with hundreds of locally installed python modules just to run one tool. Or, in this case, backport and rebuild a language specific build tool simply to compile. :)
Have you used pipx? I really like it for installing Python tools because it automatically creates a virtual environment for them so that their dependencies don't affect anything else.
I used to straddle the two worlds, maintained and supported a multi-site AD domain with AFS integration for user $HOME and some sort of unholy LDAP/kerberos bridge for login. About once every year or two I'll miss something about the way Windows does things, compared to normal (meaning "linux"). Like the NTFS permissions model, that's cool.
But it's just once a year :) And the last time I was deep in windows was win7, whenever that was. I tried to use a win10 machine and gave up.
Besides, I thought the big new feature in modern windows was that WSL improved to the point you can run unix tools! ;)
> About once every year or two I'll miss something about the way Windows does things, compared to normal (meaning "linux"). Like the NTFS permissions model, that's cool.
FreeBSD would be up your alley. Its native ACLs are NFSv4 format, a superset of NTFS ACLs. You need to enable it explicitly on UFS2, but it's default on ZFS.
> and some sort of unholy LDAP/kerberos bridge for login
It's really not that bad, the AD-IPA cross-forest trust is really solid as is the native sssd-ad integration if IPA is too much. Honestly I can't really imagine it any other way now, so much work has been put into AD support that it's actually the best login experience on Linux at the moment. OpenLDAP is definitely showing its age -- dgmr I use it for all my personal infra because it's free and my use-cases are dead simple but we got to delete so much bespoke code after migrating off it at work.
I'm not sure, and you undoubtedly know more and are more up to date than I, but I don't believe any of these things existed in 2005, when I was on the aforementioned team. Or, maybe they did exist but management decided an internal implementation was better.
Getting Windows to accept the user profile in an AFS path I recall being particularly vexing.
Only diff is I got to the point where it said I needed "cargo", On a whim, I typed "aptitude install cargo", and it did something. Now waiting for the >1GB source repo to clone to see if it works.... ;)
Looks like you need to install the Rust programming language and compile it. It worked for me. Not sure if I like the installation method. It seems the executable is portable though.
Because it is much easier, you don't have to build and maintain parsers for hundreds of languages. And you don't need need just any parser, you need very robust ones that can deal with malformed files well. Or, if you only pick a small set of supported languages, your diff tool will not work on most files or have to fall back to a structure-agnostic algorithm. Also not all text files even follow any useful grammar at all.
Finally, even if you have a syntax tree, that is just part of the solution, probably the smaller one. Detecting three lines of code wrapped in a new if statement is easy but also doesn't benefit much from a syntax-aware algorithm. But once you changes names and signatures, extract methods, introduce constants, and so on it will become progressively harder to match subtrees and one is probably quickly approaching the territory of NP-hard and undecidable problems.
> And you don't need need just any parser, you need very robust ones that can deal with malformed files well.
I very much agree. I feel there has been a trend recently where people (re)discovered how cool and useful ASTs are and now expect everything be using them. I suspect old-school computer scientists might be secretly laughing at this while programming with some Lisp-like languages they invented for themselves.
Jokes aside, I do wonder how modern IDEs manage to parse broken source code into usable ASTs --- is this trivial (CS theory-wise) or are there a lot of engineering secret sauce involved to make it work?
With only basic knowledge in the domain I would assume it is hard and ugly. If the file is malformed, there is almost certainly an infinite number of possible edits to make the file adhere to the grammar, hence there can not be any algorithm that just provides the one and only correct syntax tree. This in turn means that you have to come up with heuristics that identify reasonable changes which fix the file and that is probably not easy. Also, if you do this online in an IDE, the problem becomes probably easier [1] - if you have a valid file and then make it invalid by deleting an operator in the middle of some expression, you can still essentially use the syntax tree from just before the deletion. If, on the other hand, you get a malformed file, you might have a harder time.
[1] And also harder because if you want to parse the file after each key stroke, you have to be fast. This probably also makes incremental updates to the syntax tree the preferred solution and that might align well with using prior result for error recovery.
"If the file is malformed, there is almost certainly an infinite number of possible edits to make the file adhere to the grammar, hence there can not be any algorithm that just provides the one and only correct syntax tree. This in turn means that you have to come up with heuristics that identify reasonable changes which fix the file and that is probably not easy."
I don't understand that question. Given the following source file that does not parse
var foo = bar baz
there are many ways to change it and make it parse including the following reasonable ones
var foo = barbaz
var foo = "bar baz"
var foo = { bar, baz }
var foo = bar // baz
var foo = bar
//var foo = bar baz
var foo = bar * baz
var foo = bar + baz
var foo = bar.baz
var foo = bar(baz)
but also unreasonable ones like
var abc = 123
and therefore a parser that can handle malformed inputs has to make educated guesses what the input was actually supposed to look like. And don't be fooled by this simple example, imagine a long source file with deeply nested code in a language with curly braces and randomly deleting some of the braces. Now try to figure out where classes, methods, if or try statements begin and end in order to produce a [partial] syntax tree better than just giving up at the position of the first error.
My point was that test suites should give you a heuristic on what corrections are good and which are bad. A source code change that turns a test fail into a test pass should be considered an improvement.
I am still lost. Test suite for what? We have a parser - binary, source code and maybe a test suite if the parser developers decided to write tests - and a random text file that we throw at the parser and for which the parser hopefully generates a useful syntax tree if the content is a well-formed or not too badly malformed program in a language the parser understands.
So I can only use the diff tool to compare two non-compiling versions of a source file if I provide a test suite for that file to the diff tool? And how would you want to make use of the test suite? Before you can run the test suite, the source file must already parse and compile which is already more than a diff tool based on a syntax tree requires - it must be able to parse the source code but it doesn't have to compile. Passing the test suite requires even more, not only being able to parse and compile but also yield the correct behavior which the diff tool doesn't care about.
And you actually jumped over the hard part that requires the heuristics, how to modify the input in order to make it parse. Take a 10 kB source file and delete 10 random characters - how will you figure out which characters to put back where? With 100 possible characters, 10,000 positions to insert a character, and having to insert 10 characters, you are looking at something like 10^60 possible modifications. You are certainly not going to try them one after another, each time checking if the modified source file parses, compiles, and passes the test suite.
> So I can only use the diff tool to compare two non-compiling versions of a source file if I provide a test suite for that file to the diff tool?
Not sure what this whole straw man is about. I definitely didn't suggest anything like that. Of course you can only compare two compiling versions of a source file using a test-suite-based heuristics. I thought this whole thing was about "heuristics that identify reasonable changes which fix the file" mentioned above? "Reasonable changes that DON'T fix the file" are clearly recognizable by NOT passing the test suite, just as if it was a human trying to make those changes and finding out that the change that he just did didn't in fact yield the desired results after running the test suite.
> With 100 possible characters, 10,000 positions to insert a character, and having to insert 10 characters, you are looking at something like 10^60 possible modifications.
If you're working with an AST, you're almost certainly not working with characters. That would be immensely wasteful. In fact working with an AST is pretty much the only way in which the set of changes is sufficiently reduced for almost any change to NOT be rejected outright. With character-level modifications, you're facing the problem that almost every edit will be outright rejected as early as at the stage of parsing.
We have obviously been taking past each other. My point was that a parser for a syntax tree based diff tool should probably be able to deal well with files with syntax errors, i.e. it must be able to fix syntax errors. And with fixing syntax errors I did not mean actually fixing the file but being able to construct a reasonable syntax tree even if some subtrees do not adhere to the grammar. Given an input like
class foo
{
function bar() {
function baz() { }
}
it should be able to parse the file as if bar() was not missing the closing curly brace. If the parser just gave up or inserted the closing curly brace at the end
class foo
{
function bar() {
function baz() { }
}
}
making baz() a nested function inside of bar() the result would be worse than using a character-based diff algorithm. But I never intended to say anything about making code functionally correct, that is none of the business of a parser or diff algorithm.
What you do is produce an AST where some nodes indicate syntax errors. This works best in languages where it is easy to resynchronize after an error, of course.
I've thought before this is how diffing should be done, and speculated that tree-sitter would make it more feasible.
At this point, whenever I think some language-aware tool ought to exist, my first thought is "Does the language server protocol or tree-sitter make this more feasible?"
Someone still has to build and maintain the parsers, you are just outsourcing this. And I added a bit to my comment, I tend to believe that parsing is the easy part, but that is admittedly more a gut feeling and not based on any real knowledge of that problem space.
Languages usually change slowly, though, so once a good baseline grammar is in place, maintenance is unlikely to be a huge load.
Furthermore, with tools like tree-sitter and the language server protocol, multiple communities benefit from their continued existence, so there's a bigger pool of contributors to the parser.
> And you don't need need just any parser, you need very robust ones that can deal with malformed files well.
But why? Shouldn’t the code you push into a repository be at least syntactically correct? And even if it is not, one can simply fallback to textual diff.
> Or, if you only pick a small set of supported languages, your diff tool will not work on most files or have to fall back to a structure-agnostic algorithm.
(1) Parsing an arbitrary language is hard. Without tree-sitter, difftastic would probably be a lisp-only tool. You also want a parser that preserves comments.
(2) Inputs may not be syntactically well formed.
(3) Efficiently comparing trees is extremely difficult (difftastic is O(N^2) in time and memory).
(4) Displaying tree diffs is equally difficult. Alignment is particularly challenging when your 'unchanged before' and 'unchanged after' are syntactically the same, but textually different.
Understanding syntax would be really amazing for merges (not sure if it's even possible), but for diffs I don't immediately see why I should use that over simpler syntax-unaware tools. Highlighting the actual change in a string is important, and so is ignoring whitespace, but diffsofancy -w does it just fine. What else would I need? (Well, I guess the only use-case I can see from the demo is 2 compact changes in a single line, but… meh.)
On the other hand, even though my diffs are usually not that huge, sometimes they might be, and I don't want to switch tools every time that happens (I have just git alias and I don't even remember my exact config, nor should I care). So being slow is not great.
It might be useful for reviewing merge/pull requests. But is there a way to display the diff "interleaved" instead of 2-columns side-by-side? (when executing `GIT_EXTERNAL_DIFF=difft git log -p --ext-diff` for example)
We are working on a code review tool which supports unified diffs with semantic diffing. If that sounds interesting for you, take a look at https://mergeboard.com
Minimum system requirements? Nope. But if you check out cargo.toml, you'll see it says it needs Rust 1.56.
My system has 1.48.0 . And it the latest Debian release! I don't see how a diff tool can expect you to have a bleeding-edge development environment. I mean, ok, you chose a new language - I can understand that; I won't demand that it build with just a C compiler and Make. But come on, this is not supposed to be just a toy for new systems.
Anyway, I still cloned it, tried to build with "cargo build", and got stuck with:
I wonder if it would be possible to do this in a one-column format. That would make it more useful in a lot of contexts where a super wide view isn't practical.
I use meld too. But afaics, meld 'syntax aware' is very different from from difftastic.
Meld takes a diff, and applies syntax highlighting over the diffed files. It additionally highlights the changed characters in a line. Git diff, vimdiff and probably others, do this as well.
From the demo, I understand that Difftastic first applies syntax and then rebuilds the patch over that. Being aware of line wrapping, changes in nesting, moving codeblocks into functions and so on.
I just spent a few minutes on that site and I can't even figure out how to try it out, or their pricing, or anything other than some very superficial docs, really.
Is this just a pretty website, or is the software actually available anywhere?
Looks like no locally-run-binary/non-SaaS version. I was hoping it'd have SublimeText like model. I have no interest in trying to get my team to switch nor having to deal with the security team when it turns out I was using a free cloud account.
Does a `magit` plugin exist for Emacs users? The author of this package is also the author of a couple of popular Emacs packages but I did not see any mention of Emacs.
To be more precise, Magit could easily display the output of difft, but Magit wants to be able to do more than that. You can navigate diffs by hunk and by file, you can collapse hunks and files, you can even select individual lines of the diff to stage or unstage them, apply or unapply them, etc. Strictly speaking that’s probably not impossible with difft, but because difft has the explicit goal of displaying diffs to humans rather than producing machine–parsable output, it won’t be easy. I still want it to happen though.
BTW IIRC The tree version of levenshtein distance has (proven) terrible complexity. But so does lcs, and diff itself performs great in practice so maybe...
For Jupyter Notebooks I highly recommend trying out jupytext, which converts Notebooks on the fly to a number of formats. It really has been a game changer for working with git and Notebooks for me. I essentially never want to preserve state of the notebooks anyway so converting just makes sense. The best thing is it is completely transparent, i.e. it generates a notebook file when you open the other file and saves to the file ever time the notebook is saved. If you want to keep the state of the notebook you can always keep that file around as well.
> Non-goals
> Patching. Difftastic output is intended for human consumption, and it does
> not generate patches that you can apply later. Use diff if you need a patch.
because it's an automated piece of software making decisions about what is an "equal diff" and what is a "difference diff" because a diff no longer means just a change, it now has to be a meaningful enough change.. If you removed something like `if (true)` or whatever, that's still a diff that could have some importance and/or unknown consequences. I appreciate the value, but the fact that it allows refactoring to be a non-diff would worry me in the long run I think.
Difftastic is only ignoring whitespace that isn't significant. If you remove `if (true)`, it will get highlighted.
With a textual diff today, your only choices are 'highlight all whitespace changes' (e.g the git default) or 'ignore all whitespace' (e.g. diff --word-diff).
If difftastic says there are no changes, then both files have the same parse tree and the same comments.
If you have consistent code style and formatting this tool is unnecessary. I think that solution is better, you get a more consistent code base that is easier to read for humans. (Also diffs will be faster to compute)
> If you have consistent code style and formatting this tool is unnecessary
I disagree. I struggle to replicate it right now using a simple test, but I've seen the following rather infuriating and counter intuitive behaviour from Git/GNU diff. If you have a simple if statement such as:
if (bla) {
// do something
}
And you were to add another statement at the end, after the closing curly brace, e.g.:
if (bla) {
// do something
}
if (bla2) {
// do something else
}
Git/GNU diff will sometimes show the following diff:
diff --git 1/left 2/right
index c2ea6f1..dc0e1c2 100644
--- 1/left
+++ 2/right
@@ -1,3 +1,6 @@
if (bla) {
// do something
+}
+if (bla2) {
+ // do something else
}
This is basic example, but there's other similar things. For a simple change like the above, this isn't a huge issue, but for a bigger patch sets, it can take a minute to understand what is really going on.
Right, I frequently get angry at just how dumb diff really is. How it's greedy and can't recognize the best seams between blocks of code. But then when I think of simple rules that would improve the results, I see how they would lead to other problems in other places. So using syntax seems necessary.
Even if you are consistent, having unchanged indented text show up differently is very clever. I often end up reviewing a diff that moves a basic block into a conditional branch and have to scan each line to see if it changed.
I run all python through `black` and `isort`; this is still a huge step up in my book in terms of readability and ergonomics compared to the standard `git diff` or gnu `diff`.