Hacker News new | past | comments | ask | show | jobs | submit login
Difftastic: A diff that understands syntax (github.com/wilfred)
983 points by tempodox on March 29, 2022 | hide | past | favorite | 219 comments



I really like the idea of focusing on producing patches for human consumption. I studied the problem of merging AST-level patches during my PhD (https://github.com/VictorCMiraldo/hdiff) and can confirm: not simple! :)


Please tell me the final output of your PhD was a differtation.


omg!! I really should have left that typo somewhere in there! What a missed opportunity! xD


Should've named that repo "phdiff".


I'll vote for "diphph"


It's tangential but it reminded me of "lighght" poem by Aram Saroyan. https://en.wikipedia.org/wiki/Aram_Saroyan#Minimalism_and_co...


Best pun I've heard in a long time. Well done. <3


To be pronounced "Doctor-iff" in speech?


"Doctor if and only if <this works>"


So I looked at the paper and it seems interesting. Basic idea: Instead of the operations to consider being "insert", "delete" and "copy", one adds "reorder" "contract subtree" and "duplicate" (although I didn't quite get the subtlety of copy vs duplicate on a short skim); and even though extra ops increase the search space, they actually let you search more effectively. I can buy that argument.

The practical problem, though, is that the Haskell compiler is limited/buggy, so you couldn't implement this for C, and you settled on a small language like Lua. If you _do_ extend this to other languages (perhaps port your implementation from Haskell to something else?), please post it on HN and elsewhere!


Some of the GHC performance bugs that we ran into during the research have been fixed as far as I know! Though I'd have to double-check


Indeed, we also designed a brand new generics library to work around that. Performance was really not an issue! :)


Copy just copies once. The need for duplicate is clear if you're trying to diff something like `t = [a]` and `u = [a, a]`. You could copy `a`, but you'd have to decide whether to copy it on the first or second position; the second one would be classified an "insertion" by any ins/del/cpy-algorithm. If you instead opt to NOT make that choice, you can say: pick the source `a` and duplicate it instead


Can you give a little color on where the difficulties lie? Is it an efficiency question, or is determining "which changes" hard in the first place?


Early in the linked thesis there is a one-page argument about the shortcomings of traditional approaches, which technically isn't what you asked but might still answer the side of the question that deals with human usage at least:

https://victorcmiraldo.github.io/data/MiraldoPhD.pdf#page=24


Not OP, but the docs call out some "Tricky Cases" [1].

[1] https://difftastic.wilfred.me.uk/tricky_cases.html


I’d imagine there’s some challenging judgement calls that such a tool would have to make. Like, in Go, you can reorder the members of a struct definition. In many cases this is just diff noise to reviewers. HOWEVER, it does impact the layout of the struct in memory, so it can be semantically meaningful in performance work.


A wild nitpicker appears. I understand where you're coming from & why this matters. But Go, the language spec, doesn't make any guarantees about struct layout at all. A layout difference may be meaningful, practically, but it's potentially unreliable.

e.g. see https://groups.google.com/g/golang-nuts/c/1BlZDNBLiAM

Having said that: if a Go compiler for a given architecture decided to change its layout algorithm, I'm pretty sure it would earn a changelog entry.


PHP long stated that associative array sorting order was unstable and not guaranteed (especially when the union (+) operator or array_merge function were involved) - that doesn't mean ten bazillion websites wouldn't instantly break if they ever actually changed the ordering to be unpredictable.

Language designers need to contend with the fact that the ultimate final say in whether a thing is or not is whether that behavior is observed.


Didn't ruby actually do exactly this though? And it broke a million websites and they changed it back in the next version and have made it explicit ever since? To me that is much stronger evidence than what we think would happen if php did it.


I don't know about Ruby, but one example I can think of where a language made the instability explicit is that early on in the language Go changed the behavior of the select statement:

> If one or more of the communications can proceed, a single one that can proceed is chosen via a uniform pseudo-random selection.

https://go.dev/ref/spec#Select_statements

In an early implementation it would pick in lexical order, IIRC (and the specification did not mention how a communication should be picked). Not only could this lead to bugs, apparently some people were relying on it and they didn't want that.


I wrote a masters thesis about the more general problem here (https://tspace.library.utoronto.ca/bitstream/1807/65616/11/Z...).

The tl;dr is that there's an almost infinite number of ways to atomize/conceptualize code into meaningful "units" (to "register" it, in my supervisor's words), and the most appropriate way to do that is largely perspectival — it depends on what you care about after the fact, and there is no single maximal way to do it up front.


I mean to have an improvement over the status quo we need to simply find a conception that works better than lines as units of code. Let’s not let perfect be the enemy of the good.


love to tell someone who literally wrote a masters thesis on a topic what we need to "simply do" to solve it lol. I almost want to admire the confidence but


>>I’d imagine there’s some challenging judgement calls that such a tool would have to make

Just thinking about it makes my head spin. I spend a lot of time working out font/color hierarchies, supplementary to coding and data viz. Arguably what you're bringing up is a case for a carefully colored diff that visually cues whether something is a true semantic change or indicative of a lower level issue. I'm comfortable with reading a plain ol' diff that just shows me what changed, superficially, and interpreting it. While I think OP's idea is awesome, it also might create more confusion than it resolves; and resolving confusion is the point of a diff.


Efficiency is not the issue at this point. My prototype diffing algorithm was linear and there have been improvements on it already (I think something called "truediff" is linear but an order of magnitude better! I could be misremembering the name, don't quote me :) ).

The real difficult part is in how you represent AST-level changes, which will limit what your merging algorithm can do. In particular, working around moving "the same subtree" into different places is difficult. Imagine the following conflict:

([1,3], [4,2,5]) <-- q -- ([1,2,3], [4,5]) -- p --> ([1,3], [2,4,5])

Both p and q move the same thing to different places so they need a human to make a decision about what's the correct merge. Depending on your choice of "what is a change", even detecting this type of conflict will be difficult. And that's because we didn't add insertions nor deletions. Because now, say p was:

([1,2,3], [4,5]) -- p --> ([1,3], [2,5])

One could argue that we can now merge, because '4' was also deleted hence the position in which we insert '2' in the second list is irrelevant.

If we extrapolate from lists of integers to arbitrary ASTs the difficulties become even worse :)


How does your work relate to tree-sitter, which also manages patches which it describes as "incremental parsing" as well as error states.


This looks absolutely amazing.

One thing I do find interesting (and a wish were different) is that only programming languages are supported, rather than data formats as well.

For example, two JSON documents may be valid but formatted slightly differently, or a common task for me is comparing two YAML files.

Comparing config files that have a well defined syntax and or can be abstracted into a tree (JSON, YAML, TOML, etc.) would be absolutely lovely, even and including (if possible) Markdown and its ilk.


JSON and CSS are supported today, and I'm interested in adding more structured text formats.

If a format has a tree-sitter parser, it can be added to difftastic. The TOML tree-sitter parser looks good, but there isn't a mature markdown parser for tree-sitter. There are other markdown parsers available, so in principle difftastic could support markdown that way.

The display logic might need a little tuning for prose-heavy formats like markdown though. I'm not happy with how difftastic handles block comments yet either.

I'm not sure about formats that contain more prose, such as markdown or HTML.


I think supporting XML would be something, a lot of people would appreciate. That XML is difficult to diff comes up again and again... However, one would need to decide, whether one wants to compare by syntax or by meaning. Latter one may be preferrable, but would require the XML to be canonicalized on both sides, first.


I would naively expect that this problem is easiest to solve for languages like JSON that have an unambiguous way to be pretty printed.


Indeed. One could just do `diff $(jq . $fileOne) $(jq . $fileTwo)` and you'll end up with a "nice enough" diff even if $fileOne and $fileTwo were very differently formatted.


The problem is when a file also needs to be normalized - e.g. object keys in a different order, YAML syntax expansion. It can be very useful to indicate when a JSON file is identical to another JSON file but some of the properties or array items are out of order and that requires more in-depth knowledge of the data format. Let's not mention that you could UTF-8 encode characters or write out the same character using backslash notation, numeric or boolean data that might be wrapped in a string in one file but not in another, etc. There can still be a lot of modelling and interpretation to consider when comparing data files rather than code files.


I wrote a tool that tidies JSON and can do things like re-orders keys in a fixed order - https://github.com/ActiveState/json-ordered-tidy


I'm not too familiar with YAML, so can't answer to that.

But re JSON:

> object keys in a different order

They can't be "in a different order" as JSON keys are not ordered. They can be whatever order, and would still be considered the same.

> array items are out of order

Then it's different, as JSON arrays are ordered. ["a", "b"] is not the same as ["b", "a"] while {a: 1, b: 1} and {b: 1, a: 1} is the same.

> you could UTF-8 encode characters or write out the same character using backslash notation, numeric or boolean data that might be wrapped in a string in one file but not in another

Then again, they are different. If the data inside is different, it's different.

I understand that logically, they are the same, but not syntax-wise, which is why I included the "differently formatted" "disclaimer", it wouldn't obviously understand that "one" and "1" is the same, but then again, should you? Depends on use case I'd say, hard to generalize.


> They can't be "in a different order" as JSON keys are not ordered. They can be whatever order, and would still be considered the same.

This is what GP is saying, I'm pretty sure. Object member order is non-semantic in json, so in order to do a semantic diff (one that understands structure), you need to canonicalize the order of the two sides. Simply diffing the output of jq doesn't do that, because (afaik) jq doesn't alter the order.

Basically, if you want this to come up the same:

    {"a":"b","c":"d"}
    {"c":"d","a":"b"}
you need more than just `diff $(jq) $(jq)`.

Can argue about whether a tool like difftastic should do that, I guess, but I would personally lean towards that it should be smart enough to see this because it's precisely the sort of thing that both humans and line-based diff can be awful at seeing.


Just an FYI, jq has a flag to sort by the name of keys, I believe it's -k.


Fair enough! I should just never assume jq doesn't have a feature.


Nitpick: diff takes filenames as arguments, so comparing the output of two commands would need the `<()` expansion. So the command would be `diff <(jq . $fileOne) <(jq . $fileTwo)`


https://github.com/andreyvit/json-diff works really well for JSON diffing in my experience.

It's more simplistic than difftastic though: it considers `1` and `[1]` to have nothing in common.


JSON is supported.

HTML and XML are missing, too.


You're right. I missed JSON.

Sadly YAML, TOML and the others I mentioned are not there (yet?)


There’s always room for contributions!


This is kind of like the problem of programmatically analyzing AWS IAM roles and policies to understand impact of changes. Very difficult to do in JSON format but worth tons of money to CISOs if it can be solved.


Similarly, I would love it if Pandoc’s AST were supported. Or, if this could be extended to compare any documents taking formatting into account, or document-to-document conversions.


This isn't going to add anything to existing diff tools for JSON or YAML though. Those formats barely have any syntax highlighting or complex structures.


I would love a great XML diff tool, and after seeing the demo of this I was sad to see XML not in there. Would pay for.


same, I don't know how many times I do a diff and wish there was a smarter solution that could take account formatting and whitespaces. This is it. Wish git diff would incorporate this, would be a real treat.


Funny side note: I had a flat mate once who was on a working holiday from Japan.

He was in love with and endlessly curious about English slang, it’s basically all we talked about.

I remember explaining to him why my uni friends and I referred to things as being “craptastic”, starting with American marketing’s love affair with the portmanteau.

He got it pretty quickly and enjoyed using it in conversation.

The saying that was harder for him to understand was “fuck all”. He always wanted fuck to be the verb, rather than using “fuck all” as the adjective, so he would say things like “I fuck all my money last night at the pub”.


I know native English-speakers who would say that s/pub/bar/.

Profanity is just delightful in general, and non-native English speakers come up with some of the best profane idioms in English.

I wonder if it’s the same in other languages?


Perhaps this book would have helped.

https://www.amazon.com/gp/aw/d/486256139X


This is written by the same guy who wrote Helpful, an enhancement package for the Emacs Help buffer. I highly recommend checking out Helpful if you haven’t seen it. https://github.com/Wilfred/helpful


EDIT: Wilfred IS the original author [3]; my apologies.

Not to discredit Wilfred (it looks like he's taken over the project as the maintainer), but, based on the historical contributions [1], it looks like it was originally developed by Max Brunsfeld, who also created Tree-sitter. [2]

[1]: https://github.com/Wilfred/difftastic/graphs/contributors

[2]: https://github.com/tree-sitter/tree-sitter

[3]: https://github.com/Wilfred/difftastic/commit/958033924a2dea7...


I think the contributor graph is misleading, and that he's using git-subtree to vendor tree-sitter, which makes it look like others have contributed more to the project.


Oops, I think you're right! Thank you for pointing that out.

My apologies to Wilfred.


He wrote https://github.com/Wilfred/deadgrep too. It's awesome and I don't know how I lived without it for so long.


Helfpul is (pun fully intended) so very, very helpful.

Honestly, I cannot imagine going back to the standard emacs help.


Agreed. It’s so good it feels like it should have been that way all along. For example, when you view the help for a function Emacs has always given you a link to the source code where that function is defined. Helpful shows you the source code right in the Help buffer, and shows you a list of callers, and gives you buttons that enable tracing or debugging for the function.

Once I discovered Helpful, all of those things seemed so obviously useful that I can’t understand why nobody else thought to put them there, including myself.


The best part is the forget function, for when functions are incompatible. As an example, lsp won't work for me unless I forget the project-root function from ess-r (I have no idea why this hasn't been fixed) and helpful makes this a two or three key activity.


For everyone wondering, it looks like this will work with git diff: https://difftastic.wilfred.me.uk/git.html.


Exactly what I was looking for. Thanks!


A previous discussion from 8 months ago, with some comments by the author and authors of other diff tools:

https://news.ycombinator.com/item?id=27768861


This looks really cool and I can't wait to try it, tho... a bit of a PITA to get running. ;) Took a while to figure out how to build, and had to install 400MB of dependencies first....

Edit: And after installing cargo, watching it fail to build, then determining I must need a newer version of cargo, so I built that from source... it fails. Apparently I need to install `rustc-mozilla` and not `rustc`. "obviously".

This is all a testament to how much I want to try this tool...

MOAR EDIT: even with rustc-mozilla cargo fails to build. running `cargo install difftastic` gives me an error about my version of cargo being too old ;.;

Dear author: Let us run your tool.


Using ubuntu 20.04, I first installed cargo:

  curl https://sh.rustup.rs -sSf | sh
Restart shell to get $HOME/.cargo/bin in PATH, then did:

  cargo install difftastic
And ~4 minutes later, difft executable is ready.

Agree though that some pre-built binaries would be fantastic!


Ah, well, if you're willing to accept having a frankensystem with a mix of packaged and unpackaged software, sure. ;) I used to do that, back in Slackware days.

It's considered really sloppy and unmaintainable to admin a system like that. Things quickly get out of hand.

That strategy _does_ work if you isolate it to a chroot or a container, but littering /usr/local with all sorts of locally compiled upstream is just asking for future pain. Security updates, library incompatibilities, &c.

Prebuilt binaries might be nice, but I don't expect them for random projects. (and I wouldn't have used them if offered) I do think it's a reasonable expectation to be able to build software w/o essentially setting up a new userland just for that tool though. :)


The method I posted above doesn't write anything to /usr/local. Root isn't required. Everything is written under ~.


Whoa really?

I'm sorry, and retract my ignorant assumption! Going to try it out now.


There are a few packages available, e.g. https://aur.archlinux.org/packages/difftastic and https://pkgsrc.se/wip/difftastic.

I've also had requests from Alpine Linux packagers to allow dynamic linking to parsers. This is something I want to support in future, once I'm happy with the basic diffing logic.


I agree it leads to problems but isn't the entire purpose of `/usr/local` to be a dumping ground for locally administered (unpackaged) programs?


A huge part of the appeal of Rust and Go tools is that you can just ship a binary, it's frustrating that it's not available here.


Not sure about Go, but Rust still links against glibc, so I sometimes have to recompile things to make them work on my Debian systems if they're built against newer glibc.



Same here. Looked into repo -> no binary in release or Github actions

spinned up a Ubuntu 18.04 instance -> git clone, git checkout 0.24.0

installed rust using curl | sh method

build fails:

https://termbin.com/29xy

removed the instance and gonna check it again 6 months later


In another comment you're asking about vim support. So let me get this straight: You're using vim, yet you're unable to resolve the error message

    = note: /usr/bin/ld: cannot find Scrt1.o: No such file or directory
            /usr/bin/ld: cannot find crti.o: No such file or directory
Have you tried googling for "ubuntu crti.o: No such file or directory" ?


Using vim has nothing to do with ones ability to troubleshoot compiler/ubuntu issues. Plus both compiler and ubuntu issues can be massive PITA to solve even if you're familiar with them. Personally, if I'm trying to install something on whim to try it out and I start getting "no such file or directory" errors I'd be upset that something is going wrong.


>Have you tried googling for "ubuntu crti.o: No such file or directory" ?

Depending on the project, there is a certain threshold of trying-to-make-something-work which I'm willing to undertake in order to test an app.

But you are right. I'm sorry if my OG comment may come arrogant to the devs who do stuff for free. (♥ to the devs)

[edit]: ok, I tried again, `sudo apt update && sudo apt install build-essential` before installing rust and `cargo install`ing.

Error again:

https://dpaste.com/FTG7FSRQF


The GCC version in Ubuntu 18.04 is too old. I had the same problem, I just installed clang, updated the default c++ and it worked. There is an issue in the repo about that.


Funnily enough, the error is in a C dependency providing Haskell support.

    vendor/tree-sitter-haskell-src/scanner.cc


If you have nix (package manager) installed, it takes like half a second. For tools I want to install through nixpkgs I make a starter like this:

    $ cat /usr/local/bin/difftastic
    #!/bin/sh
    source $HOME/.nix-profile/etc/profile.d/nix.sh
    nix run nixpkgs.difftastic -c difftastic "$@"
and then it'll install on first run:

    $ difftastic
    these paths will be fetched (1.17 MiB download, 9.38 MiB unpacked):
      /nix/store/wn74xn0w60xcwsly6nqaibn205hh2qms-difftastic-0.8
    copying path '/nix/store/wn74xn0w60xcwsly6nqaibn205hh2qms-difftastic-0.8' from 'https://cache.nixos.org'...
    Difftastic 0.8.0
    Wilfred Hughes
    A syntax aware diff.
    
    USAGE:
    [etc.]


Used `cargo install difftastic`? Finished in a minute for me.


Build errors for me. Apparently I'm on some nightly build of cargo, but I need 2021 version. The pain begins...

Edit: Reinstalling Cargo worked!


With rustup, it's pretty easy to update/change your cargo version.


How did you do it? When I tried to rebuild cargo I got build errors. I'm starting to suspect the only way to run this tool is make a chroot tracking sid or something....


I just followed the installation instructions here: https://doc.rust-lang.org/cargo/getting-started/installation...

It'll confirm that you want to install it, because it's already installed I think, and I just selected 1. for Yes.


> curl https://sh.rustup.rs -sSf | sh

hard pass :)


> hard pass

Why? You're willing to run some random open source project, but you're not willing to run the official Rust installation script?


I feel the same way, I am just not willing to pipe curl into a shell blindly.

Even if this specific instance of curl'ing into sh is safe, or if I download and then run it, it's still extremely poor practice and gives me serious doubts about the developers and their security practices in general.

I also do not like when every project decides to poorly reimplement the package manager. If every software used it's own package manager my system would be a complete mess with dozens of different package managers fighting each other and it would be a total nightmare to update the system or manage non-trivial dependency chains when installing something new.

Rust is one of my favorite languages but this is definitely my least favorite aspect of it all. It really feels like the developers "optimized" for systems with no package manager.


Out of curiosity, what would be an acceptable way for the developers to provide a quick way for users to get up and running?

A get started guide with all the required commands easily copy-pastable? (A popular option these days) Something else?

I don’t mean to be critical, I’m simply curious.


You could always download it first and eyeball it before running it.


Sure, but first I had to figure out wtf "cargo" is. :P

Also, `cargo install difftastic` AIUI pulls it from a central location, if I'm gonna poke at software for the first time, I enjoy building it myself first, so I can get my hands dirty in the source. :)

EDIT: Also, the build fails. :(

"error: unexpected token: `include_str` --> /home/loxias/.cargo/registry/src/github.com-1ecc6299db9ec823/radix-heap-0.4.2/src/lib.rs:2:10 | 2 | #![doc = include_str!("../README.md")] | ^^^^^^^^^^^

error: aborting due to previous error

error: could not compile `radix-heap`.

sad trombone


This looks like you're using a version of Rust older than the minimum required (1.56).


The getting started section of the manual should help: https://difftastic.wilfred.me.uk/getting_started.html

I've documented the minimum rust version required today, although I'm looking at lowering the minimum version.


Honest question: how did you arrive to the conclusion you needed rustc-mozilla? I would love to make sure whatever flow led you to that is made clearer for other newcomers, because that is definitely not something anyone that isn't working on Firefox should even try.


I imagine it's a misunderstanding with the rustc-hash dependency used in difftastic for faster hashing.


My favorite dev tool is diff2html - a CLI that opens up your browser with a rich diff. Pro tip: alias `diff` to the command so you can launch it quickly ;)

https://diff2html.xyz/


A related thing is cregit, which does diffs of tokens:

https://github.com/cregit/cregit https://lwn.net/Articles/698425/


Ooh, I'd not seen this and I've seen a bunch of diff tools at this point! Thanks for sharing.


I would love it if version control stored an AST that also includes comments and dividers (where right now we would leave an empty line) and dev machines rendered it out however they wanted. They could even change the language of keywords in addition to normal formatting.


To do this requires some standard way of encoding an AST which includes comments and dividers.

That standard format is commonly known as source code - although it lacks a normal form.

Tools like prettier, gofmt and black can be thought of as a way to produce a normal form of source code.

This is (IMO) a reasonable incremental approach towards exactly what you describe - if a project checks in only source code that's formatted using a standardised format, then you're free to work on it using whatever equivalent representation you like - as long as it's converted back at commit-time.


FWIW VCS for Smalltalk basically does this.

The challenge for a tool like difftastic is that I can't guarantee that syntax is well-formed. You might be using new syntax that my parser doesn't support, you might have merge conflicts, or you might have a plain syntax error in your code.

Tree-sitter handles parse errors gracefully, so difftastic handles syntax errors pretty well in my experience.


Yep, I posted this idea on Reddit recently and people said they need a formatted syntax because of diff and version control; we do not; get the ast, reformat in the editor as the particular user fancies and generate diff and version control artefacts also as a particular user sees fit. Our computers are very fast so you can make a lot more different views on your code than we have now by using the ast instead of text and regexps.


This exact project is called JetBrains MPS.


MPS seems to be a DSL authoring tool. How would this be used to make an AST diff tool?

https://www.jetbrains.com/mps/

https://en.wikipedia.org/wiki/Abstract_syntax_tree


Well, the comment I was responding to was about storing the AST, not diffing it. If that's what you meant the one follows naturally from the other. Once the file format is the AST instead of its visual representation, it makes sense to implement lots of operations as DSL extensions instead of library features, because the language is the library in a sense. MPS is marketed as a DSL tool but what really is is a projectional language tool.


I paid and used SemanticMerge quite successfully when we had a complex Git workflow with lots of conflicts.

https://semanticmerge.com/

Since moving to short lived feature branches it is less useful to me.


SemanticMerge sounded interesting enough so I wanted to check it out, but to my surprise there is no Buy or Download link anywhere on the site. The only thing that might do it is a Login link, but I don't want to create an account just to see how much the thing costs. Is it only sold in bulk to companies? I find it bizarre that there isn't even a "contact sales" button.


That's incredibly annoying! They must have changed something about their pricing and sales model since the time that I had purchased it. I don't understand why companies think that's a good idea. I guess I can't recommend it anymore.


There is a 'sales' button at the bottom, but it's just a link to an email. I'm really not sure how they're even trying to sell this thing.

Maybe they don't want to any more? And this is just their subtle way of pushing everyone interested in using it away?


I don't need SemanticMerge often, but when I do I'm incredibly thankful that I have it.


For easy git usage I created these two scripts in my PATH instead of using using git config:

git-difft:

  #!/bin/sh
  GIT_EXTERNAL_DIFF=difft git diff "$@"
git-showt:

  #!/bin/sh
  GIT_EXTERNAL_DIFF=difft git show --ext-diff "$@"
Then you can run "git difft …" or "git showt …" if you want to use it.


This is really nice, thank you!


I was interested in SemanticMerge/XMerge but when I looked they didn't have a Mac clinet and now it looks like they don't have a personal edition. I just want to buy a private license and use it locally. https://semanticmerge.com


They are requesting feedback on the pricing model for the latest revision of the technology, maybe HN could change their minds:

https://www.gmaster.io/pricing


OS X and Linux are "wait & see" again. That describes half of our dev team and most of the seniors.


Nice tool! I've used icdiff for this in the terminal, but I'll see how this performs in my workflow.

Since I use VSCode as my editor, I created this oneliner in my .bash_profile:

# VS Code Diff

diffcb () { "/usr/local/bin/code" -n --diff $@ > /dev/null 2>&1 ; }

With it, I can "diffcb filename1.json filename2.json" to get a visual editor with contextual awareness based on installed lint modules.


Personally I long for a syntactic merge-tool. Every time Syncthing hiccups for some reason, I'm up for a merge session with my Org-mode files, in the vein of: ‘These properties look just like those ones, only with a different timestamp... Oh lookie, and the heading is totally changed. Let me merge this new heading all over the old one, and then pop in the old one after it.’ Dammit, it's just a whole new heading added with properties. This happens with every language heavy on markup.

However, I'm not sure if Org markup lends itself to structuring that would allow proper diffing—even with just the headings.


Be good to have different git merge strategies per file type.

e.g. A merge that knows properties files support the same property added in different places but only once is needed. And another strategy if order is significant.

Cool to have an HTML merge that recognises the tree structure and supports merging tags and having the indentation follow some rules.

I believe git supports merge strategies, its been on my todo list forever.


Today in generation Z rediscovers things: semantic patching.

https://en.wikipedia.org/wiki/Coccinelle_(software)


I LOL'ed at the first page of the manual: "When it works, it's fantastic."


Looks really cool, but there was no instructions on how to install it.

I would recommend putting an installation guide in your readme, and it being a full installation guide.

I followed the link to your manual and then it told me to install your tool using a tool called "cargo" with no reference on how to install cargo. At this point I gave up. Lazy, maybe, but for a convenience tool like this I want a convenient installation.


Cargo is Rust's build tool/package manager and can be installed easily using rustup. But I would probably suggest the difftastic maintainers add some prebuilt binaries to the releases

(I have an example workflow here if anyone from there is interested https://github.com/conradludgate/wordle/blob/main/.github/wo...)


What's rustup and how do I install it?



I think it's wonderful that there's an explosion of new exciting languages, it can only improve the quality of all our tools. I for one am looking forward to replacing my eons of MATLAB experience with Julia.

But I wish there was more of a convention in the F/OSS community that if your software isn't written in something universal (C, C++, shell and maybe python), then it also comes with a container of all that's necessary to run it.

It's frustrating to pollute my nicely packaged managed system with hundreds of locally installed python modules just to run one tool. Or, in this case, backport and rebuild a language specific build tool simply to compile. :)


Have you used pipx? I really like it for installing Python tools because it automatically creates a virtual environment for them so that their dependencies don't affect anything else.

https://pypa.github.io/pipx/


>shell

>universal

* laughs in Windows, then cries *


I used to straddle the two worlds, maintained and supported a multi-site AD domain with AFS integration for user $HOME and some sort of unholy LDAP/kerberos bridge for login. About once every year or two I'll miss something about the way Windows does things, compared to normal (meaning "linux"). Like the NTFS permissions model, that's cool.

But it's just once a year :) And the last time I was deep in windows was win7, whenever that was. I tried to use a win10 machine and gave up.

Besides, I thought the big new feature in modern windows was that WSL improved to the point you can run unix tools! ;)


> About once every year or two I'll miss something about the way Windows does things, compared to normal (meaning "linux"). Like the NTFS permissions model, that's cool.

FreeBSD would be up your alley. Its native ACLs are NFSv4 format, a superset of NTFS ACLs. You need to enable it explicitly on UFS2, but it's default on ZFS.


> and some sort of unholy LDAP/kerberos bridge for login

It's really not that bad, the AD-IPA cross-forest trust is really solid as is the native sssd-ad integration if IPA is too much. Honestly I can't really imagine it any other way now, so much work has been put into AD support that it's actually the best login experience on Linux at the moment. OpenLDAP is definitely showing its age -- dgmr I use it for all my personal infra because it's free and my use-cases are dead simple but we got to delete so much bespoke code after migrating off it at work.


> AD-IPA

I'm not sure, and you undoubtedly know more and are more up to date than I, but I don't believe any of these things existed in 2005, when I was on the aforementioned team. Or, maybe they did exist but management decided an internal implementation was better.

Getting Windows to accept the user profile in an AFS path I recall being particularly vexing.


FWIW I've had reports of people using difftastic on Windows successfully.


I agree with all your points.

Only diff is I got to the point where it said I needed "cargo", On a whim, I typed "aptitude install cargo", and it did something. Now waiting for the >1GB source repo to clone to see if it works.... ;)


This method worked for me. No root required. https://news.ycombinator.com/item?id=30842720


Looks like you need to install the Rust programming language and compile it. It worked for me. Not sure if I like the installation method. It seems the executable is portable though.


Is there a good reason why diff tools generally don’t use AST?


Because it is much easier, you don't have to build and maintain parsers for hundreds of languages. And you don't need need just any parser, you need very robust ones that can deal with malformed files well. Or, if you only pick a small set of supported languages, your diff tool will not work on most files or have to fall back to a structure-agnostic algorithm. Also not all text files even follow any useful grammar at all.

Finally, even if you have a syntax tree, that is just part of the solution, probably the smaller one. Detecting three lines of code wrapped in a new if statement is easy but also doesn't benefit much from a syntax-aware algorithm. But once you changes names and signatures, extract methods, introduce constants, and so on it will become progressively harder to match subtrees and one is probably quickly approaching the territory of NP-hard and undecidable problems.


> And you don't need need just any parser, you need very robust ones that can deal with malformed files well.

I very much agree. I feel there has been a trend recently where people (re)discovered how cool and useful ASTs are and now expect everything be using them. I suspect old-school computer scientists might be secretly laughing at this while programming with some Lisp-like languages they invented for themselves.

Jokes aside, I do wonder how modern IDEs manage to parse broken source code into usable ASTs --- is this trivial (CS theory-wise) or are there a lot of engineering secret sauce involved to make it work?


With only basic knowledge in the domain I would assume it is hard and ugly. If the file is malformed, there is almost certainly an infinite number of possible edits to make the file adhere to the grammar, hence there can not be any algorithm that just provides the one and only correct syntax tree. This in turn means that you have to come up with heuristics that identify reasonable changes which fix the file and that is probably not easy. Also, if you do this online in an IDE, the problem becomes probably easier [1] - if you have a valid file and then make it invalid by deleting an operator in the middle of some expression, you can still essentially use the syntax tree from just before the deletion. If, on the other hand, you get a malformed file, you might have a harder time.

[1] And also harder because if you want to parse the file after each key stroke, you have to be fast. This probably also makes incremental updates to the syntax tree the preferred solution and that might align well with using prior result for error recovery.


"If the file is malformed, there is almost certainly an infinite number of possible edits to make the file adhere to the grammar, hence there can not be any algorithm that just provides the one and only correct syntax tree. This in turn means that you have to come up with heuristics that identify reasonable changes which fix the file and that is probably not easy."

Don't we call such heuristics "test suites"?


I don't understand that question. Given the following source file that does not parse

  var foo = bar baz
there are many ways to change it and make it parse including the following reasonable ones

  var foo = barbaz
  var foo = "bar baz"
  var foo = { bar, baz }
  var foo = bar // baz
  var foo = bar
  //var foo = bar baz
  var foo = bar * baz
  var foo = bar + baz
  var foo = bar.baz
  var foo = bar(baz)
but also unreasonable ones like

  var abc = 123
and therefore a parser that can handle malformed inputs has to make educated guesses what the input was actually supposed to look like. And don't be fooled by this simple example, imagine a long source file with deeply nested code in a language with curly braces and randomly deleting some of the braces. Now try to figure out where classes, methods, if or try statements begin and end in order to produce a [partial] syntax tree better than just giving up at the position of the first error.


My point was that test suites should give you a heuristic on what corrections are good and which are bad. A source code change that turns a test fail into a test pass should be considered an improvement.


I am still lost. Test suite for what? We have a parser - binary, source code and maybe a test suite if the parser developers decided to write tests - and a random text file that we throw at the parser and for which the parser hopefully generates a useful syntax tree if the content is a well-formed or not too badly malformed program in a language the parser understands.


What "test suite for the parser"? Of course a test suite for the faulty program you're trying to correct into a working one.


So I can only use the diff tool to compare two non-compiling versions of a source file if I provide a test suite for that file to the diff tool? And how would you want to make use of the test suite? Before you can run the test suite, the source file must already parse and compile which is already more than a diff tool based on a syntax tree requires - it must be able to parse the source code but it doesn't have to compile. Passing the test suite requires even more, not only being able to parse and compile but also yield the correct behavior which the diff tool doesn't care about.

And you actually jumped over the hard part that requires the heuristics, how to modify the input in order to make it parse. Take a 10 kB source file and delete 10 random characters - how will you figure out which characters to put back where? With 100 possible characters, 10,000 positions to insert a character, and having to insert 10 characters, you are looking at something like 10^60 possible modifications. You are certainly not going to try them one after another, each time checking if the modified source file parses, compiles, and passes the test suite.


> So I can only use the diff tool to compare two non-compiling versions of a source file if I provide a test suite for that file to the diff tool?

Not sure what this whole straw man is about. I definitely didn't suggest anything like that. Of course you can only compare two compiling versions of a source file using a test-suite-based heuristics. I thought this whole thing was about "heuristics that identify reasonable changes which fix the file" mentioned above? "Reasonable changes that DON'T fix the file" are clearly recognizable by NOT passing the test suite, just as if it was a human trying to make those changes and finding out that the change that he just did didn't in fact yield the desired results after running the test suite.

> With 100 possible characters, 10,000 positions to insert a character, and having to insert 10 characters, you are looking at something like 10^60 possible modifications.

If you're working with an AST, you're almost certainly not working with characters. That would be immensely wasteful. In fact working with an AST is pretty much the only way in which the set of changes is sufficiently reduced for almost any change to NOT be rejected outright. With character-level modifications, you're facing the problem that almost every edit will be outright rejected as early as at the stage of parsing.


We have obviously been taking past each other. My point was that a parser for a syntax tree based diff tool should probably be able to deal well with files with syntax errors, i.e. it must be able to fix syntax errors. And with fixing syntax errors I did not mean actually fixing the file but being able to construct a reasonable syntax tree even if some subtrees do not adhere to the grammar. Given an input like

  class foo
  {
    function bar() {
    function baz() { }
  }
it should be able to parse the file as if bar() was not missing the closing curly brace. If the parser just gave up or inserted the closing curly brace at the end

  class foo
  {
    function bar() {
    function baz() { }
  }
  }
making baz() a nested function inside of bar() the result would be worse than using a character-based diff algorithm. But I never intended to say anything about making code functionally correct, that is none of the business of a parser or diff algorithm.


What you do is produce an AST where some nodes indicate syntax errors. This works best in languages where it is easy to resynchronize after an error, of course.


This tool is built on tree-sitter (https://tree-sitter.github.io/tree-sitter/), so presumably it doesn't need to maintain parsers at all.

I've thought before this is how diffing should be done, and speculated that tree-sitter would make it more feasible.

At this point, whenever I think some language-aware tool ought to exist, my first thought is "Does the language server protocol or tree-sitter make this more feasible?"


Someone still has to build and maintain the parsers, you are just outsourcing this. And I added a bit to my comment, I tend to believe that parsing is the easy part, but that is admittedly more a gut feeling and not based on any real knowledge of that problem space.


That's certainly a good point.

Languages usually change slowly, though, so once a good baseline grammar is in place, maintenance is unlikely to be a huge load.

Furthermore, with tools like tree-sitter and the language server protocol, multiple communities benefit from their continued existence, so there's a bigger pool of contributors to the parser.


> And you don't need need just any parser, you need very robust ones that can deal with malformed files well.

But why? Shouldn’t the code you push into a repository be at least syntactically correct? And even if it is not, one can simply fallback to textual diff.

> Or, if you only pick a small set of supported languages, your diff tool will not work on most files or have to fall back to a structure-agnostic algorithm.

I don’t see how it is a blocker.


> Because it is much easier, you don't have to build and maintain parsers for hundreds of languages.

Seems there's a good open market for such a lazy reason.


It's really hard! :)

(1) Parsing an arbitrary language is hard. Without tree-sitter, difftastic would probably be a lisp-only tool. You also want a parser that preserves comments.

(2) Inputs may not be syntactically well formed.

(3) Efficiently comparing trees is extremely difficult (difftastic is O(N^2) in time and memory).

(4) Displaying tree diffs is equally difficult. Alignment is particularly challenging when your 'unchanged before' and 'unchanged after' are syntactically the same, but textually different.


Performances is one I guess.


Also there are a lot of languages out there, each with their own special and unique syntaxes.


What happens when you have an invalid file/AST?


tree-sitter inserts error nodes and gives you an AST for all inputs. It seems to work well in practice.


Understanding syntax would be really amazing for merges (not sure if it's even possible), but for diffs I don't immediately see why I should use that over simpler syntax-unaware tools. Highlighting the actual change in a string is important, and so is ignoring whitespace, but diffsofancy -w does it just fine. What else would I need? (Well, I guess the only use-case I can see from the demo is 2 compact changes in a single line, but… meh.)

On the other hand, even though my diffs are usually not that huge, sometimes they might be, and I don't want to switch tools every time that happens (I have just git alias and I don't even remember my exact config, nor should I care). So being slow is not great.


Delta is a pager that does Syntax highliting, better diff highliting and improved outputs of existing git commands. Highly recommended.

https://github.com/dandavison/delta


I find ydiff more useful, specially for the side-by-side output: https://github.com/ymattw/ydiff

I'm using it like "git-ydiff-s" script in my PATH to use "git ydiff-s":

    #!/bin/sh
    git diff "$@" | ydiff -s --wrap --width=0
Installation is "sudo dnf install ydiff" or curl -fsSL https://raw.github.com/ymattw/ydiff/master/ydiff.py > ~/bin/ydiff chmod +x ~/bin/ydiff # (and change to python3)


It might be useful for reviewing merge/pull requests. But is there a way to display the diff "interleaved" instead of 2-columns side-by-side? (when executing `GIT_EXTERNAL_DIFF=difft git log -p --ext-diff` for example)


There's a basic single-column 'inline' display available if you do `INLINE=y`, but it's not as mature as the side-by-side display yet.


We are working on a code review tool which supports unified diffs with semantic diffing. If that sounds interesting for you, take a look at https://mergeboard.com


Checked out the repository.

Build instructions? Nope.

Minimum system requirements? Nope. But if you check out cargo.toml, you'll see it says it needs Rust 1.56.

My system has 1.48.0 . And it the latest Debian release! I don't see how a diff tool can expect you to have a bleeding-edge development environment. I mean, ok, you chose a new language - I can understand that; I won't demand that it build with just a C compiler and Make. But come on, this is not supposed to be just a toy for new systems.

Anyway, I still cloned it, tried to build with "cargo build", and got stuck with:

error: unexpected token: `include_str`

it couldn't even tell me "get Rust 1.56" :-(


I wonder if it would be possible to do this in a one-column format. That would make it more useful in a lot of contexts where a super wide view isn't practical.


I use meld and it seems syntax aware plus it can do merge with a click, how will difftastic diff in that regard?


I use meld too. But afaics, meld 'syntax aware' is very different from from difftastic.

Meld takes a diff, and applies syntax highlighting over the diffed files. It additionally highlights the changed characters in a line. Git diff, vimdiff and probably others, do this as well.

From the demo, I understand that Difftastic first applies syntax and then rebuilds the patch over that. Being aware of line wrapping, changes in nesting, moving codeblocks into functions and so on.


The quick examples seem like they could all be solved using git diff's --ignore-all-space option.


Unfortunately it's closed source, but https://www.semanticmerge.com/ has been around for a few years and works similarly, but can also merge.


I just spent a few minutes on that site and I can't even figure out how to try it out, or their pricing, or anything other than some very superficial docs, really.

Is this just a pretty website, or is the software actually available anywhere?


That pages is just the technology primer. The tools are XDiff & XMerge:

https://www.plasticscm.com/pricing

Looks like no locally-run-binary/non-SaaS version. I was hoping it'd have SublimeText like model. I have no interest in trying to get my team to switch nor having to deal with the security team when it turns out I was using a free cloud account.


What does it do for unsupported languages? Just fall back to "regular" diff?


Yep! It does a conventional textual diff: run Myers' diff algorithm on lines, then word highlighting on changed lines.


Does a `magit` plugin exist for Emacs users? The author of this package is also the author of a couple of popular Emacs packages but I did not see any mention of Emacs.


It won't be able to form the basis of a magit plugin because it does not target traditional diff format.


To be more precise, Magit could easily display the output of difft, but Magit wants to be able to do more than that. You can navigate diffs by hunk and by file, you can collapse hunks and files, you can even select individual lines of the diff to stage or unstage them, apply or unapply them, etc. Strictly speaking that’s probably not impossible with difft, but because difft has the explicit goal of displaying diffs to humans rather than producing machine–parsable output, it won’t be easy. I still want it to happen though.


The documentation says:

> Difftastic output is intended for human consumption

Why not separate the human-consumption part and the underlying parsing part? Or at least provide both in the same utility?


The underlying parser is just tree-sitter, which is a reusable (and excellent) parsing library.

Difftastic then converts the tree-sitter parse tree to a simpler s-expression style format (see https://difftastic.wilfred.me.uk/parsing.html#simplified-syn...), and computes differences on that.

I'm just trying to clarify that I'm not generating conventional 'unified diff' patches, so I can provide a nicer interface (e.g. line numbers).


BTW IIRC The tree version of levenshtein distance has (proven) terrible complexity. But so does lcs, and diff itself performs great in practice so maybe...


This is great. I previously used Code Compare by Devart for this purpose, but it has been abandoned without support for modern IDEs.


Any plan for the Scheme programming language ?


I'd like to add it, but I haven't found any good tree-sitter parsers for Scheme.


A diff tool for binary files:

https://diffoscope.org/


Now lets get a WASM build into Github. :)


What would be also cool is a syntax aware custom merge driver for git, but that's probably even harder.


It supports Elixir and C#! Too bad it doesn't do Erlang and F#

It looks very handy though. I still do a lot of C and C++


Is there a VSCode extension for this?


Yes! Why did it take so long for this to be invented? So obviously and amazingly useful.


First thing that came to mind is diffing python notebooks.


For Jupyter Notebooks I highly recommend trying out jupytext, which converts Notebooks on the fly to a number of formats. It really has been a game changer for working with git and Notebooks for me. I essentially never want to preserve state of the notebooks anyway so converting just makes sense. The best thing is it is completely transparent, i.e. it generates a notebook file when you open the other file and saves to the file ever time the notebook is saved. If you want to keep the state of the notebook you can always keep that file around as well.


Don't think this tool supports that, but there is https://nbdime.readthedocs.io/en/latest/


You hit us up when we can build & install it.


Looks nice! Now I only need patchtasic :-)


Actually, the README addresses that!

  > Non-goals
  > Patching. Difftastic output is intended for human consumption, and it does
  > not generate patches that you can apply later. Use diff if you need a patch.


Now I want a 3 way merge version :-)


how do i install on macbook to try? Can you give some instructions in the getting started?


    brew install rust
    cargo install difftastic
Worked for me without any problems.


From https://difftastic.wilfred.me.uk/getting_started.html, it's installed via Cargo, so if you already have Cargo installed its straightforward, otherwise you can install it via https://doc.rust-lang.org/cargo/getting-started/installation...


So how can one use this in vim?


The nine year old inside me can’t unsee the unfortunate choice of names used in the basic example :)


No support for Zig :(


Difftastic has support for ~20 languages, and I'm happy to add more if there's a decent tree-sitter parser available :)


I like the name.


Ok I really gotta try this.


good idea, but so dangerous though


Why?


because it's an automated piece of software making decisions about what is an "equal diff" and what is a "difference diff" because a diff no longer means just a change, it now has to be a meaningful enough change.. If you removed something like `if (true)` or whatever, that's still a diff that could have some importance and/or unknown consequences. I appreciate the value, but the fact that it allows refactoring to be a non-diff would worry me in the long run I think.


Difftastic is only ignoring whitespace that isn't significant. If you remove `if (true)`, it will get highlighted.

With a textual diff today, your only choices are 'highlight all whitespace changes' (e.g the git default) or 'ignore all whitespace' (e.g. diff --word-diff).

If difftastic says there are no changes, then both files have the same parse tree and the same comments.


Finally.


Yes.


If you have consistent code style and formatting this tool is unnecessary. I think that solution is better, you get a more consistent code base that is easier to read for humans. (Also diffs will be faster to compute)


> If you have consistent code style and formatting this tool is unnecessary

I disagree. I struggle to replicate it right now using a simple test, but I've seen the following rather infuriating and counter intuitive behaviour from Git/GNU diff. If you have a simple if statement such as:

    if (bla) {
      // do something
    }
And you were to add another statement at the end, after the closing curly brace, e.g.:

    if (bla) {
      // do something
    }

    if (bla2) {
      // do something else
    }
Git/GNU diff will sometimes show the following diff:

    diff --git 1/left 2/right
    index c2ea6f1..dc0e1c2 100644
    --- 1/left
    +++ 2/right
    @@ -1,3 +1,6 @@
     if (bla) {
       // do something
    +}
    +if (bla2) {
    +  // do something else
     }
This is basic example, but there's other similar things. For a simple change like the above, this isn't a huge issue, but for a bigger patch sets, it can take a minute to understand what is really going on.


Right, I frequently get angry at just how dumb diff really is. How it's greedy and can't recognize the best seams between blocks of code. But then when I think of simple rules that would improve the results, I see how they would lead to other problems in other places. So using syntax seems necessary.


There is an option [0] to use non-default but still built-in git diff algorithms that might yield better results.

[0] https://git-scm.com/docs/git-diff#Documentation/git-diff.txt...


I've used a few of the different git diff algorithms and still have had problems like these.


Even if you are consistent, having unchanged indented text show up differently is very clever. I often end up reviewing a diff that moves a basic block into a conditional branch and have to scan each line to see if it changed.


If you're using a language that doesn't depend on indentation (C, Java, Go, Rust etc), try "diff -b" or "git diff -b".

The indented basic block won't show as a difference, only the start and end of the block.


interesting. Is -b equivalent to -Xignore-all-space in git?


I run all python through `black` and `isort`; this is still a huge step up in my book in terms of readability and ergonomics compared to the standard `git diff` or gnu `diff`.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: