I would maybe be interested in Git allowing you to plug in your own diff generators for different file types.
But I would not want Git itself trying to understand the contents of files. That seems to me to be an idea that lives on a misconception of the "things programmers believe about names" variety. Not every file in source control is source code. Not every programming language's grammar maps to an abstract syntax tree. In some files, such as makefiles, the difference between tabs and spaces is semantically significant. Some languages (such as Fortran and Racket) have variable syntax. And so on and so forth.
So I think that we really don't want the source control system itself trying to get too smart about the contents of files. That will inevitably make the source control system less compatible with the various kinds of things you might want to put into source control. And it will also make the source control system a lot more complicated than it would otherwise be, in return for a largely theoretical payoff.
But if we want to delegate the work of generating diffs off to other people, so that Git can allow for syntax or semantics-aware diffing without having to personally wade into that quagmire (and perhaps also allowing language communities to support multiple source control systems, a bit like how it works with LSP), that might be an interesting thing to experiment with.
I looked this up and for anyone wondering, it's called "diff/merge drivers", but there are only a handful of them out there. Some highlights from a few minutes of searching:
One big caveat of this is that since git doesn't really store just a stack of diffs, despite the fact it presents itself as such to the user, a custom merge driver will not make your .git grow any less than it would normally.
>One big caveat of this is that since git doesn't really store just a stack of diffs, despite the fact it presents itself as such to the user,
>a custom merge driver will not make your .git grow any less than it would normally.
You can go a long way with simple homespun methods. SQL dumps delta-compress in Git very well - especially if you tweak the dump options a bit so that the row order is mostly stable.
The same most probably goes for any other textual dump - if the (unchanged) content is largely ordered in stable fashion, it will delta-compress very well, even without any explicit, specialized support in Git itself.
> One big caveat of this is that since git doesn't really store just a stack of diffs, despite the fact it presents itself as such to the user, a custom merge driver will not make your .git grow any less than it would normally.
Note that git does support using deltas for storage. But according to docs, custom diff drivers aren't used for those, instead it's a instruction based format.
This never went anywhere near production, but it was very easy to put together something basic.
One complication with using a custom merge driver, as discussed by https://github.com/Praqma/git-merge-driver , is that they need to be configured inside the `.git/config` of the repo, which itself is not version controlled. So there's an additional config management overhead to rolling that out to everyone in a machine. Additionally, if outsourcing hosting for git repos, it may not be supported to install and configure a custom merge driver for merges conducted by the hosting platform (e.g. merges created by github.com pull request workflow).
One idea I had at the time was using external schema files (e.g. JSON schema for JSON files) to help guide/constrain the result of the merge. I never implemented it, but it should be possible. If the schemas were also version controlled in the same git repo that stores the data, you'd need to figure out which one (and which version) to load when resolving a merge conflict of a data file. There doesn't seem to be a well-supported robust way for a merge driver script to discover the source and destination branches, but there are some potentially fragile ways of doing it that work some of the time.
That you can't properly parse Perl without also running Perl doesn't mean it doesn't have an AST. I don't know about Perl so it could very well be that it does not have an AST but your argument doesn't show it does not.
Perl code changes how the parsing will be done, both by adding keywords, changing how many arguments the parser will look for, and even arbitrarily changing the language entirely with a source filter. It's largely that last bit that makes creating a consistent or coherent AST so impractical it may as well be impossible.
I disagree. Many engineers want to refactor across a sequence of small PRs, for example. Small PRs are a good thing, because they’re easier to understand. But today, Git makes this painful. Also, understanding how the meaning of code changes over time can help reduce bugs.
The solution will have to be pluggable. But I think it is possible, and there are sane things to do (e.g. fall back to vanilla git) when there are missing plugs.
Not only that, but imagine you realize there is a bug in the parsing tool. Now you have to go back and re-parse the code, or otherwise just deal with a bad history forever. Suddenly you’re storing text again.
I do kind of love the idea of Git using ASTs instead of source code. It makes a ton of sense.
Even just in the immediate term I wish I could make Git(hub) tabs/2 spaces/4 spaces/whatever agnostic. Seems crazy to me that in 2021 we still have to make opinionated choices across orgs about what to use... why can't we pull the code down, view it in whatever setup we want, then commit a normalized version?
[whispers] this is actually something tabs allow you to do natively by setting custom tab widths in text editors but I've given up trying to sell people on tabs at this point and just want to be able to do my own thing
It's not that you're going too far, it's that you're not going far enough!
It's not a Git question, it's a programming language question. There's no reason source code need to be stored as plain text[1]! Editors show it as text, we edit it as text, but why wouldn't it be _stored_ as an AST? Not only does formatting becomes an editor concern, but code could even be edited as a tree, as a graph, as whatever you want
[1] - well, actually there's plenty of reasons: chiefly because plaintext is very interoperable
It profoundly is. You can't store "an AST". You can only store a serialization of it. The official language grammar is a serialization of the AST custom crafted for that language. It is as much an "AST" as any other serialization would be; all such alternative representations would all produce isomorphic memory representations if parsed from a proper library.
At a high level it may sound useful to try to then provide a cross-language AST representation, but it's one of those things that sounds great at a high level but as soon as you actually tried to implement it for, say, Python and C++, you'd rapidly discover that in practice there's not as much opportunity for "generic AST operations" as you may think.
The problem isn't that it isn't "stored as an AST" but that $YOUR_LANGUAGE apparently doesn't have good libraries or mechanisms for getting at it. Go, for instance, ships with the relevant bits of the compiler exposed, and as a result there are tons of tools that operate on Go code as ASTs and not textually, because it's readily available and supported by the core language team. I use this only as an example I know personally, there are other languages that have similar sorts of support as well.
> It profoundly is. You can't store "an AST". You can only store a serialization of it. The official language grammar is a serialization of the AST custom crafted for that language. It is as much an "AST" as any other serialization would be; all such alternative representations would all produce isomorphic memory representations if parsed from a proper library.
But a code file isn't coming out of a serializer when you hit save. It has all kinds of idiosyncrasies beyond the AST it turns into, and the user expects them all to be preserved. That makes it pretty profoundly not an AST.
What do you call “an AST plus the additional information (such as whitespace info) needed to recover the original text the AST was parsed from” (if that has a name)? I think I saw it mentioned somewhere as having a name.
I feel like you're picking a strawman here. The AST serialization everyone is implying is one where you don't need to token/lex but can just load it directly & manipulate it (i.e. implying the on-disk version is a valid AST or one who's validity can be trivially validated without needing to have the entire language syntax & grammar). First, that makes the compiler much faster because tokenization/lexing is moved to the "save" phase which happens infrequently at human scale vs the compile/processing phase which happens in an automated fashion where the overhead can be notable. Additionally, if you mmap the AST from disk into memory, you can use finer-grained caching to memoize expensive analysis that happens for faster compiles of code that's only changed slightly (e.g. changing whitespace/comments wouldn't recompile anything).
More importantly for advocates, it avoids needing to ship the deserialization library and makes tooling simpler. That's really why the idea of a simple AST format is so attractive. Typically compiler frontends are typically very tightly coupled to the underlying middle & back end. There's some work in some languages to decouple this (e.g. LSPs & Idea's failable parsing approach), but the efforts are still very immature & it's still not clear to me that it's worth it (see the last paragraph).
The main underlying challenge with making sure the on-disk contents is well-formed according to the syntax rules is that frequently you want to pause work at an intermediate stage. This means you either have to make sure that whatever state the user saves is a valid AST via editor tricks (although I think this also typically means you have to design the language around it), you reject saves, every tooling library has to be capable of parsing malformed ASTs, or you save a dirty transformation to apply to the last known saved version so you can have the user resume editing but otherwise tooling uses the "last known good" version. That's the real challenge with having a serialized version that's amenable to 3p tooling for interop.
Finally, all the "serialize the AST" solutions ignore the problem of wanting to grep the codebase. This means you need to change out several decades of line-oriented manipulation tools in favor of new ones that are AST-based & likely more complicated to write/maintain as compared with one-line regular expressions. At least I've yet to see any AST manipulation libraries that aren't drastically different from existing text manipulation tools if clang-tidy and Rust macros are any indication about what good solutions to the problem look like today.
I think eventually we'll get AST serialization, but I think it will be packaged into an entirely new language (like Rust did with ownership) that also considers the tooling aspect end-to-end rather than as a retrofit into existing languages. Once that's successful, then I think we'll see retrofits because the space will have been better explored & other languages will benefit from the R&D into what a successful path would look like.
> that makes the compiler much faster because tokenization/lexing is moved to the "save" phase which happens infrequently at human scale
For dynamic languages like Ruby or Python, storing a pre-parsed representation makes it a little faster. But for compiled languages, lexing and parsing tend to be swamped by the codegen step
If you think about it more broadly where you can memoize expensive results of AST -> code gen transformation or AST -> AST simplification, then this will help significantly for codegen, especially for incremental builds but also clean builds if you have your CI cluster sharing build cache information with your local devs.
Also, for a language like Rust, I'm not sure that there isn't a significant amount of time spent validating ownership & doing type inference. These are the kinds of analysis you could save into the AST & thus save a significant amount of build time when talking about large projects. I agree for smaller projects a lot of these optimizations are probably unimportant.
For rust in particular, there is just such an intermediate representation called MIR that has desugared AST, control flow graph, type inference and lifetime information baked into it so it can be reused to speed up incremental builds.
But that's pretty far from what the programmer wants to interact with. In the context of the thread where we're discussing whether the source code is just a serialized version of the AST.
(Also, with Rust, type checking and borrow checking are expensive, but codegen still dominates)
Yup. I’m not claiming anyone has successfully applied the general concept of memoizing expensive compiler operations at the AST level. There’s all sorts of practical complexities that come up with such a design but conceptually I think it should be doable. Using your “the text is the serialized AST” argument, we see primitive tools around this concept that operate just on the text contents and codegen end-result with ccache/sccache/bazel/etc.
However, using a richer AST representation and knowledge about what expensive operations the compiler performs should let you retain cache acceleration even when the high-level text->compiled code fails. Heck, one could imagine that some AST processing could end up being agnostic to a codebase so that the compiler ends up memoizing certain results across all code bases since code is very similar and have similar constructs that should generate very case similarly.
> This means you need to change out several decades of line-oriented manipulation tools in favor of new ones that are AST-based
I wonder if a generic binary->text tool/library could solve this. Grep could check the file mime type then call the tool to convert from the binary format to the text format if available. I could see this being useful for a lot of binary formats.
where opening and modifying "module-1.txt" would cause "module-1.ast" to be updated, and vice-versa. I actually think I've seen this approach used in some experimental language or another. I'm sure there would be a LOT of tricky edge cases and/or synchronization bugs to address, but it seems like it's not very far outside the realm of feasibility.
And this is why I love Python. It forces people to use the same coding standard in some regards, and it forces people to indent properly.
I really don't care anymore what that standard might be (well, ok, I do prefer tabs) but I do care that it be consistent. And I do DEMAND that proper nested indentation be respected. Source code is meant to be human readable.
Python is not unique nor innovative in that respect. Even FORTRAN and COBOL programs from the early days had very strict rules about indentation and blocks.
The thing is, there's no reason we have to store the code like text. Even the punch card got this right: the program wasn't stored as text, it was stored as physical holes in paper. A very experience programmer could often look at a card with just the holes and have a rough idea what it encoded, provided they knew if it was EBCDIC or ASCII or whatever, but the computer didn't care. The printed representation across the top line of the card was just that: a representation.
I didn't claim that Python is unique nor innovative. However, FORTRAN and COBOL are not modern languages in the sense that one can reasonably expect a large selection of first- and third- party libraries for most common situations, and their availability on e.g. servers is far more limited, thus learning them just for scripting is not as good a choice as is Python.
We store them as serialized text. We could store them as nodes in an AST; we could store them as OLE/CFBF structures like older versions of Microsoft Word, or do what Ted Nelson suggested decades ago: T. H. Nelson, “Complex information processing: a file structure for the complex, the changing and the indeterminate,” in Proceedings of the 1965 20th national conference, New York, NY, USA, Aug. 1965, pp. 84–100. doi: 10.1145/800197.806036.
In the context of this thread, where we are discussing Git as a medium for storing the form of the programs in a format that is meant to be read and maintained by humans, we do store the programs in text.
No, in the context of this thread we are storing programs as punch cards, which are not meant to be read or maintained by humans. They are the executable form of the software, that's all.
> There's no reason source code need to be stored as plain text
The same can be said for lots of different documents, and it's been true for programs like Word for a long time. See also [1]T. H. Nelson, “Complex information processing: a file structure for the complex, the changing and the indeterminate,” in Proceedings of the 1965 20th national conference, New York, NY, USA, Aug. 1965, pp. 84–100. doi: 10.1145/800197.806036.
Serializing structured data into ASCII streams makes it very hard to then deserialize and re-structure.
Plain text might be the lowest common denominator for Unix/shell tools, but we can do far far better in how structured data is exchanged, which would make it much easier to programmatically manipulate & process.
Don't forget bugs. Bugs in git is nightmare scenario. There are so many complexities with just plaintext, there's no way AST diffs get broad support. Also, AST needn't always be simpler than plaintext, e.g. C++ macros.
Three shall be the number of the counting and the number of the counting shall be three. Four shalt thou not count, neither shalt thou count two, excepting that thou then proceedeth to three. Five is right out.
> Tabs are 8 characters, and thus indentations are also 8 characters. There are heretic movements that try to make indentations 4 (or even 2!) characters deep, and that is akin to trying to define the value of PI to be 3.
Math trivia: there are cases on which it is sensible, in sufficiently advanced mathematics, to define pi as 3 (or whatever other number).
I don't use tabs, but if I'd say that the biggest advantage of using tabs is that everybody can configure their own editor to make them as large as they wish.
Tabs do work as long as they aren't fixed width (I don't know what you mean by "custom").
For instance, in many languages, one will sometimes have to split a function call to many lines, and in most languages function names aren't of fixed length, thus in order to get a correct alignment for parameters, the tab width at that point will have to match the function name length.
I agree with your idea of storing a normalized version of the code in the repo: it wouldn't then matter whether that version contains characters to align the code properly, it would just be inserted by the editor/linter as needed. The difficulty is that sometimes linting isn't enough, and some manual formatting is needed. Or perhaps the formatting rules are under specified?
Another issue with AST diffing is when languages allow some form of syntactic sugar as preprocessing: the compiler might just see the simplified tree, not the one with the "sugary" forms. A tool capable of parsing such languages should also be able to handle these extensions.
Obviously readability is subjective, but personally I find alignment is never valuable outside of tables of data, and I'd argue generally having tables of data embedded in your code isn't ideal.
Yes, that was more or less the argument I was trying to do by the end of my comment: diffing is meant to be space agnostic (the -B option of diff(3)), but since code is committed with formatting, formatting will interfere.
That is the kind of problem solution that ends up with you now having 2 problems. New problem(s); having tabs and spaces, having to think when to use them, having to train/document everyone in usage, having to debate that usage, having to correct code and chastise people who get usage wrong.
Use a automatic code formatter with minimal options. Automate either running code formatter on commit or denying commits that change when code formatter is run on them.
Absolutely, I wouldn't dream of doing any kind of fancy alignment by hand, only with an auto-formatter.
If I had to break arguments onto multiple lines without an auto-formatter I would just keep it simple and use another level of indentation instead of aligning them with the function name.
This will work with C (for now), but it will fail with language like C++ where you can have lambdas. When a lambda is a function parameter, it will need to be aligned to other parameters, but the lambda body will require indentation.
Of course, there are ways of avoiding the problem, eg. by using a variable for the lambda, yet isn't this sweeping things under the rug?
There won't be any difference. OP will run the script when they checkout, work with their tabs, then run the script when they commit. Spaces in, spaces out.
Oh ok, sure. But then that's just a weak version of what's being requested - an entirely neutral more agnostic, abstract format that stores the meaning without any formatting at all.
fwiw, this is what we do in Dark [1]. We store (serialized) ASTs, then then we pretty print them in the editor. This converts the AST into tokens that you see on your screen, complete with configurable* indentation, line-length, etc. Code would be displayed according to your config* and the same code displayed differently to a different developer looking at the same code.
One of the practical issues here is, if your code fails to compile in CI with an error like
/home/ci/src/foo.c:123:45: error: use of undeclared identifier 'a'
or
/home/ci/src/bar.py:50: syntax error in type comment
or crashes in production with an error like
java.lang.NullPointerException
at com.example.Baz.doThings(Baz.java:1337)
you really want to be able to find line 123 column 45, line 50, or line 1337 in your editor, and have that be the same line as what your CI compiled and deployed.
On its own, tabs vs. spaces only affects columns, and you can probably figure things out without columns (although it's a shame to lose it). But different tab sizes affect how long your lines are, and line wrapping is a thing that people care about at least as much as tabs vs. spaces (people with different size monitor or fonts will easily see too-long or too-short lines on their display; if your spaces are equivalent to the tab stop, the distinction is literally invisible). And once you start rewrapping lines, everyone's line numbers are different.
I think it's possible to solve this by using some sort of AST-based index into the file and teaching IDEs to let you seek based on that, but it's suddenly a more complex problem.
No, I don't think that's the same problem / the same solution. A source map translates between a layout checked into the code and a format generated at build time. I'm talking about translating between a layout in a developer's local workspace and the layout checked into the code.
Since the developer can choose whatever formatting options they want, there isn't a single source map that can be referenced in the compiled version of the code, so backtraces etc. So the transformation cannot be done at the point the error is displayed (compiler output or backtrace output), it has to be done in the context of the developer's local workspace.
I think source maps could probably be inspiration for solving this problem, but I don't think they would work directly - and even if they did, the real problem here is not designing a solution, it's getting everyone's IDEs to work properly with it. Source maps work largely because the major browsers know how to deal with source maps in JS. You'd have to extend this to all the other ecosystems, at the very least.
We don't ship source maps on production apps. This is for JS apps but also similar with executables and debug files.
All the logging is done trough some service like Sentry, but exactly where the error happened is opaque in the client-side. Sentry is responsible for doing the sourcemap translation.
The only difference is that with "personalized" formatting, the source mapping would be done based on the user that is logged to Sentry.
That's it.
Of course there are other workflows such as support getting errors displayed directly to customers via copy-paste, or reading server logs to look for those errors, but for apps where you have a logging solution you don't need to rely on those.
> admit it, tabs are fragile and a pretty weak implementation
Could you elaborate? I don't have a personal opinion here and have only worked in orgs that require spaces, but I'm not familiar with the criticisms of tabs.
For me the problem happens as soon as tabs are used for alignment, instead of just indent. The benefit of tabs is custom tabstop. If anyone does anything that undermines that benefit, you might as well use spaces to avoid all the problems caused.
Consider the following code:
if (x)
{
SomeMethod(paramater1,
paramater2,
parameter3);
}
If done "properly", it is:
if (x)
{
<tab>SomeMethod(paramater1,
<tab><spaces...>paramater2,
<tab><spaces...>parameter3);
}
What I often see, that totally breaks the entire point of tabs:
if (x)
{
<tab>SomeMethod(paramater1,
<tab><tab><tab><space><space>paramater2,
<tab><tab><tab><space><space>parameter3);
}
The same thing happens if you are trying to align table-style code:
var badMixedTypeArrayExample = [
[ "some", true, 128, x ],
[ "long strings", true, 8, someLongVariable ],
[ "and", false, 16384, x ],
[ "short", true, 12345678, anotherVariable ],
];
If tabs are used between fields, it will look like a hot mess to anyone with a different tabstop than the author.
Which is easy to say, but hard to make everyone do correctly. First you need to ensure that everyone uses an editor with a "visible whitespace" option, and turns it on, so they can see whether they have the right whitespace. Then you get to spend precious programming time turning one kind of whitespace into another since most editors will get it wrong when they auto-indent.
Either use spaces everywhere so you have total control over the layout or forego alignment (other than block indentation). Mixing tabs and spaces is a path to madness.
This is part of the reason why editors for programmers and editors for general text editing are not the right thing.
I have F11 in emacs bound to whitespace-cleanup, which takes care of it all for me. And supertabs mode in general works just the way it should with tabs-indent/spaces-align.
Then there's also clang-fmt, possibley used as a post-receive hook in git (and some other VCS) which makes irrelevant what the programmer's editor did, mostly.
Can whitespace-cleanup really differentiate between places where you want a tab for indentation, and places where you want spaces for alignment, where the number of spaces may be greater than the tab width? After all, the only way to differentiate is to guess what you might be trying to align to by looking at surrounding lines… but from a quick search, whitespace.el looks less complex than that.
I totally agree; I personally hate this style of code! However, people still write it (in the same way they screw up tabs+spaces), and in some code bases it's "the style" they use.
I've also seen a lot of SQL and LINQ (C#) written in this way, as well as things like:
The main supposed advantage of tabs is that everyone can set their own custom preferred tab-width and be done with it, but this advantage doesn't actually play out in practice:
* There's usually a maximum line length restriction as well, so you need to know what the tab-width is to figure out if a line needs to be broken into multiple lines.
* There are also cases where you need exact-column alignment, even across multiple indent widths. A simple example is as follows:
So in practice, tab width for a project is actually fixed to a particular value. And then you discover that wrong-tab-width code becomes quite annoying to read. I hate reading GNU style guide code, which uses 8-space-tabs but indent-width of 4, because the indenting is unreadable unless I mess with the tab spacing for an individual file I'm reading.
Alignment can just be done with spaces. This can then be enforced by a style checker.
But the maximum line length problem is real. I would be 100% for tabs if it wasn't for this issue and imo it's the only real criticism you can make that doesn't have a good solution.
The good solution to the line length problem is to not be strict about them. My line length rule is usually "stay roughly within 100 spaces, 120 is too long." If you are seriously undermined by lines being too long, then your text editor choice/setup might be worth revisiting.
That sounds a lot like a hard limit. The difference between 2-space and 8-space tabs is huge, so you still have to specify what tab width you're using and someone who prefers a narrow tab width could still accidentally blow past the hard limit if they're not careful.
The alignment in the comment above doesn't work with tabs: your initial line is going to be tab-indented, which means if you want those next lines to align with it you don't have any options for it to work.
Now, I tend to find it's better to just avoid that kind of alignment in your code style completely (just push the first arg to a new line so you're not space-ing everything out a mile to match the function call open paren) but if that's your style then you can't really do it with actually variable tab widths.
My solution to the maximum line length problem is simple: if a piece of code has more than 3 level of indentation, I start to think about refactoring, and never let it have more than 5.
I find that any reasonably complex code with 5 levels of indentation becomes difficult to read, and with more levels the difficulty grows exponentially.
At least with languages I have significant experience, keeping the levels of indentation low never was a problem. But I never did serious amount of coding in LISP or Python, for example, so I don't know if this is practical in such languages.
it only works for initial indentation, so people that like columnar layouts are kinda screwed. auto-tabbing tools will take n-spaces and turn them into a tab, which screws up stuff.
lets just take the whole idea one step further and either use tools that reformat based on agreed upon styles (meaning a developer could reasonably take the source, preject it into their preferred style and project it back out again).
or store the canonical version as structured data in a database and always project it into some text for viewing.
broader adoption of formatters has drastically reduced the number of pointless and emotional formatting arguments I've gotten into. lets push that further.
Reading this article, I feel as though the author doesn't deeply understand git.
git works on blobs of data, not files, and not lines of text. It doesn't just happen to also work on binary files- that's all it works on.
Now, if the author is suggesting that git-diff ought to have a language specific mode that parses changed files as ASTs to compare, now I'm interested. Let's do that. I'll help!
But git does not need to change how it works for that to happen. Git does not even need git-diff to exist to serve it's main purpose.
Rebases and cherry-picks work by applying diffs, not by copying blobs. Auto-merging also needs to look at file content as text, you can't auto-merge a binary file with git.
It's an often repeated fact that if you look inside Git, it doesn't work with diffs, it works with blobs. But if you look closer, it's often diffs again!
With cherry-picks (and thus rebase), you ask git to turn a commit into a patch, so it does just that.
I would mostly consider auto merges (which I guess are bolted on) as the main area where git itself uses diffs during resolution and even then only as a suggested resolution (you get warned and need to confirm it when validating the merge).
So no, it's blobs all the way down. Darcs and Pijul are patch based though.
It's true that git is blob based, as opposed to patch based, but it's not the full picture! In practice, git stores a lot more diffs (or rather, deltas) than it stores loose blobs. (And you probably know this already, but I feel it's still worth making explicit)
This is necessary, because when a repo accumulates commits, it becomes a lot more efficient to store most of the objects as deltas instead of separate blobs. If Git didn't do this, it would have a lot of copies, and they would take a lot of space.
So the fundamental model of git is truly based on blobs in theory, but in practice many or most git commands will operate on packfiles, and if you look in your .git object store, most likely you will have a few big packfiles containing most objects, and then a much smaller collection of loose blobs.
All those diffs are what the "resolving deltas" progress indicator that people see when they do a big clone, fetch, or checkout is about =)
> In practice, git stores a lot more diffs (or rather, deltas) than it stores loose blobs.
The diffs it stores are not the diffs you see in git diff.
They're rolling checksum based chunks. The data that the delta is computed against is picked with a heuristic ("sort by name and date, try the top 10, and use the smallest result"). And, in practice, the heuristic diffs the older files against the newer ones, rather than diffing in chronological order, so that getting recent data doesn't involve a lot of delta application.
The git deltification is better thought of as a compression method than as diffing.
Packfiles and deltas are a storage and transfer optimization for blobs. Any access to them store and yield blobs.
It is for all intents and purposes just an internal serialization format, akin to how a filesystem is just a serialization format that makes all your data one large stream. One generally talks about the provided interface (files for a filesystem, blobs for git) rather than where the bits actually go.
Compression algorithms are also to some extend diffs as they serialize to a sequence of "repeat previous segment and add this new data" commands, but it is not useful to consider them as such.
A merge is a commit with two parent commits, pointing to a new tree that contains the blobs from both parents. It does not modify any blobs, nor does it modify the parent commits. The full history of all activity is retained.
A merge conflict is a case where both trees changed the same blob since their common ancestor. In this case, you have to make a new blob yourself (the "resolution") for use in the merge commit's tree, instead of using one of the parent blobs.
Squash is "Remove all commits from C_newest to C_oldest, and create a new commit using C_newest tree". Rebases just run another git action for every commit in a sequence, e.g. cherry-pick.
> A merge is a commit with two parent commits, pointing to a new tree that contains the blobs from both parents.
The second parent is metadata. The way a merge works is essentially to compute the commits you need to cherry-pick, then cherry-pick them without committing, resolving conflicts in pretty much the same way git cherry-pick does, THEN commit with two parents.
> It does not modify any blobs, nor does it modify the parent commits. The full history of all activity is retained.
A new commit containing merged content is created, as well as a merge commit with the second parent that documents that a merge happened and what was merged.
> A merge conflict is a case where both trees changed the same blob since their common ancestor. In this case, you have to make a new blob yourself (the "resolution") for use in the merge commit's tree, instead of using one of the parent blobs.
git merge does the same thing for automatic (and manual) conflict resolution as git cherry-pick. So does git-rebase.
> Squash is "Remove all commits from C_newest to C_oldest, and create a new commit using C_newest tree". Rebases just run another git action for every commit in a sequence, e.g. cherry-pick.
Rebasing is constructing a set of operations:
- construct a set of commits to pick as the
commits between (the merge-base of HEAD
and the selected commit) and the selected
commit
- git checkout the --onto HEAD
- cherry-pick the selected commits
An interactive rebase lets you drop commits, add commits, edit, reword, or fixup/squash commits.
Squashing a commit is essentially doing `git cherry-pick --no-commit` of the to-be-squashed commit and then `git commit --amend` to replace the HEAD commit with a new commit that includes the changes staged by `git cherry-pick --no-commit`.
Yes, it really is this simple. I aver that it is easier to understand the above than to think of merging and rebasing and cherry-picking as fundamentally different operations.
> The second parent is metadata. The way a merge works is essentially to compute the commits you need to cherry-pick, then cherry-pick them without committing, resolving conflicts in pretty much the same way git cherry-pick does, THEN commit with two parents.
Good luck explaining an N-way merge with this approach, such as the 66-way "cthulhu merge" that is 2cde51fbd0f3 in the linux tree.
All parents are metadata, they do not contribute to the content of the commit other than their "parent" line in the commit object after the merge finished.
> A new commit containing merged content is created, as well as a merge commit with the second parent that documents that a merge happened and what was merged.
A merge only produces one commit: The merge commit, pointing to the tree of the merged content. It is a completely normal commit, having multiple parents like any commit can.
The tree of the merge commit may contain new blobs not present in any of the parents if conflict resolution was required. Otherwise, the new tree is simply a combination of the parents' trees.
> Rebasing is constructing a set of operations: <snip>. An interactive rebase lets you drop commits, add commits, edit, reword, or fixup/squash commits.
Yup, that's what I wrote.
> Squashing a commit is essentially doing `git cherry-pick --no-commit` of the to-be-squashed commit and then `git commit --amend` to replace the HEAD commit with a new commit that includes the changes staged by `git cherry-pick --no-commit`.
I think most associate squashing with the act of reducing a foreign branch into a single new commit as a merge strategy (as opposed to fast-forward or merge).
What I described was squashing commits on the current branch, while you're describing squashing a single foreign commit into HEAD. Technically neither is what `git merge --squash` does, as that doesn't produce a commit at all.
> Yes, it really is this simple.
Well, I find your description complex (and having resulting inconsistencies) as it tries to describe plumbing in the terms of porcelain, which is backwards and honestly one of the main reasons I think people are confused about git.
> The tree of the merge commit may contain new blobs not present in any of the parents if conflict resolution was required. Otherwise, the new tree is simply a combination of the parents' trees.
Sure, but for me this is the common case.
> I think most associate squashing with the act of reducing a foreign branch into a single new commit as a merge strategy (as opposed to fast-forward or merge).
I don't. I git rebase -i often to squash commits.
> But each to their own I guess.
Whatever works. However, I find that when people focus on the semantics of merging, they then don't care to understand cherry-picking or rebasing, and they miss out on those very useful concepts. Whereas understanding what the process looks like helps one (me anyways) unify an understanding of all three concepts. I much prefer understanding one thing from which to derive three others than to understand those three things independently.
There's also a historical angle here that's important to inspect - Git was designed to specifically be content agnostic. There are some predecessors in the SCM space (like VSS) that are specifically language aware and allow the checking out of line ranges (pinning them so that no one else will make a conflicting change specifically) and even entire functions - these systems can cause a lot of grief while failing to protect the logic they're specifically trying to protect. As the warts on SVN got more and more visible I think the general assumption was that the replacement SCM would come out of this code aware space - but it didn't and in retrospect we all dodged a huge bullet when that happened.
I absolutely adore tooling around git that makes diffs more visible - one thing I absolutely gush over is anything that can detect and highlight function reordering... however, the core process of merging and rebasing and all that jazz - I don't think we're going to find anything automated that I'll ever trust when I'm not working on a ridiculously clean codebase - minor changes can have echo effects and when two people are coding in the same general area they need to be aware of what the other person is trying to do.
I dunno I feel like you're focusing on a detail that's not particularly relevant. The author's main thrust is precisely what you described about parsing changed files as ASTs.
It isn't relevant to the author's vision of content-aware diffing, but it is relevant to the author's complaints about how Git's (alleged) text-based-ness makes Git awkward to use with Jupyter notebooks. Has the author tried searching the web for "git diff jupyter"?
The git extension on VSCode is already pretty good at doing diffs on jupyter notebooks.
I distinctly remember this not being a core feature of stock git and needing Jupytext to enable version control on notebooks. So, I feel like this sort of language specific stuff is already happening, but not in any unified product.
I'm in the process of building a programming language for UI designers, and realized that diffing the AST (or some other kind of object notation) would be far more useful and understandable. I'll probably be digging into this exact problem within the next year or so.
Storing AST instead of source code is one of the goals of the very interesting Unison programming language: https://www.unisonweb.org/
Part of what's nice about Git (and plain text in general) is that it's the lowest common denominator for a lot of things. This is why traditional Unix tools are built oriented around streams of bytes. Text is a low level carrier protocol; you can encode almost anything in it, but you need to agree on some kind of format.
The good part is that you can use very very generic tools on almost arbitrary pieces of data. The bad part is that you might have to do a lot of parsing and re-parsing of the same data, and you have to contend with the dangers of underspecified formats.
Git follows the Unix tradition in this regard. As a result, it is nearly universal in what it can store. You can use it to store pretty much anything, but you are now at the lowest common denominator of support for any particular data format.
Git-for-ASTs will no longer have this universality property, but will gain a lot more power in the covered domain. This is a design tradeoff.
One thing that's nice about Git is that you can specify arbitrary diff drivers with the "attributes" system. So even if the Git database is storing plain text, your diff driver can parse your source code into ASTs and present AST diffs to you when you run `git diff`. Perhaps more impressive, you can configure custom merge drivers, so you can (theoretically) implement semantic merging of ASTs right inside Git.
There are probably some fundamental limitations of this system, because the underlying data is still stored as blobs of bytes. But you can get pretty far as long as you don't mind parsing and re-parsing the same text over and over.
I don't see how this could ever work on evolving languages, different GIT versions would produce different commits and read commits differently based on the latest C++ standard. This would potentially lead to version control bugs where different GIT versions creates different results from the same commit, that is horrible, version control needs to be 100% bug free in that regard.
The only reasonable application would be to use a language AST parser to better identify relevant text diffs, but the commits still needs to be stored as text.
This doesn't really make sense, because in order to have those code changes compile correctly, there must be a corresponding commit to CI config that changes the complier version or compiler switches for the new language version. The "semantic-diff-er" can also be driven by that commit such that it uses the correct language version.
`git` generally doesn't work with lines of text. Mostly it works with opaque file blobs and directory trees.
`git diff` and `git merge` work with lines of text by default - but they don't have to. You can supply your own `diff` and `merge` tools with the `difftool.*` and `mergetool.*` config options, try them out with `git-difftool` and `git-mergetool` commands, and set the default with the `git.diff` and `git.merge` config options.
If someone wanted to create AST-based diff and merge tools for a given language, they could be plugged right into the existing `git` infrastructure and it would work with them absolutely fine.
This feature is useful in so many different places. I use it to diff small encrypted files in my repo - just add `gpg -d` as a diff configuration and now I can use git log, diff etc in a meaningful way with binary files.
I've heard of people using it with pdfs as well - a pdf to html converter lets you get a good idea of what changed in the document.
What if generating a diff is nontrivial? Say you rename an identifier. That might be a single command in an IDE. A sufficiently high-level "diff" format could easily capture that intent. But working backwards from hundreds of touched lines across many files to deduce that single semantic edit is not trivial. Git assumes that arbitrary diffs can be deduced from "before" and "after" files, but this isn't the case - it may be that you'd rather generate the new file from the diff!
> If someone wanted to create AST-based diff and merge tools for a given language, they could be plugged right into the existing `git` infrastructure and it would work with them absolutely fine.
There's a lot tooling in the Eclipse modelling ecosystem which could be easily used for this. Storing XML-based models in git is no problem and there's tooling for diffing and merging models via a GUI or programmatically. Combined with the fact that xtext DSLs use EMF models to represent ASTs, it wouldn't be too hard to glue together an AST-based a diff/merge tool for an xtext DSL.
> `git` generally doesn't work with lines of text. Mostly it works with opaque file blobs and directory trees.
I am not sure this is true.
In the past it gave me problems with line ending normalization between windows/mac/linux, in and out. In those cases it definitely had a lines of text view of things.
It is generally true, but yes; automatic line ending conversion is an exception. You can turn it off with `git config --global core.autocrlf false`, though be aware that can cause issues if you have developers on different operating systems creating and committing files with different line endings.
Our tool uses git as the foundation of its functionality. It superimposes git diffs on top of ASTs.
It is insanely powerful.
For example, we use it to power semantic code search and current support Python, Javascript, and Java. We generate a JSON object describing the AST differences between initial and terminal commits on GitHub PRs. A full text search on the JSON objects performs surprisingly well when we want to answer questions like, "When did we add dateutils as a dependency?" or "When did we last change the /journals handler on the API?"
The Python integration currently sees the most use but if you are interested in other languages, we would be happy to support it.
Do drop me a DM if you want help getting started with Locust.
Whenever I do Clojure, something that can get difficult when working with multiple people is how the parentheses/brackets/braces stack up, especially when everyone seems to have different opinions on how that works. As a result, if you're not careful, when there's a merge conflict you can have a ton of extra parentheses, which can be irritating to debug.
Obviously this is at some level an issue inherent to Lisps (and to be clear, I love Lisps, and these small headaches are worth it), but I think problems like that could be reduced if our source controls were aware of the ASTs.
Git can use an external tool for merging, so there could be eg a Clojure merge plugin even now. Apparently there are some commercial merge tools that Java programmers use for this like SemanticMerge. After all you hit the similar curly braces merge problems with other languages.
Yeah, I've long thought a diff tool that works on s-exprs instead of lines would be invaluable for Lisp programming. It doesn't seem like it would be too hard to write, either, although getting GitHub etc to use it seems like it would be its own challenge...
Git is designed to require human oversight. This is usually a feature, but in recent years has become a bug with things like GitOps.
It's important to remember that Git is a terrible database because of its lack of semantic structure. All conflicts require a human who does have to context. This is why almost no one builds a system that uses Git as a two way interface. And when they do, its via Github Pull Requests (which go to humans) and not Git itself.
In all, this makes it a wonderful general purpose shared filesystem. And that's about it.
The output could be a lot more compact, it could do better at adding context (in the same way https://github.com/romgrk/nvim-treesitter-context does, etc), but if you're interested in this it's really within reach, go help out.
> The fact that git works on lines of text [...] we could be looking at the alterations to the abstract syntax tree.
Fundamentally git does not operate on text, it operates on files (content addressed SCM not a ledger of text diffs); diffs are generated upon request between arbitrary merkel trees. So there is no need to implicate git in such a tool, it can be independent:
GIT_EXTERNAL_DIFF
When the environment variable GIT_EXTERNAL_DIFF is set, the program
named by it is called to generate diffs, and Git does not use its
builtin diff machinery. For a path that is added, removed, or
modified, GIT_EXTERNAL_DIFF is called with 7 parameters:
path old-file old-hex old-mode new-file new-hex new-mode
There's a good blog post about auto-merging JSON/XML structured data files (for game content) on the bitsquid blog from 2010:
> having content conflicts is no fun either. A level designer wants to work in the level editor, not manage strange content conflicts in barely understandable XML-files. The level designer should never have to mess with WinMerging the engine's file formats.
> And conflicts shouldn't be necessary. Most content conflicts are not actual conflicts. It is not that often that two people have moved the exact same object or changed the exact same settings parameter. Rather, the conflicts occur because a line-based merge tool tries to merge hierarchical data (XML or JSON) and messes up the structure.
> In those rare cases when there is an actual conflict, the content people don't want to resolve it in WinMerge. If two level designers have moved the same object, we don't really help them address the issue by bringing up a dialog box with a ton of XML mumbo-jumbo. Instead, it is much better to just pick one of the two locations and go ahead with merging the file. Then, the level designers can fix any problems that might have occurred in the level editor -- the right tool for the job.
I don't understand why GitHub hasn't solved the issue of diffs starting with a '}' (or ')' or 'end'). Just slide the diff over while it starts with a closing token! I suppose it's an artifact of the diffing algorithm, but aren't there better diffing algorithms, even built-in within git?
This is by far the most obvious example of "git doesn't understand programming languages", but it also seems like the most straightforward to fix.
It is because diff is syntax agnostic. You might be able to get away with this hack in some cases but that complicates algorithm and will break in some other cases (how about nested brackets? Multiple brackets on one line?). Once you want to handle this properly you need syntax aware diff algorithm and some resources are linked in this discussion.
I’ve done quite a lot of work on version management on structured data (in my case this was for a version managed GIS database) and it’s not an easy problem, and is likely even harder with something like an AST that is generated from a text file and so does not preserve the identity of nodes. I’m not saying that it’s impossible, but it is more work and requires more tooling around it than people think, and it keeps coming up here and other places as a, “really good idea.”
I'm trying to remember the citation, but I remember seeing a presentation once from someone who studied this and they said that the thing that worked best was a hybrid approach: use structured diff at the top level of the program (modules / methods) but use line-based for statements and expressions. According to them, the structured diff can give unintuitive results if applied at the lowest syntactic levels.
I’d give anything just to get a few basic merge modes. For example “this file can treat two one line additions as unordered”.
So any shared append-only file (a change log, an enumeration,…) doesn’t automatically conflict.
Syntax aware diffing would be great too, but I’d take something much simpler. For syntax aware stuff I’d love something that could tell semantic changes from noise.
The blogger does not understand Git, fundamentally.
Git does does not work with text. It stores snapshots of artifacts.
The diffs that you see when you use the various commands like git log -p are recovered from the snapshots, when those artifacts happen to be text files.
Git absolutely works with texts when you connect it with external representations and tooling, such as when you "git format-patch" and then "git am" to import that; and the rebasing workflows obviously have textual merging with conflict resolution. Still, that seems like something that could be externalized. A language-specific three-way-diff tool can handle a merge by parsing all three pieces and working with ASTs. It's something that could be developed later, yet still work with your old commits.
Interesting they mentioned Jupyter Notebooks but not NBDime https://github.com/jupyter/nbdime which is a Jupyter plugin specifically to address this problem. Without it, diffing notebooks is not feasible.
I'm surprised they didn't mention Unison (https://www.unisonweb.org/), whose big idea is an immutable content-addressable store of ASTs. I really hope it changes everything.
Except that Unison created its own language which makes pretty sure that they are doomed to fail..
I don't know if there is a technical reason for the new language or if it's NIH syndrome.
It's normal to experiment with new concepts in a new language built around that concept. Then eventually, if it's a good idea, other languages get the hint.
Back in the early 2000's Visual Age for Java allowed you to version individual methods. Since Visual Age for Java was derived from Visual Age for Smalltalk (and was actually written in Smalltalk) I suppose it inherited the capability from there.
I've wanted this for a while, but I will say there's some caveats. Sometimes I want to commit just as a "it's the end of the day, I want to leave, here's a code dump". I suppose you could have multiple tiers of code saving.
I've also wondered about whether you could do code analysis with time as a dimension. If you can analyze the evolution of the code and pull old implementations, what can you do? Autocomplete is a good example, as it can pull previous patterns you've used. Maybe some way to tell the programmer "hold up, you've made this mistake before, don't do it again"? I'm not sure.
Yes, but if we're talking about some hypothetical tool that requires a valid AST, there might be a situation where I don't have a valid AST and want to save the code. Similarly, I had a job where we used pre-commit hooks that ran a linting script. I had to override the hook to commit which was slightly annoying at times.
Theoretically it might work, but I don't think I am too fond of the idea. I used git pull to completely waste my source and it would have been nice for git to have more intelligence here, but in the end I think some of its success lies in its simplicity.
SVN isn't too bad and not too much of a difference to git if you use a central repository anyway. The main neat thing was to just have one hidden folder, not in every subdirectory.
Git would also need the ability to transform from AST to source for every language. A bit unrealistic and there is no benefit to it. Could also do that with Assemblies and some meta info for the decompiler.
I think the only useful way to implement AST-level diff/merge for non-trivial codebases would require the compiler to provide the parsed AST, since per-file ASTs would lack a lot of context. You could also ask the user to provide a separate file or files that describe the code topology, but why bother when the compiler can spit out an AST itself? A diff tool which targets a few of the bigger build systems (CMake, Maven, Gradle) and compilers might work, and could worry about small build environments after gaining momentum.
Didn't the VisualAge IDEs do this with their built-in version control? This was 20 years ago, and I seem to remember that the version control was at the method level, not file level.
I feel this is just an example of "worse is better" and whole proposition as interesting but totally not practical and I would not like for GIT to go anywhere near that idea.
Ever since I learned about Git merge strategies and wrote a very basic one myself, I've been wanting to write one that syntaxticly understands a bit of the test framework code we use at work. It is super annoying when you copy a test because you want to vary a very specific case and gig gets all confused about what code is and isn't the same.
(yeah I know I should break out the copied part but who always has time for that)
The problem with a tool that depends on structured data is that it only works with structured data.
Of course the problem with a tool built for unstructured data is that it's dumber than it need be, and when you do treat the data as structured, it's ad-hoc and often buggy.
When talking about how "powerful" a tool is, there's always this tension between structured and unstructured.
If you’re interested in this sort of thing you might want to look at Dolt (for sharing databases in a git-like way) and Pijul, which records diffs explicitly, rather than calculating them on the fly.
I wonder if there might be a clever way to encode source code in a Dolt database? Maybe each function should be a record?
I'm just a bit more "generally" curious. Is `git` being the _only_ DVCS a good thing? Not to say that `hg` or `darcs` don't exist, just that the hub on top of git has pushed us in a singular direction.
I would like to see, at least academically, something more.
The choice of DVCS tooling is ancillary to the success of GitHub. People learned Git so they could use GitHub, not the other way around. At least, this is the way I remember it.
If someone comes along and builds the best forge software ever, but uses Mercurial instead of Git, I'll bet a lot of people would switch technologies at some point. Until then, I'd say most people use GitHub because GitHub works for them, and they use Git because that's how you interact with GitHub. They don't care about the ivory tower benefits of their particular DVCS tooling, all they care about is easily collaborating with their teammates.
It would definitely be great if you could have a GitHub-like experience using Mercurial or Darcs, but so far I haven't seen anything close to that.
It sounds inspirational and revolutionary but the longer I think about it, the less utopian it feels. The idea of forgetting about pieces of code or how things are linked.
Files also provide an opportunity to purposefully communicate organizational intent.
I'm fine with the line oriented nature of git. I wouldn't want to be confronted with problems due to the constantly shifting nature of languages vs whatever plugin exists for them on git. Even C is constantly changing.
I'm actually working on a VCS based on this idea and on tracking changes to binary files based on their structure as well. (It turns out that the same techniques work for both.)
I’ve had loosely similar ideas before. The basic idea is to make the compiler tool chain aware of diffs, and help scrutinize and implement them. Refactoring suggestions could be included with the diff.
For example, say you’re dependending on a module and it renamed a class/method/trait/macro/constant/whatever. A synchronous method has become a sync or vice versa.
The diff could include programmatic instructions for consumers to apply to their code bases switching them over to the new method. This could be as simple as semantically changing the name used, or in the case of changing sync to a sync it could add `await` in the appropriate spot.
There’s no limit to how complex the rewrite rules could be. You could totally reorganize the parameters to a function and ship that refactoring, or even add a parameter along with the default code necessary to provide it.
Unlike the author I don’t think code in Git will likely ever move beyond source, plus the refactoring instructions needed to update a change from a dependency — perhaps a macro-like syntax.
Too much text manipulation is required from source control for me to conceive of it being anything but human programming text in a future I can imagine. Machines can already parse it; there doesn’t seem to be a compelling reason to store some other kind of structure.
Refactoring wouldn’t need to be any special Git extension, just a file accompanying the commit with instructions for the language tool chain.
Your IDE or CLI could walk you through interactively everywhere it’s getting applied, or you could apply all and review the result in your app that consumes the module.
This would also open up security risks from accepting diffs from dependencies and applying their refractors, but unless modules are sandboxed quite well that’s a risk you take with updating dependencies anyway. And you can always scrutinize the refactoring-diff manually after it runs before accepting it.
The industry would probably standardize onto the notion that a change that requires running automated room factoring from a dependency across your codebase is a major version change; in other words a breaking or backwards incompatible change, just one that’s much easier to upgrade to.
Languages with macros or other programmatic transformers would be well suited to this concept I think.
Maybe Rust macros could be enhanced for the purpose to pattern match over an existing codebase somehow: not just the macro invocation point, but anywhere, e.g., a given trait or function is used; and then the output of the macro would not feed into the next stage of the compiler step but would instead result in rewriting code on disk to produce a diff that you examine and apply to your code.
A capability like this would make it much easier to manage large aggregate code bases consisting of many dependencies. OSS package maintainers or infrastructure providers at companies could ship nominally backwards -incompatible changes that are still actually compatible when you run the macro transformer that updates the code that uses them.
For a simple example, imagine that `foo()` was previously a function and the implementation chooses to add some optional parameters or those with default values and change it into a macro `foo!()`. The accompanying transformer would semantically identify references to `foo()` and make the necessary updates. You could rename global constants or traits or other code elements this way.
Consider the way in which Google Guava has had to evolve over time. A number of its features have become part of the Java language, and thus the classes deprecated and removed gradually. With a compiler facility like what I am describing, users of Guava could run the transformer to migrate older could bases that use methods like Guava’s `Preconditions.checkNotNull(Object, message))` to use Java’s now-standard `Objects. requireNonNull (T obj, String message)`. Because the maintainers of Guava wanted to keep it modern, current, avoid redundancy, and designed in the best way that they knew how, they made a number of breaking changes for which the project lead later apologized [1]. Most changes wouldn’t have been be painful if accompanied by automated refactoring.
You could allow the transformer to produce code that still needs work from humans to finalize and compile. In that case it could change as much as it’s able and leave instructions at each call site.
At my company we have bots that submit proposed code changes to our codebase that need to be taken over as author by a human, reviewed, sometimes lightly edited, and shipped, and they work quite well. One bit finds unused code and submits diffs to remove it if it’s been in the code repository long enough. Another detects when launch experiments have been at 100% all on one treatment for a long period of time (meaning the feature has launched) and submits code removing the experiment check. The latter sometimes require removes surrounding code that would subsequently become unused from removing the experiment check.
These have provided meaningful value in helping keep the codebase tidy and I look forward to more automation like this in the future, including diff-aware compilers and refactoring tools.
But I would not want Git itself trying to understand the contents of files. That seems to me to be an idea that lives on a misconception of the "things programmers believe about names" variety. Not every file in source control is source code. Not every programming language's grammar maps to an abstract syntax tree. In some files, such as makefiles, the difference between tabs and spaces is semantically significant. Some languages (such as Fortran and Racket) have variable syntax. And so on and so forth.
So I think that we really don't want the source control system itself trying to get too smart about the contents of files. That will inevitably make the source control system less compatible with the various kinds of things you might want to put into source control. And it will also make the source control system a lot more complicated than it would otherwise be, in return for a largely theoretical payoff.
But if we want to delegate the work of generating diffs off to other people, so that Git can allow for syntax or semantics-aware diffing without having to personally wade into that quagmire (and perhaps also allowing language communities to support multiple source control systems, a bit like how it works with LSP), that might be an interesting thing to experiment with.