Git can't be made consistent

decklin · on April 18, 2011

Another interesting case: http://www.kernel.org/pub/software/scm/git/docs/howto/revert...

I'm not sure if the "forget whatever happened" metaphor works for me. In the "revert a revert" article above, the problem is that merging a topic branch doesn't cause the first few commits on it to be applied if those commits were already merged but then reverted -- the revert has no effect on the merge. This is precisely because every commit object, since it includes it's parent's sha1, uniquely determines a history of changes, and commits in that are in both ancestries don't get re-applied.

In Bram's example, you have the opposite problem -- two commits are semantically the same but were made independently and have different sha1s. If Linus were drawing this diagram he would label them B and B' (and so on... there's a lot). To git, B' is totally different so a merge applies the change "again". If the other person had noticed this and reset their branch to the first B, the merge would be a fast-forward.

IMHO, the Don't Do That should apply to creating those commits (by cherry-picking, or not rebasing duplicated work) rather than merging. Not because such commits are morally wrong or something like that, but because git intentionally ("the stupid content tracker") doesn't handle them well. That's the tradeoff of the nice object model.

Our git workflow at my job is pretty messy and does run into this sort of stuff. I'd love something just a little more darcs-y, like say grafting together the two branches in the second example that arrive at the same content (without having to manage a local grafts file separate from the repository), but that opens many other cans of worms that I'm sure I'm not intelligent enough to deal with.

lysium · on April 18, 2011

Thanks for mentioning B', I first did not understand what the problem was. Merging B' will turn A to B', which is equivalent to B. This causes confusion if the author of the left branch intended to get rid of B.

riffraff · on April 18, 2011

>I have a little secret for you: Git can't be made to have eventual consistency

David Roundy, the initial author of darcs, seems to disagree on this. From https://github.com/droundy/iolaus : > I realized that the semantics of git are actually not nearly so far from those of darcs as I had previously thought. In particular, if we view each commit as describing a patch in its "primitive context" (to use darcs-speak), then there is basically a one-to-one mapping from darcs' semantics to a git repository.

gwern · on April 18, 2011

'actually not nearly so far', 'if we view each commit', 'there is basically' - many differences and gotchas can lurk in such qualifiers.

someisaac · on April 18, 2011

I am surprised to see this post from bram cohen, as he himself had a heated argument with linus torvalds on git design.

http://www.gelato.unsw.edu.au/archives/git/0504/2153.html http://news.ycombinator.com/item?id=505876

ob · on April 18, 2011

Which he lost... i.e. Linus paid no attention to what Bram said ;)

bramcohen · on April 18, 2011

For some context, Git follows the same architecture as Codeville, so Linus didn't invent the idea (although he ripped it off from Monotone, not Codeville) and the argument was essentially about whether a simple three-way merge can be used in all cases, and the answer is no, because of criss-cross cases, and solutions for that have since been put into Git. It is the case that semantics which more closely resemble three way merge are preferred though, as explained in my post which this thread is about, but for reasons which nobody in the flame war you link to appreciated at the time.

jarin · on April 18, 2011

In short: Don't be a dummy and expect git to be some kind of advanced artificial intelligence.

divtxt · on April 18, 2011

Someone who knows this stuff please tell me if I my analysis is correct:

I see smart people pointing out "flaws" in software, not realizing the solution requires strong AI.

(e.g. yesterday's post: http://news.ycombinator.com/item?id=2455793)

dons · on April 18, 2011

Merging associativity doesn't require magic AI.

divtxt · on April 18, 2011

Sorry - I'm not sure what "merging associativity" is - can you give an example?

The above article gives us a simple example (A vs B) of a situation where doing the right thing requires a human aka strong AI because you need to know the "intent" of commit.

Is there a simple solution - or even a complex one - which would not require a human to verify?

For a similar analysis of yesterday's post, see this comment: http://news.ycombinator.com/item?id=2455970

dspillett · on April 18, 2011

Merge associativity would be where taking an initial stage and merging commit A then merging commit B (where A and B are commits created independently but from a common start point) always creates exactly the same results as merging in commit B followed by commit A. The word "associativity" in this instance is being used in the same sense as it is used in basic arithmetic: (1+A)+B === 1+(A+B) === (1+B)+A and (1xA)xB === 1x(AxB) === (1xB)xA.

The merge processes used by Git and other common source control systems are associative for most circumstances where the two (or more) merges affect different parts of the code (including different parts of the same source file). The issue tends to raise its ugly head when the two merges affect the same lines. For instance:

    Original:     Commit A:     Commit B:
    line 1        line 1        line 1
    line 2        line 3        line 2
    line 3        line 4        line 3 updated
    line 4                      line 4

If you merge in that order line 2 will get put back as it will look to simple inspection like that is what is intended (merging in A removes line two, merging in B inserts line 2 (which to the merge algorithm is now a new line) and updates line 3). If you merge B first then line 2 is gone from the result (merging in B updates line 3, line 2 not needing to be touched as it is the same, and merging in A after that will remove line 2.

    Merge A then B:     Merge B then A:
    line 1              line 1   
    line 2              line 3 updated
    line 3 updated      line 4
    line 4

It isn't just deletes/inserts that are affected: changes to the same lines can produce similarly inconsistent results depending on merge order. The trouble is that for a DVCS it is impossible to consistently deal with these situations without a manual merge (or AI better than we currently have). Either output could be the intention and without context other than the original state and the two commits you can't tell one way or the other.

A centralised source control system doesn't have this problem because as far as the repository is concerned there is one and only one timeline: commits happen in one order so the second will either always override the first where there is a question. This doesn't mean that the CCVS would be correct though, just that it would be consistent.

With either CVSC or a DVSC where a three-way merge (where the start point of each commit is known so the compare is done between commit, original state and current state) can be used then a merge conflict could be flagged for these issues, but a human still needs to make the final decision as no algorithm can be consistent (or correct) 100% of the time without a universe of extra context.

If you were presented with the commits above, would you know what should be done with line2? Does the change in line 3 depend upon it existing, so you must keep it, or is it irrelevant, so you should delete it (A says delete, B doesn't care either way)? Even if you knew that commit B was done later than commit A that wouldn't mean that it is necessarily the one to trust, and in any case there might be a more complex set of commits with a mix of conflicts where A is right in some cases and B in others.

People expecting Git to be associative in these instances are (by my understanding) asking for the impossible. Perhaps the merge algorithm could be made a little more intelligent, but I doubt it could ever be 100% correct or consistent (where consistent implies the associativity of merges). Remember that what we are dealing with here are edge cases (unless you have lots of people working on the same areas of the source tree at anyone time, in which case you should probably consider a more hierarchical distributed repository arrangement) and changing the behaviour will likely create other, similar, edge cases so it is probably not worth spending many man hours tweaking the merge algorithms for instead of introducing a little human intervention into the potentially inconsistent situations. Any changes that get "lost" due to the wrong decision being made by the automatic merge algorithm or the human will still be present in a good source control system (unless you have explicitly told it to purge them) so they are not lost forever.

Caveat: I've not used Git (or any DVCS) in anger yet, but I have been reading around the area with the intention of starting to use it to track my personal projects and perhaps recommend it (or something similar) to be considered at work. This is an issue that I thought about a while ago, and I'm thankful of this recent discussion as it has reaffirmed what I decided after thinking about it a bit back then: these are edge cases that are safe to ignore until the rare occasion when they happen, at which point nothing is lost (I'll just may have to make some decisions manually and/or raise a new commit to revert changes that are "made in error" due to the inconsistency). Of course I lack the experience needed to confidently suggest I can't be proven completely wrong on the matter!

divtxt · on April 18, 2011

Thank you for the explanation.

Confusion · on April 18, 2011

It does if you want to obtain the 'expected' results despite retaining merge associativity.

lukeschlather · on April 18, 2011

Merging can't be totally automated. Merging requires "intelligence" whatever that means.

gritzko · on April 18, 2011

More like: semantically correct merge requires AI. Technical correctness, whatever the definition is, may be achieved algorithmically.

bramcohen · on April 18, 2011

The problem is not the intelligence of the merge algorithm. The problem is that in this case there is no right answer.

dons · on April 18, 2011

Note that darcs implements the "expected" or "naive" semantics, at the cost of edge cases that have exponential time (rather than going ahead with unflagged inconsistent merges).

pedrocr · on April 18, 2011

The really big insight Linus had that Brahm apparently still doesn't want to recognize is that if that if you design what essentially a snapshotted filesystem, the merge algorithm is just a convenience. Any better merge algorithms can be added to git without touching the format. In fact any individual user can pick and choose their merge algorithm that the repository just cares about the recorded content history (which trees are parent to which trees).

DougBTX · on April 18, 2011

Git also stores diffs from time to time: http://book.git-scm.com/7_how_git_stores_objects.html

pedrocr · on April 18, 2011

I'm guessing you're referring to packed objects. If I understand them correctly they are just there for space efficiency of the filesystem that is git. They're not first order concepts on which git the DVCS builds upon, just an implementation detail.

DougBTX · on April 24, 2011

My thinking when I posted my comment above was that any diff format used in the repository could be treated as an "internal" format, and any actual merges that you perform could use any merge strategy that they like, as long as the commit code converted it into the repository's format on the way in. Which is why I pointed out that git also uses an internal diff format. However, if your point is that hg uses an internal format which cannot store particular changes to files correctly, or requires excessive engineering, then yes, that would be a problem and I see where you are coming from. I do very much like the conceptual simplicity of git.

wnoise · on April 18, 2011

That's just an optimization. The semantics are still stored snapshots, and the diffs can be from any version of any file (blob) to any other.

DannoHung · on April 18, 2011

Was the Camp project expected to create a non-exponential algorithm at the edges?

Matt_Rose · on April 18, 2011

Nice to see Bram Cohen coming to the same conclusion I did. Having two branches constantly cross-merging is a bad idea, no matter what SCM you use.

apenwarr · on April 18, 2011

I'm pretty sure the example in this article wouldn't confuse git: weirdness like this is the reason git has the "recursive" merge algorithm instead of just doing a plain three-way merge. A recursive merge basically tries to merge some of the parents together before doing the final merge, which resolves this sort of case.

I do criss-cross merges between git branches all the time with no ill effects. Maybe non-git VCSes can't handle this sort of thing.

ob · on April 18, 2011

You need to do criss-cross merges that revert previous commits on one or both sides of the merges. If you're not reverting you're not hitting Bram's corners.

This is a though corner case and I'm pretty sure you can confuse any source control system currently in production with cases like this. BitKeeper has some theoretical solutions, but we haven't gotten around to actually test them in production.

codex · on April 18, 2011

I stopped reading after the first sentence. The author takes some liberties with the definition of "eventual consistency.". Either he doesn't know what it means, or he likes to demolish terms which used to be defined precisely.

cdavid · on April 18, 2011

Or maybe you are expecting one context and the author is using a different one. This is quite likely given the author (a well known hacker, and having worked in particular on this exact problem through his project codeville)

random42 · on April 18, 2011

On a related note, to establish the credential of the author, he is the creator of BitTorrent protocol.

http://en.wikipedia.org/wiki/Bram_Cohen

jacobolus · on April 18, 2011

And has worked quite a bit on the revision control diff/merge problem, e.g. http://bramcohen.livejournal.com/37690.html

shasta · on April 18, 2011

Why do you think he's using the term incorrectly? He just means that, with git, the order in which you apply patches matters. In a scenario where people are distributing patches and applying them as they receive them, this implies a lack of eventual consistency.

codex · on April 18, 2011

That's the lack of the associative property. Eventual consistency means that updates will eventually be propagated to all replicas in a distributed system and that all replicas will be consistent. Not the same thing, at all.

shasta · on April 18, 2011

Commutative, not associative. And the commutative property is exactly what's required to guarantee eventual consistency when patches are being applied in the order they are received since it ensures that order of application doesn't matter. Of course, if you have all of the updates, you can achieve eventual consistency without commutativity by periodically reapplying all of the patches from scratch in a deterministic order. But IMO his intended meaning was both clear and correct.

closedbracket · on April 18, 2011

Mr. Joy is a really ironic name.

mrwhy2k · on April 18, 2011

Holy crap... someone still uses LiveJournal as their blog.