Thanks for sharing. I too agree on this: "... Git is complex is, in my opinion, a misconception... But maybe what makes Git the most confusing is the extreme simplicity and power of its core model. The combination of core simplicity and powerful applications often makes thing really hard to grasp..."
If I may do a self plug, I had recently written a note on "Build yourself a DVCS (just like Git)"[0]. The note is an effort on discussing reasoning for design decisions of the Git internals, while conceptually building a Git step by step.
While this is nice, I think it should be emphasised that the blob-tree-commit-ref data structure of git is not essential to a DVCS. One of the disadvantages of everything being git is that everyone can only think in terms of git. This makes things like Pijul's patch system, Mercurial's revlogs, or Fossil's sqlite-based data structures more obscure than they should be. People not knowing about them and considering their relative merits has resulted in a bit of a stagnation in the VCS domain.
What makes it hard is that it’s taught wrong. All this pull/checkout/commit/push whereas for me it took a long time to discover that fetch/rebase/show-branch/reset/checkout —amend, and especially the interactive -p variants, are the core tools that really make it a pleasure to use. They give you flexibility and let you write and rewrite your story, whereas the commands you’re introduced with provide no control to the user. It’s remarkable the number of users who think you can’t rewrite a Git branch.
The rebase command is only safe if you never share a branch. Most people use a distributed revision control system to work with others and if you do work with others then rebase is dangerous and should not be used.
Almost every rebase user I've spoken with has no idea what the danger is despite it being clearly discussed in the manual page for rebase and despite rebase being listed as dangerous every time it is mentioned in any manual page.
For the sake of new users everywhere, please stop recommending rebase.
This is overly broad dissuasion. `rebase` should not be used if the branch is already shared and built upon by others, but is completely harmless to use on a feature branch.
In fact, it produces cleaner feature branches for review. Tracking the trunk branch with merges into your feature branch makes for a lot of noisy commits and difficult history to read through when the time comes to diagnose a bug. Rebasing, on the other hand, allows you to neatly put all the changes from your branch (and only those changes) into one or a few neatly-packaged commits.
Not rebasing does not affect reviewing a branch in the least, unless your diff software is seriously broken.
Comparing a branch to trunk shoudl only shows the actual difference. That you merged trunk multiple times shoudl have zero bearing.
The only way it could ever confuse anyone is if they review every commit and somehow fail to pass over merge commits.
The single most aggravating thing in git are its self-appointed super-users who /almost always/ properly use its power until one day they don't. Then they make life miserable for everyone else while we all "just wait, I'm fixing it".
Feature branches really should be rebased to have clean self-contained logical commits prior to merging, and this goes beyond just making it easier to review (although that is an important point). It's good git hygiene!
No individual commit should break the build; one reason is to keep git-bisect working well for future users bug-hunting, without getting stopped because someone didn't keep the commits on their dev branch clean prior to merging (N.B., a maintainer should also reject such PRs). And keeping commits clean usually means needing to rebase occasionally to organize the commits.
And each commit should be reviewed individually, in addition to the whole of the branch / PR.
Not to mention that each commit should be logically laid out, with well-defined changes and well-written commit messages. This usually means needing to rebase a branch when developing non-trivial features or bug fixes, to fold in review feedback.
But as mentioned elsewhere, generally on feature / dev branches, the expectation is that the commits are unstable, subject to change, and should not be built upon (without prior coordination, at least).
Master and stable release branches, on the other hand, should never change or be rebased.
Absolutely. Personally, I like to spam commits while I'm developing, just to make sure I keep track of what I've done while I'm iterating, but none of this is relevant to the final merge. Any individual commit on my feature branch may not even compile, but when that feature is applied to master, I want it to be a clean atomic commit that can easily be reverted or applied to other branches.
1. I didn't say this affects branch diffing, but rather trawling through history on a single branch.
2. "self-appointed super-users [...] [who break everything]" is a strawman and borderline ad-hominem. If you follow the guidelines I put forth, there won't be any issues collaborating with others.
Also, as a general note, it's actually very difficult to completely destroy information that's been committed at some point. If you're really running into issues with this, don't let fear direct you away from enjoying the greatest features of git. Experiment! Keep trying. Read a good git book (https://git-scm.com/book/en/v2). And learn to use the reflog. Everything you've committed is backed up for a long time even if you've removed all named references to those commits.
> I didn't say this affects branch diffing, but rather trawling through history on a single branch.
Exactly. What's the point of all the "oops, a typo" or "applying code review remarks, part III" commits. Just rewrite. This is the workflow you get eg. with Gerrit.
I strongly disagree. Git rebase is a very useful command, but of course you need to know what you are doing. Git becomes very weird if you try to avoid its core concepts... as far as I'm concerned, use rebase, get yourself cut (or better yet, learn about it before using it) and try not to repeat mistakes.
That said, git porcelain is simply awful. It is inconsistent and full of dangers - there is no way I would dare try new commands without reading up on them, because the names are often misleading. Sometimes I really wish GitHub / GitLab used Mercurial as their foundation. I think the world of programming would be much easier....
This is exactly what I'm talking about! Thank you!
There is /nothing/ "dangerous" about rebasing. You just don't rebase branches that are publicly shared without coordinating with the other users, so for some cases (like "master" of an open source project) you don't rebase.
But for your internal workflow, rebase is a KEY TOOL. It's how you write your story of commits. You can't just perfectly nail your commit history the first time you code, unless you are a genius. And what if you're working on a feature, but then you want to commit a certain series chunk of changes to master, so that other features can use that change. Rebase is how you do anything like this. It's core to using and enjoying the beauty of git.
That's true, but it's an obscure feature, not a convenient "undo" button; to the average insecure git user you might as well tell them that their work has been eaten by a dragon and they can recruit another dragon to get it back.
Reflog may not be known by the average git user, but I think it is pretty convenient - just call `git reflog`, copy the hash before the rebase, and checkout/reset to that hash. Doesn't get much easier than that.
I rebase pushed branches all the time if I’m the only one who is working on them, even for repos with multiple team members working on them. In almost a decade I can’t think of a single time this has caused problems.
Yes, don’t rebase branches with multiple authors doing parallel work. But I can’t think of many times I’ve even had to work on a feature branch with multiple authors.
In a gerrit flow, you have to use it quite often, but only on the detached micro-branches that gerrit forces you to use. And the UI provides a nice convenient rebase button.
In a more traditional "trunk" flow, where you're pretending it's svn, you probably want "pull.rebase=true", otherwise you generate a lot of entirely spurious merge commits. This really confused me the first time I used git, long ago.
The "dangerous" case is, as you say, using it to rewrite history that has already been pushed (heresy!). Generally the system will warn you that this requires "--force", at which point you need to stop and think about what you've done.
(It took me a long time to overcome my feeling that history rewrites were inherently wrong - defeating the point of a VCS in some way. I've adapted to it somewhat with the "neatening things up before submitting" view, which took a while to learn. In a traditional VCS there's only one view of the code and everyone shares it.)
In a lot of workflows I've used, we _always_ rebase against <parent branch> before pushing a PR, the virtue being that the history is cleaner. I definitely don't think it deserves the huge "NEVER USE" sticker you're slapping on it. I use it multiple times a day.
From my limited knowledge (mostly based on jneems article series[1]), I think Pijul is more powerful, but for the same reason also considerably more difficult to understand than Git.
In particular, Pijul supports (and depends on) working with repository states that are, in Git terms, not fully resolved. In addition, those states are potentially very difficult to even represent as flat files (see e.g. [2]). Git is simpler in that it mandates that each commit represents a fully valid filesystem state.
That said, I still think Pijul might have a place, if it turns out that it supports superior workflows that aren't possible in Git. But the "VCS elitism" would probably become worse than it is today.
I really like the way Pijul "thinks". Unfortunately, I can't see myself using it (or Fossil, for that matter) for anything except toy code, because I have contract requirements to store everything in GitLab.
Having written my own git client, I can tell you that "the most complicated part will be the command-line arguments parsing logic" doesn't go away. I wouldn't be surprised to wake up one day and find someone published a proof that NP != P, and the proof involved trying to parse the git command line.
Doesn't seem that realistic. I reloaded a bunch of times, and didn't get a single one where the same command can mean 3 different context-sensitive things that each take different arguments but are all named the same command.
Those familiar with Mercurial will surely notice it is, in fact, a port of a Mercurial’s interactive commit functionality, previously a separate extension called crecord.
It is indeed a great (and time consuming approach). Other VCSs have gone a little further than Git though, simplifying their workflows or the GUIs and CLI commands to make their use more intuitive like https://www.plasticscm.com/ and others.
Just be aware that it is a good start but isn’t as complete as the command list would indicate - runs up to about the point of index files, so not quite at ‘git add’.
I'm just written a simple git dumper tool (https://github.com/owenchia/githack) a few days ago. Learn by doing is a very good way and I really enjoy it.
This is such a great tutorial to learn Git from the bottom up. I always thought the "back end" part of the Git is pretty complex but this tutorial makes it look so easy.
This is simply excellent. I already know some basics of the internals of git at a conceptual level but this tutorial makes the knowledge so much more concrete. Wonderful.
Doing a search on HN for "write your own" returns a lot of answers, including 'Ask HN: “Write your own” or “Build your own” software projects' https://news.ycombinator.com/item?id=16591918 from a year ago.
I have an admission to make: I don't understand git. By this I mean I have a few simple commands I use (status/add/commit/push/pull) and if I try to do anything more complicated it always ends up with lots of complex error messages that I don't understand and me nuking the repository and starting again.
So I think: there must be a better way.
I have often thought about implementing a VCS. The idea behind one doesn't seem particularly complex to me (certainly it's simpler than programming languages). If I did I would quite probably use WYAG as a starting point. My first step would be to define the user's mental model -- i.e. what concepts they need to understand such that they can predict what the system will do. Then I would build a web-based UI that presents the status of the system to the user in terms of that model.
Yeah, I too don't really understand git. It seems that it was developed without any concern for affording a good mental model of its operation to its users, and thus it is just a complex black box you chant arcane rituals at and hope it doesn't decide to burn your world down. I know I could build a mental model of it if I put enough time into it, but who wants to do that when there's actually useful things to do? So instead when I have to use it to contribute to open source projects I have a sheet of notes with incantations to cover the specific things I've had to do with it in the past.
This is a valid point. Git is sometimes incomprehensible. The best way to build a mental model for me has been to read about directed acyclic graphs. It’s all just trees, objects, and labels for trees and objects from there.
Also to learn that `git reflog` exists, and gives you pointers to all the old states of your repo, even mid-merge or mid-rebase. If you get stuck with a bad rebase, you’re only a `git reflog; git checkout HEAD@{3}` away from being back where you started.
Tutorials like this one and others mentioned throughout the thread do a good job of breaking down the viscera from there.
>> developed without any concern for affording a good mental model of its operation to its users, and thus it is just a complex black box you chant arcane rituals at
That's not git, that's the command line - the interface with no inline visualization, no discoverability and no affordances.
git is not its command line interface. I use the git CLI like I use curl: a powerful tool for occasional surgical or automated operations directly on underlying protocol. Most of the time I prefer something that better fits my workflow: for git, it's Git Extensions; for HTTP, it's Chrome.
I have the exact opposite problem. I have a mental model of git, but not a very good one for most of its competitors; I don't really get mercurial for example, but git is just:
1) uncommitted stuff in workdir. Potentially can be lost, so commit often.
2) blobs in repo representing snapshots at commit time. Can never be lost.
3) symbolic references to the blobs. Can always recover from reflog.
4) tools to sync the above two things between repositories. (fetch and push)
5) tools to merge, diff and otherwise manipulate the changes between snapshots and files.
I'm confident that git will never lose my data, so long as I commit it. This makes experimentation stress-free.
Technically, you can lose data by explicitly deleting your refs, expiring the reflog, and running gc, but if you go that far you might as well rm -r .git
The mental model helps me with git because it maps pretty closely to what data actually exists. The index is just a useful thing to help me put stuff into a repository in a controlled manner. I've never attained a similar transparent understanding of mercurial.
I think I've asked this before, but what exactly are mercurial's branches?
In git, they are a "physical" feature of the repository as it represents a set of lineages, not an actual repository object.
As such, any reference to a commit uniquely identifies a branch, so the concept of a "named line of development" is simply implemented as a reference that gets updated as you make more commits. When you "delete" a branch in git, it goes nowhere. Only its name is removed.
What sort of structure does mercurial use to represent its branches? I know they are not just an emergent thing like in git.
Mercurial uses a DAG just like git. It has facilities for embedding a branch name in each commit, or for not doing that and having anonymous branches. It also has a feature similar to git's "branches"
Two articles that might help if you really want to dig into it:
Hmm, so the branch names are part of the commit's metadata. I'm not sure I agree with that, but I guess it's a valid choice.
The first article also states that unnamed branches are useful for small, temporary diversions, and notes that git has to name branches, but I think that's somewhat misrepresenting git since you can throw away names as soon as they are no longer useful. To me it seems kind of silly to have unnamed branches, given that names are free and much easier to remember than commit hashes.
No, his design goal was to create a vcs for the Linux kernel. Frankly most people use git like svn version 2, completely skipping the distributed parts of it.
Git actually doesn’t scale to really huge monorepos.
That's because most users don't actually need the distributed part of git. For most projects the sanest solution is to have a single master repository that all users pull/push from/to.
IMHO the only real advantages over SVN for most users are the better branch/merge functions.
Have you tried looking into any other contemporary DVC systems?
I get Git, more or less. Having tried to make sense of Bazaar or Mercurial on several occasions (to understand the internal data model), I eventually gave up.
Personally I just rely on daily backup of my workstation and manually copied and compressed snapshots of the project when I feel significant progress has been made. I tried SVN once, but it wasn't very transparent what it was doing and didn't really enhance my workflow any.
All my personal projects are just me. When I contribute to open source projects I use whatever they use, which is usually git. At work, even though I am usually the only programmer on my programming projects, I use TFS because that's the company's process.
To provide an alternate viewpoint, I have never had trouble with Git. I’m a bottom-up how-does-this-thing-work sort of person so when I first started using Git, I sought to understand how it worked. That part of Git is pretty easy to understand. Knowing that made its CLI a lot easier to grok. Of course, at the time I was having to use ClearCase at work and Subversion on the side so Git, IMO, was a vast improvement to either of those tools.
git is to version control systems as vim is to text editing or dwarf fortress is to god sims.
(dear everyone here and elsewhere recommending git incantations "but of course you have to know what you're doing": if you regularly have to take a backup of your working area before interacting with the vcs, because the interaction may do things you did not intend from which the simplest way back is to reset hard and start over, I humbly suggest that the vcs has failed in its primary purpose)
I'm not sure if I'm reading your comment right, but if you're going to do something that might result in a mess (such as a merge or a rebase) the sane thing to do is not to have anything uncommitted in your workdir; that way you can always retry, undo and experiment, and the VCS is doing exactly what it's supposed to be doing.
It's something subversion can't do last I checked, so whenever I need to do a complicated operation with SVN, I import the local state into a git repo first :P
> The last company I worked at had 30 good developers, 1 of which I think really deeply understood git.
I think this is really damning. Developers understand complex languages/compilers like C++, Python, Java, etc, all of which have a good deal more intrinsic complexity than a VCS. So if a VCS isn't understood, it is badly designed.
It's because it's so opaque and inconsistent. If git was a program like Microsoft Word, where every available action is easily discoverable and the current state is obvious, then it wouldn't be so bad. Something like what Nikita Tonsky wrote in an article called "Reinventing Git Interface"[0], which has a lot of good ideas.
I'm in the same boat.
I could teach subversion in 1 hour and people would get it.
I can't teach git in a whole week. So in the end my students use the same 5 commands.
I'm like you. I use SourceTree to get a 'visual grasp' on what I find to be the noise of git commands.
However, if you're into command line, you can try fossil: it's got lots going for it.
Your idea of a "user's mental model" might land you into trouble though, because all of us come from different backgrounds (subversion, SSafe, git, HG...) and they all maddeningly redefine terms in different ways (eg branch, forks, commits, checkout).
> Your idea of a "user's mental model" might land you into trouble though
If I do do this, I will explicitly lay out the user's metal model in the documentation at the start. Then it will be the user's fault if they can't be bothered to read it.
While gits documentation might not lay out the model at the start, they certainly have documentation showing how everything works, complete with diagrams and such. As such...
> Then it will be the user's fault if they can't be bothered to read it.
can be applied to you about git technically. I think that's just me being a "little" pedantic though.
It sounds like the issue you actually have is that the documentation isn't easily readable in one or two sittings, and you don't have the time (or can't be bothered) to go through it and learn it. Which I totally understand, everyone has different things they need to spend time on, most of the time learning Git isn't one of them.
I have read large parts of the git docs. I don't like them. This is not just the text: the low contract colours and hard-to-read fonts are also factors.
> It sounds like the issue you actually have is that the documentation isn't easily readable in one or two sittings, and you don't have the time (or can't be bothered) to go through it and learn it.
So how long should it take to learn a VCS? And how long does git take to learn?
Look for a video on YouTube called Git Happens. I've found it fairly effective with my coworkers. It doesn't go over the command syntax, but instead dives into a logical overview of the underlying data structures.
I'm not much better. I knew SVN inside out. I've read up on git internals, poked around with them, but eventually forget it all because 99% of the time I only use the same five commands. It's a real iceberg, Pareto principle piece of software for me.
I have nothing to directly comment on the tutorial. Just a tangential mention regarding the tedious argument parsing boilerplate in Python, I have found Python Fire to be much more convenient: https://github.com/google/python-fire
It would have shaved off another 15-20 lines from the 503 line example ;-)
I largely find argparse to be OK apart from a couple of issues.
First, it allows the user to abbreviate flags. They can pass --fl and it will be interpreted as --flag, assuming no other flag shares the same prefix.
This sucks for maintainability: add a new flag and any abbreviation for a previously existing flag that shares the same prefix will now stop working, breaking user workflows.
Since Python 3.5 there's the allow_abbrev parameter that allows disabling this behaviour, but then you also lose the ability to combine multiple single-character flags (so you can't pass e.g. '-Ev' any more, and would have to pass '-E -v' instead[1].
The other issue is that it's tedious to keep all the .add_argument calls readable, while maintaining a reasonable maximum line length.
Magical is truly the word to describe Docopt. I use it for all Python code I write and have also used the C and C++ variants. I would highly recommend Docopt to anyone writing command-line utilities.
Yeah, i couldn't help but notice the first piece of code is an ugly "switch case". There's a python idiom for this, it's putting your functions in a dictionary and doing something like `cmds_dict.get(args.command, default)(args)`. I guess we all have our religious habits for argument parsing (more of a docopt-er myself).
I guess the author wanted the code to also look friendly to people who are not that familiar with Python - everybody understands a switch, while what you are describing would probably puzzle some people...
Reminds me of CherryPy. An object oriented app can become a web server with a few function decorators (to make them public endpoints). Coincidentally still my favorite Python web framework.
Installed by default I cannot find any. But pretty much any language with a package manager has sane and easy to use libraries to bang out a small CLI app with subcommands and option parsing in minutes. For example PicoCli in Java, Thor in Ruby and Clap in Rust.
If I may do a self plug, I had recently written a note on "Build yourself a DVCS (just like Git)"[0]. The note is an effort on discussing reasoning for design decisions of the Git internals, while conceptually building a Git step by step.
[0] https://s.ransara.xyz/notes/2019/build-yourself-a-distribute...