Hacker News new | past | comments | ask | show | jobs | submit login
Initial revision of "git", the information manager from hell (github.com/git)
120 points by Sevein on Aug 17, 2012 | hide | past | favorite | 67 comments



Actually this is just the first self-hosted revision of git, but impressively that was apparently only a few days after the actual project start.

http://www.spinics.net/lists/git/msg24141.html


Tridgwell's "reverse engineering" of BitKeeper is one of the funniest, and at the same time the saddest stories, I've read about software.

http://lwn.net/Articles/132938

It's companies like BitKeeper, who think using a command line is "not allowed" that are the reason so much software sucks.

The interface Tridgewell used was the only one I'd be interested in. It's something you could build on top of. You could add abstraction to your heart's content.

As for git, it's not nearly as simple as people portray it to be. You need to have a scripting language (e.g. Perl) and and http client (e.g. curl) already installed or you cannot compile, let alone use git. Now, this is not so bad, if git was just a little glue for some external programs. But try to compile git statically and you will end up with over 230MB of "small, simple, utilities". git is not so simple.

It's command syntax is appealing to many. It makes git seem "simple". But the program itself is not simple in the sense of being robust.

I can complile a static copy of the rcs or cvs programs, or even svn, and take them with me anywhere, all in the space of a few MB's.

git has a lot of dependencies. It's easy to to break.


You do realize that the git-* binaries by default hardlink to the "git" binary, right? Git is built as a single binary, and uses the program name to pick the right command.

My half-done AROS port of git (which admittedly does exclude some stuff) currently stands at 1.8MB, with the only external dependency so far being the C library. But the "git" binary has 104 links.

EDIT: Slight correction: There are certainly a number of additional binaries, e.g. for things like "git-instaweb", but the core functionality is held in the main "git" binary.


Uh, is "being easy to statically compile" some benchmark for simplicity that I've never heard of?

Also, what do you care if git is 230MB? Do they even make thumb drives that small any more?


> Uh, is "being easy to statically compile" some benchmark for simplicity that I've never heard of?

Yes. It's not the benchmark, but it's a benchmark.


Thank you Maro. I've never understood why static linking and easy compilation, not to mention being concerned with file sizes, upsets certain people when mentioned on mailing lists and forums. But it always does.


a 230MB executable is no fun.. it needs to be read from disk to be executed, and bigger code means worse cache use (for instructions) when running.


It's 115 links to a 2 MB executable. The actual on-disk size is 2 MB.


Right.

I feel stupid that I did not realise this. I am a big fan of crunched binaries, actually. That guy at U of Maryland who introduced it to BSD in the early 90's is a software hero in my book. For Linux fans, I guess your hero would be Bruce Perens or whoever was behind Busybox.

Anyway I've learned something more about git from admitting my error. Thank you HN!


You know what's funny? I made the same mistake a year or two ago. :)


I'd bet that if you built git as a single statically linked multi-call binary a la busybox, it would be far less than 230MB. Statically linking dozens of separate binaries with large amounts of shared code and then measuring the resulting disk usage doesn't tell you anything meaningful except how much disk space dynamic linking would save you.


git is built as a multi-call binary. I wonder if he's perhaps not realizing that all those other "git-*" binaries are hard linked to "git". Depending on which boxes I check on, my git binary has in the region of 80-110 or so hard links (EDIT: admittedly not a statically linked version, but none of it dependencies are big enough that it should add up to anywhere remotely near 230MB)


So ls -i should show they all share the same inode number?

Thanks for this. I was not aware of that. Perhaps I will give it another try.


> So ls -i should show they all share the same inode number?

It would, yes. Another useful tool here is du which by default will screen out files with duplicate inode numbers. So for an example where I have two 100M files each with multiple hard links:

  me@swann:/tmp/tmp$ ls -lhi
  total 701M
  180277 -rw-r--r-- 3 me us 100M Aug 17 13:44 zero.file
  180278 -rw-r--r-- 4 me us 100M Aug 17 13:45 zero.file.2
  180278 -rw-r--r-- 4 me us 100M Aug 17 13:45 zero.file.2.link1
  180278 -rw-r--r-- 4 me us 100M Aug 17 13:45 zero.file.2.link2
  180278 -rw-r--r-- 4 me us 100M Aug 17 13:45 zero.file.2.link3
  180277 -rw-r--r-- 3 me us 100M Aug 17 13:44 zero.file.link
  180277 -rw-r--r-- 3 me us 100M Aug 17 13:44 zero.file.link2
  me@swann:/tmp/tmp$ du -shc *
  101M    zero.file
  101M    zero.file.2
  201M    total
du does this duplicate ignorance trick across whole trees so the links do not have to be in the same directory, and you can have it scan a whole tree and it will show how much space it really taken, not how much is nominally taken. Like so:

  me@swann:/tmp/tmp$ cd ..
  me@swann:/tmp$ du -shc tmp
  201M    tmp
  201M    total
The reason I'm getting 101Mb instead of 100Mb (and 701Mb in total in ls) is that it is counting each link as taking a small amount of space, then "100MByte-plus-a-bit" is being rounded up to 101Mb (and 700-and-a-fraction rounds up to 701).

Also the number in the 3rd column of the ls output above is the number of links to the object, which can be helpful in understanding this sort of situation too.


I am no expert with du and all it's options and behaviours, but it's funny you mention the h, c and s ones because I did bother to learn and commit those three to memory long ago and routinely that combination.

I also use routinely use dd to get "exact" file sizes (yes, it's crude, but dd is on almost every UNIX-like system and it works), unless I have access to a good stat utility.


For a large chunk of the main binaries. There are certainly some things that are split out in separate binaries and scripts.

On a Debian system, take a look at /usr/lib/git-core/ - it contains a number of additional binaries, but it's still reasonably small. And a lot of what's in there is optional functionality and stuff you can delete if you don't want it. E.g. "git-imap-send", "git-instaweb" and a bunch of other things that you may or may not care about at all.

The main stuff like "git-commit" etc. is all linked to the main binary (or not necessarily present at all, depending on your build/distro).

EDIT: I just compiled a statically linked "git" binary. Stripped it is 2.5MB. That obviously excludes the few things that are in separate binaries. Things like git-daemon weighs in at 1.7MB statically linked.

Some things, like git-imap-send, seems to be a bit tricky to build statically (git-imap-send barfs errors about libdl all over my screen, and I'm not motivated to figure out why)


> As for git, it's not nearly as simple as people portray it to be.

git's famous simplicity is in the form of its well-designed and stable (and well-documented) underlying data structures, it has nothing to do with the size or runtime dependencies of a particular implementation.


I do understand about the simplicity of the design, though I haven't tried to figure out exactly how git works.

When I see people commenting about git's simplicity it is not about data structures. It is about commands. And that really tells me little about simplicity. Anyone can manipulate the argument structure for a function and positional parameters in a command line interface. The real question is what does the function do, and how does it accomplish it?

The first thing that struck me about git is the apparent use of SHA1 hashing as a basic foundation for the whole system. Maybe that's not even true and no doubt there is much more to it. I'm not out to become an expert in version control nor to understand git completely because I only use it out of necessity. Older systems work just as well for my purposes.

I do not need many advanced features in version control; I'm using version control on a personal basis, not as a contributor to some highly dynamic project with many other contributors. Plain old rcs is still my main tool when I need the ability to move between versions. And diff still seems to work for detecting and printing differences after so many years.

But to me, as a user, the compilation process of any program is also part of any purported "simplicity of design". Programs that compile easily and quickly and are easy to modify score very highly in my book. I am constantly looking for more programs that fit this description.

I was not aware that there were many implementations of git.

I'll now be looking for some other implementations. On github of course.


> The real question is what does the function do, and how does it accomplish it?

Precisely, and this is where git's simplicity shines compared to other systems. After one takes some time to learn git's data structures, it's trivial to understand exactly what effect each command has on your repository. No need to mentally model your source control system in leaky abstractions, the reality itself is simple enough to handle directly.

> The first thing that struck me about git is the apparent use of SHA1 hashing as a basic foundation for the whole system. Maybe that's not even true and no doubt there is much more to it.

Yep, it is true, and that really is all there is to it. For a quick overview, see: http://gitready.com/beginner/2009/02/17/how-git-stores-your-...

> I was not aware that there were many implementations of git. I'll now be looking for some other implementations. On github of course.

Github itself runs on a proprietary Erlang implementation of Git.


Edit: As an example, I just compiled subversion 1.6.17. They actually have an --enable-all-static in the configure script which is a nice convenience as libtool can be a real PITA sometimes when trying to link statically.

Total size: 28M


if you don't need http/https/svn/gtk support then git can be built without perl, curl, git-svn, python etc. installed


Are there instructions anywhere on how to do this? I could have sworn Perl, absent some other scripting language, was an absolute requirement. If it is possible I will have another go.


From the Makefile (which is exceedingly well commented):

# Define NO_PERL if you do not want Perl scripts or libraries at all. #

I don't know what functionality you lose that way, and you might very well still require it to build it (don't know, not tried), but the core functionality should all work.


That's only one implementation of git. There are other implementations that are less kludgey, like libgit2: http://libgit2.github.com/


You should try using an OS with a proper package manager that can take care of these things.


svn is quite over-engineered and has significantly more dependencies than git. There's no way a statically linked svn would come out smaller than git, so either you're trolling, lying, or stupid.

Besides, file sizes of statically linked version control binaries is utterly uninteresting for anyone with an ounce of sanity.


> Besides, file sizes of statically linked version control binaries is utterly uninteresting for anyone with an ounce of sanity.

Unless, you, say, want to be able to use it on an embedded device for some reason, or to be able to ship it as part of some project where you have little control over what environment users might want to run it in, say for an IDE running on Android.

Or for ports to far more constrained platforms (e.g. I have a semi-working AROS port. In addition to running on more modern x86, PPC and ARM hardware, AROS can run on original Amiga's, where finding a machine that even has enough memory to load git is a challenge; bizarre edge case? Sure; doesn't mean there aren't plenty of people with edge cases like this)

There are any number of reasons why one would care about size. I wish more developers did - while my mobile devices for example (which do have git) have decent storage space, I've filled most of the 16GB and 32GB respectively of them already, and I'd rather not waste large amounts of space dragging in all kinds of dependencies on stuff that isn't strictly necessary.

That said, in this case, the core functionality of git does in fact not take all that much space.


If you cared about binary size, you'd be using dynamic linking to avoid duplication in the first place.


Maybe he cares first about portability, e.g., easily moving a binary from one BSD-based device to another. Not all devices have the same space limitations.

There might be other reasons, too. Static binaries fork faster, but this works best if they are also small enough to remain entirely in the OS's cache.

There's nothing wrong with dynamic linking per se. Nor is there anything wrong with static linking per se. ("Per se" as used here is intended to mean "in all circumstaces".) The use of one or the other is simply a choice. There are advantages and disadvantages with each method, based on the circumstances and whatever the desired result(s) is/are.


Why would/do you run git on a mobile device?


You could use it to automatically get firmware and software updates. One likely way of doing it: use git to list out tags and branches, find most recent one with appropriate text in the name. Then check out the data from that revision, and use it. Maybe delete the local git stuff afterwards (since you don't need it any more). Or, better yet, archive the data from that revision.

You might not necessarily supply this solution as part of a product - but if you work somewhere that has a lot of devices to manage, you might want to do this kind of thing yourself, internally. And if you're going to then take it seriously, you'd want to keep previous revisions of all your stuff around, making it easy for you to roll back to previous versions. Files with history, and easy rolling back to previous revisions... git isn't the worst possible way of doing that.

(I've seen this sort of thing done with perforce, pretty much exactly as I describe, to deliver updates of internal tools and manage test builds of products. Daily builds of tools and products get checked in to perforce each day; most days, QA test that day's build results; if a a given day's build proves not to be a big pile of crap, they tag the corresponding revision. Then you can use perforce to find half-decent historical builds, and retrieve them. The place I saw this done at had a little tool that somebody had written to put a friendly GUI face on this process.)


Isn't this a problem that should be solved with a package management solution like yum or apt-get plus something for configuration management like puppet, chef or ansible? Git would still be useful, but only on the server side.


Possibly. I'm thinking more of retrieving an entire image, so you could also use FTP. Anyway, this isn't really my field of expertise, I'm just foolishly throwing out a random suggestion...


I am running git on a jailbroken 3G iPad 1 during commute. :)


Yeah, but what's the purpose? Are you editing text files (source code) on the iPad and you need to keep track of the changes?


Some people do! Textastic for example seems like a nice editor and you can always plug in a keyboard in an iPad.


Push to github :-)


What's most notable about this isn't so much that all the core ideas of git are already in place in the first version. It after all is, fundamentally, a pretty simple piece of software.

What shocks me is that that README was written literally days after the kernel project's license to bitkeeper got revoked.


Isn't it funny to think about how, if Tridge hadn't done what he did, if all of Linus' public shaming of Tridge had done the trick and the Linux kernel had stayed on Bitkeeper, that we wouldn't have had git, or Github?

Linus did a good job of coming up with a replacement tool in a hurry, but I'm also grateful to Tridge for poking a stick into the hornets' nest.


I'd say we would have had Mercurial, but it was apparently developed for similar reasons:

> Mackall first announced Mercurial on 19 April 2005. The impetus for this was the announcement earlier that month by Bitmover that they were withdrawing the free version of BitKeeper. [Wikipedia]


Yes. I was subscribed to LKML around the time the Bitkeeper sage was in progress.

One thing that I do remember was just how arrogant Larry McVoy was about whether a replacement SCM could be developed and have feature parity with Bitkeeper. He kept on talking about how he/Bitkeeper is at least a decade ahead of the rest of the community and it would be very difficult to come up with an alternative.

Thankfully, the open source community proved him wrong!


Any particular threads worth digging up?


I ran into McVoy on reddit a few years ago. He seemed totally different than the "scary evil public persona" that has been bandied about...

http://www.reddit.com/r/programming/comments/9zdlf/git_and_m...

...humble, concerned, and proud of how they've advanced the "state of the art" and his company's contributions to modern version control.

"""Imagine how I'd feel if Linus used BK for years and when moving off he made a centralized VCS."""


Well, to be fair to the people pushing the "scary evil public persona", I don't come across well in email. Never have.

I'm better in person; had a meeting with a guy from Germany a couple of years back who worked up the courage to invite himself to our offices in the south bay and after a couple of beers he made some comment like "wow, you're nothing like what I expected, you're actually a nice guy".

I get that a lot. It's a "gift" :)


Good to see you again! Hopefully business is still going well for you, although I imagine GitHub enterprise is becoming a strong competitor (and noticeably Git is left off the comparisons page?).

From what I can tell maybe age and experience have mellowed you out your email tone a bit... pretty soon you'll turn into Ned from The Simpsons. ;-)


Well, our website sucks eggs. I used to maintain it, then we hired a supposed sales/marketing guy and he stripped out all the screenshots and removed anything that actually gave you information and replaced it with a bunch of gobbly gook that is supposed to resonate with the fully buzz word enabled manager types. And then he left "to spend time with his family" :)

I've been too disgusted with the result to actually fix it but it clearly needs some lovin. I don't suppose anybody wants to tackle marketing BitKeeper? Not a fun job given that we annoyed the open source crowd but we do have some neat technology.

Github is in a different space, we don't see them much. We do see git of course, but lucky for us some of the design decisions in git left us some advantages (see the facebook thread about 6 months ago).

I dunno about Ned, isn't he the religious guy? I'm more the grumpy old man :)


We would still have had Darcs [1]. An without its competitors Git and Mercurial, maybe Darcs would have gotten more enthusiasts pushing Darcs to where Git is today.

[1] http://darcs.net/


Darcs was pretty neat at the time, especially since it was the only open DVCS. And the theory behind it was interesting to read.

The only problem is that it had a feeling of non-robustness when using it, and because of that I was tempted to have multiple copies of my repos, just in case of corruption. And that kinda defeats the purpose of VCS.

Then, git came along. Although a bit strange in the beginning, especially on windows, it felt robust enough from the start. And it supported multiple workflows. I know that, whatever stupid thing I do in my repo, git will never loose any data, and I'll be able to google an answer how to recover from my mistakes.

That being said, I'm glad darcs is still alive and kicking. Guys behind it are really smart, and who knows what new good ideas can come from it.


> Darcs was pretty neat at the time, especially since it was the only open DVCS.

You are forgetting Tom Lord's arch, which had a really bizarre interface (even more so than Darcs), but was conceptually very nice.


> You are forgetting Tom Lord's arch, which had a really bizarre interface (even more so than Darcs)

Darcs has a terrific consistent interface. Git has taken liberally from it. "git log -p" and "git add -p" are straight from darcs. AFAIK, nothing before darcs could look at a repo as a stream of diffs and comments.


Has Linus ever said why he disregarded Darcs and the other OSS options (wasn't Mercurial already starting to get somewhere at that time?).

Other projects not mature enough yet (and/or moving fast enough in that direction) and he wanted something now?

Or some technical points that he disagreed on, so wrote his own solution that worked the way he preferred instead of trying to change the established workings of other projects?


Has Linus ever said why he disregarded Darcs and the other OSS options (wasn't Mercurial already starting to get somewhere at that time?).

Here are Linus's comments about Darcs: http://markmail.org/message/vk3gf7ap5auxcxnb

As far as why not Mercurial, Mercurial was announced on April 19th. Git was announced on April 9th, and was self-hosting on April 7th. There were benchmarks comparing git and Mercurial shortly after Mercurial was announced, and git was faster, although at least originally it used much more disk space (this was before git had implemented compression and pack support).


> Has Linus ever said why he disregarded Darcs and the other OSS options (wasn't Mercurial already starting to get somewhere at that time?).

I think you might be thinking of Monotone here. Mercurial was started some days after git.

EDIT: Wikipedia has some info about what Linus thought of Monotone. The key problem with it was performance. I have no idea to what degree they have been fixed today.

https://en.wikipedia.org/wiki/Monotone_%28software%29#Monoto...


You are probably right there, I think I have mixed up Monotone and Mercurial in my chronology.


> wasn't Mercurial already starting to get somewhere at that time?

Git and Mercurial were started within days of one another, for the same reason (loss of BitKeeper license for the kernel devs).

Mercurial was actually announced 2 week after Git (2005-04-19 vs 2005-04-06)


At the time darcs had some serious problems - you could quite reliably get into a "merge of death" situation where the tool would just take ages (hours, days) to do a merge on a relatively small codebase. I don't actually know if that played into his thinking or if he just didn't trust a tool written in a language he wasn't as happy hacking in.


I tried using darcs for awhile before I ever touched git. What made me initially switch was that darcs is non-trivial to install if you can't use a package manager (say because policy doesn't allow you to). After that, what really sold me on git was that git had much nicer integration with svn. I no longer use svn, but it was important at the jobs I had while I was getting started with dvcs in general.


There would also be Monotone which had its first release in 2003. So we would have at least two distributed version control systems.


I was not familiar with this story and had to look it up. Thanks for the pointer. Seems like the CEO of BitMover shot himself in the leg by being kind of unreasonable and triggering the development of various open source DVCSs, which surely are hurting his bottomline by now.

Clickable:

http://en.wikipedia.org/wiki/BitKeeper


>+ GIT - the stupid content tracker

Hmm… I think Linus thought it was stupid.

>+"git" can mean anything, depending on your mood.

Or... He was just in the everything-is-stupid mood.

>+ - random three-letter combination that is pronounceable, and not actually used by any common UNIX command. The fact that it is a mispronounciation of "get" may or may not be relevant.

I think everyone just goes with that. :)

>+ - stupid. contemptible and despicable. simple. Take your pick from the dictionary of slang.

Here! He said it again!

>+ - "global information tracker": you're in a good mood, and it actually works for you. Angels sing, and a light suddenly fills the room.

GitHub surely took it seriously and made it happen. Thank you!

>+ - "goddamn idiotic truckload of sh*t": when it breaks

Maybe I'll take it. Mainly because I don't want to learn how to think like Linus.

>+This is a stupid (but extremely fast) directory content manager. It doesn't do a whole lot, but what it _does_ do is track directory contents efficiently.

And here's Linus saying git is stupid, again. I think he really means it. Well, at least that day he did.

Anyway, thanks for fossil Dr. Hip! :)

edit: formatting.


He means stupid as in simple, unaware of things it does not need to know. Stupidity and ignorance in software is high praise and it's very difficult to achieve.

See: http://en.wikipedia.org/wiki/Information_hiding

That style is just linus' way. It comes across as a lot more aggressive on screen than in person. To get a better sense of him, I'd recommend watching this talk he gave evangelising git at google back in 2007. It's pretty entertaining, and very educational on the issues facing VCS designers.

http://www.youtube.com/watch?v=4XpnKHJAok8

To a certain extent, if you can knock git together in a couple of weeks you get a free pass to talk however you want, as long as you don't hurt anyone.


Ouch! No heed to be harsh. I certainly appreciate Linus's work. Kudos to him! I understood his point on the stupidity thing. It even took some laughs out of me. And that's the reason I posted that, so other people could too.

I don't mean to start a flamewar on scm, really. But I think git is overly complicated, and I believe I'm not alone on that.

Anyway, don't take things too seriously and have a good one!

PS: You gave a great idea. I should email Linus and offer to pay him some beers. That sure will be a great talk. Thanks mate!


Sorry, really didn't mean to be mean whatsoever. It's hard to convey tone in internet comments :)


I used to keep messing up git till the time I studied the internal model. As long as you imagine a DAG and where the refs point to in your mind every time you execute a git command, you are good. Not that I am complaining - it is very powerful and tends to gets easier with use.


Gotta love this password handling bit ...

pw = getpwuid(getuid());

if (!pw) usage("You don't exist. Go away!");


Suggesting to HN - vote up articles which you learnt stuff from rather than articles in which you made some vapid political argument.





Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: