Here's why I'm excited about this: Data Versioning
Github is awesome for science. But my code and workflows often are just as reliant on the data I have as the code I've written. With Github, I always have to treat my data separately. A different workflow, a different storage location, different (and often manual) versioning.
Now, I can start integrating my data into the same workflow. No size limits. Just drop it in and version it like everything else. 1GB? No problem. 500GB? Just pay the money.
This is especially awesome because, as a scientist/dev, I do not want to stand up my own infrastructure and servers/vms. I don't want to do linux updates. I don't want to worry about things going down. It just needs to be there when I need it. When I don't need it anymore, I'll take it down. Done and done.
We've build a Git extension (git-bigstore) that helps manage large files in Git. It uses a combination of smudge/clean filters and git-notes to store upload/download history, and integrates nicely with S3, Google Cloud Storage, and Rackspace Cloud. Just plop in which filetypes you'd like to not store in your repo in .gitattributes and you're good to go. Our team has been using it for a while now to keep track of large image assets for our web development projects. Cheers!
If you're looking for an open source system for managing fat assets in git, we made a tool that integrates with S3: https://github.com/dailymuse/git-fit
Git-fat uses smudge/clean filters. We used git-media which employs the same technique, and it didn't work well for us. See the section about git-media in the readme.
I maintain a similar tool to manage large files in Git [1], but went the clean/smudge filter route. I think git-media gets into states where it can process files twice even though it shouldn't. From the Git docs:
> For best results, clean should not alter its output further if it is run twice ("clean→clean" should be equivalent to "clean"), and multiple smudge commands should not alter clean's output ("smudge→smudge→clean" should be equivalent to "clean").
So this isn't the fault of clean/smudge filters, just the way they were used with git-media.
We experimented with smudge/clean filters with our own implementation and they just didn't seem like the right solution for fat asset management.
The most frustrating problem was that filters are executed pretty frequently throughout git workflows, e.g. on `git diff`, even though assets rarely ever change. The added time (though individually small) created a jarring experience.
I'd also be curious how git-bigstore addresses conflicts. It seems like a lot of the filter-based tools out there don't handle them well for some reason.
I would ramp down your excitement, this doesn't solve any of the problems that Git has around large files. Dropping in a 1GB data file into your Git repo will cause problems for you, even if the server doesn't block it like GitHub does.
Oh sure, but at least my limitation in the future will be Git being able to process and handle large files (and maybe the cost of storage) and not Github's file and repo limit.
It's reasonable to assume that Github's limits are caused by Git's limits. When Git begins supporting large files, most likely Github will be able to raise their limits as well.
You sure about that? Github is a SCM company, not a bulk storage company. When your business is not storage, arbitrarily large storage requests aren't interesting...
Github has gone through great lengths to begin supporting diffs on all sorts of files, 3d volumes, images, and all sorts of assets. I'd be willing to bet that some sort of bundled asset management is someone's dream project at Github.
I'm not sure that a server-side rewrite addresses the core issue. Git is designed so that a git fetch pulls the entire repo down. The reason large binary files are a problem is because you are going to pull down the entire history of those files in order to clone the repo. Unless they're also shipping a custom git client (unlikely) then I don't see how they can solve this problem on their end.
edit: Come to think of it, the feature they could potentially implement their side is somehow making it so binary files have their history purged somehow. I'm not a git expert but if they are able to trick git into think the files are new, and have no history, each time they change then I guess that could work.
the core issue is not fetches. The last time I tried to use git to manage scientific data, commits sometimes took ~10 hours. And this was with all text files, not binaries. This was ~100k files and 20 gigs of data.
Have you ever looked into using an s3 backed git repo via jgit? Not sure how many files/revisions you have but it may be a cost effective alternative that you can start using today.
We use it for auto deploys from autoscaled instances cuz github has poor uptime.
No, but I don't understand what this is. Java git? Bound to eclipse? I don't really understand how this would be different than me storing a .git repo locally and replicating it to S3.
I've run into problems with large-ish files in git repos, binaries accidentally committed etc. I'm genuinely curious, is there a good way to use git for the size repositories that you mention?
Git is great for smaller binaries. Ideal in fact, given that it stores differences between revisions as binary deltas. For large >1GB files, I believe the diffing algorithm is the limiting factor (I would be interested in getting confirmation of that, though). For those files something like git-annex is useful (http://git-annex.branchable.com/)
I've used git to push around a lot of binary application packages and it's very nice. Previously I was copying around 250-300MB of binaries for every deployment--after switching to a git workflow (via Elita) the binary changesets were typically around 12MB or so.
I've had no trouble with non-github git repos handling text files well into the hundreds of megabytes. I haven't pushed the limit on this yet, but for me, personally, my datasets are often many hundreds to thousands of individual files that are medium sized. So it tends to work fairly well. Singular large files may not scale well with git.
I don't know of a great solution for Git, but I've heard that Perforce is more suited to handling large binaries - I believe it's used in many game development studios, where binary assets can number in the high gigabytes.
Domino has a nice solution for version control and reproducibility in the context of analytical / data science work: they keep a snapshot of your code, data, and results every time you run your code. So it’s like version control plus continuous integration. Supports large data files, R, Python, etc. http://www.dominodatalab.com/
Yeah, but their system is super simplistic and doesn't support half of the diverse and robust operations that Git and Github do. Can I see version history? Are changes compressed into diffs for text files? etc, etc.
You may not be aware that data versioning is a built in feature of S3[1]. No VMs, no management, just put all stuff in one place and it's versioned automatically.
As others have pointed out Git is not a great way to do big data versioning. Git is almost exactly the wrong tool for the job, almost any other revision control system handles large file versioning better because you're not expected to clone entire archive histories to everyone.
If you use Amazon's AMIs they do a lot of updates for you. Recently, they updated their AMI for the bash and SSL security exploits. If you have a permanent instance, you would have to run some updates yourself, but if you use auto-scaling, you can just update the group and then when the instances cycle, the new updates will be applied.
Git's storage model is such that every commit is physically a snapshot of the entire work tree. Git makes this efficient with delta compression (essentially, deduplication) which is extremely efficient for text (in fact an entire git repository can be smaller than a single SVN checkout) but is less effective with large binary files since changes don't produce compressible deltas.
> in fact an entire git repository can be smaller than a single SVN checkout
Only because svn downloads the plain text of the server's version of every single file from the revision, so that "svn diff" doesn't have to hit the server.
I don't think it's branching. From what I understand, it's changing the file. Rather than storing a diff for non-text files, it just keeps the old versions around. So if you store a lot of binary files that change very often, it makes your repo huge.
We had this problem before we moved to a dependency manager. Our repo was almost 2gb, even though the checked out code was well under 1gb.
The way I understand it, branches in git are essentially just a pointer to a revision, which automatically gets updated as you make changes. Branches don't have any overhead.
I've thought about building GitHub / BitBucket tools (issue management, pull requests etc) that work with data stored in git repositories, as opposed to one big tool that owns the git repository and manages the data in a locked-up SQL database. We already have this for code browsers (e.g. cgit), but that alone can't compete with the "full suite" that others offer.
It would have several advantages for the user:
1) You could pick your issue management separately from your code browser
2) You don't have to worry that the best tooling (Github) has some of the least reliable storage.
3) If it stored the data in git, it would work offline!
The big downside is that having to "assemble your own" is more complicated for the user, and that it isn't Github/BitBucket's business model...
I like the idea of some good, open source tools to do the different bits, preferably focused on flexibility and ease of data access (as you mentioned)
to me this would mean an issue tracker, and some kind of merge request/review tool. these could be paired with the likes of Gollum, providing a wiki.
One thing though, I would really love to see them support more than just git. Mercurial shouldn't be hard to support, as it has largely similar concepts, and even though SVN is not as popular as it was at its peak, its still widely used and a very good choice for some workflows/use cases.
In the same vane, I would love to see a decent stand-alone code browser that supports different VCS.
Thanks - great input (and the Gollum suggestion is a good starting point!)
On multiple VCS: On the one hand, git is so dominant (for better or worse) that I don't know if the complexity is worth it. On the other, perhaps git is so dominant because github is so dominant. The answer in open source land is probably to set up a reasonable abstraction layer and let interested people provide their own implementations of the VCS-interface.
My key focus for this would be on a more unix-philosophy approach (separate tools that do one thing well).
I like the idea of using a VCS repo (again, ideally any of those supported not just git) to store related information such as issues/tickets, etc however I don't know what the practical constraints of that would be, my first thought is that issues/tickets stored primarily in a (version controlled) filesystem (and then presumably indexed on a server for web views, searching etc) would not be as intuitive as say a wiki like gollum.
If the issues/tickets tool used a flexible SQL model (e.g. choice of mysql, postgresql and sqlite for instance) I would be happy with that, as the data is still quite open (to the system, owner, not as much to the individual developers admittedly)
Customers fear vendor lock-in, so an open platform/ecosystem for git could flourish, like java. Hopefully they'll have a better business model for it than Sun did. Maybe amazon's?
Figuring out a good business model (or a way to survive without one) is indeed the challenge here! I'm not sure that AWS's margins will support this (and historically they haven't been great at supporting open-source anyway)
The fact that so much of the world's private source code is on a single public website (GitHub), which has repeatedly been hacked via relatively easy exploits, is pretty frightening to me.
Not sure it helps, but an observation is that Oracle's relational database occupied a somewhat similar role, in that it tied together data from different sources, and enabled cross-platform interoperation. This was very important at the time, because there were several different hardware vendors, and companies didn't want to get locked in. And of course, a whole ecosystem of third-party tools grew up around Oracle - esp reporting tools, BI, warehousing, and everything had bindings for relational databases.
Not sure how that translates as a business model, as no one owns git (like Oracle owned oracle). However, although Sun didn't make money from java, everyone else did - so there is a business model, just as a user of git, not an owner. e.g. reporting tools based on git; CI tools, code quality tools.
So maybe the answer is to just to treat git as infrastructure, for the app you sell, as opposed to trying to make money from the infrastructure itself. Linux is a similar product.
No, that would not be "great" to have things like that built directly into your version control system.
Built on-top of it (like how Gollum is a wiki using git for storing documents) is not necessarily bad, but a built-in system is the worst possibly solution IMO
To me, it looks like the biggest advantage over existing solutions is point 4: "Faster Development Lifecycle." If you're running all your infrastructure out of an AWS datacenter anyway moving your source control servers into the same center will make checkouts in automated deploys marginally faster.
Actually, things are running pretty smooth for months now - see for yourself at: https://status.github.com/graphs/past_month. If that's not your experience, you can reach out to support@github.com and we'll look into it.
We love github, don't get me wrong (and we pay decent $ for it). But I'm looking at track record / experience with it doing production auto deploys for years.
S3 has 9 9s of durability and is "closer" to where our servers run so there are less possible points of failure. We have had 0 issues since migrating to s3 backed git repo 2 years ago.
We still use github for day-to-day scm mgmt as you guys have tons of extra value add.
Not only faster, but better. I love AWS and love how they keep rolling out new features to make deployment easier. My current development lifecycle isn't yet close to perfect because of all the gaps with piecing together different providers and tool sets. To the degree that things can be streamlined and automated, all the better. In the end, I'm hoping AWS comes out with a 1 click continuous integration solution - from dev to deploy.
Bezos to kids being born 3 years from now. "You kids are just features. Keep n eye on our announcements page."
It's interesting how they seem to be adding all the stuff in the workflow that we are okay with just being good enough and don't benefit from bells and whistles. Github is very cool for managing an open source project, but for my own stuff just about anything which makes managing my repos relatively simple and saves me a bit of time is fine. I don't need much for an interface or features. Just being able to push a button and having a hosted repo ready for other users and with instructions for less technical contributors is awesome.
If I'm already on Github, I'm probably staying there. Competition is good though. Competition will push Github to continue doing more to differentiate themselves as being something much greater than simply hosting Git repo's.
Strange use of "git-based" wording. Do I understand correctly, they implemented their own git command line binaries that you install to replace the original git binaries? Possibly a fork of git client and server?
Probably based on JGit/Gerrit. It is much easier to build "cloud" git (i.e. reliable, redundant git) that way than it is to make your disk reliable. Github went the reliable-disk route with DRBD, AFAIK, which is why Github is always going down ;-)
Intriguing post! For those who hadn't heard about DRBD, it's http://en.wikipedia.org/wiki/Distributed_Replicated_Block_De... - it makes failover incredibly simple to integrate by pushing a lot of distributed-systems responsibility down to the filesystem level, but in doing so it forces you to have a single-master setup, and you can't take advantage of parts of your domain model that are well-suited to eventual consistency. Thinking about it, now that Riak supports strong (quorum-based) consistency on a per-bucket basis, Riak-backed scalable Git hosting would probably be relatively easy to implement. Looks like you were building your own system - did you ever release the code?
The git object data itself goes into a blob-store (like S3); it can be stored without strong consistency. It turns out you only need to keep track of a very small amount of metadata consistently (the refs). Riak, etcd, DynamoDB or the Google Cloud DataStore would all be good choices, I think.
I was working on an open-source implementation of Raft as part of this (called barge), but it isn't as reliable as the alternatives above - yet!
Very interesting. I hope that's not the real rationale inside Github, as it totally ignores the fact that the blobs are immutable and thus trivially cacheable.
The traditional git server is made up of primarily two programs: 'git-send-pack' and 'git-receive-pack'. These tools work based on stdout/stdin in extremely well-documented ways. It is not inconceivable to implement your own git server that adheres to the existing protocol.
Though many won't see any sort of advantage to switching, it would be very interesting to know what kind of effect this has on both Github and Bitbucket. There doesn't seem to be any major drawcard that will have people chomping at the bit to change over that I can see. I'm happy to be enlightened though.
On another note, is it just me or has Amazon really been ramping up with their releases lately? It feels like there is something new each week at the moment.
That said, doesn't seem to offer a compelling alternative to Github, based on what they're currently saying. There could be value in tight integration with other Amazon tools, but that seems like it'll come more from intentional lock-in than added value.
Github is also very vulnerable on price if you have the need for a lot of small, private repos. If Amazon charges on storage (like they do with S3) instead of the number of repos, it will make this a compelling alternative for a lot of businesses.
Bitbucket is excellent for that use case. In fact, github often isn't even an option since even the largest plan is limited to 125 private repos. Bitbucket, meanwhile, has no limits on the number of repos (it's a fee per user instead)
> even the largest plan is limited to 125 private repos
Leaving the conversation about small plans aside, we can set up a plan for you that has as many private repositories as you need. Just email sales@github.com and we can get it set up for you.
As far as I can tell, VS Online also allows "unlimited" projects and repos per project. And TFS's non-SCM features are pretty nice. Plus it's free for the first 5 users.
I don't know. I used to work for AWS, and this sounds very link an internal tool that is used to manage the entire Amazon retail infrastructure. That being the case, this is going to be a lot more powerful and useful for online infrastructure (in AWS, obviously) than anything else out there. Even if it's not all that initially, I suspect the extra functionality will come sooner rather than later.
It's somewhat annoying I know, but I'm not going to go into more detail as I'm not sure about the legality of my position should I do so.
Edit: On review of their website, I had missed the announcements for Amazon CodePipeline and Amazon CodeDeploy, which between them provide all the missing functionality I hinted at above. And so yes, it looks like this is the internal tool I was referring to.
CodeDeploy is essentially Apollo (lite, so to speak).
CodePipelines is Amazon Pipelines.
CodeCommit is persumably Amazon GitFarm (according to our former AWS engineer here).
We've got an amazing Builder Tools team here at Amazon, I must say.
> That said, doesn't seem to offer a compelling alternative to Github
I have limited (but at lease some) experience in working with GitHub Enterprise, and for the longest time, their answer to backup was to turn the entire system offline while you performed the backup. I believe they have since improved in this area, but it was clear that while GitHub itself offers an extremely available service, GHE is severely lacking in this respect.
CodeCommit on the other hand promotes high availability from the start.
> I believe they have since improved in this area, but it was clear that while GitHub itself offers an extremely available service, GHE is severely lacking in this respect.
That is correct. On earlier versions of GitHub Enterprise backing up data was quite a hassle. For a consistent (repository) backup taken on the VM level you didn't have to shut it down, but you've had to switch the appliance into maintenance mode, effectively preventing people from getting things done. This isn't the case anymore though as we've shipped new backup utilities [1] and support for HA setups with the 2.0.0 major release [2] some time ago.
I'm not sure how the overwhelmingly dominant player in retail and hosting moving to become the overwhelmingly dominant player in source hosting can count as "competition", except in a Gatesian sense.
Off topic, but the sign up process is very bad. Why do they need full name, email, company name, role, address, phone number, catcha, etc to just sign up for more information?
It actually isn't out till 2015.... so this is just asking if you are interested. Perhaps it's an MVP -- with a signup form....and if there is enough interest they will pursue it...
Very few organizations use git the way the linux kernel project does. Most organizations use github, gitlab, etc, which definitely have had scaling issues.
I recommend Gitlab for anyone looking for something Github-like. On a $10/m DO instance you can be up and running quickly [1]. There's a rake task for backups and it's easy to get it to sync to S3.
Their omnibus installer is ridiculously easy to use if you don't mind seeing Chef messages scroll past for 20 minutes. I had it tied into Active Directory and everything in less than an hour after downloading it. Compare that to using apache-svn and those goddamn LDAP sync scripts and apache conf folder definitions and it feels like you're using some shit from the future.
Interesting. I recently started mixing GitHub and S3 for blog content and it's a pain in the ass. Asking the internet about pushing GitHub content to S3 results in just a bunch of mediocre scripts. I like it.
I also look forward to Perforce's response. For versioned binary files Perforce is still the best. Competition, hooray!
You mean to tell me I just spent a day last week setting up Jenkins, GitHub and a VM when I could've waited for my second week on the job and move it all to Amazon?
That aside.. This looks great. If I'm not fully satisfied with my setup (right now I'm not) I might make the move to this. Lower friction is always welcome
I switched from Jenkins to CircleCI (they offer a free plan now). Took me only 20 minutes to get everything running, and with Docker integration to boot. Really pleased with it.
Personally I'd hold off committing any source code (outside of personal projects) to something this new anyway, especially if it's only your first week on the job.
> CodeCommit integrates with AWS CodePipeline and AWS CodeDeploy to streamline your development and release process. CodeCommit keeps your repositories close to your build, staging, and production environments in the AWS Clou
There will also be a number of companies where a "small" DevOps teams leverages a large number of cheaper systems (owned, leased or whatever) for lower cost.
If I were a DevOps person I would learn and master all of the AWS tools (as well as competitor services). It could be a great opportunity rather than a threat.
Not to be a negative Nelly or anything, but with the speed at which Amazon is releasing this stuff, they are going to have to be a Google eventually and pull the rug out from under our feet, discontinuing several products.
AWS SimpleDB is an example of a product that Google would've likely killed years ago but Amazon maintains. It hasn't gotten any significant update since 2010 (http://aws.amazon.com/blogs/aws/amazon-simpledb-consistency-...) and, judging from the fact that it only gets a couple posts a month in its AWS Forums, it's very sparsely used.
Uggh no way. The AWS Web Console is a beastly enough UI to get around, I have no interest in migrating away from the very comfortable, friendly Github interface.
I still have trouble trusting a company which I primarily know as "that all-round webshop" to get involved withy my tech stack. It feels as-if McDonalds would over night offer server infrastructure and get away with it.
McDonalds doesn't make money from their website, which is the problem in your comment. If we were comparing McDonalds we would be comparing their fast food infrastructure.
If you were an up and coming fast food burger joint that was going from 1 shop to 2 shops, and possibly more shops, you sure as hell would want to ask McDonalds for it's strategic advice and further advice on how to run your burger shop as well as possible. They've been there, they've done it, not that you need to follow their guidelines, but you definitely want their knowledge in the back of your pocket. They've got about 70 years of experience.
Same with Amazon, who has to keep Amazon.com online or they lose massive amounts of money.
Geez, oh yes. They built their platform and decided to start reselling it. That's why Netflix, Pintrest, Airbnb, Expedia, Foursquare, NASA, the MLB all use AWS.
Github is awesome for science. But my code and workflows often are just as reliant on the data I have as the code I've written. With Github, I always have to treat my data separately. A different workflow, a different storage location, different (and often manual) versioning.
Now, I can start integrating my data into the same workflow. No size limits. Just drop it in and version it like everything else. 1GB? No problem. 500GB? Just pay the money.
This is especially awesome because, as a scientist/dev, I do not want to stand up my own infrastructure and servers/vms. I don't want to do linux updates. I don't want to worry about things going down. It just needs to be there when I need it. When I don't need it anymore, I'll take it down. Done and done.