Hacker News new | past | comments | ask | show | jobs | submit login

Lots of people are saying that having large files in a repo is wrong, bad, bad design, incorrect usage.

Forget that you know git, github, git-lfs, even software engineering for a moment. All you know is that you're developing a general project on a computer, you are using files, and you want version history on everything. What's wrong with that?

The major issue with big files is resources: storage, and network bandwidth. But for both of these it is the sum of all object sizes in a repo that matters, not any particular file, so it's weird to be harking on big files being bad design or evil.




I did just over a decade in chip design. Versioning large files in that domain is commonplace and quite sane. It can take wallclock days of processing to produce a layout file that's 100's of MBs. Keeping that asset in your SCC system along side all the block assets it was built up out of is very desireable.

Perforce handled it all like a champ.

People who think large files don't belong in SCC are...wrong.


That's why Perforce is still the SCM of choice for a lot of creatives.

I don't know if they still do it, but Unreal used to ship a Perforce license with their SDK.


That's also why perforce is slow as heck unless you throw massive resources at it. I also work in the chip industry BTW.


I occasionally used to start a sync, go get coffee, chat with colleagues, read and answer my morning email, browse the arxiv, and then wait a few more minutes before I could touch the repo. In retrospect, I should have setup a cron job for it all, but it wasn’t always that slow and I liked the coffee routine. We switched to git. Git is just fast. Even cloning huge repos is barely enough time for grabbing a coffee from down the hall.


I mean "massive resources" is just de rigeur across the chip industry now. The hard in hardware is really no longer about it being a physical product in the end.


I've only used Perforce for two years and it didn't feel slow at all. The company wasn't exactly throwing money at hardware.


I don't like it (but used it for many years).

I love Git, but, then, I don't have a workflow that would benefit from Perforce.


> Lots of people are saying that having large files in a repo is wrong, bad, bad design, incorrect usage.

I don't think that is true. You do see people warn that having large files in Git repositories, or any repository that wasn't designed with support for large files in mind, is "wrong", in the sense that there are drawbacks for using a system that was not designed to handle them.

Here's a historical doc of Linus Torvalds commenting Git's support for large files (or lack thereof)

https://marc.info/?l=git&m=124121401124923&w=2


> Forget that you know git, github, git-lfs, even software engineering for a moment. All you know is that you're developing a general project on a computer, you are using files, and you want version history on everything. What's wrong with that?

THANK YOU. Fucking prescriptivists ruin everything.


How is it not bad design? Let's say you are working in a team. Would you really want your colleagues spending a significant amount of time cloning your artifacts? Your comment is also not consistent with forgetting that one is not a developer. Even if it's my grandma, she's not gonna want to wait for 1hour to download a giant file from VC assuming she knows what a VC is. Large blobs can go into versioned object storage like GCS or S3 etc


In Subversion at least, you'd do a partial checkout. If you don't need a particular directory you just don't check it out. If you lay out your repo structure well there's no problem. It was incredibly convenient.

I've tried many different SCM over the years and I was happy when git took root, but its poor handling of large files was problematic from the beginning. Git being bad at large files turned into this best practice of not storing large files in git, which was shortened to "don't store large files in SCM." I think that's a huge source of our availability and/or supply chain headache.

I have projects from 20 years ago that I can build because all of the dependencies (minus the compiler -- I'm counting on it being backwards compatible) are stored right in the source code. Meanwhile, I can't do that with Ruby projects from several years ago because gems have been removed. I've seen deployments come to a halt because no startup runs its own package server mirror and those servers go offline or a package may get deleted mid-deploy. The infamous leftpad incident broke a good chunk of the web and that wouldn't have happened if that package was fetched once and then added to an appropriate SCM. Every time we fetch the same package repeatedly from a package server we're counting on it having not changed because no one does any sort of verification any longer.


SCC systems that handle big files don't suffer from the "you have to clone all the history and the entire repo all the time" problem that git suffers from. At least Perfoce doesn't...

git has its place but it's really broken the world for how to think about SCC. There are other ways to approach it that aren't the ways git approaches it.


When you make a video game you want version control for your graphics assets, audio, compiled binaries of various libraries, etc. You might even want to check in compiler binaries and other things you need to get a reproducible build. Being able to chuck everything in source control is actually good. And being able to partially check out repositories is also good. There is no good technical reason why you shouldn't be able to put a TB of data under version control, and there are many reasons why having that option is great.


The versioned object storage solves nothing. If your colleagues need the files, they're going to have to get them, and it's going to be no quicker getting them from somewhere else. Putting them outside the VCS won't help. (For generated files, you may have options, and the tradeoffs of putting them in the VCS could be not worth it. But for hand-edited files, you're stuck.)

If the files are particularly large, they can be excluded from the clone, depending on discipline and/or department. There are various options here. Most projects I've worked on recently have per-discipline streams, but in the past a custom workspace mapping was common.


> Would you really want your colleagues spending a significant amount of time cloning your artifacts?

Not just the artifacts, but their entire history. That is a problem that Git has out of the box, but there is no reason it needs to work that way by default. LFS should be a first class citizen of a VCS, not an afterthought.


So how would you version a game that needs assets? These files must be versioned but can be very big, for example long cutscene videos.

Some projects need the ability to version big files, there is a good reason why perforce exists and is widely used in the gaming industry.


I am not saying that it is a better UX, but hashed/versioned blobs on S3 would mostly work depending on tooling integration.


That's building a custom version control on top of the version control you're already using.


not really, it is like building a custom storage layer for your VCS.

you are still relying only on git as a source of truth for which artefacts belong to which version.


Isn’t that essentially what git lfs is?


I believe so, but with different UX. In almost every case I expect git lfs to be better, but I can see reasons to use more custom flows.


Git is designed with a strong emphasis on text source and patches. It simply isn't designed for projects with large assets like 3D animation, game dev, etc. Having said that, solutions like LFS, Annex and DVC (not git-specific) work really well (IMO). If you don't like that, there are solutions like Restic that can version large files reasonably well (though it's a backup program).


This is an example of a more generic problem. We adopt some principle or practice for rational reasons, and then as a mental shortcut conflate it with taste, aesthetics, cleanliness. But no software or data is 'dirty' or 'ugly', we feel it so because of mental associations, but intuition is unreliable -the original reasons may not apply, or may be less important.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: