It took me a while to find out what Canva actually does, but from https://www.canva.com it appears they are an online design / collab tool.
To be fair, that could mean a _lot_ of functionality and code providing a rich, SPA, JS heavy experience. Modern JS frameworks aren't exactly known for being concise.
But still, 60 million lines of code and half a million files is a sure sign that someone said at one point "sure, we can throw all these generated files into git!". Didn't someone on the team say "hey, it takes 10 seconds to run git status, can we move this junk out and do this another way??"
Given that 70% of their repo is generated files, that discussion and the tradeoffs involved don't get nearly enough attention from OP.
> Didn't someone on the team say "hey, it takes 10 seconds to run git status, can we move this junk out and do this another way??"
Why do you assume they didn't?
Just because they arrived at a different conclusion than you that doesn't mean they didn't thought about it. I might very well mean you did not considered the tradeoffs they had to take into account, mainly because you're out of the loop.
One reason for doing this is that someone might be using a very small set of the monorepo, so having everything prebuilt means that in general things are very fast, without constantly having to rebuild chunks of your system.
This is, I think, very common with stuff like browsers, where you have artifact checkouts that basically include built stuff since otherwise you're sitting there compiling forever all the time.
Stuff like Bazel in theory helps with this, but tools that help with this are either super idiosyncratic about how they work (meaning hard to adopt) or outright don't work.
I mean personally I would find that pretty annoying and unclean but I like DAGs.
The post itself covers in the intro that they made the conscious decision to go with a monorepo, accepting the downsides of it. Much more than this article though, I'd like to read one discussing that decision and why they went that direction.
Canva gives users without design tools expertise the ability to make fairly polished looking graphics with a super easy and intuitive interface. (As a designer, I can assure you that polished looking is not the same thing as designed.) It’s a very popular service, so they’re dealing with huge scale. Intuitive interfaces often come with complex mechanisms and lots of assets, and they have clients on every major mobile and desktop platform, and the web. They also do a ton of heavy graphics processing that is likely done in lower-level languages than the interface.
The likelihood of a 2000 employee software company simply not considering that they could streamline their build process is pretty slim.
They have obviously invested a lot over time into streamlining their build process: so much so that they're putting an article about it.
All of the problems they are having are basically due to their use of a monorepo: they do explain that they made the decision early, but I wonder what are the advantages over multiple repos they are seeing that it was worth it all this trouble?
We can talk about the advantages of monorepos, but your questions is phrased in a way that makes me think that you don't see any "trouble" in multiple repositories.
I would encourage you to do some research and keep an open mind.
They seem to have purposely left that out of the scope of this article. There are myriad articles about the benefits and downsides of monorepos vs multiple repos.
In another sibling comment I mentioned how I looked through their engineering blog and didn't see any post where they have talked about any of the benefits they are enjoying due to their use of monorepo.
The question is specifically about their usecase, since not everybody would hit the same bottlenecks as they did with git monorepos.
Iow, have they stopped and thought whether it's still worth it (eg how often do their engineers make use of the monorepo benefits like cross-project refactorings)?
Don't rely on this anecdotal heuristic. Have a look at some enterprises. My experience: "How to solve scaling issues?" - "Automation? No, another team." ;)
This is not a 15k employee enterprise riveting features onto a codebase from the 90s— it’s a <10yo company that makes one product. Big difference in approach.
hey hey author here, xlf files are translations that are coupled with the texts we set in the code so they're not really generated I admit that was misleading. What I wanted to get across is they're not touched directly by engineers but they're still created through our translation pipeline where real humans translate them
How do you split your XLIFF files? Does each project get one big one and the proliferation is simply due to number of languages, or do you have a more granular split (eg. if you've got one component, it will have dozens of XLIFF files for every language, instead of one per language)?
By the numbers you mention, 70% of the files make a ratio of code files to translation files 1-3, so unless you only support 3-5 languages, it's definitely not one XLIFF file per source file, so I wonder at what granularity it is?
(My experience is mostly with localizations using GNU gettext tools, and you usually do a small-finite-number of PO files per project per language, where that small-finite number is exactly one for like 99% of projects)
One of my biggest pet peeves: tracking generated files in version control.
The only exception is our generated OpenAPI spec, because we want people to be explicit about modifying the API, and have a CI task that verifies that the API and OpenAPI spec match.
One alternative that enables you to keep generated files out but still feel like there's an explicit human check in place is to add a gated confirmation step in CI to confirm that the changes to the generated spec match expectations.
Something like: "This change will result in the following new API endpoints: ... do you wish to continue?"
What does it compare against though? Need to add more state to the CI? We kinda like the interface be part of the version control and having an audit chain that's part of the code.
Yeah either you'd have to maintain the generated spec as a versioned artifact hosted wherever is most convenient for you, or the CI could actually generate the before and after specs based on the PR diff. If the calculation of the spec is computationally expensive (it shouldn't be) then the latter approach could be a problem.
Lock files protect you from the version changing out from under you, but modules disappearing from NPM is a thing that happens. Yes, you can use artifactory or similar as a proxy but that requires infrastructure that you may not want to run. That is all to say: there are situations where committing node_modules is the least evil.
Well... unless some dev's have M1-macs, and some of the docker layers are not available for arm, or the other way around, not available for amd64. Gives interesting issues.
Docker is a congregation of technologies held together with duct tape and glue.
Eg. permissions handling is completely different on Macs with Docker Desktop from the Linux dockerd stuff: on Macs, it automatically translates user ownership for any "mounted" local storage (like your code repository), whereas on Linux user IDs and host system permissions are preserved. Have some developers use Macs and others use Linux, and attempt to do proper permissions setup (otherwise known as "don't run as root"), and you are looking for some fun debugging sessions.
At companies that don't check in node_modules, build folders, and are using standard packaging tooling like maven or yarn or npm or what-have-you. Yes, I haven't experienced that in like 15 years.
Npm didn't support lockfiles until version 5, released in 2017, Yarn had them at launch in 2016. Before that committing node_modules was often used as a form of vendoring, to get reproducible builds.
If a new project these days commits node_modules to git, it's likely a mistake, but for legacy projects started before 2017 it was the lesser of two evils.
Hm, this project was started in 2017. The node_modules directory was for Serverless (a tool written in Javascript), not the website itself (which was written in AngularJS - probably not the best choice in 2017 either).
Prior to lock files (and potentially after, as checked-in files are beyond trivial to modify and review and that can be worthwhile) committing dependencies in some form was basically the only reasonable way to have reproducible builds, unless you wanted to build your own package manager / lock file implementation.
Based on how brittle Github actions is I'd be ready to commit node_modules except for that I'm building cross-platform software with native dependencies.
Not everyone has the skills to build the toolset and use it. My brother called last night to help him change some SASS variables in a bootstrap theme. He’s a data scientist and had no idea how to build bootstrap’s js and apply the new variables. If bootstrap came from npm fully built, over half of his problems he called me about (15 times!) would have been avoided.
People coming from the SVN world do not think that this is unusual or problematic. And unfortunately even recently I've seen SVN still in use at large legacy companies.
I don't think it's unfortunate. We use subversion for development in our team and it does everything we need it to.
We looked into git and didn't find it offered any features that would significantly improve our process, but found 1000 more ways to shoot ourselves in the foot
For many processes I think SVN is (and has been for many many years) been an absolutely fine method of version control
> didn't find it offered any features that would significantly improve our process, but found 1000 more ways to shoot ourselves in the foot
You're not wrong about this.
I really like git for the cheap branching, which encourages branching and merging often. But SVN might have cheap branching now, as another commenter implies.
My experience is that anything dealing with a branch, especially but not exclusively creating branches, is very slow in SVN for a repo of any real size, basically anything with a framework.
I do not remember if "stat" was particularly slow, but SVN in general is slow.
Huh. 10 gigabyte svn repo at work spanning about 40 projects.. Creating branches is virtually instantaneous. It's just a copy which is a free operation (just a link). Curious as to why it would be slow for you.
svn cp https:/ /server/svn/trunk/project/ https:/ /server/svn/branches/project/ticket -m "making a branch here"
svn status, even for an entire repo checkout (which is not common) is also fast.
And yeah, it has virtue of simplicity as well doing very well at narrow and shallow even though I'd love to have mercurial's feature set.
It's also rather good in the "wiki" situation since people can operate on their single files without needing to update, sync and merge.
> Creating branches is virtually instantaneous. It's just a copy which is a free operation (just a link).
Copy is not a "free" operation, but a symlink is close to "free" if you're measuring disk space.
What version SVN are you using? I'm certain that older SVN versions would actually copy the entire project's files, not symlinks but real copies. That would take forever and running out of disk space was a real concern.
An svn copy is just a link. It has always been a "free" operation (and yes, the analogy would be to a symlink).
I'm not aware of any version of svn that behaved the way you described and I've used it for a couple of decades.
I can perhaps imagine a large repo plus a broken svn client requiring checking out unneeded portions of trees to do a copy, but no client I've used works like that.
Hm. Another theory. Perhaps someone who knew nothing about svn and was using TortoiseSVN's Windows file manager integration was doing a Windows file manager copy, then checking that in as a "branch" with the only link being the commit message instead of using svn's copy which is free and properly links content. That would indeed be an expensive operation, and the wrong thing to do.
These days mercurial has a cached blame/annotate called "fast annotate" which I love because of one particular awesome feature --deleted which must be seen to be appreciated I think.
I have this alias in my .hgrc file fad=fastannotate -u -n -wbB --deleted
It's by the Facebook engineer Jun Wu who also made the even more awesome "absorb"
> Given that 70% of their repo is generated files, that discussion and the tradeoffs involved don't get nearly enough attention from OP.
It's perfectly fine to use Git to track things other than sourcecode. In fact, right on the manpage, Git calls itself "the stupid content tracker".
I've been using Git with git-annex to track archival files with their associated metadata.
We keep our data separate from our sourcecode, and segment our data into individual Git repositories for each collection.
Git gives us many features that we would have had to build into our app in other ways (data integrity, fixity, etc), though this came with costs.
To my eye it probably would have been better for Canva to use multiple separate repositories instead of a monorepo, but I'm not them and their use-case is not mine.
I think it is a crowd dumbing effect. Since hundreds of engineers sharing the mono-repo, no one can or care to make the decision or is able to push the decision for alternate. Even when every one is complaining, it is still far from every one agreeing on the alternate. Crowd settle at the lowest denominator.
To be fair, that could mean a _lot_ of functionality and code providing a rich, SPA, JS heavy experience. Modern JS frameworks aren't exactly known for being concise.
But still, 60 million lines of code and half a million files is a sure sign that someone said at one point "sure, we can throw all these generated files into git!". Didn't someone on the team say "hey, it takes 10 seconds to run git status, can we move this junk out and do this another way??"
Given that 70% of their repo is generated files, that discussion and the tradeoffs involved don't get nearly enough attention from OP.