From what I've seen this is usually caused by CI pipelines being overly cautious...

torginus · 2024-07-07T09:56:00.000000Z

This is usually a result of following 'cloud best practices' instead of being pragmatic.

For example, using Kubernetes with Docker where each pod is some virtual ECS 4-core whatever instance, assuming scaling the workload over a 1000 instances will be fast. Which in your case will lead to said pods individually spending 90% of their time running 'npm install' , and each pod is way slower than your desktop PC.

Different tests have different needs and bottlenecks. For example, if you're running unit tests, with no external dependencies, you'd want to separate out the build process, and distribute the artifacts to multiple test machines that run tests in parallel.

The choice of using a few big machines or more small ones is usually up to you, and is a wash cost-wise. However I do need to point out AWS sells you 192 core machines, which I'd wager are WAY faster than what you're sitting in front of.

And Amdahl's law is a thing as well. If there's a 10 minute non-parallizable build process, doing a 10 minute test run on a 192 core machine vs doing an ~0 minute test run on infinite machines ends up being only a 2x speedup.

And there's such a thing as setup costs. Spinning up an AWS machine from an image, setting up network interfaces, configuring the instance from scratch has a cost associated with it as well. And managing a cluster of 1000 computers also has its set of unique challenges. And assuming that if you ask Amazon for a 1000 machines at a drop of the hat, and plan to use each for a minute or two, AWS will throttle the fuck out of you.

All I'm trying to say with this incoherent rambling is KISS - know your requirements, and build the simplest infra that can satisfy it. Having a few large stopped instances in a pool might be all the complexity you need, and while it flies in the face of cloud best practices, it's probably going to be the fastest to start up.

XorNot · 2024-07-07T13:33:10.000000Z

I would argue the problem is for all the CI systems out there at the moment, there all "stupid": i.e. none of them try to predictively spin up an environment for a dev who is committing on a branch, none of them have any notion of promoting different "grades" of CI worker (i.e. an incremental versus a pristine) and none have any support for doing something nice like "lightweight test on push".

All of this should be possible, but the innovation is just not there.

8organicbits · 2024-07-08T01:31:20.000000Z

> notion of promoting different "grades" of CI worker (i.e. an incremental versus a pristine) and none have any support for doing something nice like "lightweight test on push".

Which ones don't support that? Anything with a Docker cache, for example, can build layers efficiently, reusing previously built layers. Build triggers let you choose when to start a job, so GH actions, for example, can trigger a full test on any PR and a light test on any commit to a branch.

wodenokoto · 2024-07-08T12:51:08.000000Z

> And assuming that if you ask Amazon for a 1000 machines at a drop of the hat, and plan to use each for a minute or two, AWS will throttle the fuck out of you.

I haven’t scaled that far out, but I thought that was the whole point of cloud platforms like AWS, GCP and Azure

MathMonkeyMan · 2024-07-07T06:28:33.000000Z

But then what will you contribute to the hour long weekly meeting where 13 managers present the status of their poorly defined CI metrics and other "action items"?

TeMPOraL · 2024-07-07T08:08:26.000000Z

Not my experience. The CI systems I dealt with most recently, for example, were very much compute-bound. Way too weak VMs for the runnres, and way too few of them. With CI builds taking each of upward 2h, and capacity to run maybe 10 of them in parallel, this pretty much destroys any ability to work with small commits - you have to squash them before submitting for review anyway, otherwise you destroy anyone else's ability to get their changeset through CI for the rest of the day.

This is entirely solvable by getting more resources. At least 2x more runners, each at least 2x as beefy. You could go 10x and I doubt it would be anywhere as expensive as the amount of money company wastes on devs waiting for CI. Alas, good luck convincing people managing the infra of this.

Too · 2024-07-07T09:47:27.000000Z

Builds taking 2h of compute? Let me guess, you are doing clean builds?

Incremental is several orders of magnitude faster and cheaper. Just need to invest in your build tooling.

TeMPOraL · 2024-07-07T15:39:36.000000Z

There was caching of third-party dependencies, but at least 2/3 of the time was spent on various tests, which for domain-specific reasons weren't trivial. Sure, everything there could be optimized further, but this is a textbook case of a problem which can be solved by adding more compute for a fraction of the cost of dev-hours spent trying to optimize compilation and test times.

Aeolun · 2024-07-07T09:05:12.000000Z

Our runners autoscale, and you can choose whatever instance type you want. It doesn’t seem like a very hard problem to solve.

Aeolun · 2024-07-07T09:09:50.000000Z

> I guarantee it will be magnitudes faster than any ephemeral pipeline.

Only if you have a single job to run. In practice it doesn’t really matter to the dev whether CI takes one or 5 minutes. When it all goes over 5 minutes is when it gets annoying.

datavirtue · 2024-07-08T15:13:57.000000Z

Your summary is the reason people quit being developers at the earliest possible convenience.