This has been the bane of my entire career. I have never, not once, encountered ...

Too · 2024-07-07T05:33:30 1720330410

From what I've seen this is usually caused by CI pipelines being overly cautious, starting from a pristine state and trying to cover every possible edge case.

Spin up new node, congrats, now you have to clone the whole repo from scratch. Oh, "npm install" was not cached, let's download half of the internet again, pull ":latest" of some multi-GB docker images while you are at it. For good measure, "make clean", because...you never know and nobody bothered to architect our build system to use a more sane tool than Makefiles, btw remember to only use make -j3, because rumor has it that the build fails intermittently when using more cores! Finally, run every test that couldn't possibly be affected by the commit in question, preferably the big e2e-suite. As someone else mentioned above, all of this running on some dog-slow cloud network attached disk.

There you have it. Spinning up more compute is not the solution, it's the cause. Keep one machine on and improve on your build tooling. I guarantee it will be magnitudes faster than any ephemeral pipeline.

torginus · 2024-07-07T09:56:00 1720346160

This is usually a result of following 'cloud best practices' instead of being pragmatic.

For example, using Kubernetes with Docker where each pod is some virtual ECS 4-core whatever instance, assuming scaling the workload over a 1000 instances will be fast. Which in your case will lead to said pods individually spending 90% of their time running 'npm install' , and each pod is way slower than your desktop PC.

Different tests have different needs and bottlenecks. For example, if you're running unit tests, with no external dependencies, you'd want to separate out the build process, and distribute the artifacts to multiple test machines that run tests in parallel.

The choice of using a few big machines or more small ones is usually up to you, and is a wash cost-wise. However I do need to point out AWS sells you 192 core machines, which I'd wager are WAY faster than what you're sitting in front of.

And Amdahl's law is a thing as well. If there's a 10 minute non-parallizable build process, doing a 10 minute test run on a 192 core machine vs doing an ~0 minute test run on infinite machines ends up being only a 2x speedup.

And there's such a thing as setup costs. Spinning up an AWS machine from an image, setting up network interfaces, configuring the instance from scratch has a cost associated with it as well. And managing a cluster of 1000 computers also has its set of unique challenges. And assuming that if you ask Amazon for a 1000 machines at a drop of the hat, and plan to use each for a minute or two, AWS will throttle the fuck out of you.

All I'm trying to say with this incoherent rambling is KISS - know your requirements, and build the simplest infra that can satisfy it. Having a few large stopped instances in a pool might be all the complexity you need, and while it flies in the face of cloud best practices, it's probably going to be the fastest to start up.

XorNot · 2024-07-07T13:33:10 1720359190

I would argue the problem is for all the CI systems out there at the moment, there all "stupid": i.e. none of them try to predictively spin up an environment for a dev who is committing on a branch, none of them have any notion of promoting different "grades" of CI worker (i.e. an incremental versus a pristine) and none have any support for doing something nice like "lightweight test on push".

All of this should be possible, but the innovation is just not there.

8organicbits · 2024-07-08T01:31:20 1720402280

> notion of promoting different "grades" of CI worker (i.e. an incremental versus a pristine) and none have any support for doing something nice like "lightweight test on push".

Which ones don't support that? Anything with a Docker cache, for example, can build layers efficiently, reusing previously built layers. Build triggers let you choose when to start a job, so GH actions, for example, can trigger a full test on any PR and a light test on any commit to a branch.

wodenokoto · 2024-07-08T12:51:08 1720443068

> And assuming that if you ask Amazon for a 1000 machines at a drop of the hat, and plan to use each for a minute or two, AWS will throttle the fuck out of you.

I haven’t scaled that far out, but I thought that was the whole point of cloud platforms like AWS, GCP and Azure

MathMonkeyMan · 2024-07-07T06:28:33 1720333713

But then what will you contribute to the hour long weekly meeting where 13 managers present the status of their poorly defined CI metrics and other "action items"?

TeMPOraL · 2024-07-07T08:08:26 1720339706

Not my experience. The CI systems I dealt with most recently, for example, were very much compute-bound. Way too weak VMs for the runnres, and way too few of them. With CI builds taking each of upward 2h, and capacity to run maybe 10 of them in parallel, this pretty much destroys any ability to work with small commits - you have to squash them before submitting for review anyway, otherwise you destroy anyone else's ability to get their changeset through CI for the rest of the day.

This is entirely solvable by getting more resources. At least 2x more runners, each at least 2x as beefy. You could go 10x and I doubt it would be anywhere as expensive as the amount of money company wastes on devs waiting for CI. Alas, good luck convincing people managing the infra of this.

Too · 2024-07-07T09:47:27 1720345647

Builds taking 2h of compute? Let me guess, you are doing clean builds?

Incremental is several orders of magnitude faster and cheaper. Just need to invest in your build tooling.

TeMPOraL · 2024-07-07T15:39:36 1720366776

There was caching of third-party dependencies, but at least 2/3 of the time was spent on various tests, which for domain-specific reasons weren't trivial. Sure, everything there could be optimized further, but this is a textbook case of a problem which can be solved by adding more compute for a fraction of the cost of dev-hours spent trying to optimize compilation and test times.

Aeolun · 2024-07-07T09:05:12 1720343112

Our runners autoscale, and you can choose whatever instance type you want. It doesn’t seem like a very hard problem to solve.

Aeolun · 2024-07-07T09:09:50 1720343390

> I guarantee it will be magnitudes faster than any ephemeral pipeline.

Only if you have a single job to run. In practice it doesn’t really matter to the dev whether CI takes one or 5 minutes. When it all goes over 5 minutes is when it gets annoying.

datavirtue · 2024-07-08T15:13:57 1720451637

Your summary is the reason people quit being developers at the earliest possible convenience.

dilyevsky · 2024-07-07T01:35:33 1720316133

In most cases i have experienced this phenomenon is caused by the fact that cloud block devices are garbage compared to your machine

kevin_nisbet · 2024-07-07T02:54:49 1720320889

That combined with the scheduling layers... on the CI service the company I'm working for uses it takes like 2 minutes to just start the job.

Aeolun · 2024-07-07T09:06:42 1720343202

Even creating whole new AWS instances each time it shouldn’t take more than 1m.

You can apparently bring it down to 5s by suspending instead of creating instances.

tomohawk · 2024-07-07T20:30:51 1720384251

It depends on how much policy your organization has. If your organization requires AV scanners and other security crapware to be on all instances, how long does that take to install? And perhaps they require that you only use their AMIs, which then must be loaded up with extra software to be usable. And perhaps they require all auth use their special auth service, which takes 5m to sync with. And on and on. I've worked in places where it takes 20m to spin up a usable instance, so we ended up just keeping them on most of the time.

In an ideal world, the policy groups imposing these inefficiencies would get billed for the cost of their policies, but I've only ever seen situations where everyone else pays the cost of their incompetence.

Aeolun · 2024-07-08T00:40:23 1720399223

Ah, yeah, we kinda have to work around the fact that the antivirus and scanner stuff keeps installing far after the time the instance is done processing the job.

On the plus side, I actually get actionable notifications from the stuff, so it’s not purely bad for me.

jomohke · 2024-07-07T03:13:19 1720321999

And yet the data is ephemeral, so in many cases could be done faster without the "real" block storage guarantees.

Github Actions is pay-per-minute-used, unfortunately, so they may have a negative incentive to speed things up. Unless people become frustrated enough to switch to a non-bundled provider.

dilyevsky · 2024-07-07T22:09:30 1720390170

Not all the data is ephemeral - stuff like node_modules needs to he cached and if you’re suggesting to use tmpfs then it’s like 50x more cost than fully tricked out cloud ssd which is already ridiculously expensive

tonyarkles · 2024-07-07T01:49:23 1720316963

And often CPU as well if your machine has a reasonable number of cores.

adolph · 2024-07-07T02:38:16 1720319896

You should just add an admission webhook that kubeadm’s your dev box onto the CI cluster with a taint that only your pods tolerate whenever you need to run the test suite. Call it edge cloud and people will love it.

mvc · 2024-07-07T12:01:08 1720353668

There was a tech talk many years ago when one of the github engineers talked about how fast the github CI tests were (this was way before the microsoft aquisition). Literally seconds. I wonder if they're still that fast.

Bognar · 2024-07-09T00:16:00 1720484160

They absolutely are not. Hundreds of thousands of Ruby tests end up being pretty slow to execute. The CI machines have all dependencies pre-cached and different tests are parallelized across multiple VMs, yet still a full suite takes ~12 minutes.

dragonwriter · 2024-07-07T03:28:23 1720322903

> Cloud resources on servers with more CPUs then I have

The server the resources run on may have more CPUs than your dev machine, but are you allocating more CPU to the specific test job than you desktop has?

samus · 2024-07-07T21:21:43 1720387303

My experience is the opposite due to the insane overhead of the spyware^H^H^H antivirus and security services running on our personal devices. Some colleagues successfully applied for MacBooks since these are reasonably fast even while running the security services.

dboreham · 2024-07-07T02:58:04 1720321084

This is just a symptom of bad management.

rr808 · 2024-07-07T03:25:12 1720322712

DHH has talked about this the last few months. He now recommends running locally on your own computer instead of server. CPUs are so fast now.

https://world.hey.com/dhh/we-re-moving-continuous-integratio...

samus · 2024-07-07T21:24:20 1720387460

Unless internal IT and security teams force running security spyware on personal devices.