CPU rate is only quasi-continuous. The fact of a thread running or not running i...

hansvm · 2024-09-12T17:46:18 1726163178

Like everything, the devil's in the details.

In our case, non-idle cumulative thread CPU time (including system activity) is a nice metric. If it exceeds 70%, start sending some 503s (or 500s depending on the client). A single core can easily service 100B requests per day, a few times more than we needed to, so if we exceeded that by enough margin to saturate every core to 70% then the right course of action was some combination of (1) scale up, and (2) notify engineering that a moderately important design assumption was violated. We could still service 100B requests per day per core, but the excess resulted in a few 5xx errors for a minute while autoscaling kicked in. It's not good enough for every service, but it was absolutely what we needed and leaps and bounds better than what we had before, with a side-benefit of being "obviously" correct (simple code causes fewer unexpected bugs).

The core idea (in tweaking the numerator in exponential smoothing) isn't any easily refutable definition of CPU time though. It's that you're able to tailor the algorithm to your individual needs. A related metric is just the wall-clock time from the beginning of a request being queued till when it was serviced (depending on your needs, maybe compared to the actual processing time of said request). If you ever get to total_time >= 2 * processing_time, that likely indicates a queue about to exceed a critical threshold.

> how many other requests are running or waiting to run

I referenced that idea pretty explicitly: "If you have more control over the low-level networking then queue depth or something might be more applicable, but you work with what you have."

Yes, that matters. That doesn't make other approaches invalid. Even then, you usually want a smoothed version of that stat for automated decision-making.