> Provided their clients don't figure out what's going on I'm always pleasantly ...

maayank · on Sept 19, 2021

> I'm always pleasantly surprised when I see power management correctly configured at a client precisely because it is so rarely done correctly.

Let's say I have Linux instances in the cloud. What would be the steps to take for proper power management?

jiggawatts · on Sept 19, 2021

As far as I know, there are no steps you can take if you're running your instance in a virtual machine. Only the hypervisor host and/or the firmware can control the processor power and sleep states.

cwizou · on Sept 19, 2021

TL;DR: C-States improve performance.

I'm not sure calling C-States throttling is completely accurate, at least not in the way that you imply.

Modern CPUs have multiple mechanisms that can be called throttling in some way, but C-States are the least offensive and in modern CPUs actually increase performance.

There's a good basic explanation here [1] but in short :

- C-States is about idle cores, and powering them down. You can actually configure how deep you can power them down. Waking up does have a tiny cost in latency, yes, but it's mostly irrelevant (your core spends most of it's time waiting for data).

- P-States are what I would call "ok" throttling, those happen when non-idle, modern cores can scale their frequency up and down based on how much workload they have. Again, in theory purely about power saving and close to 0 impact on performance.

This was simple, and used to be just about power saving, disabling was ok. But things changed massively about 10+ years ago with the introductions of various Turbo mechanisms.

Turbo is about looking at your CPU as a whole (and not per core), and maximising performance of the loaded cores by pushing their frequency, using the power budget that used to be allocated to the idle cores (that are now powered off).

The way its done is by adding a few extra P-states that push the frequency a few hundred MHz. The more "C-State-down" cores you have, the higher the frequency of the working cores is pushed.

Extreme example : you have a 28 cores CPUs running a single thread workload on one core only, with 27 idle cores. With C-States disabled, you lose Turbo and are stuck with the single core running at the "non-Turbo" frequency. With C-States enabled, because you idled the non used cores, your working core gets higher clock speeds and higher performances.

(Things get more complex with AVX2 because those units have a large power consumption, and this is where you get into the very bad throttling)

[1] : https://software.intel.com/content/www/us/en/develop/blogs/c...

jiggawatts · on Sept 19, 2021

This might be the case for some artificial benchmark, but I have never seen such a case in practice.

I have run many, many tests on real-world workloads, not artificial 100% loads of either a single core or all-cores, as you typically get in benchmarks. In all cases, disabling C-states made overall performance at least 20% better, and often 200% better (3x faster)!

There's lots of reasons for this, and I could write pages and pages on how the jitter introduced by the sleeps is murder on TCP throughput, or how the frequency ramp up time is glacially slow, or how most servers are idle most of the time and only the cold-start latency matters to end-users. Then I could keep ranting about the waste of it all, and go on and on about how Californian greenie hippie morons tried to save the planet one database server at a time. Then finish up by showing that 10 Gbps Ethernet's latencies interact with the core sleep timings just so to make things as bad as possible.

But instead of taking my word for it, you could just do your own tests on your own software, ideally in a production system involving n-tier networking instead of some isolated, synthetic benchmark.

PS: By design, power management has no impact on anything running at 100% load for extended periods, so no synthetic benchmark will ever show any impact. That's precisely why very few people are aware of the issue...

linuxftw · on Sept 19, 2021

Can confirm. I've seen C-states absolutely wreck performance on certain low latency workloads. In particular, lots of small random writes on a storage cluster.

jiggawatts · on Sept 19, 2021

Unfortunately it turns out that "low latency workloads" are not some esoteric thing only high frequency traders care about, but is an accurate description of most enterprise software. Anything n-tier, anything with an API tier or an external cache, anything vaguely K8s-like, or even just a database and a chatty ORM.