Great investigation. Does anyone know or can anyone intelligently speculate what...

dragontamer · on June 18, 2018

Well, first you have to understand exactly what "Pause" is.

"Pause" is a hint to the CPU that the current thread is spinning in a spinlock. You've ALREADY have tested the lock, but it was being held by some other processor. So why do you care about latency? In fact, you probably want to free up more processor resources as much as possible.

Indeed, there's not actually any resources wasted when you do a "pause" instruction. In a highly-threaded environment, the hyperthread-brother of the thread picks up the slack (you give all your resources to that thread, so it executes quicker).

10-cycles is probably a poor choice for modern processors. 10-cycles is 3.3 nanoseconds, which is way faster than even the L3 cache. So by the time a single pause instruction is done on older architectures, the L3 cache hasn't updated yet and everything is still locked!!

140-cycles is a bit on the long side, but we're also looking at a server-chip which might be dual socket. So if the processor is waiting for main-memory to update, then 140-cycles is reasonable (but really, it should be ~40 cycles so that it can coordinate over L3 cache if possible).

So I can see why Intel would increase the pause time above 10-cycles. But I'm unsure why Intel increased the timing beyond that. I'm guessing it has something to do with pipelining?

wildmusings · on June 19, 2018

Thank you! That’s very informative.

titzer · on June 18, 2018

The pause instruction is supposed to be used as a "yield hint" for programs to signal that they are waiting for a concurrent action in another CPU to unblock, e.g. the contended case in a spinlock implementation, where the current thread needs another thread to release the lock before it can make progress. In Skylake, they have changed the microarchitecture so that pause is a much stronger hint; executing a pause will "yield" the current hyperthread, allowing the other hyperthread on the same core to use more of the CPU's resources (ROB slots). This improves performance for well-tuned synchronization, but hurts patterns where pause is executed too often. Unrolling the spinlock loop would help in this case, probably.

wildmusings · on June 19, 2018

Thanks! Very informative.

walshemj · on June 18, 2018

Its more why their code was written that way.