The title is misleading — no code is broken, though it is true that .NET programs that exhibit lock contention will be slower.
The root of the problem is that .NET implemented a rather baroque spin loop with several assumptions about the CPU and OS architecture. Unlike any other spin loop I have seen or written, they execute multiple pause instructions per loop. Arguably this was a poor choice.
It would be interesting to see the history of this design. It was probably done years ago when .NET was closed source.
Much of the context is in dotnet/coreclr #13388 [1].
At the risk of repeating the content of the blog post ... The issue was reported in August, 2017 and fixed later in the same month. The fix shipped in a .NET Core patch update in September. We attempted to get the fix into .NET Framework 4.7.2 but missed that scheduling. We're now looking at making the fix in a .NET Framework patch release.
My experience is to the contrary. Every spin lock I have seen uses more than one pause instruction. Some may start with one and then execute more. They all involve a loop over pause instruction(s). I don't think that's the problem per se. The problem is more likely the back-off logic.
At least the Linux kernel spinlocks do a single pause instruction (which it calls cpu_relax()) before checking again to see whether the lock is free; it's a loop over the pause instruction, but the loop condition checks the lock value. I'd expect most spinlock implementations to do the same.
They basically all do that. I've seen some with multiple pause instructions in the loop body, and others with a switch or if/else if/else block where it starts off with one pause instruction, then goes to multiple, then moves to more drastic back-off techniques like giving up the thread time slice or micro sleeping.
And your claims are still completely unsupported, at least here in this discussion. You don't give even a single example of some other publicly known code that does what .NET does. And the way you formulated your previous claim appeared to claim that the Linux kernel contains such a kind of code. I'd still appreciate even a single verifiable example of all that "some" that you've apparently "seen".
Yes, but the Linux kernel is the one I'm more familiar with. So I looked for another example: glibc's pthread_mutex_lock. The code is at nptl/pthread_mutex_lock.c and the "pause" instruction here is called atomic_spin_nop() (or BUSY_WAIT_NOP on older versions), defined by sysdeps/x86_64/atomic-machine.h. Yet again, it does a single "pause" instruction in the loop, and the loop condition tests the lock.
That is, it checks the external condition after each pause. Which can't produce the problems that .NET produced.
And that's why I had expected eloff to give even one example that supports his claim, since what I can find is only what cesarb claims "it's a loop over the pause instruction, but the loop condition checks the lock value"
The whole thread is about if the loop is containing pause, but without the checks of the external condition (but instead just with the counter, like in the offending .NET example from the original article) exists in Linux kernel. If somebody claims that the .NET implementation is not unique in its carelessness of using just a big counter loop over the pause (which appears to be to me: that .NET does something unexpected) (s)he should support that.
(That I ask for the support for "extraordinary claim" in the thread where only one side makes it should be understood, but it seems some here vote "down" without understanding what is being discussed at all -- it's absurd having to summarize the whole discussion in the reply to the "extraordinary claim" in which I for the evidence)
The stuff he quoted seems to make a strong case that the behaviour they implement was the intended behaviour prior to Skylake, why else specifically call it out in the newer manuals?
Would be interested to know why you think the code was always broken. Perhaps Intel define a preferred algorithm they failed to implement?
Not the parent, but I do not see that Intel is wrong here.
New architecture makes new choices and, according to the redlined doc, they expect 1-2% performance improvements on threaded apps, which is a measurable improvement. They also explicitly call for possible degradation cases so folks who care can adjust.
.NET apparently hits such a degradation case, but parent says this is due to .NET implementing spin loops in a way (almost) no one else does. If so, in this particular case Intel is doing the greatest good for the greatest number, which is reasonable. My 2c.
I think the issue is not only the .NET spin loop implementation, but also that .NET in general is heavy on shared memory concurrency and required locking.
Examples: All the Concurrency primitives like Task<T> are thread-safe by default, and that has some associated cost. Other runtimes like javascript or other singlethreaded event loops don't have to pay for that cost, and might still be able to scale horizontally. Also the IO subsystem on .NET requires a lot of synchronization: Most IO callbacks (e.g. for sockets) will be made from an arbitrary Threadpool thread, which requires synchronization or marshaling to get back to another context. Again that will be cheaper on a runtime where the whole event sources and consumers are using the same thread.
I certainly can't quantify it in numbers, but I guess .NET has some of the highest requirements regarding synchronization performance. At least the standard libs, there are obviously some efforts underway to improve that situation.
The Intel docs mention that PAUSE isn't the right choice for waits of "thousands of cycles or more" - even on the old architecture, 10ms is waaaaaay longer than recommended.
I also have to quibble with this conclusion:
This is not a .NET issue. It affects all Spinlock implementations
which use the pause instruction.
The instruction isn't the issue, the exponential backoff that .NET includes is. At the extreme (last iteration, on a 64-core CPU) it will spin for 1.2 million pause instructions...
The whole point of spinning is to avoid the expense of a context switch by putting the thread to sleep, letting the OS schedule another. If you're going to spin for 10ms, that's longer than the timeslice most threads run for. That is going to be well above the overhead of any context switch. It makes no sense.
> That is going to be well above the overhead of any context switch.
Is it though?
The "pause" instruction gives all resources to the hyperthread sibling. So under high-utilization (ie: your core has a hyperthread-sibling ready and willing), a "pause" instruction is basically a super-low latency task-switch.
I wouldn't be surprised that a "spinlock with pause" would be better, even in 5ms to 10ms chunks of time (Windows has 15ms quantas, and may task switch as late as 30ms to 75ms depending on the situation).
Because its better to do the hardware-assisted "pause" task switch rather than actually cause the OS-scheduler to get involved. In fact, that's what Microsoft AND Intel have concluded!! To get better real-world metrics, both Microsoft AND Intel felt like the "pause" instruction should wait longer! Skylake-X increases the latency, while .NET created this exponential-backoff.
I bet it was good in isolation, but when combined, the overall effect was too much pausing.
The .NET approach only makes sense if you have exactly the same number of active threads as CPU "cores" and if your CPU always supports hyperthreading. If either of those is not true then what .NET is doing is a severe CPU bottleneck when the lock is actually contended.
And it is incredibly unlikely that that exact situation happens in the wild. You almost always have more than <cpu core count> threads/processes active in heavy multithreading workloads, and you definitely can't assume everyone has hyperthreading as that's a feature commonly cut from mid & low range CPUs.
For context here a thread context switch via something like futex is in the sub-10us range. So a 5ms spinlock robs you of ~4.9ms of usable CPU throughput that a different thread/process could have been doing instead. The timeout levels here are just shy of an eternity. 5ms is a really, really long amount of time.
By way of comparison pthread_mutex_lock in glibc will do one pause per attempt for at most 100 attempts. So under skylake it has a max pause duration of around 3-5us, depending on clock speed, which is pretty reasonable. It roughly aligns with the cost of a context switch, so once it starts spinning for longer than a context switch it switches to a context switch instead.
Giving resources to the sibling hyperthread via pause is not the kind of context switch I'm talking about. I'm talking about the expensive context switch when the OS scheduler must swap out a running thread for another. That carries a steep obvious penalty switching between user space and the kernel and back, but actually can involve much heavier penalties where the cache and TLB are concerned - depending what thread runs next.
Pausing for as much as 10ms could be better under some circumstances, but that would seem to me to be heavier than a context-switch. Whether the system actually achieves better throughput or not in that case would be workload dependent. Worst case the hyperthread is also running the spin lock loop and no useful work at all is accomplished on the core.
> Giving resources to the sibling hyperthread via pause is not the kind of context switch I'm talking about.
Isn't this entire thread about "pause" ?? Anyway, i think we both understand the difference between "context switch" and "pause" here.
> Pausing for as much as 10ms could be better under some circumstances, but that would seem to me to be heavier than a context-switch. Whether the system actually achieves better throughput or not in that case would be workload dependent.
Seems reasonable. Ideally, threads / cores shouldn't be blocked that long, so 10ms+ blocked periods is definitely an anomaly.
But whether or not a context switch is actually worth it? Ehhh, its hard to say for sure.
Right. The whole time I was thinking the hardware change probably exacerbated an existing software issue. Then I saw this:
"The issue has been fixed with .NET Core 2.1 and .NET Framework 4.8 Preview contains also the fixes for it."
And by the end of the article I was thinking their code might just run faster as a single thread on a ~4GHz processor.
Another thought was why do they have so much resource contention in the first place? That seems fishy.
Another thought: Perhaps they should write it in Rust with all of its "fearless concurrency" claims. While I'm skeptical of those claims, The exercise would probably highlight some of the real issues.
Any way I look at it, I keep thinking it's a software design problem.
Rust doesn't magically fix resources contention. It just provides safe concurrency primitives (and the surrounding language semantics) that make it difficult
-to-impossible to misuse in such a way as to cause a deadlock or race condition.
> difficult -to-impossible to misuse in such a way as to cause a deadlock or race condition
Rust doesn't solve deadlocks or race conditions. Rust only makes it hard to accidentally share a memory location, but that's only a very small part of the problem, race conditions being a fundamental part of computer science, being baked into any language that supports expressing and joining asynchronous processes.
I would be generous in agreeing with a statement about Rust making it a little more difficult to end up with a deadlock or unwanted race condition, except that's not true when compared with many other high level languages.
Also nothing is impossible until you prove it via formal methods (e.g. TLA+).
> Rust only makes it hard to accidentally share a memory location,
Not just a memory location, any value. You can use the tools Rust gives you on any piece of state you care to manage - a database connection, a remote server, the office coffee machine - represent it as a value and Rust will let you check that it's owned where you want it to be owned, borrowed where you want it to be borrowed and so on.
> Also nothing is impossible until you prove it via formal methods (e.g. TLA+).
What's the distinction you're drawing between something like TLA+ and something like the Rust borrow checker?
> I would be generous in agreeing with a statement about Rust making it a little more difficult to end up with a deadlock or unwanted race condition, except that's not true when compared with many other high level languages.
It is easy to specifically describe what Rust prevents: unsynchronized memory access from multiple threads. What other high-level languages allow read-write shared-memory multithreading and guarantee that?
> Also nothing is impossible until you prove it via formal methods (e.g. TLA+).
The fact that unsynchronized memory access is impossible in safe Rust (modulo compiler bugs, CPU bugs, etc.) is something that nobody seriously doubts. It's obvious. I'm glad people are working on proving it, but the proof is an academic exercise.
> It is easy to specifically describe what Rust prevents: unsynchronized memory access from multiple threads. What other high-level languages allow read-write shared-memory multithreading and guarantee that?
No, it's something that Pony relatively explicitly doesn't allow: AIUI, iso (non-aliasable, read/write), val (aliasable, read-only, no write) and tag (aliasable, no read or write) are the only things that can be shared between actors/threads.
You're right about deadlocks and general race conditions. I was confused. Rust only (statically) prevents data races (arguably the only form of race condition that is always undesirable). Rust does nothing to prevent deadlocks.
The advantage of Rust over, e.g. Java, then, is that Rust can prevent data races at compile time?
If I remember correctly, rust doesn't help with deadlocks - you still have to be careful of them. Though it does have primitives to help significantly with race conditions
The primitives help with taking the same lock around every access to a location. They don't help in finding locations that should use the same lock; you can easily write Rust code that uses too many separate locks and as such can still have race conditions.
No, it looks like people exaggerating to me. I don't doubt there is some merit to the safety claims made about Rust, but many people have a tendency toward unwarranted idealization. I admitted that I have not learned Rust yet which should add doubt to my own skepticism.
The project I'm thinking of trying with Rust looks like it will be predominantly Unsafe Rust - using a lot of structures that are multiply referenced by necessity (or not?). I'm doing a lot of reading and thinking prior to starting because I really want the benefit from Rust as well as changing my thinking if there are good changes to be made there. Using an array of items and having them reference each other by index should satisfy the Rust compiler, but that doesn't actually solve the potential problems of trying to traverse (update) a complex data structure with parallel threads. The challenge of doing this suggests to me that the claims are overblown. A modestly complex data structure seems to require disabling all that safety but I am still learning and contemplating different approaches. Questioning basic assumptions seems prudent. I want to believe, but at this point I'm not convinced.
You might have missed that I was responding to a (flagged) response to you talking about "astroturfers" and not to you directly. Or I'm misreading your response?
You're suggesting, then, that the spinlock implementation is not merely slower on some CPUs but actually _incorrect_? It seems more likely that your software had some combination of a deadlock and a race condition exposed by different lock timing -- inserting delays in execution is one way to expand race windows to reproduce these kinds of defects.
Yes I am suggesting the implementation is incorrect. We had a standalone test case for this which could reproduce the condition after a random wait of 1-45 minutes. I will dig it out and raise it against .Net Core if it's still valid there.
The problem was GC initiated i.e. the entry into GC caused the deadlock to start with which the process never recovered from as it was spinlocking all threads and no resources were available to complete GC.
The object code in question uses an instruction that's basically a microarchitecture-specific hack. IMO the real issue here is that the object code is optimized for an earlier microarchitecture and needs an update.
Disclaimer: I work for Intel, but not on their CPUs and I'm not a fanbois.
What amazes me is that they do not test this extremely specific uarch code when a new uarch is released.
On the other hand, I think the hack is not in using the pause instruction itself, but that crazy backoff algo. For example Linux uses pause too (via rep; nop) in lots of places. But not in a exponential loop (that I'm aware of).
Unsure what I'm more astonished about, that the correct case is that it would spin 29ms, or that a CPU instruction has changed it expected latency and prder of magnitude upwards.
I would expect any spinning lock to spin at most some slight fraction of the scheduling granularity, but almost always far less than the context switch time except maybe in some specific case of latency sensitive code.
Maybe thr long spin tine is for some reason trying to account for the very long scheduling quantas of windows server?
Superficially it seems misdirected in anything but real-time code since most schedulers would allow the thread to keep its quanta until the next scheduling if the thread wait time was low. The only case I can see this as not being the case is in a system that is overloaded and all CPU's have multiple threads that are all starved.
"Pause" spins give all their resources to the hyperthread-sibling. So unless the hyperthread-sibling is also stalled out for some reason, the "pause" instruction is definitely preferred.
So its not like resources are lost (at least, under high CPU utilization). But still, you're right in that a 29ms spin-lock sounds outrageous to me. That's definitely long.
Pause can be t he best choice even without hyperthreading because it can lower power utilization. And even on desktop that can mean less heat and more time for other stuff to run at a higher-clocked 'Turbo-boost'.
There is another architectural change made in Skylake which is related to this article. In skylake the L3 cache is now non-inclusive, which means that a core trying to acquire the lock will have to read the cache line through a cross core cache read. Making things worse when contention is high the cache-line will frequently be dirty. The latency for a cross core dirty hit of the L1/L2 cache is documented to be 60 cycles which is much high that the L3 cache latency of past architectures. This will increase overall contention which will magnify the frequency the pause instruction is hit. I'm wondering if the pause instruction cycle count was increased to reduce the impact of cross core cache reads.
But wouldn't that use a non spinning OS lock. I.e. the scheduler knows which threads are waiting on the accept and can randomly pick one to directly wake up?
If your design have multiple threads waiting for accept on the same socket, then leaving it to the OS to schedule said threads on accept should at least be the first to try. If nothing else because it's the least surprising thing to do.
The parent context asked what code could have 40 cores spinning, I picked the first example I could think of. You said it wasn't a spinlock, it was. Sure there is a followup lock that sleeps for the empty-queue case, but that doesn't relate to OP's question
You can find similar behaviour in the filesystem APIs, anything touching the VM (mmap_sem IIRC is also a spinlock), pretty much any OS API where threads are all banging at the same shared resource that isn't expected to require a long wait. struct file also contains a spinlock, but doesn't look like it's used in normal operation
> The parent context asked what code could have 40 cores spinning, I picked the first example I could think of.
Hmmm... okay. I think I see what you're going for. I think its a bit of a muddy example though because of the mutex.
But pre-mutex, there's definitely a spinlock, and the 40-cores would definitely hit that first. Mutexes themselves are likely a spinlock as well, so there's a chance (a low-chance... but a chance nonetheless) that everything is fast-pathed and never sleeps a thread.
So yeah, there are better examples you could use. But upon further analysis, it does seem like your example does work with the right frame of mind.
Does textbook equate to real-world, though? Don't real world applications either use a thread per socket (i.e. per TCP stream) or something like epoll?
Real world implementations would use OS-specific socketopts to create multiple listen sockets sharing the same port and then use one socket per thread.
You're among the 1% of engineers who both care enough to know and actually know about these kinds of facilities. For the rest of them, whatever they got at school or off Stack Overflow is as far as it goes, unless some business priority demands something different - and even in that case "throw more hardware at it" is often the route taken
So actually, you're about right. Its "wrong" to spinlock above 40 cores. But at 40-cores or less, you probably should prefer spinlocks over atomics or other primitives on the x86 platform.
As for what? Well, you need a synchronization primitive for a producer-consumer queue. A spinlock would be a good choice for that, even on 40+ core machines (since most of those cores wouldn't be spinning at the same time, even if they're all locking the same lock).
Remember, x86 primitive for synchronization is the "LOCK" prefix. When an x86 core "LOCK"s an instruction, no other core is allowed to touch the data its touching. (In MESI protocol: the cache line goes into the "exclusive" state). As such, spinlocks match the fundamental nature of the x86 instruction set very very well.
This seems extremely suspicious. Of course it depends on the nature of the data structure, but in the presentation he compares an atomic increment with a non-atomic increment guarded by a spinlock and comes to the conclusion that the latter is faster for at least on one micro-architecture.
Can anybody actually explain this result? The spinlock needs at least one atomic operation and a write (for lock + unlock) and one non-atomic read-modify-write per iteration while the atomic increment needs only a single atomic operation per iteration and no other memory access. How on Earth can the latter variant be slower?
The only thing I can think of right now is that due to some micro-architectural idiosyncracies, the relevant cache lines end up migrating between cores more often in the atomic increment case, while in the spinlock case, the same core ends up winning the lock back-to-back more often (and so cachelines migrate less frequently which reduces the overall running time).
That seems like it would most likely only ever be relevant to this kind of extremely micro benchmark, but I'd be interested to hear if there are other explanations (I haven't watched the whole talk, but at least the slides don't seem to make an attempt to explain this).
Ink x86, ALL synchronization primitives are "LOCK" prefixed. That is: for the FULL Read-modify-write, the cache line is held in exclusive mode (or equivalent). Its just how the assembly code works.
Furthermore, ALL loads / stores are almost strictly ordered on x86.
> Intel 64 memory ordering obeys the following principles:
> 1. Loads are not reordered with other loads.
> 2. Stores are not reordered with other stores.
> 3. Stores are not reordered with older loads.
> 4. Loads may be reordered with older stores to different locations but not with older stores to the same location.
> 5. In a multiprocessor system, memory ordering obeys causality (memory ordering respects transitive visibility).
> 6. In a multiprocessor system, stores to the same location have a total order.
> 7. In a multiprocessor system, locked instructions have a total order.
> 8. Loads and stores are not reordered with locked instructions.
As such, "relaxed atomics" DO NOT EXIST on x86. Almost everything on x86 is innately an acquire / release semantics (even non-locked stuff).
------------
As such, the optimal spinlock-unlock implementation on x86 doesn't use the LOCK prefix at all! You rely upon the memory-ordering consistently guaranteed by the x86 processor, and let the cache-coherence mechanism handle everything for you.
You'll still need the "lock" to obtain the lock (writing 1 into the memory location AND reading the old value). But unlocking the lock (writing 0 into the memory location) can be done x86-LOCK-prefix free.
In contrast, a CAS-atomic on x86 needs to "LOCK" the memory bus, which under x86 is specified as:
> Causes the processor’s LOCK# signal to be asserted during execution of the accompanying instruction (turns the
instruction into an atomic instruction). In a multiprocessor environment, the LOCK# signal ensures that the
processor has exclusive use of any shared memory while the signal is asserted.
So when you do a CAS-atomic on x86, in REALITY, the LOCK-prefix will be asserted, locking out all other cores from interacting with your cache line.
------------
So the cache-coherence mechanism on x86 seems sufficiently robust and well optimized by Intel / AMD, at least with our current 28-core Xeons and 32-core EPYCS. Future scaling may be an issue for Intel/AMD, and ARM / Power9 (which have true relaxed memory) may scale better eventually.
> So when you do a CAS-atomic on x86, in REALITY, the LOCK-prefix will be asserted, locking out all other cores from interacting with your cache line.
Atomic RMWs on intel (and really the vast majority of high performance architectures) have no effect outside of the core issuing them except in some very exhotic circumstances. The Lock signal is a relic of the pre-P6 past when everything communicated via a single bus.
What prevents a cache line from being 'interacted' with, is being held in exclusive mode. This is a property of any write, atomic or not. An RMW won't hold a cacheline significantly longer than a normal write would.
> As such, the optimal spinlock implementation on x86 doesn't use the LOCK prefix at all! You rely upon the memory-ordering consistently guaranteed by the x86 processor, and let the cache-coherence mechanism handle everything for you.
That's just not true. Spinlocks are not magic; they need to be implemented, and all the implementations I know of use a (atomic and therefore locked!) cmpxchg.
Which makes sense. Spinlocks are a synchronization primitive, and you wrote yourself that ALL synchronization primitives on x86 are "LOCK" prefixed.
So let's rephrase the original question: why should a locked cmpxchg + additional memory accesses be more efficient than a single locked increment?
You're right and you responded before I realized my herp-derp and made an edit. Sorry for not catching it in time!
> So let's rephrase the original question: why should a locked cmpxchg + additional memory accesses be more efficient than a single locked increment?
The ideal spinlock on x86 isn't a cmpxchg btw. Its a pure locked-swap.
And I think there-in lies the answer. A locked-swap on x86 can theoretically execute faster (its a pure write and a pure read to the cache), while the uOps generated for a cmpxchg would include a comparison.
From a cache-coherency perspective: the pure-write of a locked-swap can be implemented by invalidating the caches of every other core. So there's a "race" between cores to send out the "invalidate cache" message to everybody. Everyone who loses to the invalidate-message needs to stall and re-read the new value.
But otherwise, the "winner" of locked-swap executes incredibly efficiently. The winner instantly transitions into the "Owner" state and continues to execute.
But a cmpxchg is a far more complicated order of events. I'd imagine that the underlying, undocumented, cache-coherence mechanisms struggle under a LOCKed cmpxchg (which is far more common with atomic-based lockless programming).
Remember, the "LOCK" prefix is implemented by the MESI protocol (or something more complicated, like MESIF on Intel or MEOSI on AMD) in reality. Even then, modern processors probably implement something far more complicated, and its unfortunate that its undocumented.
Still, LOCK swap is clearly more efficient under MESI than LOCK cmpxchg.
---------
A locked-increment would probably be roughly the same speed as a cmpxchg btw. Both a locked-increment and locked-cmpxchg require a full read/modify/write cycle.
While a locked-swap is just a read / invalidate+write instead.
IIRC the reason that xchg is faster than a CAS is that the idiomatic CAS code would read the current value of the cell and then execute a CAS. This causes the cacheline to first be brought in shared mode, and then to be upgraded to exclusive, requiring an additional coherency transaction. A plain exchange would simply bring the cacheline in exclusive mode directly. If the lock is uncontenced thus can be noticeably faster. But note that this is not a property of the operation itself but simply of the surrounding code: issuing a blind cas with a speculative initial value can have similar performance characteristics of a xchg.
On the other hand, if the spin lock is contended, avoiding an unconditionally acquiring the cacheline in exclusive mode is beneficial as it will mitigate cacheline ping pong. Thus the test-and-test-and-set idiom.
On the third hand,if the lock is contended, a spin lock is probably inappropriate...
Also I want to mention that CAS can fail, while xchg always succeeds, but the performance implications are not clear cut: for example a spin lock will issue xchg in a loop (failure is encoded at higher level), while wait free algos can issue cas outside of loops as a failed cas implies that another cas succeded and thus carries information.
> while wait free algos can issue cas outside of loops as a failed cas implies that another cas succeded and thus carries information.
You almost had a perfect post IMO :-) But this last line is incorrect on some systems.
x86 CAS seems to be strong as you imply. But apparently, Power9 or ARM CAS can have spurious failures. Or more appropriately, Power9 and ARM implement "Load Linked / Store Conditional", which can have spurious failures, and thus compare-and-swap (at a higher-level, like in C++ or C) will look like a CAS failed spuriously.
Which is why atomic_compare_exchange_weak and atomic_compare_exchange_strong exists. Weak doesn't "hold information" due to the potential of spurious failures, but can be a more efficient implementation in the case of LL/SC instructions.
Yes I was specifically talking about x86 CAS which is wait free.
Still (but don't quote me on this because I'm no Power expert), IIRC Power actually does guarantee, via idiom recognition, that specific uses of LL/SC are wait free by downgrading it internally to an actual CAS after a number of failures, so a compare_exchange_strong which loops a bounded number of times can be implemented on this architecture.
Okay, it's a fair point that '(lock) xchg' can potentially be faster than 'lock cmpxchg'. I wouldn't say it's really an ideal spinlock, because under real loads there are other considerations. That's why the Linux kernel uses queued spinlocks, for example.
For various reasons I still suspect that some other factor is at play, but I also don't really know how to tease that out. Maybe there are some performance counters one could look at, but I'm just speculating.
> Ink x86, ALL synchronization primitives are "LOCK" prefixed. That is: for the FULL Read-modify-write, the cache line is held in exclusive mode (or equivalent). Its just how the assembly code works.
I'm okay with your conclusion, but your first sentence is incorrect. The XCHG instruction is fully atomic and sequentially consistent even without a LOCK prefix. (This isn't to say x86 has any particular magic -- XCHG with a memory operand is just automatically LOCKed.) Also, ordinary stores are releases, and are thus quite useful as synchronization primitives, without any LOCK prefix at all.
On the contrary: Spinlocks are faster than any other lock. In fact, they're faster than many atomics-based lock-free datastructures (!!!) in many cases.
Atomics-based lock-free data-structures scale beyond 40+ core counts. But note that x86's "lock" instruction is fundamental to the system. x86 doesn't have relaxed atomics or barriers under most circumstances. As such, the spinlock best represents what the x86 instruction set offers.
TL;DR: There exists a "pause" CPU instruction, which signals to the CPU that the code is in a spinlock wait loop. In Skylake this instruction sleeps longer than before. If your code heavily uses this, then it'll wait longer for things like locks to be released.
I agree that's also interesting, but in a discussion with such a clickbaity title it's important to correctly identify the source of the so called problem right upfront just to avoid wild speculations and conspiracies.
From the screenshot of the Intel documented linked:
The increased latency ... has a small positive performance impact of 1-2% on highly threaded applications.
Making an instruction more than 10x slower overall, for a 1-2% gain (who wants to bet it's some "industry standard" benchmark...?) in a very specific situation sounds like a textbook example of premature microoptimisation. Of course they don't want to give the impression that they severely slowed things down, but the wording of the last sentence is quite amusing: "some performance loss"? More like "a lot". It reminds me of what they said about the Spectre/Meltdown workarounds.
> a textbook example of premature microoptimisation
No, this is textbook optimization
Benchmarks showed a positive gain -> that's a real optimization, not premature at all.
Premature optimization is when you guess than a change will make things better without measuring. As soon as you measure it becomes real optimization work, not premature at all. As in, this is how you're supposed to make things better.
Feel free to debate the merits of their benchmarkmark, but this is a textbook example of Doing Performance Correctly otherwise.
But this gain of 1-2% is balanced with author's 50% performance loss. I guess it means that the benchmark results here depend on choice of tested applications. If they had used more multithread .NET applications, the tests could show a performance degradation as well.
No, because the only use of the instruction is to fix a latency "bug" on spin loop exit. PAUSE (which is actually encoded as "REP NOP" -- the instruction itself is sort of a backronym) exists as a hint to the processor that the loop exit will be speculated wrong, so predict it the opposite way.
If you are spending even a tiny fraction of 1% of your time spinning, your app has terrible performance bugs already.
> If you are spending even a tiny fraction of 1% of your time spinning, your app has terrible performance bugs already.
Not necessarily. Very latency-sensitive applications may pin a thread to a CPU core and mark the core as reserved so the kernel doesn't schedule any other task or interrupt on it. Busy-waiting is used instead of yielding to the scheduler, that way context switches can be avoided almost entirely.
I guess. Normal RTOS applications with needs like that just do the work in an ISR. But sure, granted. Those aren't applications that are going to be running on Skylake though.
The bottom line is that this is an architecture optimized for compute throughput, and they figured out that if they spent a few cycles on spin exit they could more than make it back via more effective hyperthread dispatch. And for everyone but .NET, that seems to be the case.
Why would latency sensitive applications (let's say a millisecond of jitter) be using .NET and its runtime?
I thought the VM class of languages were generally inappropriate for real-time work. Heck, even in C, you need to do specific work on the memory allocator to make real-time viable.
Highly threaded applications are not “a very specific situation”. This is a huge class of important applications. This describes most popular services.
It’s not “premature optimization” to gain 1-2% for likely trillions of application executions.
On the other hand, if 99% of highly threaded applications run 1% faster and 1% of applications run 10 times slower, maybe it's worth it?
Maybe??
Tbh I think this is more of a sign that Intel's architecture, the Core i generation of CPUs, has been optimized as much as you reasonably can and Intel is squeezing out a few more cpu cycles here and there.
However, as mentioned in one of the SO replies, you still need to make sure that you don't use locks that use loops on calls like this when those locks will have to wait on anything long-running - you're just going to burn CPU/battery needlessly. So, only use such locks when you know that the protected code doesn't involve any long wait states, or modify the code to back out even further into a proper kernel wait after looping X times. Personally, I agree with second SO answer and don't like such constructs, and would personally say just go with a straight-up kernel wait if there's any question about whether such conditions could exist now or in the future.
Well, first you have to understand exactly what "Pause" is.
"Pause" is a hint to the CPU that the current thread is spinning in a spinlock. You've ALREADY have tested the lock, but it was being held by some other processor. So why do you care about latency? In fact, you probably want to free up more processor resources as much as possible.
Indeed, there's not actually any resources wasted when you do a "pause" instruction. In a highly-threaded environment, the hyperthread-brother of the thread picks up the slack (you give all your resources to that thread, so it executes quicker).
10-cycles is probably a poor choice for modern processors. 10-cycles is 3.3 nanoseconds, which is way faster than even the L3 cache. So by the time a single pause instruction is done on older architectures, the L3 cache hasn't updated yet and everything is still locked!!
140-cycles is a bit on the long side, but we're also looking at a server-chip which might be dual socket. So if the processor is waiting for main-memory to update, then 140-cycles is reasonable (but really, it should be ~40 cycles so that it can coordinate over L3 cache if possible).
So I can see why Intel would increase the pause time above 10-cycles. But I'm unsure why Intel increased the timing beyond that. I'm guessing it has something to do with pipelining?
The pause instruction is supposed to be used as a "yield hint" for programs to signal that they are waiting for a concurrent action in another CPU to unblock, e.g. the contended case in a spinlock implementation, where the current thread needs another thread to release the lock before it can make progress. In Skylake, they have changed the microarchitecture so that pause is a much stronger hint; executing a pause will "yield" the current hyperthread, allowing the other hyperthread on the same core to use more of the CPU's resources (ROB slots). This improves performance for well-tuned synchronization, but hurts patterns where pause is executed too often. Unrolling the spinlock loop would help in this case, probably.
2 elements are fuzzy:
- clock speed is way different from 3.5GHz to 2.6. This can lead to huge differences in some computations and no new instruction set can compete against brute force.
- changing between those 2 cpus is not just a socket swap but a full platform change: can we get the other details of the 2 platforms ?
This is a kind of bug that can be difficult to reproduce. Imagine if someone reports this bug to you but your own tests show that nothing has changed (because you use an older CPU).
AMD is not trying to "disrupt Intel's model", it's trying to outcompete them on mostly the same model.
I have no idea what "disrupting Intel's model" would actually mean. But I would be fine if Intel is just outcompeted at their own game, and this already seems to be happening.
Although I think Intel derives substantial value from x86 lockin, I don't think "doing things like Intel but with a different CPU architecture" really qualifies as a disruption of Intel's model.
I'd say disruption of Intel's model would mean different ways of selling CPUs, fast storage or other silicon. All I see is Intel's competitors trying to mimic Intel to some extent because the enterprise market is the most lucrative. AMD, Qualcomm, Cavium and IBM all consider Intel a main and direct competitor. So I'd be interested if anyone has an idea on disrupting Intel's way of doing business.
To that point, cloud is probably the biggest disruption.
It may run on Intel now, but once "runs on AWS/GCP/Azure" becomes more important than running on an ISA, there's no reason compute couldn't transition much faster.
Note that Power9 and ARM both support for relaxed atomics, which would be a substantial difference from x86's stronger cache coherence policies.
That would likely make "properly coded" Power9 or ARM synchronization to be more scalable. Possibly a major advantage as 32-cores and higher are beginning to hit mainstream (or at least, high-end mainstream)
Amd has had better chips at a lower price point since always.
You can get a 8 year year old octocore that still destroys anything Intel has put out recently.
Intel relies on inertia. Amd consistently beats them on actual performance at a lower price. I remember my 386 dx 2 had better performance than an Intel 486. And let's not forget the p4 issues with their garbage ram. Plus they backdoor their chips with stupid insecure crap
I'm not sure where you get this impression, but Intel held the performance crown over AMD since releasing the Core architecture. In fact, things were looking pretty dire for AMD during the Phenom days when their highest-tier offering could only match mid-tier Intel chips in performance, so they had to compete solely on price. [1] AMD's stock price during this era was the lowest it's been since like 1980.
It's only with Ryzen that AMD reached performance parity with Intel once again.
The core 2 duo was crap, the celeries were mediocre, the p4s in combo with rdram were junk and a total failure with shitty performance. Amd really should have held the crown during those years, so I don't know why you think Intel was a consistent winner. Most of the chips they made from 2000-2010 were less than great. Intel outright didn't have a good chip for a very long time. Like I previously stated you can buy an octocore and that's still beats most Intel offerings for far less. If anything Intel is mid tier with a huge mark up on price.
Stock price doesn't determine CPU quality. AMD stock is a whipping boy for shorting..and an easy way to profit on when those professional investors are constantly wrong. (I've invested in AMD over the last 20 years and there is an extremely predictable cycle in stock price. though just a disclaimer, I currently do not hold any AMD stock)
> so I don't know why you think Intel was a consistent winner.
Because I read the benchmarks. Intel was winning by such a huge margin that AMD nearly bankrupted themselves trying to compete on price. Intel didn't even need to dip into their margins at all because AMD was several generations behind.
Intel beat AMD to 32nm fabrications by over a year, with the gap widening over the next five. Intel got to 14nm in 2015 while AMD just got there last year with Ryzen.
> Like I previously stated you can buy an octocore and that's still beats most Intel offerings for far less
BS. Nothing pre-Zen was competitive with contemporary Intel offerings, much less the latest offerings. As I pointed out, AMD was a generation or two behind Intel for this entire decade.
Yeah, you can still buy these six+ year old CPUs for $90, but, as that review shows, they were barely competitive with 2012 Intel processors. Much less, competitive with CPUs that are two generations newer.
> Stock price doesn't determine CPU quality.
It's a measure of how well a company is doing.
The original Phenom had a major CPU bug that hit once they finally began to gain market-share in the server market. This cost AMD pretty heavily and forced them to cut R&D budgets, which, as stated, put AMD generations behind Intel in tech. This led to a slashing of margins to compete, which put them even further behind Intel.
Oh yeah, and somewhere during all that, they they were the target class action lawsuits for over-stated performance.
This doesn't even get into the Radeon lineup, which totally missed the AI/ML craze that sent nVidia profits to the sky.
I agree with your post, but I did want to point out...
> This doesn't even get into the Radeon lineup, which totally missed the AI/ML craze that sent nVidia profits to the sky.
On the other hand, they scored well on the cryptomining craze. If it wasn't for the fact that an R9 290 was significantly faster at mining than the GeForce 980 GTX, I'm not sure if AMD would still be making GPUs.
Performance per watt/cost vs performance AMD has not been consistently beating them. There is a reason multiple companies continue to get intel after reviewing AMD's current offerings every few years (I'm in one of them to remain nameless)
I absolutely don't believe an 8 year old AMD octacore will "destroy" anything intel has. Sure I imagine a single highly specific and biased test could give AMD the advantage in that one instance, but overall performance AMD has been behind since the pentium M/core2 with only a few products that initially compete or win, only to fall behind again.
Around the year 2000, Intel decided that the future is Itanium. It didn't matter that the market hated the idea - there was no alternative to Intel, so Intel completely ignored the market's plea for an x86 compatible 64-bit processor that would be able to run old x86 code at native speed.
Intel's 32 bit offering at the time wasn't any good either; the pentium had such a deep pipeline (and not so smart prefetch/speculative) that the chips ran at a high frequency, ate a lot of power, but pipeline stages were idling or working on useless stuff a large amount of the time.
Luckily for the world, AMD managed to produce that alternative, the AMD64 architecture, which kicked intel's ass so hard that eventually they had to adopt it.
If it weren't for AMD's AMD64, and their cross-licensing deal with Intel (that allowed them to make an x86 compatible chip without specifically asking for Intel's permission), we'd be in an Itanium dominated world right now (or .. ARM would have stepped up, it's hard to predict an alternate reality) but surely with less efficient and less effective processors.
History does not repeat, but it rhymes. Intel has no Itanic this time, but it did become numb enough to let AMD outmaneuver them again.
It's not that Intel is doing a bad job, as much as they have too little incentive to do a good job.
> If it weren't for AMD's AMD64 … we'd be in an Itanium dominated world right now
This is far from certain: Itanium shipped years later and much, much below the advertised performance thresholds. In the meantime even Intel’s own x86 had closed the raw performance gap and since IA-64 was such a huge bet on brilliant compilers that the early benchmarks for most business code favored the P-III even before you factored in cost. Similarly, the original claim that it’d offer competitive x86 performance turned out to be half a decade behind years behind when it shipped so it wasn’t even remotely competitive for legacy code and businesses back then had even more weight of code which was difficult to recompile.
It’s equally plausible to me that Intel would have continued to see strong x86 sales and weak demand for Itanium and ended up shipping their own 64-bit extension a few years later with less pressure from AMD simply because businesses weren’t keen on the “recompile everything, make a bunch of expensive optimizations, and it might be faster than the system it’s replacing”
> It’s equally plausible to me that Intel would have continued to see strong x86 sales and weak demand for Itanium and ended up shipping their own 64-bit extension a few years later with less pressure from AMD simply because businesses weren’t keen on the “recompile everything, make a bunch of expensive optimizations, and it might be faster than the system it’s replacing”
I agree it is plausible, but the "few" in "few years later" makes a huge difference. Itanium was not requested or driven by the market, not even wanted; it was dictated by Intel, who put in billions into it. Without any market pressure, they would have likely taken their sweet time - possibly 10 or more years - to let Itanium mature. They actually took the writeoff for Itanium as late as they could - I would guess no one wants to one such a colossal failure, and that would have been true even if AMD wasn't around.
While IA-64 originated in HP, as another poster noted, it was in Intel's interest that HP would be out of the CPU market, and in that sense Itanium was a success -- it left Intel in a much stronger position in the long run, and if it wasn't for you meddling AMD kids, they would have gotten away with it in the short run as well.
It's an interesting what-if scenario: AMD put a ton of pressure on Intel but if they hadn't I wonder how Intel's internal politics would have gone as Itanium continued to fail while x86 continued to print money or whether that would have emboldened an outside competitor.
HP had ceded the CPU market but Power was competitive on both the desktop (Apple was shipping a lot of PowerPC) and high-end server (IBM, Bull, and Hitachi) markets, delivering quite competitive performance and mature 64-bit support, and there was still some chance left for SPARC, MIPS, or Alpha — and 3 of those 4 CPUs had native support in Windows. AMD publicly announced AMD 64 in 1999 and I'd be surprised if they hadn't had some early news before that time since IBM, Sun, and SGI all had Opteron servers on the market fairly early on. If that hadn't been looking likely, I wonder whether someone at IBM or Sun might have decided to invest more heavily since Intel was voluntarily giving up their biggest advantage by sacrificing backwards compatibility.
Agreed. I think people generally took the wrong lessons from IA-64 (which originated at HP).
A) Smart compilers are hard, expensive, and slow to develop
B) In 2001, source wasn't open enough for a new ISA to succeed. Not sure it would be now either (chicken / egg for big iron workloads: need marketshare for optimization, need optimization for marketshare)
C) Intel absolutely missed the "desktop memory needs are going to expand" trend
What do you think are the wrong lessons that people take from IA-64? This list seems to be what you think are the "right" lessons, two of which I agree with and one I do not:
A) That's still true. The best compilers for Itanium (and other VLIW architectures) are still bad and cannot make good use of the hardware. It's likely to stay that way a long time, unless programming languages and programmers adopt vector semantics ; they haven't yet.
B) It's never a good time for a new ISA to succeed, except if it achieves something really novel (like GPUs for parallel processing, or ARM for power consumption). Things can succeed, but unseating an incumbent is exceptionally hard when improvement is marginal if any; Itanium, if a super smart compiler existed when it came out (they still don't ...) would have been a small improvement. It only made sense for Intel to bet on it the way they did was because they were a monopoly (or so they believed).
C) I disagree; Intel didn't miss anything. They just thought they were able to dictate where the market is going. And they would have, except for AMD's guts to do AMD64. For a long time, people bought AMD64 to run 64-bit XP, so they could run with 4GB or even 8GB of RAM - even though every single process was 32-bit. But that feature was more-or-less supported also on 32-bit with bank switching and other memory extensions - the actual use of 64 bit programs (with more than 3GB per process) came much, much later.
The wrong lesson seems to be "Intel tried to foist something on the market that it didn't want, and AMD listened to the public."
Usually by the desktop gaming crowd, who can't define VLIW or discuss the relative merits of different microarchitecture choices.
On (A) and (B), I'll forgive Intel (and HP). Circa-1990, I can see it being much more debatable if making hardware (ILP) or software (VLIW) smarter were a more reliable path to performance. Especially with the market being smaller & more centralized (commercial compilers galore!).
I mean, hell, AMD made the same wrong call circa-2011 in their GPU arch, no? Thus one of the reasons they ceded ML to the simpler SIMD design Nvida used which was more amenable to creating, e.g., something like CUDA? (correct me if I'm wrong)
On (C), my impression was that bank switching and PAE were poorly supported by Windows at the desktop level. Which essentially made it a non-starter for the primary computer market at the time.
I don't think that lesson is wrong. I was involved with the semiconductor industry at the time, and that's how it looked from the inside (and it still does in retrospect).
AMD/ATI in 2011 was still focused on graphics, ignoring the GPGPU revolution (as it was still called back then). In many ways they still haven't managed to turn around.
> On (C), my impression was that bank switching and PAE were poorly supported by Windows at the desktop level. Which essentially made it a non-starter for the primary computer market at the time.
I remember that while PAE was clunky, it did enable (say) running multiple processes on 8-core machines and giving each its own 2GB of memory; It was definitely poor compared to the real thing (AMD64 at the time), but like EMS, XMS and similar "extended memory" systems of the late '80s before it, it was good enough to stop people from moving to a decent architecture (e.g. Motorola 680x0) if it didn't provide backward compatibility.
I think smart compilers for Itanium is a myth. There is nothing that special about compilers for that architecture. The architecture just wasn't good enough.
Compilers for VLIW architectures have to implement instruction scheduling at compile time but the information that is required for the best optimizations is only available at runtime.
It's not really a problem for compiled languages, like C, as they provide enough context for static speculation. But the architecture itself has to be well suited for statically speculated general purpose applications, which Itanium wasn't.
Today even Russia have managed to produce a VLIW architecture and a good enough compiler [1].
> Today even Russia have managed to produce a VLIW architecture and a good enough compiler
Do you have a citation for the “good enough” claim? In particular the listed peak performance appears to be below a modern phone's and, far more importantly, we'd need robust benchmark data to know that it's not following the Itanium cycle where a few very favorable, possibly hand-tuned, benchmarks manage to approach the theoretical maximum but most real-world code is substantially lower.
It looks like Intel's manufacturing advantage will slowly disappear over the next couple of years. So it seems likely that alternative CPUs will become more competitive, ARM for example.
The root of the problem is that .NET implemented a rather baroque spin loop with several assumptions about the CPU and OS architecture. Unlike any other spin loop I have seen or written, they execute multiple pause instructions per loop. Arguably this was a poor choice.
It would be interesting to see the history of this design. It was probably done years ago when .NET was closed source.