There are separate micro op caches per core however they are typically shared among hyperthreads. I wonder if this could be another good reason for cloud vendors to move away from 1vCPU = 1 hyperthread to 1vCPU = 1 core for x86 when sharing machines (not that there weren't enough good reasons already).
One sneaky thing I've noticed them doing is slowly switching their licensing over to 1 vCPU = 1 CPU, even though you're now only getting one hyperthread instead of one core.
For Microsoft, this means that they've literally doubled their software licensing revenue relative to the hardware it is licensed to.
This kind of false incentive worries me a lot, because while I like the technical concepts like infrastructure-as-code enabled by the public cloud, I feel like greed will eventually destroy what they've built and we'll all be back to square one.
Ask your cloud sales representative these questions next time you have coffee with them:
- What incentive do you have to make your logging formats efficient, if you charge by the gigabyte ingested?
- If your customers are forced to "scale out" to compensate for a platform inefficiency, what incentive do you have to fix the underlying issue?
- What incentive do you have to make network flows take direct paths if you charge for cross-zone traffic? Or to put it another way: Why does load balancer team refuse to implement same-zone-preference as a default?
Etc...
Once you start looking at the cloud like this, you suddenly realise why there are so many user voice feedback posts with thousands of upvotes where the vendor responds with "willnotfix" or just radio silence.
Cloud vendors probably use a hypervisor that schedules the VM time slices in a way that hyperthread siblings are only ever cooccupied by the same guest.
ARM vendors must be feeling pretty good about themselves yeah, but if you take AMD's cores... SMT might not be a huge win in every benchmark, but you just can't keep that wide backend fed from a single hyperthread (at least I can't!).
So turning SMT off is at the least wasted potential for those cores, the way they've been designed
Probably because they have an 8-wide decoder and a massive reorder buffer, so they can actually keep the backend fed.
The problem with x86 is decoding is hell and requires increasingly large transistor counts to parallelize, so you end up with a bottleneck there. ARM doesn't have that problem.
Variable length, over lapping instructions has made x86 instruction decoding intractable. The obvious answer is make it tractable, the unobvious answer is how to do that and hopefully remain backward compatible.
The performance claims are true for all the worst reasons.
Let's say you can queue up 100 instructions. This yields the following
1 port 100% of the time
2 ports 60% of the time
3 ports 30% of the time
4 ports 10% of the time
5 ports 2% of the time
Increasing the buffer to 200 instructions yields the following
2 ports 80% of the time
3 ports 40% of the time
4 ports 15% of the time
5 ports 4% of the time
As in that made-up example, doubling the window you can inspect doesn't double performance. You really want those extra ports because they offer a few percentage IPC uptick, but the cost is too high. So you keep increasing the window size until the extra ports become viable. As an aside, AMD Caymen switched from VLIW5 to VLIW4 because the fifth port was mostly unused. A few applications suffered from the slightly lower theoretical performance, but using that space for more VLIW 4 units (along with other changes) meant that for most things the overall performance went up.
Now comes the x86 fly in the ointment -- the decoders width gives rapidly diminishing returns (I believe an AMD exec mentioned 4 was the hard limit to keep power consumption under control). This limits the size of the reorder buffer that you can keep queued up. Since you have a maximum instruction window size, you have a hard port limit.
So you add a second thread. Sure, it requires it's own entire frontend and register sets, but in exchange you get a ton more opportunities to use those other ports. There are tradeoffs with the complexity and extra units required for SMT, but that's beyond our scope.
As you can see, SMT performance is DIRECTLY related to how inefficiently the main thread can use the resources. In less interdependent code, SMT performance increases are worse because finding uses for those extra ports on the main thread is easier.
Now, let's consider the M1 and one reason why it doesn't have SMT. Going 5, 6, or even 8-wide on the decoders is trivial compared to x86. Apple's M1 (and even the upcoming V1 or N2) have wider decode. This in turn feed a much larger buffer which can in turn extract more parallelism from the thread (this seems to be taking about as many transistors as the extra frontend stuff to implement SMT). Because they can keep most of their ports fed with just one thread, there's no need for the complexity of SMT.
IBM POWER does show a different side of SMT though. They go with 8-way SMT. This isn't because they have that many ports. It's so they can hide latency in their supercomputers. It's kind of like MIMT (multiple instruction, multiple thread) in modern GPUs, but even more flexible. They help to ensure that even when other threads waiting for data that there's still another thread that can be executing.
The memory latency hiding also works with 2-way SMT. I worked on a networking software doing per packet session lookup in large hash tables. SMT with a Sandybridge core in this application gave 40% better performance which is higher than usually mentioned. So for memory bound (as in cache misses) applications, SMT is a boon.
The CPU in this case is a Threadripper 3970x, 32 cores, 64 SMT.
My experience is this: When the L3 cache is effective, then the memory latency hiding via memory prefetch works well across SMT threads. If the hashtable load requires a chain walk, the SMT latency hiding is less effective because the calculated prefetch location is not the actual hit. I couldn't get prefetching multiple slots as the load increased to be as effective as prefetching a single slot.
I tested this some years ago on a raytracer, and got a tad over 50% more speed when enabling HT compared to disabling it.
As you say, the ray tracer did a lot of cache missing , interspersed with a fair bit of calculations. I'm guessing this is close to the ideal workload, as far as non-synthetic benchmarks go.
4-way and 8-way SMT is about latency hiding (like MIMT in GPUs, but more flexible). It increases the probability that at least one thread has data it can be crunching.
Because the cloud is designed around people uploading binaries to your machine -- it is a basic principle of how services are allocated. When you go to AWS an spin up an EC2 instance, you don't get a machine to yourself. You get a VM running with many other peoples VMs on some arbitrary server in one of their data centers.
You get a VM running with many other peoples VMs on some arbitrary server in one of their data centers.
Doesn't that make it even harder to do any sort of specific attack on anything? From what I understand, these side-channel attacks depend on being able to predict the addresses you'll read from and an idea of what you're after as well as a stable environment in which enough timing information can be collected, and any small changes in the environment will mean you can start reading something completely different without even knowing; a CPU that could be running literally who-knows-what at any time seems like it wouldn't let you collect much in the way of coherent data, and of course the VM you're doing it from could itself be moving uncontrollably across CPUs.