> As an outsider to 'enterprise-grade' computing, I'm curious about situations where a high number of cores in a single processor would be superior to multiple processors with the same total energy draw sitting on a single motherboard?
Databases are the big one I'm aware of.
Intel's L3 cache is truly unified. Intel's 28-core Skylake means that the L3 of a Database is TRULY 38.5MB. When any core requests data, it goes into the giant distributed L3 cache that all cores can access efficiently.
AMD's L3 cache however is a network of 8MB chunks. Sure, there's 32MB of cache in its 32-core system, but any one core can only use 8MB of it effectively.
In fact, pulling memory off of a "remote L3 cache" is slower (higher latency) than pulling it from RAM on the Threadripper / EPYC platform. (A remote L3 pull has to coordinate over infinity fabric and remain cohesive! So that means "invalidating" and waiting to become the "exclusive owner" before a core can start writing to a L3 cache line, well according to MESI cc-protocol. I know AMD uses something more complex and efficient... but my point is that cache-coherence has a cost that becomes clear in this case. ) Which doesn't bode well for any HPC application... but also for Databases (which will effectively be locked to 8MB per thread, with "poor sharing", at least compared to Xeon).
Of course, "Databases" might be just the most common HPC application in the enterprise, that needs communication and coordination between threads.
>Intel's L3 cache is truly unified. Intel's 28-core Skylake means that the L3 of a Database is TRULY 38.5MB. When any core requests data, it goes into the giant distributed L3 cache that all cores can access efficiently.
This is less true now. Intel's L3 cache is still all on one piece of monolithic silicon, unlike the 4 separate caches of the 4 separate dies on a 32-core TR. But the L3 slice for each core is now physically placed right next to the core and other slices are accessed through the ringbus or in Skylake and later, the mesh. Still faster than leaving the die and using AMD's Infinity Fabric, and a lot less complicated than wiring up all the cores for direct L3 access.
Databases are the big one I'm aware of.
Intel's L3 cache is truly unified. Intel's 28-core Skylake means that the L3 of a Database is TRULY 38.5MB. When any core requests data, it goes into the giant distributed L3 cache that all cores can access efficiently.
AMD's L3 cache however is a network of 8MB chunks. Sure, there's 32MB of cache in its 32-core system, but any one core can only use 8MB of it effectively.
In fact, pulling memory off of a "remote L3 cache" is slower (higher latency) than pulling it from RAM on the Threadripper / EPYC platform. (A remote L3 pull has to coordinate over infinity fabric and remain cohesive! So that means "invalidating" and waiting to become the "exclusive owner" before a core can start writing to a L3 cache line, well according to MESI cc-protocol. I know AMD uses something more complex and efficient... but my point is that cache-coherence has a cost that becomes clear in this case. ) Which doesn't bode well for any HPC application... but also for Databases (which will effectively be locked to 8MB per thread, with "poor sharing", at least compared to Xeon).
Of course, "Databases" might be just the most common HPC application in the enterprise, that needs communication and coordination between threads.