Does anyone have a good answer to his "Why" at the end?
What’s a total mystery to me is why Intel chose to build an AGU that cannot handle all kinds of addresses. In 2017, it was indicated to me that there “was not enough space on the die.” I find this hard to believe, especially because the problem prevailed in (at least) three further generations of Intel CPUs after Haswell.
Is die space really a plausible answer as to why Intel would bother to put in a second AGU, but cripple it so that it can only work on "simple" addresses that even their own compiler is not "smart" enough to generate?
Following the link to SO, from there the link to RWT, and from there the link to the previous page in the review, it mentions "The new port 6 on Haswell is a scalar integer port. It only accesses the GPRs and the integer bypass network." which makes me think that the "not enough space on the die" might not refer to the space for the Port 7 Store AGU itself, but to the resources it uses. The address modes with the index register need to read two registers, while allowing only "base+offset" means it only has to read one register. I don't know how expensive an extra read connection to the integer bypass network would be, but I have read that read ports for the register file are expensive, to the point where having two identical copies of the register file (written in parallel) to reduce the number of read ports per copy can be considered a valid optimization. So the "not enough space" might also be for the register file (AFAIK register files use a lot of area, which is why RISC-V has a variant with half the number of registers for some embedded use cases) or its connections.
> Is die space really a plausible answer as to why Intel would bother to put in a second AGU, but cripple it so that it can only work on "simple" addresses that even their own compiler is not "smart" enough to generate?
Maybe, it's possible whoever was responsible for handling this part of the core was given a strict area budget and adding in the full set of addressing modes pushed them over. In retrospect increasing that area budget to allow it could have been the smart move but not all decisions in CPU microarchitecture are necessarily well informed! Perhaps in this case they did in investigate and with the benchmark set they were using decided simple addressing modes only didn't have a large performance impact.
Another possibility is they wanted to add the full set of addressing modes and did so, then found this caused a timing issue (as in it produced a critical path that would reduce maximum frequency too much) or power issue and the way around the timing/power issue involved a large amount of work they didn't have time to do and/or the issue was discovered late in the day and the timing fix would have introduced some interesting new corner cases they felt increased verification risk too much. Given it's persisted over a few generations this is maybe less likely but perhaps the timing issue was so severe they couldn't work around it over multiple micro-architectures (or simply can not be fixed). Intel may simply care less about the restriction than the author (i.e. no-one who spends lots of money with them has kicked up a fuss about it, nor do they think the performance restriction is causing them a large issue which is costing them sales).
(I used to work at arm designing A-class CPUs. Finding your new microarchitecture that could execute X things at once had some annoying timing issue when doing so, often related to exception handling or something else rare, that required you to add odd restrictions to work around and then trying to understand how much the odd restrictions impacted performance and whether you should go a different way to avoid the restrictions was a common part of the job)
Edit: Oh and another reason you may point to die space is whilst the feature itself may not add much extra area it can push other things apart, making for longer wires, meaning larger buffers to drive them to make timing (or they're simply too far part to meet timing) so more area and power for those buffers.
Having had conversations with some hardware designers who work on similar, but not identical problems often it's not a matter of space on the die directly. There's a few things that happen that can lead to these results. Often what's happening is you have existing functionality on the die for other purposes - whether that be some accumulators or arithmetic functions, that you can minimally tweak to get a new function with practically no extra cost - maybe just some control logic. It would make a lot of sense to me in this case that the reason they have the reduced functionality is because they didn't design an AGU at all, they found a way to serve that purpose with almost existing hardware. Obviously there are other reasons too though -such as not having the routing resource or deciding a integer multiplier was too expensive whereas what appears to just be an adder is actually quite cheap.
"Not enough space on the die" is so obviously literally wrong that the Intel source obviously didn't mean it literally. "Space on the die" is a global constraint, but hw design is full of local constraints. Eg, if there are enough pieces of logic and wiring that are connected closely together , then in that area of the die there may not be enough room to fit any more logic or wiring, even though it may occupy a trivial amount of space in terms of the whole die - without violating the gate delay targets that keep the chip running at the correct speed.
Others have pointed out limitations on register read bandwidth and other possible hardware level issues, but I'll also point out that avoiding indexed addressing modes in the generated assembly doesn't look to be that difficult - if you're doing a big unrolled SIMD kernel it's probably not a stretch to have the compiler emit some LEA instructions as well to avoid indexed addressing modes in the loop. Apparently the compiler authors simply haven't decided that it's important enough to do so yet, but compilers are much easier to change than hardware...
Or they didn't want to end up producing code with noticeable fluctuations in efficiency, as it is described later in the article: sometimes, having consistent (read predictable) computation times is better.
Simple + plausible explanation: approximately nobody uses fancy instructions with fancy addressing, so they butcher-chopped some verilog (does intel use verilog?) in one generation to meet a target and the chopped code outlived its original purpose.
Does OP really think legacy code dynamics don't apply to intel?
Wasn't there a bug not long ago where you could thermally kill a CPU by pulling clever tricks to get vector unit utilization higher than was supposed to be possible (presumably because nobody cared enough to precisely figure out "possible")?
I don't know about "kill" but power viruses have been a research topic for a long time. A decade ago researchers were using automatic techniques for maximizing power consumption on real hardware (https://lca.ece.utexas.edu/pubs/ganesan_pact10.pdf)
Odd to not see a mention of LINPACK in that paper, when it has reliably produced temperatures exceeding all other stress-testing programs I've tried, as well as uncovered instabilities that others couldn't find. It also turns out to be a very realistic type of benchmark (solving dense systems of linear equations). The paper doesn't predate its use, so I guess the authors just haven't spent much time on overclocking forums...
Sort of. I did find this article when searching for evidence of whether the AGU ever assists in LEA, but I'm mostly interested in the topic because I've been working on-and-off with Daniel Lemire and 'BeeOnRope' on a research paper that involves getting intimately familiar with the innards of the AGU's on recent Intel. I have no doubt that the AGU exists!
Related, you might be interested in this StackOverflow, where among other things, Peter Cordes (is he here?) and BeeOnRope conclude that the Port 7 AGU cannot be acting as assist to the 3-cycle "complex" LEA: https://stackoverflow.com/questions/50557636/what-type-of-ad... (expand the comments on the answer).
Thanks for pointing out that link (I didn't click on it when I originally read the article). Deep in the comments, there's a hint of the reason for this port 7 AGU weakness: not because of the extra gates for another adder, but due to the wiring of the scheduler/bypass network, which is presumably much simpler when port 7 only handles latency-1 instructions (at least if I'm understanding that correctly).
What’s a total mystery to me is why Intel chose to build an AGU that cannot handle all kinds of addresses. In 2017, it was indicated to me that there “was not enough space on the die.” I find this hard to believe, especially because the problem prevailed in (at least) three further generations of Intel CPUs after Haswell.
Is die space really a plausible answer as to why Intel would bother to put in a second AGU, but cripple it so that it can only work on "simple" addresses that even their own compiler is not "smart" enough to generate?