Ah yes, Dr Nicely caused quite a bit of excitement at Intel. I was on the p6 arc...

tbirdz · on June 2, 2015

You can also reduce re-marking spaces by starting your iteration in the marking process at ii (as all numbers less than ii would have been marked by a previous iteration) instead of 2*i. So instead of, e.g., marking off multiples of 3 starting at 6 (which would have already been marked off from the 2 case), you would start at 9.

wscott · on June 2, 2015

Of course, and you only store odd numbers, and you can do all the smaller factors in a single pass. Nicely had already done some of these tweaks and I tried some others, but it didn't change the overall problem:

In the steady state you are still striding a huge array in memory and missing the cache with every access.

simonbyrne · on June 2, 2015

One thing I always wondered about this: why was he using the FPU? These all seem like integer operations.

wscott · on June 2, 2015

Look here: http://www.mersenne.org/various/math.php at the Lucas-Lehmer test which uses floating point FFT's to square large numbers. I am not sure if that is what is was doing originally, but I suspect it was.

We used that prime95 program from mersenne.org quite a bit in testing because it was very close to our best max-power test for processors. It would keep both the FP and integer ALUs saturated and validated all the results so if anything was wrong it would start complaining.

pdw · on June 2, 2015

Explained in the second question of his FAQ: http://www.trnicely.net/pentbug/pentbug.html

TheLoneWolfling · on June 2, 2015

And then you get an optimizing compiler and the slow behavior is back.

mikeash · on June 2, 2015

Or you get an optimizing compiler that's aware of this difference and automatically emits a read before a blind write.

TheLoneWolfling · on June 2, 2015

...Which makes things worse for any CPU that doesn't have this bug.

mikeash · on June 2, 2015

It doesn't sound like a bug, just a legitimate difference in how things are done. In any case, CPU-specific optimizations are hardly rare.