Hacker News new | past | comments | ask | show | jobs | submit login

Ah yes, Dr Nicely caused quite a bit of excitement at Intel. I was on the p6 architecture team at the time. (p6 == Pentium Pro) Our FPU was formally verified and didn't have the same bug.

To be nice to Dr Nicely we sent him a pre-release p6 development system to test with his program to demonstrate that his bug was fixed. He was working on a prime number sieve program at the same and came back reporting that the p6 ran at 1/2 the speed of a Pentium for his code. Wow, another blackeye/firestorm caused by Dr. Nicely. He had too much of an audience for him to report to the world this new processor was slower.

So I got to spend a lot of time learning how to sieve works and what is happening. For the most part it allocates a huge array in memory with each byte representing a number. You walk the array with a stride of known primes setting bytes and whatever is left must be prime. ie. every 3 is not prime, every 5 is not prime, every 7 ....

So in the steady state you are writing a single byte to a cache line without reading anything. And every write hits a different cache line.

Now p6 had a write allocate cache, but the Pentium would only allocate on read, so on the Pentium a write that misses the cache would become a write to memory. On the p6 that write would need to load the cache line from memory into the cache and then the line in the cache was modified. And since every line in the cache was also modified we had to flush some other cache line first to make room. So every 1 byte write would become a 32-byte write to memory followed by a 32-byte read from memory.

Normally write allocate is a good thing, but in this case it was a killer. We were stumped.

Then the magic observation: 99% of these writes were marking a space that was already marked. When you get up to walking by large strides most of those were already covered by one of the smaller factors.

So if you changed the code from:

     array[N] = 1
to:

     if (!array[N]) array[N] = 1

Now suddenly we are doing a read first, and after that read we skip the write so the data in the cache doesn't become modified and can be discarded in the future.

Also the p6 was a super-scalar machine that ran multiple iterations of this loop in parallel and could have multiple reads going to memory at the same time. With that small tweak the program got 4X faster and we went from being 1/2X the speed of a Pentium to being twice the speed. And this was at the same clock frequency! The test hardware ran 100Mhz, we released at 200Mhz and went up from there.




You can also reduce re-marking spaces by starting your iteration in the marking process at ii (as all numbers less than ii would have been marked by a previous iteration) instead of 2*i. So instead of, e.g., marking off multiples of 3 starting at 6 (which would have already been marked off from the 2 case), you would start at 9.


Of course, and you only store odd numbers, and you can do all the smaller factors in a single pass. Nicely had already done some of these tweaks and I tried some others, but it didn't change the overall problem:

In the steady state you are still striding a huge array in memory and missing the cache with every access.


One thing I always wondered about this: why was he using the FPU? These all seem like integer operations.


Look here: http://www.mersenne.org/various/math.php at the Lucas-Lehmer test which uses floating point FFT's to square large numbers. I am not sure if that is what is was doing originally, but I suspect it was.

We used that prime95 program from mersenne.org quite a bit in testing because it was very close to our best max-power test for processors. It would keep both the FP and integer ALUs saturated and validated all the results so if anything was wrong it would start complaining.


Explained in the second question of his FAQ: http://www.trnicely.net/pentbug/pentbug.html


And then you get an optimizing compiler and the slow behavior is back.


Or you get an optimizing compiler that's aware of this difference and automatically emits a read before a blind write.


...Which makes things worse for any CPU that doesn't have this bug.


It doesn't sound like a bug, just a legitimate difference in how things are done. In any case, CPU-specific optimizations are hardly rare.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: