The space is there, but it's difficult to see how to make use of it for this pro...

sitkack · on Sept 29, 2014

During my morning boot (awakening) I was thinking about the self modifying code solution. I don't think it is needed and that it would eat up lots of memory bandwidth. I think a 256 entry jump table is sufficient which could be inlined so

    R8 <- load 64 bits from ram
    JUMP A1
    @A0
    SHIFT R8
    @A1
    R9 load low 8 bits to register
    pc += R9 * SIZEOF_AVX_INCR_ROUTINE

Each AVX_INCR_ROUTINE would be the same size and have an absolute jump back to A0. There would need to be a DEC counter so the whole thing was only done 8 times. Also memory reads could be interleaved during the count or use the AVX register for a destination directly from memory. But it seems like a waste if there are already 64 bit registers sitting there.

If one used U16 counters, every 65k elements read would need to sweep through and write the high buckets out to 32 bit counters to prevent overflow.

I don't yet have any solid evidence but I think using the AVX registers for histogram storage could enable histogramming at the read bandwidth of main memory.

https://software.intel.com/sites/landingpage/IntrinsicsGuide...

nkurz · on Sept 29, 2014

I don't think any approach based around a jump table is going to perform well. The issue is that modern processors use speculative execution, and have a deep pipeline that needs to be refilled whenever the wrong path is chosen. A jump table and an cmp/jc are predicted in the same way, but the jump table is usually harder to predict. Every time the prediction is wrong, you lose ~15 cycles. The 4-table approach describes takes about 2 cycles per byte, so there isn't much room for error.

The whole idea of "the" program counter is almost gone. You can still set it to affect program behaviour, but in reality there is a "reorder buffer" filled with instructions from an "execution window" that are issued by the backend as soon as their dependencies are available, and the front end has a mostly independent "speculative pc" that is fetching and decoding instructions 10-100 cycles ahead of the current execution window. The guess as to which execution path will be taken at the jump table will probably be made even before the load is issued to get the address.

Per your earlier recommendation, I did read through "Inner Loops". It's a really good book, and offers a great overall approach to programming, but this is one of the places where the details of modern processors have changed so much that the older approach is counterproductive. To be competitive for this task, you probably need to be branch free, so that you have speculative execution working with you rather than against you.

sitkack · on Sept 29, 2014

I see your point. I am going to try and implement the approach I outlined above as well as the code gen. Glad you checked out "Inner Loops", I a guess my take away from that book is understand the machine, your code and most of all do some science. The hardware used the book is very dated, and I'd love to see a second edition using SkyLake or newer.

Maybe ... instead of doing the JMP back to the top of the loop, you combine the two approaches to like a direct threaded interpreter. In each AVX_INCR_ROUTINE modify the code by over writing the next JMP address, don't jump back to A0. The issue of consecutive duplicates arrises again.

Or ask the processor for some L1 cache that never gets flushed to main memory.

Or use R8-R15 as an 8x8 array of U8 counters, do 4 passes [0-63] [64-127] [128-191] [192-255] or a 4x8 array of U16 counters, would have to test.

    Function    Best Rate MB/s  Avg time     Min time     Max time
    Copy:           11569.0     0.145255     0.138301     0.154563
    Scale:          11383.5     0.145140     0.140555     0.148503
    Add:            12544.6     0.196920     0.191318     0.204782
    Triad:          12464.8     0.198199     0.192542     0.207492

from http://www.cs.virginia.edu/stream/FTP/Code/stream.c

I still think the histogramming can get above 6GB/s

sitkack · on Sept 30, 2014

The entropy in those counter writes is a single bit, so most of the traffic is zeros. INCR and CAS needs to be built into the ram.