The space is there, but it's difficult to see how to make use of it for this problem. You'd need to be able to use an 8-bit input value to choose which position in which register to update. Incrementing the position (pos = input % 32) is possible, though not as straightforward as you might hope.
But choosing which of the vectors to update (vecX where X = input / 32) doesn't have a good answer without requiring unpredictable branching. There's a thread elsewhere on this page where we muse about the possibility of using self-modifying code, but otherwise I don't know of any answer.
During my morning boot (awakening) I was thinking about the self modifying code solution. I don't think it is needed and that it would eat up lots of memory bandwidth. I think a 256 entry jump table is sufficient which could be inlined so
R8 <- load 64 bits from ram
JUMP A1
@A0
SHIFT R8
@A1
R9 load low 8 bits to register
pc += R9 * SIZEOF_AVX_INCR_ROUTINE
Each AVX_INCR_ROUTINE would be the same size and have an absolute jump back to A0. There would need to be a DEC counter so the whole thing was only done 8 times. Also memory reads could be interleaved during the count or use the AVX register for a destination directly from memory. But it seems like a waste if there are already 64 bit registers sitting there.
If one used U16 counters, every 65k elements read would need to sweep through and write the high buckets out to 32 bit counters to prevent overflow.
I don't yet have any solid evidence but I think using the AVX registers for histogram storage could enable histogramming at the read bandwidth of main memory.
I don't think any approach based around a jump table is going to perform well. The issue is that modern processors use speculative execution, and have a deep pipeline that needs to be refilled whenever the wrong path is chosen. A jump table and an cmp/jc are predicted in the same way, but the jump table is usually harder to predict. Every time the prediction is wrong, you lose ~15 cycles. The 4-table approach describes takes about 2 cycles per byte, so there isn't much room for error.
The whole idea of "the" program counter is almost gone. You can still set it to affect program behaviour, but in reality there is a "reorder buffer" filled with instructions from an "execution window" that are issued by the backend as soon as their dependencies are available, and the front end has a mostly independent "speculative pc" that is fetching and decoding instructions 10-100 cycles ahead of the current execution window. The guess as to which execution path will be taken at the jump table will probably be made even before the load is issued to get the address.
Per your earlier recommendation, I did read through "Inner Loops". It's a really good book, and offers a great overall approach to programming, but this is one of the places where the details of modern processors have changed so much that the older approach is counterproductive. To be competitive for this task, you probably need to be branch free, so that you have speculative execution working with you rather than against you.
I see your point. I am going to try and implement the approach I outlined above as well as the code gen. Glad you checked out "Inner Loops", I a guess my take away from that book is understand the machine, your code and most of all do some science. The hardware used the book is very dated, and I'd love to see a second edition using SkyLake or newer.
Maybe ... instead of doing the JMP back to the top of the loop, you combine the two approaches to like a direct threaded interpreter. In each AVX_INCR_ROUTINE modify the code by over writing the next JMP address, don't jump back to A0. The issue of consecutive duplicates arrises again.
Or ask the processor for some L1 cache that never gets flushed to main memory.
Or use R8-R15 as an 8x8 array of U8 counters, do 4 passes [0-63] [64-127] [128-191] [192-255] or a 4x8 array of U16 counters, would have to test.
Function Best Rate MB/s Avg time Min time Max time
Copy: 11569.0 0.145255 0.138301 0.154563
Scale: 11383.5 0.145140 0.140555 0.148503
Add: 12544.6 0.196920 0.191318 0.204782
Triad: 12464.8 0.198199 0.192542 0.207492
But choosing which of the vectors to update (vecX where X = input / 32) doesn't have a good answer without requiring unpredictable branching. There's a thread elsewhere on this page where we muse about the possibility of using self-modifying code, but otherwise I don't know of any answer.