I wish the author would have been more specific about the processor being tested. This is the sort of thing that is extremely dependent on the microarchitecture of processor used. I wonder if he's testing on AMD or older Intel?
Modern Intel chips usually do an very good job of "store-to-load forwarding" when the load is the same size or smaller than the store. Although it's testing a slightly different effect, this is a great recent article on the topic: http://blog.stuffedcow.net/2014/01/x86-memory-disambiguation...
How could packed addition help for this task? The project description on GitHub says benchmarking happens on a »Intel Core i5-3340M (oc'ed to 3.0GHz)« so maybe the article is based on numbers from the same machine.
The space is there, but it's difficult to see how to make use of it for this problem. You'd need to be able to use an 8-bit input value to choose which position in which register to update. Incrementing the position (pos = input % 32) is possible, though not as straightforward as you might hope.
But choosing which of the vectors to update (vecX where X = input / 32) doesn't have a good answer without requiring unpredictable branching. There's a thread elsewhere on this page where we muse about the possibility of using self-modifying code, but otherwise I don't know of any answer.
During my morning boot (awakening) I was thinking about the self modifying code solution. I don't think it is needed and that it would eat up lots of memory bandwidth. I think a 256 entry jump table is sufficient which could be inlined so
R8 <- load 64 bits from ram
JUMP A1
@A0
SHIFT R8
@A1
R9 load low 8 bits to register
pc += R9 * SIZEOF_AVX_INCR_ROUTINE
Each AVX_INCR_ROUTINE would be the same size and have an absolute jump back to A0. There would need to be a DEC counter so the whole thing was only done 8 times. Also memory reads could be interleaved during the count or use the AVX register for a destination directly from memory. But it seems like a waste if there are already 64 bit registers sitting there.
If one used U16 counters, every 65k elements read would need to sweep through and write the high buckets out to 32 bit counters to prevent overflow.
I don't yet have any solid evidence but I think using the AVX registers for histogram storage could enable histogramming at the read bandwidth of main memory.
I don't think any approach based around a jump table is going to perform well. The issue is that modern processors use speculative execution, and have a deep pipeline that needs to be refilled whenever the wrong path is chosen. A jump table and an cmp/jc are predicted in the same way, but the jump table is usually harder to predict. Every time the prediction is wrong, you lose ~15 cycles. The 4-table approach describes takes about 2 cycles per byte, so there isn't much room for error.
The whole idea of "the" program counter is almost gone. You can still set it to affect program behaviour, but in reality there is a "reorder buffer" filled with instructions from an "execution window" that are issued by the backend as soon as their dependencies are available, and the front end has a mostly independent "speculative pc" that is fetching and decoding instructions 10-100 cycles ahead of the current execution window. The guess as to which execution path will be taken at the jump table will probably be made even before the load is issued to get the address.
Per your earlier recommendation, I did read through "Inner Loops". It's a really good book, and offers a great overall approach to programming, but this is one of the places where the details of modern processors have changed so much that the older approach is counterproductive. To be competitive for this task, you probably need to be branch free, so that you have speculative execution working with you rather than against you.
I see your point. I am going to try and implement the approach I outlined above as well as the code gen. Glad you checked out "Inner Loops", I a guess my take away from that book is understand the machine, your code and most of all do some science. The hardware used the book is very dated, and I'd love to see a second edition using SkyLake or newer.
Maybe ... instead of doing the JMP back to the top of the loop, you combine the two approaches to like a direct threaded interpreter. In each AVX_INCR_ROUTINE modify the code by over writing the next JMP address, don't jump back to A0. The issue of consecutive duplicates arrises again.
Or ask the processor for some L1 cache that never gets flushed to main memory.
Or use R8-R15 as an 8x8 array of U8 counters, do 4 passes [0-63] [64-127] [128-191] [192-255] or a 4x8 array of U16 counters, would have to test.
Function Best Rate MB/s Avg time Min time Max time
Copy: 11569.0 0.145255 0.138301 0.154563
Scale: 11383.5 0.145140 0.140555 0.148503
Add: 12544.6 0.196920 0.191318 0.204782
Triad: 12464.8 0.198199 0.192542 0.207492
Thanks for this post! I've known about the massive effects of optimizing for cache lines but this is the first time I've heard of the write commit delay. Great example as well.
If I remember correctly, the Mill wouldn't have this issue, since all stores go through the cache hierarchy (by default) so they finish as soon as it's in L1 (~3 cycles typically). It's then evicted down through the cache and into to main memory as usual, while retaining consistent aliasing semantics automatically.
I may have misunderstood, but this is covered in more detail in the memory talk I believe: http://youtu.be/bjRDaaGlER8
I did a quick test on an i5( Ivy Bridge ) proc using c, and surprisingly when on a non-random table, the distributed version retained the speed of a random table, while the single version became 2.5x slower.
So the problem is when the same byte occurs 2 or more times in rapid succession, right? Couldn't you use a temporary variable to count the number of consecutive occurrences of the same byte, and then once you hit a different byte, add the total consecutive count to the appropriate slot? That way you only write to the count table once for each run of identical bytes, and you never write to the same slot of the table twice in a row. Assuming that the compiler can fit all the necessary temporary variables into registers, wouldn't this eliminate the problem entirely? Or would this be so much extra work that the processor becomes the bottleneck instead of writes to memory?
I did a test on this as well. It approaches the speed of distributed version only if there are a lot of consecutive bytes. The distributed version is fast in any case.
This code is generating tons of D-cache misses, at the cost of hundreds of cycles (in a modern x86 memory system), so you can afford a branch miss now and then.
I don't think this is correct. The input is sequential, so the hardware prefetcher should be very successful: the only misses will be 1/4096 for the first entry into a new 4KB page, and even this could be avoided with a single judicious software prefetch.
He's counting 8-bit bytes, and thus has 256 entries per table. Using 64-bit counters, this is 8 * 256 = 2048 contiguous bytes per table. 4 tables gets him up to 8KB. Modern L1D is 32KB.
Alternatively viewed, 1700MB/s is 1.7 billion table additions per second. A processor is running at about 3.5 GHz, which is 3.5 billion cycles per second. Thus his current algorithm is takes ~2 cycles per byte. A branch prediction errors costs 15 cycles, and thus would be relatively very expensive.
You'd probably be right for the 16-bit U16 version, though.
The U16 version can probably benefit from a test for the first 256 values, and leave the remainder of the range as vanilla counters. You might get a win for less than 128 on the first test, in fact.
If the sampling is deterministic, it sounds like you'd risk malicious input being able to cause pathological results? Not unlike the various hashmap DoS'es from a couple of years ago.
I wonder what would happen if you used AVX-512 registers to store the byte distribution. You could use 16 of the 32 512 bit SIMD registers instead of relying on memory. AVX-256 would require 32 SIMD registers but it has just 16 available.
With the newer extensions like AVX it looks like there are so many registers that it makes sense to have some sort of addressing mode that treats them like RAM.
I mean, there are 2048 bytes of storage available there. It would be great if they could be accessed as 2048 bytes, 1024 words, 512 dwords, or 256 qwords. Something like this...
inc dword ptr AVX:[eax] ; low 9 bits of eax are used as index into the register set
I agree that this would occasionally be a very useful feature. Any idea if this would be architecturally feasible? Might it be similar to modern a GPU's "register file"? Maxwell (current generation NVidia) has 64KB of registers.
How did you come up with 2048 bytes for x64, though? I get 16 * 32B = 256 if I count logical registers, or 168 * 32B = 5376 if I count physical (for Haswell).
Architecturally, it would merely be a matter of taking the register selection bits not from the instruction itself but from some other source, like another register; there will certainly be a bit more complexity due to dependencies etc.
2048 bytes since AVX-512 has 32 registers, each 512 bits (64 bytes) wide.
> Architecturally, it would merely be a matter of taking the register selection bits
Except these do not exist for individual SIMD lanes; that's why you have permutation instructions. Treating individual lanes as registers would increase the needed number of addressing bits, which I'm sure would complicate stuff much more than "just" changing the possible source.
The debate is essentially whether AVX-512 is making a mistake by giving up 8-bit and 16-bit operations. Agner feels it's a mistake, but other smart people argue that it's not a problem.
There are reasonable branch free ways to increment one of the low 16 bytes of an XMM register (shift and add), and slower ways to increment any of the low 32 bytes in a single YMM register (permute and add). Interestingly, a lookup table that loads the appropriate addend vector is surprisingly fast as well. But there is no branch free way to select which of several AVX/AVX2 registers to work with.
But I'd be happy to be wrong -- maybe there is a way to address them using the older stack oriented FPU instructions?
If one is willing to write self-modifying code, then it is possible to overwrite VEX.vvvv instruction encoding with the register select, so for example to select XMM0/YMM0 you'd have VEX.1111 (one's complement) or for XMM15/YMM15 VEX.0000.
My first thought was that this was an interesting approach, but that there was no speed advantage because of the cache flush and pipeline reset. But maybe it could be made to work...
What if instead of overwriting an existing instruction, you generated the instructions you needed and appended them to a new writable and executable buffer. Then every 100/1000/1000000 characters, you terminate it with a 'ret' and 'call' into it. It branchlessly updates the set of XMM/YMM/ZMM registers you are using without touching memory. Repeat until you've read to the end of your input, then store the vector register results by writing their contents out to RAM in a known order. Perhaps you cycle between a few buffers so that the address you are calling into is never cached.
Overkill, but it seems like this might actually be pretty fast, presuming you can generate the instructions fast enough. I've considered this technique before for fast integer decompression when trying to avoid branch errors, but never so far as to actually test it.
Unfortunately although SMC could be shorter, it's even slower since the pipeline gets flushed every time a write occurs to locations within the current fetch window (not sure exactly how long that is, but it's not small.)
Yes, and the instruction cache becomes stale as well. I guess one way to avoid is to have 16 code blocks back-to-back and then to do a like a jmp into the section that contains the right register. JMP are pretty cheap, and the end point is likely to be in cache anyway.
I am not an assembly programmer, and there is probably a better way, but something neat that thinking about this resulted in:
If instead of 16 512bit registers, we simply want to increment the appropriate section of one big register, we can add 2^(byte * n), where byte is the value of the byte in question, and n is the number of bits we're using to count the results.
Actually, you can add one 512-bit register with another 512 bit register in a single instruction. You won't get a carry from a segment to a segment but nialo's general idea can be implemented.
Modern Intel chips usually do an very good job of "store-to-load forwarding" when the load is the same size or smaller than the store. Although it's testing a slightly different effect, this is a great recent article on the topic: http://blog.stuffedcow.net/2014/01/x86-memory-disambiguation...