Counting bytes fast

nkurz · on Sept 28, 2014

I wish the author would have been more specific about the processor being tested. This is the sort of thing that is extremely dependent on the microarchitecture of processor used. I wonder if he's testing on AMD or older Intel?

Modern Intel chips usually do an very good job of "store-to-load forwarding" when the load is the same size or smaller than the store. Although it's testing a slightly different effect, this is a great recent article on the topic: http://blog.stuffedcow.net/2014/01/x86-memory-disambiguation...

sitkack · on Sept 28, 2014

Pretty sure this would help. https://software.intel.com/en-us/node/513929

Author should have supplied test code and a benchmark suite.

danbruc · on Sept 28, 2014

How could packed addition help for this task? The project description on GitHub says benchmarking happens on a »Intel Core i5-3340M (oc'ed to 3.0GHz)« so maybe the article is based on numbers from the same machine.

sitkack · on Sept 28, 2014

the avx registers have 512 bytes of storage.

nkurz · on Sept 28, 2014

The space is there, but it's difficult to see how to make use of it for this problem. You'd need to be able to use an 8-bit input value to choose which position in which register to update. Incrementing the position (pos = input % 32) is possible, though not as straightforward as you might hope.

But choosing which of the vectors to update (vecX where X = input / 32) doesn't have a good answer without requiring unpredictable branching. There's a thread elsewhere on this page where we muse about the possibility of using self-modifying code, but otherwise I don't know of any answer.

sitkack · on Sept 29, 2014

During my morning boot (awakening) I was thinking about the self modifying code solution. I don't think it is needed and that it would eat up lots of memory bandwidth. I think a 256 entry jump table is sufficient which could be inlined so

    R8 <- load 64 bits from ram
    JUMP A1
    @A0
    SHIFT R8
    @A1
    R9 load low 8 bits to register
    pc += R9 * SIZEOF_AVX_INCR_ROUTINE

Each AVX_INCR_ROUTINE would be the same size and have an absolute jump back to A0. There would need to be a DEC counter so the whole thing was only done 8 times. Also memory reads could be interleaved during the count or use the AVX register for a destination directly from memory. But it seems like a waste if there are already 64 bit registers sitting there.

If one used U16 counters, every 65k elements read would need to sweep through and write the high buckets out to 32 bit counters to prevent overflow.

I don't yet have any solid evidence but I think using the AVX registers for histogram storage could enable histogramming at the read bandwidth of main memory.

https://software.intel.com/sites/landingpage/IntrinsicsGuide...

nkurz · on Sept 29, 2014

I don't think any approach based around a jump table is going to perform well. The issue is that modern processors use speculative execution, and have a deep pipeline that needs to be refilled whenever the wrong path is chosen. A jump table and an cmp/jc are predicted in the same way, but the jump table is usually harder to predict. Every time the prediction is wrong, you lose ~15 cycles. The 4-table approach describes takes about 2 cycles per byte, so there isn't much room for error.

The whole idea of "the" program counter is almost gone. You can still set it to affect program behaviour, but in reality there is a "reorder buffer" filled with instructions from an "execution window" that are issued by the backend as soon as their dependencies are available, and the front end has a mostly independent "speculative pc" that is fetching and decoding instructions 10-100 cycles ahead of the current execution window. The guess as to which execution path will be taken at the jump table will probably be made even before the load is issued to get the address.

Per your earlier recommendation, I did read through "Inner Loops". It's a really good book, and offers a great overall approach to programming, but this is one of the places where the details of modern processors have changed so much that the older approach is counterproductive. To be competitive for this task, you probably need to be branch free, so that you have speculative execution working with you rather than against you.

sitkack · on Sept 29, 2014

I see your point. I am going to try and implement the approach I outlined above as well as the code gen. Glad you checked out "Inner Loops", I a guess my take away from that book is understand the machine, your code and most of all do some science. The hardware used the book is very dated, and I'd love to see a second edition using SkyLake or newer.

Maybe ... instead of doing the JMP back to the top of the loop, you combine the two approaches to like a direct threaded interpreter. In each AVX_INCR_ROUTINE modify the code by over writing the next JMP address, don't jump back to A0. The issue of consecutive duplicates arrises again.

Or ask the processor for some L1 cache that never gets flushed to main memory.

Or use R8-R15 as an 8x8 array of U8 counters, do 4 passes [0-63] [64-127] [128-191] [192-255] or a 4x8 array of U16 counters, would have to test.

    Function    Best Rate MB/s  Avg time     Min time     Max time
    Copy:           11569.0     0.145255     0.138301     0.154563
    Scale:          11383.5     0.145140     0.140555     0.148503
    Add:            12544.6     0.196920     0.191318     0.204782
    Triad:          12464.8     0.198199     0.192542     0.207492

from http://www.cs.virginia.edu/stream/FTP/Code/stream.c

I still think the histogramming can get above 6GB/s

sitkack · on Sept 30, 2014

The entropy in those counter writes is a single bit, so most of the traffic is zeros. INCR and CAS needs to be built into the ram.

Tractor4973 · on Sept 28, 2014

Yes it is

Tractor4973 · on Sept 28, 2014

Actually, the test code and the benchmark suite are present within the FSE project (linked from the article)

0x0 · on Sept 27, 2014

Thanks for this post! I've known about the massive effects of optimizing for cache lines but this is the first time I've heard of the write commit delay. Great example as well.

infogulch · on Sept 28, 2014

If I remember correctly, the Mill wouldn't have this issue, since all stores go through the cache hierarchy (by default) so they finish as soon as it's in L1 (~3 cycles typically). It's then evicted down through the cache and into to main memory as usual, while retaining consistent aliasing semantics automatically.

I may have misunderstood, but this is covered in more detail in the memory talk I believe: http://youtu.be/bjRDaaGlER8

readerrrr · on Sept 28, 2014

I did a quick test on an i5( Ivy Bridge ) proc using c, and surprisingly when on a non-random table, the distributed version retained the speed of a random table, while the single version became 2.5x slower.

jevinskie · on Sept 28, 2014

Can you share your test harness and implementations? This would be fun to tinker with. =)

toast0 · on Sept 28, 2014

Which architecture i5? i5 doesn't tell us very much more than Xeon.

rcthompson · on Sept 28, 2014

So the problem is when the same byte occurs 2 or more times in rapid succession, right? Couldn't you use a temporary variable to count the number of consecutive occurrences of the same byte, and then once you hit a different byte, add the total consecutive count to the appropriate slot? That way you only write to the count table once for each run of identical bytes, and you never write to the same slot of the table twice in a row. Assuming that the compiler can fit all the necessary temporary variables into registers, wouldn't this eliminate the problem entirely? Or would this be so much extra work that the processor becomes the bottleneck instead of writes to memory?

readerrrr · on Sept 28, 2014

I did a test on this as well. It approaches the speed of distributed version only if there are a lot of consecutive bytes. The distributed version is fast in any case.

bodyfour · on Sept 28, 2014

The cost of an unpredicted branch inside the loop would probably outweigh the benefit.

kabdib · on Sept 28, 2014

This code is generating tons of D-cache misses, at the cost of hundreds of cycles (in a modern x86 memory system), so you can afford a branch miss now and then.

nkurz · on Sept 28, 2014

I don't think this is correct. The input is sequential, so the hardware prefetcher should be very successful: the only misses will be 1/4096 for the first entry into a new 4KB page, and even this could be avoided with a single judicious software prefetch.

He's counting 8-bit bytes, and thus has 256 entries per table. Using 64-bit counters, this is 8 * 256 = 2048 contiguous bytes per table. 4 tables gets him up to 8KB. Modern L1D is 32KB.

Alternatively viewed, 1700MB/s is 1.7 billion table additions per second. A processor is running at about 3.5 GHz, which is 3.5 billion cycles per second. Thus his current algorithm is takes ~2 cycles per byte. A branch prediction errors costs 15 cycles, and thus would be relatively very expensive.

You'd probably be right for the 16-bit U16 version, though.

kabdib · on Sept 28, 2014

Good point about prefetch.

The U16 version can probably benefit from a test for the first 256 values, and leave the remainder of the range as vanilla counters. You might get a win for less than 128 on the first test, in fact.

rcthompson · on Sept 28, 2014

Ah yes, branch prediction would kill this approach pretty badly. Data-dependent branching = bad.

im3w1l · on Sept 27, 2014

It seems what you really want is probability for different symbols. Have you considered estimating them by sampling?

0x0 · on Sept 27, 2014

If the sampling is deterministic, it sounds like you'd risk malicious input being able to cause pathological results? Not unlike the various hashmap DoS'es from a couple of years ago.

baudehlo · on Sept 28, 2014

I think you may be a few years out in that. The hash DoS's were discovered about 11 years ago. Time flies eh?

0x0 · on Sept 28, 2014

You're probably right, but there was a least a big reprise of it in 2011 in php - CVE-2011-4885

jevinskie · on Sept 28, 2014

I wonder what would happen if you used AVX-512 registers to store the byte distribution. You could use 16 of the 32 512 bit SIMD registers instead of relying on memory. AVX-256 would require 32 SIMD registers but it has just 16 available.

userbinator · on Sept 28, 2014

With the newer extensions like AVX it looks like there are so many registers that it makes sense to have some sort of addressing mode that treats them like RAM.

I mean, there are 2048 bytes of storage available there. It would be great if they could be accessed as 2048 bytes, 1024 words, 512 dwords, or 256 qwords. Something like this...

    inc dword ptr AVX:[eax]   ; low 9 bits of eax are used as index into the register set

nkurz · on Sept 28, 2014

I agree that this would occasionally be a very useful feature. Any idea if this would be architecturally feasible? Might it be similar to modern a GPU's "register file"? Maxwell (current generation NVidia) has 64KB of registers.

How did you come up with 2048 bytes for x64, though? I get 16 * 32B = 256 if I count logical registers, or 168 * 32B = 5376 if I count physical (for Haswell).

userbinator · on Sept 28, 2014

Architecturally, it would merely be a matter of taking the register selection bits not from the instruction itself but from some other source, like another register; there will certainly be a bit more complexity due to dependencies etc.

2048 bytes since AVX-512 has 32 registers, each 512 bits (64 bytes) wide.

zvrba · on Sept 28, 2014

> Architecturally, it would merely be a matter of taking the register selection bits

Except these do not exist for individual SIMD lanes; that's why you have permutation instructions. Treating individual lanes as registers would increase the needed number of addressing bits, which I'm sure would complicate stuff much more than "just" changing the possible source.

nkurz · on Sept 28, 2014

This Intel Forum post entitled "AVX-512 is a big step forward - but repeating past mistakes" addresses some of these issues: https://software.intel.com/en-us/forums/topic/477541

The debate is essentially whether AVX-512 is making a mistake by giving up 8-bit and 16-bit operations. Agner feels it's a mistake, but other smart people argue that it's not a problem.

innocenat · on Sept 28, 2014

And those has been added back in AVX512BW.

nkurz · on Sept 28, 2014

Thanks! That's big news, and I'd missed it: https://software.intel.com/en-us/blogs/additional-avx-512-in...

Narishma · on Sept 28, 2014

There seems to be only one person in that thread arguing that it's not a problem.

dbaupp · on Sept 28, 2014

AVX-512 has 32 512-bit registers.

dbaupp · on Sept 28, 2014

Is there an efficient way to take a byte and increment the appropriate location of the appropriate register?

nkurz · on Sept 28, 2014

No, there is not.

There are reasonable branch free ways to increment one of the low 16 bytes of an XMM register (shift and add), and slower ways to increment any of the low 32 bytes in a single YMM register (permute and add). Interestingly, a lookup table that loads the appropriate addend vector is surprisingly fast as well. But there is no branch free way to select which of several AVX/AVX2 registers to work with.

But I'd be happy to be wrong -- maybe there is a way to address them using the older stack oriented FPU instructions?

pmalynin · on Sept 28, 2014

If one is willing to write self-modifying code, then it is possible to overwrite VEX.vvvv instruction encoding with the register select, so for example to select XMM0/YMM0 you'd have VEX.1111 (one's complement) or for XMM15/YMM15 VEX.0000.

nkurz · on Sept 28, 2014

My first thought was that this was an interesting approach, but that there was no speed advantage because of the cache flush and pipeline reset. But maybe it could be made to work...

What if instead of overwriting an existing instruction, you generated the instructions you needed and appended them to a new writable and executable buffer. Then every 100/1000/1000000 characters, you terminate it with a 'ret' and 'call' into it. It branchlessly updates the set of XMM/YMM/ZMM registers you are using without touching memory. Repeat until you've read to the end of your input, then store the vector register results by writing their contents out to RAM in a known order. Perhaps you cycle between a few buffers so that the address you are calling into is never cached.

Overkill, but it seems like this might actually be pretty fast, presuming you can generate the instructions fast enough. I've considered this technique before for fast integer decompression when trying to avoid branch errors, but never so far as to actually test it.

userbinator · on Sept 28, 2014

Unfortunately although SMC could be shorter, it's even slower since the pipeline gets flushed every time a write occurs to locations within the current fetch window (not sure exactly how long that is, but it's not small.)

pmalynin · on Sept 28, 2014

Yes, and the instruction cache becomes stale as well. I guess one way to avoid is to have 16 code blocks back-to-back and then to do a like a jmp into the section that contains the right register. JMP are pretty cheap, and the end point is likely to be in cache anyway.

nialo · on Sept 28, 2014

I am not an assembly programmer, and there is probably a better way, but something neat that thinking about this resulted in:

If instead of 16 512bit registers, we simply want to increment the appropriate section of one big register, we can add 2^(byte * n), where byte is the value of the byte in question, and n is the number of bits we're using to count the results.

pmalynin · on Sept 28, 2014

Right. However, AVX and SSE registers aren't just big 512-128 bit numbers. But rather arrays of bytes/shorts/ints/floats/doubles.

acqq · on Sept 28, 2014

Actually, you can add one 512-bit register with another 512 bit register in a single instruction. You won't get a carry from a segment to a segment but nialo's general idea can be implemented.

How fast it can be done is another topic.

zobzu · on Sept 28, 2014

interestingly this is the kind of tasks where i fear languages with no pointer exposed may be slower. Any one has a good comparison?