Hacker News new | past | comments | ask | show | jobs | submit login
Dynamic bit shuffle using AVX-512 (lemire.me)
90 points by ibobev on June 30, 2023 | hide | past | favorite | 52 comments



I have this weird architecture I'm emulating that I call a bit-grid.[1] The basic idea is an FPGA with NO routing fabric at all... just a 2d grid of look up tables that take a single bit input from each neighbor, and output a single bit to each, which means each look up table holds 64 bits (4*16 possible input states) of "program". Clocking is in two phases (like the colors of a chess board) to prevent all manner of grief, and make it easier to think about. Each phase is NOT Turning Complete, but the grid is.

So, I can emulate this thing on my desktop PC at about 28 nSec/cell using some Pascal code[2]. I'm thinking that if I upgrade to a machine with AVX512 instructions, it might get radically faster. What I can't figure out is how this instruction actually works, and what gains I would actually get.

The Intel documentation on this instruction is as clear as mud. There's no example with all the bits shown and worked through, leading the reader to have no ledge on which to make some intellectual purchase towards understanding.

Questions:

  If I were to fork over the cash for a machine with AVX512 instructions, how many of these instructions can actually execute/second?

  Wouldn't moving a bunch of values to/from memory basically empty out all the caches and make this thing really slow anyway?

  Does anyone have a worked out example with bits shown for all the sources and destinations before/after the instruction, so I can see what it does?


[1] https://esolangs.org/wiki/Bitgrid

[2] https://github.com/mikewarot/Bitgrid


You might already be able to get good acceleration with SSSE3 or AVX2 or NEON, which also has a 4-bit-input permutation instruction. The problem is that you're doing parallel lookup into many different tables, whereas NEON/SSSE3's lookups are 16x in parallel into the same table (and AVX2 is two copies of the SSSE3 one in parallel I think). So it's not as useful unless you're simulating the same grid on several different inputs for bulk testing. It might still be faster than scalar but I'm not sure.


You could rent a VPS with AVX512 support and tinker around with it. I believe GCP has a few instance types that support it, AWS surely has some too.


That's a great idea! It would only cost a few bucks to find out.

The interesting idea about the bitgrid is that you can spread the emulation across cores, as none of the cells are Turing complete. Do all of phase A, then all of phase B, repeat.


It's quite variable even in instance types, I've emailed some VPS providers asking and the response was always it's not guaranteed, you'll have to test and see.

I'm guessing there some out there, but it's not the people I have accounts with.


About micro-optimizations, I don’t know AVX512 but with AVX2 you could compute 4 nodes in parallel, an example in C++, untested https://gist.github.com/Const-me/90a52f291c1fcb06142307facdb...

Here’s another idea how to optimize. Instead of a single 2D array, I would rework the memory layout. Specifically, make 6 2D arrays. Two with uint64_t values, for even/odd cells in the lookup tables. Two with bits for even/odd cells in the old grid state. Two with bits for even/odd cells in the new grid state. This improves RAM access patterns because any half of the cycle loads / stores half as many cache lines. After the complete cycle, swap old state with new state.

Gathering inputs from neighbors, and scattering output to them could be a bit tricky this way, but you can simplify if you can limit grid size to even number in both X and Y direction. At least the even/odd halves will be of the equal size this way.


Disclaimer: I'm not am expert.

For your first question, the instructions are documented on the Intel website[0]. Many instructions have Latency and Throughput figures, which indicate how many cycles they take to execute. These are not straightforward to interpret due to instruction pipelining.

As for cache exhaustion, that depends on how quickly you consume more data from memory. It's worth noting that the registers are an entire cache line in width, and that Latency figures are given for the instructions that load from memory into the AVX registers.

[0] https://www.intel.com/content/www/us/en/docs/intrinsics-guid...


> Wouldn't moving a bunch of values to/from memory basically empty out all the caches and make this thing really slow anyway?

It's probably the same issues as with neural networks: they have to work on a whole batch of inputs to be efficient. They load only a few weights at a time into registers, then use those on the whole batch before loading the next weights.

So my guess would be: If you only run a single grid, and the grid is large, you may end up memory-bound and you can afford running inefficient instructions.

(Btw. sounds like bitgrid would map 1:1 into the LUTs of an FPGA. Might be a fun project to generate VHDL for it and then let it actually run in parallel. The less straight-forward part being the logic to get the data out of the FPGA again.)


Does the shape of the grid matter? If not, I think you'd benefit from constraining the width such that the data for three rows fits within L1, so that iterating over one row leaves the data from the start of the next row in L1 by the time you get to it.


Edit:

Actually... what if you broke the grid up in to N by N blocks, and simulated 2N half time steps (each decreasing the side length by 2) of that block? You'd need to do a similar sort of chess grid over the blocks, and emit like "partial outputs" from each block, then run a pass going the reverse direction, consuming the partial outputs and initial inputs. This way, you can fit a block entirely within a cores L1 for N time steps, and naturally get parallelism by simulating multiple blocks at once.

You should be able to simulate N time steps like this with just 2 loads of the data (program and state) instead of 2 * N.


I sometimes wonder what CPU designers think when building such weird instructions? They must have some programs in mind that could be run faster with them, or else the additional transistors are just lost weight. But then compilers and language runtimes might or might not use those instructions. Add to that the fact that modern CPUs are basically their own compilers (going from "machine code" to microcode) and you have weirdness atop of more weirdness. But perhaps this is just for business sake; adding more instruction sets to provide a barrier to the competition, because programs using this run faster but are no longer portable to competitors' CPUs.


I once thought "bit shuffling" was a special-purpose use-case you'd only ever need to deinterleave RGBA channels or something. Later I wanted to implement a small look-up table (surely something more common) and realized that it's just a different name for the same operation. (I think "bit shuffling" instructions have evolved to be generic enough that they're just programmable LUTs now.)


Yeah, those vector permute instructions are super useful for both patterns. There are dedicated instructions for some specific permutations (shifting over by a constant number of bytes, and some interleavings) but you can easily end up needing the general case. And of course parallel LUTs are also very useful. Depending on what you're doing you could easily end up with both in the same algorithm.


Good for software defined radio too because real and imaginary are interleaved.


> because real and imaginary are interleaved.

And so it is in life


Yeah, I often talk about the "phase" of a schedule being close to 90. Managers are thinking % but I'm thinking degrees.


Took me a while to comprehend that. NOYCE!


I would have never thought to phrase bit-shuffling as deinterleaving, which makes so much more sense to my not-professionally-computer-related eyes and mind.

Realistically though, how likely would a GCC/clang be to emit these instructions when I'm working on some lookup tables, assuming I permit it to use them (e.g. via `-march=native` on a machine that supports the extension)? My gut feeling would be that unless I specifically make sure to structure my code to be as close to the semantics of the instructions as possible, these instructions would never ever be emitted. Or has the world of compiler optimizers advanced enough that rewriting that is commonplace now?


This may surprise you but large CPU buyers go to Intel and ask for ISA extensions, and Intel implements them. This is how BMI came to exist. The fact that you do not see these instructions exploited by your vanilla Debian box says nothing about warehouse-scale datacenter operators.


Your vanilla Debian box uses BMI2 whenever you do a string comparison, unless you are on a decade old CPU [^0].

The "strange" instructions are actually not that niche, it's just that usage tends to be "indirect" and therefore people don't notice.

[^0] E.g. https://xoranth.net/memcmp-avx2


Yes. AVX-512 support is also implemented by SIMD-JSON. JSON string parsing, definitely not a thing anybody does. /s

Also super great for emulation, and anyone else who does a lot of bulk bit-twiddling.

The whole discourse has become super weird (up to and including Linux himself ;) because of Intel 10nm delays. With the only AVX-512 products being 14nm-based intel server chips for 5 years, and then only coming to laptop for another couple years, and then only a single terrible generation of desktop parts that nobody bought, and with AMD launching super competitive (usually leading) products in those segments, obviously there wasn't a whole lot of real consistent adoption in software. And what adoption there was, was complicated by the fact that the largest adopter (server market) had to drop clocks massively and even pause processing to allow voltage to swing up enough, because they were 14nm products on a feature that was really aimed at 10nm and beyond. And then Intel yanked it out of all the desktop and laptop chips and seems poised to just ignore it for another 5 years.

Everyone just decided that because it wasn't getting adopted that it was inherently useless, up to and including Linus himself. But it wasn't getting adopted because it was a complete mess on the Intel side and AMD didn't even support it, so why bother?

The AVX-512 story is inextricably bound up in the 10nm delays and the organizational problems that have plagued Intel ever since. It's such a great thing that AMD didn't buy into the naysaying.


Famously, popcount was a common request from intelligence agencies, at least as far back as the DEC Alpha.


In Intel's case they were thinking about turning the CPU into a GPU, but failed at that several times, AVX is what is left from Larrabee.


AVX and AVX2 are pretty awful because of lane-crossing limitations, but AVX512 is actually really nice and feels like a real programmer designed it rather than an electrical engineer.


FWIW, Michael Abrash [1] was at Intel when Larrabee (the AVX512 predecessor) was being developed and apparently [2] he contributed to the ISA design.

[1] https://en.wikipedia.org/wiki/Michael_Abrash [2] https://www.anandtech.com/show/2580/9


Yeah — my favorite instructions he added were `fmad233` and `faddsets`; the former instruction essentially bootstraps the line-equation for the mask-generation for rasterization, and the latter lets you 'step' the intersection. You could plumb the valid mask through and get the logical intersection "for free". This let us compute the covering mask in 9 + N instructions for N+1 4x4 tiles. We optimized tile load-store to work in 16x16 chunks, so valid mask generation came to just 24 cycles. It was my argument that using Boustrophedon order and just blasting the tile (rather than quad-tree descent like he designed) is what convinced him to let me work with RAD & do the non-polygon path for LRB.


This is not just in your head.

Most Intel ISA extensions come from either customers asking for specific instructions, or from Intel engineers (from the hardware side) proposing reasonable extensions to what already exists.

LRBni, which eventually morphed into AVX-512, was developed by a team mostly consisting of programmers without long ties to Intel hw side, as a greenfield project to make an entirely new vector ISA that should be good from the standpoint of a programmer. I strongly feel that they have succeeded, and AVX-512 is transformative when compared to all previous Intel vector extensions.

The downside is that as they had much less input and restraint from the hw side, it's kind of expensive to implement, especially in small cores. Which directly led to its current market position.


See also: https://tomforsyth1000.github.io/papers/LRBNI%20origins%20v4... which describes some of that history.


I think back to the Playstation 2-3 or N64 that all took years for developers to fully utilize the capabilities of the hardware. Yet the hardware engineers must have known how to do it long before the software side totally figured it out. After years of SW Development it's still just magic sand to me


It's also difficult for both SW teams' and HW teams' visions to converge, even under the same company, such that the product can be put to use to maximize performance and programmability WRT another, already-established product.

Different constraints and challenges on both sides of the aisle give rise to compromises which end up with lowered performance or lowered ease of use. This is one area where great authority over the entire stack lends you lots of leeways, e.g. Apple designing Metal API and the HW for it.


I don't know. If the HW engineers knew something, wouldn't Sony or Nintendo have had them lend a hand on 1st-party titles?


I can tell you that for 3rd party titles back in the PS2/N64 days, the HW engineers handed you a spec manual explaining such useful things as "Bit 7 at address 0x70002048 toggle RFTAG mode on the MDEC". This was great when I went to write a VU emulator because no VU debugger was available. But, not so great when trying to figure out how to use the beast effectively. If you google "ps2 sdk docs" you can still find them after a while. If it's a doc with examples of how to do anything, it's from the software team at Sony Europe.

Sony Japan's documentation for how to use a mouse & keyboard on the PS2 was literally just the URL "https://www.usb.org/document-library/usb-20-specification". Eventually, they provided a binary-only keyboard library that everyone complained was buggy, but actually just had documentation that was so brief it was easily misunderstood. After black-box testing it for an hour it was clear it worked fine, just not how anyone would expect it to.

Many years ago I made a tiny stir online by writing a stream-of-consciousness report of the experience of dealing with stuff like this for a decade. https://venturebeat.com/games/what-is-making-games-like-for-...


Actually once you have these tools at your disposal and see them, it's actually hard to not see them and find places to use them. String processing, bitset indices in data structures, etc. all have places where they naturally fit.

I have a data structure library (in Rust) where I would love to have these. The problem is that AVX-512 just isn't common enough to rely on it yet, and I don't even have it on my workstation CPU (Radeon 6850, from just last year).

But in particular whether they had something in mind, I suspect Intel was thinking about video codecs and containers for a lot of these. If you read through the specs for them, you will find all sorts of places which call for things like this.

But yes, whether compiler developers can make good use of these. Questionable. They are really for specialized optimization workflows.


I think the "shuffle" instructions are a generalization of "s-box", from various cryptographic algorithms?

> They must have some programs in mind that could be run faster with them

Yeah, all new instructions are built with some workload in mind. This may or may not be specified in the architecture manual or you have to reverse-engineer it from the press releases.


The vector permute was a highlight of the PowerPC vector extensions (AltiVec) in the late 1990s. It made a real difference for many media processing algorithms where the data is often packed and you need to pick apart the components for vector execution.

I’ve got the impression that AVX-512 is finally a good and comprehensive vector ISA from Intel, not just a tolerable one. Sadly it seems to be so comprehensive that the x86-64 vendors can’t afford the die area to ship it widely, or they treat it as an enterprise feature.


At least AMD kept AVX-512 in there small Zen4C cores and just sacrificed some cache instead. I have to wonder if it was intel marketing that killed off working AVX-512 in consumer P-cores after they were released because the E-cores just become dead weight with AVX-512 enabled.


nobody really knows what is going on with AVX-512 inside intel.

Linus has commented that it's trivial to trap the first time an AVX instruction is used and pin it to a P-core (they used to do this as an optimization to avoid saving/restoring AVX registers in non-AVX code) and he doesn't know why that patch isn't on his desk.

The other problem is presumably that CPUID depends on what core it's executed on but... it seems straightforward for code to just run a CPUID on every single core (using affinity) and analyze the results. OK, 16 AVX-512 threads and 8 AVX2 threads, that's fine! It is obviously not the way code is currently written but code isn't written for AVX-512 right now anyway, and it should be literally an hour of work for a C dev.

I guess maybe they just didn't want the long tail of support but with their future architectures being heterogeneous too, they don't seem to have any plan either, and they're still shipping the AVX-512 units in silicon, meaning they are paying millions of dollars for a feature that isn't enabled. Very, very weird.

Fun fact, Alder can actually be run with AVX-512 enabled even with E-cores active. There is an undocumented MSR flag that seems to allow this. Has to be the stepping/BIOS revision before it was locked out though.


> he doesn't know why that patch isn't on his desk

Because then anytime a program links to a vector-supporting library such as glibc, there’s a high chance they’ll be pinned to a P-core when an E-core and AVX2 would’ve sufficed.


I really love playing with vector instructions, especially AVX2, which is the most limited in my opinion. You get a whole bunch of oddly shaped legos, and it’s very challenging to build something beautiful from them. But also very fun and rewarding.



By coincidence I just started playing with AVX-512 this week. It's a lot of fun, but I have to say there is a dearth of resources on how to use it. For example I would love a simple visualisation for each operation as to how the various vectors and/or memory are transformed. Can anyone recommend some good books/learning resources?



Amazing, thanks!


Short intro with some visuals http://const.me/articles/simd/simd.pdf


Answering my own question a little, on further exploration the Intel optimisation actually has a good chapter on avx512: https://www.intel.com/content/www/us/en/content-details/6714...


there are only a handful of instructions that do interesting things beyond parallel versions of basic arithmetic and bitwise operations. https://branchfree.org/2019/05/29/why-ice-lake-is-important-... provides a good overview of them.


Not many people know, but Alder Lake's implementation of AVX-512 supports FP16 format, and FP16 version of AVX-512 is so extremely fast, it is almost touching GPU performance.


> Suppose that you want to reorder, arbitrarily, the bits in a 64-bit word.

What's the application of this? Theres's a Twitter article in the next sentence, but looks like I need to sign up to see it.


Static reorderings can be used for some data layouts, eg interleaving the bits of an x and y coordinate, but you should hope they’d be cheaper than dynamic reordering.


Cryptography


This goes way back at least to the late 80's when the Cray C-90 introduced its "bit matrix multiplication" instructions.


I knew about the instruction that Daniel Lemire missed too. It seems useless to me.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: