Hacker News new | past | comments | ask | show | jobs | submit login

You're mostly right. It's often not that hard to completely saturate memory bus.

However, base64 has weird mappings that take some processing to undo in SIMD – can't use lookup tables and need to regroup bits from 8 bit to 6 bit width. That does take a lot of cycles without specialized bit manipulation instructions.

Also the data you'll need to process is often probably already hot in the CPU caches. Disk/network I/O can be DMA'd directly into L3 cache.

So I think this decoder is a useful contribution. But also somewhat no-brainer once you have those AVX-512 instructions available.




> However, base64 has weird mappings that take some processing to undo in SIMD – can't use lookup tables and need to regroup bits from 8 bit to 6 bit width. That does take a lot of cycles without specialized bit manipulation instructions.

Base64 uses lookup tables and the bit manipulations required are standard shifts and 'and', which are basic, fast instructions on any CPU. That seems exactly what they do here with an efficient use of AVX512 to make it fast(er).


> SIMD – can't use lookup tables

This is doing exactly that with `vpermb` and `vpermi2b`.


Yeah, missed that you can now fit 64 8-bit entries in a single 512 bit register!

Generally my point stands that LUTs are not vectorizable, except for small special cases like this.


Why can't you use lookup tables?


Part of the point of this paper is that the lookup tables for base64 fit into a 512 bit vector register.


Notably this technique only works in this special case (or other small LUTs). Generic (larger) lookup tables can't be vectorized yet.


Because table lookups don't vectorize. You could try to use vectorized gather (VPGATHERDD [0]), but so far in my experience it's been just as slow as using scalar code. (Then again, even gather instructions operate just on 32-bit data, so it wouldn't help anyways.)

So to perform multiple operations in parallel per core (SIMD), you'll have to write some code to perform the "lookup" transformation instead.

[0]: https://www.felixcloutier.com/x86/vpgatherdd:vpgatherqd




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: