Hacker News new | past | comments | ask | show | jobs | submit login
tolower() with AVX-512 (dotat.at)
275 points by fanf2 5 months ago | hide | past | favorite | 153 comments



Note that the “unsafe read beyond of death” trick is considered undefined behavior in the Rust and LLVM memory model, even if it’s allowed by the underlying hardware. Like any undefined behavior, compilers are allowed to assume it doesn’t happen for the purpose of optimization, leading to results you don’t expect. The only way around this is to use inline assembly.

https://github.com/ogxd/gxhash/issues/82


It would be neat to have non-assembly options for things like this. A "load with unspecified elements for any values past the end of the allocation, UB only if the hardware doesn't like it" thing shouldn't be hard to support, even if just as an alias for the respective assembly invocations.

Additional neatness would be being able to request a guarantee that all allocations - malloc, stack, constants - have at least, say, 64 bytes of non-faulting addresses after them, though that is significantly more complex, requiring cooperation between a bunch of parts.

Annoying thing is that this is trivial with a custom allocator (as long as the compiler isn't told to consider the custom sub-allocations as separate), but then you're stuck not being able to use your SIMD stuff on anything outside your custom heap due to the very tiny chance of segfaulting.

Sanitizers/valgrind don't necessarily become pointless with this even - the past-the-end values are still undefined, can be tracked as such, and error on use.


The sanctioned way to this would be masked aligned load intrinsics, alignment avoids page faults, masking avoids reading undef bits, being an intrinsic conveys the intent to the compiler so it'll know that this is not an OOB read.

The other option that I've seen discussed is adding a freezing load to LLVM that turns the undef bits into some unspecified but valid bit patterns.


> A "load with unspecified elements for any values past the end of the allocation, UB only if the hardware doesn't like it" thing shouldn't be hard to support

Not an expert, but to me this sounds like you want an alternative where behaviour for a read beyond the end of an allocation is merely implementation-defined, not undefined. That means the implementation (e.g. LLVM) has to document what they do — which may be platform-dependent — and the choice of whether it becomes undefined is up to the implementation.

The natural thing to do here for the implementation is of course to say "I'm just going to emit the load instruction, it may crash your program, better be prepared".


Here it'd be perfectly fine to define it as "completely arbitrary bits past the end, potentially even differing between back-to-back calls of loading the same memory"; specific backends will end up refining that of course. In LLVM those bytes would behave as freeze(poison).


Not every platform in existence will return data when asked to access stuff out of bounds, even when sufficiently aligned. So you wouldn't want to bake into the standard that valid bits must be returned; you'd want to allow crashing, in the standard. An implementation might then define that for suitably aligned addresses, data will be returned (just not necessarily sensible data).


It should still be with "UB only if the hardware doesn't like it", of course. If weird funky hardware not following usual memory paging is of worry, providing a "memory_protection_granularity" constant is trivial, to be used instead of the page size for the check (and said funky hardware could set it to 1, thus always failing).

Alternatively, a different API would be returning an optional of the loaded data, having the stdlib/language/backend convert that to the appropriate boundary check (or always returning a None if impossible).

Ideally there'd be languages that can be at least configured into providing more "unsafe" useful things, even if at the expense of not having the code be compilable targeting funky hardware that noone would run the software in question on anyway.


What about tools like ASAN? I want it to be able to tell me if I read a single character out of bounds. Tools like ASAN can't do this if the language gets rid of undefined behavior. The reason why undefined behavior is undefined is because it's such a degenerate state for a program to exist in that any attempt by a language to imbue it with a particular blessed meaning is, to put it politely, crazy; like trying to prove a theorem that's allowed to have some contradictions.


Indeed, if you an want immediate error on every out-of-bounds read, this won't be suitable. I do think one should always have the option to not opt into this. But there still exist use-cases where the benefit of being able to do partially-past-the-end loads would significantly outweigh this downside.

That said, clang's MemorySanitizer, and, similarly, valgrind, could still produce errors via tracking which bytes are undefined within registers; might be somewhat delayed between load and error, but still shouldn't allow such out-of-bound values to be used for much.

And, anyway, as this load would be a separate instruction/builtin (if so decided), UB of regular operations is unaffected. If the sanitizer in question doesn't track (partial) register definedness, it could just accept all of these explicitly-potentially-OoB loads; indeed not ideal, but the alternative is not being to write such performant code at all.

And there are already people doing this, just limited to doing so with data within a custom allocator. It would just be nice to have a mechanism to not be fully ruled out of using standard tooling at least for testing.


What's wrong with assembly? What's wrong with aligning a pointer and turning the sanitizer off if need be? If you're making machine specific assumptions then you should be programming against the machine rather than the language.


Assembly is a good fallback, but it requires that individual programmers special-case individual architectures (thus each likely ending up with an incomplete list, and definitely incomplete as soon as it stops being updated) instead of having a standard thing everyone can rely on and the compiler/sanitizers could actually understand, and which automatically extends to more architectures.

Not using sanitizers is, of course, an option, but, as should be obvious, is very much not optimal. Having a single "questionable" operation in one place does not mean that the programmer intends for the entire function or program to be all "programmed against the machine"; the rest of the code, and, to an extent, even the fancy operation in question, could still very much benefit from regular language tooling.

Such operations needn't even be machine-specific.

"try loading N bytes, accepting garbage out-of-bounds; return None if cannot be done without potentially faulting" says and requires nothing about the architecture, and is implementable everywhere (even if as always returning None).

Aligning pointers is another option, but is fundamentally in no way different from the memory-protection-boundary-based version in how much it relies on hardware specifics, and compiler/language builtins could still be made that allow for sanitizer-friendly usage. It might be more or less efficient depending on use-case, both are useful to have.

Of course, the best option would be that malloc, the linker, etc work together to guarantee at least N bytes of addressable memory past all user-accessible pointers, at which point architecture specifics completely stop mattering. This needn't change any behavior around sanitizers or regular loads; all it'd mean is that the "load N bytes with trailing garbage" operation can always succeed. Sanitizers could error on said op reading outside of the guaranteed readability size, and regular loads of course continue erroring on any out-of-bounds read. Compilers could even use this guarantee themselves to emit unmasked loads for loop tails.


Simplest solution and the one I use is all SIMD related buffers use a custom allocator(actually everything uses it) and it always rounds the allocation size up to the SIMD width.

Masked loads kinda suck, they are a tiny bit slower and you now need a mask and you need to compute the mask..


This is what I do too (in my case I don't round up the allocation size and just let loads & stores potentially see the next object (doing tail stores via load+blend+store where needed; only works if multithreaded heap mutation isn't required though)).

The one case it can be annoying is passing pointers to constant data to custom-heap-assuming functions - e.g. to get a pointer to [n,n-1,n-2,...,2,1,0] for, say, any n≤64, make a global of [64,63,...,2,1,0] and offset its pointer; but you end up needing to add padding to the global, and this materializes as avoidable binary size increase as the "padding" could just be other constants from anywhere else. Copying the constant to the custom heap would be extra startup time and more memory usage (not sharable between processes).


"UB only if the hardware doesn't like it" sounds like you want to shift the complexity from the developers who know the problem domain best to the packagers.

As soon as the thing is packaged to run on an raspberry or something else that doesn't like it, it will start to generate CVEs and be a major pain.


This shouldn't ever be a security vulnerability, outside of perhaps denial of service from segfaults (though I'm pretty sure you'd find hardware with no page faults before finding one with pages less than 4KB; and of course, if you wanted to not be hard-coding 4KB, a compiler providing a "minimum page size" constant for the target architecture should be possible, and could return 1 on page-less hardware). But, yes, as with many optimizations, getting them wrong could end up badly.


For the case of specific vector extensions that imply specific cache line sizes, and loads that do not span multiple cache lines, I don't think you could run into issues.


Is that even true at a hardware level? What if you read into an unmapped page or into protected memory? (I haven't read the code, maybe it has alignment guarantees that avoid this?)


You make sure you don't do that.

A trick to avoid reading beyond the end of the buffer is to make sure the end of the buffer lies on the same page. Typically, the OS will allocate memory in pages of 4KB, thus we can make a function that checks whether it is okay to read beyond or if we should fallback to the copy version.

-- https://ogxd.github.io/articles/unsafe-read-beyond-of-death/


That's not a guarantee. On some systems memory protection can be sub-page (not sure about x86).

But it sounds like the masking feature mentioned in a sibling comment takes care of it anyway.


Masking is nice, but not available everywhere (i.e. intel is still making new generations of CPUs without AVX-512, and apple silicon doesn't have any masked loads/stores either).

It might not be the nicest thing to assume to be the case on all hardware, but it shouldn't be too unreasonable to put it under an "if (arch_has_a_minimum_page_size)". So many things already assume at least 4KB pages, Intel/AMD aren't gonna break like half the world. If anything, they'd want to make larger pages to make larger L1 caches more feasible.


There's a debate on how unsafe/unsound this technique actually is. https://github.com/ogxd/gxhash/issues/82

I definitely see the conundrum since the dangerous code is such a huge performance gain.


The code uses unaligned load and store instructions, so it should be possible to trigger memory access to unmapped addresses.


Isn't the point of the "masked load" instruction discussed in the article to avoid that? https://stackoverflow.com/a/54530225


Unfortunately, AMD's masked AVX2 instructions reserve the right to fault even for masked-off elements :(


> Like any undefined behavior, compilers are allowed to assume it doesn’t happen for the purpose of optimization, leading to results you don’t expect

No. First, undefined behavior is a term of art in the C standard, so the idea of generalizing it is nonsensical. Second, ANSI C explicitly does not allow this assumption, and ISO C—while more open ended—doesn't specifically justify this assumption. The entire "UB = assume it cannot happen" thing is grossly dishonest FUD.


Both (unsafe) Rust and LLVM have their own concepts of undefined behavior (https://doc.rust-lang.org/reference/behavior-considered-unde..., https://llvm.org/docs/LangRef.html#undefined-values), and while some of the details vary by language, compilers for all of these languages do in fact optimize based on the assumption that undefined behavior is not invoked in the abstract execution model. This is a real thing (https://blog.llvm.org/2011/05/what-every-c-programmer-should..., https://highassurance.rs/chp3/undef.html), and any debate about whether it’s “justified” is many decades late and entirely academic (but we don’t have any other methodology for building optimizers of the quality that programmers expect from compiled languages).


> any debate about whether it’s “justified” is many decades late

"It's always been this way so it's impossible to address." Forgive me if I'm not convinced.

From https://highassurance.rs/chp3/undef.html:

> In other words, should a developer inadvertently trigger UB, the program can do absolutely anything.

Well, no. It is "behavior . . . for which this International Standard imposes no requirements." There are restraints and constraints beyond the ISO standard.

realloc(p, 0) is now undefined in C23. However, every mainstream OS and compiler specifies the correct behavior for that environment. It is simply not. true. that the program can do anything. What is true is that the range of behavior is not restricted by the ISO standard.


At least 15 years ago, the team responsible for developing both our ASICs and core routing/switching code for the firmware on our networking devices worked under the consensus understanding that "Undefined Behavior" (in C, in their case) meant precisely that - could trigger a Nuclear Launch, blow up the planet, etc.... There was absolutely no behavior (within the confines of the limitations of C compiler(s) for the various hardware platforms we built on) that was not restricted.

A very significant amount of their effort, time, focus went into understanding precisely what was, and was not "Undefined Behavior" - the instant you did anything that was "undefined" anything that happened after that was fair game.

They also did zero dynamic memory allocation after starting an application. All memory was allocated on startup based on initial config settings.

My sense in watching that extraordinarily skilled team was that the logic and features they were building (on a very complex FHSS MAC) were secondary to convincing the compiler and hardware to do something that the specifications and definitions should happen. The great firmware developers were also pretty solid language lawyers.


I'm not saying developers who are careful about UB aren't doing the right thing. They are absolutely doing the right thing.

What I am saying is that a compiler that sees

    int8_t x;
    float x;
and does anything other than "terminating a translation or execution (with the issuance of a diagnostic measure)" is doing the wrong thing.

I am also saying that a compiler that offers -fwrapv and formats your hard drive on int x = INT_MAX; x++; rather than "behaving during translation or program execution in a documented manner characteristic of the environment" is pathological, violates the spirit of the ANSI and ISO standards, and violates the letter of the ANSI standard.


You may not like it, and it’s within your rights not to like it, but the reality is that compilers do treat UB this way, and it’s not “grossly dishonest FUD” to point out that this is the reality. Here’s a test case demonstrating that UB can actually format your hard drive: https://bugs.llvm.org/show_bug.cgi?id=49599.

Note that one of the differences between C and Rust is that integer overflow is not UB in Rust (it panics in debug mode and wraps in release mode: https://doc.rust-lang.org/book/ch03-02-data-types.html#integ...). But there are other sources of UB in unsafe Rust, such reads through a pointer not allowed by the memory model.


> the reality is that compilers do treat UB this way

No, they really don't. https://godbolt.org/z/M6hcTx3aY

Clang will not blow your computer up just because you use #pragma STDC FENV_ROUND <direction> in accordance with its documentation.

I don't know why it became popular to make UB into a bogeyman and act like it's an intractable problem, but it isn't. Compilers handle most UB just fine.* It's better for everyone if we can focus on the actual problems rather than blowing it out of proportion with sweeping generalizations.

* All undefined behavior is undefined, but some undefined behavior is more undefined than others.


It seems like you think I’m saying that all compilers are required to recognize all forms of UB and format your hard drive. Of course not; some UB in one specification may be defined in another more restrictive spec, some UB may be defined by compiler flags like -fwrapv, some UB might be detected and converted to a compile-time or runtime error, and some UB might happen to behave like you expect because your compiler didn’t implement certain optimizations yet or you just got lucky.

It sounds like you agree that programmers should avoid relying on luck for UB safety. If you have a way to prove that this UB is actually safe and not just lucky, feel free to present it. Until then, I stand by everything I’ve said.


No. I am saying that the tropes about the danger of undefined behavior apply only to a subset of undefined behavior.

It isn't true to say that "any undefined behavior" can result in a parade of horribles. That is a sweeping generalization. Huge amounts of undefined behavior are in fact well-defined by the implementation and/or environment.

Indeed, the formal definition of "implementation-defined" is so narrow that most of what you'd think should be "implementation-defined" is actually "undefined." The example I gave of realloc(ptr, 0) is one such case. "Classifying a call to realloc with a size of 0 as undefined behavior would allow POSIX to define the otherwise undefined behavior however they please." WG14 n2464, available at https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2464.pdf


If a program’s behavior is undefined in one specification and defined in a second, then insofar as both specifications apply to it, that is defined behavior, not undefined behavior. That’s not an obscure property of terms of art and specific formal definitions about which I need to be educated, it’s just basic logic about which we obviously agree. The conclusion would be the same if one specification explicitly said “this may return an empty allocation or NULL or format your hard drive” and a second specification said “this will return an empty allocation or NULL”; the conjunction of those two specifications entails that it will not format your hard drive.

There’s nothing wrong or misleading in what I wrote. If, hypothetically, there were a second specification that constrained the Rust compiler’s translation of programs that over-read memory, then it would have been wrong to write that over-reading memory in Rust is undefined behavior, and misleading to suggest that a statement about “any undefined behavior” is applicable to it. But there is no such second specification (as confirmed by comments on the issue I linked from RalfJung, whose formal specification hat is much pointier than either of ours), and you aren’t even disputing that. Instead, you have deliberately misapplied a statement about “any undefined behavior” to other unrelated behavior that is in fact defined, in order to construct a pretense for calling someone else dishonest. Find better hobbies.


No. Undefined behavior means one thing and only one thing, at least when it comes to C: "behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements"

There's no such thing as "insofar as both specifications apply to it, that is defined behavior, not undefined behavior." Undefined behavior in C that is standardized by POSIX is still undefined behavior. Indeed, that is why "Possible undefined behavior [includes] behaving during translation or program execution in a documented manner characteristic of the environment." Behaving according to the POSIX specification on a POSIX system (or according to the specification of any system on that system) is explicitly accounted for in the definition of "undefined behavior."

These things are objectively, inarguably true, and recorded in black and white. I do not understand why you are so relentless in trying to gaslight HN about them. Just stop.


When the term “undefined behavior” appears within the C standard, yes, that’s what it means. There is no way for another document to alter the meaning of “undefined behavior” as it appears within the C standard.

When the term “undefined behavior” appears in different context, we use our writing and reading comprehension skills to agree on what it means. If we’re programming in standard C, it refers to one class of behavior. If we’re programming in POSIX C, it refers to a smaller class of behavior (even according to the C standards committee: see the usage of “otherwise undefined behavior” in N2464). If we’re programming in Rust, it refers to a completely different class of behavior.

From my very first message, I did my part by providing the context explicitly: “undefined behavior in the Rust and LLVM memory model”. You complained that it’s nonsense to talk about “undefined behavior” in any context other than the C standard, but I provided references showing that it’s not. Yet you persist in intentionally misinterpreting what I wrote in order to accuse me of gaslighting. Does that make you feel superior?


Neat and performant code like the article makes me very curious how the competition will shake out between AMD's AVX512 implementation and Intel's upcoming AVX10. The entire point of AVX10 seems to be to resolve Intel's P vs E core situation, while AMD seems to have taken a better approach of using either full width (Zen5) or double-pumped 256bit (Zen4, Zen5 mobile) as appropriate to the situation, while making the API seamless.

The big gains delivered in the article are all on a double-pumped Zen4 core! AVX512 brings a lot to the table so its quite frustrating that Intel market-segmented support for it so heavily as to completely inhibit its adoption in broad-based client code.


If Intel actually implements AVX10/256 on every CPU they ship going forwards, it will eventually win simply by availability. The market has repeatedly and thoroughly rejected dispatching to different code paths based on CPU, so the only SIMD implementation that actually matters is the lowest common denominator. And since AVX10.1/256 and AVX512VL have a shared subset, that will be what people will eventually target once enough time has passed and nearly everyone has a CPU that can support it.

AVX512 will continue to give AMD some easy wins on the few benchmarking apps that were actually updated to support it, but if Intel sticks with the AVX10 plan I expect that AMD will eventually just use the double-pumped SIMD pipes for everything, just because they are the more efficient way to support AVX10/256 while retaining AVX512 compatibility.

Intel did a lot of bad choices in the past decade, but segmenting the market based on instruction set has to be one of the worst. They just chose to kill all the momentum and interest in their newest and best innovations. Hopefully they actually add AVX10/256 support to the whole lineup, because the width is the least interesting part about AVX512, the masked operations especially are a lot more important.


Dispatching is actually heavily used. Image/video codecs, cryptography, and ML libraries routinely use it, because the lowest common denominator is very low indeed.


The things you listed are probably less than 1% of all client loads. Video codecs and ML mostly run on accelerators and cryptography is a tiny proportion of all loads.


Cryptography is every download you ever do in a browser, bitlocker or equivalent encrypted disk access. Add to that gzip or brotli compression (common in HTTP). Hardware decoders for newer video codecs (such as AV1) are not that common (less than 50% of devices) so most youtube watch time on desktops/laptops is software decoding. It's a lot more than 1%.


> Cryptography is every download you ever do in a browser, bitlocker or equivalent encrypted disk access

For IO bandwidth-bound heavy lifting these things typically use AES algorithm. The hardware support for that algorithm is widely available inside CPU cores for more than a decade: https://en.wikipedia.org/wiki/AES_instruction_set#x86_archit... That hardware support is precisely what enabled widespread use of HTTPS or full disk encryption. Before AES-NI it was too slow, or it required specialized accelerator chips found in web servers in 2000-s who needed to encrypt/decrypt HTTPS traffic.

I don’t think people use AVX2 or AVX512 for AES because AES-NI is way faster. The runtime dispatch needs just a few implementations: hardware-based AES to use on 99% of the computers, and couple legacy SSE-only versions.


The original AES-NI (which were SSE instructions) and also their initial correspondent in AVX performed 128-bit operations.

Later, 256-bit AES instructions were introduced, but significantly later than AVX2. Such 256-bit AES instructions, which double the AES throughput, are available in more recent CPUs, like AMD Zen 3 and Intel Alder Lake (the so-called VAES instructions).

Some of the more recent CPUs with AVX-512 support have added 512-bit AES instructions, for an extra doubling of the AES throughput.

Zen 5 (desktop and server) doubles the AES throughput in comparison with Zen 4, similarly with the double throughput for other vector operations.

In conclusion, on x86 CPUs there are many alternatives for AES, which have different throughputs: 128-bit SSE instructions since Westmere, 128-bit AVX instructions since Sandy Bridge, 256-bit VAES AVX instructions since Zen 3 and Alder Lake and 512-bit AVX-512 instructions since Ice Lake, but only in the CPUs with AVX-512 support.


Surely h264 encode and decode is substantial, given the large amount of video consumed?


I don’t believe many people encode or especially decode video with CPU-running codes.

Modern GPUs include hardware accelerators for popular video codecs, these are typically way more power efficient than even AVX-512 software implementations.


> I don’t believe many people encode or especially decode video with CPU-running codes.

Believe what you want but as soon as realtime is not a concern and quality matters you'll be using CPU-based encoders unless you have special hardware for your use case.

> Modern GPUs include hardware accelerators for popular video codecs

... with shit quality encoders because they are designed for speed first.

Decoding is a different matter but even there older hardware can easily end up not supporting codecs (or profiles of them) that you come accross.


> The market has repeatedly and thoroughly rejected dispatching to different code paths based on CPU, so the only SIMD implementation that actually matters is the lowest common denominator.

RHEL just moved up to x86_64-v2, equivalent to 2009 level CPUs. And they’re an early mover, Fedora/Ubuntu/Arch have not done the same.

Glibc has used CPU dispatching for str* and mem* functions for over a decade.


FreeBSD added AVX512 dispatch into it C library AFAIK.


> The market has repeatedly and thoroughly rejected dispatching to different code paths based on CPU

What do you mean? At least numpy and pytorch (the only numeric libraries I'm familiar with) both use runtime dispatching.


I agree. I still hesitate to ship code that requires even AVX1, even though it was first introduced in 2011(!).

AVX512 really improves the instruction set. Not just from masking but from some really big holes filled in terms of instructions available that AVX2 doesn't have a good solution for.

At the same time I also DO have plenty of code that could definitely use the compute throughput improvement of 512 bit vectors. But it's definitely a more niche usage. You have to at least nominally satisfy that you: 1) Benefit from 2x the ALU throughput 2) Live mostly in the cache 3) Are not worth running on the GPU instead.


It depends on the product and the market.

I’ve been shipping software that requires AVX2 and FMA3 because it is a professional CAM/CAE application. Our users typically don’t use sub-$100 Intel processors like Pentium and Celeron. The key is communication, you need to make sure users are aware of the hardware requirements before they buy.

Another example, in the server market, you can almost always count on having at least AVX2 support because most servers today are cloud-based. For people running these cloud services, power efficiency is crucial due to the high cost of electricity, so they tend to replace old hardware regularly.

On the other hand, for desktop software aimed at a broad audience, there are valid reasons to hesitate before shipping code that requires AVX1.


I don't get saying mask operations are more important than width?

Mask operations can be trivially emulated with vblend, it is one extra instruction..

Width can't be emulated, you just are stuck running half speed.

This take keeps getting repeated, but doesn't appear to be backed up by reality.

Intel hasn't even put AVX10 on their upcoming chips(skymont), so it appears to be going nowhere.


> Mask operations can be trivially emulated with vblend, it is one extra instruction..

For unaligned loads where you can't guarantee that the entire vector is on a mapped page?


The important feature of AVX-512 demonstrated in my blog post is masked loads and stores, which can't be emulated with vblend.


Zen 4 AVX512 implementation is not double-pumped and tech journos need to stop calling it that, because it has specific meaning that does not match what takes place.

It simply decodes operations on ZMM registers into multiple uOPS and schedules them to free 256b units. In addition, Zen 4 has special handling of 512b full-width shuffles, with dedicated hardware to avoid doing very expensive emulation. As a result, Zen 4 with its 4 256b SIMD units still acts like a very strong 2x512b core. There is nothing cheap about this implementation and it is probably the best rendition for consumer hardware so far.


I don't understand why intel doesn't make their E-cores use double pumped AVX512 to solve this issue (or just make P-core only CPUs for desktops like it should). They have had years to fix this by now. It's annoying that despite AMD's support, market share makes the adoption of this not happen anyway, and the AVX10 thing will unfortunately allow them to hold back the world even longer.

What I like to see (for desktop) is: better cores, more cores, well standardized instruction sets that unlock useful things (wide SIMD, float16, gather/scatter, ...). AMD is doing pretty well at this. What Intel is doing instead: put weak cores alongside decent cores, cripple the decent cores to keep up with the weak cores, release CPUs with the same amount of cores as before for many generations in a row, use the weak cores to make it sound like they have more cores than they have, bring out way too many variants of instructions sets to ever have a useful common set, drop support for their own promising sounding instructions

I just really dislike anything Intel has come out with or plans to come out with lately :p

My cycle of preference of manufacturer (for desktop computers) has been: 90s: Intel. Early 2000s: AMD (pentium 4 was so meh). late 2000's+2010s: Intel. Now: AMD again. What will Intel do to gain foothold again (that isn't sabotaging the other)? We need to keep the competition going, or the other side may get too comfortable.


A credible but unconfirmed rumor I've read is that Intel didn't want to do it because of the 512-bit shuffles. The E-cores (including those using the upcoming Skymont microarchitecture) are natively 128-bit and already double-pump AVX2, so to quad-pump AVX-512 would result in many-uops 512-bit shuffle instructions made out of 128-bit uops.

This would render the shuffles unusable, because you'd unpredictably have them costing either 1 uop or cycle to taking 10-30 uops/cycles depending on which core you are on at the moment. A situation similar to PEXT/PDEP, which cost almost nothing on Intel and dozens of cycles on AMD until a couple generations ago.

Why does Zen 4 not have this problem? First, they're only double-pumping instead of quad-pumping. Secondly, while most of their AVX-512 implementation is double-pumped, there seems to be a full-length shuffle unit in there.


Interesting stuff! If a shuffle is just crossed wires without any logic (as far as I know shuffle is a completely predetermined bit permutation), wouldn't it fit on the E-cores easily enough?


AVX-512 allows arbitrary shuffles, e.g., shuffle the 64 bytes in zmm0 with indices from zmm1 into zmm2. Simple shuffles like unpacks etc aren't really an issue.


Worse yet (for wiring complexity or required uops, anyway), AVX-512 also has shuffles with two data inputs, i.e. each of the 64 bytes of result can come from any of 128 different input bytes, selected by another 64-byte register.


Which is also why it's so attractive. :)

Those large shuffles are really powerful for things like lookup tables. Large tables are suddenly way more feasible in-register, letting you replace a costly gather with an in-register permute.


What they did, is actually even worse: they briefly introduced in Alder Lake AVX-512 FP16 = excellent turbo fast instructions for AI, to immediately drop it.



While we're having fun with that:

    # Capitalising an eszett changes the string length.
    >>> "straße".upper()
    'STRASSE'
    
    # If you don't specify the locale, round-trip upper/lower case
    # messes up the dotless i used in turkic languages.
    >>> 'ı'.upper().lower()
    'i'


Up until a few years ago, the first conversion would have been the official way to write a German word that contains the ligature ß (a lower-case letter) in all caps, because there was no corresponding upper-case letter. However, in 2017, an upper-case variant [1] was introduced into the official German orthography, so the conversion cited here should no longer be necessary.

[1] https://en.wikipedia.org/wiki/Capital_%E1%BA%9E


The upper-case ẞ remains very unconventional, and the Unicode casing algorithm continues to specify upper-case conversion to SS.


Correct. This raises the question what should be the basis of the Unicode casing algorithm: what is commonly practiced by users of a specific language (how to measure this reliably in less clear cases than this one?) or what an official "specification" of the language defines (if such a thing exists, and is widely accepted, especially when the language is spoken in more than one country)?


IMO: the official specification. If it doesn't match common usage after some time, the spec should be revised.

Pretty sure the actual question would be: what if there are multiple conflicting official specs?


> If it doesn't match common usage after some time, the spec should be revised

Well, the problem with any kind of language spec change is that it can take decades until it gets accepted widely.

Young people are the first adopters as they get it force-fed by schools, but getting the age bracket 40+ to adopt is a real challenge. Germany's 1996 project led to years-long battles and partial reversals in 2004/06, and really old people to this day haven't accepted it.


It is extremely unusual, and the addition to Unicode was very controversial.

Basically, some type designers liked to play with capital ß and thought it cool to have it included into Unicode. There was a whole campaign for inclusion, and it was a big mistake.

Because even though existing use must be shown to merit inclusion, they only managed to find a single book (a dictionary) printed in more than a few copies. From East Germany. And that book only used the capital ß for a single edition (out of dozens of editions) and reverted straight back to double s. Somehow, that still was enough for Unicode.

Capital ß is a shit show, and it only confuses native speakers, because close to none of them have ever seen that glyph before.

It has no real historic lineage (like the long s, for example, that pretty much nobody under 80 knows, either), it is a faux-retro modern design, an idle plaything.


Unicode attempts to render all human-written documents. That's why U+A66E (ꙮ), a character found once in one single document, is still considered acceptable.

ẞ is not a good capital letter of ß, but if that one single book ever gets transcribed into digital text, a unicode character code is necessary for that to happen.

I doubt systems that can't deal with the ß -> SS transcription will be able to deal with ẞ in the first place.


It‘s not as simple as you make it to be. Klingon script was rejected, partly because of "Lack of evidence of usage in published literature", despite many web sites and more than a handful of published books using it.


ꙮ is a farce like many other Unicode inclusions. Good for a laugh but best forgotten. If you want to encode that level of stylistic choice then you really need to start encoding the specific font used too.

Next you are going to ask Unicode to include all fancy stylized initials which are of course specific to the paragraph in a single book they are used in? At that point just embed SVGs directly in Unicode because the fight for any semantic meaning has been lost.


Well, what can you do... amtlich is amtlich.

And the behavior of a function that converts ß to double upper-case S can certainly be discussed too, if only for the fact that a round-trip toUpper().toLower() will not give you back the original word.


> if only for the fact that a round-trip toUpper().toLower() will not give you back the original word.

It is an inherently destructive round trip, especially in a language that makes excessive use of upper case when comapred to english. When you have the word "wagen" did it originally refer to a "car" or did it refer to someone "trying"?


I'm not a German speaker but this is definitely not exclusive to German nor is it caused by 'excessive' capitalization. The use of capital letters to denote nouns only helps to disambiguate it in German. While in English, this distinction is never clear just from the spelling of the isolated word alone. E.g. 'contract' can either mean 'an agreement' as a noun or 'to decrease in size' as a verb. There are plenty of other examples.

I agree though that this makes the round-trip inherently destructive.


Note that the addition to Unicode came first, and the addition to the official German orthography only came nine years later (and might not have happened at all without the Unicode inclusion). In addition, it remains only a variant in the latter, alongside the existing SS.


> Capital ß is a shit show, and it only confuses native speakers, because close to none of them have ever seen that glyph before.

I don't think it's particularly confusing, but it is a great addition. As of 2017 the rules allow both "SS" and "ẞ".

Moreover, the current version of DIN 5008 (Rules for Writing and Design in Text and Information Processing) actually recommends preferring "ẞ" over "SS".


What's confusing for me is that the German government seems to change its mind radically around the question every few years. If I counted properly, there was already two orthographic changes since I was schooled (German as a second language).


I'm not sure what radical changes you are referring to. There was a orthography reform in 1996 (with a transitional period until 2005). But other than that only minor changes occurred.


The government isn't involved in orthography. There is an official orthography that public officials must use, but it follows the rules of the relevant non-governmental organizations.


No, the government set the rules in 1996 and modified them in 2006. that was new, before that, orthography was set by convention and common usage.

Ostensibly, the reform only applied to officials in government agencies, but that also included both teachers and students in schools, and today using the old orthography in university exams is marked as errors (although many lecturers don‘t care and won‘t mark it up). Just went through it a few months ago.

Today, new orthography is not optional, at workplaces you will be corrected (and sometimes chastised) for using old orthography. Just recently a colleague went through a document I wrote and changed every "daß" to "dass". Nothing else was changed.

The ship has sailed, though, since many cohorts of students have gone through it now. I would just like people to be tolerant of what we older people learned in school. I don‘t want to re-learn everything. Just leave me in peace.


As a conlanger, I much appreciated the addition of a capital ß to Unicode. I use this letter in my conlang, and several words start with it, so it'd be natural to have a capital version of it to start sentences with. I was relieved to learn there is a defined spot for this letter in Unicode.


As a non native German speaker I really don't understand all the fuß around the capitalneSS of ß


Heh, thank goodness I don’t have to deal with all that! This code is ascii-only because it arose from working on the DNS. There are other protocols that are ascii-case-insensitive so it turns up a lot in the hot path of many servers.


This is presumably Rust's u8::to_ascii_lowercase rather than C's tolower since tolower is locale sensitive (which you don't care about) and has Undefined Behaviour (because of course it does it's a C function and who cares about correctness)

Or possibly u8::make_ascii_lowercase which is the same function but with in-place mutation.


There is a difference between strings used internally, usually as IDs, and text entered by a human. For the former you'd always use straight ASCII in 8-bit encoding, for the latter ... things get difficult. A straightforward example are DNS addresses - they can technically contain almost any Unicode, but that is always converted to a very limited subset of ASCII for actual DNS resolution, which in turn is case-insensitive.

There are of course things like programming languages with case-insensitive identifiers that support all human writing systems in Unicode. If that's what you're dealing with, you have my condolences.


On the wire DNS questions and answers are case preserving but not case sensitive which is important for correctness. DNS was designed a long time ago and doesn't have enough places to hide randomness (~30 bits at most are available) to protect it against a realistic attacker, so, for most names we do a terrible trick in the real world - we randomize the case during querying. This is called DNS-0x20 for obvious reasons.

Suppose a web browser wants to know news.ycombinator.com AAAA? but bad guys know it's about to ask this (e.g. they use JS to force that lookup when they wanted), they can shove a billion bogus answers (one for every possible random value) onto the wire and have a great chance to trick the browser into accepting one of these answers which seemingly is to the question it just asked. But, if we instead pick random cases we're asking about, say, NeWS.yCOmbinAtOR.cOM and we can ignore answers for nEWS.yCOMBINATOR.cOM or news.ycombinator.com or NEWS.YCOMBINATOR.COM or any other spelling. Bad guys now need to do many orders of magnitude more expensive work for the same results.


> For the former you'd always use straight ASCII in 8-bit encoding

why is that?

IMO internal strings should be treated as black-box byte-arrays, i.e. the specific content does not matter to the software, except for checking equality between to strings. In this case it should not matter to the software if it is unicode or whatever.


> There are of course things like programming languages with case-insensitive identifiers that support all human writing systems in Unicode. If that's what you're dealing with, you have my condolences.

Fun times when an upgrade of the Unicode library used by your compiler changes the semantics of you program.


For what it’s worth, with IDNs you’re still going to do a kind of case folding using stringprep before doing the query, and that isn’t really better than the table GP linked. ASCII-only case-insensitive matching is indeed a thing, but it’s usually mutually exclusive with (non-programmer) user-entered data.


Note for the first example which transforms the German maße to MASSE that the German language has an uppercase Eszett as well: ẞ

This is as of now not widely deployed and few fonts support it, but in theory it is there now.


> Finally, we do a masked add. We pass c twice: bytes from the first c are copied to the result when is_uppper is false, and when is_upper is true the result is c + to_upper.

Is that an error in the post ? Shouldnt it do the addition when is_upper is false and copy the same when it is true ?


D'oh, I must have seen your comment 10 times before I realised the to_upper variable is named backwards and should be called to_lower. Thanks for pointing out that it was confusing! I've fixed the post and the code.


The operation is `tolower`

Capital a is 0x40, lowercase is 0x60.

The addition of 0x20 needs to happen when is_upper is true.


Unfortunately those SWAR optimizations are only useful for strings that are aligned on 8 bytes address.

If your SWAR algorithm is applied on a non-aligned string, it is often slower than the original algorithm.

And splitting the algorith in 3 parts (handling the beginning up to an aligned address, then the aligned part, and then the less-than-8-bytes tail) takes even more instructions.

Here is a similar case on a false claim of a faster utf8.IsValid in Go, with benchmarks: https://github.com/sugawarayuuta/charcoal/pull/1


Masked SIMD operations, which are in AVX-512 and ARM SVE were intended to solve that problem. Then memory operations could still be aligned and of full vectors all the time, but masked to only those elements that are valid.

Even if a masked vector-memory operation is unaligned and crosses into an unmapped or protected page, that will not cause a fault if those lanes are masked off. There are even special load instructions that will reduce the vector length to end at the first element that would have caused a fault, for operations such as strlen() where the length is not known beforehand.


Mask add looks neat! I wish there was a way to directly manipulate AVX512's mask registers in .NET intrinsics but for now we have to live with "recognized idioms".

For anyone interested, the author's core loop in ASM is as compiled by GCC

    .L3:
        vmovdqu8        zmm0, ZMMWORD PTR [rcx+rax]
        vmovdqa64       zmm1, zmm0
        vpcmpb  k1, zmm0, zmm4, 5
        vpcmpb  k0, zmm0, zmm3, 2
        kandq   k1, k1, k0
        vpaddb  zmm1{k1}, zmm0, zmm2
        vmovdqu8        ZMMWORD PTR [rdi+rax], zmm1
        add     rax, 64
        cmp     rax, r8
        jne     .L3
uiCA (CQA/MAQAO) (https://uica.uops.info/, make sure to pick CQA + Ice Lake) says it achieves nice 32B/cycle on Ice Lake. If you multiply by say 3 to match 3 GHz, this gives us almost 96 GiB/s assuming memory access is not a bottleneck (it always is in such algorithms).

But this seems not as close to optimal utilization as it could be. Using Clang instead yields much better, nicely unrolled result with better instruction selection.

    .LBB0_9:
        vmovdqu64       zmm3, zmmword ptr [rsi]
        vmovdqu64       zmm5, zmmword ptr [rsi + 64]
        vmovdqu64       zmm6, zmmword ptr [rsi + 128]
        add     rdx, -512
        vpaddb  zmm4, zmm3, zmm0
        vpcmpltub       k1, zmm4, zmm1
        vpaddb  zmm4, zmm5, zmm0
        vpaddb  zmm3 {k1}, zmm3, zmm2
        vpcmpltub       k1, zmm4, zmm1
        vpaddb  zmm4, zmm6, zmm0
        vpaddb  zmm5 {k1}, zmm5, zmm2
        vmovdqu64       zmmword ptr [rcx], zmm3
        vpcmpltub       k1, zmm4, zmm1
        vmovdqu64       zmmword ptr [rcx + 64], zmm5
        vmovdqu64       zmm5, zmmword ptr [rsi + 192]
        vpaddb  zmm6 {k1}, zmm6, zmm2
        vmovdqu64       zmmword ptr [rcx + 128], zmm6
        vmovdqu64       zmm6, zmmword ptr [rsi + 256]
        vpaddb  zmm4, zmm5, zmm0
        vpcmpltub       k1, zmm4, zmm1
        vpaddb  zmm4, zmm6, zmm0
        vpaddb  zmm5 {k1}, zmm5, zmm2
        vpcmpltub       k1, zmm4, zmm1
        vmovdqu64       zmmword ptr [rcx + 192], zmm5
        vmovdqu64       zmm5, zmmword ptr [rsi + 320]
        vpaddb  zmm6 {k1}, zmm6, zmm2
        vmovdqu64       zmmword ptr [rcx + 256], zmm6
        vmovdqu64       zmm6, zmmword ptr [rsi + 384]
        vpaddb  zmm4, zmm5, zmm0
        vpcmpltub       k1, zmm4, zmm1
        vpaddb  zmm4, zmm6, zmm0
        vpaddb  zmm5 {k1}, zmm5, zmm2
        vpcmpltub       k1, zmm4, zmm1
        vmovdqu64       zmmword ptr [rcx + 320], zmm5
        vmovdqu64       zmm5, zmmword ptr [rsi + 448]
        vpaddb  zmm6 {k1}, zmm6, zmm2
        add     rsi, 512
        vmovdqu64       zmmword ptr [rcx + 384], zmm6
        vpaddb  zmm4, zmm5, zmm0
        vpcmpltub       k1, zmm4, zmm1
        vpaddb  zmm5 {k1}, zmm5, zmm2
        vmovdqu64       zmmword ptr [rcx + 448], zmm5
        add     rcx, 512
        cmp     rdx, 63
        ja      .LBB0_9
This extracts more impressive 42.67B/c, I don't think even L2 cache can sustain such a throughput, but it's nice to know that medium length strings get up/downcased in about the same time it takes light from your screen to reach your cornea.

The core for short lengths there is one instruction less:

    .LBB0_5:
        vmovdqu64       zmm3, zmmword ptr [rsi + rcx]
        vpaddb  zmm4, zmm3, zmm0
        vpcmpltub       k1, zmm4, zmm1
        vpaddb  zmm3 {k1}, zmm3, zmm2
        vmovdqu64       zmmword ptr [rax + rcx], zmm3
        add     rcx, 64
        cmp     r8, rcx
        jne     .LBB0_5
Some months ago I wrote a similar ASCII in UTF-8 upcase/downcase implementation in C#: https://github.com/U8String/U8String/blob/main/Sources/U8Str...

(the unrolled conversion for below vectorization lengths is required as short strings dominate most codebases so handling it fast is important - the switch compiles to jump table and then branchless fall-through to return)

For now it goes as wide as 256b as it already saturates e.g. Zen 3 or 4 which have only 256x4 SIMD units (even though Zen 4 can do fancy 512b shuffles natively and has very good 512b implementation). The core loop compiles to compact

       G_M48884_IG05:
              vmovups  ymm3, ymmword ptr [rdi+rax]
              vpaddb   ymm4, ymm3, ymm1
              vpcmpgtb ymm4, ymm2, ymm4
              vpand    ymm4, ymm4, ymm0
              vpor     ymm3, ymm4, ymm3
              vmovups  ymmword ptr [rsi+rax], ymm3
              add      rax, 32
              cmp      rax, rdx
              jbe      SHORT G_M48884_IG05
Side by side with C ones: https://godbolt.org/z/eTGYhTPan

I believe you can also achieve similar with 3-instruction conversion with AVX512 with vpternlogd, as when I had access to AVX512 hardware, this is what .NET optimized it to for 256b width + AVX512VL, but strangely enough I can't make it do so for 512b width right now.

You may notice failed SWAR attempt for switch dispatch case and I was wondering what kind of license your posts are distributed under? (gave up on it back then because per-element fall-through was already fast enough, but if yours passes the test suite, I'd love to use it haha)


Clang and GCC have a differing approach to Intrinsics, and Clang is more likely than GCC to deviate from the Intel guide's specified opcodes & algorithms, and this is particularly noticeable with AVX-512 instructions. It's understandable given their respective architectures. Sometimes the result is an improvement, sometimes it is a detriment.

A couple of years ago I worked on a heavily vectorized project that was intended to compile with either, and wound up maintaining inline asm and .S in the repository for specific targets alongside the C reference version. That made for some ugly Makefile shenanigans, and also meant including benchmarking as part of the test suite. It adds up to considerable maintenance burden, so the takeaway for me was that using Intrinsics as a low-level means to improve on the autovectorizer should be only very sparingly considered.

Edit to add: quick example, from my notes during that project, https://godbolt.org/z/T4Pjhrz5d ; the GCC output is what was expected, the Clang output was a surprise, and noticeably slower in practice, even when inlined. When looped (or similarly if unrolled), uiCA clocks it at 7 cycles to GCC's 4, and this was borne out by their benchmark performance in application code, in which this function was performed a few billion times in the course of a brute-forcing algorithm (which is to say, it mattered). I recall finding other issues where a dive into the LLVM codebase suggested that Clang 16 might be entirely unable to issue some masked AVX-512 instructions due to internal refactorings.


I've run into the same behavior with clang and intrinsics. Well, I appreciate the fact that they're trying to optimize the intrinsics usage, there really does need to be a flag or pragma you can pass that says more along the lines of "no really, give me what I asked for." In some cases I have found that the code it produces is a significant pessimization from what I had selected.


Did you file bugs with testcases?


Sadly didn’t have time (this was not a funded project and I am far from sufficiently up to speed on LLVM internals). I still hope to get around to writing something up during the next break.


Thanks for that analysis, very informative!

I have not tried to get the best possible performance: at first I wanted to see if it would work at all, and the fact that my first attempt performed really well was a bonus! My main point of interest is strings less than the size of a vector register, and getting rid of the troughs in the throughput chart.

You can click through the link to the code at the end of the blog post, which has all the licence details. It is 0BSD or MIT-0 except for the parts written originally for BIND which are MPL-2.0.


Just to clarify - this was about SWAR version as my interest, similar to what you mentioned in a sibling comment, is in mainly below-vector-width strings, because these are most frequently passed to these kinds of functions by developers. I was going to try to see if it can further "amortize" the jump table with fallthrough for variable lengths below 16 in my C# UTF-8 string implementation that is going to use it.

In any case, thanks for writing the article, for some reason there is quite a bit of negativity among users and developers towards AVX512 - it is seen as "AVX2 but twice as wide", which it of course isn't, but most do not know that.

Also, you may want to simply do the overlapping conversion for the tail instead of masking out the elements, which is usually faster on the benchmarks (at least give it a try), and also align the source and destination for the loop to avoid split loads/stores.


One thing that's not all that obvious from this large chunk of assembly is that Clang manages to rewrite

  (x >= 'a' && x <= 'z')
into

  (x - 'a') < <some constant>
which saves one instruction (and sometimes, a register load due to weird opcode encoding thingies).


This is a common transformation for range-check style comparisons. It is always nice to see that LLVM reasons about it in a generalized fashion and applies it to vector operations, and it is what I ultimately ended up on when writing a similar to author's implementation.


I sometimes write intrinsic comparisons (e.g., something like "x < 'a'" instead of "_mm_cmplt_epi8(x, _mm_set1_epi8('a'))") just to make sure the compiler understands what's going on and can do that sort of transformations. (It also sometimes makes for more code sharing between e.g. x86 and NEON.)


I'm lost, what is swar?


"SIMD Within A Register"

I think the implication is that you can pack multiple items into an ordinary register and effectively get SIMD even if you aren't using explicit SIMD instructions. E.g. if you pack a 31 and 32 bit number into a 64 bit register (you need 1 spare for a carry bit), you can do 2 adds with a single 64-bit add.

Games have used these tricks for graphics to pack RGB(A) values into 32 bit integers. E.g. this code from scummvm interpolates 2 16-bit RGB pixels (6 total components) packed into a 32-bit value. https://github.com/scummvm/scummvm/blob/master/graphics/scal...


SIMD within a register


With the advent of Unicode, the concept of upper and lowercase is a quagmire. You need a whole bunch of data to do it right.

If your work involves some process whose timely completion depends on how fast ASCII tolower executes, you better shake things up somehow and change the ground rules.


In the past I have added a black border around an image to be able to completely avoid the read beyond buffer SIMD problem. It worked very well and allowed us to beat some opencv implementation in terms of speed. But you don't always have full control over the input like that.


Did you try something like this? The autovectorizer looks pretty clean to me.

https://godbolt.org/z/1c5joKK5n


That’s basically the same as `tolower1` - see the bullet points below the graph


Well, the conundrum sits with the fact that this is the disassembly of the main loop:

        vmovdqu64       zmm3, zmmword ptr [rdi + rcx]
        vmovdqu64       zmm4, zmmword ptr [rdi + rcx + 64]
        vmovdqu64       zmm5, zmmword ptr [rdi + rcx + 128]
        vmovdqu64       zmm6, zmmword ptr [rdi + rcx + 192]
        vpaddb  zmm7, zmm3, zmm0
        vpaddb  zmm8, zmm4, zmm0
        vpaddb  zmm9, zmm5, zmm0
        vpaddb  zmm10, zmm6, zmm0
        vpcmpltub       k1, zmm7, zmm1
        vpcmpltub       k2, zmm8, zmm1
        vpcmpltub       k3, zmm9, zmm1
        vpcmpltub       k4, zmm10, zmm1
        vpaddb  zmm3 {k1}, zmm3, zmm2
        vpaddb  zmm4 {k2}, zmm4, zmm2
        vpaddb  zmm5 {k3}, zmm5, zmm2
        vpaddb  zmm6 {k4}, zmm6, zmm2
        vmovdqu64       zmmword ptr [rdi + rcx], zmm3
        vmovdqu64       zmmword ptr [rdi + rcx + 64], zmm4
        vmovdqu64       zmmword ptr [rdi + rcx + 128], zmm5
        vmovdqu64       zmmword ptr [rdi + rcx + 192], zmm6
...which is, upon first glance, is similar to yet better than the intrinsics version you wrote. Additionally it has cleaner tail handling.


The tail handling is the main point of the post. The tolower1 line on the chart is very spiky because the autovectorizer doesn’t use masked loads and stores for tails, and instead does something slower that tanks performance. The tolower64 line is smoother and rises faster because masked loads and stores make it easier to handle strings shorter than the vector size.


There is something to be said for using masking load/stores instead of downgrading the SIMD register types. But to my eye, something is clearly wrong with your code in such a way that the "naive" autovectorized version should be blowing yours out of the water at 256B. Your code leaves a sparse instruction pipeline, and thus needs to run through the loop 4 times for every one of the autovectorized version. So, something is wrong here, as far as I know. I was just trying to make sure you'd covered the basics, like ensuring you were targeting the right architecture with your build and whatnot.

This is yours. Basically the same instructions, but taking up way more space:

        vmovdqu64       zmm3, zmmword ptr [rsi]
        vpaddb  zmm4, zmm3, zmm0
        vpcmpltub       k1, zmm4, zmm1
        vpaddb  zmm3 {k1}, zmm3, zmm2
        vmovdqu64       zmmword ptr [rdi], zmm3


Unrolling on AVX-512, especially on Zen 4 with its double-pumped almost-everything, isn't particularly significant; the 512-bit store alone has a reciprocal throughput of 2 cycles/zmm, which gives a pretty comfortable best possible target of 2 cycles/iteration. With Zen 4 being 6-wide out of the op cache, that's fine for up to 12 instructions in the loop (and with separate ports for scalar & SIMD, the per-iteration scalar op overhead is irrelevant).

Interestingly enough, clang 16 doesn't unroll the intrinsics version, but trunk does, which'd make the entire point moot.

The benchmark in question, as per the article, tests over 1MB in blocks, so it'll be at L2 or L3 speeds anyway.

Clang downgrades to ymm, which'll handle a 32-byte tail, but after that it does a plain scalar loop, up to a massive 31 iterations. Whereas masking appears to be approximately free on Zen 4, so I'd be surprised if the scalar tail would be better at even, like, 4 bytes (perhaps there could be some minor throughput gained by splitting up into optional unmasked ymm and then a masked ymm, but even that's probably questionable just from the extra logic/decode overhead).

Also worth considering that in real-world usage the masking version could be significantly better just from being branchless for up to 64-byte inputs.


L2 speeds are ~180GB/s on Zen 4. That's also a part of my confusion. This should be line speed.

I do not have the same experience as you with unrolling AVX512 loops on Zen 4. I recall even with double pumping, you can do 2.5 per cycle. As you noted with the stores, it takes two cycles, so you can put 4 into a 3.25 cycle pipeline, instead of 8 cycles. With 5 dependent ops covering 4.5 cycles, this should be a significant win.

I'm not defending the 31 length loop to clean up the mod32 leftover section. That is bad. But it doesn't answer why 256B is 4x slower than line speed and not significantly faster than the unrolled intrinsic version.

In my experience, I never used the masked load store because some platforms did an actual read over the masked-away parts, and could segfault. I recall hearing from a reliable source that Zen 4 doesn't do that, but didn't see official documentation for it. Clang may actually be avoiding the masked cleanup for that reason.

To top it off, I also always found it faster to just stagger the index back instead of a masked load/store whenever it's longer than 64 on calculations like this. That is, if it's size=80, do 0-63, and then 15-79 (which is an optimization Clang doesn't do either for some reason).

Finally, what really really confuses me is that whenever I write a benchmark like this:

         for(size_t i = 0; i < sizeof(src); i++) {
                 src[i] = (uint8_t)(i & 63) + 32;
         }
-O3 will do something absurd like just return the answer, since the inputs are constexpr. I can understand why the intrinsic version might confuse the compiler, but the clearly written one should totally have broken the benchmark and overwritten it with a constexpr answer.


The benchmark measures the time to copy about 1 MiByte, in chunks of various lengths from 1 byte to 1 kilobyte. I wanted to take into account differences in alignment in the source and destination strings, so there are a few bytes between each source and destination string, which are not counted as part of the megabyte.

On my Zen 4 CPU the L2 cache is 1 MiB per core, so because the working set is over 2 MB I think the relevant speed limit is the L3 cache.

To be sure I was measuring what I thought I was, I compiled each function separately to avoid interference from inlining, code motion, loop fusion, etc. Of course in real code it's more likely that you would want to encourage inlining, not prevent it!


According to Intel document https://gcc.gnu.org/wiki/cauldron2014?action=AttachFile&do=g..., AVX-512 should suppress faults during masked access or otherwise it is not AVX-512.


Also of concern is that the input, and thus the loads & stores here, are intentionally bumped to cycle through all possible alignments, thus ending up unaligned most of the time, in which case it should be 3 cycles per store (I think?).

I don't understand your point about pipelining - OoO should mean that, as long as there's enough decode bandwidth and per-iteration scalar overhead doesn't overwhelm scalar execution resources, all SIMD ops can run at full force up to the most contended resource (store here), no?

That said, yeah, ~44GB/s is actually still pretty slow here, even for L3.

Masked load/store faulting was problematic on AVX2 (in addition to being pretty slow on AMD (which actually continues into Zen 4, despite it having fast AVX-512 versions)); AVX-512's should always be fine, and compilers already output them: https://godbolt.org/z/98sY57TE1

Intrinsics shouldn't "confuse" clang - clang lowers them, where possible, to the same LLVM instructions that the autovectorizer would generate. Both clang and gcc can even convert an intrinsics-based memcpy/memset impl to a libc call (as annoying may that be)!

If you want a compiler to not optimize out computation, you can add something like `__asm__ volatile(""::"r"(src):"memory");` after the loop to make it operate as if the contents of src were modified/read.


> I don't understand your point about pipelining - OoO should mean that, as long as there's enough decode bandwidth and per-iteration scalar overhead doesn't overwhelm scalar execution resources, all SIMD ops can run at full force up to the most contended resource (store here), no?

You are reaching the limits of my understanding, but my level of knowledge is that store may have reciprocal throughput of 2, but it only occupies two ops (from double pumping a single one) over those two cycles, while the CPU pipeline can handle doing 10. For store in particular, nothing is dependent on it completing, so it can be "thrown into the wind" so to speak. But here's my approximation of the pipeline of a single thread, where dashes separate ops

LOADU.0 - LOADU.1 - _ - _ - _ - _ - ADD.0 - ADD.1 - _ - _ - CMP.0 - CMP.1 - _ - _ - _ - _ - ADD.0 - ADD.1 - _ - _ - STORE.0 - STORE.1 - [start again, because nothing is dependent on STORE completing]

So, that's 10 ops and 12 empty spots that can be filled by simultaneously doing 1.2 more loops simultaneously.

I do want to know why clang isn't using the masked load/store. If it's willing to do it on a dot-product, it should do it here as well. It makes me want to figure out what is blocking it (usually some guarantee that 99.9% of developers don't know they're making).


The store will still take up throughput even if nothing depends on it right now - there is limited hardware available for copying data from the register file to the cache, and its limit is two 32-byte stores per cycle, which you'll have to pay one way or another at some point.

With out-of-order execution, the layout of instructions in the source just doesn't matter at all - the CPU will hold multiple iterations of the loop in the reorder buffer, and assign execution units from multiple iterations.

e.g. see: https://uica.uops.info/?code=vmovdqu64%20zmm3%2C%20zmmword%2...

(click run, then Open Trace); That's Tiger Lake, not Zen 4, but still displays how instructions from multiple iterations execute in parallel. Zen 4's double-pumping doesn't change the big picture, only essentially meaning that each zmm instr is split into two ymm ones (they might not even need to be on the same port, i.e. double-pumping is really the wrong term, but whatever).


> The store will still take up throughput even if nothing depends on it right now - there is limited hardware available for copying data from the register file to the cache, and its limit is two 32-byte stores per cycle, which you'll have to pay one way or another at some point.

Sure. But that limit is one cycle--not two. This is getting pretty above my pay grade.

That tool is nifty, but I couldn't really figure out why it supports your assertion. I plugged in the two loop bodies and got these predicted throughput results:

Unrolled 4x: uiCA 6.00 llvm-mca 6.20

Regular: uiCA 2.00 llvm-mca 2.40

LLVM pretty strongly supports my experience that unrolling/reordering should be a substantial gain here, no? uiCA still has a meaningful gain as well.


Sorry, mixed things up - store has a limit of one 32-byte store per cycle (load is what has two 32-byte/cycle). Of additional note is Zen 4's tiny store queue of 64 32-byte entries, which could be at play (esp. with it pointing at L3; or it might not be important, I don't know how exactly it interacts with things).

The uiCA link was just to show how out-of-order execution works; the exact timings don't apply as they're for Tiger Lake, not Zen 4. My assertion being that your diagram of spaces between instructions is completely meaningless with OoO execution, as those empty spaces will be filled with instructions from another iteration of the loop or other surrounding code.

Clang is extremely unroll-happy in general; from its perspective, it's ~free to do, has basically zero downsides (other than code size, but noone's benchmarking that too much) and can indeed maybe sometimes improve things slightly.


Is there a limit to how much the OoO execution buffer can handle? I don't want to dig up the benchmarks again, but I remember for very long calculations (like computing a bunch of exp(x[i]), which boil down to about 40 AVX512 instructions), the unrolling+reordering was very effective.


Zen 4 has two 32-entry schedulers for SIMD, which I think is the relevant buffer here (many of my numbers in this thread, including these, are coming from https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-fron...), which would get mostly exhausted by one iteration of a 40-instruction loop. Whereas the 5-instruction body here should easily manage many iterations.

I think, at least. Now I'm at the limit of my knowledge, it took me a bit to figure out whether the reorder buffer or scheduler is the relevant thing here, and I'm still not too sure. Though, either way, 32 is the smallest number of all for Zen 4, and still fits 6 iterations of the 5-instruction loop and should happily reorder them. (and the per-iteration-overhead scalar code goes to separate schedulers)


On clang masked load/store tail - there are some flags to change its behavior, but seems like nothing does just a masked tail, and what does exist still has work to do, e.g. here's a magic incantation that I found looking through LLVM's source code to make it do a single body for the entire loop: https://godbolt.org/z/1Kb6qTx31. It's hilariously inefficient though, and likely meant for SVE which is intentionally designed for things like this.


Clang rewrites my tolower64() into something rather more clever, and it adds a 2x unrolled version. (gcc keeps the object code much closer to what I wrote.) But from my point of view the important thing is that Clang retains the masked load and store for short string fragments. Its autovectorizer is not able to generate masked loads and stores.

I used Clang 11 for copybytes64() because it is unable to recognize the memcpy() idiom, whereas Clang 16 does turn it into memcpy() - which is slower!


Never heard about reading the masked area. Source?


AMD64 manual volume 4 (https://www.amd.com/content/dam/amd/en/documents/processor-t...):

> Exception and trap behavior for elements not selected for loading or storing from/to memory is implementation dependent. For instance, a given implementation may signal a data breakpoint or a page fault for doublewords that are zero-masked and not actually written.


I wonder if it ever happens IRL, as this would severely restrict the usefulness.

EDIT: does not seem to be applicable to AVX-512, only to AVX 1.

According to this PDF by intel, AVX-512 suppresses faults for masked access.

https://gcc.gnu.org/wiki/cauldron2014?action=AttachFile&do=g...


Yeah the spikes in the “tolower1” line illustrate the 32 byte tail pretty nicely.

I should maybe draw a version of the chart covering just small strings, but it’s an SVG so you can zoom right in. The “tolower1” line shows relatively poor performance compared to “tolower64”, tho it is hard to see for strings less than 8 bytes.


The autovectorizer can, at times, when you check it produce reasonable simd code with a given compiler based on simple, clear C code. Is gcc as good in this case? Are previous versions of clang (that may be employed by users) going to work out as well. How do you check in your build that the compiler did the right thing? If you touch the C code in any way will it silently do something you don't want?

Autovectorization is great and improving.


See also, https://github.com/Daniel-Liu-c0deb0t/uwu

Using SIMD for making text into … uwu.


Now account for Unicode characters (ie, Â -> â) and I’ll be impressed.

Shame most programmers only care about ASCII. There is a whole world that exists outside of the standard [a-z,A-Z,0-9] character set


He wrote about tolower(), not towlower()/wcslwr().


C tolower() is locale aware and handles utf-8 as well as many other encodings


It can only handle single-byte character sets, not UTF-8


...you know, while I personally think that the RISC approach was an honest mistake, stuff like this makes me see why some people wanted to got rid of complex instructions.

Well, supposedly RISC-V implementations will have none of this malarkey while still rivaling x64/ARM64 in processing speed at comparable technology/clock rates/prices, just with plain old loads-and-xors-and-stores?


Complicated vector instructions like these are not really antithetical to RISC.

The core of modern RISC thought is basically: "The laws of physics mean that no matter how much hardware you throw at it, only some kinds of instructions can be implemented in a performant way. We should only include these kinds of instructions in the instruction set." Then you build more complex operations out of these simple building blocks, but the fact that every instruction provided can be reasonably implemented to run really fast, the CPU itself can be fast.

Masked vector adds belong in the set of instructions that can be implemented to be fast, and that's why they are included in the RVV RISC-V extension. An example of an instruction that cannot be implemented to be fast would be the humble x86 load+add, where you first look up a value in memory, and then add it to a register. The only reasonable way to implement this to be fast is to just split it into two separate operations which are also dispatched separately, and that is precisely what modern x86 does.


RISC-V does have RVV, which similarly can do SIMD, has masking, but also has a vector length separate from masks: https://godbolt.org/z/rrEW85snh. Complete with its own set of ~40000 C intrinsics (https://dzaima.github.io/intrinsics-viewer).

Though, granted, RVV is significantly more uniform than AVX-512 (albeit at the cost of not having some useful goodies).


RISC-V has SIMD extension as well. Even when there is no SIMD, prefetching or instruction selection/scheduling will have a big impact on the performance, so it is unlikely one can easily write a few lines of assembly and get to a similar level of performance.


I don't think RISC-V's SIMD extension is very popular. At least I can't think of any available core implementing it. The vector extension is much more common.


The "P" (Packed SIMD) extension is still under development. It uses GPRs and is intended for smaller cores for which V would be too heavyweight.

The proposal originates with Andes, and one of their own ISAs. They have several RISC-V cores with an early draft of it.


There seem to be a few Raspberry Pi style boards available with it. Bruce Hoult wrote about his RISC-V SIMD tolower() at https://lobste.rs/s/bfgsh6/tolower_with_avx_512#c_8wmpce


That's the vector extension (V) rather than packed SIMD (P).


Do the RISC-V vector instructions cover the whole gamut that x86 does? (or at least the modern AVX-512 / AVX-10 coding style)


RVV has: masking for everything (though for things like loop tail handling (or even main body) using VL is better and much nicer); all the usual int & FP widths; indexed (gather & scatter) & strided & segmented loads & stores (all potentially masked); all operations support all types where at all possible - including integer division of all widths, and three sign variations for the high half of the 128-bit result of a 64-bit int multiply; And (of course) has 8-bit shifts, which AVX-512 somehow doesn't have.

All while being scalable, i.e. minimum vector register width (VLEN) is 128-bit for the 'v' extension, but hardware can implement up to 65536-bit vectors (and software can choose to either pretend they're 128-bit, or can be written such that it portably scales automatically); and if you want more than 128 bits portably there's LMUL, allowing grouping registers up to 8 in a group, giving up to at-least-1024-bit registers.

For shuffles it has vrgather, which supports all element width lookups and can move any element to any other element (yes, including at LMUL=8, though as you can imagine it can be expected to slow down quadratically with LMUL; and could even become a problem at LMUL=1 for hardware with large VLEN, whenever that becomes a thing).


Thanks for those details, it sounds like it should be very nice for short strings, and more like SVE than AVX


Considering all x86 procwssors I know about use a risc architecture internally I am not sure what actual benefits you get from a cisc.


toLower() with RVV[0] has been implemented (by brucehoult).

0. https://lobste.rs/s/bfgsh6/tolower_with_avx_512#c_wqhwtp




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: