> Latin 1 standard is still in widespread inside some systems (such as browsers)
That doesn't seem to be correct. UTF-8 is used by 98% of all the websites. I am not sure if it's even worth the trouble for libraries to implement this algorithm, since Latin-1 encoding is being phased out.
One place I know where latin1 is still used is as an internal optimization in javascript engines. JS strings are composed of 16-bit values, but the vast majority of strings are ascii. So there's a motivation to store simpler strings using 1 byte per char.
However, once that optimization has been decided, there's no point in leaving the high bit unused, so the engines keep optimized "1-byte char" strings as Latin1.
> What advantage would this have over UTF-7, especially since the upper 128 characters wouldn't match their Unicode values?
(I'm going to assume you mean UTF-8 here rather than UTF-7 since UTF-7 is not really useful for anything, it's jus a way to pack Unicode into only 7-bit ascii characters.)
Fixed width string encodings like Latin-1 let you directly index to a particular character (code point) within a string without having to iterate from the beginning of the string.
JavaScript was originally specified in terms of UCS-2 which is a 16 bit fixed width encoding as this was commonly used at the time in both Windows and Java. However there are more than 64k characters in all the world's languages so it eventually evolved to UTF-16 which allows for wide characters.
However because of this history indexing into a JavaScript string gives you the 16-bit code unit which may be only part of a wide character. A string's length is defined in terms of 16-bit code units but iterating over a string gives you full characters.
Using Latin-1 as an optimisation allows JavaScript to preserve the same semantics around indexing and length. While it does require translating 8 bit Latin-1 character codes to 16 bit code points, this can be done very quickly through a lookup table. This would not be possible with UTF-8 since it is not fixed width.
EDIT: A lookup table may not be required. I was confused by new TextDecoder('latin1') actually using windows-1252.
More modern languages just use UTF-8 everywhere because it uses less space on average and UTF-16 doesn't save you from having to deal with wide characters.
And yet HTTP/1.1 headers should be sent in Latin1 (is this fixed in HTTP/2 or HTTP/3?). And WebKit's JavaScriptCore has special handling for Latin1 strings in JS, for performance reasons I assume.
> Historically, HTTP has allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset [USASCII]. Newly defined header fields SHOULD limit their field values to US-ASCII octets.
In practice and by spec, HTTP headers should be ASCII encoded.
> Newly defined header fields SHOULD limit their field values to US-ASCII octets
ASCII octets! That means you SHOULD NOT send Latin1 encoded headers. The opposite of what pzmarzly was saying. I don't disagree Latin-1 being a superset of ASCII or having backward compatibility in mind, but that's not relevant to my response.
SHOULD is a recommendation, not a requirement, and it refers only to newly-defined header fields, not existing ones. The text implies that 8-bit characters in existing fields are to be interpreted as ISO-8859-1.
There is a RFC (2119) that specifies what SHOULD means in RFCs:
> SHOULD This word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course.
Web servers need to be able to receive and decode latin1 into utf-8 regardless of what the RFC recommends people send. The fact that it's going to become rarer over time to have the 8th bit set in headers, means you can write a simpler algorithm than what Lemire did that assumes an ASCII average case. https://github.com/jart/cosmopolitan/blob/755ae64e73ef5ef7d1... That goes 23 GB/s on my machine using just SSE2 (rather than AVX512). However it goes much slower if the text is full of european diacritics. Lemire's algorithm is better at decoding those.
Otherwise known as "Making other people's incompetence and inability to implement a specification your problem." Just because it's a widely quoted maxim doesn't make it good advice.
The spec may disagree, but webservers do sometimes send bytes outside the ASCII range, and the most sensible way to deal with that on the receiving side is still by treating them as latin1 to match (last I checked) what browsers do with it.
I do agree that latin1 headers shouldn't be _sent_ out though.
Only because those websites include `<meta charset="utf-8">`. Browsers don't use utf-8 unless you tell them to, so we tell them to. But there's an entire internet archive's worth of pages that don't tell them to.
Not including charset="utf-8" doesn't mean that the website is not UTF-8. Do you have a source on a significant percentage of website being Latin-1 while omitting charset encoding? I don't believe that's the case.
> Browsers don't use utf-8 unless you tell them to
This is wrong. You can prove this very easily by creating a HTML file with UTF-8 text while omitting the charset. It will render correctly.
Answering your "do you have a source" question, yeah: "the entire history of the web prior to HTML5's release", which the internet has already forgotten is a rather recent thing (2008). And even then, it took a while for HTML5 to become the de facto format, because it took the majority of the web years before they'd changed over their tooling from HTML 4.01 to HTML5.
> This is wrong. You can prove this very easily by creating a HTML file with UTF-8 text
No, but I will create an HTML file with latin-1 text, because that's what we're discussing: HTML files that don't use UTF-8 (and so by definition don't contain UTF-8 either).
While modern browsers will guess the encoding by examining the content, if you make an html file that just has plain text, then it won't magically convert it to UTF-8: create a file with `<html><head><title>encoding check</title></head><body><h1>Not much here, just plain text</h1><p>More text that's not special</p></body></html>` in it. Load it in your browser through an http server (e.g. `python -m http.server`), and then hit up the dev tools console and look at `document.characterSet`.
Both firefox and chrome give me "windows-1252" on Windows, for which the "windows" part in the name is of course irrelevant; what matters is what it's not, which is that it's not UTF-8, because the content has nothing in it to warrant UTF-8.
Chromium (and I'm sure other browsers, but I didn't test) will sniff character set heuristically regardless of the HTML version or quirks mode. It's happy to choose UTF-8 if it sees something UTF-8-like in there. I don't know how to square this with your earlier claim of "Browsers don't use utf-8 unless you tell them to."
That is, the following UTF-8 encoded .html files all produce document.characterSet == "UTF-8" and render as expected without mojibake, despite not saying anything about UTF-8. Change "ä" to "a" to get windows-1252 again.
<html>ä
<!DOCTYPE html><html>ä
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><html>ä
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"><html>ä
Into your url bar and inspect that. Avoids server messing with encoding values. And yes, here on my linux machine in firefox it is windows-1252 too.
(You can type the complete document, but <html> is sufficient. Browsers autocomplete a valid document. BTW, data:text/html,<html contenteditable> is something I use quite a lot)
But yeah, I think windows-1252 is standard for quirks mode, for historical reasons.
Well, I'm on Linux - system encoding set to UTF-8 which is pretty much standard there.
But I think the "windows-1252 for quirks" is just driven by what was dominant back when the majority of quirky HTML was generated decades ago.
The historical (and present?) default is to use the local character set, which on US Windows is Windows-1252, but for example on Japanese Windows is Shift-JIS. The expectation is that users will tend to view web pages from their region.
I'm in Japan on a Mac with the OS language set to Japanese. Safari gives me Shift_JIS, but Chrome and Firefox give me windows-1252
edit: Trying data:text/html,<html>日本語 makes Chrome also use Shift_JIS, resulting in mojibake as it's actually UTF-8. Firefox shows a warning about it guessing the character set, and then it chooses windows-1252 and displays more garbage.
Okay, it's good that we agree then on my original premise, the vast majority of websites (by quantity and popularity) on the Internet today are using UTF-8 encoding, and Latin-1 is being phased out.
Btw I appreciate your edited response, but still you were factually incorrect about:
> Browsers don't use utf-8 unless you tell them to
Browsers can use UTF-8 even if we don't tell them. I am already aware of the extra heuristics you wrote about.
> HTML file with latin-1 ... which is that it's not UTF-8, because the content has nothing in it to warrant UTF-8
You are incorrect here as well, try using some latin-1 special character like "ä" and you will see that browsers default to document.characterSet UTF-8 not windows-1252
> You are incorrect here as well, try using some latin-1 special character like "ä" and you will see that browsers default to document.characterSet UTF-8 not windows-1252
I decided to try this experimentally. In my findings, if neither the server nor the page contents indicate that a file is UTF-8, then the browser NEVER defaults to setting document.characterSet to UTF-8, instead basically always assuming that it's "windows-1252" a.k.a. "latin1". Read on for my methodology, an exact copy of my test data, and some particular oddities at the end.
To begin, we have three '.html' files, one with ASCII only characters, a second file with two separate characters that are specifically latin1 encoded, and a third with those same latin1 characters but encoded using UTF-8. Those two characters are:
Ë - "Latin Capital Letter E with Diaeresis" - Latin1 encoding: 0xCB - UTF-8 encoding: 0xC3 0x8B - https://www.compart.com/en/unicode/U+00CB
¥ - "Yen Sign" - Latin1 encoding: 0xA5 - UTF-8 encoding: 0xC2 0xA5 - https://www.compart.com/en/unicode/U+00A5
To avoid copy-paste errors around encoding, I've dumped the contents of each file as "hexdumps", which you can transform back into their binary form by feeding the hexdump form into the command 'xxd -r -p -'.
The full contents of my current folder is as such:
$ ls -a .
. .. ascii.html latinone.html utf8.html
Now that we have our test files, we can serve them via a very basic HTTP server. But first, we must verify that all responses from the HTTP server do not contain a header implying the content type; we want the browser to have to make a guess based on nothing but the contents of the file. So, we run the server and check to make sure it's not being well intentioned and guessing the content type:
Now we've verified that we won't have our observations muddled by the server doing its own detection, so our results from the browser should be able to tell us conclusively if the presence of a latin1 character causes the browser to use UTF-8 encoding. To test, I loaded each web page in Firefox and Chromium and checked what `document.characterSet` said.
Firefox (v116.0.3):
http://127.0.0.1:8000/ascii.html result of `document.characterSet`: "windows-1252"
http://127.0.0.1:8000/latinone.html result of `document.characterSet`: "windows-1252"
http://127.0.0.1:8000/utf8.html result of `document.characterSet`: "windows-1252"
Chromium (v115.0.5790.170):
http://127.0.0.1:8000/ascii.html result of `document.characterSet`: "windows-1252"
http://127.0.0.1:8000/latinone.html result of `document.characterSet`: "macintosh"
http://127.0.0.1:8000/utf8.html result of `document.characterSet`: "windows-1252"
So in my testing, neither browser EVER guesses that any of these pages are UTF-8, all these browsers seem to mostly default to assuming that if no content-type is set in the document or in the headers then the encoding is "windows-1252" (bar Chromium and the Latin1 characters which bizzarely caused Chromium to guess that it's "macintosh" encoded?). Also note that if I add the exact character you proposed (ä) to the text body, it still doesn't cause the browser to start assuming everything is UTF-8; the only change is that Chromium starts to think the latinone.html file is also "windows-1252" instead of "macintosh".
> Both firefox and chrome give me "windows-1252" on Windows, for which the "windows" part in the name is of course irrelevant; what matters is what it's not, which is that it's not UTF-8, because the content has nothing in it to warrant UTF-8.
While technically latin-1/iso-8859-1 is a different encoding than windows-1252, html5 spec says browsers are supposed to treat latin1 as windows-1252.
The following .html file encoded in UTF-8, when loaded from disk in Google Chrome (so no server headers hinting anything), yields document.characterSet == "UTF-8". If you make it "a" instead of "ä" it becomes "windows-1252".
<html>ä
The renders correctly in Chrome and does not show mojibake as you might have expected from old browsers. Explicitly specifying a character set just ensures you're not relying on the browser's heuristics.
There may be a difference here between local and network, as well as if the multi-byte utf-8 character appears in the first 1024 bytes or how much network delay there is before that character appears.
The original claim was that browsers don't ever use UTF-8 unless you specify it. Then ko27 provided a counterexample that clearly shows that a browser can choose UTF-8 without you specifying it. You then said "I'm pretty sure this is incorrect"--which part? ko27's counterexample is correct; I tried it and it renders correctly as ko27 said. If you do it, the browser does choose UTF-8. I'm not sure where you're going with this now. This was a minimal counterexample for a narrow claim.
I think when most people say "web browsers do x" they mean when browsing the world wide web.
My (intended) claim is that in practise the statement is almost always untrue. There may be weird edge cases when loading from local disk where it is true sometimes, but not in a way that web developers will usually ever encounter since you don't put websites on local disk.
This part of the html5 spec isn't binding so who knows what different browsers do, but it is a reccomendation of the spec that browsers should handle charset of documents differently depending on if they are on local disk or from the internet.
To quote: "User agents are generally discouraged from attempting to autodetect encodings for resources obtained over the network, since doing so involves inherently non-interoperable heuristics. Attempting to detect encodings based on an HTML document's preamble is especially tricky since HTML markup typically uses only ASCII characters, and HTML documents tend to begin with a lot of markup rather than with text content." https://html.spec.whatwg.org/multipage/parsing.html#determin...
Fair enough. I intended only to test the specific narrow claim OP made that you had quoted, which seemed to be about a local file test. This shows it is technically true that browsers are capable of detecting UTF-8, but only in one narrow situation and not the one that's most interesting.
Be careful, since at least Chrome may choose a different charset if loading a file from disk versus from a HTTP URL (yes this has tripped me up more than once).
I've observed Chrome to usually default to windows-1252 (latin1) for UTF-8 documents loaded from the network.
Be aware that with the WHATWG Encoding specification [1], that says that latin1, ISO-8859-1, etc. are aliases of the windows-1252 encoding, not the proper latin1 encoding. As a result, browsers and operating systems will display those files differently! It also aliases the ASCII encoding to windows-1252.
That's not what your linked spec says. You can try it yourself, in any browser. If you omit the encoding the browser uses heuristics to guess, but it will always work if you write UTF-8 even without meta charset or encoding header.
I don't doubt browsers use heuristics. But spec-wise I think it's your turn to to provide a reference in favour of a utf-8-is-default interpretation :)
The WHATWG HTML spec [1] has various heuristics it uses/specifies for detecting the character encoding.
In point 8, it says an implementation may use heuristics to detect the encoding. It has a note which states:
> The UTF-8 encoding has a highly detectable bit pattern. Files from the local file system that contain bytes with values greater than 0x7F which match the UTF-8 pattern are very likely to be UTF-8, while documents with byte sequences that do not match it are very likely not. When a user agent can examine the whole file, rather than just the preamble, detecting for UTF-8 specifically can be especially effective.
In point 9, the implementation can return an implementation or user-defined encoding. Here, it suggests a locale-based default encoding, including windows-1252 for "en".
As such, implementations may be capable of detecting/defaulting to UTF-8, but are equally likely to default to windows-1252, Shift_JIS, or other encoding.
No it isn't. My original point is that Latin-1 is used very rarely on Internet and is being phased out. Now it's your turn to provide some references that a significant percentage of websites are omitting encoding (which is required by spec!) and using Latin-1.
> UTF-8 is the default character encoding for HTML5. However, it was used to be different. ASCII was the character set before it. And the ISO-8859-1 was the default character set from HTML 2.0 till HTML 4.01.
> My original point is that Latin-1 is used very rarely on Internet and is being phased out.
Nobody disagrees with this, but this is a very different statement from what you said originally in regards to what the default is. Things can be phased out but still have the old default with no plan to change the default.
Re other sources - how about citing the actual spec instead of sketchy websites that seem likely to have incorrect information.
In countries communicating in non-English languages which are written in the latin script, there is a very large use of Latin-1. Even when Latin-1 is "phased out", there are tons and tons of documents and databases encoded in Latin-1, not to mention millions of ill-configured terminals.
Since values 0-127 are used far more frequently than 128-255 in latin-1, it might make more sense to simply have a fast path which simply loads 512 bits at a time (ie. 64 bytes), detects if any are 0x80 or above, and if not just outputs them verbatim.
I don’t know if the article has been updated since your comment, but this approach is discussed & benchmarked. For the benchmarked data set it’s a winner.
> A dual channel DDR4 system memory bandwidth is ~40GB/s, and DDR5 ~80GB/s.
It's impossible to saturate the memory bandwidth on a modern CPU with a single thread, even if all you do is reads with absolutely no processing. The bottleneck is how fast outstanding cache misses can be satisfied.
Is this useful? Most Latin 1 text is really Windows 1252, which has additional characters that don't have the same regular mapping to unicode. So this conversion will mangle curly quotes and the Euro sign, among others.
I'd say that the vast majority of Latin-1 that I've encountered is just ASCII. Where have you seen Windows-1252 presented with a Latin-1 header or other encoding declaration that declared it as Latin-1?
Windows 1252 served as Latin 1 used to be common enough that browsers interpret a Latin 1 declaration as Windows 1252. Nowadays it seems moderately common for such text to be served with a utf-8 declaration, so it gets mangled in other ways. Or it gets imported into a CMS with no conversion or the wrong conversion, which has a similar result.
You're right, ASCII is more common, but single-byte encoded prose that goes beyond ASCII is usually Windows 1252 in my experience.
> Windows 1252 served as Latin 1 used to be common enough that browsers
> interpret a Latin 1 declaration as Windows 1252.
Thank you, I had not encountered this and I was dealing a lot with improperly-encoded text when I was running gibberish.co.il (over a decade ago). What systems were serving this? IIS would be my first guess, an Intuit product would be my second.
Interesting to see how a non-AVX, non-branching version would do, need a prefilled array of extra pointer advance (0/1) and seemingly two more for the bitbanging.
I had the same question, wondering what sort of workflow would have this task in the critical path. Maybe if the Library of Congress needs to change their default text encoding it'll save a minute or two?
The benchmark result is cool, but I'm curious how well it works with smaller outputs. When I've played around with SIMD stuff in the past, you can't necessary go off of metrics like "bytes generated per cycle", because of how much CPU freq can vary when using SIMD instructions, context switching costs, and different thermal properties (eg maybe the work per cycles is higher per SIMD, but the CPU generates heat much more quickly and downclocks itself).
But where do you find it? Almost the entirety of internet is UTF-8. You can always transcode to Latin 1 for testing purposes, but that raises the question of practical benefits of this algorithm.
It's not necessarily about sustained throughput spent only in this routine. It can be small bursts of processing text segments that are then handed off to other parts of the program.
Once a program is optimized to the point where no leaf method / hot loop takes up more than a few percent of runtime and algorithmic improvements aren't available or extremely hard to implement the speed of all the basic routines (memcpy, allocations, string processing, data structures) start to matter. The constant factors elided by Big-O notation start to matter.
Another proof that Linus is not always right. There were many folks who just blindly regurgitated AVX 512 is evil, without even actually knowing a thing about it.
No, this is just a case of the right answer changing over time, as good AVX-512 implementations became available, long after its introduction. And nothing in this article even comes close to addressing the main concern with the early AVX-512 implementation: the significant performance penalty it imposes on other code due to Skylake's slow power state transitions. Microbenchmarks like this made AVX-512 look good even on Skylake, because they ignore the broader effects.
So your point he is indeed always right? Or was right in that particular case (he was not)?
If you remember, Linux complained not about the particular implementation of AVX-512, but the concept itself. It is also kinda looks ignorant of him (and anyone else who thinks the same way) to believe that AVX-512 is only about 512 or it has no potential, being a just simply better SIMD ISA compared to AVX1/2. What he did he just expressed himself in his trademark silly edgy maximalist way. It is an absolute pleasure to work with, gives great performance boost, and he should have more careful with his statements.
To add to your point, this benchmark would not have run on Skylake at all. It uses the _mm512_maskz_compress_epi8 instruction, which wasn't introduced until Ice Lake.
It kinda depends, I wouldn't be surprises, if properly optimized avx2 could get the same performance, since it looks like the operation is memory bottlenecked.
Nah, AVX512 is more performant design due to the support of masking. It does not depend in fact on anything. Those who compares favorably or equally AVX2 with 512 never used either of them.
Which is I’m really looking forward to Intel’s upcoming AVX10 / APX extensions, created to alleviate the P/E core issues they had in Alder Lake (really, what the fuck was Intel thinking back then?) It’s still inferior compared to full AVX-512 since you don’t have 512-bit wide registers, but you still have 32 256-bit registers to play with as well as the rich instruction set of AVX-512 (with all that masking and gather/scatter ops and that jazz…) so much better than AVX2.
Now hearing that the upcoming Ryzen is also going with the hybrid P/E core approach I wonder if AMD is also preparing to adopt these instructions as well…
The thing about Alder Lake AVX 512 is that (actually was, but i still occasionally run it with AVX512 enabled, for some specific tasks) the Alder Lake is the only consumer grade CPU that supports AVX 512 FP16. It is very performant for ML tasks, not like a real GPU, obviously, but a lot, much much easier to use.
Besides Intel do not gurantee compatibility AVX10 with AVX512/256.
Every time someone writes some really carefully micro-optimized piece of code like this, I worry that the implementation won't be shared with the whole world.
This code only makes people's lives better if many languages and frameworks that translates latin-1 to utf8 are updated to have this new faster implementation.
If this took 3 days to write and benchmark, then to save 3 days of human time, we probably need to get this into the hands of hundreds of millions of people, saving each person a few hundred microseconds.
The author is a French Canadian academic at Université du Québec à Montréal. He is one of the more famous figures in computer science in all of Canada, with over 5000 citations (which is stretching the meaning of famous, but still.) This is not closed source work optimizing for some company product, this is research for publication on his blog or in computer science journals.
He’s one of the most famous computer scientists in general!
The audience for wicked-clever, low/no branch, cache aware, SIMD sorcery is admittedly not everyone, but if you end up with that kind of problem, this is a go to!
> I worry that the implementation won't be shared with the whole world.
Considering the author also created https://github.com/simdutf/simdutf it's likely used or will be used in NodeJs amongst other things. Is that good enough?
> This code only makes people's lives better if many languages and frameworks that translates latin-1 to utf8 are updated to have this new faster implementation.
Except CPUs evolve and what was once a fast way of doing things may no longer be very fast. And with ASM you got no compiler to generate better targeted instructions.
I've seen many instances where significant performance was gained by swapping out and old hand-written ASM routine with a plain language version.
If you ever add some optimized ASM to your code, do a performance check at startup or similar, and have the plain language version as a fallback.
I'm talking about which instructions and idioms are optimal. AFAIK, with intrinsics the compiler won't completely change what you've written.
Back in the days REP MOVSB was the fastes way to copy bytes, then Pentium came and rolling your own loop was better. Then CPUs improved and REP MOVSB was suddenly better again[1], for those CPUs. And then it changed again...
Similar story for other idioms where implementation details on CPUs change. Compilers can respond and target your exact CPU.
In terms of being broadly available, most of AVX-512 (ER, PF, 4FMAPS, and 4VNNIW haven't been available on any new hardware since 2017) is available on basically any Intel cpu manufactured since 2020 as well as on all AMD Zen4 (2022 and on) cpus.
I can't speak to being error free or other issues but it should at the very least be present on any modern desktop, laptop, or server x86 CPU you could buy today.
Edit: I forgot to mention but Intel's Alder lake CPUs only have partial support presumably due to some issue with E cores. I'd guess Intel will get their shit together eventually wrt this now that AMD is shipping all their hardware with this instruction set.
Intel seems to be going for market segmentation, with AVX-512 only available on their server CPUs. The option to enable AVX-512 has been removed from Alder Lake CPUs since 2022, and there is no AVX-512 on Raptor Lake.
AMD also keeps making and selling Zen 3 and Zen 2 chips as lower-cost products, and those do not have AVX-512.
With AVX10 intel will make the instructions available again on all segments. SIMD register width will vary between cores but the instructions will be there.
I don't think it was intentional market segmentation, just poor planning: the whole heterogenous cores strategy seems to have been thrown together in a hurry and they didn't have time to add AVX-512 to their Atom cores in an area-efficient way (so as not to negate the point of having E-cores).
>most of AVX-512 is available on basically any Intel cpu manufactured since 2020
That's incorrect. On the consumer cpu side Intel introduced AVX-512 for one generation in 2021 (Rocket lake), but than removed AVX-512 from the subsequent Alder Lake using bios updates, and fused it off in later revisions. It's also absent from the current Raptor Lake. So actually it's only available on Intel's server grade cpus.
>Edit: I forgot to mention but Intel's Alder lake CPUs only have partial support presumably due to some issue with E cores.
The latest Intel architecture (Sapphire Rapids) support it without downclocking. AMD Zen 4 also supports it, although their implementation is double pumped, not sure what the real world performance impact of that is.
There is a huge confusion about this "double pumped" thing.
All that this means is that Zen 4 uses the same execution units both for 256-bit operations and for 512-bit operations. This means that the throughput in instructions per cycle for 512-bit operations is half of that for 256-bit operations, but the throughput in bytes per cycle is the same.
However the 512-bit operations need fewer resources for instruction fetching and decoding and for micro-operation storing and dispatching, so in most cases using 512-bit instructions on Zen 4 provides a big speed-up.
Even if Zen 4 is "double pumped", its 256-bit throughput is higher than that of Sapphire Rapids, so after dividing by two, for most instructions it has exactly the same 512-bit throughput as Sapphire Rapids, i.e. two 512-bit register-register instructions per cycle.
The only exceptions are that Sapphire Rapids (with the exception of the cheap SKUs) can do 2 FMA instructions per cycle, while Zen 4 can do only 1 FMA + 1 FADD instructions per cycle, and that Sapphire Rapids has a double throughput for loads and stores from the L1 cache memory. There are also a few 512-bit instructions where Zen 4 has better throughput or latency than Sapphire Rapids, e.g. some of the shuffles.
It's unlikely that this makes anyone's life better. It is more a curiosity and maybe a teachable thing on how to do SIMD. I would venture the guess that there are very few workloads that require this conversion for more than a few KB, and over time as the world migrates to Unicode it will be less and less.
That doesn't seem to be correct. UTF-8 is used by 98% of all the websites. I am not sure if it's even worth the trouble for libraries to implement this algorithm, since Latin-1 encoding is being phased out.
https://w3techs.com/technologies/details/en-utf8