Transcoding Latin 1 strings to UTF-8 strings at 12 GB/s using AVX-512

ko27 · on Aug 21, 2023

> Latin 1 standard is still in widespread inside some systems (such as browsers)

That doesn't seem to be correct. UTF-8 is used by 98% of all the websites. I am not sure if it's even worth the trouble for libraries to implement this algorithm, since Latin-1 encoding is being phased out.

https://w3techs.com/technologies/details/en-utf8

kannanvijayan · on Aug 21, 2023

One place I know where latin1 is still used is as an internal optimization in javascript engines. JS strings are composed of 16-bit values, but the vast majority of strings are ascii. So there's a motivation to store simpler strings using 1 byte per char.

However, once that optimization has been decided, there's no point in leaving the high bit unused, so the engines keep optimized "1-byte char" strings as Latin1.

HideousKojima · on Aug 21, 2023

>So there's a motivation to store simpler strings using 1 byte per char.

What advantage would this have over UTF-7, especially since the upper 128 characters wouldn't match their Unicode values?

laurencerowe · on Aug 21, 2023

> What advantage would this have over UTF-7, especially since the upper 128 characters wouldn't match their Unicode values?

(I'm going to assume you mean UTF-8 here rather than UTF-7 since UTF-7 is not really useful for anything, it's jus a way to pack Unicode into only 7-bit ascii characters.)

Fixed width string encodings like Latin-1 let you directly index to a particular character (code point) within a string without having to iterate from the beginning of the string.

JavaScript was originally specified in terms of UCS-2 which is a 16 bit fixed width encoding as this was commonly used at the time in both Windows and Java. However there are more than 64k characters in all the world's languages so it eventually evolved to UTF-16 which allows for wide characters.

However because of this history indexing into a JavaScript string gives you the 16-bit code unit which may be only part of a wide character. A string's length is defined in terms of 16-bit code units but iterating over a string gives you full characters.

Using Latin-1 as an optimisation allows JavaScript to preserve the same semantics around indexing and length. While it does require translating 8 bit Latin-1 character codes to 16 bit code points, this can be done very quickly through a lookup table. This would not be possible with UTF-8 since it is not fixed width.

EDIT: A lookup table may not be required. I was confused by new TextDecoder('latin1') actually using windows-1252.

More modern languages just use UTF-8 everywhere because it uses less space on average and UTF-16 doesn't save you from having to deal with wide characters.

layer8 · on Aug 21, 2023

Latin1 does match the Unicode values (0-255).

layer8 · on Aug 21, 2023

Java nowadays does the same.

rurban · on Aug 22, 2023

perl5 also does the same

pzmarzly · on Aug 21, 2023

And yet HTTP/1.1 headers should be sent in Latin1 (is this fixed in HTTP/2 or HTTP/3?). And WebKit's JavaScriptCore has special handling for Latin1 strings in JS, for performance reasons I assume.

ko27 · on Aug 21, 2023

> should be sent in Latin1

Do you have a source on that "should" part. Because the spec disagrees https://www.rfc-editor.org/rfc/rfc7230#section-3.2.4:

> Historically, HTTP has allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset [USASCII]. Newly defined header fields SHOULD limit their field values to US-ASCII octets.

In practice and by spec, HTTP headers should be ASCII encoded.

nicktelford · on Aug 21, 2023

ISO-8859-1 (aka. Latin-1) is a superset of ASCII, so all ASCII strings are also valid Latin-1 strings.

The section you quoted actually suggests that implementations should support ISO-8859-1 to ensure compatibility with systems that use it.

ko27 · on Aug 21, 2023

You should read it again

> Newly defined header fields SHOULD limit their field values to US-ASCII octets

ASCII octets! That means you SHOULD NOT send Latin1 encoded headers. The opposite of what pzmarzly was saying. I don't disagree Latin-1 being a superset of ASCII or having backward compatibility in mind, but that's not relevant to my response.

layer8 · on Aug 21, 2023

SHOULD is a recommendation, not a requirement, and it refers only to newly-defined header fields, not existing ones. The text implies that 8-bit characters in existing fields are to be interpreted as ISO-8859-1.

verst · on Aug 22, 2023

There is a RFC (2119) that specifies what SHOULD means in RFCs:

> SHOULD This word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course.

https://datatracker.ietf.org/doc/html/rfc2119

jart · on Aug 21, 2023

Haven't you heard of Postel's Maxim?

Web servers need to be able to receive and decode latin1 into utf-8 regardless of what the RFC recommends people send. The fact that it's going to become rarer over time to have the 8th bit set in headers, means you can write a simpler algorithm than what Lemire did that assumes an ASCII average case. https://github.com/jart/cosmopolitan/blob/755ae64e73ef5ef7d1... That goes 23 GB/s on my machine using just SSE2 (rather than AVX512). However it goes much slower if the text is full of european diacritics. Lemire's algorithm is better at decoding those.

HideousKojima · on Aug 21, 2023

>Haven't you heard of Postel's Maxim?

Otherwise known as "Making other people's incompetence and inability to implement a specification your problem." Just because it's a widely quoted maxim doesn't make it good advice.

missblit · on Aug 21, 2023

The spec may disagree, but webservers do sometimes send bytes outside the ASCII range, and the most sensible way to deal with that on the receiving side is still by treating them as latin1 to match (last I checked) what browsers do with it.

I do agree that latin1 headers shouldn't be _sent_ out though.

TheRealPomax · on Aug 21, 2023

Only because those websites include `<meta charset="utf-8">`. Browsers don't use utf-8 unless you tell them to, so we tell them to. But there's an entire internet archive's worth of pages that don't tell them to.

ko27 · on Aug 21, 2023

Not including charset="utf-8" doesn't mean that the website is not UTF-8. Do you have a source on a significant percentage of website being Latin-1 while omitting charset encoding? I don't believe that's the case.

> Browsers don't use utf-8 unless you tell them to

This is wrong. You can prove this very easily by creating a HTML file with UTF-8 text while omitting the charset. It will render correctly.

TheRealPomax · on Aug 21, 2023

Answering your "do you have a source" question, yeah: "the entire history of the web prior to HTML5's release", which the internet has already forgotten is a rather recent thing (2008). And even then, it took a while for HTML5 to become the de facto format, because it took the majority of the web years before they'd changed over their tooling from HTML 4.01 to HTML5.

> This is wrong. You can prove this very easily by creating a HTML file with UTF-8 text

No, but I will create an HTML file with latin-1 text, because that's what we're discussing: HTML files that don't use UTF-8 (and so by definition don't contain UTF-8 either).

While modern browsers will guess the encoding by examining the content, if you make an html file that just has plain text, then it won't magically convert it to UTF-8: create a file with `<html><head><title>encoding check</title></head><body><h1>Not much here, just plain text</h1><p>More text that's not special</p></body></html>` in it. Load it in your browser through an http server (e.g. `python -m http.server`), and then hit up the dev tools console and look at `document.characterSet`.

Both firefox and chrome give me "windows-1252" on Windows, for which the "windows" part in the name is of course irrelevant; what matters is what it's not, which is that it's not UTF-8, because the content has nothing in it to warrant UTF-8.

electroly · on Aug 21, 2023

Chromium (and I'm sure other browsers, but I didn't test) will sniff character set heuristically regardless of the HTML version or quirks mode. It's happy to choose UTF-8 if it sees something UTF-8-like in there. I don't know how to square this with your earlier claim of "Browsers don't use utf-8 unless you tell them to."

That is, the following UTF-8 encoded .html files all produce document.characterSet == "UTF-8" and render as expected without mojibake, despite not saying anything about UTF-8. Change "ä" to "a" to get windows-1252 again.

    <html>ä

    <!DOCTYPE html><html>ä

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><html>ä

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"><html>ä

capitainenemo · on Aug 21, 2023

A simpler test FWIW.. type:

   data:text/html,<html>

Into your url bar and inspect that. Avoids server messing with encoding values. And yes, here on my linux machine in firefox it is windows-1252 too.

(You can type the complete document, but <html> is sufficient. Browsers autocomplete a valid document. BTW, data:text/html,<html contenteditable> is something I use quite a lot)

But yeah, I think windows-1252 is standard for quirks mode, for historical reasons.

slt2021 · on Aug 21, 2023

>data:text/html,<html contenteditable>

thank you, I learned nice trick today.

re windows1252 - this could be driven by system encoding settings, for most people it is 1252, but for eastern europe it is windows-1251.

when viewed from IBM z mainframe - encoding will be something like IBM EBCDIC

capitainenemo · on Aug 22, 2023

Well, I'm on Linux - system encoding set to UTF-8 which is pretty much standard there. But I think the "windows-1252 for quirks" is just driven by what was dominant back when the majority of quirky HTML was generated decades ago.

layer8 · on Aug 21, 2023

The historical (and present?) default is to use the local character set, which on US Windows is Windows-1252, but for example on Japanese Windows is Shift-JIS. The expectation is that users will tend to view web pages from their region.

kalleboo · on Aug 22, 2023

I'm in Japan on a Mac with the OS language set to Japanese. Safari gives me Shift_JIS, but Chrome and Firefox give me windows-1252

edit: Trying data:text/html,<html>日本語 makes Chrome also use Shift_JIS, resulting in mojibake as it's actually UTF-8. Firefox shows a warning about it guessing the character set, and then it chooses windows-1252 and displays more garbage.

ko27 · on Aug 21, 2023

Okay, it's good that we agree then on my original premise, the vast majority of websites (by quantity and popularity) on the Internet today are using UTF-8 encoding, and Latin-1 is being phased out.

Btw I appreciate your edited response, but still you were factually incorrect about:

> Browsers don't use utf-8 unless you tell them to

Browsers can use UTF-8 even if we don't tell them. I am already aware of the extra heuristics you wrote about.

> HTML file with latin-1 ... which is that it's not UTF-8, because the content has nothing in it to warrant UTF-8

You are incorrect here as well, try using some latin-1 special character like "ä" and you will see that browsers default to document.characterSet UTF-8 not windows-1252

lelandbatey · on Aug 21, 2023

> You are incorrect here as well, try using some latin-1 special character like "ä" and you will see that browsers default to document.characterSet UTF-8 not windows-1252

I decided to try this experimentally. In my findings, if neither the server nor the page contents indicate that a file is UTF-8, then the browser NEVER defaults to setting document.characterSet to UTF-8, instead basically always assuming that it's "windows-1252" a.k.a. "latin1". Read on for my methodology, an exact copy of my test data, and some particular oddities at the end.

To begin, we have three '.html' files, one with ASCII only characters, a second file with two separate characters that are specifically latin1 encoded, and a third with those same latin1 characters but encoded using UTF-8. Those two characters are:

    Ë - "Latin Capital Letter E with Diaeresis" - Latin1 encoding: 0xCB  - UTF-8 encoding: 0xC3 0x8B   - https://www.compart.com/en/unicode/U+00CB
    ¥ - "Yen Sign"                              - Latin1 encoding: 0xA5  - UTF-8 encoding: 0xC2 0xA5   - https://www.compart.com/en/unicode/U+00A5

To avoid copy-paste errors around encoding, I've dumped the contents of each file as "hexdumps", which you can transform back into their binary form by feeding the hexdump form into the command 'xxd -r -p -'.

    $ cat ascii.html | xxd -p
    3c68746d6c3e3c686561643e3c7469746c653e656e636f64696e67206368
    65636b2041534349493c2f7469746c653e3c2f686561643e3c626f64793e
    3c68313e4e6f74206d75636820686572652c206a75737420706c61696e20
    746578743c2f68313e3c703e4d6f7265207465787420746861742773206e
    6f74207370656369616c3c2f703e3c2f626f64793e3c2f68746d6c3e0a
    $ cat latinone.html | xxd -p
    3c68746d6c3e3c686561643e3c7469746c653e656e636f64696e67206368
    65636b206c6174696e313c2f7469746c653e3c2f686561643e3c626f6479
    3e3c68313e546869732069732061206c6174696e31206368617261637465
    7220307841353a20a53c2f68313e3c703e54686973206973206368617220
    307843423a20cb3c2f703e3c2f626f64793e3c2f68746d6c3e0a
    $ cat utf8.html | xxd -p
    3c68746d6c3e3c686561643e3c7469746c653e656e636f64696e67206368
    65636b207574663820203c2f7469746c653e3c2f686561643e3c626f6479
    3e3c68313e54686973206973206120757466382020206368617261637465
    7220307841353a20c2a53c2f68313e3c703e546869732069732063686172
    203078433338423a20c38b3c2f703e3c2f626f64793e3c2f68746d6c3e0a

The full contents of my current folder is as such:

    $ ls -a .
    .  ..  ascii.html  latinone.html  utf8.html

Now that we have our test files, we can serve them via a very basic HTTP server. But first, we must verify that all responses from the HTTP server do not contain a header implying the content type; we want the browser to have to make a guess based on nothing but the contents of the file. So, we run the server and check to make sure it's not being well intentioned and guessing the content type:

    $ curl -s -vvv 'http://127.0.0.1:8000/ascii.html' 2>&1 | egrep -v -e 'Last|Length|^\*|^<html|^{|Date:|Agent|Host'
    > GET /ascii.html HTTP/1.1
    > Accept: */*
    >
    < HTTP/1.0 200 OK
    < Server: SimpleHTTP/0.6 Python/3.10.7
    < Content-type: text/html

    $ curl -s -vvv 'http://127.0.0.1:8000/latinone.html' 2>&1 | egrep -v -e 'Last|Length|^\*|^<html|^{|Date:|Agent|Host'
    > GET /latinone.html HTTP/1.1
    > Accept: */*
    >
    < HTTP/1.0 200 OK
    < Server: SimpleHTTP/0.6 Python/3.10.7
    < Content-type: text/html

    $ curl -s -vvv 'http://127.0.0.1:8000/utf8.html' 2>&1 | egrep -v -e 'Last|Length|^\*|^<html|^{|Date:|Agent|Host'
    > GET /utf8.html HTTP/1.1
    > Accept: */*
    >
    < HTTP/1.0 200 OK
    < Server: SimpleHTTP/0.6 Python/3.10.7
    < Content-type: text/html

Now we've verified that we won't have our observations muddled by the server doing its own detection, so our results from the browser should be able to tell us conclusively if the presence of a latin1 character causes the browser to use UTF-8 encoding. To test, I loaded each web page in Firefox and Chromium and checked what `document.characterSet` said.

    Firefox (v116.0.3):
        http://127.0.0.1:8000/ascii.html     result of `document.characterSet`: "windows-1252"
        http://127.0.0.1:8000/latinone.html  result of `document.characterSet`: "windows-1252"
        http://127.0.0.1:8000/utf8.html      result of `document.characterSet`: "windows-1252"

    Chromium (v115.0.5790.170):
        http://127.0.0.1:8000/ascii.html     result of `document.characterSet`: "windows-1252"
        http://127.0.0.1:8000/latinone.html  result of `document.characterSet`: "macintosh"
        http://127.0.0.1:8000/utf8.html      result of `document.characterSet`: "windows-1252"

So in my testing, neither browser EVER guesses that any of these pages are UTF-8, all these browsers seem to mostly default to assuming that if no content-type is set in the document or in the headers then the encoding is "windows-1252" (bar Chromium and the Latin1 characters which bizzarely caused Chromium to guess that it's "macintosh" encoded?). Also note that if I add the exact character you proposed (ä) to the text body, it still doesn't cause the browser to start assuming everything is UTF-8; the only change is that Chromium starts to think the latinone.html file is also "windows-1252" instead of "macintosh".

bawolff · on Aug 21, 2023

> Both firefox and chrome give me "windows-1252" on Windows, for which the "windows" part in the name is of course irrelevant; what matters is what it's not, which is that it's not UTF-8, because the content has nothing in it to warrant UTF-8.

While technically latin-1/iso-8859-1 is a different encoding than windows-1252, html5 spec says browsers are supposed to treat latin1 as windows-1252.

bawolff · on Aug 21, 2023

> This is wrong. You can prove this very easily by creating a HTML file with UTF-8 text while omitting the charset. It will render correctly.

I'm pretty sure this is incorrect.

electroly · on Aug 21, 2023

The following .html file encoded in UTF-8, when loaded from disk in Google Chrome (so no server headers hinting anything), yields document.characterSet == "UTF-8". If you make it "a" instead of "ä" it becomes "windows-1252".

    <html>ä

The renders correctly in Chrome and does not show mojibake as you might have expected from old browsers. Explicitly specifying a character set just ensures you're not relying on the browser's heuristics.

bawolff · on Aug 21, 2023

There may be a difference here between local and network, as well as if the multi-byte utf-8 character appears in the first 1024 bytes or how much network delay there is before that character appears.

electroly · on Aug 21, 2023

The original claim was that browsers don't ever use UTF-8 unless you specify it. Then ko27 provided a counterexample that clearly shows that a browser can choose UTF-8 without you specifying it. You then said "I'm pretty sure this is incorrect"--which part? ko27's counterexample is correct; I tried it and it renders correctly as ko27 said. If you do it, the browser does choose UTF-8. I'm not sure where you're going with this now. This was a minimal counterexample for a narrow claim.

bawolff · on Aug 22, 2023

I think when most people say "web browsers do x" they mean when browsing the world wide web.

My (intended) claim is that in practise the statement is almost always untrue. There may be weird edge cases when loading from local disk where it is true sometimes, but not in a way that web developers will usually ever encounter since you don't put websites on local disk.

This part of the html5 spec isn't binding so who knows what different browsers do, but it is a reccomendation of the spec that browsers should handle charset of documents differently depending on if they are on local disk or from the internet.

To quote: "User agents are generally discouraged from attempting to autodetect encodings for resources obtained over the network, since doing so involves inherently non-interoperable heuristics. Attempting to detect encodings based on an HTML document's preamble is especially tricky since HTML markup typically uses only ASCII characters, and HTML documents tend to begin with a lot of markup rather than with text content." https://html.spec.whatwg.org/multipage/parsing.html#determin...

electroly · on Aug 22, 2023

Fair enough. I intended only to test the specific narrow claim OP made that you had quoted, which seemed to be about a local file test. This shows it is technically true that browsers are capable of detecting UTF-8, but only in one narrow situation and not the one that's most interesting.

Indeed, in the Chromium source code we can see a special case for local files with some comment explanation. https://github.com/chromium/chromium/blob/dea8b2608dd5d95e38...

missblit · on Aug 21, 2023

Be careful, since at least Chrome may choose a different charset if loading a file from disk versus from a HTTP URL (yes this has tripped me up more than once).

I've observed Chrome to usually default to windows-1252 (latin1) for UTF-8 documents loaded from the network.

fulafel · on Aug 21, 2023

It's the default HTTP character set. It's not clear whether the above stat page is about what charsets are explicitly specified.

Also headers, mostly relevant for header values, are I think ISO-8859-1.

rhdunn · on Aug 21, 2023

Be aware that with the WHATWG Encoding specification [1], that says that latin1, ISO-8859-1, etc. are aliases of the windows-1252 encoding, not the proper latin1 encoding. As a result, browsers and operating systems will display those files differently! It also aliases the ASCII encoding to windows-1252.

[1] https://encoding.spec.whatwg.org/#names-and-labels

ko27 · on Aug 21, 2023

Since HTML5 UTF-8 is the default charset. And for headers, they are parsed as ASCII encoded in almost all cases although ISO-8859-1 is supported.

fulafel · on Aug 21, 2023

I tried to find confirmation of this but found only: https://html.spec.whatwg.org/multipage/semantics.html#charse...

> The Encoding standard requires use of the UTF-8 character encoding and requires use of the "utf-8" encoding label to identify it. Those

Sounds to me like it tells you that you have to explicitly declare the charset as UTF-8, so you don't get the HTTP default of Latin-1.

(But that's just one "living standard" not exactly synonymous with with HTML5 and it might change, or might have been different last week..)

ko27 · on Aug 21, 2023

> so you don't get the HTTP default of Latin-1.

That's not what your linked spec says. You can try it yourself, in any browser. If you omit the encoding the browser uses heuristics to guess, but it will always work if you write UTF-8 even without meta charset or encoding header.

fulafel · on Aug 21, 2023

I don't doubt browsers use heuristics. But spec-wise I think it's your turn to to provide a reference in favour of a utf-8-is-default interpretation :)

rhdunn · on Aug 21, 2023

The WHATWG HTML spec [1] has various heuristics it uses/specifies for detecting the character encoding.

In point 8, it says an implementation may use heuristics to detect the encoding. It has a note which states:

> The UTF-8 encoding has a highly detectable bit pattern. Files from the local file system that contain bytes with values greater than 0x7F which match the UTF-8 pattern are very likely to be UTF-8, while documents with byte sequences that do not match it are very likely not. When a user agent can examine the whole file, rather than just the preamble, detecting for UTF-8 specifically can be especially effective.

In point 9, the implementation can return an implementation or user-defined encoding. Here, it suggests a locale-based default encoding, including windows-1252 for "en".

As such, implementations may be capable of detecting/defaulting to UTF-8, but are equally likely to default to windows-1252, Shift_JIS, or other encoding.

[1] https://html.spec.whatwg.org/#determining-the-character-enco...

ko27 · on Aug 21, 2023

No it isn't. My original point is that Latin-1 is used very rarely on Internet and is being phased out. Now it's your turn to provide some references that a significant percentage of websites are omitting encoding (which is required by spec!) and using Latin-1.

But if you insist, here is this quote:

https://www.w3docs.com/learn-html/html-character-sets.html

> UTF-8 is the default character encoding for HTML5. However, it was used to be different. ASCII was the character set before it. And the ISO-8859-1 was the default character set from HTML 2.0 till HTML 4.01.

or another:

https://www.dofactory.com/html/charset

> If a web page starts with <!DOCTYPE html> (which indicates HTML5), then the above meta tag is optional, because the default for HTML5 is UTF-8.

bawolff · on Aug 21, 2023

> My original point is that Latin-1 is used very rarely on Internet and is being phased out.

Nobody disagrees with this, but this is a very different statement from what you said originally in regards to what the default is. Things can be phased out but still have the old default with no plan to change the default.

Re other sources - how about citing the actual spec instead of sketchy websites that seem likely to have incorrect information.

syats · on Aug 22, 2023

In countries communicating in non-English languages which are written in the latin script, there is a very large use of Latin-1. Even when Latin-1 is "phased out", there are tons and tons of documents and databases encoded in Latin-1, not to mention millions of ill-configured terminals.

I think it makes total sense to implement this.

londons_explore · on Aug 21, 2023

Since values 0-127 are used far more frequently than 128-255 in latin-1, it might make more sense to simply have a fast path which simply loads 512 bits at a time (ie. 64 bytes), detects if any are 0x80 or above, and if not just outputs them verbatim.

NelsonMinar · on Aug 21, 2023

The article has a whole section about that, you might enjoy reading about it. He reports a ~20% speedup on his test data.

twoodfin · on Aug 21, 2023

I don’t know if the article has been updated since your comment, but this approach is discussed & benchmarked. For the benchmarked data set it’s a winner.

wffurr · on Aug 21, 2023

The article was indeed updated since I read it and the parent comment this morning.

jojobas · on Aug 21, 2023

Either way throughput will depend on the fraction of >192 characters, what input data gave 12GB/s seems to be a mystery.

reaperhulk · on Aug 21, 2023

The article states it's the French version of the Mars wikipedia entry and the repository has a link to the file he used in the readme: https://raw.githubusercontent.com/lemire/unicode_lipsum/main...

redox99 · on Aug 21, 2023

12GB/s seems a bit slow. I'd expect the only bottleneck to be memory bandwidth.

A dual channel DDR4 system memory bandwidth is ~40GB/s, and DDR5 ~80GB/s.

Since this operation requires both a read and a write, you'd expect half that.

peppermint_gum · on Aug 21, 2023

> A dual channel DDR4 system memory bandwidth is ~40GB/s, and DDR5 ~80GB/s.

It's impossible to saturate the memory bandwidth on a modern CPU with a single thread, even if all you do is reads with absolutely no processing. The bottleneck is how fast outstanding cache misses can be satisfied.

The article even links to a benchmark that attempts to measure what it calls "sustainable memory bandwidth": https://www.cs.virginia.edu/stream/ref.html

masfuerte · on Aug 21, 2023

Is this useful? Most Latin 1 text is really Windows 1252, which has additional characters that don't have the same regular mapping to unicode. So this conversion will mangle curly quotes and the Euro sign, among others.

dotancohen · on Aug 21, 2023

  > Most Latin 1 text is really Windows 1252

I'd say that the vast majority of Latin-1 that I've encountered is just ASCII. Where have you seen Windows-1252 presented with a Latin-1 header or other encoding declaration that declared it as Latin-1?

masfuerte · on Aug 22, 2023

Windows 1252 served as Latin 1 used to be common enough that browsers interpret a Latin 1 declaration as Windows 1252. Nowadays it seems moderately common for such text to be served with a utf-8 declaration, so it gets mangled in other ways. Or it gets imported into a CMS with no conversion or the wrong conversion, which has a similar result.

You're right, ASCII is more common, but single-byte encoded prose that goes beyond ASCII is usually Windows 1252 in my experience.

dotancohen · on Aug 23, 2023

  > Windows 1252 served as Latin 1 used to be common enough that browsers
  > interpret a Latin 1 declaration as Windows 1252.

Thank you, I had not encountered this and I was dealing a lot with improperly-encoded text when I was running gibberish.co.il (over a decade ago). What systems were serving this? IIS would be my first guess, an Intuit product would be my second.

jojobas · on Aug 21, 2023

Interesting to see how a non-AVX, non-branching version would do, need a prefilled array of extra pointer advance (0/1) and seemingly two more for the bitbanging.

xiphias2 · on Aug 21, 2023

Another option would be a vector of 256 16 bit entries and keeping the pointer advance vector as you suggested.

jojobas · on Aug 22, 2023

That would work, the only problem in either solution is double-writing 8 of the 16 bits on sequential ASCII characters. Might give it a try.

xiphias2 · on Aug 22, 2023

That shouldn't be a problem, they all go to the store buffer of the CPU, which should be able to handle 1 write/cycle.

justin101 · on Aug 21, 2023

Where does one even go about finding 12Gb of pure latin text?

Rebelgecko · on Aug 21, 2023

I had the same question, wondering what sort of workflow would have this task in the critical path. Maybe if the Library of Congress needs to change their default text encoding it'll save a minute or two?

The benchmark result is cool, but I'm curious how well it works with smaller outputs. When I've played around with SIMD stuff in the past, you can't necessary go off of metrics like "bytes generated per cycle", because of how much CPU freq can vary when using SIMD instructions, context switching costs, and different thermal properties (eg maybe the work per cycles is higher per SIMD, but the CPU generates heat much more quickly and downclocks itself).

lovasoa · on Aug 21, 2023

Not sure whether that was sarcastic, but ISO-8859-1 (Latin 1) encodes most european languages, not just latin.

https://en.wikipedia.org/wiki/ISO/IEC_8859-1

ko27 · on Aug 21, 2023

But where do you find it? Almost the entirety of internet is UTF-8. You can always transcode to Latin 1 for testing purposes, but that raises the question of practical benefits of this algorithm.

tgv · on Aug 21, 2023

Older corpora are probably still in Latin-1 or some variant. That could include decades of news paper publications.

lovasoa · on Aug 22, 2023

All of Europe has written in Latin 1 for a decade. There are billion of files encoded in Latin 1 everywhere.

ko27 · on Aug 22, 2023

Where?

the8472 · on Aug 21, 2023

It's not necessarily about sustained throughput spent only in this routine. It can be small bursts of processing text segments that are then handed off to other parts of the program.

Once a program is optimized to the point where no leaf method / hot loop takes up more than a few percent of runtime and algorithmic improvements aren't available or extremely hard to implement the speed of all the basic routines (memcpy, allocations, string processing, data structures) start to matter. The constant factors elided by Big-O notation start to matter.

martijnvds · on Aug 21, 2023

The Vatican?

ant6n · on Aug 21, 2023

The latin in latin-1 refers to the alphabet, not the language. In fact latin-1 can encode many Western European languages.

CoastalCoder · on Aug 21, 2023

I believe it was a joke.

But the humour may have been lost in translation. It's funnier in the original ASCII.

mmastrac · on Aug 21, 2023

The high bit is generally used to indicate humour.

SomeoneFromCA · on Aug 21, 2023

Another proof that Linus is not always right. There were many folks who just blindly regurgitated AVX 512 is evil, without even actually knowing a thing about it.

wtallis · on Aug 21, 2023

> Another proof that Linus is not always right.

No, this is just a case of the right answer changing over time, as good AVX-512 implementations became available, long after its introduction. And nothing in this article even comes close to addressing the main concern with the early AVX-512 implementation: the significant performance penalty it imposes on other code due to Skylake's slow power state transitions. Microbenchmarks like this made AVX-512 look good even on Skylake, because they ignore the broader effects.

SomeoneFromCA · on Aug 21, 2023

So your point he is indeed always right? Or was right in that particular case (he was not)?

If you remember, Linux complained not about the particular implementation of AVX-512, but the concept itself. It is also kinda looks ignorant of him (and anyone else who thinks the same way) to believe that AVX-512 is only about 512 or it has no potential, being a just simply better SIMD ISA compared to AVX1/2. What he did he just expressed himself in his trademark silly edgy maximalist way. It is an absolute pleasure to work with, gives great performance boost, and he should have more careful with his statements.

a1369209993 · on Aug 21, 2023

> So your point he is indeed always right?

No, their point is that this does not refute said hypothetical claim. That is, their point is that it is not, as you claimed:

> Another proof that Linus is not always right.

(I don't know if their point is correct, but it's of the form "your argument against X is invalid", not "X is correct".)

nwallin · on Aug 21, 2023

To add to your point, this benchmark would not have run on Skylake at all. It uses the _mm512_maskz_compress_epi8 instruction, which wasn't introduced until Ice Lake.

camel-cdr · on Aug 21, 2023

It kinda depends, I wouldn't be surprises, if properly optimized avx2 could get the same performance, since it looks like the operation is memory bottlenecked.

SomeoneFromCA · on Aug 21, 2023

Nah, AVX512 is more performant design due to the support of masking. It does not depend in fact on anything. Those who compares favorably or equally AVX2 with 512 never used either of them.

cyber_kinetist · on Aug 22, 2023

Which is I’m really looking forward to Intel’s upcoming AVX10 / APX extensions, created to alleviate the P/E core issues they had in Alder Lake (really, what the fuck was Intel thinking back then?) It’s still inferior compared to full AVX-512 since you don’t have 512-bit wide registers, but you still have 32 256-bit registers to play with as well as the rich instruction set of AVX-512 (with all that masking and gather/scatter ops and that jazz…) so much better than AVX2.

Now hearing that the upcoming Ryzen is also going with the hybrid P/E core approach I wonder if AMD is also preparing to adopt these instructions as well…

SomeoneFromCA · on Aug 22, 2023

The thing about Alder Lake AVX 512 is that (actually was, but i still occasionally run it with AVX512 enabled, for some specific tasks) the Alder Lake is the only consumer grade CPU that supports AVX 512 FP16. It is very performant for ML tasks, not like a real GPU, obviously, but a lot, much much easier to use.

Besides Intel do not gurantee compatibility AVX10 with AVX512/256.

londons_explore · on Aug 21, 2023

Every time someone writes some really carefully micro-optimized piece of code like this, I worry that the implementation won't be shared with the whole world.

This code only makes people's lives better if many languages and frameworks that translates latin-1 to utf8 are updated to have this new faster implementation.

If this took 3 days to write and benchmark, then to save 3 days of human time, we probably need to get this into the hands of hundreds of millions of people, saving each person a few hundred microseconds.

slashdev · on Aug 21, 2023

The author is a French Canadian academic at Université du Québec à Montréal. He is one of the more famous figures in computer science in all of Canada, with over 5000 citations (which is stretching the meaning of famous, but still.) This is not closed source work optimizing for some company product, this is research for publication on his blog or in computer science journals.

benreesman · on Aug 21, 2023

He’s one of the most famous computer scientists in general!

The audience for wicked-clever, low/no branch, cache aware, SIMD sorcery is admittedly not everyone, but if you end up with that kind of problem, this is a go to!

re-thc · on Aug 21, 2023

> I worry that the implementation won't be shared with the whole world.

Considering the author also created https://github.com/simdutf/simdutf it's likely used or will be used in NodeJs amongst other things. Is that good enough?

magicalhippo · on Aug 21, 2023

> This code only makes people's lives better if many languages and frameworks that translates latin-1 to utf8 are updated to have this new faster implementation.

Except CPUs evolve and what was once a fast way of doing things may no longer be very fast. And with ASM you got no compiler to generate better targeted instructions.

I've seen many instances where significant performance was gained by swapping out and old hand-written ASM routine with a plain language version.

If you ever add some optimized ASM to your code, do a performance check at startup or similar, and have the plain language version as a fallback.

TinkersW · on Aug 21, 2023

It is written with intrinsics not ASM.

Compilers understand intrinsics and can optimize around them, and CPUs evolve improved SIMD instruction sets at a snails pace.

Intel doesn't even really support AVX512 yet for consumer hardware, and maybe never will, so this code is mostly only good for very modern AMD.

magicalhippo · on Aug 21, 2023

I'm talking about which instructions and idioms are optimal. AFAIK, with intrinsics the compiler won't completely change what you've written.

Back in the days REP MOVSB was the fastes way to copy bytes, then Pentium came and rolling your own loop was better. Then CPUs improved and REP MOVSB was suddenly better again[1], for those CPUs. And then it changed again...

Similar story for other idioms where implementation details on CPUs change. Compilers can respond and target your exact CPU.

[1]: https://github.com/golang/go/issues/14630 (notice how one comments the same patch that gives 1.6x boost for OP gives them a 5x degradation)

bruce343434 · on Aug 21, 2023

What do you mean "optimize around them"? Do you have a godbolt/codegen example of suboptimal intrinsic calls being optimized?

eesmith · on Aug 21, 2023

You should also worry about how other peoples' time is wasted when you miss important details then comment about easily assuaged worries.

Quoting the article "I use GCC 11 on an Ice Lake server. My source code is available.", linking to https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/... .

From the README at the top-level:

> Unless otherwise stated, I make no copyright claim on this code: you may consider it to be in the public domain.

> Don't bother forking this code: just steal it.

maxerickson · on Aug 21, 2023

Are you also worried about my hobby vegetable garden being a waste of time?

I'm sure I could get my tomato fix at the farmers market.

whoknowswhat11 · on Aug 21, 2023

Is avx512 broadly available and error free w no stalls slowdowns or other side effects. For a long time it felt like a corner intel thing

jacoblambda · on Aug 21, 2023

In terms of being broadly available, most of AVX-512 (ER, PF, 4FMAPS, and 4VNNIW haven't been available on any new hardware since 2017) is available on basically any Intel cpu manufactured since 2020 as well as on all AMD Zen4 (2022 and on) cpus.

https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512

I can't speak to being error free or other issues but it should at the very least be present on any modern desktop, laptop, or server x86 CPU you could buy today.

Edit: I forgot to mention but Intel's Alder lake CPUs only have partial support presumably due to some issue with E cores. I'd guess Intel will get their shit together eventually wrt this now that AMD is shipping all their hardware with this instruction set.

unnah · on Aug 21, 2023

Intel seems to be going for market segmentation, with AVX-512 only available on their server CPUs. The option to enable AVX-512 has been removed from Alder Lake CPUs since 2022, and there is no AVX-512 on Raptor Lake.

AMD also keeps making and selling Zen 3 and Zen 2 chips as lower-cost products, and those do not have AVX-512.

the8472 · on Aug 21, 2023

With AVX10 intel will make the instructions available again on all segments. SIMD register width will vary between cores but the instructions will be there.

wtallis · on Aug 21, 2023

I don't think it was intentional market segmentation, just poor planning: the whole heterogenous cores strategy seems to have been thrown together in a hurry and they didn't have time to add AVX-512 to their Atom cores in an area-efficient way (so as not to negate the point of having E-cores).

nullifidian · on Aug 21, 2023

>most of AVX-512 is available on basically any Intel cpu manufactured since 2020

That's incorrect. On the consumer cpu side Intel introduced AVX-512 for one generation in 2021 (Rocket lake), but than removed AVX-512 from the subsequent Alder Lake using bios updates, and fused it off in later revisions. It's also absent from the current Raptor Lake. So actually it's only available on Intel's server grade cpus.

>Edit: I forgot to mention but Intel's Alder lake CPUs only have partial support presumably due to some issue with E cores.

No, this wiki page is outdated.

papercrane · on Aug 21, 2023

The latest Intel architecture (Sapphire Rapids) support it without downclocking. AMD Zen 4 also supports it, although their implementation is double pumped, not sure what the real world performance impact of that is.

adrian_b · on Aug 21, 2023

There is a huge confusion about this "double pumped" thing.

All that this means is that Zen 4 uses the same execution units both for 256-bit operations and for 512-bit operations. This means that the throughput in instructions per cycle for 512-bit operations is half of that for 256-bit operations, but the throughput in bytes per cycle is the same.

However the 512-bit operations need fewer resources for instruction fetching and decoding and for micro-operation storing and dispatching, so in most cases using 512-bit instructions on Zen 4 provides a big speed-up.

Even if Zen 4 is "double pumped", its 256-bit throughput is higher than that of Sapphire Rapids, so after dividing by two, for most instructions it has exactly the same 512-bit throughput as Sapphire Rapids, i.e. two 512-bit register-register instructions per cycle.

The only exceptions are that Sapphire Rapids (with the exception of the cheap SKUs) can do 2 FMA instructions per cycle, while Zen 4 can do only 1 FMA + 1 FADD instructions per cycle, and that Sapphire Rapids has a double throughput for loads and stores from the L1 cache memory. There are also a few 512-bit instructions where Zen 4 has better throughput or latency than Sapphire Rapids, e.g. some of the shuffles.

stkdump · on Aug 21, 2023

It's unlikely that this makes anyone's life better. It is more a curiosity and maybe a teachable thing on how to do SIMD. I would venture the guess that there are very few workloads that require this conversion for more than a few KB, and over time as the world migrates to Unicode it will be less and less.

SomeoneFromCA · on Aug 21, 2023

It is mostly an educational code. Once you learn AVX-512 you can get boosts in many areas.

antiloper · on Aug 22, 2023

Not every human action is required to move the GDP line upwards.