UTF-8 history (2003)

Camillo · on Oct 10, 2019

With all due respect, it seems to me that the documents contradict the claim in the initial email.

Rob's initial claim is that Ken came up with UTF-8 entirely from scratch, without even looking at IBM's proposal.

But Ken's oldest document from Sep 2, 1992 has his changes simply appended after the original FSS-UTF document from IBM (starting at "We define 7 byte types", as mentioned in the email), and notes the changes from original spec.

Then the final document that he sent out (Sep 8) is basically the FSS-UTF document with his changes applied (this is also mentioned in the email!).

There are two changes, basically:

1. Use 10 instead of 1 as the prefix for continuation bytes, so you can synchronize from an arbitrary location.

2. Once the bits are reassembled, use the value as-is instead of adding a bias constant, which simplifies the code at the cost of a tiny bit of packing efficiency.

Given the documentation provided, I would say that the fairest description of the development of UTF-8 is that IBM came up with the initial design, and Plan 9 made two improvements on it to produce the final UTF-8 design.

leshow · on Oct 10, 2019

One thing I was confused about. The document says there are 7 byte types, but I thought UTF-8 was variable width up to only 4 bytes. Did I misunderstand something?

hyperman1 · on Oct 10, 2019

Both are correct: This original UTF-8 encoding can encode values up to 2^32. But because UTF-16 encoding limits possible values to 16 planes of 64K values, unicode has a hard limit of 2^20 codepoints.

This means UTF-8 encoded values of more than 4 bytes can never represent a valid unicode codepoint even if they produce a valid 32 bit numerical value.

kstenerud · on Oct 10, 2019

"Why didn't we just use their FSS/UTF? As I remember, it was because in that first phone call I sang out a list of desiderata for any such encoding, and FSS/UTF was lacking at least one - the ability to synchronize a byte stream picked up mid-run, with less that one character being consumed before synchronization."

I'm of two minds about this. On the one hand, such an ability is pretty useless in modern systems filled with checksums at every stage, and reduces the bit efficiency of UTF-8. On the other hand, this was the only valid-y reason at the time to rewrite a UTF implementation from scratch.

If not for that requirement, we would have just had UTF-8 implemented as regular VLQs.

Edit: Actually, now that I think about it, VLQ already does satisfy the synchronization requirement. Just scan for the next cleared high bit. At most 1 character consumed, and far less bit wastage.

boomlinde · on Oct 10, 2019

> Actually, now that I think about it, VLQ already does satisfy the synchronization requirement. Just scan for the next cleared high bit. At most 1 character consumed, and far less bit wastage.

There are other useful properties of UTF-8 encoding. For one, it is easy to identify invalid UTF-8 sequences, and valid UTF-8 sequences are unlikely to appear in written text encoded as e.g. ISO 8859-1. This must have been more useful in the past when fixed 8-bit encodings were more common. With a simple VLQ your guess whether a stream complies to your encoding won't be nearly as informed, and the means to provide a fallback is diminished. Allowing graceful transition from ASCII and extended ASCII encodings was a very important property for the sake of adoption.

It's also useful that any non-ASCII sequence has the msbit set through the whole sequence. You can easily discard all sequences that can not be displayed as ASCII by throwing bytes >= 0x80, or for example replace every contiguous sequence of >= 0x80 bytes as question marks in an ASCII-only display system and still display all ASCII compatible sequences perfectly.

EDIT: It's also very much not the case that the days of starting read mid-stream are over.

jstimpfle · on Oct 10, 2019

The ability to pick up byte streams mid-run is not useless. It's required for example to jump to arbitrary locations in a file and make sense of the data you find there.

Imagine a text editor displaying a large CSV file. Wouldn't you mind having to read everything betseen two locations if you jump forward from the one to the other? Or read everything from the start if you jumping backwards? Even if the text editor stores its own synchronization points, it has to read the file completely at least once, which can be annyoing for very large files.

Also, many of the simpler text tools that only look for ASCII bytes wouldn't work. For example, printing the last lines.

Also, I don't know that other sytem, but what about robustness if there's one bad byte somewhere in the file?

kstenerud · on Oct 10, 2019

You can do the same thing using VLQ, with less wastage. Just scan for the next cleared high bit.

mirekrusin · on Oct 10, 2019

You can also construct malicious payloads which will DoS many implementations with architecture like that.

kstenerud · on Oct 10, 2019

You can also trivially defend against that by having a max travel distance of, say, 4.

mirekrusin · on Oct 13, 2019

Exactly, so you arrived at the same conclusion - fixed distance so you can safely pick up stream at arbitrary position.

kstenerud · on Oct 13, 2019

The difference is that the distance limits would be inherent to the codec itself, rather than taking up extra space in the data stream.

ludamad · on Oct 10, 2019

Is that guaranteed to catch all valid cases?

zajio1am · on Oct 10, 2019

> If not for that requirement, we would have just had UTF-8 implemented as regular VLQs.

There is a requirement that 7-bit ASCII character codes would not appear as a part of non-ASCII character encoding, so NULL-terminated strings, slash path separation and other similar issues could be handled by existing charset-independent code. That would not work with regular VLQs.

kstenerud · on Oct 10, 2019

Actually, that's right, forgot about that :P

The only remaining mystery then would be why they bothered encoding it like this:

    0xxxxxxx
    110xxxxx  10xxxxxx        
    1110xxxx  10xxxxxx  10xxxxxx    
    11110xxx  10xxxxxx  10xxxxxx  10xxxxxx

... when it could have been done much more simply by putting the continuation bit in position 6 and keeping bit 7 set across the entire multibyte sequence:

    0xxxxxxx
    11xxxxxx  11xxxxxx  ...  10xxxxxx

Or even like this for maximum compactness:

    0xxxxxxx
    100xxxxx 1xxxxxxx
    101xxxxx 1xxxxxxx 1xxxxxxx
    110xxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx
    111xxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx

fefe23 · on Oct 10, 2019

One of the requirements is:

4) The first byte should indicate the number of bytes to follow in a multibyte sequence.

This is actually a pretty smart requirement for efficiency. This way the parser can know if the input bytes it has constitute a UTF-8 sequence by only looking at the first byte. That saves a lot of unnecessary processing.

Unfortunately the wording here breaks that advantage: https://en.cppreference.com/w/cpp/string/multibyte/mbrtoc16

if the next n bytes constitute an incomplete, but so far valid, multibyte character. Nothing is written to

The "so far valid" forces implementations to still look at the other bytes, which undoes the optimization in UTF-8. I suspect somebody was asleep at the wheel there.

Regarding your question about the wasted bits: I don't think it matters much. It certainly does not for English text like we exchange here, where the important case is the 0-0x7f case which UTF-8 handles optimally.

Your maximum compactness variant means if you start in the middle of a sequence, you can't tell you are in the middle and not at the start. UTF-8 is doing this well in my opinion. You can seek anywhere in a file and then move forwards or backwards till the beginning of a sequence without fear of misparsing the middle of a correct sequence as a different sequence.

kstenerud · on Oct 10, 2019

Yes, the ultra-compact encoding will not be self-sequencing. But the bit-6-continuation variant yields more bits per byte, which would give better compression in many languages. Regarding efficiency, you still have to read every character, with the difference being one check every 1, 2, 3, or 4 characters in a more complex algorithm vs a check on every character in a simpler algorithm (I haven't checked to see which beats the other in performance, but they look pretty similar).

This feels a lot like mixing transport layer metadata into the data format, potentially giving a small processing performance benefit at the cost of huge data wastage when certain languages are encoded.

marcosdumay · on Oct 10, 2019

> huge data wastage when certain languages are encoded

Is there a language that consistently uses codepoints with more than 2 bytes?

It bothers me a bit that UTF-8 is not an infinitely extendable encoding. But that also isn't an important objection, because it is finite, but huge.

XaspR8d · on Oct 10, 2019

> Is there a language that consistently uses codepoints with more than 2 bytes?

There are definitely (small) communities using scripts that lie entirely in the SMP. For example, Mru, Adlam, Takri, Pracalit, Miao, Wancho, etc. Most of these are either historic scripts that have mostly been supplanted by unified ones (esp. Devanagari) but retain usage in some areas, or languages that did not have a pre-colonial writing system that are attempting to reclaim cultural identity with a new script.

But yes, I don't think there are major communities that consistently do so. My anecdata from a few Mandarin- and Japanese-speaking friends is that SIP characters rarely occur.

Really if anything, emoji obsessives, mathematicians using bold/fraktur characters, and historical linguists/anthropologists would have the biggest savings.

https://en.wikipedia.org/wiki/Mru_language#Alphabet https://en.wikipedia.org/wiki/Adlam_script https://en.wikipedia.org/wiki/Takri_script https://en.wikipedia.org/wiki/Pracalit_script https://en.wikipedia.org/wiki/Pollard_script

kstenerud · on Oct 10, 2019

Yeah, it's fine that the encoding is not infinitely extendable. But if it didn't encode the length into the first byte, you'd have 1 extra bit for 2-byte sequences, 2 extra bits for 3-byte sequences, etc. That means that you can double the number of possible glyphs per 2-byte sequence (for an extra 2048 glyphs in the 2-byte range). For 3 bytes, it's 2 bits for an extra almost 200,000 glyphs before having to jump to 4 bytes. This would be a huge boon to East Asian languages like Chinese and Japanese.

boomlinde · on Oct 10, 2019

> Is there a language that consistently uses codepoints with more than 2 bytes?

Kanji, hiragana, katakana etc. At least 3 bytes with some 4-byte sequences. This compares unfavorably to for example Shift-JIS.

devit · on Oct 10, 2019

None of these encodings work.

If there is no length encoding, start, middle, end and ASCII byte encodings must not overlap to support correct decoding of any subslice of a document.

If there is a length encoding on the start, then start, middle and ASCII must not overlap.

avereveard · on Oct 10, 2019

because then you don't know when you jump in the middle of a stream and see a 11xxxxx whether it's the beginning of a valid multibyte character and you have to keep it or part of a multibyte you have to discard it

same with the second encoding, if so happens that one 1xxxxxxx takes the value of 11010101 how would you know it's a multibyte start or a continuation?

basically in the current solution if you read the head of a multibyte character you know it's a valid head, if you read the head of a multibyte in the proposed encoding you can't know if it's valid.

kstenerud · on Oct 10, 2019

Yes, the second one is not self-synchronizing, so it's out. However, I don't see much utility in getting first-character detection from the data format itself. You're very unlikely to miss the beginning of a stream of characters in any modern system, and in the event that your medium has no error detection, it would be trivial to add a zero synchronization byte to the beginning of the field.

My point is that embedding transport level metadata into the data format seems like a poor tradeoff because of the sheer inflation potential of the data encoding (potentially 10%), when a single guard byte per field would solve the problem of first character truncation detection.

labawi · on Oct 11, 2019

> I don't see much utility in getting first-character detection ..

> .. in the event that your medium has no error detection ... add a zero synchronization byte

How would such encoding deal with non-utf8-safe editors, copy-pasting, programs truncating, then inserting previously broken sequences, etc?

Encoding obviously can't fix all errors, but it is quite useful if broken sequences are obviously broken and non-broken sequences remain valid when handling text in non-aware/non-safe applications.

I think in UTF8 two splices can generate a random character, but in a characters + splice combination, the character remains recognizable in any order and combination and a lone splice is also recognizable as an error.

zajio1am · on Oct 10, 2019

Your second encoding (with binary coded length) would not be self-synchronizing.

DagAgren · on Oct 10, 2019

UTF-8 lets you skip entire characters without scanning the individual bytes, which VLQ does not.

baybal2 · on Oct 10, 2019

There are now encoding even more efficient than VLQ, but they blue the line in between encodings and compression algorithms. Most propose to trade efficiency of encoding 7 bit chars for ability to squeeze few thousands common Chinese characters into 16 bits.

The idea I heard was that to always code in 4 bytes blocks, and use some form of delta encoding. Some variations allow for less than OlogN character position search. And given that you can feed 32 wide data into NEON/SSE, and block are always 32 bit aligned, you can have that working faster than UTF-8

SamReidHughes · on Oct 11, 2019

Do you have a link to this 32-bit-aligned proposal?

SamReidHughes · on Oct 10, 2019

It's still useful. If you have a string or text file and want to parallelize the processing of it, this lets you cut it in half and do that.

lifthrasiir · on Oct 10, 2019

> If not for that requirement, we would have just had UTF-8 implemented as regular VLQs.

While not technically equivalent, synchronized byte stream allows a backward scanning that is beneficial for many Unicode-related algorithms. In fact synchronization is the simplest way to do that.

kstenerud · on Oct 10, 2019

VLQ can also be backward scanned, since the high bit is the continuation bit. Just scan for the next cleared high bit, which marks the end of a glyph.

SamReidHughes · on Oct 11, 2019

Provided that you know where the string boundary is, so you don't scan off the front into differently encoded data. So it can't be backward scanned for some miscellaneous encoded number.

lifthrasiir · on Oct 10, 2019

Ah, good point. I'm still confusing which one is which. I was specifically thinking of UTF-1 that is definitely not synchronized nor backward scannable.

acqq · on Oct 10, 2019

> now that I think about it, VLQ already does satisfy the synchronization requirement

What was then the advantage of UTF8 over VLQ?

SamReidHughes · on Oct 10, 2019

You can determine whether the leading character is complete or not.

acqq · on Oct 10, 2019

Other comments know, see:

https://news.ycombinator.com/item?id=21212851

twsted · on Oct 10, 2019

"So we went to dinner, Ken figured out the bit-packing..."

I cannot find the words to describe how much I admire these guys (ken, rob, drm, etc).

Pulletwee12549 · on Oct 10, 2019

Nice post, quite interesting.

I think Rob Pike may have one of the coolest email addresses in the world (r@google.com).

_4gzn · on Oct 10, 2019

Back when I worked at Google, a friend of mine got an unpleasant email from r@ because his Python script had a bug: it sent alerts to the individual characters of my-friend@google.com

oh_sigh · on Oct 10, 2019

I've heard his email causes issues for tons of internal systems at Google because a lot are coded with the expectation of a minimum of 3 characters for the left part of the email. May just be urban legend though.

kevin_thibedeau · on Oct 10, 2019

Do "+" suffixes work for google.com addresses? Is r+pike@google.com the same account?

anticensor · on Oct 10, 2019

Yes and no. Yes, they belong to the same employee; no, each google.com mailbox needs to be approved, they are not automatically redirected unlike regular GMail addresses.

fanf2 · on Oct 10, 2019

I beg to differ

jquast · on Oct 10, 2019

Markus Kuhn did a great deal of work for unicode for unix, see the rest of the same folder, https://www.cl.cam.ac.uk/~mgk25/ucs/

such as examples/UTF-8-demo.txt a beautiful demo file for testing terminal emulators

or the wcwidth.c file I discovered was the origin of the same c function found in all OS's, i forked & maintain in python form as the public "wcwidth" module,

anyway just wanted to point out the treasure trove of the parent folder!