Repeat after me: Unicode is not UTF-\d{1,2}

jobu · on June 22, 2009

From the comments on the article: "UTF-\d+ IS Unicode like a Ford Focus is a car. Not every car is a Ford Focus. Thus UTF-8 IS Unicode, not vice-versa."

I think that's a better explanation than the whole rest of the article/rant. That author is a little too nuts to follow.

DougWebb · on June 22, 2009

That's not a great metaphor though, because there are many types of cars, but only one type of Unicode. 'Unicode' is not a generalization for character encodings the way 'car' is a generalization for car models. I think this is exactly the point the article author is trying to make.

A better analogy would be color pixels. Consider a pure-red pixel; there is only one particular shade of red a pixel can have and be pure-red. However, there are multiple ways to represent that color: RGB, HSV, HSL, YUV, CMYK, etc. These are all encodings of the same color. None of them /are/ pure-red, but they all /represent/ pure-red.

Similarly, the 1-4 byte sequences within a UTF-# encoded string aren't Unicode characters themselves; they represent individual Unicode characters. There is only one A in Unicode, but there are multiple ways to encode that A in a stream of bytes.

blasdel · on June 23, 2009

Well there are at least a half-dozen valid unaccented capital a characters in Unicode, but there is only one LATIN CAPITAL LETTER A, which sits at U+0041.

You are correct that there are many ways to encode U+0041 in a stream of bytes.

cduan · on June 23, 2009

The analogy I prefer is that Unicode-encoding is like network protocol layers. In a (very simplified) network, the goal is to present an interface of a pipeline for sending a continuous stream of bytes. To achieve this, you have a protocol like TCP, which consists of routines for transforming a series of packets running through a wire into a stream of bytes. To achieve a series of packets running through a wire, you have a protocol like Ethernet, which consists of routines for transforming electrical pulses into a series of data packets. You need both Ethernet and TCP to turn electrical pulses into a stream of continuous bytes.

Similarly, the goal of Unicode is to present an interface of a stream of abstract characters (the letter "A", the number 3, the nonbreaking space, etc.). To achieve this, you use Unicode, which consists of routines to convert a stream of arbitrarily-long numbers (packets) into abstract characters. To achieve a stream of arbitrarily-long numbers, you use an encoding such as UTF-8, which consists of routines to convert a series of bytes into a stream of arbitrarily-long numbers. Now, using an encoding and Unicode, you can convert a stream of bytes (i.e., an arbitrary file) into a series of abstract characters.

blasdel · on June 22, 2009

Any article that attempts to lay down the truth about Unicode that does not mention UCS-2 is seriously deficient.

There's way too many morons out there treating UTF-16 input as UCS-2, or writing UCS-2 and calling it UTF-16 (or "Unicode" as the article nicely addresses). Both Windows and Java have fucked this up pervasively in the past.

Tichy · on June 22, 2009

"when transforming from (byte) strings to Unicode, you are decoding your data"

Oh, so in memory they are not bytes anymore, but "code sequences"? Fair enough to attempt to clarify a point, but please don't make it even more confusing than it actually is.

I guess (what I take away from the article, even though it is not written in it) the actual "transforming" stage only applies to single letters then - "unicode" would be the mapping of a number to a letter, and the encodings (utf-8 and so on) are different ways to represent the number?

Also, is it true that utf-16 can represent all of unicode? Because I was under the impression that it can't?

sp332 · on June 22, 2009

You're probably thinking of UCS-2, which is very similar to UTF-16, but doesn't have "surrogate pairs" to represent codepoints beyond the 16-bit limit. http://en.wikipedia.org/wiki/UTF-16/UCS-2

Tichy · on June 22, 2009

I wasn't, but thanks anyway ;-) I think Java uses UTF-16 and all those years as a Java developer I was under the impression that it can only use two bytes per "letter". Thanks for the clarification.

It's significant because it messes up the length (or size) property of strings, doesn't it?

dionidium · on June 23, 2009

You were probably under that impression because it used to be true. Java started supporting surrogate pairs -- i.e. it made the switch from UCS-2 to UTF-16 -- in J2SE 5.0

http://en.wikipedia.org/wiki/UTF-16#Use_in_major_operating_s...

sp332 · on June 22, 2009

Yeah, big-time. Guaranteed fixed-size encodings either take 4 bytes per codepoint, or give up some codepoints as unencodeable.

gizmo · on June 22, 2009

It's true. UTF-16, like UTF-8 is a variable length encoding. UTF-8 uses less memory, and with UTF-16 it's much easier to determine the string length.

jrockway · on June 22, 2009

How is it easier in UTF-16? Most "normal" Unicode characters fit in two bytes, sure, but you can't just count bytes and divide by two if you want the right answer. It is just as difficult as UTF-8 to implement.

warp · on June 22, 2009

UTF-8 uses less memory if your string happens to contain mostly characters that UTF-8 can represent in a single byte (like english text). I expect UTF-8 will actually use more memory if you're working with e.g. japanese, arabic, etc..

sp332 · on June 22, 2009

Wow, there's a UTF-EBCDIC? What next, UTF-Morse code?

andrewvc · on June 22, 2009

This is the kind of thing that doesn't matter till it does, which for most programmers (at least in the US) is never.

When something bad happens you go off and learn some more, but, for a lot of people, it flat out won't be an issue.

Kadin · on June 22, 2009

I've found that it matters pretty quickly. I've seen the Python conversion error alluded to in the article quite frequently, and if you search for it you'll find no shortage of people asking about it.

Most programs have to deal with non-ASCII now. Even if your target audience and users are all English-speaking Americans, people are getting used to things like emdashes and smart quotes, and it's pretty bush league if your program doesn't support them or displays them incorrectly.

The number of programmers who can just blithely ignore Unicode is dwindling by the day.

mbrubeck · on June 22, 2009

For web developers, it matters as soon as a user pastes text with curly quotes from a Word document web page. For other developers, it matters as soon as you have to parse a file format or use a protocol with non-ASCII data. I'm in the U.S., and I've eventually had to deal with character encoding at five of the six companies I've worked for.

lallysingh · on June 22, 2009

.. and then you rewrite all your string manipulations?

Unlike other new technologies, you're already using it from the beginning (and in your approach, incorrectly). Spend the two hours to figure out how to do the basic work right at the beginning.

prodigal_erik · on June 22, 2009

This. At my workplace we had legacy code written by people who either didn't understand or didn't care about this stuff, and now our databases are full of crap that we literally have to guess how to correctly convert and render.

andrewvc · on June 22, 2009

To clarify, I meant the finer distinctions between UTF-8 and unicode aren't as important. I wasn't advocating just using ASCII. UTF-8 just works for most things. I wasn't saying character encodings and knowing how they interact isn't worthwhile. For the average western programmer, UTF-8 is generally the answer.

The author of the article wants everyone to be full aware of the the significance of alternate encodings as well as how they relate to codepoints. That knowledge, while useful (and not really hard to learn in a broad sense), isn't useful for many software developers.

prodigal_erik · on June 22, 2009

There are platforms which use UCS-4 or UTF-16 (or even UCS-2) by default. You don't get UTF-8 unless you ask for it, which involves knowing not only that exists but that it's different from what you already had.

nudded · on June 22, 2009

and programmers in the US don't write software that has to work in Europe too?

jm4 · on June 22, 2009

Of course they do but not as often as one would think.

dasil003 · on June 22, 2009

On the other hand, the times when this attitude actually pans out in terms of broken character encoding is vanishingly small. All it takes is one person with an accented name, or someone pasting in smart quotes to break your application and make you look like an amateur. The idea that Americans only need ASCII is silly.

jodrellblank · on June 22, 2009

So, unicode isn't called multicode because...?

blasdel · on June 23, 2009

...because Unicode is a Universal set of Code Points that uniquely refer to Characters.

Even if some asshole decreed that there'd be variable-width encodings, you'd still have endianness issues and combining characters to trip over. Shit ain't easy.