Unicode should be considered harmful, possibly even text. Never think you unders...

derefr · on Aug 19, 2011

> Size is always a trade-off and there won't be one standard for encoding.

I don't see why, in modern multi-core CPUs/GPUs where 99% of time is spent idle, we can't just have "raw" text (UCS4) to manipulate in memory, and compressed text (using any standard stream compression algorithm) on disk/in the DB/over the wire.

Anything that's not UCS4 is already variable-length-encoded, so you lose the ability to random-seek it anyway; and (safely performing) any complex text manipulation, e.g. upper-casing, requires temporarily converting the text to UCS4 anyway. At that point, you may as well go all the way, and serialize it as efficiently as possible, if you're just going to spit it out somewhere else. I guess the only difference is that string-append operations would require un-compressing compressed strings and then re-compressing the result—but you could defer that as long as necessary using rope[1].

[1] http://en.wikipedia.org/wiki/Rope_(computer_science)

klodolph · on Aug 19, 2011

The oft-stated advantage of UTF-32/UCS4 is that you can do random access. But random character access is almost entirely useless for real text processing tasks. (You can still do random byte access for UTF-8 text, and if your regexp engine spits out byte offsets, you're fine.)

Even when you're doing something "simple" like upcasing/downcasing, the advantages of UTF-32 are not great. You are still converting variable length sequences to other variable length sequences -- e.g., eszett upcases to SS.

Now the final piece to this is that for some language implementations, compilation times are dominated by lexical analysis. Sometimes, significant speed gains can be had by dealing with UTF-8 directly rather than UTF-32 because memory and disk representation are identical, and memory bandwidth affects parsing performance. This doesn't matter for most people, but it matters to the Clang developers, for example. Additional system speed gains are had from reducing memory pressure.

Sure, we have plenty of memory and processor power these days. But simpler code isn't always worth 3-4x memory usage.

Text is not simple.

pornel · on Aug 19, 2011

> I don't see why, in modern multi-core CPUs/GPUs where 99% of time is spent idle

We have 3 levels of caching and hyperthreading cores because memory access is so ridiculously slow compared to the CPU. Quadrupling amount of data that goes through this bottleneck isn't going to help.

> Anything that's not UCS4 is already variable-length-encoded

You can't access n-th character in UCS4 anyway, because Unicode has combining characters (e.g. ü may be ¨ + u).

pyre · on Aug 19, 2011

tchrist seems pretty knowledgeable on unicode issues. He even did 3 talks at OSCON on the topic. My takeaway wasthat the best language for dealing with Unicode was Perl, and Ruby was the second best.

compay · on Aug 19, 2011

I gave a presentation on this topic at Rubyconf Brazil last year, and would be hard pressed to describe Ruby as "good" at dealing with Unicode, unless by "good" you mean "avoids making almost any decisions at all" (which might actually be a good thing but it's debatable).

Ruby 1.9 doesn't even offer Unicode case folding, so from a practical standpoint working with Unicode text is a PITA with Ruby unless you use third party libraries.

Ruby source can include constants, variables, etc. with Unicode (or other character set) symbols, which is very cool but for Unicode text processing I've found Ruby to be frustratingly lacking.

pyre · on Aug 19, 2011

Here are the slides for his "Unicode Shootout: The Good, The Bad, and the Ugly":

http://code.activestate.com/lists/perl5-porters/166738/

Hmmm... According to that email he says that Java is second, but in his first talk he did say that Ruby (1.9+) was second in his mind (and he seemed to be visibily frustrated with Java's Unicode support).

yuhong · on Aug 18, 2011

ICU is one library.