Hacker News new | past | comments | ask | show | jobs | submit login

Unicode should be considered harmful, possibly even text. Never think you understand text, it is a very complex medium, and every time this topic is brought up, you learn something about some odd quality of some language that you might never have heard of. Yes, UTF-16 is variable length, yes, it does make many European scripts larger. Size is always a trade-off and there won't be one standard for encoding.

Text is hard, do not approach it with a C library you built in an afternoon, leave it to the professionals. I just wish I knew any...




> Size is always a trade-off and there won't be one standard for encoding.

I don't see why, in modern multi-core CPUs/GPUs where 99% of time is spent idle, we can't just have "raw" text (UCS4) to manipulate in memory, and compressed text (using any standard stream compression algorithm) on disk/in the DB/over the wire.

Anything that's not UCS4 is already variable-length-encoded, so you lose the ability to random-seek it anyway; and (safely performing) any complex text manipulation, e.g. upper-casing, requires temporarily converting the text to UCS4 anyway. At that point, you may as well go all the way, and serialize it as efficiently as possible, if you're just going to spit it out somewhere else. I guess the only difference is that string-append operations would require un-compressing compressed strings and then re-compressing the result—but you could defer that as long as necessary using rope[1].

[1] http://en.wikipedia.org/wiki/Rope_(computer_science)


The oft-stated advantage of UTF-32/UCS4 is that you can do random access. But random character access is almost entirely useless for real text processing tasks. (You can still do random byte access for UTF-8 text, and if your regexp engine spits out byte offsets, you're fine.)

Even when you're doing something "simple" like upcasing/downcasing, the advantages of UTF-32 are not great. You are still converting variable length sequences to other variable length sequences -- e.g., eszett upcases to SS.

Now the final piece to this is that for some language implementations, compilation times are dominated by lexical analysis. Sometimes, significant speed gains can be had by dealing with UTF-8 directly rather than UTF-32 because memory and disk representation are identical, and memory bandwidth affects parsing performance. This doesn't matter for most people, but it matters to the Clang developers, for example. Additional system speed gains are had from reducing memory pressure.

Sure, we have plenty of memory and processor power these days. But simpler code isn't always worth 3-4x memory usage.

Text is not simple.


> I don't see why, in modern multi-core CPUs/GPUs where 99% of time is spent idle

We have 3 levels of caching and hyperthreading cores because memory access is so ridiculously slow compared to the CPU. Quadrupling amount of data that goes through this bottleneck isn't going to help.

> Anything that's not UCS4 is already variable-length-encoded

You can't access n-th character in UCS4 anyway, because Unicode has combining characters (e.g. ü may be ¨ + u).


tchrist seems pretty knowledgeable on unicode issues. He even did 3 talks at OSCON on the topic. My takeaway wasthat the best language for dealing with Unicode was Perl, and Ruby was the second best.


I gave a presentation on this topic at Rubyconf Brazil last year, and would be hard pressed to describe Ruby as "good" at dealing with Unicode, unless by "good" you mean "avoids making almost any decisions at all" (which might actually be a good thing but it's debatable).

Ruby 1.9 doesn't even offer Unicode case folding, so from a practical standpoint working with Unicode text is a PITA with Ruby unless you use third party libraries.

Ruby source can include constants, variables, etc. with Unicode (or other character set) symbols, which is very cool but for Unicode text processing I've found Ruby to be frustratingly lacking.


Here are the slides for his "Unicode Shootout: The Good, The Bad, and the Ugly":

http://code.activestate.com/lists/perl5-porters/166738/

Hmmm... According to that email he says that Java is second, but in his first talk he did say that Ruby (1.9+) was second in his mind (and he seemed to be visibily frustrated with Java's Unicode support).


ICU is one library.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: