What Every Software Developer Must Know About Unicode and Character Sets (2003)

pmjordan · on March 25, 2010

I'm just reaching the end of a pretty soul-destroying consulting project. The client side is C++, and uses a lot of strings. To my horror, there still doesn't seem to be a de-facto standard way of dealing with the various unicode encodings in C++, even after my multi-year C++ hiatus. I ended up using the WideCharToMultiByte() and MultiByteToWideChar() Win32 functions, which are rather yucky. I'd fully expected boost to have an answer to this problem by now, but that only offers UCS-4 <-> UTF-8 conversion.

What libraries do C and C++ programmers use to hold unicode strings and convert between encodings these days?

ximeng · on March 25, 2010

Not quite answering your question, but worth noting that C++0x supports various unicode encodings for string literals:

http://en.wikipedia.org/wiki/C%2B%2B0x#New_string_literals

There's also a link to a proposed boost solution here:

http://stackoverflow.com/questions/511280/is-there-stl-and-u...

Doesn't quite sound like the standard way that you're looking for, but moving closer.

Edit:

"ICU today is de-facto standard Unicode/localization library" from a mailing discussion of the boost solution. And http://art-blog.no-ip.info/cppcms/blog/post/43 has an interesting comparison of a few libraries, but not too comprehensive.

pistoriusp · on March 25, 2010

My favorite author on the subject of Unicode is Mark Pilgram in his book Dive in to Python 3 (it's relevant for any language):

http://diveintopython3.org/strings.html

"Some Boring Stuff You Need To Understand Before You Can Dive In."

wizard_2 · on March 25, 2010

It's worth bringing this up from time to time, if for nothing else then to educate new developers.

Emore · on March 26, 2010

What happens when I copy and paste?

Do I copy the code points, or the encoded characters--the bytes--along with what encoding is used? Similarly, when I paste, is it the code points I paste which are instantaneously encoded using the application's encoding scheme?

ptarjan · on March 26, 2010

Just yesterday I ran into this exact problem.

In PHP curl_exec returns data in the raw encoding of the source. Fine, some people will want that. But I want to do things with the data, so I want it in UTF8.

So, I ended up writing my own curl_exec_utf8 function which I'm sure is wrong for many edge cases, but it is 2010! Why is there no decent ways to deal with charsets?

Here is the function, in case any of you need it, or want to point out how it is hopelessly broken : http://stackoverflow.com/questions/2510868/php-convert-web-p...

thmz · on March 26, 2010

If we are lucky they will fix it before 2011. http://www.php.net/~scoates/unicode/render_func_data.php?x=0 Complete 70.70

jamiecobbett · on April 22, 2010

For Ruby, I can highly recommend this series of articles: http://blog.grayproductions.net/articles/understanding_m17n

gjm11 · on March 25, 2010

From 2003; it might be worth putting that in the title.

akirk · on March 25, 2010

Hm but only really related to the recent Joel and the he's not blogging anymore situation.

What he says is still very much valid. It's a nice intro into Unicode. Quite refreshing to read.

akirk · on March 25, 2010

Actually I just miss that he doesn't state anything about the downsides of UTF-8. Like that you need to go through the string to determine how many characters it has, due to their (potentially) variable length.

pmjordan · on March 25, 2010

I love how people then tend to bring up Win32 "wide strings", Java and .NET as alternatives, all of which use UTF-16, which is also a variable width encoding.

dfox · on March 25, 2010

I'm still trying to find one valid use for length of string in unicode characters. What one usually needs to know is length of string as it's rendered by some output device, which is not related to count of unicode characters in any useful way. Even for fixed point fonts you can have glyphs that are composed from multiple unicode characters or characters whose glyphs occupy two consecutive positions.

anamax · on March 25, 2010

Twitter has a limit of 140 "codepoints". Not bytes. Not glyphs.

prodigal_erik · on March 26, 2010

That's weird, I thought its limit was deliberately low enough to fit into an SMS message, which has a limit of 140 octets (160 characters in some 7-bit encoding GSM uses). Do they actually allow, say, 140 kanji?

anamax · on March 26, 2010

http://groups.google.com/group/twitter-development-talk/brow...

ricree · on March 26, 2010

That post basically just says go look at this wiki page: https://twitterapi.pbworks.com/Counting-Characters

Why not link to that in the first place?

JeremyStein · on March 25, 2010

More importantly, you need to go through the string to determine where the nth character is. You can't jump to a character by index.

epochwolf · on March 25, 2010

Which you can fix by storing an unsigned long at the front of the string which holds the size.

If it would overflow, you just set the long to the max size, byte the bullet and read the entire string. If you are using a dynamic language just throw whatever InfinitelyLargeNumber class it has in the size column and you're good.

If you're worried about ram that much you can just use plain strings when you need to.

Daniel_Newby · on March 26, 2010

"Like that you need to go through the string to determine how many characters it has, due to their (potentially) variable length."

You have to do that with all Unicode encodings, since a semantic symbol can be composed of multiple combining characters.

hairsupply · on March 25, 2010

And of course, all developers should be familiar with U+F8D0 through U+F8FF.