As a Finnish software dev I can say that in these days native support for Unicod...

silentbicycle · on Oct 17, 2010

Is Unicode something that needs to be in the core of a language, or is it sufficient to leave it to libraries, if the language's design doesn't prevent it?

On this computer (OpenBSD/i386), icu has over 1 MB of libraries and a 15 MB data file. The whole Lua distribution fits in one 200k library. Bloating the core language with that seems impractical.

arkx · on Oct 17, 2010

In my opinion, it needs to be in the core.

Imagine you need to go through a library every time your string includes or might include the letter "s" or "v". Basically you'd need to use this library for all your strings. But then you lose compatibility with 'normal' string type and need to be constantly aware of the difference. You might want to use some other library that doesn't support this Unicode library at all, etc. It quickly becomes a very painful world to live in.

As you might imagine, just having support for Unicode baked in the language is very nice. Defaulting to Unicode for all text is even better.

I can understand why Lua in particular doesn't come with Unicode support out of the box, being so small. My comment was written in response to the more general claim that Unicode support is a strange thing to consider essential in a programming language.

(According to http://www.bckelk.ukfsn.org/words/etaoin.html, s/ä and v/ö are comparable in frequency.)

silentbicycle · on Oct 17, 2010

I think we have slightly different ideas about the language core vs. library distinction. In C, for example, printf and strlen are library functions (stdio.h and string.h), while structs are part of the core language.

All the language core needs for Unicode is reasonable support for tagging string literals (i.e., U"blah") and a binary-safe string type. It's best if there's either a standard or de facto community standard library for doing Unicode string ops, but it doesn't need core support anymore than the Linux kernel needs to know about parsing HTTP.

stcredzero · on Oct 17, 2010

Is Unicode something that needs to be in the core of a language, or is it sufficient to leave it to libraries, if the language's design doesn't prevent it?

At the very least, there should be a first-class type that maps 1-to-1 with a Unicode CodePoint. Then there should be easy ways to do common operations on strings in terms of CodePoints. (Like string comparison, substring matching, concatenation.) Furthermore, the encodings should be handled in a transparent way.

If the goal is to keep Lua to a 200k core, then there should be a mechanism to add such functionality as if it's built in.

EDIT: "transparent" meaning, it looks like core functionality.

silentbicycle · on Oct 17, 2010

> If the goal is to keep Lua to a 200k core, then there should be a mechanism to add such functionality as if it's built in.

There is. See "metatables" - the behavior of tables (Lua dicts) and userdata (handles to C pointers, or raw C pointers) is intentionally left minimal but extensible, so that new first-class ("transparent") types can be added.

For example, Lua doesn't have full regexp support* , but there's LPEG (http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html), a library that adds a PEG-based matching/parsing engine (which is a superset of REs). It's no less usable for being in a library rather than the core.

* Though what it does have (http://www.lua.org/manual/5.1/manual.html#5.4) if often good enough - the main thing missing is groups, e.g. "a+(ab|bc)?d".

xtho · on Oct 17, 2010

The characters you listed are defined in latin-1. They really don't serve as a good example for the urgent need to use unicode.

arkx · on Oct 17, 2010

They are not in ASCII, and that is the problem. You wouldn't believe how many times I've tripped over Python 2.x's UnicodeDecodeError for example.

Moreover, I wasn't searching for the best example -- merely using something I have personal experience of.