Hacker News new | past | comments | ask | show | jobs | submit login
Libgrapheme: A simple freestanding C99 library for Unicode (suckless.org)
108 points by harporoeder on Nov 15, 2022 | hide | past | favorite | 45 comments



I've been trying to get into C this past week, and this is a great coincidence to see this today! I was just thinking how convenient it is to type emojis right into strings in Python and print them. I assumed C didn't have much unicode compatibility, though I didn't research it.

I gave libgrapheme a try, and it compiled just as the instructions said it would. The hello-world program also mostly worked, but in my terminal it malformed several things. For example the American flag emoji rendered in my terminal as [U][S], and the family emoji rendered as three distinct emoji faces (side-by-side) rather than one grouped one.

I went to a website that lets me copy emojis to my clipboard, and I directly copy-pasted the American flag into my terminal, and I still got [U][S], so I think the problem is just with the terminal and not the library.

edit: Indeed, this is a problem in Gnome terminal. I found a Bugzilla link[0] that is still open. The official name for the grouped emoji type is "ZWJ sequence"[1], short for Zero-Width Joiner, and it appears not a lot of terminals support them. If anyone knows of a good one for Linux, please let me know!

Great stuff, thank you for sharing!

References:

[0]: https://gitlab.gnome.org/GNOME/vte/-/issues/2317

[1]: https://emojipedia.org/emoji-zwj-sequence/


> I was just thinking how convenient it is to type emojis right into strings in Python and print them. I assumed C didn't have much unicode compatibility, though I didn't research it.

Libgrapheme is a nice library, but it doesn't really have anything to do with this.

Almost all modern terminal emulators use the UTF-8 character encoding. In order to successfully output Unicode characters, your programming language doesn't actually need much "Unicode support"; it just needs to be able to send UTF-8-encoded bytes to stdout. (That's why many modern programming languages like Go and Zig define strings as a simple array of bytes.) Modern C compilers allow you to printf("é") and get the appropriate behavior.

As you mention, the terminal emulator also needs to be able to decode and display those UTF-8 bytes correctly, and a lot of terminals don't get it right in some situations. Off the top of my head, I don't know of a terminal that actually implements the entire (very complex) set of Unicode text rendering behaviors; maybe one of the web-based ones that run in Electron? macOS's Terminal.app is also pretty good IIRC.

Where libgrapheme comes in is if you want to analyze or manipulate a UTF-8-encoded string. It provides operations like "split into words" and "convert to uppercase". A surprising number of programs never need to do that stuff, but if you do, libgrapheme will give you a Unicode-compatible implementation. (Many more basic operations, like concatenating two strings, will work just fine without libgrapheme.)


> That's why many modern programming languages like Go and Zig define strings as a simple array of bytes.

I'd argue this is not in fact a modern approach but rather a pretty dated one.

Modern language string API design instead leans into 'views' of the underlying depending on what you are trying to do with it.

- Transmission and storage: UTF-8 bytes.

- Parsing: code points.

- Editing: grapheme clusters.

- Rendering: pixels.

It doesn't make sense to consider a random assortment of bytes to be a 'string', that's just an array of random bytes - and the language almost certainly already has a better type for that like a vector of 8-bit unsigned values or a slice type. Strings should be opaque - but always valid - encoded data to be sensible and differentiated from a slice type.

Zig seems to be even worse than Go, requiring null-termination instead of storing them as Pascal strings. This means that a valid substring requires a copy so that the substring can itself be null-terminated to be valid. Inner pointers are not valid unless to the end of the parent and therefore substrings are very expensive relatively speaking.


> Zig seems to be even worse than Go, requiring null-termination instead of storing them as Pascal strings. This means that a valid substring requires a copy so that the substring can itself be null-terminated to be valid. Inner pointers are not valid unless to the end of the parent and therefore substrings are very expensive relatively speaking.

No, Zig uses slices normally (ie ptr + len). Terminators are part of the type system for the (relatively rare, mostly related to C interop) cases where you really want a sentinel value at the end of your array.


I see, I misread the docs. However everything else I stand by, after all it allows you to embed invalid UTF-8 content.


(Not a language or Unicode expert, the following likely has important mistakes.)

> Off the top of my head, I don't know of a terminal that actually implements the entire (very complex) set of Unicode text rendering behaviors

There are at least two reasons for this:

First, nobody actually seems to know how bidirectional text should interact with terminal control sequences, or indeed how it should be typeset on a terminal in the first place (what is the primary direction? where are the reordering boundaries?). There is the pre-Unicode bidirectional support mode (BDSM, I kid you not) in ECMA-48[1] and TR/53[2], which AFAIK nobody implements nor cares about; there are terminal emulators endorsed by bidi-language users[3], which AFAIK nobody has written down the behaviour of; there is the Freedesktop bidi terminal spec[4], which is a draft and AFAIK nobody implements yet either but at least some people care about; finally, there are bidi-language users who say that spec is a mistake[5].

Second, aside from bidi and a smattering of other things such as emoji, there is no detailed “Unicode text rendering behaviour”, only standards specific to font formats—the most recent among them being OpenType, which is dubiously compatible across implementations, decently documented only through painstaking reverse engineering (sometimes in words[6], sometimes only in Freetype library code), and generally full of snakes[7]. And it has no notion of a monospace font—only of a (proportional) font where all Lat/Cyr/Grk characters just happen to have the same advance.

AFAICT that is not negligence or an oversight, but rather a concession to the fact that there are scripts which don’t really have a notion of monospace in the typographic tradition and in fact are written such that it’s extremely unclear what monospace would even mean—certainly not one or two cells per codepoint (e.g. Burmese or Tibetan; apparently there are Arabic monospace fonts[8] but I’ve no idea how the hell they work). Not coincidentally, those are the scripts where you really, really need that shaper, otherwise nothing looks anywhere close to correct.

[This post could have been titled “Contra Muratori on Unicode in terminal emulators”.]

[1] https://www.ecma-international.org/publications-and-standard...

[2] https://www.ecma-international.org/publications-and-standard...

[3] https://news.ycombinator.com/item?id=8086417

[4] https://terminal-wg.pages.freedesktop.org/bidi/

[5] http://litcave.rudi.ir/

[6] https://github.com/n8willis/opentype-shaping-documents

[7] https://litherum.blogspot.com/2019/03/addition-font.html

[8] https://news.ycombinator.com/item?id=10395464


> First, nobody actually seems to know how bidirectional text should interact with terminal control sequences...

This goes beyond just bidirectional text. The traditional behavior of text in a terminal is based around two key assumptions, both of which break down catastrophically when dealing with non-ASCII text:

1) The state of a terminal can be represented as a set of cells, each of which has exactly one glyph in it and can be drawn independently from any other cell.

2) Printing a character will write a glyph to the cell the cursor is in and move the cursor to the right by one cell (or down to the next line).

The first assumption breaks down when dealing with full-width characters and ligatures/complex scripts, but can at least be papered over to handle full-width. The second assumption breaks down when exposed to virtually any interesting typographical feature (RTL, combining characters and ZWJ, shaped characters, etc). And I'm not sure it's possible to fix without some pretty substantial changes to how terminals operate -- standard terminal control sequences, and the code that uses them, are all built around these assumptions; introducing new behaviors like "the cursor doesn't always move from left to right" or "erasing the middle of a string might change how the rest of it displays" will break existing applications.

The ECMA standards are of absolutely no help in the matter. They were written in the early 1990s, before Unicode came onto the scene. Their idea of "international language support" was supporting both French and German.


Combining characters don't usually break those rules. They just mean that a single glyph is no longer 1-4 bytes, it's 1-128 or unlimited.


Consider a program that prints "A", waits a second, then prints a combining mark. Before the combining mark shows up, the terminal has no way of knowing to expect one, so it'll render an unaccented "A"; it then has to go back and modify that glyph (breaking the second rule) when the combining mark appears.

A terminal can't defer rendering that "A" until another character shows up -- that'd break common use cases like echoing user input -- and it can't refuse to apply the combining mark just because it arrived later, because the input to a terminal is a stream of bytes, not characters, and that stream can be delayed or interrupted in any number of ways.

And that's without getting into other interactions with terminal features, like: what happens if you move the cursor, or change its attributes, between the base character and its combining mark? How do you apply a combining mark to the last character on a line, since printing that character moves the cursor to the next line? How do combining marks interact with backspace? And so on.


I wouldn't have said that modifying the last-printed glyph breaks rule 2. Either way, I think modifying the glyph is the better answer.

If you send a console command to move the cursor or do anything else to it, the simple answer is to make that a barrier between glyphs. If you output a combining character after doing that, the combining character goes into the next cell all by itself.

Backspace sent to a terminal should move the cursor left by one cell as usual. A backspace sent by a user would probably delete one glyph if no IMEs are involved. If an IME is involved that's a different problem than plain old combining characters.

Being the last character on a line isn't really any different from being in the middle of a line. The cursor auto-moves, but if the next bytes the terminal gets are combining characters then you can pretty easily shove them into the last-printed glyph.


> If you send a console command to move the cursor or do anything else to it, the simple answer is to make that a barrier between glyphs. If you output a combining character after doing that, the combining character goes into the next cell all by itself.

Maybe. I'd be concerned about what effects that might have on applications which need to be able to display multiple streams of output at once, like tmux.

> Being the last character on a line isn't really any different from being in the middle of a line

You might be surprised. After printing the last character on a line, the cursor goes into a weird "wrapnext" state, where the cursor stays at the end of the line, but will wrap to the next line (potentially triggering scrolling) when the next character is output. This state is highly ephemeral; it can't be entered through control sequences, even by DECSC/DECRC.


Here's another fun Unicode pitfall: does any terminal provide a way to display Chinese and Japanese text simultaneously, using the appropriate versions of the glyphs for each language's characters?


As far as existing terminals are concerned, I don’t know. FWIW, there are similar problems (though only to the point of looking wrong, not of misunderstanding) in other scripts: Cyrillic as used in Bulgarian and a number of other languages[1] and even Latin as used in Polish[2].

Even the Han version, though, does not seem to me to be the sort of “what does it even mean?” problem like those I listed above; more like what you want the input to be. You can make your terminal keep language state, e.g. using the deprecated language tags. Pro: some form of this likely already needs to happen for bidi support; similar to what HTML does. Con: no text file or program ever did this; your nice UTF-8-only terminal is now stateful and goes mad after `head /dev/urandom`. Alternatively, you can require the driving program to emit variation selectors for each Han character. Pro: the state and the ensuing madness is now limited; you can still pretend you’re looking at a stream of characters. Con: no text file or program ever did this; neither does HTML although it theorerically could.

[1] https://commons.wikimedia.org/wiki/File:Cyrillic_alternates....

[2] https://www.twardoch.com/download/polishhowto/kreska.html


> does any terminal provide a way to display Chinese and Japanese text simultaneously, using the appropriate versions of the glyphs

Does any terminal provide a way to display (say) two consecutive lines in different fonts?

Because according to Unicode (and for once I don't particularly disagree), Chinese versus Japanese ideographs are the same characters, drawn in two different fonts.


https://www.facebook.com/appliedphilosophy/photos/a.82854007...

Many years ago I thought the highlighted text was a light hearted joke. Now I see what it is actually quite serious. (And now I understand why, among all people, a Japanese philosopher brought up the issue)

I personally think Unicode didn't go essentialist enough though. There's lots of obscure CJK characters in unicode that, when you look up a dictionary, just says it's another form of a common character. What I heard from people in the know was that the CJK committee basically made an exception for the Japanese because they were threatening to withdraw from the Unicode effort unless their demands were met, and for the most part they got what they wanted.

The "Chinese versus Japanese ideographs are the same characters, drawn in two different fonts" is actually a minority of those cases, which I personally interpret as clerical errors instead of a philosophical disagreement.

For example if Unicode were actually thorough with the essentialist philosophy, 關 and 関 would have the same code point.


No, because it's not actually possible without extra information. The codepoints in Unicode are the same if the languages use the same character (Han unification). With no extra information, if a string is only comprised of unified codepoints, it isn't inherently any one language, so you'd have to resort to a dumb external setting that lets users pick one font in that situation. Next best, you'd have to scan the entire string and find at least one non-unified codepoint and then use that to guess what language it is and then map the correct font for that language. Or, if you have external data such as an encoding string, file metadata, text source, you can simply obey what the metadata says.

Unification was done originally because early versions of Unicode were bit-constrained and there simply was not enough space to have separate codepoints for each language. This would need to be "undone" in some future version of Unicode and text encoders would need to make the backward incompatible change (in terms of rendering in unknown downstream software) of using new codepoints if this were to be done. Then the "proper" font would be deterministic.


Oh, cool! Some more links for my "Unicode is a tire fire" collection. (And no, I don't want ASCII back, but finally something sane).


Honestly, to me it looks more like human writing is much more complex than typewriter-tyranny Latin would lead you to believe than that Unicode is a tire fire. (OpenType is a tire fire, though, befitting its history as a final peace treaty between Adobe, Apple, and Microsoft after a decade or two of shifting two-against-one alliances. Good luck convincing the half-dozen or so main font foundries to switch to something else.)

I mean, look at Gujarati[1] and tell me how that thing is ever supposed to fit into character cells. Or think how Hebrew (RTL) text with (LTR) numbers should handle arrow keys in selections[2]—the problem was born in cursed depths where unspeakable ancient monsters breed, you lost (possibly your mind) the moment you decided to solve it. Watch a Japanese user type a romanization of their sentence on a QWERTY keyboard, displayed using the syllabary, and then have the computer guess the word boundaries and offer the appropriate ideograms in a little systemwide overlay[3]—watch and despair. And none of these have anything to do with the way you’ve encoded the strings into bytes. (Unless you tell the Hebrew users to type or process text in reverse, which is also utter insanity.)

[1] https://r12a.github.io/scripts/gujarati/gu.html

[2] https://ltr.wtf/explained/bidiintro.html#selections

[3] https://en.wikipedia.org/wiki/File:IME_demonstratie_-_Matsuo...


I've never seen a Unicode editor that handled Japanese top-to-bottom right-to-left oriented text. Setting up a terminal to switch between Japanese, Farsi, and German on the same screen would be quite a challenge. That's hard even on paper.


The bit encoding is only a very small part of Unicode. That alone would be likely bearable.

It's the whole thing that is problematic.

Unicode mixes way to many concepts in an unholy way. The result is insanity.

The original sin is of course that content and appearance / layout aren't strictly separated.

The whole idea that a text format should be "typewriter compatible" is also a huge part of the problem.

Human script is way too complex that one could pretend that a typewriter abstraction would suffice. But Unicode tries hard to keep this leaky abstraction (because "backwards compatibility"; even it's actually not compatible in all kinds of "corner cases", BIDI being only one). The result is the dumpster fire we've got (and that we likely never ever will get rid of, as it's "good enough" for the majority; I mean as long as you're using European scripts, and no scientific notation, math, and such…).


> And no, I don't want ASCII back, but finally something sane

ASCII was sane (give or take minor annoyances like hexadecimal digits or DEL); the problem was it was too narrowly scoped for general-purpose use. In the context it was actually developed for (exclusively English text on machines that struggled (or didn't bother) to even render 256 different characters, nevermind thousands), ASCII was a broadly-reasonable character encoding that even had half the encoding space left over for something like UTF-8 to build a sensible, more general-purpose encoding in.

Unfortunately, we got specifically UTF-8, with the dumpster fire that is Unicode tied around its neck like a string of albatrosses.


Speaking of "getting" UTF8, I really enjoy this story on it's inception: https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt


C11 has full support for Unicode in source code for string and char literals and comments. Any runtime Unicode processing, however, has to be provided by outside code.


> If anyone knows of a good one for Linux, please let me know!

kitty has excellent support for this stuff, and is much more performant than anything based on vte (like gnome-terminal).

https://github.com/kovidgoyal/kitty


The claims about "word segmentation" caught my eye, because it's not even well defined in my native language. (one example among the difficulties: there's a lot of constructs like "rid-fucking-iculous". Should it be "rid/fucking/iculous" or "fucking ridiculous"?)

After checking the source to see how they perform this literally impossible feat, it turns out they were implementing a Unicode standard that basically tries to do something useful (for some families of languages) but has a whole wall of caveats: https://unicode.org/reports/tr29/#Word_Boundary_Rules

I can see users of the word segmentation function shooting their feet really badly with at least CJK languages, especially with a deceptively simple API like that. In general "word segmentation" doesn't make sense on a language-neutral level. libgrapheme even admits to this somewhat:

""" For some languages, for instance, it is necessary to have a dictionary on hand to always accurately determine when a word begins and ends. The defaults provided by the standard, though, already do a great job respecting the language's boundaries in the general case """

I disagree about the "great job" part. It's probably the best job a Unicode spec can do, but it's not going to be great for like at least 20% of the world's population... (Also a minor nit is that no dictionary is accurate, so even with a dictionary the results are not really "always accurate", just mostly accurate, unless you hit edge cases [for CJK at least], upon which advanced NLP techniques are needed.)

IMHO they should have a big disclaimer in their GRAPHEME_NEXT_WORD_BREAK(3) manual page warning users about the caveats (like, "if you're using this for anything other than $(these families of languages) make sure you really know what you're doing before using this function").


This is interesting, particularly for implementing Intl in JS engines without the mega-heavy ICU. But I wonder how portable it really is.

Sometimes I have to dig very deep to find that what folks call "portable C" is actually POSIX-dependent.

It doesn't appear to be the case after going through the code for a bit, so that's promising.


You can also refer to the Unicode routines of other small JS engines[1,2], those don’t use ICU either, although the implementations are mercilessly size-optimized (to put it politely) and restricted to what the target JS version requires (e.g. Duktape does casemapping but no normalization). Still, Bellard’s in particular look like he had a small Unicode processing library lying around and just copied it into the tree, not like he was forced to write the absolute minimum to do a JS inteprerer, so they can even be compared with dedicated libraries like libgrapheme, libutf8proc or libutf.

[1] https://github.com/bellard/quickjs/blob/master/libunicode.c

[2] https://github.com/svaarala/duktape/blob/master/src-input/du...


> This is interesting, particularly for implementing Intl in JS engines without the mega-heavy ICU.

On the contrary, every single programming language and platform should by default include ICU and provide easy bindings to its functionality. Because even normalization and equality comparisons are often done poorly or not at all in these "simple" libraries.


To the contrary I think ICU should be better handled as a "third party" library.

It gets updated quite often, and changes between versions can be substantial. If the developer cares about correctness, they need to have explicit control over the actual version of ICU they're bundling, instead of having the language/platform dealing with it as a black box.

It can be argued that in such cases they can always bundle their own custom version, but then the argument boils down to whether it's better to assure the developer that it works automagically even though it's only correct 99% of the time, or whether to not provide a default implementation and risk them choosing to implement a poor one themselves?


The functionality they’re implementing is purely string/byte processing, I don’t see what should introduce use of non-POSIX features.


Maybe a comparison to ICU4X is more interesting.


Does anyone know offhand whether this does comparisons? And normalization?


It looks like "no", just browsing its docs.


Yeah, I didn't see anything in the man pages.

For my purposes, a Unicode library that doesn't do string comparisons is, well ... incomplete. It's a fairly fundamental thing that you need to do: I would have thought well before you'd want eg. up/down-casing, for instance.



Is there any similar C library that deals with Normalisation and Collation?



I was hoping for some alternative since the linked article mentions utf8proc as such:

> Some libraries, like libutf8proc and libunistring, are incorrect by basing their API on assumptions that haven't been true for years (e.g. offering stateless grapheme cluster segmentation even though the underlying algorithm is not stateless). As an additional factor, libutf8proc's UTF-8-decoder is unsafe, as it allows overlong encodings that can be easily used for exploits.


There is a stateful API too, and I'm surprised they mention it's decoder is unsafe, utf8proc is fuzzed by oss fuzz, which I assume would catch something like that.


[flagged]



You can walk at night holding a torch without being a Nazi.


[flagged]


[flagged]


Personal attacks will get you banned here, regardless of what you're replying to. If you'd please review https://news.ycombinator.com/newsguidelines.html and not post like this to HN, we'd appreciate it.


Oh my. Don't roll your own crypto or Unicode library. There are too many edge cases that "simplicity" simply doesn't scale well with. Use what works and what's already been done rather than reinventing the wheel without a clear and necessary purpose.


This is very general argument that can be used when anybody is trying to write another software that does same thing as the existing ones.


I think it specifically applies to implementing standards that are known to be more difficult to implement and easily break.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: