Hacker News new | past | comments | ask | show | jobs | submit login

"Character count" (i.e. grapheme cluster count) only exists once you apply a font—because of ligature rules and some other stuff.

Do you want your programming-language stdlib String class to require a font metrics library?

(This isn't a rhetorical question; you can go either way. Objective C cares about grapheme clusters, and so needs to know about fonts.)




I disagree. A grapheme cluster and a glyph are different things. The former is not dependent on the font.

'fi' is two grapheme clusters even if the font renders it as a single glyph. 'é' is a single grapheme cluster even if the font renders it as two glyphs: 'e ́'


Consider the https://en.wikipedia.org/wiki/Regional_Indicator_Symbol s. In every way that matters, these "act as" one 'character': you can't set your cursor position to be between them; backspacing one should delete both; the flag takes up one terminal col (even if it is rendered as two) such that "\033[1Dx" (cursor-left 1, print "x") will overwrite the whole flag, etc.

But it's the font that controls those semantics—because it's the font that knows what flags do or do not exist. For any pair of RIS codepoints that doesn't form a flag (in the opinion of a given font), they behave like two separate characters.

Thus: one grapheme cluster, or two, depending on the font.

And another example—this one much less "idiomatic", but not specifically decried by the Unicode committee: http://kudakurage.com/ligature_symbols/

That's a font, making entirely-arbitrary clusters out of codepoints. As far as I am aware, it's fully within its rights as a font to do so. There's nothing in the Unicode standard saying that the code-points ['f', 'i', 'l', 'e'], put in a row, can't combine to form a single grapheme cluster. They don't have combining behavior themselves, but—unlike, say, things in the Unicode "Separator" class—they don't have any property that says they don't combine with anything.


Cursor position is based on glyphs though, not grapheme clusters. Grapheme clusters are a well-specific Unicode concept.

>Thus: one grapheme cluster, or two, depending on the font.

One glyph or two depending on the font. Two grapheme clusters, always.

>That's a font, making entirely-arbitrary clusters out of codepoints. As far as I am aware, it's fully within its rights as a font to do so. There's nothing in the Unicode standard saying that the code-points ['f', 'i', 'l', 'e'], put in a row, can't combine to form a single grapheme cluster. They don't have combining behavior themselves, but—unlike, say, things in the Unicode "Separator" class—they don't have any property that says they don't combine with anything.

No, I think those are glyphs. 'file' is always four grapheme clusters.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: