Hacker News new | past | comments | ask | show | jobs | submit login

What's the solution? Only use ASCII? Don't use padStart/padEnd? Does anyone know anything about monospace fonts and any guarantees they make wrt. unicode?

Well.

The description given of "multiple bytes of unicode" is terribly misleading.

There are multiple reasons you might run into trouble in a JavaScript string. One is that it uses UTF-16 for its string type; this represents Unicode code points as 16-bit units. For code points which fit in 16 bits, it uses one unit, and for code points that don't it uses two units, so it's a variable-width encoding (all code points of Unicode can be represented with 32 bits, so two units for a code point is the most you'll see in a UTF-16 byte sequence). This is done via a mechanism which derives two "surrogate" code points from the original, and the resulting two code points are called a "surrogate pair".

Unfortunately, JavaScript leaks this implementation detail to the programmer, which means many string operations can "cut" a surrogate pair in two and leave you with code points that don't actually represent any character, because they're from the surrogate range.

But that's not what's happening in the given example.

Emoji are complicated. Some emoji use only a single Unicode code point, while others are composed from multiple code points, potentially with a joiner code point in between. Here's a comment I just posted in another thread with an example:

https://news.ycombinator.com/item?id=16757317

In the example in the linked article, the "heart" emoji is actually two code points: U+2764 HEAVY BLACK HEART and U+FE0F VARIATION SELECTOR-16. The first is a heart-shaped character that's been around for years; the second is a "variation selector" character which tells whatever's rendering this that it should use a variant emoji-style presentation.

But since that's two code points, again, operations on it can "cut" it in half and cause havoc.

The "workaround" is not to use string indexing or slicing operations in JavaScript if you think you'll be handed emoji or anything else from outside the Basic Multilingual Plane of Unicode, or to be prepared to manually handle them.

As to monospace fonts, this article has a great rundown of how Unicode actually works, and an exploration of how various monospace environments try, and often fail, to handle it:

https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: