Hacker News new | past | comments | ask | show | jobs | submit login
String length functions for single emoji characters evaluate to greater than 1 (hsivonen.fi)
116 points by kevincox on March 26, 2021 | hide | past | favorite | 127 comments



Raku seems to be more correct (DWIM) in this regard than all the examples given in the post...

  my \emoji = "\c[FACE PALM]\c[EMOJI MODIFIER FITZPATRICK TYPE-3]\c[ZERO WIDTH JOINER]\c[MALE SIGN]\c[VARIATION SELECTOR-16]";

  #one character
  say emoji.chars; # 1 
  #Five code points
  say emoji.codes; # 5

  #If I want to know how many bytes that takes up in various encodings...
  say emoji.encode('UTF8').bytes; # 17 bytes 
  say emoji.encode('UTF16').bytes; # 14 bytes 

Edit: Updated to use the names of each code point since HN cannot display the emoji


And if you try to say emoji.length, you'll get an error:

No such method 'length' for invocant of type 'Str'. Did you mean any of these: 'codes', 'chars'?

Because as the article points out, the "length" of a string is an ambiguous concept these days.


You can represent it as a sequence of escapes. If Raku handles this the same way as Perl5, it should be:

    $a = "\N{FACE PALM}\N{EMOJI MODIFIER FITZPATRICK TYPE-3}\N{ZERO WIDTH JOINER}\N{MALE SIGN}\N{VARIATION SELECTOR-16}";


now do it in YAML


In summary the two useful measures of unicode length are:

* Number of native (preferably UTF-8) code units

* Number of extended grapheme clusters

The first is useful for basic string operations. The second is good for telling you what a user would consider a "character".


> The first is useful for basic string operations.

Reversing a string is what I would consider basic string operations, but I also expect it not to break emoji and other grapheme clusters.

Nothing is easy.


Personally I'd consider reversing a string to be a pretty niche use case and, as you say, it's a complex operation. Especially in some languages.

Tbh I can't think of a time when I've actually needed to do that.


This always comes up as a thing that people use as an interview question or an algorithm problem. But for what conceivable non-toy reason would you actually want to reverse a string?

I have never once wanted to actually do this in real code, with real strings.

Furthermore, the few times I've tried to do something like this by being cute and encoding data in a string, I never get outside the ASCII character set.


To index from the end of the string. If you want the last 3 characters from a string, it's often easier to reverse the string and take the first 3.


Most languages that are good enough at doing Unicode are also modern enough to give you a “suffix” function.


And even ASCII can't do a completely dumb string reverse, because of at least \r\n.


Trivia/history question: There's a very good reason it's \r\n and not \n\r.


The commands were used to drive a teletype machine. Carriage return and newline were independent operations with carriage return physically taking longer. So \r\n allowed it to get to the next letter soon.


Generating a list through recursion often involves reversing it at the end, so I’ve reversed a lot of Erlang strings in real code.


Outside of an interview have you ever needed to reverse a string?


Sorting by reversed string is not totally uncommon.


What is a use case for this?


Another use case is finding words with a common ending. Can be used for finding similar inflections or rhymes (in which case you'd use some form of phonetic alphabet, which fits in normal ASCII, so the problem wouldn't arise there). I also once used it for a (very) quick&dirty spell correction algorithm, which had to run in less than 1s and had to read all data from CD-ROM.


I did this yesterday.

    find ... | rev | sort | rev > sorted-filelist
I had several directories and I wanted to pull out the unique list of certain files (TrueType fonts) across all of them, regardless of which subdirs they were in. (I'm omitting the find CLI args for clarity; the command just finds all the *.ttf (case-insensitive) files in the dirs.)

By reversing the lines before sorting them (and un-reversing them) they come out sorted and grouped by filename.


That's clever. But in a programming language (instead of a composition of coreutils), I would use a filename() function or something.


I have my personal library of string functions

Recently I rewrote the codepoint reverse function to make it faster. The trick is to not write codepoints, but copy the bytes.

One the first attempt I introduced two bugs.


I agree. Also, truncating a string won't work. The first 3 characters of "Österreich" are not "Ös" in my opinion.


> Reversing a string

I wonder: several comments are saying it hardly makes any sense to reverse a string but... Certainly there are useful algorithms out there which do work by, at some point, reversing strings no!? I mean: not just for the sake of reversing it but for lookup/parsing or I don't know what.


Some parsing algorithms may want to reverse the bytes but that's different to reversing the order of user-perceived characters.


Reversing a string does not make sense in most languages.


Sorting filenames by extension


If you reverse a filename string, you'd be sorting by a reversed extension. Most mature programming language infrastructures have a FilenameExtension() function or similar which in this case would be the best to use.


While I agree with this assessment, it means that these are the basic string operations:

* Passing strings around

* Reading and writing strings from files / sockets

* Concatenation

Anything else should reckon with extended grapheme clusters, whether it does so or not. Even proper upcasing is impossible without knowing, for one example, whether or not the string is in Turkish.


One difficulty in using extended grapheme clusters is that the codepoints which will be merged into a cluster changes depending on the Unicode version, and sometimes the platform and library. For collaborative editing, the basic unit of measure is codepoints rather than grapheme clusters because you don’t want any ambiguity about where an insert is happening on different peers. Or for a library bump to change how historical data is interpreted.


> For collaborative editing, the basic unit of measure is codepoints

I'd quibble it's not the basic unit of measure so much as how changesets are represented. The user edits based on grapheme clusters. The final edit is then encoded using codepoints, which makes sense because a changeset amounts to a collection of basic string operations (splitting, concatenating, etc). As you note, it would be undesirable for changesets to be aware of higher level string representation details.


For that matter, as long as the format is restricted to one encoding, I don't see why the unit of a changeset can't just be a byte array.

I can see why it would happen to be a codepoint, this might be ergonomic for the language, but it seems to me that, like clustering codepoints together in graphemes, clustering bytes into codepoints is something the runtime takes care of, such that a changeset will be a valid example of all three.


The reason you mentioned (requiring every system to use the same string encoding) matters. Interpreting a UCS-2 byte offset in rust (which uses UTF-8 internally) isn’t easy. Or symmetrically, patch a javascript string based on a UTF-8 byte offset. It’s especially hard if you want to do better than a O(n) linear scan of the entire document’s contents.

Using byte offsets also makes it possible to express a change which corrupts the encoding - like inserting in the middle of a multi byte codepoint. That goes against the principle of “make invalid data unrepresentable”. Your code is simpler if you don’t have to guard against this sort of thing. And you don’t have to worry about that if these invalid changes are impossible to represent in the patch format.


> Number of native (preferably UTF-8) code units

> The first is useful for basic string operations

Can you expand on this? I don't see why knowing the number of code units would be useful except when calculating the total size of the string to allocate memory. Basic string operations, such as converting to uppercase, would operate on codepoints, regardless of how many code units are used to encode that codepoint.

Converting 'Á' to 'á', for example, is an operation on one codepoint but multiple code units.


> Can you expand on this? I don't see why knowing the number of code units would be useful except when calculating the total size of the string to allocate memory

I’ve used this for collaborative editing. If you want to send a change saying “insert A at position 10”, the question is: what units should you use for “position 10”?

- If you use byte offsets then you have to enforce an encoding on all machines, even when that doesn’t make sense. And you’re allowing the encoding to become corrupted by edits in invalid locations. (Which goes against the principle of making invalid state impossible to represent).

- If you use grapheme clusters, the positions aren’t portable between systems or library versions. What today is position 10 in a string might tomorrow be position 9 due to new additions to the Unicode spec.

The cleanest answer I’ve found is to count using Unicode codepoints. This approach is encoding-agnostic, portable, simple, well defined and stable across time and between platforms.


>calculating the total size of the string to allocate memory

In some languages this is 90% of everything you do with strings. In other languages it's still 90% of everything done to strings, but done automatically.


Neither code points nor units help upper casing ß.


Contemplate 2 methods of writing a 'dz' digraph

  dz - \u0064\u007a, 2 basic latin block codepoints
  DZ - \u0044\u005a
  Dz - \u0044\u007a
  
  dz - \u01f3, lowercase, single codepoint
  DZ - \u01f1, uppercase
  Dz - \u01f2, TITLECASE!
What happens if you try to express dż or dź from polish orthography?

You can use

  dż - \u0064\u017c - d followed by 'LATIN SMALL LETTER Z WITH DOT ABOVE'
  dż - \u0064\u007a\u0307 - d followed by z, followed by combining diacritical dot above
  dż - \u01f3\u0307 - dz with combining diacritical dot above

  multiplied by uppercase and titlecase forms
In polish orthography dz digraph is considered 2 letters, despite being only one sound (głoska). I'm not so sure about macedonian orthography, they might count it as one thing.

Medieval ß is a letter/ligature that was created from ſʒ - that is a long s and a tailed z. In other words it is a form of 'sz' digraph. Contemporarily it is used only in german orthography.

How long is ß?

By some rules uppercasing ß yields SS or SZ. Should uppercasing or titlecasing operations change length of a string?


I mean, if you remove the concept of a computer, and ask the question "how many letters are in this word?", you are likely going to end up in some highly-contextual conversations--taking into account all of language, culture, geography, and time--with respect to some of these examples... the concept of "uppercasing or titlecasing" has absolutely no reason to somehow have some logical basis like "the number of characters (which I will note is itself poorly-defined even on a computer) remains constant".


> The first is useful for basic string operations.

The only thing it's useful for is sizing up storage. It does nothing for "basic string operations" unless "basic string operations" are solely 7-bit ascii manipulations.


It's useful for any and all operations that involve indexing a range.

Yes you locate the specific indexes by using extended grapheme clusters, but you use the retrieved byte indexes to actually perform the basic operations. These indexes can also be cached so you don't have to recalculate their byte position every time (so long as the string isn't modified).


> It's useful for any and all operations that involve indexing a range.

That has little to do with the string length, those indices can be completely opaque for all you care.

> Yes you locate the specific indexes by using extended grapheme clusters

Do you? I'd expect that the indices are mostly located using some sort of pattern matching.


And how does that pattern matching work?


By traversing the string?


Yes. Any pattern matching has to match on extended grapheme clusters or else we're back to the old "treating a string like it's 7-bit ascii" problem.


Why do people prefer UTF-8 coordinates? While for storage I think we should use UTF-8, when working with strings live it’s just so much easier to use UTF-16 because it’s predictable: 1 unit for the basic plane and 2 for everything else (the multi-character emoji and modifier stuff aside). I am probably biased because I mostly use and think about DOMStrings which are always UTF-16 but I’m not sure why people who use languages which are more flexible about string representations than JavaScript would not also appreciate this kind of regularity.


I struggle to see why you'd ever want UTF-16. If you're using a variable length encoding, might as well stick to UTF-8. If you want predictable sizes, there's UTF-32 instead.

~Also, DOM Strings are not UTF-16, they're UCS-16.~

EDIT: UCS-2, not UCS-16. Also, I'm confusing the DOM with EcmaScript, and even that hasn't been true in a while.


> Also, DOM Strings are not UTF-16, they're UCS-16.

Hm, according to the spec they should be interpreted as UTF-16 but this isn't enforced by the language so it can contain unpaired surrogates:

From https://heycam.github.io/webidl/#idl-DOMString

> Such sequences are commonly interpreted as UTF-16 encoded strings [RFC2781] although this is not required... Nothing in this specification requires a DOMString value to be a valid UTF-16 string.

From https://262.ecma-international.org/11.0/#sec-ecmascript-lang...

> The String type is the set of all ordered sequences of zero or more 16-bit unsigned integer values (“elements”) up to a maximum length of 253 - 1 elements. The String type is generally used to represent textual data in a running ECMAScript program, in which case each element in the String is treated as a UTF-16 code unit value... Operations that do interpret String values treat each element as a single UTF-16 code unit. However, ECMAScript does not restrict the value of or relationships between these code units, so operations that further interpret String contents as sequences of Unicode code points encoded in UTF-16 must account for ill-formed subsequences.


Yes UTF-16 / UCS-2 is a silly encoding and in retrospect it was a mistake. It matters because for awhile it was believed that 2 bytes would be enough to encode any Unicode character. During this time, lots of important languages appeared which figured a 2 byte fixed size encoding was better than a variable length encoding. So Java, javascript, C# and others all use UCS-2. Now they suffer the downsides of both using a variable length encodings and being memory inefficient. The string.length property in all these languages is almost totally meaningless.

UTF-8 is the encoding you should generally always reach for when designing new systems, or when implementing a network protocol. Rust, Go and other newer languages all use UTF-8 internally because it’s better. Well, and in Go’s case because it’s author, Rob Pike also had a hand in inventing UTF-8.

Ironically C and UNIX, which (mostly) stubbornly stuck with single byte character encodings generally works better with UTF-8 than a lot of newer languages.


UTF-32 is a bit like giving up and saying code units equal code points right? I’m more interested in the comparison between UTF-8 and UTF-16, where UTF-8 requires 1 to 3 bytes just in the BMP, with 3 bytes for CJK characters. I’m saying that as a quick measure of the actual length of a string, UTF-16 is much more predictable and provides a nice intuitive estimation of how long text actually is, as well as providing a fair encoding for most commonly used languages.

RE the UTF-16 vs UCS-2 stuff, that’s probably a distinction which has technical meaning but will collapse at some point because no one actually cares, much like the distinction between URI and URL.


> I’m saying that as a quick measure of the actual length of a string, UTF-16 is much more predictable

Meaning the software will deal much less well when it's wrong.

> and provides a nice intuitive estimation of how long text actually is

Not really due to combining codepoints, which make it not useful.

> as well as providing a fair encoding for most commonly used languages.

Which we know effectively doesn't matter: it's essentially only a gain for pure CJK text being stored at rest, because otherwise the waste on ASCII will more than compensate for the gain.


The distinction between the basic plane and everything else is not particularly useful, imho. So I'm not sure I understand what the advantage of UTF-16 is here? It's an arbitrary division either way.


The article explains this.

To summarize: the "codepoint" is a broken metric for what a grapheme "is" in basically any context. Edge cases would be truncating with a guarantee of a valid encoding on the substring? But really you want to truncate at extended grapheme cluster boundaries. Truncating a country flag between two of the regional indicator symbols might not throw errors in your code, but no user is going to consider that a valid string. The same is true of all manner of composed characters, and there are a lot of them.

So the only advantage of using the very-often-longer UTF-16 encoding is that it's an attractive nuisance! This makes it easier to write code which will do the wrong thing, constantly, but at a low enough rate that developers will put off fixing it.

Unicode is variable width, and what width you need is application-specific. That's the whole point of the article! UTF-8 doesn't try to hide any of this from you.


Performance is one reason. UTF-8 is up to twice as compact as UTF-16, which allows much better cache locality. And for Latin-like text, you’re probably frequently hitting the 2x better limit.


After having read (quite interesting) the article, I still don't quite get the subtitle:

> 'But It’s Better that "[emoji]".len() == 17 and Rather Useless that len("[emoji]") == 5'

It sounds like it's just whether you're counting UTF-8/-16/-32 units. Does the article explain why one is worse and one is "rather useless?"


I agree. The author talks a little bit about which UTF encoding makes sense in which situation, but they never make an argument about which result from len is correct.

My two cents is that string length should always be the number of Unicode codepoints in the string, regardless of encoding. If you want the byte length, I'm sure there is a sizeof equivalent for your language.

When we call len() on an array, we want the number of objects in the array. When we iterate over an array, we want to deal with one object from the array at a time. We don't care how big an object is. A large object shouldn't count for more than 1 when calculating the array length.

Similarly, a unicode codepoint is the fundamental object in a string. The byte-size of a codepoint does not affect the length of the string. It makes no sense to iterate over each byte in a unicode string, because a byte on its own is completely meaningless. len() should be the number of objects we can iterate over, just like in arrays.


What is a string an array of?

If you ask the user, it's an array of characters (aka extended grapheme clusters in Unicode speak).

If you ask the machine it's an array of integers (how many bytes make up an integer depends on the encoding used).

Nothing really considers them an array of code points. Code points are only useful as intermediary values when converting between encodings or interpreting an encoded string as grapheme clusters.


You could do it the way Raku does. It's implementation defined. (Rakudo on MoarVM)

The way MoarVM does it is that it does NFG, which is sort of like NFC except that it stores grapheme clusters as if they were negative codepoints.

If a string is ASCII it uses an 8bit storage format, otherwise it uses a 32bit one.

It also creates a tree of immutable string objects.

If you do a substring operation it creates a substring object that points at an existing string object.

If you combine two strings it creates a string concatenation object. Which is useful for combining an 8bit string with a 32bit one.

All of that is completely opaque at the Raku level of course.

    my $str = "\c[FACE PALM, EMOJI MODIFIER FITZPATRICK TYPE-3, ZWJ, MALE SIGN, VARIATION SELECTOR-16]";

    say $str.chars;        # 1
    say $str.codes;        # 5
    say $str.encode('utf16').elems; # 7
    say $str.encode('utf16').bytes; # 14
    say $str.encode.elems; # 17
    say $str.encode.bytes; # 17
    say $str.codes * 4;    # 20
    #(utf32 encode/decode isn't implemented in MoarVM yet)


    .say for $str.uninames;
    # FACE PALM
    # EMOJI MODIFIER FITZPATRICK TYPE-3
    # ZERO WIDTH JOINER
    # MALE SIGN
    # VARIATION SELECTOR-16
The reason we have utf8-c8 encode/decode is because filenames, usernames, and passwords are not actually Unicode. (I have 4 files all named rèsumè in the same folder on my computer.) utf8-c8 uses the same synthetic codepoint system as grapheme clusters.


> If you ask the machine it's an array of integers

Not sure what you mean by this. A string is an array of bytes, in the way that literally every array is an array of bytes, but its not "implemented" with integers. Its a UTC-encoded array of bytes.

And what is the information that is encoded in those bytes? Codepoints. That's what UTF does, it lets us store unicode codepoints as bytes. There is a reasonable argument that the machine, or at least the developer, considers a string as an array of codepoints.


UTF-8 is an array of bytes (8 bit integers).

UTF-16 is an array of 16 bit integers.

UTF-32 is an array of 32 bit integers.

The machine doesn't know anything about code points. If you want to index into the array you'll need to know the integer offset.


> The machine doesn't know anything about code points. If you want to index into the array you'll need to know the integer offset.

The machine doesn't know anything about Colors either. But if I defined a Color object, I would be able to put Color objects into an array and count how many Color objects I had. You're being needlessly reductive.

> UTF-8 is an array of bytes (8 bit integers)

UTF-8 encodes a codepoint with 1-4 single-byte code units. The reason UTF-8 exists is to provide a way for machines and developers to interact with unicode codepoints.

Is a huffman code an array of bits? Or is it a list of symbols encoded using bits?


You seem to be thinking of the abstraction as a concrete thing. A code point is like LLVM IR; an intermediary language for describing and converting between encodings. It is not a concrete thing in itself.

The concrete thing we're encoding is human readable text. The atomic unit of which is the user perceived character.

I'm curious, what use is knowing the number of code points in a string? It doesn't tell the user anything. It doesn't even tell the programmer anything actionable.


But what about combining characters? https://en.wikipedia.org/wiki/Combining_Diacritical_Marks

Should the letter plus a combining character count as one (I think so), or two characters? Should you normalize before counting length? And so on.


Combining characters are their own unicode codepoint, so they count towards length. The beauty of this approach is that its simple and objective.

If you had a list of 5 Dom Element objects, and one Dom Attr object, the length of that list is 6. Its nonsensical to say "The Attr object modifies an Element object, so its not really in the list".


Going by bytes is also simple and objective. And also totally arbitrary, just like going by codepoints.

Which is the most useful for dealing with strings in practice though? Are either interpretations useful at all?


Going byte by byte is useless. You can't do anything with a single byte of a unicode codepoint (unless, by luck, the codepoint is encoded in a single byte).

Codepoint is the smallest useful unit of a unicode string. It is a character, and you can do all the character things with it.

If you wanted to implement a toUpper() function for example, you would want to iterate over all the codepoints.


> If you wanted to implement a toUpper() function for example, you would want to iterate over all the codepoints.

Nope. In order to deal with special casings you will have to span multiple codepoints, at which point it's no more work with whatever the code units are.


> Combining characters are their own unicode codepoint, so they count towards length.

This is incredibly arbitrary - it depends entirely on what "length" means for a particular usecase. From the user's perspective there might only be a single character on the screen.

Any nontrivial string operation must be based around grapheme clusters, otherwise it is fundamentally broken. Codepoints are a useful encoding agnostic way to handle basic (split & concatenate) operations, but the method by which those offsets are determined needs to be grapheme cluster aware. Raw byte offsets are encoding specific and only really useful for allocating the underlying storage.


You don't count the length. Length specifies the size of the internal encoding.

You count the width. And there are Unicode rules how you count the width. Which do change every year.


perhaps the issue is that there's a canonical "length" at all. It would make more sense to me to have different types of length depending on which measure you're after, like Swift apparently has but without the canonical `.count`. Because when there's multiple interpretations of a thing's length, when you ask for "length" you're leaving the developers to resolve the ambiguity and I'm of the firm belief that developers shouldn't consider themselves psychic.


The main reason, I think, that Swift strings have `count` is that they conform to the `Collection` protocol. Swift's stdlib has a pervasive "generic programming" philosophy, enabled by various protocol heirarchies.

So, given that the property is required to be present, some semantic or the other had to be chosen. I am sure there were debates when they were writing `String` about which one was proper.


In Swift 3 (and probably previous versions as well), `String.count` defaulted to the count of the Unicode scalar representation. In this version, iterating over a string would operate on each Unicode scalar, which often doesn't make sense due to the expected behaviour of extended grapheme clusters. So, this is my best guess why `String` in Swift 4 and later ended up with the current default behaviour.


that's very cringe of them.

> So, given that the property is required to be present, some semantic or the other had to be chosen.

Sounds like an invented solution to an invented problem. The programmer's speciality.


On the contrary it is based!

This conforms exactly to our intuition about what a "collection" is. Some (exact) number of items which share some meaningful property in common such that we can consider them "the same" for purposes of enumeration and iteration.

In the real world, we also have to decide what our collection is a collection of! Let's say we a pallet of candy bars, each of which is in a display box. If we want to ask "how many X" are on the pallet, we have to decide whether we're asking about the candy bars or the boxes. Clearly we should be able to answer questions about both; just as clearly, operations on the pallet should work with boxes. Because we don't want to open them, and even if we do want to open them, we have to open them, we can't just ignore their existence!

I assert that the extended grapheme cluster is the "box" around bytes in a string. Even if you do care about the contents (very often you do not!) you have to know where the boundaries are to do it! Because a Fitzpatric skin tone modifier all on its own has different semantics from one found within an emoji modifier sequence.

So it makes perfect sense for Swift to provide one blessed way to iterate over strings, and provide other options for when you're interested in some other level of aggregation. Which is what Swift does.


I think the problem is that strings are ambiguous enough a collection to warrant extra semantics.

The alternative to a collection would be an iterator or other public method returning a collection-like accessor, which would be a good compromise.

Though if you were to choose a canonical collection quanta, the it'd probably be the grapheme cluster, yeah.

Unfortunately OOP can never be based; only functional or procedural programming can attain such standards.


Strings were done well, because they are just BudirectionalCollection and not RandomAccessCollection on graphemes, which is usually what you would want (especially as an app developer writing user-facing code). The other views are collections in their own right. By conforming to Collection a string can do things like be sliced and prefixed and searched over “for free”, which are extremely common operations to define generically.


OOP using a meta-object protocol is very based indeed.

Unfortunately, only Common Lisp and Lua do it that way.

Actor models are pretty based as well, and have a better historical claim to the title "object oriented programming" than class ontologies do, but that ship has sailed.


Yes agreed; I'm guessing by meta object that is the module pattern from fp with syntax sugar? I use the module pattern with typescript interfaces plus namespaces and it's pretty great.

100% on the actor model. My visual programming platform is basically based on actors, but the core data model is cybernetic (persistence is done via self referentiality). Alan Kay got shafted by the creation of C++, OO in the original conception was very based.


If you're working with an image, you might have an Image class, that has a Image.width and a Image.height in pixels, regardless of how these pixels are laid out in memory (depends on encoding, colorspace, etc). Most if not all methods operate on these pixels, e.g. cropping, scaling, color filtering, etc. Then, there might be a Image.memory property that provides acces to the underlying, actual bytes.

I don't understand why the same is not the obvious solution for strings. len("🇨🇦") should be 1, because we as humans read one emoji as one character, regardless of the memory layout behind. Most if not all methods operate on characters (reversing, concatenating, indexing).

And then, if you need access to the low level data, the String.memory would contain the actual bytes... which would be different depending on the actual text encoding.


The number of bytes necessary is incredibly important for security reasons. It's arguably better to make the number of bytes be the primary value and have a secondary lookup be the number of glyphs.

To be fair, some systems distinguish between size and length (with size expected to be O(1) and length allowed to be up to O(n)). For those systems proceed as parent.


Python 3's approach is clearly the best. Because it focuses on the problem at hand, unicode codepoints. A string in python is a sequence of unicode codepoints, it's length should be the number of codepoints in the sequence, it has nothing to do with bytes.

To draw an absurd parallel "[emoji]".len() == 17 is equivalent to [1,2].len() == 8 (2 32 bit integers)

In my opinion the most useful result in the case the article describes is 5. There should of course be a way to get 1 (the number of extended graphemes), but it should not be the strings "length".


Don't Swift and Go support iterating over graphemes? Edit: yes, Swift is mentioned at the bottom of the article.

It'd be great to have a function for that in other scripting languages like Python, Ruby, etc.

There was an interesting sub-thread here on HN a while ago about how a string-of-codepoints is just as bad as a string-of-bytes (subject to truncation, incorrect iteration, incorrect length calculation) and that we should just have string-of-bytes if we can't have string-of-graphemes. I don't agree, but some people felt very strongly about it.


Defining string as a sequence of unicode codepoints is the mistake.

Nobody ever cares about unicode codepoints. You either want the number of bytes or the width of the string on screen.

UTF-32 codepoints waste space and give you neither.


The width on the screen is in pixels. Yes, I find monospace fonts increasingly pointless.


But what do you do when you're processing a string with codepoints that compose into one user-visible glyph?

    >>> len("🇨🇦")
    2


yeah, and what do you do when you got a nonexistent country: "🇩🇧".


Your example changes length depending on the font. (If a new country gets the code DB, fonts will gradually be updated to include the flag.)

I'll bet there are some systems in the PRC that don't include 🇹🇼.


Gah, I wish I was done with my crate so I could point to lovely formatted documentaion...

But I believe this can be handled explicitly and well and I'm trying to do that in my fuzzy string matching library based on fuzzywuzzy.

https://github.com/logannc/fuzzywuzzy-rs/pull/26/files


This may be the best tech writeup I've ever seen. Super in-depth, easily readable. Really well done.


With clickbaity misrepresentations of correctness, unfortunately.


If you do a formula in Google Sheets and it contains an emoji, this comes into play. For example, if the first character is an emoji and you want to reference it you need to do =LEFT(A1, 2) - and not =LEFT(A1, 1)


We need to deal with mountains of user input from mobile devices and the worst we have run into is "smart" quotes from iOS devices hosing older business systems. Our end users are pretty good about not typing pirate flags into customer name fields.

I still haven't run into a situation where I need to count the logical number of glyphs in a string. Any system that we need to push a string into will be limiting things on the basis of byte counts, so a .Length check is still exactly what we need.

Does this cause trouble for anyone else?


I have used glyph counting a handful of times, mostly for width computing before I learned there were better ways. I'm 100% sure my logic was just waiting to fail on any input that didn't use the Latin alphabet.


Relevant related writeup that was just on HN the other day (I think... or I saw it somewhere else):

https://tonsky.me/blog/emoji/

Edit: confirmed, it was on HN just half a day prior. Probably why this article arrived too. Just surprised nobody reference this other one in this thread :)


I like how Go handles this case by providing the utf8.DecodeRune/utf8.DecodeRuneInString functions which return each individual "rune" as well as the size in code points.

Coming from python2, for me it was the first time I saw a language handle unicode so gracefully.



Rusts graphemes.count sounds nice.

In C land it is called ucwidth (libunistring). Length is too arbitrary. Byte? Unicode points? Utf-8 length?


5 in Perl, 17 in PHP.


17 in php with strlen, which is defined as counting bytes. 1 when you use grapheme_strlen. mb_strlen and iconv_strlen return 5, and 5 is rather useless as the article says.


I was wondering what the title meant. Turns out HN’s emoji stripper screwed with the title.

It’s asking why a skin toned (not yellow) facepalm emoji’s length is 7 when the user perceives it as a single character.

Tangent: Emojis are an interesting topic in regards to programming. They challenged the “rule” held by programmers that every character is a single codepoint (of 8 or 16 bits). So, str[1234] would get me the 1235th character, but it’s actually the 1235th byte. UTF-8 threw a wrench in that, but many programmers went along ignoring reality.

Sadly, preexisting languages such as Arabic weren’t warning enough in regards to line breaking. As in: an Arabic “character”[a] can change its width depending on if there’s a “character” before or after it (which gives it its cursive-like look). So, a naive line breaking routine could cause bugs if it tried to break in the middle of a word. Tom Scott has a nice video on it that was filmed when the “effective power” crash was going around.[0]

[0]: https://youtu.be/hJLMSllzoLA

[a]: Arabic script isn’t technically an alphabet like Latin characters are. It’s either an abugida or abjad (depending on who you ask). See Wikipedia: https://en.wikipedia.org/wiki/Arabic_script


Interesting. Also, many of us use fonts with ligatures, which render as a single character (for example: tt, ti, ff, Th, ffi)

Of course, we're taught to parse that as multiple discrete letters from an early age, so we don't get confused :)


I think this title needs an exemption from HN's no-emoji rule


I’m curious: is the no-emoji rule a rule that happens to block emoji or a hardcoded rule? What I mean is: emojis (in UTF-16) have to use surrogate pairs because all their code points are in the U+1xxxx plane. Is the software just disallowing any characters needing two code points to encode (which would include emoji)? Or is it specifically singling out the emoji blocks?


It seems to have changed recently. I recall a thread about plastics a few months ago where the plastic type symbols (♳ through ♹, i.e. U+2763 through U+2769) disappeared, but I see them now.

Edits:

⭕ that post was https://news.ycombinator.com/item?id=25237688

⭕ The rules seem pretty arbitrary. Recycling symbol U+2672 ♲ is allowed but recycled paper symbol U+267C is not. Chess kings ♔ are allowed but checkers kings aren't.

⭕ is allowed (for now).

⭕ I think the right thing to do would be to strip anything with emoji presentation http://unicode.org/reports/tr51/#Presentation_Style


It’s not that simple, as many OSes render characters that aren’t strictly look emoji “as emoji” and there is no standard way to check for this.


It's a very deliberate and precise filter. I can write in old persian: "𐎠𐎡𐎢" (codepoints 0x103xx) or CJK "𠀀𠀁𠀂" (codepoints 0x2000x), but can't write "" (emoji with codepoint 0x1F605)


The latter.


Why is there such a rule?


Emojis aren’t professional (depending on who you ask) and can be overused (look at a few popular GitHub projects’ READMEs - emojis on every bullet point of the feature list, for example)


What a weird reasoning. In most contexts I've seen, "professionalism" doesn't really do anything except strip away all the human factor that goes into every interaction. Personally, I don't care whether the person I'm talking to is being "professional" or not. What I care is that they're respectful and can properly communicate their thoughts.

With that mindset, emojis (and emoticon) can actually add context to interactions, considering how much context is lost when communicating over text. A simple smiley face at the end of a sentence can go a long way in my experience :)


I agree that they aren't the most professional thing to use, but I'm not sure why I agree

Maybe it's some kind of bias?


[flagged]


I didn’t say that they weren’t. I’m simply saying that some do think that. Whether I do or don’t is irrelevant as I’m only speculating on the reason why.


Emoji draw attention to themselves in a way that plain text does not. I assume this is why Hacker News does not allow bolding text and strongly discourages use of ALL CAPS.


This is a forum sponsored by opinionated people who find them annoying. I find them annoying, too, so I think it's a good convention. Somehow ":)" became an industry, and as industries go, then eliminated ":)" (which gets replaced with a smiling yellow head in most places.)

It isn't the only arbitrary convention enforced here, and the sum of those conventions are what attracts the audience.


Surprisingly many people try to abuse Unicode when posting submissions and comments. This includes not just emojis, but also other symbols, or glyphs that look like bold, underlined or otherwise stylized letters.


> Surprisingly many people try to abuse Unicode when posting submissions and comments.

˙ǝsnqɐ ƃuıpıoʌɐ ʇɐ qoɾ pooƃ ɐ ɥɔns sǝop sɹǝʇɔɐɹɐɥɔ ɟo ɥɔunq ɐ ƃuıuuɐq ʎʃıɹɐɹʇıqɹɐ ǝsnɐɔǝ𐐒


𝕴 𝖉𝖔𝖓'𝖙 𝖙𝖍𝖎𝖓𝖐 𝖙𝖍𝖊𝖞'𝖛𝖊 𝖇𝖆𝖓𝖓𝖊𝖉 𝖎𝖙 𝖊𝖓𝖙𝖎𝖗𝖊𝖑𝖞.


HN's character stripping is completely arbitrary.

You can play mahjong, domino or cards (🀕, 🁓, 🃅) but not chess, you can show some random glyphs like ⏱ or ⤪ but not others, you can use box-drawing characters (┐) but not block elements, you can use byzantine musical notation (𝈙) but only western musical symbols (𝅗𝅥) and the notes are banned, you can Z̸̠̽a̷͍̟̱͔͛͘̚ĺ̸͎̌̄̌g̷͓͈̗̓͌̏̉o̴̢̺̹̕ just fine.


Also, looking at the article, they are quite complex! It looks like handling emojis in a proper manner has a large investment and the payoff is somewhat small.


For HN I would think almost all the complexity is in rendering. That’s a job for your browser.

What’s left is things like the max length for a title (not too problematic to count that in code points or bytes, I think)

The big risk, I think, is in users (mis)using rendering direction, multiple levels of accents, zero-width characters and weird code points to mess up the resulting page. Some Unicode symbols and emojis typically look a lot larger and heavier than ‘normal’ characters, switching writing direction may change visual indentation of a comment, etc.

Worse, Unicode rendering being hard, text rendering bugs that crash browsers or even entire devices are discovered every now and then. If somebody entered one in a HN message, that could crash millions of devices.


Emoji don't introduce anything that isn't used for various other languages. Emojis are just the most visible breakage for American users when you screw up your unicode handling.

HN however handles all of unicode just fine, it just chose to specifically exclude emojis (and a bunch of other symbols)


I think emoji is wonderful for widespread unicode support. It combats ossification. It is also a nice carrot to incite users to install updates.

However, I don't like the rabbit hole it started to go into with gendered and colored emojis. There's never going to be enough. I wish we had stuck to color-neutral and gender-neutral, like original smileys :)

I find it also conveys too much meaning. I am generally not interested in knowing a person's gender or ethnic group when discussing over text... but I digress.


I've said this before, but not here:

Eventually, Unicode will allow you to combine flag modifiers with human emojis. So you can have a black South Korean man facepalming.

This will trigger a war which ends industrial civilization, and is the leading candidate for the Great Filter.


Stripping some arbitrary subset of characters is a lot more work than just letting them through, which is what HN would otherwise be doing: the hard work is done by the browser, HN doesn't render emoji or lay text out.


In my curmudgeonly way, I suggest zero is the correct value, to reflect the information content of almost all emoji use.


Counterpoint: Appending U+1F609, U+1F61C, or U+1F643 to the end of your comment would have added immense value in communicating your point because it would have lightened the otherwise harsh tone of your words :)


I almost did, but I was concerned that doing so might be flirting with dissembling, were the inherent irony not be recognized! (BTW: Is ‘!’ the original emoji?)


With that metric the length of your comment should be zero as well. Probably mine also :)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: