What is a string an array of? If you ask the user, it's an array of characters (...

b2gills · on March 31, 2021

You could do it the way Raku does. It's implementation defined. (Rakudo on MoarVM)

The way MoarVM does it is that it does NFG, which is sort of like NFC except that it stores grapheme clusters as if they were negative codepoints.

If a string is ASCII it uses an 8bit storage format, otherwise it uses a 32bit one.

It also creates a tree of immutable string objects.

If you do a substring operation it creates a substring object that points at an existing string object.

If you combine two strings it creates a string concatenation object. Which is useful for combining an 8bit string with a 32bit one.

All of that is completely opaque at the Raku level of course.

    my $str = "\c[FACE PALM, EMOJI MODIFIER FITZPATRICK TYPE-3, ZWJ, MALE SIGN, VARIATION SELECTOR-16]";

    say $str.chars;        # 1
    say $str.codes;        # 5
    say $str.encode('utf16').elems; # 7
    say $str.encode('utf16').bytes; # 14
    say $str.encode.elems; # 17
    say $str.encode.bytes; # 17
    say $str.codes * 4;    # 20
    #(utf32 encode/decode isn't implemented in MoarVM yet)


    .say for $str.uninames;
    # FACE PALM
    # EMOJI MODIFIER FITZPATRICK TYPE-3
    # ZERO WIDTH JOINER
    # MALE SIGN
    # VARIATION SELECTOR-16

The reason we have utf8-c8 encode/decode is because filenames, usernames, and passwords are not actually Unicode. (I have 4 files all named rèsumè in the same folder on my computer.) utf8-c8 uses the same synthetic codepoint system as grapheme clusters.

dahfizz · on March 26, 2021

> If you ask the machine it's an array of integers

Not sure what you mean by this. A string is an array of bytes, in the way that literally every array is an array of bytes, but its not "implemented" with integers. Its a UTC-encoded array of bytes.

And what is the information that is encoded in those bytes? Codepoints. That's what UTF does, it lets us store unicode codepoints as bytes. There is a reasonable argument that the machine, or at least the developer, considers a string as an array of codepoints.

ChrisSD · on March 26, 2021

UTF-8 is an array of bytes (8 bit integers).

UTF-16 is an array of 16 bit integers.

UTF-32 is an array of 32 bit integers.

The machine doesn't know anything about code points. If you want to index into the array you'll need to know the integer offset.

dahfizz · on March 26, 2021

> The machine doesn't know anything about code points. If you want to index into the array you'll need to know the integer offset.

The machine doesn't know anything about Colors either. But if I defined a Color object, I would be able to put Color objects into an array and count how many Color objects I had. You're being needlessly reductive.

> UTF-8 is an array of bytes (8 bit integers)

UTF-8 encodes a codepoint with 1-4 single-byte code units. The reason UTF-8 exists is to provide a way for machines and developers to interact with unicode codepoints.

Is a huffman code an array of bits? Or is it a list of symbols encoded using bits?

ChrisSD · on March 26, 2021

You seem to be thinking of the abstraction as a concrete thing. A code point is like LLVM IR; an intermediary language for describing and converting between encodings. It is not a concrete thing in itself.

The concrete thing we're encoding is human readable text. The atomic unit of which is the user perceived character.

I'm curious, what use is knowing the number of code points in a string? It doesn't tell the user anything. It doesn't even tell the programmer anything actionable.