If you ask the user, it's an array of characters (aka extended grapheme clusters in Unicode speak).
If you ask the machine it's an array of integers (how many bytes make up an integer depends on the encoding used).
Nothing really considers them an array of code points. Code points are only useful as intermediary values when converting between encodings or interpreting an encoded string as grapheme clusters.
You could do it the way Raku does.
It's implementation defined.
(Rakudo on MoarVM)
The way MoarVM does it is that it does NFG, which is sort of like NFC except that it stores grapheme clusters as if they were negative codepoints.
If a string is ASCII it uses an 8bit storage format, otherwise it uses a 32bit one.
It also creates a tree of immutable string objects.
If you do a substring operation it creates a substring object that points at an existing string object.
If you combine two strings it creates a string concatenation object. Which is useful for combining an 8bit string with a 32bit one.
All of that is completely opaque at the Raku level of course.
my $str = "\c[FACE PALM, EMOJI MODIFIER FITZPATRICK TYPE-3, ZWJ, MALE SIGN, VARIATION SELECTOR-16]";
say $str.chars; # 1
say $str.codes; # 5
say $str.encode('utf16').elems; # 7
say $str.encode('utf16').bytes; # 14
say $str.encode.elems; # 17
say $str.encode.bytes; # 17
say $str.codes * 4; # 20
#(utf32 encode/decode isn't implemented in MoarVM yet)
.say for $str.uninames;
# FACE PALM
# EMOJI MODIFIER FITZPATRICK TYPE-3
# ZERO WIDTH JOINER
# MALE SIGN
# VARIATION SELECTOR-16
The reason we have utf8-c8 encode/decode is because filenames, usernames, and passwords are not actually Unicode.
(I have 4 files all named rèsumè in the same folder on my computer.)
utf8-c8 uses the same synthetic codepoint system as grapheme clusters.
> If you ask the machine it's an array of integers
Not sure what you mean by this. A string is an array of bytes, in the way that literally every array is an array of bytes, but its not "implemented" with integers. Its a UTC-encoded array of bytes.
And what is the information that is encoded in those bytes? Codepoints. That's what UTF does, it lets us store unicode codepoints as bytes. There is a reasonable argument that the machine, or at least the developer, considers a string as an array of codepoints.
> The machine doesn't know anything about code points. If you want to index into the array you'll need to know the integer offset.
The machine doesn't know anything about Colors either. But if I defined a Color object, I would be able to put Color objects into an array and count how many Color objects I had. You're being needlessly reductive.
> UTF-8 is an array of bytes (8 bit integers)
UTF-8 encodes a codepoint with 1-4 single-byte code units. The reason UTF-8 exists is to provide a way for machines and developers to interact with unicode codepoints.
Is a huffman code an array of bits? Or is it a list of symbols encoded using bits?
You seem to be thinking of the abstraction as a concrete thing. A code point is like LLVM IR; an intermediary language for describing and converting between encodings. It is not a concrete thing in itself.
The concrete thing we're encoding is human readable text. The atomic unit of which is the user perceived character.
I'm curious, what use is knowing the number of code points in a string? It doesn't tell the user anything. It doesn't even tell the programmer anything actionable.
If you ask the user, it's an array of characters (aka extended grapheme clusters in Unicode speak).
If you ask the machine it's an array of integers (how many bytes make up an integer depends on the encoding used).
Nothing really considers them an array of code points. Code points are only useful as intermediary values when converting between encodings or interpreting an encoded string as grapheme clusters.