Objects of type `string` are internally encoded as UTF-16. Byte arrays or spans ...

tialaramex · on May 22, 2022

> if a string literal contains only UTF-8 characters and you assign it to a byte array or span, it gets encoded as UTF-8.

I write a bunch of C# for my job, but am far from an expert in the language. My reading of this statement is redundant, which means I feel sure it's trying to communicate something the authors thought was "obvious" and is not.

* A string literal - so, realistically some Unicode text, right? All the other encodings anybody was actually using can transliterate to Unicode, so, they are just Unicode (with a different encoding)

* contains only UTF-8 characters - UTF-8 is an encoding of Unicode, so, this just means Unicode again

I'm guessing actually C# can write something that's not Unicode in a String for some reason? But what that might be is unexplained:

Can you... emit arbitrary bytes? But how when your native encoding (UTF-16) isn't even byte oriented? What does that mean?

Maybe you can emit the rare Unicode "non-characters" like U+FFFF ? But, you can express those just fine in UTF-8 so who cares?

Or perhaps it's as simple as C# lets you write literals which are sequences of 16-bit code units but aren't UTF-16 ?

bzxcvbn · on May 22, 2022

The proposal is linked right there in the blog post. You could read it and save some time. https://github.com/dotnet/csharplang/blob/main/proposals/utf...

> The language will allow conversions between string constants and byte sequences where the text is converted into the equivalent UTF8 byte representation. Specifically the compiler will allow string_constant_to_UTF8_byte_representation_conversion - implicit conversions from string constants to byte[], Span<byte>, and ReadOnlySpan<byte>. A new bullet point will be added to the implicit conversions §10.2 section. This conversion is not a standard conversion §10.4.

    byte[] array = "hello";             // new byte[] { 0x68, 0x65, 0x6c, 0x6c, 0x6f }
    Span<byte> span = "dog";            // new byte[] { 0x64, 0x6f, 0x67 }
    ReadOnlySpan<byte> span = "cat";    // new byte[] { 0x63, 0x61, 0x74 }

> When the input text for the conversion is a malformed UTF16 string then the language will emit an error:

    const string text = "hello \uD801\uD802";
    byte[] bytes = text; // Error: the input string is not valid UTF16