That's not wrong. In [1]: import json In [2]: json.dumps("\N{PILE OF POO}") Out[...

deathanatos · on Oct 27, 2016

> What you see here is "PILE OF POO" in UTF-16.

Well, yes, but you missed my point: we're not looking at an encoded string object. A `str` (Python's string type) is supposed to represent a Unicode string — "Strings are immutable sequences of Unicode code points"; the underlying encoding is supposed to be transparent, and in fact, it's possible to construct an example (invalid, IMO, like the above example) YAML that PyYAML will decode where the underlying encoding in memory would be UTF-32, but still contain those surrogates.¹

In Unicode's lexicon, this string contains surrogate code points, not surrogate code units. This is wrong, and as I stated, it is surprising that Python permits it. Unicode explicitly warns against this behavior:

> A process shall not interpret a high-surrogate code point or a low-surrogate code point as an abstract character.

And that's what's happening here.

> YAML specifies that the recommended encoding is UTF-8, but should be able to parse UTF-16 and UTF-32. And, if you want that, then you need to tell YAML to expect UTF-16. You can do that by including a BOM (Byte-Order Mark).

I'm passing PyYAML a string, but I can pass it encoded text as well; the output is the same. Note that the input contains characters completely in ASCII, so it's really not input encoding at play here. I realize now it wasn't explicit in my original comment, but here's the input being given to PyYAML (JSON that we're claiming is also YAML, because the claim was that JSON is a subset of YAML):

  "\ud83d\udca9"

Note that this is the raw JSON/YAML, not a Python repr of it. Those slashes are literal slashes.

> PyYAML is documented as only supporting UTF-8 and UTF-16, but not UTF-32.

PyYAML is documented to contain a bug, then. (Though admittedly better than it being undocumented.) It is not developer friendly to only decode some text, and emit erroneous output on others.

> I'm scratching my head on how to get that "UTF-16 encoded" UTF-8 string to a proper representation (in Python).

It's hard because it shouldn't be possible. Unicode forbids it. The encoded UTF-32 version of that string (if Python permitted it) would be:

  0x0000d83d  0x0000dca9

unless you explicitly handled the surrogate code points separately by decoding them first, and then re-encoding them into the target encoding, but that's crazy.

(this is another example of why the above is not valid or expected output from PyYAML.)

¹this:

  In [19]: yaml.load('"\\ud83d\\udca9, \\U0001f4a9"')
  Out[19]: '\ud83d\udca9, '

is, in CPython, a string that in memory is stored in UTF+32, but contains two surrogate code points.

(Note that I'm using a late version of Python 3. Early Python 3 and Python 2 handle Unicode poorly, but this example should work equally strangely there too. While PyYAML's output here is clearly buggy, the spec is equally vague and underspecified about what should happen.)

yes_or_gnome · on Oct 27, 2016

> In Unicode's lexicon, this string contains surrogate code points, not surrogate code units. This is wrong, and as I stated, it is surprising that Python permits it. Unicode explicitly warns against this behavior:

> Note that this is the raw JSON/YAML, not a Python repr of it. Those slashes are literal slashes.

I just came to these realization over dinner. Apparently, this is the odd behavior is defined by JSON. So, it's not Python's json module at fault because it is, actually, implemented correctly.

https://en.wikipedia.org/wiki/JSON#Data_portability_issues

This whole time I thought the json module had a bug, but now I am wondering if it's another PyYAML bug.

Maybe not. I'll have to read this section a few more times. http://yaml.org/spec/1.2/spec.html#id2770814

Edit: Going back a bit here.

> Specifically, the spec around escaped unicode characters lacks any mention of surrogates being encoded in two \u sequences,

http://yaml.org/spec/1.2/spec.html#id2771184 makes the statement, "All characters mentioned in this specification are Unicode code points. Each such code point is written as one or more bytes depending on the character encoding used. Note that in UTF-16, characters above #xFFFF are written as four bytes, using a surrogate pair.". And, there's numerous mentions for JSON compatibility. So, I suppose, this is an issue PyYAML (and Ruby's YAML, I checked).

I had never seen this issue before. Ruby's JSON module doesn't breakup utf-8 characters into surrogate pairs, nor does any online json parser that I could find via Google.