Hacker News new | past | comments | ask | show | jobs | submit login

I wonder what the pros and cons weighed in the discussion were.

Clearly not supporting Unicode text in non-UTF-8 locales (except through, like, some kind of compatibility function, like recode or iconv) is the Right Thing. One problem that I have is that current UTF-8 implementations typically are not "8 bit clean", in the sense that GNU and modern Unix tools typically attempt to be; they crash, usually by throwing an exception, if you feed them certain data, or worse, they silently corrupt it.

Markus Kuhn suggested "UTF-8B" as a solution to this problem some years ago. Quoting Eric Tiedemann's libutf8b blurb, "utf-8b is a mapping from byte streams to unicode codepoint streams that provides an exceptionally clean handling of garbage (i.e., non-utf-8) bytes (i.e., bytes that are not part of a utf-8 encoding) in the input stream. They are mapped to 256 different, guaranteed undefined, unicode codepoints." Eric's dead, but you can still get libutf8b from http://hyperreal.org/~est/libutf8b/.




I'm willing to bet a large amount that non UTF-8 encoding were broken and nobody cared enough to bother fixing them.

OpenBSD does not hesitate to nuke legacy stuff that gets broken. Which i feel is ultimately for the best, because half-assed support that barely functions is worse than no support at all many times.


It was in fact intentionally broken to find out where removing single-byte locales hurts our users most.

We have a hackathon coming up with devs committed to making UTF-8 work in more base utilities. If that works out, and the most sore points of latin1/koi-8/etc users have been adequately addressed, 5.9 will ship with only the UTF-8 locale (and of course the default "C" locale -- ASCII).

If this approach turns out to be wrong because we cannot get regressions fixed, 5.9 will ship like 5.7 and 5.8 (with UTF-8 and single byte locales).


My first thought was, what about the "C" locale so good to see that question already answered.

I really wish there was some sort of standard "U" locale that would be the same as "C" but UTF-8, and ISO rather than US format dates.


That locale pseudo-exists. It's called "don't call the evil setlocale function, write in C90 as much as possible, do your own UTF-8 encoding and decoding, and implement the exact default date format you want with your own strftime string or whatever."


That doesn't exactly help me as a user, and possibly makes things worse as some things respect locale and some don't.


There has been some talk both in glibc and musl of shipping such a "C-but-UTF-8" locale.


Oh, I didn't realize you weren't removing "C"! Thank you for explaining!


If I had to guess, using my mental model of OpenBSD:

(a) most non-UTF-8-or-UTF-16 locales will choke (crash or corrupt data) in the rare case that they try to encode text outside their encoding range (the mirror image of the problem UTF-8B fixes in UTF-8);

(b) codecs have to be fast and handle untrusted strings of somewhat unpredictable lengths, making them a likely source of security holes;

(c) possible subtle bugs in a codec enable "cloaking attacks" where different parts of a system parse the same string differently; these have existed in the past with UTF-8, but would have to be rooted out of every codec;

(d) encoding text with one codec and decoding it with another also corrupts it.

So there are lots of good reasons to require the system to default to UTF-8 and use other codecs only in special cases involving backwards compatibility.

I hope you can still get reasonable performance and sensible ordering by setting LC_COLLATE=C.


Having sat in on a BUG meeting where this was discussed by one of the devs responsible, I believe it was basically "UTF-8 won, it's time to not pretend otherwise, we're going to move forward with this."


For the benefit of others (the link is nonobvious), here's Markus Kuhn's presentation of UTF-8B:

http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043...

The tl;dr is to map an invalid UTF-8 byte n to code point U+DC00 + n, which puts it in the code point range reserved for the second part of a surrogate pair. (In UTF-16, a 16-bit value between D800 and DBFF followed by a 16-bit value between DC00 and DFFF is used to encode a code point that cannot fit in 16 bits. Since these "surrogate pairs" happen only in that order, there is room to extend UTF-16 by assigning a meaning to a DC00-DFFF value seen without a D800-DBFF before it.) Since the surrogate code points are defined as not "Unicode scalar values" and cannot exist in well-formed "Unicode text", and therefore cannot be decoded from well-formed UTF-8, there's no risk of confusion.

There are some similarities with the extension of UTF-8 encoding that is sometimes called "WTF-8" https://simonsapin.github.io/wtf-8/. WTF-8 lets unchecked purportedly-UTF-16 data be parsed as a sequence of code points, encoded into an extension of UTF-8, and round-tripped back into the original array of uint16s. UTF-8B lets unchecked purportedly-UTF-8 data be parsed as a sequence of code points, encoded into an extension of UTF-16, and round-tripped back into the original array of uint8s. They're not quite compatible, because WTF-8 would encode U+DC80 as a three-byte sequence (ED B2 80), and UTF-8B would decode that into three code points (U+DCED U+DCB2 U+DC80) since U+DC80 isn't a Unicode scalar value. But if a system wanted to support both of these robust encodings simultaneously, I think you could handle this fairly clear special case.


Agreed, except that UTF-8B also lets you round trip UTF-8B → UCS-4 → UTF-8B safely, not just via UTF-16.


Kuhn's idea is also used in in Python 3, so that garbage bytes can (optionally!) be decoded to Unicode strings and later losslessly turned back into the same bytes, which ensures (e.g.) that filenames that can't be decoded can still be used: https://www.python.org/dev/peps/pep-0383/


Interesting; I implemented exactly the same thing in the TXR language. I can read an arbitrary file in /bin/ as UTF-8 to a string, and when that string is converted to UTF-8, it reproduces that file exactly. All invalid bytes go to DCXX, including the null character. The code U+DC00 is called "pnul" (pseudo-null) and can even be written like #\pnul in the language as a character constant. Thanks to pnul, you can easily manipulate data that contains nulls, like /proc/<pid>/environ, or null-delimited strings from GNU "xargs -0". The underlying C strings are nicely null-terminated with the real U+0000 NUL, and everything is cool.


You and the Python guys should get together and make your hack compatible, and then pressure everyone else to standardize on it, instead of the horrible nightmare where every string input and output operation potentially corrupts data or crashes.


Same in haskell (ghc haskell at least).


It was sort of darkly funny to be reading along as you're quoting the guy then all of a sudden hit so matter-of-factly, "He's dead, but you can still get the thing from...." A real splash of cold water.


What would be great would be if someone would take up UTF-8B again. I mentioned his death because otherwise you might think he lost interest in the project, but no, he lost interest in living.


> he lost interest in living

I get what you're trying to accomplish with the parallel construction, but that's a pretty callous way to describe it :/


I'm sorry I upset you. I didn't mean to.


>I'm sorry (if) I upset you. I didn't mean to.

The poster didn't express that he was upset. He expressed an opinion that your description was callous.


I wasn't upset, and I certainly wasn't trying to imply I was. But that doesn't change that it makes me a little sad to see people refer to suicide with such a lack of empathy.


I am surprised that you read my remark as lacking empathy, but I have deleted the remarks I had written here about my actual feelings about the event, because this doesn't seem like a very promising conversation.


He died from a massive drug overdose right?


Yes, although we didn't find out for sure until several months later. It was too massive to be an accident.


In a roundabout way, this is because I wasn't able to push through an isprint() workaround diff to ls. http://marc.info/?l=openbsd-misc&m=142540203528315&w=2


Reminds me of Go strings: they usually store UTF-8 but they're actually 8-bit clean:

"It's important to state right up front that a string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes."

https://blog.golang.org/strings


It's only 8-bit clean if you don't poke it very hard. Try either of the last two loop examples in that page after adding "\xff\x80" to the string; you get two (indistinguishable) U+FFFD REPLACEMENT CHARACTERs in the iteration. So the loop destroys data, which UTF-8B specifically does not.

Also, it's a little disappointing that Go doesn't have a type-level way to say that a string is in fact UTF-8, not Latin-1 or something, and preferably that all values that inhabit that type are guaranteed to be valid and well-formed UTF-8. This is the cause of plenty of subtle bugs in Python 2, C, etc., which are all technically the result of programmer error, but in this decade, type systems should be helping us avoid common, subtle programmer errors.


If you're juggling a bunch of different types of strings and want to keep them straight, Go does support defining separate types for them. An example is the HTML type from the template library [1].

The question is which sanitized string types are worth defining in the standard library. Presumably UTF-8 sanitized strings didn't make the cut.

Not sure about UTF-8B. Suppose the input is already UTF-8B? Do you double-escape it somehow?

It looks like DecodeRuneInString returns RuneError if it can't decode something and RuneError is defined as U+FFFD. The example uses a hard-coded string where it can't happen, so technically it's not a bug that it doesn't check for the error. But a linter might want to flag it.

[1] http://golang.org/pkg/html/template/#HTML


> One problem that I have is that current UTF-8 implementations typically are not "8 bit clean", in the sense that GNU and modern Unix tools typically attempt to be; they crash, usually by throwing an exception

Crashing on invalid data sounds like a great idea. Leaving garbage through doesn't.


Crashing in the "/* oops! */ exit(1)" sense is great. Crashing in the buffer overflow sense is not. OpenBSD treats all of the latter as potential security vulnerabilities.


No objection there, "crashing" in my comment was meant as "stops processing and report an input data error to the user", the erlang sense of crashing if you will.


Aborting might be the better term.


Is it really garbage? If we want to be true to UNIX's (questionable) "Write programs to handle text streams, because that is a universal interface" ethos, our definition of "text" has to admit all possible byte strings to be "universal". And the so-called C locale historically did.


> Is it really garbage?

Invalid UTF-8 when valid UTF-8 was expected? Yes.

> "Write programs to handle text streams, because that is a universal interface" ethos, our definition of "text" has to admit all possible byte strings to be "universal"

Random bytes are not text. The Unix ethos is "communicate via arbitrary binary streams" but programs which only understand text understand text, not random-bytes-which-are-not-text. It seems sensible for programs to have more restrictions on their input than the general-purpose communication protocol does: would you expect jq to try and process input which is not JSON in any way, shape or form despite being it being billed a JSON processor? Because that's not what it's going to do, at least not by default.


Although it was pretty common up to the mid-1990s to run into 8-bit-cleanness problems and arbitrary buffer sizes, IMHO, the Unix ethos is not for `wc`, `tr`, `dd`, `sort`, `uniq`, `read`, `diff`, `patch`, or `split` to crash with certain input data, to silently corrupt that data, or to spew warning messages about its contents. They are building blocks for your programs; it is not their business to impose unnecessary expectations on your data. They can and should correctly handle arbitrary data. When they don't do that, they limit the programs you can write with them, and with no compensating increase in anything other virtue.


You're the reason I can't use an "'" in my password on Citibank's web site, aren't you? I finally caught you, you bastard.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: