I wonder what the pros and cons weighed in the discussion were. Clearly not supp...

throwaway2048 · on Aug 14, 2015

I'm willing to bet a large amount that non UTF-8 encoding were broken and nobody cared enough to bother fixing them.

OpenBSD does not hesitate to nuke legacy stuff that gets broken. Which i feel is ultimately for the best, because half-assed support that barely functions is worse than no support at all many times.

stsp · on Aug 14, 2015

It was in fact intentionally broken to find out where removing single-byte locales hurts our users most.

We have a hackathon coming up with devs committed to making UTF-8 work in more base utilities. If that works out, and the most sore points of latin1/koi-8/etc users have been adequately addressed, 5.9 will ship with only the UTF-8 locale (and of course the default "C" locale -- ASCII).

If this approach turns out to be wrong because we cannot get regressions fixed, 5.9 will ship like 5.7 and 5.8 (with UTF-8 and single byte locales).

opk · on Aug 14, 2015

My first thought was, what about the "C" locale so good to see that question already answered.

I really wish there was some sort of standard "U" locale that would be the same as "C" but UTF-8, and ISO rather than US format dates.

kazinator · on Aug 15, 2015

That locale pseudo-exists. It's called "don't call the evil setlocale function, write in C90 as much as possible, do your own UTF-8 encoding and decoding, and implement the exact default date format you want with your own strftime string or whatever."

Dylan16807 · on Aug 15, 2015

That doesn't exactly help me as a user, and possibly makes things worse as some things respect locale and some don't.

caf · on Aug 15, 2015

There has been some talk both in glibc and musl of shipping such a "C-but-UTF-8" locale.

kragen · on Aug 15, 2015

Oh, I didn't realize you weren't removing "C"! Thank you for explaining!

kragen · on Aug 14, 2015

If I had to guess, using my mental model of OpenBSD:

(a) most non-UTF-8-or-UTF-16 locales will choke (crash or corrupt data) in the rare case that they try to encode text outside their encoding range (the mirror image of the problem UTF-8B fixes in UTF-8);

(b) codecs have to be fast and handle untrusted strings of somewhat unpredictable lengths, making them a likely source of security holes;

(c) possible subtle bugs in a codec enable "cloaking attacks" where different parts of a system parse the same string differently; these have existed in the past with UTF-8, but would have to be rooted out of every codec;

(d) encoding text with one codec and decoding it with another also corrupts it.

So there are lots of good reasons to require the system to default to UTF-8 and use other codecs only in special cases involving backwards compatibility.

I hope you can still get reasonable performance and sensible ordering by setting LC_COLLATE=C.

busterarm · on Aug 14, 2015

Having sat in on a BUG meeting where this was discussed by one of the devs responsible, I believe it was basically "UTF-8 won, it's time to not pretend otherwise, we're going to move forward with this."

geofft · on Aug 14, 2015

For the benefit of others (the link is nonobvious), here's Markus Kuhn's presentation of UTF-8B:

http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043...

The tl;dr is to map an invalid UTF-8 byte n to code point U+DC00 + n, which puts it in the code point range reserved for the second part of a surrogate pair. (In UTF-16, a 16-bit value between D800 and DBFF followed by a 16-bit value between DC00 and DFFF is used to encode a code point that cannot fit in 16 bits. Since these "surrogate pairs" happen only in that order, there is room to extend UTF-16 by assigning a meaning to a DC00-DFFF value seen without a D800-DBFF before it.) Since the surrogate code points are defined as not "Unicode scalar values" and cannot exist in well-formed "Unicode text", and therefore cannot be decoded from well-formed UTF-8, there's no risk of confusion.

There are some similarities with the extension of UTF-8 encoding that is sometimes called "WTF-8" https://simonsapin.github.io/wtf-8/. WTF-8 lets unchecked purportedly-UTF-16 data be parsed as a sequence of code points, encoded into an extension of UTF-8, and round-tripped back into the original array of uint16s. UTF-8B lets unchecked purportedly-UTF-8 data be parsed as a sequence of code points, encoded into an extension of UTF-16, and round-tripped back into the original array of uint8s. They're not quite compatible, because WTF-8 would encode U+DC80 as a three-byte sequence (ED B2 80), and UTF-8B would decode that into three code points (U+DCED U+DCB2 U+DC80) since U+DC80 isn't a Unicode scalar value. But if a system wanted to support both of these robust encodings simultaneously, I think you could handle this fairly clear special case.

kragen · on Aug 15, 2015

Agreed, except that UTF-8B also lets you round trip UTF-8B → UCS-4 → UTF-8B safely, not just via UTF-16.

ptx · on Aug 14, 2015

Kuhn's idea is also used in in Python 3, so that garbage bytes can (optionally!) be decoded to Unicode strings and later losslessly turned back into the same bytes, which ensures (e.g.) that filenames that can't be decoded can still be used: https://www.python.org/dev/peps/pep-0383/

kazinator · on Aug 15, 2015

Interesting; I implemented exactly the same thing in the TXR language. I can read an arbitrary file in /bin/ as UTF-8 to a string, and when that string is converted to UTF-8, it reproduces that file exactly. All invalid bytes go to DCXX, including the null character. The code U+DC00 is called "pnul" (pseudo-null) and can even be written like #\pnul in the language as a character constant. Thanks to pnul, you can easily manipulate data that contains nulls, like /proc/<pid>/environ, or null-delimited strings from GNU "xargs -0". The underlying C strings are nicely null-terminated with the real U+0000 NUL, and everything is cool.

kragen · on Aug 15, 2015

You and the Python guys should get together and make your hack compatible, and then pressure everyone else to standardize on it, instead of the horrible nightmare where every string input and output operation potentially corrupts data or crashes.

joeyh · on Aug 15, 2015

Same in haskell (ghc haskell at least).

FroshKiller · on Aug 14, 2015

It was sort of darkly funny to be reading along as you're quoting the guy then all of a sudden hit so matter-of-factly, "He's dead, but you can still get the thing from...." A real splash of cold water.

kragen · on Aug 14, 2015

What would be great would be if someone would take up UTF-8B again. I mentioned his death because otherwise you might think he lost interest in the project, but no, he lost interest in living.

nightpool · on Aug 14, 2015

> he lost interest in living

I get what you're trying to accomplish with the parallel construction, but that's a pretty callous way to describe it :/

kragen · on Aug 15, 2015

I'm sorry I upset you. I didn't mean to.

roghummal · on Aug 15, 2015

>I'm sorry (if) I upset you. I didn't mean to.

The poster didn't express that he was upset. He expressed an opinion that your description was callous.

nightpool · on Aug 15, 2015

I wasn't upset, and I certainly wasn't trying to imply I was. But that doesn't change that it makes me a little sad to see people refer to suicide with such a lack of empathy.

kragen · on Aug 16, 2015

I am surprised that you read my remark as lacking empathy, but I have deleted the remarks I had written here about my actual feelings about the event, because this doesn't seem like a very promising conversation.

lsaferite · on Aug 14, 2015

He died from a massive drug overdose right?

kragen · on Aug 15, 2015

Yes, although we didn't find out for sure until several months later. It was too massive to be an accident.

tedunangst · on Aug 14, 2015

In a roundabout way, this is because I wasn't able to push through an isprint() workaround diff to ls. http://marc.info/?l=openbsd-misc&m=142540203528315&w=2

skybrian · on Aug 14, 2015

Reminds me of Go strings: they usually store UTF-8 but they're actually 8-bit clean:

"It's important to state right up front that a string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes."

https://blog.golang.org/strings

geofft · on Aug 15, 2015

It's only 8-bit clean if you don't poke it very hard. Try either of the last two loop examples in that page after adding "\xff\x80" to the string; you get two (indistinguishable) U+FFFD REPLACEMENT CHARACTERs in the iteration. So the loop destroys data, which UTF-8B specifically does not.

Also, it's a little disappointing that Go doesn't have a type-level way to say that a string is in fact UTF-8, not Latin-1 or something, and preferably that all values that inhabit that type are guaranteed to be valid and well-formed UTF-8. This is the cause of plenty of subtle bugs in Python 2, C, etc., which are all technically the result of programmer error, but in this decade, type systems should be helping us avoid common, subtle programmer errors.

skybrian · on Aug 15, 2015

If you're juggling a bunch of different types of strings and want to keep them straight, Go does support defining separate types for them. An example is the HTML type from the template library [1].

The question is which sanitized string types are worth defining in the standard library. Presumably UTF-8 sanitized strings didn't make the cut.

Not sure about UTF-8B. Suppose the input is already UTF-8B? Do you double-escape it somehow?

It looks like DecodeRuneInString returns RuneError if it can't decode something and RuneError is defined as U+FFFD. The example uses a hard-coded string where it can't happen, so technically it's not a bug that it doesn't check for the error. But a linter might want to flag it.

[1] http://golang.org/pkg/html/template/#HTML

masklinn · on Aug 14, 2015

> One problem that I have is that current UTF-8 implementations typically are not "8 bit clean", in the sense that GNU and modern Unix tools typically attempt to be; they crash, usually by throwing an exception

Crashing on invalid data sounds like a great idea. Leaving garbage through doesn't.

kstrauser · on Aug 14, 2015

Crashing in the "/* oops! */ exit(1)" sense is great. Crashing in the buffer overflow sense is not. OpenBSD treats all of the latter as potential security vulnerabilities.

masklinn · on Aug 15, 2015

No objection there, "crashing" in my comment was meant as "stops processing and report an input data error to the user", the erlang sense of crashing if you will.

witty_username · on Aug 15, 2015

Aborting might be the better term.

geofft · on Aug 15, 2015

Is it really garbage? If we want to be true to UNIX's (questionable) "Write programs to handle text streams, because that is a universal interface" ethos, our definition of "text" has to admit all possible byte strings to be "universal". And the so-called C locale historically did.

masklinn · on Aug 15, 2015

> Is it really garbage?

Invalid UTF-8 when valid UTF-8 was expected? Yes.

> "Write programs to handle text streams, because that is a universal interface" ethos, our definition of "text" has to admit all possible byte strings to be "universal"

Random bytes are not text. The Unix ethos is "communicate via arbitrary binary streams" but programs which only understand text understand text, not random-bytes-which-are-not-text. It seems sensible for programs to have more restrictions on their input than the general-purpose communication protocol does: would you expect jq to try and process input which is not JSON in any way, shape or form despite being it being billed a JSON processor? Because that's not what it's going to do, at least not by default.

kragen · on Aug 16, 2015

Although it was pretty common up to the mid-1990s to run into 8-bit-cleanness problems and arbitrary buffer sizes, IMHO, the Unix ethos is not for `wc`, `tr`, `dd`, `sort`, `uniq`, `read`, `diff`, `patch`, or `split` to crash with certain input data, to silently corrupt that data, or to spew warning messages about its contents. They are building blocks for your programs; it is not their business to impose unnecessary expectations on your data. They can and should correctly handle arbitrary data. When they don't do that, they limit the programs you can write with them, and with no compensating increase in anything other virtue.

kragen · on Aug 15, 2015

You're the reason I can't use an "'" in my password on Citibank's web site, aren't you? I finally caught you, you bastard.