A byte string library for Rust

childintime · on Sept 7, 2022

This discussion likely quantifies the difference between Zig strings and Rust strings. In light of this and of the conclusion that a tool like ripgrep couldn't exist without this string type, I'd say Zig's choice seems well balanced.

burntsushi · on Sept 7, 2022

Is Zig's string API documented anywhere? Or is there a design document outlining their intended API?

pornel · on Sept 7, 2022

ripgrep is in a rather special situation of processing files that can contain literally anything without declaring any structure whatsoever, from binary data to text in any encoding, including mojibake.

It's good that this library exists as an option, but it isn't necessarily the right default for all strings.

The "it's UTF-8 or maybe not" string encoding approach of Go and Zig adds overhead from having to handle broken UTF-8 sequences (Rust can safely assume they don't exist), weakens safeguards against encoding errors, and makes lossless roundtrip of string -> codepoints -> string impossible.

burntsushi · on Sept 8, 2022

I don't think ripgrep is that special. I tried to give other examples in my blog. CSV is another crate I maintain that works on arbitrary bytes, so long as they're ASCII compatible. There are also significant consumers of bstr that are not me, such as Artichoke and gitoxide.

On balance, I personally prefer the Go approach to strings. (I don't say Zig only because I don't know precisely what their strategy is in detail. I don't follow the project closely enough.) I am of course absurdly biased, because pretty much all of my work really wants byte strings.

But even in the simple case of running globs or regexes on file paths... That case is simply quite annoying in Rust today. bstr makes it a little easier if you're willing to sacrifice correctness in a rare corner case on Windows, but bringing in a dependency on bstr just for those (very simple) convenience routines is a hard pill to swallow. (I think this gets at the root of why some folks---probably me included---are keen on exposing OsStr's representation on all platforms.)

To be clear, I don't think where Rust landed is bad. There are very nice advantages to its approach as you outline. Rust strings are quite good in the vast majority of use cases, but I also actually think bstr's strings are too.

I don't think there is a clear and obvious right answer in this design space to be honest. And in particular, if you do take Rust's approach, then things like bstr can still exist and exist conveniently. What I mean by that is that even if you expose an API that only accepts byte strings, you can pass Rust strings to that without any cost whatsoever. That's a really key interoperability property that does maybe argue in favor of taking Rust's approach in general.

But if you're going to write "standard" Unix tooling---and I don't think that's too special of a use case---then byte strings are probably superior given the environment that we live in. That is, that the fundamental interface with which you interact with streams is just bytes without any solid guarantees. In that world, requiring valid UTF-8 ends up being pretty annoying.

I did mention overhead from dealing with broken UTF-8 sequences and encoding errors at least in my blog. The lossless roundtrip aspect is something I didn't mention because I'm not sure how common the need for that property is. But I grant it is another advantage to Rust's approach.

So in summary... I don't really disagree with you, but I don't really agree with you either. :-)

pornel · on Sept 8, 2022

I think my perspective on the "standard unix" approach of "it's bytes with ASCII or whatever" is different, because my native language uses the "whatever" part.

In pre-Unicode times I couldn't reliably grep for "żółw", or consume a CSV file with it, or share a file with anyone on a different platform, because it may have been in ISO-8859-2 encoding, or Windows-1252 which was subtly incompatible, or in MacCE, or in one of three AmigaOS encodings. I've had printers that had "it's just bytes" encoding, so I had to convert files to a character encoding used by a particular font file, which varied between font sources and printer brands. I've had to deal with files from a popular MS-DOS editor which invented its own character encoding. I've had to convert files to encoding of "Reverend John Pikul", because he was a prolific author of printer drivers, and created his own encoding.

The "strings are just bytes" systems were a shitshow for me. It's a broken approach that is impossible to fix. Nobody cares what those "garbage" bytes actually mean. So when strings are UTF-8-or-garbage, these non-UTF-8 bytes are someone's "żółw" becoming "???w", or a `String + bstr` is "¿ó³w".

burntsushi · on Sept 8, 2022

Yeah I think that's a different time and a different place. At least the GNU Unix tooling can all be configured in the UTF-8 locale, which I think is pretty standard these days.

So I guess from my view, Unix tooling isn't quite "it's all just bytes" these days, but rather, "everything is UTF-8, except for some things and we still want to handle those things in some sensible way."

I don't think the byte-string-mostly-UTF-8 approach leads to the sadness you describe. What you describe I think was a result of a very different time before the ubiquity of UTF-8. I might have a very different opinion if we were transported back in time, absolutely.

Go has the same byte-string-mostly-UTF-8 approach as bstr (and indeed, bstr was inspired by it), and I don't think they're having the kinds of problems you describe. Go strings might not be valid UTF-8, but the language itself pushes you toward UTF-8 and they otherwise have very good Unicode support throughout their standard library and supporting libraries.