Is json just bytes tho? Doesn’t it allow for variable width encodings?

ElevenLathe · 2024-04-22T16:00:44 1713801644

No matter what encoding your JSON file is, gzip will output a compressed bag of bytes that, when unzipped, will result in the same file coming out the other end. This is true of movie codecs, Word 97 files, or anything, and none of the maintainers of those formats had to be consulted about this in order to make it work. That's what is meant by "thin waist" here.

rusk · 2024-04-22T17:25:03 1713806703

I know, but it’s not “just bytes” as per parent comment. You cannot infer the length of the content without decoding it. “By definition” it is variable width character data. I think it’s fair to be pedantic vs a fairly dramatic oversimplification.

ElevenLathe · 2024-04-22T17:43:46 1713807826

Less specific interfaces let you do less interesting things, but are more resilient. It's an engineering tradeoff. Purpose-built interfaces that fully expose and understand domain-level semantics are great in certain circumstances, but other times you want a certain minimum abstraction (IP packets and 'bags-of-bytes' POSIX file semantics are good examples) that can be used to build better ones.

If the rollout of HTTP had required that all the IP routers on the internet be updated to account for it, we likely would not have it. Likewise, if we required that all the classic Unix text utilities like wc, sort, paste, etc. did meaningful things to JSON before we could standardize JSON, adoption would likely have suffered.

rusk · 2024-04-22T17:59:54 1713808794

The basic unix tools do account for variable width though. Variable chars are baked into most OS. When you use these commands the decode is implicit.

samatman · 2024-04-22T18:51:49 1713811909

You can transport it to an architecture of different endianness without loss of information or metadata and a transformation at destination.

There are important ways in which it is, in fact, "just bytes".

rusk · 2024-04-22T20:52:18 1713819138

Endianness etc is a feature of the encoding. Most JSON implementations I’ve used require the raw bytes to first be decoded as such.

samatman · 2024-04-23T01:37:09 1713836229

No, endianness is not a feature of UTF-8 encoding. There isn't a UTF-8LE and a UTF-8BE. That's because the codeunit is bytes.

Forget "decoding", you have to parse JSON. But you don't have to figure out how it's encoded first. Because it's a byte format. You already know.

rusk · 2024-04-23T09:54:26 1713866066

There isn’t a UTF-8LE/BE because it is implicitly BE for wide characters. Any byte in a WC sequence cannot meaningfully be interpreted (exc character class, page etc) without its companions, so not just bytes. There is an element of presentation that must happen before “mere bytes” are eligible for JSON

jshier · 2024-04-22T17:34:22 1713807262

By spec all JSON must be UTF-8. Anyone adding encodings to application/json is, at best, redundant.

rusk · 2024-04-22T17:56:14 1713808574

UTF-8 is variable width my friend

chubot · 2024-04-23T01:16:20 1713834980

The fact that JSON is UTF-8 doesn't contradict the fact that it's bytes!

That's a feature, not a bug.

i.e. "exterior designs are layered" - https://www.oilshell.org/blog/2023/06/ysh-design.html#exteri...

This is not a trivial point -- there are plenty of specs which are sloppy about text, try to abstract over code points, and result in non-working software.

A main example is the shitshow with libc and Unicode - https://thephd.dev/cuneicode-and-the-future-of-text-in-c#sta...

It suffers from what I call "interior fictions", like wchar_t.

rusk · 2024-04-23T09:57:01 1713866221

Of course it’s all bytes. It’s all bytes. That doesn’t change the fact that you need to have some awareness of encoding before those bytes are fully sensible

sambazi · 2024-04-22T16:48:59 1713804539

encoding decides how bytes are interpreted

rusk · 2024-04-22T17:25:58 1713806758

Yes. The encoded content is “just bytes” - once decoded it’s logically something else (var char data structured as json) that transcends bytes at the data level.