Hacker News new | past | comments | ask | show | jobs | submit login

Is json just bytes tho? Doesn’t it allow for variable width encodings?



No matter what encoding your JSON file is, gzip will output a compressed bag of bytes that, when unzipped, will result in the same file coming out the other end. This is true of movie codecs, Word 97 files, or anything, and none of the maintainers of those formats had to be consulted about this in order to make it work. That's what is meant by "thin waist" here.


I know, but it’s not “just bytes” as per parent comment. You cannot infer the length of the content without decoding it. “By definition” it is variable width character data. I think it’s fair to be pedantic vs a fairly dramatic oversimplification.


Less specific interfaces let you do less interesting things, but are more resilient. It's an engineering tradeoff. Purpose-built interfaces that fully expose and understand domain-level semantics are great in certain circumstances, but other times you want a certain minimum abstraction (IP packets and 'bags-of-bytes' POSIX file semantics are good examples) that can be used to build better ones.

If the rollout of HTTP had required that all the IP routers on the internet be updated to account for it, we likely would not have it. Likewise, if we required that all the classic Unix text utilities like wc, sort, paste, etc. did meaningful things to JSON before we could standardize JSON, adoption would likely have suffered.


The basic unix tools do account for variable width though. Variable chars are baked into most OS. When you use these commands the decode is implicit.


You can transport it to an architecture of different endianness without loss of information or metadata and a transformation at destination.

There are important ways in which it is, in fact, "just bytes".


Endianness etc is a feature of the encoding. Most JSON implementations I’ve used require the raw bytes to first be decoded as such.


No, endianness is not a feature of UTF-8 encoding. There isn't a UTF-8LE and a UTF-8BE. That's because the codeunit is bytes.

Forget "decoding", you have to parse JSON. But you don't have to figure out how it's encoded first. Because it's a byte format. You already know.


There isn’t a UTF-8LE/BE because it is implicitly BE for wide characters. Any byte in a WC sequence cannot meaningfully be interpreted (exc character class, page etc) without its companions, so not just bytes. There is an element of presentation that must happen before “mere bytes” are eligible for JSON


By spec all JSON must be UTF-8. Anyone adding encodings to application/json is, at best, redundant.


UTF-8 is variable width my friend


The fact that JSON is UTF-8 doesn't contradict the fact that it's bytes!

That's a feature, not a bug.

i.e. "exterior designs are layered" - https://www.oilshell.org/blog/2023/06/ysh-design.html#exteri...

This is not a trivial point -- there are plenty of specs which are sloppy about text, try to abstract over code points, and result in non-working software.

A main example is the shitshow with libc and Unicode - https://thephd.dev/cuneicode-and-the-future-of-text-in-c#sta...

It suffers from what I call "interior fictions", like wchar_t.


Of course it’s all bytes. It’s all bytes. That doesn’t change the fact that you need to have some awareness of encoding before those bytes are fully sensible


encoding decides how bytes are interpreted


Yes. The encoded content is “just bytes” - once decoded it’s logically something else (var char data structured as json) that transcends bytes at the data level.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: