The answer to this question is complicated. JavaScript char encoding is roughly ...

lhorie · on April 29, 2021

No need to overcomplicate though. None of this applies to the ascii range, and we're just talking about storage in disk in the first place.

IggleSniggle · on April 29, 2021

The 2-bit part applies, if not the rest, and it’s not really “on disk” but rather “memory for data type,” right?

Given the nature of the questions, I presumed they were interested in knowing “how does JavaScript load strings into memory, anyway?”

And to answer that question, your rough heuristic should be “2 bytes per character” not 1, even for ascii range. That just leads to additional questions, though, because of the oddity of it.

In order to achieve the ability to do Unicode, there’s a reserved set of values within that 2-bytes, to allow you to extend the encoding to reach Unicode.

Back to the original measurement, for the string “hello world”, I believe a JavaScript `sizeof`, if it existed, would report 24 bytes (22 for the characters, and 2 (give or take) for either the NULL character or for a length header.

lhorie · on April 29, 2021

The thread was originally about CRA vs Vite size on disk (or implicitly, if we're applying it to real world applications, network cost in CI job startup times). And like I said, surrogate pairs don't apply to ASCII.

See this[0] for reference. Note how the first byte must fall within a certain range in order to signal being a surrogate pair. This range quite deliberately falls outside the ASCII range. This fact is taken advantage of by JS parsers to make parsing of ASCII substrings faster by special casing that range, since checking for a valid character in the entire unicode range is quite a bit more expensive[1].

IMHO nitpicking about memory consumption of the underlying data structure is a bit meaningless, since the spec doesn't actually enforce any guarantees about memory layout. An implementation can take more memory for pointer to prototype, to cache hash code/length, etc, and there are also considerations such as whether the underlying data structure is polymorphic or monomorphic due to JIT, whether the string is boxed/unboxed, whether it's implemented in terms of C strings vs slices, etc.

Regardless, it doesn't change the fact that the octet sequence "hello world" takes 11 bytes in ASCII/UTF8 encoding (disregarding implementation metadata).

[0] https://github.com/jquery/esprima/blob/0911ad869928fd218371b...

[1] https://github.com/jquery/esprima/blob/0911ad869928fd218371b...

IggleSniggle · on April 29, 2021

All great points. Not trying to nitpick, just trying to satisfy curiosity.