Pointer Compression in V8

svat · on Feb 3, 2020

The immediate context for this is this post https://v8.dev/blog/v8-release-80 on the v8 blog, where they announced that in v8 version 8, they saved an average of 40% memory, and (unlike usual memory-time tradeoffs) also got good performance improvements. (Design doc: https://docs.google.com/document/d/10qh2-b4C5OtSg-xLwyZpEI5Z...) So pointer compression is clearly a good thing. I don't know much about the history of pointer compression, but I know the following.

In 2008, Don Knuth posted on his then "news" page (https://cs.stanford.edu/~knuth/news08.html):

> A Flame About 64-bit Pointers

> It is absolutely idiotic to have 64-bit pointers when I compile a program that uses less than 4 gigabytes of RAM. When such pointer values appear inside a struct, they not only waste half the memory, they effectively throw away half of the cache.

> The gcc manpage advertises an option "-mlong32" that sounds like what I want. Namely, I think it would compile code for my x86-64 architecture, taking advantage of the extra registers etc., but it would also know that my program is going to live inside a 32-bit virtual address space.

> Unfortunately, the -mlong32 option was introduced only for MIPS computers, years ago. Nobody has yet adopted such conventions for today's most popular architecture. Probably that happens because programs compiled with this convention will need to be loaded with a special version of libc.

> Please, somebody, make that possible.

Presumably Knuth was not the only person asking for it, and in 2011 there was work on this: see "Making Knuth's wish come true: the x32 ABI" (http://blog.reverberate.org/2011/09/making-knuth-wish-come-t...) and Wikipedia/LWN coverage (https://en.wikipedia.org/w/index.php?title=X32_ABI&oldid=921... https://lwn.net/Articles/456731/)

Unfortunately, no one was using it (not sure why, maybe not many people who care about performance write the kinds of programs that would hugely benefit from this?), and the x32 ABI got sort of deprecated by late 2018 (https://www.phoronix.com/scan.php?page=news_item&px=Linux-Po... etc).

Now, a personal story. Recently, while searching for something on Stack Exchange, I came across a question related to Bentley's June 1986 Programming Pearls column that featured an invited literate program by Knuth and a review by Doug McIlroy, about which there is a lot of misinformation and misunderstanding on the internet (e.g. calling it an "interview question" and what not!). Anyway, this question on codegolf.SE (https://codegolf.stackexchange.com/questions/188133/bentleys...) was about implementing a fast solution to the same problem, and the "winner" was an elegant Rust program. I was curious about Knuth's original Pascal (WEB) program from 1986, so I studied it, translated it to C++, and found to my surprise that it ran faster than the fastest program that had been posted on the site! Looking closer into why, experimenting with this and that, it turned out AFAICT that the probable reason, ultimately, was that where the Rust program used (64-bit) pointers, the translation of Knuth's program (which had been targeting "common denominator" Pascal, without pointer types) used (32-bit) array indices, so it was able to fit twice as many struct values in the cache.

In fact, taking just this one idea (cache-friendliness) and using a regular trie data structure (as we're no longer operating under similar memory or language constraints as Knuth was) gives something even faster. (https://codegolf.stackexchange.com/a/197870) I'd been planning to write a blog post explaining all this -- the clever data structures used (tries, trie-packing, and hash tries), how they're used in the TeX program for hyphenation, the context in 1986 and misunderstandings today, and my experiments with the programs -- but got distracted by other things, but this post has reminded me to try again. :-)

MisterAV · on Feb 3, 2020

Having 32-bit pointers doesn't mean that ASLR becomes much less effective? I can suppose that for V8 is not a big problem, because they use compressed pointers where necessary, but if it was a compiler directive (like Knuth wanted) it would affect the whole program. I would not use that option for any program that have to process untrusted input.

verwaest · on Feb 3, 2020

It does: pointers in the V8 heap all lie within the same 4gb region. But then again, Spectre also makes ASLR much less effective.

correct_horse · on Feb 3, 2020

This is really interesting, and I think 32-bit pointers make sense for a significant class of problems. At the very least, having the option could be useful. But...

One time in node, building a react-native project, the build failed with `FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory`. (see issue here: https://github.com/expo/expo-cli/issues/94) I ended up fixing it with `export NODE_OPTIONS=--max_old_space_size=8000` But it took a while to find that. In an era where developer experience is king, maybe the default should be 64-bit pointers with an option for 32-bit.

Also if a 32-bit pointer type option existed in say C, I worry that programmers would abuse it, chasing performance at the expense of bugs.

flohofwoe · on Feb 3, 2020

IMHO the best alternative to pointers is "tagged index handles" as described here:

https://floooh.github.io/2018/06/17/handles-vs-pointers.html

This approach drastically reduces the risk of memory corruption in usafe languages, and at the same time keeps data structures compact because handles rarely need to be 64-bits (it's a bit similar to the compressed pointers described in the original post, basically split pointers into a "private" base pointer, and a public offset/index, and use some bits in the public value for dangling protection).

tybit · on Feb 3, 2020

“ A downside to this is that the V8 heap can not be any greater than 4 GB as that is the maximum limit of a 32-bit address space. This is fine for browsers, as the heap doesn’t need to be greater than 4 GB anyway. It becomes a problem with things like node.js that require larger heaps. Because of this, pointer compression is disabled for node.js until a better solution can be figured out.”

It will be interesting to see how this evolves. In the Hotspot JVM(not using fancy GCs at least), heaps up to ~32GB can use pointer compression, as they compress pointers to object offsets instead of memory addresses.

https://stackoverflow.com/questions/25120546/trick-behind-jv...

verwaest · on Feb 3, 2020

We're definitely interested in exploring this. Unfortunately it's likely a little slower than 4gb PC: Compression right now is basically a no-op, decompression simply being a single add instruction. And it'll fragment memory a little because of the alignment requirements. But surely worth it for where larger heaps are necessary.

the8472 · on Feb 3, 2020

Don't fancy x86 addressing modes provide most of those multiplications and offsets with very little IPC penalty?

cfallin · on Feb 3, 2020

Yeah, this should be roughly the same overhead as an ADD:

    LEA rDest, [rBase + 8*rPtr]

(The "load effective address" instruction computes an effective address like a load or store would, but just gives the address without doing a memory access.)

the8472 · on Feb 3, 2020

AIUI mov supports these things directly[0] and if I read the instruction tables correctly then at least on skylake the latency/throughput is the same for all addressing modes[1]

[0] http://www.c-jump.com/CIS77/ASM/Addressing/lecture.html#R77_... [1] https://www.agner.org/optimize/instruction_tables.pdf (page 238)

verwaest · on Feb 7, 2020

Decompression isn't the problem, compression is. Compression is just a mov. Now we need additional shifts.

verwaest · on Feb 7, 2020

Also we'll probably lose some cache benefits from compression due to larger alignment.

akvadrako · on Feb 3, 2020

In general there doesn’t seem to be much use for pointers to specific bytes. Almost any field should be at least 32-bit aligned and when you do need an exact byte, storing an offset from the beginning of the object probably makes more sense.

barrkel · on Feb 3, 2020

Any C code which does any kind of parsing or work with buffers almost certainly uses char* as a pointer to bytes.

Interesting portability effects then arise when you have different representations for void / char vs other pointer types.

IvanK_net · on Feb 3, 2020

"This is fine for browsers, as the heap doesn’t need to be greater than 4 GB anyway."

I develop a photo editor www.Photopea.com , where people often edit e.g. 100-megapixel photos. Then, Chrome may crash (because of 4GB limit) and they lose their work. I have to recommend users to use Firefox for such cases.

ygra · on Feb 3, 2020

Is it really necessary to keep the whole uncompressed image in memory (as well as all undo steps) at all time? I guess the browser environment makes trading RAM for disk space difficult, but Photoshop ran fine with moderate amounts of RAM and plenty of scratch space on disk a few decades ago. It's probably easier to just keep stuff in memory, but perhaps not strictly necessary as not everything needs the same latency.

MrPatan · on Feb 3, 2020

I have 32 GB of RAM, why shouldn't software use it to make things faster?

ygra · on Feb 3, 2020

It definitely should, but keeping your working set smaller can also make things faster (as seen by Chrome here). This depends very much on the workload and what's being done to the data in memory. My example was just that a raster image editing program probably doesn't require a huge memory footprint just to be able to edit images well (as a lot of the memory use typically is not the image you're seeing, but history and undo state, which is neither latency-critical, not frequently accessed).

cormacrelf · on Feb 3, 2020

Chrome's 32-bit address space/4GB limit is different from having 64-bit machine/48-bit address space/4GB of RAM. In the latter, you can keep allocating after 4GB, it will just get slower as the pager will start swapping pages to and from the disk. But in V8 with pointer compression, you will just hit a brick wall.

To swap memory to disk, you still need address space to map it to. Say the editor has allocated 3.4GB, and then makes another 600MB layer, allocated at roughly 0xD000000. With no address space left, it asks for another layer, and the allocator returns NULL. It can't give you a pointer to a 600MB region, because there's no address space left. If you paged out the layer at 0x20000000, that would not help, because it wouldn't magically free up the addresses 0x20000000-0x40000000. They would just refer to pages that are currently on disk, and still be 'occupied' address space. You still need an address for this new allocation, and there is no room to put it.

No slow degradation -- it won't page fault at all unless you otherwise fill the RAM on the machine. So the image editor just falls over, with an uncatchable OOM exception I presume, with no perceptible warning from page fault slowdown just prior. It will go full speed into the brick wall. For your account to be accurate, V8 would have had to implement their own virtual address space, which they have not. VA basically requires a hardware TLB to be fast, and V8's "TLB" here is just `mov eax, [whatever]; and rax, r13`. Anything other than that would have completely defeated the speed gains from locality.

This doesn't account for nuances like whether ArrayBuffers would be allocated elsewhere and have no pointer compression applied, but it's definitely true of general objects. For a regular JS program to fill 4GB with normal web app things would be a miracle, and the image editors of the world can probably still work if they make the big-allocation APIs use full-size pointers.

eru · on Feb 13, 2020

Keeping your working set smaller doesn't mean that you have to not keep everything 'in memory': you can carefully craft your memory access patterns to work well with OS virtual memory management.

im3w1l · on Feb 3, 2020

Because you are using 29 of those to do something else like compilation or rendering. Hypothetically.

IvanK_net · on Feb 3, 2020

It is true, that I could use mimpaps and show only the preview of a scaled-down version of the image.

But the rendering pipeline of PSD documents is extremely complex. There are not just layers, but also layer styles, raster masks, clipping masks, adjustment layers. Folders of layers can have their own layer styles and masks. You need to allocate separate buffers to render "sub-trees". And everything is GPU accelerated (over WebGL).

I am afraid that remaking it to a mip-mapped system would take me like 1 000 hours of work, so I think it is easier to remake V8 (which I guess could be like 100 hours of work). As the usual capacity of RAM will keep growing, they would have to do it at some point anyway.

twic · on Feb 3, 2020

The object heap can be up to 4 GB. The contents of ArrayBuffers and the like could live in a separate, much larger, space. I don't know if V8 is doing this, though.

DougBTX · on Feb 3, 2020

FTA:

> The reason this works is because the backing stores of array buffers are allocated using PartitionAlloc (I’m not entirely sure if this is still the case, but this was the case approximately 3-4 years ago, and I haven’t seen anything to suggest that it has changed). All PartitionAlloc allocations go on a separate memory region that is not within the V8 heap. This means that the backing store pointer needs to be stored as an uncompressed 64-bit pointer, since its upper 32 bits are not the same as the isolate root and thus have to be stored with the pointer.

tuyiown · on Feb 3, 2020

If you need help on how handling images larger than your allocatable memory, I seriously recommend to take a look to GDAL lib drivers interface, it's designed to enable best perf from either memory or disk read/writes, possibly on compressed format for the few that allows partial read/writes.

londons_explore · on Feb 3, 2020

I'd be interested to see how far this can be taken...

Why not make pointers just 4 bits, giving them only 15 possible memory locations they can point to? Then try to allocate the thing they'll be pointing at in one of those 15 locations. Reserve the 16th location for some kind of backup data structure which can point anywhere.

Clearly it isn't branchless, but I would guess that most codepaths would either always be able to make use of one of the 15 locations, or would never be able to, making branches predictable.

verwaest · on Feb 3, 2020

Sounds like a fun project in the context of a moving GC :)

unwind · on Feb 3, 2020

Four bits is probably too low, since you would need to fit other data into the renaining bits in order not to waste them.

The x86 is (merely) byte-addressable so a byte is the smallest piece of data that makes sense for an "object", as far as I know.

londons_explore · on Feb 3, 2020

Nearly all objects are full of pointers. You can pack loads in if they're 4 bit.

joe_the_user · on Feb 3, 2020

Pointer compression seems like an amazing method for bytecode optimization in general.

Bytecode optimization is one of my hobbies and I wish I could find a list of all such methods that VM like V8 use.

unlinked_dll · on Feb 3, 2020

I think pointer compression is worthy of a masters/PhD thesis in compiler design if it hasn't been done already. Should be ubiquitous, if you don't have a program that will ever require the amount of memory needed for more than a 2^N bit number the pointers should probably be compressed and arithmetic optimized. There's a lot of edge cases to look for there however.

chubot · on Feb 3, 2020

There's been a bunch of work on it in the last 10-20 years... Here are some (disorganized) links, including the one by Chris Lattner mentioned in a sibling comment:

https://github.com/oilshell/oil/wiki/Compact-AST-Representat...

joe_the_user · on Feb 3, 2020

This is a really cool reference that actually appears to be just what I'm seeking, cheers.

bdash · on Feb 3, 2020

A student from University of Illinois at Urbana-Champaign wrote a paper on transparent pointer compression in their custom GCC backend back in 2005: https://llvm.org/pubs/2005-06-12-MSP-PointerComp.pdf. Their compiler backend went places, but I'm not sure if the optimization described in the paper ever made it into production.

Taniwha · on Feb 3, 2020

It's pretty trivial, from a compiler point of view - not being able to take real pointers is going to make code slower in some cases - the hard part is things like partially computed pointers that get spilled onto the stack (esp if you are doing GC)

unlinked_dll · on Feb 3, 2020

GC is interesting, because often times the concept of a reference is a concept of identity, whereas a pointer is a concept of location in the heap. Things like reallocating GC that shuffle around location on the fly make this a bit easier since your references only pertain to identity, and the compression can be applied to estimating or validating the total number of identities won't exceed some maximum for a valid set of inputs to the program.

chrisseaton · on Feb 3, 2020

> worthy of a masters/PhD thesis in compiler design if it hasn't been done already

You can’t just write a thesis about someone else’s existing idea! You’re supposed to come up with a new idea!

ndesaulniers · on Feb 3, 2020

Or if vtables had relative offsets, rather than absolute addresses? Fuchsia is designing their ABI with that change to the Itanium ABI.

vidarh · on Feb 3, 2020

It's unclear what you mean here. Lots of uses of vtables will use offsets to look up a specific slot in a vtable, but the total space used by vtables in most systems will be a tiny proportion of the heap, so storing the pointers in the vtables as offsets doesn't seem likely to make much difference.

ndesaulniers · on Feb 4, 2020

What's in those slots? (Full pointers)

https://itanium-cxx-abi.github.io/cxx-abi/cxx-vtable-ex.html

https://bugs.llvm.org/show_bug.cgi?id=26723

vidarh · on Feb 7, 2020

Yes, but there's only 1 vtable per class. While the latter shows a shockingly high reduction in code size, which seems to imply a ludicrous number of tiny classes with few call sites per method in use in the Chromium code base, relative to what I've seen when doing compiler development, it shouldn't translate into much of a change in heap usage.

From reading those, my main takeaway is that the Chromium codebase sounds awful.

The_rationalist · on Feb 3, 2020

Why are you talking about the itanium API instead of the x86-64 API?

ndesaulniers · on Feb 4, 2020

The Itanium ABI lives on in x86_64.

https://stackoverflow.com/q/53455855

iainmerrick · on Feb 3, 2020

Is Itanium still around??

ndesaulniers · on Feb 4, 2020

Its ABI is.

https://stackoverflow.com/q/53455855

andreareina · on Feb 3, 2020

Getting a 40% heap reduction by halving pointer size implies that pointers were 80% of the heap! What am I missing?

eru · on Feb 13, 2020

Don't think you are missing anything. There are a lot of pointers.

boardwaalk · on Feb 3, 2020

Is there a way to do this in a systems language? I know there was the X32 ABI that’s essentially deprecated at this point. I suppose you could use your own allocator and pointer wrapper types, but it’d be nice if there was just a switch you could flip if you know you’re just not going to use >4GiB.

anarazel · on Feb 3, 2020

Depends on what you define as "do this". It's pretty common for pointer heavy datastructures to replace the pointers with a narrower offset, and then a single base pointer which the offsets are added/subtracted to.

If the compiler can be convinced to keep the base pointer in a register through dereference heavy code, the cost is often negligible. And more than won back by better cache efficiency.

Taniwha · on Feb 3, 2020

sure you can - 35 years or so back I ported Unix (v6/v7) to run in a virtual system (under VMS) with relative pointers (so you could move swapped images around without playing too many MMU games). I did this by hacking the C compiler to use a relative pointer much the same as these people are doing

winrid · on Feb 3, 2020

Wow, how long did it take to Port Unix to run under VMS?

Taniwha · on Feb 3, 2020

It was a long time ago, probably ~ 6 months part time - bringing up the kernel at the same time as a compiler is never easy - it ran in supervisor/user modes in place of VMS's shell.

Porting V6 was harder, it was very pdp-11ish, lots of stuff (especially context switch) depended a lot on knowledge of the structure pdp11 stack frames

codedokode · on Feb 3, 2020

Would not it be easier just to use 32-bit x86 arch instead of 64-bit? Pointer compression looks like a solution for a problem that should not exist in the first place. I saw the arguments against this in [1] but they look very weak ("We cannot use 32-bit arch because Chrome has switched to 64-bit and because there is an OS nobody is using that doesn't allow this").

Why not make 64-bit Chromium for those who has over 16 Gb of RAM and 32-bit for normal people?

[1] https://docs.google.com/document/d/10qh2-b4C5OtSg-xLwyZpEI5Z...

anarazel · on Feb 3, 2020

Realistically the limit for 32bit apps isn't 4GB, but considerably lower. The OS mapping takes out like 1gb at least. Then you have shared libraries mapped in. You want space for aslr. Mmapping files. Etc.

Leaving available memory aside, for a lot of optimizations it's useful to have plenty virtual memory space - which is definitely not the case in 32bit.

Basically, just because js bytecode doesn't need more than 4gb, doesn't mean no other part of chrome needs more.

beagle3 · on Feb 3, 2020

That's correct, although 32-bit processes on a 64-bit operating system (as are still supported on Windows and Linux, and were supported on macOS until Catalina) can effectively have 4GB.

Back when I was still using Windows, you could boot the 32-bit OS with "/3GB", which would make the kernel/user split 1GB/3GB instead of the original 2GB/2GB default - but it was optional and explicit, because quite a bit of software failed; I would guess that changed with time, but likely a similar "/4GB" switch for 32-bit apps on 64-bit OS would also expose assumptions about the address space layout ...

anarazel · on Feb 3, 2020

It'd also make syscalls a bit more expensive...

desdiv · on Feb 3, 2020

x64 has 16 general purpose registers, while x86 only has 8 general purpose registers. So even with the extra overhead of 64-bit pointers, x64 code still ends up being faster.

In terms of speed, it's:

1. x64 with compressed pointers

2. x64

3. x86

In terms of actual production usage, it's:

1. x64

2. x64 with compressed pointers

3. x86

So x86 is the slowest and the least used among the three. And frankly it would be very difficult to hire talent in 2020 if you're targeting the x86-64 platform but chose to use the legacy x86 mode for whatever reason.

imtringued · on Feb 3, 2020

That would mean the entire process would be limited to 4GB of ram which is pretty awful considering multiple Javascript VMs run in the same process which could have 4GB of memory each. That way the RAM limit is significantly below 4GB per application.

verwaest · on Feb 3, 2020

+1. In fact even the JavaScript heap isn't limited to 4gb, just the space that pointers can point to. For example, we allocate large typed arrays outside of the V8 heap.

beagle3 · on Feb 3, 2020

Recent macOS Catalina threw away 32-bit support; Ubuntu LTS as of 20.04 is removing almost all of the 32-bit libraries (with a few exceptions that would let Wine/Proton keep running, IIRC). It will eventually happen in Windows as well; 16-bit code was exceptionally fast and compact when it was suffcient.....

londons_explore · on Feb 3, 2020

How about another approach...

Design a "pointer predictor", which for a given pointer predicts where it will lead to. I would guess there are many arrays of identical structures, so predicting any given pointer value ought to be doable. The predictor could be as simple as "This object has patterns of pointers very similar to this other object, so use those instead"

Then replace each pointer with a single bit saying "the predictor is right" or "the predictor is wrong, use an alternative pointer stored in an external table".

lifthrasiir · on Feb 3, 2020

Similar ideas were proposed for the memory compression, exploiting a fact that most allocations in typical applications are object-like. See for example [1] (HN discussion: [2]).

[1] https://blog.acolyer.org/2019/05/24/zippads/

[2] https://news.ycombinator.com/item?id=19998645

kilowatt · on Feb 3, 2020

At some point (not sure if it is still true) Jai had or was going to have language-level implementation of relative pointers, i.e. a 16 or 32 bit relative offset from “this field’s memory address.”

Santosh83 · on Feb 3, 2020

When is this rolling out for stable Chromium/Chrome? Any timeline?

woofwoofwoof · on Feb 3, 2020

> This post provides a preview of some of the highlights in anticipation of the release in coordination with Chrome 80 Stable in several weeks.

https://v8.dev/blog/v8-release-80

Waterluvian · on Feb 3, 2020

Can someone give some context on why pointer compression is worthwhile? Are there actually so many pointers in use that indeed you save a significant amount of memory?

imtringued · on Feb 3, 2020

>Are there actually so many pointers in use that indeed you save a significant amount of memory?

Every time you create a Javascript object you need a pointer. Everything except small integers (including booleans) is an object in Javascript.

NikolaeVarius · on Feb 3, 2020

The article links to a design doc https://docs.google.com/document/d/10qh2-b4C5OtSg-xLwyZpEI5Z...

tybit · on Feb 3, 2020

“We expect about ~35% of V8 heap reduction on 64-bit platforms on real-world web sites” amazing

pdpi · on Feb 3, 2020

Without some optimisation, anything that’s not a Boolean, a number, a null or an undefined will involve some pointers.

rapsey · on Feb 3, 2020