Fast Shadow Stacks for Go

10000truths · 2024-05-30T07:41:19 1717054879

In an ideal world, we'd be using an ABI that keeps the code (return addresses) and data (spilled function params/locals) in two separate stacks. Then the profiling would be as straightforward as a memory copy of the code stack.

felixge · 2024-05-30T07:46:53 1717055213

That's what hardware shadow stacks in modern intel/arm CPUs can do! It just needs to be exposed to user space and become widely available.

fweimer · 2024-05-30T10:39:02 1717065542

Fedora 40 and later have shadow stack support in userspace. It's currently opt-in with glibc (`export GLIBC_TUNABLES=glibc.cpu.x86_shstk=on` is one way to switch it on I believe). The plan is to make this self-tuning eventually in glibc upstream, once the quirks have been ironed out.

It will not work with Go as-is because the Go scheduler will have to be taught to switch the shadow stack along with the regular stack, panic/recover needs to walk the shadow stack. But for binaries that do not use CGO, it would be possible to enable this fairly quickly. Hardware support is already widely available. The SHSTK-specific code paths are well-isolated. You would not lose compatibility with older CPUs or kernels.

felixge · 2024-05-30T11:46:26 1717069586

Thanks for the reply!

What does the API for accessing the shadow stack from user space look like? I didn't see anything for it in the kernel docs [1].

I agree about the need for switching the shadow stacks in the Go scheduler. But this would probably require an API that is a bit at odds with the security goals of the kernel feature.

I'm not sure I follow your thoughts on CGO and how this would work on older CPUs and kernels.

[1] https://docs.kernel.org/next/x86/shstk.html

fweimer · 2024-05-30T12:29:06 1717072146

You can get the shadow stack pointer using the RDSSPQ instruction. The kernel documentation shows how the shadow stack size is obtained for the main thread. Threads created explicitly using clone3 have the specified shadow stack size. I think this is sufficient to determine the shadow stack boundaries.

Regarding older CPUs, what I wanted to point out is that the code to enable and maintain shadow stacks will not be smeared across the instruction stream (unlike using APX instructions, or direct use of LSE atomics on AArch64). It's possible to execute the shadow stack code only conditionally.

felixge · 2024-05-30T12:53:44 1717073624

Thank you so much, this is very helpful and interesting. I'll try to experiment with this at some point.

fweimer · 2024-05-30T12:59:44 1717073984

Glad to be of help.

Note that it's actually GLIBC_TUNABLES=glibc.cpu.hwcaps=SHSTK to enable it with Fedora 40 glibc (and the program needs to be built with -fcf-protection).

mmaniac · 2024-05-30T08:25:15 1717057515

Would that not complicate function prologues & epilogues?

I suppose hardware support (e.g. dedicated call/ret instructions accessing a second stack pointer) is also desirable if you want to do this sort of thing. I recall that Itanium had something like this.

Joker_vD · 2024-05-30T13:21:02 1717075262

How would it complicate the prologue/epilogue? It would still be

    push    RBP
    mov     RBP, RDSP  ; DSP stands for "data-stack pointer"
    sub     RDSP, <frame-size>
    ...                ; locals are [RBP-n], args are [RBP+n]
    add     RDSP, <frame-size>
    pop     RBP
    ret                ; pops address from [RCSP], where CSP stands for "call-stack pointer"

or even, if you don't want a specific base pointer, like this:

    sub     RDSP, <frame-size>
    ...                ; locals are [RBP+n], args are [RBP+frame_size+n]
    add     RDSP, <frame-size>
    ret

In fact, you can do this even now on x86-64: make a second stack segment, point RBP there, and use RBP as if it were that hypothetical RDSP register. The return addresses naturally go into the RSP-pointed area, as they always did. The only tricky part is when you need to call the code that uses SysV x64 ABI, you'll need to conditionally re-align that RSP-pointed area to be 16-bytes aligned which is a bit annoying (in your own code, having RBP always aligned at 16 bytes is trivial since function calls don't break that alignment).

There are two main benefits of this layout, as I see it: first, smashing the return address is now way harder that it normally is; second, "the act of traversing all call frames on the stack and collecting the return addresses that are stored inside of them" is now trivial, since the call stack at RSP is now just a dense array of just the return addresses, no pointer chasing needed.

jerf · 2024-05-30T13:49:19 1717076959

And Go is already using a non-standard ABI and has to take real action to convert back to standard ABIs if you want to call external code. That's part of why cgo calls are expensive, the other major bit relating to having to interact with OS threads differently than native Go code does.

friendzis · 2024-05-30T09:18:28 1717060708

Isn't that what mainly differentiates Harvard from Von Neumann?

immibis · 2024-05-30T09:52:07 1717062727

No. They store code and data in separate address spaces. That isn't the same as storing code addresses and data addresses in separate parts of the data space.

eru · 2024-05-30T09:58:04 1717063084

Yes. And all four combinations of Harvard / von-Neumann X single stack / double stack are potentially viable options.

Btw, I think Forth has a separate data stack and return address stack?

p_l · 2024-05-30T10:26:26 1717064786

Yes. You can also freely use the return stack, IIRC, so long as you don't mess up a return itself.

malkia · 2024-05-30T14:53:24 1717080804

You can totally access these stacks from userspace - https://github.com/gabriellandau/ShadowStackWalk

felixge · 2024-05-30T15:47:12 1717084032

That seems to be windows only? My main target OS is Linux.

malkia · 2024-05-30T16:39:03 1717087143

Ah - sorry, now I understand. Also TIL - This might not be right away accessible on Linux, unless certain config is done/set. Thanks!

naasking · 2024-05-30T16:49:25 1717087765

How common is it for stack frames to be deeper than 32 calls, where these techniques cross in the benchmark? I don't program in Go, but it doesn't seem that common in the languages I've used, so I'm skeptical it's worth it.

felixge · 2024-05-30T21:06:19 1717103179

I'm skeptical that it's worth it myself, this was just a fun research project for me. But once hardware shadow stacks are available, I think this could be great.

To answer your first question: For most Go applications, the average stack trace depth for profiling/execution tracing is below 32 frames. But some applications use heavy middleware layers that can push the average above this limit.

That being said, I think this technique will amortize much earlier when the fixed cost per frame walk is higher, e.g. when using DWARF or gopclntab unwinding. For Go that doesn't really matter while the compiler emits frame pointers. But it's always good to have options when it comes to evolving the compiler and runtime ...

malkia · 2024-05-30T18:05:03 1717092303

On Windows, I tend to often see callstack frames going down a lot, especially in C++ debug builds where inlines are no longer such.

On top of that if you have work-stealing task scheduler, that if your current thread waits on I/O reassigns it to some other task to work it may bulb even more.

naasking · 2024-05-30T20:45:27 1717101927

Can you clarify what you mean by "going down a lot"? I interpret that to mean that the stacks are flatter, but then debug builds without inlining would grow the stack, not flatten it, so I'm not clear on what you're saying.

Scheduling tasks on a work-stealing queue tends to flatten the stack first, eg. a halted task has a nearly empty stack if it's an async coroutine. The scheduler itself may have 3-4 calls deep before resuming the task itself, so that doesn't seem too bad. Unless you're talking about full blown threads.

malkia · 2024-05-30T22:39:12 1717108752

Sorry I meant to say is that inlines are now individual functions, so the amount of callstack entry items is going to grow.

But then, if you capture the stack (using sampling), and catch for example ThreadContext switch while CreateFileA is happening, you can expect lots of callstack items going all the way down to the kernel

felixge · 2024-05-30T06:42:19 1717051339

OP here, happy to answer any question.

nikolayasdf123 · 2024-05-30T07:21:53 1717053713

Did you get any response from Go core team? Might make sense to open issue in GitHub https://github.com/golang/go/issues/ and start thread in https://groups.google.com/g/golang-dev

felixge · 2024-05-30T07:25:44 1717053944

I know that at least two engineers from the runtime team have seen the post in the #darkarts channel of gopher slack. One of them left a fire emoji :).

I'll probably bring it up in the by-weekly Go runtime diagnostics sync [1] next Thursday, but my guess is that they'll have the same conclusion as me: Neat trick, but not a good idea for the runtime until hardware shadow stacks become widely available and accessible.

[1] https://github.com/golang/go/issues/57175

nickcw · 2024-05-30T07:01:34 1717052494

Very interesting article thank you.

Do you see this speeding up real world Go programs?

felixge · 2024-05-30T07:10:25 1717053025

Thanks! And to answer you question: No, it won't speed up Go programs for now. This was mostly a fun research project for me.

The low hanging fruits to speed up stack unwinding in the Go runtime is to switch to frame pointer unwinding in more places. In go1.21 we contributed patches to do this for the execution tracer. For the upcoming go1.23 release, my colleague Nick contributed patches to upgrade the block and mutex profiler. Once the go1.24 tree opens, we're hoping to tackle the memory profiler as well as copystack. The latter would benefit all Go programs, even those not using profiling. But it's likely going to be relative small win (<= 1%).

Once all of this is done, shadow stacks have the potential to make things even faster. But the problem is that we'll be deeply in diminishing returns territory at that point. Speeding up stack capturing is great when it makes up 80-90% of your overhead (this was the case for the execution tracer before frame pointers). But once we're down to 1-2% (the current situation for the execution tracer), another 8x speedup is not going to buy us much, especially when it has downsides.

The only future in which shadow stacks could speed up real Go programs is one where we decide to drop frame pointer support in the compiler, which could provide 1-2% speedup for all Go programs. Once hardware shadow stacks become widely available and accessible, I think that would be worth considering. But that's likely to be a few years down the road from now.

aerfio · 2024-05-30T09:21:53 1717060913

Do you think/know of any areas in Go codebase that would enable jump in performance bigger than e.g 10%? I'm very grateful for any work done in Go codebase, for me this language is plenty fast, I'm just curious what's the state of Go internals, are there any techniques left to speed it up significantly or some parts of codebase/old architectures holding it back? And thank you for your work!

felixge · 2024-05-30T09:33:53 1717061633

I don't think any obvious 10%+ opportunities have been overlooked. Go is optimizing for fast and simple builds, which is a bit at odds with optimal code gen. So I think the biggest opportunity is to use Go implementations that are based on aggressively optimizing compilers such as LLVM and GCC. But those implementations tend to be a few major versions behind and are likely to be less stable than the official compiler.

That being said, I'm sure there are a lot of remaining incremental optimization opportunities that could add up to 10% over time. For example a faster map implementation [1]. I'm sure there is more.

Another recent perf opportunity is using pgo [2] which can get you 10% in some cases. Shameless plug: We recently GA'ed our support for it at Datadog [3].

[1] https://github.com/golang/go/issues/54766 [2] https://go.dev/doc/pgo [3] https://www.datadoghq.com/blog/datadog-pgo-go/

neonsunset · 2024-05-30T11:48:15 1717069695

Go limitation is it’s a high-level language with a very simple compiler where providing true zero-cost abstractions (full monomorphization rather than GC stenciling) and advanced optimizations is a bridge it wouldn’t cross because it means much greater engineering effort spent on the compiler and increasing LOCs by a factor of 5, especially if a compiler wants to preserve its throughput.

Though I find it unfortunate that the industry considers Go as a choice for performance-sensitive scenarios when C# exists which went the above route and does not sacrifice performance and ability to offer performance-specific APIs (like crossplat SIMD) by paying the price of higher effort/complexity compiler implementation. It also does in-runtime PGO (DynamicPGO) given long-running server workloads are usually using JIT where it's available, so you don't need to carefully craft a sample workload hoping it would match production behavior - JIT does it for you and it yields anything from 10% to 35% depending on how abstraction-heavy the codebase is.

dolmen · 2024-05-30T10:10:03 1717063803

Reminder: the Go team consider optimizations in code generation only as far the the compiler is kept fast. That's why the Go compiler doesn't have as many optimizations phases as C/C++/Rust compilers.

As a developer I like that approach as it keeps a great developer experience and helps me stayed focus and gives me great productivity.