Marcan's attitude is great; I know of a ton of people (myself included) who would've written that article with far more complaining interleaved. Super informative as well, I learned a ton from this article (GRUB 2 feature for marking off bad RAM? Wow!). Very well written, informative, humorous, etc. Love it
Agreed. Marcan's a chill dude. I saw him a few times in the Dolphin Emulator IRC and he always had interesting things to say.
From my perspective this approach was pretty unique; going all the way down to debugging the hardware first may seem obvious to some, but it's a totally opposite approach to how I'd go about it. My mind would jump directly to producing a minimal test case. Would've never thought to mark off bad RAM with an obscure(ish?) GRUB 2 feature. Would've never thought to selectively flip a kernel flag for some parts of the code.
It's great to get these perspectives from people who really know how to dig down and debug deep.
That's not to say I spend a whole lot of time looking at the lower levels, but my quick mental checklist starts off down at physical points, and I try to quickly eliminate possibilities. In a lot of cases it's obvious it's a code / logic bug, and you can completely skip the lower layer stuff, but making it a conscious step pays off.
Man I loved the approach to narrow down the offending object file using the regex on the SHA hash of the binaries. That would've saved me lots of time hunting and guessing bugs with cscope+kdb back in my kernel hacker days!
Wow, these are some serious debugging skills. I also admire the tenacity and the will to investigate the root cause.
> I tried setting GOMAXPROCS=1, which tells Go to only use a single OS-level thread to run Go code. This also stopped the crashes, again pointing strongly to a concurrency issue.
This is great story, probably wins the year for "best bug you've ever encountered?" question.
Having implemented some weird runtimes for weird languages, I am sympathetic to Go team here -- these odd tradeoffs of pushing the envelope on OS <-> your_own_compiler interactions can trigger some wild experiences.
Possibly, although hotter fundamentally means more thermal noise, which might actually reduce correlations / ability to communicate effectively between adjacent circuits.
Think of it as SNR (signal-noise-ratio) -- increasing temperature increases thermal noise (there are other kinds), and with the same signal, it should actually reduce the efficiency of the side channel.
But it brings up a good question, I wonder if anyone has studied this...
Ninja level debugging and diagnostic skills. A fascinating read from start to finish. Bonus points for the GRUB 2 feature for masking out bad RAM blocks – still dreaming of owning a laptop with ECC memory :/
The redzone is only non-explicit stack and only really matters if you violate it. If vDSO allocates stack properly, which it should considering it's an exported function, there is no problem.
How does being an exported function change the rules, I thought the x86-64 ABI mandated an implicit safe 128 bytes below rsp at all times? Also, how can a vDSO function "allocate" stack? It would have to know about the current stack space as configured by the go runtime, and somehow dig into this go-runtime-specific record of the current stack limit? Isn't the only available option for any exported function just to /use/ pre-allocated stack space (by subtracting from rsp) - I don't see how it could possibly extend the pre-allocated stack.
The Red Zone is only a thing you should be doing within a function. Once you call another function you need to update the RSP regardless.
This "at all times" is most relevant when you have interrupts in your execution that you can't predict. When an interrupt runs, you can't know how much of the redzone has been used, so you assume 128 bytes below the current RSP value.
> Also, how can a vDSO function "allocate" stack?
By updating RSP and using the stack properly, that is what I mean with allocating the stack. It's normal code.
> It would have to know about the current stack space as configured by the go runtime, and somehow dig into this go-runtime-specific record of the current stack limit?
No, you simply use push and pop as usual. The go routine sets up the stack, the bug in the blogpost is that the go runtime assumed the vDSO would only use a couple bytes on the stack, but it used more because it did a stack probe about 4 kilobytes into the stack.
>Isn't the only available option for any exported function just to /use/ pre-allocated stack space (by subtracting from rsp) - I don't see how it could possibly extend the pre-allocated stack.
Not at all, as previously mentioned, stack is simply the value in RSP, you can update that as you want. PUSH and POP the values you need and the OS will implicitly allocate memory for the stack as needed.
Agreed on all those points (except - I think interrupts use the kernel stack? Otherwise triggering an interrupt would clobber the red zone, which should be preserved, since it pushes the return address and flags IIRC? So interrupts don't "use" the red zone)
I was just wondering, with the Red Zone in the ABI specification requiring 128 bytes below RSP, wouldn't that also mean any function can assume the 128 bytes below RSP are actually mapped? If Go sets up a stack with only 104 bytes to go, then couldn't accessing rsp-105 to rsp-128 from the vDSO (which should be safely within the red zone) risk causing a segfault even if the function didn't do the race-y 4k stack probe?
> (except - I think interrupts use the kernel stack? Otherwise triggering an interrupt would clobber the red zone, which should be preserved, since it pushes the return address and flags IIRC? So interrupts don't "use" the red zone)
Interrupts may use the kernelstack, you can use the userstack but this is trouble if they don't uphold the redzone, which they may not.
The redzone is a mere suggestion the compiler may apply when it exposes functions elsewhere. The stack is mostly free game.
When you are writing a kernel, the red zone is off, I may have worded this wrong previously. The userspace should use the redzone, the kernel can't, interrupts must not assume a redzone since they don't have the time to add or sub from the RSP.
>wouldn't that also mean any function can assume the 128 bytes below RSP are actually mapped?
The stack is always sort of mapped and not at all. Normally the stack region is unmapped in the page table. Should a process run the stack into an unmapped region, the OS will generally map this region if memory is available. And because programmers are silly, the OS will also happily map memory way above the last mapped page if accessed.
Go setting up a 104 byte stack only means that the runtime assumes the vDSO will not use more than 104 bytes of stack and another go routine may be allocated just after those 104 bytes. So if the vDSO does it's stack probe 4k into foreign stack just under the current thread stack, it will cause a race condition on the memory write.
The stack probe is there to prevent overflowing the stack into the heap, there is always a guard page between user memory and user stack which will kill the task if it is accessed. So the probe will jump into this guard page and kill the program if any funny business is going on.
But shouldn't Go really assume that the vDSO (or any other external function) will use at least 128 bytes of stack then? If go only reserves 104 bytes, then even without all this 4k probe business, there would be a risk that the vDSO could overwrite up to 128 bytes (which could be another go-routine's stack, or an unmapped page (which would probably cause a segfault if Go's trap handler for segfaults doesn't expect an access there, instead of the OS or the trap handler silently mapping more memory))? I mean, isn't the assumption flawed from the beginning by reserving less than 128 bytes?
Also, regarding:
> Interrupts may use the kernelstack, you can use the userstack but this is trouble if they don't uphold the redzone, which they may not.
I don't see how this is possible without corrupting the redzone. The CPU pushes the return RIP and the flags onto RSP when an interrupt occurs, which would overwrite parts of the redzone, so the damage would be done on the userspace stack even before the first instruction of the interrupt handler is executed)
> I mean, isn't the assumption flawed from the beginning by reserving less than 128 bytes?
Not technically since the vDSO does not use more than 104 bytes of stack. The problem was the compiler assuming wrongly there might be more to have and writing data above this limit.
The actual stack usage of a vDSO is not documented afaik.
My best guess would be to prevent the 'useless instruction' from being optimized out, but an or with 0 is still useless, and I don't see what optimizer lies between this GCC feature and the final binary.
Why is it shorter? Both MOV and OR have one byte encodings, and with the OR you either have to use an immediate zero (which burns a byte) or materialize zero in some other way. As that email points out, the entire sequence would be shorter using a different addressing mode anyways. And a read-modify-write is definitely slower at runtime.
I wonder if it's because it's safer in that it doesn't change anything there if you've gone over the stack limit and into the heap? I know that -fstack-protect was designed a long time ago, possible before guard pages and before 64bit addressing.
> This code is so wrong I don't even no where to start. ... I suppose we could try to make the kernel fail to build at all on a broken configuration like this.
Very good point. Key phrase is "an offset > 4096 is just bogus. That's big enough to skip right over the guard page" - if allocating more than a page-size-worth (4K in most cases), it should try to write to it before returning.
It isn't. -fstack-check consistently checks the stack at a 4K offset. It wasn't designed for Stack Clash protection, but in practice, it works as long as all the code was compiled with it (which on Gentoo Hardened, it would be, at least for typical C/C++ apps). The whole stack is still being probed (except perhaps the first page), it's just being probed ahead of actually being used. The caller was responsible for probing the space up to that 4K offset. It's counterintuitive, but in practice it works as an effective mitigation (and it exists already, so Gentoo might as well use it until real Stack Clash specific protection lands in GCC).
Edit: to give an example: if a function uses 6K of stack space, then -fstack-check will probe pages from [SP+4K] to [SP+10K]. [SP] through [SP+4K] are assumed already probed by the call chain, and the [SP+4K] through [SP+6K] region is being probed explicitly. In the end, nothing gets skipped over.
This only works if literally every function has a probe inserted, which is a really silly dependency and is slow. With a more sensible design, functions with small stack frames don't need probes.
It's also not at all clear that this design works is signals are involved unless there's explicit runtime support.
Sure. -fstack-check wasn't designed for stack clash protection. It's old and it makes little sense in context, but it works and it exists, which is why it's being used until proper Stack Clash prevention lands in GCC 8.
I don't think signals change anything. You can think of them as just function calls that skip over the redzone, then keep probing. The redzone is < 1 page so you should still wind up touching every page.
Oh, I was imagining that referred to a situation where you have a stack (say 8K to keep it aligned), then a 4K guard page (which faults when you grow past it), and then someone probes past it, at 16K. Which may be mapped to all kinds of other things, no?
Sure, but the thing is something would've already probed into the 4K guard page before you had a chance to probe past it. If your stack guard page is the next thing up on the stack and you call that vDSO function, then sure, it'll probe past the guard page. But if all your code is built with -fstack-check then that guarantees you should have already crashed by having some prior function in the call stack probe into that guard page.