Stack-checking a program... that will execute in orbit

mstevens · on Sept 19, 2011

These arguments never quite make sense to me.

The reasoning appears to be "since technique X can cause problems in some circumstances, it can never be used in software for domain Y (usually avionics, space, or similar)".

However, obviously every technique could cause problems in some circumstances. Therefore I conclude it's impossible to ship software into space.

Since software clearly is shipped into space, it seems likely there is some problem with the above argument. My suspect is the "could cause problems? Can't use it!" line of reasoning is oversimplified for the purposes of explanation, and hopefully something better justified is actually occurring.

Someone · on Sept 19, 2011

I would think the reasoning would be "pick the low-hanging fruit to improve reliability".

Checking a programmer's claim that a program will not run out of memory is difficult (sometimes impossible) if you allow heap allocations and arbitrary stack usage. You could allow either, if the programmer supplies a proof that things are OK. But with such a proof in hand, writing the thong so that it can be mechanically checked is "just work" (tedious work, possibly, but for this application, that is not that important). Because of that, why risk that such a proof is flawed (or becomes flawed after a bug fix), if one can just as well prevent the problem?

Of course, the checking code will be buggy, too. That is OK; it will be written by a separate team, and will improve over time, just as the code it checks does.

I guess that, even with these checks, the program will still have some watchdog timer, a way to remotely reset it, and a way to upload new software, just in case it runs wild.

ttsiodras · on Sept 20, 2011

> I guess that, even with these checks, the program will still have some watchdog timer, a way to remotely reset it, and a way to upload new software, just in case it runs wild.

Indeed.

As for the reasoning: couldn't have said it better myself.

jmaygarden · on Sept 19, 2011

I would note that determining maximum stack usage is similar to the halting problem in some cases. One must carefully restrict how the program is structured for this analysis to be valid. As is stated, any sort of recursion or mutually dependent functions are out, and the use of function pointers could also cause issues.

This approach provides a good baseline. However, I'd highly recommend empirical testing as well. For example, create a sufficiently large stack, initialize the stack with a known bit pattern and then check the high water mark after fully exercising the program.

eru · on Sept 19, 2011

I agree with your first two sentences.

Why should recursion be out? You just need to be careful with it. Not having an infinite recursion is basically the same problem as not having an infinite loop.

wlievens · on Sept 20, 2011

If you want to prove the maximum stack usage of a recursive evaluation, you need to prove how deep it goes. Since that would depend on the inputs, you need to statically determine the limits on the inputs. It's all possible, but cases where it applies would be rare.

rgoddard · on Sept 19, 2011

Since with recursion you can have a large but finite number of calls that could blow the stack, but not have the same problem with a for loop.

eru · on Sept 19, 2011

Yes, you have to account for that. Also, tail recursion should be fine, if your compiler is smart enough. (I guess since the program is looking at assembly, it couldn't tell optimized tail recursion from a loop.)

wlievens · on Sept 20, 2011

I worked on a compiler that did not support recursion, dynamic memory allocation or dynamic function calls (function pointers). We could statically guarantee the stack depth and thus the total memory usage of the program.

You'll probably be wondering "what use could that crippled compiler probably have?". Well, most of the people here are carrying code emitted by it in their pocket.

jimrandomh · on Sept 19, 2011

Static analysis to prove that code doesn't run out of memory is cool. However, completely refusing to use heap allocation, because of the possibility that some heap-using code will run out of memory, is completely insane. The ridiculous things you'll have to do with data structures to avoid using the heap will introduce more bugs than this prevents.

ttsiodras · on Sept 20, 2011

You can try convincing space companies that they are in error; good luck with that :-)

ajross · on Sept 19, 2011

Not to burst any bubbles, but the static analysis technique shown is capable of detecting recursion only in static call chains. It doesn't seem to even have any handling (even to say "warning! unhandled call!") for something like function pointers. Feed this a C++ program with recursive virtual functions and it will, apparently, return "OK!"

ttsiodras · on Sept 19, 2011

Actually, function pointers are... not allowed in space SW.

And remember, this is just a hack I did - I wanted to show people the things one can do with objdump. From the article:

"This kind of functionality is usually part of far more mature tools than... quickly-hacked Python scripts. Don't use this to send your spaceship in orbit :-)"

ajross · on Sept 19, 2011

Still misses the point. A quickly hacked script can be really useful if it shows real bugs (where a recursive call in this case is a "bug"). There's nothing wrong with the development methodology: who cares how ugly the code is that found the bug as long as it's fixed?

The problem is that the mathematical problem being attacked is undecidable. This script only approximates a solution in the case where function pointers don't exist. Even that would be OK if you knew about the limitation and the script flagged indirect jumps as errors too. It looked like you didn't realize that though, so I thought it had to be mentioned.

eru · on Sept 19, 2011

You could send in a patch.

ajross · on Sept 19, 2011

... that solves the halting problem?

wlievens · on Sept 20, 2011

You're using the Turing Tarpit here. You do not need to "solve the Halting Problem" in order to make useful static analysis software.

When the software can prove X, return True.

When the software can prove !X, return False.

When the software can't do either, return Sorry.

eru · on Sept 19, 2011

If you can. Honestly I meant something along the lines of "Even that would be OK if you knew about the limitation and the script flagged indirect jumps as errors too."

zwieback · on Sept 19, 2011

I generally also trace stack usage by filling the stack with a fixed pattern and watching it in my trace probe.

wlievens · on Sept 20, 2011

Yes, but that's a form of dynamic analysis. To prove its applicability, you need to run it for every possible input (or a reduction that comes down the same thing).

sausagefeet · on Sept 19, 2011

Writing code to execute in orbit seems interesting. Do you have any further suggested reading?

ttsiodras · on Sept 20, 2011

MISRA-C, for starters.

Developing safety-critical code is governed by extensive guidelines (like DO178-B) which dictate lots of things, not just about code (process of development, testing, etc)

zandorg · on Sept 19, 2011

I just rewrite the stack function to use heap (eg, a queue) instead of stack.

ajross · on Sept 19, 2011

The application in question disallows heap usage. In "no runtime errors allowed" environment like this, everything is pre-allocated. This is easy for data structures (though it gets hairy once the usages start overlapping). This including the stack, which is hard to prove bounds for with traditional tools. So the author wrote some scripts to parse the disassembly.

icefox · on Sept 19, 2011

What about having the heap allocation actually use the stack?

stygianguest · on Sept 19, 2011

In safety critical applications everything is very conservative by necessity: they are complicated systems consisting of many parts, ensuring correctness requires a rigorous workflow and complete toolchain. They involve not just multiple people but whole industries (think automotive).

Most innovations fail because they do not integrate in the existing chain. Radical change can be accepted if, and only if it has extreme potential (orders of magnitude), because the existing methods are continously improved by small steps as well.

rcfox · on Sept 19, 2011

That kind of defeats the purpose. Heap allocation is meant for objects that persist for the long-term. Stack allocation is meant for short-term storage, and anything that is allocated on the stack is lost when the function returns.

apaprocki · on Sept 19, 2011

You can always allocate stack space (static or dynamic with alloca()) in main() before kicking off the application.

ajross · on Sept 19, 2011

In which case it's isomorphic to a static memory block (albeit one with non-zero initialization costs), so what would the point be?

icefox · on Sept 20, 2011

Well the whole article seemed silly after he said this:

"First of all, you can't use heap - not when you're in orbit. Why? Because quite simply, if you do use heap, and malloc returns null... what do you do? You can't ask the user for help"

This is the OOM problem and is highly dependent upon the kernel you are using. Not bringing the kernel into the discussion means that the discussion is just silly. There is not much of a point to putting the heap in the stack as you said, but hey like the author we are just getting some ideas out there without bringing up the kernel.

ttsiodras · on Sept 20, 2011

I think you misinterpreted the article: the target is not Linux/x86, its a custom board with the SPARC/Leon processor running a custom embedded OS (RTEMS), launched in space :-)

The article is just an example of the notions used, and how objdump can be used to enforce them. Linux is used simply for demonstration purposes.

In your average PC with Linux/x86, it is clear that you fear neither stack nor heap error - you just restart the app :-)

icefox · on Sept 21, 2011

I think you misinterpreted my comment. I didn't mention Linux anywhere in it, but specifically talk about how when dealing with OOM you must first discuss what the kernel provides. On RTEMS can malloc actually return 0 or does it never return 0 and there is a signal or some other event that is sent out from the kernel? These are the most important bits of information in this problem.

jmaygarden · on Sept 19, 2011

That sounds very inefficient given that every processor of which I am familiar has built-in support for a stack. Could you elaborate a bit more on your approach?

Someone · on Sept 19, 2011

For the curious: typically, pure RISC CPUs will have 'support for a stack' in the sense that it is not too hard to implement Algol-like languages on them, but not a stack proper. For example, neither PowerPC nor MIPS have designated stack pointer registers. As a consequence, they do not have a 'return from subroutine' instruction, they do not have pop or push, etc. ABIs will define a stack pointer, but if you control the hardware, you can use any register you want, none at all, have separate data and return address stacks, etc.

Also, for a truly stackless CPU, look at system 370 (and, I guess, most other machines predating Algol)