Could have abstract interpretation (http://www.astree.ens.fr/) or some other for...

tlb · on May 2, 2015

Someone wrote something like:

  int32_t ticks; // 100ths of a second

which overflows in 248 days, a particularly unfortunate amount of time because it doesn't show up during testing.

Although it would be a good engineering choice, a formal verifier would say that:

  int64_t ticks; // 100ths of a second

is also incorrect, since it also overflows (after 10^9 years).

In a hard real time system,

  mpz_t ticks; // 100ths of a second, infinite precision libgmp type

is still formally incorrect, since as the the number grows in size it will eventually exceed a time limit or memory (after 10^10^9 years)

The overall lesson from formal methods is that it's impossibly to write formally correct useful programs. So programmers just muddle through.

cnvogel · on May 2, 2015

As you have written, a formal test/analysis will always detect that a monotonously increased tick counter will not be bound by an upper limit. And the obvious solution is that you don't rely on such a thing, but define your API such that the (preferably unsigned, but doen't matter) tick-counter will roll over in a well defined way.

If the algorithms really depend on an always monotonously increasing tick-counter (which I doubt), the solution is quite easy: After 2^30 ticks set a flag which raises the "service needed" light in the cockpit, taking the plane out of service until it's power cycled. By this you explicitly state that your device cannot be used longer than 120 days continuously.

wampus · on May 2, 2015

Agree with the first paragraph, but in the second I don't see how requiring a periodic reboot is a solution. Your "service needed" light is a "Case closed, WON'T FIX" message made real.

vilhelm_s · on May 2, 2015

Airplanes already have an elaborate schedule for mandatory periodic service. Pressing a "reset" button once every 30 days is pretty trivial compared to dismantling the entire engine every couple of years.

What made this bug dangerous is that nobody knew about it, that's the main problem that needs to be solved.

cnvogel · on May 2, 2015

On the assumption that there's a complicated control algorithm which, unfortunately, does arithmetic on (now-then) tick-values everywhere... but this algorithm has been extensively validated to be correct on the non-overflowing case, and it will take a while to find out how it handles discontinuities in its time scale.

Then the simple "raise service-needed signal" would be a valid workaround and easily testable local change for the next two years until the extensively fixed-algorithm went through testing and certification.

ambrop7 · on May 2, 2015

A general solution to the overflowing-clock problem is to deal with the clock in modular arithmetic. When wanting to know if t2 comes before t1, check the MSB of the modular difference.

  uint32_t t1 = ...;
  uint32_t t2 = ...;
  if ((uint32_t)(t2 - t1) >= UINT32_C(0x80000000)) {
    // t2 is before t1
  } else {
    // t2 is after or equal to t1
  }

What this gives us is that if the difference of the actual times (not these uint32 representations which are ambiguous modulo 2^32) is less than 2^31 units (plus minus one maybe..), this check will give the expected result. This does allow a correct system that never fails if the timing/duration of the events is suitably limited.

For example you time events at a fixed time interval, and it will keep going forever in spite of clock roll-over.

  uint32_t next_time = now();
  while (1) {
    while ((uint32_t)(now() - next_time) >= UINT32_C(0x80000000));
    printf("Shoot\n");
    next_time += interval;
  }

The timing events also need to be processed quickly enough of course (that printf shouldn't block for longer than about 2^31).

tlb · on May 2, 2015

Using this technique to compare the current time to engine start time would cause the exact problem described.

ambrop7 · on May 2, 2015

It would. Which would fail the condition that I mentioned "if the timing/duration of the events is suitably limited". So you should just not do what you suggest :)

ams6110 · on May 2, 2015

Isn't the actual problem using raw integers to represent time, instead of a proper date/time data type and supporting (tested) library functions?

tgbrter · on May 2, 2015

I disagree. You can reduce space usage with a logarithmic complexity. A couple tens of bytes is enough to store miliseconds until the heat death of the universe.

firethief · on May 2, 2015

Just drag the complexity of bignum arithmetic into a hard-realtime embedded system... What could possibly go wrong?

tgbrter · on May 2, 2015

I was not implying that.

Nothing. Built in arithmetic will do the job nicely.

mhogomchungu · on May 2, 2015

> int64_t ticks; // 100ths of a second

I would go with uint64_t

as it documents "ticks" as a variable that can not hold negative values and also doubles its range of positive values.

tlb · on May 2, 2015

I prefer signed time types, because you frequently subtract them. If you use unsigned, then you have to cast the result to signed every time:

  int64_t elapsed = (int64_t)(t1 - t0);

And it's very easy to cause disaster with:

  if ((t1 - t0) > 50) ...

which also succeeds if t1<t0.

While it's theoretically possible, using all 64 bits is tricky and very hard to test.

Veedrac · on May 2, 2015

The "doubles its range of positive values" is weak argument because you should never be reaching values more than a few decades, never mind 10-100x the age of the universe. Such a state is a bug.

The "can not hold negative values" argument is also weak because a uint does not prevent generating negative values - it only prevents you from knowing that you've generated negative values. Such a state is a bug.

Using a uint only serves to make it harder to test when your system is in an invalid state.

speakeron · on May 2, 2015

I would go with

uint64_t ticksOfDuration10ms; // No comment necessary

cnvogel · on May 2, 2015

The concept of timer "ticks" is well established as a unit of time in embeded programming, it's almost universally included in your embedded (realtime-)OS and might increase at any conceivable rate, both limited by the hardware constraints (e.g. a fixed, simple, 16-bit ripple counter that is clocked by the main CPU clock of 8 MHz will clock at 122.07 Hz) or at your application requirements (you let a slightly more configurable timer only count to 40000 at half the CPU clock to get exactly 100 Hz). Hence you shouldn't explicitly inscribe the tick rate in your symbol name, as it can change when requirements change.

You'll almost always have a global variable, preprocessor define... or something similar to get the frequency (or time increase per tick), which you should use whenever you have to convert "ticks" to actual physical units. If the actual effective tick rate is visible at many places in your code, both as a symbol name or as a comment, you are most certainly doing something wrong.

speakeron · on May 3, 2015

I think you kind of missed the point of my post (which was a bit tongue-in-cheek). The original code fragment had the tick duration embedded in a comment, so changing a global variable which defines it something other than 10ms is going to cause all sorts of problems in maintaining that code. (Leading possibly to the very problem Boeing had).

cnvogel · on May 3, 2015

...well, then my irony-detector is broken ;-).

tedunangst · on May 2, 2015

Why not uint64_t thisVariableIncrementsByOneEvery10ms?

edmccard · on May 2, 2015

uint64_t thisVariableIncrementsByOneEvery10msSoItWontOverflowForAReallyLongTime

(then you'd know it was safe)

CamperBob2 · on May 2, 2015

Another good practice is to initialize the time counters to something close to the overflow point, rather than zero. This encourages overflow bugs to show up at timescales where they will be noticed during testing, rather than after 248 days of service.

This is a scary-ass bug in a codebase that was supposed to be authored to strict professional standards.

cpeterso · on May 3, 2015

The Linux kernel does that. The 'jiffies' counter's initial value rolls over 5 minutes after boot:

http://lxr.free-electrons.com/source/include/linux/jiffies.h...

rasz_pl · on May 2, 2015

http://embedded.fm/episodes/2015/4/28/99-you-can-say-a-boat

but I would suspect Boeing used something better than C.

excel2flow · on May 2, 2015

http://www.pipeline.com/~hbaker1/sigplannotices/gigo-1997-04...

apitman · on May 2, 2015

I honestly can decide whether this is serious, satire, or conspiracy theory, but it's awesome nonetheless. My first project working on the 787 was converting an Ada codebase to C.

azth · on May 2, 2015

Isn't Ada precisely suited for this sort of application? What was the motivation for switching to C?

randomsearch · on May 2, 2015

> What was the motivation for switching to C?

Invariably, cost. SPARK Ada is demonstrably superior to C for safety-critical development (I can't cite the sources for this, but a major company developing safety-critical software has shown this to be the case).

But, SPARK Ada requires a lot of highly skilled manpower and it's slow to develop. C gets the job done, albeit with lots of bugs.

cpeterso · on May 3, 2015

If the industry is unwilling to invest in the training or tooling for a safe language like SPARK Ada, is there research into "easier" safe languages, something between C and Ada? Or do companies like Boeing still expect to be writing new avionics safety in C in 2030 or 2040?

randomsearch · on May 3, 2015

Yes, Real-Time Java being an example.

Realistically, it seems to me that avionics etc. will be written in C for a very long time to come. It all comes down to the cost and availability of programmers.

apitman · on May 2, 2015

C is incredibly portable and tons of programmers know it. Ada is great in a lot of ways but just never got the traction it needed to be #1.

azth · on May 3, 2015

Sure, but shouldn't safety be the number one concern here? Programmers can always be trained to learn it, as long as they demonstrate competence. It seems like an unfortunate case of trying to cut costs at the cost of safety.