This is a valid point. If booting just hangs after the bootloader but before the...

xyzzy_plugh · on March 1, 2017

Hey! Love the product, but I'm out of the embedded game. Thought I'd give my $0.02:

The first step of our boot process was to enable the watchdog. We extend the timeout periodically during the boot process, but generally if userspace isn't reached within 30 seconds or so we reset. Once in userspace, the daemon validates that things look good (this includes things beyond just application of the update -- did services start up correctly? Is the hardware operating as we expect?) before disabling the watchdog and marking the update as a success, at which point rollback isn't possible. At this point we might consider applying new updates, etc.

We also modified our first stage bootloader to be resilient to bootloader update issues, and chainloaded our second stage bootloader from a stub which could rollback.

We also niced the update process to avoid resource contention, allowed the updates to be delayed until the network was quiet, and paused them when it became noisy to make for a good user experience. There was a server-side flag to force updates to apply regardless, with higher priority, as well as one to basically disable all other functionality in the case of a unforeseen serious, perhaps security related, issue.

We actually had a discrete watchdog service which was responsible for petting an always-on watchdog, to rescue the system if it locked up or became unresponsive (if certain processes were not running, or responding, the watchdog would not be pet).

All of this led to effectively 0 failures in the field, a seamless user experience (except for the 30-second reboot when inactive). I wish everything I owned worked this way.

I could talk ad nauseum about this stuff. It's very cool to see the designs of others. I feel this is an under appreciated and under explored problem space.