This is along the lines of how one of the wireless telecom products I really liked worked.
Each running process had a backup on another blade in the chassis. All internal state was replicated. And the process was written in a crash only fashion, anything unexpected happened and the process would just minicore and exit.
One day I think I noticed that we had over a hundred thousand crashes in the previous 24 hours, but no one complained and we just sent over the minicores to the devs and got them fixed. In theory some users would be impacted that were triggering the crashes, their devices might have a glitch and need to re-associate with the network, but the crashes caused no widespread impacts in that case.
To this day I'm a fan of crash only software as a philosophy, even though I haven't had the opportunity to implement it in the software I work on.
Each running process had a backup on another blade in the chassis. All internal state was replicated. And the process was written in a crash only fashion, anything unexpected happened and the process would just minicore and exit.
One day I think I noticed that we had over a hundred thousand crashes in the previous 24 hours, but no one complained and we just sent over the minicores to the devs and got them fixed. In theory some users would be impacted that were triggering the crashes, their devices might have a glitch and need to re-associate with the network, but the crashes caused no widespread impacts in that case.
To this day I'm a fan of crash only software as a philosophy, even though I haven't had the opportunity to implement it in the software I work on.