I see this a lot with people are experts in real time operating systems environments, particularly in aviation/space stuff (maybe because that’s where I worked for a while).
They have excellent intuition around making things redundant to single pieces of hardware failing but don’t really grok making stuff resilient to wider failures.
Anything involving transaction logs, rollbacks, and plain old backups take a backseat to live hardware-redundant environments. “It’s OK though because we follow the NASA software development process which has a rigorous set of validation steps that prevent bugs.”
> They have excellent intuition around making things redundant to single pieces of hardware failing but don’t really grok making stuff resilient to wider failures.
I always feel like making single components redundant is a fairly well-defined process -- generally speaking, the mechanisms are the same (1+ redundant components, failover, STONITH, etc), where making things resilient on a higher level is not as well-defined, and often requires bespoke solutions to each unique situation.
BFT state machine replication is well-understood and well-defined: use N of M agreement for inputs and run them through a deterministic state machine. Optionally, do N of M signature of outputs.
OTOH what are properties of failover? "Failover" seems like an attempt to cheat on Byzantine generals' problem: Generals send mail and the confirm results in a Zoom call. But what if Zoom doesn't work? What are the assumptions for 1+ redundant components/failover/STONITH?
Formal verification of not having such fatal bugs would allow a real-time system without reliance of backups to not screw up, but still of course having logs/rollbacks for (human) input actions to cope with mistakes.
It's just that production software essentially never used formal verification to a sufficient extend.
They have excellent intuition around making things redundant to single pieces of hardware failing but don’t really grok making stuff resilient to wider failures.
Anything involving transaction logs, rollbacks, and plain old backups take a backseat to live hardware-redundant environments. “It’s OK though because we follow the NASA software development process which has a rigorous set of validation steps that prevent bugs.”