I see this a lot with people are experts in real time operating systems environm...

stingraycharles · on Jan 12, 2023

> They have excellent intuition around making things redundant to single pieces of hardware failing but don’t really grok making stuff resilient to wider failures.

I always feel like making single components redundant is a fairly well-defined process -- generally speaking, the mechanisms are the same (1+ redundant components, failover, STONITH, etc), where making things resilient on a higher level is not as well-defined, and often requires bespoke solutions to each unique situation.

killerstorm · on Jan 13, 2023

Hmm?

BFT state machine replication is well-understood and well-defined: use N of M agreement for inputs and run them through a deterministic state machine. Optionally, do N of M signature of outputs.

OTOH what are properties of failover? "Failover" seems like an attempt to cheat on Byzantine generals' problem: Generals send mail and the confirm results in a Zoom call. But what if Zoom doesn't work? What are the assumptions for 1+ redundant components/failover/STONITH?

namibj · on Jan 12, 2023

Formal verification of not having such fatal bugs would allow a real-time system without reliance of backups to not screw up, but still of course having logs/rollbacks for (human) input actions to cope with mistakes.

It's just that production software essentially never used formal verification to a sufficient extend.

ilyt · on Jan 12, 2023

Well, it is hideously expensive to do on bigger pieces of code. And I'd imagine you can still get the spec you verify against wrong