That one time Keygen went down for 5 hours (twice)

i-dont-remember · 2024-02-22T23:50:28 1708645828

I see resolutions for protecting against the now-known error modes discussed, and better alerting to get the on-call engineer (aka always Zeke :D) looking into things quicker, but curious how they might approach preparing for "unknown-unknowns" that will come in the future.

Are there good ways for a small-team to proactively stress test a system without mucking up customers? Open question.