I see resolutions for protecting against the now-known error modes discussed, and better alerting to get the on-call engineer (aka always Zeke :D) looking into things quicker, but curious how they might approach preparing for "unknown-unknowns" that will come in the future.
Are there good ways for a small-team to proactively stress test a system without mucking up customers? Open question.
Are there good ways for a small-team to proactively stress test a system without mucking up customers? Open question.