Chaos Monkey is great. I'm suggesting something slightly different. Imagine you have a bunch of cent 6 machines that have been running great for years and years. upgrades go smoothly and deployment is a breeze. Now you need to upgrade to cent 7. Stuff that used to happen in init.d now happens in the systemd, for example.
Reviewing changes by hand every few months keeps you aware of what needs to happen. Automation doesn't get brittle, people forget how systems are flexible and inflexible. Spaced repetition would give people a chance to keep up with the current design, so when that crazy security vulnerability happens, you can jump in and change stuff knowing how it works, rather than having to figure it out on the fly.
I don't think it would take much. Configure a box today, tomorrow, next week, next month, 3 months, then perhaps every 6 months.
There are very smart people that sort of intuitively keep up with those kinds of changes. But those people change jobs. How do you get a mere mortal up to speed on 20-30 machine configurations? config management can configure hundreds of machines a day, no problem. but it's easy to forget what's really going into each of those boxes.
This is more about preserving organizational awareness, not so much robustness of a running system.
Reviewing changes by hand every few months keeps you aware of what needs to happen. Automation doesn't get brittle, people forget how systems are flexible and inflexible. Spaced repetition would give people a chance to keep up with the current design, so when that crazy security vulnerability happens, you can jump in and change stuff knowing how it works, rather than having to figure it out on the fly.
I don't think it would take much. Configure a box today, tomorrow, next week, next month, 3 months, then perhaps every 6 months.
There are very smart people that sort of intuitively keep up with those kinds of changes. But those people change jobs. How do you get a mere mortal up to speed on 20-30 machine configurations? config management can configure hundreds of machines a day, no problem. but it's easy to forget what's really going into each of those boxes.
This is more about preserving organizational awareness, not so much robustness of a running system.