In my experience with outages, usually the problem lies in some human error not ...

yuliyp · 2024-07-25T02:58:32 1721876312

> In my experience with outages, usually the problem lies in some human error not following the process

Everyone makes mistakes. Blaming them for making those mistakes doesn't help prevent mistakes in the future.

> what kind of bug? Could it have been prevented with proper testing or code review?

It doesn't matter what the exact details of the bug are. A validator and the thing it tries to defend being imperfect mates is a failure mode. They happened to trip that failure mode spectacularly.

Also saying "proper testing and code review" in a post-mortem is useless like 95% of the time. Short of a culture of rubber-stamping and yolo-merging where there is something to do, it's a truism that any bug could have been detected with a test or caught by a diligent reviewer in code review. But they could also have been (and were) missed. "git gud" is not an incident prevention strategy, it's wishful thinking or blaming the devs unlucky enough to break it.

More useful as follow-ups are things like "this type of failure mode feels very dangerous, we can do something to make those failures impossible or much more likely to be caught"

flanked-evergl · 2024-07-25T07:50:41 1721893841

> Everyone makes mistakes. Blaming them for making those mistakes doesn't help prevent mistakes in the future.

You can't reliably fix problems you don't understand.

gwd · 2024-07-25T08:00:57 1721894457

> ...what was the process in place and why did it fail?

It appears the process was:

1. Channel files are considered trusted; so no need to sanity-check inputs in the sensor, and no need to fuzz the sensor itself to make sure it deals gracefully with corrupted channel files.

2. Channel files are trusted if they pass a Content Validator. No additional testing is needed; in particular, the channel files don't even need to be smoke-tested on a real system.

3. A Content Validator is considered 100% effective if it has been run on three previous batches of channel files without incident.

Now it's possible that there were prescribed steps in the process which were not followed; but those too are to be expected if there is no automation in place. A proper process requires some sort of explicit override to skip parts of it.