> Does anyone have One Weird Trick™ to fix it? My one weird trick is to have a z...

mceachen · on July 20, 2022

This is the correct answer.

For extra credit: if you have a weekly or monthly team meeting, include an agenda item for the people that were on call so they can debrief the team on what alerts fired and what the resolutions were. As a team, you can then decide which alerts need to be deleted or need adjustments, and if there are additions or edits that need to be made to the runbook.

A big thing to avoid in this whole process is "naming and shaming." The Google SRE book calls this a "blameless postmortem culture," and it's helps you avoid perverse incentives for people to hide or obscure latent production issues.

coffeefirst · on July 20, 2022

Yeah. Also, unless you are a genuinely essential application—like air traffic control, a hospital, or a nuclear power plant—you can live with a few hours of downtime.

AWS goes on the fritz for a few days out of of every year and breaks half the internet. Your business will be okay.

Tao3300 · on July 20, 2022

99% of the time your shit just isn't that essential. That's the One Weird Trick: don't get suckered into thinking your corporate vision is so important that it can't have an issue wait until morning.

hyperman1 · on July 20, 2022

I tend to agree with you, but found an exception a while ago.

A certain file has to appear before a specific time, or else some people don't get money they deserve and rightfully get very angry.

Except 1 or 2 times per year there is nobody in that situation. No payments have to be made. No file appears, as other alerts would signal an empty file.

As the relevant time was in business hours and the thing was important, I decided to swallow my pride and accept that invalid alert.

  Resolution procedure is documented as: Call team X, and ask if this is correct.  If yes, blackout that alert for 24 hours

Aeolun · on July 21, 2022

No, but, the point is there are never alerts that aren’t alerts.

This is arguably worse, since it’s otherwise a very important alert so anyone that sees it will freak out. Since it happens only 2 times a year, anyone seeing the alert for the first time (depending on churn, this may happen quite often) is going to think it’s really important.

Hopefully you don’t regularly get these alerts because something actually went wrong (say once a year), but that means 66% of all alerts you get are false alarm.

tharkun__ · on July 20, 2022

Unacceptable. Make sure that the file is there but empty if nobody needs to get money. If empty could be a failure case, have a 'this page intentionally left blank' type arrangement for the file contents. Done, no monitor exceptions.

hyperman1 · on July 20, 2022

Sorry, been there done that. Layout of the file is dictated by an external party. Empty or dummy file is major bad news that blocks other payments. If the file exists, it should have at least 1 valid payment. Payment of 1 cent to a dummy account is illegal. File is produced and consumed by 2 different software packages from 2 different vendors. If the business knowingly creates an invalid file, they commit fraud.

To be honest, calling a human and asking if they are really sure in this case is probably a good idea.

tharkun__ · on July 21, 2022

I never said 1 cent payment. I didn't know diddly squat about what your domain was or any specifics. You just said that that was a good case of 'bad monitoring is OK'. It's not. Maybe your hands are tied, I'll give you that. It won't make it good though.

Also I said 'this space intentionally left blank'. Whatever actual form that would take in your example. Just because multiple entities use the same interface does not mean a bad interface is suddenly a good interface.

To take an analogy and stay in payments (not sure what your specifics are). If your live system for CC processing OKs a payment from 4111 1111 1111 1111 you're toast. Your system better only 'process' that in a test environment. Perfect for a dummy row that gets ignored but ensures your monitoring does not trigger.

And yes, been there, done that too with these file based interfaces.

(it's funny how some systems that don't actually process your payment right away actually let you book with a known dummy CC - in Prod. Technically OK because you'll pay for real later - think hotel. Still funny)

hyperman1 · on July 21, 2022

To be clear: Not a good case of bad monitoring OK. More an unresolvable case given the constraints. I've had to tame plenty of cases of bad interfaces, and that was the main one that evaded any decent resolution.