Lessons Learned from Reading Post Mortems

timf · on Aug 20, 2015

If you would like to read postmortems, I maintain a list of them here (over 250 so far): https://pinboard.in/u:peakscale/t:postmortem/

cheeseprocedure · on Aug 20, 2015

This is great - thank you!

In a similar vein is the Google+ Postmortems community: https://plus.google.com/u/0/communities/11513614020301839179...

nathankleyn · on Aug 20, 2015

Wow, this is a great list.

It would be fantastic to take this set visualise their causes. It would be really interesting to see whether the causes are that different for large corporations vs smaller startups. I suspect, as in the article, it remains that configuration, error handling and human causes are still the most popular regardless of whether you have vast quantities of process or money for tooling.

m0nty · on Aug 21, 2015

Thanks for this - I've been working my way through it, and it's fascinating, far more so than I would have imagined.

tboyd47 · on Aug 20, 2015

> Configuration bugs, not code bugs, are the most common cause I’ve seen of really bad outages, and nothing else even seems close.

The funny thing about this conclusion is that configuration and code are not dramatically different concepts when you think about it. One of them is "data" and the other "code", but both affect the global behavior of the system. Config variables are often played up as being simpler to manage, but it's actually more complicated from an engineering standpoint, since we know there is code required to support said configuration.

The process is what's dramatically different. "Write a story with acceptance criteria, get it estimated by engineers, get it prioritized by management, wait two weeks for the sprint to be over, wait for QA acceptance, deploy in the middle of the night," vs. "Just change this field located right here in the YAML file..."

michaelt · on Aug 20, 2015

Also, I can't speak for all companies, but where I work configuration is how we define the differences between our test and production environments.

If your config files are intentionally different, because in test you should use authentication server testauth.example.com and in production you should use auth.example.com, then how can you avoid violating test-what-you-fly-and-fly-what-you-test?

Obviously, you could add an extra layer of abstraction (make the DNS config different between test and production and both environments could use auth.example.com) but that's just moving the configuration problem somewhere else :)

suryaj · on Aug 20, 2015

This how we do it - a) Have a regression test suite running continuously and have alerts pop up when they fail. Have a minimal set of config values in your regression suite and fire off alerts when they fail. b) Setup monitoring for your components and trigger alerts based on some thresholds c) With (a) and (b) setup, rollout your bits to a canary environment and if all looks good, trigger rolling deployment to your prod environment.

mreiland · on Aug 20, 2015

you automate the deployment and that automation runs checks. If things don't work out, it refuses to deploy.

bbali · on Aug 20, 2015

"The lack of proper monitor is never the sole cause of a problem, but it’s often a serious contributing factor."

I am continuously amazed that downtime issues go undetected until a) customer notifies you b) things go downhill and alarms are blazing. Our central principle is that monitoring and alerting has to be part of your deployment. The way we apply at my work place is that every design doc has a monitoring section which has to be filled out.

erikb · on Aug 20, 2015

Well, the thing about human mistakes is that they are easy to blame on some part of the hardware or software. I configure an incorrect data element and the system dies, then I won't write down "I configured something wrong so the system died", I will write a bug that says "software didn't catch that kind of misconfiguration". Also it's not reasonable to admit too many mistakes publicly. People who haven't thought about all the mistakes they make will start to think that you are not able to perform well in your job.

protonfish · on Aug 20, 2015

Technically all outages are "human mistakes". Humans build hardware, write software, configure and maintain systems, and manage other humans. Which is why explaining an outage as "human error" is not constructive.

There are known methods to create systems that are resistant to human error: automation, checklists, testing, etc. Humans will make mistakes-that is a certainty. A solution to an outage will be to employ these techniques, not tell your team members to not make mistakes.

quadrangle · on Aug 21, 2015

A lightning strike or earthquake is not a human mistake.

grogers · on Aug 21, 2015

Human mistakes probably led to the lightning strike not being grounded properly or the earthquake causing structural damage.

ionforce · on Aug 20, 2015

As a recent convert to functional programming, I'd like to point out that more functional styles of error handling helps one address them more properly and not sweep them under the rug.

This ties into the article's first point about how poor error handling is a common source of bugs.

_qc3o · on Aug 20, 2015

How exactly does functional programming help here?

acconsta · on Aug 21, 2015

Sum types allow a function to return either a result or an error. Not accounting for both possibilities is a compile time type error.

Go mimics this behavior, so it's not only a functional thing: http://blog.golang.org/error-handling-and-go

anonymousDan · on Aug 20, 2015

I guess he's referring to things like the maybe monad.

hassy · on Aug 20, 2015

Daily load tests against both staging and production (yes, really) can help catch a lot of issues of the kind described in the article. You do have to have a solid monitoring & alerting set up still though.

suryaj · on Aug 20, 2015

Curious, how do you perform load tests in production for an e-commerce site?

hassy · on Aug 21, 2015

Well, the gist of it is that it's not that hard technically, but could be more difficult to implement organisationally. Production load-testing is something that needs to be agreed with various teams/people across the organisation (easier if you're a small startup), think ops, marketing, analytics etc.

The basic thing is to make sure you can separate real and synthetic requests (with a special header for example). This will allow you to mock/no-op certain operations like attempting to charge the user's card or reducing the quantity of stock you have. It'll also allow you to remove synthetic requests from your analytics data, so that marketing does not get excited by the sudden influx of new users. If you have user accounts on your system, make all fake users register with @somenonexistentdomain.com so you can filter for that too etc.

Obviously start slow and ramp up over time as you iron out issues.

JustEat.co.uk run daily load-tests in production at +20-25% of their peak traffic. As in: extra 20% simulated load during their peak hours, which happen to be between 6-9pm every day. They process a very respectable number of real-money transactions every second, a number that a lot of ecommerce sites would be very happy with. (Source: a presentation at ScaleSummit in London this year)

Feel free to @message me if you want to talk more about this.

sengork · on Aug 20, 2015

One could learn a lot from observing the methods and practices applied to legacy systems.

_qc3o · on Aug 20, 2015

Legacy systems tend to ossify and not change much. How is that a good lesson?

jedberg · on Aug 20, 2015

I think OP means more along the lines of the fact that legacy systems were very hard to change, so changes went through a much more rigorous review process.

The problem with that theory is that because current systems are so easy to change, the cost of failure is much lower, so the upfront cost to avoid failure no longer has as good an ROI.

ninjakeyboard · on Aug 21, 2015

Define legacy system.

0xdeadbeefbabe · on Aug 20, 2015

I have a tendency to overlook a valuable lesson, perhaps the most valuable lessons, when I redefine post mortem in this way.