The 4 Minute Bug

krallja · on Sept 29, 2022

Love it. This reminds me somewhat of a similar issue I saw in 2012, where Cheezburger.com went down over Thanksgiving weekend because we had no deploys for more than 80 hours, and an accidental per-server cache filled up in about that much time. https://jacob.jkrall.net/turkey-day-down-time

kinduff · on Sept 29, 2022

Very good story, it's one of those edge cases that are only found when something unusual happens.

pachico · on Sept 29, 2022

It looks to me like a bad architected solution...

If you have a background job to update data, there is no need to do any TTL. If we're speaking about currency exchange, you might prefer to store a transactional history of exchanges, together with the current one, and populate a fast read model too (Redis) with a failover against the oltp version...

Maybe it's just me.

kinduff · on Sept 29, 2022

I agree with you, it is bad architecture.

I didn't mention it in the blog post, but the TTL was introduced by a dependency that allowed us to integrate the rate exchange service.

javier_e06 · on Sept 29, 2022

The story falls on the category of "it won't happen until it happens". Testing is expensive and even with all the testing in advance we observe emergent behaviors that make us run to the reams of logs to ask the very simple question: "What happened?" My take away: If you have to fail, because you will, fail graciously. Good post.

kinduff · on Sept 29, 2022

Very nice way to put the take away, thanks for reading!