Hacker News new | past | comments | ask | show | jobs | submit login

These follow ups aren't super compelling IMO.

> To do this, the first and smallest step will be to phase out the BinaryPack library and make sure we run a more extensive testing on any third-party libraries we work with in the future.

Sure. Not exactly a structural fix. But maybe worth doing. Another view would be that you've just "paid" a ton to find issues in the BinaryPack library, and maybe should continue to invest in it.

Also, "do more tests" isn't a follow up. What's your process for testing these external libs, if you're making this a core part of your reliability effort?

> We are currently planning a complete migration of our internal APIs to a third-party independent service. This means if their system goes down, we lose the ability to do updates, but if our system goes down, we will have the ability to react quickly and reliably without being caught in a loop of collapsing infrastructure.

Ok, now tell me how you're going to test it. Changing architectures is fine, but until you're running drills of core services going down, you don't actually know you've mitigated the "loop of collapsing infrastructure" issue.

> Finally, we are making the DNS system itself run a local copy of all backup data with automatic failure detection. This way we can add yet another layer of redundancy and make sure that no matter what happens, systems within bunny.net remain as independent from each other as possible and prevent a ripple effect when something goes wrong.

Additional redundancy isn't a great way of mitigating issues caused by a change being deployed. Being 10x redundant usually adds quite a lot of complexity, provides less safety than it seems (again, do you have a plan to regularly test that this failover mode is working?) and can be less effective than preventing issues getting to prod.

What would be nice to see if a full review of the detection, escalation, remediation and prevention for this incident.

More specifically, the triggering event here, the release of a new version of software, isn't super novel. More discussion of follow ups that are systematic improvements to the release process would be useful. Some options:

- Replay tests to detect issues before landing changes

- Canaries to detect issues before pushing to prod

- Gradual deployments to detect issues before they hit 100%

- Even better, isolated gradual deployments (i.e. deploy region by region, zone by zone) to mitigate the risk of issues spreading between regions.

Beyond that, start thinking about all the changing components of your product, and their lifecycle. It sounds like here some data file got screwed up as it was changed. Do you stage those changes to your data files? Can you isolate regional deployments entirely, and control the rollout of new versions of this data file on a regional basis? Can you do the same for all other changes in your system?




This, I am not at all reassured by this that it won't happen again. Next week perhaps.

Also, their DNS broke last month as well, but I guess we won't mention that as it would invalidate 2 years of stellar reliability




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: