Hacker News new | past | comments | ask | show | jobs | submit login

Router config changes are simultaneously very commonplace and incredibly risky.

I've seen outages caused by a single bad router advertisement that caused global crashes due to route poisoning interacting with a vendor bug. RPKI enforcement caused massive congestion on transit links. Route leaks have DoSed entire countries (https://www.internetsociety.org/blog/2017/08/google-leaked-p...). Even something as simple as a peer removing rules for clearing ToS bits resulted in a month of 20+ engineers trying to figure out why an engineering director was sporadically being throttled to ~200kbps when trying to access Google properties.

Running a large-scale production network is hard.

edit: in case it is not obvious: I agree entirely with you -- the routine config changes that do risk the enterprise are often very hard to identify ahead of time.




the report stated: https://crtc.gc.ca/eng/publications/reports/xona2024.htm

> "this configuration change was the sixth phase of a seven-phase network upgrade process that had begun weeks earlier. Before this sixth phase configuration update, the previous configuration updates were completed successfully without any issue. Rogers had initially assessed the risk of this seven-phased process as “High.”

> However, as changes in prior phases were completed successfully, the risk assessment algorithm downgraded the risk level for the sixth phase of the configuration change to “Low” risk"

> Downgrading the risk assessment to “Low” for changing the Access Control List filter in a routing policy contravenes industry norms, which require high scrutiny for such configuration changes, including laboratory testing before deploying in the production network.

Overall, the lack of detail of the (regulator forced) post-mortem makes it impossible for the public to decide.

It's a Canadian telecom: They'll release detail when it makes them look good, and hide it if it makes them look bad.


Certainly if it was downgraded only by the time it reached phase 6, then I would expect it to have gone through that higher scrutiny in earlier phases (including lab testing). My guess is that the existing lab testing was inadequate for surfacing issues that would only appear at production-scale.

If each of the six phases was a distinct set of config changes, then they really shouldn't have been bundled as part of the same network upgrade with the same risk assessment. But, charitably, I assumed that this was a progressive rollout in some form (my guess was different device roles, e.g. peering devices vs backbone routers). Should these device roles have been qualified separately via lab testing and more? Certainly. Were they? I have no idea.

Do I think there are systemic issues with how Rogers runs their network? Almost certainly. But from my perspective, the report (which was created by an external third-party) places too much blame on the downgrade of risk assessment as opposed to other underlying issues.

(As you can see, there is a lot of guesswork on my behalf, precisely because, as you mention, there isn't enough information in the executive summary to fill in these gaps.)


> Overall, the lack of detail of the (regulator forced) post-mortem makes it impossible for the public to decide.

Note: the publicly available post-mortem makes it difficult for the public to decide.

Per a news article:

> Xona Partners' findings were contained in the executive summary of the review report,[0] released this month. The CRTC says the full report contains sensitive information and will be released in redacted form at a later, unspecified, date.

* https://www.cbc.ca/news/politics/rogers-outage-human-error-s...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: