In the entire history of the Bell System, no electromechanical exchange was ever down for more than 30 minutes for any reason other than a natural disaster. With one exception, a major fire in New York City. Three weeks of downtime for 170,000 phones for that.[1] The Bell System pulled in resources and people from all over the system to replace and rewire several floors of equipment and cabling.
That record has not been maintained in the digital era.
The long distance system did not originally need the Bedminster, NJ network control center to operate. Bedminster sent routing updates periodically to the regional centers, but they could fall back to static routing if necessary. There was, by design, no single point of failure. Not even close. That was a basic design criterion in telecom prior to electronic switching. The system was designed to have less capacity but still keep running if parts of it went down.
That electro mechanical system also switched significantly less calls than the digital counterparts!
Most modern day telcos that I have seen still have multiple power/line cards/uplinks in place and designed for redundancy. However the new systems can also just do so much more and are so more flexible that they can be configured out of existence just as easily!
Some of this as well is just poor software, on some of the big carrier grade routers you can configure many things but the combination of things that you can figure may also just cause things to not work correctly, or even worse pull down the entire chassis, I don't have immediate experience on how good the early 2000s software was, but I would take a guess and say that configurability/flexibility has had a serious cost on reliability of the network
I think that's an oversimplification and not quite the core value proposition. Implementing a solution is just the beginning. Systems evolve, so software development is not only about delivering solutions but also about enabling their maintenance and evolution.
I view programming languages and software dev tools as fundamentally about the management of complexity.
The evolution of software development tools reflects a continuous effort to manage the inherent complexity of building and maintaining software systems. From the earliest programming languages and punch cards to simple text editors, and now to sophisticated IDEs, version control systems, and automated testing frameworks, each advancement enables better ways to manage and simplify the overall development process. These tools help developers handle dependencies, abstract away low-level details, and facilitate collaboration, ultimately enabling the creation of more robust, scalable, and maintainable software.
Notable figures in software development emphasize this focus on complexity management. Fred Brooks, in "The Mythical Man-Month," highlights the inherent complexity in software development and the necessity of tools and methodologies to manage it effectively. Eric S. Raymond, in "The Cathedral and the Bazaar," discusses how collaborative tools and practices, especially in open-source projects, help manage complexity. Grady Booch's "Object-Oriented Analysis and Design" underscores how object-oriented principles and tools promote modularity and reuse, aiding complexity management.
Martin Fowler, known for advocating the use of design patterns in his book "Patterns of Enterprise Application Architecture," emphasizes how patterns provide proven solutions to recurring problems, thereby managing complexity more effectively. By using patterns, developers can reuse solutions, improve communication through a common vocabulary, and enhance system flexibility. Fowler also advocates for continuous improvement and refactoring in his book "Refactoring" as essential practices for managing and reducing software complexity. Similarly, Robert C. Martin, in "Clean Code," stresses the importance of writing clean, readable, and maintainable code to manage complexity and ensure long-term software health.
While speed and efficiency are important, the core value of software development lies in the ability to manage complexity, ensuring that systems are robust, scalable, and maintainable over time.
The expectation should be that, as you switch more and more, so that the cost of a 30 minute pause gets higher and higher, the situation would improve, and a more modern system might have been expected to boast that it never had a break lasting more than, say, 30s outside of a natural disaster.
I don't know where you get that expectation from. These are arbitrary engineering constraints informed by business decisions. If they decided that people could deal with up to 30 minutes of service interruption and set that as a goal, they would engineer with that in mind, regardless of how many people. If they used total combined user hours of interrupted service, then they would engineer around reducing possibly outage times for a system as it handled more people (or scale differently, with more systems).
I don't think there's any sort of expectation that it would definitely go one way though, as you say. It's all business and legal constraints providing engineering constraints to build against.
A relatively famous example of the extent to which Indiana Bell went to avoid disrupting telephone service: rotating and relocating its headquarters over a few weeks.
Surely, that downtime quote is apocryphal. The early decades were plagued by poor service due to maintenance, mechanical failure, and human error. In the early decades, manual switchboards were the primary method of connecting calls. The intense workload and physical demands on operators often led to service disruptions . In 1907, over 400 operators went on strike in Toronto, severely impacting phone service. The strike was driven by wage disputes, increased working hours, and poor working conditions
They didn't have downtime logs, but that doesn't mean that the rapid growth of telephone demand didn't outpace the Bell System's capacity to provide adequate service. The company struggled to balance expansion with maintaining service quality, leading to intermittent service issues .
Bell System faced significant public dissatisfaction due to poor service quality. This was compounded by internal issues such as poor employee morale and fierce competition.
Bell Canada had a major outage on July 17th 1999 when a tool was dropped on the bus bar for the main battery power that ignited the hydrogen from the batteries in one of the exchangeds in downtown Toronto. The fire department insisted that all power in the area be shut down which lead to the main switch that handled long distance call routing for all 1-800 numbers being offline for the better part of a day.
One thing that was fascinating about the Rogers outage was on the wireless side: because "just" the core was down, the towers were still up.
So mobile phones would try to make a connection to the tower just enough to connect but not be able to do anything, like call 9-1-1 without trying to fail-over to other mobile networks. Devices showed zero bars, but field test mode would show some handshake succeeding.
(The CTO was roaming out-of-country, had zero bars and thought nothing of it... how they had no idea an enterprise-risking update was scheduled, we'll never know)
Supposedly you could remove your SIM card (who carries that tool doohickey with them at all times?), or disable that eSIM, but you'd have to know that you can do that. Unsure if you'd still be at the mercy of Rogers being the most powerful signal and still failing to get your 9-1-1 call through.
Rogers claimed to have no ability to power down the towers without a truck-roll (which is how another aspect where widespread OOB could have come in handy).
Various stories of radio stations (which Rogers also owns a lot of) not being able to connect the studio to the transmitter, so some tech went with an mp3 player to play pre-recorded "evergreen" content. Others just went off-air.
> Supposedly you could remove your SIM card (who carries that tool doohickey with them at all times?)
In sane handsets (ones where the battery is still removable), that tool was and still is a fingernail, which most have on their person.
I believe the innovation of the need for a special SIM eject tool was bestowed upon us by the same fruit company that gave us floppy and optical drives without manual eject buttons over 30 years ago.
I have fond memories of the fruit company taking the next step: removing the floppy drive bezel from the drive and instead having the floppy drive slot be part of the overall chassis front panel. Of course, their mechanical tolerances were nothing like they are today, so if you looked at the computer crosseyed, the front panel would fail to align to the actual internal disk path, and ejecting the disk would cause it to get stuck behind the front chassis panel. One could rescue it by careful wiggling with a tool to guide the disk through the slot or by removing the entire front panel.
Meanwhile “PCs” had a functional but ugly rectangular opening the size of the entire drive, and the drive had its own bezel, and imperfect alignment between drive and case looked a bit ugly but had no effect on function.
(I admit I’m suspicious that Apple’s approach was a cost optimization, not an aesthetic optimization.)
You could operate the ejection mechanism by hand both on optical and floppy disk drives with an uncurled paperclip (or a SIM card ejection tool were they to exist at that point in time). But I wouldn't ascribe the introduction of the motorized tray to the fruit company, it was the wordmark company: https://youtu.be/bujOWWTfzWQ
Sounds like a problem that should be (rather easily) fixable in the Operating System, no?
If the emergency call doesn’t go through, try the call over a different network.
This would also mitigate problems we see from time to time where emergency calls don’t work because the uplink to the emergency call center was impacted either physically or by a bad software update.
> (The CTO was roaming out-of-country, had zero bars and thought nothing of it... how they had no idea an enterprise-risking update was scheduled, we'll never know)
How does one know ahead of time if any particular change is "enterprise-risking"? It appeared to be a fairly routine set of changes that were going just fine:
> The report summary says that in the weeks leading up to the outage, Rogers was undergoing a seven-phase process to upgrade its network. The outage occurred during the sixth's phase of the upgrade.
It turns out that they self-DoSed certain components:
> Staff at Rogers caused the shutdown, the report says, by removing a control filter that directed information to its appropriate destination.
> Without the filter in place, a flood of information was sent into Rogers' core network, overloading and crashing the system within minutes of the control filter being removed.
* Ibid
> In a letter to the CRTC, Rogers stated that the deletion of a routing filter on its distribution routers caused all possible routes to the internet to pass through the routers, exceeding the capacity of the routers on its core network.
> Rogers staff removed the Access Control ListFootnote 5 policy filter from the configuration of the distribution routers. This consequently resulted in a flood of IP routing information into the core network routers, which triggered the outage. The core network routers allow Rogers wireline and wireless customers to access services such as voice and data. The flood of IP routing data from the distribution routers into the core routers exceeded their capacity to process the informationFootnote 6. The core routers crashed within minutes from the time the policy filter was removed from the distribution routers configuration.
> In October, Facebook suffered a historic outage when their automation software mistakenly withdrew the anycasted BGP routes handling its authoritative DNS rendering its services unusable. Last month, Cloudflare suffered a 30-minute outage when they pushed a configuration mistake in their automation software which also caused BGP routes to be withdrawn.
BTW: anyone who wants to really experience how complex internet routing is, go and join DN42 (https://dn42.dev). This is a fake internet built on a network of VPN tunnels and using the same routing systems. As long as you're just acting as a leaf node, it's pretty straightforward. If you want to attach to the network at multiple points and not just VPN them all to the same place, now you have to design a network just like an ISP would, with IGP and so on.
Router config changes are simultaneously very commonplace and incredibly risky.
I've seen outages caused by a single bad router advertisement that caused global crashes due to route poisoning interacting with a vendor bug. RPKI enforcement caused massive congestion on transit links. Route leaks have DoSed entire countries (https://www.internetsociety.org/blog/2017/08/google-leaked-p...). Even something as simple as a peer removing rules for clearing ToS bits resulted in a month of 20+ engineers trying to figure out why an engineering director was sporadically being throttled to ~200kbps when trying to access Google properties.
Running a large-scale production network is hard.
edit: in case it is not obvious: I agree entirely with you -- the routine config changes that do risk the enterprise are often very hard to identify ahead of time.
> "this configuration change was the sixth phase of a seven-phase network upgrade process that had begun weeks earlier. Before this sixth phase configuration update, the previous configuration updates were completed successfully without any issue. Rogers had initially assessed the risk of this seven-phased process as “High.”
> However, as changes in prior phases were completed successfully, the risk assessment algorithm downgraded the risk level for the sixth phase of the configuration change to “Low” risk"
> Downgrading the risk assessment to “Low” for changing the Access Control List filter in a routing policy contravenes industry norms, which require high scrutiny for such configuration changes, including laboratory testing before deploying in the production network.
Overall, the lack of detail of the (regulator forced) post-mortem makes it impossible for the public to decide.
It's a Canadian telecom: They'll release detail when it makes them look good, and hide it if it makes them look bad.
Certainly if it was downgraded only by the time it reached phase 6, then I would expect it to have gone through that higher scrutiny in earlier phases (including lab testing). My guess is that the existing lab testing was inadequate for surfacing issues that would only appear at production-scale.
If each of the six phases was a distinct set of config changes, then they really shouldn't have been bundled as part of the same network upgrade with the same risk assessment. But, charitably, I assumed that this was a progressive rollout in some form (my guess was different device roles, e.g. peering devices vs backbone routers). Should these device roles have been qualified separately via lab testing and more? Certainly. Were they? I have no idea.
Do I think there are systemic issues with how Rogers runs their network? Almost certainly. But from my perspective, the report (which was created by an external third-party) places too much blame on the downgrade of risk assessment as opposed to other underlying issues.
(As you can see, there is a lot of guesswork on my behalf, precisely because, as you mention, there isn't enough information in the executive summary to fill in these gaps.)
> Overall, the lack of detail of the (regulator forced) post-mortem makes it impossible for the public to decide.
Note: the publicly available post-mortem makes it difficult for the public to decide.
Per a news article:
> Xona Partners' findings were contained in the executive summary of the review report,[0] released this month. The CRTC says the full report contains sensitive information and will be released in redacted form at a later, unspecified, date.
I’m reminded of when an old AT&T building went on sale as a house, and one of its selling points was that you could get power from two different power companies if you wanted. This highlighted to me the level of redundancy required to take such things seriously. It probably cost the company a lot to hook up the wires, and I doubt the second power company paid anything for the hookup. Big Bell did it there, and I’m sure they did it everywhere else too.
Edit: I bet it had diesel generators when it was in service with AT&T to boot.
> I bet it had diesel generators when it was in service with AT&T to boot.
20 to 25 years ago I visited a telecom switch center in Paris, the one under the Tuileries garden next to the Louvre. They had a huge and empty diesel generators room. They had all been replaced by a small turbine (not sure it's the right English term), just the same as what's used to power an helicopter. It was in a relatively small soundproof box, with a special vent for the exhaust, kind of lost on the side of a huge underground room.
As the guy in charge explained to us, it was much more compact and convenient. The big risk was in getting it started, this was the tricky part. Once started it was extremely reliable.
> Edit: I bet it had diesel generators when it was in service with AT&T to boot.
That's where AT&T screwed up in Nashville when their DC got bombed. They relied on natural gas generators for their electrical backup. No diesel tank farm. Big fire = fire department shuts down natural gas as wide as deemed necessary and everything slowly dies as the UPS batteries die.
They also didn't have roll-up generator electrical feed points, so they had to figure out how to wire those up once they could get access again, delaying recovery.
I've seen some power outages in california, and noticed that comcast/xfinity had these generator trailers rolled up next to telephone poles, probably powering the low voltage network infrastructure below the power lines.
It’s trivial when you have the resources that come from being one of Canada’s 3 telecom oligopoly members.
Unfortunately the CRTC is run by former execs/management of Bell, Telus, and Rogers, and our anti-competition bureau doesn’t seem to understand their purpose when they consistently allow these 3 to buy up and any all small competitors that gain even a regional market share.
Meanwhile their service is mediocre and overpriced, which they’ll chalk up to geographical challenges of operating in Canada while all offering the exact same plans at the exact same prices, buying sports teams, and paying a reliable dividend.
It's worse than that: 2 of the 3 telecom oligopoly members share (most) of their entire wireless network, with one providing most towers in the West, and the other in the East.
I'm sure those 2 compete very hard with each other with that level of co-dependency.
There is OOB for carriers and OOB for non-carriers. OOB for carriers is significantly more complex and resource intensive than OOB for non-carriers. This topic (OOB or to forgo) has been beat to death over the last 20 years in the operator circles; the responsible consensus is trying to shave a % off operating expenses by cheaping out on your OOB is wrong. That said it does shock me that one of the tier-1 carriers in Canada was this... ignorant? Did they never expect it to rain or something? Wild.
When I see out of band management at remote locations (usually for a dedicated doctors network run by the health authority that gets deployed at offices and clinics) it's generally analog phone line -> modem -> console port. Dialup is more than enough if all you need to do is reset a router config.
Not 100% out of band for a telco though, unless they made sure to use a competitors lines.
Here in Australia, POTS lines have been completely decommissioned, UK will be switched off by end of 2025 and I'm assuming there's similar timelines in lots of other countries.
They're on the way out in France, too. New buildings don't get copper anymore, only fiber.
However, as I understand it, at least for commercial use, the phone company provides some kind of box that has battery-backing so it can provide phone service for a certain duration in case of emergency.
The tricky part with that is that, at least in Canada, the RJ11 ports on the ONT are generally VoIP. They provide the appropriate voltages for a conventional POTS phone to work but digitize & compress the audio and send it along to the Telco as SIP or whatever. That works fine for voice but you're probably going to have a hard time using a conventional POTS modem over that connection. I've never tested it and am honestly pretty curious to see how well/poorly it would work.
However, I’m only familiar with the emergency phone call use case, for which voice is enough. I’m not familiar with any legal obligation to provide data service, so I guess that if you need that, it’s up to you to negotiate SLAs or have multiple providers.
Reminds me of a data center that said they had a backup connection and I pointed out that only one fiber was coming into the data center. They said, "Oh its on a different lambda[1]" :-)
[1] Wave division multiplexing sends multiple signals over the same fiber by using different wavelengths for different channels. Each wavelength is sometimes referred to as a lambda.
1. The risk, when you use a competitor's service, of your competitor cutting off service, especially at an inopportune time (like your service undergoing a major disruption, where cutting off your OOBM would be kicking you while you are down, but such is business).
2. The risk that you and your competitor unknowingly share a common dependency, like utility lines; if the common dependency fails then both you and your OOBM are offline.
The whole point of paying for and maintaining an OOBM is to manage and compensate for the risks of disruption to your main infrastructure. Why would you knowingly add risks you can't control for on top of a framework meant to help you manage risk? It misses the point of why you have the OOBM in the first place.
Maybe 10-15 years ago there was a local Rogers outage that would have had the #2 failure you're describing. From what I recall, SaskTel had a big bundle of about 3,000 twisted pairs running under a park. Some of those went to a SaskTel tower, some to SaskTel residential wireline customers and some of those went to a Rogers facility. Along comes a backhoe and slices through the entire bundle.
> If your OOB network is your only way of managing things, you not only have to build a separate network, you have to make sure it is fully redundant, because otherwise you've created a single point of failure for (some) management.
I'm not sure I necessarily agree with that. You can set up the network in such a way that you can route over the main network as a backup if your OOB network was down but the main network was up. Obviously, it's not quite as simple as sticking a patch cable between the two networks, but it can be close - you have a machine that's always on your OOB network, and it has an additional port that either configures itself over DHCP or has a hard-coded IP for the main net. But the important thing is that you never have that patched in, except for emergencies like your OOB network cable being severed but you still have access to the main network. If that does happen, you plug it in temporarily and use that machine as a proxy. There's no real reason for extra redundancy in the OOB, because if your main uplink is also severed, there's not really much you're going to be usefully configuring anyway!
In a lot of environments, you can at least choose to restrict what networks can be used to manage equipment; sometimes this is forced on you because the equipment only has a single port it will use for management or must be set to be managed over a single VLAN. Even when it's not forced, you may want to restrict management access as a security measure. If you can't reach a piece of equipment with restricted management access over your management-enabled network or networks, for instance because a fiber link in the middle has failed, you can't manage it (well, remotely, you can usually go there physically to reset or reconfigure it).
You can cross-connect your out of band network to an in-band version of it (give it a VLAN tag, carry it across your regular infrastructure as a backup to its dedicated OOB links, have each location connect the VLAN to the dedicated OOB switches), but this gets increasingly complex as your OOB network itself gets complex (and you still need redundant OOB switches). As part of the complexity, this increases the chances an in-band failure affects your OOB network. For instance, if your OOB network is routed (because it's large), and you use your in-band routers as backup routing to the dedicated OOB routers, and you have an issue where the in-band routers start exporting a zillion routes to everyone they talk to (hi Rogers), you could crash your OOB network routers from the route flood. Oops. You can also do things like mis-configure switches and cross over VLANs, so that the VLAN'd version of your OOB network is suddenly being flooded with another VLAN's traffic.
We might be talking at cross-purposes a bit, but also it seems that you're considering a much larger scale than me, and also I hadn't really considered that some people might want to do data-intensive transfers on the management network, e.g. VM snapshots and backups.
Because of how I use it, I was only considering the management port as being for management, and it's separated for security. In the example in the article, there was a management network that was entirely separate from the main network, with a different provider etc. I guess you may have a direct premises-to-premises connection, but I was assuming it'd just be a backup internet connection with a VPN on top of that, so in theory and management network can connect to any other management network, unless its own uplink is severed. Of course, you need ISPs that ultimately have different upstreams.
In the situation that your management network uplink is down, I'd presume that was because of a temporary fault with that ISP, which is different to the provider for your main network uplink. You'd have to be pretty unlucky for that also to be down too. Sure, I can foresee a hypothetical situation where you completely trash the routes of your main network and then by some freak incident your management uplink is also severed. But I think the odds are low, because your aim should be to always have the main network working correctly anyway. If you maintain 99.9% uptime on your main network and your management uplink from another provider is also 99.9%, the likelihood of both being down is 0.0001%.
I'd also never, ever, ever, want a VLAN-based management network, unless that VLAN only exists on your internal routers and is separated up again into individual nets before it goes outside the server rooms. Otherwise, you've completely lost any security benefit of using an isolated network. OTOH, maintaining a parallel backup network on a VLAN that's completely independent to the management network, but which can be easily patched it by someone at that site if you need them to, isn't necessarily a bad thing.
But anyway, these are just my opinions, and it's been a long time since I was last responsible for maintaining a properly large network, so your experience is almost definitely going to be more useful and current than mine.
Because of our (work) situation, I was thinking of an OOB network with its own dedicated connections between sites, instead of the situation where you can plug each site into a 'management' Internet link with protection for your management traffic. However, once your management network gets into each site, the physical management network at that site needs to worry about redundancy if it's the only way to manage critical things there. You don't want to be locked out of a site's router or firewall or the like because a cheap switch on the management network had its power supply fail (and they're likely to be inexpensive because the management network is usually low usage and low port count).
Apart from Rogers et alike, the main OOB/LOM issue is that's mostly only very old iron very few know, finding people who knows and finding non-hacky homegrown and not much tested solutions it's damn hard.
With launch costs dropping, I wonder if there’s a market for a low bandwidth “ssh via satellite” service. Could use AWS Ground Station to connect to your VPC.
If this is for use during outages, I want to know exactly what network path is used, ideally with as few hops as possible. Starlink can’t guarantee that.
why? from my pov, once i’ve bought the service from the provider, their job is to deliver however they can; not my business, not my problem. my problem is making sure my redundancy (if required) isnt fate sharing.
Because in networking, if you buy two uplinks and don't check the paths they're taking, fate demands that the fiber seeking back hoe just took out that one duct it turns out both of your "redundant" lines go down
even with KMZs supplied, this still happens. complications in some cases. but an IP product (like starlink), i dont see the same equivalence. at what point does fate sharing analysis end in such a scenario?
That's the point. If you want a reliable separate path, you must test it, and you must be prepared to spend time and money on fixing it. The tests include calling up the engineering manager for the separate path and verifying that it has not been "re-groomed" into sharing a path with your primary -- monthly or quarterly, depending on your risk tolerance.
Operations work does not end because the world keeps changing.
it certainly ends in somewhere resembling cost-effective. "reliable" has its meanings in context, and backhoe issues aren't so much of a problem architecturally for starlink.
they have incentive and capability to get that traffic off the shared fate should it occur (even if that extends up to starlink serving one of their IP transit providers for OOB). that's why i question the wisdom of being overly concerned with starlink's particular paths.
I see you haven't met Google's production backbone network(s)... We intentionally didn't connect the Middle East and India (due to a combination of geopolitics and concerns around routing instability), so any traffic between the two would go the long way around the world, incurring a 200+ ms RTT penalty.
Agreed entirely on your point that if you're buying multiple redundant links, you're responsible for making sure that they're actually relying on different underlying fiber spans.
Because it turns out your Starlink connection goes to a ground station in your city which is connected to your network which is the one that is broken. So you can't manage it through Starlink when it's broken.
I'm pretty sure I saw it mentioned that if the source and destination are Starlink dishes then packets will be routed by the satellites directly to the destination dish without going through any ground stations.
That means Starlink can, in fact, guarantee communications during outages (so long as the Starlink network itself isn't down). You just need to have Starlink service at both the send and receive sides and the communication effectively acts as a direct link.
I can only assume it’s based on VLAN for security (and probably dedicated ports assigned to VLANs so regular ports are never able to access the VLAN), but other than that, I have a hard time envisioning in-band management that doesn’t lock you out when the network goes down.
It would protect you against things like DDoS attacks, and you can even assign dedicated (prioritized) access for these management ports.
The blog post is weird. "Rogers didn't even try, so OOB is hard."
Also this sentence makes me question his IQ:
"Some people have gone so far as to suggest that out of band network management is an obvious thing that everyone should have"
Yes Chris, Rogers, the monopoly telco company of Canada, should have OOB network! They can afford it.
Talking about the challenges of OOB is great, but the point the blog post is wrong and dumb.
The report says "Rogers had a management network that relied on the Rogers IP core network". They had no OOB network. They didn't even try.
This is a a symptom of Rogers status as a monopoly, negligence on the behalf of Rogers, and negligence on the behalf of the government who should have regulated OOB into existence. This is some serious clown car shit.
One of the advantages that competitor networks provides is redundancy. Canada doesn't have that, so their networks will remain weak. This will probably happen again some day.
Yes OOB is hard, but not even trying and then throwing up your hands and defending the negligent is stupid.
I disagree, Out of Band Network Management (OOBM) is extremely trivial to implement. Most companies however don't see the value of OOBM until they have a major fault. The setup costs can be high, and the ongoing operational costs of OOBM infrastructure and links is also significant.
I've built dozens of OOBM networks using fibre and 4G with the likes of Opengear. In instances, often deploying OOBM ahead of infrastructure rollouts so hardware can be delivered to site directly from factory, rather than go through a staging environment which adds time, cost and complexity.
OOB for carriers is significantly more complex; especially when you may be the only realistic access option in certain locations. However, given the rise of Starlink I think it becomes closer to "trivial" when the math becomes $100/mo/location + some minimal always-on OOB infrastructure on prem + cloud. Even in heavy-monopoly situations, you can usually guarantee the Starlink to Internet path due to the traffic bypassing the transport carriers on the ground (bent pipe to LEO sat) and landing at IXPs/near telco houses which egress direct to transit carriers.
We have a major incident wherein our firewall was totally down last month. The director at the end suggested that we need to have RS232 cable for out of band communication for such eventualities in the future.
Makes one realize the reliability of RS232 in today’s day and age.
I worry that this misses the point a little. All the OOB in the world will not help you if you cannot reach the management entity (eg IP-enabled PSU, terminal server, etc). It is also insufficient to protect against second order thundering-herd-type problems (e.g.: you log in, stop a worker process, and upstream, traffic is directed away from the node to the others, and starts causing new problems).
In telco operations, every MoP should have: an unambiguous linear sequence of steps, a procedure to verify that the desired result has been achieved, and a backout plan if things do go bad. This is drilled into you at every telco I ever worked at. Rogers' cardinal sin on the day of the outage was that they didn't have a backout plan at each step of the MoP.
More structurally, networks have a dependency graph that you ignore at your peril. X depends on Y depend on Z, and so on. And yes, loops are quite possible! OOB management is an attempt to add new links to the graph that only get used in a crisis. These kind of pull-it-out-when-you-need-it solutions are fine, but have a tendency to fail just when you need them. For one, they don't get exercised enough, and two, they may have their own dependencies on the graph that are not realized until too late.
So, what would this Internet rando prescribe? First order of business is to enumerate the dependency graph. I would wager that BGP, DNS, and the identity system are at or near the very top. Notice the deadly embrace of DNS and ID: if DNS is down, ID fails.
Next, study the failure modes of the elements. In the Rogers outage, a lack of route filters crashed a core router. That's a vague word, "crashed". Are we talking core dumps and SEGVs? Are we talking response times that skyrocketed, leading to peers timing out? Rogers really need to understand that. Typically in telco networks when nodes get "congested" like this there are escape valves built into the control plane protocol, eg a response that says "please back off and retry in rand(300)". They need to have a conversation with Cisco/Juniper etc and their router gurus about this.
Finally, the telco industry (or what's left of it) needs to do some introspection about the direction it is pulling vendors. For the last 15 years, telcos have been convinced that if only that can ingest some of that sweet, sweet cloud juice, their software costs will drop, they can slash operations costs, and watch the share price go brrr. Problem is, replacing legacy systems with ones cobbled together by vendors from a patchwork of kubernetes and prayers is guaranteed not to lead to the level of reliability that telcos and their regulators expect. If I'm a Rogers' operations manager and my network dies, I don't want to hear that some dude in India has to spend the next week picking through a service mesh and experimenting with multus to decide if turning if off and on again is gonna work.
All great points- it sounds like you have a similar cultural awareness of the telco space. I'll reply to a few things that caught my brain's attention:
> All the OOB in the world will not help you if you cannot reach the management entity (eg IP-enabled PSU, terminal server, etc).
In _healthy_ OOB situations, all of the adjacent OOB infrastructure should be reachable, even if the entire core IP network is completely tanked. The only scenario where this would not apply in my eyes would be a power outage that whacks an entire site including the OOB gear. But in that scenario OOB doesn't help you.
> Next, study the failure modes of the elements. In the Rogers outage, a lack of route filters crashed a core router. That's a vague word, "crashed". Are we talking core dumps and SEGVs? Are we talking response times that skyrocketed, leading to peers timing out? Rogers really need to understand that. Typically in telco networks when nodes get "congested" like this there are escape valves built into the control plane protocol, eg a response that says "please back off and retry in rand(300)". They need to have a conversation with Cisco/Juniper etc and their router gurus about this.
Typically the "crash" is memory exhaustion due to incorrectly configured filtering between either routing protocols, or someone blasting a BGP peer with a large number of unexpected routes. As a former support engineer for BIGCO-ROUTER-COMPANY (either C.. or J..), I can't tell the number of times I've seen people melt down a large sized network due to either exceeding a defined prefix limit (limiting number of routes allowed), or accidentally nuking an ACL controlling route-redistribution, and either cratering all connectivity (no routes), or dump all routes unrestrictedly (no filter), with the latter resulting in memory exhaustion. Luckily, everyone these days working with big routers are culturally conditioned to do change-commit confirmation - if you make a change that blows the box up and isolates it, it will automatically revert the change after a defined period of time.
> Finally, the telco industry (or what's left of it) needs to do some introspection about the direction it is pulling vendors. For the last 15 years, telcos have been convinced that if only that can ingest some of that sweet, sweet cloud juice, their software costs will drop, they can slash operations costs, and watch the share price go brrr. Problem is, replacing legacy systems with ones cobbled together by vendors from a patchwork of kubernetes and prayers is guaranteed not to lead to the level of reliability that telcos and their regulators expect. If I'm a Rogers' operations manager and my network dies, I don't want to hear that some dude in India has to spend the next week picking through a service mesh and experimenting with multus to decide if turning if off and on again is gonna work.
I think your perception of the quality of a K8 telco stack is a bit off to be candid. They are not cobbling together random stacks from unvetted vendors/sources. Nearly every telco K8 stack these days is using an off the shelf K8 vendor, and off the shelf K8-compatible services on top, again from (reputable) vendors.
At the end of the day this was a failure of culture and management. The technology is a side conversation.
Who said it is trivial?...
Edit: The article take a title and describe some straightforward technical and business investments to make oob management network work.
That record has not been maintained in the digital era.
The long distance system did not originally need the Bedminster, NJ network control center to operate. Bedminster sent routing updates periodically to the regional centers, but they could fall back to static routing if necessary. There was, by design, no single point of failure. Not even close. That was a basic design criterion in telecom prior to electronic switching. The system was designed to have less capacity but still keep running if parts of it went down.
[1] https://www.youtube.com/watch?v=f_AWAmGi-g8