Optus outage: We can’t afford to have a single point of failure in our telecoms systemNov 9, 2023
The recent Optus outage cannot be considered a ‘rare occasion.’ Over the last few years, we have witnessed several major outages across the telecoms networks, making it imperative for us to prepare ourselves for such events. We must address Telecom system vulnerabilities to prevent widespread outages.
Today, over 99% of telecoms traffic comprises data. Virtually every organisation and nearly all Australians rely on data services through their phones and fixed line connections. As we’ve observed, an outage of this magnitude can cause significant disruptions in the economy and people’s private lives. In this case, even the 000 emergency service on landlines was disconnected.
These outages are of national interest, and thus, we require national solutions to mitigate the considerable fallout from such events.
What occurred at Optus was likely a software problem. While such issues occur more frequently, most systems recover in seconds or minutes, resulting in minimal disruption. However, in some cases, as it appears to have happened this time, a critical fault during a software update can cascade through the computer systems that underpin the network’s operation. Unravelling, fixing, and bringing all these different systems back online can take hours, and sometimes even days. Moreover, not all systems are likely to come back online simultaneously; they need to be restarted one by one, further extending the recovery time.
In the end, this is an infrastructure problem.
There are essentially two long-term solutions for such events.
The first one pertains to the individual networks of the operators. It is unacceptable for there to be a single point of failure in a network that can bring down an entire country, or as seen before, the entire East Coast. With over 100 years of telecoms experience and a wealth of engineering knowledge and skills, networks can be designed to eliminate single points of failure. In the event of a disruption, traffic should be rerouted through other network systems. In other words, there should be duplicated, unconnected systems where one can take over from the other in emergencies.
The other solution involves the combined telecoms infrastructure in Australia. In case of an emergency, there should be a ‘gateway’ facility connecting the networks, allowing them to take over traffic from one another. In the case of mobile networks, I have advocated for this for over 20 years; the solution is called roaming. After government pressure, an announcement was finally made last week that roaming via mobile networks is now possible in emergency situations such as bushfires or floods. It’s technically feasible, and we should explore its use in other emergency scenarios like the one we experienced today.
The reason for the delay in implementing this in Australia is the resistance from telecoms companies. They view the size of their networks as a competitive advantage and question why they should allow others to use their network.
The issue is that these networks aren’t just commercial operations; they are vital infrastructure for our society and economy. Protecting the national interest in the face of serious network failures is paramount. Implementing such solutions requires the government’s commitment and the regulatory authority’s influence.
However, there is also a responsibility on the part of users, both organisations and individuals, to acknowledge that such events will happen and assess their vulnerability. For example, if a company’s sales system goes down, financial systems shut down, transport systems don’t work, or emergency operations fail, these organisations need to consider the need for their own redundancy solutions.
For individuals, it’s important to be prepared. Are people familiar with communication methods such as WhatsApp, Skype, Facetime, etc.? In today’s emergency, these systems still function. Mobile phones are increasingly software-based, using e-SIMs (no physical SIM cards needed), which allow you to switch between operators, like switching from Optus to another operator in a situation like this.
Solutions need to encompass all these aspects. Networks must be more resilient, and users must explore their options in such situations. One thing is certain: more outages will occur, so preparedness is crucial.
First published by Paul Budde November 8, 2023