Optus has given its fullest account of what it thinks caused the November 8 outage: default settings in its Cisco provider edge (PE) routers that led to around 90 shutting down nationwide.
The attribution is an evolution of its previous explanation that an “international peering network” had fed it bad data.
News reports this week identified that peer to be the Singtel internet exchange (STiX), and partially identified the cause as a software upgrade on Singtel’s end.
Singtel disputed that account on Thursday, instead - more correctly, it seems - identifying “preset failsafe” mechanisms in Optus’ routers as the cause - an account Optus confirmed in a submission filed late on Thursday, ahead of a senate appearance on Friday.
“It is now understood that the outage occurred due to approximately 90 PE [provider edge] routers automatically self-isolating in order to protect themselves from an overload of IP routing information,” Optus said. [pdf]
“These self-protection limits are default settings provided by the relevant global equipment vendor (Cisco).”
Optus said the “unexpected overload” of routing information came via “an alternate Singtel peering router”, because the primary or usual router hardware that Optus took route information from was under planned maintenance.
The telco said an unspecified software upgrade was being performed at one STiX location in North America - which Singtel confirms [pdf].
Optus suggests the upgrade led to the bad route information being propagated - why, it is unclear - but now says this “was not the cause of the incident" in Australia.
Instead, it puts the blame on the edge router “safety” defaults. It does not say why the default settings were used, to what extent it had the ability to tweak the settings, or how long the routers had operated with these defaults in place.
Optus said a team of 150 engineers and technicians were directly involved in the investigation and restoration, supported by another 250 staff and five vendors.
Six theories
For the first six hours or so, the engineers pursued six different possible explanations for the large-scale outage.
These included whether works overnight by Optus itself were the cause; it rolled back those changes but found no resolution.
Other options simultaneously explored included whether it was a DDoS attack, a network authentication issue, or problems with other vendors such as its content delivery network provider.
One explanation, however, became the “leading hypothesis for network restoration”: equipment logs and alerts that “showed multiple Border Gateway Protocol (BGP) IPv6 prefixes exceeding threshold alerts.”
“We identified that resetting routing connectivity addressed the loss of network services. This occurred at 10:21am,” Optus said.
Engineers then set about “resetting and clearing routing connectivity on network elements which had disconnected themselves from the network, physically rebooting and reconnecting some network elements to restore connectivity, [and] “carefully and methodically re-introducing traffic onto the mobile data and voice core to avoid a signalling surge on the network,” it said.
Engineers performed unspecified “resiliency” works on the network between resolution on November 8 and the following Monday, November 13.
Optus foreshadowed more work to come.
“We are committed to learning from this event and continue to invest heavily, working with our international vendors and partners, to increase the resilience of our network,” it said.
“We will also support and will fully cooperate with the reviews being undertaken by the government and the senate.”
Defends customer comms
Optus used other parts of its submission to defend its customer communications on the outage day.
Its position is that as consumer and some enterprise services were out, media - traditional and social - was considered the best way to get the word out.
That is likely to be challenged in the senate inquiry.
The other issue the senate is likely to raise is financial compensation for customers.
So far, Optus has offered users extra data quota, which has been criticised in some circles.
While there is an argument that businesses, in particular, lost money while the network was down, there is a counterargument that businesses should have their own backup connectivity in the event their primary service is down.
To what extent the senate can resolve that is unclear.
Financial compensation unprecedented
Optus, however, in its submission argues that making a telco pay financial compensation for “consequential losses” isn’t a precedent that should be set.
“There is no precedent for compensation being paid by telecommunications providers to all business customers who suffer a loss of business as a result of an outage of the kind that occurred on November 8, either here or overseas,” Optus said.
“We understand that this would create a new precedent that would extend far beyond Optus and apply to all other telecommunications providers, as well as other providers of essential services, critical infrastructure and public services.
“This makes it a much broader policy question for government that would have far reaching implications across many sectors of the economy and the cost of these services for Australian consumers.”
Optus said that it isn’t the first to suffer a sizeable outage in Australia, nor would the November 8 outage be the last incident of its type.
“It is an unfortunate reality in our reliant digital age that no communications network can completely protect against, nor prevent, these types of occurrences from ever happening – despite the investments made or resiliency efforts undertaken,” it said.
“Reflecting this, communications services are not provided with a guarantee of continuous service.
“Given continuity of service is not guaranteed, consumers are not given an automatic right of compensation whenever an outage occurs.”