Microsoft had three staff at Australian data centre campus when Azure went out

By Ry Crozier

Sep 4 2023 6:55AM

Cascading failures and root causes revealed.

Microsoft had “insufficient” staff levels at its data centre campus last week when a power sag knocked its chiller plant for two data halls offline, cooking portions of its storage hardware.

Microsoft had three staff at Australian data centre campus when Azure went out

The company has released a preliminary post-incident report (PIR) for the large-scale failure, which saw large enterprise customers including Bank of Queensland and Jetstar completely lose service.

The PIR sheds light on why some enterprises lost service altogether: so many storage nodes were gracefully shut down - or had components fried - in the incident that data, and all replicas of it, were offline.

In addition, after storage nodes were finally recovered, a "tenant ring" hosting over 250,000 databases, failed - albeit with uneven impact on customers.

Chillers offline

Microsoft said the cooling capacity for the two affected data halls “consisted of seven chillers, with five chillers in operation and two chillers in standby (N+2)”.

A power sag - voltage dip - caused the five operating chillers to fault. In addition, only one of the standby units worked.

Microsoft said the onsite staff “performed our documented emergency operational procedures (EOP) to attempt to bring the chillers back online, but were not successful.”

The company appeared to be caught out by the scale of the incident, with not enough staff onsite, and its emergency procedures not catering for the size of the issue.

“Due to the size of the data centre campus, the staffing of the team at night was insufficient to restart the chillers in a timely manner,” the company said.

“We have temporarily increased the team size from three to seven, until the underlying issues are better understood and appropriate mitigations can be put in place.”

On its EOP, Microsoft said: “The EOP for restarting chillers is slow to execute for an event with such a significant blast radius.”

“We are exploring ways to improve existing automation to be more resilient to various voltage sag event types.”

While there weren’t enough staff to execute the documented procedures, having more staff would’ve gotten to the same result faster, as the chillers themselves have issues.

Preliminary investigations showed the chiller plant did not automatically restart “because the corresponding pumps did not get the run signal from the chillers.”

“This is important as it is integral to the successful restarting of the chiller units,” Microsoft said.

“We are partnering with our OEM vendor to investigate why the chillers did not command their respective pump to start.”

Microsoft said the faulted chillers could not be manually restarted “as the chilled water loop temperature had exceeded the threshold.”

With rising temperatures, and thermal warnings from infrastructure, Microsoft had no choice but to shut down servers.

“This successfully allowed the chilled water loop temperature to drop below the required threshold and enabled the restoration of the cooling capacity,” it said.

Storage, SQL database recovery

Still, not everything recovered smoothly.

The incident impacted seven storage tenants - five “standard”, two “premium”.

Some storage hardware was “damaged by the data hall temperatures”, Microsoft said.

Diagnostics weren’t available for troubleshooting because the storage nodes were offline.

“As a result, our onsite data centre team needed to remove components manually, and re-seat them one by one to identify which particular component(s) were preventing each node from booting,” Microsoft said.

“Several components needed to be replaced for successful data recovery and to restore impacted nodes.

“In order to completely recover data, some of the original/faulty components were required to be temporarily re-installed in individual servers.”

An infrastructure-as-code automation also failed, “incorrectly approving stale requests, and marking some healthy nodes as unhealthy, which slowed storage recovery efforts.”

The failure of a tenant ring hosting over 250,000 SQL databases further slowed recovery, Microsoft said.

“As we attempted to migrate databases out of the degraded ring, SQL did not have well tested tools on hand that were built to move databases when the source ring was in [a] degraded health scenario,” the company said.

“Soon this became our largest impediment to mitigating impact.”

A final PIR is expected to be completed in a few weeks.

Got a news tip for our journalists? Share it with us anonymously here.

Tags:

Partner Content

Kyocera hub

Partner Content Why maintaining older hardware is the smart economic decision

Partner Content Exploring the hidden benefits of maintaining older hardware

Partner Content Security and familiarity drive Aussie online payments – Worldpay

Most Read Articles

Suncorp builds generative AI engine 'SunGPT'

More than half of loyalty members concerned about their data

How eBay uses interaction analytics to improve CX

COVER STORY: What AI regulation might look like in Australia

Why maintaining your hardware can improve your cloud journey

NAB retires its Tableau environment

Nine's web app protection blocked 96m bad requests in 2024 Olympics

Oracle shares jump as AI push perks up cloud demand

News Corp would lose US$9 million by ditching Google ads

Microsoft had three staff at Australian data centre campus when Azure went out

Cascading failures and root causes revealed.

Partner Content

Sponsored Whitepapers

Most Read Articles

Suncorp builds generative AI engine 'SunGPT'

Coles Group calculates a TCO for its enterprise applications

NAB retires its Tableau environment

Bendigo and Adelaide Bank uses GenAI, MongoDB to refactor application

Digital Nation

Most popular tech stories

State of Security 2023

COVER STORY: Sustainability and AI, a promising partnership or an environmental grey area?

FYAI: What is an AI hallucination and how does it impact business leaders?

Case study: Warren and Mahoney adopts digital tools to reduce its carbon footprint

Cricket Australia automates experiences for fans and players

Register for CRN Pipeline 2024!

State of the MSP 2024

How Dicker Data is helping partners find success in the Azure Cloud with TechClick

IoT Impact conference returns to UTS in 2024

Photos from Dicker Data TechX Brisbane, Perth & Melbourne: AI, profitability on agenda

Right to repair: Large scale IT buyers can influence product design... and they should

Shivering in summer? Sweating in winter? Your building is living a lie

Building a modern workplace for a remote workforce

Venom BlackBook Zero 15 Phantom

Right to repair: Large scale IT buyers can influence product design... and they should

Photos: The 2024 IoT Awards winners

Photos: Australian industry explores data for net zero

IoT Impact conference returns to UTS in 2024

Announcing the winners of the 2024 IoT Awards

IoT Awards: WaterGroup combines IoT technology and human support

Why maintaining your hardware can improve your cloud journey

NAB retires its Tableau environment

Nine's web app protection blocked 96m bad requests in 2024 Olympics

Oracle shares jump as AI push perks up cloud demand

News Corp would lose US$9 million by ditching Google ads

Microsoft had three staff at Australian data centre campus when Azure went out

Cascading failures and root causes revealed.

Partner Content

Sponsored Whitepapers

Most Read Articles

Suncorp builds generative AI engine 'SunGPT'

Coles Group calculates a TCO for its enterprise applications

NAB retires its Tableau environment

Bendigo and Adelaide Bank uses GenAI, MongoDB to refactor application

Digital Nation

Most popular tech stories

State of Security 2023

COVER STORY: Sustainability and AI, a promising partnership or an environmental grey area?

FYAI: What is an AI hallucination and how does it impact business leaders?

Case study: Warren and Mahoney adopts digital tools to reduce its carbon footprint

Cricket Australia automates experiences for fans and players

Register for CRN Pipeline 2024!

State of the MSP 2024

How Dicker Data is helping partners find success in the Azure Cloud with TechClick

IoT Impact conference returns to UTS in 2024

Photos from Dicker Data TechX Brisbane, Perth & Melbourne: AI, profitability on agenda

Right to repair: Large scale IT buyers can influence product design... and they should

Shivering in summer? Sweating in winter? Your building is living a lie

Building a modern workplace for a remote workforce

Venom BlackBook Zero 15 Phantom

Right to repair: Large scale IT buyers can influence product design... and they should

Photos: The 2024 IoT Awards winners

Photos: Australian industry explores data for net zero

IoT Impact conference returns to UTS in 2024

Announcing the winners of the 2024 IoT Awards

IoT Awards: WaterGroup combines IoT technology and human support

Log In