iTnews
  • Home
  • News
  • Technology
  • Cloud

Microsoft had three staff at Australian data centre campus when Azure went out

By Ry Crozier
Sep 4 2023 6:55AM

Cascading failures and root causes revealed.

Microsoft had “insufficient” staff levels at its data centre campus last week when a power sag knocked its chiller plant for two data halls offline, cooking portions of its storage hardware.

Microsoft had three staff at Australian data centre campus when Azure went out

The company has released a preliminary post-incident report (PIR) for the large-scale failure, which saw large enterprise customers including Bank of Queensland and Jetstar completely lose service.

The PIR sheds light on why some enterprises lost service altogether: so many storage nodes were gracefully shut down - or had components fried - in the incident that data, and all replicas of it, were offline.

In addition, after storage nodes were finally recovered, a "tenant ring" hosting over 250,000 databases, failed - albeit with uneven impact on customers.

Chillers offline

Microsoft said the cooling capacity for the two affected data halls “consisted of seven chillers, with five chillers in operation and two chillers in standby (N+2)”.

A power sag - voltage dip - caused the five operating chillers to fault. In addition, only one of the standby units worked.

Microsoft said the onsite staff “performed our documented emergency operational procedures (EOP) to attempt to bring the chillers back online, but were not successful.”

The company appeared to be caught out by the scale of the incident, with not enough staff onsite, and its emergency procedures not catering for the size of the issue.

“Due to the size of the data centre campus, the staffing of the team at night was insufficient to restart the chillers in a timely manner,” the company said.

“We have temporarily increased the team size from three to seven, until the underlying issues are better understood and appropriate mitigations can be put in place.”

On its EOP, Microsoft said: “The EOP for restarting chillers is slow to execute for an event with such a significant blast radius.”

“We are exploring ways to improve existing automation to be more resilient to various voltage sag event types.”

While there weren’t enough staff to execute the documented procedures, having more staff would’ve gotten to the same result faster, as the chillers themselves have issues.

Preliminary investigations showed the chiller plant did not automatically restart “because the corresponding pumps did not get the run signal from the chillers.”

“This is important as it is integral to the successful restarting of the chiller units,” Microsoft said.

“We are partnering with our OEM vendor to investigate why the chillers did not command their respective pump to start.”

Microsoft said the faulted chillers could not be manually restarted “as the chilled water loop temperature had exceeded the threshold.”

With rising temperatures, and thermal warnings from infrastructure, Microsoft had no choice but to shut down servers.

“This successfully allowed the chilled water loop temperature to drop below the required threshold and enabled the restoration of the cooling capacity,” it said.

Storage, SQL database recovery

Still, not everything recovered smoothly.

The incident impacted seven storage tenants - five “standard”, two “premium”.

Some storage hardware was “damaged by the data hall temperatures”, Microsoft said. 

Diagnostics weren’t available for troubleshooting because the storage nodes were offline.

“As a result, our onsite data centre team needed to remove components manually, and re-seat them one by one to identify which particular component(s) were preventing each node from booting,” Microsoft said.

“Several components needed to be replaced for successful data recovery and to restore impacted nodes. 

“In order to completely recover data, some of the original/faulty components were required to be temporarily re-installed in individual servers.”

An infrastructure-as-code automation also failed, “incorrectly approving stale requests, and marking some healthy nodes as unhealthy, which slowed storage recovery efforts.”

The failure of a tenant ring hosting over 250,000 SQL databases further slowed recovery, Microsoft said.

“As we attempted to migrate databases out of the degraded ring, SQL did not have well tested tools on hand that were built to move databases when the source ring was in [a] degraded health scenario,” the company said.

“Soon this became our largest impediment to mitigating impact.”

A final PIR is expected to be completed in a few weeks.

Got a news tip for our journalists? Share it with us anonymously here.
Copyright © iTnews.com.au . All rights reserved.
Tags:
azurecloudmicrosoftoutagestorage

Related Articles

  • NAB retires its Tableau environment NAB retires its Tableau environment
  • Oracle shares jump as AI push perks up cloud demand Oracle shares jump as AI push perks up cloud demand
  • Coles Group calculates a TCO for its enterprise applications Coles Group calculates a TCO for its enterprise applications
  • US proposes requiring reporting for advanced AI, cloud providers US proposes requiring reporting for advanced AI, cloud providers

Partner Content

Dual Challenge: Securing Modern Enterprises While Enabling Remote Work
Partner Content Dual Challenge: Securing Modern Enterprises While Enabling Remote Work
Security and familiarity drive Aussie online payments – Worldpay
Partner Content Security and familiarity drive Aussie online payments – Worldpay
Why maintaining older hardware is the smart economic decision
Partner Content Why maintaining older hardware is the smart economic decision
Kyocera hub
Kyocera hub

Sponsored Whitepapers

Redefining Vulnerability Management
Redefining Vulnerability Management
How JLL gained visibility into nearly 100K endpoints with Tanium
How JLL gained visibility into nearly 100K endpoints with Tanium
Why a holistic approach to managing risk is key to solving complex IT problems
Why a holistic approach to managing risk is key to solving complex IT problems
High Availability: The Foundation of Digital Transformation
High Availability: The Foundation of Digital Transformation
Nine Ways To Prepare Your Database for a High-Traffic Event
Nine Ways To Prepare Your Database for a High-Traffic Event
Share on Facebook Share on LinkedIn Share on Whatsapp Email A Friend

Most Read Articles

Suncorp builds generative AI engine 'SunGPT'

Suncorp builds generative AI engine 'SunGPT'

Coles Group calculates a TCO for its enterprise applications

Coles Group calculates a TCO for its enterprise applications

NAB retires its Tableau environment

NAB retires its Tableau environment

Bendigo and Adelaide Bank uses GenAI, MongoDB to refactor application

Bendigo and Adelaide Bank uses GenAI, MongoDB to refactor application

Digital Nation

How eBay uses interaction analytics to improve CX
How eBay uses interaction analytics to improve CX
COVER STORY: What AI regulation might look like in Australia
COVER STORY: What AI regulation might look like in Australia
More than half of loyalty members concerned about their data
More than half of loyalty members concerned about their data
State of Security 2023
State of Security 2023
Health tech startup Kismet raises $4m in pre-seed funding
Health tech startup Kismet raises $4m in pre-seed funding
All rights reserved. This material may not be published, broadcast, rewritten or redistributed in any form without prior authorisation.
Your use of this website constitutes acceptance of nextmedia's Privacy Policy and Terms & Conditions.