Service Incident - 30 August 2023 - LMS365 is unavailable in the Australia East region

LMS365 is unavailable in the Australia East region

9/4/2023 7:06 AM

Information from Microsoft Azure Service Health page:



What happened?

Between approximately 08:41 UTC on 30 August 2023 and 06:40 UTC on 1 September 2023 customers may have experienced issues accessing or using Azure, Microsoft 365 and Power Platform services. This event was triggered by a utility power sag in the Australia East region which tripped a subset of the cooling units offline in one datacenter, within one of the Availability Zones. While working to restore cooling, temperatures in the datacenter increased so we proactively powered down a small subset of selected compute and storage scale units, in an attempt to avoid damage to hardware.

Multiple downstream Azure services with dependencies on this infrastructure were also impacted – including Activity Logs & Alerts, API Management, App Service, Application Insights, Arc enabled Kubernetes, Azure API for FHIR, Backup, Batch, Chaos Studio, Container Apps, Container Registry, Cosmos DB, Databricks, Data Explorer, Data Factory, Database for MySQL flexible servers, Database for PostgreSQL flexible servers, Digital Twins, Device Update for IoT Hub, Event Hubs, ExpressRoute, Health Data Services, HDInsight, IoT Central, IoT Hub, Kubernetes Service (AKS), Logic Apps, Log Analytics, Log Search Alerts, Microsoft Sentinel, NetApp Files, Notification Hubs, Purview, Redis Cache, Relay, Search, Service Bus, Service Fabric, SQL Database, Storage, Stream Analytics, Virtual Machines. A small number of these services experienced prolonged impact, predominantly as a result of dependencies in recovering subsets of Storage, SQL, and/or Cosmos DB services.

What went wrong and why?



Starting at approximately 08:41 UTC on 30 August 2023, a utility power sag in the Australia East region tripped a subset of the cooling units offline in one datacenter, within one of the Availability Zones. We performed our documented Emergency Operational Procedures (EOP) to attempt to bring the chillers back online, but were not successful. The cooling capacity was reduced in two data halls for a prolonged time, so temperatures continued to rise. At 11:34 UTC, infrastructure thermal warnings from components in the affected data halls directed a shutdown of selected compute, network and storage infrastructure – by design, to protect data durability and infrastructure health. This resulted in a loss of service availability for a subset of this Availability Zone.

The cooling capacity for the two affected data halls consisted of seven chillers, with five chillers in operation and two chillers in standby (N+2) before the voltage dip event. When the event occurred, all five chillers in operation faulted and restart because the corresponding pumps did not get the run signal from the chillers. This is important as it is integral to the successful restarting of the chiller units. In addition, faulted chillers could not be restarted manually, as the chilled water loop temperature had exceeded the threshold. We had two chillers that were in standby which attempted to restart automatically – one managed to restart and came back online, the other restarted but was tripped offline again within minutes. Subsequently, thermal loads had to be reduced by shutting down servers. This successfully allowed the chilled water loop temperature to drop below the required threshold and enabled the restoration of the cooling capacity. When the data hall temperature levels were within operational thresholds, we began to restore power to the affected infrastructure and started a phased process to bring the infrastructure back online. Once all networking and storage infrastructure had power restored, dependent compute scale units were then also restored to operation. As the underlying compute and storage scale units came online, dependent Azure services started to recover, but some services experienced issues coming back online.

From a Storage perspective, seven tenants were impacted – five standard storage tenants, and two premium storage tenants. While all storage data is replicated across multiple storage servers, there were cases in which all of the copies were unavailable due to failures on multiple impacted storage servers. After power restoration, storage nodes started coming back online from 15:25 UTC. Generally speaking, there were three main factors that contributed to delays in bringing storage infrastructure back to full functionality. Firstly, the hardware damaged by the data hall temperatures required extensive troubleshooting. Diagnostics were not able to identify the faults, because the storage nodes themselves were not online – as a result, our onsite datacenter team needed to remove components manually, and re-seat them one by one to identify which particular component(s) were preventing each node from booting. Secondly, several components needed to be replaced for successful data recovery and to restore impacted nodes. In order to completely recover data, some of the original/faulty components were required to be temporarily re-installed in individual servers. Thirdly, we identified that our automation was incorrectly approving stale requests, and marking some healthy nodes as unhealthy, which slowed storage recovery efforts.

From a SQL perspective, once power had fully returned to the impacted halls, service restoration was initially impacted by the slow recovery of dependent services, those being primarily Azure Standard Storage offerings. Many general purpose databases remained unavailable until those premium Azure Storage services had recovered. After storage services were >99% recovered, a single tenant ring that hosted databases failed to recover completely after the incident. This ring, hosting over ~250K databases had a mix of failure modes. Some databases may have been completely unavailable, some would have experienced intermittent connectivity issues, and some databases would have been fully available. This uneven impact profile for databases in the degraded ring, meant that it was difficult to summarize which customers were still impacted, which continued to present a challenge throughout the incident. As we attempted to migrate databases out of the degraded ring, SQL did not have well tested tools on hand that were built to move databases when the source ring was in degraded health scenario. Soon this became our largest impediment to mitigating impact. Our standard DB migration workflows are designed to contain many safety and health checks to make sure that the DB being migrated does not experience downtime during the migration. Each of these contact points from the SQL control plane to the underlying Service Fabric (which hosts the DB’s compute and coordinates replicas) for the ring was another opportunity for the operation to fail or get stuck. Because every DB moved required manual mitigation via scripts, it seriously undermined our ability to move fast even once impacted DBs were identified, and DB moves were scheduled.

From a Cosmos DB perspective, the service experienced a loss of compute underpinning three clusters, and a loss of the operating system for 11 clusters due to dependencies on Azure storage. In total this mean that approximately half of Cosmos DB clusters in the Australia East region were either down or heavily degraded. Eligible accounts (multi-region with Service Managed Failover enabled) were failed over to their alternate regions to restore availability. A set of internal accounts were not originally configured for failover, so the Cosmos DB team worked with these internal service teams to configure and then failover those accounts upon request. Accounts that were not eligible had service restored to partitions once the dependent storage and compute were restored and Cosmos DB clusters recovered.





How did we respond?



* 30 August 2023 @ 08:41 – Voltage sag occurred on utility power line
* 30 August 2023 @ 08:43 – Five chillers failed to restart
* 30 August 2023 @ 08:45 – Two standby chillers started automatically
* 30 August 2023 @ 08:47 – One standby chiller tripped and went offline
* 30 August 2023 @ 09:41 – Onsite team arrived at the rooftop chiller area
* 30 August 2023 @ 09:42 – Onsite team manually restarted the five chillers as per the EOP
* 30 August 2023 @ 10:30 – Storage and SQL alerted by monitors about failure rates
* 30 August 2023 @ 10:57 – Cosmos DB Initial impact detected via monitoring
* 30 August 2023 @ 11:15 – Attempts to stabilize the five chillers were unsuccessful after multiple chiller restarts
* 30 August 2023 @ 11:20 – Chiller OEM support team arrives onsite
* 30 August 2023 @ 11:34 – Decision was made to shutdown infrastructure in the two affected data halls
* 30 August 2023 @ 12:07 – Failover initiated for eligible Cosmos DB accounts
* 30 August 2023 @ 12:12 – Five chillers manually restarted successfully
* 30 August 2023 @ 13:30 – Data hall temperature normalized
* 30 August 2023 @ 14:10 – Safety walkthrough completed for both data halls
* 30 August 2023 @ 14:25 – Decision made to start powering up hardware in the two affected data halls
* 30 August 2023 @ 15:10 – Power restored to all hardware
* 30 August 2023 @ 15:25 – Storage infrastructure started coming back online after power restoration
* 30 August 2023 @ 15:30 – Identified three specific storage tenants still experiencing fault codes
* 30 August 2023 @ 16:00 – Began manual recovery efforts for these three storage tenants
* 30 August 2023 @ 16:13 – Account failover completed for all Cosmos DB accounts
* 30 August 2023 @ 19:29 – Successfully recovered all premium storage tenants
* 30 August 2023 @ 20:29 – All but two SQL nodes recovered
* 30 August 2023 @ 22:35 – Standard storage tenants were recovered, except for one scale unit
* 31 August 2023 @ 04:04 – Restoration of Cosmos DB accounts to Australia East initiated
* 31 August 2023 @ 04:43 – Final Cosmos DB cluster recovered, restoring all traffic for accounts that were not failed over
* 31 August 2023 @ 08:45 – All external customer accounts back online and operating from Australia
* 1 September 2023 @ 06:40 – Successfully recovered all standard storage tenants





How are we making incidents like this less likely or less impactful?



Based on our initial assessment, we have already identified the following learnings from a datacenter power/cooling perspective. The Final PIR will include additional learnings and repairs based on the service-specific extended recovery timelines.





* Due to the size of the datacenter campus, the staffing of the team at night was insufficient to restart the chillers in a timely manner. We have temporarily increased the team size from three to seven, until the underlying issues are better understood and appropriate mitigations can be put in place.
* The EOP for restarting chillers is slow to execute for an event with such a significant blast radius. We are exploring ways to improve existing automation to be more resilient to various voltage sag event types.
* Moving forward, we are evaluating ways to ensure that the load profiles of the various chiller subsets can be prioritized so that chiller restarts will be performed for the highest load profiles first.
* Utilizing the playbook in sequencing workload failovers and equipment shutdown could have been prioritized differently with better insights. We are working to improve reporting on chilled water temperature, to enable more timely decisions for failover/shutdown based on thresholds.
* The five chillers did not manage to restart because the corresponding pumps did not get the run signal from the chillers. This is important as it is integral to the successful restarting of the chiller units. We are partnering with our OEM vendor to investigate why the chillers did not command their respective pump to start.
* One standby chiller did not automatically restart due to an unknown error. Our OEM vendor is running diagnostics to understand what caused this specific issue.



How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/VVTQ-J98

8/31/2023 7:52 AM

We can confirm that the issue is now resolved. If you notice any problems going forward, please reach out to us

8/30/2023 11:11 PM

We’re happy to report the issues in the Australia East region have now almost been resolved.

As per Microsoft:

Current Status: With 99% of storage services and 99% of impacted Virtual Machines back online and healthy, we are actively investigating remaining issues with individual downstream services to confirm their recovery status. Our Storage team are making progress on one specific storage scale unit that is still experiencing isolated issues. Our SQL team are investigating a potential issue with an underlying Service Fabric dependency. Our Cosmos DB team are investigating why some services have not fully recovered. Despite these remaining investigations, the majority of customers and services should already be recovered. Further updates will be provided in 60 minutes, or as events warrant.

8/30/2023 7:12 PM

As per Microsoft:

Current Status: Mitigation efforts are continuing, we have made significant progress in restoring core services, and we expect that the vast majority of remaining services should be back online in the next 2-3 hours. After restoring power and stabilizing temperatures, all network infrastructure and 95% of storage services are back online. All premium disk storage has fully recovered, we continue to work towards mitigating the final remaining storage devices. The majority of underlying compute services are back online, with more than 85% of Virtual Machines that were impacted now back online and healthy. As a result, many customers of these services have already recovered - but we continue to work with downstream impacted services to ensure that they are coming back online in the next 2-3 hours as expected. Further updates will be provided in 60 minutes, or as events warrant.

8/30/2023 5:40 PM

Based on our logs, tenants in Australia East region can be accessed now, however, Microsoft is still stating that service recovery is progressing.

We will continue monitoring, and we will keep providing updates as soon as we have further information.

Do not hesitate to contact us if you have any issues with loading the system.

8/30/2023 2:32 PM

We are currently aware of an issue with accessing the LMS365 in Australia East Region. The root cause is a problem with the cooling units in the datacenter.

As per Microsoft:
Storage and Compute - Australia East - Applying Mitigation

Impact Statement: Starting at approximately 08:30 UTC on 30 August 2023, a utility power surge in the Australia East region tripped a subset of the cooling units offline in one of the Availability Zones. While working to restore the cooling units, temperatures in the datacenter increased so we have proactively powered down a small subset of selected compute and storage scale units to avoid damage to hardware and reduce cooling system load. All impacted storage and compute scale units are in the same datacenter, within one of the region’s three Availability Zones (AZs). Multiple downstream services have been identified as impacted.

Current Status: We do not have an exact ETA at this time, but temperature in the impacted datacenter have been stabilized. The Azure service recovery process has commenced and is expected to progressively return over a number of hours. Due to the nature of this issue our storage scale units are expected to require additional recovery efforts to ensure all resources return in a consistent state. Note that any new allocations for resources will automatically avoid the impacted scale units. If your workloads are protected by Azure Site Recovery or Azure Backup, we recommend to either initiate a failover to the recovery region or recover using Cross Region Restore. Further updates will be provided in an hour or as events warrant.

FOR MORE INFORMATION
For current system status information about LMS365, check out our system status page. During an incident, you can also receive status updates by subscribing to updates available on our status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.
Was this article helpful?
0 out of 0 found this helpful

Comments

Article is closed for comments.