10/30/2025 7:59 PM
On October 29, 2025, our platform experienced intermittent latencies, timeouts, and errors due to a widespread disruption in Azure Front Door (AFD), a critical component of Azure-based infrastructure. This affected our services reliant on AFD for global content delivery, impacting customer access and performance. As a consumer of Azure services, we were directly affected by this Microsoft-managed incident. Below is a summary of the root cause, resolution efforts, and planned remediation based on Microsoft's preliminary incident report (PIR).
Root Cause Identification and Analysis
The outage was triggered by an inadvertent tenant configuration change by Microsoft within Azure Front Door, which introduced an invalid or inconsistent configuration state. This caused a significant number of AFD nodes to fail to load properly, resulting in increased latencies, timeouts, and connection errors across dependent services, including our platform.
As unhealthy nodes were removed from the global pool, traffic became imbalanced, exacerbating the issue and leading to intermittent availability even in partially healthy regions. The root trigger was a faulty tenant configuration deployment process, where Microsoft's protection mechanisms—designed to validate and block erroneous deployments—failed due to a software defect. This allowed the invalid configuration to bypass safety checks and propagate globally.
Our internal monitoring confirmed the impact aligned with Microsoft's timeline, starting around 15:45 UTC on October 29, with services like Azure App Service and Azure SQL Database (which underpin our SaaS offerings) experiencing degraded performance.
Resolution
Microsoft's response began shortly after the incident onset. Key steps included:
* At 16:04 UTC on October 29, investigation commenced following monitoring alerts.
* By 16:15 UTC, teams examined recent configuration changes in AFD.
* Initial communications were posted to the public status page at 16:18 UTC, with targeted notifications to impacted customers (including us) via Azure Service Health at 16:20 UTC.
* To contain the issue, all new customer configuration changes to AFD were blocked at 17:30 UTC.
* A "last known good" configuration was deployed starting at 17:40 UTC, with global pushes beginning at 18:30 UTC.
* Manual node recovery and gradual traffic rebalancing to healthy nodes started at 18:45 UTC.
* By 00:05 UTC on October 30, the impact was fully mitigated for most customers, though a small subset (not affecting our platform) experienced lingering issues.
From our side, we monitored the situation via Azure Service Health alerts. Our platform returned to normal operation in line with Microsoft's recovery, with no residual issues reported by our users.
Remediation Items
As this incident originated from Microsoft's internal processes, there are limited direct actions we can take to prevent similar Azure-managed outages in the future. However, Microsoft has already implemented immediate safeguards, including a review of their protection mechanisms and the addition of enhanced validation and rollback controls to block faulty deployments.
From Microsoft updates:
An unintended configuration change in Azure Front Door (AFD) caused a global service disruption by introducing an invalid state that prevented numerous nodes from loading properly. This resulted in latency, timeouts, and uneven traffic distribution. The root cause was a deployment process flaw combined with a software defect that allowed the invalid configuration to bypass validation. Microsoft restored stability by rolling back to the last known good configuration and has since implemented additional safeguards to prevent recurrence.
For more details, please refer to the Microsoft status page: https://azure.status.microsoft/en-us/status/history/
If you experience any other issues, please don’t hesitate to contact our technical support by submitting a ticket: https://helpcenter.zensai.com/hc/en-us/articles/360019112178-Get-help-from-the-Zensai-Product-Support-team
We are pleased to report that functionality is beginning to return across all regions. Access to all aspects of Learn365 and Perform & Engage 365 should now start to be restored for users.
Please be aware that Microsoft is continuing efforts to fully resolve the issue, so you may still experience intermittent access or degraded performance during this time. We appreciate your patience and will provide further updates as the situation evolves.
Microsoft has identified the services affected by the outage:
Affected Azure services include, but are not limited to: App Service, Azure Active Directory B2C, Azure Communication Services, Azure Databricks, Azure Healthcare APIs, Azure Maps, Azure Portal, Azure SQL Database, Container Registry, Media Services, Microsoft Defender External Attack Surface Management, Microsoft Entra ID, Microsoft Purview, Microsoft Sentinel, Video Indexer, and Virtual Desktop.
Per Microsoft:
As recovery progresses, some requests may still land on unhealthy nodes, resulting in intermittent failures or reduced availability until more nodes are fully restored. This recovery effort involves reloading configurations and rebalancing traffic across a large volume of nodes to restore full operational scale. The process is gradual by design, ensuring stability and preventing overload as dependent services recover. We expect continued improvement across affected regions. This means we expect recovery to happen by 23:20 UTC on 29 October 2025
Current status from Microsoft:
We initiated the deployment of our ‘last known good’ configuration, which has now successfully been completed. Customers may have begun to see initial signs of recovery. We are currently recovering nodes and routing traffic through healthy nodes, and as we make progress in this workstream, customers will continue to see improvement.
Current status from Microsoft:
We have confirmed that an inadvertent configuration change as the trigger event for this issue.
We have initiated the deployment of our 'last known good' configuration. This is expected to be fully deployed in about 30 minutes from which point customers will start to see initial signs of recovery. Once this is completed, the next stage is to start to recover nodes while we route traffic through these healthy nodes.
Customer configuration changes will remain blocked during this time as we work towards mitigation. We will communicate to customers when this block is reverted.
Microsoft has reported the following:
"We began experiencing Azure Front Door issues resulting in a loss of availability of some services. In addition, customers may experience issues accessing the Azure Portal."
We are continuing to assess the situation.
Starting at approximately 16:00 UTC, Microsoft reported DNS issues that are affecting access to the Azure Portal and potentially other services. Microsoft has initiated mitigation steps expected to restore portal access shortly and is actively investigating the root cause.
10/29/2025 4:26 PMWe are continuing to investigate this issue.
10/29/2025 4:11 PMWe are currently aware of an issue with an Azure service outage. We are actively investigating the issue.
FOR MORE INFORMATIONFor current system status information about Learn365, check out our system status page. During an incident, you can also receive status updates by subscribing to updates available on our status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.
Comments
Please sign in to leave a comment.