12/11/2024 1:37 PM
Incident Summary
On December 9th, we identified an issue where a bulk enrollment operation of big AD groups triggered the User Service and caused performance degradation for other services hosted on the same App Service Plan.
The root cause was an error in our code that allowed the User Service to operate inefficiently under certain conditions. We identified the issue swiftly and rolled out a fix the same day to all regions to prevent further occurrences.
Impact
* Region Affected: UK
* Performance Degradation Periods:
* The most significant impact occurred on Monday between 10:59 and 11:26 UTC, during which the system was still accessible but experienced higher-than-normal loading times.
Resolution
* Immediate Fix:
* Updated the User Service code to handle enrollment operations more efficiently.
* Deployed the fix to all regions to mitigate potential impact elsewhere.
* Monitoring:
* Our existing monitoring systems quickly pinpointed the source of the problem, enabling us to act swiftly.
Mitigation Steps
To reduce the likelihood of similar incidents in the future, we are taking the following steps:
Improved QA Protocols:
Enhance testing for high-load scenarios, particularly for bulk operations.
Monitoring System Enhancements:
Implement additional resource usage alerts to detect and isolate high-impact operations earlier.
We sincerely apologize for the inconvenience caused to our users, particularly in the UK region. We deeply regret any disruption this may have caused and are committed to learning from this incident to serve you better in the future.
This incident has been resolved.
12/10/2024 9:42 AMWe have implemented a fix for the issue and are actively monitoring the situation.
We sincerely apologize for any inconvenience caused and remain committed to delivering reliable services.
We are aware of a potential issue that could affect customers in the UK region. We are actively analyzing the situation and will provide a status update as our investigation continues.
FOR MORE INFORMATIONFor current system status information about LMS365, check out our system status page. During an incident, you can also receive status updates by subscribing to updates available on our status page. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.
Comments
Please sign in to leave a comment.