Incident Information



Preliminary Post Incident Review Report For Microsoft 365Report Date: December 12, 2019Report By: ICCThe information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS plying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation. Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property. The descriptions of other companies’ products in this document, if any, are provided only as a convenience to you. Any such references should not be considered an endorsement or support by Microsoft. Microsoft cannot guarantee their accuracy, and the products may change over time. Also, the descriptions are intended as brief highlights to aid understanding, rather than as thorough coverage. For authoritative descriptions of these products, please consult their respective manufacturers.? 2019 Microsoft Corporation. All rights reserved. Any use or distribution of these materials without express authorization of Microsoft Corp. is strictly prohibited.Microsoft and Windows are either registered trademarks of Microsoft Corporation in the United States and/or other countries.The names of actual companies and products mentioned herein may be the trademarks of their respective owners.Microsoft 365 Customer Ready Post Incident Review Incident InformationImportant NoteThis is a preliminary Post Incident Report (PIR) that is being delivered prior to full incident resolution to provide early insight into details of the issue. The information in this PIR is preliminary and subject to change. A final PIR will be provided within five (5) business days from full event resolution and will supersede this document upon publication.Incident IDSP197263, OD197264Incident TitleCan't access sitesService(s) ImpactedSharePoint Online, OneDrive for BusinessUser ImpactUsers would have primarily experienced latency when opening OneDrive for Business content or been shown “503 Service Unavailable” or “Server Busy” failure messages when accessing SharePoint Online.Additionally, users may have also seen impact to ancillary services with dependencies on SharePoint Online such as: PowerApps, PowerAutomate, OneDrive Sync, Word, Excel, OneNote, Teams, and background applications.During the incident mitigation process, service throttles were applied to specific workloads to reduce availability impact. As a result, while the throttles were in place, users could have experienced impact to migrations, large file uploads, and OneDrive for Business or OneNote syncs.Scope of ImpactThis issue affected about 35 percent of SharePoint and OneDrive capacity in Europe and North America and caused intermittent service access issues for users in these regions. Incident Start Date and TimeMonday, December 2, 2019, at 2:30 PM UTCIncident End Date and TimeTuesday, December 10, 2019, at 9:00 PM UTCRoot CauseA configuration change to update SharePoint Online caching infrastructure from IPv4 to IPv6 led to increased failure rates for user request traffic. As a result, users would have experienced multiple performance issues such as latency and timeouts when accessing SharePoint Online and OneDrive for Business, or when utilizing other dependent services.Once the configuration was reverted and the service throttles lifted, the SharePoint service experienced a secondary layer of impact from the previously throttled request traffic. Additional throttles were applied to limit the ingestion speed of this traffic, which resulted in further latency until the backlog was fully processed.Specific Questions About the IssueQ. Why did it take so long for Microsoft to mitigate impact?Unfortunately, the issue was complex. We identified several changes that were designed to improve the responsiveness of the service. It wasn't until Friday that the root cause of the issue was fully understood and the offending change was reverted.Q. Why didn’t Microsoft fail over to alternative infrastructure?In certain SharePoint farms a failover was completed in an attempt to mitigate impact. However, as the alternative infrastructure had received the configuration changes in parallel with the primary infrastructure, this action did not fully resolve the issue.Q. How was the underlying change tested?After initial validation, the underlying change was further tested during a slow implementation phase across portions of the service environment. The implementation took place over the course of the past several months, during which any identified issues were quickly addressed. Unfortunately, the prior testing did not find nor anticipate the impact we experienced from making this change once it hit a certain deployment threshold. Q. Why wasn’t the issue caught in testing?This issue had not been observed during previous changes or testing of our caching infrastructure from IPv4 to IPv6.Q. Can Microsoft provide the specifics of which features in SharePoint were throttled?We implemented a dynamic throttle to non-critical SharePoint and OneDrive tasks to systematically lower overall resource utilization. In addition, we throttled a subset of system tasks. These throttles would have also limited migrations, large file uploads, and background sync actions for OneNote and OneDrive.Q. How did Microsoft determine which features to throttle?The SharePoint service chooses to throttle operations that are not initiated by a customer. We prioritize the health of operations created by a customer taking action on a site, file, or other information stored in SharePoint and OneDrive.Q. Why were updates not being provided more frequently?During any high-impact event, our goal is generally to provide updates hourly when possible. We strive to ensure that any provided communication contains specifics on the work that has taken place and are not just repeating the same information over the course of several updates.Q. Why were the initial communications regarding this issue lacking in detail?Our goal with any incident is to provide customer communications as quickly as possible. We recognize that this initial communication is often going to be light on the details by necessity during our initial investigation. We strive to provide significantly more detailed communications as we identify the scope and user impact of each issue. In the case of this incident, we were pursuing multiple parallel triage steps, but the root cause of the issue had not yet been identified. We have identified some areas of improvement here that will be reviewed during our internal post-mortem of the issue, and will be calling out some specific actions in the final PIR in the Next Steps section below.Q. Why did we still see impact early this week if Microsoft identified and reverted the change that caused the issue on Friday?Due to the length of the issue, there were a number of transactions that were queued up as a result of the throttles we had in place. After we reverted the core issue, these transactions were in a backlog and all needed to complete. We applied further optimizations to ensure that the backlogged work completed as quickly as possible while keeping the service available to users.? As we saw service returning to normal levels, we gradually removed the throttles.Q. Was Microsoft treating this issue with the right level of priority and urgency?With any high-impact issue, we have a senior member of the engineering leadership team managing the incident. We also provide ongoing status updates to the Microsoft 365 leadership team. In the case of this issue, engineering was fully participating, and the Microsoft 365 and SharePoint leadership teams were fully engaged and actively ensuring progress was being made on the issue.Q. What is Microsoft doing to prevent this type of issue from occurring again?We have identified several key areas of improvement to ensure issues of this nature do not reoccur. These will be contained in the Next Steps section in the final Post-Incident Report.Actions Taken (All times UTC)Timeline of the change that caused the impactMonday, November 8 – Thursday, December 5, 2019An IPv6 change was deployed throughout NOAM and EMEA capacity. In total, 73 farms received this change (targeted environment).November 8 – The change was deployed to 5.48% of the targeted environment.November 11 – The change reached 6.85% of the targeted environmentNovember 12 – The change is at 8.22% completion.November 13 – 12.33%.November 22 – 41.1%November 23 – 42.47%November 25 – 56.16%December 4 – 82.19%December 5 – 100% of the targeted environment received the change.Timeline of engineering actionsMonday, December 2, 20192:30 PM – Microsoft anomaly detection systems generated monitoring alerts indicating an issue.3:45 PM – Service Incidents SP197124 and OD197125 were proactively published to the admin center for a subset of US/European customers.3:46 PM – Microsoft received the first customer reports of impact.4:17 PM – Engineers initiated a high-priority investigation.4:17 PM – Incident Manager engaged automatically after high priority alert open for 30 minutes.5:28 PM – The Initial failover test on a portion of infrastructure proved ineffective.8:11 PM – Additional engineering resources engaged for further investigation.9:55 PM – Latency was observed across significant portions of the North American service environment.11:00 PM – The service stabilized in North America due to the reduced load at the end of business hours.Tuesday, December 3, 201912:47 AM – Engineers suspected that User Front End servers were running out of CPU to handle use request traffic.1:57 AM – Service Incidents SP197124/OD197125 were updated with final details and the status was changed to Service Restored.12:17 PM – Engineers reported multiple alerts were coming from Europe. Initial throttles were applied to reduce load and prioritize user service access.4:09 PM – Engineers reported that US capacity alerts were firing as load increased in the region. Customers affected the day before began reopening support cases confirming renewed impact.4:33 PM – Service Incident communications were published under SP197263/OD197264.6:25 PM – Engineers isolated and started examining all patches implemented within the last seven days.6:39 PM – Engineers identified a large increase in garbage collection and suspected this was related to a previous patch deployed the week prior.6:46 PM – Engineers identified a spike in Distributed Cache Host (DCH) CPU utilization.6:49 – Engineers flipped the killswitch for the patch that was causing increased garbage collection. Unfortunately, this proved ineffective.9:37 PM – Engineers identified that calls to DCH were accumulating large buffers. Wednesday, December 4, 201912:49 AM – Engineers identified a logging flight which may have been contributing to impact. 12:51 AM – Engineers observed that all US capacity returned to normal function at the end of standard business hours.2:31 AM – Engineers identified an authentication flight that was disabled prior to the logging change.4:46 AM – The authentication flight was restored to 100 percent of the production environment. The logging flight was disabled.9:36 AM – Engineers confirmed a resurgence of impact in Europe.9:51 PM – Service Incidents SP197263/OD197264 were expanded to all of the European region.2:44 PM – Engineers identified increases in sync traffic were strongly correlated to the high User Front End CPU resource utilization.3:18 PM – Engineers proactively applied throttles to US region for META and sync operations.3:27 PM – Service Incidents SP197263/OD197264 were expanded to all of the US region. 3:41 PM – Engineers identified that an asynchronous Common Intermediate Language (CIL) API was using an outsized amount of resources and began working on disabling the flight for this API.5:23 PM – A OneNote conversion flight was disabled on a subset of affected infrastructure.5:53 PM – Engineers disabled all recent flights and enabled all recent killswitches on a subset of US infrastructure, which proved ineffective.6:12 PM – A high impact event was declared.6:24 PM - 2:48 AM – An additional subset of infrastructure received the IPv6 caching infrastructure configuration change.6:27 PM – Engineers disabled all recent flights and enabled all recent killswiches on a different subset of US infrastructure, which also proved ineffective.9:50 PM – Throttles in the US and Europe were updated to expire after three days.11:51 PM – Engineers began to consider the possibility of a roll-back of entire service environment to a previous build as a final action, should no other course prove effective.Thursday, December 5, 201912:01 AM – Engineers compiled a list of all changes made in the previous build.1:15 AM – Engineers identified a large number of exceptions related to WCF-DCH connections.3:45 AM – Engineers identified a US infrastructure hot zone where users were more likely to experience impact.4:09 AM – Network metrics for the hot zone showed that load balancers were not operating near their maximum capacity, ruling out a networking issue.4:26 AM – Engineers first suspected the root cause as a IPv6 caching infrastructure change as User Front End servers are observed connecting to DCH over IPv6 with static routes missing.10:17 AM – Engineers started investigating Unified Logging Service (ULS) tag pairs with the largest latency between them in a given request.2:25 PM – Engineers confirm, as with prior nights, that no alerts were fired for either US or Europe outside of their standard business hours.2:54 PM – Customers started reporting impact related to the implemented service throttles.3:26 PM – Engineers enabled the garbage collection killswitch on a separate portion of US infrastructure.4:55 PM – Service Incident SP197263/OD197264 was updated with more detail based on customer feedback.5:46 PM - 3:36 AM – An additional subset of infrastructure received the IPv6 caching infrastructure configuration change.5:51 PM – Engineers began preparing to potentiality roll the service environment back to a previous build.Friday, December 6, 20191:41 AM – All killswitches included in the previous build were activated.2:30 AM – An additional subset of flights were disabled.2:45 AM – Engineers reported that the server-side DCH looked healthy.2:58 AM – Engineers understand the relationship between SharePoint Online Directory Services (SPODS), DCH, Security Token Service (STS), and the content app pool. They started to investigate whether increased STS load is a side effect or factor of the root cause.3:59 AM – Engineers prepared a rollback patch for a recent FileStore update.4:08 AM – Engineers began planning to relocate a subset of customers out of the hot zone infrastructure to better distribute service load and decrease customer impact.5:51 AM – Engineers determined that DCH connections were timing out after receiving an increased number of queued requests.1:00 PM – The FileStore rollback patch was applied to a subset of Europe infrastructure.6:28 PM – Engineers confirmed that a previously applied full sync throttle test was providing 100 percent relief to service availability for the target infrastructure.7:00 PM – The initial rollback of IPv6 on DCH completed on a targeted subset of US infrastructure.7:45 PM – The IPv6 rollback completed for target infrastructure.8:00 PM – The target infrastructure remained stable after the rollback.9:00 PM – After observing continued stability of the target infrastructure in the US, the IPv6 rollback was expanded to an additional subset of US infrastructure within the hot zone.9:45 PM – The IPv6 rollback completed for the target US infrastructure inside the hot zone.9:55 PM – All targeted infrastructure remained stable.10:10 PM – DCH response latency observed a drop after IPv6 rollback.10:40 PM – Engineers performed additional validation testing on portions of the target US infrastructure to confirm that the IPv6 rollback was effective.11:00 PM - 12:00 AM – The Joblet to use a HOSTS file override to force IPv4 for DCH was deployed to the hot zone and then worldwide.Sunday, December 8, 201911:30 PM – While the issue was considered mitigated and the throttles lifted, engineers discussed whether the IPv6 change was the root cause, as only 50 percent of the affected farms had the change.11:45 PM – Engineers compiled a list of the IPv6 status for the impacted infrastructure.Monday, December 9, 201911:26 AM – Several alerts fired within the European region.11:42 AM – Engineers observe a spike in Background Intelligent Transfer Service (Large files) uploads from the sync client and suspected that this was due to these actions being throttled for the past week.3:39 PM – 12:01 AM – Engineers began work to perform service optimizations and to enable additional service throttles to manage the ingress of the backlog of previously throttled traffic.3:50 PM – Engineers confirmed that the DCH call count had not increased, indicating secondary impact beyond the IPv6 issue.Tuesday, December 10, 20191:45 AM – Engineers disabled SharePoint Trial mode on all infrastructure to further increase overall availability.5:40 PM – Engineers confirmed that they observed no discernable impact during European business hours.10:00 PM – Engineers began removing the additional throttles in the European region.11:50 PM – SP197263/OD197264 was resolved with a customer impact end time of Tuesday, December 10, 2019, at 9:00 PM UTC.Wednesday, December 11, 20194:00 AM – Engineers began removing the additional throttles in US region.6:15 PM – The incident was confirmed as resolved.Next StepsAs this is a preliminary Post-Incident Report, the list below does not constitute the full scope of our insights, planned follow-up actions for this event, or completion targets for those actions. This list will be expanded for the publication of the final Post-Incident Report. These will relate to our change management, monitoring, incident management, communication, and investigative steps taken to triage the issue.FindingsActionCompletion DateIt took too long to understand what change led to the incidentImprove tracking of infrastructure changes within the production systemTBDInternal monitoring was insufficient to pinpoint the subsystem that was impacted by the incidentImprove our telemetry for each incoming and outgoing call within the service to ensure root cause is easier to find.TBDResidual impact from the backed-up load caused too large of an impact.Automate management of this load and better understand the extent of backed-up traffic due to throttling.TBDInitial communications did not provide the level of detail needed by customers.Identify ways to improve the details obtained during the initial timeframe of an incident to provide additional details.TBD ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download