Blog Post

Microsoft Entra Blog
7 MIN READ

Leading the way in resilience at scale

nadimabdo's avatar
nadimabdo
Icon for Microsoft rankMicrosoft
Jun 03, 2025

Microsoft Entra’s security and resilience are and have always been our top priorities. We understand how mission critical Microsoft Entra is to our customers. For many years we’ve invested in a range of differentiated resilience measures that are a superset of the overall Microsoft cloud-wide resilience best practices. This update is a continuation of a series of posts we’ve made sharing our progress and giving a peek under the hood of how we build and operate Microsoft Entra to meet your needs.

Microsoft Entra maintains a 99.99% availability service level agreement (SLA), and further we engineer layers of resilience in front of it ranging from hardening the core system itself, to a parallel backup authentication system, all the way to resilience built into our software development kits (SDKs) and applications. All these techniques come together to ensure customer scenarios can operate uninterrupted even if backend services or dependencies encounter challenges.

Below we’ll examine Microsoft Entra’s transparent resilience strategy, providing illustrative examples.

SLA attainment

As of May 2025, Microsoft Entra has met or exceeded its global SLA of 99.99% availability for the last 40 months and publishes these figures publicly on Microsoft Learn.

We invest heavily to meet this SLA. While we’re proud of its performance on the core system itself, we’ve gone further, developing resilience measures to ensure that even if there’s an SLA dip, customer scenarios are not impacted. In April 2025, we revised our SLA performance calculations to better reflect the user experience with authentication availability. The new method includes all successful authentications from Microsoft Entra's resilient infrastructure, including those from backup systems on retry. Previously, these were not included in the SLA. This change more accurately reflects users’ rate of undisrupted sign-in experiences.

To illustrate the benefit that end users see from these layers of resilience, the two charts below highlight the impact of our resilience measures. They show global daily availability rates across a recent month when an incident occurred. Near the end of the month, the incident could have reduced sign-in uptime, illustrated by the gray line in the bottom chart. But instead, as the result of resilience investments, there was limited impact to end users, and we maintained 5-9s uptime throughout the entire month. The gray-shaded area in the top chart represents the net resilience gains, with the green line of the top chart representing the end user experience each day.  

Figure 1: The chart above illustrates a recent month of uptime at a daily level. Global SLA attainment is published at an aggregated monthly level on Microsoft Learn (aka.ms/EntraIDSLA). Tenant-level monthly SLA attainment is reported to individual customers in the Microsoft Entra admin center.

Resilience strategy: backup auth system

A key feature of Microsoft Entra's resilience strategy is our backup authentication system. This system operates in an active-active configuration, consistently handling a portion of traffic and providing backup services where necessary. This backup system is a significant investment in resilience, ensuring that authentication services remain operational even during outages. We prioritize security and quality above all else, given the critical role of identity services in keeping operations always running safely and smoothly. The key expectation is that all dependencies and subunits may fail.

Our resilience strategy has also led us to a cell-based architecture, where each part of the system works independently and is protected from failures in other parts. We use multiple data centers to support these parts (the cells), and the cells in turn each span multiple Azure regions, avoiding a single point of failure. Some clients can access services across up to 13 regions, adding layers of protection at different levels, such as racks, zones, and data centers.

Another area essential for developing resilience and addressing failure is our discovery and mitigation processes, which focus on anticipating potential issues and maintaining smooth operations. An important aspect of this approach is the use of "pre-mortems," which help identify and resolve vulnerabilities before they become problems; and, following incidents, thorough post-mortem analyses help contribute to ongoing enhancements. By integrating lessons learned into the processes proactively and retroactively, balancing loads, incorporating failovers, and improving monitoring capabilities, Microsoft Entra maintains its services' resilience and reliability. The ultimate success of our discovery and mitigation approach is consistently challenging assumptions, questioning established practices, and fostering a mindset of continuous improvement.

A good example of how our continuous improvement processes and tools work in combination with one another is in how we identify and systematically eliminate single points of failure (SPOFs), shown in the illustration below.

Figure 2: This diagram illustrates the critical components essential for maintaining the robust and reliable operation of the Microsoft Entra identity and access management system.

Real-world example: Managed Service Identity infrastructure

A practical example of the resilience offered by Microsoft Entra can be observed in the Managed Service Identity (MSI) infrastructure, which allows the ability to access applications even when the primary authentication service experiences a disruption. The system incorporates multiple layers of caching, including the token cache, enabling Azure to maintain operational continuity without customer impact. Token caching saves authentication tokens (the digital credential that verifies a user’s identity) in a temporary storage (the cache) to access them quickly later. Instead of constantly checking for new tokens, the system uses the saved tokens to save time and resources, especially when accessing something multiple times within a short period.

This design also allows users with a valid cached token to continue accessing applications uninterrupted in the event of authentication service disruptions. A recent configuration error put token caching to the test. This error caused a temporary loss of access to an internal database without any degradation in auth availability for users. This example demonstrates the value of designing architecture with an assume failure mindset.

Resilience strategy: Supporting resilient tenant management

In addition to building resilience into Microsoft Entra’s services, we also recognize the need for supporting resilience at the tenant level. Microsoft Entra enables tenant management best practices that ensure data integrity and prevent business disruptions, a key pillar of compliance in the Digital Operational Resilience Act (DORA). Microsoft Entra is built to support digital operational resilience efforts in many ways, including through the use of soft deletion recovery, taking snapshots of directory state regularly, and monitoring configuration changes. (Explore Microsoft Entra customer considerations under DORA  to learn more.)

Regulated entities can incorporate Microsoft Entra capabilities into their frameworks, policies, and plans to align with specific requirements under DORA, offering several benefits for organizations aiming to minimize operational disruptions and comply with the regulation. But regardless of whether your business falls within the scope of DORA, you may benefit from putting many of these best practices and features into action. One such example is Microsoft Entra’s capability to support recovery of deleted data for directory objects such as users, groups, and applications via self-service soft deletion protection. This protection is a crucial aspect of ensuring data integrity and preventing business disruptions.

Transparency strategy: Building trust through observability

Microsoft Entra's transparency strategy is comprehensive, emphasizing automated communications and health monitoring. Our auto-comms system speeds up incident notifications by messaging customers directly in email and posting alerts to Azure and M365 communication channels as soon as an incident is detected, rather than waiting for human intervention. This approach cuts notification time by at least 30 minutes and often achieves even larger reductions. We’re continuously pushing to increase the coverage of auto-comms, with nearly two-thirds of our quality critical service incidents now benefiting from faster notification times provided by automated notification.

In addition to reactive incident messaging, Microsoft Entra also provides proactive observability of the health of critical sign-in scenarios such as multifactor authentication (MFA). By calculating health metrics at very low latency at the tenant level and monitoring them with anomaly detection, the Microsoft Entra health monitoring feature provides real-time updates on any sign-in issues your users may be facing. You can check these alerts through the Microsoft Entra admin center or Microsoft Graph and subscribe to email notifications whenever an alert occurs. This way, you’re kept in the loop on potential problems, and you’ll get tips on root causes and how to fix them.

Conclusion

Microsoft Entra’s approach to resilience is centered on ensuring that identity services remain operational and secure, even in the face of internal disruptions or failures in our dependencies. This capability is not merely about preventing downtime; it extends to building systems capable of adapting to unforeseen challenges and maintaining trustworthiness. By providing tailored identity solutions that address unique needs, we ensure that businesses can thrive in a secure and resilient environment. Microsoft Entra’s blend of resilience and health transparency serves as a model for how technology can adapt to dynamic landscapes while maintaining user trust.

Microsoft Entra’s scalable and intelligent design means it can support organizations of all types, whether you’re running a startup or managing a global corporation. We will continue to prioritize and invest heavily in security and resilience and thank you for your trust and partnership.

 

Nadim Abdo, CVP, Engineering and Igor Sakhnov, CVP, Engineering

 

Read more on this topic

 

Learn more about Microsoft Entra

Prevent identity attacks, ensure least privilege access, unify access controls, and improve the experience for users with comprehensive identity and network access solutions across on-premises and clouds.

Updated Jun 02, 2025
Version 1.0
No CommentsBe the first to comment