Blog Post

Azure PaaS Blog
3 MIN READ

Introducing Azure SRE Agent

dchelupati's avatar
dchelupati
Icon for Microsoft rankMicrosoft
May 19, 2025

Today we’re thrilled to introduce Azure SRE Agent, an AI-powered tool that makes it easier to sustain production cloud environments. SRE Agent helps respond to incidents quickly and effectively, alleviating the toil of managing production environments. Overall, it results in better service uptime and reduced operational costs. SRE Agent leverages the reasoning capabilities of LLMs to identify the logs and metrics necessary for rapid root cause analysis and issue mitigation. Its advanced AI capabilities transform incident and infrastructure management in Azure, freeing engineers to focus on more meaningful work. 

Sign up for the SRE Agent preview click here

As more companies move their services online, site reliability engineering (SRE) has become crucial to keeping critical systems reliable, scalable, and cost-efficient. But SRE isn't just about fixing problems—it’s about bridging the gap between business goals and developer needs. With growing infrastructure complexity, it’s harder than ever to keep everything running smoothly while anticipating future scalability and reliability needs.

We’ve heard from SREs that they experience significant toil from repetitive live-site incident handling and log analysis tasks, with adhoc administrative tasks disrupting their workflow. Responding to incidents is stressful, as seconds matter and there’s little room for error. 

SRE Agent brings the years of experience Microsoft teams gathered running the Azure cloud to your team. 

SRE Agent is a new Azure service that gives site reliability engineers (SREs) and developers the tools they need to increase the speed and efficiency of incident responses, diagnostics and collaboration to resolve problems rapidly. It is seamlessly integrated with other observability and incident management tools, as well as the new coding agent in GitHub Copilot. It runs in the background 24x7, learning and monitoring the health and performance of Azure resources, handling production alerts, and partnering in incident investigations and root cause analysis (RCA) to mitigate issues faster.

Key Capabilities

SRE Agent can help make your infrastructure more secure, resilient and scalable and can help detect and respond to incidents more quickly.

Evaluating usage and performance trends 

SRE Agent continuously learns about your Azure resources to build relevant context about them without the hassle of multiple tools. You can ask questions to learn about their properties, configuration, and recent changes. And you can learn about their health and performance by visualizing relevant metrics. This allows developers to quickly identify anomalies or trends that need attention.

Example Prompts:

  • What changed to my app in last day?
  • When was the last slot swap performed on my app?
  • What alerts should I set up for my web app?
  • Can you give me the overall AKS cluster usage?
  • What are best practices that I should setup for my app?
  • Visualize requests and 500 errors for last week for my app

Proactive detection and remediation of security vulnerabilities

SRE Agent continuously audits Azure resources to ensure compliance with security best practices. Currently, it checks for the use of supported TLS versions and verifies if resources have Managed Identity enabled. SRE Agent not only identifies the potential vulnerabilities but can also perform the necessary operations to update resources with your approval, bringing them into compliance.

 

Automated incident response and Faster root cause analysis

SRE Agent can respond to Azure Monitor alerts out of the box. You can also integrate with Incident Management tools such as PagerDuty to extend its alert-handling capabilities. With this integration, SRE Agent can:

  • Start investigation upon alert detection.
  • Access metrics, activity logs, dependencies, and dashboards to form hypotheses and determine the root cause.

Traditional RCA methods can take hours, while the SRE Agent completes them in minutes, minimizing impact and driving faster resolutions.

Incident mitigation 

To mitigate incidents and bring an application back to operational state, SRE Agent can perform operations on behalf of and with approval of the user. These operations can include scaling up resources, restarting applications, and performing rollbacks to a previously working version of the app.

Close the loop with developers

Once an investigation is complete, SRE Agent creates a GitHub issue including all the details from the investigation helping developers to fix source code and prevent subsequent recurrences of an incident.

Share your thoughts 

We would love to hear your feedback. Signup for preview access to SRE Agent is here. Use the “Thumbs Up” or “Thumbs Down” buttons to share your thoughts on Agent’s responses. You can also share feedback on the product via the “give us feedback” button on the top right corner inside the SRE Agent. Your input is invaluable to us as we strive to improve and support you on your Azure journey.

Additional resources

Updated May 17, 2025
Version 1.0

4 Comments