As organizations increasingly integrate AI into their applications, managing model usage, ensuring governance, and optimizing performance across diverse APIs has become critical. Azure API Management’s AI Gateway is evolving rapidly to meet these needs introducing powerful new capabilities that simplify integration, improve observability, and enhance control over AI workloads. In this update, we’re excited to share several key enhancements, including expanded support for Responses API and AWS Bedrock APIs, advanced token tracking and logging, session-aware load balancing, and streamlined onboarding for custom models. Let’s dive into what’s new and how you can take advantage of these features today.
Model Logging and Token tracking dashboard
Understanding how your AI models are being used is critical for governance, cost management, and performance tuning. AI Gateway now enables comprehensive model logging and token tracking, giving you visibility into:
- Prompts and completions
- Token usage
You can configure diagnostic settings to export this data to long-term storage solutions such as Azure Monitor, Azure Storage, or Event Hub for custom analysis.
Importantly, this logging feature is fully compatible with streaming responses, allowing you to capture detailed insights without compromising the real-time experience for users.
A built-in dashboard in the Azure portal provides an at-a-glance view of token usage trends, model performance across teams, and cost drivers- empowering organizations to make data-driven decisions around AI consumption and policy.
Learn more about model logging.
Responses API Support (Preview)
The Responses API is a new stateful API from Azure OpenAI that unifies the capabilities of the Chat Completions API and the Assistants API into a single, streamlined interface. This makes it easier to build multi-turn conversational experiences, maintain session context, and handle tool calling: all within one API.
With AI Gateway support for the Responses API, you now get:
- Token limiting to manage usage quotas
- Token and request tracking for auditing and monitoring
- Semantic caching to reduce latency and optimize compute
- Content filtering and safety controls
This support enables organizations to confidently use the Responses API at scale with built-in observability and governance.
AWS Bedrock API Support
In our continued effort to support multi-cloud AI strategies, we’re thrilled to announce native support for AWS Bedrock API in AI Gateway.
This means you can now:
- Apply token limits to Bedrock-based models
- Use semantic caching to minimize redundant requests
- Enforce content safety and responsible AI policies
- Log prompts and completions just as you would with Azure-hosted models
Whether you’re running models like Anthropic Claude or Bedrock, you can bring them under the same centralized AI Gateway streamlining operations, compliance, and user experience.
Simplified Onboarding: AI Foundry and OpenAI-Compatible APIs
With the introduction of LLM policies that now support Azure AI Model Inference and 3rd party OpenAI-compatible APIs we wanted to simplify the process of onboarding those APIs to Azure API Management.
We’re happy to announce two new experiences in Azure API Management’s portal: Import from Azure AI Foundry and Create OpenAI API. These new gestures allow you to easily configure your model endpoints to be exposed via AI Gateway and configure token limiting, token tracking, semantic caching and content safety policy directly from the Azure portal.
Session-aware load balancing
Modern LLM applications, especially chatbots, agents, and batch inference workloads—often require stateful processing, where a user’s requests must consistently hit the same backend to preserve context.
We’re introducing session-aware load balancing in Azure API Management to meet this need.
With this feature, you can:
- Enable cookie-based session affinity for load-balanced backends
- Ensure that requests from the same session are routed to the same Azure OpenAI or third-party endpoint
- Support APIs like Assistants or the new Responses API that rely on consistent backend state
Session-aware load balancing ensures your multi-turn conversations or batched tool-calling experiences remain consistent, reliable, and scalable while still benefiting from Azure API Management’s AI Gateway capabilities.
Learn more about session-aware load balancing.
Get started
These new capabilities are being gradually rolled out across all Azure regions where API Management is available.
Want early access to the latest AI Gateway features? You can now configure your Azure API Management instance to join the AI Gateway Early (GenAI Release) update group. This gives you access to new features before they are made generally available to all customers.
To configure this, navigate to the Service Update Settings blade in the Azure portal and select the appropriate update track.
Learn more about update groups.
Published May 19, 2025
Version 1.0akamenev
Microsoft
Joined April 05, 2023
Azure Integration Services Blog
Follow this blog board to get notified when there's new activity