Scenario 3: Azure API Management - Generative AI resources as backend

This reference implementation demonstrates how to provision and interact with Generative AI resources through API Management. The implementation is on top of the APIM baseline and additionally includes private deployments of Azure OpenAI endpoints, and the policies for the following capabilities that are specifically tailored for GenAI use cases.

By the end of this deployment guide, you would have deployed private Azure OpenAI endpoints and an opinionated set of policies in APIM to manage traffic to these endpoints. You can then test the policies by sending requests to the APIM gateway, and can modify either to include the policy fragments listed here or to include your own custom policies.

Architecture

Core components

Azure OpenAI endpoints
Azure Event Hub
Azure Private Endpoint
Azure Private DNS Zones

GenAI Gateway capabilities

Deploy the reference implementation

This reference implementation is provided with the following infrastructure as code options. Select the deployment guide you are interested in. They both deploy the same implementation.

▶️ Bicep-based deployment guide ▶️ Terraform-based deployment guide

GenAI Gateway

A "GenAI Gateway" serves as an intelligent interface/middleware that dynamically balances incoming traffic across backend resources to achieve optimizing resource utilization. In addition to load balancing, GenAI Gateway can be equipped with extra capabilities to address the challenges around billing, monitoring etc.

To read more about considerations when implementing a GenAI Gateway, see this article.

This accelerator contains APIM policies showing how to implement different GenAI Gateway capabilities in APIM, along with code to enable you to deploy the policies and see them in action.

Scenarios handled by this accelerator

This repo currently contains the policies showing how to implement these GenAI Gateway capabilities:

Capability	Description
Load balancing (round-robin)	Load balance traffic across PAYG endpoints using simple and weighted round-robin algorithm.
Managing spikes with PAYG	Manage spikes in traffic by routing traffic to PAYG endpoints when a PTU is out of capacity.
Adaptive rate limiting	Dynamically adjust rate-limits applied to different workloads
Tracking token usage	Record the token consumption for usage tracking and attribution

Test/Demo setup

If you are looking for a quick way to test or demo these capabilities with a minimalistic non production like APIM setup against a Azure OpenAI simulator, check out this repository.

▶️ APIM GenAI Gateway Toolkit

AI Hub Gateway capabilities

Looking for comprehensive reference implementation to provision your AI Hub Gateway? Check out AI Hub Gateway scenario.

▶️ AI Hub Gateway

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Scenario 3: Azure API Management - Generative AI resources as backend

Architecture

Core components

GenAI Gateway capabilities

Deploy the reference implementation

GenAI Gateway

Scenarios handled by this accelerator

Test/Demo setup

AI Hub Gateway capabilities

Files

README.md

Latest commit

History

README.md

File metadata and controls

Scenario 3: Azure API Management - Generative AI resources as backend

Architecture

Core components

GenAI Gateway capabilities

Deploy the reference implementation

GenAI Gateway

Scenarios handled by this accelerator

Test/Demo setup

AI Hub Gateway capabilities