An OpenAI-compatible LLM gateway powered by LiteLLM, running on Azure Container Apps. Unifies Azure AI Foundry model deployments behind a single, standardized API.
AzureLIT provides a lightweight, cost-conscious HTTPS gateway that exposes Azure AI Foundry models through an OpenAI-compatible interface.
- OpenAI-Compatible API: Drop-in replacement for OpenAI SDK clients
- Multi-Model Support: Declarative
var.modelsmap — add a model with one Terraform map entry - Authentication: Custom auth handler validates distributed client API keys and the master key
- Usage Tracking: Per-key analytics with Azure Log Analytics — track tokens, cache usage, cost, and failures
- Infrastructure as Code: Fully automated deployment via Terraform
- Observability: Azure Monitor integration with metadata-only logging (no prompt/response content)
- Hardened Deployment: Pinned LiteLLM image, HTTPS-only ingress, disabled UI/key routes, and constrained scale settings
- Prompt Caching: Automatic cost reduction for Azure OpenAI models with 1024+ token prompts
- Azure subscription
- Terraform >= 1.0
- Azure CLI (for authentication)
- direnv (recommended for secret injection)
- Copy the example environment file and configure your secrets:
cd infra
cp example.env .env- Edit
.envwith your values:
# Required - Your Azure subscription ID
TF_VAR_subscription_id=your-subscription-id
# Required - Master key for admin/operator access (must start with 'sk-')
TF_VAR_litellm_master_key=sk-your-secure-master-key
# Required - Comma-separated client API keys distributed to consumers
TF_VAR_api_keys=sk-clientA,sk-clientB
# Optional - Override defaults
TF_VAR_location=germanywestcentral
TF_VAR_resource_group_name=AzureLIT-POC
TF_VAR_scale_cooldown_seconds=900- Load the env vars (with direnv:
direnv allow; without:)
export $(grep -v '^#' .env | grep -v '^$' | xargs)- (Development Only) Install pre-commit hooks to ensure local commits pass CI formatting checks:
pip install pre-commit
pre-commit installcd infra
terraform init
terraform plan -out=tfplan
terraform apply tfplanAfter deployment, Terraform outputs the container app URL:
container_app_fqdn = "litellm-proxy.<env>.<region>.azurecontainerapps.io"
container_app_url = "https://litellm-proxy.<env>.<region>.azurecontainerapps.io"
# Set your deployed URL and a client API key
ENDPOINT="https://<your-container-app-fqdn>"
API_KEY="sk-clientA"
# List available models
curl -sS \
-H "Authorization: Bearer $API_KEY" \
"$ENDPOINT/v1/models"
# Replace model names below with models you actually deployed.
# Test chat completion
curl -sS \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4.1",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false
}' \
"$ENDPOINT/v1/chat/completions"
# Test with streaming
curl -sS \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "grok-4-20-reasoning",
"messages": [{"role": "user", "content": "Count to 5"}],
"stream": true
}' \
"$ENDPOINT/v1/chat/completions"To avoid guessing model name/version/SKU combinations, use:
cd infra
./list-deployable-models.sh --name codexUseful filters:
# Only models that support the Responses API
./list-deployable-models.sh --capability responses
# Search by family + capability
./list-deployable-models.sh --name gpt-5.1 --capability responses
# Check models supporting a specific SKU
./list-deployable-models.sh --sku DataZoneStandardRequirements: az (logged in) and jq installed locally.
Recommended workflow before editing infra/openai.tf:
# 1) Discover what this account can actually deploy
./list-deployable-models.sh --name gpt-5 --capability responses
# 2) Pick exact name + version + SKU from output
# 3) Add/update the entry in var.models
# 4) Deploy with terraform plan/applyIf responses=true and chatCompletion=false, set responses_only = true.
from openai import OpenAI
client = OpenAI(
api_key="sk-clientA",
base_url="https://<your-container-app-fqdn>"
)
response = client.chat.completions.create(
model="gpt-4.1",
messages=[{"role": "user", "content": "Hello!"}],
stream=False
)
print(response.choices[0].message.content)graph LR
Client["Client / SDK<br/>(OpenAI format)"] -->|Bearer sk-...| LiteLLM
subgraph Azure Container Apps
LiteLLM["LiteLLM Proxy<br/>:4000"]
end
subgraph Azure AIServices Account
ChatModel["Chat Model<br/>(e.g. gpt-4.1)"]
ResponsesModel["Responses Model<br/>(e.g. gpt-5.1-codex)"]
end
LiteLLM -->|azure/<model>| ChatModel
LiteLLM -->|azure/responses/<model>| ResponsesModel
subgraph Supporting Services
LA["Log Analytics<br/>Metadata Logging"]
end
LiteLLM -.-> LA
- Azure Container Apps: Hosts LiteLLM Proxy with external HTTPS ingress
- Azure AIServices Cognitive Account (
kind = "AIServices"): Unified Foundry resource hosting all model deployments - Regional AIServices Accounts: Created automatically when
var.modelstargets a non-primary region - Azure Foundry Project (
azurerm_cognitive_account_project): Created automatically; used by models requiring project-scoped deployment (project = true) - Log Analytics: Metadata-only logging (no prompt/response content)
- Log Analytics: Per-key usage tracking (tokens, cache usage, cost, failures)
The table below shows example model configurations from this repo. Actual deployability varies by subscription, region, quota, and Azure rollout stage. Use infra/list-deployable-models.sh to discover what you can deploy, then add entries to var.models in infra/openai.tf.
| Model (example) | Format | SKU | Region | API Surface |
|---|---|---|---|---|
gpt-4.1 |
OpenAI |
DataZoneStandard | germanywestcentral |
Chat Completions |
gpt-5.1-codex |
OpenAI |
GlobalStandard | germanywestcentral |
Responses API only |
Responses-only models (e.g., codex variants) set responses_only = true and use LiteLLM's azure/responses/ prefix with api_version=preview.
The deployment uses a custom auth handler in infra/custom_auth.py:
- Set
TF_VAR_api_keysto a comma-separated list of distributed client keys - Set
TF_VAR_litellm_master_keywith a value starting withsk-for operator/admin use - Clients authenticate with
Authorization: Bearer <api_key> - The custom auth handler also accepts the master key so admin operations still work
Per-key usage analytics are tracked in Azure Log Analytics:
# Last 7 days (default)
python scripts/usage-report.py
# Specific date range
python scripts/usage-report.py --from 2026-04-01 --to 2026-04-15
# Single day
python scripts/usage-report.py --date 2026-04-15
# Export to CSV
python scripts/usage-report.py --from 2026-04-01 --to 2026-04-15 --format csv > usage.csvOutput example:
Usage Report: last 7 days to now
| Key Hash | Requests | Failures | Tokens In | Tokens Out | Cost | Models |
|-------------|----------|----------|-----------|------------|----------|----------------------|
| 308e39b0... | 114 | 110 | 143541 | 2385 | $0.00004 | Kimi-K2.5, Kimi-K2.6, gpt-4.1 |
See docs/USAGE_ANALYSIS.md for full documentation.
- Defender for AI Services: Configured as Free tier (disabled) — see docs/DEFENDER_AI_SERVICES.md for details on enabling Standard tier for production workloads
See the ## Next Steps sections in docs/ARCHITECTURE.md and docs/DEPLOYMENT_SUMMARY.md.
Prompt caching is the primary cost reduction lever for Azure OpenAI models:
- Automatically reduces input token costs by up to 90% for prompts with 1024+ tokens
- No configuration required - works automatically with proper prompt structure
- Verified working for
gpt-4.1,gpt-5.4, andgpt-5.1-codex - See docs/PROMPT_CACHING.md for implementation guide
Typical savings for workloads with repeated context:
- Code generation with shared codebases: 60-80% reduction
- Document analysis with templates: 70-90% reduction
- Multi-turn conversations: 50-70% reduction
- Secrets: Never commit
.envor*.tfvarsfiles (both are gitignored) - Logging: No prompt/response content is logged; only metadata (timestamps, latency, token counts)
- HTTPS Only: Container Apps enforces TLS on external ingress
- Proxy Hardening:
disable_admin_ui: true,disable_key_management: true,drop_params: true,drop_unknown_params: true - Runtime Hardening: LiteLLM image pinned to
ghcr.io/berriai/litellm:main-v1.83.14-stable.patch.3,min_replicas = 0,max_replicas = 2,cooldown_period_in_seconds = var.scale_cooldown_seconds(default900/ 15 minutes)
- Proxy Hardening:
- Least Privilege: Managed identities used where possible
- ARCHITECTURE - Architecture and deployment behavior
- PROMPT_CACHING - Cost optimization via prompt caching (recommended)
- DEPLOYMENT_SUMMARY - Operational summary
- MASTER_KEY_MANAGEMENT - Master/client key operations
- CUSTOM_AUTH - Custom auth implementation
- USAGE_ANALYSIS - Per-key usage tracking and reporting
- USAGE_TRACKING_IMPLEMENTATION - Implementation details
- DEFENDER_AI_SERVICES - Defender for AI Services security documentation
- LINKS - External references
TBD