AzureLIT

An OpenAI-compatible LLM gateway powered by LiteLLM, running on Azure Container Apps. Unifies Azure AI Foundry model deployments behind a single, standardized API.

Overview

AzureLIT provides a lightweight, cost-conscious HTTPS gateway that exposes Azure AI Foundry models through an OpenAI-compatible interface.

Features

OpenAI-Compatible API: Drop-in replacement for OpenAI SDK clients
Multi-Model Support: Declarative var.models map — add a model with one Terraform map entry
Authentication: Custom auth handler validates distributed client API keys and the master key
Usage Tracking: Per-key analytics with Azure Log Analytics — track tokens, cache usage, cost, and failures
Infrastructure as Code: Fully automated deployment via Terraform
Observability: Azure Monitor integration with metadata-only logging (no prompt/response content)
Hardened Deployment: Pinned LiteLLM image, HTTPS-only ingress, disabled UI/key routes, and constrained scale settings
Prompt Caching: Automatic cost reduction for Azure OpenAI models with 1024+ token prompts

Quick Start

Prerequisites

Azure subscription
Terraform >= 1.0
Azure CLI (for authentication)
direnv (recommended for secret injection)

Configuration

Copy the example environment file and configure your secrets:

cd infra
cp example.env .env

Edit .env with your values:

# Required - Your Azure subscription ID
TF_VAR_subscription_id=your-subscription-id

# Required - Master key for admin/operator access (must start with 'sk-')
TF_VAR_litellm_master_key=sk-your-secure-master-key

# Required - Comma-separated client API keys distributed to consumers
TF_VAR_api_keys=sk-clientA,sk-clientB

# Optional - Override defaults
TF_VAR_location=germanywestcentral
TF_VAR_resource_group_name=AzureLIT-POC
TF_VAR_scale_cooldown_seconds=900

Load the env vars (with direnv: direnv allow; without:)

export $(grep -v '^#' .env | grep -v '^$' | xargs)

(Development Only) Install pre-commit hooks to ensure local commits pass CI formatting checks:

pip install pre-commit
pre-commit install

Deploy

cd infra
terraform init
terraform plan -out=tfplan
terraform apply tfplan

After deployment, Terraform outputs the container app URL:

container_app_fqdn = "litellm-proxy.<env>.<region>.azurecontainerapps.io"
container_app_url  = "https://litellm-proxy.<env>.<region>.azurecontainerapps.io"

Test the Deployment

# Set your deployed URL and a client API key
ENDPOINT="https://<your-container-app-fqdn>"
API_KEY="sk-clientA"

# List available models
curl -sS \
  -H "Authorization: Bearer $API_KEY" \
  "$ENDPOINT/v1/models"

# Replace model names below with models you actually deployed.

# Test chat completion
curl -sS \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4.1",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": false
  }' \
  "$ENDPOINT/v1/chat/completions"

# Test with streaming
curl -sS \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "grok-4-20-reasoning",
    "messages": [{"role": "user", "content": "Count to 5"}],
    "stream": true
  }' \
  "$ENDPOINT/v1/chat/completions"

Inspect Deployable Models (Azure CLI Helper)

To avoid guessing model name/version/SKU combinations, use:

cd infra
./list-deployable-models.sh --name codex

Useful filters:

# Only models that support the Responses API
./list-deployable-models.sh --capability responses

# Search by family + capability
./list-deployable-models.sh --name gpt-5.1 --capability responses

# Check models supporting a specific SKU
./list-deployable-models.sh --sku DataZoneStandard

Requirements: az (logged in) and jq installed locally.

Recommended workflow before editing infra/openai.tf:

# 1) Discover what this account can actually deploy
./list-deployable-models.sh --name gpt-5 --capability responses

# 2) Pick exact name + version + SKU from output
# 3) Add/update the entry in var.models
# 4) Deploy with terraform plan/apply

If responses=true and chatCompletion=false, set responses_only = true.

Using with OpenAI SDK

from openai import OpenAI

client = OpenAI(
    api_key="sk-clientA",
    base_url="https://<your-container-app-fqdn>"
)

response = client.chat.completions.create(
    model="gpt-4.1",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=False
)

print(response.choices[0].message.content)

Architecture

graph LR
    Client["Client / SDK<br/>(OpenAI format)"] -->|Bearer sk-...| LiteLLM

    subgraph Azure Container Apps
        LiteLLM["LiteLLM Proxy<br/>:4000"]
    end

    subgraph Azure AIServices Account
        ChatModel["Chat Model<br/>(e.g. gpt-4.1)"]
        ResponsesModel["Responses Model<br/>(e.g. gpt-5.1-codex)"]
    end

    LiteLLM -->|azure/<model>| ChatModel
    LiteLLM -->|azure/responses/<model>| ResponsesModel

    subgraph Supporting Services
        LA["Log Analytics<br/>Metadata Logging"]
    end

    LiteLLM -.-> LA

Components

Azure Container Apps: Hosts LiteLLM Proxy with external HTTPS ingress
Azure AIServices Cognitive Account (kind = "AIServices"): Unified Foundry resource hosting all model deployments
Regional AIServices Accounts: Created automatically when var.models targets a non-primary region
Azure Foundry Project (azurerm_cognitive_account_project): Created automatically; used by models requiring project-scoped deployment (project = true)
Log Analytics: Metadata-only logging (no prompt/response content)
Log Analytics: Per-key usage tracking (tokens, cache usage, cost, failures)

Example Model Configurations

The table below shows example model configurations from this repo. Actual deployability varies by subscription, region, quota, and Azure rollout stage. Use infra/list-deployable-models.sh to discover what you can deploy, then add entries to var.models in infra/openai.tf.

Model (example)	Format	SKU	Region	API Surface
`gpt-4.1`	`OpenAI`	DataZoneStandard	`germanywestcentral`	Chat Completions
`gpt-5.1-codex`	`OpenAI`	GlobalStandard	`germanywestcentral`	Responses API only

Responses-only models (e.g., codex variants) set responses_only = true and use LiteLLM's azure/responses/ prefix with api_version=preview.

Authentication

The deployment uses a custom auth handler in infra/custom_auth.py:

Set TF_VAR_api_keys to a comma-separated list of distributed client keys
Set TF_VAR_litellm_master_key with a value starting with sk- for operator/admin use
Clients authenticate with Authorization: Bearer <api_key>
The custom auth handler also accepts the master key so admin operations still work

Usage Tracking

Per-key usage analytics are tracked in Azure Log Analytics:

# Last 7 days (default)
python scripts/usage-report.py

# Specific date range
python scripts/usage-report.py --from 2026-04-01 --to 2026-04-15

# Single day
python scripts/usage-report.py --date 2026-04-15

# Export to CSV
python scripts/usage-report.py --from 2026-04-01 --to 2026-04-15 --format csv > usage.csv

Output example:

Usage Report: last 7 days to now

| Key Hash    | Requests | Failures | Tokens In | Tokens Out | Cost     | Models                |
|-------------|----------|----------|-----------|------------|----------|----------------------|
| 308e39b0... | 114      | 110      | 143541    | 2385       | $0.00004 | Kimi-K2.5, Kimi-K2.6, gpt-4.1 |

See docs/USAGE_ANALYSIS.md for full documentation.

Security

Defender for AI Services: Configured as Free tier (disabled) — see docs/DEFENDER_AI_SERVICES.md for details on enabling Standard tier for production workloads

Roadmap

See the ## Next Steps sections in docs/ARCHITECTURE.md and docs/DEPLOYMENT_SUMMARY.md.

Cost Optimization

Prompt caching is the primary cost reduction lever for Azure OpenAI models:

Automatically reduces input token costs by up to 90% for prompts with 1024+ tokens
No configuration required - works automatically with proper prompt structure
Verified working for gpt-4.1, gpt-5.4, and gpt-5.1-codex
See docs/PROMPT_CACHING.md for implementation guide

Typical savings for workloads with repeated context:

Code generation with shared codebases: 60-80% reduction
Document analysis with templates: 70-90% reduction
Multi-turn conversations: 50-70% reduction

Security Notes

Secrets: Never commit .env or *.tfvars files (both are gitignored)
Logging: No prompt/response content is logged; only metadata (timestamps, latency, token counts)
HTTPS Only: Container Apps enforces TLS on external ingress
- Proxy Hardening: disable_admin_ui: true, disable_key_management: true, drop_params: true, drop_unknown_params: true
- Runtime Hardening: LiteLLM image pinned to ghcr.io/berriai/litellm:main-v1.83.14-stable.patch.3, min_replicas = 0, max_replicas = 2, cooldown_period_in_seconds = var.scale_cooldown_seconds (default 900 / 15 minutes)
Least Privilege: Managed identities used where possible

Documentation

ARCHITECTURE - Architecture and deployment behavior
PROMPT_CACHING - Cost optimization via prompt caching (recommended)
DEPLOYMENT_SUMMARY - Operational summary
MASTER_KEY_MANAGEMENT - Master/client key operations
CUSTOM_AUTH - Custom auth implementation
USAGE_ANALYSIS - Per-key usage tracking and reporting
USAGE_TRACKING_IMPLEMENTATION - Implementation details
DEFENDER_AI_SERVICES - Defender for AI Services security documentation
LINKS - External references

License

TBD

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.github/workflows		.github/workflows
docs		docs
infra		infra
scripts		scripts
.envrc		.envrc
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AzureLIT

Overview

Features

Quick Start

Prerequisites

Configuration

Deploy

Test the Deployment

Inspect Deployable Models (Azure CLI Helper)

Using with OpenAI SDK

Architecture

Components

Example Model Configurations

Authentication

Usage Tracking

Security

Roadmap

Cost Optimization

Security Notes

Documentation

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AzureLIT

Overview

Features

Quick Start

Prerequisites

Configuration

Deploy

Test the Deployment

Inspect Deployable Models (Azure CLI Helper)

Using with OpenAI SDK

Architecture

Components

Example Model Configurations

Authentication

Usage Tracking

Security

Roadmap

Cost Optimization

Security Notes

Documentation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages