Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,6 @@ Full setup in `docs/DEPLOYMENT_SUMMARY.md`.

- `config.yaml.tpl` or `custom_auth.py` changes only take effect on redeploy — no hot reload.
- `litellm_settings.drop_params: true` — prevents clients from overriding provider credentials at request time.
- `litellm_settings.drop_unknown_params: true` — strips unsupported request fields before they reach upstream providers.
- `custom_auth.py` caches valid keys in memory on first request. Key changes require redeploy to take effect.
- `custom_auth` replaces LiteLLM's built-in master key check entirely — the handler explicitly also accepts `LITELLM_MASTER_KEY` so admin operations keep working.
- No content logging (prompts/responses); metadata-only with 30-day retention in Log Analytics.
Expand All @@ -118,4 +117,4 @@ Full setup in `docs/DEPLOYMENT_SUMMARY.md`.
- Per-key model access restrictions (extend `custom_auth.py` to map keys → allowed models)
- Spend tracking / rate limiting without DB (e.g. Azure Table Storage counters)
- Telemetry to Azure Monitor (latency, errors, token counts)
- **Verify whether LiteLLM still exposes any residual `/ui` surface despite `disable_admin_ui: true`**. If needed, block it completely via an nginx sidecar that proxies traffic to LiteLLM on `localhost:4000` and returns `404` on `/ui*`. Change ingress `target_port` from `4000` to `80`. Alternative (paid): Azure Front Door WAF with a path-based custom rule.
- **Verify whether LiteLLM still exposes any residual `/ui` surface despite `DISABLE_ADMIN_UI=True`**. If needed, block it completely via an nginx sidecar that proxies traffic to LiteLLM on `localhost:4000` and returns `404` on `/ui*`. Change ingress `target_port` from `4000` to `80`. Alternative (paid): Azure Front Door WAF with a path-based custom rule.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -286,7 +286,7 @@ Typical savings for workloads with repeated context:
- **Secrets**: Never commit `.env` or `*.tfvars` files (both are gitignored)
- **Logging**: No prompt/response content is logged; only metadata (timestamps, latency, token counts)
- **HTTPS Only**: Container Apps enforces TLS on external ingress
- **Proxy Hardening**: `disable_admin_ui: true`, `disable_key_management: true`, `drop_params: true`, `drop_unknown_params: true`
- **Proxy Hardening**: `DISABLE_ADMIN_UI=True`, `drop_params: true`
- **Runtime Hardening**: LiteLLM image pinned to `ghcr.io/berriai/litellm:main-v1.82.3`, `min_replicas = 0`, `max_replicas = 1`, `cooldown_period_in_seconds = 600`
- **Least Privilege**: Managed identities used where possible

Expand Down
8 changes: 3 additions & 5 deletions docs/ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,9 +143,7 @@ curl -sS \
- Store all secrets as Container Apps secrets; never commit to git
- HTTPS only enforced (`allow_insecure_connections = false`)
- `litellm_settings.drop_params: true` prevents clients overriding provider credentials
- `litellm_settings.drop_unknown_params: true` drops unsupported request fields before they reach upstream APIs
- Admin UI disabled (`disable_admin_ui: true`)
- Key management routes disabled (`disable_key_management: true`)
- Admin UI disabled via `DISABLE_ADMIN_UI` env var
- Container image pinned to `main-v1.82.3` — no floating tag surprises
- Scale-to-zero (`min_replicas = 0`, `max_replicas = 1`) limits blast radius of abuse
- `cooldown_period_in_seconds = 600` slows repeated cold-start churn after bursts
Expand Down Expand Up @@ -176,15 +174,15 @@ See `docs/USAGE_ANALYSIS.md` for schema, KQL examples, and cost tracking roadmap
- Per-key model access restrictions (extend `custom_auth.py`)
- Key alias mapping (human-readable labels for keys)
- Budget alerts / rate limiting
- **Verify whether LiteLLM still exposes any residual `/ui` surface despite `disable_admin_ui: true`**. If needed, block it completely via an nginx sidecar that proxies traffic to LiteLLM on `localhost:4000` and returns `404` on `/ui*`. Change ingress `target_port` from `4000` to `80`. Alternative (paid): Azure Front Door WAF with a path-based custom rule.
- **Verify whether LiteLLM still exposes any residual `/ui` surface despite `DISABLE_ADMIN_UI=True`**. If needed, block it completely via an nginx sidecar that proxies traffic to LiteLLM on `localhost:4000` and returns `404` on `/ui*`. Change ingress `target_port` from `4000` to `80`. Alternative (paid): Azure Front Door WAF with a path-based custom rule.

## Prompt Caching

Azure OpenAI models (`gpt-4.1`, `gpt-5.4`, `gpt-5.1-codex`) support automatic prompt caching for prompts with 1024+ tokens. The LiteLLM proxy preserves native OpenAI caching semantics:

- **No configuration required**: Caching activates automatically for eligible prompts
- **Prompt structure matters**: Place static content at the beginning, variable content at the end
- **Use `prompt_cache_key`**: Improves hit rates for workloads with shared prefixes (parameter survives `drop_unknown_params: true`)
- **Use `prompt_cache_key`**: Improves hit rates for workloads with shared prefixes (parameter survives `drop_params: true`)
- **Extended retention**: `prompt_cache_retention: "24h"` available for recurring tasks on `gpt-4.1` and newer models
- **Visibility**: Cached token counts logged in `UsageMetrics` table (`CachedTokensIn_d` field)

Expand Down
5 changes: 2 additions & 3 deletions docs/DEPLOYMENT_SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,17 +122,16 @@ Authorization: Bearer <api_key>
#### Additional Hardening

- `litellm_settings.drop_params: true` — prevents clients from overriding provider credentials.
- `litellm_settings.drop_unknown_params: true` — strips unknown request fields before proxying upstream.
- DB features disabled (`store_model_in_db: false`, `disable_spend_logs: true`, etc.) — no database in use.
- Admin UI and key-management routes disabled (`disable_admin_ui: true`, `disable_key_management: true`).
- Admin UI disabled via `DISABLE_ADMIN_UI` env var.
- Container image pinned to `ghcr.io/berriai/litellm:main-v1.82.3`, HTTPS-only ingress, `min_replicas = 0`, `max_replicas = 1`, and `cooldown_period_in_seconds = 600`.

#### Prompt Caching

Azure OpenAI models (`gpt-4.1`, `gpt-5.4`, `gpt-5.1-codex`) support automatic prompt caching. Key points:

- **Automatic activation**: No configuration required; works for prompts with 1024+ tokens
- **Parameter passthrough**: `prompt_cache_key` and `prompt_cache_retention` survive `drop_unknown_params: true` filtering
- **Parameter passthrough**: `prompt_cache_key` and `prompt_cache_retention` survive `drop_params: true` filtering
- **Cost impact**: Cached tokens billed at ~10-20% of standard input pricing
- **Verification**: Check `usage.prompt_tokens_details.cached_tokens` in responses; monitor via Log Analytics `CachedTokensIn_d` field

Expand Down
1 change: 0 additions & 1 deletion docs/PROMPT_CACHING.md
Original file line number Diff line number Diff line change
Expand Up @@ -271,7 +271,6 @@ If `prompt_cache_key` or `prompt_cache_retention` are not working:

1. Verify model supports caching via `/v1/model/info`
2. Check LiteLLM version compatibility
3. Review `drop_unknown_params` setting (currently enabled for security)

The current deployment has validated that these parameters survive filtering for `gpt-4.1`.

Expand Down
Loading