Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ jobs:
strategy:
fail-fast: false
matrix:
environment: [dev, staging, prod]
environment: [dev, staging, production]
steps:
- name: Checkout
uses: actions/checkout@v6
Expand Down Expand Up @@ -59,7 +59,7 @@ jobs:
uses: actions/github-script@v9
with:
script: |
const environments = ['dev', 'staging', 'prod'];
const environments = ['dev', 'staging', 'production'];
const rows = environments.map(env => {
const job = context.payload.workflow_run?.jobs?.find(j => j.name.includes(env));
return `| ${env} | :white_check_mark: |`;
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/diff.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ on:
options:
- dev
- staging
- prod
- production

jobs:
diff:
Expand Down
28 changes: 15 additions & 13 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,31 +6,33 @@ You're an AI client (or the author of one) about to add a cluster-level addon fo

ArgoCD App-of-Apps catalog for AKS clusters. The Azure-specific cousin of [`eks-gitops`](../eks-gitops/) — same App-of-Apps pattern, same ApplicationSet generators, AKS-specific addon set.

Categories under `addons/`:
Six categories under `addons/` (listed in deploy order):

- **`argo-platform/`** — Argo CD, Argo Workflows, Argo Rollouts, Argo Events (same as EKS)
- **`bootstrap/`** — cluster bootstrap: cert-manager, external-secrets-operator, metrics-server, azure-workload-identity (instead of IRSA), azure-disk-csi-driver, cluster-autoscaler with AKS-specific config
- **`networking/`** — ingress-nginx, cilium (where supported by Azure CNI), network-policies
- **`observability/`** — kube-prometheus-stack, loki, tempo, azure-monitor-opentelemetry-collector
- **`operations/`** — keda (native AKS integration), descheduler, reloader, vpa
- **`security/`** — kyverno, falco, trivy-operator, gatekeeper
- **`bootstrap/`** — cert-manager, external-secrets, external-secrets-stores, metrics-server, prometheus-operator-crds, reloader, priority-classes, storage-classes
- **`networking/`** — cilium, external-dns, ingress-nginx
- **`security/`** — kyverno, falco, trivy-operator
- **`observability/`** — grafana-agent, grafana-operator, loki, tempo, opencost
- **`operations/`** — descheduler, goldilocks, karpenter-resources, keda, velero, vpa
- **`argo-platform/`** — Argo Workflows, Argo Rollouts, Argo Events

Karpenter on AKS is Node Auto Provisioning (the operator is managed by the AKS control plane), so this repo ships only `karpenter-resources` (the NodePool / AKSNodeClass CRs), not a karpenter operator chart.

Plus:

- **`applicationsets/`** — ApplicationSet generators that fan addons + tenant workloads out across clusters by label
- **`catalog/`** — per-addon catalog metadata
- **`environments/`** — per-cluster overlays
- **`dashboards/`** — Grafana dashboard JSON
- **`policies/`** — Kyverno + Gatekeeper policies enforced cluster-wide
- **`catalog/`** — platform-specific tenant workloads (currently Druid)
- **`environments/`** — per-cluster overlays (dev / staging / production)
- **`dashboards/`** — `GrafanaDashboard` CRs that grafana-operator reconciles into Azure Managed Grafana
- **`policies/`** — Kyverno policies (best-practices, pod-security-standards) enforced cluster-wide

## Contract surface

Identical to `eks-gitops` — same addon shape, same ApplicationSet pattern, same sync-wave ordering, same per-env values structure. Differences:

- **Identity**: Azure Workload Identity (federated credentials) instead of AWS IRSA. Components in `landing-zone/components/azure/` provision the federated credentials; tenant ServiceAccounts annotate with `azure.workload.identity/client-id`.
- **Storage classes**: AKS uses `managed-csi-premium` / `azure-disk` by default. EKS uses `gp3` / `ebs-csi`.
- **Ingress**: usually ingress-nginx (same as EKS). Application Gateway Ingress Controller is also supported via `addons/networking/agic/` when an AppGW is provisioned by landing-zone.
- **Observability collector**: `azure-monitor-opentelemetry-collector` ships logs to Azure Monitor in addition to Grafana Cloud.
- **Ingress**: ingress-nginx (same as EKS).
- **Observability**: `grafana-agent` ships metrics to an Azure Monitor Workspace; `grafana-operator` reconciles dashboards into Azure Managed Grafana (the EKS cousin targets the same external-Grafana pattern).

## Add a new addon

Expand Down
4 changes: 2 additions & 2 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,10 @@ addons/ → Addon configurations
values.yaml → Base Helm values (all environments)
values-dev.yaml → Dev delta overrides
values-staging.yaml → Staging delta overrides
values-prod.yaml → Production delta overrides
values-production.yaml → Production delta overrides
# Kustomize addons (storage-classes, priority-classes, karpenter-resources):
base/ → Kustomization + resource manifests
overlays/{dev,staging,prod}/
overlays/{dev,staging,production}/
→ Environment-specific kustomization.yaml
policies/ → Kyverno ClusterPolicy manifests (pure Kustomize, base/overlays)
environments/ → Cluster-config ConfigMaps per environment (includes provider field)
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Argo Events - prod env overrides
# Argo Events - production env overrides

controller:
serviceAccount:
Expand Down
2 changes: 0 additions & 2 deletions addons/argo-platform/argo-rollouts/values-prod.yaml

This file was deleted.

2 changes: 2 additions & 0 deletions addons/argo-platform/argo-rollouts/values-production.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Argo Rollouts - production env overrides
# No environment-specific overrides
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Argo Workflows - prod env overrides
# Argo Workflows - production env overrides

server:
serviceAccount:
Expand All @@ -9,5 +9,5 @@ server:

artifactRepository:
azure:
endpoint: https://prodaksartifacts8391ac17.blob.core.windows.net
endpoint: https://productionaksartifacts83.blob.core.windows.net
container: argo-artifacts
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# cert-manager - prod env overrides
# cert-manager - production env overrides

replicaCount: 3

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ patches:
patch: |
- op: replace
path: /spec/provider/azurekv/vaultUrl
value: https://prodsecrets8391ac1728dd4.vault.azure.net/
value: https://productionsecrets8391ac1.vault.azure.net/
- op: replace
path: /spec/provider/azurekv/tenantId
value: 00000000-0000-0000-0000-000000000000
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# External Secrets - prod env overrides
# External Secrets - production env overrides

serviceAccount:
labels:
Expand Down
3 changes: 0 additions & 3 deletions addons/bootstrap/metrics-server/values-prod.yaml

This file was deleted.

3 changes: 3 additions & 0 deletions addons/bootstrap/metrics-server/values-production.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Metrics Server - production env overrides

replicas: 2
1 change: 0 additions & 1 deletion addons/bootstrap/prometheus-operator-crds/values-prod.yaml

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Prometheus Operator CRDs - production env overrides
1 change: 0 additions & 1 deletion addons/bootstrap/reloader/values-prod.yaml

This file was deleted.

1 change: 1 addition & 0 deletions addons/bootstrap/reloader/values-production.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Reloader - production env overrides
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# External DNS - prod env overrides
# External DNS - production env overrides
#
# Azure Workload Identity setup for external-dns:
# - The kubernetes-sigs/external-dns chart doesn't have a first-class Azure
Expand All @@ -10,7 +10,7 @@
# by the workload-identity webhook.
# - podLabels.azure.workload.identity/use=true triggers the webhook.
# - SA annotation maps to the UAMI's client_id.
# - `--txt-owner-id=prod-aks` namespaces the TXT registry records so this
# - `--txt-owner-id=production-aks` namespaces the TXT registry records so this
# cluster only owns the records it created.

podLabels:
Expand All @@ -30,9 +30,9 @@ secretConfiguration:
{
"tenantId": "00000000-0000-0000-0000-000000000000",
"subscriptionId": "00000000-0000-0000-0000-000000000000",
"resourceGroup": "prod",
"resourceGroup": "production",
"useWorkloadIdentityExtension": true
}

extraArgs:
- --txt-owner-id=prod-aks
- --txt-owner-id=production-aks
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Ingress NGINX - prod env overrides
# Ingress NGINX - production env overrides

controller:
replicaCount: 3
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Grafana Agent - prod env overrides
# Grafana Agent - production env overrides
#
# The chart's value-path layout is `agent.*` for the running agent
# container (extraEnv, resources, config, etc.) and `controller.*` for
Expand Down Expand Up @@ -28,9 +28,9 @@ controller:
agent:
extraEnv:
- name: AMW_REMOTE_WRITE_URL
value: "https://prod-aks-amw-dce-ed54.westus2-1.metrics.ingest.monitor.azure.com/dataCollectionRules/dcr-051a89947a17412a90e6f56a908913f5/streams/Microsoft-PrometheusMetrics/api/v1/write?api-version=2023-04-24"
value: "https://production-aks-amw-dce-ed54.westus2-1.metrics.ingest.monitor.azure.com/dataCollectionRules/dcr-051a89947a17412a90e6f56a908913f5/streams/Microsoft-PrometheusMetrics/api/v1/write?api-version=2023-04-24"
- name: CLUSTER_NAME
value: "prod-aks"
value: "production-aks"
resources:
requests:
cpu: 100m
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Grafana Operator - prod env overrides
# Grafana Operator - production env overrides

replicas: 2

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Loki - prod env overrides
# Loki - production env overrides
#
# Workload Identity is wired up (the UAMI has Storage Blob Data Contributor
# on the loki storage account), but the chart's storage backend stays on
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# OpenCost - prod env overrides
# OpenCost - production env overrides

serviceAccount:
labels:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Tempo - prod env overrides
# Tempo - production env overrides
#
# Workload Identity is wired up (the UAMI has Storage Blob Data Contributor
# on the tempo storage account), but the chart's trace backend stays on the
Expand Down
2 changes: 0 additions & 2 deletions addons/operations/descheduler/values-prod.yaml

This file was deleted.

2 changes: 2 additions & 0 deletions addons/operations/descheduler/values-production.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Descheduler - production env overrides
# No environment-specific overrides
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ patches:
name: default
spec:
tags:
environment: prod
environment: production
managed-by: karpenter
target:
kind: AKSNodeClass
Expand All @@ -60,5 +60,5 @@ patches:
configMapGenerator:
- name: karpenter-config
literals:
- CLUSTER_NAME=prod-aks
- ENVIRONMENT=prod
- CLUSTER_NAME=production-aks
- ENVIRONMENT=production
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# KEDA - prod env overrides
# KEDA - production env overrides

serviceAccount:
operator:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Velero - prod env overrides
# Velero - production env overrides

serviceAccount:
server:
Expand All @@ -22,6 +22,6 @@ configuration:
provider: azure
bucket: velero
config:
resourceGroup: prod
storageAccount: prodaksbackups8391ac1728
resourceGroup: production
storageAccount: productionaksbackups8391
subscriptionId: 00000000-0000-0000-0000-000000000000
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
# VPA - prod env overrides
# VPA - production env overrides
# No environment-specific overrides
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Falco - prod env overrides
# Falco - production env overrides

resources:
requests:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Trivy Operator - prod env overrides
# Trivy Operator - production env overrides
#
# nodeCollector pins each pod to a specific node via `kubernetes.io/hostname`
# nodeSelector. On a cluster with tainted system-pool nodes (the
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,14 @@ patches:
- patch: |-
- op: replace
path: /spec/external/url
value: https://prod-aks-grafana-azghhmakabeuhaa2.wus2.grafana.azure.com
value: https://production-aks-grafana-azghhmakabeuhaa2.wus2.grafana.azure.com
target:
kind: Grafana
name: external
- patch: |-
- op: replace
path: /spec/datasource/url
value: https://prod-aks-amw-gveeghapeqh6dre2.westus2.prometheus.monitor.azure.com
value: https://production-aks-amw-gveeghapeqh6dre2.westus2.prometheus.monitor.azure.com
target:
kind: GrafanaDatasource
name: managed-prometheus
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,12 @@ metadata:
name: cluster-config
namespace: argocd
labels:
environment: prod
environment: production
data:
# Cluster identification
environment: "prod"
environment: "production"
provider: "azure"
cluster_name: "prod-aks"
cluster_name: "production-aks"

# Azure configuration
region: "westus2"
Expand Down
2 changes: 1 addition & 1 deletion scripts/smoke.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
#
# Usage:
# ./scripts/smoke.sh # run against current kube-context
# ENV=prod ./scripts/smoke.sh # tag output with an env label
# ENV=production ./scripts/smoke.sh # tag output with an env label
# SKIP_DESTRUCTIVE=1 ./scripts/smoke.sh # skip checks that create resources
#
# Required tools: kubectl, az (for Azure-side checks), jq.
Expand Down
4 changes: 2 additions & 2 deletions scripts/wire-env.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
# Usage:
# ./scripts/wire-env.sh <cloud> <account> <region> <env>
# Example:
# ./scripts/wire-env.sh azure workload-prod westus2 prod
# ./scripts/wire-env.sh azure workload-prod westus2 production
#
# Required tools: terragrunt, jq, sed. Run from the aks-gitops repo root;
# expects landing-zone at ../landing-zone.
Expand All @@ -25,7 +25,7 @@ set -euo pipefail
CLOUD="${1:?missing CLOUD (e.g. azure)}"
ACCOUNT="${2:?missing ACCOUNT (e.g. workload-prod)}"
REGION="${3:?missing REGION (e.g. westus2)}"
ENV="${4:?missing ENV (e.g. prod)}"
ENV="${4:?missing ENV (e.g. production)}"

LZ="${LANDING_ZONE_PATH:-../landing-zone}/live/$CLOUD/$ACCOUNT/$REGION/$ENV"
if [[ ! -d "$LZ" ]]; then
Expand Down