Skip to content

Latest commit

 

History

History
1771 lines (1304 loc) · 49.2 KB

File metadata and controls

1771 lines (1304 loc) · 49.2 KB

Operations

Deployment

Deployment Type Selection

The platform supports two deployment types with different infrastructure requirements:

Kind (Development):

  • Local Kubernetes cluster via Kind
  • Requires: Docker, kubectl, kind CLI
  • Simulated GPU support (RuntimeClass only)
  • nginx-ingress-controller
  • Suitable for development and testing
  • Bootstrap: ./bootstrap.sh --deployment kind

K3s (Production):

  • Lightweight Kubernetes for production
  • Requires: Ubuntu 22.04+, NVIDIA drivers, nvidia-container-toolkit
  • Real GPU support via NVIDIA GPU Operator
  • Gateway API with external DNS
  • Optimized for bare-metal and edge deployments
  • Bootstrap: ./bootstrap.sh --deployment k3s

Prerequisites

Common Prerequisites (both deployments):

  1. kubectl: Configured with cluster access and admin permissions
  2. DNS Configuration: For *.home.local (or your chosen domain)
  3. Network Access: Ability to reach ingress controller from your devices
  4. Cloudflare API Token: For organization webhook tunnels (set as CLOUDFLARE_API_TOKEN environment variable)
  5. Basic tools: curl, openssl, base64, jq

Kind-Specific Prerequisites:

  1. Docker: Docker Desktop or Docker Engine
  2. Kind CLI: brew install kind or download from releases
  3. Kubernetes: Version 1.24+ (created by Kind)

K3s-Specific Prerequisites:

  1. Ubuntu 22.04+: With root/sudo access
  2. NVIDIA Drivers: Installed and working (nvidia-smi succeeds)
  3. NVIDIA Container Toolkit: nvidia-container-runtime installed
  4. K3s: Will be installed by bootstrap if not present

DNS Setup for Home Network Access

The platform requires DNS configuration for browser-based authentication. Choose one option:

Option A: Router/Pi-hole DNS (Recommended)

# Find your ingress controller IP
kubectl get svc -n ingress-nginx ingress-nginx-controller

# Add A records in your router or Pi-hole:
auth.home.local     → <INGRESS_IP>
dex.home.local      → <INGRESS_IP>
argocd.home.local   → <INGRESS_IP>
tekton.home.local   → <INGRESS_IP>

Option B: Hosts File (Per-device)

# Add to /etc/hosts on each device:
<INGRESS_IP> auth.home.local
<INGRESS_IP> dex.home.local
<INGRESS_IP> argocd.home.local
<INGRESS_IP> tekton.home.local

Zero-Touch Bootstrap Process

The bootstrap script achieves complete platform convergence automatically with no manual steps required.

Run Bootstrap

git clone https://github.com/bdchatham/AphexPlatformInfrastructure.git
cd AphexPlatformInfrastructure

# Set Cloudflare API token for organization webhooks
export CLOUDFLARE_API_TOKEN="your-cloudflare-api-token"

# Choose deployment type
./bootstrap.sh --deployment kind    # For development
# OR
./bootstrap.sh --deployment k3s     # For production with GPU support

Kind Bootstrap Options:

./bootstrap.sh --deployment kind [OPTIONS]

Options:
  --cluster-name NAME    Name for Kind cluster (default: platform-cluster)
  --use-existing         Use existing kubecontext instead of creating cluster
  --show-secrets         Display generated secrets (WARNING: not for production)

K3s Bootstrap Options:

sudo ./bootstrap.sh --deployment k3s [OPTIONS]

Options:
  --show-secrets         Display generated secrets (WARNING: not for production)

Note: Must run as root/sudo for K3s installation

Bootstrap Actions (Automatic):

  1. Routes to deployment-specific bootstrap script
  2. Creates or configures Kubernetes cluster (Kind or K3s)
  3. Generates ALL secrets automatically (PostgreSQL, Authentik, Dex, API tokens)
  4. Creates platform namespaces (argocd, auth-system, tekton-pipelines, platform-system, external-secrets)
  5. Stores all secrets in Kubernetes (never prints to stdout)
  6. Installs ArgoCD with proper OIDC configuration
  7. Creates deployment-specific platform-root ArgoCD Application
  8. Waits for ArgoCD to deploy all platform components (GitOps-managed)
  9. Waits for Authentik and creates API token via Authentik API
  10. Achieves complete platform convergence automatically
  11. Displays access instructions and secret retrieval commands

Deployment-Specific Paths:

  • Kind: Uses platform/base/argocd/apps → references platform/deployments/kind/*
  • K3s: Uses platform/deployments/k3s/argocd/apps → references platform/deployments/k3s/*

For detailed architecture, see architecture.md.

Layered cert-manager Deployment

The platform implements a revolutionary layered cert-manager architecture that eliminates timing issues and manual intervention.

Deployment Waves (Automatic via ArgoCD):

Wave 10: cert-manager Installation

# ArgoCD deploys cert-manager with PostSync validation
kubectl get pods -n cert-manager
kubectl get job cert-manager-webhook-readiness -n cert-manager

Wave 20: Certificate Foundation

# Only proceeds after webhook validation passes
kubectl get clusterissuer selfsigned-issuer
kubectl get certificates -A

Wave 30: Ingress Resources

# Only proceeds after certificates are ready
kubectl get ingress -A

Source: platform/cert-manager/, platform/cert-manager/webhook-readiness-hook.yaml, platform/argocd/apps/platform-cert-manager.yaml

For detailed cert-manager architecture, see architecture.md.

Verification Steps

Step 1: Check Bootstrap Completion

# All ArgoCD applications should be Synced/Healthy
kubectl get applications -n argocd

# cert-manager components should be running
kubectl get pods -n cert-manager

# External Secrets Operator should be running
kubectl get pods -n external-secrets

# Certificates should be Ready
kubectl get certificates -A

Step 2: Verify Authentication System

# Check auth-system components
kubectl get pods -n auth-system

# Verify Authentik is accessible
curl -k https://auth.home.local/if/flow/initial-setup/

# Verify Dex OIDC discovery
curl -k https://dex.home.local/.well-known/openid-configuration

Source: platform/auth/authentik/, platform/auth/dex/

Step 3: Verify External Secrets Operator

# Check External Secrets Operator pods
kubectl get pods -n external-secrets

# Verify ClusterSecretStore CRD is installed
kubectl get crd clustersecretstores.external-secrets.io

# List any ClusterSecretStores (created when organizations are provisioned)
kubectl get clustersecretstores

Source: platform/base/external-secrets/

Step 4: Access Platform Services

Authentik UI (User Management):

# Get admin password
kubectl get secret authentik-secrets -n auth-system \
  -o jsonpath='{.data.admin-password}' | base64 -d

# Access: https://auth.home.local
# Username: admin
# Password: (from above command)

ArgoCD UI (GitOps Management):

# Access: https://argocd.home.local
# Click "Login via Dex"
# Authenticate with Authentik credentials

Tekton Dashboard (Pipeline Monitoring):

# Access: https://tekton.home.local
# Authenticate via Dex/Authentik

Source: platform/auth/secrets/README.md, platform/integrations/argocd-oidc-config.yaml

Expected Final State

After successful bootstrap and convergence:

ArgoCD Applications (Kind deployment):

NAME                          SYNC STATUS   HEALTH STATUS
platform-auth                 Synced        Healthy
platform-catalog              Synced        Healthy
platform-cert-foundation      Synced        Healthy
platform-cert-manager         Synced        Healthy
platform-controllers          Synced        Healthy
platform-crds                 Synced        Healthy
platform-external-secrets     Synced        Healthy
platform-gpu                  Synced        Healthy
platform-ingress              Synced        Healthy
platform-ingress-controller   Synced        Healthy
platform-rbac                 Synced        Healthy
platform-root                 Synced        Healthy
platform-tekton               Synced        Healthy

ArgoCD Applications (K3s deployment - additional apps):

platform-external-dns         Synced        Healthy
platform-gateway              Synced        Healthy
platform-gpu-operator         Synced        Healthy

Certificates:

NAMESPACE     NAME            READY   SECRET          AGE
auth-system   argocd-tls      True    argocd-tls      5m
auth-system   authentik-tls   True    authentik-tls   5m
auth-system   dex-tls         True    dex-tls         5m
auth-system   tekton-tls      True    tekton-tls      5m

Platform Services:

  • All pods Running in cert-manager, external-secrets, auth-system, argocd, tekton-pipelines namespaces
  • All Ingress resources configured with TLS certificates
  • Authentication flow working end-to-end
  • External Secrets Operator ready for ClusterSecretStore provisioning

Source

  • platform/bootstrap/bootstrap.sh - Bootstrap implementation
  • platform/cert-manager/ - Layered cert-manager architecture
  • platform/cert-manager/webhook-readiness-hook.yaml - PostSync validation
  • platform/auth/ - Authentication system components

Validate RBAC Authorization

# Run RBAC validation script
platform/scripts/validate-rbac.sh

# Manual RBAC checks
kubectl auth can-i create pipelines.platform.dev --as=admin@platform.local --as-group=platform-admins
kubectl auth can-i create pipelines.platform.dev --as=alice@platform.local --as-group=platform-engineering -n user-alice
kubectl auth can-i create pipelines.platform.dev --as=alice@platform.local --as-group=platform-engineering -n auth-system

Test Break-Glass Access

# Verify certificate-based admin access works
kubectl --kubeconfig /etc/kubernetes/admin.conf get nodes

# Test when OIDC is unavailable
kubectl scale deployment/dex --replicas=0 -n auth-system
kubectl --kubeconfig /etc/kubernetes/admin.conf get pods -n auth-system
kubectl scale deployment/dex --replicas=1 -n auth-system

Access Authentication Services

Authentik UI (User Management):

# Get admin password
kubectl get secret authentik-secrets -n auth-system -o jsonpath='{.data.admin-password}' | base64 -d

# Access at https://auth.home.local
# Username: admin
# Password: (from above command)

ArgoCD with OIDC:

# Access at https://argocd.home.local
# Click "Login via Dex"
# Authenticate with Authentik credentials

Tekton Dashboard with OIDC:

# Access at https://tekton.home.local
# Authenticate with Authentik credentials via Dex

Source

  • platform/scripts/validate-oidc-discovery.sh
  • platform/scripts/validate-rbac.sh
  • platform/auth/ingress/

Organization and Repository Registration

Important: Dual-Domain Strategy
The platform uses two domain strategies:

  • arbiter-dev.com (public): Organization webhook endpoints accessible from the internet via Cloudflare tunnels
  • home.local (local): Authentication and platform services accessible only within the home network

This separation ensures webhook endpoints are publicly accessible while keeping platform administration services private. See architecture.md for detailed networking architecture.

Prerequisites for Organization Bootstrap

Before creating organizations, ensure:

  1. Cloudflare Account Setup:

    • Domain arbiter-dev.com added to Cloudflare account
    • Nameservers updated at domain registrar to point to Cloudflare
    • API token created with permissions: Zone.DNS (Edit), Account.Cloudflare Tunnel (Edit)
    • API token stored in cluster secret: cloudflare-api-token in platform-system namespace
  2. Platform Bootstrap Complete:

    • ArgoCD and all platform components deployed
    • Onboarding controller running in platform-system namespace

Bootstrap Organization

Organizations provide multi-tenant isolation with dedicated namespaces (org-{name}), EventListeners, and public webhook endpoints.

Using AphexCLI (Recommended):

aphex organization bootstrap --admin-email admin@acme-corp.com acme-corp

Manual YAML Application:

kubectl apply -f - <<EOF
apiVersion: aphex.io/v1alpha1
kind: Organization
metadata:
  name: acme-corp
  namespace: platform-system
spec:
  displayName: "ACME Corporation"
  adminUsers:
    - admin@acme-corp.com
  webhookSecret: ""  # Auto-generated if empty
EOF

What Gets Created:

  1. Organization namespace: org-acme-corp
  2. Cloudflare tunnel via API
  3. DNS CNAME record: acme-corp.arbiter-dev.com → {tunnel-id}.cfargotunnel.com
  4. Cloudflared tunnel deployment
  5. EventListener with dedicated ServiceAccount and ClusterRoleBinding
  6. Organization admin RBAC
  7. Webhook secret for GitHub integration

Source: platform/crds/organization-crd.yaml, platform/platform-controller/controller/controllers/organization_controller.go

Verify Organization Bootstrap

# Check Organization status
kubectl get organization acme-corp -n platform-system
kubectl describe organization acme-corp -n platform-system

# Verify organization namespace
kubectl get namespace org-acme-corp

# Verify webhook secret
kubectl get secret github-webhook-secret -n org-acme-corp

# Verify Cloudflared tunnel
kubectl get deployment -n org-acme-corp | grep cloudflared
kubectl get configmap cloudflared-config -n org-acme-corp

# Verify EventListener
kubectl get eventlisteners -n org-acme-corp
kubectl get clusterrolebinding eventlistener-acme-corp

# Verify admin RBAC
kubectl get role,rolebinding -n org-acme-corp

# Verify External Secrets infrastructure
kubectl get serviceaccount eso-secrets-reader -n org-acme-corp
kubectl get role eso-secrets-reader -n org-acme-corp
kubectl get rolebinding eso-secrets-reader -n org-acme-corp
kubectl get clustersecretstore org-acme-corp-store

Test Webhook Endpoint

Verify the public webhook endpoint is accessible:

# Test DNS resolution
nslookup acme-corp.arbiter-dev.com

# Test HTTPS connectivity (expect 404 for GET request)
curl -k https://acme-corp.arbiter-dev.com

# Get webhook secret for GitHub configuration
kubectl get secret github-webhook-secret -n org-acme-corp -o jsonpath='{.data.secret}' | base64 -d

Configure GitHub Webhook:

  1. Go to repository Settings → Webhooks → Add webhook
  2. Payload URL: https://acme-corp.arbiter-dev.com
  3. Content type: application/json
  4. Secret: (use secret from command above)
  5. Events: Select "Push events"

Verify Webhook Delivery:

# Watch EventListener logs
kubectl logs -n org-acme-corp -l eventlistener=github-listener -f

# Check for PipelineRuns after push
kubectl get pipelineruns -n org-acme-corp

Source: platform/platform-controller/controller/controllers/organization_controller.go

Create RepoBinding

After organization bootstrap, repositories can be onboarded to create webhook integration with existing pipelines.

# Create RepoBinding
kubectl apply -f - <<EOF
apiVersion: aphex.io/v1alpha1
kind: RepoBinding
metadata:
  name: my-repo-binding
  namespace: platform-system
spec:
  aphexOrg: "acme-corp"
  repoOrg: "acme"
  repoName: "my-application"
  pipelineName: "my-application-pipeline"
  templateRef: "run-pipeline-v1"
  ingressHost: ""  # Optional: defaults to cluster ingress
EOF

Field Descriptions:

  • aphexOrg: Organization name (maps to org-{aphexOrg} namespace)
  • repoOrg: GitHub organization name
  • repoName: Repository name
  • pipelineName: Name of Tekton Pipeline to trigger
  • templateRef: Dispatcher template name (e.g., run-pipeline-v1)
  • ingressHost: Optional webhook hostname (defaults to cluster ingress)

Source: platform/crds/repobinding-crd.yaml, platform/platform-controller/controller/controllers/repobinding_controller.go

Verify Onboarding

# Check RepoBinding status
kubectl get repobinding my-repo-binding -n platform-system
kubectl describe repobinding my-repo-binding -n platform-system

# Verify organization namespace
kubectl get namespace org-my-org

# Verify service account in pipeline namespace
kubectl get serviceaccount pipeline-runner -n my-pipeline

# Verify RBAC in pipeline namespace
kubectl get role,rolebinding,clusterrole,clusterrolebinding -n my-pipeline | grep my-pipeline

# Verify ArgoCD AppProject
kubectl get appproject my-pipeline -n argocd
kubectl describe appproject my-pipeline -n argocd

# Verify resource limits in pipeline namespace
kubectl get resourcequota,limitrange -n my-pipeline

# Verify network policy in pipeline namespace
kubectl get networkpolicy -n my-pipeline

# Verify Tekton resources in organization namespace
kubectl get trigger,triggertemplate -n org-my-org | grep my-pipeline

Configure GitHub Webhook

After onboarding, configure the webhook in GitHub:

# Get webhook URL and secret from RepoBinding status
kubectl get repobinding my-repo-binding -n pipeline-system -o yaml

# Look for status.webhookURL and status.webhookSecret

Configure in GitHub:

  1. Go to repository Settings → Webhooks → Add webhook
  2. Payload URL: (from RepoBinding status.webhookURL)
  3. Content type: application/json
  4. Secret: (from RepoBinding status.webhookSecret)
  5. Events: Push events
  6. Active: ✓
  7. Click "Add webhook"

Source

  • platform/crds/repobinding-crd.yaml
  • platform/platform-controller/controller/

Authentication System Operations

The authentication system provides centralized identity management through Authentik with Dex as an OIDC connector layer. All authentication components are managed by ArgoCD via GitOps.

Accessing Authentik UI

After bootstrap completes and ArgoCD syncs the auth system:

Step 1: Ensure DNS is configured

Configure DNS so that *.home.local resolves to your Ingress controller's IP address.

# Find Ingress controller IP
kubectl get svc -n ingress-nginx ingress-nginx-controller

# Add DNS records (router/Pi-hole) or /etc/hosts entries:
# 192.168.1.100 auth.home.local
# 192.168.1.100 dex.home.local
# 192.168.1.100 argocd.home.local
# 192.168.1.100 tekton.home.local

Step 2: Retrieve admin password

kubectl get secret authentik-secrets -n auth-system \
  -o jsonpath='{.data.admin-password}' | base64 -d

Step 3: Access Authentik UI

  • Open https://auth.home.local in browser
  • Accept certificate warning (if using self-signed certificates)
  • Login with username admin and password from Step 2

User Management

Creating Users

  1. Navigate to DirectoryUsers
  2. Click Create
  3. Fill in user details:
    • Username (required, unique)
    • Email (required, unique)
    • Name (display name)
    • Password (or send password reset email)
  4. Assign user to groups:
    • admins: Full access to all platform services
    • engineering: Read-only access to platform services
  5. Click Create

Users can authenticate immediately - no pod restarts or configuration changes required.

Managing Groups

  1. Navigate to DirectoryGroups
  2. View existing groups (admins, engineering)
  3. Create new groups as needed
  4. Assign users to groups
  5. Groups are automatically included in OIDC tokens

Changing Admin Password

  1. Navigate to DirectoryUsers
  2. Click on admin user
  3. Click Set password
  4. Enter new password
  5. Click Update

DNS Configuration

DNS configuration is required for user browsers to reach services via hostnames. OIDC authentication requires redirect URIs that browsers can reach.

Option 1: Router/Pi-hole DNS (Recommended)

Add A records in your home router or Pi-hole:

auth.home.local     → 192.168.1.100
dex.home.local      → 192.168.1.100
argocd.home.local   → 192.168.1.100
tekton.home.local   → 192.168.1.100

Replace 192.168.1.100 with your Ingress controller's LoadBalancer IP or NodePort IP.

Option 2: Hosts File

Add entries to /etc/hosts on each device:

# Linux/macOS
sudo nano /etc/hosts

# Add these lines:
192.168.1.100 auth.home.local
192.168.1.100 dex.home.local
192.168.1.100 argocd.home.local
192.168.1.100 tekton.home.local

Verify DNS configuration:

nslookup auth.home.local
nslookup dex.home.local
curl -k https://auth.home.local

TLS Certificate Setup

All services use HTTPS with TLS certificates. Choose between self-signed (simplest) or Let's Encrypt (trusted certificates).

Self-Signed Certificates (Simplest)

  1. Install cert-manager:

    kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.16.4/cert-manager.yaml
    
    kubectl wait --for=condition=ready pod \
      -l app.kubernetes.io/instance=cert-manager \
      -n cert-manager \
      --timeout=90s
  2. Create self-signed ClusterIssuer:

    cat <<EOF | kubectl apply -f -
    apiVersion: cert-manager.io/v1
    kind: ClusterIssuer
    metadata:
      name: selfsigned-issuer
    spec:
      selfSigned: {}
    EOF
  3. Accept certificate warnings in browser:

    • Chrome: Click "Advanced" → "Proceed to auth.home.local (unsafe)"
    • Firefox: Click "Advanced" → "Accept the Risk and Continue"

Let's Encrypt with DNS-01 Challenge

Provides trusted certificates without browser warnings. Requires DNS provider API access.

  1. Install cert-manager (same as above)

  2. Create DNS provider API token secret (Cloudflare example):

    kubectl create secret generic cloudflare-api-token \
      -n cert-manager \
      --from-literal=api-token=YOUR_TOKEN_HERE
  3. Create Let's Encrypt ClusterIssuer:

    cat <<EOF | kubectl apply -f -
    apiVersion: cert-manager.io/v1
    kind: ClusterIssuer
    metadata:
      name: letsencrypt-dns
    spec:
      acme:
        server: https://acme-v02.api.letsencrypt.org/directory
        email: your-email@example.com
        privateKeySecretRef:
          name: letsencrypt-dns-key
        solvers:
          - dns01:
              cloudflare:
                apiTokenSecretRef:
                  name: cloudflare-api-token
                  key: api-token
    EOF
  4. Update Ingress resources to use Let's Encrypt issuer and your domain

  5. Commit and push to Git - ArgoCD syncs changes automatically

Verify certificates:

kubectl get certificate -n auth-system
kubectl describe certificate authentik-tls -n auth-system

Secret Rotation

Secrets should be rotated periodically for security. The authentication system supports secret rotation without downtime.

Rotating Authentik Admin Password

Via Authentik UI (Recommended):

  1. Login to Authentik UI
  2. Navigate to DirectoryUsersadmin
  3. Click Set password
  4. Enter new password
  5. Click Update
  6. Update Kubernetes Secret:
    kubectl create secret generic authentik-secrets \
      -n auth-system \
      --from-literal=secret-key="$(kubectl get secret authentik-secrets -n auth-system -o jsonpath='{.data.secret-key}' | base64 -d)" \
      --from-literal=admin-password="NEW_PASSWORD" \
      --dry-run=client -o yaml | kubectl apply -f -

Rotating Dex Client Secret

  1. Generate new secret:

    NEW_SECRET=$(openssl rand -base64 32)
  2. Update Kubernetes Secret:

    kubectl create secret generic dex-secrets \
      -n auth-system \
      --from-literal=client-secret="$NEW_SECRET" \
      --dry-run=client -o yaml | kubectl apply -f -
  3. Update Authentik OIDC provider:

    • Login to Authentik UI
    • Navigate to ApplicationsProvidersDex OIDC Provider
    • Update Client Secret field
    • Click Update
  4. Restart Dex:

    kubectl rollout restart deployment/dex -n auth-system

Rotating Authentik API Token

  1. Create new token via Authentik UI:

    • Navigate to DirectoryTokensCreate
    • Set identifier: "config-sync-job-new"
    • Set intent: "API"
    • Copy token value
  2. Update Kubernetes Secret:

    kubectl create secret generic authentik-api-token \
      -n auth-system \
      --from-literal=token="NEW_TOKEN_HERE" \
      --dry-run=client -o yaml | kubectl apply -f -
  3. Revoke old token:

    • Navigate to DirectoryTokens
    • Find old token → Delete

Config Sync Job Management

The Config Sync Job orchestrates Authentik-Dex integration. It runs automatically during bootstrap but can be manually triggered.

Check Job Status

# Check Job status
kubectl get job auth-config-sync -n auth-system

# Check Job pod status
kubectl get pods -n auth-system -l app=auth-config-sync

# View Job logs
kubectl logs -n auth-system -l app=auth-config-sync

Manually Trigger Job

To re-run the Job (e.g., after fixing a configuration issue):

# Delete existing Job
kubectl delete job auth-config-sync -n auth-system

# ArgoCD will recreate the Job automatically
# Or manually apply:
kubectl apply -f platform/auth/config-sync/job.yaml

# Watch Job progress
kubectl logs -n auth-system -l app=auth-config-sync -f

Verify Job Completion

# Check if Job completed successfully
kubectl get job auth-config-sync -n auth-system -o jsonpath='{.status.conditions[?(@.type=="Complete")].status}'
# Should output: True

# Get detailed Job status
kubectl describe job auth-config-sync -n auth-system

GitOps Workflow for Authentication Changes

All authentication system components are managed by ArgoCD. To make changes:

Step 1: Update manifests in Git

# Clone repository
git clone https://github.com/bdchatham/AphexPlatformInfrastructure.git
cd AphexPlatformInfrastructure

# Update authentication manifests
# Example: Update Authentik image version
vi platform/auth/authentik/server-deployment.yaml

# Commit changes
git add .
git commit -m "Update Authentik to v2024.2.2"
git push

Step 2: ArgoCD detects and syncs changes

ArgoCD polls Git every 3 minutes and detects changes automatically.

# Watch ArgoCD sync status
kubectl get application platform-auth -n argocd -w

# Or view in ArgoCD UI
# https://argocd.home.local

Step 3: Verify changes

# Check pod status
kubectl get pods -n auth-system

# Check ArgoCD sync status
kubectl get application platform-auth -n argocd

# View sync details
kubectl describe application platform-auth -n argocd

No manual kubectl apply required - ArgoCD handles all deployments and updates.

Authentication Troubleshooting

Authentication Fails

Symptom: User cannot log in to ArgoCD or Tekton Dashboard

Diagnosis:

# Check DNS resolution
nslookup auth.home.local
nslookup dex.home.local

# Check Authentik OIDC discovery
curl https://auth.home.local/application/o/dex/.well-known/openid-configuration

# Check Dex OIDC discovery
curl https://dex.home.local/.well-known/openid-configuration

# Check user groups in Authentik UI
# Login → Directory → Users → Select user → Groups tab

Resolution:

  • Configure DNS if resolution fails
  • Verify redirect URI configuration in Dex and Authentik
  • Verify user is in correct group (admins or engineering)

Dex Pod CrashLoopBackOff

Symptom: Dex pod fails to start repeatedly

Diagnosis:

# Check Dex logs
kubectl logs -n auth-system deployment/dex

# Verify Dex starts with replicas=0
kubectl get deployment dex -n auth-system -o jsonpath='{.spec.replicas}'

# Check Config Sync Job status
kubectl get job auth-config-sync -n auth-system
kubectl logs -n auth-system job/auth-config-sync

Resolution:

  • Verify Authentik is running and healthy
  • Check Dex ConfigMap for syntax errors
  • Verify Config Sync Job completed successfully
  • Re-run Config Sync Job if needed

ArgoCD Not Syncing Auth Changes

Symptom: Changes to Git are not applied to cluster

Diagnosis:

# Check Application status
kubectl get application platform-auth -n argocd

# Check Application details
kubectl describe application platform-auth -n argocd

# Check for sync errors
kubectl get application platform-auth -n argocd -o json | \
  jq '.status.conditions[] | select(.type=="SyncError")'

Resolution:

  • Verify ArgoCD auto-sync is enabled
  • Check for invalid YAML syntax in manifests
  • Manually trigger sync:
    kubectl patch application platform-auth -n argocd \
      --type merge -p '{"operation":{"initiatedBy":{"username":"admin"},"sync":{}}}'

Secrets Not Found

Symptom: Pods fail to start with "secret not found" errors

Diagnosis:

# Check if secrets exist
kubectl get secrets -n auth-system

# Expected secrets:
# - authentik-postgresql
# - authentik-secrets
# - dex-secrets
# - authentik-api-token

Resolution:

  • Re-run bootstrap to regenerate secrets:
    ./platform/bootstrap/bootstrap.sh
  • Or manually create missing secrets (see platform/auth/secrets/README.md)

Source

  • platform/auth/README.md
  • platform/auth/secrets/README.md
  • platform/auth/ingress/README.md
  • platform/auth/config-sync/README.md
  • platform/bootstrap/bootstrap.sh

ArgoCD-Based Upgrade Workflow

The platform upgrades itself via ArgoCD when manifests change in Git. No manual kubectl apply or custom upgrade scripts are needed.

Update Platform Components

Step 1: Update Manifests in Git

# Clone platform repository
git clone https://github.com/bdchatham/AphexPlatformInfrastructure.git
cd AphexPlatformInfrastructure

# Update component manifests
# Example: Update controller image
vi platform/platform-controller/controller-deployment.yaml  # Update image tag

# Commit changes
git add .
git commit -m "Update onboarding controller to v1.1.0"
git push

Step 2: ArgoCD Detects Changes

ArgoCD polls Git every 3 minutes (default) and detects changes automatically.

# Watch ArgoCD sync status
kubectl get application -n argocd -w

# Or view in ArgoCD UI
# http://localhost:8080

Step 3: ArgoCD Syncs Changes

ArgoCD automatically syncs changes based on sync policy:

  • Automated sync: Changes applied automatically
  • Self-heal: Drift corrected automatically
  • Prune: Removed resources deleted automatically
# Check sync status
kubectl get application platform-root -n argocd

# View sync details
kubectl describe application platform-root -n argocd

# Check child Applications
kubectl get application -n argocd

Step 4: Verify Upgrade

# Check component versions
kubectl get deployment -n platform-system -o wide

# Check pod status
kubectl get pods -n platform-system

# Check ArgoCD sync status
kubectl get application -n argocd

Manual Sync (If Needed)

If automatic sync is disabled or you want to sync immediately:

# Sync via kubectl
kubectl patch application platform-root -n argocd --type merge -p '{"operation":{"initiatedBy":{"username":"admin"},"sync":{"revision":"HEAD"}}}'

# Or sync via ArgoCD CLI
argocd app sync platform-root

# Or sync via ArgoCD UI
# Click "Sync" button in UI

Source

  • platform/argocd/apps/platform-root.yaml

Monitoring

Platform Health Checks

# Check ArgoCD
kubectl get pods -n argocd

# Check Tekton controllers
kubectl get pods -n tekton-pipelines

# Check platform controller
kubectl get pods -n platform-system -l app=platform-controller

# Check ArgoCD Applications
kubectl get application -n argocd

ArgoCD Sync Status

# List all Applications
kubectl get application -n argocd

# Get Application sync status
kubectl get application platform-root -n argocd -o jsonpath='{.status.sync.status}'

# View sync details
kubectl describe application platform-root -n argocd

# Check for sync errors
kubectl get application -n argocd -o json | jq '.items[] | select(.status.sync.status != "Synced") | {name: .metadata.name, status: .status.sync.status, message: .status.conditions[0].message}'

Pipeline Execution Monitoring

# List all PipelineRuns
kubectl get pipelineruns --all-namespaces

# Get PipelineRun details
kubectl describe pipelinerun <name> -n <pipeline-namespace>

# View PipelineRun logs
kubectl logs -n <pipeline-namespace> -l tekton.dev/pipelineRun=<name>

# Watch PipelineRun status
kubectl get pipelinerun <name> -n <pipeline-namespace> -w

EventListener Logs

# View EventListener logs for an organization
kubectl logs -n org-<organization-name> -l eventlistener=github-listener --tail=100

# Stream EventListener logs
kubectl logs -n org-<organization-name> -l eventlistener=github-listener -f

# Search for specific webhook events
kubectl logs -n org-<organization-name> -l eventlistener=github-listener | grep "webhook"

Onboarding Controller Logs

# View controller logs
kubectl logs -n platform-system -l app=platform-controller --tail=100

# Stream controller logs
kubectl logs -n platform-system -l app=platform-controller -f

# Search for specific RepoBinding
kubectl logs -n platform-system -l app=platform-controller | grep "repobinding-name"

Source

  • platform/argocd/apps/
  • platform/platform-controller/controller-deployment.yaml

Troubleshooting

EventListener CrashLoopBackOff

Symptoms: EventListener pod crashes with "empty caBundle in clusterInterceptor spec" error.

Diagnosis:

# Check EventListener pod status in organization namespace
kubectl get pods -n org-<organization-name> -l eventlistener=github-listener

# Check EventListener logs
kubectl logs -n org-<organization-name> -l eventlistener=github-listener --tail=50

# Check if ClusterInterceptors exist
kubectl get clusterinterceptors

# Check if Core Interceptors deployment exists
kubectl get deployment tekton-triggers-core-interceptors -n tekton-pipelines

Resolution:

This error occurs when Tekton Triggers Core Interceptors are not installed. The Core Interceptors provide ClusterInterceptor resources (github, gitlab, cel, etc.) that EventListeners need.

# Install Core Interceptors
kubectl apply -f https://infra.tekton.dev/tekton-releases/triggers/previous/v0.34.0/interceptors.yaml

# Verify ClusterInterceptors are created
kubectl get clusterinterceptors

# Delete EventListener pod to restart
kubectl delete pod -n org-<organization-name> -l eventlistener=github-listener

# Verify EventListener is running
kubectl get pods -n org-<organization-name> -l eventlistener=github-listener

Prevention: Ensure bootstrap script installs Core Interceptors, or ensure platform-tekton ArgoCD Application includes interceptors.yaml.

EventListener RBAC Permission Errors

Symptoms: EventListener pod logs show "cannot list resource clusterinterceptors" or "cannot list resource clustertriggerbindings" errors.

Diagnosis:

# Check EventListener logs
kubectl logs -n org-<organization-name> -l eventlistener=github-listener --tail=50

# Check if ClusterRole exists for pipeline
kubectl get clusterrole pipeline-runner-<pipeline-name>

# Check if ClusterRoleBinding exists for pipeline
kubectl get clusterrolebinding pipeline-runner-<pipeline-name>

Resolution:

EventListener pods need cluster-scoped read permissions for ClusterInterceptor and ClusterTriggerBinding resources. The onboarding controller should provision these automatically.

# Check if controller provisioned cluster-scoped RBAC
kubectl describe clusterrole pipeline-runner-<pipeline-name>
kubectl describe clusterrolebinding pipeline-runner-<pipeline-name>

# If missing, delete and recreate RepoBinding to trigger reprovisioning
kubectl delete repobinding <name> -n platform-system
kubectl apply -f repobinding.yaml

# Verify cluster-scoped RBAC was created
kubectl get clusterrole pipeline-runner-<pipeline-name>
kubectl get clusterrolebinding pipeline-runner-<pipeline-name>

# Delete EventListener pod to restart with new permissions
kubectl delete pod -n org-<organization-name> -l eventlistener=github-listener

Prevention: Ensure platform controller has permissions to create ClusterRoles and ClusterRoleBindings (check platform/platform-controller/controller-rbac.yaml).

Repository Not Triggering Pipelines

Diagnosis:

# Check if EventListener exists in organization namespace
kubectl get eventlistener -n org-<organization-name>

# Check EventListener logs for webhook events
kubectl logs -n org-<organization-name> -l eventlistener=github-listener | grep "webhook"

# Check if Ingress exists
kubectl get ingress -n org-<organization-name>

# Check if webhook secret exists in organization namespace
kubectl get secret github-webhook-secret -n org-<organization-name>

Resolution:

  1. Verify EventListener is running
  2. Verify Ingress is configured correctly
  3. Verify GitHub webhook is configured with correct URL and secret
  4. Check EventListener logs for error messages

PipelineRun Failures

Diagnosis:

# Get PipelineRun status
kubectl get pipelinerun <name> -n <pipeline-namespace>

# Get detailed status
kubectl describe pipelinerun <name> -n <pipeline-namespace>

# Get pod logs
kubectl logs -n <pipeline-namespace> -l tekton.dev/pipelineRun=<name>

# Check pod events
kubectl get events -n <pipeline-namespace> --sort-by='.lastTimestamp'

Common Issues:

  1. Git clone failure: Check repository access
  2. CDKTF synth failure: Check Node.js dependencies and syntax errors
  3. CDKTF deploy failure: Check Terraform state and permissions
  4. RBAC denial: Check service account permissions

Onboarding Failures

Diagnosis:

# Check RepoBinding status
kubectl get repobinding <name> -n platform-system
kubectl describe repobinding <name> -n platform-system

# Check controller logs
kubectl logs -n platform-system -l app=platform-controller | grep "<name>"

Common Issues:

  1. Invalid namespace pattern: Namespace name doesn't match pattern
  2. RBAC failure: Controller lacks permissions to create resources
  3. EventListener creation failed: Check Tekton Triggers installation
  4. Ingress creation failed: Check Ingress controller installation

ArgoCD Sync Failures

Diagnosis:

# Check Application sync status
kubectl get application -n argocd

# Get sync error details
kubectl describe application <name> -n argocd

# View Application events
kubectl get events -n argocd --field-selector involvedObject.name=<name>

# Check ArgoCD controller logs
kubectl logs -n argocd -l app.kubernetes.io/name=argocd-application-controller

Common Issues:

  1. Invalid manifest: YAML syntax errors in Git
  2. Resource conflicts: Resource already exists with different configuration
  3. RBAC denial: ArgoCD lacks permissions to create resources
  4. Git connection failure: ArgoCD cannot access Git repository

Source

  • platform/argocd/apps/
  • platform/platform-controller/controller/
  • platform/catalog/

Disaster Recovery

Backup Platform Configuration

# Backup RepoBindings
kubectl get repobindings -n platform-system -o yaml > repobindings-backup.yaml

# Backup ArgoCD Applications
kubectl get applications -n argocd -o yaml > applications-backup.yaml

# Backup platform manifests (already in Git)
# No backup needed - Git is the source of truth

Restore Platform

Step 1: Re-run Bootstrap

# Run bootstrap on new cluster
cd platform/bootstrap
./bootstrap.sh --cluster-name aphex-platform --repo-url https://github.com/bdchatham/AphexPlatformInfrastructure

Step 2: Wait for ArgoCD to Sync

ArgoCD will automatically sync all platform components from Git.

# Watch ArgoCD sync
kubectl get application -n argocd -w

# Verify all Applications are synced
kubectl get application -n argocd

Step 3: Restore RepoBindings

# Restore RepoBindings
kubectl apply -f repobindings-backup.yaml

# Verify onboarding
kubectl get repobindings -n platform-system
kubectl get namespaces -l aphex/managed-by=platform-controller

Step 4: Verify Platform

# Check all components
kubectl get pods -n argocd
kubectl get pods -n tekton-pipelines
kubectl get pods -n platform-system

# Check organization namespaces
kubectl get namespaces -l aphex/managed-by=platform-controller

# Check ArgoCD sync status
kubectl get application -n argocd

Source

  • platform/bootstrap/bootstrap.sh
  • .kiro/specs/argocd-tekton-platform/requirements.md (Requirement 12.1, 12.4)

Common Troubleshooting Scenarios

Webhook Not Delivered

Symptoms: No PipelineRun created after merge to main.

Diagnosis:

# Check EventListener logs in organization namespace
kubectl logs -n org-<organization-name> -l eventlistener=github-listener --tail=100

# Check Ingress configuration
kubectl get ingress -n org-<organization-name> -o yaml

# Check webhook secret in organization namespace
kubectl get secret github-webhook-secret -n org-<organization-name>

# Check GitHub webhook delivery logs
# Go to GitHub repository Settings → Webhooks → Recent Deliveries

Resolution:

  1. Verify Ingress is accessible from GitHub
  2. Verify webhook secret matches GitHub configuration
  3. Verify EventListener is running
  4. Check GitHub webhook delivery logs for errors

ArgoCD Not Syncing

Symptoms: Changes committed to Git but ArgoCD not syncing.

Diagnosis:

# Check Application sync status
kubectl get application platform-root -n argocd

# Check ArgoCD controller logs
kubectl logs -n argocd -l app.kubernetes.io/name=argocd-application-controller --tail=100

# Check Git repository connectivity
kubectl exec -n argocd -it <argocd-repo-server-pod> -- git ls-remote https://github.com/bdchatham/AphexPlatformInfrastructure

Resolution:

  1. Verify ArgoCD can access Git repository
  2. Verify sync policy is configured (automated sync enabled)
  3. Manually trigger sync if needed
  4. Check for manifest errors in Git

Onboarding Controller Not Reconciling

Symptoms: RepoBinding created but status remains "Pending".

Diagnosis:

# Check controller logs
kubectl logs -n platform-system -l app=platform-controller --tail=100

# Check controller pod status
kubectl get pods -n platform-system -l app=platform-controller

# Check RepoBinding status
kubectl describe repobinding <name> -n platform-system

Resolution:

  1. Verify controller is running
  2. Check controller logs for errors
  3. Verify controller has RBAC permissions
  4. Restart controller if needed: kubectl rollout restart deployment platform-controller -n platform-system

Source

  • platform/platform-controller/controller/
  • platform/argocd/apps/
  • platform/tenancy/templates/

Maintenance

Update Tekton

Tekton is managed by ArgoCD after bootstrap. To update Tekton versions:

# Update Tekton versions in kustomization
vi platform/tekton/kustomization.yaml

# Update resource URLs to new versions
# Example: Update to Tekton Pipelines v0.66.0
resources:
  - https://github.com/tektoncd/pipeline/releases/download/v0.66.0/release.yaml
  - https://github.com/tektoncd/triggers/releases/download/v0.30.0/release.yaml
  - https://github.com/tektoncd/triggers/releases/download/v0.30.0/interceptors.yaml

# Commit changes
git add platform/tekton/kustomization.yaml
git commit -m "Update Tekton to v0.66.0"
git push

# ArgoCD will automatically sync and update Tekton
# Watch sync status
kubectl get application platform-tekton -n argocd -w

# Verify updates
kubectl get pods -n tekton-pipelines

Note: The bootstrap script installs Tekton initially, but ArgoCD manages updates from Git. This enables GitOps-based Tekton upgrades.

Update ArgoCD

# Update ArgoCD version in bootstrap script
vi platform/bootstrap/bootstrap.sh

# Update version URL
# ArgoCD: https://raw.githubusercontent.com/argoproj/argo-cd/v2.9.3/manifests/install.yaml

# Commit changes
git add platform/bootstrap/bootstrap.sh
git commit -m "Update ArgoCD to v2.9.3"
git push

# ArgoCD will NOT automatically update itself (installed by bootstrap)
# To update ArgoCD, manually apply new manifests:
kubectl apply -f https://raw.githubusercontent.com/argoproj/argo-cd/v2.9.3/manifests/install.yaml

Update Onboarding Controller

# Build new version
cd platform/platform-controller/controller
docker build -t your-registry/platform-controller:v1.1.0 .
docker push your-registry/platform-controller:v1.1.0

# Update deployment manifest
vi platform/platform-controller/controller-deployment.yaml
# Update image tag to v1.1.0

# Commit changes
git add platform/platform-controller/controller-deployment.yaml
git commit -m "Update platform controller to v1.1.0"
git push

# ArgoCD will automatically sync and update the controller
# Watch sync status
kubectl get application platform-controllers -n argocd -w

Update Pipeline Catalog

# Update catalog resources
vi platform/catalog/tasks/cdktf-deploy.yaml
# Make changes

# Commit changes
git add platform/catalog/
git commit -m "Update CDKTF deploy task"
git push

# ArgoCD will automatically sync and update the catalog
# Watch sync status
kubectl get application platform-catalog -n argocd -w

# Verify updates
kubectl get tasks -n platform-system

Clean Up Old PipelineRuns

# List old PipelineRuns
kubectl get pipelineruns --all-namespaces --sort-by=.metadata.creationTimestamp

# Delete PipelineRuns older than 30 days
kubectl get pipelineruns --all-namespaces -o json | \
  jq -r '.items[] | select(.metadata.creationTimestamp < "'$(date -d '30 days ago' -Iseconds)'") | "\(.metadata.namespace) \(.metadata.name)"' | \
  xargs -n2 kubectl delete pipelinerun -n

Source

  • platform/bootstrap/bootstrap.sh
  • platform/platform-controller/controller-deployment.yaml
  • platform/catalog/

Security

Rotate Webhook Secrets

Webhook secrets are generated by the Onboarding Controller and stored in organization namespaces. To rotate:

# Delete existing secret in organization namespace
kubectl delete secret github-webhook-secret -n org-<organization-name>

# Delete and recreate Organization to regenerate secret
kubectl delete organization <organization-name> -n platform-system
kubectl apply -f organization.yaml

# Get new webhook secret from Organization status
kubectl get organization <organization-name> -n platform-system -o yaml

# Update GitHub webhook with new secret

Audit RBAC Permissions

# List all Roles in pipeline namespaces
kubectl get roles --all-namespaces | grep -v "kube-"

# Review specific Role in pipeline namespace
kubectl get role pipeline-runner -n <pipeline-namespace> -o yaml

# Check what a service account can do
kubectl auth can-i --list --as=system:serviceaccount:<pipeline-namespace>:pipeline-runner -n <pipeline-namespace>

Review Network Policies

# List all NetworkPolicies
kubectl get networkpolicies --all-namespaces

# Review specific NetworkPolicy in pipeline namespace
kubectl get networkpolicy pipeline-isolation -n <pipeline-namespace> -o yaml

# Test network connectivity
kubectl run -it --rm debug --image=busybox --restart=Never -n <pipeline-namespace> -- wget -O- http://<service>.<other-namespace>.svc.cluster.local

Source

  • platform/platform-controller/controller/
  • platform/tenancy/templates/

Runbooks

Bootstrap Runbook

Prerequisites:

  • Kubernetes cluster (1.24+) with RBAC enabled
  • kubectl configured with cluster admin access
  • GitHub organization with admin access

Bootstrap Steps:

  1. Clone Repository:

    git clone https://github.com/bdchatham/AphexPlatformInfrastructure.git
    cd AphexPlatformInfrastructure
  2. Run Bootstrap:

    cd platform/bootstrap
    ./bootstrap.sh --cluster-name aphex-platform --repo-url https://github.com/bdchatham/AphexPlatformInfrastructure
  3. Verify Bootstrap:

    # Check all components
    kubectl get pods -n argocd
    kubectl get pods -n tekton-pipelines
    kubectl get pods -n platform-system
    
    # Check ArgoCD Applications
    kubectl get application -n argocd
  4. Access ArgoCD UI:

    # Get admin password
    kubectl get secret argocd-initial-admin-secret -n argocd -o jsonpath='{.data.password}' | base64 -d
    
    # Port-forward
    kubectl port-forward svc/argocd-server -n argocd 8080:443
    
    # Access at http://localhost:8080

Source

  • platform/bootstrap/bootstrap.sh

Onboarding Runbook

Prerequisites:

  • Platform bootstrapped and running
  • kubectl configured with cluster access
  • GitHub repository ready for onboarding

Onboarding Steps:

  1. Create RepoBinding:

    kubectl apply -f - <<EOF
    apiVersion: aphex.io/v1alpha1
    kind: RepoBinding
    metadata:
      name: my-repo-binding
      namespace: platform-system
    spec:
      aphexOrg: "my-org"
      repoOrg: "your-github-org"
      repoName: "your-repo"
      pipelineName: "my-pipeline"
      templateRef: "run-pipeline-v1"
    EOF
  2. Verify Onboarding:

    # Check RepoBinding status
    kubectl get repobinding my-repo-binding -n platform-system
    
    # Verify organization namespace
    kubectl get namespace org-my-org
    
    # Verify pipeline namespace
    kubectl get namespace my-pipeline
    
    # Verify all resources in organization namespace
    kubectl get all -n org-my-org
  3. Configure GitHub Webhook:

    # Get webhook URL from Organization status
    kubectl get organization my-org -n platform-system -o jsonpath='{.status.webhookURL}'
    
    # Get webhook secret from organization namespace
    kubectl get secret github-webhook-secret -n org-my-org -o jsonpath='{.data.secret}' | base64 -d
    
    # Configure in GitHub repository Settings → Webhooks
  4. Test Pipeline:

    # Merge a commit to main branch
    # Watch for PipelineRun creation in pipeline namespace
    kubectl get pipelineruns -n my-pipeline -w

Source

  • platform/crds/repobinding-crd.yaml
  • platform/platform-controller/controller/

Source

  • .kiro/specs/argocd-tekton-platform/design.md
  • .kiro/specs/argocd-tekton-platform/requirements.md
  • platform/bootstrap/bootstrap.sh
  • platform/argocd/apps/
  • platform/platform-controller/controller/
  • platform/catalog/
  • platform/tenancy/templates/