Email rackctl@gmail.com with subject [security][incident-response]. Do not open public issues for security reports.
Acknowledgement target: within 72 hours. Triage target: within 5 business days.
incident-response is an incident-commander assistant: a Grafana OnCall webhook fans a P1 into a
Slack war room, and the IC drives the response through /incident-response slash commands. It handles
incident metadata, responder identities, and the IC's conversation with the model, and it can
publish customer-facing Statuspage updates — so its defining controls are that the webhook
ingress only trusts requests it can cryptographically verify, no customer-facing Statuspage
publish happens without a recorded human approval, and the IC↔AI conversation never leaks to
inference logs or third parties.
- Every Grafana OnCall webhook is verified with HMAC-SHA256 over the raw request body before
anything is parsed or persisted (
src/handlers/webhook-ingress.ts). The comparison usescrypto.timingSafeEqual, so signature checking is constant-time and doesn't leak the expected digest byte-by-byte. - The signing secret is read from AWS Secrets Manager and cached keyed on the SecretsManager
VersionIdwith a 5-minute TTL. On a verification failure the cache force-refreshes once and retries the check — so a secret rotation mid-flight recovers on the next request instead of failing every webhook until the TTL expires, and rotating the HMAC secret never needs a pod redeploy. - A request that fails verification is rejected at the boundary (
401); it is never written to DynamoDB and never enqueued to SQS.
- A customer-facing Statuspage incident is only ever created through
src/services/statuspage-approval-gate.ts— it is the single call site ofStatuspageClient.createIncident()anywhere in the codebase. - The gate is a two-phase commit: it writes a
STATUSPAGE_DRAFT_APPROVEDaudit event, then re-reads the audit log withConsistentRead: true, and only on a confirmed read does it callcreateIncident(). If the audit write or the consistent re-read fails, the publish never happens and the gate throwsAutoPublishNotPermittedError. There is no auto-publish path. - This is enforced two ways so it can't silently regress: a CI grep-gate fails the build if
createIncident()appears anywhere outside the gate file, and the gate carries 100% branch coverage (alongsidesrc/utils/audit.ts) — CI goes red if a branch drops.
stripPIIruns before every Bedrock call (src/ai/incident-response-ai.ts), so responder names, contact details, and other sensitive strings in incident context are scrubbed out of the prompt before drafts or postmortem sections are generated.- Bedrock invocation logging is set to NONE for the account, so the IC↔AI conversation (the
model request and response bodies) never lands in CloudWatch. This is an account-level control
owned by the
landing-zonesubstrate — not app code. The app relies on it being in place; it does not (and should not) try to set it from the tenant. - Inference runs on-account via Amazon Bedrock — incident content is not sent to third parties.
- No long-lived credentials in the app. Pods get AWS access via IRSA (Workload Identity); there
are no static keys anywhere in the repo or image. DynamoDB, SQS, Bedrock, EventBridge Scheduler,
and Secrets Manager calls AssumeRoleWithWebIdentity into the landing-zone
incident-response-platformIRSA role. - App-level secrets are projected at deploy time by External Secrets Operator from AWS Secrets
Manager (
incident-response/<env>/*— the Grafana OnCall HMAC secret, app secrets, and Grafana Cloud credentials) into a Kubernetes Secret consumedenvFrom— never committed.
- Default-deny
NetworkPolicy: ingress is limited to ingress-nginx reaching the webhook Deployment; egress is DNS plus HTTPS to AWS APIs and the Slack / Grafana / Linear / WorkOS / Statuspage endpoints. IMDS is blocked. - Public surface is limited to
/healthand the signed Grafana OnCall webhook POST behind ingress-nginx + cert-manager TLS.
- Webhook authenticity is bounded by the secrecy of the HMAC signing secret. Anyone who can read
the
incident-response/<env>HMAC secret can forge a P1; protection of the secret rests on Secrets Manager access control and the IRSA-only posture. - The Bedrock-logging-NONE guarantee is a substrate control. If the
landing-zoneaccount configuration drifts (someone re-enables invocation logging out of band), IC↔AI conversations could reach CloudWatch — the app cannot detect or correct that on its own. Verifying it stays NONE is a landing-zone responsibility. - The approval gate trusts the actor that clicked approve in Slack. The gate proves that an approval was recorded before a publish, not that the approver was authorized for that specific incident — authorization is upstream in the Slack action bindings.
incident-response exposes the controls needed for SOC 2 Type II — IRSA-only access with no
static credentials, secrets sourced from AWS Secrets Manager (never committed), a constant-time
HMAC check at the only ingress, PII scrubbing before inference, inference logging disabled at the
account level, and a recorded human-approval gate as the sole path to any customer-facing
publish, backed by a complete per-incident audit trail in DynamoDB. Substrate-level controls
(CIS EKS baseline, Pod Security Standards, image signing, and the account-level Bedrock
invocation-logging=NONE setting) are enforced upstream by landing-zone and eks-gitops.