Skip to content

Sync impl with main (scores-diff #356, Okta kid #354, /login CloudFront #351)#361

Merged
danielbowne merged 23 commits into
implfrom
feature/impl-sync-20260623
Jun 23, 2026
Merged

Sync impl with main (scores-diff #356, Okta kid #354, /login CloudFront #351)#361
danielbowne merged 23 commits into
implfrom
feature/impl-sync-20260623

Conversation

@danielbowne

Copy link
Copy Markdown
Collaborator

Summary

Routine refresh of the impl branch from main so the experimental environment does not drift from production code. Merge (not rebase), per the impl-sync policy, because impl is a shared, deployed branch.

What main brought in

The bulk of main's recent history (OpDiv-scoped RBAC, dual-IdP scaffolding, pgxpool, OpenAPI autogeneration, the N+1 users-list fix) was already present on impl from the previous sync, so the net new content here is small:

Conflict resolution

  • Makefile - kept impl's per-environment port scaffolding (ZTMF_ENV detection, parametrized TEST_DB_PORT). Reverting to main's hardcoded ports would break side-by-side worktree runs.
  • fismasystems.go / scores.go / openapi.yaml - took main's canonical versions (main's FindFismaSystemByUUID adds an explicit LIMIT 1; scores-diff is additive). OpenAPI regenerated from annotations afterward, no drift.

Environment safety

Testing

Full local suite on the impl port set (unit, integration, E2E/Emberfall).

danielbowne and others added 23 commits May 27, 2026 11:27
## Summary

Adds a generic `public.system_enrichment` table that ztmf core owns and
a separate enrichment pipeline (private `ztmf-insights` stack) fills
with an opaque JSON payload, plus a read-only endpoint that returns it.
This is the core-side, **additive** half of decoupling the CMS-specific
enrichment pipeline from the open-source data-call app.

The two sides meet at one stable contract: the table. Adding or changing
enrichment fields becomes a payload-key change in the pipeline with no
ztmf migration. The core app works with or without an enrichment
pipeline attached (an unfilled table simply returns 404s).

## Changes

- **Migration 0038** creates `public.system_enrichment (fisma_uuid
VARCHAR(255) PK, payload jsonb NOT NULL, synced_at timestamptz)`.
`fisma_uuid` has no FK, matching the existing `cfacts_systems` precedent
(`fismasystems.fismauid` is `VARCHAR(255)` and not unique); the table is
a disposable cache the pipeline rewrites via TRUNCATE+INSERT.
- **`GET /api/v1/systemenrichment/{fisma_uuid}`** returns the payload
verbatim. Access control reuses the existing FISMA-system assignment
check: admins / read-only admins see any system, ISSOs only assigned
ones, missing rows return 404.
- `model.SystemEnrichment.Payload` is `json.RawMessage` so the payload
serializes back out as raw JSON, not base64.
- OpenAPI path + `SystemEnrichment` schema, Emberfall E2E coverage, and
one empire seed row.

## Scope / deferred

Additive only — **no cfacts files are touched**. Retiring the
enrichment-specific (CFACTS) read surface and dropping `cfacts_systems`
is deferred until the UI repoints to this endpoint and the old sync is
decommissioned, so this is safe to run alongside the current stack.
Tracked under the enrichment-extraction epic.

## Deploy note

On merge/deploy to dev, migration 0038 applies on API startup and
creates the table. This is intentionally sequenced ahead of the
enrichment pipeline's first run against dev.

## Test plan

- [x] `go build` / `go vet` / `staticcheck` clean
- [x] model unit tests pass (incl. payload-serializes-as-JSON guard)
- [x] `make dev-setup` applies 0038; live curl returns 200 (admin), 404
(no row), 200 (assigned ISSO), 403 (unassigned ISSO)
- [x] `make test-full` passes (unit + Emberfall E2E), Failed: 0

Refs #211
## Summary

Three small, independent fixes that together make `make full-stack-up`
work on Apple Silicon macOS out of the box. Each one addresses a
Mac-specific quirk; none change behavior on Linux.

- **`backend/Dockerfile`** — `FROM scratch` → `FROM
gcr.io/distroless/base-debian12`, and drop the hardcoded
`x86_64-linux-gnu` libc/ld COPY lines. Distroless ships the right
libc/ld/ca-certs for whichever arch the build runs on, so the build is
now arch-agnostic and runs native-fast on both arm64 and amd64.
- **`Makefile` (compose-dev.yml + dev.compose.env generators)** —
replace `network_mode: host` with explicit `ports:` mappings
(`54321:54321` for postgres, `$(API_PORT):$(API_PORT)` for api) and
change `DB_ENDPOINT=localhost` → `DB_ENDPOINT=postgre`. Docker Desktop
on macOS runs containers inside a Linux VM, where `network_mode: host`
binds to the VM, not the Mac, unless the experimental host-networking
toggle is enabled. Port mappings behave the same on every platform.
- **`Makefile` (top)** — add `SHELL := /bin/bash`. The default `/bin/sh`
on macOS is bash invoked in POSIX mode, where the `echo` builtin prints
`-n` literally instead of suppressing the trailing newline. That
corrupted every base64-encoded segment in `generate-jwt` and
`frontend-env`, producing tokens the API rejected with 401.

## Symptom before this PR (Apple Silicon)

```
=> ERROR [stage-1 4/7] COPY --from=builder /lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 ...
=> ERROR [stage-1 5/7] COPY --from=builder /usr/lib/x86_64-linux-gnu/libc.so.6 ...
failed to compute cache key: ... "/usr/lib/x86_64-linux-gnu/libc.so.6": not found
```

After patching the Dockerfile, the build succeeds but `curl
localhost:8080` returns connection-refused (host networking issue), and
once port mappings are in place, the JWT from `make frontend-env`
decodes to `-n {"alg":"HS256"}\n...` and the API returns 401.

## Test plan

- [x] `make dev-down && rm backend/compose-dev.yml
backend/dev.compose.env`
- [x] `make full-stack-up` on Apple Silicon (arm64) — backend container
builds natively, both containers come up healthy
- [x] `curl -s -o /dev/null -w "%{http_code}\n" -H "Authorization:
$VITE_AUTH_TOKEN3" http://localhost:8080/api/v1/users/current` → `200`
with the seeded admin user
- [x] Frontend (`vite`) loads in browser and can hit the API
- [x] Verify on Linux that the same flow still works (echo behavior on
dash differs but `SHELL := /bin/bash` is widely available; port-mapping
change is platform-neutral)
…isc#213) (#324)

## Overview

This PR removes the cfacts enrichment and PostgreSQL-to-Snowflake
data-sync pipelines from the ztmf account. The same code and
infrastructure now lives in the private `CMS-Enterprise/ztmf-insights`
repo and has been running side-by-side in dev, impl, and prod for the
last few weeks. With the insights stack verified end to end, the
original copies in ztmf are no longer needed.

## Why the diff is large

This is the third and biggest commit of a multi-repo decoupling effort
tracked under epic ztmf-misc#190. The earlier steps were additive: the
insights repo was created and stood up, the new generic
`public.system_enrichment` seam was added to ztmf core (#323), and the
UI was repointed at the new endpoint (CMS-Enterprise/ztmf-ui#367). Those
landed first so nothing live depended on the old code at the moment this
PR removes it.

What ships here is the cleanup. Three Lambda functions, their IAM roles
and policies, their CloudWatch monitoring, their dead-letter queues, the
CFACTS CSV ingest S3 bucket, and the supporting Go packages all come out
in one PR. The diff reads as "minus four thousand lines" because that is
exactly what it is: an entire feature path being retired from this repo.

For context, the goal of the decoupling was to keep ztmf core generic
enough that other HHS OpDivs can adopt it without inheriting
CMS-specific plumbing. The cfacts pipeline reflects the CMS Security
Data Lake mapping, which is not portable, so it belongs in a CMS-only
repo. Splitting also lets enrichment schema iterate independently of
ztmf migrations: adding a new enrichment field is now a Snowflake view
payload-key change on the insights side with zero ztmf-side work.

## What this PR ships

**Removed from `backend/`:**

- `cmd/lambda/` (ztmf-data-sync: PostgreSQL to Snowflake export)
- `cmd/lambda-cfacts-snowflake/` (Snowflake CFACTS view to PostgreSQL
ingest)
- `cmd/lambda-cfacts-s3/` (CSV upload ingest path; never reached prod
usage)
- `internal/cfacts/` and `internal/export/` (model, CSV parser,
Snowflake client, Postgres exporter)
- Orphaned sync notification helpers in
`internal/notifications/slack.go` (`SyncResult`, `SendSyncNotification`,
`buildSyncMessage`, supporting formatters). The rotation and
cert-rotation notifications used by the surviving Lambdas are untouched.

**Removed from `infrastructure/`:**

- `lambda.tf`, `lambda-cfacts.tf` (the three Lambda function blocks plus
EventBridge schedules and permissions)
- `iam.tf`, `iam-cfacts.tf` (per-Lambda execution roles and policies)
- `monitoring.tf`, `monitoring-cfacts.tf` (log groups, error alarms,
duration alarms, dead-letter queues)
- `aws_s3_bucket.cfacts_sync` block from `s3.tf` (CFACTS CSV upload
bucket plus lifecycle and SSE configuration)
- `aws_secretsmanager_secret.ztmf_snowflake_dev` and
`.ztmf_snowflake_prod` from `secrets.tf` (the insights stack now owns
its own `ztmf-insights-snowflake` secret in each account, seeded and
verified per ztmf-misc#223)
- `enable_snowflake_sync` local in `locals.tf`
- `cfacts_snowflake_view` and `snowflake_table_prefix` variables in
`variables.tf`, plus the matching settings in
`tfvars/{dev,impl,prod}.tfvars`
- Data-sync-specific outputs from `outputs.tf` (`lambda_function_name`,
`lambda_function_arn`) and the SSM test-event parameter resource. The
NAT egress output is retained and its description updated for the
surviving source-system allowlist use cases (Kion).

**Updated docs:**

- `infrastructure/README.md`: "Data Synchronization" section rewritten
to point at the insights repo; "Infrastructure Organization" file list
updated.
- `backend/cmd/lambda-kion-key-rotate/README.md`: prior-art reference
repointed at `lambda-cert-rotation` (the only remaining Go Lambda using
the same pattern).

**Updated CI:**

- `.github/workflows/backend.yml`: the three deleted Lambda build,
upload, and update-function-code steps are removed. The workflow now
builds and deploys only `lambda-kion-key-rotate` and
`lambda-cert-rotation`.
- `Makefile`: `test-build` and `test-full` targets updated to match.

**Dependency cleanup:**

- `go mod tidy` removed the Snowflake driver and CSV-related transitive
dependencies that only the deleted Lambdas pulled in (44 lines from
`go.mod`, 120 from `go.sum`).

## What this PR deliberately leaves in place

- **`aws_security_group.ztmf_sync_lambda`** in `vpc.tf`. The name is now
historical, but the security group is still attached to
`lambda-kion-key-rotate` and `lambda-cert-rotation`. Renaming it
requires a `moved` block and is a follow-up so this PR can stay focused.
- **`cfacts_systems` table, the `/api/v1/cfactssystems` controller and
routes, the cfacts emberfall block, and the read-side model**. These are
the read surface that ztmf-ui#367 just stopped consuming. They get
retired in a separate PR tracked under ztmf-misc#214, which is gated on
this one merging and the insights stack writing the new
`system_enrichment` table cleanly in prod (already verified, 394 rows
present).
- **`backend/internal/notifications/slack.go`** core (the
`SlackNotifier` plumbing, the `RotationResult` and `CertRotationResult`
paths). Used by both surviving Lambdas. Only the sync-result code paths
are removed.

## Terraform blast radius

Verified via `terraform plan` against dev state: **0 to add, 0 to
change, 51 to destroy.** The destroys group as follows (dev numbers;
prod is the same shape minus the dev-only SSM test-event parameters):

| Resource type | Count | Notes |
|---|---:|---|
| `aws_lambda_function` | 3 | `ztmf-data-sync`,
`ztmf-cfacts-snowflake-sync`, `ztmf-cfacts-s3-sync` |
| `aws_lambda_permission` | 3 | EventBridge and S3 invoke permissions |
| `aws_cloudwatch_event_rule` | 2 | data-sync + cfacts-snowflake
schedules |
| `aws_cloudwatch_event_target` | 2 | matched targets |
| `aws_cloudwatch_log_group` | 3 | one per Lambda |
| `aws_cloudwatch_metric_alarm` | 4 | data-sync errors + duration,
cfacts-snowflake errors, cfacts-s3 errors |
| `aws_sqs_queue` | 2 | dead-letter queues |
| `aws_iam_role` | 2 | data-sync and cfacts execution roles |
| `aws_iam_policy` | 10 | logging, secrets, vpc, sqs, xray, s3 split
across both roles |
| `aws_iam_role_policy_attachment` | 10 | matched attachments |
| `aws_s3_bucket` and 4 supporting | 5 | `ztmf-cfacts-sync-${env}` plus
its public-access-block, versioning, SSE, lifecycle, and the S3
notification on it |
| `aws_ssm_parameter` (dev only) | 3 | `/ztmf/dev/lambda/test-events/*`
|
| `aws_secretsmanager_secret` | 1 per env | `ztmf_snowflake_dev` in dev,
`ztmf_snowflake_prod` in prod |

Plus two Terraform outputs (`lambda_function_name`,
`lambda_function_arn`) come out of the state.

Every destroyed resource maps to enrichment compute or its supporting
plumbing. No surviving resources (API, ALB, Aurora, CloudFront, ECS,
Kion rotator, cert-rotation, ALB OIDC trust provider, TLS or SMTP
secrets, Aurora master user secret, Slack webhook secret) appear in the
destroy list.

The unrelated pre-existing `managed_policy_arns` deprecation warning on
`module.github_actions[0].aws_iam_role.role` is shown by the plan but is
unaffected by this PR. It will be addressed when the GitHub Actions
module is updated.

The S3 buckets were pre-cleared before this PR opened.
`ztmf-cfacts-sync-dev` had nine versioned objects and delete markers
from February 2026 test runs; all were purged via `aws s3api
delete-objects`. `ztmf-cfacts-sync-prod` was already empty. The bucket
destroy will not be blocked.

## Validation done locally

- `terraform -chdir=infrastructure fmt -recursive -check`: clean.
- `tflint --chdir=infrastructure -f compact`: clean.
- `cd backend && go build ./...`: clean.
- `cd backend && go vet ./...`: clean.
- `cd backend && go test -short ./internal/notifications/...`: passes.
The remaining tests cover the cert-rotation Slack path.
- `aws s3api list-object-versions` on both `ztmf-cfacts-sync-*` buckets:
empty, including delete markers.
- `terraform plan -var-file=tfvars/dev.tfvars` against dev state: `0 to
add, 0 to change, 51 to destroy`. Per-resource list matches the
blast-radius table above with no surprises.

## Cross-repo context for reviewers

If you want to see how this fits together, the merge order across the
three repos was:

1. **insights repo** CMS-Enterprise/ztmf-insights#1: relocate the
Lambdas into the private repo. Merged, deployed to dev, impl, prod.
Side-by-side with the originals.
2. **ztmf repo** #323 (ztmf-misc#211): add migration
`0038systemenrichment.go` plus `GET
/api/v1/systemenrichment/{fisma_uuid}`. Merged, in prod.
3. **ztmf-ui repo** CMS-Enterprise/ztmf-ui#367 (ztmf-misc#212): repoint
the card at the new endpoint. Merged, deployed.
4. **insights repo** CMS-Enterprise/ztmf-insights#7 (ztmf-misc#223):
move the Snowflake credential into an insights-owned secret. Merged,
secret seeded in both accounts, lambda verified reading from the new
path.
5. **ztmf repo** this PR (ztmf-misc#213): remove the original compute,
IAM, monitoring, S3 ingest bucket, and Snowflake secret from the ztmf
account.

Follow-up still tracked under the epic:

- ztmf-misc#214: retire the `cfacts_systems` table and the
`/api/v1/cfactssystems` read surface. Gated on this PR merging.
- ztmf-misc#220: rename `CfactsRecordCard` to `SystemEnrichmentCard` in
ztmf-ui. Gated on ztmf-misc#214.
- ztmf-misc#224: migrate the Kion plumbing (`lambda-kion-key-rotate` +
Kion secret) into ztmf-insights when the planned Kion-logs-to-SDL ingest
scope lands.

## Reviewer checklist

- [x] The `terraform plan` against dev shows the destroys listed under
"Terraform blast radius" and nothing else.
- [x] No remaining references to deleted Go packages anywhere in
`backend/cmd/api/`.
- [x] CI workflow runs through Lambda build and deploy steps for kion
and cert-rotation only.
- [x] The retained CFACTS read surface still serves the ztmf-ui card for
any system whose `sdl_sync_enabled` flag is off (those systems still
read the legacy path until ztmf-misc#214 lands).
…14) (#326)

## Summary

Final step of the ZTMF Insights extraction tracked under ztmf-misc#190.
The legacy CFACTS read surface in the ztmf core API is removed and the
underlying `cfacts_systems` table is dropped. ztmf-ui#367 has been the
sole consumer of this surface and was repointed at the new
`system_enrichment` endpoint two weeks ago, with no fallback to the old
path. #324 removed the write side (lambdas + S3
ingest + Snowflake secrets) last week. This PR closes the loop on the
read side.

## Changes

- Add migration `0039cfactsdrop.go` that drops `public.cfacts_systems`.
The down migration recreates the table with the cumulative shape from
`0015` + `0021` + `0022` (table + two ALTER ADDs) plus the original
`COMMENT ON TABLE`, so a rollback restores the table faithfully.
- Remove the `GET /api/v1/cfactssystems` and `GET
/api/v1/cfactssystems/{fisma_uuid}` routes from `router.go`, delete
`controller/cfacts.go`, delete the `CfactsSystem` model and unit tests.
- Move the per-system access check out of the deleted model file and
into `internal/model/fismasystems.go` as
`UserCanAccessFismaSystemByUUID`. The query was always against
`users_fismasystems INNER JOIN fismasystems` and never touched
`cfacts_systems`, so the old name `UserCanAccessCfactsSystem` was
misleading; the SQL is byte-identical. The new `systemenrichment`
controller is the only caller and is updated to use the renamed
function.
- Strip the six `cfacts_systems` INSERTs and stale CFACTS comments from
`_test_data_empire.sql`, and delete the unused standalone
`_test_data_cfacts_empire.sql` fixture (it was not loaded by any compose
file or Makefile target).
- Drop the `cfactsSystemData` anchor and all CFACTS test cases from
`emberfall_tests.yml` (eight cases across admin, read-only admin, and
ISSO sections). The retained `system_enrichment` cases cover the same
access matrix and a new READONLY_ADMIN case was added so coverage of
that role's controller branch is preserved.
- Remove the two `/cfactssystems` paths and the `CfactsSystem` schema
from `backend/openapi.yaml`.

## Why this is safe to merge

- **Backend not in use.** Both `ztmf-ui` and `ztmf-insights` return zero
hits for `cfactssystems` or `cfacts_systems`. The card was repointed at
`/api/v1/systemenrichment/{fisma_uuid}` in ztmf-ui#367 and the section
is gated on `sdl_sync_enabled`, so systems without insights data hide
the card entirely rather than falling back.
- **No referential cascades.** `cfacts_systems` had no foreign keys in
or out by design (see the comment in migration
`0038systemenrichment.go`), so dropping the table is a clean operation.
- **Real-data migration check.** Verified locally against a postgres
volume that holds a real-shape dataset: the drop ran cleanly and `make
test-full` passed (91 emberfall cases, 0 failures, 0 skipped) against
the resulting schema.

## Test plan

- [x] `make test-unit` clean
- [x] `make test-full` clean against the local postgres volume
(real-data shape) — 91/0 emberfall, all integration paths green
- [x] `grep -rn cfacts backend/` shows only the immutable historical
migration files (`0015`, `0021`, `0022`, `0028` and `0038` comments
referencing the prior pipeline) and the opaque `cfacts` JSON key inside
one `system_enrichment` payload (owned by the Snowflake view, not this
table)
- [x] OpenAPI spec has no remaining `CfactsSystem` references or orphan
`$ref`s
- [x] External repos checked: `gh search code --repo
CMS-Enterprise/ztmf-ui cfactssystems` and `--repo
CMS-Enterprise/ztmf-insights cfactssystems` both return empty

## Follow-ups

Once this lands, ztmf-misc#220 (rename `CfactsRecordCard` to
`SystemEnrichmentCard` in ztmf-ui) is unblocked. After that, the only
remaining open sub-issue of the extraction epic is ztmf-misc#224
(migrate the Kion plumbing into the insights stack), which is explicitly
out of scope for this PR.

Refs ztmf-misc#214.
…ching (#327)

Closes #309

> **Note:** Rebased onto latest `main` after #326 landed (which retired
the legacy CFACTS read surface and deleted `cfactssystems.go`). The
original `cfactssystems.go` hunks were dropped as part of that rebase —
the case-insensitive matching now lives in
`UserCanAccessFismaSystemByUUID`, where the old
`UserCanAccessCfactsSystem` access check was relocated.

## Summary

- Relaxed `fismauid` validation in `FismaSystem.validate()` to require
only a non-empty string instead of a valid UUID. Some HHS OpDiv system
IDs (e.g. `CDC8767221`) are not GUIDs and were being rejected with a
400. This is the core #309 fix — the gate that blocks creating a
CDC-style system.
- Made the access-check `WHERE` clause in
`UserCanAccessFismaSystemByUUID` (`fismasystems.go`) case-insensitive
using `LOWER()`. Defensive future-proofing so non-GUID OpDiv IDs match
regardless of case once system enrichment opens beyond CMS.
- Updated `fismasystems_test.go`: replaced the `InvalidUUID` test case
with `ValidSystemNonUUID` (CDC-style ID should pass) and `EmptyFismaUID`
(empty string should still fail).

## Testing

Go unit tests pass (run in a `golang:1.25` container — `go test -short
./...`):
```
--- PASS: TestFismaSystem_Validate/ValidSystemUUID
--- PASS: TestFismaSystem_Validate/ValidSystemNonUUID
--- PASS: TestFismaSystem_Validate/InvalidEmail
--- PASS: TestFismaSystem_Validate/EmptyFismaUID
ok  github.com/CMS-Enterprise/ztmf/backend/internal/model  0.003s
```
Full `go test -short ./...` suite is green; `go build ./...` and `go vet
./internal/model/` clean.

> Snyk / secret-dependent CI checks show red because this branch is from
a fork (no repo secrets) — expected, maintainers re-run on their side.
## Summary

`make dev-setup` was not loading the Empire test data
(`_test_data_empire.sql`) into a fresh Postgres volume. A new clone, a
new machine, or any `docker compose down -v` came up with an empty
database (0 FISMA systems, 0 questions), which is what surfaced when
setting up the backend on a freshly migrated OS.

## Root cause

The seed step was gated on the database host string:

```go
if cfg.Db.PopulateSql != nil && cfg.Db.Host == "localhost" {
```

The dev API container reaches Postgres over the `postgre` Docker Compose
service name (`compose-dev.yml` overrides `DB_ENDPOINT=postgre` so the
container can talk to Postgres over the internal Docker network). The
`localhost` value in `dev.compose.env` only applies to host-side
tooling. So inside the container the check evaluated `postgre ==
"localhost"`, was false, and seeding was silently skipped.

This was masked on existing machines two ways, which is why it did not
show up for other developers:
- `backend/compose-dev.yml` was only regenerated when missing, so
existing checkouts kept an older on-disk compose file and never picked
up the `postgre` override.
- The `backend_postgres_data` volume persists across restarts, so
databases seeded under the older configuration stayed seeded.

A fresh volume on current code has neither of those, so the seed never
ran.

## Fix

Two changes, addressing both halves of the root cause:

1. **Gate seeding on `ENVIRONMENT`, not the database host.**
`ENVIRONMENT` is host- and platform-agnostic: `local` for dev
(`dev.compose.env`), `test` for the Emberfall E2E stack
(`compose-test.yml`), and it defaults to `production` everywhere else.

   ```go
   if cfg.Db.PopulateSql != nil && cfg.IsLocalOrTest() {
   ```

2. **Make the `backend/compose-dev.yml` generation target PHONY** so it
regenerates on every `make dev-setup` / `dev-up`. This stops developers
from running a stale on-disk compose file that predates the
`DB_ENDPOINT=postgre` override, which was the second reason fresh-volume
seeding silently broke.

### Environment-gate helpers

The environment checks are now centralized as two `config` helpers,
replacing scattered string comparisons:

- `config.IsLocal()` -> `ENVIRONMENT == "local"`
- `config.IsLocalOrTest()` -> `ENVIRONMENT == "local"` or `"test"`

These are deliberately distinct. Seeding uses `IsLocalOrTest()` (safe in
local dev and the E2E test stack). Just-in-time ADMIN user creation in
the auth middleware uses `IsLocal()` (local dev only, never the E2E test
stack). Keeping them separate preserves that asymmetry, and a new unit
test (`TestEnvironmentGates`) locks the contract across
local/test/production/dev/impl/prod/empty.

## Why this is safe for deployed environments

Seeding requires two independent conditions, and deployed environments
fail both:
- Deployed dev, impl, and prod resolve `ENVIRONMENT` to `dev`, `impl`,
and `prod` respectively (see `infrastructure/ecs.tf` and the tfvars),
none of which match `local` or `test`.
- Deployed task definitions never set `DB_POPULATE`, so
`cfg.Db.PopulateSql` is nil and the populate call cannot run regardless.

The allowlist of environment names is deliberate: an unrecognized
environment fails closed (no seeding) rather than open.

## Why this does not break macOS developers

The fix does not touch the compose networking, port mappings, or the
`DB_ENDPOINT=postgre` override introduced for Apple Silicon. macOS dev
uses the same `ENVIRONMENT=local`, so it seeds a fresh volume the same
way Linux does. The PHONY change only ever regenerates `compose-dev.yml`
to its current correct form (the file is gitignored and marked "DO NOT
EDIT - Managed by Makefile"), and existing seeded volumes are
unaffected. The auth-middleware refactor is behavior-preserving: JIT
ADMIN creation still fires in local dev only.

## Dependency security fix

The PR build's Snyk container scan failed the pipeline on a critical
vulnerability in a transitive dependency, fixed here to keep the branch
green:

- **golang.org/x/net** bumped from v0.47.0 to v0.55.0. v0.47.0 carries a
critical Improper Authentication vulnerability in
`golang.org/x/net/idna` (SNYK-GOLANG-GOLANGORGXNETIDNA-17116876), first
fixed in 0.54.0.
- `go mod tidy` also advanced the related modules it pulls:
`golang.org/x/crypto` to v0.51.0 and `golang.org/x/text` to v0.37.0.

`x/net` is an indirect dependency, so this is a `go.mod` / `go.sum`
change only with no source edits.

## Testing

- Fresh volume on Linux: `docker compose down -v` then `make dev-setup`
now logs `Populating database with /app/_test_data_empire.sql` and the
database comes up with 9 users, 4 FISMA systems, 18 questions, 30
scores. No manual psql load needed.
- Confirmed `compose-dev.yml` now regenerates on every run (PHONY),
overwriting a deliberately corrupted copy.
- New `TestEnvironmentGates` unit test passes.
- `make test-full` (unit + Emberfall E2E on the isolated test stack with
`ENVIRONMENT=test`) passes, confirming the E2E seed path is preserved
and the middleware refactor is behavior-preserving.
## Summary

Snyk started failing the prod orchestration on two high-severity
vulnerabilities in the Go standard library at 1.25.10:

- `std/mime`
- `std/crypto/x509`

Both are fixed in Go 1.25.11. The advisories surfaced after #329 had
already passed its PR checks, so the same code that was green on the PR
now blocks the prod pipeline at the analysis stage. Until this lands,
every merge to main fails Snyk and nothing deploys.

## Change

- `backend/go.mod`: `go 1.25.10` -> `1.25.11`
- `backend/Dockerfile`: builder image `golang:1.25.10-bookworm` ->
`golang:1.25.11-bookworm`

No code changes. Builds clean locally and the host unit tests pass.

## Note

The prod run failed at the analysis stage before the deploy step, so
prod was not affected and is still running the prior image. This
unblocks the pipeline so #329 (and subsequent work) can deploy.
Re-homes #330 (originally authored by @voidspooks) onto an org branch so
the full CI pipeline runs. Fork PRs do not get the org secrets Snyk and
the build/deploy steps need, so the change could not be validated or
shipped from the fork.

Same content as #330, plus a merge of current main, which brings in the
seeding fix (#329) and the Go 1.25.11 bump (#331). The one overlap with
main was in `auth/middleware.go` (the JIT user-creation gate); resolved
to keep main's `cfg.IsLocal()` helper and this branch's `OWNER` role.

## What this does

Stage D of the role cleanup: removes the legacy `ADMIN` and
`READONLY_ADMIN` role values. A prior migration (`0036`) already
converted existing rows (`ADMIN -> OWNER`, `READONLY_ADMIN ->
HHS_READONLY_ADMIN`); this stops the app from accepting or emitting the
legacy values and adds a DB guard (`0040`, a CHECK constraint rejecting
the legacy strings).

## Validation

- Migration tested against a copy of prod data: zero legacy rows,
constraint applies clean, API starts, and a legacy write is rejected as
expected.
- Authorization routes through the role helpers; every legacy tier maps
to its replacement, so no lockout or access gap.
- Allowed role set is consistent across `validations.go`, the DB
constraint, and both OpenAPI enums; both seed files use only the new
values.
- Emberfall E2E passes locally (91/91).

Closes #330.

Co-authored-by: Cameron Testerman <11036339+voidspooks@users.noreply.github.com>
…#333)

## Summary

Adds authorization enforcement for the multi-OpDiv role model on top of
the
schema that already shipped (opdivs reference table,
fismasystems.opdiv_id,
users_opdivs junction, the new role enum). Before this branch, reads
were only
partially scoped and write paths were not scoped at all: an OPDIV_ADMIN
could
edit or delete systems in any OpDiv, score any system, and manage any
user, and
there was no endpoint to create an OpDiv or grant a user an OpDiv. This
branch
closes those gaps so a scoped admin manages only their own OpDiv, while
OWNER/HHS tiers retain full cross-OpDiv access.

## What changed

**OpDiv administration (OWNER-only)**
- `POST /api/v1/opdivs`, `PUT /api/v1/opdivs/{opdiv_id}` to create /
update / deactivate an OpDiv.

**OpDiv grants (user membership)**
- `GET/POST/DELETE /api/v1/users/{userid}/assignedopdivs[/{opdiv_id}]`
to list, grant, and revoke OpDiv membership. An OPDIV_ADMIN may only
grant/revoke an OpDiv they themselves hold, and only onto a user within
their assignable tier.
- A user's `identity_provider` is derived from their OpDiv on
grant/revoke (CMS to okta, otherwise entra) and is only overridable by
an OWNER.

**Write-path scope**
- Editing, decommissioning, reactivating, and assigning a user to a
system, scoring a system, and marking a data call complete are all gated
so an OPDIV_ADMIN is limited to systems in an OpDiv they hold.
OWNER/HHS_ADMIN are unrestricted; ISSO/ISSM keep their per-system
assignment path.
- Role-escalation guard on user create/update: an OPDIV_ADMIN cannot
mint or modify a user above the OpDiv tier, and cannot manage a user
outside their OpDiv.

**Read-path scope (fail-closed)**
- User, score, and data-call-completion lists are scoped to the caller's
OpDivs for OPDIV tiers. A scoped admin with zero grants matches nothing
rather than falling through to an unscoped read.
- The audit trail (`GET /events`) and mass email (`POST /massemails`)
are restricted to the unscoped admin tiers, since neither has a
per-OpDiv scope to filter on.

**Migration**
- `0041` adds an idempotent index on `fismasystems.opdiv_id` to back the
scope predicates. No data change.

**Docs and tests**
- OpenAPI updated for the new endpoints and the scoped / restricted
behavior.
- Unit tests for the role helpers, the fail-closed SQL predicate, and
the controller authorization gates; Emberfall E2E negative matrix
asserting cross-OpDiv writes return 403 and scoped reads exclude
out-of-OpDiv rows.

## Related issues

- Closes #313 (Backend: surface OpDiv /
identity_provider fields and add scope predicates), and additionally
delivers three items that issue had deferred (OpDiv scope on
scores/events/datacalls, grant-management endpoints, and write-path
scope beyond create) because the authorization layer must be complete
before the dual-IdP work admits OpDiv-scoped users.
- Part of epic CMS-Enterprise/ztmf-misc#198 (HHS multi-OpDiv expansion).
- Builds on the database work in #311 / #312 and the role-value cleanup
in #314, all already on `main`.

## Safety and rollout

A no-op for current production: no OPDIV_ADMIN users exist yet, and the
OWNER / HHS tiers short-circuit every new check. The only migration is
an index
(idempotent up / reversible down). Rollback is a redeploy of the prior
image.
Per team policy this branch is not deployed to a shared environment
before merge.

## Testing

`make test-full` green: unit + isolated integration + 111 Emberfall E2E.

### How to verify

- The OpDiv negative matrix lives at the bottom of
`backend/emberfall_tests.yml`: cross-OpDiv writes return 403, scoped
reads exclude out-of-OpDiv rows, and `/events` + mass email are gated to
unscoped admins.
- To poke it by hand, the seed adds an `OPDIV_ADMIN` (EMPIRE grant only)
plus REBELLION systems 1005/1006 (token anchor `opDivAdminHeaders` in
the test file). `PUT /fismasystems/1005` or `POST /scores` for 1006
returns 403; EMPIRE systems 1001-1004 succeed.
- The scope logic is centralized in `backend/internal/model/users.go`
(`CanManageFismaSystem`, `CanManageUser`, `EffectiveOpDivScope`) and
exercised in
`backend/cmd/api/internal/controller/rbac_enforcement_test.go`, so the
role/scope matrix is reviewable in one place.

## Sequencing

Lands before the dual-IdP branch, which rebases on top once this merges.
A
paired ztmf-ui PR adds the matching UI gates (hide actions and views a
scoped
admin would be 403'd on); the two land together so no admin loses UI in
between.
…ret rotation (#337)

Re-homes #335 (originally authored by @voidspooks) onto an org branch so
the full CI pipeline runs, and merges current main. Fork PRs do not get
the org secrets Snyk and the build/deploy steps need.

## What this fixes (ztmf#328)

The API caches the RDS-managed DB secret at startup. RDS rotates it on a
~7-day schedule; afterward the cached password is stale, the next
connection fails SASL auth (28P01), and the old code `log.Fatal`'d on
28P01 - so every rotation took the process down until ECS restarted it.

Cameron's change:
- `db/conn.go`: `Conn()` retries once on a 28P01 auth error when a
secret is configured, re-fetching the secret first (`refreshDbCreds` ->
`Secret.Refresh`), so a routine rotation is ridden out in place. Env-var
(local) credentials are not retried since there is nothing to re-fetch.
- `model/errors.go`: the 28P01 case returns `ErrDbConnection` instead of
`log.Fatal`, so a genuinely-wrong password surfaces as a normal
connection error rather than killing the process.
- Adds unit tests for the error classification (`isAuthError`,
`trapError`), including wrapped-error cases.

## Added during re-home: synchronize the cached secret

Making the refresh reactive on the per-request path means a rotation can
have many goroutines reading (`Value`/`Unmarshal`) and writing
(`Refresh`) the cached `Secret` at once, and those fields had no
synchronization (a read race already existed in `Value`'s
proactive-refresh path; the reactive trigger widens it). Added a mutex
on `Secret` guarding `secret`/`metadata` across the fetch RPC, so a
refresh storm serializes instead of racing the pointers.
`Value`/`Refresh` lock and delegate to an unexported `refreshLocked` to
avoid re-locking.

## Testing

- `go build`, `go vet`, and the touched-package unit tests pass under
`-race`.
- Error-classification regression is covered; the live refresh-and-retry
flow and the concurrent-refresh path are not yet unit-tested (they need
an injectable Secrets Manager / connect seam) - tracked as a follow-up
under ztmf#328.

## Note

A single retry rides out the modeled failure. Right at a rotation
boundary a refreshed secret can still briefly mismatch the cluster, so
expect the occasional single failed request at rotation time - the
process stays up, which is the goal.

Closes #335.

---------

Co-authored-by: Cameron Testerman <11036339+voidspooks@users.noreply.github.com>
…rail (#338)

Gives a fresh local dev environment realistic sample data and makes the
OpDiv and audit features exercisable without pulling production data.

## What's added

- **Multi-OpDiv org.** Six Imperial-branch OpDivs (Navy, Army,
Starfighter Corps, ISB, Intelligence, Sienar) with 10 new systems and 12
officers scoped across them as ISSO / OPDIV_ADMIN /
OPDIV_READONLY_ADMIN, so OpDiv-scoped RBAC isolation can be seen in the
UI (an OWNER sees all 16 systems; a branch OPDIV_ADMIN sees only their
branch).
- **Three-year history.** Data calls FY2022-FY2025, with each new system
scored every cycle and maturity climbing year over year (Traditional
toward Advanced).
- **Richer questionnaire.** Three new `datacenterenvironment`s
(Ground-Assault, Surveillance-Net, Shipyard-RnD), each a full six-pillar
function set with four maturity options, plus per-pillar questions.
- **Audit trail.** `last_edited_at` / `last_edited_by` are derived from
the events table, which raw SQL seed data bypasses, so seeded scores
previously showed a blank "last edited by" in the dashboard footer. This
seeds one `updated` event per new score, attributed to the system's
assigned officer, so the audit display and features work on a fresh
local DB.

## Tests

- `TestFindScoresResolvesSeededAuditFieldsIntegration` covers the read
of audit fields from a pre-existing (seed) event. The existing audit
tests all create the event in-test, so the historical read path that the
seed and the UI footer depend on was untested.
- `make test-e2e` passes 111/111 and `make test-integration` passes. All
data uses new IDs and new OpDivs only; existing systems (1001-1006),
data calls (1-3), and scores (9001-9030) are untouched, so the
OpDiv-scoped and aggregate E2E assertions are unaffected.

## Note

The score notes carry some lightly buried thematic flavor; they remain
plausible as assessment text and contain no real data.
…led (#341)

## Summary

Adds the backend and infrastructure for dual-IdP authentication: HHS
OpDiv users authenticate through Microsoft Entra ID alongside the
existing CMS Okta. The entire feature ships **dormant behind the
`entra_enabled` Terraform flag (default `false` in dev and prod)**, so
merging and deploying this changes nothing about today's Okta-only
behavior. It only takes effect after an operator loads the Entra secret
and flips the flag, per the runbook below.

Implements the backend/infra slice of the dual-IdP epic
(CMS-Enterprise/ztmf-misc#170). The frontend landing page is tracked
separately (CMS-Enterprise/ztmf-ui#328, #329).

## What this delivers

- **Multi-issuer token validation.** The API validates tokens from both
issuers: Okta (ES256) and Entra (RS256 via JWKS), with the Entra tenant
pinned and a per-IdP issuer allowlist. Local dev and the E2E suite
continue to use HS256.
- **Pre-auth IdP lookup.** `GET /api/v1/auth/lookup?email=`
(unauthenticated, rate-limited) returns `{ "data": { "idp": "okta" |
"entra" | null } }` so the browser can route each user to the correct
login path. Unknown and unprovisioned emails return the same shape (no
enumeration), and the email is not logged.
- **Application session layer.** After OIDC login the backend mints its
own signed session cookie and gates `/api/*` by that session or an IdP
bearer, with an origin (CSRF) check on state-changing requests.
- **Per-IdP ALB listener rules.** A new `/login/entra*` rule with a
distinct session cookie name, a forward-only rule for the lookup
endpoint, and the `/api/*` switch from ALB OIDC to backend-gated. The
existing Okta `/login*` and `/api/*` rules are unchanged while the flag
is off.
- **Token-validation hardening** (from a security review of this work):
- Optional per-IdP audience pinning, enforced only when configured, so a
validly signed token minted for a different application in the same
issuer or tenant is rejected. The Entra audience is wired to the ZTMF
app client id; the Okta audience is intentionally left unset for now
since Okta traffic stays ALB-gated until cutover.
- HS256 tokens are rejected on the IdP path outside local/test, so a
deployed environment is asymmetric-only (ES256/RS256) regardless of
environment configuration.

## Why this is safe to merge now

With `entra_enabled = false` (the committed value in both dev and prod):

- Every new ALB rule, the lookup rule, and the `/api/*` switch are
`count = 0` and are not created. The existing Okta `/login*` and
Okta-OIDC `/api/*` rules stay exactly as they are today.
- No new container environment or secrets are injected; the data sources
that read the Entra and session secrets are gated off.
- The new middleware is additive: with no session cookie present,
requests fall through to the existing IdP-bearer path (the ALB-injected
Okta token), resolved by email as today.
- The new audience and issuer checks are inert when unconfigured, and
the HS256 restriction does not touch the Okta ES256 path.

Net: the live authentication path stays 100% CMS/Okta-via-ALB. Rollback
is flipping the flag back to `false`.

## What must happen after merge, before Entra works (per environment)

Dev first (the validation environment), then prod before the data-call
cutover. Dev and prod are separate AWS accounts with separate Secrets
Manager, so each step runs in each account:

1. **Deploy this branch (flag stays `false`).** Terraform creates two
empty secret containers, `ztmf_entra_oidc` and
`ztmf_session_signing_key`. Everything else stays inert.
2. **Load the Entra secret.** Run `scripts/bootstrap-entra-secrets.sh`
against the account to populate `ztmf_entra_oidc` (tenant id, issuer,
JWKS URI, client id, client secret) and generate the session signing
key. Safe to do while the flag is off; nothing reads these values until
the flip.
3. **Confirm HHS-side prerequisites** for that environment: the redirect
URI is registered on the Entra app, and the Conditional Access allowlist
is in place for the pilot identities.
4. **Flip `entra_enabled = true` and apply.** The data sources now read
the populated secret and the ALB rules come up. Flipping before step 2
fails the apply, by design.

The frontend (ztmf-ui#328/#329) must ship its matching email landing
page, gated by its own flag, before end-to-end login through Entra is
usable.

## Test plan

- [x] `make test-e2e` (isolated ephemeral DB, Emberfall): 117 ran, 0
failed
- [x] Auth package unit tests pass, including the
multi-issuer/tenant/audience matrix and the HS256 environment gate
- [x] `go build ./...` and `go vet` clean
- [x] `terraform fmt` clean on changed files
- [x] Confirmed the Okta-only path is unchanged with `entra_enabled =
false`

Closes #265
Closes #334
Refs #264 (ALB listener rules ship dormant; activated and validated at
cutover)
Refs CMS-Enterprise/ztmf-misc#170
Refs CMS-Enterprise/ztmf-ui#328, CMS-Enterprise/ztmf-ui#329
)

## Summary

Fixes a database connection leak that exhausted the Aurora connection
limit in prod today. db.Conn() opens a dedicated pgx connection on every
call (there is no shared pool), and the central model helpers plus a few
raw call sites never closed the connection, so every query leaked one.
Two active users were enough to pile up around 197 open connections, peg
CPU, and time out requests. In the UI that showed as intermittently
missing or slow data, like the OpDiv column blanking on first load.

## Root cause

The leak has been in the data layer since the open-per-call, no-close
pattern was first written (mid-2024). It stayed hidden because Aurora
reaped idle connections faster than they leaked. The OpDiv RBAC work
added more DB round trips per authenticated request (auth middleware,
then the user lookup, then the OpDiv grant queries), which pushed the
leak rate past the reap rate, so connections piled up until the cluster
hit its limit. The recent multi-OpDiv change triggered it but was not
the root.

## Fix

Added a deferred connection close at every runtime site that opens one:

- model.go query() and queryRow(), which back most request-path queries
- scores.go copyPreviousScores, an Exec site that leaked
- fismasystems.go and users.go transaction paths, which rolled back the
tx but never closed the connection. The close is declared before the tx
rollback defer so it runs after the transaction resolves (defers are
LIFO)
- populate.go, the local seed path

The tern migrator connection in migrations.go is left open on purpose.
It is a startup-only connection the migrator owns for the life of the
process.

## Not included

A connection pool (pgxpool) is the real fix for the per-query connect
cost, which is the remaining slowness and the high DB CPU, and it will
matter at HHS scale. That change also touches the credential rotation
path, so it is a separate follow-up. This PR only stops the leak.

## Testing

- make test-integration (isolated DB) passes
- make test-e2e (Emberfall) passes
- go build, staticcheck, and go vet are clean
- Confirmed every runtime connection site closes once, the tx ordering
is correct, there is no double close or use after close, and closing on
an already-canceled context still releases the socket in our pinned pgx
version

## Context

Replaces the earlier partial fix that only closed connections in two
scores functions (#322, refs #321). This closes the leak across the
whole data layer.
…1) (#344)

## Summary

The users table was slow to load its OpDiv column. `FindUsers` (the `GET
/api/v1/users` list query) omitted the user's OpDiv and FISMA grants, so
the frontend backfilled them by calling `/users/{id}` once per user.
With ~600 users that is hundreds of round-trips trickling through the
browser's per-host connection limit, which is the delay. The list query
itself is a few milliseconds.

This loads the grants inline on the list, using the same correlated
subqueries `findUser` already uses for the detail endpoint, so the table
renders the OpDiv column straight from the list response with no per-row
fetch.

## Why this is cheap

Measured against real prod-scale data locally (595 non-deleted users):
the list query with both subqueries runs in a few milliseconds. Each
subquery is a composite-PK index-only scan on the junction table (one
row group per user), so it does not add a meaningful cost and there is
no N*M join blowup. The endpoint returns in about 11ms.

## Coordination

The frontend (ztmf-ui) drops its per-user `fetchUserOpDivs` bulk loop
and reads `assignedopdivids` from each list row, mapping ids to codes
with the opdiv map it already builds once. The frontend change is
written to fall back gracefully if the field is absent, so the two can
deploy in either order.

## Scope

This fixes the page-load read N+1 only. The OpDivGrantModal per-item
write loop is lower severity (user-triggered write, small N, now
pool-bounded) and is tracked separately; it needs a backend batch-grant
endpoint.

## Validation

- Verified against real prod-scale local data: list returns
`assignedopdivids` for all 595 users, endpoint ~11ms
- `make test-integration` and `make test-e2e` pass
- `go build`, staticcheck, `go vet` clean
- Emberfall `/users` list assertions are status/header only, so the
additive fields do not break them

Closes #336
…under normal use) (#343)

## Summary

The data layer had no connection pool. db.Conn() opened a brand-new
connection (TCP, TLS, SCRAM auth) on every query. Once the OpDiv RBAC
and dual-IdP work made every authenticated request run a DB user lookup,
normal use opened a burst of concurrent connections that hit the Aurora
connection limit. New connections then died mid-handshake (failed SASL
auth: context canceled), which showed in the UI as the OpDiv column
going empty and edits failing to save. One active user was enough to
trigger it, so this is a design problem, not load.

This replaces the per-call connection model with a shared, bounded
pgxpool.

## What changed

- db.Conn acquires a connection from a shared pool and returns it;
callers release it back to the pool when done instead of closing it. The
pool caps connections at 16, well under the cluster limit, and reuses
them, so a burst of concurrent requests no longer opens a connection
each.
- Credentials are supplied per new connection, so the 7-day RDS secret
rotation is picked up as connections recycle (MaxConnLifetime 50m). An
auth failure on acquire refreshes the secret and retries once,
preserving the rotation handling from the earlier credential-refresh
fix.
- The tern migrator keeps a dedicated single connection (MigrationConn),
since it needs a plain connection and is not pool-managed.
- Pool is built once with a background context so a canceled request
cannot poison the one-time init.

## Relationship to the earlier fix

The previous change (#342) added a close to every connection site to
stop a slow leak. That was necessary but did not stop the concurrent
burst, because closing a connection after use does nothing about many
connections being opened at the same time. The pool is what actually
bounds and reuses connections. With the pool, connections are released
back rather than closed, which is the correct pattern.

## Validation

- make test-integration (isolated DB) passes
- make test-e2e (Emberfall) passes
- go build, staticcheck, and go vet are clean
- Reviewed for pgx pool correctness: every acquire has a matching
release on all paths including the transaction error paths, the
transaction sites release exactly once, rows are fully read before
release, MaxConns is a hard cap, and the rotation path recovers across
the 7-day cycle

## Sizing notes

MaxConns 16 against the observed cluster connection limit leaves
headroom for the migrator connection, the ops task, and pgAdmin
sessions. MinConns 2 keeps a warm pair so the first requests after idle
do not each pay a full handshake. These are conservative for a
low-traffic app and can be tuned later if needed.
Flips `entra_enabled = true` for dev, activating the dormant dual-IdP
infra from #341: the per-IdP ALB listener rules (`/api/v1/auth/lookup`
forward, `/login/entra` Entra OIDC, `/login` Okta), `/api/*` moved off
ALB OIDC to backend session validation, and the Entra + session env
injected into the API task. Okta login is unchanged.

This is the in-repo version of #347 by @voidspooks, moved onto a
CMS-Enterprise branch so the infrastructure deploy (terraform apply,
which needs the org OIDC role) runs; that does not run on a fork PR. The
single commit and authorship are unchanged.

## Why now

The frontend landing page (ztmf-ui#404) is deployed to dev and calls
`GET /api/v1/auth/lookup` on submit, but that ALB rule only exists when
`entra_enabled = true`. With the flag still false, the lookup falls
through to the OIDC-gated `/api/*` rule, fails for a logged-out visitor,
and the page shows "We can't determine an identity provider" for
everyone, including Okta users. This flip routes the lookup and unblocks
dev testing.

## Scope

- dev only. `infrastructure/tfvars/dev.tfvars`, one line plus a comment.
prod (`prod.tfvars`) stays `false`.
- Preconditions met: `ztmf_entra_oidc` and `ztmf_session_signing_key`
are seeded and verified in the dev account.

## Deploy behavior

Once this is non-draft, orchestration-dev applies it to dev. It only
persists across later dev applies once merged to main, so merge after
the dev login is validated end to end.

Refs #347

Co-authored-by: Cameron Testerman <11036339+voidspooks@users.noreply.github.com>
…349)

First slice of #317, scoped to the backend workflow. GitHub is forcing
Actions runners off Node.js 20, so every run currently emits the
deprecation warning. This moves the backend workflow's Node-based
actions to their latest Node.js 24 releases, which clears the warnings,
and SHA-pins the rest as a security hardening pass while we are in the
file.

## Node.js 24 bumps (the actual warning fix)

These five were running on Node 20; each is now on its latest Node 24
release:

- `actions/checkout` v4 -> v6.0.3
- `docker/setup-buildx-action` v3 -> v4.1.0
- `hoverkraft-tech/compose-action` v2.0.2 -> v3.0.0
- `aws-actions/configure-aws-credentials` v4 -> v6.2.0
- `actions/setup-go` v5 -> v6.4.0

`orchestration-dev.yml` also had `actions/checkout@v4` in the
change-detection job; bumped to v6.0.3 so that job stops warning too.

## SHA pinning (security bonus)

Every third-party action in `backend.yml` is now pinned to a commit SHA
with a `# vX.Y.Z` comment. This drops the two floating refs that were
the worst case:

- `snyk/actions/docker@master` -> pinned SHA (no tagged releases
upstream)
- `aquia-inc/emberfall@main` -> pinned SHA (composite action; the
emberfall binary version still comes from the `version` input)

## Also

- `actions/setup-go` now sets `cache-dependency-path: backend/go.sum`,
so the dependency cache restores instead of missing on every run (go.sum
lives under `backend/`, not repo root).
- `orchestration-dev.yml` now runs the backend smoke test when
`backend.yml` itself changes, not only when `backend/**` code changes,
so workflow edits like this one are exercised by CI instead of shipping
unvalidated. That is what lets this PR test its own bumps.
- Added `.github/dependabot.yml` with a grouped `github-actions` entry
so the SHA pins receive automated update PRs and do not rot.

## Scope

Backend workflow only, as the pilot. The remaining workflows in #317
(`analysis.yml`, `infrastructure.yml`, `orchestration-impl.yml`,
`ops-image.yml`) are follow-ups once this pattern is proven green.

## Test plan

- [ ] Backend smoke-test job runs on this PR (via the new
workflow-change trigger) and passes
- [ ] No Node.js 20 deprecation warning in the backend job logs
- [ ] `compose-action` v3 still starts the API + Postgres e2e stack;
emberfall smoke tests pass
- [ ] `configure-aws-credentials` v6 still assumes the role and the
image pushes to ECR
- [ ] setup-go cache restores from `backend/go.sum` (no cache-miss
warning)

Refs #317
…/swag (#348)

Auto-generates `backend/openapi.yaml` from Go annotations using
swaggo/swag, so the spec stays in lockstep with the routes instead of
being hand-maintained and drifting. A CI drift check fails the build if
the committed spec does not match a fresh generation, and `redocly lint`
enforces that every route declares its security (the one intentional
public route is acknowledged explicitly).

This is the in-repo version of #340 by @voidspooks, moved onto a
CMS-Enterprise branch so the Snyk and other org-secret CI gates run
(they do not run on fork PRs). The commits and authorship are unchanged
from his branch, which is already rebased on current main and addresses
the full review.

## What it does

- swaggo/swag annotations on every handler are the source of truth;
`make generate-openapi` produces `openapi.yaml`.
- CI regenerates and diffs the spec; any drift fails the build.
- `redocly lint` (pinned `redocly/cli:2.32.2`, matching local) runs in
CI with `security-defined` as a hard error, so a new handler that
forgets `@Security` fails. The single public route (`GET
/api/v1/auth/lookup`) is the only entry in
`backend/.redocly.lint-ignore.yaml`.

## Review items addressed (all on this branch)

- Rebased past the OpDiv (#333) and dual-IdP (#341) merges; only
generated/lock-file conflicts, resolved cleanly.
- `GET /api/v1/auth/lookup` annotated with no `@Security`, so it renders
as the one unauthenticated route.
- `SaveFunction` / `SaveOpDiv` update-path id set to optional to match
the `SaveDataCall` / `SaveUser` shared-handler pattern.
- `GetEvents` query params corrected to what the schema decoder actually
binds (`action`, `resource`, and nested `payload.*`).

## Test plan

- [x] `make generate-openapi` is deterministic (regenerates
byte-identical; drift check green)
- [x] `go build ./...` and `go test -short ./...` pass
- [x] `redocly lint` clean (0 errors)
- [x] All 48 routes carry matching `@Router` annotations; the public
route carries no `@Security`

Closes #260
Refs #340

---------

Co-authored-by: Cameron Testerman <11036339+voidspooks@users.noreply.github.com>
…gin (#351)

Fixes the Entra login path being unreachable from the public CloudFront
URL.

## Problem

CloudFront routes only the exact path `/login` to the ALB origin
(`cloudfront.tf`); there is no `/login/*` behavior. So `/login/entra`
does not match, falls through to the S3 default origin, and S3 returns
an `AccessDenied` XML page. Entra login is unreachable for everyone, not
specific to any user or email. Okta is unaffected because it lives at
exactly `/login`.

The dual-IdP work added the `/login/entra*` rule on the ALB but never
added the matching CloudFront behavior. The two layers are owned
separately (the CloudFront login behavior was set up under #166 as an
exact match), and the end-to-end "user lands at /login/entra" check was
deferred to runtime, so the gap only surfaced once dev was flipped to
`entra_enabled = true` and Entra login was first exercised.

## Fix

Add a `/login/*` ordered cache behavior pointing at the `ztmf_api` (ALB)
origin, mirroring the existing `/login` behavior (same forwarding, no
caching). It is gated with `dynamic ... for_each = var.entra_enabled ?
[1] : []` so it exists only when Entra is enabled, matching the per-IdP
ALB listener rules that are created under the same flag. The exact
`/login` behavior is left untouched, so Okta keeps working exactly as
before.

On dev (`entra_enabled = true`) this creates the behavior and
`/login/entra` reaches the ALB. On prod (`entra_enabled = false`)
nothing changes.

## Validation

- `terraform fmt` clean, `terraform validate` passes, `tflint` clean.
- After this applies to dev: `/login/entra` routes to the ALB (Entra
OIDC) instead of returning S3 `AccessDenied`; `/login` (Okta) unchanged.

Refs #264
Hardens JWT verification key fetching against an unsanitized-input flaw
in the Okta path.

## Problem

`oktaKey()` took the `kid` from the (necessarily unverified) token
header and concatenated it directly into the key-fetch URL, then fetched
it with `http.DefaultClient` (no timeout, follows redirects) and used
the returned PEM as the verification key. Because `kid` selects the key,
it is read before the signature is checked - so it is untrusted input
being used to shape an outbound request and choose a trust anchor. The
Entra (RS256) path was already safe (fixed JWKS URL, `kid` as a map
key); this brings the Okta path to the same standard.

## Change

- Validate `kid` against `^[A-Za-z0-9_-]{1,200}$` before any cache,
config, or network access; reject otherwise. The allowlist (URL-safe
characters only, matching real Okta/ALB key ids) is what makes the
existing `TokenKeyUrl + kid` concatenation safe - no path traversal,
host/userinfo injection, scheme, or encoded tricks can reshape the
request.
- Route both the Okta per-`kid` fetch and the Entra JWKS fetch through a
dedicated `http.Client` with a 10s timeout that refuses to follow
redirects, so a slow/hung endpoint cannot pin a request goroutine and a
3xx cannot bounce the fetch to an unexpected host.
- Capture the previously-ignored `http.NewRequest` error.

Behavior is unchanged for legitimate tokens: a valid `kid` still fetches
exactly `TokenKeyUrl + kid` and parses the ECDSA key.

## Test plan

- [x] `make test-full` (unit + Emberfall E2E) passes 117/0
- [x] New unit tests: `kid` allowlist (valid vs
traversal/host-injection/scheme/oversized), rejection before any
outbound fetch, and a positive path (valid `kid` -> correct URL -> ECDSA
parse)
- [x] `go build`, `go vet`, `gofmt` clean
- [x] Independent review

Refs CMS-Enterprise/ztmf-misc#240
## Summary

Gates the existing system enrichment read endpoint (`GET
/api/v1/systemenrichment/{fisma_uuid}`) on a new per-OpDiv capability
flag, so ZTMF Insights enrichment is served only for OpDivs that have an
insights pipeline feeding them. Today that is CMS only; enabling another
OpDiv later is a single database update, no code change.

## Why

Enrichment data (system metadata from the ZTMF Insights pipeline)
currently exists only for CMS, but the read endpoint returned a payload
for any system that had a row, with no notion of which OpDivs are
entitled to the feature. This adds that boundary and makes it
data-driven so the feature can extend to other OpDivs as an add-on.

## Changes

- **Migration 0042**: adds `opdivs.insights_enabled` (boolean, default
false), backfills it true for the active CMS OpDiv. Down migration drops
the column.
- **Model**: `OpDiv.InsightsEnabled *bool` (pointer so an update that
omits it does not reset it), conditional set on save.
`FindSystemEnrichment` returns a row only when the system's `fismauid`
maps to an OpDiv with `insights_enabled = TRUE` (an `EXISTS` predicate,
which avoids fanning out on the non-unique `fismauid`).
- **Controller**: resolves the system first, then authorizes against its
OpDiv (`CanAccessFismaSystem`), replacing the prior per-system-only
check. A system in a non-enabled OpDiv returns 404, identical to a
system with no enrichment row, so the gate never reveals which systems
exist.
- **Tests**: model integration test for the gate, plus an Emberfall E2E
case proving a system with an enrichment row in a disabled OpDiv returns
404.
- **OpenAPI** regenerated from annotations.

## Behavior and compatibility

- Backwards compatible. Additive schema change; deploys to dev on PR,
prod on merge.
- Enrichment is only hidden for systems in OpDivs that are not
insights-enabled. The pipeline only feeds CMS, and the migration enables
CMS, so existing enrichment continues to be served unchanged.
- Minor authorization tightening: OpDiv-scoped admins now read
enrichment only for systems in their granted OpDivs (previously any
admin tier could read any system's enrichment). Unscoped admins (OWNER,
HHS) and assigned ISSOs are unaffected.

## Testing

Full local suite passes: unit, integration (including the new gate
test), and E2E (Emberfall). Validated end to end against real enrichment
data in the impl environment.

## Issues

Implements CMS-Enterprise/ztmf-misc#85 (API endpoints for SDL enrichment
data). The read surface landed in #211; this PR adds the
OpDiv-conditional gate that completes the consumption API. The issue's
original per-function response shape is superseded by the generic jsonb
payload (the pipeline owns structure). It will be closed on merge
(cross-repo keywords do not auto-close from this repository).
## Overview

This is the in-repo version of #355, originally submitted by @voidspooks
(cameron testerman) from a fork. The branch has been pushed to this repo
unchanged so that org-secret CI gates — Snyk and the
Infrastructure/Deploy job — can run. All commits are authored by
@voidspooks; authorship is fully preserved.

Refs #355
Closes #133

---

## What this PR does

Adds a new read-only API endpoint:

```
GET /api/v1/scores/diff?from={datacallid}&to={datacallid}[&fismasystemid={id}]
```

This endpoint compares the questionnaire answers between two data calls
and returns **only the functions whose answer changed**, with
attribution for who made the change and when. It answers the "show me
everything that changed during this period" ask from #133 — the two data
call IDs bound the time window, and the response is the delta plus audit
info.

---

## Response shape

One object per changed function:

```jsonc
{
  "fismasystemid": 1001,
  "functionid": 7007,
  "function": "Battle Station Device Management",
  "question": "Does your system track all assets with comprehensive inventory?",
  "from": { "scoreid": 9001, "functionoptionid": 25, "optionname": "Defined",  "score": 2, "notes": "..." },
  "to":   { "scoreid": 9025, "functionoptionid": 26, "optionname": "Managed",  "score": 3, "notes": "..." },
  "changed_at": "2026-06-18T16:20:18Z",
  "changed_by": { "userid": "...", "name": "Grand Moff Tarkin", "email": "...", "role": "OWNER" }
}
```

- `from` / `to` are the answer in each cycle. Either is `null` when a
function was answered in only one cycle (i.e., it was added or removed
between data calls).
- `changed_by` / `changed_at` attribute the write that produced the `to`
answer, resolved from the `events` audit trail. Absent when no write
event exists for the `to` score (seed data, or a function answered only
in the `from` cycle).

---

## What counts as a change

A function is included in the diff when:
- The selected `functionoption` changed, **or**
- The notes changed, **or**
- The function was answered in exactly one of the two cycles (added or
removed)

`nil` and empty string notes compare as equal, matching the no-op
normalization already used on the write path. Functions with no change
are dropped from the response.

---

## Scope and access control

Tier scoping mirrors `ListScores` and is enforced after query decode so
a caller cannot widen their own scope:

| Role | Sees |
|---|---|
| Unscoped admin | All systems |
| OpDiv-scoped admin | Their OpDiv's systems (fail-closed) |
| ISSO / ISSM | Their assigned systems only |

`from` and `to` are both required and must differ — the endpoint returns
`400` otherwise.

---

## Implementation notes

- **New file:** `internal/model/scorediff.go` — contains `FindScoreDiff`
and `buildScoreDiffSQL`. Uses parameterized raw SQL via the existing
`rawQuery` read-path helper. A FULL OUTER JOIN of two cycle CTEs on
`(system, function)`, plus catalog joins for labels and an events
lateral for attribution, was beyond squirrel's ergonomics — this follows
the precedent set by `buildPillarScoresSQL`. SELECT-only; never touches
the write path or event recording.
- **New controller:** `GetScoresDiff` handler wired to
`/api/v1/scores/diff`.
- **OpenAPI:** `openapi.yaml` regenerated via swag; redocly lint clean
(pre-existing warnings only, no new issues).

---

## Testing

- **Unit tests:** input validation; SQL builder filter and scope
shaping; arg ordering across single-system, ISSO user scope, and OpDiv
grants with fail-closed behavior.
- **Integration tests (real Postgres, empire seed):** changed answer
included; unchanged answer dropped; one-sided entry via the outer join;
`to`-side attribution resolved correctly.
- **Local smoke tests:** run against a freshly wiped and reseeded dev
DB. `go build/vet/test ./...` all green (full suite including
integration). Live HTTP tests covered: 400 validation cases, unchanged →
empty response, a rich non-empty diff, and an edit-then-diff that
surfaced the change attributed to the editor.

---------

Co-authored-by: Cameron Testerman <11036339+voidspooks@users.noreply.github.com>
…60623

# Conflicts:
#	Makefile
#	backend/cmd/api/internal/controller/scores.go
#	backend/internal/model/fismasystems.go
#	backend/openapi.yaml
@danielbowne danielbowne added impl Targets the impl branch; reviewers should diff against impl, not main zt-insights ZTMF Insights / SDL Snowflake enrichment pipeline work labels Jun 23, 2026
@danielbowne danielbowne self-assigned this Jun 23, 2026
@danielbowne danielbowne merged commit 2d00a14 into impl Jun 23, 2026
9 checks passed
@danielbowne danielbowne deleted the feature/impl-sync-20260623 branch June 23, 2026 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

impl Targets the impl branch; reviewers should diff against impl, not main zt-insights ZTMF Insights / SDL Snowflake enrichment pipeline work

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants