Sync impl with main (scores-diff #356, Okta kid #354, /login CloudFront #351)#361
Merged
Conversation
## Summary
Adds a generic `public.system_enrichment` table that ztmf core owns and
a separate enrichment pipeline (private `ztmf-insights` stack) fills
with an opaque JSON payload, plus a read-only endpoint that returns it.
This is the core-side, **additive** half of decoupling the CMS-specific
enrichment pipeline from the open-source data-call app.
The two sides meet at one stable contract: the table. Adding or changing
enrichment fields becomes a payload-key change in the pipeline with no
ztmf migration. The core app works with or without an enrichment
pipeline attached (an unfilled table simply returns 404s).
## Changes
- **Migration 0038** creates `public.system_enrichment (fisma_uuid
VARCHAR(255) PK, payload jsonb NOT NULL, synced_at timestamptz)`.
`fisma_uuid` has no FK, matching the existing `cfacts_systems` precedent
(`fismasystems.fismauid` is `VARCHAR(255)` and not unique); the table is
a disposable cache the pipeline rewrites via TRUNCATE+INSERT.
- **`GET /api/v1/systemenrichment/{fisma_uuid}`** returns the payload
verbatim. Access control reuses the existing FISMA-system assignment
check: admins / read-only admins see any system, ISSOs only assigned
ones, missing rows return 404.
- `model.SystemEnrichment.Payload` is `json.RawMessage` so the payload
serializes back out as raw JSON, not base64.
- OpenAPI path + `SystemEnrichment` schema, Emberfall E2E coverage, and
one empire seed row.
## Scope / deferred
Additive only — **no cfacts files are touched**. Retiring the
enrichment-specific (CFACTS) read surface and dropping `cfacts_systems`
is deferred until the UI repoints to this endpoint and the old sync is
decommissioned, so this is safe to run alongside the current stack.
Tracked under the enrichment-extraction epic.
## Deploy note
On merge/deploy to dev, migration 0038 applies on API startup and
creates the table. This is intentionally sequenced ahead of the
enrichment pipeline's first run against dev.
## Test plan
- [x] `go build` / `go vet` / `staticcheck` clean
- [x] model unit tests pass (incl. payload-serializes-as-JSON guard)
- [x] `make dev-setup` applies 0038; live curl returns 200 (admin), 404
(no row), 200 (assigned ISSO), 403 (unassigned ISSO)
- [x] `make test-full` passes (unit + Emberfall E2E), Failed: 0
Refs #211
## Summary
Three small, independent fixes that together make `make full-stack-up`
work on Apple Silicon macOS out of the box. Each one addresses a
Mac-specific quirk; none change behavior on Linux.
- **`backend/Dockerfile`** — `FROM scratch` → `FROM
gcr.io/distroless/base-debian12`, and drop the hardcoded
`x86_64-linux-gnu` libc/ld COPY lines. Distroless ships the right
libc/ld/ca-certs for whichever arch the build runs on, so the build is
now arch-agnostic and runs native-fast on both arm64 and amd64.
- **`Makefile` (compose-dev.yml + dev.compose.env generators)** —
replace `network_mode: host` with explicit `ports:` mappings
(`54321:54321` for postgres, `$(API_PORT):$(API_PORT)` for api) and
change `DB_ENDPOINT=localhost` → `DB_ENDPOINT=postgre`. Docker Desktop
on macOS runs containers inside a Linux VM, where `network_mode: host`
binds to the VM, not the Mac, unless the experimental host-networking
toggle is enabled. Port mappings behave the same on every platform.
- **`Makefile` (top)** — add `SHELL := /bin/bash`. The default `/bin/sh`
on macOS is bash invoked in POSIX mode, where the `echo` builtin prints
`-n` literally instead of suppressing the trailing newline. That
corrupted every base64-encoded segment in `generate-jwt` and
`frontend-env`, producing tokens the API rejected with 401.
## Symptom before this PR (Apple Silicon)
```
=> ERROR [stage-1 4/7] COPY --from=builder /lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 ...
=> ERROR [stage-1 5/7] COPY --from=builder /usr/lib/x86_64-linux-gnu/libc.so.6 ...
failed to compute cache key: ... "/usr/lib/x86_64-linux-gnu/libc.so.6": not found
```
After patching the Dockerfile, the build succeeds but `curl
localhost:8080` returns connection-refused (host networking issue), and
once port mappings are in place, the JWT from `make frontend-env`
decodes to `-n {"alg":"HS256"}\n...` and the API returns 401.
## Test plan
- [x] `make dev-down && rm backend/compose-dev.yml
backend/dev.compose.env`
- [x] `make full-stack-up` on Apple Silicon (arm64) — backend container
builds natively, both containers come up healthy
- [x] `curl -s -o /dev/null -w "%{http_code}\n" -H "Authorization:
$VITE_AUTH_TOKEN3" http://localhost:8080/api/v1/users/current` → `200`
with the seeded admin user
- [x] Frontend (`vite`) loads in browser and can hit the API
- [x] Verify on Linux that the same flow still works (echo behavior on
dash differs but `SHELL := /bin/bash` is widely available; port-mapping
change is platform-neutral)
…isc#213) (#324) ## Overview This PR removes the cfacts enrichment and PostgreSQL-to-Snowflake data-sync pipelines from the ztmf account. The same code and infrastructure now lives in the private `CMS-Enterprise/ztmf-insights` repo and has been running side-by-side in dev, impl, and prod for the last few weeks. With the insights stack verified end to end, the original copies in ztmf are no longer needed. ## Why the diff is large This is the third and biggest commit of a multi-repo decoupling effort tracked under epic ztmf-misc#190. The earlier steps were additive: the insights repo was created and stood up, the new generic `public.system_enrichment` seam was added to ztmf core (#323), and the UI was repointed at the new endpoint (CMS-Enterprise/ztmf-ui#367). Those landed first so nothing live depended on the old code at the moment this PR removes it. What ships here is the cleanup. Three Lambda functions, their IAM roles and policies, their CloudWatch monitoring, their dead-letter queues, the CFACTS CSV ingest S3 bucket, and the supporting Go packages all come out in one PR. The diff reads as "minus four thousand lines" because that is exactly what it is: an entire feature path being retired from this repo. For context, the goal of the decoupling was to keep ztmf core generic enough that other HHS OpDivs can adopt it without inheriting CMS-specific plumbing. The cfacts pipeline reflects the CMS Security Data Lake mapping, which is not portable, so it belongs in a CMS-only repo. Splitting also lets enrichment schema iterate independently of ztmf migrations: adding a new enrichment field is now a Snowflake view payload-key change on the insights side with zero ztmf-side work. ## What this PR ships **Removed from `backend/`:** - `cmd/lambda/` (ztmf-data-sync: PostgreSQL to Snowflake export) - `cmd/lambda-cfacts-snowflake/` (Snowflake CFACTS view to PostgreSQL ingest) - `cmd/lambda-cfacts-s3/` (CSV upload ingest path; never reached prod usage) - `internal/cfacts/` and `internal/export/` (model, CSV parser, Snowflake client, Postgres exporter) - Orphaned sync notification helpers in `internal/notifications/slack.go` (`SyncResult`, `SendSyncNotification`, `buildSyncMessage`, supporting formatters). The rotation and cert-rotation notifications used by the surviving Lambdas are untouched. **Removed from `infrastructure/`:** - `lambda.tf`, `lambda-cfacts.tf` (the three Lambda function blocks plus EventBridge schedules and permissions) - `iam.tf`, `iam-cfacts.tf` (per-Lambda execution roles and policies) - `monitoring.tf`, `monitoring-cfacts.tf` (log groups, error alarms, duration alarms, dead-letter queues) - `aws_s3_bucket.cfacts_sync` block from `s3.tf` (CFACTS CSV upload bucket plus lifecycle and SSE configuration) - `aws_secretsmanager_secret.ztmf_snowflake_dev` and `.ztmf_snowflake_prod` from `secrets.tf` (the insights stack now owns its own `ztmf-insights-snowflake` secret in each account, seeded and verified per ztmf-misc#223) - `enable_snowflake_sync` local in `locals.tf` - `cfacts_snowflake_view` and `snowflake_table_prefix` variables in `variables.tf`, plus the matching settings in `tfvars/{dev,impl,prod}.tfvars` - Data-sync-specific outputs from `outputs.tf` (`lambda_function_name`, `lambda_function_arn`) and the SSM test-event parameter resource. The NAT egress output is retained and its description updated for the surviving source-system allowlist use cases (Kion). **Updated docs:** - `infrastructure/README.md`: "Data Synchronization" section rewritten to point at the insights repo; "Infrastructure Organization" file list updated. - `backend/cmd/lambda-kion-key-rotate/README.md`: prior-art reference repointed at `lambda-cert-rotation` (the only remaining Go Lambda using the same pattern). **Updated CI:** - `.github/workflows/backend.yml`: the three deleted Lambda build, upload, and update-function-code steps are removed. The workflow now builds and deploys only `lambda-kion-key-rotate` and `lambda-cert-rotation`. - `Makefile`: `test-build` and `test-full` targets updated to match. **Dependency cleanup:** - `go mod tidy` removed the Snowflake driver and CSV-related transitive dependencies that only the deleted Lambdas pulled in (44 lines from `go.mod`, 120 from `go.sum`). ## What this PR deliberately leaves in place - **`aws_security_group.ztmf_sync_lambda`** in `vpc.tf`. The name is now historical, but the security group is still attached to `lambda-kion-key-rotate` and `lambda-cert-rotation`. Renaming it requires a `moved` block and is a follow-up so this PR can stay focused. - **`cfacts_systems` table, the `/api/v1/cfactssystems` controller and routes, the cfacts emberfall block, and the read-side model**. These are the read surface that ztmf-ui#367 just stopped consuming. They get retired in a separate PR tracked under ztmf-misc#214, which is gated on this one merging and the insights stack writing the new `system_enrichment` table cleanly in prod (already verified, 394 rows present). - **`backend/internal/notifications/slack.go`** core (the `SlackNotifier` plumbing, the `RotationResult` and `CertRotationResult` paths). Used by both surviving Lambdas. Only the sync-result code paths are removed. ## Terraform blast radius Verified via `terraform plan` against dev state: **0 to add, 0 to change, 51 to destroy.** The destroys group as follows (dev numbers; prod is the same shape minus the dev-only SSM test-event parameters): | Resource type | Count | Notes | |---|---:|---| | `aws_lambda_function` | 3 | `ztmf-data-sync`, `ztmf-cfacts-snowflake-sync`, `ztmf-cfacts-s3-sync` | | `aws_lambda_permission` | 3 | EventBridge and S3 invoke permissions | | `aws_cloudwatch_event_rule` | 2 | data-sync + cfacts-snowflake schedules | | `aws_cloudwatch_event_target` | 2 | matched targets | | `aws_cloudwatch_log_group` | 3 | one per Lambda | | `aws_cloudwatch_metric_alarm` | 4 | data-sync errors + duration, cfacts-snowflake errors, cfacts-s3 errors | | `aws_sqs_queue` | 2 | dead-letter queues | | `aws_iam_role` | 2 | data-sync and cfacts execution roles | | `aws_iam_policy` | 10 | logging, secrets, vpc, sqs, xray, s3 split across both roles | | `aws_iam_role_policy_attachment` | 10 | matched attachments | | `aws_s3_bucket` and 4 supporting | 5 | `ztmf-cfacts-sync-${env}` plus its public-access-block, versioning, SSE, lifecycle, and the S3 notification on it | | `aws_ssm_parameter` (dev only) | 3 | `/ztmf/dev/lambda/test-events/*` | | `aws_secretsmanager_secret` | 1 per env | `ztmf_snowflake_dev` in dev, `ztmf_snowflake_prod` in prod | Plus two Terraform outputs (`lambda_function_name`, `lambda_function_arn`) come out of the state. Every destroyed resource maps to enrichment compute or its supporting plumbing. No surviving resources (API, ALB, Aurora, CloudFront, ECS, Kion rotator, cert-rotation, ALB OIDC trust provider, TLS or SMTP secrets, Aurora master user secret, Slack webhook secret) appear in the destroy list. The unrelated pre-existing `managed_policy_arns` deprecation warning on `module.github_actions[0].aws_iam_role.role` is shown by the plan but is unaffected by this PR. It will be addressed when the GitHub Actions module is updated. The S3 buckets were pre-cleared before this PR opened. `ztmf-cfacts-sync-dev` had nine versioned objects and delete markers from February 2026 test runs; all were purged via `aws s3api delete-objects`. `ztmf-cfacts-sync-prod` was already empty. The bucket destroy will not be blocked. ## Validation done locally - `terraform -chdir=infrastructure fmt -recursive -check`: clean. - `tflint --chdir=infrastructure -f compact`: clean. - `cd backend && go build ./...`: clean. - `cd backend && go vet ./...`: clean. - `cd backend && go test -short ./internal/notifications/...`: passes. The remaining tests cover the cert-rotation Slack path. - `aws s3api list-object-versions` on both `ztmf-cfacts-sync-*` buckets: empty, including delete markers. - `terraform plan -var-file=tfvars/dev.tfvars` against dev state: `0 to add, 0 to change, 51 to destroy`. Per-resource list matches the blast-radius table above with no surprises. ## Cross-repo context for reviewers If you want to see how this fits together, the merge order across the three repos was: 1. **insights repo** CMS-Enterprise/ztmf-insights#1: relocate the Lambdas into the private repo. Merged, deployed to dev, impl, prod. Side-by-side with the originals. 2. **ztmf repo** #323 (ztmf-misc#211): add migration `0038systemenrichment.go` plus `GET /api/v1/systemenrichment/{fisma_uuid}`. Merged, in prod. 3. **ztmf-ui repo** CMS-Enterprise/ztmf-ui#367 (ztmf-misc#212): repoint the card at the new endpoint. Merged, deployed. 4. **insights repo** CMS-Enterprise/ztmf-insights#7 (ztmf-misc#223): move the Snowflake credential into an insights-owned secret. Merged, secret seeded in both accounts, lambda verified reading from the new path. 5. **ztmf repo** this PR (ztmf-misc#213): remove the original compute, IAM, monitoring, S3 ingest bucket, and Snowflake secret from the ztmf account. Follow-up still tracked under the epic: - ztmf-misc#214: retire the `cfacts_systems` table and the `/api/v1/cfactssystems` read surface. Gated on this PR merging. - ztmf-misc#220: rename `CfactsRecordCard` to `SystemEnrichmentCard` in ztmf-ui. Gated on ztmf-misc#214. - ztmf-misc#224: migrate the Kion plumbing (`lambda-kion-key-rotate` + Kion secret) into ztmf-insights when the planned Kion-logs-to-SDL ingest scope lands. ## Reviewer checklist - [x] The `terraform plan` against dev shows the destroys listed under "Terraform blast radius" and nothing else. - [x] No remaining references to deleted Go packages anywhere in `backend/cmd/api/`. - [x] CI workflow runs through Lambda build and deploy steps for kion and cert-rotation only. - [x] The retained CFACTS read surface still serves the ztmf-ui card for any system whose `sdl_sync_enabled` flag is off (those systems still read the legacy path until ztmf-misc#214 lands).
…14) (#326) ## Summary Final step of the ZTMF Insights extraction tracked under ztmf-misc#190. The legacy CFACTS read surface in the ztmf core API is removed and the underlying `cfacts_systems` table is dropped. ztmf-ui#367 has been the sole consumer of this surface and was repointed at the new `system_enrichment` endpoint two weeks ago, with no fallback to the old path. #324 removed the write side (lambdas + S3 ingest + Snowflake secrets) last week. This PR closes the loop on the read side. ## Changes - Add migration `0039cfactsdrop.go` that drops `public.cfacts_systems`. The down migration recreates the table with the cumulative shape from `0015` + `0021` + `0022` (table + two ALTER ADDs) plus the original `COMMENT ON TABLE`, so a rollback restores the table faithfully. - Remove the `GET /api/v1/cfactssystems` and `GET /api/v1/cfactssystems/{fisma_uuid}` routes from `router.go`, delete `controller/cfacts.go`, delete the `CfactsSystem` model and unit tests. - Move the per-system access check out of the deleted model file and into `internal/model/fismasystems.go` as `UserCanAccessFismaSystemByUUID`. The query was always against `users_fismasystems INNER JOIN fismasystems` and never touched `cfacts_systems`, so the old name `UserCanAccessCfactsSystem` was misleading; the SQL is byte-identical. The new `systemenrichment` controller is the only caller and is updated to use the renamed function. - Strip the six `cfacts_systems` INSERTs and stale CFACTS comments from `_test_data_empire.sql`, and delete the unused standalone `_test_data_cfacts_empire.sql` fixture (it was not loaded by any compose file or Makefile target). - Drop the `cfactsSystemData` anchor and all CFACTS test cases from `emberfall_tests.yml` (eight cases across admin, read-only admin, and ISSO sections). The retained `system_enrichment` cases cover the same access matrix and a new READONLY_ADMIN case was added so coverage of that role's controller branch is preserved. - Remove the two `/cfactssystems` paths and the `CfactsSystem` schema from `backend/openapi.yaml`. ## Why this is safe to merge - **Backend not in use.** Both `ztmf-ui` and `ztmf-insights` return zero hits for `cfactssystems` or `cfacts_systems`. The card was repointed at `/api/v1/systemenrichment/{fisma_uuid}` in ztmf-ui#367 and the section is gated on `sdl_sync_enabled`, so systems without insights data hide the card entirely rather than falling back. - **No referential cascades.** `cfacts_systems` had no foreign keys in or out by design (see the comment in migration `0038systemenrichment.go`), so dropping the table is a clean operation. - **Real-data migration check.** Verified locally against a postgres volume that holds a real-shape dataset: the drop ran cleanly and `make test-full` passed (91 emberfall cases, 0 failures, 0 skipped) against the resulting schema. ## Test plan - [x] `make test-unit` clean - [x] `make test-full` clean against the local postgres volume (real-data shape) — 91/0 emberfall, all integration paths green - [x] `grep -rn cfacts backend/` shows only the immutable historical migration files (`0015`, `0021`, `0022`, `0028` and `0038` comments referencing the prior pipeline) and the opaque `cfacts` JSON key inside one `system_enrichment` payload (owned by the Snowflake view, not this table) - [x] OpenAPI spec has no remaining `CfactsSystem` references or orphan `$ref`s - [x] External repos checked: `gh search code --repo CMS-Enterprise/ztmf-ui cfactssystems` and `--repo CMS-Enterprise/ztmf-insights cfactssystems` both return empty ## Follow-ups Once this lands, ztmf-misc#220 (rename `CfactsRecordCard` to `SystemEnrichmentCard` in ztmf-ui) is unblocked. After that, the only remaining open sub-issue of the extraction epic is ztmf-misc#224 (migrate the Kion plumbing into the insights stack), which is explicitly out of scope for this PR. Refs ztmf-misc#214.
…ching (#327) Closes #309 > **Note:** Rebased onto latest `main` after #326 landed (which retired the legacy CFACTS read surface and deleted `cfactssystems.go`). The original `cfactssystems.go` hunks were dropped as part of that rebase — the case-insensitive matching now lives in `UserCanAccessFismaSystemByUUID`, where the old `UserCanAccessCfactsSystem` access check was relocated. ## Summary - Relaxed `fismauid` validation in `FismaSystem.validate()` to require only a non-empty string instead of a valid UUID. Some HHS OpDiv system IDs (e.g. `CDC8767221`) are not GUIDs and were being rejected with a 400. This is the core #309 fix — the gate that blocks creating a CDC-style system. - Made the access-check `WHERE` clause in `UserCanAccessFismaSystemByUUID` (`fismasystems.go`) case-insensitive using `LOWER()`. Defensive future-proofing so non-GUID OpDiv IDs match regardless of case once system enrichment opens beyond CMS. - Updated `fismasystems_test.go`: replaced the `InvalidUUID` test case with `ValidSystemNonUUID` (CDC-style ID should pass) and `EmptyFismaUID` (empty string should still fail). ## Testing Go unit tests pass (run in a `golang:1.25` container — `go test -short ./...`): ``` --- PASS: TestFismaSystem_Validate/ValidSystemUUID --- PASS: TestFismaSystem_Validate/ValidSystemNonUUID --- PASS: TestFismaSystem_Validate/InvalidEmail --- PASS: TestFismaSystem_Validate/EmptyFismaUID ok github.com/CMS-Enterprise/ztmf/backend/internal/model 0.003s ``` Full `go test -short ./...` suite is green; `go build ./...` and `go vet ./internal/model/` clean. > Snyk / secret-dependent CI checks show red because this branch is from a fork (no repo secrets) — expected, maintainers re-run on their side.
## Summary
`make dev-setup` was not loading the Empire test data
(`_test_data_empire.sql`) into a fresh Postgres volume. A new clone, a
new machine, or any `docker compose down -v` came up with an empty
database (0 FISMA systems, 0 questions), which is what surfaced when
setting up the backend on a freshly migrated OS.
## Root cause
The seed step was gated on the database host string:
```go
if cfg.Db.PopulateSql != nil && cfg.Db.Host == "localhost" {
```
The dev API container reaches Postgres over the `postgre` Docker Compose
service name (`compose-dev.yml` overrides `DB_ENDPOINT=postgre` so the
container can talk to Postgres over the internal Docker network). The
`localhost` value in `dev.compose.env` only applies to host-side
tooling. So inside the container the check evaluated `postgre ==
"localhost"`, was false, and seeding was silently skipped.
This was masked on existing machines two ways, which is why it did not
show up for other developers:
- `backend/compose-dev.yml` was only regenerated when missing, so
existing checkouts kept an older on-disk compose file and never picked
up the `postgre` override.
- The `backend_postgres_data` volume persists across restarts, so
databases seeded under the older configuration stayed seeded.
A fresh volume on current code has neither of those, so the seed never
ran.
## Fix
Two changes, addressing both halves of the root cause:
1. **Gate seeding on `ENVIRONMENT`, not the database host.**
`ENVIRONMENT` is host- and platform-agnostic: `local` for dev
(`dev.compose.env`), `test` for the Emberfall E2E stack
(`compose-test.yml`), and it defaults to `production` everywhere else.
```go
if cfg.Db.PopulateSql != nil && cfg.IsLocalOrTest() {
```
2. **Make the `backend/compose-dev.yml` generation target PHONY** so it
regenerates on every `make dev-setup` / `dev-up`. This stops developers
from running a stale on-disk compose file that predates the
`DB_ENDPOINT=postgre` override, which was the second reason fresh-volume
seeding silently broke.
### Environment-gate helpers
The environment checks are now centralized as two `config` helpers,
replacing scattered string comparisons:
- `config.IsLocal()` -> `ENVIRONMENT == "local"`
- `config.IsLocalOrTest()` -> `ENVIRONMENT == "local"` or `"test"`
These are deliberately distinct. Seeding uses `IsLocalOrTest()` (safe in
local dev and the E2E test stack). Just-in-time ADMIN user creation in
the auth middleware uses `IsLocal()` (local dev only, never the E2E test
stack). Keeping them separate preserves that asymmetry, and a new unit
test (`TestEnvironmentGates`) locks the contract across
local/test/production/dev/impl/prod/empty.
## Why this is safe for deployed environments
Seeding requires two independent conditions, and deployed environments
fail both:
- Deployed dev, impl, and prod resolve `ENVIRONMENT` to `dev`, `impl`,
and `prod` respectively (see `infrastructure/ecs.tf` and the tfvars),
none of which match `local` or `test`.
- Deployed task definitions never set `DB_POPULATE`, so
`cfg.Db.PopulateSql` is nil and the populate call cannot run regardless.
The allowlist of environment names is deliberate: an unrecognized
environment fails closed (no seeding) rather than open.
## Why this does not break macOS developers
The fix does not touch the compose networking, port mappings, or the
`DB_ENDPOINT=postgre` override introduced for Apple Silicon. macOS dev
uses the same `ENVIRONMENT=local`, so it seeds a fresh volume the same
way Linux does. The PHONY change only ever regenerates `compose-dev.yml`
to its current correct form (the file is gitignored and marked "DO NOT
EDIT - Managed by Makefile"), and existing seeded volumes are
unaffected. The auth-middleware refactor is behavior-preserving: JIT
ADMIN creation still fires in local dev only.
## Dependency security fix
The PR build's Snyk container scan failed the pipeline on a critical
vulnerability in a transitive dependency, fixed here to keep the branch
green:
- **golang.org/x/net** bumped from v0.47.0 to v0.55.0. v0.47.0 carries a
critical Improper Authentication vulnerability in
`golang.org/x/net/idna` (SNYK-GOLANG-GOLANGORGXNETIDNA-17116876), first
fixed in 0.54.0.
- `go mod tidy` also advanced the related modules it pulls:
`golang.org/x/crypto` to v0.51.0 and `golang.org/x/text` to v0.37.0.
`x/net` is an indirect dependency, so this is a `go.mod` / `go.sum`
change only with no source edits.
## Testing
- Fresh volume on Linux: `docker compose down -v` then `make dev-setup`
now logs `Populating database with /app/_test_data_empire.sql` and the
database comes up with 9 users, 4 FISMA systems, 18 questions, 30
scores. No manual psql load needed.
- Confirmed `compose-dev.yml` now regenerates on every run (PHONY),
overwriting a deliberately corrupted copy.
- New `TestEnvironmentGates` unit test passes.
- `make test-full` (unit + Emberfall E2E on the isolated test stack with
`ENVIRONMENT=test`) passes, confirming the E2E seed path is preserved
and the middleware refactor is behavior-preserving.
## Summary Snyk started failing the prod orchestration on two high-severity vulnerabilities in the Go standard library at 1.25.10: - `std/mime` - `std/crypto/x509` Both are fixed in Go 1.25.11. The advisories surfaced after #329 had already passed its PR checks, so the same code that was green on the PR now blocks the prod pipeline at the analysis stage. Until this lands, every merge to main fails Snyk and nothing deploys. ## Change - `backend/go.mod`: `go 1.25.10` -> `1.25.11` - `backend/Dockerfile`: builder image `golang:1.25.10-bookworm` -> `golang:1.25.11-bookworm` No code changes. Builds clean locally and the host unit tests pass. ## Note The prod run failed at the analysis stage before the deploy step, so prod was not affected and is still running the prior image. This unblocks the pipeline so #329 (and subsequent work) can deploy.
Re-homes #330 (originally authored by @voidspooks) onto an org branch so the full CI pipeline runs. Fork PRs do not get the org secrets Snyk and the build/deploy steps need, so the change could not be validated or shipped from the fork. Same content as #330, plus a merge of current main, which brings in the seeding fix (#329) and the Go 1.25.11 bump (#331). The one overlap with main was in `auth/middleware.go` (the JIT user-creation gate); resolved to keep main's `cfg.IsLocal()` helper and this branch's `OWNER` role. ## What this does Stage D of the role cleanup: removes the legacy `ADMIN` and `READONLY_ADMIN` role values. A prior migration (`0036`) already converted existing rows (`ADMIN -> OWNER`, `READONLY_ADMIN -> HHS_READONLY_ADMIN`); this stops the app from accepting or emitting the legacy values and adds a DB guard (`0040`, a CHECK constraint rejecting the legacy strings). ## Validation - Migration tested against a copy of prod data: zero legacy rows, constraint applies clean, API starts, and a legacy write is rejected as expected. - Authorization routes through the role helpers; every legacy tier maps to its replacement, so no lockout or access gap. - Allowed role set is consistent across `validations.go`, the DB constraint, and both OpenAPI enums; both seed files use only the new values. - Emberfall E2E passes locally (91/91). Closes #330. Co-authored-by: Cameron Testerman <11036339+voidspooks@users.noreply.github.com>
…#333) ## Summary Adds authorization enforcement for the multi-OpDiv role model on top of the schema that already shipped (opdivs reference table, fismasystems.opdiv_id, users_opdivs junction, the new role enum). Before this branch, reads were only partially scoped and write paths were not scoped at all: an OPDIV_ADMIN could edit or delete systems in any OpDiv, score any system, and manage any user, and there was no endpoint to create an OpDiv or grant a user an OpDiv. This branch closes those gaps so a scoped admin manages only their own OpDiv, while OWNER/HHS tiers retain full cross-OpDiv access. ## What changed **OpDiv administration (OWNER-only)** - `POST /api/v1/opdivs`, `PUT /api/v1/opdivs/{opdiv_id}` to create / update / deactivate an OpDiv. **OpDiv grants (user membership)** - `GET/POST/DELETE /api/v1/users/{userid}/assignedopdivs[/{opdiv_id}]` to list, grant, and revoke OpDiv membership. An OPDIV_ADMIN may only grant/revoke an OpDiv they themselves hold, and only onto a user within their assignable tier. - A user's `identity_provider` is derived from their OpDiv on grant/revoke (CMS to okta, otherwise entra) and is only overridable by an OWNER. **Write-path scope** - Editing, decommissioning, reactivating, and assigning a user to a system, scoring a system, and marking a data call complete are all gated so an OPDIV_ADMIN is limited to systems in an OpDiv they hold. OWNER/HHS_ADMIN are unrestricted; ISSO/ISSM keep their per-system assignment path. - Role-escalation guard on user create/update: an OPDIV_ADMIN cannot mint or modify a user above the OpDiv tier, and cannot manage a user outside their OpDiv. **Read-path scope (fail-closed)** - User, score, and data-call-completion lists are scoped to the caller's OpDivs for OPDIV tiers. A scoped admin with zero grants matches nothing rather than falling through to an unscoped read. - The audit trail (`GET /events`) and mass email (`POST /massemails`) are restricted to the unscoped admin tiers, since neither has a per-OpDiv scope to filter on. **Migration** - `0041` adds an idempotent index on `fismasystems.opdiv_id` to back the scope predicates. No data change. **Docs and tests** - OpenAPI updated for the new endpoints and the scoped / restricted behavior. - Unit tests for the role helpers, the fail-closed SQL predicate, and the controller authorization gates; Emberfall E2E negative matrix asserting cross-OpDiv writes return 403 and scoped reads exclude out-of-OpDiv rows. ## Related issues - Closes #313 (Backend: surface OpDiv / identity_provider fields and add scope predicates), and additionally delivers three items that issue had deferred (OpDiv scope on scores/events/datacalls, grant-management endpoints, and write-path scope beyond create) because the authorization layer must be complete before the dual-IdP work admits OpDiv-scoped users. - Part of epic CMS-Enterprise/ztmf-misc#198 (HHS multi-OpDiv expansion). - Builds on the database work in #311 / #312 and the role-value cleanup in #314, all already on `main`. ## Safety and rollout A no-op for current production: no OPDIV_ADMIN users exist yet, and the OWNER / HHS tiers short-circuit every new check. The only migration is an index (idempotent up / reversible down). Rollback is a redeploy of the prior image. Per team policy this branch is not deployed to a shared environment before merge. ## Testing `make test-full` green: unit + isolated integration + 111 Emberfall E2E. ### How to verify - The OpDiv negative matrix lives at the bottom of `backend/emberfall_tests.yml`: cross-OpDiv writes return 403, scoped reads exclude out-of-OpDiv rows, and `/events` + mass email are gated to unscoped admins. - To poke it by hand, the seed adds an `OPDIV_ADMIN` (EMPIRE grant only) plus REBELLION systems 1005/1006 (token anchor `opDivAdminHeaders` in the test file). `PUT /fismasystems/1005` or `POST /scores` for 1006 returns 403; EMPIRE systems 1001-1004 succeed. - The scope logic is centralized in `backend/internal/model/users.go` (`CanManageFismaSystem`, `CanManageUser`, `EffectiveOpDivScope`) and exercised in `backend/cmd/api/internal/controller/rbac_enforcement_test.go`, so the role/scope matrix is reviewable in one place. ## Sequencing Lands before the dual-IdP branch, which rebases on top once this merges. A paired ztmf-ui PR adds the matching UI gates (hide actions and views a scoped admin would be 403'd on); the two land together so no admin loses UI in between.
…ret rotation (#337) Re-homes #335 (originally authored by @voidspooks) onto an org branch so the full CI pipeline runs, and merges current main. Fork PRs do not get the org secrets Snyk and the build/deploy steps need. ## What this fixes (ztmf#328) The API caches the RDS-managed DB secret at startup. RDS rotates it on a ~7-day schedule; afterward the cached password is stale, the next connection fails SASL auth (28P01), and the old code `log.Fatal`'d on 28P01 - so every rotation took the process down until ECS restarted it. Cameron's change: - `db/conn.go`: `Conn()` retries once on a 28P01 auth error when a secret is configured, re-fetching the secret first (`refreshDbCreds` -> `Secret.Refresh`), so a routine rotation is ridden out in place. Env-var (local) credentials are not retried since there is nothing to re-fetch. - `model/errors.go`: the 28P01 case returns `ErrDbConnection` instead of `log.Fatal`, so a genuinely-wrong password surfaces as a normal connection error rather than killing the process. - Adds unit tests for the error classification (`isAuthError`, `trapError`), including wrapped-error cases. ## Added during re-home: synchronize the cached secret Making the refresh reactive on the per-request path means a rotation can have many goroutines reading (`Value`/`Unmarshal`) and writing (`Refresh`) the cached `Secret` at once, and those fields had no synchronization (a read race already existed in `Value`'s proactive-refresh path; the reactive trigger widens it). Added a mutex on `Secret` guarding `secret`/`metadata` across the fetch RPC, so a refresh storm serializes instead of racing the pointers. `Value`/`Refresh` lock and delegate to an unexported `refreshLocked` to avoid re-locking. ## Testing - `go build`, `go vet`, and the touched-package unit tests pass under `-race`. - Error-classification regression is covered; the live refresh-and-retry flow and the concurrent-refresh path are not yet unit-tested (they need an injectable Secrets Manager / connect seam) - tracked as a follow-up under ztmf#328. ## Note A single retry rides out the modeled failure. Right at a rotation boundary a refreshed secret can still briefly mismatch the cluster, so expect the occasional single failed request at rotation time - the process stays up, which is the goal. Closes #335. --------- Co-authored-by: Cameron Testerman <11036339+voidspooks@users.noreply.github.com>
…rail (#338) Gives a fresh local dev environment realistic sample data and makes the OpDiv and audit features exercisable without pulling production data. ## What's added - **Multi-OpDiv org.** Six Imperial-branch OpDivs (Navy, Army, Starfighter Corps, ISB, Intelligence, Sienar) with 10 new systems and 12 officers scoped across them as ISSO / OPDIV_ADMIN / OPDIV_READONLY_ADMIN, so OpDiv-scoped RBAC isolation can be seen in the UI (an OWNER sees all 16 systems; a branch OPDIV_ADMIN sees only their branch). - **Three-year history.** Data calls FY2022-FY2025, with each new system scored every cycle and maturity climbing year over year (Traditional toward Advanced). - **Richer questionnaire.** Three new `datacenterenvironment`s (Ground-Assault, Surveillance-Net, Shipyard-RnD), each a full six-pillar function set with four maturity options, plus per-pillar questions. - **Audit trail.** `last_edited_at` / `last_edited_by` are derived from the events table, which raw SQL seed data bypasses, so seeded scores previously showed a blank "last edited by" in the dashboard footer. This seeds one `updated` event per new score, attributed to the system's assigned officer, so the audit display and features work on a fresh local DB. ## Tests - `TestFindScoresResolvesSeededAuditFieldsIntegration` covers the read of audit fields from a pre-existing (seed) event. The existing audit tests all create the event in-test, so the historical read path that the seed and the UI footer depend on was untested. - `make test-e2e` passes 111/111 and `make test-integration` passes. All data uses new IDs and new OpDivs only; existing systems (1001-1006), data calls (1-3), and scores (9001-9030) are untouched, so the OpDiv-scoped and aggregate E2E assertions are unaffected. ## Note The score notes carry some lightly buried thematic flavor; they remain plausible as assessment text and contain no real data.
…led (#341) ## Summary Adds the backend and infrastructure for dual-IdP authentication: HHS OpDiv users authenticate through Microsoft Entra ID alongside the existing CMS Okta. The entire feature ships **dormant behind the `entra_enabled` Terraform flag (default `false` in dev and prod)**, so merging and deploying this changes nothing about today's Okta-only behavior. It only takes effect after an operator loads the Entra secret and flips the flag, per the runbook below. Implements the backend/infra slice of the dual-IdP epic (CMS-Enterprise/ztmf-misc#170). The frontend landing page is tracked separately (CMS-Enterprise/ztmf-ui#328, #329). ## What this delivers - **Multi-issuer token validation.** The API validates tokens from both issuers: Okta (ES256) and Entra (RS256 via JWKS), with the Entra tenant pinned and a per-IdP issuer allowlist. Local dev and the E2E suite continue to use HS256. - **Pre-auth IdP lookup.** `GET /api/v1/auth/lookup?email=` (unauthenticated, rate-limited) returns `{ "data": { "idp": "okta" | "entra" | null } }` so the browser can route each user to the correct login path. Unknown and unprovisioned emails return the same shape (no enumeration), and the email is not logged. - **Application session layer.** After OIDC login the backend mints its own signed session cookie and gates `/api/*` by that session or an IdP bearer, with an origin (CSRF) check on state-changing requests. - **Per-IdP ALB listener rules.** A new `/login/entra*` rule with a distinct session cookie name, a forward-only rule for the lookup endpoint, and the `/api/*` switch from ALB OIDC to backend-gated. The existing Okta `/login*` and `/api/*` rules are unchanged while the flag is off. - **Token-validation hardening** (from a security review of this work): - Optional per-IdP audience pinning, enforced only when configured, so a validly signed token minted for a different application in the same issuer or tenant is rejected. The Entra audience is wired to the ZTMF app client id; the Okta audience is intentionally left unset for now since Okta traffic stays ALB-gated until cutover. - HS256 tokens are rejected on the IdP path outside local/test, so a deployed environment is asymmetric-only (ES256/RS256) regardless of environment configuration. ## Why this is safe to merge now With `entra_enabled = false` (the committed value in both dev and prod): - Every new ALB rule, the lookup rule, and the `/api/*` switch are `count = 0` and are not created. The existing Okta `/login*` and Okta-OIDC `/api/*` rules stay exactly as they are today. - No new container environment or secrets are injected; the data sources that read the Entra and session secrets are gated off. - The new middleware is additive: with no session cookie present, requests fall through to the existing IdP-bearer path (the ALB-injected Okta token), resolved by email as today. - The new audience and issuer checks are inert when unconfigured, and the HS256 restriction does not touch the Okta ES256 path. Net: the live authentication path stays 100% CMS/Okta-via-ALB. Rollback is flipping the flag back to `false`. ## What must happen after merge, before Entra works (per environment) Dev first (the validation environment), then prod before the data-call cutover. Dev and prod are separate AWS accounts with separate Secrets Manager, so each step runs in each account: 1. **Deploy this branch (flag stays `false`).** Terraform creates two empty secret containers, `ztmf_entra_oidc` and `ztmf_session_signing_key`. Everything else stays inert. 2. **Load the Entra secret.** Run `scripts/bootstrap-entra-secrets.sh` against the account to populate `ztmf_entra_oidc` (tenant id, issuer, JWKS URI, client id, client secret) and generate the session signing key. Safe to do while the flag is off; nothing reads these values until the flip. 3. **Confirm HHS-side prerequisites** for that environment: the redirect URI is registered on the Entra app, and the Conditional Access allowlist is in place for the pilot identities. 4. **Flip `entra_enabled = true` and apply.** The data sources now read the populated secret and the ALB rules come up. Flipping before step 2 fails the apply, by design. The frontend (ztmf-ui#328/#329) must ship its matching email landing page, gated by its own flag, before end-to-end login through Entra is usable. ## Test plan - [x] `make test-e2e` (isolated ephemeral DB, Emberfall): 117 ran, 0 failed - [x] Auth package unit tests pass, including the multi-issuer/tenant/audience matrix and the HS256 environment gate - [x] `go build ./...` and `go vet` clean - [x] `terraform fmt` clean on changed files - [x] Confirmed the Okta-only path is unchanged with `entra_enabled = false` Closes #265 Closes #334 Refs #264 (ALB listener rules ship dormant; activated and validated at cutover) Refs CMS-Enterprise/ztmf-misc#170 Refs CMS-Enterprise/ztmf-ui#328, CMS-Enterprise/ztmf-ui#329
) ## Summary Fixes a database connection leak that exhausted the Aurora connection limit in prod today. db.Conn() opens a dedicated pgx connection on every call (there is no shared pool), and the central model helpers plus a few raw call sites never closed the connection, so every query leaked one. Two active users were enough to pile up around 197 open connections, peg CPU, and time out requests. In the UI that showed as intermittently missing or slow data, like the OpDiv column blanking on first load. ## Root cause The leak has been in the data layer since the open-per-call, no-close pattern was first written (mid-2024). It stayed hidden because Aurora reaped idle connections faster than they leaked. The OpDiv RBAC work added more DB round trips per authenticated request (auth middleware, then the user lookup, then the OpDiv grant queries), which pushed the leak rate past the reap rate, so connections piled up until the cluster hit its limit. The recent multi-OpDiv change triggered it but was not the root. ## Fix Added a deferred connection close at every runtime site that opens one: - model.go query() and queryRow(), which back most request-path queries - scores.go copyPreviousScores, an Exec site that leaked - fismasystems.go and users.go transaction paths, which rolled back the tx but never closed the connection. The close is declared before the tx rollback defer so it runs after the transaction resolves (defers are LIFO) - populate.go, the local seed path The tern migrator connection in migrations.go is left open on purpose. It is a startup-only connection the migrator owns for the life of the process. ## Not included A connection pool (pgxpool) is the real fix for the per-query connect cost, which is the remaining slowness and the high DB CPU, and it will matter at HHS scale. That change also touches the credential rotation path, so it is a separate follow-up. This PR only stops the leak. ## Testing - make test-integration (isolated DB) passes - make test-e2e (Emberfall) passes - go build, staticcheck, and go vet are clean - Confirmed every runtime connection site closes once, the tx ordering is correct, there is no double close or use after close, and closing on an already-canceled context still releases the socket in our pinned pgx version ## Context Replaces the earlier partial fix that only closed connections in two scores functions (#322, refs #321). This closes the leak across the whole data layer.
…1) (#344) ## Summary The users table was slow to load its OpDiv column. `FindUsers` (the `GET /api/v1/users` list query) omitted the user's OpDiv and FISMA grants, so the frontend backfilled them by calling `/users/{id}` once per user. With ~600 users that is hundreds of round-trips trickling through the browser's per-host connection limit, which is the delay. The list query itself is a few milliseconds. This loads the grants inline on the list, using the same correlated subqueries `findUser` already uses for the detail endpoint, so the table renders the OpDiv column straight from the list response with no per-row fetch. ## Why this is cheap Measured against real prod-scale data locally (595 non-deleted users): the list query with both subqueries runs in a few milliseconds. Each subquery is a composite-PK index-only scan on the junction table (one row group per user), so it does not add a meaningful cost and there is no N*M join blowup. The endpoint returns in about 11ms. ## Coordination The frontend (ztmf-ui) drops its per-user `fetchUserOpDivs` bulk loop and reads `assignedopdivids` from each list row, mapping ids to codes with the opdiv map it already builds once. The frontend change is written to fall back gracefully if the field is absent, so the two can deploy in either order. ## Scope This fixes the page-load read N+1 only. The OpDivGrantModal per-item write loop is lower severity (user-triggered write, small N, now pool-bounded) and is tracked separately; it needs a backend batch-grant endpoint. ## Validation - Verified against real prod-scale local data: list returns `assignedopdivids` for all 595 users, endpoint ~11ms - `make test-integration` and `make test-e2e` pass - `go build`, staticcheck, `go vet` clean - Emberfall `/users` list assertions are status/header only, so the additive fields do not break them Closes #336
…under normal use) (#343) ## Summary The data layer had no connection pool. db.Conn() opened a brand-new connection (TCP, TLS, SCRAM auth) on every query. Once the OpDiv RBAC and dual-IdP work made every authenticated request run a DB user lookup, normal use opened a burst of concurrent connections that hit the Aurora connection limit. New connections then died mid-handshake (failed SASL auth: context canceled), which showed in the UI as the OpDiv column going empty and edits failing to save. One active user was enough to trigger it, so this is a design problem, not load. This replaces the per-call connection model with a shared, bounded pgxpool. ## What changed - db.Conn acquires a connection from a shared pool and returns it; callers release it back to the pool when done instead of closing it. The pool caps connections at 16, well under the cluster limit, and reuses them, so a burst of concurrent requests no longer opens a connection each. - Credentials are supplied per new connection, so the 7-day RDS secret rotation is picked up as connections recycle (MaxConnLifetime 50m). An auth failure on acquire refreshes the secret and retries once, preserving the rotation handling from the earlier credential-refresh fix. - The tern migrator keeps a dedicated single connection (MigrationConn), since it needs a plain connection and is not pool-managed. - Pool is built once with a background context so a canceled request cannot poison the one-time init. ## Relationship to the earlier fix The previous change (#342) added a close to every connection site to stop a slow leak. That was necessary but did not stop the concurrent burst, because closing a connection after use does nothing about many connections being opened at the same time. The pool is what actually bounds and reuses connections. With the pool, connections are released back rather than closed, which is the correct pattern. ## Validation - make test-integration (isolated DB) passes - make test-e2e (Emberfall) passes - go build, staticcheck, and go vet are clean - Reviewed for pgx pool correctness: every acquire has a matching release on all paths including the transaction error paths, the transaction sites release exactly once, rows are fully read before release, MaxConns is a hard cap, and the rotation path recovers across the 7-day cycle ## Sizing notes MaxConns 16 against the observed cluster connection limit leaves headroom for the migrator connection, the ops task, and pgAdmin sessions. MinConns 2 keeps a warm pair so the first requests after idle do not each pay a full handshake. These are conservative for a low-traffic app and can be tuned later if needed.
Flips `entra_enabled = true` for dev, activating the dormant dual-IdP infra from #341: the per-IdP ALB listener rules (`/api/v1/auth/lookup` forward, `/login/entra` Entra OIDC, `/login` Okta), `/api/*` moved off ALB OIDC to backend session validation, and the Entra + session env injected into the API task. Okta login is unchanged. This is the in-repo version of #347 by @voidspooks, moved onto a CMS-Enterprise branch so the infrastructure deploy (terraform apply, which needs the org OIDC role) runs; that does not run on a fork PR. The single commit and authorship are unchanged. ## Why now The frontend landing page (ztmf-ui#404) is deployed to dev and calls `GET /api/v1/auth/lookup` on submit, but that ALB rule only exists when `entra_enabled = true`. With the flag still false, the lookup falls through to the OIDC-gated `/api/*` rule, fails for a logged-out visitor, and the page shows "We can't determine an identity provider" for everyone, including Okta users. This flip routes the lookup and unblocks dev testing. ## Scope - dev only. `infrastructure/tfvars/dev.tfvars`, one line plus a comment. prod (`prod.tfvars`) stays `false`. - Preconditions met: `ztmf_entra_oidc` and `ztmf_session_signing_key` are seeded and verified in the dev account. ## Deploy behavior Once this is non-draft, orchestration-dev applies it to dev. It only persists across later dev applies once merged to main, so merge after the dev login is validated end to end. Refs #347 Co-authored-by: Cameron Testerman <11036339+voidspooks@users.noreply.github.com>
…349) First slice of #317, scoped to the backend workflow. GitHub is forcing Actions runners off Node.js 20, so every run currently emits the deprecation warning. This moves the backend workflow's Node-based actions to their latest Node.js 24 releases, which clears the warnings, and SHA-pins the rest as a security hardening pass while we are in the file. ## Node.js 24 bumps (the actual warning fix) These five were running on Node 20; each is now on its latest Node 24 release: - `actions/checkout` v4 -> v6.0.3 - `docker/setup-buildx-action` v3 -> v4.1.0 - `hoverkraft-tech/compose-action` v2.0.2 -> v3.0.0 - `aws-actions/configure-aws-credentials` v4 -> v6.2.0 - `actions/setup-go` v5 -> v6.4.0 `orchestration-dev.yml` also had `actions/checkout@v4` in the change-detection job; bumped to v6.0.3 so that job stops warning too. ## SHA pinning (security bonus) Every third-party action in `backend.yml` is now pinned to a commit SHA with a `# vX.Y.Z` comment. This drops the two floating refs that were the worst case: - `snyk/actions/docker@master` -> pinned SHA (no tagged releases upstream) - `aquia-inc/emberfall@main` -> pinned SHA (composite action; the emberfall binary version still comes from the `version` input) ## Also - `actions/setup-go` now sets `cache-dependency-path: backend/go.sum`, so the dependency cache restores instead of missing on every run (go.sum lives under `backend/`, not repo root). - `orchestration-dev.yml` now runs the backend smoke test when `backend.yml` itself changes, not only when `backend/**` code changes, so workflow edits like this one are exercised by CI instead of shipping unvalidated. That is what lets this PR test its own bumps. - Added `.github/dependabot.yml` with a grouped `github-actions` entry so the SHA pins receive automated update PRs and do not rot. ## Scope Backend workflow only, as the pilot. The remaining workflows in #317 (`analysis.yml`, `infrastructure.yml`, `orchestration-impl.yml`, `ops-image.yml`) are follow-ups once this pattern is proven green. ## Test plan - [ ] Backend smoke-test job runs on this PR (via the new workflow-change trigger) and passes - [ ] No Node.js 20 deprecation warning in the backend job logs - [ ] `compose-action` v3 still starts the API + Postgres e2e stack; emberfall smoke tests pass - [ ] `configure-aws-credentials` v6 still assumes the role and the image pushes to ECR - [ ] setup-go cache restores from `backend/go.sum` (no cache-miss warning) Refs #317
…/swag (#348) Auto-generates `backend/openapi.yaml` from Go annotations using swaggo/swag, so the spec stays in lockstep with the routes instead of being hand-maintained and drifting. A CI drift check fails the build if the committed spec does not match a fresh generation, and `redocly lint` enforces that every route declares its security (the one intentional public route is acknowledged explicitly). This is the in-repo version of #340 by @voidspooks, moved onto a CMS-Enterprise branch so the Snyk and other org-secret CI gates run (they do not run on fork PRs). The commits and authorship are unchanged from his branch, which is already rebased on current main and addresses the full review. ## What it does - swaggo/swag annotations on every handler are the source of truth; `make generate-openapi` produces `openapi.yaml`. - CI regenerates and diffs the spec; any drift fails the build. - `redocly lint` (pinned `redocly/cli:2.32.2`, matching local) runs in CI with `security-defined` as a hard error, so a new handler that forgets `@Security` fails. The single public route (`GET /api/v1/auth/lookup`) is the only entry in `backend/.redocly.lint-ignore.yaml`. ## Review items addressed (all on this branch) - Rebased past the OpDiv (#333) and dual-IdP (#341) merges; only generated/lock-file conflicts, resolved cleanly. - `GET /api/v1/auth/lookup` annotated with no `@Security`, so it renders as the one unauthenticated route. - `SaveFunction` / `SaveOpDiv` update-path id set to optional to match the `SaveDataCall` / `SaveUser` shared-handler pattern. - `GetEvents` query params corrected to what the schema decoder actually binds (`action`, `resource`, and nested `payload.*`). ## Test plan - [x] `make generate-openapi` is deterministic (regenerates byte-identical; drift check green) - [x] `go build ./...` and `go test -short ./...` pass - [x] `redocly lint` clean (0 errors) - [x] All 48 routes carry matching `@Router` annotations; the public route carries no `@Security` Closes #260 Refs #340 --------- Co-authored-by: Cameron Testerman <11036339+voidspooks@users.noreply.github.com>
…gin (#351) Fixes the Entra login path being unreachable from the public CloudFront URL. ## Problem CloudFront routes only the exact path `/login` to the ALB origin (`cloudfront.tf`); there is no `/login/*` behavior. So `/login/entra` does not match, falls through to the S3 default origin, and S3 returns an `AccessDenied` XML page. Entra login is unreachable for everyone, not specific to any user or email. Okta is unaffected because it lives at exactly `/login`. The dual-IdP work added the `/login/entra*` rule on the ALB but never added the matching CloudFront behavior. The two layers are owned separately (the CloudFront login behavior was set up under #166 as an exact match), and the end-to-end "user lands at /login/entra" check was deferred to runtime, so the gap only surfaced once dev was flipped to `entra_enabled = true` and Entra login was first exercised. ## Fix Add a `/login/*` ordered cache behavior pointing at the `ztmf_api` (ALB) origin, mirroring the existing `/login` behavior (same forwarding, no caching). It is gated with `dynamic ... for_each = var.entra_enabled ? [1] : []` so it exists only when Entra is enabled, matching the per-IdP ALB listener rules that are created under the same flag. The exact `/login` behavior is left untouched, so Okta keeps working exactly as before. On dev (`entra_enabled = true`) this creates the behavior and `/login/entra` reaches the ALB. On prod (`entra_enabled = false`) nothing changes. ## Validation - `terraform fmt` clean, `terraform validate` passes, `tflint` clean. - After this applies to dev: `/login/entra` routes to the ALB (Entra OIDC) instead of returning S3 `AccessDenied`; `/login` (Okta) unchanged. Refs #264
Hardens JWT verification key fetching against an unsanitized-input flaw
in the Okta path.
## Problem
`oktaKey()` took the `kid` from the (necessarily unverified) token
header and concatenated it directly into the key-fetch URL, then fetched
it with `http.DefaultClient` (no timeout, follows redirects) and used
the returned PEM as the verification key. Because `kid` selects the key,
it is read before the signature is checked - so it is untrusted input
being used to shape an outbound request and choose a trust anchor. The
Entra (RS256) path was already safe (fixed JWKS URL, `kid` as a map
key); this brings the Okta path to the same standard.
## Change
- Validate `kid` against `^[A-Za-z0-9_-]{1,200}$` before any cache,
config, or network access; reject otherwise. The allowlist (URL-safe
characters only, matching real Okta/ALB key ids) is what makes the
existing `TokenKeyUrl + kid` concatenation safe - no path traversal,
host/userinfo injection, scheme, or encoded tricks can reshape the
request.
- Route both the Okta per-`kid` fetch and the Entra JWKS fetch through a
dedicated `http.Client` with a 10s timeout that refuses to follow
redirects, so a slow/hung endpoint cannot pin a request goroutine and a
3xx cannot bounce the fetch to an unexpected host.
- Capture the previously-ignored `http.NewRequest` error.
Behavior is unchanged for legitimate tokens: a valid `kid` still fetches
exactly `TokenKeyUrl + kid` and parses the ECDSA key.
## Test plan
- [x] `make test-full` (unit + Emberfall E2E) passes 117/0
- [x] New unit tests: `kid` allowlist (valid vs
traversal/host-injection/scheme/oversized), rejection before any
outbound fetch, and a positive path (valid `kid` -> correct URL -> ECDSA
parse)
- [x] `go build`, `go vet`, `gofmt` clean
- [x] Independent review
Refs CMS-Enterprise/ztmf-misc#240
## Summary
Gates the existing system enrichment read endpoint (`GET
/api/v1/systemenrichment/{fisma_uuid}`) on a new per-OpDiv capability
flag, so ZTMF Insights enrichment is served only for OpDivs that have an
insights pipeline feeding them. Today that is CMS only; enabling another
OpDiv later is a single database update, no code change.
## Why
Enrichment data (system metadata from the ZTMF Insights pipeline)
currently exists only for CMS, but the read endpoint returned a payload
for any system that had a row, with no notion of which OpDivs are
entitled to the feature. This adds that boundary and makes it
data-driven so the feature can extend to other OpDivs as an add-on.
## Changes
- **Migration 0042**: adds `opdivs.insights_enabled` (boolean, default
false), backfills it true for the active CMS OpDiv. Down migration drops
the column.
- **Model**: `OpDiv.InsightsEnabled *bool` (pointer so an update that
omits it does not reset it), conditional set on save.
`FindSystemEnrichment` returns a row only when the system's `fismauid`
maps to an OpDiv with `insights_enabled = TRUE` (an `EXISTS` predicate,
which avoids fanning out on the non-unique `fismauid`).
- **Controller**: resolves the system first, then authorizes against its
OpDiv (`CanAccessFismaSystem`), replacing the prior per-system-only
check. A system in a non-enabled OpDiv returns 404, identical to a
system with no enrichment row, so the gate never reveals which systems
exist.
- **Tests**: model integration test for the gate, plus an Emberfall E2E
case proving a system with an enrichment row in a disabled OpDiv returns
404.
- **OpenAPI** regenerated from annotations.
## Behavior and compatibility
- Backwards compatible. Additive schema change; deploys to dev on PR,
prod on merge.
- Enrichment is only hidden for systems in OpDivs that are not
insights-enabled. The pipeline only feeds CMS, and the migration enables
CMS, so existing enrichment continues to be served unchanged.
- Minor authorization tightening: OpDiv-scoped admins now read
enrichment only for systems in their granted OpDivs (previously any
admin tier could read any system's enrichment). Unscoped admins (OWNER,
HHS) and assigned ISSOs are unaffected.
## Testing
Full local suite passes: unit, integration (including the new gate
test), and E2E (Emberfall). Validated end to end against real enrichment
data in the impl environment.
## Issues
Implements CMS-Enterprise/ztmf-misc#85 (API endpoints for SDL enrichment
data). The read surface landed in #211; this PR adds the
OpDiv-conditional gate that completes the consumption API. The issue's
original per-function response shape is superseded by the generic jsonb
payload (the pipeline owns structure). It will be closed on merge
(cross-repo keywords do not auto-close from this repository).
## Overview This is the in-repo version of #355, originally submitted by @voidspooks (cameron testerman) from a fork. The branch has been pushed to this repo unchanged so that org-secret CI gates — Snyk and the Infrastructure/Deploy job — can run. All commits are authored by @voidspooks; authorship is fully preserved. Refs #355 Closes #133 --- ## What this PR does Adds a new read-only API endpoint: ``` GET /api/v1/scores/diff?from={datacallid}&to={datacallid}[&fismasystemid={id}] ``` This endpoint compares the questionnaire answers between two data calls and returns **only the functions whose answer changed**, with attribution for who made the change and when. It answers the "show me everything that changed during this period" ask from #133 — the two data call IDs bound the time window, and the response is the delta plus audit info. --- ## Response shape One object per changed function: ```jsonc { "fismasystemid": 1001, "functionid": 7007, "function": "Battle Station Device Management", "question": "Does your system track all assets with comprehensive inventory?", "from": { "scoreid": 9001, "functionoptionid": 25, "optionname": "Defined", "score": 2, "notes": "..." }, "to": { "scoreid": 9025, "functionoptionid": 26, "optionname": "Managed", "score": 3, "notes": "..." }, "changed_at": "2026-06-18T16:20:18Z", "changed_by": { "userid": "...", "name": "Grand Moff Tarkin", "email": "...", "role": "OWNER" } } ``` - `from` / `to` are the answer in each cycle. Either is `null` when a function was answered in only one cycle (i.e., it was added or removed between data calls). - `changed_by` / `changed_at` attribute the write that produced the `to` answer, resolved from the `events` audit trail. Absent when no write event exists for the `to` score (seed data, or a function answered only in the `from` cycle). --- ## What counts as a change A function is included in the diff when: - The selected `functionoption` changed, **or** - The notes changed, **or** - The function was answered in exactly one of the two cycles (added or removed) `nil` and empty string notes compare as equal, matching the no-op normalization already used on the write path. Functions with no change are dropped from the response. --- ## Scope and access control Tier scoping mirrors `ListScores` and is enforced after query decode so a caller cannot widen their own scope: | Role | Sees | |---|---| | Unscoped admin | All systems | | OpDiv-scoped admin | Their OpDiv's systems (fail-closed) | | ISSO / ISSM | Their assigned systems only | `from` and `to` are both required and must differ — the endpoint returns `400` otherwise. --- ## Implementation notes - **New file:** `internal/model/scorediff.go` — contains `FindScoreDiff` and `buildScoreDiffSQL`. Uses parameterized raw SQL via the existing `rawQuery` read-path helper. A FULL OUTER JOIN of two cycle CTEs on `(system, function)`, plus catalog joins for labels and an events lateral for attribution, was beyond squirrel's ergonomics — this follows the precedent set by `buildPillarScoresSQL`. SELECT-only; never touches the write path or event recording. - **New controller:** `GetScoresDiff` handler wired to `/api/v1/scores/diff`. - **OpenAPI:** `openapi.yaml` regenerated via swag; redocly lint clean (pre-existing warnings only, no new issues). --- ## Testing - **Unit tests:** input validation; SQL builder filter and scope shaping; arg ordering across single-system, ISSO user scope, and OpDiv grants with fail-closed behavior. - **Integration tests (real Postgres, empire seed):** changed answer included; unchanged answer dropped; one-sided entry via the outer join; `to`-side attribution resolved correctly. - **Local smoke tests:** run against a freshly wiped and reseeded dev DB. `go build/vet/test ./...` all green (full suite including integration). Live HTTP tests covered: 400 validation cases, unchanged → empty response, a rich non-empty diff, and an edit-then-diff that surfaced the change attributed to the editor. --------- Co-authored-by: Cameron Testerman <11036339+voidspooks@users.noreply.github.com>
…60623 # Conflicts: # Makefile # backend/cmd/api/internal/controller/scores.go # backend/internal/model/fismasystems.go # backend/openapi.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Routine refresh of the
implbranch frommainso the experimental environment does not drift from production code. Merge (not rebase), per the impl-sync policy, becauseimplis a shared, deployed branch.What main brought in
The bulk of main's recent history (OpDiv-scoped RBAC, dual-IdP scaffolding, pgxpool, OpenAPI autogeneration, the N+1 users-list fix) was already present on impl from the previous sync, so the net new content here is small:
kidvalidation and hardened key-fetch client (fix(auth): validate Okta JWT kid and harden key-fetch client #354)/login/*through CloudFront to the ALB for Entra login (fix(infra): route /login/* through CloudFront to the ALB for Entra login #351)Conflict resolution
ZTMF_ENVdetection, parametrizedTEST_DB_PORT). Reverting to main's hardcoded ports would break side-by-side worktree runs.FindFismaSystemByUUIDadds an explicitLIMIT 1; scores-diff is additive). OpenAPI regenerated from annotations afterward, no drift.Environment safety
entra_enabledstays at its default of false. Main's chore(infra): enable Entra dual-IdP in dev (entra_enabled = true) #350 only enabled it in dev.Testing
Full local suite on the impl port set (unit, integration, E2E/Emberfall).