Skip to content

Add db migrate job for Gateway upgrades#99

Open
jchanbcbc wants to merge 11 commits into
developfrom
feature/add_db_migrate_job
Open

Add db migrate job for Gateway upgrades#99
jchanbcbc wants to merge 11 commits into
developfrom
feature/add_db_migrate_job

Conversation

@jchanbcbc

@jchanbcbc jchanbcbc commented Jun 30, 2026

Copy link
Copy Markdown

Database Migration Job Support for Layer7 Gateway Operator

Overview

This PR introduces a pre-upgrade database migration job for the Layer7 Gateway Operator. It allows Liquibase schema migrations to run as a dedicated Kubernetes Job before Gateway pods start during an upgrade, reducing downtime risk and preventing schema conflicts. This feature requires Gateway 11.2.2 or higher.


What's New

Gateway Entrypoint — Schema Update Modes

The Gateway container (entrypoint.sh) now supports four explicit schema update modes via EXTRA_JAVA_ARGS:

Mode Description
default Gateway runs Liquibase migrations on startup (existing behavior)
skip Gateway skips schema update entirely; assumes migrations were run externally
liquibase-only Run Liquibase migrations and exit (used by the migration Job)
liquibase-only-with-unlock Release the Liquibase DATABASECHANGELOGLOCK before migrating, then exit

Operator — Migration Job (spec.app.management.database.migrationJob)

A new opt-in migrationJob section in the Gateway CR spec:

management:
  database:
    migrationJob:
      enabled: true                    # opt-in; default false
      activeDeadlineSeconds: 300       # job timeout; default 5 minutes
      clearLocks: false                # set true to release stale Liquibase locks before migrating
      jdbcUrl: ""                      # optional override to by-pass db proxies, defaults to database.jdbcUrl

Job lifecycle and sequencing:

  • When migrationJob.enabled: true, the operator creates a Kubernetes Job that runs Liquibase in liquibase-only (requiring Gateway 11.2.2 or higher) mode before the Gateway Deployment is allowed to proceed.
  • The Gateway Deployment is blocked until the migration job succeeds. If the job fails, an error is logged with instructions and the Gateway remains in its current state — no automatic rollout to pods with an unmigrated schema.
  • If the job's spec changes between kubectl apply calls (e.g. a corrected jdbcUrl, a new image, or a changed clearLocks flag), the operator automatically deletes and recreates the job rather than getting stuck on a stale run.
  • Failed jobs require manual intervention: kubectl delete job <name> -n <namespace> to retry.

Automatic skip mode for Gateway pods:

  • When migrationJob.enabled: true, the operator automatically injects -Dgateway.db.schema-update.mode=skip into EXTRA_JAVA_ARGS for the main Gateway pods. This prevents them from attempting their own Liquibase run, since the Job already handles it. Users do not need to set this manually.

JDBC URL handling:

  • If migrationJob.jdbcUrl is explicitly set, the operator appends ?createDatabaseIfNotExist=false (using ? or & correctly based on whether query params already exist). This ensures that a misconfigured or mistyped URL fails fast rather than silently creating an unintended database.
  • If migrationJob.jdbcUrl is not set, the main database.jdbcUrl from the ConfigMap is used as-is, preserving the existing createDatabaseIfNotExist=true behavior for fresh installs.

Diskless vs non-diskless config parity:

  • In diskless mode (default): the Secret is exposed as env vars; migrationJob.jdbcUrl overrides the main JDBC URL via Kubernetes env precedence.
  • In non-diskless mode (disklessConfig.disabled: true): the Secret is mounted as a node.properties file (matching the main Gateway deployment behavior); migrationJob.jdbcUrl is not used — node.properties wins, consistent with Helm behavior.

clearLocks field:

  • When clearLocks: true, the job runs in liquibase-only-with-unlock mode, which releases any stale DATABASECHANGELOGLOCK before applying migrations. Useful for recovering from a previously aborted migration. Changing this field triggers automatic job replacement. Recommended to use kubectl patch with type=merge.

Optional liquibaseLogLevel:

  • spec.app.management.database.liquibaseLogLevel sets the LIQUIBASE_LOG_LEVEL env var on both the migration job and Gateway pods for debugging.

Files Changed

Component File Change
Gateway container entrypoint.sh 4-mode schema update logic
Helm chart values.yaml, db-migration-job.yaml, README.md, release-notes.md Migration job template and documentation
Operator CRD api/v1/gateway_types.go MigrationJob.ClearLocks, Database.LiquibaseLogLevel
Operator job builder pkg/gateway/migration_job.go Job spec with diskless/non-diskless handling, URL normalization
Operator reconciler pkg/gateway/reconcile/migration_job.go Job lifecycle, deployment blocking, spec-change detection
Operator ConfigMap pkg/gateway/configmap.go Auto-inject skip mode, LIQUIBASE_LOG_LEVEL
Operator controller internal/controller/gateway/controller.go Owns(&batchv1.Job{}) to watch Job status changes

jchanbcbc added 2 commits June 29, 2026 22:25
Introduces an opt-in Kubernetes Job that runs Liquibase schema migrations
before Gateway pods roll out during an upgrade. Gateway pods are blocked
until the job succeeds, preventing pods from starting against an unmigrated
schema.
Key behaviors:
- Works with Gateway 11.2.2 and later which provides four schema update modes in entrypoint.sh: default, skip,
  liquibase-only, liquibase-only-with-unlock
- Migration job enabled via spec.app.management.database.migrationJob.enabled
- clearLocks field releases stale Liquibase locks before migrating
- Gateway pods automatically use skip mode when migration job is enabled
- Job is auto-replaced when spec changes (image, jdbcUrl, clearLocks)
- Failed jobs block the deployment and require manual deletion to retry
- jdbcUrl override only applies in diskless mode; node.properties wins
  in non-diskless mode, consistent with Helm behavior
Introduces an opt-in Kubernetes Job that runs Liquibase schema migrations
before Gateway pods roll out during an upgrade. Gateway pods are blocked
until the job succeeds, preventing pods from starting against an unmigrated
schema.
Key behaviors:
- Works with Gateway 11.2.2 and later which provides four schema update modes in entrypoint.sh: default, skip,
  liquibase-only, liquibase-only-with-unlock
- Migration job enabled via spec.app.management.database.migrationJob.enabled
- clearLocks field releases stale Liquibase locks before migrating
- Gateway pods automatically use skip mode when migration job is enabled
- Job is auto-replaced when spec changes (image, jdbcUrl, clearLocks)
- Failed jobs block the deployment and require manual deletion to retry
- jdbcUrl override only applies in diskless mode; node.properties wins
  in non-diskless mode, consistent with Helm behavior
@jchanbcbc jchanbcbc self-assigned this Jun 30, 2026
@jchanbcbc jchanbcbc changed the title Feature/add db migrate job Add db migrate job for Gateway upgrades Jun 30, 2026
jchanbcbc added 7 commits June 30, 2026 16:13
Gateway.Status.MigrationStatus. This addresses issue with previous
implementation would re-run migrations on every 12-hour reconcile if the
completed Job had been deleted.
Changes:
- Add MigrationStatus struct (SpecHash, Complete) to GatewayStatus in
  gateway_types.go. Once Complete is true, GatewayMigrationJob returns
  immediately on all future reconciles regardless of Job existence.
- Add DeepCopyInto/DeepCopy for MigrationStatus in zz_generated.deepcopy.go.
- Update CRD schema (security.brcmlabs.com_gateways.yaml) to include the
  new migrationStatus status fields.
- Rewrite GatewayMigrationJob in reconcile/migration_job.go:
  - Compute a 16-char spec hash from image, effective jdbcUrl, clearLocks,
    and activeDeadlineSeconds. A hash change (e.g. image upgrade) resets
    status and triggers a fresh migration automatically.
  - On first enable: write hash to status, create Job, wait.
  - On Job success: write Complete=true to status, unblock Deployment.
  - On Job failure (both pod attempts exhausted): log error with exact
    kubectl delete command, block Deployment. User deletes Job to retry.
  - On disabled: clean up any orphaned Job.
  - Remove migrationJobSpecChanged — spec change detection is now fully
    handled by the hash comparison.
Recovery after failure: fix the root cause (run kubectl patch with clearLocks:true or
restore DB to pre-upgrade state if partially migrated) , then:
  kubectl delete job <name>-db-migration -n <namespace>
The operator automatically creates a new Job. No kubectl apply needed.
…pec parity

- Reconcile loop no longer short-circuits while the migration job is pending — it runs every op to completion each pass and only stops on a genuine error. The Deployment step alone gates on Gateway.Status.MigrationStatus.
- GatewayMigrationJob returns nil for all "not done yet" states instead of a synthetic ErrMigrationPending; progress is driven by the Job's own watch events (create/update/delete) rather than a fixed poll interval.
- Fixed an edge case where a spec change (e.g. image upgrade) with no existing Job to delete could stall migration Job recreation for up to 12 hours — it now creates the replacement Job in the same reconcile pass when there's nothing to wait on.
- Failure log message no longer embeds a kubectl command.
- The migration Job's pod/container spec now inherits the same settings as the main Gateway Deployment (security contexts, resources, node selector, affinity, tolerations, topology spread constraints, pod annotations/labels) instead of a bare-minimum hand-built spec, and uses the same app.kubernetes.io/* labeling convention as other operator-managed resources.
- Added a regression test for the "spec changed, no existing Job" case; existing migration job tests updated for the new nil-return contract.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant