Add db migrate job for Gateway upgrades#99
Open
jchanbcbc wants to merge 11 commits into
Open
Conversation
added 2 commits
June 29, 2026 22:25
Introduces an opt-in Kubernetes Job that runs Liquibase schema migrations before Gateway pods roll out during an upgrade. Gateway pods are blocked until the job succeeds, preventing pods from starting against an unmigrated schema. Key behaviors: - Works with Gateway 11.2.2 and later which provides four schema update modes in entrypoint.sh: default, skip, liquibase-only, liquibase-only-with-unlock - Migration job enabled via spec.app.management.database.migrationJob.enabled - clearLocks field releases stale Liquibase locks before migrating - Gateway pods automatically use skip mode when migration job is enabled - Job is auto-replaced when spec changes (image, jdbcUrl, clearLocks) - Failed jobs block the deployment and require manual deletion to retry - jdbcUrl override only applies in diskless mode; node.properties wins in non-diskless mode, consistent with Helm behavior
Introduces an opt-in Kubernetes Job that runs Liquibase schema migrations before Gateway pods roll out during an upgrade. Gateway pods are blocked until the job succeeds, preventing pods from starting against an unmigrated schema. Key behaviors: - Works with Gateway 11.2.2 and later which provides four schema update modes in entrypoint.sh: default, skip, liquibase-only, liquibase-only-with-unlock - Migration job enabled via spec.app.management.database.migrationJob.enabled - clearLocks field releases stale Liquibase locks before migrating - Gateway pods automatically use skip mode when migration job is enabled - Job is auto-replaced when spec changes (image, jdbcUrl, clearLocks) - Failed jobs block the deployment and require manual deletion to retry - jdbcUrl override only applies in diskless mode; node.properties wins in non-diskless mode, consistent with Helm behavior
added 7 commits
June 30, 2026 16:13
Gateway.Status.MigrationStatus. This addresses issue with previous
implementation would re-run migrations on every 12-hour reconcile if the
completed Job had been deleted.
Changes:
- Add MigrationStatus struct (SpecHash, Complete) to GatewayStatus in
gateway_types.go. Once Complete is true, GatewayMigrationJob returns
immediately on all future reconciles regardless of Job existence.
- Add DeepCopyInto/DeepCopy for MigrationStatus in zz_generated.deepcopy.go.
- Update CRD schema (security.brcmlabs.com_gateways.yaml) to include the
new migrationStatus status fields.
- Rewrite GatewayMigrationJob in reconcile/migration_job.go:
- Compute a 16-char spec hash from image, effective jdbcUrl, clearLocks,
and activeDeadlineSeconds. A hash change (e.g. image upgrade) resets
status and triggers a fresh migration automatically.
- On first enable: write hash to status, create Job, wait.
- On Job success: write Complete=true to status, unblock Deployment.
- On Job failure (both pod attempts exhausted): log error with exact
kubectl delete command, block Deployment. User deletes Job to retry.
- On disabled: clean up any orphaned Job.
- Remove migrationJobSpecChanged — spec change detection is now fully
handled by the hash comparison.
Recovery after failure: fix the root cause (run kubectl patch with clearLocks:true or
restore DB to pre-upgrade state if partially migrated) , then:
kubectl delete job <name>-db-migration -n <namespace>
The operator automatically creates a new Job. No kubectl apply needed.
…pec parity - Reconcile loop no longer short-circuits while the migration job is pending — it runs every op to completion each pass and only stops on a genuine error. The Deployment step alone gates on Gateway.Status.MigrationStatus. - GatewayMigrationJob returns nil for all "not done yet" states instead of a synthetic ErrMigrationPending; progress is driven by the Job's own watch events (create/update/delete) rather than a fixed poll interval. - Fixed an edge case where a spec change (e.g. image upgrade) with no existing Job to delete could stall migration Job recreation for up to 12 hours — it now creates the replacement Job in the same reconcile pass when there's nothing to wait on. - Failure log message no longer embeds a kubectl command. - The migration Job's pod/container spec now inherits the same settings as the main Gateway Deployment (security contexts, resources, node selector, affinity, tolerations, topology spread constraints, pod annotations/labels) instead of a bare-minimum hand-built spec, and uses the same app.kubernetes.io/* labeling convention as other operator-managed resources. - Added a regression test for the "spec changed, no existing Job" case; existing migration job tests updated for the new nil-return contract.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Database Migration Job Support for Layer7 Gateway Operator
Overview
This PR introduces a pre-upgrade database migration job for the Layer7 Gateway Operator. It allows Liquibase schema migrations to run as a dedicated Kubernetes Job before Gateway pods start during an upgrade, reducing downtime risk and preventing schema conflicts. This feature requires Gateway 11.2.2 or higher.
What's New
Gateway Entrypoint — Schema Update Modes
The Gateway container (
entrypoint.sh) now supports four explicit schema update modes viaEXTRA_JAVA_ARGS:defaultskipliquibase-onlyliquibase-only-with-unlockDATABASECHANGELOGLOCKbefore migrating, then exitOperator — Migration Job (
spec.app.management.database.migrationJob)A new opt-in
migrationJobsection in the Gateway CR spec:Job lifecycle and sequencing:
migrationJob.enabled: true, the operator creates a Kubernetes Job that runs Liquibase inliquibase-only(requiring Gateway 11.2.2 or higher) mode before the Gateway Deployment is allowed to proceed.kubectl applycalls (e.g. a correctedjdbcUrl, a new image, or a changedclearLocksflag), the operator automatically deletes and recreates the job rather than getting stuck on a stale run.kubectl delete job <name> -n <namespace>to retry.Automatic
skipmode for Gateway pods:migrationJob.enabled: true, the operator automatically injects-Dgateway.db.schema-update.mode=skipintoEXTRA_JAVA_ARGSfor the main Gateway pods. This prevents them from attempting their own Liquibase run, since the Job already handles it. Users do not need to set this manually.JDBC URL handling:
migrationJob.jdbcUrlis explicitly set, the operator appends?createDatabaseIfNotExist=false(using?or&correctly based on whether query params already exist). This ensures that a misconfigured or mistyped URL fails fast rather than silently creating an unintended database.migrationJob.jdbcUrlis not set, the maindatabase.jdbcUrlfrom the ConfigMap is used as-is, preserving the existingcreateDatabaseIfNotExist=truebehavior for fresh installs.Diskless vs non-diskless config parity:
migrationJob.jdbcUrloverrides the main JDBC URL via Kubernetes env precedence.disklessConfig.disabled: true): the Secret is mounted as anode.propertiesfile (matching the main Gateway deployment behavior);migrationJob.jdbcUrlis not used —node.propertieswins, consistent with Helm behavior.clearLocksfield:clearLocks: true, the job runs inliquibase-only-with-unlockmode, which releases any staleDATABASECHANGELOGLOCKbefore applying migrations. Useful for recovering from a previously aborted migration. Changing this field triggers automatic job replacement. Recommended to use kubectl patch with type=merge.Optional
liquibaseLogLevel:spec.app.management.database.liquibaseLogLevelsets theLIQUIBASE_LOG_LEVELenv var on both the migration job and Gateway pods for debugging.Files Changed
entrypoint.shvalues.yaml,db-migration-job.yaml,README.md,release-notes.mdapi/v1/gateway_types.goMigrationJob.ClearLocks,Database.LiquibaseLogLevelpkg/gateway/migration_job.gopkg/gateway/reconcile/migration_job.gopkg/gateway/configmap.goskipmode,LIQUIBASE_LOG_LEVELinternal/controller/gateway/controller.goOwns(&batchv1.Job{})to watch Job status changes