OCPBUGS-87020: fix non-atomic NorthDB port-group writes causing persistent NodePort traffic loss by joshbranham · Pull Request #3226 · openshift/ovn-kubernetes

joshbranham · 2026-06-04T20:46:37Z

📑 Description

When ovnkube-controller reconnects to kube-apiserver after a transient outage (e.g. KAS revision rollout) and re-syncs NetworkPolicy state, two bugs combine to cause persistent NodePort/LB traffic loss:

CreateOrUpdatePortGroupsOps clears existing port membership during Update. getAllUpdatableFields() unconditionally returns Ports in the OnModelUpdates list. When BuildPortGroup() is called with nil ports (the normal path in createNetworkPolicy and createDefaultDenyPGAndACLs), and the port group already exists in NorthDB (during retry/re-sync), the OVSDB Update operation replaces existing ports with an empty set. The allow ACLs on the now-empty port group match no pods, while default-deny ACLs still apply — dropping all traffic.
The local pod sync function silently discards errors. In addLocalPodHandler, the syncFunc ignores the return value of handleLocalPodSelectorAddFunc. If the NorthDB transaction fails during sync (e.g. connection instability during API server recovery), ports are never added to the port group and no retry is triggered. Since resyncInterval=0, the stale state persists indefinitely.

Fix getAllUpdatableFields for PortGroup to skip nil fields — nil means "not specified" (preserve existing value), while non-nil means "set to this value". This is safe for all callers: NetworkPolicy code passes nil ports (preserved on update), ANP code passes explicit ports (replaced as before), and the Create path is unaffected (uses the full model, not OnModelUpdates).

Fix the syncFunc to return errors from handleLocalPodSelectorAddFunc. The function is idempotent (pods already in np.localPods are skipped), and the WatchFactory retries the sync for up to 60 seconds.

Both bugs were introduced in 2022 (commits 5d84deb and c973d61) and are present in all release branches from 4.14 through 4.21.

Fixes: https://issues.redhat.com/browse/OCPBUGS-87020

Additional Information for reviewers

✅ Checks

My code requires changes to the documentation
if so, I have updated the documentation as required
My code requires tests
if so, I have added and/or updated the tests as required
All the tests have passed in the CI

How to verify it

Summary by CodeRabbit

Bug Fixes
- Pod sync now surfaces sync errors so failed syncs are retried.
- Port group updates now preserve existing ports when ports are unspecified; explicitly provided ports replace existing ones; creating with unspecified ports yields an empty ports list. ACLs and external IDs are consistently applied.
Tests
- Added tests covering port group create/update behaviors for nil, explicit, and create-with-nil ports.

openshift-ci · 2026-06-04T20:46:41Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2026-06-04T20:46:45Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: d1d268df-8302-4bd2-b005-8e7ecb709dd5

📥 Commits

Reviewing files that changed from the base of the PR and between 4c17252 and 6773d38.

📒 Files selected for processing (3)

go-controller/pkg/libovsdb/ops/model.go
go-controller/pkg/libovsdb/ops/portgroup_test.go
go-controller/pkg/ovn/base_network_controller_policy.go

🚧 Files skipped from review as they are similar to previous changes (2)

go-controller/pkg/ovn/base_network_controller_policy.go
go-controller/pkg/libovsdb/ops/portgroup_test.go

Walkthrough

This PR makes PortGroup updatable-field selection include ACLs and ExternalIDs always while appending Ports only when non-nil, adds three tests validating CreateOrUpdatePortGroupsOps behavior for Ports and ExternalIDs, and changes pod local-pod sync to propagate handler errors so WatchFactory can retry failed syncs.

Changes

PortGroup Selective Field Updates

Layer / File(s)	Summary
Conditional update field inclusion `go-controller/pkg/libovsdb/ops/model.go`	`getAllUpdatableFields` for `*nbdb.PortGroup` now returns pointers for `ACLs` and `ExternalIDs` unconditionally and appends the `Ports` pointer only when `t.Ports` is non-nil.
PortGroup operation test coverage `go-controller/pkg/libovsdb/ops/portgroup_test.go`	Adds three tests: `TestCreateOrUpdatePortGroupsOps_NilPortsPreservesExisting` (nil Ports preserves existing Ports while updating ExternalIDs), `TestCreateOrUpdatePortGroupsOps_ExplicitPortsReplaceExisting` (explicit Ports replace existing list), and `TestCreateOrUpdatePortGroupsOps_CreateWithNilPorts` (creating with nil Ports yields an empty Ports slice).

Pod Sync Error Propagation

Layer / File(s)	Summary
Error propagation in local pod reconciliation `go-controller/pkg/ovn/base_network_controller_policy.go`	The sync function passed to WatchFactory in `addLocalPodHandler` now returns the error from `handleLocalPodSelectorAddFunc` instead of discarding it, allowing retries on sync failure.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Ports and ACLs now know their place,
Nil leaves Ports alone, explicit ones replace,
Three tests hop in to check each cue—
And pod-sync errors finally bubble through.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly reflects the primary fix: addressing non-atomic NorthDB port-group writes causing NodePort traffic loss, which aligns with both the PortGroup field update and local pod sync error handling changes in the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-06-04T20:46:52Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: joshbranham
Once this PR has been reviewed and has the lgtm label, please assign knobunc for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2026-06-04T20:49:16Z

@joshbranham: This pull request references Jira Issue OCPBUGS-87020, which is invalid:

expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

📑 Description

When ovnkube-controller reconnects to kube-apiserver after a transient outage (e.g. KAS revision rollout) and re-syncs NetworkPolicy state, two bugs combine to cause persistent NodePort/LB traffic loss:

CreateOrUpdatePortGroupsOps clears existing port membership during Update. getAllUpdatableFields() unconditionally returns Ports in the OnModelUpdates list. When BuildPortGroup() is called with nil ports (the normal path in createNetworkPolicy and createDefaultDenyPGAndACLs), and the port group already exists in NorthDB (during retry/re-sync), the OVSDB Update operation replaces existing ports with an empty set. The allow ACLs on the now-empty port group match no pods, while default-deny ACLs still apply — dropping all traffic.

The local pod sync function silently discards errors. In addLocalPodHandler, the syncFunc ignores the return value of handleLocalPodSelectorAddFunc. If the NorthDB transaction fails during sync (e.g. connection instability during API server recovery), ports are never added to the port group and no retry is triggered. Since resyncInterval=0, the stale state persists indefinitely.

Fix getAllUpdatableFields for PortGroup to skip nil fields — nil means "not specified" (preserve existing value), while non-nil means "set to this value". This is safe for all callers: NetworkPolicy code passes nil ports (preserved on update), ANP code passes explicit ports (replaced as before), and the Create path is unaffected (uses the full model, not OnModelUpdates).

Fix the syncFunc to return errors from handleLocalPodSelectorAddFunc. The function is idempotent (pods already in np.localPods are skipped), and the WatchFactory retries the sync for up to 60 seconds.

Both bugs were introduced in 2022 (commits 5d84deb and c973d61) and are present in all release branches from 4.14 through 4.21.

Fixes: https://issues.redhat.com/browse/OCPBUGS-87020

Additional Information for reviewers

✅ Checks

My code requires changes to the documentation

if so, I have updated the documentation as required

My code requires tests

if so, I have added and/or updated the tests as required

All the tests have passed in the CI

How to verify it

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

joshbranham · 2026-06-04T20:49:44Z

/jira refresh
@coderabbitai review

openshift-ci-robot · 2026-06-04T20:49:50Z

@joshbranham: This pull request references Jira Issue OCPBUGS-87020, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

/jira refresh
@coderabbitai review

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-06-04T20:49:57Z

@joshbranham: I'll review the changes in this PR now.

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2026-06-04T20:50:06Z

@joshbranham: I'll review the changes in this PR now.

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

joshbranham · 2026-06-04T20:57:20Z

/retest

When ovnkube-controller reconnects to kube-apiserver after a transient outage (e.g. KAS revision rollout) and re-syncs NetworkPolicy state, two bugs combine to cause persistent NodePort/LB traffic loss: 1. CreateOrUpdatePortGroupsOps clears existing port membership during Update. getAllUpdatableFields() unconditionally returns Ports in the OnModelUpdates list. When BuildPortGroup() is called with nil ports (the normal path in createNetworkPolicy and createDefaultDenyPGAndACLs), and the port group already exists in NorthDB (during retry/re-sync), the OVSDB Update replaces existing ports with an empty set. The allow ACLs on the now-empty port group match no pods, while default-deny ACLs still apply, dropping all traffic. 2. The local pod sync function silently discards errors. In addLocalPodHandler, the syncFunc ignores the return value of handleLocalPodSelectorAddFunc. If the NorthDB transaction fails during sync, ports are never added to the port group and no retry is triggered. Since resyncInterval=0, the stale state persists indefinitely. Fix getAllUpdatableFields for PortGroup to skip nil fields. nil means "not specified" (preserve existing value), while non-nil means "set to this value". This is safe for all callers: NetworkPolicy code passes nil ports (preserved on update), ANP code passes explicit ports (replaced as before), and the Create path is unaffected (uses the full model, not OnModelUpdates). Fix the syncFunc to return errors from handleLocalPodSelectorAddFunc. The function is idempotent (pods already in np.localPods are skipped), and the WatchFactory retries the sync for up to 60 seconds. Both bugs were introduced in 2022 (commits 5d84deb and c973d61) and are present in all release branches from 4.14 through 4.21. Fixes: https://issues.redhat.com/browse/OCPBUGS-87020 Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Josh Branham <jbranham@redhat.com>

joshbranham · 2026-06-04T21:44:32Z

/retest

joshbranham · 2026-06-04T21:44:49Z

@coderabbitai review

coderabbitai · 2026-06-04T21:44:55Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

openshift-ci · 2026-06-05T03:41:36Z

@joshbranham: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/qe-perfscale-aws-ovn-small-udn-density-l2	`6773d38`	link	false	`/test qe-perfscale-aws-ovn-small-udn-density-l2`
ci/prow/5.0-upgrade-from-stable-4.22-e2e-gcp-ovn-rt-upgrade	`6773d38`	link	true	`/test 5.0-upgrade-from-stable-4.22-e2e-gcp-ovn-rt-upgrade`
ci/prow/e2e-aws-ovn-edge-zones	`6773d38`	link	true	`/test e2e-aws-ovn-edge-zones`
ci/prow/e2e-aws-ovn-upgrade	`6773d38`	link	true	`/test e2e-aws-ovn-upgrade`
ci/prow/e2e-aws-ovn-rhcos10-techpreview	`6773d38`	link	false	`/test e2e-aws-ovn-rhcos10-techpreview`
ci/prow/e2e-aws-ovn	`6773d38`	link	true	`/test e2e-aws-ovn`
ci/prow/e2e-aws-ovn-shared-to-local-gateway-mode-migration	`6773d38`	link	true	`/test e2e-aws-ovn-shared-to-local-gateway-mode-migration`
ci/prow/e2e-aws-ovn-upgrade-local-gateway	`6773d38`	link	true	`/test e2e-aws-ovn-upgrade-local-gateway`
ci/prow/unit	`6773d38`	link	true	`/test unit`
ci/prow/security	`6773d38`	link	false	`/test security`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2026-06-23T18:38:21Z

@joshbranham: This pull request references Jira Issue OCPBUGS-87020. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state.

Details

In response to this:

📑 Description

When ovnkube-controller reconnects to kube-apiserver after a transient outage (e.g. KAS revision rollout) and re-syncs NetworkPolicy state, two bugs combine to cause persistent NodePort/LB traffic loss:

CreateOrUpdatePortGroupsOps clears existing port membership during Update. getAllUpdatableFields() unconditionally returns Ports in the OnModelUpdates list. When BuildPortGroup() is called with nil ports (the normal path in createNetworkPolicy and createDefaultDenyPGAndACLs), and the port group already exists in NorthDB (during retry/re-sync), the OVSDB Update operation replaces existing ports with an empty set. The allow ACLs on the now-empty port group match no pods, while default-deny ACLs still apply — dropping all traffic.

The local pod sync function silently discards errors. In addLocalPodHandler, the syncFunc ignores the return value of handleLocalPodSelectorAddFunc. If the NorthDB transaction fails during sync (e.g. connection instability during API server recovery), ports are never added to the port group and no retry is triggered. Since resyncInterval=0, the stale state persists indefinitely.

Fix getAllUpdatableFields for PortGroup to skip nil fields — nil means "not specified" (preserve existing value), while non-nil means "set to this value". This is safe for all callers: NetworkPolicy code passes nil ports (preserved on update), ANP code passes explicit ports (replaced as before), and the Create path is unaffected (uses the full model, not OnModelUpdates).

Fix the syncFunc to return errors from handleLocalPodSelectorAddFunc. The function is idempotent (pods already in np.localPods are skipped), and the WatchFactory retries the sync for up to 60 seconds.

Both bugs were introduced in 2022 (commits 5d84deb and c973d61) and are present in all release branches from 4.14 through 4.21.

Fixes: https://issues.redhat.com/browse/OCPBUGS-87020

Additional Information for reviewers

✅ Checks

My code requires changes to the documentation

if so, I have updated the documentation as required

My code requires tests

if so, I have added and/or updated the tests as required

All the tests have passed in the CI

How to verify it

Summary by CodeRabbit

Bug Fixes

Pod sync now surfaces sync errors so failed syncs are retried.

Port group updates now preserve existing ports when ports are unspecified; explicitly provided ports replace existing ones; creating with unspecified ports yields an empty ports list. ACLs and external IDs are consistently applied.

Tests

Added tests covering port group create/update behaviors for nil, explicit, and create-with-nil ports.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 4, 2026

joshbranham force-pushed the fix/OCPBUGS-87020-netpol-port-group-atomicity branch from bb5a23a to 4c17252 Compare June 4, 2026 20:48

joshbranham changed the title ~~Bug 87020: fix non-atomic NorthDB port-group writes causing persistent NodePort traffic loss~~ OCPBUGS-87020: fix non-atomic NorthDB port-group writes causing persistent NodePort traffic loss Jun 4, 2026

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 4, 2026

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 4, 2026

joshbranham force-pushed the fix/OCPBUGS-87020-netpol-port-group-atomicity branch from 4c17252 to 087e2d1 Compare June 4, 2026 21:37

joshbranham force-pushed the fix/OCPBUGS-87020-netpol-port-group-atomicity branch from 087e2d1 to 6773d38 Compare June 4, 2026 21:39

joshbranham closed this Jun 23, 2026

Uh oh!

Conversation

joshbranham commented Jun 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📑 Description

Additional Information for reviewers

✅ Checks

How to verify it

Summary by CodeRabbit

Uh oh!

openshift-ci Bot commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

openshift-ci Bot commented Jun 4, 2026

Uh oh!

openshift-ci-robot commented Jun 4, 2026

📑 Description

Additional Information for reviewers

✅ Checks

How to verify it

Uh oh!

joshbranham commented Jun 4, 2026

Uh oh!

openshift-ci-robot commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joshbranham commented Jun 4, 2026

Uh oh!

joshbranham commented Jun 4, 2026

Uh oh!

joshbranham commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci Bot commented Jun 5, 2026

Uh oh!

openshift-ci-robot commented Jun 23, 2026

📑 Description

Additional Information for reviewers

✅ Checks

How to verify it

Summary by CodeRabbit

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joshbranham commented Jun 4, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading