Skip to content

OCPBUGS-87020: fix non-atomic NorthDB port-group writes causing persistent NodePort traffic loss#3226

Closed
joshbranham wants to merge 1 commit into
openshift:mainfrom
joshbranham:fix/OCPBUGS-87020-netpol-port-group-atomicity
Closed

OCPBUGS-87020: fix non-atomic NorthDB port-group writes causing persistent NodePort traffic loss#3226
joshbranham wants to merge 1 commit into
openshift:mainfrom
joshbranham:fix/OCPBUGS-87020-netpol-port-group-atomicity

Conversation

@joshbranham

@joshbranham joshbranham commented Jun 4, 2026

Copy link
Copy Markdown

📑 Description

When ovnkube-controller reconnects to kube-apiserver after a transient outage (e.g. KAS revision rollout) and re-syncs NetworkPolicy state, two bugs combine to cause persistent NodePort/LB traffic loss:

  1. CreateOrUpdatePortGroupsOps clears existing port membership during Update. getAllUpdatableFields() unconditionally returns Ports in the OnModelUpdates list. When BuildPortGroup() is called with nil ports (the normal path in createNetworkPolicy and createDefaultDenyPGAndACLs), and the port group already exists in NorthDB (during retry/re-sync), the OVSDB Update operation replaces existing ports with an empty set. The allow ACLs on the now-empty port group match no pods, while default-deny ACLs still apply — dropping all traffic.

  2. The local pod sync function silently discards errors. In addLocalPodHandler, the syncFunc ignores the return value of handleLocalPodSelectorAddFunc. If the NorthDB transaction fails during sync (e.g. connection instability during API server recovery), ports are never added to the port group and no retry is triggered. Since resyncInterval=0, the stale state persists indefinitely.

Fix getAllUpdatableFields for PortGroup to skip nil fields — nil means "not specified" (preserve existing value), while non-nil means "set to this value". This is safe for all callers: NetworkPolicy code passes nil ports (preserved on update), ANP code passes explicit ports (replaced as before), and the Create path is unaffected (uses the full model, not OnModelUpdates).

Fix the syncFunc to return errors from handleLocalPodSelectorAddFunc. The function is idempotent (pods already in np.localPods are skipped), and the WatchFactory retries the sync for up to 60 seconds.

Both bugs were introduced in 2022 (commits 5d84deb and c973d61) and are present in all release branches from 4.14 through 4.21.

Fixes: https://issues.redhat.com/browse/OCPBUGS-87020

Additional Information for reviewers

✅ Checks

  • My code requires changes to the documentation
  • if so, I have updated the documentation as required
  • My code requires tests
  • if so, I have added and/or updated the tests as required
  • All the tests have passed in the CI

How to verify it

Summary by CodeRabbit

  • Bug Fixes

    • Pod sync now surfaces sync errors so failed syncs are retried.
    • Port group updates now preserve existing ports when ports are unspecified; explicitly provided ports replace existing ones; creating with unspecified ports yields an empty ports list. ACLs and external IDs are consistently applied.
  • Tests

    • Added tests covering port group create/update behaviors for nil, explicit, and create-with-nil ports.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 4, 2026
@openshift-ci

openshift-ci Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: d1d268df-8302-4bd2-b005-8e7ecb709dd5

📥 Commits

Reviewing files that changed from the base of the PR and between 4c17252 and 6773d38.

📒 Files selected for processing (3)
  • go-controller/pkg/libovsdb/ops/model.go
  • go-controller/pkg/libovsdb/ops/portgroup_test.go
  • go-controller/pkg/ovn/base_network_controller_policy.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • go-controller/pkg/ovn/base_network_controller_policy.go
  • go-controller/pkg/libovsdb/ops/portgroup_test.go

Walkthrough

This PR makes PortGroup updatable-field selection include ACLs and ExternalIDs always while appending Ports only when non-nil, adds three tests validating CreateOrUpdatePortGroupsOps behavior for Ports and ExternalIDs, and changes pod local-pod sync to propagate handler errors so WatchFactory can retry failed syncs.

Changes

PortGroup Selective Field Updates

Layer / File(s) Summary
Conditional update field inclusion
go-controller/pkg/libovsdb/ops/model.go
getAllUpdatableFields for *nbdb.PortGroup now returns pointers for ACLs and ExternalIDs unconditionally and appends the Ports pointer only when t.Ports is non-nil.
PortGroup operation test coverage
go-controller/pkg/libovsdb/ops/portgroup_test.go
Adds three tests: TestCreateOrUpdatePortGroupsOps_NilPortsPreservesExisting (nil Ports preserves existing Ports while updating ExternalIDs), TestCreateOrUpdatePortGroupsOps_ExplicitPortsReplaceExisting (explicit Ports replace existing list), and TestCreateOrUpdatePortGroupsOps_CreateWithNilPorts (creating with nil Ports yields an empty Ports slice).

Pod Sync Error Propagation

Layer / File(s) Summary
Error propagation in local pod reconciliation
go-controller/pkg/ovn/base_network_controller_policy.go
The sync function passed to WatchFactory in addLocalPodHandler now returns the error from handleLocalPodSelectorAddFunc instead of discarding it, allowing retries on sync failure.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Ports and ACLs now know their place,
Nil leaves Ports alone, explicit ones replace,
Three tests hop in to check each cue—
And pod-sync errors finally bubble through.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly reflects the primary fix: addressing non-atomic NorthDB port-group writes causing NodePort traffic loss, which aligns with both the PortGroup field update and local pod sync error handling changes in the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci

openshift-ci Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: joshbranham
Once this PR has been reviewed and has the lgtm label, please assign knobunc for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@joshbranham joshbranham force-pushed the fix/OCPBUGS-87020-netpol-port-group-atomicity branch from bb5a23a to 4c17252 Compare June 4, 2026 20:48
@joshbranham joshbranham changed the title Bug 87020: fix non-atomic NorthDB port-group writes causing persistent NodePort traffic loss OCPBUGS-87020: fix non-atomic NorthDB port-group writes causing persistent NodePort traffic loss Jun 4, 2026
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 4, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@joshbranham: This pull request references Jira Issue OCPBUGS-87020, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

📑 Description

When ovnkube-controller reconnects to kube-apiserver after a transient outage (e.g. KAS revision rollout) and re-syncs NetworkPolicy state, two bugs combine to cause persistent NodePort/LB traffic loss:

  1. CreateOrUpdatePortGroupsOps clears existing port membership during Update. getAllUpdatableFields() unconditionally returns Ports in the OnModelUpdates list. When BuildPortGroup() is called with nil ports (the normal path in createNetworkPolicy and createDefaultDenyPGAndACLs), and the port group already exists in NorthDB (during retry/re-sync), the OVSDB Update operation replaces existing ports with an empty set. The allow ACLs on the now-empty port group match no pods, while default-deny ACLs still apply — dropping all traffic.

  2. The local pod sync function silently discards errors. In addLocalPodHandler, the syncFunc ignores the return value of handleLocalPodSelectorAddFunc. If the NorthDB transaction fails during sync (e.g. connection instability during API server recovery), ports are never added to the port group and no retry is triggered. Since resyncInterval=0, the stale state persists indefinitely.

Fix getAllUpdatableFields for PortGroup to skip nil fields — nil means "not specified" (preserve existing value), while non-nil means "set to this value". This is safe for all callers: NetworkPolicy code passes nil ports (preserved on update), ANP code passes explicit ports (replaced as before), and the Create path is unaffected (uses the full model, not OnModelUpdates).

Fix the syncFunc to return errors from handleLocalPodSelectorAddFunc. The function is idempotent (pods already in np.localPods are skipped), and the WatchFactory retries the sync for up to 60 seconds.

Both bugs were introduced in 2022 (commits 5d84deb and c973d61) and are present in all release branches from 4.14 through 4.21.

Fixes: https://issues.redhat.com/browse/OCPBUGS-87020

Additional Information for reviewers

✅ Checks

  • My code requires changes to the documentation
  • if so, I have updated the documentation as required
  • My code requires tests
  • if so, I have added and/or updated the tests as required
  • All the tests have passed in the CI

How to verify it

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@joshbranham

Copy link
Copy Markdown
Author

/jira refresh
@coderabbitai review

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 4, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@joshbranham: This pull request references Jira Issue OCPBUGS-87020, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh
@coderabbitai review

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown

@joshbranham: I'll review the changes in this PR now.

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown

@joshbranham: I'll review the changes in this PR now.

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@joshbranham

Copy link
Copy Markdown
Author

/retest

@joshbranham joshbranham force-pushed the fix/OCPBUGS-87020-netpol-port-group-atomicity branch from 4c17252 to 087e2d1 Compare June 4, 2026 21:37
When ovnkube-controller reconnects to kube-apiserver after a
transient outage (e.g. KAS revision rollout) and re-syncs
NetworkPolicy state, two bugs combine to cause persistent
NodePort/LB traffic loss:

1. CreateOrUpdatePortGroupsOps clears existing port membership
   during Update. getAllUpdatableFields() unconditionally returns
   Ports in the OnModelUpdates list. When BuildPortGroup() is
   called with nil ports (the normal path in createNetworkPolicy
   and createDefaultDenyPGAndACLs), and the port group already
   exists in NorthDB (during retry/re-sync), the OVSDB Update
   replaces existing ports with an empty set. The allow ACLs on
   the now-empty port group match no pods, while default-deny
   ACLs still apply, dropping all traffic.

2. The local pod sync function silently discards errors. In
   addLocalPodHandler, the syncFunc ignores the return value of
   handleLocalPodSelectorAddFunc. If the NorthDB transaction
   fails during sync, ports are never added to the port group
   and no retry is triggered. Since resyncInterval=0, the stale
   state persists indefinitely.

Fix getAllUpdatableFields for PortGroup to skip nil fields. nil
means "not specified" (preserve existing value), while non-nil
means "set to this value". This is safe for all callers:
NetworkPolicy code passes nil ports (preserved on update), ANP
code passes explicit ports (replaced as before), and the Create
path is unaffected (uses the full model, not OnModelUpdates).

Fix the syncFunc to return errors from
handleLocalPodSelectorAddFunc. The function is idempotent (pods
already in np.localPods are skipped), and the WatchFactory
retries the sync for up to 60 seconds.

Both bugs were introduced in 2022 (commits 5d84deb and
c973d61) and are present in all release branches from 4.14
through 4.21.

Fixes: https://issues.redhat.com/browse/OCPBUGS-87020

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Josh Branham <jbranham@redhat.com>
@joshbranham joshbranham force-pushed the fix/OCPBUGS-87020-netpol-port-group-atomicity branch from 087e2d1 to 6773d38 Compare June 4, 2026 21:39
@joshbranham

Copy link
Copy Markdown
Author

/retest

@joshbranham

Copy link
Copy Markdown
Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@openshift-ci

openshift-ci Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

@joshbranham: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/qe-perfscale-aws-ovn-small-udn-density-l2 6773d38 link false /test qe-perfscale-aws-ovn-small-udn-density-l2
ci/prow/5.0-upgrade-from-stable-4.22-e2e-gcp-ovn-rt-upgrade 6773d38 link true /test 5.0-upgrade-from-stable-4.22-e2e-gcp-ovn-rt-upgrade
ci/prow/e2e-aws-ovn-edge-zones 6773d38 link true /test e2e-aws-ovn-edge-zones
ci/prow/e2e-aws-ovn-upgrade 6773d38 link true /test e2e-aws-ovn-upgrade
ci/prow/e2e-aws-ovn-rhcos10-techpreview 6773d38 link false /test e2e-aws-ovn-rhcos10-techpreview
ci/prow/e2e-aws-ovn 6773d38 link true /test e2e-aws-ovn
ci/prow/e2e-aws-ovn-shared-to-local-gateway-mode-migration 6773d38 link true /test e2e-aws-ovn-shared-to-local-gateway-mode-migration
ci/prow/e2e-aws-ovn-upgrade-local-gateway 6773d38 link true /test e2e-aws-ovn-upgrade-local-gateway
ci/prow/unit 6773d38 link true /test unit
ci/prow/security 6773d38 link false /test security

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@joshbranham: This pull request references Jira Issue OCPBUGS-87020. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state.

Details

In response to this:

📑 Description

When ovnkube-controller reconnects to kube-apiserver after a transient outage (e.g. KAS revision rollout) and re-syncs NetworkPolicy state, two bugs combine to cause persistent NodePort/LB traffic loss:

  1. CreateOrUpdatePortGroupsOps clears existing port membership during Update. getAllUpdatableFields() unconditionally returns Ports in the OnModelUpdates list. When BuildPortGroup() is called with nil ports (the normal path in createNetworkPolicy and createDefaultDenyPGAndACLs), and the port group already exists in NorthDB (during retry/re-sync), the OVSDB Update operation replaces existing ports with an empty set. The allow ACLs on the now-empty port group match no pods, while default-deny ACLs still apply — dropping all traffic.

  2. The local pod sync function silently discards errors. In addLocalPodHandler, the syncFunc ignores the return value of handleLocalPodSelectorAddFunc. If the NorthDB transaction fails during sync (e.g. connection instability during API server recovery), ports are never added to the port group and no retry is triggered. Since resyncInterval=0, the stale state persists indefinitely.

Fix getAllUpdatableFields for PortGroup to skip nil fields — nil means "not specified" (preserve existing value), while non-nil means "set to this value". This is safe for all callers: NetworkPolicy code passes nil ports (preserved on update), ANP code passes explicit ports (replaced as before), and the Create path is unaffected (uses the full model, not OnModelUpdates).

Fix the syncFunc to return errors from handleLocalPodSelectorAddFunc. The function is idempotent (pods already in np.localPods are skipped), and the WatchFactory retries the sync for up to 60 seconds.

Both bugs were introduced in 2022 (commits 5d84deb and c973d61) and are present in all release branches from 4.14 through 4.21.

Fixes: https://issues.redhat.com/browse/OCPBUGS-87020

Additional Information for reviewers

✅ Checks

  • My code requires changes to the documentation
  • if so, I have updated the documentation as required
  • My code requires tests
  • if so, I have added and/or updated the tests as required
  • All the tests have passed in the CI

How to verify it

Summary by CodeRabbit

  • Bug Fixes

  • Pod sync now surfaces sync errors so failed syncs are retried.

  • Port group updates now preserve existing ports when ports are unspecified; explicitly provided ports replace existing ones; creating with unspecified ports yields an empty ports list. ACLs and external IDs are consistently applied.

  • Tests

  • Added tests covering port group create/update behaviors for nil, explicit, and create-with-nil ports.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants