OCPBUGS-87020: fix non-atomic NorthDB port-group writes causing persistent NodePort traffic loss#3226
Conversation
|
Skipping CI for Draft Pull Request. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (2)
WalkthroughThis PR makes PortGroup updatable-field selection include ACLs and ExternalIDs always while appending Ports only when non-nil, adds three tests validating CreateOrUpdatePortGroupsOps behavior for Ports and ExternalIDs, and changes pod local-pod sync to propagate handler errors so WatchFactory can retry failed syncs. ChangesPortGroup Selective Field Updates
Pod Sync Error Propagation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: joshbranham The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
bb5a23a to
4c17252
Compare
|
@joshbranham: This pull request references Jira Issue OCPBUGS-87020, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@joshbranham: This pull request references Jira Issue OCPBUGS-87020, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
✅ Action performedReview finished.
|
|
✅ Action performedReview finished.
|
|
/retest |
4c17252 to
087e2d1
Compare
When ovnkube-controller reconnects to kube-apiserver after a transient outage (e.g. KAS revision rollout) and re-syncs NetworkPolicy state, two bugs combine to cause persistent NodePort/LB traffic loss: 1. CreateOrUpdatePortGroupsOps clears existing port membership during Update. getAllUpdatableFields() unconditionally returns Ports in the OnModelUpdates list. When BuildPortGroup() is called with nil ports (the normal path in createNetworkPolicy and createDefaultDenyPGAndACLs), and the port group already exists in NorthDB (during retry/re-sync), the OVSDB Update replaces existing ports with an empty set. The allow ACLs on the now-empty port group match no pods, while default-deny ACLs still apply, dropping all traffic. 2. The local pod sync function silently discards errors. In addLocalPodHandler, the syncFunc ignores the return value of handleLocalPodSelectorAddFunc. If the NorthDB transaction fails during sync, ports are never added to the port group and no retry is triggered. Since resyncInterval=0, the stale state persists indefinitely. Fix getAllUpdatableFields for PortGroup to skip nil fields. nil means "not specified" (preserve existing value), while non-nil means "set to this value". This is safe for all callers: NetworkPolicy code passes nil ports (preserved on update), ANP code passes explicit ports (replaced as before), and the Create path is unaffected (uses the full model, not OnModelUpdates). Fix the syncFunc to return errors from handleLocalPodSelectorAddFunc. The function is idempotent (pods already in np.localPods are skipped), and the WatchFactory retries the sync for up to 60 seconds. Both bugs were introduced in 2022 (commits 5d84deb and c973d61) and are present in all release branches from 4.14 through 4.21. Fixes: https://issues.redhat.com/browse/OCPBUGS-87020 Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Josh Branham <jbranham@redhat.com>
087e2d1 to
6773d38
Compare
|
/retest |
|
@coderabbitai review |
✅ Action performedReview finished.
|
|
@joshbranham: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@joshbranham: This pull request references Jira Issue OCPBUGS-87020. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
📑 Description
When ovnkube-controller reconnects to kube-apiserver after a transient outage (e.g. KAS revision rollout) and re-syncs NetworkPolicy state, two bugs combine to cause persistent NodePort/LB traffic loss:
CreateOrUpdatePortGroupsOps clears existing port membership during Update. getAllUpdatableFields() unconditionally returns Ports in the OnModelUpdates list. When BuildPortGroup() is called with nil ports (the normal path in createNetworkPolicy and createDefaultDenyPGAndACLs), and the port group already exists in NorthDB (during retry/re-sync), the OVSDB Update operation replaces existing ports with an empty set. The allow ACLs on the now-empty port group match no pods, while default-deny ACLs still apply — dropping all traffic.
The local pod sync function silently discards errors. In addLocalPodHandler, the syncFunc ignores the return value of handleLocalPodSelectorAddFunc. If the NorthDB transaction fails during sync (e.g. connection instability during API server recovery), ports are never added to the port group and no retry is triggered. Since resyncInterval=0, the stale state persists indefinitely.
Fix getAllUpdatableFields for PortGroup to skip nil fields — nil means "not specified" (preserve existing value), while non-nil means "set to this value". This is safe for all callers: NetworkPolicy code passes nil ports (preserved on update), ANP code passes explicit ports (replaced as before), and the Create path is unaffected (uses the full model, not OnModelUpdates).
Fix the syncFunc to return errors from handleLocalPodSelectorAddFunc. The function is idempotent (pods already in np.localPods are skipped), and the WatchFactory retries the sync for up to 60 seconds.
Both bugs were introduced in 2022 (commits 5d84deb and c973d61) and are present in all release branches from 4.14 through 4.21.
Fixes: https://issues.redhat.com/browse/OCPBUGS-87020
Additional Information for reviewers
✅ Checks
How to verify it
Summary by CodeRabbit
Bug Fixes
Tests