OCPBUGS-74711: [release-4.20] EgressIP annotation corruption causes stale IPs on br-ex after failover#3257
Conversation
Code was modifying the annotations of the informer cache node object. If this was happening while another goroutine was reading the annotation map, it would trigger ovnkube to crash! Fixes: #5950 Signed-off-by: Tim Rozet <trozet@nvidia.com>
Gateway egress IP adds IPs to an annotation on the node. The code was assuming the informer object should have the latest data, then overwriting the IPs using that information. That isn't reliable as the informer could have stale data compared to recent kubeclient updates. This would trigger egress IP logic to corrupt the IPs in the node annotation, and cause further drift/corruption in subsequent updates. This fixes it by creating a local cache of IPs for the controller, and using that as the source of truth, initialized on start up from the node object. Then updates are driven by what is in the cache, versus what is in the informer. Also fixes places where tests should have been using Eventually. Signed-off-by: Tim Rozet <trozet@nvidia.com>
|
@bpickard22: This pull request references Jira Issue OCPBUGS-74711, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Note
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: bpickard22 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/jira refresh |
|
@bpickard22: This pull request references Jira Issue OCPBUGS-74711, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@bpickard22: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/retest-required |
Summary
Backport of upstream ovn-kubernetes/ovn-kubernetes#5951 to release-4.20.
Cherry-picked commits:
0e44890df— EgressIP: Fix crash from mutating node informer objectb95fc8081— Fixes gateway egress IP node update logicProblem
addIPToAnnotationanddeleteIPsFromAnnotationingateway_egressip.goread the node annotation from the informer cache and directly mutate the cached node object. With many EgressIPs (customer has 270), the rapid read-modify-write calls duringSyncEgressIPoverwrite each other due to stale informer reads. The annotation ends up tracking only ~7% of assigned IPs, preventingSyncEgressIPfrom cleaning up stale IPs on br-ex after failover. This causes duplicate IPs across nodes and service outages.Fix
annotationIPscache initialized at startup, eliminating the stale-read raceTest plan
Fixes: https://redhat.atlassian.net/browse/OCPBUGS-74711
🤖 Generated with Claude Code