Skip to content

OCPBUGS-74711: [release-4.20] EgressIP annotation corruption causes stale IPs on br-ex after failover#3257

Open
bpickard22 wants to merge 2 commits into
openshift:release-4.20from
bpickard22:4.20-ocpbugs74711
Open

OCPBUGS-74711: [release-4.20] EgressIP annotation corruption causes stale IPs on br-ex after failover#3257
bpickard22 wants to merge 2 commits into
openshift:release-4.20from
bpickard22:4.20-ocpbugs74711

Conversation

@bpickard22

@bpickard22 bpickard22 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

Backport of upstream ovn-kubernetes/ovn-kubernetes#5951 to release-4.20.

Cherry-picked commits:

  • 0e44890df — EgressIP: Fix crash from mutating node informer object
  • b95fc8081 — Fixes gateway egress IP node update logic

Problem

addIPToAnnotation and deleteIPsFromAnnotation in gateway_egressip.go read the node annotation from the informer cache and directly mutate the cached node object. With many EgressIPs (customer has 270), the rapid read-modify-write calls during SyncEgressIP overwrite each other due to stale informer reads. The annotation ends up tracking only ~7% of assigned IPs, preventing SyncEgressIP from cleaning up stale IPs on br-ex after failover. This causes duplicate IPs across nodes and service outages.

Fix

  1. Deep-copy the node object before mutation (stops corrupting the informer cache)
  2. Replace informer reads with a local annotationIPs cache initialized at startup, eliminating the stale-read race

Test plan

  • Existing EgressIP unit tests pass
  • CI green
  • Verified fix is in 4.21+ and working

Fixes: https://redhat.atlassian.net/browse/OCPBUGS-74711

🤖 Generated with Claude Code

trozet added 2 commits June 17, 2026 14:11
Code was modifying the annotations of the informer cache node object. If
this was happening while another goroutine was reading the annotation
map, it would trigger ovnkube to crash!

Fixes: #5950

Signed-off-by: Tim Rozet <trozet@nvidia.com>
Gateway egress IP adds IPs to an annotation on the node. The code was
assuming the informer object should have the latest data, then
overwriting the IPs using that information. That isn't reliable as the
informer could have stale data compared to recent kubeclient updates.
This would trigger egress IP logic to corrupt the IPs in the node
annotation, and cause further drift/corruption in subsequent updates.

This fixes it by creating a local cache of IPs for the controller, and
using that as the source of truth, initialized on start up from the node
object. Then updates are driven by what is in the cache, versus what is
in the informer.

Also fixes places where tests should have been using Eventually.

Signed-off-by: Tim Rozet <trozet@nvidia.com>
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 17, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@bpickard22: This pull request references Jira Issue OCPBUGS-74711, which is invalid:

  • expected the bug to target the "4.20.z" version, but no target version was set
  • release note text must be set and not match the template OR release note type must be set to "Release Note Not Required". For more information you can reference the OpenShift Bug Process.
  • expected Jira Issue OCPBUGS-74711 to depend on a bug targeting a version in 4.21.0, 4.21.z and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

Backport of upstream ovn-kubernetes/ovn-kubernetes#5951 to release-4.20.

Cherry-picked commits:

  • 0e44890df — EgressIP: Fix crash from mutating node informer object
  • b95fc8081 — Fixes gateway egress IP node update logic

Problem

addIPToAnnotation and deleteIPsFromAnnotation in gateway_egressip.go read the node annotation from the informer cache and directly mutate the cached node object. With many EgressIPs (customer has 270), the rapid read-modify-write calls during SyncEgressIP overwrite each other due to stale informer reads. The annotation ends up tracking only ~7% of assigned IPs, preventing SyncEgressIP from cleaning up stale IPs on br-ex after failover. This causes duplicate IPs across nodes and service outages.

Fix

  1. Deep-copy the node object before mutation (stops corrupting the informer cache)
  2. Replace informer reads with a local annotationIPs cache initialized at startup, eliminating the stale-read race

Test plan

  • Existing EgressIP unit tests pass
  • CI green
  • Verified fix is in 4.21+ and working

Fixes: https://redhat.atlassian.net/browse/OCPBUGS-74711

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: d36ad5fe-0e9a-4300-a4da-cc37b71a81b3

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Note

.coderabbit.yml has unrecognized properties

CodeRabbit is using all valid settings from your configuration. Unrecognized properties (listed below) have been ignored and may indicate typos or deprecated fields that can be removed.

⚠️ Parsing warnings (1)
Validation error: Unrecognized key: "paths_ignore"
⚙️ Configuration instructions
  • Please see the configuration documentation for more information.
  • You can also validate your configuration using the online YAML validator.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from martinkennelly and tssurya June 17, 2026 18:13
@openshift-ci

openshift-ci Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bpickard22
Once this PR has been reviewed and has the lgtm label, please assign kyrtapz for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bpickard22

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@bpickard22: This pull request references Jira Issue OCPBUGS-74711, which is invalid:

  • release note text must be set and not match the template OR release note type must be set to "Release Note Not Required". For more information you can reference the OpenShift Bug Process.
  • expected Jira Issue OCPBUGS-74711 to depend on a bug targeting a version in 4.21.0, 4.21.z and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bpickard22

Copy link
Copy Markdown
Contributor Author

backport tracking is a little wonky here, this fix from @trozet made it downstream via #3011 which is a no jira, so we will have to create a chain up to main to satisfy the bot, this is also why I am pointed to the upstream pr instead of something downstream as I think it provides better clarity

@openshift-ci

openshift-ci Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

@bpickard22: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/security 5055ffd link false /test security

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@bpickard22

Copy link
Copy Markdown
Contributor Author

/retest-required

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants