Skip to content

Ovnkube objretry#3215

Open
afcollins wants to merge 10 commits into
openshift:mainfrom
afcollins:ovnkube-objretry
Open

Ovnkube objretry#3215
afcollins wants to merge 10 commits into
openshift:mainfrom
afcollins:ovnkube-objretry

Conversation

@afcollins

@afcollins afcollins commented Jun 1, 2026

Copy link
Copy Markdown

📑 Description

Adds a test that proves the obj_retry storm and code that fixes it.

Mainly opening this PR downstream to run rehearse tests.

Full RCA from log analysis at: https://gist.github.com/afcollins/1b7f47b1396268c614ff606adddcdaba

Additional Information for reviewers

✅ Checks

  • My code requires changes to the documentation
  • if so, I have updated the documentation as required
  • My code requires tests
  • if so, I have added and/or updated the tests as required
  • All the tests have passed in the CI

How to verify it

go test ./...

Summary by CodeRabbit

  • Bug Fixes

    • Preserve retry failure counters when retry entries are re-initialized; failedAttempts persist across re-add/update events.
    • Reduce log noise by gating many informational messages (ignored deletes, port/node/namespace/events, policy/pod progress) behind verbose logging.
    • Make network policy deletion idempotent and skip unmanaged link handling cleanly.
    • Retry sweeps now process entries sequentially to avoid concurrent retry runs.
  • Tests

    • Added unit tests verifying preserved failure counts and correct removal once max failures are reached.
  • Chores

    • Added a package test runner script to simplify running retry tests.

@coderabbitai

coderabbitai Bot commented Jun 1, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: b6f045c5-a331-4845-b465-20ecb4e5df76

📥 Commits

Reviewing files that changed from the base of the PR and between 95c2017 and ee6a708.

📒 Files selected for processing (3)
  • go-controller/pkg/controller/controller.go
  • go-controller/pkg/ovn/controller/services/services_controller.go
  • go-controller/pkg/retry/obj_retry.go
✅ Files skipped from review due to trivial changes (1)
  • go-controller/pkg/ovn/controller/services/services_controller.go

Walkthrough

This PR preserves a retry entry's failedAttempts when re-initializing add/update/delete backoff entries, conditions log output to show failedAttempts only when >0, adds unit tests and a podman-based test runner for the retry package, and applies small logging and error-wrapping fixes across linkmanager, node, and OVN controllers.

Changes

Preserve failedAttempts on retry re-initialization

Layer / File(s) Summary
Retry initialization behavior
go-controller/pkg/retry/obj_retry.go
initRetryObjWithAddBackoff, initRetryObjWithUpdate, and initRetryObjWithDeleteBackoff now use the loaded result from LoadOrStore to reset failedAttempts only when creating new entries; runtime logs include (failed attempts: N) only when failedAttempts > 0, iterateRetryResources processes keys sequentially, and a terminal-state delete-event log was changed to klog.V(5).Infof.

|Unit tests for failedAttempts preservation
go-controller/pkg/retry/obj_retry_storm_test.go|Adds alwaysFailHandler, test helpers, and two tests (TestFailedAttemptsPreservedOnReAdd, TestFailedAttemptsNotResetReachesMax) that seed failing retry entries, simulate update/re-add events, and assert failure-count preservation and drop-at-max behavior.|

|Container-based test runner script
go-controller/pkg/retry/run-tests.sh|Bash script that runs go test ./pkg/retry/ inside a golang:1.25 podman container, accepting an optional -run filter and mounting the repository into /workspace.|

Misc runtime and logging fixes

Layer / File(s) Summary
Link manager lookup and ENODEV handling
go-controller/pkg/node/linkmanager/link_network_manager.go
syncLink now checks for a managed link entry before running the handler and, on unix.ENODEV from GetFilteredInterfaceAddrs, deletes the store entry and returns nil instead of erroring.
Controller suppressed-error logging
go-controller/pkg/controller/controller.go
Adds import for types and logs suppressed reconciliation errors at verbose level instead of logging them as errors for retryable items.
OVN/node logging and network policy guard
go-controller/pkg/ovn/*, go-controller/pkg/node/default_node_network_controller.go
Multiple informational log statements were lowered to klog.V(5).Infof; deleteNetworkPolicy now returns early if the policy is absent from the controller cache.
Net util error wrapping
go-controller/pkg/util/net_linux.go
GetFilteredInterfaceAddrs now returns a wrapped error using %w to preserve the original error for callers.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I am a rabbit who counts each fall,
Failed attempts preserved, I heed each call.
Re-added, unchanged — the counter stays true,
Tests hop along till the max says adieu.
🐇✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Ovnkube objretry' is too vague and generic, using an abbreviated term that lacks specificity about the actual changes made. Use a more descriptive title that captures the main objective, such as 'Reduce log verbosity and serialize retry resource processing' or 'Fix obj_retry goroutine storm by serializing retry processing'.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from jcaamano and martinkennelly June 1, 2026 16:43
@openshift-ci

openshift-ci Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: afcollins
Once this PR has been reviewed and has the lgtm label, please assign tssurya for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@afcollins

Copy link
Copy Markdown
Author

/pj-rehearse pull-ci-openshift-ovn-kubernetes-main-qe-perfscale-payload-control-plane-6nodes

@afcollins

Copy link
Copy Markdown
Author

/test pull-ci-openshift-ovn-kubernetes-main-qe-perfscale-payload-control-plane-6nodes

@afcollins

Copy link
Copy Markdown
Author

/test qe-perfscale-payload-control-plane-6nodes

1 similar comment
@afcollins

Copy link
Copy Markdown
Author

/test qe-perfscale-payload-control-plane-6nodes

if entry.newObj != nil {
klog.Infof("%s: adding new object: %s %s", r.name, r.ResourceHandler.ObjType, objKey)
if entry.failedAttempts > 0 {
klog.Infof("%s: adding new object: %s %s (failed attempts: %d)", r.name, r.ResourceHandler.ObjType, objKey, entry.failedAttempts)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this message is down substantially, and it allows us to track how many retries are actually happening and how frequently.

This one in particular had 7 retries, ~1s apart. It's curious the retries were on 'adding new object' when the objects were being deleted.

I0604 00:43:34.469982 4293 obj_retry.go:435] nodeGateway: adding new object: *factory.serviceForGateway udn-density-pods-68/udn-density-2 (failed attempts: 7)

time="2026-06-04 00:41:07" level=info msg="Deleting 72 namespaces with label: kube-burner.io/job=create-udn-l2" file="namespaces.go:65"

afcollins and others added 7 commits June 4, 2026 13:18
Demonstrates that informer update events reset failedAttempts to 0
via initRetryObjWithAdd, preventing MaxFailedAttempts (15) from ever
being reached. Services whose namespace network state is deleted
retry indefinitely instead of being dropped.

Two test cases:
- TestFailedAttemptsResetOnReAdd: proves the bug — entry survives
  20 retry cycles past MaxFailedAttempts because each simulated
  update event resets the counter
- TestFailedAttemptsNotResetReachesMax: control case — without
  update events, entry is correctly dropped at MaxFailedAttempts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
initRetryObjWithAdd, initRetryObjWithUpdate, and
initRetryObjWithDeleteBackoff unconditionally reset failedAttempts
to 0 on every call, even when the entry already exists in the retry
cache. This allows informer update events to reset the failure
counter indefinitely, preventing MaxFailedAttempts (15) from ever
being reached.

During UDN namespace teardown, services whose network state is
deleted keep retrying because AreResourcesEqual returns false for
all service updates, triggering initRetryObjWithAdd which resets
the counter. In production this produced 27,817 retry attempts for
422 services in a single log window.

Fix: capture the `loaded` boolean from LoadOrStore and only reset
failedAttempts for genuinely new entries, not re-adds of
already-failing objects.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Andrew Collins <ancollin@redhat.com>
The "retry object setup" and "adding new object" log lines fire for
every entry in the retry cache on every iteration, even when objects
succeed on first attempt. With ~72 network controllers each re-queuing
all pods via addAllPodsOnNode, this produces ~650k log lines per run
for objects that never actually failed.

Gate these messages behind failedAttempts > 0 so they only appear for
objects that previously failed, and include the attempt count for
debugging. This preserves log level but eliminates noise from
successful first-try retries.

Co-Authored-By: Claude Opus 4.6
Each UDN controller watches the shared MultiNetworkPolicy informer
and processes every delete event regardless of network ownership,
producing ~69k redundant "Deleting network policy" log lines. Short-
circuit with an early return when the policy is not in networkPolicies.

Co-Authored-By: Claude Opus 4.6
When a netlink event fires for a link that has already been removed
(e.g. pod veth during CNI DEL), syncLink would error with "no such
device" producing ~15k error-level log lines during density tests.

Fix: move the store membership check before any I/O so unmanaged
links are skipped entirely, and handle ENODEV from GetFilteredInterfaceAddrs
by cleaning up the store entry instead of returning an error.

Also fix error wrapping in GetFilteredInterfaceAddrs (%v -> %w) so
errors.Is can detect ENODEV through the chain.

Co-Authored-By: Claude Opus 4.6
During UDN density tests, these log lines produce ~400k+ lines at
default Info level despite being operational traces, not actionable
signals. Demote to V(5) so they only appear at elevated verbosity.

Changed across 5 files:
- Network policy lifecycle: add/delete/cleanup/peer-address-set traces
- UDN namespace: add/update/unknown-namespace traces
- UDN pod: addLogicalPort timing, pod deletion traces
- Pod port: "creating logical port" trace
- PMTUD: "Adding remote node to PMTUD blocking rules" trace
- namespace.go: "updating namespace" trace

Co-Authored-By: Claude Opus 4.6
afcollins added 3 commits June 4, 2026 13:54
Reconcile functions already wrap transient/expected errors with
SuppressedError (e.g. "no pod IPs found", "chassis-id annotation
not found"). The generic controller was logging these at Error
level on every retry attempt, producing ~600+ error lines for
conditions that self-resolve. Check IsSuppressedError and log
at V(5) instead, preserving Error level for real failures.

Co-Authored-By: Claude Opus 4.6
"Configuring UDN enabled service route" fires for every service
across every UDN network during setup/teardown. It's a trace,
not an actionable event.

Co-Authored-By: Claude Opus 4.6
iterateRetryResources previously spawned a goroutine for every
retry entry key, creating a goroutine storm every 30s across all
controllers. With 72 UDN controllers each having their own retry
frameworks, this produced thousands of concurrent goroutines all
contending on the same OVN NB DB client.

Process keys serially within each controller instead. Each
controller still runs its own periodicallyRetryResources loop,
so inter-controller parallelism is preserved — but intra-controller
work is now sequential, eliminating scheduling overhead and lock
contention.

Co-Authored-By: Claude Opus 4.6
@openshift-ci

openshift-ci Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

@afcollins: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/4.22-upgrade-from-stable-4.21-e2e-gcp-ovn-rt-upgrade d1b6e3f link true /test 4.22-upgrade-from-stable-4.21-e2e-gcp-ovn-rt-upgrade
ci/prow/4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade d1b6e3f link true /test 4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade
ci/prow/e2e-metal-ipi-ovn-dualstack-bgp-local-gw ee6a708 link true /test e2e-metal-ipi-ovn-dualstack-bgp-local-gw
ci/prow/e2e-aws-ovn-edge-zones ee6a708 link true /test e2e-aws-ovn-edge-zones
ci/prow/security ee6a708 link false /test security
ci/prow/e2e-metal-ipi-ovn-dualstack-bgp ee6a708 link true /test e2e-metal-ipi-ovn-dualstack-bgp
ci/prow/unit ee6a708 link true /test unit
ci/prow/5.0-upgrade-from-stable-4.22-e2e-aws-ovn-upgrade ee6a708 link true /test 5.0-upgrade-from-stable-4.22-e2e-aws-ovn-upgrade
ci/prow/e2e-aws-ovn-fdp-qe ee6a708 link true /test e2e-aws-ovn-fdp-qe
ci/prow/5.0-upgrade-from-stable-4.22-e2e-gcp-ovn-rt-upgrade ee6a708 link true /test 5.0-upgrade-from-stable-4.22-e2e-gcp-ovn-rt-upgrade
ci/prow/e2e-aws-ovn-local-gateway ee6a708 link true /test e2e-aws-ovn-local-gateway

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant