Ovnkube objretry#3215
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (1)
WalkthroughThis PR preserves a retry entry's failedAttempts when re-initializing add/update/delete backoff entries, conditions log output to show failedAttempts only when >0, adds unit tests and a podman-based test runner for the retry package, and applies small logging and error-wrapping fixes across linkmanager, node, and OVN controllers. ChangesPreserve failedAttempts on retry re-initialization
|Unit tests for failedAttempts preservation |Container-based test runner script Misc runtime and logging fixes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: afcollins The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/pj-rehearse pull-ci-openshift-ovn-kubernetes-main-qe-perfscale-payload-control-plane-6nodes |
|
/test pull-ci-openshift-ovn-kubernetes-main-qe-perfscale-payload-control-plane-6nodes |
|
/test qe-perfscale-payload-control-plane-6nodes |
1 similar comment
|
/test qe-perfscale-payload-control-plane-6nodes |
| if entry.newObj != nil { | ||
| klog.Infof("%s: adding new object: %s %s", r.name, r.ResourceHandler.ObjType, objKey) | ||
| if entry.failedAttempts > 0 { | ||
| klog.Infof("%s: adding new object: %s %s (failed attempts: %d)", r.name, r.ResourceHandler.ObjType, objKey, entry.failedAttempts) |
There was a problem hiding this comment.
Now this message is down substantially, and it allows us to track how many retries are actually happening and how frequently.
This one in particular had 7 retries, ~1s apart. It's curious the retries were on 'adding new object' when the objects were being deleted.
I0604 00:43:34.469982 4293 obj_retry.go:435] nodeGateway: adding new object: *factory.serviceForGateway udn-density-pods-68/udn-density-2 (failed attempts: 7)
time="2026-06-04 00:41:07" level=info msg="Deleting 72 namespaces with label: kube-burner.io/job=create-udn-l2" file="namespaces.go:65"
Demonstrates that informer update events reset failedAttempts to 0 via initRetryObjWithAdd, preventing MaxFailedAttempts (15) from ever being reached. Services whose namespace network state is deleted retry indefinitely instead of being dropped. Two test cases: - TestFailedAttemptsResetOnReAdd: proves the bug — entry survives 20 retry cycles past MaxFailedAttempts because each simulated update event resets the counter - TestFailedAttemptsNotResetReachesMax: control case — without update events, entry is correctly dropped at MaxFailedAttempts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
initRetryObjWithAdd, initRetryObjWithUpdate, and initRetryObjWithDeleteBackoff unconditionally reset failedAttempts to 0 on every call, even when the entry already exists in the retry cache. This allows informer update events to reset the failure counter indefinitely, preventing MaxFailedAttempts (15) from ever being reached. During UDN namespace teardown, services whose network state is deleted keep retrying because AreResourcesEqual returns false for all service updates, triggering initRetryObjWithAdd which resets the counter. In production this produced 27,817 retry attempts for 422 services in a single log window. Fix: capture the `loaded` boolean from LoadOrStore and only reset failedAttempts for genuinely new entries, not re-adds of already-failing objects. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Andrew Collins <ancollin@redhat.com>
The "retry object setup" and "adding new object" log lines fire for every entry in the retry cache on every iteration, even when objects succeed on first attempt. With ~72 network controllers each re-queuing all pods via addAllPodsOnNode, this produces ~650k log lines per run for objects that never actually failed. Gate these messages behind failedAttempts > 0 so they only appear for objects that previously failed, and include the attempt count for debugging. This preserves log level but eliminates noise from successful first-try retries. Co-Authored-By: Claude Opus 4.6
Each UDN controller watches the shared MultiNetworkPolicy informer and processes every delete event regardless of network ownership, producing ~69k redundant "Deleting network policy" log lines. Short- circuit with an early return when the policy is not in networkPolicies. Co-Authored-By: Claude Opus 4.6
When a netlink event fires for a link that has already been removed (e.g. pod veth during CNI DEL), syncLink would error with "no such device" producing ~15k error-level log lines during density tests. Fix: move the store membership check before any I/O so unmanaged links are skipped entirely, and handle ENODEV from GetFilteredInterfaceAddrs by cleaning up the store entry instead of returning an error. Also fix error wrapping in GetFilteredInterfaceAddrs (%v -> %w) so errors.Is can detect ENODEV through the chain. Co-Authored-By: Claude Opus 4.6
During UDN density tests, these log lines produce ~400k+ lines at default Info level despite being operational traces, not actionable signals. Demote to V(5) so they only appear at elevated verbosity. Changed across 5 files: - Network policy lifecycle: add/delete/cleanup/peer-address-set traces - UDN namespace: add/update/unknown-namespace traces - UDN pod: addLogicalPort timing, pod deletion traces - Pod port: "creating logical port" trace - PMTUD: "Adding remote node to PMTUD blocking rules" trace - namespace.go: "updating namespace" trace Co-Authored-By: Claude Opus 4.6
fe329d3 to
95c2017
Compare
Reconcile functions already wrap transient/expected errors with SuppressedError (e.g. "no pod IPs found", "chassis-id annotation not found"). The generic controller was logging these at Error level on every retry attempt, producing ~600+ error lines for conditions that self-resolve. Check IsSuppressedError and log at V(5) instead, preserving Error level for real failures. Co-Authored-By: Claude Opus 4.6
"Configuring UDN enabled service route" fires for every service across every UDN network during setup/teardown. It's a trace, not an actionable event. Co-Authored-By: Claude Opus 4.6
iterateRetryResources previously spawned a goroutine for every retry entry key, creating a goroutine storm every 30s across all controllers. With 72 UDN controllers each having their own retry frameworks, this produced thousands of concurrent goroutines all contending on the same OVN NB DB client. Process keys serially within each controller instead. Each controller still runs its own periodicallyRetryResources loop, so inter-controller parallelism is preserved — but intra-controller work is now sequential, eliminating scheduling overhead and lock contention. Co-Authored-By: Claude Opus 4.6
|
@afcollins: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
📑 Description
Adds a test that proves the obj_retry storm and code that fixes it.
Mainly opening this PR downstream to run rehearse tests.
Full RCA from log analysis at: https://gist.github.com/afcollins/1b7f47b1396268c614ff606adddcdaba
Additional Information for reviewers
✅ Checks
How to verify it
go test ./...Summary by CodeRabbit
Bug Fixes
Tests
Chores