tests/ocp/sriov: fix reconcile race and ICMP check in metricsExporter#1433
tests/ocp/sriov: fix reconcile race and ICMP check in metricsExporter#1433zhiqiangf wants to merge 1 commit into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughThis PR refines the metrics exporter test in the SR-IOV test suite. It adds device-type-aware ICMP connectivity assertions and reorders test resource creation to improve synchronization of SR-IOV policy processing by the daemon. ChangesMetrics Exporter Test Updates
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.12.2)level=warning msg="The linter 'gomodguard' is deprecated (since v2.12.0) due to: new major version. Replaced by gomodguard_v2." Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/ocp/sriov/tests/metricsExporter.go (1)
517-525: 💤 Low valueConsider clarifying the comment to reflect defensive verification.
The comment "Wait for NAD Creation" doesn't distinguish this from the earlier NAD wait at lines 507-510. This second wait is a defensive check after cluster stabilization to ensure NADs still exist (since the MCP wait may cause node disruptions). Consider updating the comment to something like "Verify NAD existence after cluster stabilization" for clarity.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/ocp/sriov/tests/metricsExporter.go` around lines 517 - 525, Update the test comment that currently reads By("Wait for NAD Creation") to clarify this is a defensive verification after cluster stabilization; replace it with a more descriptive message such as By("Verify NAD existence after cluster stabilization") immediately above the loop that calls nad.Pull(APIClient, res.network.Object.Name, tsparams.TestNamespaceName) for metricsTestResource entries (cRes, sRes) so the intent is clear and distinguished from the earlier NAD wait.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/ocp/sriov/tests/metricsExporter.go`:
- Around line 517-525: Update the test comment that currently reads By("Wait for
NAD Creation") to clarify this is a defensive verification after cluster
stabilization; replace it with a more descriptive message such as By("Verify NAD
existence after cluster stabilization") immediately above the loop that calls
nad.Pull(APIClient, res.network.Object.Name, tsparams.TestNamespaceName) for
metricsTestResource entries (cRes, sRes) so the intent is clear and
distinguished from the earlier NAD wait.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: dd66108e-7495-4c21-94d4-9bcaafe5c888
📒 Files selected for processing (1)
tests/ocp/sriov/tests/metricsExporter.go
d01d316 to
efc33a4
Compare
Two fixes for the Netdevice-to-Vfiopci test (OCP-75931): 1. Split reconcile race: createMetricsTestResources was creating policies and networks in an interleaved loop (policy1 → network1 → NAD wait → policy2). The NAD wait gap caused the SR-IOV daemon to process the two policies in separate reconcile generations. The first generation reported "Succeeded" with only the client device plugin resource registered, and WaitForSriovStable returned prematurely. The server DPDK pod then sat Pending for the full 2-minute timeout with "Insufficient openshift.io/servervfiopci". Fix: create all policies first, then all networks, so both policies land in the same reconcile generation. 2. ICMP expectation: the test unconditionally expected ICMP to fail, but on Mellanox NICs defineMetricsPolicy() configures netdevice+RDMA instead of vfio-pci, so the kernel network stack remains active and ICMP succeeds. Fix: branch on the server policy's DeviceType to choose the correct assertion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
efc33a4 to
11bdaca
Compare
Summary
Fixes two bugs causing OCP-75931 (
SriovMetricsExporter Netdevice to Vfiopci Different PF) to fail intermittently withcontext deadline exceededwhen creating the DPDK server pod.1. Split reconcile race in
createMetricsTestResourcesThe previous loop interleaved policy and network creation:
policy1 → network1 → WaitForNADCreation → policy2 → ...The
WaitForNADCreationdelay between the twoSriovNetworkNodePolicycreates caused the SR-IOV daemon to process them in separate reconcile generations:clientnetdevicepolicy → device plugin registers only that resource → reports "Succeeded"WaitForSriovStablesees gen N's "Succeeded" (before gen N+1 sets "InProgress") and returns early. The server DPDK pod sits Pending withInsufficient openshift.io/servervfiopcifor the full 2-minute timeout.Fix: Create all
SriovNetworkNodePolicyresources first, then allSriovNetworkresources, so both policies land in the same reconcile generation.2. Wrong ICMP expectation for Mellanox NICs
The test unconditionally expected ICMP to fail between client and server pods. This is correct for Intel NICs where
vfio-pcigives DPDK exclusive VF ownership. However,defineMetricsPolicy()configures Mellanox NICs withnetdevice+RDMAinstead, so the kernel network stack remains active and ICMP succeeds.Fix: Branch on the server policy's
DeviceTypeto assert the correct outcome per NIC type.Test plan
--focus="Netdevice to Vfiopci"— all three subtests (Same PF, Different PF, Different Worker) passed🤖 Generated with Claude Code
Summary by CodeRabbit
vfio-pci, succeed otherwise).