Skip to content

tests/ocp/sriov: fix reconcile race and ICMP check in metricsExporter#1433

Open
zhiqiangf wants to merge 1 commit into
rh-ecosystem-edge:mainfrom
zhiqiangf:fix-metrics-exporter-reconcile-race
Open

tests/ocp/sriov: fix reconcile race and ICMP check in metricsExporter#1433
zhiqiangf wants to merge 1 commit into
rh-ecosystem-edge:mainfrom
zhiqiangf:fix-metrics-exporter-reconcile-race

Conversation

@zhiqiangf

@zhiqiangf zhiqiangf commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes two bugs causing OCP-75931 (SriovMetricsExporter Netdevice to Vfiopci Different PF) to fail intermittently with context deadline exceeded when creating the DPDK server pod.

1. Split reconcile race in createMetricsTestResources

The previous loop interleaved policy and network creation:
policy1 → network1 → WaitForNADCreation → policy2 → ...

The WaitForNADCreation delay between the two SriovNetworkNodePolicy creates caused the SR-IOV daemon to process them in separate reconcile generations:

  • Gen N: only clientnetdevice policy → device plugin registers only that resource → reports "Succeeded"
  • Gen N+1: both policies → configures Port 2 VFs → restarts device plugin with both resources

WaitForSriovStable sees gen N's "Succeeded" (before gen N+1 sets "InProgress") and returns early. The server DPDK pod sits Pending with Insufficient openshift.io/servervfiopci for the full 2-minute timeout.

Fix: Create all SriovNetworkNodePolicy resources first, then all SriovNetwork resources, so both policies land in the same reconcile generation.

2. Wrong ICMP expectation for Mellanox NICs

The test unconditionally expected ICMP to fail between client and server pods. This is correct for Intel NICs where vfio-pci gives DPDK exclusive VF ownership. However, defineMetricsPolicy() configures Mellanox NICs with netdevice+RDMA instead, so the kernel network stack remains active and ICMP succeeds.

Fix: Branch on the server policy's DeviceType to assert the correct outcome per NIC type.

Test plan

  • Previously verified on wsfd-advnetlab244 (Mellanox CX6-DX) with --focus="Netdevice to Vfiopci" — all three subtests (Same PF, Different PF, Different Worker) passed

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests
    • Updated metrics export tests to conditionally validate ICMP connectivity based on the server policy’s device type (expected to fail for vfio-pci, succeed otherwise).
    • Improved test reliability by reordering test resource setup: create SR-IOV node policies for both client and server first, then provision networks afterward.

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cad14a73-aa85-4059-8ea0-5dd1378adddc

📥 Commits

Reviewing files that changed from the base of the PR and between efc33a4 and 11bdaca.

📒 Files selected for processing (1)
  • tests/ocp/sriov/tests/metricsExporter.go
✅ Files skipped from review due to trivial changes (1)
  • tests/ocp/sriov/tests/metricsExporter.go

📝 Walkthrough

Walkthrough

This PR refines the metrics exporter test in the SR-IOV test suite. It adds device-type-aware ICMP connectivity assertions and reorders test resource creation to improve synchronization of SR-IOV policy processing by the daemon.

Changes

Metrics Exporter Test Updates

Layer / File(s) Summary
Device-type-aware ICMP assertion
tests/ocp/sriov/tests/metricsExporter.go
ICMP connectivity check now conditionally expects failure only when the server device type is vfio-pci, and expects success for other device types.
Resource creation synchronization ordering
tests/ocp/sriov/tests/metricsExporter.go
SR-IOV node policies for both client and server are created in a first loop, then networks are created in a second loop, ensuring the SR-IOV daemon processes both policies in the same reconcile generation.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

  • evgenLevin
  • ajaggapa
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly and concisely describes the two main fixes: addressing a reconcile race condition and correcting ICMP connectivity checks in metricsExporter, matching the core changes in the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

level=warning msg="The linter 'gomodguard' is deprecated (since v2.12.0) due to: new major version. Replaced by gomodguard_v2."
level=warning msg="Suggested new configuration:\nlinters:\n enable:\n - gomodguard_v2\n"


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/ocp/sriov/tests/metricsExporter.go (1)

517-525: 💤 Low value

Consider clarifying the comment to reflect defensive verification.

The comment "Wait for NAD Creation" doesn't distinguish this from the earlier NAD wait at lines 507-510. This second wait is a defensive check after cluster stabilization to ensure NADs still exist (since the MCP wait may cause node disruptions). Consider updating the comment to something like "Verify NAD existence after cluster stabilization" for clarity.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/ocp/sriov/tests/metricsExporter.go` around lines 517 - 525, Update the
test comment that currently reads By("Wait for NAD Creation") to clarify this is
a defensive verification after cluster stabilization; replace it with a more
descriptive message such as By("Verify NAD existence after cluster
stabilization") immediately above the loop that calls nad.Pull(APIClient,
res.network.Object.Name, tsparams.TestNamespaceName) for metricsTestResource
entries (cRes, sRes) so the intent is clear and distinguished from the earlier
NAD wait.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/ocp/sriov/tests/metricsExporter.go`:
- Around line 517-525: Update the test comment that currently reads By("Wait for
NAD Creation") to clarify this is a defensive verification after cluster
stabilization; replace it with a more descriptive message such as By("Verify NAD
existence after cluster stabilization") immediately above the loop that calls
nad.Pull(APIClient, res.network.Object.Name, tsparams.TestNamespaceName) for
metricsTestResource entries (cRes, sRes) so the intent is clear and
distinguished from the earlier NAD wait.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: dd66108e-7495-4c21-94d4-9bcaafe5c888

📥 Commits

Reviewing files that changed from the base of the PR and between 716879a and d01d316.

📒 Files selected for processing (1)
  • tests/ocp/sriov/tests/metricsExporter.go

@zhiqiangf zhiqiangf force-pushed the fix-metrics-exporter-reconcile-race branch from d01d316 to efc33a4 Compare June 15, 2026 11:06
Two fixes for the Netdevice-to-Vfiopci test (OCP-75931):

1. Split reconcile race: createMetricsTestResources was creating policies
   and networks in an interleaved loop (policy1 → network1 → NAD wait →
   policy2). The NAD wait gap caused the SR-IOV daemon to process the two
   policies in separate reconcile generations. The first generation reported
   "Succeeded" with only the client device plugin resource registered, and
   WaitForSriovStable returned prematurely. The server DPDK pod then sat
   Pending for the full 2-minute timeout with "Insufficient
   openshift.io/servervfiopci". Fix: create all policies first, then all
   networks, so both policies land in the same reconcile generation.

2. ICMP expectation: the test unconditionally expected ICMP to fail, but
   on Mellanox NICs defineMetricsPolicy() configures netdevice+RDMA instead
   of vfio-pci, so the kernel network stack remains active and ICMP succeeds.
   Fix: branch on the server policy's DeviceType to choose the correct
   assertion.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@zhiqiangf zhiqiangf force-pushed the fix-metrics-exporter-reconcile-race branch from efc33a4 to 11bdaca Compare June 24, 2026 04:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant