Skip to content

Bug 86073: validate duplicate vSphere failureDomain topology#10563

Open
chdeshpa-hue wants to merge 3 commits into
openshift:mainfrom
chdeshpa-hue:fix/OCPBUGS-86074-vsphere-duplicate-failuredomain-topology
Open

Bug 86073: validate duplicate vSphere failureDomain topology#10563
chdeshpa-hue wants to merge 3 commits into
openshift:mainfrom
chdeshpa-hue:fix/OCPBUGS-86074-vsphere-duplicate-failuredomain-topology

Conversation

@chdeshpa-hue
Copy link
Copy Markdown

@chdeshpa-hue chdeshpa-hue commented May 19, 2026

Summary

  • Adds validation in validateFailureDomains to detect when two or more vSphere failure domains have identical topology (same server, datacenter, computeCluster, datastore, networks, and resourcePool). Copy-pasted failure domains that differ only in name/region/zone labels provide no additional fault tolerance and can cause subtle scheduling issues.
  • Inventory paths (computeCluster, datastore, resourcePool) are canonicalized with filepath.Clean before comparison so that syntactically different but semantically equivalent paths (e.g. trailing slashes, double separators) are normalized.

Split from #10561 per reviewer feedback — this PR contains the vSphere portion only.

Bug: https://redhat.atlassian.net/browse/OCPBUGS-86073

Manual Test Results

Tested with a custom-built openshift-install binary against a live vSphere environment. When two failure domains share identical topology but different names, the installer now correctly rejects:

ERROR failed to load asset "Install Config": invalid "install-config.yaml" file:
  platform.vsphere.failureDomains[1]: Invalid value: "ocp3-h6c9h-fd-2":
    failure domain "ocp3-h6c9h-fd-2" has identical topology (same server, datacenter,
    computeCluster, datastore, networks, resourcePool) as "ocp3-h6c9h-fd-1";
    this provides no additional fault tolerance

Test Plan

  • Unit tests pass: go test ./pkg/types/vsphere/validation/
  • New test case: Multi-zone platform failureDomain duplicate topology
  • Existing test Resources in customized folders updated to use distinct ResourcePool paths
  • Manual testing against live vSphere environment

/cc @jcpowermac

Made with Cursor

Summary by CodeRabbit

  • Bug Fixes
    • vSphere validation now rejects failure domains that share the same normalized topology (including inventory path normalization and exact network lists/order), preventing invalid multi-zone cluster configurations.
  • Tests
    • Added negative and variant tests for duplicate topology detection, network ordering, path-normalization, hostgroup cases, and unique resource-path scenarios.

Adds a check in validateFailureDomains that detects when two or more
failure domains have identical topology (same server, datacenter,
computeCluster, datastore, networks, and resourcePool). Copy-pasted
failure domains that differ only in name/region/zone labels provide
no additional fault tolerance and can cause subtle scheduling issues.

Inventory paths (computeCluster, datastore, resourcePool) are
canonicalized with filepath.Clean before comparison so that
syntactically different but semantically equivalent paths (e.g.
trailing slashes, double separators) are normalized.

Bug: https://redhat.atlassian.net/browse/OCPBUGS-86074
Co-authored-by: Cursor <cursoragent@cursor.com>
@openshift-ci openshift-ci Bot requested a review from jcpowermac May 19, 2026 11:57
@openshift-ci openshift-ci Bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 19, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 19, 2026

Hi @chdeshpa-hue. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

Note

.coderabbit.yaml has unrecognized properties

CodeRabbit is using all valid settings from your configuration. Unrecognized properties (listed below) have been ignored and may indicate typos or deprecated fields that can be removed.

⚠️ Parsing warnings (1)
Validation error: Unrecognized key: "tools"
⚙️ Configuration instructions
  • Please see the configuration documentation for more information.
  • You can also validate your configuration using the online YAML validator.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json
📝 Walkthrough

Walkthrough

Canonicalizes vSphere inventory paths, adds helpers to compute a normalized topology key, and rejects failure domains whose canonicalized topology and ordered networks duplicate an earlier failure domain; tests updated for unique resource-pools and duplicate-topology cases.

Changes

Failure domain duplicate topology detection

Layer / File(s) Summary
Path import and normalization helpers
pkg/types/vsphere/validation/platform.go
Import path and add vsphereTopologyKey, normalizedTopologyKey(fd), and related helpers to canonicalize inventory paths and produce a comparable key.
Duplicate topology detection in validator
pkg/types/vsphere/validation/platform.go
Add topologyEntry and fdTopologies map in validateFailureDomains; compute normalized keys per failure domain and emit field.Invalid for duplicates with identical ordered Topology.Networks, preserving first-seen entry.
Canonicalization replacements
pkg/types/vsphere/validation/platform.go
Replace filepath.Clean with path.Clean for Topology.Datastore, Topology.ComputeCluster, Topology.ResourcePool, and Topology.Template.
Tests: unique resource pools and duplicate-topology cases
pkg/types/vsphere/validation/platform_test.go
Assign unique Topology.ResourcePool per failure-domain index; add/extend multi-zone tests covering duplicate topology rejection, network ordering/subset cases, path-normalization equivalence, hostgroup-specific duplication, and valid non-duplicate scenarios.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (14 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding validation to detect and reject duplicate vSphere failure domains with identical topology.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Test file uses standard Go testing (not Ginkgo), with 87 test cases all having static string names. No dynamic values in test titles; dynamic setup values correctly placed in test bodies.
Test Structure And Quality ✅ Passed Test file uses standard Go testing package with table-driven tests and testify assertions, not Ginkgo. The custom check is specifically for Ginkgo test code, so it does not apply to this PR.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. All changes are in unit tests (pkg/types/vsphere/validation/platform_test.go) using standard Go testing package, not e2e framework.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests are added. Changes are unit tests in pkg/types/vsphere/validation/ using Go's testing package, not Ginkgo, so SNO compatibility check does not apply.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only vSphere installer configuration validation code (not deployment manifests, operators, or scheduling constraints). Check applies to deployment/operator code only.
Ote Binary Stdout Contract ✅ Passed No process-level stdout writes found. The fmt.Printf in platform_test.go is inside t.Run() subtest, which is test-level stdout, not OTE binary contract.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests added. PR contains only validation code and standard Go unit tests with no IPv6 or connectivity issues.
No-Weak-Crypto ✅ Passed No weak crypto (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or insecure secret comparisons found. PR adds vSphere topology validation only.
Container-Privileges ✅ Passed PR contains only Go source and test files (platform.go, platform_test.go) with vSphere validation logic. No K8s manifests or container definitions present; check is not applicable.
No-Sensitive-Data-In-Logs ✅ Passed PR validation adds safe error messages with only non-sensitive config fields. No passwords, tokens, API keys, PII, or credentials are logged or exposed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 19, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign vr4manta for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jcpowermac
Copy link
Copy Markdown
Contributor

/ok-to-test

@openshift-ci openshift-ci Bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 19, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
pkg/types/vsphere/validation/platform.go (1)

361-366: ⚖️ Poor tradeoff

Consider using path.Clean instead of filepath.Clean for vSphere paths.

vSphere inventory paths always use forward slashes (URL-style), but filepath.Clean uses OS-specific path separators (backslashes on Windows). The path.Clean function from the "path" package is designed for slash-separated paths and would be more semantically correct.

That said, this pattern already exists throughout this file (lines 262, 317, 336, 346), so changing it here alone wouldn't be consistent. This is noted for potential future refactoring across the file.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/types/vsphere/validation/platform.go` around lines 361 - 366, The
normalizePath closure uses filepath.Clean which applies OS-specific separators;
replace filepath.Clean with path.Clean (from the "path" package) so vSphere
inventory paths (slash-separated) are normalized correctly; update imports to
include "path" if missing and consider aligning the same change for the other
occurrences of filepath.Clean in this file (e.g., the similar closures/uses
around the functions referenced at lines near the existing normalizePath
instances).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/types/vsphere/validation/platform_test.go`:
- Around line 555-564: Update the duplicate-topology detection so HostGroup
failure domains are allowed to share all topology fields except
Topology.HostGroup: modify the topology-key construction function
vsphereFailureDomainTopologyKey (and any code that uses it to detect duplicates)
to include Topology.HostGroup when the failure domain ZoneType is
HostGroupFailureDomain (so keys differ for different HostGroup values), and add
the new test "Valid HostGroup failure domains with same topology but different
HostGroups" to pkg/types/vsphere/validation/platform_test.go to assert no error
is produced.

In `@pkg/types/vsphere/validation/platform.go`:
- Around line 353-375: The topology key builder vsphereFailureDomainTopologyKey
omits Topology.HostGroup so HostGroup-based failure domains can be treated as
duplicates; update vsphereFailureDomainTopologyKey to include
fd.Topology.HostGroup (normalized like other path-like fields, e.g., via
normalizePath or directly) into the fmt.Sprintf key (e.g., add a hostGroup=%s
segment) and ensure networks sorting/copying remains unchanged; also update the
related validation/error message that lists topology fields (the one referencing
hostGroup currently missing) to include hostGroup so error text correctly
reflects the fields compared for duplicate detection.

---

Nitpick comments:
In `@pkg/types/vsphere/validation/platform.go`:
- Around line 361-366: The normalizePath closure uses filepath.Clean which
applies OS-specific separators; replace filepath.Clean with path.Clean (from the
"path" package) so vSphere inventory paths (slash-separated) are normalized
correctly; update imports to include "path" if missing and consider aligning the
same change for the other occurrences of filepath.Clean in this file (e.g., the
similar closures/uses around the functions referenced at lines near the existing
normalizePath instances).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 54606e31-2d37-4340-adb4-6bb0b7b87799

📥 Commits

Reviewing files that changed from the base of the PR and between 0d798da and 0f0778c.

📒 Files selected for processing (2)
  • pkg/types/vsphere/validation/platform.go
  • pkg/types/vsphere/validation/platform_test.go

Comment thread pkg/types/vsphere/validation/platform_test.go
Comment thread pkg/types/vsphere/validation/platform.go Outdated
// vsphereFailureDomainTopologyKey builds a comparable key from the infrastructure-defining
// fields of a failure domain topology. Inventory paths are canonicalized with filepath.Clean
// so that syntactically different but equivalent paths (e.g. trailing slashes) are normalized.
func vsphereFailureDomainTopologyKey(fd vsphere.FailureDomain) string {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will absolutely break vm-host zonal

I am not entirely excited with this function

if p == "" {
return ""
}
return filepath.Clean(p)
Copy link
Copy Markdown
Contributor

@jcpowermac jcpowermac May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is wrong, paths within govc/govmomi are not file paths, use path.Clean()

@chdeshpa-hue chdeshpa-hue changed the title Bug 86074: validate duplicate vSphere failureDomain topology Bug 86073: validate duplicate vSphere failureDomain topology May 19, 2026
return filepath.Clean(p)
}

return fmt.Sprintf("server=%s;dc=%s;cluster=%s;ds=%s;nets=%s;rp=%s",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There has to be a better way to determine if failure domain / topology is colliding.

Copy link
Copy Markdown
Contributor

@jcpowermac jcpowermac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was claude's review

Feedback on vsphereFailureDomainTopologyKey

The string-key approach with fmt.Sprintf("server=%s;dc=%s;cluster=%s;...") is fragile:

  1. Relies on field values never containing the delimiters (;, =, ,)
  2. Must be manually kept in sync if Topology fields change
  3. Ambiguous network join — "net-a,b" + "net-c" vs "net-a" + "b,net-c" produce the same comma-joined string

Suggested approach: use a comparable struct as the map key

Go supports struct map keys as long as all fields are comparable. Define a small struct with only the fields that define physical topology, and use it directly:

type failureDomainTopology struct {
	server         string
	datacenter     string
	computeCluster string
	datastore      string
	networks       string
	resourcePool   string
}

func normalizedTopology(fd vsphere.FailureDomain) failureDomainTopology {
	networks := make([]string, len(fd.Topology.Networks))
	copy(networks, fd.Topology.Networks)
	sort.Strings(networks)

	return failureDomainTopology{
		server:         fd.Server,
		datacenter:     fd.Topology.Datacenter,
		computeCluster: filepath.Clean(fd.Topology.ComputeCluster),
		datastore:      filepath.Clean(fd.Topology.Datastore),
		networks:       strings.Join(networks, "\x00"),
		resourcePool:   filepath.Clean(fd.Topology.ResourcePool),
	}
}

Then the validation simplifies to:

fdTopologies := make(map[failureDomainTopology]string)

// in the loop:
topo := normalizedTopology(failureDomain)
if prevName, exists := fdTopologies[topo]; exists {
    allErrs = append(allErrs, field.Invalid(...))
} else {
    fdTopologies[topo] = failureDomain.Name
}

Why not reuse FailureDomain or Topology directly? Both contain fields we want to exclude from comparison. FailureDomain has identity/label fields (Name, Region, Zone, RegionType, ZoneType). Topology has fields like Folder, Template, TagIDs, HostGroup that aren't relevant to the duplicate-infrastructure check. A dedicated struct makes the "these are the fields that define physical topology" decision explicit and reviewable — and it won't silently include new fields added later.

Update: I think Claude is wrong here, if we are installing vm-host zonal, hostgroup would need to be checked.

Benefits over the string key:

  • Type-safe — no delimiter injection risk
  • Compiler-checked — field additions are obvious at the struct definition, not buried in a format string
  • Uses \x00 as network separator — can't appear in vSphere inventory paths, eliminating join ambiguity

@chdeshpa-hue
Copy link
Copy Markdown
Author

Thanks for the review @jcpowermac — all valid points. Here's how I'll address them:

CI failures: Investigated both — they're infra flakes unrelated to this change:

  • devpreview: machine-config operator timeout after 69min (single legacy-topology FD, duplicate check never triggered)
  • hybrid-env: 4/1554 e2e test failures after successful install (cluster health flakes)

Will fix in V2:

  1. HostGroup: You're right, this will break vm-host zonal. Will include Topology.HostGroup in the comparison so FDs sharing the same cluster/datastore/network/resourcePool but pointing to different HostGroups are correctly treated as distinct fault zones.

  2. path.Clean vs filepath.Clean: Agreed — vSphere inventory paths are URL-style (always /), not OS paths. Will switch to path.Clean.

  3. Struct key instead of string key: Will refactor to use a comparable struct as the map key per your suggestion. Benefits: type-safe, no delimiter injection risk, compiler-checked, and uses \x00 as network separator to eliminate join ambiguity.

Will push V2 shortly.

- Refactor from fmt.Sprintf string key to comparable struct map key
  (type-safe, no delimiter injection risk, compiler-checked)
- Include Topology.HostGroup in comparison so vm-host zonal failure
  domains with different HostGroups are not falsely rejected
- Switch from filepath.Clean to path.Clean for vSphere inventory paths
  (URL-style, always forward slashes — not OS paths)
- Use \x00 as network separator to eliminate join ambiguity
- Add positive test: HostGroup FDs with same topology but different
  HostGroups must pass validation
- Update error message to include hostGroup in the list of compared fields

Bug: https://redhat.atlassian.net/browse/OCPBUGS-86073
Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/types/vsphere/validation/platform.go`:
- Around line 389-397: The current vsphereTopologyKey construction always sets
hostGroup from fd.Topology.HostGroup which lets non-HostGroup failure domains
evade duplicate-topology detection; modify the return so hostGroup is populated
only when the failure domain is a HostGroup zone (e.g., check fd.Type or
equivalent type field for the HostGroup enum/value) and otherwise set hostGroup
to an empty string (or nil-equivalent). Update the code around the
vsphereTopologyKey return (referencing vsphereTopologyKey, hostGroup,
fd.Topology.HostGroup, and fd.Type) so only HostGroup-type failure domains
contribute their HostGroup into the dedupe key.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5cb61b55-abb2-4163-9e76-e90b0be207b1

📥 Commits

Reviewing files that changed from the base of the PR and between 0f0778c and 7da08ac.

📒 Files selected for processing (2)
  • pkg/types/vsphere/validation/platform.go
  • pkg/types/vsphere/validation/platform_test.go

Comment thread pkg/types/vsphere/validation/platform.go
@chdeshpa-hue
Copy link
Copy Markdown
Author

/retest

datastore: normalizePath(fd.Topology.Datastore),
networks: strings.Join(networks, "\x00"),
resourcePool: normalizePath(fd.Topology.ResourcePool),
hostGroup: fd.Topology.HostGroup,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you won't be able to use hostGroup when the topology is for vm-host zonal.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcpowermac Could you clarify your comment? I want to make sure I understand correctly.

Currently, hostGroup is unconditionally included in the topology key. Are you suggesting:

It should only be included when ZoneType == HostGroup (for vm-host zonal), OR
There's a different issue with the current implementation?
My understanding from the enhancement doc is that vm-host zonal does use hostGroup to distinguish zones within a stretched cluster.

@jcpowermac
Copy link
Copy Markdown
Contributor

/test ?

@jcpowermac
Copy link
Copy Markdown
Contributor

/test e2e-vsphere-host-groups-ovn-techpreview

@chdeshpa-hue
Copy link
Copy Markdown
Author

/retest e2e-vsphere-host-groups-ovn-techpreview

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 3, 2026

@chdeshpa-hue: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

/test artifacts-images
/test e2e-agent-compact-ipv4
/test e2e-aws-ovn
/test e2e-aws-ovn-edge-zones-manifest-validation
/test e2e-aws-ovn-upi
/test e2e-azure-nat-gateway-single-zone
/test e2e-azure-ovn
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upi
/test e2e-metal-ipi-ovn-ipv6
/test e2e-openstack-ovn
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi
/test gofmt
/test golint
/test govet
/test images
/test integration-tests
/test integration-tests-nodejoiner
/test okd-scos-images
/test openstack-manifests
/test shellcheck
/test unit
/test verify-codegen
/test verify-deps
/test verify-vendor
/test yaml-lint

The following commands are available to trigger optional jobs:

/test aws-private
/test azure-ovn-marketplace-images
/test azure-private
/test e2e-agent-4control-ipv4
/test e2e-agent-5control-ipv4
/test e2e-agent-compact-ipv4-appliance-diskimage
/test e2e-agent-compact-ipv4-iso-no-registry
/test e2e-agent-compact-ipv4-none-platform
/test e2e-agent-compact-ipv4-rhel10-techpreview
/test e2e-agent-compact-ipv6-minimaliso
/test e2e-agent-ha-dualstack
/test e2e-agent-ha-dualstack-rhel10-techpreview
/test e2e-agent-sno-ipv4-pxe
/test e2e-agent-sno-ipv6
/test e2e-agent-sno-ipv6-rhel10-techpreview
/test e2e-agent-two-node-fencing-ipv4
/test e2e-aws-byo-subnet-role-security-groups
/test e2e-aws-custom-dns-techpreview
/test e2e-aws-default-config
/test e2e-aws-eusc-techpreview
/test e2e-aws-overlay-mtu-ovn-1200
/test e2e-aws-ovn-custom-iam-profile
/test e2e-aws-ovn-devpreview
/test e2e-aws-ovn-dualstack-ipv4-primary-techpreview
/test e2e-aws-ovn-dualstack-ipv6-primary-techpreview
/test e2e-aws-ovn-edge-zones
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-heterogeneous
/test e2e-aws-ovn-imdsv2
/test e2e-aws-ovn-pki-default-techpreview
/test e2e-aws-ovn-pki-rsa-techpreview
/test e2e-aws-ovn-proxy
/test e2e-aws-ovn-public-ipv4-pool
/test e2e-aws-ovn-public-ipv4-pool-disabled
/test e2e-aws-ovn-public-subnets
/test e2e-aws-ovn-rhcos10-techpreview
/test e2e-aws-ovn-shared-vpc-custom-security-groups
/test e2e-aws-ovn-shared-vpc-edge-zones
/test e2e-aws-ovn-single-node
/test e2e-aws-ovn-techpreview
/test e2e-aws-ovn-upgrade
/test e2e-aws-upi-proxy
/test e2e-azure-confidential-trustedlaunch
/test e2e-azure-custom-dns-techpreview
/test e2e-azure-default-config
/test e2e-azure-ovn-devpreview
/test e2e-azure-ovn-dualstack-ipv4-primary-techpreview
/test e2e-azure-ovn-dualstack-ipv6-primary-techpreview
/test e2e-azure-ovn-multidisk-techpreview
/test e2e-azure-ovn-resourcegroup
/test e2e-azure-ovn-shared-vpc
/test e2e-azure-ovn-techpreview
/test e2e-azure-ovn-upi
/test e2e-azurestack
/test e2e-azurestack-upi
/test e2e-crc
/test e2e-external-aws
/test e2e-external-aws-ccm
/test e2e-gcp-custom-dns
/test e2e-gcp-custom-endpoints
/test e2e-gcp-default-config
/test e2e-gcp-ovn-byo-vpc
/test e2e-gcp-ovn-devpreview
/test e2e-gcp-ovn-heterogeneous
/test e2e-gcp-ovn-techpreview
/test e2e-gcp-ovn-xpn
/test e2e-gcp-secureboot
/test e2e-gcp-upgrade
/test e2e-gcp-upi-xpn
/test e2e-gcp-xpn-custom-dns
/test e2e-gcp-xpn-dedicated-dns-project
/test e2e-ibmcloud-ovn
/test e2e-metal-assisted
/test e2e-metal-ipi-ovn
/test e2e-metal-ipi-ovn-dualstack
/test e2e-metal-ipi-ovn-swapped-hosts
/test e2e-metal-ipi-ovn-virtualmedia
/test e2e-metal-ovn-two-node-arbiter
/test e2e-metal-ovn-two-node-fencing
/test e2e-metal-single-node-live-iso
/test e2e-nutanix-ovn
/test e2e-openstack-ccpmso
/test e2e-openstack-dualstack
/test e2e-openstack-dualstack-upi
/test e2e-openstack-externallb
/test e2e-openstack-nfv-intel
/test e2e-openstack-proxy
/test e2e-openstack-singlestackv6
/test e2e-powervs-capi-ovn
/test e2e-vsphere-externallb-ovn
/test e2e-vsphere-host-groups-ovn-techpreview
/test e2e-vsphere-multi-vcenter-ovn
/test e2e-vsphere-ovn-devpreview
/test e2e-vsphere-ovn-disk-setup-techpreview
/test e2e-vsphere-ovn-hybrid-env
/test e2e-vsphere-ovn-multi-disk
/test e2e-vsphere-ovn-multi-network
/test e2e-vsphere-ovn-techpreview
/test e2e-vsphere-ovn-upi-multi-vcenter
/test e2e-vsphere-ovn-upi-zones
/test e2e-vsphere-ovn-zones
/test e2e-vsphere-ovn-zones-techpreview
/test e2e-vsphere-static-ovn
/test gcp-custom-endpoints-proxy-wif
/test gcp-private
/test okd-scos-e2e-aws-ovn

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-installer-main-artifacts-images
pull-ci-openshift-installer-main-e2e-aws-ovn
pull-ci-openshift-installer-main-e2e-vsphere-multi-vcenter-ovn
pull-ci-openshift-installer-main-e2e-vsphere-ovn
pull-ci-openshift-installer-main-e2e-vsphere-ovn-devpreview
pull-ci-openshift-installer-main-e2e-vsphere-ovn-disk-setup-techpreview
pull-ci-openshift-installer-main-e2e-vsphere-ovn-hybrid-env
pull-ci-openshift-installer-main-e2e-vsphere-ovn-techpreview
pull-ci-openshift-installer-main-e2e-vsphere-ovn-zones
pull-ci-openshift-installer-main-gofmt
pull-ci-openshift-installer-main-golint
pull-ci-openshift-installer-main-govet
pull-ci-openshift-installer-main-images
pull-ci-openshift-installer-main-okd-scos-images
pull-ci-openshift-installer-main-shellcheck
pull-ci-openshift-installer-main-unit
pull-ci-openshift-installer-main-verify-codegen
pull-ci-openshift-installer-main-verify-deps
pull-ci-openshift-installer-main-verify-vendor
pull-ci-openshift-installer-main-yaml-lint
Details

In response to this:

/retest e2e-vsphere-host-groups-ovn-techpreview

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@chdeshpa-hue
Copy link
Copy Markdown
Author

@jcpowermac can this PR now be approved to merge?

datacenter: fd.Topology.Datacenter,
computeCluster: normalizePath(fd.Topology.ComputeCluster),
datastore: normalizePath(fd.Topology.Datastore),
networks: strings.Join(networks, "\x00"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chdeshpa-hue instead of string concat, you can create an slice of strings with network names in them. The slice should be comparable and check order of each network in it to make sure it all matches. can you have a test for this scenario to validate it marks them as the same? We do want to maintain network order since that does matter. Also add some tests for different network configs.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the latest push. Here's what changed:

  1. Order-sensitive slice comparison: Replaced the hand-rolled sort.Strings() + strings.Join("\x00") approach with slices.Equal (from the already-imported k8s.io/utils/strings/slices). Two failure domains with [net-a, net-b] vs [net-b, net-a] are now correctly treated as distinct topologies since network adapter order matters.

  2. New tests added covering network scenarios:

    • Same networks, same order → flagged as duplicate ✓
    • Same networks, different order → NOT duplicate (order matters) ✓
    • Different networks → NOT duplicate ✓
    • Subset networks (different length) → NOT duplicate ✓
    • Path normalization (trailing slash, double slash) → correctly normalized ✓
    • HostGroup zonal (same vs different hostGroup) ✓
    • Differentiation by datastore, server, resourcePool ✓
  3. Approach: Since Go map keys cannot contain slices, vsphereTopologyKey holds only scalar fields. After a scalar key match, slices.Equal does the positional network comparison.

Also addressed @jcpowermac's feedback on path.Clean in the same commit.

Addresses reviewer feedback from @vr4manta and @jcpowermac:

- Networks are now compared positionally (order matters per
  +listType=atomic) instead of sorting and joining into a string key.
  Two failure domains with the same networks in different order are
  now correctly treated as distinct topologies.

- Replaced filepath.Clean with path.Clean for vSphere inventory paths,
  which use forward slashes regardless of OS (govmomi convention).

- The topology key struct no longer contains a networks field; instead
  a separate networksEqual() helper performs the slice comparison after
  the scalar key matches.

Test matrix covers: multi-net same/different order, subset lengths,
path normalization (trailing slash, double slash), HostGroup zonal
same/different hostGroup, and differentiation by all scalar fields.

Co-authored-by: Cursor <cursoragent@cursor.com>
@chdeshpa-hue chdeshpa-hue force-pushed the fix/OCPBUGS-86074-vsphere-duplicate-failuredomain-topology branch from b815c73 to 8657c8c Compare June 3, 2026 12:09
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/types/vsphere/validation/platform_test.go`:
- Around line 582-693: The duplicate-topology check currently includes
Topology.HostGroup unconditionally, allowing an evasion when two non-HostGroup
failure domains differ only by HostGroup; update the topology comparison in the
duplicate-detection logic (the function in platform.go that normalizes/compares
FailureDomain.Topology used to validate failure domains) so that
Topology.HostGroup is only considered when both FailureDomain.ZoneType ==
vsphere.HostGroupFailureDomain (otherwise ignore HostGroup for comparison), and
ensure this change respects ZoneType values like
vsphere.ComputeClusterFailureDomain and vsphere.HostGroupFailureDomain when
building the topology key for duplicate detection.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: ad1cc252-6c72-4b5e-9e00-5ad04ac17133

📥 Commits

Reviewing files that changed from the base of the PR and between 7da08ac and b815c73.

📒 Files selected for processing (2)
  • pkg/types/vsphere/validation/platform.go
  • pkg/types/vsphere/validation/platform_test.go

Comment on lines +582 to +693
{
name: "Duplicate topology with same networks same order (multi-net)",
platform: func() *vsphere.Platform {
p := validPlatform()
p.FailureDomains[0].Topology.Networks = []string{"net-a", "net-b"}
p.FailureDomains[1].Topology = p.FailureDomains[0].Topology
p.FailureDomains[1].Server = p.FailureDomains[0].Server
return p
}(),
expectedError: `test-path\.failureDomains\[1\]: Invalid value: "test-east-2a": failure domain "test-east-2a" has identical topology .* as "test-east-1a"; this provides no additional fault tolerance`,
},
{
name: "Valid: same networks different order (multi-net, order matters)",
platform: func() *vsphere.Platform {
p := validPlatform()
p.FailureDomains[0].Topology.Networks = []string{"net-a", "net-b"}
p.FailureDomains[1].Server = p.FailureDomains[0].Server
p.FailureDomains[1].Topology = p.FailureDomains[0].Topology
p.FailureDomains[1].Topology.Networks = []string{"net-b", "net-a"}
return p
}(),
},
{
name: "Valid: different networks (multi-net)",
platform: func() *vsphere.Platform {
p := validPlatform()
p.FailureDomains[0].Topology.Networks = []string{"net-a", "net-b"}
p.FailureDomains[1].Server = p.FailureDomains[0].Server
p.FailureDomains[1].Topology = p.FailureDomains[0].Topology
p.FailureDomains[1].Topology.Networks = []string{"net-a", "net-c"}
return p
}(),
},
{
name: "Valid: subset networks (different length)",
platform: func() *vsphere.Platform {
p := validPlatform()
p.FailureDomains[0].Topology.Networks = []string{"net-a"}
p.FailureDomains[1].Server = p.FailureDomains[0].Server
p.FailureDomains[1].Topology = p.FailureDomains[0].Topology
p.FailureDomains[1].Topology.Networks = []string{"net-a", "net-b"}
return p
}(),
},
{
name: "Duplicate topology with path normalization (trailing slash)",
platform: func() *vsphere.Platform {
p := validPlatform()
p.FailureDomains[1].Server = p.FailureDomains[0].Server
p.FailureDomains[1].Topology = p.FailureDomains[0].Topology
p.FailureDomains[1].Topology.ComputeCluster = "/test-datacenter/host/test-cluster/"
return p
}(),
expectedError: `test-path\.failureDomains\[1\]: Invalid value: "test-east-2a": failure domain "test-east-2a" has identical topology .* as "test-east-1a"; this provides no additional fault tolerance`,
},
{
name: "Duplicate topology with path normalization (double slash)",
platform: func() *vsphere.Platform {
p := validPlatform()
p.FailureDomains[1].Server = p.FailureDomains[0].Server
p.FailureDomains[1].Topology = p.FailureDomains[0].Topology
p.FailureDomains[1].Topology.Datastore = "/test-datacenter/datastore//test-datastore"
return p
}(),
expectedError: `test-path\.failureDomains\[1\]: Invalid value: "test-east-2a": failure domain "test-east-2a" has identical topology .* as "test-east-1a"; this provides no additional fault tolerance`,
},
{
name: "Valid: different datastore (not duplicate)",
platform: func() *vsphere.Platform {
p := validPlatform()
p.FailureDomains[1].Server = p.FailureDomains[0].Server
p.FailureDomains[1].Topology = p.FailureDomains[0].Topology
p.FailureDomains[1].Topology.Datastore = "/test-datacenter/datastore/other-datastore"
return p
}(),
},
{
name: "Valid: different server (not duplicate)",
platform: func() *vsphere.Platform {
p := validPlatform()
p.FailureDomains[1].Topology = p.FailureDomains[0].Topology
p.FailureDomains[1].Server = "other-vcenter"
return p
}(),
},
{
name: "Duplicate topology HostGroup zonal same hostGroup",
platform: func() *vsphere.Platform {
p := validPlatform()
p.FailureDomains[0].RegionType = vsphere.ComputeClusterFailureDomain
p.FailureDomains[0].ZoneType = vsphere.HostGroupFailureDomain
p.FailureDomains[0].Topology.HostGroup = "host-group-a"
p.FailureDomains[0].Topology.ResourcePool = "/test-datacenter/host/test-cluster/Resources/test-resourcepool"

p.FailureDomains[1].Server = p.FailureDomains[0].Server
p.FailureDomains[1].Topology = p.FailureDomains[0].Topology
p.FailureDomains[1].RegionType = vsphere.ComputeClusterFailureDomain
p.FailureDomains[1].ZoneType = vsphere.HostGroupFailureDomain
return p
}(),
expectedError: `test-path\.failureDomains\[1\]: Invalid value: "test-east-2a": failure domain "test-east-2a" has identical topology .* as "test-east-1a"; this provides no additional fault tolerance`,
},
{
name: "Valid: different resourcePool (not duplicate)",
platform: func() *vsphere.Platform {
p := validPlatform()
p.FailureDomains[1].Server = p.FailureDomains[0].Server
p.FailureDomains[1].Topology = p.FailureDomains[0].Topology
p.FailureDomains[1].Topology.ResourcePool = "/test-datacenter/host/test-cluster/Resources/other-pool"
return p
}(),
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add test coverage for HostGroup evasion on non-HostGroup zones.

The test suite comprehensively covers network ordering, path normalization, and HostGroup-zone scenarios. However, there's no test case for two non-HostGroup failure domains (e.g., ZoneType is empty/default or ComputeCluster) with identical topology but different Topology.HostGroup values. This scenario should be detected as a duplicate (since HostGroup is not topology-defining for non-HostGroup zones), but the current implementation at line 400 of platform.go would treat them as distinct.

🧪 Proposed test case (should expect error after fixing platform.go line 400)
{
	name: "Duplicate topology: non-HostGroup zones with different HostGroup values (evasion attempt)",
	platform: func() *vsphere.Platform {
		p := validPlatform()
		// Both are default ComputeCluster zones (not HostGroup zones)
		p.FailureDomains[0].Topology.HostGroup = "should-be-ignored-a"
		p.FailureDomains[1].Server = p.FailureDomains[0].Server
		p.FailureDomains[1].Topology = p.FailureDomains[0].Topology
		p.FailureDomains[1].Topology.HostGroup = "should-be-ignored-b"
		return p
	}(),
	// Should produce an error because HostGroup should not participate in dedup for non-HostGroup zones
	expectedError: `test-path\.failureDomains\[1\]: Invalid value: "test-east-2a": failure domain "test-east-2a" has identical topology .* as "test-east-1a"; this provides no additional fault tolerance`,
},

Note: This test will fail with the current implementation until the fix suggested in platform.go (conditional HostGroup inclusion) is applied.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/types/vsphere/validation/platform_test.go` around lines 582 - 693, The
duplicate-topology check currently includes Topology.HostGroup unconditionally,
allowing an evasion when two non-HostGroup failure domains differ only by
HostGroup; update the topology comparison in the duplicate-detection logic (the
function in platform.go that normalizes/compares FailureDomain.Topology used to
validate failure domains) so that Topology.HostGroup is only considered when
both FailureDomain.ZoneType == vsphere.HostGroupFailureDomain (otherwise ignore
HostGroup for comparison), and ensure this change respects ZoneType values like
vsphere.ComputeClusterFailureDomain and vsphere.HostGroupFailureDomain when
building the topology key for duplicate detection.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
pkg/types/vsphere/validation/platform.go (1)

394-401: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Gate hostGroup in dedupe key by zone type.

At Line 400, hostGroup is always part of vsphereTopologyKey. For non-HostGroup failure domains, that lets duplicate placement evade detection by setting different Topology.HostGroup values.

Suggested fix
 func normalizedTopologyKey(fd vsphere.FailureDomain) vsphereTopologyKey {
 	normalizePath := func(p string) string {
 		if p == "" {
 			return ""
 		}
 		return path.Clean(p)
 	}
 
+	hostGroup := ""
+	if fd.ZoneType == vsphere.HostGroupFailureDomain {
+		hostGroup = fd.Topology.HostGroup
+	}
+
 	return vsphereTopologyKey{
 		server:         fd.Server,
 		datacenter:     fd.Topology.Datacenter,
 		computeCluster: normalizePath(fd.Topology.ComputeCluster),
 		datastore:      normalizePath(fd.Topology.Datastore),
 		resourcePool:   normalizePath(fd.Topology.ResourcePool),
-		hostGroup:      fd.Topology.HostGroup,
+		hostGroup:      hostGroup,
 	}
 }

As per coding guidelines: “pkg/types/**/validation/**: Ensure validation errors are clear and actionable for the end user. Check that new validations do not break existing valid configs (backward compatibility).”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/types/vsphere/validation/platform.go` around lines 394 - 401, The dedupe
key currently always includes hostGroup (vsphereTopologyKey.hostGroup =
fd.Topology.HostGroup), which causes non-HostGroup failure domains to be treated
as distinct when only their HostGroup differs; change the key construction so
hostGroup is included only when the failure domain is a HostGroup-type domain
(e.g., check fd.Type or fd.Topology.Type == "HostGroup" depending on the
surrounding code) — otherwise set hostGroup to an empty string (or
nil-equivalent). Update the vsphereTopologyKey creation site to use
normalizePath(fd.Topology.HostGroup) only inside that conditional so duplicate
detection works correctly for other zone types.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@pkg/types/vsphere/validation/platform.go`:
- Around line 394-401: The dedupe key currently always includes hostGroup
(vsphereTopologyKey.hostGroup = fd.Topology.HostGroup), which causes
non-HostGroup failure domains to be treated as distinct when only their
HostGroup differs; change the key construction so hostGroup is included only
when the failure domain is a HostGroup-type domain (e.g., check fd.Type or
fd.Topology.Type == "HostGroup" depending on the surrounding code) — otherwise
set hostGroup to an empty string (or nil-equivalent). Update the
vsphereTopologyKey creation site to use normalizePath(fd.Topology.HostGroup)
only inside that conditional so duplicate detection works correctly for other
zone types.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 6e3d7c06-6a31-4f68-a772-6438209122af

📥 Commits

Reviewing files that changed from the base of the PR and between b815c73 and 8657c8c.

📒 Files selected for processing (2)
  • pkg/types/vsphere/validation/platform.go
  • pkg/types/vsphere/validation/platform_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/types/vsphere/validation/platform_test.go

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 3, 2026

@chdeshpa-hue: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-vsphere-host-groups-ovn-techpreview 7da08ac link false /test e2e-vsphere-host-groups-ovn-techpreview
ci/prow/e2e-vsphere-ovn-devpreview 8657c8c link false /test e2e-vsphere-ovn-devpreview

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ok-to-test Indicates a non-member PR verified by an org member that is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants