OCPBUGS-88531: Remove CPO-side restart logic for CNO operands by bryan-cox · Pull Request #8751 · openshift/hypershift

bryan-cox · 2026-06-17T10:11:30Z

Summary

Remove redundant CPO-side restart-date annotation propagation for multus-admission-controller, network-node-identity, and ovnkube-control-plane
CNO handles restart-date propagation to all its operands directly, making the CPO-side logic redundant

The cleanupClusterNetworkOperatorResources function previously contained logic to read hypershift.openshift.io/restart-date from the HCP and patch it onto CNO-managed deployments. CNO reads the annotation from HCP itself and sets it as a pod template annotation on all rendered workloads, so the CPO-side logic is unnecessary duplication.

Test plan

Tested and verified. Test verification report here - https://bryan-cox.github.io/architectural-artifact-sharing/test-verification-report-ocpbugs-84239/index.html.

Summary by CodeRabbit

Refactor
- Simplified cluster network operator resource cleanup by removing restart-annotation handling for CNO-managed components.
- Cleanup now focuses directly on removing the ovnkube-sbdb Route when applicable, and deleting the ovnkube-master-external and ovnkube-master-internal Services.

openshift-merge-bot · 2026-06-17T10:11:34Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci-robot · 2026-06-17T10:11:39Z

@bryan-cox: This pull request references Jira Issue OCPBUGS-88531, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

Remove redundant CPO-side restart-date annotation propagation for multus-admission-controller, network-node-identity, and ovnkube-control-plane

CNO now handles restart-date propagation to all its operands directly via OCPBUGS-88531: Propagate restart-date annotation to CNO operand pod templates cluster-network-operator#3030

The cleanupClusterNetworkOperatorResources function previously contained logic to read hypershift.openshift.io/restart-date from the HCP and patch it onto CNO-managed deployments. With the CNO fix, CNO reads the annotation from HCP itself and sets it as a pod template annotation on all rendered workloads, making the CPO-side logic redundant.

Test plan

Verify make build passes (confirmed locally)

Deploy with CNO PR HOSTEDCP-1205: Move network policies reconciliation to the CPO #3030 and confirm all CNO operands restart when restart-date annotation is set on HostedCluster

CI: e2e-aws-ovn-hypershift-conformance, hypershift-e2e-aks

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-06-17T10:11:47Z

📝 Walkthrough

Walkthrough

In cleanupClusterNetworkOperatorResources on HostedControlPlaneReconciler, the 24-line block that read hyperv1.RestartDateAnnotation from the HostedControlPlane annotations and patched it onto the multus-admission-controller, network-node-identity, and ovnkube-control-plane deployments has been deleted. The function now proceeds directly to its remaining tasks: conditionally deleting the ovnkube-sbdb Route when hasRouteCap is true, and unconditionally deleting the ovnkube-master-external and ovnkube-master-internal Services.

🚥 Pre-merge checks | ✅ 11

✅ Passed checks (11 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: removing CPO-side restart logic for CNO operands, which aligns directly with the changeset that removes restart-date annotation propagation logic.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	PR modifies only controller code (hostedcontrolplane_controller.go), not test files. No Ginkgo tests to verify, so check is not applicable.
Test Structure And Quality	✅ Passed	The PR does not contain Ginkgo test code. The test file added uses standard Go unit testing with Gomega assertions, not Ginkgo BDD patterns (Describe, It, BeforeEach, Eventually, etc.), so the Gink...
Topology-Aware Scheduling Compatibility	✅ Passed	PR removes annotation propagation code from cleanupClusterNetworkOperatorResources; no new deployment manifests, scheduling constraints, affinity rules, topology spread constraints, nodeSelectors,...
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	No new Ginkgo e2e tests are added in this PR. The changes only remove 24 lines of annotation propagation logic from hostedcontrolplane_controller.go, making this check not applicable.
No-Weak-Crypto	✅ Passed	No weak cryptography detected. The PR removes annotation propagation logic, not crypto code. The file uses only standard Go crypto/rand for secure randomness.
Container-Privileges	✅ Passed	PR modifies Go controller code only (hostedcontrolplane_controller.go), removing restart annotation logic. No Kubernetes manifests, container specs, or security configurations present.
No-Sensitive-Data-In-Logs	✅ Passed	The modified cleanupClusterNetworkOperatorResources function contains no logging statements that expose passwords, tokens, API keys, PII, session IDs, internal hostnames, or customer data. Logging...
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-06-17T10:11:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [bryan-cox]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov · 2026-06-17T10:19:17Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 41.86%. Comparing base (392fd5a) to head (407530a).
⚠️ Report is 18 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8751      +/-   ##
==========================================
+ Coverage   41.75%   41.86%   +0.11%     
==========================================
  Files         758      759       +1     
  Lines       93981    94019      +38     
==========================================
+ Hits        39240    39362     +122     
+ Misses      51988    51902      -86     
- Partials     2753     2755       +2

Files with missing lines	Coverage Δ
...ostedcontrolplane/hostedcontrolplane_controller.go	`46.04% <ø> (+0.33%)`	⬆️
...or/controllers/hostedcontrolplane/manifests/cno.go	`0.00% <ø> (ø)`
...controllers/hostedcontrolplane/v2/cno/component.go	`5.94% <ø> (+0.85%)`	⬆️

... and 13 files with indirect coverage changes

Flag	Coverage Δ
cmd-support	`35.13% <ø> (+0.10%)`	⬆️
cpo-hostedcontrolplane	`44.25% <ø> (+0.15%)`	⬆️
cpo-other	`43.45% <ø> (ø)`
hypershift-operator	`52.02% <ø> (+0.19%)`	⬆️
other	`31.56% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jparrill · 2026-06-17T16:02:27Z

/hold

Dropped some comments. Thanks!

Holding until companion PR from cluster-network-operator#3030 got merged. Feel free to remove it when that happens.

jparrill · 2026-06-17T16:03:48Z

 }

 func (r *HostedControlPlaneReconciler) cleanupClusterNetworkOperatorResources(ctx context.Context, hcp *hyperv1.HostedControlPlane, hasRouteCap bool) error {
-	if restartAnnotation, ok := hcp.Annotations[hyperv1.RestartDateAnnotation]; ok {


This function has zero callers after your PR — the three call sites you removed were its only consumers. Worth deleting it here to avoid leaving dead exported code in the package. Same for MultusAdmissionControllerDeployment, NetworkNodeIdentityDeployment, OVNKubeControlPlaneDeployment and their constants in manifests/cno.go — all orphaned now.

Happy to help with a follow-up if you prefer to keep this PR minimal, but since it's all in the same ownership boundary it fits naturally here.

Isn't it still called in

hypershift/control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go

Line 1234 in 4755e9c

if err := r.cleanupClusterNetworkOperatorResources(ctx, hcp, r.ManagementClusterCapabilities.Has(capabilities.CapabilityRoute)); err != nil {

?

I was going to comment the same - let's remove the function now because it's unused.

Isn't it still called in

The function SetRestartAnnotationAndPatch was only called in this cleanupClusterNetworkOperatorResources.
See:

ᐅ grep -Ri SetRestartAnnotationAndPatch control-plane-operator/controllers/hostedcontrolplane/v2/cno/component.go:func SetRestartAnnotationAndPatch(ctx context.Context, crclient client.Client, dep *appsv1.Deployment, restartAnnotation string) error { control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go: if err := cnov2.SetRestartAnnotationAndPatch(ctx, r.Client, multusDeployment, restartAnnotation); err != nil { control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go: if err := cnov2.SetRestartAnnotationAndPatch(ctx, r.Client, networkNodeIdentityDeployment, restartAnnotation); err != nil { control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go: // CNO manages overall ovnkube-control-plane deployment. CPO manages restarts. Note that cnov2.SetRestartAnnotationAndPatch just returns err == nil if the deployment isn't found (so if OVN isn't being used) control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go: if err := cnov2.SetRestartAnnotationAndPatch(ctx, r.Client, ovnKubeControlPlaneDeployment, restartAnnotation); err != nil {

Done — removed SetRestartAnnotationAndPatch and the orphaned manifest helpers (MultusAdmissionControllerDeployment, NetworkNodeIdentityDeployment, OVNKubeControlPlaneDeployment) along with their unused constants and imports.

CNO now handles restart-date annotation propagation to all its operands (multus-admission-controller, network-node-identity, ovnkube-control-plane, cloud-network-config-controller) directly via the fix in CNO PR openshift#3030. The CPO-side restart logic for these deployments is now redundant and can be removed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mgencur · 2026-06-18T10:46:05Z

/lgtm

openshift-merge-bot · 2026-06-18T10:46:20Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

hypershift-jira-solve-ci · 2026-06-18T12:40:03Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2067559611954630656 | Cost: $3.23479775 | Failed step: hypershift-azure-run-e2e

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

cwbotbot · 2026-06-18T12:55:03Z

Test Results

e2e-aws

Status: ❌ FAIL
Started: 2026-06-18T10:46:50Z
View Job
View Job History

e2e-aks

Status: ❌ FAIL
Started: 2026-06-18T10:46:50Z
View Job
View Job History

Failed Tests

Total failed tests: 3

TestNodePool
TestNodePool/HostedCluster0
TestNodePool/HostedCluster0/EnsureHostedCluster

hypershift-jira-solve-ci · 2026-06-18T13:15:22Z

Now I have all the evidence I need. Let me produce the final report.

Test Failure Analysis Complete

Job Information

Prow Job: pull-ci-openshift-hypershift-main-e2e-aws
Build ID: 2067559611992379392
Target: e2e-aws
PR: OCPBUGS-88531: Remove CPO-side restart logic for CNO operands #8751 (OCPBUGS-88531: Remove CPO-side restart logic for CNO operands)

Test Failure Analysis

Error

failed to wait for DaemonSet global-pull-secret-syncer to be ready: context deadline exceeded

DaemonSet global-pull-secret-syncer not ready: 2/3 pods ready
(repeated for ~20 minutes until timeout)

Secondary failure: daemonsets.apps "kubelet-config-verifier" already exists
(cascade from first timeout not cleaning up resources)

Summary

All 5 test failures originate from a single root cause: the global-pull-secret-syncer DaemonSet in the hosted cluster was stuck at 2/3 pods ready for the entire 20-minute timeout window. One pod on one of the 3 guest-cluster nodes (3 NodePools across us-east-1a/b/c, 1 replica each) could not reach Ready state. This caused TestCreateCluster/Main/EnsureGlobalPullSecret/When_management-cluster_hostedCluster.Spec.PullSecret_is_updated_in-place_it_should_propagate_to_guest_without_rollout to time out after 1205 seconds. The subsequent subtest Check_if_the_config.json_is_correct_in_all_of_the_nodes then failed immediately because the kubelet-config-verifier DaemonSet created by the first subtest was never cleaned up due to the timeout. The remaining 3 failures are parent-test cascades. This failure is unrelated to the PR's changes, which only remove redundant CPO-side restart-date annotation propagation for CNO operands (multus-admission-controller, network-node-identity, ovnkube-control-plane) — none of which interact with the global-pull-secret-syncer DaemonSet or the EnsureGlobalPullSecret test.

Root Cause

The global-pull-secret-syncer DaemonSet (managed by HCCO — the Hosted Cluster Config Operator) had 3 desired pods (one per guest-cluster node) but only 2 reached Ready state. The third pod on one of the three nodes was unable to become ready within the 20-minute polling timeout (10-second intervals). The test calls waitForDaemonSetReady() which polls the DaemonSet every 10 seconds for up to 20 minutes, checking that numberReady == desiredNumberScheduled. With 2/3 ready throughout the entire window, the context deadline was exceeded.

The exact reason why the one pod could not become Ready is not determinable from the available log artifacts alone (the hypershift-analyze-e2e-failure step timed out waiting for artifacts, and no must-gather was analyzed in --fast mode). Possible causes include:

A node-level issue on the third guest-cluster node (resource pressure, kubelet issue, or storage problem preventing the pod's HostPath volume mount)
Image pull delay or failure for the global-pull-secret-syncer container on that specific node
A scheduling issue specific to that node (the build log shows numerous FailedScheduling events for other pods during the test run, indicating cluster resource pressure)

This is a pre-existing flaky test issue, not caused by PR #8751. The PR modifies only:

hostedcontrolplane_controller.go — removes cleanupClusterNetworkOperatorResources restart-date annotation logic
manifests/cno.go — removes MultusAdmissionControllerDeployment, NetworkNodeIdentityDeployment, OVNKubeControlPlaneDeployment manifest functions
v2/cno/component.go — removes SetRestartAnnotationAndPatch function

None of these changes affect the global-pull-secret-syncer DaemonSet, the HCCO reconciliation, or the EnsureGlobalPullSecret test infrastructure.

Recommendations

Re-trigger the e2e-aws job with /test e2e-aws — this failure is a transient infrastructure/scheduling issue unrelated to the PR's changes.
No code changes needed in PR OCPBUGS-88531: Remove CPO-side restart logic for CNO operands #8751 — the PR correctly removes redundant CPO-side restart-date annotation propagation logic, and the test failure has no connection to the changed code paths.
Consider filing a test robustness issue for TestCreateCluster/Main/EnsureGlobalPullSecret:
- The test uses the first NodePool's Spec.Replicas (=1) as minExpected but the hosted cluster has 3 NodePools × 1 replica = 3 nodes, creating a mismatch between the expected and actual node count semantics.
- The kubelet-config-verifier DaemonSet cleanup fails when the previous subtest times out, causing a cascade failure — the cleanup should use CreateOrUpdate instead of Create to handle leftover resources.

Evidence

Evidence	Detail
Primary failure	`global-pull-secret-syncer` DaemonSet stuck at 2/3 pods ready for 20 min → context deadline exceeded
Secondary failure	`kubelet-config-verifier` DaemonSet already exists (cleanup not run due to timeout in prior subtest)
Cascade failures	3 parent tests (`EnsureGlobalPullSecret`, `Main`, `TestCreateCluster`) fail because subtests failed
Total failures	5 failures, 618 passes, 30 skipped out of 623 tests
Test duration	1205s for primary timeout subtest; 3890s total for TestCreateCluster
Hosted cluster	`e2e-clusters-xk62x/create-cluster-kjcg8` with 3 NodePools (us-east-1a/b/c) × 1 replica = 3 nodes
PR files changed	`hostedcontrolplane_controller.go`, `manifests/cno.go`, `v2/cno/component.go` (CNO restart-date logic removal)
PR relevance	None — PR removes CPO restart-date annotation for CNO operands; failure is in HCCO-managed `global-pull-secret-syncer` DaemonSet
Failed step	`e2e-aws-hypershift-aws-run-e2e-nested` (test phase) — exited with code 1 after 1h21m44s
Cluster resource pressure	Build log shows numerous `FailedScheduling` events: `Too many pods`, node anti-affinity conflicts

openshift-ci · 2026-06-18T14:17:27Z

@bryan-cox: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aks	`407530a`	link	true	`/test e2e-aks`
ci/prow/e2e-aws	`407530a`	link	true	`/test e2e-aws`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci Bot added the do-not-merge/needs-area label Jun 17, 2026

openshift-ci Bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release and removed do-not-merge/needs-area labels Jun 17, 2026

openshift-ci Bot requested review from devguyio and jparrill June 17, 2026 10:12

bryan-cox force-pushed the OCPBUGS-88531-remove-cpo-restart branch from 923dc8e to 5682e1f Compare June 17, 2026 10:57

openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 17, 2026

This comment was marked as duplicate.

Sign in to view

jparrill reviewed Jun 17, 2026

View reviewed changes

mgencur mentioned this pull request Jun 18, 2026

CNTRLPLANE-3576: add RestartDateAnnotation propagation tests #8733

Open

4 tasks

bryan-cox force-pushed the OCPBUGS-88531-remove-cpo-restart branch from 5682e1f to 407530a Compare June 18, 2026 10:42

openshift-ci Bot assigned mgencur Jun 18, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 18, 2026

bryan-cox mentioned this pull request Jun 22, 2026

OCPBUGS-88531: Propagate restart-date annotation to CNO operand pod templates openshift/cluster-network-operator#3030

Open

6 tasks

Uh oh!

Conversation

bryan-cox commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

openshift-merge-bot Bot commented Jun 17, 2026

Uh oh!

openshift-ci-robot commented Jun 17, 2026

Summary

Test plan

Uh oh!

coderabbitai Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

openshift-ci Bot commented Jun 17, 2026

Uh oh!

codecov Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jparrill commented Jun 17, 2026

Uh oh!

This comment was marked as duplicate.

Uh oh!

jparrill Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

bryan-cox Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

mgencur Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

bryan-cox Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

mgencur commented Jun 18, 2026

Uh oh!

openshift-merge-bot Bot commented Jun 18, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented Jun 18, 2026

AI Test Failure Analysis

Uh oh!

cwbotbot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

e2e-aws

e2e-aks

Uh oh!

hypershift-jira-solve-ci Bot commented Jun 18, 2026

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

Summary

Uh oh!

openshift-ci Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bryan-cox commented Jun 17, 2026 •

edited

Loading

coderabbitai Bot commented Jun 17, 2026 •

edited

Loading

codecov Bot commented Jun 17, 2026 •

edited

Loading

cwbotbot commented Jun 18, 2026 •

edited

Loading