Skip to content

OCPBUGS-88531: Remove CPO-side restart logic for CNO operands#8751

Open
bryan-cox wants to merge 1 commit into
openshift:mainfrom
bryan-cox:OCPBUGS-88531-remove-cpo-restart
Open

OCPBUGS-88531: Remove CPO-side restart logic for CNO operands#8751
bryan-cox wants to merge 1 commit into
openshift:mainfrom
bryan-cox:OCPBUGS-88531-remove-cpo-restart

Conversation

@bryan-cox

@bryan-cox bryan-cox commented Jun 17, 2026

Copy link
Copy Markdown
Member

Summary

  • Remove redundant CPO-side restart-date annotation propagation for multus-admission-controller, network-node-identity, and ovnkube-control-plane
  • CNO handles restart-date propagation to all its operands directly, making the CPO-side logic redundant

The cleanupClusterNetworkOperatorResources function previously contained logic to read hypershift.openshift.io/restart-date from the HCP and patch it onto CNO-managed deployments. CNO reads the annotation from HCP itself and sets it as a pod template annotation on all rendered workloads, so the CPO-side logic is unnecessary duplication.

Test plan

Tested and verified. Test verification report here - https://bryan-cox.github.io/architectural-artifact-sharing/test-verification-report-ocpbugs-84239/index.html.

Summary by CodeRabbit

  • Refactor
    • Simplified cluster network operator resource cleanup by removing restart-annotation handling for CNO-managed components.
    • Cleanup now focuses directly on removing the ovnkube-sbdb Route when applicable, and deleting the ovnkube-master-external and ovnkube-master-internal Services.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jun 17, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@bryan-cox: This pull request references Jira Issue OCPBUGS-88531, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

The cleanupClusterNetworkOperatorResources function previously contained logic to read hypershift.openshift.io/restart-date from the HCP and patch it onto CNO-managed deployments. With the CNO fix, CNO reads the annotation from HCP itself and sets it as a pod template annotation on all rendered workloads, making the CPO-side logic redundant.

Test plan

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

In cleanupClusterNetworkOperatorResources on HostedControlPlaneReconciler, the 24-line block that read hyperv1.RestartDateAnnotation from the HostedControlPlane annotations and patched it onto the multus-admission-controller, network-node-identity, and ovnkube-control-plane deployments has been deleted. The function now proceeds directly to its remaining tasks: conditionally deleting the ovnkube-sbdb Route when hasRouteCap is true, and unconditionally deleting the ovnkube-master-external and ovnkube-master-internal Services.

🚥 Pre-merge checks | ✅ 11
✅ Passed checks (11 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: removing CPO-side restart logic for CNO operands, which aligns directly with the changeset that removes restart-date annotation propagation logic.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR modifies only controller code (hostedcontrolplane_controller.go), not test files. No Ginkgo tests to verify, so check is not applicable.
Test Structure And Quality ✅ Passed The PR does not contain Ginkgo test code. The test file added uses standard Go unit testing with Gomega assertions, not Ginkgo BDD patterns (Describe, It, BeforeEach, Eventually, etc.), so the Gink...
Topology-Aware Scheduling Compatibility ✅ Passed PR removes annotation propagation code from cleanupClusterNetworkOperatorResources; no new deployment manifests, scheduling constraints, affinity rules, topology spread constraints, nodeSelectors,...
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests are added in this PR. The changes only remove 24 lines of annotation propagation logic from hostedcontrolplane_controller.go, making this check not applicable.
No-Weak-Crypto ✅ Passed No weak cryptography detected. The PR removes annotation propagation logic, not crypto code. The file uses only standard Go crypto/rand for secure randomness.
Container-Privileges ✅ Passed PR modifies Go controller code only (hostedcontrolplane_controller.go), removing restart annotation logic. No Kubernetes manifests, container specs, or security configurations present.
No-Sensitive-Data-In-Logs ✅ Passed The modified cleanupClusterNetworkOperatorResources function contains no logging statements that expose passwords, tokens, API keys, PII, session IDs, internal hostnames, or customer data. Logging...
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci

openshift-ci Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release and removed do-not-merge/needs-area labels Jun 17, 2026
@openshift-ci openshift-ci Bot requested review from devguyio and jparrill June 17, 2026 10:12
@codecov

codecov Bot commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 41.86%. Comparing base (392fd5a) to head (407530a).
⚠️ Report is 18 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8751      +/-   ##
==========================================
+ Coverage   41.75%   41.86%   +0.11%     
==========================================
  Files         758      759       +1     
  Lines       93981    94019      +38     
==========================================
+ Hits        39240    39362     +122     
+ Misses      51988    51902      -86     
- Partials     2753     2755       +2     
Files with missing lines Coverage Δ
...ostedcontrolplane/hostedcontrolplane_controller.go 46.04% <ø> (+0.33%) ⬆️
...or/controllers/hostedcontrolplane/manifests/cno.go 0.00% <ø> (ø)
...controllers/hostedcontrolplane/v2/cno/component.go 5.94% <ø> (+0.85%) ⬆️

... and 13 files with indirect coverage changes

Flag Coverage Δ
cmd-support 35.13% <ø> (+0.10%) ⬆️
cpo-hostedcontrolplane 44.25% <ø> (+0.15%) ⬆️
cpo-other 43.45% <ø> (ø)
hypershift-operator 52.02% <ø> (+0.19%) ⬆️
other 31.56% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bryan-cox bryan-cox force-pushed the OCPBUGS-88531-remove-cpo-restart branch from 923dc8e to 5682e1f Compare June 17, 2026 10:57
@jparrill

Copy link
Copy Markdown
Contributor

/hold

Dropped some comments. Thanks!

Holding until companion PR from cluster-network-operator#3030 got merged. Feel free to remove it when that happens.

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 17, 2026
jparrill

This comment was marked as duplicate.

}

func (r *HostedControlPlaneReconciler) cleanupClusterNetworkOperatorResources(ctx context.Context, hcp *hyperv1.HostedControlPlane, hasRouteCap bool) error {
if restartAnnotation, ok := hcp.Annotations[hyperv1.RestartDateAnnotation]; ok {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function has zero callers after your PR — the three call sites you removed were its only consumers. Worth deleting it here to avoid leaving dead exported code in the package. Same for MultusAdmissionControllerDeployment, NetworkNodeIdentityDeployment, OVNKubeControlPlaneDeployment and their constants in manifests/cno.go — all orphaned now.

Happy to help with a follow-up if you prefer to keep this PR minimal, but since it's all in the same ownership boundary it fits naturally here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it still called in

if err := r.cleanupClusterNetworkOperatorResources(ctx, hcp, r.ManagementClusterCapabilities.Has(capabilities.CapabilityRoute)); err != nil {
?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to comment the same - let's remove the function now because it's unused.

Isn't it still called in

The function SetRestartAnnotationAndPatch was only called in this cleanupClusterNetworkOperatorResources.
See:

ᐅ grep -Ri SetRestartAnnotationAndPatch                            
control-plane-operator/controllers/hostedcontrolplane/v2/cno/component.go:func SetRestartAnnotationAndPatch(ctx context.Context, crclient client.Client, dep *appsv1.Deployment, restartAnnotation string) error {
control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go:			if err := cnov2.SetRestartAnnotationAndPatch(ctx, r.Client, multusDeployment, restartAnnotation); err != nil {
control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go:		if err := cnov2.SetRestartAnnotationAndPatch(ctx, r.Client, networkNodeIdentityDeployment, restartAnnotation); err != nil {
control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go:		// CNO manages overall ovnkube-control-plane deployment. CPO manages restarts.  Note that cnov2.SetRestartAnnotationAndPatch just returns err == nil if the deployment isn't found (so if OVN isn't being used)
control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go:		if err := cnov2.SetRestartAnnotationAndPatch(ctx, r.Client, ovnKubeControlPlaneDeployment, restartAnnotation); err != nil {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — removed SetRestartAnnotationAndPatch and the orphaned manifest helpers (MultusAdmissionControllerDeployment, NetworkNodeIdentityDeployment, OVNKubeControlPlaneDeployment) along with their unused constants and imports.

CNO now handles restart-date annotation propagation to all its operands
(multus-admission-controller, network-node-identity, ovnkube-control-plane,
cloud-network-config-controller) directly via the fix in CNO PR openshift#3030.

The CPO-side restart logic for these deployments is now redundant and
can be removed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@bryan-cox bryan-cox force-pushed the OCPBUGS-88531-remove-cpo-restart branch from 5682e1f to 407530a Compare June 18, 2026 10:42
@mgencur

mgencur commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 18, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@hypershift-jira-solve-ci

Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2067559611954630656 | Cost: $3.23479775 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@cwbotbot

cwbotbot commented Jun 18, 2026

Copy link
Copy Markdown

Test Results

e2e-aws

e2e-aks

Failed Tests

Total failed tests: 3

  • TestNodePool
  • TestNodePool/HostedCluster0
  • TestNodePool/HostedCluster0/EnsureHostedCluster

@hypershift-jira-solve-ci

Copy link
Copy Markdown

Now I have all the evidence I need. Let me produce the final report.

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

failed to wait for DaemonSet global-pull-secret-syncer to be ready: context deadline exceeded

DaemonSet global-pull-secret-syncer not ready: 2/3 pods ready
(repeated for ~20 minutes until timeout)

Secondary failure: daemonsets.apps "kubelet-config-verifier" already exists
(cascade from first timeout not cleaning up resources)

Summary

All 5 test failures originate from a single root cause: the global-pull-secret-syncer DaemonSet in the hosted cluster was stuck at 2/3 pods ready for the entire 20-minute timeout window. One pod on one of the 3 guest-cluster nodes (3 NodePools across us-east-1a/b/c, 1 replica each) could not reach Ready state. This caused TestCreateCluster/Main/EnsureGlobalPullSecret/When_management-cluster_hostedCluster.Spec.PullSecret_is_updated_in-place_it_should_propagate_to_guest_without_rollout to time out after 1205 seconds. The subsequent subtest Check_if_the_config.json_is_correct_in_all_of_the_nodes then failed immediately because the kubelet-config-verifier DaemonSet created by the first subtest was never cleaned up due to the timeout. The remaining 3 failures are parent-test cascades. This failure is unrelated to the PR's changes, which only remove redundant CPO-side restart-date annotation propagation for CNO operands (multus-admission-controller, network-node-identity, ovnkube-control-plane) — none of which interact with the global-pull-secret-syncer DaemonSet or the EnsureGlobalPullSecret test.

Root Cause

The global-pull-secret-syncer DaemonSet (managed by HCCO — the Hosted Cluster Config Operator) had 3 desired pods (one per guest-cluster node) but only 2 reached Ready state. The third pod on one of the three nodes was unable to become ready within the 20-minute polling timeout (10-second intervals). The test calls waitForDaemonSetReady() which polls the DaemonSet every 10 seconds for up to 20 minutes, checking that numberReady == desiredNumberScheduled. With 2/3 ready throughout the entire window, the context deadline was exceeded.

The exact reason why the one pod could not become Ready is not determinable from the available log artifacts alone (the hypershift-analyze-e2e-failure step timed out waiting for artifacts, and no must-gather was analyzed in --fast mode). Possible causes include:

  • A node-level issue on the third guest-cluster node (resource pressure, kubelet issue, or storage problem preventing the pod's HostPath volume mount)
  • Image pull delay or failure for the global-pull-secret-syncer container on that specific node
  • A scheduling issue specific to that node (the build log shows numerous FailedScheduling events for other pods during the test run, indicating cluster resource pressure)

This is a pre-existing flaky test issue, not caused by PR #8751. The PR modifies only:

  1. hostedcontrolplane_controller.go — removes cleanupClusterNetworkOperatorResources restart-date annotation logic
  2. manifests/cno.go — removes MultusAdmissionControllerDeployment, NetworkNodeIdentityDeployment, OVNKubeControlPlaneDeployment manifest functions
  3. v2/cno/component.go — removes SetRestartAnnotationAndPatch function

None of these changes affect the global-pull-secret-syncer DaemonSet, the HCCO reconciliation, or the EnsureGlobalPullSecret test infrastructure.

Recommendations
  1. Re-trigger the e2e-aws job with /test e2e-aws — this failure is a transient infrastructure/scheduling issue unrelated to the PR's changes.

  2. No code changes needed in PR OCPBUGS-88531: Remove CPO-side restart logic for CNO operands #8751 — the PR correctly removes redundant CPO-side restart-date annotation propagation logic, and the test failure has no connection to the changed code paths.

  3. Consider filing a test robustness issue for TestCreateCluster/Main/EnsureGlobalPullSecret:

    • The test uses the first NodePool's Spec.Replicas (=1) as minExpected but the hosted cluster has 3 NodePools × 1 replica = 3 nodes, creating a mismatch between the expected and actual node count semantics.
    • The kubelet-config-verifier DaemonSet cleanup fails when the previous subtest times out, causing a cascade failure — the cleanup should use CreateOrUpdate instead of Create to handle leftover resources.
Evidence
Evidence Detail
Primary failure global-pull-secret-syncer DaemonSet stuck at 2/3 pods ready for 20 min → context deadline exceeded
Secondary failure kubelet-config-verifier DaemonSet already exists (cleanup not run due to timeout in prior subtest)
Cascade failures 3 parent tests (EnsureGlobalPullSecret, Main, TestCreateCluster) fail because subtests failed
Total failures 5 failures, 618 passes, 30 skipped out of 623 tests
Test duration 1205s for primary timeout subtest; 3890s total for TestCreateCluster
Hosted cluster e2e-clusters-xk62x/create-cluster-kjcg8 with 3 NodePools (us-east-1a/b/c) × 1 replica = 3 nodes
PR files changed hostedcontrolplane_controller.go, manifests/cno.go, v2/cno/component.go (CNO restart-date logic removal)
PR relevance None — PR removes CPO restart-date annotation for CNO operands; failure is in HCCO-managed global-pull-secret-syncer DaemonSet
Failed step e2e-aws-hypershift-aws-run-e2e-nested (test phase) — exited with code 1 after 1h21m44s
Cluster resource pressure Build log shows numerous FailedScheduling events: Too many pods, node anti-affinity conflicts

@openshift-ci

openshift-ci Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

@bryan-cox: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aks 407530a link true /test e2e-aks
ci/prow/e2e-aws 407530a link true /test e2e-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants