OCPBUGS-91650: bump consolidateAfter to 60s in karpenter e2e base NodePool by maxcao13 · Pull Request #8820 · openshift/hypershift

maxcao13 · 2026-06-23T18:07:18Z

With consolidateAfter set to 0s, drift replacement nodes become consolidation-eligible the instant they initialize. The emptiness controller can delete them before the drift command drains the source, causing the upgrade test to fail (OCPBUGS-91650).

60s gives the drift command time to begin eviction after the replacement is ready.

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

Checklist:

Subject and description added to both, commit and PR.
Relevant issues have been referenced.
This change includes docs.
This change includes unit tests.

Summary by CodeRabbit

Tests
- Updated node pool disruption consolidation configuration to use a 60-second delay instead of immediate consolidation.

openshift-merge-bot · 2026-06-23T18:07:21Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci-robot · 2026-06-23T18:07:25Z

@maxcao13: This pull request references Jira Issue OCPBUGS-91650, which is invalid:

expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

With consolidateAfter set to 0s, drift replacement nodes become consolidation-eligible the instant they initialize. The emptiness controller can delete them before the drift command drains the source, causing the upgrade test to fail (OCPBUGS-91650).

60s gives the drift command time to begin eviction after the replacement is ready.

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-06-23T18:07:25Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

maxcao13 · 2026-06-23T18:07:41Z

/test e2e-aws
/test e2e-aws-autonode

coderabbitai · 2026-06-23T18:08:21Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 526fca60-d76d-49fd-b92d-716dc4797c15

📥 Commits

Reviewing files that changed from the base of the PR and between 937dcd5 and e20ac54.

📒 Files selected for processing (1)

test/e2e/karpenter_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

test/e2e/karpenter_test.go

📝 Walkthrough

Walkthrough

In test/e2e/karpenter_test.go, the baseNodePool helper function updates the karpenterv1.Disruption.ConsolidateAfter field from 0s to 60s. An explanatory comment is added noting that the delay prevents consolidation from racing with drift draining.

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error)

Check name	Status	Explanation	Resolution
Container-Privileges	❌ Error	PR adds a new test pod with 'privileged: true' in test/e2e/karpenter_kubelet_checker_pod.yaml, flagging container privilege escalation per the check requirements.	While the privileged pod has justification for diagnostic/testing purposes (reading kubelet config), the custom check requires flagging any privileged: true in container manifests regardless of rationale.

✅ Passed checks (10 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly reflects the main change: bumping consolidateAfter from 0s to 60s in the karpenter e2e base NodePool configuration.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	PR modifies non-Ginkgo test file (test/e2e/karpenter_test.go uses t.Run, not Ginkgo's It/Describe). Check is not applicable since no Ginkgo test names were changed.
Test Structure And Quality	✅ Passed	PR modifies only baseNodePool helper (fixture factory), not Ginkgo test code. The custom check on test structure/quality is not applicable since this helper creates test data structures, not test l...
Topology-Aware Scheduling Compatibility	✅ Passed	Change is in test/e2e/karpenter_test.go, adjusting consolidateAfter timing parameter only. No scheduling constraints, pod affinity, topology spread, or replica count logic introduced.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	This PR does not add new Ginkgo e2e tests. It only modifies the baseNodePool() helper function to change ConsolidateAfter from "0s" to "60s", which is a configuration change unrelated to IPv4 assum...
No-Weak-Crypto	✅ Passed	PR modifies only a Karpenter test configuration file, changing consolidateAfter timing parameter from 0s to 60s. No cryptographic code, weak crypto algorithms, custom crypto implementations, or sec...
No-Sensitive-Data-In-Logs	✅ Passed	The PR modifies test/e2e/karpenter_test.go baseNodePool function, changing ConsolidateAfter to 60s and adding comments. No logging of passwords, tokens, API keys, PII, or sensitive data was added.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/karpenter_test.go`:
- Line 1741: In the baseNodePool helper function, add a comment above the
ConsolidateAfter field assignment explaining the rationale for the 60-second
value. The comment should indicate that 60 seconds provides a reasonable buffer
for drift operations to initiate and helps prevent unnecessary consolidation
churn, making it clear for future maintainers whether this timing is critical
for test correctness or can be adjusted if performance needs optimization.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 39bbd76d-db5d-481b-870c-074fa6c33d95

📥 Commits

Reviewing files that changed from the base of the PR and between 9a16fd2 and 59b3104.

📒 Files selected for processing (1)

test/e2e/karpenter_test.go

codecov · 2026-06-23T18:17:38Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 42.50%. Comparing base (7eea4bd) to head (e20ac54).
⚠️ Report is 16 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #8820   +/-   ##
=======================================
  Coverage   42.50%   42.50%           
=======================================
  Files         768      768           
  Lines       95272    95272           
=======================================
  Hits        40498    40498           
  Misses      51971    51971           
  Partials     2803     2803

Flag	Coverage Δ
cmd-support	`35.46% <ø> (ø)`
cpo-hostedcontrolplane	`44.84% <ø> (ø)`
cpo-other	`44.32% <ø> (ø)`
hypershift-operator	`53.05% <ø> (ø)`
other	`31.69% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cwbotbot · 2026-06-23T20:25:02Z

Test Results

e2e-aws

Status: ✅ PASS
Started: 2026-06-24T18:26:01Z
View Job
View Job History

e2e-aks

Status: ✅ PASS
Started: 2026-06-24T18:26:02Z
View Job
View Job History

maxcao13 · 2026-06-23T20:40:15Z

/test e2e-aws
/test e2e-aws-autonode

With consolidateAfter set to 0s, drift replacement nodes become consolidation-eligible the instant they initialize. The emptiness controller can delete them before the drift command drains the source, causing the upgrade test to fail (OCPBUGS-91650). 60s gives the drift command time to begin eviction after the replacement is ready. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Max Cao <macao@redhat.com>

maxcao13 · 2026-06-23T23:17:42Z

/test e2e-aws
/test e2e-aws-autonode

hypershift-jira-solve-ci · 2026-06-24T01:20:45Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2069560580762505216 | Cost: $3.8974346499999992 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

hypershift-jira-solve-ci · 2026-06-24T07:37:24Z

Test Failure Analysis Complete

Job Information

Prow Job: pull-ci-openshift-hypershift-main-okd-scos-images
Build ID: 2069675119105544192
Target: [images] (variant: okd-scos)
State: error
Pod: 4693f32e-7837-432b-956d-e41480534b8b (namespace: ci)
Duration: 30 minutes (06:52:55 → 07:22:55 UTC, 2026-06-24)

Test Failure Analysis

Error

Pod scheduling timeout. Pod remained in Pending phase for 30 minutes and was terminated.
FailedScheduling: 0/81 nodes are available: 62 node(s) had untolerated taint(s), 15 node(s) didn't match Pod's node affinity/selector, 3 Insufficient memory, 1 node(s) didn't satisfy existing pods anti-affinity rules. Preemption was not helpful.

Summary

This is a CI infrastructure failure, not a test or code failure. The Prow job pod (okd-scos-images variant) was never scheduled onto a node in the CI cluster. It remained in Pending state for the full 30-minute scheduling timeout and was then killed by Prow. No build artifacts, build logs, or JUnit results were produced because the ci-operator container never started. The PR's code changes (bumping consolidateAfter to 60s in karpenter e2e base NodePool) are entirely unrelated to this failure.

Root Cause

The CI cluster was experiencing resource pressure at the time this job was submitted (2026-06-24 ~06:52 UTC). The pod could not be scheduled because no node in the 75–82 node cluster satisfied all scheduling constraints simultaneously:

Untolerated taints (dominant blocker — ~75% of nodes): 58–63 nodes had taints that the pod did not tolerate. The pod only tolerates node.kubernetes.io/not-ready, node.kubernetes.io/unreachable, node.kubernetes.io/memory-pressure, and node-role.kubernetes.io/ci-prowjobs-worker. All other tainted nodes (e.g., those reserved for other workload types) were excluded.
Node affinity/selector mismatch (15 nodes): The pod has a nodeSelector: {ci-workload: prowjobs} and a required node affinity for kubernetes.io/arch In [amd64, arm64] plus kubernetes.io/hostname NotIn [ip-10-0-162-214.ec2.internal]. 15 nodes did not match these selector/affinity rules.
Insufficient memory (3 nodes): 3 nodes that otherwise matched had insufficient memory to accommodate the pod's resource requests (~63 MiB for the main container plus sidecar/init container requests).
Anti-affinity conflict (1 node): 1 node was excluded due to existing pod anti-affinity rules.

The remaining eligible nodes were fully consumed by other CI workloads. Preemption could not help because the 77+ nodes where preemption was evaluated either had untolerated taints or didn't match affinity rules — constraints that preemption cannot resolve. The pod was repeatedly evaluated (25+ FailedScheduling events across the 30-minute window) as the cluster node count fluctuated between 75 and 82, but no scheduling slot opened up before the timeout.

This is a transient CI infrastructure capacity issue. No code was built or tested — the failure is entirely pre-execution.

Recommendations

Retry the job — This is a transient infrastructure issue. Simply re-triggering the okd-scos-images job with /retest or /test okd-scos-images should succeed when cluster load is lower.
No code changes needed — The PR's changes (bumping consolidateAfter to 60s) are completely unrelated to this scheduling failure. The ci-operator container never started, so no code was compiled or tested.
If retries consistently fail, escalate to the CI infrastructure team (OpenShift DPTP / Test Platform) as it may indicate sustained cluster capacity issues for the ci-workload: prowjobs node pool, particularly for multi-arch (amd64/arm64) workloads.

Evidence

Evidence	Detail
Job result	`error` — "Pod scheduling timeout." (prowjob.json `.status.description`)
Pod phase	`Pending` — never reached `Running`; no container ever started
Pod condition	`PodScheduled: False`, reason: `Unschedulable`
Node name	`(unscheduled)` — pod was never assigned to a node
Scheduling events	25+ `FailedScheduling` events from 06:52 to 07:22 UTC
Cluster size	75–82 nodes observed during scheduling attempts
Dominant blocker	58–63 nodes with untolerated taints (pod only tolerates `ci-prowjobs-worker` + standard taints)
Secondary blocker	15 nodes didn't match `nodeSelector: {ci-workload: prowjobs}` or arch affinity
Memory pressure	3 nodes had insufficient memory
Anti-affinity	1 node excluded by existing pod anti-affinity rules
Preemption	Not helpful — 77+ nodes blocked by taints/affinity (preemption cannot fix these)
Timeout	30 minutes (created 06:52:55Z → deleted 07:22:55Z)
Artifacts produced	Only `prowjob.json`, `podinfo.json`, `started.json`, `finished.json` — no build-log.txt, no JUnit, no test artifacts
Relation to PR	None — code was never built or tested

bryan-cox

/approve

fishereskew · 2026-06-24T18:20:08Z

/lgtm

openshift-merge-bot · 2026-06-24T18:20:24Z

Tests from second stage were triggered manually. Pipeline can be controlled only manually, until HEAD changes. Use command to trigger second stage.

openshift-ci · 2026-06-24T18:21:59Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox, fishereskew, maxcao13

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [bryan-cox]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

maxcao13 · 2026-06-24T18:25:28Z

/pipeline required

openshift-merge-bot · 2026-06-24T18:25:32Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

maxcao13 · 2026-06-24T18:49:54Z

/test okd-scos-images

maxcao13 · 2026-06-24T22:33:25Z

/test e2e-azure-v2-self-managed

maxcao13 · 2026-06-25T01:38:26Z

/verified by e2e-aws,e2e-aws-autonode

openshift-ci-robot · 2026-06-25T01:38:40Z

@maxcao13: This PR has been marked as verified by e2e-aws,e2e-aws-autonode.

Details

In response to this:

/verified by e2e-aws,e2e-aws-autonode

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

maxcao13 · 2026-06-25T01:41:56Z

/jira refresh

openshift-ci-robot · 2026-06-25T01:42:01Z

@maxcao13: This pull request references Jira Issue OCPBUGS-91650, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-06-25T02:06:54Z

@maxcao13: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2026-06-25T02:11:27Z

@maxcao13: Jira Issue Verification Checks: Jira Issue OCPBUGS-91650
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-91650 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

With consolidateAfter set to 0s, drift replacement nodes become consolidation-eligible the instant they initialize. The emptiness controller can delete them before the drift command drains the source, causing the upgrade test to fail (OCPBUGS-91650).

60s gives the drift command time to begin eviction after the replacement is ready.

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Summary by CodeRabbit

Tests

Updated node pool disruption consolidation configuration to use a 60-second delay instead of immediate consolidation.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-merge-robot · 2026-06-25T19:42:14Z

Fix included in release 5.0.0-0.nightly-2026-06-25-122140

mgencur · 2026-06-29T08:56:19Z

I don't know if the 60 seconds is enough. Still hitting this issue in https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_hypershift/8733/pull-ci-openshift-hypershift-main-e2e-aws/2071467605327089664, which is for this PR: #8733

openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Jun 23, 2026

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 23, 2026

openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jun 23, 2026

openshift-ci Bot added the do-not-merge/needs-area label Jun 23, 2026

openshift-ci Bot added area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Jun 23, 2026

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread test/e2e/karpenter_test.go

hypershift-jira-solve-ci Bot mentioned this pull request Jun 23, 2026

CNTRLPLANE-3383: HO: Add HostedClusterDeleting condition to track deletion progress #8427

Open

6 tasks

maxcao13 force-pushed the fix-karpenter-upgrade branch from 59b3104 to 937dcd5 Compare June 23, 2026 20:39

maxcao13 force-pushed the fix-karpenter-upgrade branch from 937dcd5 to e20ac54 Compare June 23, 2026 23:17

maxcao13 marked this pull request as ready for review June 24, 2026 06:52

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 24, 2026

openshift-ci Bot requested review from devguyio and enxebre June 24, 2026 06:53

This was referenced Jun 24, 2026

OCPBUGS-91656: fix(test): add retry logic to GetLogs in Karpenter kubelet propagatio… #8805

Merged

CNTRLPLANE-3526: Add spec.monitoring API for metrics forwarding #8626

Merged

bryan-cox reviewed Jun 24, 2026

View reviewed changes

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 24, 2026

openshift-ci Bot assigned fishereskew Jun 24, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 24, 2026

fishereskew approved these changes Jun 24, 2026

View reviewed changes

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 25, 2026

openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jun 25, 2026

openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jun 25, 2026

openshift-merge-bot Bot merged commit f46af75 into openshift:main Jun 25, 2026
42 checks passed

Uh oh!

Conversation

maxcao13 commented Jun 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Summary by CodeRabbit

Uh oh!

openshift-merge-bot Bot commented Jun 23, 2026

Uh oh!

openshift-ci-robot commented Jun 23, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Uh oh!

openshift-ci Bot commented Jun 23, 2026

Uh oh!

maxcao13 commented Jun 23, 2026

Uh oh!

coderabbitai Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Pre-merge checks failed

❌ Failed checks (1 error)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cwbotbot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

e2e-aws

e2e-aks

Uh oh!

maxcao13 commented Jun 23, 2026

Uh oh!

maxcao13 commented Jun 23, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented Jun 24, 2026

AI Test Failure Analysis

Uh oh!

hypershift-jira-solve-ci Bot commented Jun 24, 2026

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

Summary

Uh oh!

bryan-cox left a comment

Choose a reason for hiding this comment

Uh oh!

fishereskew commented Jun 24, 2026

Uh oh!

openshift-merge-bot Bot commented Jun 24, 2026

Uh oh!

openshift-ci Bot commented Jun 24, 2026

Uh oh!

maxcao13 commented Jun 24, 2026

Uh oh!

openshift-merge-bot Bot commented Jun 24, 2026

Uh oh!

maxcao13 commented Jun 24, 2026

Uh oh!

maxcao13 commented Jun 24, 2026

Uh oh!

maxcao13 commented Jun 25, 2026

Uh oh!

openshift-ci-robot commented Jun 25, 2026

Uh oh!

maxcao13 commented Jun 25, 2026

Uh oh!

openshift-ci-robot commented Jun 25, 2026

maxcao13 commented Jun 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading

codecov Bot commented Jun 23, 2026 •

edited

Loading

cwbotbot commented Jun 23, 2026 •

edited

Loading