Skip to content

OCPBUGS-91650: bump consolidateAfter to 60s in karpenter e2e base NodePool#8820

Merged
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
maxcao13:fix-karpenter-upgrade
Jun 25, 2026
Merged

OCPBUGS-91650: bump consolidateAfter to 60s in karpenter e2e base NodePool#8820
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
maxcao13:fix-karpenter-upgrade

Conversation

@maxcao13

@maxcao13 maxcao13 commented Jun 23, 2026

Copy link
Copy Markdown
Member

With consolidateAfter set to 0s, drift replacement nodes become consolidation-eligible the instant they initialize. The emptiness controller can delete them before the drift command drains the source, causing the upgrade test to fail (OCPBUGS-91650).

60s gives the drift command time to begin eviction after the replacement is ready.

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Tests
    • Updated node pool disruption consolidation configuration to use a 60-second delay instead of immediate consolidation.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Jun 23, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@maxcao13: This pull request references Jira Issue OCPBUGS-91650, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

With consolidateAfter set to 0s, drift replacement nodes become consolidation-eligible the instant they initialize. The emptiness controller can delete them before the drift command drains the source, causing the upgrade test to fail (OCPBUGS-91650).

60s gives the drift command time to begin eviction after the replacement is ready.

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci

openshift-ci Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 23, 2026
@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jun 23, 2026
@maxcao13

Copy link
Copy Markdown
Member Author

/test e2e-aws
/test e2e-aws-autonode

@openshift-ci openshift-ci Bot added area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Jun 23, 2026
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 526fca60-d76d-49fd-b92d-716dc4797c15

📥 Commits

Reviewing files that changed from the base of the PR and between 937dcd5 and e20ac54.

📒 Files selected for processing (1)
  • test/e2e/karpenter_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/karpenter_test.go

📝 Walkthrough

Walkthrough

In test/e2e/karpenter_test.go, the baseNodePool helper function updates the karpenterv1.Disruption.ConsolidateAfter field from 0s to 60s. An explanatory comment is added noting that the delay prevents consolidation from racing with drift draining.


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error)

Check name Status Explanation Resolution
Container-Privileges ❌ Error PR adds a new test pod with 'privileged: true' in test/e2e/karpenter_kubelet_checker_pod.yaml, flagging container privilege escalation per the check requirements. While the privileged pod has justification for diagnostic/testing purposes (reading kubelet config), the custom check requires flagging any privileged: true in container manifests regardless of rationale.
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly reflects the main change: bumping consolidateAfter from 0s to 60s in the karpenter e2e base NodePool configuration.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR modifies non-Ginkgo test file (test/e2e/karpenter_test.go uses t.Run, not Ginkgo's It/Describe). Check is not applicable since no Ginkgo test names were changed.
Test Structure And Quality ✅ Passed PR modifies only baseNodePool helper (fixture factory), not Ginkgo test code. The custom check on test structure/quality is not applicable since this helper creates test data structures, not test l...
Topology-Aware Scheduling Compatibility ✅ Passed Change is in test/e2e/karpenter_test.go, adjusting consolidateAfter timing parameter only. No scheduling constraints, pod affinity, topology spread, or replica count logic introduced.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR does not add new Ginkgo e2e tests. It only modifies the baseNodePool() helper function to change ConsolidateAfter from "0s" to "60s", which is a configuration change unrelated to IPv4 assum...
No-Weak-Crypto ✅ Passed PR modifies only a Karpenter test configuration file, changing consolidateAfter timing parameter from 0s to 60s. No cryptographic code, weak crypto algorithms, custom crypto implementations, or sec...
No-Sensitive-Data-In-Logs ✅ Passed The PR modifies test/e2e/karpenter_test.go baseNodePool function, changing ConsolidateAfter to 60s and adding comments. No logging of passwords, tokens, API keys, PII, or sensitive data was added.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/karpenter_test.go`:
- Line 1741: In the baseNodePool helper function, add a comment above the
ConsolidateAfter field assignment explaining the rationale for the 60-second
value. The comment should indicate that 60 seconds provides a reasonable buffer
for drift operations to initiate and helps prevent unnecessary consolidation
churn, making it clear for future maintainers whether this timing is critical
for test correctness or can be adjusted if performance needs optimization.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 39bbd76d-db5d-481b-870c-074fa6c33d95

📥 Commits

Reviewing files that changed from the base of the PR and between 9a16fd2 and 59b3104.

📒 Files selected for processing (1)
  • test/e2e/karpenter_test.go

Comment thread test/e2e/karpenter_test.go
@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 42.50%. Comparing base (7eea4bd) to head (e20ac54).
⚠️ Report is 16 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8820   +/-   ##
=======================================
  Coverage   42.50%   42.50%           
=======================================
  Files         768      768           
  Lines       95272    95272           
=======================================
  Hits        40498    40498           
  Misses      51971    51971           
  Partials     2803     2803           
Flag Coverage Δ
cmd-support 35.46% <ø> (ø)
cpo-hostedcontrolplane 44.84% <ø> (ø)
cpo-other 44.32% <ø> (ø)
hypershift-operator 53.05% <ø> (ø)
other 31.69% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cwbotbot

cwbotbot commented Jun 23, 2026

Copy link
Copy Markdown

Test Results

e2e-aws

e2e-aks

@maxcao13 maxcao13 force-pushed the fix-karpenter-upgrade branch from 59b3104 to 937dcd5 Compare June 23, 2026 20:39
@maxcao13

Copy link
Copy Markdown
Member Author

/test e2e-aws
/test e2e-aws-autonode

With consolidateAfter set to 0s, drift replacement nodes become
consolidation-eligible the instant they initialize. The emptiness
controller can delete them before the drift command drains the source,
causing the upgrade test to fail (OCPBUGS-91650).

60s gives the drift command time to begin eviction after the
replacement is ready.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Max Cao <macao@redhat.com>
@maxcao13 maxcao13 force-pushed the fix-karpenter-upgrade branch from 937dcd5 to e20ac54 Compare June 23, 2026 23:17
@maxcao13

Copy link
Copy Markdown
Member Author

/test e2e-aws
/test e2e-aws-autonode

@hypershift-jira-solve-ci

Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2069560580762505216 | Cost: $3.8974346499999992 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@maxcao13 maxcao13 marked this pull request as ready for review June 24, 2026 06:52
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 24, 2026
@openshift-ci openshift-ci Bot requested review from devguyio and enxebre June 24, 2026 06:53
@hypershift-jira-solve-ci

Copy link
Copy Markdown

Test Failure Analysis Complete

Job Information

  • Prow Job: pull-ci-openshift-hypershift-main-okd-scos-images
  • Build ID: 2069675119105544192
  • Target: [images] (variant: okd-scos)
  • State: error
  • Pod: 4693f32e-7837-432b-956d-e41480534b8b (namespace: ci)
  • Duration: 30 minutes (06:52:55 → 07:22:55 UTC, 2026-06-24)

Test Failure Analysis

Error

Pod scheduling timeout. Pod remained in Pending phase for 30 minutes and was terminated.
FailedScheduling: 0/81 nodes are available: 62 node(s) had untolerated taint(s), 15 node(s) didn't match Pod's node affinity/selector, 3 Insufficient memory, 1 node(s) didn't satisfy existing pods anti-affinity rules. Preemption was not helpful.

Summary

This is a CI infrastructure failure, not a test or code failure. The Prow job pod (okd-scos-images variant) was never scheduled onto a node in the CI cluster. It remained in Pending state for the full 30-minute scheduling timeout and was then killed by Prow. No build artifacts, build logs, or JUnit results were produced because the ci-operator container never started. The PR's code changes (bumping consolidateAfter to 60s in karpenter e2e base NodePool) are entirely unrelated to this failure.

Root Cause

The CI cluster was experiencing resource pressure at the time this job was submitted (2026-06-24 ~06:52 UTC). The pod could not be scheduled because no node in the 75–82 node cluster satisfied all scheduling constraints simultaneously:

  1. Untolerated taints (dominant blocker — ~75% of nodes): 58–63 nodes had taints that the pod did not tolerate. The pod only tolerates node.kubernetes.io/not-ready, node.kubernetes.io/unreachable, node.kubernetes.io/memory-pressure, and node-role.kubernetes.io/ci-prowjobs-worker. All other tainted nodes (e.g., those reserved for other workload types) were excluded.

  2. Node affinity/selector mismatch (15 nodes): The pod has a nodeSelector: {ci-workload: prowjobs} and a required node affinity for kubernetes.io/arch In [amd64, arm64] plus kubernetes.io/hostname NotIn [ip-10-0-162-214.ec2.internal]. 15 nodes did not match these selector/affinity rules.

  3. Insufficient memory (3 nodes): 3 nodes that otherwise matched had insufficient memory to accommodate the pod's resource requests (~63 MiB for the main container plus sidecar/init container requests).

  4. Anti-affinity conflict (1 node): 1 node was excluded due to existing pod anti-affinity rules.

The remaining eligible nodes were fully consumed by other CI workloads. Preemption could not help because the 77+ nodes where preemption was evaluated either had untolerated taints or didn't match affinity rules — constraints that preemption cannot resolve. The pod was repeatedly evaluated (25+ FailedScheduling events across the 30-minute window) as the cluster node count fluctuated between 75 and 82, but no scheduling slot opened up before the timeout.

This is a transient CI infrastructure capacity issue. No code was built or tested — the failure is entirely pre-execution.

Recommendations
  1. Retry the job — This is a transient infrastructure issue. Simply re-triggering the okd-scos-images job with /retest or /test okd-scos-images should succeed when cluster load is lower.

  2. No code changes needed — The PR's changes (bumping consolidateAfter to 60s) are completely unrelated to this scheduling failure. The ci-operator container never started, so no code was compiled or tested.

  3. If retries consistently fail, escalate to the CI infrastructure team (OpenShift DPTP / Test Platform) as it may indicate sustained cluster capacity issues for the ci-workload: prowjobs node pool, particularly for multi-arch (amd64/arm64) workloads.

Evidence
Evidence Detail
Job result error — "Pod scheduling timeout." (prowjob.json .status.description)
Pod phase Pending — never reached Running; no container ever started
Pod condition PodScheduled: False, reason: Unschedulable
Node name (unscheduled) — pod was never assigned to a node
Scheduling events 25+ FailedScheduling events from 06:52 to 07:22 UTC
Cluster size 75–82 nodes observed during scheduling attempts
Dominant blocker 58–63 nodes with untolerated taints (pod only tolerates ci-prowjobs-worker + standard taints)
Secondary blocker 15 nodes didn't match nodeSelector: {ci-workload: prowjobs} or arch affinity
Memory pressure 3 nodes had insufficient memory
Anti-affinity 1 node excluded by existing pod anti-affinity rules
Preemption Not helpful — 77+ nodes blocked by taints/affinity (preemption cannot fix these)
Timeout 30 minutes (created 06:52:55Z → deleted 07:22:55Z)
Artifacts produced Only prowjob.json, podinfo.json, started.json, finished.json — no build-log.txt, no JUnit, no test artifacts
Relation to PR None — code was never built or tested

@bryan-cox bryan-cox left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 24, 2026
@fishereskew

Copy link
Copy Markdown

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 24, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Tests from second stage were triggered manually. Pipeline can be controlled only manually, until HEAD changes. Use command to trigger second stage.

@openshift-ci

openshift-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox, fishereskew, maxcao13

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@maxcao13

Copy link
Copy Markdown
Member Author

/pipeline required

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@maxcao13

Copy link
Copy Markdown
Member Author

/test okd-scos-images

@maxcao13

Copy link
Copy Markdown
Member Author

/test e2e-azure-v2-self-managed

@maxcao13

Copy link
Copy Markdown
Member Author

/verified by e2e-aws,e2e-aws-autonode

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 25, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@maxcao13: This PR has been marked as verified by e2e-aws,e2e-aws-autonode.

Details

In response to this:

/verified by e2e-aws,e2e-aws-autonode

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@maxcao13

Copy link
Copy Markdown
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jun 25, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@maxcao13: This pull request references Jira Issue OCPBUGS-91650, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jun 25, 2026
@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@maxcao13: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit f46af75 into openshift:main Jun 25, 2026
42 checks passed
@openshift-ci-robot

Copy link
Copy Markdown

@maxcao13: Jira Issue Verification Checks: Jira Issue OCPBUGS-91650
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-91650 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

With consolidateAfter set to 0s, drift replacement nodes become consolidation-eligible the instant they initialize. The emptiness controller can delete them before the drift command drains the source, causing the upgrade test to fail (OCPBUGS-91650).

60s gives the drift command time to begin eviction after the replacement is ready.

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Tests
  • Updated node pool disruption consolidation configuration to use a 60-second delay instead of immediate consolidation.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot

Copy link
Copy Markdown
Contributor

Fix included in release 5.0.0-0.nightly-2026-06-25-122140

@mgencur

mgencur commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

I don't know if the 60 seconds is enough. Still hitting this issue in https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_hypershift/8733/pull-ci-openshift-hypershift-main-e2e-aws/2071467605327089664, which is for this PR: #8733

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/testing Indicates the PR includes changes for e2e testing jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants