OCPBUGS-91650: bump consolidateAfter to 60s in karpenter e2e base NodePool#8820
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@maxcao13: This pull request references Jira Issue OCPBUGS-91650, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
/test e2e-aws |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughIn Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error)
✅ Passed checks (10 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/e2e/karpenter_test.go`:
- Line 1741: In the baseNodePool helper function, add a comment above the
ConsolidateAfter field assignment explaining the rationale for the 60-second
value. The comment should indicate that 60 seconds provides a reasonable buffer
for drift operations to initiate and helps prevent unnecessary consolidation
churn, making it clear for future maintainers whether this timing is critical
for test correctness or can be adjusted if performance needs optimization.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 39bbd76d-db5d-481b-870c-074fa6c33d95
📒 Files selected for processing (1)
test/e2e/karpenter_test.go
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #8820 +/- ##
=======================================
Coverage 42.50% 42.50%
=======================================
Files 768 768
Lines 95272 95272
=======================================
Hits 40498 40498
Misses 51971 51971
Partials 2803 2803
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Test Resultse2e-aws
e2e-aks
|
59b3104 to
937dcd5
Compare
|
/test e2e-aws |
With consolidateAfter set to 0s, drift replacement nodes become consolidation-eligible the instant they initialize. The emptiness controller can delete them before the drift command drains the source, causing the upgrade test to fail (OCPBUGS-91650). 60s gives the drift command time to begin eviction after the replacement is ready. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Max Cao <macao@redhat.com>
937dcd5 to
e20ac54
Compare
|
/test e2e-aws |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThis is a CI infrastructure failure, not a test or code failure. The Prow job pod ( Root CauseThe CI cluster was experiencing resource pressure at the time this job was submitted (2026-06-24 ~06:52 UTC). The pod could not be scheduled because no node in the 75–82 node cluster satisfied all scheduling constraints simultaneously:
The remaining eligible nodes were fully consumed by other CI workloads. Preemption could not help because the 77+ nodes where preemption was evaluated either had untolerated taints or didn't match affinity rules — constraints that preemption cannot resolve. The pod was repeatedly evaluated (25+ FailedScheduling events across the 30-minute window) as the cluster node count fluctuated between 75 and 82, but no scheduling slot opened up before the timeout. This is a transient CI infrastructure capacity issue. No code was built or tested — the failure is entirely pre-execution. Recommendations
Evidence
|
|
/lgtm |
|
Tests from second stage were triggered manually. Pipeline can be controlled only manually, until HEAD changes. Use command to trigger second stage. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bryan-cox, fishereskew, maxcao13 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/pipeline required |
|
Scheduling tests matching the |
|
/test okd-scos-images |
|
/test e2e-azure-v2-self-managed |
|
/verified by e2e-aws,e2e-aws-autonode |
|
@maxcao13: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@maxcao13: This pull request references Jira Issue OCPBUGS-91650, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@maxcao13: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@maxcao13: Jira Issue Verification Checks: Jira Issue OCPBUGS-91650 Jira Issue OCPBUGS-91650 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Fix included in release 5.0.0-0.nightly-2026-06-25-122140 |
|
I don't know if the 60 seconds is enough. Still hitting this issue in https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_hypershift/8733/pull-ci-openshift-hypershift-main-e2e-aws/2071467605327089664, which is for this PR: #8733 |
With consolidateAfter set to 0s, drift replacement nodes become consolidation-eligible the instant they initialize. The emptiness controller can delete them before the drift command drains the source, causing the upgrade test to fail (OCPBUGS-91650).
60s gives the drift command time to begin eviction after the replacement is ready.
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes
Special notes for your reviewer:
Checklist:
Summary by CodeRabbit