Skip to content

CNTRLPLANE-3553: Wire osImageStream into NodePool controller (hash, token, status, validation)#8730

Open
sdminonne wants to merge 1 commit into
openshift:mainfrom
sdminonne:CNTRLPLANE-3553-osImageStream-controller
Open

CNTRLPLANE-3553: Wire osImageStream into NodePool controller (hash, token, status, validation)#8730
sdminonne wants to merge 1 commit into
openshift:mainfrom
sdminonne:CNTRLPLANE-3553-osImageStream-controller

Conversation

@sdminonne

@sdminonne sdminonne commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Wire spec.osImageStream and status.osImageStream into the NodePool
reconciliation loop: validation, config hash (triggering rollouts on
stream change), token secret propagation, observation-based status
reporting, and boot image function parameter threading.

Aligns with the dual-stream RHEL NodePool enhancement
Phase 2 controller plumbing.

Dependencies

Changes

Stream resolution (stream.go, osstream.go)

  • GetRHELStream() returns:
    • The explicit spec.osImageStream.name when set (with validation)
    • "rhel-10" when unset and release >= 5.0
    • "rhel-9" when unset and release < 5.0 (always returns a concrete
      stream name so that downstream consumers like StreamForName() work
      correctly when legacy StreamMetadata is removed)
  • getRHELStream() wrapper in osstream.go extracts spec and version
    from NodePool/ReleaseImage and delegates to GetRHELStream()
  • validateOSImageStream() delegates to GetRHELStream() for
    consistent validation (no duplicate resolution logic)

Validation (conditions.go)

  • osImageStream validation runs before NewConfigGenerator inside
    validMachineConfigCondition for fail-fast on invalid stream names
    before expensive MCO config generation

Config hash integration (config.go)

  • Add rhelStream to rolloutConfig, Hash(), and HashWithoutVersion()
  • Normalize rhelStream in NewConfigGenerator: when the explicit
    spec.osImageStream.name matches the version-derived default (e.g.
    "rhel-9" on a 4.x release), it is kept as "" so the hash doesn't
    change and no spurious rollout is triggered. Only a non-default stream
    produces a different hash
  • Keep resolvedRHELStream on ConfigGenerator (not rolloutConfig)
    for downstream consumers that need a concrete stream name (GCP, AWS
    AMI, token secret)

Token secret (token.go)

  • Write os-stream key into the token secret for future ignition-server
    consumption (TODO: not yet read downstream)

Status observation (version.go)

  • rhcosStreamFromOSImage() parses Machine NodeInfo.OSImage to
    determine RHEL generation (RHCOS 4xx → rhel-9, 5xx → rhel-10)
  • osImageStreamFromMachines() determines the observed stream using
    strict majority consensus (count > total/2) across the pool
  • setOSImageStreamStatus() sets status.osImageStream when a
    majority of machines report a consistent stream, aligning with the
    enhancement's observation-based status design

Boot image threading (aws.go, gcp.go)

  • Thread resolved stream through defaultNodePoolAMI,
    defaultNodePoolGCPImage, setAWSConditions, and their callers via
    getRHELStream() for consistent boot image resolution

What's not yet implemented (future work)

  • usesRunc guard (explicit rhel-10 + runc → error, implicit >=5.0 + runc → fallback to rhel-9)
  • Multi-stream boot image metadata parsing (CNTRLPLANE-3553)
  • Ignition server plumbing (OSImageStream CR generation in GetPayload())
  • StreamMetadata / OSStreams naming consolidation in ReleaseImage struct

Test plan

  • TestGetRHELStream — covers explicit stream, 4.x/5.x/6.x defaults, unparsable version
  • TestValidateOSImageStream — covers empty, valid, and invalid stream names
  • TestOSImageStreamHashStability — covers default stream producing same hash, non-default producing different hash
  • TestHash / TestHashWithoutVersion — covers non-default stream changing the hash
  • TestRhcosStreamFromOSImage — covers RHCOS 4xx/5xx parsing, unknown versions, unrecognised OS
  • TestOsImageStreamFromMachines — covers majority consensus: single replica, all same, majority during upgrade, even split, no NodeInfo, unrecognised OS
  • TestSetOSImageStreamStatus — covers integration with fake client: no machines, all RHEL 9, majority RHEL 10, preserve previous status
  • TestReconcileMachineDeploymentStatus — covers config annotation update
  • make verify passes clean

🤖 Generated with Claude Code

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 12, 2026
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 12, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 12, 2026

Copy link
Copy Markdown

@sdminonne: This pull request references CNTRLPLANE-3553 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Wire spec.osImageStream into the NodePool reconciliation loop: config hash
(triggering rollouts on stream change), token secret propagation, status
reporting, and boot image function parameter threading.

DRAFT — depends on #8669 and #8719 merging first.
This PR will be rebased and conflict-resolved after those land.
Opening early for design review on the config hash, token secret,
and boot image threading approach.

Dependencies

This PR overlaps with both on condition constants and stream resolution logic.
After they merge, the overlapping pieces (osstream.go condition constants,
validOSImageStreamCondition allowlist) will be dropped and replaced with
the implementations from those PRs.

What this PR adds (unique to this PR)

  • Config hash integration (config.go): add rhelStream to rolloutConfig,
    Hash(), and HashWithoutVersion() — backward compatible when empty
  • Token secret (token.go): write os-stream key for downstream ignition
  • Status propagation (capi.go): set status.osImageStream on rollout
    completion in both Replace (MachineDeployment) and InPlace (MachineSet) paths
  • Boot image threading: add rhelStream parameter to defaultNodePoolAMI,
    defaultNodePoolGCPImage, and all callers (aws.go, gcp.go, token.go).
    Parameter is accepted but not yet used — marked with TODO([CNTRLPLANE-3553](https://redhat.atlassian.net/browse/CNTRLPLANE-3553))
    for when StreamForName() from CNTRLPLANE-3552: Multi-stream CoreOS metadata parsing and stream resolution #8669 is wired in

What overlaps (will be resolved on rebase)

Test plan

  • go test ./hypershift-operator/controllers/nodepool/... -count=1 — all pass
  • make lint-fix — 0 issues
  • go build ./... — full repo builds
  • CI will fail until dependencies merge — expected for draft

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci

openshift-ci Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR propagates a resolved RHEL OS image stream (rhelStream) through the full NodePool reconciliation stack. Two new API constants (OSImageStreamRHEL9, OSImageStreamRHEL10) are added alongside a CEL immutability rule for osImageStream. The defaultRHELStream, getRHELStream, and validateOSImageStream helpers are rewritten to delegate to the shared GetRHELStream helper. The resolved stream is threaded into AWS and GCP image resolution functions, written into token secrets under a new TokenSecretOSStreamKey constant, included in rollout hash computation (normalized to avoid spurious changes when the default is explicitly set), and used to set NodePool.Status.OSImageStream upon MachineDeployment/MachineSet rollout completion. An early validateOSImageStream check is added to validMachineConfigCondition.

Possibly related PRs

  • openshift/hypershift#8669: Introduced the GetRHELStream shared helper and constants that this PR now delegates to across defaultRHELStream, getRHELStream, and validateOSImageStream.
  • openshift/hypershift#8699: Reworks AWS NodePool AMI resolution flow in defaultNodePoolAMI/resolveAWSAMI/setAWSConditions, directly overlapping with this PR's changes to those same call paths.

Suggested reviewers

  • sjenning
  • devguyio
🚥 Pre-merge checks | ✅ 10 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning Assertions in osstream_test.go lack meaningful failure messages (e.g., g.Expect(err).ToNot(HaveOccurred()) with no message), inconsistent with quality patterns seen in config_test.go where assert... Add failure messages to all Expect() assertions in osstream_test.go, e.g., g.Expect(err).ToNot(HaveOccurred(), "failed to get RHEL stream for test case: %s", tc.name).
✅ Passed checks (10 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically describes the main change: wiring osImageStream into the NodePool controller across hash, token, status, and validation components.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in the PR are stable and deterministic. No dynamic values (UUIDs, timestamps, pod names, namespaces, etc.) appear in test titles. Tests use static string fields from table-driven tes...
Topology-Aware Scheduling Compatibility ✅ Passed PR introduces no scheduling constraints. Changes are limited to OS image stream threading through boot image resolution, config hashing, status propagation, and validation—not affecting Pod specs,...
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR adds only standard Go unit tests (osstream_test.go with table-driven tests), not Ginkgo e2e tests. Custom check only applies to Ginkgo e2e tests; therefore not applicable.
No-Weak-Crypto ✅ Passed PR uses FNV-1a hashing for config tracking and contains no weak crypto (MD5, SHA1, DES, RC4, etc.), custom crypto implementations, or insecure secret comparisons.
Container-Privileges ✅ Passed PR contains no Kubernetes or container manifests (YAML/JSON/Dockerfile). All 14+ changed files are Go source code implementing controller logic; no privileged container configurations present.
No-Sensitive-Data-In-Logs ✅ Passed PR adds logging for osImageStream validation errors and writes stream names to token secrets. All logged/stored data are public OCP version identifiers (rhel-9, rhel-10) and public release versions...
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@openshift-ci openshift-ci Bot added area/api Indicates the PR includes changes for the API area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/aws PR/issue for AWS (AWSPlatform) platform area/platform/gcp PR/issue for GCP (GCPPlatform) platform and removed do-not-merge/needs-area labels Jun 12, 2026
@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 74.81481% with 34 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.31%. Comparing base (73da66e) to head (6806f1e).
⚠️ Report is 23 commits behind head on main.

Files with missing lines Patch % Lines
...rshift-operator/controllers/nodepool/conditions.go 0.00% 13 Missing ⚠️
hypershift-operator/controllers/nodepool/config.go 64.00% 6 Missing and 3 partials ⚠️
...erator/controllers/nodepool/nodepool_controller.go 33.33% 3 Missing and 1 partial ⚠️
...pershift-operator/controllers/nodepool/osstream.go 82.35% 2 Missing and 1 partial ⚠️
...ypershift-operator/controllers/nodepool/version.go 92.68% 2 Missing and 1 partial ⚠️
hypershift-operator/controllers/nodepool/gcp.go 75.00% 1 Missing ⚠️
hypershift-operator/controllers/nodepool/token.go 83.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8730      +/-   ##
==========================================
+ Coverage   43.19%   43.31%   +0.12%     
==========================================
  Files         767      772       +5     
  Lines       94914    95611     +697     
==========================================
+ Hits        41001    41417     +416     
- Misses      51050    51305     +255     
- Partials     2863     2889      +26     
Files with missing lines Coverage Δ
hypershift-operator/controllers/nodepool/aws.go 80.93% <100.00%> (+0.76%) ⬆️
hypershift-operator/controllers/nodepool/stream.go 100.00% <100.00%> (ø)
hypershift-operator/controllers/nodepool/gcp.go 66.21% <75.00%> (-0.30%) ⬇️
hypershift-operator/controllers/nodepool/token.go 82.70% <83.33%> (+0.10%) ⬆️
...pershift-operator/controllers/nodepool/osstream.go 82.35% <82.35%> (ø)
...ypershift-operator/controllers/nodepool/version.go 94.11% <92.68%> (-0.97%) ⬇️
...erator/controllers/nodepool/nodepool_controller.go 42.69% <33.33%> (-0.42%) ⬇️
hypershift-operator/controllers/nodepool/config.go 82.91% <64.00%> (-2.61%) ⬇️
...rshift-operator/controllers/nodepool/conditions.go 53.40% <0.00%> (-0.53%) ⬇️

... and 14 files with indirect coverage changes

Flag Coverage Δ
cmd-support 36.67% <ø> (+0.23%) ⬆️
cpo-hostedcontrolplane 45.31% <ø> (ø)
cpo-other 45.10% <ø> (ø)
hypershift-operator 53.68% <74.81%> (+0.14%) ⬆️
other 31.69% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
hypershift-operator/controllers/nodepool/config.go (1)

88-94: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Thread the resolved OS stream, not the raw spec field.

getRHELStream() is the contract for this feature: when spec.osImageStream is unset, the controller derives rhel-9/rhel-10 from the release. These sites currently cache/forward nodePool.Spec.OSImageStream.Name directly, so the value collapses to "" for defaulted NodePools. That leaves the hash path out of sync with status.osImageStream today, and it will feed the wrong stream into AWS/GCP image selection as soon as the dependent multi-stream metadata work starts using rhelStream after rebase.

Compute the effective stream once from getRHELStream(nodePool, releaseImage) and thread that value through rolloutConfig, CAPI, and the platform validation helpers.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/nodepool/config.go` around lines 88 - 94,
Compute the effective RHEL stream once by calling getRHELStream(nodePool,
releaseImage) and pass that computed value into rolloutConfig.rhelStream instead
of using nodePool.Spec.OSImageStream.Name; also propagate the same computed
value into any CAPI-related code paths and platform validation helper calls that
previously read spec.osImageStream so all consumers use the resolved stream
(e.g., update usages in rolloutConfig initialization, CAPI helpers, and platform
validation functions to accept/consume the new resolved rhelStream variable).
🧹 Nitpick comments (2)
hypershift-operator/controllers/nodepool/osstream_test.go (1)

17-80: ⚡ Quick win

Add test coverage for getRHELStream error path.

TestGetRHELStream only covers success cases. Add a test case for when releaseImage.Version() returns an invalid semver string, to verify that getRHELStream returns an error as expected (Line 26-29 in osstream.go handles this, but it's untested).

📋 Proposed test case for invalid semver
 		expectedStream: "rhel-10",
 	},
+	{
+		name: "When spec.osImageStream.Name is empty and version is invalid, it should return error",
+		nodePool: &hyperv1.NodePool{
+			Spec: hyperv1.NodePoolSpec{},
+		},
+		releaseImage: &releaseinfo.ReleaseImage{
+			ImageStream: &imageapi.ImageStream{ObjectMeta: metav1.ObjectMeta{Name: "invalid-version"}},
+		},
+		expectErr: true,
+	},
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/nodepool/osstream_test.go` around lines 17 -
80, TestGetRHELStream lacks coverage for the error path in getRHELStream when
releaseImage.Version() yields an invalid semver; add a new test case in
TestGetRHELStream that supplies a releaseinfo.ReleaseImage whose
ImageStream.ObjectMeta.Name is an invalid version string (e.g., "not-a-semver"),
call getRHELStream(nodePool, releaseImage) and assert that an error is returned
(Expect(err).To(HaveOccurred())); reference the existing test harness and types
(TestGetRHELStream, getRHELStream, releaseinfo.ReleaseImage,
imageapi.ImageStream/ObjectMeta.Name) so the new case is consistent with the
other table-driven cases.
hypershift-operator/controllers/nodepool/nodepool_controller.go (1)

407-409: ⚡ Quick win

Log or surface getRHELStream errors in all three status-update paths.

In nodepool_controller.go (Line 407-409), capi.go reconcileMachineDeploymentStatus (Line 617-619), and capi.go reconcileMachineSet (Line 1018-1020), errors from getRHELStream are silently ignored with if err == nil guards. The shared root cause is missing error visibility: when OS stream resolution fails (e.g., invalid semver in release version), Status.OSImageStream remains unset or stale and no log or condition surfaces the issue to the user. As per coding guidelines, "Always check errors in Go — don't ignore them." Add logging or set a condition in each path to surface failures.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/nodepool/nodepool_controller.go` around lines
407 - 409, getRHELStream errors are being ignored in three places
(nodepool_controller.go when setting nodePool.Status.OSImageStream, capi.go
reconcileMachineDeploymentStatus, and capi.go reconcileMachineSet); update each
location to handle the error path: when getRHELStream returns an error, log the
error with the controller logger (including err.Error()) and set a clear status
condition (e.g., "OSImageStreamResolutionFailed") on the owning object with the
error message so users see the failure, and when getRHELStream succeeds clear
that condition and set Status.OSImageStream as before; reference the
getRHELStream call sites and the Status.OSImageStream update points and ensure
status updates are persisted.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@hypershift-operator/controllers/nodepool/osstream.go`:
- Around line 21-35: The getRHELStream function dereferences nodePool
(nodePool.Spec.OSImageStream.Name) and releaseImage (releaseImage.Version())
without nil checks; add defensive nil guards at the top of getRHELStream to
return a clear error if nodePool or releaseImage is nil, and also guard
nodePool.Spec or nodePool.Spec.OSImageStream as needed before accessing Name to
avoid panics, then proceed with semver.Parse on releaseImage.Version() only when
releaseImage is non-nil.

---

Outside diff comments:
In `@hypershift-operator/controllers/nodepool/config.go`:
- Around line 88-94: Compute the effective RHEL stream once by calling
getRHELStream(nodePool, releaseImage) and pass that computed value into
rolloutConfig.rhelStream instead of using nodePool.Spec.OSImageStream.Name; also
propagate the same computed value into any CAPI-related code paths and platform
validation helper calls that previously read spec.osImageStream so all consumers
use the resolved stream (e.g., update usages in rolloutConfig initialization,
CAPI helpers, and platform validation functions to accept/consume the new
resolved rhelStream variable).

---

Nitpick comments:
In `@hypershift-operator/controllers/nodepool/nodepool_controller.go`:
- Around line 407-409: getRHELStream errors are being ignored in three places
(nodepool_controller.go when setting nodePool.Status.OSImageStream, capi.go
reconcileMachineDeploymentStatus, and capi.go reconcileMachineSet); update each
location to handle the error path: when getRHELStream returns an error, log the
error with the controller logger (including err.Error()) and set a clear status
condition (e.g., "OSImageStreamResolutionFailed") on the owning object with the
error message so users see the failure, and when getRHELStream succeeds clear
that condition and set Status.OSImageStream as before; reference the
getRHELStream call sites and the Status.OSImageStream update points and ensure
status updates are persisted.

In `@hypershift-operator/controllers/nodepool/osstream_test.go`:
- Around line 17-80: TestGetRHELStream lacks coverage for the error path in
getRHELStream when releaseImage.Version() yields an invalid semver; add a new
test case in TestGetRHELStream that supplies a releaseinfo.ReleaseImage whose
ImageStream.ObjectMeta.Name is an invalid version string (e.g., "not-a-semver"),
call getRHELStream(nodePool, releaseImage) and assert that an error is returned
(Expect(err).To(HaveOccurred())); reference the existing test harness and types
(TestGetRHELStream, getRHELStream, releaseinfo.ReleaseImage,
imageapi.ImageStream/ObjectMeta.Name) so the new case is consistent with the
other table-driven cases.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: e87df744-44ee-4acb-9a91-4746ce49edd8

📥 Commits

Reviewing files that changed from the base of the PR and between 39f04eb and afe4b10.

⛔ Files ignored due to path filters (1)
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/nodepool_conditions.go is excluded by !vendor/**, !**/vendor/**
📒 Files selected for processing (13)
  • api/hypershift/v1beta1/nodepool_conditions.go
  • hypershift-operator/controllers/nodepool/aws.go
  • hypershift-operator/controllers/nodepool/aws_test.go
  • hypershift-operator/controllers/nodepool/capi.go
  • hypershift-operator/controllers/nodepool/config.go
  • hypershift-operator/controllers/nodepool/gcp.go
  • hypershift-operator/controllers/nodepool/gcp_test.go
  • hypershift-operator/controllers/nodepool/nodepool_controller.go
  • hypershift-operator/controllers/nodepool/nodepool_controller_test.go
  • hypershift-operator/controllers/nodepool/osstream.go
  • hypershift-operator/controllers/nodepool/osstream_test.go
  • hypershift-operator/controllers/nodepool/token.go
  • hypershift-operator/controllers/nodepool/token_test.go

Comment on lines +21 to +35
func getRHELStream(nodePool *hyperv1.NodePool, releaseImage *releaseinfo.ReleaseImage) (string, error) {
if nodePool.Spec.OSImageStream.Name != "" {
return nodePool.Spec.OSImageStream.Name, nil
}

version, err := semver.Parse(releaseImage.Version())
if err != nil {
return "", fmt.Errorf("failed to parse release image version %q: %w", releaseImage.Version(), err)
}

if version.Major >= 5 {
return "rhel-10", nil
}
return "rhel-9", nil
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add nil-safety guards to prevent panics.

getRHELStream dereferences nodePool (Line 22) and releaseImage (Line 26) without nil checks. If either parameter is nil, the function will panic. Current callers appear to pass valid pointers, but defensive validation would prevent runtime panics if the contract changes or a new caller is added.

🛡️ Proposed fix to add nil guards
 func getRHELStream(nodePool *hyperv1.NodePool, releaseImage *releaseinfo.ReleaseImage) (string, error) {
+	if nodePool == nil {
+		return "", fmt.Errorf("nodePool cannot be nil")
+	}
+	if releaseImage == nil {
+		return "", fmt.Errorf("releaseImage cannot be nil")
+	}
 	if nodePool.Spec.OSImageStream.Name != "" {
 		return nodePool.Spec.OSImageStream.Name, nil
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/nodepool/osstream.go` around lines 21 - 35,
The getRHELStream function dereferences nodePool
(nodePool.Spec.OSImageStream.Name) and releaseImage (releaseImage.Version())
without nil checks; add defensive nil guards at the top of getRHELStream to
return a clear error if nodePool or releaseImage is nil, and also guard
nodePool.Spec or nodePool.Spec.OSImageStream as needed before accessing Name to
avoid panics, then proceed with semver.Parse on releaseImage.Version() only when
releaseImage is non-nil.

@sdminonne sdminonne force-pushed the CNTRLPLANE-3553-osImageStream-controller branch from 03a5cef to 8e027da Compare June 17, 2026 07:40
@sdminonne sdminonne changed the title CNTRLPLANE-3553: Wire osImageStream into NodePool controller (config hash, token, boot image) CNTRLPLANE-3553: Wire osImageStream into NodePool controller (hash, token, status, validation) Jun 17, 2026
@openshift-ci openshift-ci Bot added the area/cli Indicates the PR includes changes for CLI label Jun 17, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
api/hypershift/v1beta1/nodepool_types.go (1)

51-60: 💤 Low value

Consider nil check for defensive safety.

The function dereferences nodePool at Line 52 without a nil check. While current callers in context snippets (conditions.go:390) pass valid pointers, adding a guard would prevent panics if a new caller is added or the contract changes.

🛡️ Optional defensive nil guard
 func validateOSImageStream(nodePool *hyperv1.NodePool) error {
+	if nodePool == nil {
+		return fmt.Errorf("nodePool cannot be nil")
+	}
 	name := nodePool.Spec.OSImageStream.Name
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/hypershift/v1beta1/nodepool_types.go` around lines 51 - 60, Locate the
function that dereferences the nodePool parameter (likely in the
nodepool_types.go file around the specified line range) and add a nil check
guard before the dereference operation. The check should verify that nodePool is
not nil before attempting to access its fields or methods, returning an
appropriate error or early exit if the pointer is nil. This defensive check
prevents potential panics if the function contract changes or new callers pass
nil values in the future.
hypershift-operator/controllers/nodepool/capi_test.go (2)

2771-2773: ⚡ Quick win

Add nil-safety to OSImageStream assertion.

The assertion at line 2772 directly accesses nodePool.Status.OSImageStream.Name without first checking if nodePool.Status.OSImageStream is nil. If the status field is unexpectedly nil when expectedOSImageStream is non-empty, this will panic rather than provide a clear test failure message.

Safer assertion pattern
 if tc.expectedOSImageStream != "" {
+    g.Expect(nodePool.Status.OSImageStream).ToNot(BeNil(), "OSImageStream should be set when stream is expected")
     g.Expect(nodePool.Status.OSImageStream.Name).To(Equal(tc.expectedOSImageStream))
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/nodepool/capi_test.go` around lines 2771 -
2773, In the test assertion block where tc.expectedOSImageStream is checked, add
a nil-safety check before accessing the Name field. When
tc.expectedOSImageStream is non-empty, first verify that
nodePool.Status.OSImageStream is not nil before attempting to access
nodePool.Status.OSImageStream.Name in the Expect call. This will prevent a panic
and provide a clear test failure message if the OSImageStream field is
unexpectedly nil.

2636-2674: ⚡ Quick win

Add test case to verify OSImageStream status population on rollout completion.

The test infrastructure includes an expectedOSImageStream field and conditional assertion (lines 2771-2773), but no test case actually verifies that nodePool.Status.OSImageStream is populated when the MachineDeployment completes. The test case at line 2652 sets expectedOSImageStream: "", which skips the assertion entirely.

According to the layer description, reconcileMachineDeploymentStatus should call getRHELStream and populate nodePool.Status.OSImageStream when the deployment reaches completion. Add a test case that sets a non-empty expectedOSImageStream value and verifies the status field is correctly populated.

Consider also updating the test name at line 2652 to reflect what is actually being verified, or add a separate test case specifically for OSImageStream status population.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/nodepool/capi_test.go` around lines 2636 -
2674, Add a new test case to the testCases slice in the
TestReconcileMachineDeploymentStatus function that verifies OSImageStream status
population when a MachineDeployment completes. This test case should have a
non-empty expectedOSImageStream value (such as a valid RHEL stream identifier)
and match the completion conditions of the existing test case (all replica
counts at 3, ObservedGeneration matching Generation). This will ensure the
conditional assertion at lines 2771-2773 that checks
nodePool.Status.OSImageStream is actually exercised with a meaningful
verification rather than being skipped due to an empty expected value.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@hypershift-operator/controllers/nodepool/nodepool_controller.go`:
- Around line 406-408: In the nodepool_controller.go file, the call to
getRHELStream is silently ignoring errors by only handling the success case with
if err == nil. Instead of dropping the error, properly handle failure cases by
either logging the error with context and returning it to trigger
reconciliation, or updating the NodePool status to reflect the error condition.
This ensures that stream-resolution failures are visible and the status remains
synchronized with the actual state rather than remaining stale and unset.

In `@hypershift-operator/controllers/nodepool/token.go`:
- Around line 358-359: The assignment of TokenSecretOSStreamKey to
tokenSecret.Data only executes within a conditional block that checks if
tokenSecret.Data is nil, meaning existing token secrets never get the os-stream
field populated on reconciliation. Move the line
tokenSecret.Data[TokenSecretOSStreamKey] = []byte(t.rhelStream) outside of the
nil-check conditional so it runs on every reconciliation, ensuring all token
secrets have the os-stream field persisted consistently. Additionally, add unit
tests to verify that the os-stream field is set during both initial creation and
subsequent reconciliations, and add e2e tests to validate this behavior impacts
consumer behavior correctly.

---

Nitpick comments:
In `@api/hypershift/v1beta1/nodepool_types.go`:
- Around line 51-60: Locate the function that dereferences the nodePool
parameter (likely in the nodepool_types.go file around the specified line range)
and add a nil check guard before the dereference operation. The check should
verify that nodePool is not nil before attempting to access its fields or
methods, returning an appropriate error or early exit if the pointer is nil.
This defensive check prevents potential panics if the function contract changes
or new callers pass nil values in the future.

In `@hypershift-operator/controllers/nodepool/capi_test.go`:
- Around line 2771-2773: In the test assertion block where
tc.expectedOSImageStream is checked, add a nil-safety check before accessing the
Name field. When tc.expectedOSImageStream is non-empty, first verify that
nodePool.Status.OSImageStream is not nil before attempting to access
nodePool.Status.OSImageStream.Name in the Expect call. This will prevent a panic
and provide a clear test failure message if the OSImageStream field is
unexpectedly nil.
- Around line 2636-2674: Add a new test case to the testCases slice in the
TestReconcileMachineDeploymentStatus function that verifies OSImageStream status
population when a MachineDeployment completes. This test case should have a
non-empty expectedOSImageStream value (such as a valid RHEL stream identifier)
and match the completion conditions of the existing test case (all replica
counts at 3, ObservedGeneration matching Generation). This will ensure the
conditional assertion at lines 2771-2773 that checks
nodePool.Status.OSImageStream is actually exercised with a meaningful
verification rather than being skipped due to an empty expected value.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 8ebdba94-e5e5-4bf0-8e2a-0fa112fff63f

📥 Commits

Reviewing files that changed from the base of the PR and between 03a5cef and 8e027da.

⛔ Files ignored due to path filters (5)
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/OSStreams.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • cmd/install/assets/crds/hypershift-operator/tests/nodepools.hypershift.openshift.io/featuregated.nodepools.osimagestream.testsuite.yaml is excluded by !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/nodepool_types.go is excluded by !vendor/**, !**/vendor/**
📒 Files selected for processing (16)
  • api/hypershift/v1beta1/nodepool_types.go
  • hypershift-operator/controllers/nodepool/aws.go
  • hypershift-operator/controllers/nodepool/aws_test.go
  • hypershift-operator/controllers/nodepool/capi.go
  • hypershift-operator/controllers/nodepool/capi_test.go
  • hypershift-operator/controllers/nodepool/conditions.go
  • hypershift-operator/controllers/nodepool/config.go
  • hypershift-operator/controllers/nodepool/config_test.go
  • hypershift-operator/controllers/nodepool/gcp.go
  • hypershift-operator/controllers/nodepool/gcp_test.go
  • hypershift-operator/controllers/nodepool/nodepool_controller.go
  • hypershift-operator/controllers/nodepool/nodepool_controller_test.go
  • hypershift-operator/controllers/nodepool/osstream.go
  • hypershift-operator/controllers/nodepool/osstream_test.go
  • hypershift-operator/controllers/nodepool/token.go
  • hypershift-operator/controllers/nodepool/token_test.go
✅ Files skipped from review due to trivial changes (2)
  • hypershift-operator/controllers/nodepool/nodepool_controller_test.go
  • hypershift-operator/controllers/nodepool/aws_test.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • hypershift-operator/controllers/nodepool/config.go
  • hypershift-operator/controllers/nodepool/aws.go
  • hypershift-operator/controllers/nodepool/gcp_test.go
  • hypershift-operator/controllers/nodepool/gcp.go
  • hypershift-operator/controllers/nodepool/capi.go
  • hypershift-operator/controllers/nodepool/config_test.go

Comment thread hypershift-operator/controllers/nodepool/nodepool_controller.go Outdated
Comment thread hypershift-operator/controllers/nodepool/token.go Outdated
@sdminonne sdminonne force-pushed the CNTRLPLANE-3553-osImageStream-controller branch from 49e64cb to ebde395 Compare June 17, 2026 08:31
@sdminonne

Copy link
Copy Markdown
Contributor Author

I'll rebase this once #8699 will be merged

@sdminonne sdminonne marked this pull request as ready for review June 17, 2026 13:10
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 17, 2026
@openshift-ci openshift-ci Bot requested review from Nirshal and cblecker June 17, 2026 13:11

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@hypershift-operator/controllers/nodepool/capi_test.go`:
- Around line 2771-2773: The assertion for nodePool.Status.OSImageStream.Name is
conditionally guarded, which skips verification when tc.expectedOSImageStream is
empty. This allows incorrect non-empty status values to pass undetected in test
cases expecting empty values. Remove the conditional guard `if
tc.expectedOSImageStream != ""` and assert nodePool.Status.OSImageStream.Name
unconditionally against tc.expectedOSImageStream to ensure all test cases
properly verify the expected OS image stream value regardless of whether it is
empty or not.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: b312f697-6295-426e-b155-a62af1d22c0d

📥 Commits

Reviewing files that changed from the base of the PR and between 8e027da and ebde395.

⛔ Files ignored due to path filters (5)
  • api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests/nodepools.hypershift.openshift.io/OSStreams.yaml is excluded by !**/zz_generated.featuregated-crd-manifests/**
  • cmd/install/assets/crds/hypershift-operator/tests/nodepools.hypershift.openshift.io/featuregated.nodepools.osimagestream.testsuite.yaml is excluded by !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-CustomNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • cmd/install/assets/crds/hypershift-operator/zz_generated.crd-manifests/nodepools-TechPreviewNoUpgrade.crd.yaml is excluded by !**/zz_generated.crd-manifests/**, !cmd/install/assets/**/*.yaml
  • vendor/github.com/openshift/hypershift/api/hypershift/v1beta1/nodepool_types.go is excluded by !vendor/**, !**/vendor/**
📒 Files selected for processing (16)
  • api/hypershift/v1beta1/nodepool_types.go
  • hypershift-operator/controllers/nodepool/aws.go
  • hypershift-operator/controllers/nodepool/aws_test.go
  • hypershift-operator/controllers/nodepool/capi.go
  • hypershift-operator/controllers/nodepool/capi_test.go
  • hypershift-operator/controllers/nodepool/conditions.go
  • hypershift-operator/controllers/nodepool/config.go
  • hypershift-operator/controllers/nodepool/config_test.go
  • hypershift-operator/controllers/nodepool/gcp.go
  • hypershift-operator/controllers/nodepool/gcp_test.go
  • hypershift-operator/controllers/nodepool/nodepool_controller.go
  • hypershift-operator/controllers/nodepool/nodepool_controller_test.go
  • hypershift-operator/controllers/nodepool/osstream.go
  • hypershift-operator/controllers/nodepool/osstream_test.go
  • hypershift-operator/controllers/nodepool/token.go
  • hypershift-operator/controllers/nodepool/token_test.go
🚧 Files skipped from review as they are similar to previous changes (13)
  • hypershift-operator/controllers/nodepool/nodepool_controller_test.go
  • hypershift-operator/controllers/nodepool/token_test.go
  • hypershift-operator/controllers/nodepool/capi.go
  • hypershift-operator/controllers/nodepool/aws.go
  • hypershift-operator/controllers/nodepool/gcp_test.go
  • hypershift-operator/controllers/nodepool/config.go
  • hypershift-operator/controllers/nodepool/osstream_test.go
  • hypershift-operator/controllers/nodepool/aws_test.go
  • hypershift-operator/controllers/nodepool/nodepool_controller.go
  • hypershift-operator/controllers/nodepool/gcp.go
  • hypershift-operator/controllers/nodepool/config_test.go
  • api/hypershift/v1beta1/nodepool_types.go
  • hypershift-operator/controllers/nodepool/token.go

Comment thread hypershift-operator/controllers/nodepool/capi_test.go Outdated

@jparrill jparrill left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped some comments. Thanks!

Note: #8699 (CNTRLPLANE-3026) is still pending merge. Once it lands, this PR will need a rebase — defaultNodePoolAMI signature and behavior change significantly there (streamName becomes functional via StreamForName). The //nolint:unparam and TODOs referencing CNTRLPLANE-3553 for that function will be obsolete after rebase.

We'll do multiple review rounds as the dependency chain settles.

// behavior: no OSImageStream CR is generated and the MCC uses
// BaseOSContainerImage from ControllerConfig as-is).
func getRHELStream(nodePool *hyperv1.NodePool, releaseImage *releaseinfo.ReleaseImage) (string, error) {
if nodePool.Spec.OSImageStream.Name != "" {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reimplements GetRHELStream() from stream.go in the same package, but with divergent semantics: defaultRHELStream says major >= 5 → rhel-10 unconditionally, while GetRHELStream handles major >= 5 && usesRunc → rhel-9. These two functions will return different answers for 5.x + runc.

Same with validateOSImageStreamGetRHELStream already validates unknown stream names and rejects invalid combinations (rhel-10 on 4.x, rhel-10 with runc).

Can these delegate to GetRHELStream instead of reimplementing? Having two resolution paths for the same question in the same package is a maintenance hazard — when a new stream is added, both must be updated.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Both defaultRHELStream, getRHELStream, and validateOSImageStream now delegate to GetRHELStream from stream.go. The duplicate resolution logic is removed — see 63532a2.

if rhelStream != "" {
defaultStream, err := defaultRHELStream(releaseImage)
if err == nil && rhelStream == defaultStream {
rhelStream = ""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This normalization is the most load-bearing part of the PR — it's what prevents spurious fleet-wide rollouts when a user explicitly sets the default stream. But `TestNewConfigGenerator` has zero test cases that set `nodePool.Spec.OSImageStream`.

Since the plan is to consolidate on `GetRHELStream()` from `stream.go` (see osstream.go comment), the normalization internals will change. But the observable contract should be tested now:

  • Setting `osImageStream.Name` to the version-derived default (e.g. `"rhel-9"` on 4.x) must produce the same hash as not setting it at all (no spurious rollout).
  • Setting a non-default stream (e.g. `"rhel-10"` on 4.x) must produce a different hash (rollout triggered).

These tests verify the hash stability guarantee regardless of which internal function does the resolution.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added TestOSImageStreamHashStability in config_test.go covering both cases:

  • Setting osImageStream.Name to the version-derived default (e.g. "rhel-9" on 4.x) produces the same hash as not setting it (no spurious rollout).
  • Setting a non-default stream (e.g. "rhel-10" on 4.x) produces a different hash (rollout triggered).

The TestConfigHash and TestConfigHashWithoutVersion tables also include rhelStream cases — see 0b8280f and 257394d.

@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 29, 2026
@openshift-ci

openshift-ci Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

New changes are detected. LGTM label has been removed.

@sdminonne

Copy link
Copy Markdown
Contributor Author

/test e2e-aws-upgrade-hypershift-operator

@sdminonne

Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@sdminonne

Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@sdminonne

Copy link
Copy Markdown
Contributor Author

/test e2e-kubevirt-aws-ovn-reduced

@sdminonne

Copy link
Copy Markdown
Contributor Author

/test e2e-aws-4-22

@sdminonne

Copy link
Copy Markdown
Contributor Author

/test e2e-aks-4-22

@jparrill jparrill left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped some comments. Thanks!

// Default behavior for Linux/RHCOS AMIs.
// Use the resolved stream (via GetRHELStream) rather than the raw spec field,
// so that the AMI lookup is consistent with the CAPI path.
rhelStream, err := getRHELStream(nodePool, releaseImage)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new error path (getRHELStream failure -> condition False + return error) is untested. TestSetAWSConditions doesn't have a case with an invalid osImageStream.Name — e.g. rhel-10 on a 4.x release. Worth adding one to make sure the condition message surfaces correctly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread hypershift-operator/controllers/nodepool/conditions.go
tokenSecret.Data[TokenSecretHCConfigurationHashKey] = t.globalConfigHash
// TODO(CNTRLPLANE-3553): consumed by the ignition-server's TokenSecretReconciler once
// multi-stream ignition support lands. Until then this key is written but not read downstream.
tokenSecret.Data[TokenSecretOSStreamKey] = []byte(t.resolvedRHELStream)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This key is written now but consumed downstream later (per the TODO). If someone refactors the tokenSecret.Data == nil init block, the key could be dropped silently — nothing would catch it until the ignition-server starts reading it in a separate PR, likely weeks later. A test asserting tokenSecret.Data["os-stream"] exists with the expected value would catch that at the source. Same applies to the other keys in this block if they're not already tested.

// changes during HyperShift Operator upgrades.
// When the field is explicitly set, it delegates to GetRHELStream for
// version-aware validation and runc constraint checking.
func getRHELStream(nodePool *hyperv1.NodePool, releaseImage *releaseinfo.ReleaseImage) (string, error) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor test gap: there's no case for a set osImageStream.Name with an unparsable version (e.g. Name: "rhel-9" + version "not-a-version"). The empty-name + unparsable case is covered (returns "" via short-circuit), but the set-name path through semver.Parse at line 24 would fail and return an error — worth one test case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}{
{
name: "When MachineDeployment is complete, it should update nodePool version and annotations",
name: "When MachineDeployment is complete, it should update nodePool version",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this rename intentional? The test behavior didn't change, just the name. If there's no reason for it, I'd revert to avoid noise in git blame.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Sorry for this, TY!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amending it

Comment thread hypershift-operator/controllers/nodepool/config.go Outdated
Comment thread hypershift-operator/controllers/nodepool/osstream.go Outdated
Comment thread hypershift-operator/controllers/nodepool/aws_test.go
@sdminonne sdminonne force-pushed the CNTRLPLANE-3553-osImageStream-controller branch from e5e4f1d to a748041 Compare June 30, 2026 15:45
@sdminonne

Copy link
Copy Markdown
Contributor Author

/test e2e-aws-upgrade-hypershift-operator

metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

func Test_getRHELStreamForBootImage(t *testing.T) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestGetRHELStreamForBootImage

// resolvedRHELStreamForBootImage is the RHEL stream name used for boot
// image resolution via StreamForName. It is set by
// getRHELStreamForBootImage: empty when spec.osImageStream.Name is unset
// (so StreamForName("") follows the legacy StreamMetadata path), or a

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// getRHELStreamForBootImage: empty when spec.osImageStream.Name is unset
// (so StreamForName("") follows the legacy StreamMetadata path

with that ^^ you can not get a rhel 10 boot image if you don't specify explicit in the nodepool, is that intentional? am I missing anything?
This comment was resolved with no answer #8730 (comment)

The way I think of this resolvedRHELStreamForBootImage always need to be either 9 or 10, empty has no meaning. Then StreamForName() will just fallback to i.StreamMetadata if we are in a version that ReleaseImage doesn't have i.OSStreams[9]

// Fallback to legacy StreamMetadata for single-stream payloads (OCP < 5.0)
// where OSStreams is nil but StreamMetadata carries the data.
if i.StreamMetadata != nil {
return i.StreamMetadata, nil
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right @enxebre. This has been fixed and now there're two ways to get and rhel-10. Using OCP5.0+ without any specific setting, or with an explicit set in stream name

@sdminonne sdminonne force-pushed the CNTRLPLANE-3553-osImageStream-controller branch 2 times, most recently from 19a7e19 to 0296982 Compare July 1, 2026 15:31
@sdminonne

Copy link
Copy Markdown
Contributor Author

/test e2e-aws-upgrade-hypershift-operator

{
name: "When nodepool has configs it should populate mcoRawConfig ",
expectedHash: "801aff6a",
expectedHash: "af67f27c",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@enxebre
in this test case, the releaseImage field previously used Name: "latest" but was changed to Name: "4.18.0". Since Hash() includes cg.releaseImage.Version() in its concatenation (line 169 of config.go), changing the version string from "latest" to "4.18.0" directly changes the hash output. The test also needed a parseable semver version because NewConfigGenerator now calls getRHELStreamForBootImage() which calls semver.Parse() — an unparseable version like "latest" would now cause the constructor to fail

The hash didn't change because of the rhelStream addition to the hash (that's "" in this case). It changed because the test fixture's release image version was updated from "latest" to "4.18.0" to make it a valid semver. It's necessary because NewConfigGenerator now parses the version during RHEL stream resolution

@hypershift-jira-solve-ci

hypershift-jira-solve-ci Bot commented Jul 1, 2026

Copy link
Copy Markdown

Now I have the full picture. Let me verify one more detail — the test examined pods from the test view (i.e., it ran against the management cluster using the HCP client), and the pod names in the test output (e.g., capi-provider-7577649cd-xdsjt with restartCount 5) are from an earlier revision of the deployment. The pod restarted because readiness probes failed during cluster bootstrap (context deadline exceeded on readyz endpoint).

Let me now write the final report:

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

--- FAIL: TestUpgradeHyperShiftOperator/ValidateHostedCluster/EnsureNoCrashingPods (0.06s)
    util.go:829: Container aws-ebs-csi-driver-operator in pod aws-ebs-csi-driver-operator-cb88bcf5d-mtm2t has a restartCount > 0 (2)
    util.go:829: Container manager in pod capi-provider-7577649cd-xdsjt has a restartCount > 0 (5)
    util.go:829: Container manager in pod cluster-api-7447f9b959-6mfwv has a restartCount > 0 (3)
    util.go:829: Container cluster-storage-operator in pod cluster-storage-operator-6d54b795f-bczd6 has a restartCount > 0 (2)
    util.go:829: Container control-plane-operator in pod control-plane-operator-5bdfc699d-k4nd4 has a restartCount > 0 (4)
    util.go:829: Container csi-snapshot-controller-operator in pod csi-snapshot-controller-operator-b7648fcc5-c7f6l has a restartCount > 0 (2)
    util.go:829: Container hosted-cluster-config-operator in pod hosted-cluster-config-operator-678dfc6db6-wn2js has a restartCount > 0 (4)

Summary

The EnsureNoCrashingPods subtest failed because 7 hosted control plane pods accumulated container restarts (2–5 each) during the initial HostedCluster creation and rollout phase of the TestUpgradeHyperShiftOperator test. The test has a zero-tolerance policy for restarts on AWS (crashToleration = 0 for these components), but the restarts are caused by transient readiness/liveness probe failures and leader election timing issues during cluster bootstrap — not by the PR's code changes. The PR (CNTRLPLANE-3553) modifies NodePool controller logic to wire osImageStream into hash computation, token generation, status, and validation — none of which affects control plane pod lifecycle or deployment. This is a pre-existing flaky failure pattern in the HyperShift upgrade e2e test.

Root Cause

The failure is a pre-existing flaky test issue, not caused by PR #8730's changes.

Mechanism: During initial HostedCluster creation in the upgrade test, the hosted control plane pods start before all their dependencies (API server, etcd, webhook servers, leader election leases) are fully available. This causes:

  1. Readiness probe failurescapi-provider pod logs show Readiness probe failed: Get "http://....:9440/readyz": context deadline exceeded (9 occurrences), leading to pod restarts by kubelet.
  2. Leader election failurescluster-api pod logs show repeated failed to acquire lease and client rate limiter Wait returned an error: context deadline exceeded errors, causing the controller-runtime process to exit and be restarted.
  3. Webhook server not readycapi-provider previous logs show healthz check failed: webhook server has not been started yet during startup.
  4. Status update conflictscontrol-plane-operator logs show continuous Operation cannot be fulfilled on hostedcontrolplanes...the object has been modified errors due to optimistic locking contention during rapid reconciliation at startup.

All these are transient bootstrap conditions. The cluster eventually reaches a healthy state (nodes become ready in ~10 min, rollout completes in ~2.5 min), but the accumulated restartCount on the pods persists and is caught by the zero-tolerance EnsureNoCrashingPods check.

Why the test is flaky: The EnsureNoCrashingPods test uses crashToleration = 0 for capi-provider, cluster-api, control-plane-operator, hosted-cluster-config-operator, aws-ebs-csi-driver-operator, csi-snapshot-controller-operator, and cluster-storage-operator on the AWS platform. These components have no explicit toleration override in the podCrashTolerations map, so any restart during the entire cluster lifetime (including bootstrap) causes the test to fail. This is overly strict given that the upgrade test creates a full HostedCluster from scratch, during which transient restarts are expected.

Why this is unrelated to PR #8730: The PR modifies nodepool_controller.go to wire osImageStream into the NodePool controller's hash computation, token generation, status reporting, and validation. None of these changes affect:

  • Control plane pod deployments or their container images
  • Pod readiness/liveness probes
  • Leader election configuration
  • Webhook server startup
  • The HyperShift operator deployment that manages control plane pods
Recommendations
  1. Retrigger the job — This failure is not related to the PR's changes. A /retest should pass if the bootstrap timing is more favorable.

  2. Consider adding crash tolerations for bootstrap-sensitive components — The following pods consistently restart during HostedCluster creation on AWS and should have explicit tolerations in podCrashTolerations in test/e2e/util/util.go:

    • capi-provider — toleration of 2-3 (webhook server startup timing)
    • cluster-api — toleration of 2-3 (leader election + CRD migration RBAC errors)
    • control-plane-operator — toleration of 2-3 (status update contention)
    • hosted-cluster-config-operator — toleration of 2-3 (dependency readiness)
    • aws-ebs-csi-driver-operator — toleration of 1-2
    • csi-snapshot-controller-operator — toleration of 1-2
    • cluster-storage-operator — toleration of 1-2
  3. File a tracking issue for the flaky EnsureNoCrashingPods in the upgrade test context, as the zero-tolerance policy is incompatible with the reality of HyperShift hosted control plane bootstrap behavior.

Evidence
Evidence Detail
Failed test TestUpgradeHyperShiftOperator/ValidateHostedCluster/EnsureNoCrashingPods
Failure phase ValidateHostedCluster (initial validation before HO upgrade)
Crash toleration 0 for all 7 failing components on AWS platform (no overrides in podCrashTolerations)
capi-provider restarts 5 — caused by readiness probe timeout (context deadline exceeded on /readyz) and webhook server not started
cluster-api restarts 3 — caused by leader election failures and CRD migrator RBAC errors (cannot patch resource "customresourcedefinitions")
control-plane-operator restarts 4 — caused by status update conflicts (the object has been modified; please apply your changes)
hosted-cluster-config-operator restarts 4 — same pattern as control-plane-operator
Cluster health Healthy — 3 nodes ready in 9m42s, rollout completed in 2m24s, all conditions valid
PR changes scope NodePool controller: osImageStream hash, token, status, validation — no control plane pod changes
Test source test/e2e/upgrade_hypershift_operator_test.go — creates HostedCluster, validates, then upgrades HO
Crash toleration source test/e2e/util/util.go:119-150podCrashTolerations map, default 0 for AWS

// a different machine template name and triggering CAPI to roll out new
// nodes. This is the intended behavior per the enhancement:
// implicit-stream NodePools automatically adopt the new default.
func TestGetRHELStreamForBootImage(t *testing.T) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is named TestGetRHELStreamForBootImage but test other things.
Please remove this test completely.
Add any needed test case to testStreamForName, e.g.
When it's a < 4 release image (OSStreams empty and has StreamMetadata) it should return that stream
When it's a > 4 release image (Has OSStreams and has empty StreamMetadata) it should return that stream

Feel free to add a different test for awsMachineTemplateSpec in the appropriate file if you find it valuable

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Deleted the old integration TestGetRHELStreamForBootImage from osstream_test.go
  • Renamed Test_getRHELStreamForBootImageTestGetRHELStreamForBootImage
  • Added two dual-stream test cases to TestStreamForName in releaseinfo_test.go
  • Added TestAWSMachineTemplateSpec_StreamSelection in aws_test.go

Wire spec.osImageStream into the NodePool reconciliation loop:
validation, config hash, boot image resolution, token secret
propagation, and observation-based status reporting.

Behavior matches the dual-stream RHEL NodePool enhancement:
https://github.com/openshift/enhancements/blob/master/enhancements/hypershift/dual-stream-rhel-nodepool.md

Stream resolution (osstream.go, stream.go):
- getRHELStreamForBootImage delegates to GetRHELStream for
  version-aware default resolution: rhel-9 for OCP < 5.0, rhel-10
  for OCP >= 5.0 when spec.osImageStream is unset. On upgrade to
  OCP 5.0+, existing NodePools with unset spec.osImageStream will
  implicitly transition to rhel-10 boot images as intended by the
  enhancement.
- GetRHELStream always returns a concrete stream name ("rhel-9" or
  "rhel-10") and is used for validation and default resolution.
- validateOSImageStream delegates to GetRHELStream; runs before
  NewConfigGenerator in validMachineConfigCondition for fail-fast.

Config hash (config.go):
- Add rhelStream to rolloutConfig and Hash()/HashWithoutVersion().
  Normalized so that setting the version-derived default (e.g. "rhel-9"
  on 4.x) keeps rhelStream empty — no spurious rollout.
- resolvedRHELStreamForBootImage lives on ConfigGenerator outside
  rolloutConfig since it does not participate in the hash.

Boot image resolution (aws.go, gcp.go):
- AWS and GCP machine template paths use resolvedRHELStreamForBootImage
  via StreamForName for consistent AMI/image lookup.
- setAWSConditions uses getRHELStreamForBootImage for consistency with
  the CAPI path.

Token secret (token.go):
- Write os-stream key with resolved stream for future ignition-server
  consumption (CNTRLPLANE-3553).

Status reporting (version.go):
- setOSImageStreamStatus infers the observed RHEL stream from Machine
  NodeInfo.OSImage (RHCOS 4xx → rhel-9, 5xx → rhel-10) using majority
  vote across pool machines.
- setNodesInfoStatus aggregates node version and health from CAPI
  Machines into status.nodesInfo.

Tests (osstream_test.go):
- Test_getRHELStreamForBootImage covers all combinations of
  osImageStream.Name × release version per the enhancement behavior
  table.
- TestGetRHELStreamForBootImage proves that on OCP 5.0+ multi-stream
  payloads, the resolved rhel-10 default produces the correct AMI
  and machine template hash, confirming the intended boot image
  transition.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sdminonne sdminonne force-pushed the CNTRLPLANE-3553-osImageStream-controller branch from 0296982 to 6806f1e Compare July 1, 2026 19:06
@openshift-ci openshift-ci Bot added the area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release label Jul 1, 2026
// with the CAPI path: version-derived default (rhel-9 for OCP < 5.0,
// rhel-10 for OCP >= 5.0) for unset osImageStream, or a validated
// stream name for explicit osImageStream.
rhelStream, err := getRHELStreamForBootImage(nodePool, releaseImage)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this aws specific? how about the other platforms?

expectError: true,
},
{
name: "When rhelStream is rhel-9 with single-stream payload (OSStreams nil), it should fall back to StreamMetadata",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice test cases!


// defaultNodePoolGCPImage returns the default GCP image for a given architecture from release metadata.
func defaultNodePoolGCPImage(specifiedArch string, releaseImage *releaseinfo.ReleaseImage) (string, error) {
func defaultNodePoolGCPImage(specifiedArch string, releaseImage *releaseinfo.ReleaseImage, rhelStream string) (string, error) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about others than aws and gcp?

@enxebre

enxebre commented Jul 1, 2026

Copy link
Copy Markdown
Member

lgtm
/test e2e-aws

@openshift-ci

openshift-ci Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@sdminonne: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-4-22 bbde575 link true /test e2e-aws-4-22
ci/prow/e2e-aws-upgrade-hypershift-operator 0296982 link true /test e2e-aws-upgrade-hypershift-operator

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/api Indicates the PR includes changes for the API area/cli Indicates the PR includes changes for CLI area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/aws PR/issue for AWS (AWSPlatform) platform area/platform/gcp PR/issue for GCP (GCPPlatform) platform jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants