CORS-4236: aws: support worker machine pool management with ClusterAPI#10577
CORS-4236: aws: support worker machine pool management with ClusterAPI#10577tthvo wants to merge 8 commits into
Conversation
Defaults the machine pool management to CAPI when the appropriate feature gate is enabled.
Edge compute pools require MachineTaintPropagation, which is only available in CAPI v1.12+ (currently vendored at v1.11.8). Block the combination at install-config validation to surface the error early rather than producing incomplete manifests.
|
@tthvo: This pull request references CORS-4236 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughThis PR adds feature-gated ClusterAPI management for machine pools, implements CAPI AWSMachineSpec/Template/MachineSet generation and worker asset persistence, normalizes quota inputs across MAPI and CAPI, removes AWS tfvars generation, and adds related tests and utilities. ChangesCluster API Worker Migration
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 13 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (13 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/cc @patrickdillon |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/asset/machines/aws/clusterapi_machinesets.go`:
- Around line 24-54: The code divides by numOfAZs computed from mpool.Zones
which can be zero and cause a panic; before computing numOfAZs or doing the
replicas calculation (the loop that uses replicas := int32(total / numOfAZs) and
the idx < total%numOfAZs branch), add a defensive check for len(mpool.Zones) ==
0 and return a clear error (or handle it by setting a sane default) so the
function exits safely instead of dividing by zero.
In `@pkg/asset/quota/aws/aws.go`:
- Around line 166-178: The function MachineInfoFromMAPIMachineSets currently
dereferences ms.Spec.Replicas directly; update it to guard against nil by
checking if ms.Spec.Replicas == nil before dereferencing and use a sensible
default (e.g. int64(1)) when nil, otherwise set Replicas =
int64(*ms.Spec.Replicas); keep everything else the same so the change only
affects how Replicas is computed.
- Around line 152-164: The MachineInfoFromMAPIMachines function does an
unchecked type assertion on m.Spec.ProviderSpec.Value.Object which can panic;
guard against nil ProviderSpec/Value/Object and use a safe type assertion for
*machinev1beta1.AWSMachineProviderConfig (e.g., cfg, ok :=
m.Spec.ProviderSpec.Value.Object.(*machinev1beta1.AWSMachineProviderConfig)),
and if nil or !ok simply skip/continue that machine (or handle it gracefully)
instead of asserting; update the code paths that reference
providerConfig.InstanceType and providerConfig.Placement to only run after the
checks succeed.
- Around line 180-194: MachineInfoFromCAPIMachineSets is not setting
MachineInfo.AvailabilityZone, causing undercounting in network() EIP
calculations; update MachineInfoFromCAPIMachineSets to populate AvailabilityZone
for each MachineInfo by extracting the AZ from the matching
capa.AWSMachineTemplate (inspect the template's Spec.Template.Spec.Subnet
filter/availability zone field if present) and, if not present, fall back to
parsing the MachineSet name (ms.Name) using the "<clusterID>-<pool>-<az>"
pattern to derive the zone; ensure you use the same map key lookup
(templateInstanceTypes currently using
ms.Spec.Template.Spec.InfrastructureRef.Name) to locate the template, handle nil
pointers/empty values safely, and set MachineInfo.AvailabilityZone alongside
InstanceType and Replicas before returning infos.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 675a3ba6-21a8-4533-8299-4cba39b57f25
📒 Files selected for processing (17)
pkg/asset/cluster/tfvars/tfvars.gopkg/asset/installconfig/installconfig.gopkg/asset/machines/aws/awsmachines.gopkg/asset/machines/aws/clusterapi_machinesets.gopkg/asset/machines/userdata.gopkg/asset/machines/worker.gopkg/asset/quota/aws/aws.gopkg/asset/quota/quota.gopkg/asset/quota/types/types.gopkg/tfvars/aws/OWNERSpkg/tfvars/aws/aws.gopkg/types/defaults/installconfig.gopkg/types/defaults/machinepools.gopkg/types/defaults/machinepools_test.gopkg/types/validation/installconfig.gopkg/types/validation/installconfig_test.gopkg/utils/utils.go
💤 Files with no reviewable changes (2)
- pkg/tfvars/aws/OWNERS
- pkg/tfvars/aws/aws.go
AWS uses CAPI for infrastructure provisioning and no longer consumes Terraform variables. Remove the entire AWS case from TerraformVariables and delete the pkg/tfvars/aws/ package. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract machine info conversion into dedicated functions with platform-agnostic MachineInfo struct. This decouples quota constraint generation from specific machine API types, allowing both MAPI and CAPI managed pools to be checked. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
We set the AZ to launch machines for a CAPI machineset its failure domain spec. This also allows us to easily extract the AZ for quota calculating.
|
/test e2e-aws-ovn-devpreview |
|
Since /label platform/aws |
The cluster-capi-operator deploys a ValidatingAdmissionPolicy that forbids spec.uncompressedUserData on AWSMachines and AWSMachineTemplates. This field only affects cloud-init based setups. In OpenShift, ignition is used instead, so it's always ignored. Thus, there is no need to set it. Notes: If set, the VAP will reject the worker machine creation. In some rare cases, VAP may be deployed late and workers are still provisioned. With this change, we don't have to worry it that.
|
Local testing showed promising results (see PR description for a sample install-config snippet):
ATM, control plane nodes are still managed by MAPI in the cluster. Support will added in follow-ups. |
|
/test e2e-aws-ovn-edge-zones |
|
/test e2e-aws-ovn-imdsv2 |
|
/test e2e-aws-ovn-public-ipv4-pool |
|
/test e2e-aws-ovn-public-subnets |
|
@tthvo: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/test e2e-aws-ovn-devpreview |
|
The devpreview job was successful! Exploring the artifacts... It looks like camgi wont pick up the capi manifests, and we (openshift) probably want to extend that It looks like the correct manifests are being produced in the build logs I was surprised to see only the one manifests, but then I realized the job is setting the install to a single zone: So that is actually the correct behavior, but I wonder why the job is limited to a single zone?? |
I think it's for saving cloud costs in presubmits according to the if [[ "${ZONES_COUNT}" == "auto" ]]; then
if [[ "${JOB_NAME}" == pull-ci-* || "${JOB_NAME}" == rehearse-*-pull-ci-* ]]; then
# For presubmits, limit cloud costs by using only one AZ when in "auto".
ZONES_COUNT="1"
else
# For periodics (which inform component readiness), ensure multiple AZ
# usage in "auto" mode.
ZONES_COUNT="2"
fi
fi |
Huh. I wonder how that proposes to save money... or what "auto" is. |
|
/approve This looks really good. I will read over it more carefully for lgtm but tbh I didn't notice anything to change. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: patrickdillon The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Oh, I see, looking at git blame, azs correlate to nat gws, which are $$$. |
Oh yes, exactly! I remember it was one of the resources that cost the most in CI 😅 That's probably why we introduced public-only install. |
Claude and I attempted elmiko/camgi.rs#59 :D it seems to render pretty nicely
|
Looks awesome! One thing I'm confused about and wanted to clarify: in openshift, we're not creating an |
Oh, not the installer. But cluster-capi-operator creates it in aws.go#L51-L61 to represent the current running OCP cluster, I believe. The AWSCluster is not managed by in-cluster CAPA due to this annotation here so its, ATM, informational only 🤔
Yes, there is, but in gather-extra though. You can see it here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_installer/10577/pull-ci-openshift-installer-main-e2e-aws-ovn-devpreview/2060263970471677952/artifacts/e2e-aws-ovn-devpreview/gather-extra/artifacts/awsclusters.infrastructure.cluster.x-k8s.io.json |
|
/lgtm |
|
/test e2e-aws-ovn-dualstack-ipv4-primary-techpreview e2e-aws-ovn-dualstack-ipv6-primary-techpreview |
|
/test e2e-aws-ovn-custom-iam-profile e2e-aws-ovn-techpreview |
|
/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-ovn-dedicated-serial-techpreview-1of2 |
|
@tthvo: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/40218360-5f95-11f1-9332-42a30348c59d-0 |
|
/payload-job periodic-ci-openshift-openshift-tests-private-release-5.0-amd64-nightly-aws-ipi-confidential-fips-mini-perm-f28-destructive |
|
@tthvo: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d708f3d0-5f95-11f1-9edb-95b145ce2f5d-0 |

This PR adds support for generating CAPI machinesets for worker machine pool in AWS. The change also cleans up some dependent but unused terraform code path (AWS only).
Edge (local/wavelength zones) pool does not support
management: ClusterAPIdue to missing node taints configuration in CAPI machineset. Node taints support is available in CAPI v1.12+ (currently vendored at v1.11.8). Control plane machine pool ClusterAPI management will be added in follow-ups.For example, use the below install-config snippet to install:
Important
At the moment, the bootstrap KAS cannot reach conversion webhooks, so installer-generated CAPI machinesets must use the storage version (as of 5.0, it is
v1beta2). If using a non-storage version, KAS will try to reach a conversion webhook, which will fail, and the object will not be admitted.Summary by CodeRabbit
New Features
Refactor
Validation
Chores