Skip to content

OCPCRT-450: Improve ClusterBot's cloud platform dispursment#604

Open
thiagoalessio wants to merge 5 commits into
openshift:mainfrom
thiagoalessio:OCPCRT-450
Open

OCPCRT-450: Improve ClusterBot's cloud platform dispursment#604
thiagoalessio wants to merge 5 commits into
openshift:mainfrom
thiagoalessio:OCPCRT-450

Conversation

@thiagoalessio
Copy link
Copy Markdown
Member

@thiagoalessio thiagoalessio commented Mar 17, 2026

Previously, selecting which cloud account to use for a platform (e.g. aws vs aws-2) required maintaining two nearly identical switch blocks: one in manager.go to query Boskos metrics and pick the best account, and another in prow.go to apply the corresponding cluster profile to the ProwJob. Adding or removing a cloud account meant editing both switches and adding/removing wrapper functions.

Now, all account information lives in a single platformQuotaSlices map. A generic selectCloudAccountProfile function loops over entries, queries Boskos, and picks the one with the most free resources. The result is passed directly to applyClusterProfile.

Why the quota-slice names are still listed explicitly

Ideally, I wish we could dynamically discover available quota-slice resource types from Boskos, but both Claude and I couldn't find something that would allow us to fetch a list of resources in https://github.com/kubernetes-sigs/boskos/blob/master/client/client.go.

This could have been refactored even further, extracting these values to a config file or something, but it is already better than before, so after these changes, adding a third AWS account for example can be done by simply by appending a new entry in platformQuotaSlices.

Suggestion on how to review

I'd recommend to check the individual commits, I tried to make each one small and self-contained, each advancing one refactoring step towards the end goal.

Summary by CodeRabbit

  • New Features

    • Added support for multiple cloud accounts per platform with automatic selection based on available quota resources.
    • Jobs are now intelligently routed to cloud accounts based on current resource availability metrics.
  • Refactor

    • Consolidated cloud account configuration logic and removed platform-specific account conversion logic.

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Mar 17, 2026

@thiagoalessio: This pull request references OCPCRT-450 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Previously, selecting which cloud account to use for a platform (e.g. aws vs aws-2) required maintaining two nearly identical switch blocks: one in manager.go to query Boskos metrics and pick the best account, and another in prow.go to apply the corresponding cluster profile to the ProwJob. Adding or removing a cloud account meant editing both switches and adding/removing wrapper functions.

Now, all account information lives in a single platformQuotaSlices map. A generic selectCloudAccountProfile function loops over entries, queries Boskos, and picks the one with the most free resources. The result is passed directly to applyClusterProfile.

Why the quota-slice names are still listed explicitly

Ideally, I wish we could dynamically discover available quota-slice resource types from Boskos, but both Claude and I couldn't find something that would allow us to fetch a list of resources in https://github.com/kubernetes-sigs/boskos/blob/master/client/client.go.

This could have been refactored even further, extracting these values to a config file or something, but it is already better than before, so after these changes, adding a third AWS account for example can be done by simply by appending a new entry in platformQuotaSlices.

Suggestion on how to review

I'd recommend to check the individual commits, I tried to make each one small and self-contained, each advancing one refactoring step towards the end goal.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 17, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 17, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1de752ac-edec-48f1-ac7b-25e91a4ee767

📥 Commits

Reviewing files that changed from the base of the PR and between 0be8b07 and d411c79.

📒 Files selected for processing (4)
  • pkg/manager/manager.go
  • pkg/manager/manager_test.go
  • pkg/manager/prow.go
  • pkg/manager/types.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • pkg/manager/types.go
  • pkg/manager/manager_test.go
  • pkg/manager/manager.go
  • pkg/manager/prow.go

📝 Walkthrough

Walkthrough

This PR adds a CloudAccountProfile type and platformQuotaSlices, implements selectCloudAccountProfile that chooses an alternate account by Boskos metrics, sets job.CloudAccountProfile in LaunchJobForUser, applies cluster profiles in newJob via applyClusterProfile, and adds tests for selection logic.

Changes

Quota-slice selection and application

Layer / File(s) Summary
Types and Job field update
pkg/manager/types.go
Adds exported CloudAccountProfile struct and replaces Job.UseSecondaryAccount boolean with Job.CloudAccountProfile *CloudAccountProfile.
Quota-slice map and selection logic
pkg/manager/manager.go
Introduces platformQuotaSlices and selectCloudAccountProfile(platform string, lClient LeaseClient) to query LeaseClient metrics and pick an alternate profile with the most Free resources (or nil to use primary/fallback).
LaunchJobForUser integration
pkg/manager/manager.go
Replaces previous per-platform metric checks by calling selectCloudAccountProfile for amd64 jobs and assigns the returned profile to job.CloudAccountProfile; selection errors are propagated.
applyClusterProfile and newJob wiring
pkg/manager/prow.go
Adds applyClusterProfile(...), mounts the cluster-profile secret and patches the launch test step (ClusterProfile and optional BASE_DOMAIN), calls it from newJob when job.CloudAccountProfile is set, and removes per-cloud conversion helpers.
Test coverage
pkg/manager/manager_test.go
Adds mockLeaseClient, updates imports, and a table-driven Test_selectCloudAccountProfile validating selection outcomes, nil cases, and error propagation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

"I nibble on code and carrots bright,
Quota-slices hopping into sight,
Profiles lined in tidy rows,
Metrics tell me where to go,
A rabbit cheers: launch jobs light! 🐇"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'OCPCRT-450: Improve ClusterBot's cloud platform dispursment' accurately describes the main change: refactoring cloud account selection logic and introducing a centralized quota-slice mechanism for cloud platform account management.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

Command failed


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from AlexNPavel and jupierce March 17, 2026 13:24
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/manager/prow.go (1)

1732-1734: ⚠️ Potential issue | 🟠 Major

Initialize Environment before setting BASE_DOMAIN.

Many launch configs have no environment: block. When that happens, Lines 1732-1734 silently drop the alternate-domain override, so aws-2 / azure-2 jobs still use the default base domain.

Suggested fix
-	if accountDomain != "" && matchedTarget.MultiStageTestConfiguration != nil && matchedTarget.MultiStageTestConfiguration.Environment != nil {
-		matchedTarget.MultiStageTestConfiguration.Environment["BASE_DOMAIN"] = accountDomain
-	}
+	if accountDomain != "" {
+		if matchedTarget.MultiStageTestConfiguration.Environment == nil {
+			matchedTarget.MultiStageTestConfiguration.Environment = make(citools.TestEnvironment)
+		}
+		matchedTarget.MultiStageTestConfiguration.Environment["BASE_DOMAIN"] = accountDomain
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/manager/prow.go` around lines 1732 - 1734, The code sets
matchedTarget.MultiStageTestConfiguration.Environment["BASE_DOMAIN"] without
ensuring Environment is initialized; modify the block that checks accountDomain
and matchedTarget.MultiStageTestConfiguration (the lines manipulating
matchedTarget.MultiStageTestConfiguration.Environment) to allocate a new map if
Environment is nil (e.g., set
matchedTarget.MultiStageTestConfiguration.Environment = make(map[string]string))
before assigning the "BASE_DOMAIN" key so alternate-domain overrides are applied
even when no environment: block exists in the launch config.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/manager/manager.go`:
- Around line 154-156: The current code returns an error when
lClient.Metrics(accounts[i].QuotaSlice) fails, which aborts scheduling; change
this to log the error and fall back to using the primary/default account instead
of returning nil. Specifically, in the block that calls lClient.Metrics for
accounts[i].QuotaSlice, catch the error, emit a warning (including the error)
and set metrics to the primary account's metrics (or mark this candidate as
using the primary quota) so the caller can continue; do this change around the
lClient.Metrics call and its error handling to avoid returning fmt.Errorf and
instead continue with a fallback metrics value.

---

Outside diff comments:
In `@pkg/manager/prow.go`:
- Around line 1732-1734: The code sets
matchedTarget.MultiStageTestConfiguration.Environment["BASE_DOMAIN"] without
ensuring Environment is initialized; modify the block that checks accountDomain
and matchedTarget.MultiStageTestConfiguration (the lines manipulating
matchedTarget.MultiStageTestConfiguration.Environment) to allocate a new map if
Environment is nil (e.g., set
matchedTarget.MultiStageTestConfiguration.Environment = make(map[string]string))
before assigning the "BASE_DOMAIN" key so alternate-domain overrides are applied
even when no environment: block exists in the launch config.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0abbf168-63b8-4399-859d-1a757db3749a

📥 Commits

Reviewing files that changed from the base of the PR and between 9504e68 and 0be8b07.

📒 Files selected for processing (4)
  • pkg/manager/manager.go
  • pkg/manager/manager_test.go
  • pkg/manager/prow.go
  • pkg/manager/types.go

Comment thread pkg/manager/manager.go Outdated
@bradmwilliams
Copy link
Copy Markdown
Contributor

/label tide/merge-method-squash

@openshift-ci openshift-ci Bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Apr 1, 2026
Copy link
Copy Markdown
Contributor

@bradmwilliams bradmwilliams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Comment thread pkg/manager/manager.go Outdated
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 1, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bradmwilliams, thiagoalessio

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 1, 2026
thiagoalessio and others added 5 commits June 2, 2026 13:23
Renaming this function to better reflect its purpose,
especially if at some point we have more than two accounts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Introduce the data structures that will replace the hardcoded
switch block for quota-slice selection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract the quota-slice selection logic into a standalone function that
loops over platformQuotaSlices entries, queries Boskos metrics for each,
and returns the profile with the most free resources. The old switch
block still exists for now.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ction

Got rid of the switch block in favor of the newly introduced `selectCloudAccountProfile`.
Also replaced the other switch block with those wrapper functions in `prow.go` with a single
nil check that calls applyClusterProfile directly from the profile struct fields.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ailable

Instead of failing the cluster launch when Boskos metrics cannot be
retrieved, log a warning and proceed with the default (primary) account.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Jun 2, 2026

@thiagoalessio: This pull request references OCPCRT-450 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Previously, selecting which cloud account to use for a platform (e.g. aws vs aws-2) required maintaining two nearly identical switch blocks: one in manager.go to query Boskos metrics and pick the best account, and another in prow.go to apply the corresponding cluster profile to the ProwJob. Adding or removing a cloud account meant editing both switches and adding/removing wrapper functions.

Now, all account information lives in a single platformQuotaSlices map. A generic selectCloudAccountProfile function loops over entries, queries Boskos, and picks the one with the most free resources. The result is passed directly to applyClusterProfile.

Why the quota-slice names are still listed explicitly

Ideally, I wish we could dynamically discover available quota-slice resource types from Boskos, but both Claude and I couldn't find something that would allow us to fetch a list of resources in https://github.com/kubernetes-sigs/boskos/blob/master/client/client.go.

This could have been refactored even further, extracting these values to a config file or something, but it is already better than before, so after these changes, adding a third AWS account for example can be done by simply by appending a new entry in platformQuotaSlices.

Suggestion on how to review

I'd recommend to check the individual commits, I tried to make each one small and self-contained, each advancing one refactoring step towards the end goal.

Summary by CodeRabbit

  • New Features

  • Added support for multiple cloud accounts per platform with automatic selection based on available quota resources.

  • Jobs are now intelligently routed to cloud accounts based on current resource availability metrics.

  • Refactor

  • Consolidated cloud account configuration logic and removed platform-specific account conversion logic.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 2, 2026

@thiagoalessio: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@thiagoalessio
Copy link
Copy Markdown
Member Author

/jira refresh

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Jun 2, 2026

@thiagoalessio: This pull request references OCPCRT-450 which is a valid jira issue.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants