Skip to content

Approve pending upgrade InstallPlans during MCH wait#467

Open
eliorerz wants to merge 1 commit into
mainfrom
fix/approve-upgrade-installplans-during-mch-wait
Open

Approve pending upgrade InstallPlans during MCH wait#467
eliorerz wants to merge 1 commit into
mainfrom
fix/approve-upgrade-installplans-during-mch-wait

Conversation

@eliorerz

@eliorerz eliorerz commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • When MCE is installed with Manual approval, a newer version can appear in the catalog while MCH is reconciling. This creates a second InstallPlan in RequiresApproval phase that nobody approves, causing MCH to get stuck in phase: Installing forever.
  • Replace the simple k8s_info MCH wait with a shell task that directly approves any unapproved InstallPlans in the MCE and ACM namespaces on every retry iteration before checking MCH phase. This bypasses version gating since the version-gated reconcile tool would skip InstallPlans for versions newer than pinned.

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Review Change Stack

Walkthrough

The PR modifies the advanced cluster management task sequence to integrate operator version reconciliation into the MultiClusterHub readiness check. The task now derives the MCE operator reference, runs reconciliation for both operators, then polls the MultiClusterHub phase via shell commands until it reaches "Running" status.

Changes

RHACM MultiClusterHub Readiness and Operator Reconciliation

Layer / File(s) Summary
MCE operator reference derivation
operators/advanced-cluster-management/tasks.yaml
A new task step selects the multicluster-engine operator from the operators list and stores it in mce_operator for downstream use.
MultiClusterHub readiness with operator reconciliation
operators/advanced-cluster-management/tasks.yaml
The prior Kubernetes API polling is replaced with a shell-based loop that attempts to reconcile operator versions for MCE and the current operator (ignoring reconciliation failures via || true), queries the MultiClusterHub phase via oc get and jsonpath extraction, and retries until the phase equals "Running".

Estimated Code Review Effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Risk Assessment: The shift from native Kubernetes API polling to shell-based phase queries introduces operational risk. The \|\| true pattern masks reconciliation failures, which could obscure upgrade or operator-state issues. The polling loop relies on oc CLI availability and correct jsonpath syntax; failures would surface only as timeout exhaustion rather than explicit errors. Verify that the retry logic and timeout configuration provide adequate safeguards for production reconciliation workflows.


🔧 Shell commands replace the K8s dance,
Operators now reconcile, given their chance,
MCE and kin align with care,
Till the Hub sings "Running" in the air.

🚥 Pre-merge checks | ✅ 10 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Ai-Attribution ⚠️ Warning CodeRabbit AI tool generated the summary per CONTRIBUTING.md, but commit f31504d lacks required Assisted-by trailer for AI-assisted work. Add "Assisted-by: CodeRabbit noreply@coderabbit.ai" trailer to commit message per CONTRIBUTING.md guidelines for AI-assisted work.
✅ Passed checks (10 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main change: replacing MCH polling with a shell-based wait that approves pending upgrade InstallPlans during the wait process.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
No-Hardcoded-Secrets ✅ Passed File contains no hardcoded secrets, API keys, tokens, passwords, private keys, base64 strings >32 chars, embedded credentials in URLs, or sensitive variable assignments.
No-Weak-Crypto ✅ Passed No weak crypto (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto, or non-constant-time secret comparisons found in the PR or repository.
No-Injection-Vectors ✅ Passed No injection vectors found. Variables sourced from trusted defaults/operators.yaml, properly quoted in shell commands, and contain only safe characters.
Container-Privileges ✅ Passed PR modifies only Ansible orchestration tasks with no container/K8s manifest changes; no privileged settings, privilege escalation, or elevated capability flags introduced.
No-Sensitive-Data-In-Logs ✅ Passed Task logs only non-sensitive data: operator names, versions, namespaces, and cluster phase status. No passwords, tokens, API keys, PII, or customer data are exposed in the code.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/approve-upgrade-installplans-during-mch-wait

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the operators Operator installation/config label Jun 10, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@operators/advanced-cluster-management/tasks.yaml`:
- Around line 16-18: After setting mce_operator via the set_fact that selects
the 'multicluster-engine' entry, add an ansible.builtin.assert task immediately
after to fail fast if mce_operator is undefined or missing required keys; assert
that mce_operator is defined and that mce_operator.name, mce_operator.version
and mce_operator.namespace are present (with a clear message like
"multicluster-engine operator entry missing or incomplete") so the later shell
task that dereferences mce_operator.name/.version/.namespace cannot run with
nil/invalid values.
- Around line 21-39: The shell task registering mch_phase_result can hang
because the two commands (enclave reconcile operator-versions and oc get
multiclusterhub ...) have no per-attempt timeout; update the
ansible.builtin.shell invocation to wrap each of those commands with a
per-command timeout (e.g., via the GNU timeout utility such as timeout 300s
enclave reconcile ... and timeout 30s oc get ...) so each retry attempt will
fail fast instead of hanging indefinitely; keep set -euo pipefail and the same
register variable (mch_phase_result) unchanged.
- Around line 24-35: The task is polluting stdout by redirecting stderr into it;
remove the "2>&1" (or otherwise stop merging stderr into stdout) from both
enclave reconcile invocations so that mce_phase_result.stdout can match
"Running" reliably (i.e., change the two lines that include "enclave reconcile
operator-versions ... --no-dry-run 2>&1 || true" for the mce_operator and
operator blocks to not redirect stderr into stdout, keeping "|| true" only if
you still want a zero-failure exit behavior).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: adffd63c-612d-465a-90a3-e6ca2acacd81

📥 Commits

Reviewing files that changed from the base of the PR and between b69cc60 and f31504d.

📒 Files selected for processing (1)
  • operators/advanced-cluster-management/tasks.yaml

Comment on lines +16 to +18
- name: Set MCE operator reference for upgrade approval during MCH wait
ansible.builtin.set_fact:
mce_operator: "{{ operators | selectattr('name', 'equalto', 'multicluster-engine') | first }}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate and inspect the referenced tasks file and the usages of mce_operator.
FILE="operators/advanced-cluster-management/tasks.yaml"
echo "=== File: $FILE (lines 1-120) ==="
nl -ba "$FILE" | sed -n '1,120p'

echo
echo "=== Search for mce_operator usage in operators/advanced-cluster-management ==="
rg -n "mce_operator" operators/advanced-cluster-management || true

echo
echo "=== Search for operators variable definition/shape (operators list) in this area ==="
rg -n "operators" operators/advanced-cluster-management || true

Repository: rh-ecosystem-edge/enclave

Length of output: 207


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# List files under operators/advanced-cluster-management to understand context.
find operators/advanced-cluster-management -maxdepth 3 -type f -print

echo
echo "=== Inspect tasks.yaml around the reported lines ==="
FILE="operators/advanced-cluster-management/tasks.yaml"
nl -ba "$FILE" | sed -n '1,80p'
nl -ba "$FILE" | sed -n '80,160p' || true

echo
echo "=== Find mce_operator usage anywhere in repo ==="
rg -n "mce_operator" . || true

Repository: rh-ecosystem-edge/enclave

Length of output: 342


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="operators/advanced-cluster-management/tasks.yaml"

echo "=== File: $FILE (head 120) ==="
nl -ba "$FILE" | sed -n '1,120p'

echo
echo "=== File: $FILE (tail 80) ==="
nl -ba "$FILE" | tail -n 80 || true

echo
echo "=== mce_operator usage in $FILE ==="
rg -n "mce_operator" "$FILE" || true

echo
echo "=== mce_operator usage in repo ==="
rg -n "mce_operator" . || true

echo
echo "=== operators variable usage in $FILE ==="
rg -n "operators" "$FILE" || true

Repository: rh-ecosystem-edge/enclave

Length of output: 204


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="operators/advanced-cluster-management/tasks.yaml"

echo "=== File: $FILE (head 120) ==="
nl -ba "$FILE" | sed -n '1,120p'

echo
echo "=== File: $FILE (tail 80) ==="
nl -ba "$FILE" | tail -n 80 || true

echo
echo "=== mce_operator usage in $FILE ==="
rg -n "mce_operator" "$FILE" || true

echo
echo "=== mce_operator usage in repo ==="
rg -n "mce_operator" . || true

echo
echo "=== operators variable usage in $FILE ==="
rg -n "operators" "$FILE" || true

Repository: rh-ecosystem-edge/enclave

Length of output: 204


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="operators/advanced-cluster-management/tasks.yaml"

echo "=== Print $FILE with line numbers (1-120) ==="
python3 - <<'PY'
import itertools
file_path="operators/advanced-cluster-management/tasks.yaml"
start, end = 1, 120
with open(file_path, "r", encoding="utf-8") as f:
    for i, line in enumerate(f, start=1):
        if i < start: 
            continue
        if i > end:
            break
        print(f"{i:4d}\t{line.rstrip()}")
PY

echo
echo "=== Print $FILE with line numbers (120-end, up to 260) ==="
python3 - <<'PY'
file_path="operators/advanced-cluster-management/tasks.yaml"
start, end = 121, 260
with open(file_path, "r", encoding="utf-8") as f:
    for i, line in enumerate(f, start=1):
        if i < start: 
            continue
        if i > end:
            break
        print(f"{i:4d}\t{line.rstrip()}")
PY

echo
echo "=== rg: mce_operator usage ==="
rg -n "mce_operator" operators/advanced-cluster-management || true

echo
echo "=== rg: operators usage near tasks.yaml ==="
rg -n "operators" operators/advanced-cluster-management/tasks.yaml || true

Repository: rh-ecosystem-edge/enclave

Length of output: 4866


Major: Assert mce_operator is present before dereferencing .name/.version/.namespace.

In operators/advanced-cluster-management/tasks.yaml, mce_operator is set to operators | selectattr('name','equalto','multicluster-engine') | first (line 18). If operators lacks a multicluster-engine entry, the subsequent shell task dereferences mce_operator.name, .version, and .namespace (lines 25-27), leading to a broken reconciliation run (template failure or empty/invalid CLI args). Add an ansible.builtin.assert right after the set_fact to fail fast with a clear message when the required operator entry/fields are missing.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@operators/advanced-cluster-management/tasks.yaml` around lines 16 - 18, After
setting mce_operator via the set_fact that selects the 'multicluster-engine'
entry, add an ansible.builtin.assert task immediately after to fail fast if
mce_operator is undefined or missing required keys; assert that mce_operator is
defined and that mce_operator.name, mce_operator.version and
mce_operator.namespace are present (with a clear message like
"multicluster-engine operator entry missing or incomplete") so the later shell
task that dereferences mce_operator.name/.version/.namespace cannot run with
nil/invalid values.

Comment on lines +21 to +39
ansible.builtin.shell: |
set -euo pipefail
# Approve any pending upgrade InstallPlans for MCE and ACM
enclave reconcile operator-versions \
--name "{{ mce_operator.name }}" \
--version "{{ mce_operator.version }}" \
--namespace "{{ mce_operator.namespace }}" \
--csv-name "{{ mce_operator.name }}" \
--no-dry-run 2>&1 || true
enclave reconcile operator-versions \
--name "{{ operator.name }}" \
--version "{{ operator.version }}" \
--namespace "{{ operator.namespace }}" \
--csv-name "{{ operator.name }}" \
--no-dry-run 2>&1 || true
# Output MCH phase for the until condition
oc get multiclusterhub multiclusterhub -n open-cluster-management \
-o jsonpath='{.status.phase}'
register: mch_phase_result

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Major: Add per-attempt timeouts to the retrying ansible.builtin.shell in operators/advanced-cluster-management/tasks.yaml (lines 21-39).
The task retries on mch_phase_result, but the enclave reconcile operator-versions ... and oc get multiclusterhub ... commands have no per-attempt timeout and can hang indefinitely—blocking retries and stalling the upgrade-approval flow (availability risk).

Suggested fix
-    enclave reconcile operator-versions \
+    timeout 60s enclave reconcile operator-versions \
       --name "{{ mce_operator.name }}" \
       --version "{{ mce_operator.version }}" \
       --namespace "{{ mce_operator.namespace }}" \
       --csv-name "{{ mce_operator.name }}" \
       --no-dry-run 2>&1 || true
-    enclave reconcile operator-versions \
+    timeout 60s enclave reconcile operator-versions \
       --name "{{ operator.name }}" \
       --version "{{ operator.version }}" \
       --namespace "{{ operator.namespace }}" \
       --csv-name "{{ operator.name }}" \
       --no-dry-run 2>&1 || true
     # Output MCH phase for the until condition
-    oc get multiclusterhub multiclusterhub -n open-cluster-management \
+    timeout 60s oc get multiclusterhub multiclusterhub -n open-cluster-management \
       -o jsonpath='{.status.phase}'
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@operators/advanced-cluster-management/tasks.yaml` around lines 21 - 39, The
shell task registering mch_phase_result can hang because the two commands
(enclave reconcile operator-versions and oc get multiclusterhub ...) have no
per-attempt timeout; update the ansible.builtin.shell invocation to wrap each of
those commands with a per-command timeout (e.g., via the GNU timeout utility
such as timeout 300s enclave reconcile ... and timeout 30s oc get ...) so each
retry attempt will fail fast instead of hanging indefinitely; keep set -euo
pipefail and the same register variable (mch_phase_result) unchanged.

Comment thread operators/advanced-cluster-management/tasks.yaml Outdated
When MCE is installed with Manual approval, a newer version can appear
in the catalog while MCH is still reconciling. This creates a second
InstallPlan in RequiresApproval phase that nobody approves, causing MCH
to get stuck in phase: Installing forever.

Replace the simple k8s_info MCH wait with a shell task that directly
approves any unapproved InstallPlans in the MCE and ACM namespaces on
every retry iteration before checking MCH phase.
@eliorerz eliorerz force-pushed the fix/approve-upgrade-installplans-during-mch-wait branch from f31504d to 6766d0e Compare June 10, 2026 00:47
@maorfr

maorfr commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

this PR effectively cancelles version pinning, not sure we can merge it?

@maorfr

maorfr commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Claude Code:

I've created an alternative proposal in PR #471 that introduces a more flexible property for operators.

Instead of hardcoding logic for multicluster-engine, the new approach:

  • Adds a property to the operator definition
  • Uses automatic install plan approval when and
  • Is extensible to any operator in the future

This is proposed as a workaround. We should still pursue a bug with the ACM team for a proper fix, but this provides a configurable solution in the meantime.

See: #471

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

operators Operator installation/config

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants