Skip to content

[WIP] OSAC-1384: Fix operator-versions to approve InstallPlans on every poll iteration#470

Open
rporres wants to merge 1 commit into
mainfrom
support-intermediate-install-plans
Open

[WIP] OSAC-1384: Fix operator-versions to approve InstallPlans on every poll iteration#470
rporres wants to merge 1 commit into
mainfrom
support-intermediate-install-plans

Conversation

@rporres

@rporres rporres commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

When upgrading across multiple versions, OLM may not create all InstallPlans upfront — depending on the channel's skips graph, intermediate InstallPlans can appear only after earlier versions finish installing. The previous implementation called approve_install_plans once upfront then blocked in wait_for_resource_status, so any InstallPlan that did not exist at that moment was never approved.

Replace wait_for_resource_status with a new approve_and_wait_for_csv function that calls approve_install_plans on every poll iteration before checking the target CSV phase, ensuring no InstallPlan is missed regardless of when OLM creates it.

Summary by CodeRabbit

  • Tests

    • Updated test coverage for operator version reconciliation, including new assertions for approval workflow and timeout scenarios.
  • Bug Fixes

    • Enhanced operator approval process with improved handling of install plan approval and timeout detection during status transitions.

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Review Change Stack

Walkthrough

This PR refactors CSV operator installation reconciliation by introducing approve_and_wait_for_csv, a new polling function that repeatedly approves InstallPlans and queries each CSV's terminal state instead of relying on external wait utilities. The reconcile function now delegates per-CSV approval+waiting to this helper while maintaining dry-run visibility limitations.

Changes

CSV Approval Polling Refactor

Layer / File(s) Summary
Import updates
src/enclave/reconcile/operator_versions.py, src/tests/test_operator_versions.py
Add time module import and remove wait_for_resource_status dependency; test imports updated to reference the new approve_and_wait_for_csv helper.
approve_and_wait_for_csv polling loop
src/enclave/reconcile/operator_versions.py, src/tests/test_operator_versions.py
New approve_and_wait_for_csv function implements deadline-based polling: re-invokes approve_install_plans on each iteration, queries CSV status.phase via oc get ... jsonpath, returns on Succeeded, raises TimeoutError on deadline. Tests cover intermediate approvals across polling cycles and timeout behavior with time mocking.
Reconcile delegation and test updates
src/enclave/reconcile/operator_versions.py, src/tests/test_operator_versions.py
reconcile constructs version map, immediately approves and exits in dry-run mode (skipping CSV waits), and delegates per-CSV sequencing to approve_and_wait_for_csv in non-dry-run mode with normalized version strings (+-). Tests verify per-CSV delegation, dry-run short-circuit, version normalization, and consistent version map across all calls.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • rh-ecosystem-edge/enclave#375: Earlier refactoring of operator-versions reconciliation flow; this PR builds on that structure by introducing interval-based CSV polling with explicit retry and timeout semantics.

Suggested labels

operators, validation

Suggested reviewers

  • maorfr

InstallPlans now dance the polling tango,

Re-approving, re-querying with each go,

'Til CSV phase shines Succeeded—no more limbo!

Dry-runs glimpse once, non-dry waits for the glow. 🔄


Security & Risk Assessment

Risk Severity: Low to Moderate

Specific Risks Identified:

  1. Polling Timeout Edge Case (Low): If the timeout deadline passes precisely between an oc query failure (treated as empty phase) and the next iteration, a TimeoutError will be raised with the last observed phase. Ensure calling code handles this timeout gracefully; incomplete operator installations could be left in a partially-updated state.

  2. Intermediate InstallPlan Visibility (Low): In dry-run mode, the function approves InstallPlans once then exits. Any intermediate InstallPlans created by the operator after the dry-run approval will not be visible—this is a known limitation documented in the updated docstring but should be communicated to users running dry-runs for audit purposes.

  3. CSV Query Failure Handling (Low): When oc get clusterserviceversion fails, the phase is silently treated as empty and polling continues. If the failure is due to RBAC denial or network issues (not just missing CSV), the loop will retry indefinitely until timeout, consuming log overhead. Consider whether failed queries should increment a separate retry counter or be distinguished from legitimate missing/not-ready states.

  4. Repeated approve_install_plans Invocation (Low): The function re-approves InstallPlans on every polling iteration. This is idempotent but increases cluster API load and log volume during extended waits; verify the cluster can sustain this polling frequency without degrading other operations.

Impact: Risks are contained to operator installation workflows and do not affect core system stability or data integrity.

🚥 Pre-merge checks | ✅ 11
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the core change: fixing operator-versions to approve InstallPlans on every poll iteration rather than once up front, which directly addresses the multi-step upgrade problem described in objectives.
Docstring Coverage ✅ Passed Docstring coverage is 92.31% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
No-Hardcoded-Secrets ✅ Passed No hardcoded secrets found: no API keys, tokens, passwords, private keys, base64 strings >32 chars, embedded credentials, or sensitive variable assignments detected in the modified files.
No-Weak-Crypto ✅ Passed No weak cryptographic algorithms, custom crypto implementations, or non-constant-time secret comparisons detected. PR contains only operational changes to operator version reconciliation logic.
No-Injection-Vectors ✅ Passed No injection vectors detected. All subprocess calls use list-based arguments (subprocess.run(args, ...)) without shell=True, preventing command injection. No SQL concatenation, eval/exec, pickle, u...
Container-Privileges ✅ Passed PR contains only Python code and test fixtures; no container/K8s manifests with privilege escalation settings are present or modified.
No-Sensitive-Data-In-Logs ✅ Passed The PR only logs stderr from failed oc commands and operator/version names - no passwords, tokens, API keys, PII, or credentials are exposed in logs.
Ai-Attribution ✅ Passed AI tool usage (Claude Code) was properly attributed using "Assisted-by:" trailer in commit; no improper "Co-Authored-By" used.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch support-intermediate-install-plans

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/reconcile/operator_versions.py`:
- Around line 239-245: The dry_run boolean is redundant when calling
approve_and_wait_for_csv because the caller returns early when dry_run is true;
remove the dry_run parameter from the approve_and_wait_for_csv function
signature and all calls (update the definition of approve_and_wait_for_csv and
its invocation in the loop that currently passes dry_run), leaving
approve_install_plans(dry_run, namespace, op_version_map) as the only place that
consumes dry_run; also search for other callers of approve_and_wait_for_csv to
update their call sites and related tests/docstrings to reflect the new
signature.
- Around line 193-217: The loop calling run_oc_command to get the CSV phase
silently swallows oc failures; update the block in the wait/poll function (where
run_oc_command, csv_name, csv_version, namespace are used) to log subprocess
failures when result.returncode != 0 instead of treating them as an empty
phase—use the existing log_subprocess_output helper or logger.warning to emit
result.stderr/result.stdout and the return code (include csv_name.v{csv_version}
and namespace in the message for context); keep the existing timeout/raise
TimeoutError behavior but ensure failures are logged each iteration so
persistent oc failures are observable.

In `@src/tests/test_operator_versions.py`:
- Around line 252-296: Add a unit test that asserts approve_and_wait_for_csv
raises TimeoutError when the CSV never reaches "Succeeded": patch
reconcile.operator_versions.run_oc_command to repeatedly return a plan (use
_plan) then "patched" then a non-"Succeeded" phase (e.g., "Installing")
sufficient times to exceed the timeout; patch
reconcile.operator_versions.time.time to simulate wall-clock progression past
the timeout and patch time.sleep to no-op; call approve_and_wait_for_csv with a
small timeout_minutes and wrap in pytest.raises(TimeoutError, match="did not
reach phase=Succeeded") to verify the timeout path in approve_and_wait_for_csv.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9df49dc2-1950-4376-a771-34b7f5d6a41a

📥 Commits

Reviewing files that changed from the base of the PR and between 91f6d95 and f51a48b.

📒 Files selected for processing (2)
  • src/reconcile/operator_versions.py
  • src/tests/test_operator_versions.py

Comment thread src/enclave/reconcile/operator_versions.py
Comment thread src/enclave/reconcile/operator_versions.py
Comment thread src/tests/test_operator_versions.py
@maorfr maorfr added the rc-ok label to indicate a PR is ok to arrive to main branch while preparing a release candidate label Jun 10, 2026
…l iteration

When upgrading across multiple versions, OLM may not create all
InstallPlans upfront — depending on the channel's skips graph,
intermediate InstallPlans can appear only after earlier versions
finish installing. The previous implementation called
approve_install_plans once upfront then blocked in
wait_for_resource_status, so any InstallPlan that did not exist
at that moment was never approved.

Replace wait_for_resource_status with a new approve_and_wait_for_csv
function that calls approve_install_plans on every poll iteration
before checking the target CSV phase, ensuring no InstallPlan is
missed regardless of when OLM creates it.

Signed-off-by: Rafa Porres Molina <rporresm@redhat.com>
Assisted-by: Claude Code <noreply@anthropic.com>
@rporres rporres force-pushed the support-intermediate-install-plans branch from 7abb924 to 36b164d Compare June 11, 2026 07:38

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/tests/test_operator_versions.py (1)

213-337: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Minor: this test block is still failing the Python formatting gate.

The PR validation job already reports that ruff format would rewrite this file, so these test additions cannot pass make python-linter-test as-is. Re-run make python-format and commit the normalized output before merge.

As per coding guidelines, Python code must use ruff with an 88-character line limit and be validated with make python-linter-test / make python-format.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/tests/test_operator_versions.py` around lines 213 - 337, The file fails
ruff formatting; run the project's formatter and commit the changes so the tests
pass. Reformat this test file (which contains functions like
test_reconcile_delegates_to_approve_and_wait_per_csv,
test_reconcile_dry_run_skips_wait, test_reconcile_replaces_plus_in_version,
etc.) using the repo tooling (run make python-format or ruff format with the
project's 88-char config) and commit the rewritten file before merging.

Sources: Coding guidelines, Pipeline failures

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/enclave/reconcile/operator_versions.py`:
- Around line 191-225: The loop around approve_install_plans() and the
run_oc_command(...) call currently lets a per-call TimeoutError escape and abort
the whole reconcile; wrap calls that can raise run_oc_command's TimeoutError
(both approve_install_plans(...) if it invokes run_oc_command and the explicit
run_oc_command(...) that reads CSV phase) in a try/except that catches
TimeoutError, treats it as a transient error (log a warning with context
including csv_name, csv_version, namespace and the exception), and continues
retrying until the global deadline is reached (only re-raise TimeoutError if
time.time() >= deadline). Update the loop logic around approve_install_plans and
the CSV phase check to implement this transient-retry behavior.

---

Outside diff comments:
In `@src/tests/test_operator_versions.py`:
- Around line 213-337: The file fails ruff formatting; run the project's
formatter and commit the changes so the tests pass. Reformat this test file
(which contains functions like
test_reconcile_delegates_to_approve_and_wait_per_csv,
test_reconcile_dry_run_skips_wait, test_reconcile_replaces_plus_in_version,
etc.) using the repo tooling (run make python-format or ruff format with the
project's 88-char config) and commit the rewritten file before merging.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4e9837ea-fb1c-4eaf-83ef-4ddc87142c47

📥 Commits

Reviewing files that changed from the base of the PR and between 7abb924 and 36b164d.

📒 Files selected for processing (2)
  • src/enclave/reconcile/operator_versions.py
  • src/tests/test_operator_versions.py

Comment on lines +191 to +225
while True:
approve_install_plans(dry_run, namespace, op_version_map)
result = run_oc_command([
"oc",
"get",
"clusterserviceversion.operators.coreos.com",
f"{csv_name}.v{csv_version}",
"-n",
namespace,
"-o",
"jsonpath={.status.phase}",
])
if result.returncode != 0:
log_subprocess_output(
f"oc get clusterserviceversion/{csv_name}.v{csv_version} failed"
f" (exit {result.returncode})",
result.stderr or "",
)
phase = ""
else:
phase = (result.stdout or "").strip()
if phase == "Succeeded":
logger.info(
"CSV %s.v%s in namespace %s reached status.phase=Succeeded.",
csv_name,
csv_version,
namespace,
)
return
if time.time() >= deadline:
raise TimeoutError(
f"CSV {csv_name}.v{csv_version} did not reach phase=Succeeded"
f" within {timeout_minutes} minutes (last observed: {phase!r})"
)
time.sleep(sleep_interval)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Major: a transient oc timeout now aborts the upgrade loop.

run_oc_command() raises TimeoutError after 60 seconds, but this loop never catches it. The old wait_for_resource_status() behavior retried per-call oc timeouts until the global deadline; now one slow API response during either approve_install_plans(...) or the CSV phase read aborts reconciliation immediately, which reintroduces upgrade fragility under API slowness.

Suggested fix
     while True:
-        approve_install_plans(dry_run, namespace, op_version_map)
-        result = run_oc_command([
-            "oc",
-            "get",
-            "clusterserviceversion.operators.coreos.com",
-            f"{csv_name}.v{csv_version}",
-            "-n",
-            namespace,
-            "-o",
-            "jsonpath={.status.phase}",
-        ])
+        try:
+            approve_install_plans(dry_run, namespace, op_version_map)
+            result = run_oc_command([
+                "oc",
+                "get",
+                "clusterserviceversion.operators.coreos.com",
+                f"{csv_name}.v{csv_version}",
+                "-n",
+                namespace,
+                "-o",
+                "jsonpath={.status.phase}",
+            ])
+        except TimeoutError as exc:
+            if time.time() >= deadline:
+                raise TimeoutError(
+                    f"CSV {csv_name}.v{csv_version} did not reach phase=Succeeded "
+                    f"within {timeout_minutes} minutes (last observed: 'oc timeout')"
+                ) from exc
+            logger.warning(
+                "Timed out while polling CSV %s.v%s in namespace %s. Retrying...",
+                csv_name,
+                csv_version,
+                namespace,
+            )
+            time.sleep(sleep_interval)
+            continue
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/enclave/reconcile/operator_versions.py` around lines 191 - 225, The loop
around approve_install_plans() and the run_oc_command(...) call currently lets a
per-call TimeoutError escape and abort the whole reconcile; wrap calls that can
raise run_oc_command's TimeoutError (both approve_install_plans(...) if it
invokes run_oc_command and the explicit run_oc_command(...) that reads CSV
phase) in a try/except that catches TimeoutError, treats it as a transient error
(log a warning with context including csv_name, csv_version, namespace and the
exception), and continues retrying until the global deadline is reached (only
re-raise TimeoutError if time.time() >= deadline). Update the loop logic around
approve_install_plans and the CSV phase check to implement this transient-retry
behavior.

@maorfr maorfr removed the rc-ok label to indicate a PR is ok to arrive to main branch while preparing a release candidate label Jun 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants