[WIP] OSAC-1384: Fix operator-versions to approve InstallPlans on every poll iteration#470
[WIP] OSAC-1384: Fix operator-versions to approve InstallPlans on every poll iteration#470rporres wants to merge 1 commit into
Conversation
WalkthroughThis PR refactors CSV operator installation reconciliation by introducing ChangesCSV Approval Polling Refactor
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Security & Risk AssessmentRisk Severity: Low to Moderate Specific Risks Identified:
Impact: Risks are contained to operator installation workflows and do not affect core system stability or data integrity. 🚥 Pre-merge checks | ✅ 11✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/reconcile/operator_versions.py`:
- Around line 239-245: The dry_run boolean is redundant when calling
approve_and_wait_for_csv because the caller returns early when dry_run is true;
remove the dry_run parameter from the approve_and_wait_for_csv function
signature and all calls (update the definition of approve_and_wait_for_csv and
its invocation in the loop that currently passes dry_run), leaving
approve_install_plans(dry_run, namespace, op_version_map) as the only place that
consumes dry_run; also search for other callers of approve_and_wait_for_csv to
update their call sites and related tests/docstrings to reflect the new
signature.
- Around line 193-217: The loop calling run_oc_command to get the CSV phase
silently swallows oc failures; update the block in the wait/poll function (where
run_oc_command, csv_name, csv_version, namespace are used) to log subprocess
failures when result.returncode != 0 instead of treating them as an empty
phase—use the existing log_subprocess_output helper or logger.warning to emit
result.stderr/result.stdout and the return code (include csv_name.v{csv_version}
and namespace in the message for context); keep the existing timeout/raise
TimeoutError behavior but ensure failures are logged each iteration so
persistent oc failures are observable.
In `@src/tests/test_operator_versions.py`:
- Around line 252-296: Add a unit test that asserts approve_and_wait_for_csv
raises TimeoutError when the CSV never reaches "Succeeded": patch
reconcile.operator_versions.run_oc_command to repeatedly return a plan (use
_plan) then "patched" then a non-"Succeeded" phase (e.g., "Installing")
sufficient times to exceed the timeout; patch
reconcile.operator_versions.time.time to simulate wall-clock progression past
the timeout and patch time.sleep to no-op; call approve_and_wait_for_csv with a
small timeout_minutes and wrap in pytest.raises(TimeoutError, match="did not
reach phase=Succeeded") to verify the timeout path in approve_and_wait_for_csv.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9df49dc2-1950-4376-a771-34b7f5d6a41a
📒 Files selected for processing (2)
src/reconcile/operator_versions.pysrc/tests/test_operator_versions.py
…l iteration When upgrading across multiple versions, OLM may not create all InstallPlans upfront — depending on the channel's skips graph, intermediate InstallPlans can appear only after earlier versions finish installing. The previous implementation called approve_install_plans once upfront then blocked in wait_for_resource_status, so any InstallPlan that did not exist at that moment was never approved. Replace wait_for_resource_status with a new approve_and_wait_for_csv function that calls approve_install_plans on every poll iteration before checking the target CSV phase, ensuring no InstallPlan is missed regardless of when OLM creates it. Signed-off-by: Rafa Porres Molina <rporresm@redhat.com> Assisted-by: Claude Code <noreply@anthropic.com>
7abb924 to
36b164d
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/tests/test_operator_versions.py (1)
213-337:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winMinor: this test block is still failing the Python formatting gate.
The PR validation job already reports that
ruff formatwould rewrite this file, so these test additions cannot passmake python-linter-testas-is. Re-runmake python-formatand commit the normalized output before merge.As per coding guidelines, Python code must use ruff with an 88-character line limit and be validated with
make python-linter-test/make python-format.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/tests/test_operator_versions.py` around lines 213 - 337, The file fails ruff formatting; run the project's formatter and commit the changes so the tests pass. Reformat this test file (which contains functions like test_reconcile_delegates_to_approve_and_wait_per_csv, test_reconcile_dry_run_skips_wait, test_reconcile_replaces_plus_in_version, etc.) using the repo tooling (run make python-format or ruff format with the project's 88-char config) and commit the rewritten file before merging.Sources: Coding guidelines, Pipeline failures
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/enclave/reconcile/operator_versions.py`:
- Around line 191-225: The loop around approve_install_plans() and the
run_oc_command(...) call currently lets a per-call TimeoutError escape and abort
the whole reconcile; wrap calls that can raise run_oc_command's TimeoutError
(both approve_install_plans(...) if it invokes run_oc_command and the explicit
run_oc_command(...) that reads CSV phase) in a try/except that catches
TimeoutError, treats it as a transient error (log a warning with context
including csv_name, csv_version, namespace and the exception), and continues
retrying until the global deadline is reached (only re-raise TimeoutError if
time.time() >= deadline). Update the loop logic around approve_install_plans and
the CSV phase check to implement this transient-retry behavior.
---
Outside diff comments:
In `@src/tests/test_operator_versions.py`:
- Around line 213-337: The file fails ruff formatting; run the project's
formatter and commit the changes so the tests pass. Reformat this test file
(which contains functions like
test_reconcile_delegates_to_approve_and_wait_per_csv,
test_reconcile_dry_run_skips_wait, test_reconcile_replaces_plus_in_version,
etc.) using the repo tooling (run make python-format or ruff format with the
project's 88-char config) and commit the rewritten file before merging.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 4e9837ea-fb1c-4eaf-83ef-4ddc87142c47
📒 Files selected for processing (2)
src/enclave/reconcile/operator_versions.pysrc/tests/test_operator_versions.py
| while True: | ||
| approve_install_plans(dry_run, namespace, op_version_map) | ||
| result = run_oc_command([ | ||
| "oc", | ||
| "get", | ||
| "clusterserviceversion.operators.coreos.com", | ||
| f"{csv_name}.v{csv_version}", | ||
| "-n", | ||
| namespace, | ||
| "-o", | ||
| "jsonpath={.status.phase}", | ||
| ]) | ||
| if result.returncode != 0: | ||
| log_subprocess_output( | ||
| f"oc get clusterserviceversion/{csv_name}.v{csv_version} failed" | ||
| f" (exit {result.returncode})", | ||
| result.stderr or "", | ||
| ) | ||
| phase = "" | ||
| else: | ||
| phase = (result.stdout or "").strip() | ||
| if phase == "Succeeded": | ||
| logger.info( | ||
| "CSV %s.v%s in namespace %s reached status.phase=Succeeded.", | ||
| csv_name, | ||
| csv_version, | ||
| namespace, | ||
| ) | ||
| return | ||
| if time.time() >= deadline: | ||
| raise TimeoutError( | ||
| f"CSV {csv_name}.v{csv_version} did not reach phase=Succeeded" | ||
| f" within {timeout_minutes} minutes (last observed: {phase!r})" | ||
| ) | ||
| time.sleep(sleep_interval) |
There was a problem hiding this comment.
Major: a transient oc timeout now aborts the upgrade loop.
run_oc_command() raises TimeoutError after 60 seconds, but this loop never catches it. The old wait_for_resource_status() behavior retried per-call oc timeouts until the global deadline; now one slow API response during either approve_install_plans(...) or the CSV phase read aborts reconciliation immediately, which reintroduces upgrade fragility under API slowness.
Suggested fix
while True:
- approve_install_plans(dry_run, namespace, op_version_map)
- result = run_oc_command([
- "oc",
- "get",
- "clusterserviceversion.operators.coreos.com",
- f"{csv_name}.v{csv_version}",
- "-n",
- namespace,
- "-o",
- "jsonpath={.status.phase}",
- ])
+ try:
+ approve_install_plans(dry_run, namespace, op_version_map)
+ result = run_oc_command([
+ "oc",
+ "get",
+ "clusterserviceversion.operators.coreos.com",
+ f"{csv_name}.v{csv_version}",
+ "-n",
+ namespace,
+ "-o",
+ "jsonpath={.status.phase}",
+ ])
+ except TimeoutError as exc:
+ if time.time() >= deadline:
+ raise TimeoutError(
+ f"CSV {csv_name}.v{csv_version} did not reach phase=Succeeded "
+ f"within {timeout_minutes} minutes (last observed: 'oc timeout')"
+ ) from exc
+ logger.warning(
+ "Timed out while polling CSV %s.v%s in namespace %s. Retrying...",
+ csv_name,
+ csv_version,
+ namespace,
+ )
+ time.sleep(sleep_interval)
+ continue🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/enclave/reconcile/operator_versions.py` around lines 191 - 225, The loop
around approve_install_plans() and the run_oc_command(...) call currently lets a
per-call TimeoutError escape and abort the whole reconcile; wrap calls that can
raise run_oc_command's TimeoutError (both approve_install_plans(...) if it
invokes run_oc_command and the explicit run_oc_command(...) that reads CSV
phase) in a try/except that catches TimeoutError, treats it as a transient error
(log a warning with context including csv_name, csv_version, namespace and the
exception), and continues retrying until the global deadline is reached (only
re-raise TimeoutError if time.time() >= deadline). Update the loop logic around
approve_install_plans and the CSV phase check to implement this transient-retry
behavior.
When upgrading across multiple versions, OLM may not create all InstallPlans upfront — depending on the channel's skips graph, intermediate InstallPlans can appear only after earlier versions finish installing. The previous implementation called approve_install_plans once upfront then blocked in wait_for_resource_status, so any InstallPlan that did not exist at that moment was never approved.
Replace wait_for_resource_status with a new approve_and_wait_for_csv function that calls approve_install_plans on every poll iteration before checking the target CSV phase, ensuring no InstallPlan is missed regardless of when OLM creates it.
Summary by CodeRabbit
Tests
Bug Fixes