Skip to content

[WIP] Add pre-bootstrap leftover check via enclave environment subcommand#496

Open
rporres wants to merge 7 commits into
mainfrom
pre-bootstrap-cleanup
Open

[WIP] Add pre-bootstrap leftover check via enclave environment subcommand#496
rporres wants to merge 7 commits into
mainfrom
pre-bootstrap-cleanup

Conversation

@rporres

@rporres rporres commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Introduces enclave environment check-leftovers and enclave environment cleanup commands so customers can detect and remove Metal3/Quay residue before re-running bootstrap. The logic lives in the Python package (included in the customer tarball) rather than scripts/ (which is excluded).

  • Add src/enclave/environment/ with check_leftovers.py, cli.py, cleanup.sh
  • Register enclave environment subcommand in src/enclave/cli.py
  • Add step_check_leftovers() to bootstrap.sh (advise-only, does not auto-clean)
  • Extend shellcheck coverage to src/ in scripts/verification/validate.sh
  • Add deploy-cluster-check-leftovers Makefile.ci target
  • Run leftover check in e2e-deployment and disconnected-dry-run GH Actions workflows after installing Enclave Lab

Summary by CodeRabbit

  • New Features
    • Added an enclave environment command group with check-leftovers (detect stale Metal3/Quay resources) and cleanup (remediate and reset the environment).
    • Introduced a pre-bootstrap leftover check for e2e connected/disconnected flows with step summary output.
  • Bug Fixes
    • Improved failure reporting to correctly attribute failures to the leftover-check step.
  • Tests
    • Added coverage for leftover detection, command exit-code behavior, and cleanup handling.
  • Chores
    • Added CI/deployment make target and a bootstrap step for the leftover check; expanded shell linting to cover src/.

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

A new pre-bootstrap leftover-resource detection and cleanup capability is introduced to scan and remove stale Metal3/Quay infrastructure before bootstrapping. A Python module (check_leftovers.py) detects leftover systemd units, podman resources, and working directory contents. A Bash cleanup.sh removes them idempotently. A Click CLI subcommand group (enclave environment) exposes both actions with working-directory resolution. bootstrap.sh, Makefile.ci, and three CI workflow files are wired to invoke the check before bootstrapping.

Changes

Leftover Resource Detection and Cleanup

Layer / File(s) Summary
check_leftovers.py detection logic and module tests
src/enclave/environment/check_leftovers.py, src/tests/test_check_leftovers.py
Adds check_systemd, check_podman (root/user variants), check_working_dir, and main() functions that scan for Metal3/Quay leftover resources across systemd units, podman pods/containers/volumes, and filesystem; a full pytest suite mocks subprocess calls to cover all detection paths and asserts that main() always runs all checks even if earlier checks find leftovers.
cleanup.sh pre-bootstrap removal script
src/enclave/environment/cleanup.sh
New Bash script that idempotently stops and removes Metal3 systemd services, Quadlet unit files, root and user Podman pods/containers/volumes/secrets, Quay registry resources (via mirror-registry uninstall or fallback Podman removal), and working directory contents with a safety case guard to refuse deletion of critical system paths (/, /etc, etc.).
environment Click CLI subcommand group
src/enclave/environment/cli.py, src/enclave/cli.py
Adds Click commands check-leftovers (exits 0 if clean, 1 if cleanup needed) and cleanup (resolves working directory from --working-dir option, env var, or config/global.yaml, invokes cleanup.sh via subprocess with 600s timeout); registers the group as enclave environment subcommand.
bootstrap.sh and deploy_bootstrap_step.sh step integration
bootstrap.sh, scripts/deployment/deploy_bootstrap_step.sh
Adds step_check_leftovers() that invokes enclave environment check-leftovers and aborts with cleanup instructions if leftovers are detected; registers check-leftovers in valid step lists and help output; inserts the call before step_setup in full-run mode; updates deploy_bootstrap_step.sh to accept the new step name.
Makefile.ci target and CI workflow integration
Makefile.ci, .github/workflows/disconnected-dry-run.yml, .github/workflows/e2e-deployment.yml, scripts/verification/validate.sh, scripts/infrastructure/generate_enclave_vars.sh
Adds deploy-cluster-check-leftovers Makefile target; inserts the leftover check step into three CI workflow jobs (after Landing Zone setup and Install Enclave); expands shellcheck discovery from scripts/ to both scripts/ and src/; changes default working directory from /home/cloud-user to per-cluster session subdirectory /home/cloud-user/sessions/${CLUSTER_NAME}.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • rh-ecosystem-edge/enclave#412: The new deploy-cluster-check-leftovers Makefile.ci target depends on setup-dev-scripts, which was introduced in this PR for the dev-scripts clone lifecycle.

Suggested labels

infrastructure

Suggested reviewers

  • maorfr

Poem

🧹 Before the bootstrap boots its boots,
We scan for stale and leftover roots —
Metal3 pods and systemd ghosts,
Podman volumes haunting our hosts.
"Is the Landing Zone clean?" we cry —
Now we know, no phantom supply! 🛡️

Security Considerations

Risk Level: Moderate

  • Privileged execution: The cleanup.sh script requires sudo access for root-level podman resource removal (pods, volumes, secrets). Ensure script invocation is restricted to trusted CI/deployment contexts and that the WORKING_DIR environment variable is validated upstream before invocation.

  • Destructive operations: The cleanup script wipes the entire working directory with only a hardcoded allowlist (/, /etc, /home, /root, /var, /sys, /proc, /dev, /boot, /tmp) as safeguards via a case statement. A misconfigured WORKING_DIR or symlink traversal could theoretically lead to unintended data loss. The denylist is robust, but audit enforcement of path resolution and symlink handling is recommended.

  • Subprocess timeout: The 600-second timeout for cleanup.sh may be insufficient for large infrastructure cleanup operations. Timeout expiration (TimeoutExpired) results in a click.ClickException and a dangling process. Ensure cleanup idempotency is reliable and consider documenting timeout tuning requirements for large deployments.

  • YAML config parsing: The _read_working_dir_from_config() helper reads config/global.yaml and coerces the YAML-parsed value to string. While yaml.safe_load() prevents arbitrary code injection, ensure the config file source is validated and restricted to trusted administrators before bootstrap execution.

  • Regex matching for resource detection: Metal3/Quay resource detection relies on compiled regex patterns (METAL3_PATTERN, QUAY_PATTERN). Confirm that pattern coverage is exhaustive for all possible resource names (pods, containers, volumes) to avoid missed cleanup targets in future Metal3/Quay versions.


Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (2 errors, 1 warning)

Check name Status Explanation Resolution
No-Injection-Vectors ❌ Error Critical arbitrary code execution vulnerability: cleanup.sh executes mirror-registry at line 53 using uncanonicalized WORKING_DIR before validation (which occurs at line 62), enabling path traversa... Canonicalize WORKING_DIR with readlink -f immediately after assignment (line 14), before any command execution at line 53; validate against critical paths before operations.
No-Sensitive-Data-In-Logs ❌ Error Lines 104-108 in check_leftovers.py log working directory item names (e.g., 'kubeconfig', 'secrets/', filenames) which may expose customer data or sensitive configuration file names. Remove the itemized file/directory names from the log message at lines 104-108. Log only the count: "Working directory %s is non-empty (%d items)" instead of including the list of names.
Docstring Coverage ⚠️ Warning Docstring coverage is 15.38% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (8 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: adding a pre-bootstrap leftover check via the enclave environment subcommand.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
No-Hardcoded-Secrets ✅ Passed No hardcoded secrets found. PR only references credential names (e.g., metal3-ironic-password) for cleanup, not actual secret values.
No-Weak-Crypto ✅ Passed No weak cryptographic practices found. PR adds resource detection/cleanup without any MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB, custom crypto, or unsafe secret comparisons.
Container-Privileges ✅ Passed PR introduces no container/K8s manifests with privilege escalation; cleanup.sh's sudo usage is justified for system-level resource cleanup operations.
Ai-Attribution ✅ Passed PR does not claim AI tool usage; commit message lacks AI attribution trailers. "AI-generated summary" in review metadata refers to summary generation only, not PR creation.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pre-bootstrap-cleanup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@github-actions github-actions Bot added deployment Deployment-related changes ci-cd CI/CD infrastructure validation Validation and testing labels Jun 15, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@bootstrap.sh`:
- Around line 172-179: The step_check_leftovers function incorrectly assumes any
non-zero exit code means a clean environment, but the CLI contract specifies
exit code 0 means cleanup needed and exit code 1 means clean, with any other
non-zero code indicating an actual failure. Capture the exit code from the
enclave environment check-leftovers command (run it separately and store the
exit code in a variable) and then explicitly branch on that code: if it equals
0, show the cleanup error message and exit; if it equals 1, continue with the
"Environment is clean" message; if it equals anything else, log an error about
the failed precheck and exit. This prevents masking real command failures that
would otherwise be silently treated as a clean environment.

In `@src/enclave/environment/check_leftovers.py`:
- Around line 16-18: The _run function silently ignores command failures by
using check=False without validating the result, causing failed
podman/systemctl/sudo commands to be treated as successful "no leftovers"
results. Check the return code of the subprocess result and raise an exception
if the command exits with a non-zero status code before returning the stdout
lines. This ensures that actual command failures and timeouts are properly
surfaced rather than masking dirty environment states as clean.

In `@src/enclave/environment/cleanup.sh`:
- Around line 61-63: The rm command at line 62 that wipes the working directory
is unsafe because it misses dotfiles (files starting with a dot) and lacks a
guard against dangerous paths like root. Add a safety check before the rm
command to verify that the workingDir variable is not empty and is not a
critical system path like / or /etc, then modify the rm command to properly
delete both regular files and dotfiles by using a pattern that explicitly
includes dotfiles or by enabling the dotglob shell option before executing the
removal.
- Line 7: The cleanup.sh script at line 7 only includes set -u which catches
unset variables but allows command failures to be silently ignored, causing the
CLI to trust a false-success exit code. Add set -e (or set -o pipefail) after
set -u to make the script fail on any command error. Then wrap only the
explicitly-idempotent removal commands with a run_ignore wrapper function so
those specific operations can fail gracefully without terminating the script,
while ensuring all other critical cleanup commands will properly fail the script
if they encounter errors.

In `@src/enclave/environment/cli.py`:
- Around line 54-55: The subprocess.run call for the bash script execution in
the cleanup command lacks a timeout parameter, which can cause the CLI/CI step
to hang indefinitely if the subprocess blocks (such as waiting for interactive
input). Add a timeout parameter to the subprocess.run function call to ensure
the process terminates within a reasonable timeframe if it becomes unresponsive,
and handle the resulting TimeoutExpired exception appropriately.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2b00ed72-d137-4868-966c-c0d426e8c3c7

📥 Commits

Reviewing files that changed from the base of the PR and between 45d0061 and c619766.

📒 Files selected for processing (12)
  • .github/workflows/disconnected-dry-run.yml
  • .github/workflows/e2e-deployment.yml
  • Makefile.ci
  • bootstrap.sh
  • scripts/deployment/deploy_bootstrap_step.sh
  • scripts/verification/validate.sh
  • src/enclave/cli.py
  • src/enclave/environment/__init__.py
  • src/enclave/environment/check_leftovers.py
  • src/enclave/environment/cleanup.sh
  • src/enclave/environment/cli.py
  • src/tests/test_check_leftovers.py

Comment thread bootstrap.sh
Comment thread src/enclave/environment/check_leftovers.py
Comment thread src/enclave/environment/cleanup.sh
Comment thread src/enclave/environment/cleanup.sh
Comment thread src/enclave/environment/cli.py Outdated
@rporres rporres changed the title Add pre-bootstrap leftover check via enclave environment subcommand [WIP] Add pre-bootstrap leftover check via enclave environment subcommand Jun 15, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/enclave/environment/cli.py (1)

76-80: ⚠️ Potential issue | 🔴 Critical

Critical: Type safety violation in _read_working_dir_from_config() causes CLI crash on malformed YAML.

Line 77-79: yaml.safe_load(f) can return a list, scalar, or None—not always a dict. The code assumes a dict and calls .get(), which raises AttributeError if the YAML root is a list or scalar. The exception handler at line 80 only catches OSError and yaml.YAMLError, not AttributeError, so the CLI crashes unexpectedly.

Risk: If config/global.yaml contains valid YAML that isn't a mapping (e.g., [1, 2, 3] or a string), the cleanup command terminates with an unhandled exception instead of gracefully returning None.

Suggested fix
     try:
         with config_path.open(encoding="utf-8") as f:
-            config: dict[str, object] = yaml.safe_load(f) or {}
-        value = config.get("workingDir")
+            loaded = yaml.safe_load(f)
+        if not isinstance(loaded, dict):
+            return None
+        value = loaded.get("workingDir")
         return str(value) if value is not None else None
     except (OSError, yaml.YAMLError):
         return None
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/enclave/environment/cli.py` around lines 76 - 80, The
_read_working_dir_from_config() function assumes yaml.safe_load() always returns
a dict, but it can return a list, scalar, or None. Calling .get() on a non-dict
raises AttributeError, which is not caught by the current exception handler
(which only catches OSError and yaml.YAMLError). Fix this by adding a type check
to verify that the loaded config is a dict instance before calling .get() on it,
returning None if it is not a dict. Additionally, add AttributeError to the
except clause at line 80 to catch any attribute-related errors as a defensive
measure.
♻️ Duplicate comments (3)
src/enclave/environment/check_leftovers.py (1)

16-21: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Major: _run() can hang indefinitely on blocked host commands.

At Line 17, subprocess.run(...) has no timeout. A stuck systemctl/podman call can stall CI/bootstrap and block safe failure handling.

Suggested fix
 def _run(cmd: list[str]) -> list[str]:
-    result = subprocess.run(cmd, capture_output=True, text=True, check=False)
+    result = subprocess.run(
+        cmd,
+        capture_output=True,
+        text=True,
+        check=False,
+        timeout=30,
+    )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/enclave/environment/check_leftovers.py` around lines 16 - 21, The _run()
function in src/enclave/environment/check_leftovers.py lacks a timeout parameter
when calling subprocess.run() at line 17, which can cause the function to hang
indefinitely if a host command (like systemctl or podman) gets stuck. Add a
timeout parameter to the subprocess.run() call to ensure commands are terminated
if they exceed a reasonable execution time, preventing indefinite hangs that
could block CI/bootstrap.
src/enclave/environment/cleanup.sh (2)

62-69: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Critical: critical-path guard is bypassable with trailing slashes.

At Lines 62-69, the guard matches exact literals (/tmp, /etc, etc.), but values like /tmp/ or /etc/ bypass the case and proceed to rm -fr .../*, creating catastrophic deletion risk.

Suggested fix
-case "${workingDir}" in
+safe_working_dir="${workingDir%/}"
+[ -z "${safe_working_dir}" ] && safe_working_dir="/"
+
+case "${safe_working_dir}" in
+    .|..)
+        error "Refusing to wipe relative path: ${workingDir}"
+        exit 1
+        ;;
     /|/etc|/usr|/bin|/sbin|/lib|/lib64|/boot|/proc|/sys|/dev|/var|/home|/root|/tmp)
         error "Refusing to wipe critical system path: ${workingDir}"
         exit 1
         ;;
 esac
 shopt -s dotglob
-rm -fr "${workingDir:?}"/*
+rm -fr -- "${safe_working_dir:?}/"*
 shopt -u dotglob
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/enclave/environment/cleanup.sh` around lines 62 - 69, The critical-path
guard in the case statement can be bypassed by appending trailing slashes to the
protected paths (e.g., `/tmp/` instead of `/tmp`). Before the case statement
that evaluates `workingDir`, normalize the variable by removing any trailing
slashes using bash parameter expansion (e.g., `${workingDir%/}`) so that paths
like `/tmp/`, `/etc/`, etc. are properly matched against the guard patterns and
rejected. This ensures the safety check cannot be circumvented through path
manipulation.

7-7: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Major: cleanup can report success after partial failure.

At Line 7, set -u alone does not fail on command errors. This can leave leftovers while the script still exits “success,” which is risky before bootstrap.

Suggested fix pattern
-set -u
+set -euo pipefail
+
+run_ignore() {
+  "$@" || warning "Ignoring expected cleanup failure: $*"
+}
-sudo podman pod rm -f metal3-ironic
+run_ignore sudo podman pod rm -f metal3-ironic

As per coding guidelines, shell scripts should use fail-fast settings (set -euo pipefail) to avoid silent failure paths.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/enclave/environment/cleanup.sh` at line 7, The cleanup.sh script uses
only `set -u` for error handling, which does not cause the script to fail on
command errors and can result in partial cleanup while still reporting success.
Replace the `set -u` directive with `set -euo pipefail` to enable fail-fast
behavior that exits immediately on any command error, unset variable access, or
pipe failure. This ensures the cleanup script fails visibly and completely
rather than leaving leftovers behind.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/enclave/environment/check_leftovers.py`:
- Around line 18-21: Define a custom LeftoverCheckError exception class in
check_leftovers.py to replace the raw subprocess.CalledProcessError being raised
in the _run() helper function. Modify the _run() function to catch
subprocess.CalledProcessError internally and raise LeftoverCheckError instead,
providing appropriate error context. Update the exception handler in cli.py that
currently catches subprocess.CalledProcessError to catch the new
LeftoverCheckError type instead. Additionally, update all test assertions that
currently expect subprocess.CalledProcessError to expect LeftoverCheckError
instead.

---

Outside diff comments:
In `@src/enclave/environment/cli.py`:
- Around line 76-80: The _read_working_dir_from_config() function assumes
yaml.safe_load() always returns a dict, but it can return a list, scalar, or
None. Calling .get() on a non-dict raises AttributeError, which is not caught by
the current exception handler (which only catches OSError and yaml.YAMLError).
Fix this by adding a type check to verify that the loaded config is a dict
instance before calling .get() on it, returning None if it is not a dict.
Additionally, add AttributeError to the except clause at line 80 to catch any
attribute-related errors as a defensive measure.

---

Duplicate comments:
In `@src/enclave/environment/check_leftovers.py`:
- Around line 16-21: The _run() function in
src/enclave/environment/check_leftovers.py lacks a timeout parameter when
calling subprocess.run() at line 17, which can cause the function to hang
indefinitely if a host command (like systemctl or podman) gets stuck. Add a
timeout parameter to the subprocess.run() call to ensure commands are terminated
if they exceed a reasonable execution time, preventing indefinite hangs that
could block CI/bootstrap.

In `@src/enclave/environment/cleanup.sh`:
- Around line 62-69: The critical-path guard in the case statement can be
bypassed by appending trailing slashes to the protected paths (e.g., `/tmp/`
instead of `/tmp`). Before the case statement that evaluates `workingDir`,
normalize the variable by removing any trailing slashes using bash parameter
expansion (e.g., `${workingDir%/}`) so that paths like `/tmp/`, `/etc/`, etc.
are properly matched against the guard patterns and rejected. This ensures the
safety check cannot be circumvented through path manipulation.
- Line 7: The cleanup.sh script uses only `set -u` for error handling, which
does not cause the script to fail on command errors and can result in partial
cleanup while still reporting success. Replace the `set -u` directive with `set
-euo pipefail` to enable fail-fast behavior that exits immediately on any
command error, unset variable access, or pipe failure. This ensures the cleanup
script fails visibly and completely rather than leaving leftovers behind.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: fe4add4e-dc64-4fa8-b114-90f8996be044

📥 Commits

Reviewing files that changed from the base of the PR and between c619766 and f3fbf4c.

📒 Files selected for processing (5)
  • bootstrap.sh
  • src/enclave/environment/check_leftovers.py
  • src/enclave/environment/cleanup.sh
  • src/enclave/environment/cli.py
  • src/tests/test_check_leftovers.py

Comment on lines +18 to +21
if result.returncode != 0:
raise subprocess.CalledProcessError(
result.returncode, cmd, result.stdout, result.stderr
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

🧩 Analysis chain

🏁 Script executed:

# Check if the file exists and examine its structure
fd -t f check_leftovers.py

Repository: rh-ecosystem-edge/enclave

Length of output: 146


🏁 Script executed:

# Look for the cli.py file and understand exception handling
fd -t f cli.py | grep environment

Repository: rh-ecosystem-edge/enclave

Length of output: 100


🏁 Script executed:

# Search for existing custom exception classes in the environment module
find . -path "*/environment/*.py" -type f | head -20

Repository: rh-ecosystem-edge/enclave

Length of output: 185


🏁 Script executed:

cat -n src/enclave/environment/check_leftovers.py

Repository: rh-ecosystem-edge/enclave

Length of output: 3144


🏁 Script executed:

cat -n src/enclave/environment/cli.py

Repository: rh-ecosystem-edge/enclave

Length of output: 3161


🏁 Script executed:

cat -n src/enclave/environment/__init__.py

Repository: rh-ecosystem-edge/enclave

Length of output: 51


🏁 Script executed:

# Check test file for exception handling
cat -n src/tests/test_check_leftovers.py

Repository: rh-ecosystem-edge/enclave

Length of output: 7394


🏁 Script executed:

# Search for other usages of CalledProcessError in relation to check_leftovers
rg "CalledProcessError" --type py

Repository: rh-ecosystem-edge/enclave

Length of output: 326


🏁 Script executed:

# Check if there are any other imports of check_leftovers in the codebase
rg "from enclave.environment.check_leftovers|import.*check_leftovers" --type py

Repository: rh-ecosystem-edge/enclave

Length of output: 261


Define custom exception for leftover check failures.

The _run() helper at lines 18-21 raises raw subprocess.CalledProcessError, leaking implementation details into the module's public contract. This couples callers (cli.py, tests) to subprocess internals and reduces clarity around failure modes.

Define a custom LeftoverCheckError exception (and specific subclasses if needed) in check_leftovers.py and raise that from _run(). Update the exception handler in cli.py to catch the domain-specific type instead. This aligns with the project's custom exception hierarchy requirement for Python code.

Note: Tests currently assert subprocess.CalledProcessError; they'll need updating to expect the new exception type instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/enclave/environment/check_leftovers.py` around lines 18 - 21, Define a
custom LeftoverCheckError exception class in check_leftovers.py to replace the
raw subprocess.CalledProcessError being raised in the _run() helper function.
Modify the _run() function to catch subprocess.CalledProcessError internally and
raise LeftoverCheckError instead, providing appropriate error context. Update
the exception handler in cli.py that currently catches
subprocess.CalledProcessError to catch the new LeftoverCheckError type instead.
Additionally, update all test assertions that currently expect
subprocess.CalledProcessError to expect LeftoverCheckError instead.

Source: Coding guidelines

@rporres rporres force-pushed the pre-bootstrap-cleanup branch from f3fbf4c to a57296e Compare June 22, 2026 11:32
Introduces `enclave environment check-leftovers` and `enclave environment
cleanup` commands so customers can detect and remove Metal3/Quay residue
before re-running bootstrap. The logic lives in the Python package (included
in the customer tarball) rather than scripts/ (which is excluded).

- Add src/enclave/environment/ with check_leftovers.py, cli.py, cleanup.sh
- Register `enclave environment` subcommand in src/enclave/cli.py
- Add step_check_leftovers() to bootstrap.sh (advise-only, does not auto-clean)
- Extend shellcheck coverage to src/ in scripts/verification/validate.sh
- Add deploy-cluster-check-leftovers Makefile.ci target
- Run leftover check in e2e-deployment and disconnected-dry-run GH Actions
  workflows after installing Enclave Lab

Signed-off-by: Rafa Porres Molina <rporresm@redhat.com>
Assisted-by: Claude Code <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/disconnected-dry-run.yml:
- Around line 113-118: The "Check for leftover resources on Landing Zone" step
is missing a step ID, which causes failure attribution to be unclear when the
step fails. Add an id field to this step (for example, id: check-leftovers or
similar) so that downstream failure detection and reporting can properly
identify which phase failed, allowing better root cause analysis in Slack
notifications and summaries.

In @.github/workflows/e2e-deployment.yml:
- Around line 257-262: Add an `id` attribute to the "Check for leftover
resources on Landing Zone" step (currently missing an id) and also to the
corresponding step mentioned at lines 666-671. Use `check_leftovers` as the id
value for both steps. Then add `check_leftovers` to the respective job's
`step_names` and `step_order` lists to ensure the existing failure reporting
logic in the "Determine failed step" step can properly capture and surface any
failures from these leftover-check steps.

In `@src/enclave/environment/cleanup.sh`:
- Around line 62-69: The case statement guard at line 62 compares raw string
values of workingDir, which can be bypassed using path traversal tricks like
/tmp/.. or symlinks, allowing line 69 to delete unintended directories. Before
the case statement, canonicalize workingDir using realpath or readlink to
resolve all symlinks and relative path components, then also add an explicit
check to reject any paths containing .. sequences. Update the case statement
comparison to validate against the canonicalized path instead of the raw input
to ensure the deletion guard cannot be bypassed through path manipulation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 35d0d940-c6c9-447b-93ee-33aeb05fbd72

📥 Commits

Reviewing files that changed from the base of the PR and between f3fbf4c and a57296e.

📒 Files selected for processing (12)
  • .github/workflows/disconnected-dry-run.yml
  • .github/workflows/e2e-deployment.yml
  • Makefile.ci
  • bootstrap.sh
  • scripts/deployment/deploy_bootstrap_step.sh
  • scripts/verification/validate.sh
  • src/enclave/cli.py
  • src/enclave/environment/__init__.py
  • src/enclave/environment/check_leftovers.py
  • src/enclave/environment/cleanup.sh
  • src/enclave/environment/cli.py
  • src/tests/test_check_leftovers.py

Comment thread .github/workflows/disconnected-dry-run.yml
Comment thread .github/workflows/e2e-deployment.yml
Comment thread src/enclave/environment/cleanup.sh Outdated
Signed-off-by: Rafa Porres Molina <rporresm@redhat.com>
@rporres rporres force-pushed the pre-bootstrap-cleanup branch from a57296e to 0d9b8f4 Compare June 22, 2026 11:56
rporres added 2 commits June 22, 2026 14:03
Signed-off-by: Rafa Porres Molina <rporresm@redhat.com>
Signed-off-by: Rafa Porres Molina <rporresm@redhat.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
src/enclave/environment/check_leftovers.py (2)

26-36: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Essential refactor: use a domain exception instead of raw CalledProcessError.

Line 26 and Line 35 expose subprocess internals as part of this module contract. Introduce a module-specific exception (for example, LeftoverCheckError) and catch that in CLI/tests.

As per coding guidelines, "Implement custom exception hierarchy with descriptive error messages in Python code."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/enclave/environment/check_leftovers.py` around lines 26 - 36, Define a
custom module-specific exception class (such as LeftoverCheckError) at the
module level in check_leftovers.py. In the _run function, wrap the
subprocess.CalledProcessError that is raised on line 26 by catching it and
raising the custom exception instead, preserving the error details. Update the
exception handling in the check_systemd function to catch the new custom
exception instead of subprocess.CalledProcessError, and ensure all other
locations in the codebase that interact with this module (CLI and tests) are
updated to handle the new exception type.

Source: Coding guidelines


16-17: ⚠️ Potential issue | 🟠 Major

Major: add a timeout to _run() to prevent bootstrap/preflight hangs.

Line 17 runs external commands (systemctl, podman) without timeout. If these commands block due to system issues, check-leftovers can stall the entire bootstrap or CI pipeline indefinitely. Established pattern in the codebase (see run_oc_command(), cleanup_cmd()) uses timeouts for subprocess calls.

Suggested patch
 def _run(cmd: list[str]) -> list[str]:
-    result = subprocess.run(cmd, capture_output=True, text=True, check=False)
+    try:
+        result = subprocess.run(
+            cmd,
+            capture_output=True,
+            text=True,
+            check=False,
+            timeout=30,
+        )
+    except subprocess.TimeoutExpired as exc:
+        raise subprocess.CalledProcessError(
+            124, cmd, exc.stdout or "", exc.stderr or ""
+        ) from exc
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/enclave/environment/check_leftovers.py` around lines 16 - 17, Add a
timeout parameter to the subprocess.run() call in the _run() function to prevent
external commands from blocking indefinitely. Review the established timeout
patterns used in similar functions like run_oc_command() and cleanup_cmd()
elsewhere in the codebase, then apply a reasonable timeout value to the
subprocess.run() call to ensure check-leftovers does not stall the bootstrap or
CI pipeline.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/enclave/environment/cleanup.sh`:
- Around line 63-66: The cleanup script currently treats a missing or invalid
working directory as a fatal error by calling error and exiting with status 1 in
the resolvedWorkingDir validation block. This should instead be treated as an
already-clean state since there is nothing to clean up. Modify the condition
check to exit successfully with status 0 (or remove the exit entirely and let
execution continue) when resolvedWorkingDir is empty or the directory does not
exist, so that repeat cleanup runs do not fail on absent directories.

---

Duplicate comments:
In `@src/enclave/environment/check_leftovers.py`:
- Around line 26-36: Define a custom module-specific exception class (such as
LeftoverCheckError) at the module level in check_leftovers.py. In the _run
function, wrap the subprocess.CalledProcessError that is raised on line 26 by
catching it and raising the custom exception instead, preserving the error
details. Update the exception handling in the check_systemd function to catch
the new custom exception instead of subprocess.CalledProcessError, and ensure
all other locations in the codebase that interact with this module (CLI and
tests) are updated to handle the new exception type.
- Around line 16-17: Add a timeout parameter to the subprocess.run() call in the
_run() function to prevent external commands from blocking indefinitely. Review
the established timeout patterns used in similar functions like run_oc_command()
and cleanup_cmd() elsewhere in the codebase, then apply a reasonable timeout
value to the subprocess.run() call to ensure check-leftovers does not stall the
bootstrap or CI pipeline.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: cab94113-c19c-4eae-8c36-192e754fbebd

📥 Commits

Reviewing files that changed from the base of the PR and between a57296e and 077b50f.

📒 Files selected for processing (12)
  • .github/workflows/disconnected-dry-run.yml
  • .github/workflows/e2e-deployment.yml
  • Makefile.ci
  • bootstrap.sh
  • scripts/deployment/deploy_bootstrap_step.sh
  • scripts/verification/validate.sh
  • src/enclave/cli.py
  • src/enclave/environment/__init__.py
  • src/enclave/environment/check_leftovers.py
  • src/enclave/environment/cleanup.sh
  • src/enclave/environment/cli.py
  • src/tests/test_check_leftovers.py

Comment on lines +63 to +66
if [ -z "${resolvedWorkingDir}" ] || [ ! -d "${resolvedWorkingDir}" ]; then
error "Invalid WORKING_DIR: ${workingDir}"
exit 1
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Major: missing working directory should be treated as already-clean, not fatal.

Line 63-Line 66 exits with failure when the directory is absent. That turns repeat cleanup runs into false failures and can break automation even after resources were removed.

Suggested patch
 if [ -z "${resolvedWorkingDir}" ] || [ ! -d "${resolvedWorkingDir}" ]; then
-    error "Invalid WORKING_DIR: ${workingDir}"
-    exit 1
+    if [ -e "${resolvedWorkingDir}" ]; then
+        error "Invalid WORKING_DIR (not a directory): ${workingDir}"
+        exit 1
+    fi
+    warning "Working directory does not exist, skipping wipe: ${workingDir}"
+    info "Pre-bootstrap cleanup complete"
+    exit 0
 fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/enclave/environment/cleanup.sh` around lines 63 - 66, The cleanup script
currently treats a missing or invalid working directory as a fatal error by
calling error and exiting with status 1 in the resolvedWorkingDir validation
block. This should instead be treated as an already-clean state since there is
nothing to clean up. Modify the condition check to exit successfully with status
0 (or remove the exit entirely and let execution continue) when
resolvedWorkingDir is empty or the directory does not exist, so that repeat
cleanup runs do not fail on absent directories.

rporres added 2 commits June 22, 2026 16:17
Signed-off-by: Rafa Porres Molina <rporresm@redhat.com>
Signed-off-by: Rafa Porres Molina <rporresm@redhat.com>
@github-actions github-actions Bot added the infrastructure Infrastructure setup (VMs, networks) label Jun 22, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/infrastructure/generate_enclave_vars.sh`:
- Line 105: The CLUSTER_NAME variable (sourced from the ENCLAVE_CLUSTER_NAME
environment variable at line 26) is used directly in the workingDir path at line
105 without any validation, creating a path traversal vulnerability. Add
validation logic after the CLUSTER_NAME variable is set to reject any values
containing "../" sequences or absolute paths (starting with "/"), and fail
gracefully if invalid input is detected. This ensures the working directory
cannot be redirected outside the intended /home/cloud-user/sessions/ directory
tree.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ace271ff-03e9-4cb4-91b5-eba3a3cf880f

📥 Commits

Reviewing files that changed from the base of the PR and between 077b50f and ce3c41a.

📒 Files selected for processing (3)
  • scripts/infrastructure/generate_enclave_vars.sh
  • src/enclave/environment/check_leftovers.py
  • src/tests/test_check_leftovers.py

# Base Configuration
# ============================================================================
workingDir: "/home/cloud-user"
workingDir: "/home/cloud-user/sessions/${CLUSTER_NAME}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Major: Path traversal risk in CLUSTER_NAME without validation.

CLUSTER_NAME (sourced from environment variable ENCLAVE_CLUSTER_NAME at line 26) is interpolated into the workingDir path without canonicalization or validation. An attacker-controlled value containing ../ sequences or an absolute path could redirect the working directory to an unintended filesystem location, potentially overwriting sensitive files or leaking configuration data outside the intended directory tree.

Security impact: Violates the coding guideline requirement to "canonicalize paths, reject ../".

🛡️ Proposed validation fix

Add validation after line 33 to reject path traversal attempts:

 CLUSTER_NAME="${ENCLAVE_CLUSTER_NAME:-enclave-test}"
+
+# Validate CLUSTER_NAME to prevent path traversal
+if [[ "$CLUSTER_NAME" == *".."* ]] || [[ "$CLUSTER_NAME" == /* ]]; then
+    error "CLUSTER_NAME must not contain '..' or absolute paths: $CLUSTER_NAME"
+    exit 1
+fi
+
+# Further restrict to safe characters (alphanumeric, dash, underscore)
+if [[ ! "$CLUSTER_NAME" =~ ^[a-zA-Z0-9_-]+$ ]]; then
+    error "CLUSTER_NAME contains invalid characters: $CLUSTER_NAME"
+    exit 1
+fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/infrastructure/generate_enclave_vars.sh` at line 105, The
CLUSTER_NAME variable (sourced from the ENCLAVE_CLUSTER_NAME environment
variable at line 26) is used directly in the workingDir path at line 105 without
any validation, creating a path traversal vulnerability. Add validation logic
after the CLUSTER_NAME variable is set to reject any values containing "../"
sequences or absolute paths (starting with "/"), and fail gracefully if invalid
input is detected. This ensures the working directory cannot be redirected
outside the intended /home/cloud-user/sessions/ directory tree.

Source: Coding guidelines

Signed-off-by: Rafa Porres Molina <rporresm@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-cd CI/CD infrastructure deployment Deployment-related changes infrastructure Infrastructure setup (VMs, networks) validation Validation and testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant