Skip to content

Add validation to detect persistent Metal3 from previous deployments#483

Closed
maorfr wants to merge 1 commit into
mainfrom
fix/validate-no-persistent-metal3
Closed

Add validation to detect persistent Metal3 from previous deployments#483
maorfr wants to merge 1 commit into
mainfrom
fix/validate-no-persistent-metal3

Conversation

@maorfr

@maorfr maorfr commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds early validation to prevent confusing 401 Unauthorized errors when persistent Metal3 components from a previous deployment are still running.

Problem

The real root cause of GoRI issue #915 was not the htpasswd generation, but leftover Metal3 infrastructure from a previous deployment:

  1. Previous deployment used metal3_persistent: true → creates systemd quadlet services running as root
  2. Cleanup script only removed user-level podman resources
  3. New deployment generates new credentials in user secret store
  4. New user-level Ironic container fails to start (port already bound by root Ironic)
  5. API calls hit the old root-level Ironic with old credentials → 401 Unauthorized

This caused hours of debugging because:

  • Password verification showed credentials were correct ✓
  • Container saw correct hash ✓
  • But authentication failed ✗ (hitting wrong Ironic instance)

Solution

Added validate_no_persistent_metal3.yaml that runs before Metal3 deployment and checks for:

  • Root-level Metal3 pods (sudo podman pod ps)
  • Metal3 systemd services (quadlets)
  • Quadlet unit files in /etc/containers/systemd/

Fails fast with clear error message including:

  • Explanation of the problem
  • List of what was detected
  • Complete cleanup commands (from Nick's solution)
  • Reference to the GitHub issue

Changes

  1. New file: playbooks/tasks/validate_no_persistent_metal3.yaml

    • Validation task that checks for persistent Metal3
    • Comprehensive error message with cleanup instructions
  2. Modified: playbooks/tasks/configure_hardware_ironic_setup.yaml

    • Added validation as first step before deploying Metal3

Testing

Validation will detect the conditions that caused the issue on Cluster 10 and Cluster 17, failing early with actionable instructions instead of later with confusing 401 errors.

Related Issues

Checklist

  • Validation runs before Metal3 deployment
  • Error message includes complete cleanup commands
  • References the GitHub issue for context
  • Commit includes Assisted-by trailer

Summary by CodeRabbit

  • Bug Fixes
    • Improved deployment reliability by adding validation checks to detect and prevent conflicts from previous Metal3 installations. When legacy artifacts are detected, the deployment process now fails with clear error messages and specific cleanup instructions, ensuring successful redeployment.

@coderabbitai

coderabbitai Bot commented Jun 14, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@maorfr, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 56 minutes and 15 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 65bfa3f1-5389-4eac-8cf6-3a4c372986a8

📥 Commits

Reviewing files that changed from the base of the PR and between ae6b936 and 2d82d32.

📒 Files selected for processing (2)
  • playbooks/tasks/configure_hardware_ironic_setup.yaml
  • playbooks/tasks/validate_no_persistent_metal3.yaml

Walkthrough

A new Ansible task file (validate_no_persistent_metal3.yaml) is added that checks for three categories of leftover Metal3 artifacts — running podman pods, loaded systemd services, and quadlet unit files — and fails with cleanup instructions if any are found. This task is included as a pre-deployment step in configure_hardware_ironic_setup.yaml.

Changes

Metal3 Persistent Artifact Guard

Layer / File(s) Summary
Metal3 leftover detection and conditional failure
playbooks/tasks/validate_no_persistent_metal3.yaml, playbooks/tasks/configure_hardware_ironic_setup.yaml
New task file performs three host-state checks (podman pod, systemd units, quadlet files) and conditionally fails with diagnostic output and enumerated cleanup commands. The task file is included as a pre-deployment guard in the Ironic setup playbook before Metal3 resources are deployed.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🛡️ Before the Metal3 pod dares to rise,
A sentinel scans with security eyes —
Old quadlets, dead pods, and services wide,
Must all be purged before you proceed inside.
Clean slate or fail fast — no ghosts shall hide! 👻

🚥 Pre-merge checks | ✅ 11
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and accurately summarizes the main change: adding validation to detect persistent Metal3 artifacts from previous deployments, which is precisely what the PR implements.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
No-Hardcoded-Secrets ✅ Passed No hardcoded secrets detected in the PR. Files contain only task definitions, service references, and cleanup instructions without exposing actual credentials, API keys, tokens, passwords, or base6...
No-Weak-Crypto ✅ Passed PR introduces only infrastructure validation tasks with no cryptographic operations, weak algorithms, custom crypto, or insecure secret comparisons.
No-Injection-Vectors ✅ Passed No injection vectors detected. Code uses hardcoded values in commands/paths, safe Ansible modules (command, systemd, stat), and only displays registered outputs in error messages without executing...
Container-Privileges ✅ Passed PR adds Ansible validation task using 'become: true' to detect (not deploy) privileged Metal3 containers. No privileged container settings, hostPID/Network/IPC, SYS_ADMIN, or allowPrivilegeEscalati...
No-Sensitive-Data-In-Logs ✅ Passed No sensitive data (passwords, tokens, API keys, PII, session IDs, kubeconfigs) is logged or exposed. Only pod/service/file names and static cleanup instructions with secret resource names are output.
Ai-Attribution ✅ Passed AI tool usage (Claude Code) is properly attributed with "Assisted-by" trailer in correct Red Hat format. No problematic Co-Authored-By trailers found.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/validate-no-persistent-metal3

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added infrastructure Infrastructure setup (VMs, networks) validation Validation and testing labels Jun 14, 2026
Prevents confusing 401 Unauthorized errors when root-level Metal3 components
from a previous deployment (using metal3_persistent: true) are still running.

The issue occurs when:
1. Previous deployment created systemd quadlet services running as root
2. Cleanup script only removed user-level podman resources
3. New deployment generates new credentials in user secret store
4. User-level Ironic fails to start (port already bound by root Ironic)
5. API calls hit the old root-level Ironic with old credentials → 401 error

This validation runs before Metal3 deployment and fails with clear
instructions if persistent components are detected:
- Root-level Metal3 pods
- Metal3 systemd services
- Quadlet unit files in /etc/containers/systemd/

The error message includes complete cleanup commands from Nick Carboni's
solution that properly removes root-level Metal3 stack.

Fixes: gori-project/GoRI#915

Assisted-by: Claude Code <noreply@anthropic.com>
@maorfr maorfr force-pushed the fix/validate-no-persistent-metal3 branch from ae6b936 to 2d82d32 Compare June 14, 2026 06:11

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@playbooks/tasks/configure_hardware_ironic_setup.yaml`:
- Around line 4-5: The task that includes `validate_no_persistent_metal3.yaml`
executes unconditionally and breaks re-runnable deployments in persistent mode
where Metal3 root-level artifacts are legitimately expected to exist. Add a
conditional guard (when clause) to the include_tasks directive for the Validate
no persistent Metal3 from previous deployment task so it only executes when
appropriate for the deployment scenario (e.g., when not in persistent mode),
ensuring the playbook remains idempotent and can be safely re-executed without
failing on expected artifacts.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9e720788-b238-4102-915c-90e8636506c5

📥 Commits

Reviewing files that changed from the base of the PR and between b078c63 and ae6b936.

📒 Files selected for processing (2)
  • playbooks/tasks/configure_hardware_ironic_setup.yaml
  • playbooks/tasks/validate_no_persistent_metal3.yaml

Comment on lines +4 to +5
- name: Validate no persistent Metal3 from previous deployment
ansible.builtin.include_tasks: validate_no_persistent_metal3.yaml

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Unconditional guard can break re-runnable deployments in persistent mode (major risk).

This include always executes and can fail valid reruns where Metal3 root-level artifacts are expected (for example persistent flows), causing a deterministic pre-deploy block.

Suggested fix
 - name: Validate no persistent Metal3 from previous deployment
   ansible.builtin.include_tasks: validate_no_persistent_metal3.yaml
+  when: not (metal3_persistent | default(false) | bool)

As per coding guidelines, “Ansible tasks must be idempotent and re-runnable.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@playbooks/tasks/configure_hardware_ironic_setup.yaml` around lines 4 - 5, The
task that includes `validate_no_persistent_metal3.yaml` executes unconditionally
and breaks re-runnable deployments in persistent mode where Metal3 root-level
artifacts are legitimately expected to exist. Add a conditional guard (when
clause) to the include_tasks directive for the Validate no persistent Metal3
from previous deployment task so it only executes when appropriate for the
deployment scenario (e.g., when not in persistent mode), ensuring the playbook
remains idempotent and can be safely re-executed without failing on expected
artifacts.

Source: Coding guidelines

@rporres

rporres commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

I would make this more general and include previous registry installations in the LZ which also need to be cleaned-up. Something like using what we already have in RHDP env, the cleanup script.

FWIW, this alone is not going to solve the current issues there as I suspect the metal3 containers are restored by a different method there.

@carbonin

carbonin commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Instead of a validation could we implement some kind of forced clean slate variable that would actually do the wipe before we run?

On the same topic what are we expecting if we rerun bootstrap generally? Will the cluster redeploy or do we check if it's installed already? Do we fully re-mirror or do we check what's already present (maybe this is just a feature of oc mirror)?

I ask because we should be consistent. If we're expecting a rerun of bootstrap to redeploy everything then we should be cleaning out these pods ourselves unconditionally. If we want a rerun to do more of a "desired state" thing, then we definitely shouldn't clean anything out if it already exists.

Does that make sense?

@rporres

rporres commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

closing this, it will be taken care differently in #496 (still a WIP, though)

@rporres rporres closed this Jun 15, 2026
@rporres

rporres commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

@carbonin The idea is to check if a wipe is needed in bootstrap, using the same philosophy as the rest of the validations, asking the user to perform the change before actually doing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

infrastructure Infrastructure setup (VMs, networks) validation Validation and testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants