Add validation to detect persistent Metal3 from previous deployments by maorfr · Pull Request #483 · rh-ecosystem-edge/enclave

maorfr · 2026-06-14T06:08:51Z

Summary

Adds early validation to prevent confusing 401 Unauthorized errors when persistent Metal3 components from a previous deployment are still running.

Problem

The real root cause of GoRI issue #915 was not the htpasswd generation, but leftover Metal3 infrastructure from a previous deployment:

Previous deployment used metal3_persistent: true → creates systemd quadlet services running as root
Cleanup script only removed user-level podman resources
New deployment generates new credentials in user secret store
New user-level Ironic container fails to start (port already bound by root Ironic)
API calls hit the old root-level Ironic with old credentials → 401 Unauthorized

This caused hours of debugging because:

Password verification showed credentials were correct ✓
Container saw correct hash ✓
But authentication failed ✗ (hitting wrong Ironic instance)

Solution

Added validate_no_persistent_metal3.yaml that runs before Metal3 deployment and checks for:

Root-level Metal3 pods (sudo podman pod ps)
Metal3 systemd services (quadlets)
Quadlet unit files in /etc/containers/systemd/

Fails fast with clear error message including:

Explanation of the problem
List of what was detected
Complete cleanup commands (from Nick's solution)
Reference to the GitHub issue

Changes

New file: playbooks/tasks/validate_no_persistent_metal3.yaml
- Validation task that checks for persistent Metal3
- Comprehensive error message with cleanup instructions
Modified: playbooks/tasks/configure_hardware_ironic_setup.yaml
- Added validation as first step before deploying Metal3

Testing

Validation will detect the conditions that caused the issue on Cluster 10 and Cluster 17, failing early with actionable instructions instead of later with confusing 401 errors.

Related Issues

GoRI: #915
Root cause identified by @carbonin in this comment

Checklist

Validation runs before Metal3 deployment
Error message includes complete cleanup commands
References the GitHub issue for context
Commit includes Assisted-by trailer

Summary by CodeRabbit

Bug Fixes
- Improved deployment reliability by adding validation checks to detect and prevent conflicts from previous Metal3 installations. When legacy artifacts are detected, the deployment process now fails with clear error messages and specific cleanup instructions, ensuring successful redeployment.

coderabbitai · 2026-06-14T06:09:01Z

Warning

Review limit reached

@maorfr, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 56 minutes and 15 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 65bfa3f1-5389-4eac-8cf6-3a4c372986a8

📥 Commits

Reviewing files that changed from the base of the PR and between ae6b936 and 2d82d32.

📒 Files selected for processing (2)

playbooks/tasks/configure_hardware_ironic_setup.yaml
playbooks/tasks/validate_no_persistent_metal3.yaml

Walkthrough

A new Ansible task file (validate_no_persistent_metal3.yaml) is added that checks for three categories of leftover Metal3 artifacts — running podman pods, loaded systemd services, and quadlet unit files — and fails with cleanup instructions if any are found. This task is included as a pre-deployment step in configure_hardware_ironic_setup.yaml.

Changes

Metal3 Persistent Artifact Guard

Layer / File(s)	Summary
Metal3 leftover detection and conditional failure `playbooks/tasks/validate_no_persistent_metal3.yaml`, `playbooks/tasks/configure_hardware_ironic_setup.yaml`	New task file performs three host-state checks (podman pod, systemd units, quadlet files) and conditionally fails with diagnostic output and enumerated cleanup commands. The task file is included as a pre-deployment guard in the Ironic setup playbook before Metal3 resources are deployed.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🛡️ Before the Metal3 pod dares to rise,
A sentinel scans with security eyes —
Old quadlets, dead pods, and services wide,
Must all be purged before you proceed inside.
Clean slate or fail fast — no ghosts shall hide! 👻

🚥 Pre-merge checks | ✅ 11

✅ Passed checks (11 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and accurately summarizes the main change: adding validation to detect persistent Metal3 artifacts from previous deployments, which is precisely what the PR implements.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
No-Hardcoded-Secrets	✅ Passed	No hardcoded secrets detected in the PR. Files contain only task definitions, service references, and cleanup instructions without exposing actual credentials, API keys, tokens, passwords, or base6...
No-Weak-Crypto	✅ Passed	PR introduces only infrastructure validation tasks with no cryptographic operations, weak algorithms, custom crypto, or insecure secret comparisons.
No-Injection-Vectors	✅ Passed	No injection vectors detected. Code uses hardcoded values in commands/paths, safe Ansible modules (command, systemd, stat), and only displays registered outputs in error messages without executing...
Container-Privileges	✅ Passed	PR adds Ansible validation task using 'become: true' to detect (not deploy) privileged Metal3 containers. No privileged container settings, hostPID/Network/IPC, SYS_ADMIN, or allowPrivilegeEscalati...
No-Sensitive-Data-In-Logs	✅ Passed	No sensitive data (passwords, tokens, API keys, PII, session IDs, kubeconfigs) is logged or exposed. Only pod/service/file names and static cleanup instructions with secret resource names are output.
Ai-Attribution	✅ Passed	AI tool usage (Claude Code) is properly attributed with "Assisted-by" trailer in correct Red Hat format. No problematic Co-Authored-By trailers found.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/validate-no-persistent-metal3

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Prevents confusing 401 Unauthorized errors when root-level Metal3 components from a previous deployment (using metal3_persistent: true) are still running. The issue occurs when: 1. Previous deployment created systemd quadlet services running as root 2. Cleanup script only removed user-level podman resources 3. New deployment generates new credentials in user secret store 4. User-level Ironic fails to start (port already bound by root Ironic) 5. API calls hit the old root-level Ironic with old credentials → 401 error This validation runs before Metal3 deployment and fails with clear instructions if persistent components are detected: - Root-level Metal3 pods - Metal3 systemd services - Quadlet unit files in /etc/containers/systemd/ The error message includes complete cleanup commands from Nick Carboni's solution that properly removes root-level Metal3 stack. Fixes: gori-project/GoRI#915 Assisted-by: Claude Code <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@playbooks/tasks/configure_hardware_ironic_setup.yaml`:
- Around line 4-5: The task that includes `validate_no_persistent_metal3.yaml`
executes unconditionally and breaks re-runnable deployments in persistent mode
where Metal3 root-level artifacts are legitimately expected to exist. Add a
conditional guard (when clause) to the include_tasks directive for the Validate
no persistent Metal3 from previous deployment task so it only executes when
appropriate for the deployment scenario (e.g., when not in persistent mode),
ensuring the playbook remains idempotent and can be safely re-executed without
failing on expected artifacts.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9e720788-b238-4102-915c-90e8636506c5

📥 Commits

Reviewing files that changed from the base of the PR and between b078c63 and ae6b936.

📒 Files selected for processing (2)

playbooks/tasks/configure_hardware_ironic_setup.yaml
playbooks/tasks/validate_no_persistent_metal3.yaml

coderabbitai · 2026-06-14T06:12:31Z

+- name: Validate no persistent Metal3 from previous deployment
+  ansible.builtin.include_tasks: validate_no_persistent_metal3.yaml


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Unconditional guard can break re-runnable deployments in persistent mode (major risk).

This include always executes and can fail valid reruns where Metal3 root-level artifacts are expected (for example persistent flows), causing a deterministic pre-deploy block.

Suggested fix

- name: Validate no persistent Metal3 from previous deployment ansible.builtin.include_tasks: validate_no_persistent_metal3.yaml + when: not (metal3_persistent | default(false) | bool)

As per coding guidelines, “Ansible tasks must be idempotent and re-runnable.”

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@playbooks/tasks/configure_hardware_ironic_setup.yaml` around lines 4 - 5, The task that includes `validate_no_persistent_metal3.yaml` executes unconditionally and breaks re-runnable deployments in persistent mode where Metal3 root-level artifacts are legitimately expected to exist. Add a conditional guard (when clause) to the include_tasks directive for the Validate no persistent Metal3 from previous deployment task so it only executes when appropriate for the deployment scenario (e.g., when not in persistent mode), ensuring the playbook remains idempotent and can be safely re-executed without failing on expected artifacts.

Source: Coding guidelines

rporres · 2026-06-15T08:33:49Z

I would make this more general and include previous registry installations in the LZ which also need to be cleaned-up. Something like using what we already have in RHDP env, the cleanup script.

FWIW, this alone is not going to solve the current issues there as I suspect the metal3 containers are restored by a different method there.

carbonin · 2026-06-15T13:25:10Z

Instead of a validation could we implement some kind of forced clean slate variable that would actually do the wipe before we run?

On the same topic what are we expecting if we rerun bootstrap generally? Will the cluster redeploy or do we check if it's installed already? Do we fully re-mirror or do we check what's already present (maybe this is just a feature of oc mirror)?

I ask because we should be consistent. If we're expecting a rerun of bootstrap to redeploy everything then we should be cleaning out these pods ourselves unconditionally. If we want a rerun to do more of a "desired state" thing, then we definitely shouldn't clean anything out if it already exists.

Does that make sense?

rporres · 2026-06-15T15:08:56Z

closing this, it will be taken care differently in #496 (still a WIP, though)

rporres · 2026-06-15T15:10:57Z

@carbonin The idea is to check if a wipe is needed in bootstrap, using the same philosophy as the rest of the validations, asking the user to perform the change before actually doing it.

github-actions Bot added infrastructure Infrastructure setup (VMs, networks) validation Validation and testing labels Jun 14, 2026

maorfr force-pushed the fix/validate-no-persistent-metal3 branch from ae6b936 to 2d82d32 Compare June 14, 2026 06:11

coderabbitai Bot requested changes Jun 14, 2026

View reviewed changes

rporres closed this Jun 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add validation to detect persistent Metal3 from previous deployments#483

Add validation to detect persistent Metal3 from previous deployments#483
maorfr wants to merge 1 commit into
mainfrom
fix/validate-no-persistent-metal3

maorfr commented Jun 14, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 14, 2026 •

edited

Loading

Review limit reached

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 14, 2026

Uh oh!

rporres commented Jun 15, 2026 •

edited

Loading

Uh oh!

carbonin commented Jun 15, 2026 •

edited

Loading

Uh oh!

rporres commented Jun 15, 2026

Uh oh!

rporres commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		- name: Validate no persistent Metal3 from previous deployment
		ansible.builtin.include_tasks: validate_no_persistent_metal3.yaml

Uh oh!

Conversation

maorfr commented Jun 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Changes

Testing

Related Issues

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

rporres commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carbonin commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rporres commented Jun 15, 2026

Uh oh!

rporres commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

maorfr commented Jun 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 14, 2026 •

edited

Loading

rporres commented Jun 15, 2026 •

edited

Loading

carbonin commented Jun 15, 2026 •

edited

Loading