Add validation to detect persistent Metal3 from previous deployments#483
Add validation to detect persistent Metal3 from previous deployments#483maorfr wants to merge 1 commit into
Conversation
|
Warning Review limit reached
More reviews will be available in 56 minutes and 15 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
WalkthroughA new Ansible task file ( ChangesMetal3 Persistent Artifact Guard
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 11✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Prevents confusing 401 Unauthorized errors when root-level Metal3 components from a previous deployment (using metal3_persistent: true) are still running. The issue occurs when: 1. Previous deployment created systemd quadlet services running as root 2. Cleanup script only removed user-level podman resources 3. New deployment generates new credentials in user secret store 4. User-level Ironic fails to start (port already bound by root Ironic) 5. API calls hit the old root-level Ironic with old credentials → 401 error This validation runs before Metal3 deployment and fails with clear instructions if persistent components are detected: - Root-level Metal3 pods - Metal3 systemd services - Quadlet unit files in /etc/containers/systemd/ The error message includes complete cleanup commands from Nick Carboni's solution that properly removes root-level Metal3 stack. Fixes: gori-project/GoRI#915 Assisted-by: Claude Code <noreply@anthropic.com>
ae6b936 to
2d82d32
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@playbooks/tasks/configure_hardware_ironic_setup.yaml`:
- Around line 4-5: The task that includes `validate_no_persistent_metal3.yaml`
executes unconditionally and breaks re-runnable deployments in persistent mode
where Metal3 root-level artifacts are legitimately expected to exist. Add a
conditional guard (when clause) to the include_tasks directive for the Validate
no persistent Metal3 from previous deployment task so it only executes when
appropriate for the deployment scenario (e.g., when not in persistent mode),
ensuring the playbook remains idempotent and can be safely re-executed without
failing on expected artifacts.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9e720788-b238-4102-915c-90e8636506c5
📒 Files selected for processing (2)
playbooks/tasks/configure_hardware_ironic_setup.yamlplaybooks/tasks/validate_no_persistent_metal3.yaml
| - name: Validate no persistent Metal3 from previous deployment | ||
| ansible.builtin.include_tasks: validate_no_persistent_metal3.yaml |
There was a problem hiding this comment.
Unconditional guard can break re-runnable deployments in persistent mode (major risk).
This include always executes and can fail valid reruns where Metal3 root-level artifacts are expected (for example persistent flows), causing a deterministic pre-deploy block.
Suggested fix
- name: Validate no persistent Metal3 from previous deployment
ansible.builtin.include_tasks: validate_no_persistent_metal3.yaml
+ when: not (metal3_persistent | default(false) | bool)As per coding guidelines, “Ansible tasks must be idempotent and re-runnable.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@playbooks/tasks/configure_hardware_ironic_setup.yaml` around lines 4 - 5, The
task that includes `validate_no_persistent_metal3.yaml` executes unconditionally
and breaks re-runnable deployments in persistent mode where Metal3 root-level
artifacts are legitimately expected to exist. Add a conditional guard (when
clause) to the include_tasks directive for the Validate no persistent Metal3
from previous deployment task so it only executes when appropriate for the
deployment scenario (e.g., when not in persistent mode), ensuring the playbook
remains idempotent and can be safely re-executed without failing on expected
artifacts.
Source: Coding guidelines
|
I would make this more general and include previous registry installations in the LZ which also need to be cleaned-up. Something like using what we already have in RHDP env, the cleanup script. FWIW, this alone is not going to solve the current issues there as I suspect the metal3 containers are restored by a different method there. |
|
Instead of a validation could we implement some kind of forced clean slate variable that would actually do the wipe before we run? On the same topic what are we expecting if we rerun bootstrap generally? Will the cluster redeploy or do we check if it's installed already? Do we fully re-mirror or do we check what's already present (maybe this is just a feature of I ask because we should be consistent. If we're expecting a rerun of bootstrap to redeploy everything then we should be cleaning out these pods ourselves unconditionally. If we want a rerun to do more of a "desired state" thing, then we definitely shouldn't clean anything out if it already exists. Does that make sense? |
|
closing this, it will be taken care differently in #496 (still a WIP, though) |
|
@carbonin The idea is to check if a wipe is needed in bootstrap, using the same philosophy as the rest of the validations, asking the user to perform the change before actually doing it. |
Summary
Adds early validation to prevent confusing 401 Unauthorized errors when persistent Metal3 components from a previous deployment are still running.
Problem
The real root cause of GoRI issue #915 was not the htpasswd generation, but leftover Metal3 infrastructure from a previous deployment:
metal3_persistent: true→ creates systemd quadlet services running as rootThis caused hours of debugging because:
Solution
Added
validate_no_persistent_metal3.yamlthat runs before Metal3 deployment and checks for:sudo podman pod ps)/etc/containers/systemd/Fails fast with clear error message including:
Changes
New file:
playbooks/tasks/validate_no_persistent_metal3.yamlModified:
playbooks/tasks/configure_hardware_ironic_setup.yamlTesting
Validation will detect the conditions that caused the issue on Cluster 10 and Cluster 17, failing early with actionable instructions instead of later with confusing 401 errors.
Related Issues
Checklist
Summary by CodeRabbit