Skip to content

fix(dashmate): prevent orphaned verification container blocking SSL renewal#3162

Open
ktechmidas wants to merge 1 commit intov3.1-devfrom
fix/zerossl-renewal-cleanup
Open

fix(dashmate): prevent orphaned verification container blocking SSL renewal#3162
ktechmidas wants to merge 1 commit intov3.1-devfrom
fix/zerossl-renewal-cleanup

Conversation

@ktechmidas
Copy link
Collaborator

@ktechmidas ktechmidas commented Feb 26, 2026

Issue being fixed or feature implemented

When ZeroSSL experiences an API issue during the certificate renewal window, the verification server container (bound to port 80) is left running indefinitely. This blocks all subsequent renewal attempts, causing certificates to expire. In December this led to widespread certificate expiration across many mainnet nodes simultaneously.

Root cause: the "Stop verification server" step is a regular Listr task at the end of the pipeline — if any earlier step throws, Listr aborts and the cleanup never runs. Compounding this, the ZeroSSL scheduler had no try/catch (unlike the Let's Encrypt scheduler), so errors propagated unhandled with no retry backoff, and the helper process had no graceful shutdown or startup cleanup.

What was done?

Three layered fixes to ensure port 80 is never left stuck:

  1. obtainZeroSSLCertificateTaskFactory.js — Wrapped the Listr run() method with a catch block that ensures verificationServer.stop() and destroy() are called on any failure. The existing "Stop verification server" task still handles the happy path; the wrapper only fires on error.

  2. scheduleRenewZeroSslCertificateFactory.js — Added try/catch with 1-hour retry backoff, mirroring the existing Let's Encrypt scheduler pattern. Previously, unhandled errors could crash silently or cause tight failure loops. Now logs success/failure and waits before retrying.

  3. scripts/helper.js — Two additions:

    • Startup cleanup: On boot, force-removes any orphaned dashmate-zerossl-validation or dashmate-letsencrypt-lego containers from previous failed runs (the key recovery mechanism).
    • Graceful shutdown: Registers signal handlers via node-graceful (same library used by BaseCommand) to clean up tracked containers when the helper exits.

How Has This Been Tested?

  • Code review tracing all error paths through the renewal pipeline
  • Verified verificationServer.stop() is idempotent (returns early if no container)
  • Verified node-graceful is already a dashmate dependency
  • Verified docker, stopAllContainers, and startedContainers are available in the DI container
  • Confirmed the Let's Encrypt scheduler already uses the same try/catch + backoff pattern being added here

Breaking Changes

None. All changes are additive error handling and cleanup logic.

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added or updated relevant unit/integration/functional/e2e tests
  • I have added "!" to the title and described breaking changes in the corresponding section if my code contains any
  • I have made corresponding changes to the documentation if needed

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • SSL certificate renewal now automatically retries failed operations after 1 hour delay
    • Improved cleanup of ephemeral containers on system shutdown and SSL renewal operations
    • Enhanced error handling for SSL certificate verification with safe resource cleanup on failure

…enewal

When ZeroSSL certificate renewal fails mid-pipeline (e.g. during domain
verification or certificate download), the verification server container
bound to port 80 is never cleaned up. This blocks all subsequent renewal
attempts, causing certificates to expire across many nodes simultaneously
if ZeroSSL experiences an API issue during the renewal window.

Three fixes:

1. Wrap the obtain task's run() to ensure the verification server
   container is always stopped on failure, not just on success.

2. Add try/catch with 1-hour retry backoff to the ZeroSSL scheduler,
   matching the existing Let's Encrypt scheduler pattern. Previously,
   unhandled errors could crash silently or cause tight failure loops.

3. Add graceful shutdown and startup orphan cleanup to the helper
   process. On boot, any leftover verification containers from previous
   failed runs are force-removed before scheduling renewals.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added this to the v3.1.0 milestone Feb 26, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 26, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 91eddae and c614762.

📒 Files selected for processing (3)
  • packages/dashmate/scripts/helper.js
  • packages/dashmate/src/helper/scheduleRenewZeroSslCertificateFactory.js
  • packages/dashmate/src/listr/tasks/ssl/zerossl/obtainZeroSSLCertificateTaskFactory.js

📝 Walkthrough

Walkthrough

These changes enhance the dashmate helper script with graceful shutdown handling, improved error recovery for SSL certificate renewal, and cleanup mechanisms for verification servers. The updates implement try-catch error handling for renewal operations, conditional retry logic based on renewal success, and automatic cleanup of orphaned containers and resources on exit.

Changes

Cohort / File(s) Summary
Graceful Shutdown & Container Cleanup
packages/dashmate/scripts/helper.js
Adds graceful shutdown handler that stops and removes started containers on exit. Attempts to clean up orphaned SSL-related containers (dashmate-zerossl-validation, dashmate-letsencrypt-lego) on startup, with error logging and 404 error suppression.
SSL Renewal Error Handling & Retry Logic
packages/dashmate/src/helper/scheduleRenewZeroSslCertificateFactory.js
Wraps certificate renewal in try/catch block with renewalSucceeded flag. Implements conditional rescheduling: immediate re-check on success via process.nextTick, 1-hour retry delay on failure via setTimeout. Removes previous inline rescheduling logic.
Verification Server Cleanup on Failure
packages/dashmate/src/listr/tasks/ssl/zerossl/obtainZeroSSLCertificateTaskFactory.js
Refactors Listr construction into named const and wraps tasks.run() execution with cleanup handler that stops and destroys verification server on any failure, ensuring resource cleanup before rethrowing errors.

Sequence Diagram

sequenceDiagram
    participant Helper as Helper Script
    participant Scheduler as Scheduler
    participant ZeroSSL as ZeroSSL<br/>Renewal
    participant Listr as Listr Tasks
    participant Verification as Verification<br/>Server
    participant Gateway as Gateway<br/>Process
    participant Containers as Container<br/>Cleanup

    Helper->>Containers: Remove orphaned SSL containers<br/>(startup cleanup)
    Note over Containers: dashmate-zerossl-validation<br/>dashmate-letsencrypt-lego
    
    Scheduler->>ZeroSSL: Trigger renewal (CronJob)
    
    alt Renewal Success
        ZeroSSL->>Verification: Start verification server
        Verification-->>ZeroSSL: Server ready
        ZeroSSL->>ZeroSSL: Renew certificate
        ZeroSSL->>Gateway: Restart gateway process
        ZeroSSL->>Scheduler: Set renewalSucceeded = true
        Scheduler->>Scheduler: Schedule immediate re-check<br/>(process.nextTick)
    else Renewal Failure
        ZeroSSL->>Verification: Start verification server
        Verification-->>ZeroSSL: Server ready
        ZeroSSL->>ZeroSSL: Attempt renewal
        ZeroSSL->>ZeroSSL: Error caught
        Listr->>Verification: Cleanup server on error
        Verification-->>Listr: Server destroyed
        ZeroSSL->>Scheduler: Set renewalSucceeded = false
        Scheduler->>Scheduler: Schedule retry<br/>(1 hour delay)
    end
    
    Helper->>Containers: Graceful shutdown handler
    Containers->>Containers: Stop & remove<br/>started containers
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A certificate renewed with grace,
When failures come, they find their place,
With cleanup sure and retry wise,
The dashmate helper optimizes!
No orphans left, no leaks in sight,
Our SSL renewals done just right. 🔒✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main fix: preventing orphaned verification containers from blocking SSL renewal, which aligns with the core objective of the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/zerossl-renewal-cleanup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ktechmidas ktechmidas added ready for final review Ready for the final review. If AI was involved in producing this PR, it has already had a reviewer. and removed ready for final review Ready for the final review. If AI was involved in producing this PR, it has already had a reviewer. labels Feb 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants