fix(dashmate): prevent orphaned verification container blocking SSL renewal#3162
fix(dashmate): prevent orphaned verification container blocking SSL renewal#3162ktechmidas wants to merge 1 commit intov3.1-devfrom
Conversation
…enewal When ZeroSSL certificate renewal fails mid-pipeline (e.g. during domain verification or certificate download), the verification server container bound to port 80 is never cleaned up. This blocks all subsequent renewal attempts, causing certificates to expire across many nodes simultaneously if ZeroSSL experiences an API issue during the renewal window. Three fixes: 1. Wrap the obtain task's run() to ensure the verification server container is always stopped on failure, not just on success. 2. Add try/catch with 1-hour retry backoff to the ZeroSSL scheduler, matching the existing Let's Encrypt scheduler pattern. Previously, unhandled errors could crash silently or cause tight failure loops. 3. Add graceful shutdown and startup orphan cleanup to the helper process. On boot, any leftover verification containers from previous failed runs are force-removed before scheduling renewals. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review infoConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThese changes enhance the dashmate helper script with graceful shutdown handling, improved error recovery for SSL certificate renewal, and cleanup mechanisms for verification servers. The updates implement try-catch error handling for renewal operations, conditional retry logic based on renewal success, and automatic cleanup of orphaned containers and resources on exit. Changes
Sequence DiagramsequenceDiagram
participant Helper as Helper Script
participant Scheduler as Scheduler
participant ZeroSSL as ZeroSSL<br/>Renewal
participant Listr as Listr Tasks
participant Verification as Verification<br/>Server
participant Gateway as Gateway<br/>Process
participant Containers as Container<br/>Cleanup
Helper->>Containers: Remove orphaned SSL containers<br/>(startup cleanup)
Note over Containers: dashmate-zerossl-validation<br/>dashmate-letsencrypt-lego
Scheduler->>ZeroSSL: Trigger renewal (CronJob)
alt Renewal Success
ZeroSSL->>Verification: Start verification server
Verification-->>ZeroSSL: Server ready
ZeroSSL->>ZeroSSL: Renew certificate
ZeroSSL->>Gateway: Restart gateway process
ZeroSSL->>Scheduler: Set renewalSucceeded = true
Scheduler->>Scheduler: Schedule immediate re-check<br/>(process.nextTick)
else Renewal Failure
ZeroSSL->>Verification: Start verification server
Verification-->>ZeroSSL: Server ready
ZeroSSL->>ZeroSSL: Attempt renewal
ZeroSSL->>ZeroSSL: Error caught
Listr->>Verification: Cleanup server on error
Verification-->>Listr: Server destroyed
ZeroSSL->>Scheduler: Set renewalSucceeded = false
Scheduler->>Scheduler: Schedule retry<br/>(1 hour delay)
end
Helper->>Containers: Graceful shutdown handler
Containers->>Containers: Stop & remove<br/>started containers
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Issue being fixed or feature implemented
When ZeroSSL experiences an API issue during the certificate renewal window, the verification server container (bound to port 80) is left running indefinitely. This blocks all subsequent renewal attempts, causing certificates to expire. In December this led to widespread certificate expiration across many mainnet nodes simultaneously.
Root cause: the "Stop verification server" step is a regular Listr task at the end of the pipeline — if any earlier step throws, Listr aborts and the cleanup never runs. Compounding this, the ZeroSSL scheduler had no try/catch (unlike the Let's Encrypt scheduler), so errors propagated unhandled with no retry backoff, and the helper process had no graceful shutdown or startup cleanup.
What was done?
Three layered fixes to ensure port 80 is never left stuck:
obtainZeroSSLCertificateTaskFactory.js— Wrapped the Listrrun()method with a catch block that ensuresverificationServer.stop()anddestroy()are called on any failure. The existing "Stop verification server" task still handles the happy path; the wrapper only fires on error.scheduleRenewZeroSslCertificateFactory.js— Added try/catch with 1-hour retry backoff, mirroring the existing Let's Encrypt scheduler pattern. Previously, unhandled errors could crash silently or cause tight failure loops. Now logs success/failure and waits before retrying.scripts/helper.js— Two additions:dashmate-zerossl-validationordashmate-letsencrypt-legocontainers from previous failed runs (the key recovery mechanism).node-graceful(same library used byBaseCommand) to clean up tracked containers when the helper exits.How Has This Been Tested?
verificationServer.stop()is idempotent (returns early if no container)node-gracefulis already a dashmate dependencydocker,stopAllContainers, andstartedContainersare available in the DI containerBreaking Changes
None. All changes are additive error handling and cleanup logic.
Checklist:
🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes