Skip to content

fix: database connection auto-healing and controller fail-fast monitoring#448

Open
TrezorHannes wants to merge 1 commit into
cryptosharks131:v1.11.0from
TrezorHannes:fix/daemon-zombie-loop-clean
Open

fix: database connection auto-healing and controller fail-fast monitoring#448
TrezorHannes wants to merge 1 commit into
cryptosharks131:v1.11.0from
TrezorHannes:fix/daemon-zombie-loop-clean

Conversation

@TrezorHannes

Copy link
Copy Markdown
Contributor

Hey @cryptosharks131

Here's the fix for the daemon zombie loop/controller hanging issues.

I've been running this patch locally, and it completely resolved all the random controller/daemon hangs I used to see. Before this, I had to manually restart the lndg controller systemd service every 2-3 weeks due to connections dropping/processes getting stuck. Since running this patch, it's been running perfectly with zero issues.

Summary of Changes:

  1. Controller Fail-Fast: Modified controller.py to monitor the main tasks (jobs, rebalancer, htlc_stream, p2p) in a loop. If any of these helper processes dies or crashes, the controller terminates the remaining processes and exits with status 1. This allows standard systemd service managers to cleanly restart the entire group.
  2. DB Connection Auto-Healing: Added connections.close_all() calls to exception blocks in jobs.py, rebalancer.py, p2p.py, and htlc_stream.py. When a database connection is dropped or gets stuck, Django will now automatically close the bad connections and reopen clean ones on the next retry loop, preventing the processes from getting stuck in an infinite database connection error loop.

…ring

- Implement Django connection auto-healing in background workers (jobs.py, rebalancer.py, p2p.py, htlc_stream.py) to prevent unrecoverable zombie loops on DB drop.
- Implement process monitoring in controller.py to terminate the process group and exit if any worker process dies, allowing systemd restart.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant