Skip to content

smite: Clean up managed subprocess groups on shutdown#32

Open
Ashish-Kumar-Dash wants to merge 5 commits intomorehouse:masterfrom
Ashish-Kumar-Dash:process-group-cleanup
Open

smite: Clean up managed subprocess groups on shutdown#32
Ashish-Kumar-Dash wants to merge 5 commits intomorehouse:masterfrom
Ashish-Kumar-Dash:process-group-cleanup

Conversation

@Ashish-Kumar-Dash
Copy link
Copy Markdown

@Ashish-Kumar-Dash Ashish-Kumar-Dash commented Apr 4, 2026

Summary
Teach ManagedProcess to place spawned children in their own process groups and terminate the full group during shutdown. This makes cleanup more reliable for subprocess trees instead of only signaling the direct child.

Changes
create a dedicated process group for each managed subprocess at spawn time
send SIGTERM and, if needed, SIGKILL to the process group during shutdown
keep a direct-child signal fallback if process-group signaling fails
add a regression test covering child + grandchild cleanup
add a small CLN note documenting graceful RPC stop first, then managed cleanup fallback
add tempfile as a test-only dependency for the new regression test
Why
This improves teardown semantics for multi-process workloads and is directly relevant to future campaign management work where reliable stop/cleanup behavior matters.

Testing
cargo test -p smite ✅
cargo test ✅
Kindly review @morehouse & let me know if any changes are needed

Copy link
Copy Markdown
Owner

@morehouse morehouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change makes sense. But we'll need to manually test it against each coverage workload to make sure it doesn't cause coverage files to be dropped. In the past we've had to do several workarounds, especially for CLN, to make sure there's a "clean exit" that allows coverage profiles to be saved to disk.

Also please make sure CI passes before requesting the next review. You should be able to run all the CI commands locally (e.g., cargo test, cargo fmt, cargo clippy --all-targets --all-features -- -D warnings).

Ashish-Kumar-Dash and others added 3 commits April 7, 2026 11:52
Added a SAFETY comment here explaining that the pre_exec closure only calls async-signal-safe libc::setpgid and uses io::Error::last_os_error() for errno capture.

Co-authored-by: Matt Morehouse <mattmorehouse@gmail.com>
@Ashish-Kumar-Dash
Copy link
Copy Markdown
Author

Hey @morehouse, thanks for the review. I’ve updated the patch to address the feedback, switched the pre_exec path to use libc::setpgid, removed the stored process_group field, dropped the fallback direct-child kills, and added the requested safety comment. I also reran the local CI equivalent checks and fixed the issues they surfaced.

One question as I think about follow-up validation: for the manual coverage testing you mentioned, is there a target/workload you consider the most sensitive canary for shutdown-related coverage loss, or should I treat CLN as the primary one given its existing clean-exit handling?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants