Crash handler is not async-signal-safe (SA_ONSTACK set, sigaltstack never called) — truncates under corrupted-stack crashes

Follow-up to #404 / #405 (VLA stack-overflow). The original production bug report attached to #404 was **truncated mid-stream** at "CLIENT LIST OUTPUT" — the crash handler in `memtier_benchmark.cpp` started writing the report, got past `STACK TRACE` and `INFO OUTPUT`, then died while iterating clients because the worker thread's stack was already corrupted by the VLA overflow and the handler itself re-faulted inside `_dl_fixup` (ld-linux lazy PLT bind) when it tried to call an stdio function for the first time. PR #405 patches the trigger; the handler remains fragile against any future memory corruption.

## Async-signal-safety violations in current `crash_handler` (~lines 111–208)

| Site | Call | Issue |
|---|---|---|
| ~30 sites throughout the handler | `fprintf` / `vfprintf` | take FILE* lock; may `malloc`; **lazy-bound on first call** (the bug that truncated #404) |
| lines 2040, 2044 | `backtrace_symbols` + `free` | explicitly documented as not signal-safe; allocates from the (possibly corrupt) heap arena |
| lines 120–121 | `localtime` + `strftime` | not signal-safe; locale/TZ locks |
| lines 1976–2024 | `print_client_list` traverses `std::vector<cg_thread*>` and derefs virtual `shard_connection` getters | UB under torn heap |
| line 216 | `SA_ONSTACK` is set in `sa_flags` | **but `sigaltstack(2)` is never called anywhere in the codebase** — `grep -r sigaltstack` returns zero. The flag is silently inert; stack-overflow crashes re-crash on the exhausted stack and produce *no* report at all. |
| (everywhere) | No symbol pre-warm at startup | any function called for the first time in the handler triggers `_dl_fixup`, which is **not signal-safe** and is the suspected cause of the #404 truncation |

## Proposed remediation

**Short term (low risk, high return):**

1. **Install `sigaltstack` in `setup_crash_handlers`** with at least `SIGSTKSZ * 4` (or 64 KB). Without this, stack-overflow crashes are unrecoverable and `SA_ONSTACK` is misleading dead code.
2. **Pre-warm symbols at startup.** Call `backtrace()` with a 1-frame buffer and discard, call `backtrace_symbols_fd(...,/dev/null fd)` once, touch `event_get_version`, run one `fprintf(stderr, "")` so glibc's stdio lazy init is done. Eliminates `_dl_fixup` from the signal path.
3. **Replace `fprintf` with `write(2)` + a tiny reentrant formatter.** Build into a stack `char buf[512]` with a hand-rolled `safe_itoa`/`safe_strcat`, then `write(STDERR_FILENO, buf, len)`.
4. **Replace `backtrace_symbols` + `free`** with `backtrace_symbols_fd(trace, trace_size, STDERR_FILENO)` — the documented signal-safe variant.
5. **Drop `localtime` + `strftime`.** Capture / format the timestamp into a static buffer at startup, or compose it manually from `time(NULL)`.
6. **Gate `print_client_list` behind a "handler running" flag + an `alarm(5)`** so a hung handler self-terminates instead of leaving a half-written report. Defer it to *after* the simple sections so truncation lands at the *end* of the report.
7. **Cache `getpid()` in a static at startup**; remove the repeated calls.

**Longer term:** spawn an external helper at startup with a pre-opened pipe; on crash, the handler `write()`s only the signal context to the pipe and the helper does the formatting from outside the crashed address space (Crashpad/Breakpad model).

## New tests for `tests/test_crash_handler_integration.py`

The existing tests only send `SIGSEGV` to a healthy process and check *section headers are present*. That assertion silently passes on the truncated-report mode that #404 reported (because the headers *are* there, just no content after them). Add:

```python
def test_crash_handler_stack_overflow_main(env):  # validates sigaltstack
def test_crash_handler_heap_double_free(env):    # validates no malloc/free
def test_crash_handler_use_after_free_worker(env):# validates print_client_list survives torn vectors
def test_crash_handler_null_deref_main(env):     # baseline sanity
def test_crash_handler_abort_main(env):          # SIGABRT direct
def test_crash_handler_worker_thread_abort(env): # pthread_kill on worker
def test_crash_handler_recursive_signal(env):    # SA_RESETHAND second-fault
```

Each must assert **both** `BUG REPORT START` and `BUG REPORT END` are present (the missing-end case is the actual production regression). Requires a new `--crash-test=<mode>` debug flag in the binary that triggers the requested fault on the chosen thread once the benchmark is warm.

## Why this is urgent

Production users have already hit one truncated bug report (the report attached to #404 has no `BUG REPORT END` marker). The next corruption-class bug will be much harder to triage if the handler keeps dying mid-stream.

Related: #404, #405.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash handler is not async-signal-safe (SA_ONSTACK set, sigaltstack never called) — truncates under corrupted-stack crashes #412

Async-signal-safety violations in current `crash_handler` (~lines 111–208)

Proposed remediation

New tests for `tests/test_crash_handler_integration.py`

Why this is urgent

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Site	Call	Issue
~30 sites throughout the handler	`fprintf` / `vfprintf`	take FILE* lock; may `malloc`; lazy-bound on first call (the bug that truncated #404)
lines 2040, 2044	`backtrace_symbols` + `free`	explicitly documented as not signal-safe; allocates from the (possibly corrupt) heap arena
lines 120–121	`localtime` + `strftime`	not signal-safe; locale/TZ locks
lines 1976–2024	`print_client_list` traverses `std::vector<cg_thread*>` and derefs virtual `shard_connection` getters	UB under torn heap
line 216	`SA_ONSTACK` is set in `sa_flags`	but `sigaltstack(2)` is never called anywhere in the codebase — `grep -r sigaltstack` returns zero. The flag is silently inert; stack-overflow crashes re-crash on the exhausted stack and produce no report at all.
(everywhere)	No symbol pre-warm at startup	any function called for the first time in the handler triggers `_dl_fixup`, which is not signal-safe and is the suspected cause of the #404 truncation

Crash handler is not async-signal-safe (SA_ONSTACK set, sigaltstack never called) — truncates under corrupted-stack crashes #412

Description

Async-signal-safety violations in current crash_handler (~lines 111–208)

Proposed remediation

New tests for tests/test_crash_handler_integration.py

Why this is urgent

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Async-signal-safety violations in current `crash_handler` (~lines 111–208)

New tests for `tests/test_crash_handler_integration.py`