Skip to content

Crash handler is not async-signal-safe (SA_ONSTACK set, sigaltstack never called) — truncates under corrupted-stack crashes #412

@fcostaoliveira

Description

@fcostaoliveira

Follow-up to #404 / #405 (VLA stack-overflow). The original production bug report attached to #404 was truncated mid-stream at "CLIENT LIST OUTPUT" — the crash handler in memtier_benchmark.cpp started writing the report, got past STACK TRACE and INFO OUTPUT, then died while iterating clients because the worker thread's stack was already corrupted by the VLA overflow and the handler itself re-faulted inside _dl_fixup (ld-linux lazy PLT bind) when it tried to call an stdio function for the first time. PR #405 patches the trigger; the handler remains fragile against any future memory corruption.

Async-signal-safety violations in current crash_handler (~lines 111–208)

Site Call Issue
~30 sites throughout the handler fprintf / vfprintf take FILE* lock; may malloc; lazy-bound on first call (the bug that truncated #404)
lines 2040, 2044 backtrace_symbols + free explicitly documented as not signal-safe; allocates from the (possibly corrupt) heap arena
lines 120–121 localtime + strftime not signal-safe; locale/TZ locks
lines 1976–2024 print_client_list traverses std::vector<cg_thread*> and derefs virtual shard_connection getters UB under torn heap
line 216 SA_ONSTACK is set in sa_flags but sigaltstack(2) is never called anywhere in the codebasegrep -r sigaltstack returns zero. The flag is silently inert; stack-overflow crashes re-crash on the exhausted stack and produce no report at all.
(everywhere) No symbol pre-warm at startup any function called for the first time in the handler triggers _dl_fixup, which is not signal-safe and is the suspected cause of the #404 truncation

Proposed remediation

Short term (low risk, high return):

  1. Install sigaltstack in setup_crash_handlers with at least SIGSTKSZ * 4 (or 64 KB). Without this, stack-overflow crashes are unrecoverable and SA_ONSTACK is misleading dead code.
  2. Pre-warm symbols at startup. Call backtrace() with a 1-frame buffer and discard, call backtrace_symbols_fd(...,/dev/null fd) once, touch event_get_version, run one fprintf(stderr, "") so glibc's stdio lazy init is done. Eliminates _dl_fixup from the signal path.
  3. Replace fprintf with write(2) + a tiny reentrant formatter. Build into a stack char buf[512] with a hand-rolled safe_itoa/safe_strcat, then write(STDERR_FILENO, buf, len).
  4. Replace backtrace_symbols + free with backtrace_symbols_fd(trace, trace_size, STDERR_FILENO) — the documented signal-safe variant.
  5. Drop localtime + strftime. Capture / format the timestamp into a static buffer at startup, or compose it manually from time(NULL).
  6. Gate print_client_list behind a "handler running" flag + an alarm(5) so a hung handler self-terminates instead of leaving a half-written report. Defer it to after the simple sections so truncation lands at the end of the report.
  7. Cache getpid() in a static at startup; remove the repeated calls.

Longer term: spawn an external helper at startup with a pre-opened pipe; on crash, the handler write()s only the signal context to the pipe and the helper does the formatting from outside the crashed address space (Crashpad/Breakpad model).

New tests for tests/test_crash_handler_integration.py

The existing tests only send SIGSEGV to a healthy process and check section headers are present. That assertion silently passes on the truncated-report mode that #404 reported (because the headers are there, just no content after them). Add:

def test_crash_handler_stack_overflow_main(env):  # validates sigaltstack
def test_crash_handler_heap_double_free(env):    # validates no malloc/free
def test_crash_handler_use_after_free_worker(env):# validates print_client_list survives torn vectors
def test_crash_handler_null_deref_main(env):     # baseline sanity
def test_crash_handler_abort_main(env):          # SIGABRT direct
def test_crash_handler_worker_thread_abort(env): # pthread_kill on worker
def test_crash_handler_recursive_signal(env):    # SA_RESETHAND second-fault

Each must assert both BUG REPORT START and BUG REPORT END are present (the missing-end case is the actual production regression). Requires a new --crash-test=<mode> debug flag in the binary that triggers the requested fault on the chosen thread once the benchmark is warm.

Why this is urgent

Production users have already hit one truncated bug report (the report attached to #404 has no BUG REPORT END marker). The next corruption-class bug will be much harder to triage if the handler keeps dying mid-stream.

Related: #404, #405.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions