Follow-up to #404 / #405 (VLA stack-overflow). The original production bug report attached to #404 was truncated mid-stream at "CLIENT LIST OUTPUT" — the crash handler in memtier_benchmark.cpp started writing the report, got past STACK TRACE and INFO OUTPUT, then died while iterating clients because the worker thread's stack was already corrupted by the VLA overflow and the handler itself re-faulted inside _dl_fixup (ld-linux lazy PLT bind) when it tried to call an stdio function for the first time. PR #405 patches the trigger; the handler remains fragile against any future memory corruption.
Async-signal-safety violations in current crash_handler (~lines 111–208)
| Site |
Call |
Issue |
| ~30 sites throughout the handler |
fprintf / vfprintf |
take FILE* lock; may malloc; lazy-bound on first call (the bug that truncated #404) |
| lines 2040, 2044 |
backtrace_symbols + free |
explicitly documented as not signal-safe; allocates from the (possibly corrupt) heap arena |
| lines 120–121 |
localtime + strftime |
not signal-safe; locale/TZ locks |
| lines 1976–2024 |
print_client_list traverses std::vector<cg_thread*> and derefs virtual shard_connection getters |
UB under torn heap |
| line 216 |
SA_ONSTACK is set in sa_flags |
but sigaltstack(2) is never called anywhere in the codebase — grep -r sigaltstack returns zero. The flag is silently inert; stack-overflow crashes re-crash on the exhausted stack and produce no report at all. |
| (everywhere) |
No symbol pre-warm at startup |
any function called for the first time in the handler triggers _dl_fixup, which is not signal-safe and is the suspected cause of the #404 truncation |
Proposed remediation
Short term (low risk, high return):
- Install
sigaltstack in setup_crash_handlers with at least SIGSTKSZ * 4 (or 64 KB). Without this, stack-overflow crashes are unrecoverable and SA_ONSTACK is misleading dead code.
- Pre-warm symbols at startup. Call
backtrace() with a 1-frame buffer and discard, call backtrace_symbols_fd(...,/dev/null fd) once, touch event_get_version, run one fprintf(stderr, "") so glibc's stdio lazy init is done. Eliminates _dl_fixup from the signal path.
- Replace
fprintf with write(2) + a tiny reentrant formatter. Build into a stack char buf[512] with a hand-rolled safe_itoa/safe_strcat, then write(STDERR_FILENO, buf, len).
- Replace
backtrace_symbols + free with backtrace_symbols_fd(trace, trace_size, STDERR_FILENO) — the documented signal-safe variant.
- Drop
localtime + strftime. Capture / format the timestamp into a static buffer at startup, or compose it manually from time(NULL).
- Gate
print_client_list behind a "handler running" flag + an alarm(5) so a hung handler self-terminates instead of leaving a half-written report. Defer it to after the simple sections so truncation lands at the end of the report.
- Cache
getpid() in a static at startup; remove the repeated calls.
Longer term: spawn an external helper at startup with a pre-opened pipe; on crash, the handler write()s only the signal context to the pipe and the helper does the formatting from outside the crashed address space (Crashpad/Breakpad model).
New tests for tests/test_crash_handler_integration.py
The existing tests only send SIGSEGV to a healthy process and check section headers are present. That assertion silently passes on the truncated-report mode that #404 reported (because the headers are there, just no content after them). Add:
def test_crash_handler_stack_overflow_main(env): # validates sigaltstack
def test_crash_handler_heap_double_free(env): # validates no malloc/free
def test_crash_handler_use_after_free_worker(env):# validates print_client_list survives torn vectors
def test_crash_handler_null_deref_main(env): # baseline sanity
def test_crash_handler_abort_main(env): # SIGABRT direct
def test_crash_handler_worker_thread_abort(env): # pthread_kill on worker
def test_crash_handler_recursive_signal(env): # SA_RESETHAND second-fault
Each must assert both BUG REPORT START and BUG REPORT END are present (the missing-end case is the actual production regression). Requires a new --crash-test=<mode> debug flag in the binary that triggers the requested fault on the chosen thread once the benchmark is warm.
Why this is urgent
Production users have already hit one truncated bug report (the report attached to #404 has no BUG REPORT END marker). The next corruption-class bug will be much harder to triage if the handler keeps dying mid-stream.
Related: #404, #405.
Follow-up to #404 / #405 (VLA stack-overflow). The original production bug report attached to #404 was truncated mid-stream at "CLIENT LIST OUTPUT" — the crash handler in
memtier_benchmark.cppstarted writing the report, got pastSTACK TRACEandINFO OUTPUT, then died while iterating clients because the worker thread's stack was already corrupted by the VLA overflow and the handler itself re-faulted inside_dl_fixup(ld-linux lazy PLT bind) when it tried to call an stdio function for the first time. PR #405 patches the trigger; the handler remains fragile against any future memory corruption.Async-signal-safety violations in current
crash_handler(~lines 111–208)fprintf/vfprintfmalloc; lazy-bound on first call (the bug that truncated #404)backtrace_symbols+freelocaltime+strftimeprint_client_listtraversesstd::vector<cg_thread*>and derefs virtualshard_connectiongettersSA_ONSTACKis set insa_flagssigaltstack(2)is never called anywhere in the codebase —grep -r sigaltstackreturns zero. The flag is silently inert; stack-overflow crashes re-crash on the exhausted stack and produce no report at all._dl_fixup, which is not signal-safe and is the suspected cause of the #404 truncationProposed remediation
Short term (low risk, high return):
sigaltstackinsetup_crash_handlerswith at leastSIGSTKSZ * 4(or 64 KB). Without this, stack-overflow crashes are unrecoverable andSA_ONSTACKis misleading dead code.backtrace()with a 1-frame buffer and discard, callbacktrace_symbols_fd(...,/dev/null fd)once, touchevent_get_version, run onefprintf(stderr, "")so glibc's stdio lazy init is done. Eliminates_dl_fixupfrom the signal path.fprintfwithwrite(2)+ a tiny reentrant formatter. Build into a stackchar buf[512]with a hand-rolledsafe_itoa/safe_strcat, thenwrite(STDERR_FILENO, buf, len).backtrace_symbols+freewithbacktrace_symbols_fd(trace, trace_size, STDERR_FILENO)— the documented signal-safe variant.localtime+strftime. Capture / format the timestamp into a static buffer at startup, or compose it manually fromtime(NULL).print_client_listbehind a "handler running" flag + analarm(5)so a hung handler self-terminates instead of leaving a half-written report. Defer it to after the simple sections so truncation lands at the end of the report.getpid()in a static at startup; remove the repeated calls.Longer term: spawn an external helper at startup with a pre-opened pipe; on crash, the handler
write()s only the signal context to the pipe and the helper does the formatting from outside the crashed address space (Crashpad/Breakpad model).New tests for
tests/test_crash_handler_integration.pyThe existing tests only send
SIGSEGVto a healthy process and check section headers are present. That assertion silently passes on the truncated-report mode that #404 reported (because the headers are there, just no content after them). Add:Each must assert both
BUG REPORT STARTandBUG REPORT ENDare present (the missing-end case is the actual production regression). Requires a new--crash-test=<mode>debug flag in the binary that triggers the requested fault on the chosen thread once the benchmark is warm.Why this is urgent
Production users have already hit one truncated bug report (the report attached to #404 has no
BUG REPORT ENDmarker). The next corruption-class bug will be much harder to triage if the handler keeps dying mid-stream.Related: #404, #405.