Follow-up to #410 / PR #424. While narrowing the Hypothesis strategies to keep the new CLI fuzzer (tests/cli_fuzz.py) green on master, the agent surfaced 17 pre-existing crash/hang classes in memtier_benchmark's option parsing + connection-stage retry loops. The fuzzer's strategy was trimmed so the regression net is stable; each of the bugs below remains live in master and is filed here for triage.
Most actionable pattern
11 of the 17 hang classes share one root cause: once memtier enters a reconnect / retry loop after a connection-stage failure (AUTH, HELLO, SELECT, CLUSTER SLOTS, protocol negotiation), it ignores --test-time and runs until SIGKILLed. A single bounded "give up after N seconds of consecutive connect-stage failures" policy closes most of them.
| # |
Repro (minimal argv) |
Symptom |
| 1 |
--authenticate '' against a server with no auth configured |
hangs (reconnect loop, ignores --test-time) |
| 2 |
--cluster-mode against standalone Redis |
hangs (cluster slot failed. loop, ignores --test-time) |
| 3 |
--protocol memcache_text (or memcache_binary) against Redis |
hangs (response parsing failed. loop) |
| 4 |
--run-count -1 |
SIGABRT, std::bad_alloc (negative passed to vector resize) |
| 5 |
--run-count 2147483647 (INT_MAX) |
SIGABRT, std::bad_alloc (allocator OOM, no bounds check) |
| 6 |
--pipeline -1 |
hangs (negative size, underflow loop) |
| 7 |
--data-size -1 |
SIGABRT, protocol.cpp:309: assert value_len > 0 failed |
| 8 |
--data-size-range 1-9999999999 |
hangs after -ERR Protocol error: invalid bulk length (reconnect loop) |
| 9 |
--data-size-list 0:50 |
SIGABRT, value_len=0 assert |
| 10 |
--data-size-list 8:0 |
hangs (zero weight, sampler picks nothing) |
| 11 |
--select-db 100 against default 16-DB Redis |
hangs (database selection failed. loop) |
| 12 |
--command __key__ (bare placeholder, no command name) |
SIGABRT, protocol.cpp:774: first arg is not command name? |
| 13 |
--command '' |
hangs |
| 14 |
--command-ratio 0 (or empty / whitespace, both parse as 0) |
hangs |
| 15 |
--key-pattern G:G --key-stddev inf |
hangs (sampler never produces a key) |
| 16 |
--key-pattern G:G --key-maximum 1 |
SIGABRT (Gaussian path on a 1-key range) |
| 17 |
--wait-ratio 0:1 --num-slaves 1-10 against standalone |
hangs (WAIT never satisfied, no timeout exit) |
Proposed remediation plan
Phase 1 — connection-stage timeout (closes 1, 2, 3, 8, 11, 17):
Track consecutive connection-stage failures (AUTH-fail / HELLO-fail / SELECT-fail / CLUSTER-SLOTS-fail / -ERR from server during initial probe) on the main thread. After N seconds of unbroken connection-stage failures (default 30 s, configurable via a new --connection-stage-timeout), call event_base_loopbreak() on all loops and exit with code 2 + a clear diagnostic. --test-time continues to bound the steady-state run; this new knob bounds the startup phase.
Phase 2 — argument range validation (closes 4, 5, 6, 7, 9, 12, 14, 16):
Validate at parse time and reject with a non-zero exit + readable error:
--run-count must be > 0 and ≤ a sane cap (e.g. 65535)
--pipeline must be > 0
--data-size must be > 0
--data-size-list entries must have a non-zero size and a non-zero weight
--command-ratio must be > 0
--command argv[0] must be non-empty and not a placeholder
--key-pattern=G:G + --key-maximum=1 rejected (Gaussian distribution is degenerate on a 1-key range)
Phase 3 — sampler safety (closes 10, 15):
Detect samplers that cannot produce a value (zero-weight list, infinite stddev) and reject at sampler-construction time rather than hanging the worker loop.
Acceptance per phase
- Each phase lands as its own PR with a regression test that reproduces the affected fuzz case on the un-fixed binary and passes on the fixed one.
- The corresponding entries in
tests/cli_fuzz.py's strategy exclusion list are then removed, expanding the fuzzer's coverage net.
Out of scope here
The other CLI-fuzz follow-ups deferred from #410: --tls*, --uri, --data-import, --failed-keys-file, --monitor-input (need dedicated setup harnesses; queued for after Phases 1-3).
Follow-up to #410 / PR #424. While narrowing the Hypothesis strategies to keep the new CLI fuzzer (
tests/cli_fuzz.py) green on master, the agent surfaced 17 pre-existing crash/hang classes in memtier_benchmark's option parsing + connection-stage retry loops. The fuzzer's strategy was trimmed so the regression net is stable; each of the bugs below remains live in master and is filed here for triage.Most actionable pattern
11 of the 17 hang classes share one root cause: once memtier enters a reconnect / retry loop after a connection-stage failure (AUTH, HELLO, SELECT, CLUSTER SLOTS, protocol negotiation), it ignores
--test-timeand runs until SIGKILLed. A single bounded "give up after N seconds of consecutive connect-stage failures" policy closes most of them.--authenticate ''against a server with no auth configured--test-time)--cluster-modeagainst standalone Rediscluster slot failed.loop, ignores--test-time)--protocol memcache_text(ormemcache_binary) against Redisresponse parsing failed.loop)--run-count -1std::bad_alloc(negative passed to vector resize)--run-count 2147483647(INT_MAX)std::bad_alloc(allocator OOM, no bounds check)--pipeline -1--data-size -1protocol.cpp:309: assert value_len > 0 failed--data-size-range 1-9999999999-ERR Protocol error: invalid bulk length(reconnect loop)--data-size-list 0:50--data-size-list 8:0--select-db 100against default 16-DB Redisdatabase selection failed.loop)--command __key__(bare placeholder, no command name)protocol.cpp:774: first arg is not command name?--command ''--command-ratio 0(or empty / whitespace, both parse as 0)--key-pattern G:G --key-stddev inf--key-pattern G:G --key-maximum 1--wait-ratio 0:1 --num-slaves 1-10against standaloneProposed remediation plan
Phase 1 — connection-stage timeout (closes 1, 2, 3, 8, 11, 17):
Track consecutive connection-stage failures (AUTH-fail / HELLO-fail / SELECT-fail / CLUSTER-SLOTS-fail /
-ERRfrom server during initial probe) on the main thread. After N seconds of unbroken connection-stage failures (default 30 s, configurable via a new--connection-stage-timeout), callevent_base_loopbreak()on all loops and exit with code 2 + a clear diagnostic.--test-timecontinues to bound the steady-state run; this new knob bounds the startup phase.Phase 2 — argument range validation (closes 4, 5, 6, 7, 9, 12, 14, 16):
Validate at parse time and reject with a non-zero exit + readable error:
--run-countmust be > 0 and ≤ a sane cap (e.g. 65535)--pipelinemust be > 0--data-sizemust be > 0--data-size-listentries must have a non-zero size and a non-zero weight--command-ratiomust be > 0--commandargv[0] must be non-empty and not a placeholder--key-pattern=G:G+--key-maximum=1rejected (Gaussian distribution is degenerate on a 1-key range)Phase 3 — sampler safety (closes 10, 15):
Detect samplers that cannot produce a value (zero-weight list, infinite stddev) and reject at sampler-construction time rather than hanging the worker loop.
Acceptance per phase
tests/cli_fuzz.py's strategy exclusion list are then removed, expanding the fuzzer's coverage net.Out of scope here
The other CLI-fuzz follow-ups deferred from #410:
--tls*,--uri,--data-import,--failed-keys-file,--monitor-input(need dedicated setup harnesses; queued for after Phases 1-3).