Skip to content

CLI-fuzz follow-ups: 17 crash/hang classes (11 share connection-stage retry-loop root cause) #426

@fcostaoliveira

Description

@fcostaoliveira

Follow-up to #410 / PR #424. While narrowing the Hypothesis strategies to keep the new CLI fuzzer (tests/cli_fuzz.py) green on master, the agent surfaced 17 pre-existing crash/hang classes in memtier_benchmark's option parsing + connection-stage retry loops. The fuzzer's strategy was trimmed so the regression net is stable; each of the bugs below remains live in master and is filed here for triage.

Most actionable pattern

11 of the 17 hang classes share one root cause: once memtier enters a reconnect / retry loop after a connection-stage failure (AUTH, HELLO, SELECT, CLUSTER SLOTS, protocol negotiation), it ignores --test-time and runs until SIGKILLed. A single bounded "give up after N seconds of consecutive connect-stage failures" policy closes most of them.

# Repro (minimal argv) Symptom
1 --authenticate '' against a server with no auth configured hangs (reconnect loop, ignores --test-time)
2 --cluster-mode against standalone Redis hangs (cluster slot failed. loop, ignores --test-time)
3 --protocol memcache_text (or memcache_binary) against Redis hangs (response parsing failed. loop)
4 --run-count -1 SIGABRT, std::bad_alloc (negative passed to vector resize)
5 --run-count 2147483647 (INT_MAX) SIGABRT, std::bad_alloc (allocator OOM, no bounds check)
6 --pipeline -1 hangs (negative size, underflow loop)
7 --data-size -1 SIGABRT, protocol.cpp:309: assert value_len > 0 failed
8 --data-size-range 1-9999999999 hangs after -ERR Protocol error: invalid bulk length (reconnect loop)
9 --data-size-list 0:50 SIGABRT, value_len=0 assert
10 --data-size-list 8:0 hangs (zero weight, sampler picks nothing)
11 --select-db 100 against default 16-DB Redis hangs (database selection failed. loop)
12 --command __key__ (bare placeholder, no command name) SIGABRT, protocol.cpp:774: first arg is not command name?
13 --command '' hangs
14 --command-ratio 0 (or empty / whitespace, both parse as 0) hangs
15 --key-pattern G:G --key-stddev inf hangs (sampler never produces a key)
16 --key-pattern G:G --key-maximum 1 SIGABRT (Gaussian path on a 1-key range)
17 --wait-ratio 0:1 --num-slaves 1-10 against standalone hangs (WAIT never satisfied, no timeout exit)

Proposed remediation plan

Phase 1 — connection-stage timeout (closes 1, 2, 3, 8, 11, 17):

Track consecutive connection-stage failures (AUTH-fail / HELLO-fail / SELECT-fail / CLUSTER-SLOTS-fail / -ERR from server during initial probe) on the main thread. After N seconds of unbroken connection-stage failures (default 30 s, configurable via a new --connection-stage-timeout), call event_base_loopbreak() on all loops and exit with code 2 + a clear diagnostic. --test-time continues to bound the steady-state run; this new knob bounds the startup phase.

Phase 2 — argument range validation (closes 4, 5, 6, 7, 9, 12, 14, 16):

Validate at parse time and reject with a non-zero exit + readable error:

  • --run-count must be > 0 and ≤ a sane cap (e.g. 65535)
  • --pipeline must be > 0
  • --data-size must be > 0
  • --data-size-list entries must have a non-zero size and a non-zero weight
  • --command-ratio must be > 0
  • --command argv[0] must be non-empty and not a placeholder
  • --key-pattern=G:G + --key-maximum=1 rejected (Gaussian distribution is degenerate on a 1-key range)

Phase 3 — sampler safety (closes 10, 15):

Detect samplers that cannot produce a value (zero-weight list, infinite stddev) and reject at sampler-construction time rather than hanging the worker loop.

Acceptance per phase

  • Each phase lands as its own PR with a regression test that reproduces the affected fuzz case on the un-fixed binary and passes on the fixed one.
  • The corresponding entries in tests/cli_fuzz.py's strategy exclusion list are then removed, expanding the fuzzer's coverage net.

Out of scope here

The other CLI-fuzz follow-ups deferred from #410: --tls*, --uri, --data-import, --failed-keys-file, --monitor-input (need dedicated setup harnesses; queued for after Phases 1-3).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions