CLI-fuzz follow-ups: 17 crash/hang classes (11 share connection-stage retry-loop root cause)

Follow-up to #410 / PR #424. While narrowing the Hypothesis strategies to keep the new CLI fuzzer (`tests/cli_fuzz.py`) green on master, the agent surfaced **17 pre-existing crash/hang classes** in memtier_benchmark's option parsing + connection-stage retry loops. The fuzzer's strategy was trimmed so the regression net is stable; each of the bugs below remains live in master and is filed here for triage.

## Most actionable pattern

**11 of the 17 hang classes share one root cause**: once memtier enters a reconnect / retry loop after a *connection-stage* failure (AUTH, HELLO, SELECT, CLUSTER SLOTS, protocol negotiation), it ignores `--test-time` and runs until SIGKILLed. A single bounded *"give up after N seconds of consecutive connect-stage failures"* policy closes most of them.

| # | Repro (minimal argv) | Symptom |
|---|---|---|
| 1 | `--authenticate ''` against a server with no auth configured | hangs (reconnect loop, ignores `--test-time`) |
| 2 | `--cluster-mode` against standalone Redis | hangs (`cluster slot failed.` loop, ignores `--test-time`) |
| 3 | `--protocol memcache_text` (or `memcache_binary`) against Redis | hangs (`response parsing failed.` loop) |
| 4 | `--run-count -1` | SIGABRT, `std::bad_alloc` (negative passed to vector resize) |
| 5 | `--run-count 2147483647` (INT_MAX) | SIGABRT, `std::bad_alloc` (allocator OOM, no bounds check) |
| 6 | `--pipeline -1` | hangs (negative size, underflow loop) |
| 7 | `--data-size -1` | SIGABRT, `protocol.cpp:309: assert value_len > 0 failed` |
| 8 | `--data-size-range 1-9999999999` | hangs after `-ERR Protocol error: invalid bulk length` (reconnect loop) |
| 9 | `--data-size-list 0:50` | SIGABRT, value_len=0 assert |
| 10 | `--data-size-list 8:0` | hangs (zero weight, sampler picks nothing) |
| 11 | `--select-db 100` against default 16-DB Redis | hangs (`database selection failed.` loop) |
| 12 | `--command __key__` (bare placeholder, no command name) | SIGABRT, `protocol.cpp:774: first arg is not command name?` |
| 13 | `--command ''` | hangs |
| 14 | `--command-ratio 0` (or empty / whitespace, both parse as 0) | hangs |
| 15 | `--key-pattern G:G --key-stddev inf` | hangs (sampler never produces a key) |
| 16 | `--key-pattern G:G --key-maximum 1` | SIGABRT (Gaussian path on a 1-key range) |
| 17 | `--wait-ratio 0:1 --num-slaves 1-10` against standalone | hangs (WAIT never satisfied, no timeout exit) |

## Proposed remediation plan

**Phase 1 — connection-stage timeout (closes 1, 2, 3, 8, 11, 17):**

Track consecutive connection-stage failures (AUTH-fail / HELLO-fail / SELECT-fail / CLUSTER-SLOTS-fail / `-ERR` from server during initial probe) on the main thread. After N seconds of unbroken connection-stage failures (default 30 s, configurable via a new `--connection-stage-timeout`), call `event_base_loopbreak()` on all loops and exit with code 2 + a clear diagnostic. `--test-time` continues to bound the *steady-state* run; this new knob bounds the *startup* phase.

**Phase 2 — argument range validation (closes 4, 5, 6, 7, 9, 12, 14, 16):**

Validate at parse time and reject with a non-zero exit + readable error:
- `--run-count` must be > 0 and ≤ a sane cap (e.g. 65535)
- `--pipeline` must be > 0
- `--data-size` must be > 0
- `--data-size-list` entries must have a non-zero size and a non-zero weight
- `--command-ratio` must be > 0
- `--command` argv[0] must be non-empty and not a placeholder
- `--key-pattern=G:G` + `--key-maximum=1` rejected (Gaussian distribution is degenerate on a 1-key range)

**Phase 3 — sampler safety (closes 10, 15):**

Detect samplers that cannot produce a value (zero-weight list, infinite stddev) and reject at sampler-construction time rather than hanging the worker loop.

## Acceptance per phase

- Each phase lands as its own PR with a regression test that reproduces the affected fuzz case on the un-fixed binary and passes on the fixed one.
- The corresponding entries in `tests/cli_fuzz.py`'s strategy exclusion list are then **removed**, expanding the fuzzer's coverage net.

## Out of scope here

The other CLI-fuzz follow-ups deferred from #410: `--tls*`, `--uri`, `--data-import`, `--failed-keys-file`, `--monitor-input` (need dedicated setup harnesses; queued for after Phases 1-3).

#	Repro (minimal argv)	Symptom
1	`--authenticate ''` against a server with no auth configured	hangs (reconnect loop, ignores `--test-time`)
2	`--cluster-mode` against standalone Redis	hangs (`cluster slot failed.` loop, ignores `--test-time`)
3	`--protocol memcache_text` (or `memcache_binary`) against Redis	hangs (`response parsing failed.` loop)
4	`--run-count -1`	SIGABRT, `std::bad_alloc` (negative passed to vector resize)
5	`--run-count 2147483647` (INT_MAX)	SIGABRT, `std::bad_alloc` (allocator OOM, no bounds check)
6	`--pipeline -1`	hangs (negative size, underflow loop)
7	`--data-size -1`	SIGABRT, `protocol.cpp:309: assert value_len > 0 failed`
8	`--data-size-range 1-9999999999`	hangs after `-ERR Protocol error: invalid bulk length` (reconnect loop)
9	`--data-size-list 0:50`	SIGABRT, value_len=0 assert
10	`--data-size-list 8:0`	hangs (zero weight, sampler picks nothing)
11	`--select-db 100` against default 16-DB Redis	hangs (`database selection failed.` loop)
12	`--command __key__` (bare placeholder, no command name)	SIGABRT, `protocol.cpp:774: first arg is not command name?`
13	`--command ''`	hangs
14	`--command-ratio 0` (or empty / whitespace, both parse as 0)	hangs
15	`--key-pattern G:G --key-stddev inf`	hangs (sampler never produces a key)
16	`--key-pattern G:G --key-maximum 1`	SIGABRT (Gaussian path on a 1-key range)
17	`--wait-ratio 0:1 --num-slaves 1-10` against standalone	hangs (WAIT never satisfied, no timeout exit)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLI-fuzz follow-ups: 17 crash/hang classes (11 share connection-stage retry-loop root cause) #426

Most actionable pattern

Proposed remediation plan

Acceptance per phase

Out of scope here

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

CLI-fuzz follow-ups: 17 crash/hang classes (11 share connection-stage retry-loop root cause) #426

Description

Most actionable pattern

Proposed remediation plan

Acceptance per phase

Out of scope here

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions