Proposal: benchmark utility for MinHashLSH storage backends

## Summary

Following up on discussion #252, where I volunteered for *"Better benchmark utility to measure performance of various storage backends, and propose optimization"* and @ekzhu asked for a design proposal first.

This issue proposes a v1 benchmark harness that compares the storage backends available to `MinHashLSH`. Optimization proposals are explicitly deferred — each will be a separate follow-up issue backed by numbers from this harness.

## Gap

`datasketch/storage.py` exposes three sync backends (`dict`, `redis`, `cassandra`) through the `ordered_storage` / `unordered_storage` factories (storage.py:58-65). `datasketch/aio/storage.py` adds two async backends (`aiomongo`, `aioredis`), recently promoted from `experimental` in v1.10.0 (commit 8489173). All are swappable via `storage_config`.

Today there is no benchmark that compares them head-to-head. The existing `benchmark/indexes/jaccard/` suite compares *algorithms* (LSH vs LSHForest vs HNSW vs exact) on in-memory storage; it does not vary the backend. Users picking a backend for production MinHashLSH deployments have no data to ground the choice.

## Scope

**In:** `MinHashLSH` across all five backends — `dict`, `redis`, `cassandra`, `aiomongo`, `aioredis`.

**Out:** `LSHBloom`, `LSHEnsemble`, algorithm-level benchmarks (already covered), mixed read/write workloads, server-side resource attribution.

## Design

**Workloads** (four operations × three concurrency levels of 1 / 4 / 16):
- Bulk index build
- Single-op query latency
- Batch query throughput
- Delete + update

Recall is *not* a metric — same algorithm across backends means identical recall by construction. This is purely a storage-layer characterization.

**Metrics** (all client-side):
- Throughput (ops/sec)
- Latency p50 / p95 / p99 — mean hides the tail, which is where network backends differ most
- Wall-clock index build time
- Client process RSS peak (via `psutil`)
- Backend roundtrip count where cheaply instrumentable — direct signal for whether buffering works

Server-side memory, disk I/O, and network bytes are v2 candidates. Each is per-backend-specific and balloons scope.

**Datasets:** Synthetic parametric as primary — seeded RNG, sweep `N ∈ {1M, 10M}` and `num_perm ∈ {128, 256}`. Reproducible, zero download friction, CI-friendly. Storage backends don't care about Jaccard distribution, so parametric sweeps answer operational questions better than fixed real datasets. One real-dataset run (Orkut, reusing `benchmark/indexes/jaccard/utils.py:read_sets_from_file`) as a validation pass to confirm synthetic trends hold.

**Provisioning:** A `benchmark/storage/docker-compose.yml` with pinned Redis / Cassandra / MongoDB on non-default ports. Backend selection via env vars following the CI convention in `.github/workflows/test-redis.yml` et al. — `RUN_REDIS=true RUN_CASSANDRA=true ...`. `dict` is always run as baseline. Missing backends are skipped with a warning, not an error.

**Output:** SQLite for results persistence (new schema — recall-oriented columns from the existing benchmark utils don't apply). Schema: `backend, operation, concurrency, N, num_perm, run_id, throughput, p50_ms, p95_ms, p99_ms, rss_peak_bytes, roundtrips, wall_time, git_sha, timestamp`. A `plot_storage.py` reads the SQLite and emits throughput-vs-concurrency curves, latency CDFs per backend, and a markdown summary table for PR review. Mirrors the `init_results_db` / `save_results` pattern at `benchmark/indexes/jaccard/utils.py:137-188`.

## Preliminary optimization hypotheses

Things the benchmark is *looking for* — not promises of PRs:
- Redis buffer sizing sensitivity under bulk insert
- Cassandra batch vs individual insert crossover point
- Key encoding consistency costs (bytes vs str) across backends
- `AsyncMongoBuffer` flush cadence
- Async Redis pipeline depth tuning

## Proposed layout

```
benchmark/storage/
  README.md               # quickstart, mirrors benchmark/indexes/jaccard/README.md
  docker-compose.yml
  storage_benchmark.py    # argparse entry point, runner-dict pattern from topk_benchmark.py
  runners/
    sync_lsh.py
    async_lsh.py
  utils.py                # SQLite schema + synthetic generator
  plot_storage.py
```

## Rollout

To keep PRs reviewable:

1. Harness + synthetic workloads + `dict` + `redis` as proof of pattern.
2. Remaining sync backends + Orkut validation run + plotting.
3. Async backends + final README.

No optimizations in any of these PRs.

## Open questions

1. Concurrency levels — 1 / 4 / 16 reasonable, or prefer different values?
2. Orkut vs Flickr for the real-dataset run?
3. Check benchmark result SQLite into the repo, or harness only?
4. MongoDB is async-only today — should a sync Mongo backend influence the harness API now, or is that permanently out of scope?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: benchmark utility for MinHashLSH storage backends #319

Summary

Gap

Scope

Design

Preliminary optimization hypotheses

Proposed layout

Rollout

Open questions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Proposal: benchmark utility for MinHashLSH storage backends #319

Description

Summary

Gap

Scope

Design

Preliminary optimization hypotheses

Proposed layout

Rollout

Open questions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions