Summary
Following up on discussion #252, where I volunteered for "Better benchmark utility to measure performance of various storage backends, and propose optimization" and @ekzhu asked for a design proposal first.
This issue proposes a v1 benchmark harness that compares the storage backends available to MinHashLSH. Optimization proposals are explicitly deferred — each will be a separate follow-up issue backed by numbers from this harness.
Gap
datasketch/storage.py exposes three sync backends (dict, redis, cassandra) through the ordered_storage / unordered_storage factories (storage.py:58-65). datasketch/aio/storage.py adds two async backends (aiomongo, aioredis), recently promoted from experimental in v1.10.0 (commit 8489173). All are swappable via storage_config.
Today there is no benchmark that compares them head-to-head. The existing benchmark/indexes/jaccard/ suite compares algorithms (LSH vs LSHForest vs HNSW vs exact) on in-memory storage; it does not vary the backend. Users picking a backend for production MinHashLSH deployments have no data to ground the choice.
Scope
In: MinHashLSH across all five backends — dict, redis, cassandra, aiomongo, aioredis.
Out: LSHBloom, LSHEnsemble, algorithm-level benchmarks (already covered), mixed read/write workloads, server-side resource attribution.
Design
Workloads (four operations × three concurrency levels of 1 / 4 / 16):
- Bulk index build
- Single-op query latency
- Batch query throughput
- Delete + update
Recall is not a metric — same algorithm across backends means identical recall by construction. This is purely a storage-layer characterization.
Metrics (all client-side):
- Throughput (ops/sec)
- Latency p50 / p95 / p99 — mean hides the tail, which is where network backends differ most
- Wall-clock index build time
- Client process RSS peak (via
psutil)
- Backend roundtrip count where cheaply instrumentable — direct signal for whether buffering works
Server-side memory, disk I/O, and network bytes are v2 candidates. Each is per-backend-specific and balloons scope.
Datasets: Synthetic parametric as primary — seeded RNG, sweep N ∈ {1M, 10M} and num_perm ∈ {128, 256}. Reproducible, zero download friction, CI-friendly. Storage backends don't care about Jaccard distribution, so parametric sweeps answer operational questions better than fixed real datasets. One real-dataset run (Orkut, reusing benchmark/indexes/jaccard/utils.py:read_sets_from_file) as a validation pass to confirm synthetic trends hold.
Provisioning: A benchmark/storage/docker-compose.yml with pinned Redis / Cassandra / MongoDB on non-default ports. Backend selection via env vars following the CI convention in .github/workflows/test-redis.yml et al. — RUN_REDIS=true RUN_CASSANDRA=true .... dict is always run as baseline. Missing backends are skipped with a warning, not an error.
Output: SQLite for results persistence (new schema — recall-oriented columns from the existing benchmark utils don't apply). Schema: backend, operation, concurrency, N, num_perm, run_id, throughput, p50_ms, p95_ms, p99_ms, rss_peak_bytes, roundtrips, wall_time, git_sha, timestamp. A plot_storage.py reads the SQLite and emits throughput-vs-concurrency curves, latency CDFs per backend, and a markdown summary table for PR review. Mirrors the init_results_db / save_results pattern at benchmark/indexes/jaccard/utils.py:137-188.
Preliminary optimization hypotheses
Things the benchmark is looking for — not promises of PRs:
- Redis buffer sizing sensitivity under bulk insert
- Cassandra batch vs individual insert crossover point
- Key encoding consistency costs (bytes vs str) across backends
AsyncMongoBuffer flush cadence
- Async Redis pipeline depth tuning
Proposed layout
benchmark/storage/
README.md # quickstart, mirrors benchmark/indexes/jaccard/README.md
docker-compose.yml
storage_benchmark.py # argparse entry point, runner-dict pattern from topk_benchmark.py
runners/
sync_lsh.py
async_lsh.py
utils.py # SQLite schema + synthetic generator
plot_storage.py
Rollout
To keep PRs reviewable:
- Harness + synthetic workloads +
dict + redis as proof of pattern.
- Remaining sync backends + Orkut validation run + plotting.
- Async backends + final README.
No optimizations in any of these PRs.
Open questions
- Concurrency levels — 1 / 4 / 16 reasonable, or prefer different values?
- Orkut vs Flickr for the real-dataset run?
- Check benchmark result SQLite into the repo, or harness only?
- MongoDB is async-only today — should a sync Mongo backend influence the harness API now, or is that permanently out of scope?
Summary
Following up on discussion #252, where I volunteered for "Better benchmark utility to measure performance of various storage backends, and propose optimization" and @ekzhu asked for a design proposal first.
This issue proposes a v1 benchmark harness that compares the storage backends available to
MinHashLSH. Optimization proposals are explicitly deferred — each will be a separate follow-up issue backed by numbers from this harness.Gap
datasketch/storage.pyexposes three sync backends (dict,redis,cassandra) through theordered_storage/unordered_storagefactories (storage.py:58-65).datasketch/aio/storage.pyadds two async backends (aiomongo,aioredis), recently promoted fromexperimentalin v1.10.0 (commit 8489173). All are swappable viastorage_config.Today there is no benchmark that compares them head-to-head. The existing
benchmark/indexes/jaccard/suite compares algorithms (LSH vs LSHForest vs HNSW vs exact) on in-memory storage; it does not vary the backend. Users picking a backend for production MinHashLSH deployments have no data to ground the choice.Scope
In:
MinHashLSHacross all five backends —dict,redis,cassandra,aiomongo,aioredis.Out:
LSHBloom,LSHEnsemble, algorithm-level benchmarks (already covered), mixed read/write workloads, server-side resource attribution.Design
Workloads (four operations × three concurrency levels of 1 / 4 / 16):
Recall is not a metric — same algorithm across backends means identical recall by construction. This is purely a storage-layer characterization.
Metrics (all client-side):
psutil)Server-side memory, disk I/O, and network bytes are v2 candidates. Each is per-backend-specific and balloons scope.
Datasets: Synthetic parametric as primary — seeded RNG, sweep
N ∈ {1M, 10M}andnum_perm ∈ {128, 256}. Reproducible, zero download friction, CI-friendly. Storage backends don't care about Jaccard distribution, so parametric sweeps answer operational questions better than fixed real datasets. One real-dataset run (Orkut, reusingbenchmark/indexes/jaccard/utils.py:read_sets_from_file) as a validation pass to confirm synthetic trends hold.Provisioning: A
benchmark/storage/docker-compose.ymlwith pinned Redis / Cassandra / MongoDB on non-default ports. Backend selection via env vars following the CI convention in.github/workflows/test-redis.ymlet al. —RUN_REDIS=true RUN_CASSANDRA=true ....dictis always run as baseline. Missing backends are skipped with a warning, not an error.Output: SQLite for results persistence (new schema — recall-oriented columns from the existing benchmark utils don't apply). Schema:
backend, operation, concurrency, N, num_perm, run_id, throughput, p50_ms, p95_ms, p99_ms, rss_peak_bytes, roundtrips, wall_time, git_sha, timestamp. Aplot_storage.pyreads the SQLite and emits throughput-vs-concurrency curves, latency CDFs per backend, and a markdown summary table for PR review. Mirrors theinit_results_db/save_resultspattern atbenchmark/indexes/jaccard/utils.py:137-188.Preliminary optimization hypotheses
Things the benchmark is looking for — not promises of PRs:
AsyncMongoBufferflush cadenceProposed layout
Rollout
To keep PRs reviewable:
dict+redisas proof of pattern.No optimizations in any of these PRs.
Open questions