Skip to content

Proposal: benchmark utility for MinHashLSH storage backends #319

@rohanpoudel2

Description

@rohanpoudel2

Summary

Following up on discussion #252, where I volunteered for "Better benchmark utility to measure performance of various storage backends, and propose optimization" and @ekzhu asked for a design proposal first.

This issue proposes a v1 benchmark harness that compares the storage backends available to MinHashLSH. Optimization proposals are explicitly deferred — each will be a separate follow-up issue backed by numbers from this harness.

Gap

datasketch/storage.py exposes three sync backends (dict, redis, cassandra) through the ordered_storage / unordered_storage factories (storage.py:58-65). datasketch/aio/storage.py adds two async backends (aiomongo, aioredis), recently promoted from experimental in v1.10.0 (commit 8489173). All are swappable via storage_config.

Today there is no benchmark that compares them head-to-head. The existing benchmark/indexes/jaccard/ suite compares algorithms (LSH vs LSHForest vs HNSW vs exact) on in-memory storage; it does not vary the backend. Users picking a backend for production MinHashLSH deployments have no data to ground the choice.

Scope

In: MinHashLSH across all five backends — dict, redis, cassandra, aiomongo, aioredis.

Out: LSHBloom, LSHEnsemble, algorithm-level benchmarks (already covered), mixed read/write workloads, server-side resource attribution.

Design

Workloads (four operations × three concurrency levels of 1 / 4 / 16):

  • Bulk index build
  • Single-op query latency
  • Batch query throughput
  • Delete + update

Recall is not a metric — same algorithm across backends means identical recall by construction. This is purely a storage-layer characterization.

Metrics (all client-side):

  • Throughput (ops/sec)
  • Latency p50 / p95 / p99 — mean hides the tail, which is where network backends differ most
  • Wall-clock index build time
  • Client process RSS peak (via psutil)
  • Backend roundtrip count where cheaply instrumentable — direct signal for whether buffering works

Server-side memory, disk I/O, and network bytes are v2 candidates. Each is per-backend-specific and balloons scope.

Datasets: Synthetic parametric as primary — seeded RNG, sweep N ∈ {1M, 10M} and num_perm ∈ {128, 256}. Reproducible, zero download friction, CI-friendly. Storage backends don't care about Jaccard distribution, so parametric sweeps answer operational questions better than fixed real datasets. One real-dataset run (Orkut, reusing benchmark/indexes/jaccard/utils.py:read_sets_from_file) as a validation pass to confirm synthetic trends hold.

Provisioning: A benchmark/storage/docker-compose.yml with pinned Redis / Cassandra / MongoDB on non-default ports. Backend selection via env vars following the CI convention in .github/workflows/test-redis.yml et al. — RUN_REDIS=true RUN_CASSANDRA=true .... dict is always run as baseline. Missing backends are skipped with a warning, not an error.

Output: SQLite for results persistence (new schema — recall-oriented columns from the existing benchmark utils don't apply). Schema: backend, operation, concurrency, N, num_perm, run_id, throughput, p50_ms, p95_ms, p99_ms, rss_peak_bytes, roundtrips, wall_time, git_sha, timestamp. A plot_storage.py reads the SQLite and emits throughput-vs-concurrency curves, latency CDFs per backend, and a markdown summary table for PR review. Mirrors the init_results_db / save_results pattern at benchmark/indexes/jaccard/utils.py:137-188.

Preliminary optimization hypotheses

Things the benchmark is looking for — not promises of PRs:

  • Redis buffer sizing sensitivity under bulk insert
  • Cassandra batch vs individual insert crossover point
  • Key encoding consistency costs (bytes vs str) across backends
  • AsyncMongoBuffer flush cadence
  • Async Redis pipeline depth tuning

Proposed layout

benchmark/storage/
  README.md               # quickstart, mirrors benchmark/indexes/jaccard/README.md
  docker-compose.yml
  storage_benchmark.py    # argparse entry point, runner-dict pattern from topk_benchmark.py
  runners/
    sync_lsh.py
    async_lsh.py
  utils.py                # SQLite schema + synthetic generator
  plot_storage.py

Rollout

To keep PRs reviewable:

  1. Harness + synthetic workloads + dict + redis as proof of pattern.
  2. Remaining sync backends + Orkut validation run + plotting.
  3. Async backends + final README.

No optimizations in any of these PRs.

Open questions

  1. Concurrency levels — 1 / 4 / 16 reasonable, or prefer different values?
  2. Orkut vs Flickr for the real-dataset run?
  3. Check benchmark result SQLite into the repo, or harness only?
  4. MongoDB is async-only today — should a sync Mongo backend influence the harness API now, or is that permanently out of scope?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions