Hybrid Knowledge Search (No Vector-DB Theater)

Lightweight, config-driven lexical + semantic search in one container. Inspired by “Building Knowledge Search Without Vector-DB Theater,” this stack keeps everything local: Polars ETL pulls JSONL drops, Tantivy (Lucene) handles BM25, FastEmbed + HNSW deliver semantic recall, and a FastAPI endpoint fuses results, autocomplete, and RAG-style answers.

Why this repo

One config (config/search_config.yaml) captures schemas, analyzers, embedding model, and fusion knobs—treat search like code, not tribal memory.
Pull-based indexing: drop CSVs/JSONL in data/raw, run Polars ETL → data/processed/docs.jsonl, rebuild indexes.
Single container: local embeddings (ONNX via FastEmbed), BM25 from Lucene/Tantivy, HNSW vectors, RAG answer synthesis on the same API.
Resilient storage: disk-backed indexes in data/index; swap folders or mount S3 for zero-downtime rollbacks.
Deterministic hybrid: lexical + semantic in one request using reciprocal-rank fusion; autocomplete reuses the same index.

Layout

config/search_config.yaml — field types, analyzers, embedding model, fusion weights, autocomplete fields.
data/raw/ — sample Confluence, GitHub, and Slack CSV inputs.
data/processed/docs.jsonl — Polars-normalized corpus; the indexer reads from here.
data/index/ — Tantivy schema + HNSW vectors + doc store.
src/app/ — FastAPI app, ETL, indexer, hybrid search, and RAG helper.
scripts/refresh.sh — run ETL + indexing in one command during development.

Quickstart (local, uv)

cd knowledge-search-hybrid
export PYTHONPATH=src
uv run python -m app.etl         # Polars ETL → data/processed/docs.jsonl
uv run python -m app.indexer     # Build lexical + vector indexes
uv run python -m app.main        # Start API at http://localhost:8000

Endpoints:

GET /health — status + index readiness.
GET /search?q=payment — hybrid BM25 + semantic with rank fusion.
GET /autocomplete?prefix=pay — search-as-you-type driven by the same index.
GET /rag?q=... — pulls top docs and streams them through a local LLM (Ollama if set) or a deterministic extractive summary.

Docker

docker compose up --build
# Inside the container, seed data and indexes:
docker compose exec knowledge-search uv run python -m app.etl
docker compose exec knowledge-search uv run python -m app.indexer

Port 8000 is exposed; mount ./data to keep indexes between restarts. Set OLLAMA_MODEL if you want local LLM answers.

How it maps to the blog claims

Hybrid in one node: Tantivy BM25 + HNSW vectors live together; handles ~100k docs comfortably on a laptop.
Local embeddings: FastEmbed uses ONNX models locally—no per-token costs or external latency.
Strong schema: field types, stored flags, and analyzers live in config; the indexer enforces them every rebuild.
Boring ingestion: Polars ETL reads CSV drops and writes JSONL; the indexer consumes files, not POST storms.
Rank fusion: reciprocal-rank fusion with adjustable lexical/semantic weights gives deterministic ordering.
Autocomplete & RAG: autocomplete reuses the same text fields; RAG happens on the search API, no extra service.
Stateless node: all state lives under data/index; point another container at it or swap directories for rollbacks.

Running your own corpus

Add new CSV/JSONL files under data/raw/.
Update config/search_config.yaml if you introduce new fields or tags.
uv run python -m app.etl && uv run python -m app.indexer to refresh.
Keep ./data/index on disk or S3; swap the folder to roll forward/back.

Notes

The FastEmbed model downloads on first run; pre-warm it or bake into your image for air-gapped envs.
Use config.hybrid knobs to tune BM25/semantic balance or increase k for broader recall.
For production, mount ./data to durable storage and add a cron/CI job that runs scripts/refresh.sh.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hybrid Knowledge Search (No Vector-DB Theater)

Why this repo

Layout

Quickstart (local, uv)

Docker

How it maps to the blog claims

Running your own corpus

Notes

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Hybrid Knowledge Search (No Vector-DB Theater)

Why this repo

Layout

Quickstart (local, uv)

Docker

How it maps to the blog claims

Running your own corpus

Notes