Litmuse is a local-first research workflow for checking whether a new research idea overlaps with papers in a local PDF corpus.
It has two phases:
- Offline pipeline: build a PDF corpus, ingest papers into PostgreSQL, extract idea cards, and build embeddings.
- Online novelty checking: run a small web server, submit one research idea, and get a ranked list of related papers from the already-built database.
The web interface does not build the database. Run the pipeline first, then deploy the web server against that database.
Litmuse separates durable paper storage from searchable research metadata.
- A
.litmusecorpusfile is a portable SQLite archive that stores PDF bytes and basic corpus metadata. - PostgreSQL stores parsed paper records, extracted idea cards, embedding vectors, and novelty-check reports.
- Idea cards summarize each paper into fields such as
core_idea,method, andcontribution. - Embeddings are built for those idea-card fields and stored with their provider, model, dimensions, source field, and content hash.
- At query time, Litmuse embeds the user’s new idea with the configured embedding provider and searches matching stored embeddings with pgvector.
- Retrieved field-level hits are aggregated into paper-level candidates.
- The novelty service returns a conservative report. It does not claim that an idea is unpublished; it only reports overlap found in the configured local corpus.
The embedding provider and model used by the web server must match embeddings that already exist in the database. If the pipeline built fake/deterministic-test embeddings, start the web server with the same defaults. If the pipeline built FastEmbed or OpenAI embeddings, set the same provider and model before starting the server.
- Python 3.11 or newer
- PostgreSQL database with pgvector available
- A directory of PDF files
- Optional: a local Codex or Claude Code SDK/backend for
local-llmidea extraction - Optional:
OPENAI_API_KEYif using OpenAI embeddings
Install for local development or deployment:
python -m venv .venv
. .venv/bin/activate
python -m pip install -e ".[dev]"For a production-like deployment, installing without dev tools is enough:
python -m venv .venv
. .venv/bin/activate
python -m pip install -e .Environment variables:
| Variable | Default | Purpose |
|---|---|---|
LITMUSE_DATABASE_URL |
postgresql+psycopg://litmuse:litmuse@localhost:5432/litmuse |
PostgreSQL app database. Must have pgvector. |
LITMUSE_CORPUS_PATH |
unset | Reserved for future default corpus path support. |
LITMUSE_IDEA_EXTRACTOR_PROVIDER |
fake |
Default idea extraction provider. Use fake or local-llm. |
LITMUSE_EMBEDDING_PROVIDER |
fake |
Default embedding provider for CLI and web checks. Use fake, fastembed, or openai. |
LITMUSE_EMBEDDING_MODEL |
deterministic-test |
Default embedding model. Must match the selected provider. |
OPENAI_API_KEY |
unset | Required by the OpenAI Python client for OpenAI embeddings. |
Recommended shell setup:
export LITMUSE_DATABASE_URL="postgresql+psycopg://litmuse:litmuse@localhost:5432/litmuse"
export LITMUSE_EMBEDDING_PROVIDER="fake"
export LITMUSE_EMBEDDING_MODEL="deterministic-test"Provider and model rules:
fakeonly supportsdeterministic-test. It is deterministic and useful for smoke tests.fastembedsupportsBAAI/bge-small-en-v1.5andBAAI/bge-base-en-v1.5.openaisupportstext-embedding-3-smallandtext-embedding-3-large.openaidefaults totext-embedding-3-smallif no model is passed and the configured model is stilldeterministic-test.fastembeddefaults toBAAI/bge-small-en-v1.5if no model is passed and the configured model is stilldeterministic-test.
Litmuse needs a running PostgreSQL server with pgvector. SQLite is only used for tests and for the portable .litmusecorpus archive.
If PostgreSQL is installed as a system service:
sudo systemctl start postgresql
sudo systemctl stop postgresql
sudo systemctl status postgresqlIf you use a local PostgreSQL data directory instead:
pg_ctl -D /path/to/pgdata -l /path/to/postgres.log -o "-p 5432" start
pg_ctl -D /path/to/pgdata status
pg_ctl -D /path/to/pgdata stopIf pgvector is not installed in PostgreSQL’s standard extension directory, start PostgreSQL with extension paths:
pg_ctl -D /path/to/pgdata \
-l /path/to/postgres.log \
-o "-p 5432 -c extension_control_path=/path/to/pgvector-share: -c dynamic_library_path=/path/to/pgvector-lib:" \
startVerify the database is reachable:
pg_isready -h 127.0.0.1 -p 5432Initialize or upgrade the Litmuse schema:
litmuse init-dblitmuse init-db runs packaged Alembic migrations and creates the pgvector extension.
Run these steps whenever you want to build or refresh the searchable paper database.
Initialize a corpus archive:
litmuse-corpus init tmp/papers.litmusecorpus --name "My Research Papers"Add PDFs from a file or directory:
litmuse-corpus add tmp/papers.litmusecorpus /path/to/papersVerify and inspect the archive:
litmuse-corpus verify tmp/papers.litmusecorpus
litmuse-corpus stats tmp/papers.litmusecorpus
litmuse-corpus list tmp/papers.litmusecorpusThe corpus archive stores valid PDFs as BLOBs. Duplicate PDFs are skipped by SHA-256. Corrupted PDFs are rejected before storage.
litmuse ingest tmp/papers.litmusecorpusThis registers each archived PDF in PostgreSQL and extracts text sections with the local PDF parser.
Use deterministic fake extraction for a repeatable smoke test:
litmuse extract-ideas --provider fakeUse the local LLM extractor when a supported backend is available:
litmuse extract-ideas --provider local-llmLimit extraction for a quick run:
litmuse extract-ideas --provider local-llm --limit 14The local LLM extractor uses a local SDK/backend such as Codex or Claude Code when detected. It validates JSON output before storing idea cards.
Build deterministic local embeddings:
litmuse embed --provider fake --model deterministic-testBuild FastEmbed embeddings:
litmuse embed --provider fastembed --model BAAI/bge-small-en-v1.5Build OpenAI embeddings:
export OPENAI_API_KEY="..."
litmuse embed --provider openai --model text-embedding-3-smallEmbeddings are keyed by provider, model, dimensions, source table, source field, and content hash. You can build embeddings with multiple providers or models without overwriting older rows.
Use the same embedding provider/model that you built in step 4:
litmuse check "Can LLMs guide fuzzing by generating protocol-aware inputs?" \
--provider fake \
--model deterministic-testThe command embeds the idea, searches related idea-card fields, builds a conservative novelty report, stores the report in novelty_checks, and prints JSON.
The web server serves a focused novelty-checking page at /.
The page has one textarea and one button. It submits POST /novelty-checks to the FastAPI app. The server embeds the submitted idea, searches the existing database, stores the novelty check, and returns related papers for display.
Use this when testing on the same machine:
. .venv/bin/activate
export LITMUSE_DATABASE_URL="postgresql+psycopg://litmuse:litmuse@localhost:5432/litmuse"
export LITMUSE_EMBEDDING_PROVIDER="fake"
export LITMUSE_EMBEDDING_MODEL="deterministic-test"
litmuse serve --host 127.0.0.1 --port 8000Open:
http://127.0.0.1:8000/
Stop it with Ctrl-C in the terminal running the server.
Use this when accessing the server from another machine or from a Windows host browser into a Linux VM:
. .venv/bin/activate
export LITMUSE_DATABASE_URL="postgresql+psycopg://litmuse:litmuse@localhost:5432/litmuse"
export LITMUSE_EMBEDDING_PROVIDER="fake"
export LITMUSE_EMBEDDING_MODEL="deterministic-test"
litmuse serve --host 0.0.0.0 --port 8000Find the VM or server IP:
ip -4 addr show scope globalOpen:
http://<server-ip>:8000/
Ports below 1024 usually require elevated privileges on Linux. Either run through sudo or put a reverse proxy in front of Litmuse.
Direct sudo start:
sudo -E .venv/bin/litmuse serve --host 0.0.0.0 --port 80Use sudo -E if the database and embedding environment variables are already set in your shell and need to be preserved. Without the right environment, the server will fall back to defaults such as localhost:5432.
Open:
http://<server-ip>/
Stop it with Ctrl-C in the terminal running the server.
For a simple non-systemd deployment:
mkdir -p tmp
nohup .venv/bin/litmuse serve --host 0.0.0.0 --port 8000 > tmp/litmuse-web.log 2>&1 &
echo $! > tmp/litmuse-web.pidStop it:
kill "$(cat tmp/litmuse-web.pid)"
rm tmp/litmuse-web.pidCheck whether it is listening:
ss -ltnp 'sport = :8000'
curl -fsS http://127.0.0.1:8000/healthCreate /etc/systemd/system/litmuse-web.service:
[Unit]
Description=Litmuse web server
After=network.target postgresql.service
[Service]
WorkingDirectory=/home/audit/litmuse
Environment=LITMUSE_DATABASE_URL=postgresql+psycopg://litmuse:litmuse@localhost:5432/litmuse
Environment=LITMUSE_EMBEDDING_PROVIDER=fake
Environment=LITMUSE_EMBEDDING_MODEL=deterministic-test
ExecStart=/home/audit/litmuse/.venv/bin/litmuse serve --host 0.0.0.0 --port 8000
Restart=on-failure
User=audit
[Install]
WantedBy=multi-user.targetStart, stop, and inspect logs:
sudo systemctl daemon-reload
sudo systemctl start litmuse-web
sudo systemctl status litmuse-web
sudo journalctl -u litmuse-web -f
sudo systemctl stop litmuse-webIf you need public HTTP on port 80, run Litmuse on port 8000 and use Nginx, Caddy, or another reverse proxy to forward port 80 to 127.0.0.1:8000.
If the page loads but submitting an idea fails with a PostgreSQL connection error:
-
Confirm PostgreSQL is running:
pg_isready -h 127.0.0.1 -p 5432
-
Confirm
LITMUSE_DATABASE_URLpoints at the database that contains the pipeline output:printenv LITMUSE_DATABASE_URL
-
Confirm the database has papers, idea cards, and embeddings:
PGPASSWORD=litmuse psql -h 127.0.0.1 -U litmuse -d litmuse -c \ "select count(*) from papers; select count(*) from idea_cards; select provider, model_name, dimensions, count(*) from embeddings group by provider, model_name, dimensions;" -
Confirm the web server uses the same embedding provider and model as the stored embeddings:
printenv LITMUSE_EMBEDDING_PROVIDER printenv LITMUSE_EMBEDDING_MODEL
If the server cannot bind to port 80, use port 8000 or run with appropriate privileges.
This repository includes seed evaluation cases for the local paper set under /home/audit/libfuzz/raw/papers.
litmuse-corpus init tmp/libfuzz.litmusecorpus --name "LibFuzz Papers"
litmuse-corpus add tmp/libfuzz.litmusecorpus /home/audit/libfuzz/raw/papers
litmuse-corpus verify tmp/libfuzz.litmusecorpus
litmuse init-db
litmuse ingest tmp/libfuzz.litmusecorpus
litmuse extract-ideas --provider local-llm --limit 14
litmuse embed --provider fake --model deterministic-test
litmuse check "Can LLMs guide fuzzing by generating protocol-aware inputs?" --provider fake --model deterministic-test
litmuse serve --host 0.0.0.0 --port 8000Evaluation seeds are in:
eval/novelty_cases.jsonl
litmuse-corpus init <corpus_path> [--name NAME]
litmuse-corpus add <corpus_path> <pdf_or_directory>
litmuse-corpus list <corpus_path>
litmuse-corpus verify <corpus_path>
litmuse-corpus stats <corpus_path>
litmuse-corpus extract <corpus_path> <sha256> --out <output_pdf>
litmuse init-db
litmuse ingest <corpus_path>
litmuse extract-ideas [--limit N] [--provider fake|local-llm] [--model MODEL]
litmuse embed [--provider fake|fastembed|openai] [--model MODEL]
litmuse check <idea_text> [--provider fake|fastembed|openai] [--model MODEL]
litmuse serve [--host HOST] [--port PORT]
GET / Web novelty-checking page
GET /health Server health check
POST /novelty-checks Run a novelty check for input_text
GET /novelty-checks/{check_id} Placeholder route
POST /ingest Placeholder route
POST /extract-ideas Placeholder route
POST /embeddings/build Placeholder route
GET /papers/{paper_id} Placeholder route
GET /papers/{paper_id}/pdf Placeholder route
The web page and POST /novelty-checks are wired to the current novelty-checking service. Pipeline operations should still be run through the CLI.
The schema is intentionally small:
papers: one row per archived PDFpaper_texts: parsed text blocks and section extractsidea_cards: one structured idea card per paperembeddings: provider/model/dimension-specific vectors for idea-card fieldsnovelty_checks: stored JSON reports for user idea checks
Migrations are packaged under litmuse.migrations and run through:
litmuse init-dbMigration 0002_unique_idea_card_per_paper fails with a clear error if duplicate idea cards already exist, rather than deleting data.
Run tests and lint:
python -m pytest -v
python -m ruff check .Check migration discovery:
alembic -c alembic.ini history- No crawler yet; the corpus is built from local PDF files.
- The current novelty judge is conservative and deterministic.
- Retrieval database search expects PostgreSQL with pgvector.
- The web UI is intentionally focused on novelty checking only. It does not manage ingestion, extraction, embedding builds, or corpus status.
- Static Alembic SQL generation is not part of the supported workflow; use online migrations through
litmuse init-db.