Skip to content

COCOP1l0t/litmuse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Litmuse

Litmuse is a local-first research workflow for checking whether a new research idea overlaps with papers in a local PDF corpus.

It has two phases:

  1. Offline pipeline: build a PDF corpus, ingest papers into PostgreSQL, extract idea cards, and build embeddings.
  2. Online novelty checking: run a small web server, submit one research idea, and get a ranked list of related papers from the already-built database.

The web interface does not build the database. Run the pipeline first, then deploy the web server against that database.

How It Works

Litmuse separates durable paper storage from searchable research metadata.

  • A .litmusecorpus file is a portable SQLite archive that stores PDF bytes and basic corpus metadata.
  • PostgreSQL stores parsed paper records, extracted idea cards, embedding vectors, and novelty-check reports.
  • Idea cards summarize each paper into fields such as core_idea, method, and contribution.
  • Embeddings are built for those idea-card fields and stored with their provider, model, dimensions, source field, and content hash.
  • At query time, Litmuse embeds the user’s new idea with the configured embedding provider and searches matching stored embeddings with pgvector.
  • Retrieved field-level hits are aggregated into paper-level candidates.
  • The novelty service returns a conservative report. It does not claim that an idea is unpublished; it only reports overlap found in the configured local corpus.

The embedding provider and model used by the web server must match embeddings that already exist in the database. If the pipeline built fake/deterministic-test embeddings, start the web server with the same defaults. If the pipeline built FastEmbed or OpenAI embeddings, set the same provider and model before starting the server.

Requirements

  • Python 3.11 or newer
  • PostgreSQL database with pgvector available
  • A directory of PDF files
  • Optional: a local Codex or Claude Code SDK/backend for local-llm idea extraction
  • Optional: OPENAI_API_KEY if using OpenAI embeddings

Install for local development or deployment:

python -m venv .venv
. .venv/bin/activate
python -m pip install -e ".[dev]"

For a production-like deployment, installing without dev tools is enough:

python -m venv .venv
. .venv/bin/activate
python -m pip install -e .

Configuration

Environment variables:

Variable Default Purpose
LITMUSE_DATABASE_URL postgresql+psycopg://litmuse:litmuse@localhost:5432/litmuse PostgreSQL app database. Must have pgvector.
LITMUSE_CORPUS_PATH unset Reserved for future default corpus path support.
LITMUSE_IDEA_EXTRACTOR_PROVIDER fake Default idea extraction provider. Use fake or local-llm.
LITMUSE_EMBEDDING_PROVIDER fake Default embedding provider for CLI and web checks. Use fake, fastembed, or openai.
LITMUSE_EMBEDDING_MODEL deterministic-test Default embedding model. Must match the selected provider.
OPENAI_API_KEY unset Required by the OpenAI Python client for OpenAI embeddings.

Recommended shell setup:

export LITMUSE_DATABASE_URL="postgresql+psycopg://litmuse:litmuse@localhost:5432/litmuse"
export LITMUSE_EMBEDDING_PROVIDER="fake"
export LITMUSE_EMBEDDING_MODEL="deterministic-test"

Provider and model rules:

  • fake only supports deterministic-test. It is deterministic and useful for smoke tests.
  • fastembed supports BAAI/bge-small-en-v1.5 and BAAI/bge-base-en-v1.5.
  • openai supports text-embedding-3-small and text-embedding-3-large.
  • openai defaults to text-embedding-3-small if no model is passed and the configured model is still deterministic-test.
  • fastembed defaults to BAAI/bge-small-en-v1.5 if no model is passed and the configured model is still deterministic-test.

PostgreSQL

Litmuse needs a running PostgreSQL server with pgvector. SQLite is only used for tests and for the portable .litmusecorpus archive.

If PostgreSQL is installed as a system service:

sudo systemctl start postgresql
sudo systemctl stop postgresql
sudo systemctl status postgresql

If you use a local PostgreSQL data directory instead:

pg_ctl -D /path/to/pgdata -l /path/to/postgres.log -o "-p 5432" start
pg_ctl -D /path/to/pgdata status
pg_ctl -D /path/to/pgdata stop

If pgvector is not installed in PostgreSQL’s standard extension directory, start PostgreSQL with extension paths:

pg_ctl -D /path/to/pgdata \
  -l /path/to/postgres.log \
  -o "-p 5432 -c extension_control_path=/path/to/pgvector-share: -c dynamic_library_path=/path/to/pgvector-lib:" \
  start

Verify the database is reachable:

pg_isready -h 127.0.0.1 -p 5432

Initialize or upgrade the Litmuse schema:

litmuse init-db

litmuse init-db runs packaged Alembic migrations and creates the pgvector extension.

Offline Pipeline

Run these steps whenever you want to build or refresh the searchable paper database.

1. Build A Portable PDF Corpus

Initialize a corpus archive:

litmuse-corpus init tmp/papers.litmusecorpus --name "My Research Papers"

Add PDFs from a file or directory:

litmuse-corpus add tmp/papers.litmusecorpus /path/to/papers

Verify and inspect the archive:

litmuse-corpus verify tmp/papers.litmusecorpus
litmuse-corpus stats tmp/papers.litmusecorpus
litmuse-corpus list tmp/papers.litmusecorpus

The corpus archive stores valid PDFs as BLOBs. Duplicate PDFs are skipped by SHA-256. Corrupted PDFs are rejected before storage.

2. Ingest Papers Into PostgreSQL

litmuse ingest tmp/papers.litmusecorpus

This registers each archived PDF in PostgreSQL and extracts text sections with the local PDF parser.

3. Extract Idea Cards

Use deterministic fake extraction for a repeatable smoke test:

litmuse extract-ideas --provider fake

Use the local LLM extractor when a supported backend is available:

litmuse extract-ideas --provider local-llm

Limit extraction for a quick run:

litmuse extract-ideas --provider local-llm --limit 14

The local LLM extractor uses a local SDK/backend such as Codex or Claude Code when detected. It validates JSON output before storing idea cards.

4. Build Embeddings

Build deterministic local embeddings:

litmuse embed --provider fake --model deterministic-test

Build FastEmbed embeddings:

litmuse embed --provider fastembed --model BAAI/bge-small-en-v1.5

Build OpenAI embeddings:

export OPENAI_API_KEY="..."
litmuse embed --provider openai --model text-embedding-3-small

Embeddings are keyed by provider, model, dimensions, source table, source field, and content hash. You can build embeddings with multiple providers or models without overwriting older rows.

5. Smoke-Test Novelty Checking From The CLI

Use the same embedding provider/model that you built in step 4:

litmuse check "Can LLMs guide fuzzing by generating protocol-aware inputs?" \
  --provider fake \
  --model deterministic-test

The command embeds the idea, searches related idea-card fields, builds a conservative novelty report, stores the report in novelty_checks, and prints JSON.

Web Server Deployment

The web server serves a focused novelty-checking page at /.

The page has one textarea and one button. It submits POST /novelty-checks to the FastAPI app. The server embeds the submitted idea, searches the existing database, stores the novelty check, and returns related papers for display.

Start On A Development Port

Use this when testing on the same machine:

. .venv/bin/activate
export LITMUSE_DATABASE_URL="postgresql+psycopg://litmuse:litmuse@localhost:5432/litmuse"
export LITMUSE_EMBEDDING_PROVIDER="fake"
export LITMUSE_EMBEDDING_MODEL="deterministic-test"
litmuse serve --host 127.0.0.1 --port 8000

Open:

http://127.0.0.1:8000/

Stop it with Ctrl-C in the terminal running the server.

Start For LAN Or VM Access

Use this when accessing the server from another machine or from a Windows host browser into a Linux VM:

. .venv/bin/activate
export LITMUSE_DATABASE_URL="postgresql+psycopg://litmuse:litmuse@localhost:5432/litmuse"
export LITMUSE_EMBEDDING_PROVIDER="fake"
export LITMUSE_EMBEDDING_MODEL="deterministic-test"
litmuse serve --host 0.0.0.0 --port 8000

Find the VM or server IP:

ip -4 addr show scope global

Open:

http://<server-ip>:8000/

Start On Port 80

Ports below 1024 usually require elevated privileges on Linux. Either run through sudo or put a reverse proxy in front of Litmuse.

Direct sudo start:

sudo -E .venv/bin/litmuse serve --host 0.0.0.0 --port 80

Use sudo -E if the database and embedding environment variables are already set in your shell and need to be preserved. Without the right environment, the server will fall back to defaults such as localhost:5432.

Open:

http://<server-ip>/

Stop it with Ctrl-C in the terminal running the server.

Run In The Background

For a simple non-systemd deployment:

mkdir -p tmp
nohup .venv/bin/litmuse serve --host 0.0.0.0 --port 8000 > tmp/litmuse-web.log 2>&1 &
echo $! > tmp/litmuse-web.pid

Stop it:

kill "$(cat tmp/litmuse-web.pid)"
rm tmp/litmuse-web.pid

Check whether it is listening:

ss -ltnp 'sport = :8000'
curl -fsS http://127.0.0.1:8000/health

Example systemd Service

Create /etc/systemd/system/litmuse-web.service:

[Unit]
Description=Litmuse web server
After=network.target postgresql.service

[Service]
WorkingDirectory=/home/audit/litmuse
Environment=LITMUSE_DATABASE_URL=postgresql+psycopg://litmuse:litmuse@localhost:5432/litmuse
Environment=LITMUSE_EMBEDDING_PROVIDER=fake
Environment=LITMUSE_EMBEDDING_MODEL=deterministic-test
ExecStart=/home/audit/litmuse/.venv/bin/litmuse serve --host 0.0.0.0 --port 8000
Restart=on-failure
User=audit

[Install]
WantedBy=multi-user.target

Start, stop, and inspect logs:

sudo systemctl daemon-reload
sudo systemctl start litmuse-web
sudo systemctl status litmuse-web
sudo journalctl -u litmuse-web -f
sudo systemctl stop litmuse-web

If you need public HTTP on port 80, run Litmuse on port 8000 and use Nginx, Caddy, or another reverse proxy to forward port 80 to 127.0.0.1:8000.

Troubleshooting The Web Server

If the page loads but submitting an idea fails with a PostgreSQL connection error:

  1. Confirm PostgreSQL is running:

    pg_isready -h 127.0.0.1 -p 5432
  2. Confirm LITMUSE_DATABASE_URL points at the database that contains the pipeline output:

    printenv LITMUSE_DATABASE_URL
  3. Confirm the database has papers, idea cards, and embeddings:

    PGPASSWORD=litmuse psql -h 127.0.0.1 -U litmuse -d litmuse -c \
      "select count(*) from papers; select count(*) from idea_cards; select provider, model_name, dimensions, count(*) from embeddings group by provider, model_name, dimensions;"
  4. Confirm the web server uses the same embedding provider and model as the stored embeddings:

    printenv LITMUSE_EMBEDDING_PROVIDER
    printenv LITMUSE_EMBEDDING_MODEL

If the server cannot bind to port 80, use port 8000 or run with appropriate privileges.

Local LibFuzz Smoke Flow

This repository includes seed evaluation cases for the local paper set under /home/audit/libfuzz/raw/papers.

litmuse-corpus init tmp/libfuzz.litmusecorpus --name "LibFuzz Papers"
litmuse-corpus add tmp/libfuzz.litmusecorpus /home/audit/libfuzz/raw/papers
litmuse-corpus verify tmp/libfuzz.litmusecorpus
litmuse init-db
litmuse ingest tmp/libfuzz.litmusecorpus
litmuse extract-ideas --provider local-llm --limit 14
litmuse embed --provider fake --model deterministic-test
litmuse check "Can LLMs guide fuzzing by generating protocol-aware inputs?" --provider fake --model deterministic-test
litmuse serve --host 0.0.0.0 --port 8000

Evaluation seeds are in:

eval/novelty_cases.jsonl

CLI Reference

litmuse-corpus

litmuse-corpus init <corpus_path> [--name NAME]
litmuse-corpus add <corpus_path> <pdf_or_directory>
litmuse-corpus list <corpus_path>
litmuse-corpus verify <corpus_path>
litmuse-corpus stats <corpus_path>
litmuse-corpus extract <corpus_path> <sha256> --out <output_pdf>

litmuse

litmuse init-db
litmuse ingest <corpus_path>
litmuse extract-ideas [--limit N] [--provider fake|local-llm] [--model MODEL]
litmuse embed [--provider fake|fastembed|openai] [--model MODEL]
litmuse check <idea_text> [--provider fake|fastembed|openai] [--model MODEL]
litmuse serve [--host HOST] [--port PORT]

HTTP Routes

GET  /                         Web novelty-checking page
GET  /health                   Server health check
POST /novelty-checks           Run a novelty check for input_text
GET  /novelty-checks/{check_id} Placeholder route
POST /ingest                   Placeholder route
POST /extract-ideas            Placeholder route
POST /embeddings/build         Placeholder route
GET  /papers/{paper_id}        Placeholder route
GET  /papers/{paper_id}/pdf    Placeholder route

The web page and POST /novelty-checks are wired to the current novelty-checking service. Pipeline operations should still be run through the CLI.

Data Model

The schema is intentionally small:

  • papers: one row per archived PDF
  • paper_texts: parsed text blocks and section extracts
  • idea_cards: one structured idea card per paper
  • embeddings: provider/model/dimension-specific vectors for idea-card fields
  • novelty_checks: stored JSON reports for user idea checks

Migrations are packaged under litmuse.migrations and run through:

litmuse init-db

Migration 0002_unique_idea_card_per_paper fails with a clear error if duplicate idea cards already exist, rather than deleting data.

Development

Run tests and lint:

python -m pytest -v
python -m ruff check .

Check migration discovery:

alembic -c alembic.ini history

MVP Limitations

  • No crawler yet; the corpus is built from local PDF files.
  • The current novelty judge is conservative and deterministic.
  • Retrieval database search expects PostgreSQL with pgvector.
  • The web UI is intentionally focused on novelty checking only. It does not manage ingestion, extraction, embedding builds, or corpus status.
  • Static Alembic SQL generation is not part of the supported workflow; use online migrations through litmuse init-db.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages