docling-pgvector

Building a RAG pipeline requires wiring together a document parser, an embedding model, and a vector database yourself, investing time learning each tool along the way. docling-pgvector frees from that overhead and gives user a simple interface to focus on what matters:

Provide a PDF as input
Run a query
Get similarity-ranked text chunks back, ready to pass to any LLM of your choice

Support for additional input formats including CSV, web pages, and other document types is currently under development.

The default embedding model is BAAI/bge-base-en-v1.5, but any HuggingFace SentenceTransformer model can be used by passing a different model_name to EmbeddingsConfig.

How It Works

PDF File
   │
   ▼
Docling (PDF parser)          ← GPU auto-detected
   │  page-batched conversion
   ▼
HybridChunker + TableItem     ← semantic text chunks + Markdown tables
   │  unique content
   ▼
SentenceTransformer            ← BAAI/bge-base-en-v1.5 (768-dim)
   │  vector embeddings
   ▼
PostgreSQL + pgvector          ← similarity search (L2 distance)

Requirements

Before you begin: Make sure Docker Desktop (or an equivalent Docker runtime) is installed and running on your machine. All setup options below rely on Docker.

Option	Tools needed
A — Pre-built Docker Image (recommended)	Docker Desktop
B — Dev Container	VS Code + Dev Containers extension
C — Local	Python 3.12+ and uv

Option A — Pre-built Docker Image (Recommended)

The fastest way to get started. Pull the pre-built image and run — no Python, no dependency installs, no setup scripts required.

Terminal: Use Git Bash or WSL on Windows. If you prefer PowerShell, replace $(pwd) with ${PWD} in step 5. Command Prompt is not recommended as some commands will not work correctly.

1. Pull the image

docker pull ghcr.io/sunishbharat/docling-pgvector:cpu-dev

2. Clone the repository

git clone https://github.com/sunishbharat/docling-pgvector.git
cd docling-pgvector

3. Start PostgreSQL + pgvector

Create a shared network so both containers can talk to each other, then start the database.

docker network create devnet || true

docker rm -f pgvector-container 2>/dev/null || true

docker run --name pgvector-container \
  --network devnet \
  -e POSTGRES_USER=postgres \
  -e POSTGRES_PASSWORD=postgres \
  -e POSTGRES_DB=vectordb \
  -p 5432:5432 \
  -d pgvector/pgvector:pg17

4. Create the database extension

Wait until the database is ready, then enable the pgvector extension.

docker exec pgvector-container bash -c "until pg_isready -U postgres; do sleep 1; done"

docker exec pgvector-container psql -U postgres -d vectordb \
  -c "CREATE EXTENSION IF NOT EXISTS vector;"

5. Run the container

On Windows, replace $(pwd) with ${PWD} in PowerShell or %cd% in Command Prompt.

docker run --rm -it \
  --network devnet \
  -e DATABASE_URL=postgresql://postgres:postgres@pgvector-container:5432/vectordb \
  -v $(pwd):/workspace/docling-pgvector \
  -w /workspace/docling-pgvector \
  ghcr.io/sunishbharat/docling-pgvector:cpu-dev \
  bash

6. Inside the container — install the project and run tests

/opt/venv/bin/pip install -e .
pytest test/pgpytest.py -v -s
python -m test.docling_test
python -m test.document_processor_test

Option B — Dev Container

Everything is pre-configured. All dependencies, the app container, and PostgreSQL+pgvector are set up automatically — no manual configuration needed.

1. Clone the repository

git clone https://github.com/sunishbharat/docling-pgvector.git
cd docling-pgvector

2. Open in VS Code

code .

3. Reopen in Dev Container

When VS Code prompts "Reopen in Container", click it. Or open the Command Palette (Ctrl+Shift+P) and run:

Dev Containers: Reopen in Container

4. Wait for setup to complete

The first time takes a few minutes. The setup script automatically:

Installs the project and all Python dependencies
Creates the vectordb database and enables the vector extension
Downloads ./data/test.pdf (the "Attention Is All You Need" paper for testing)

5. Run the tests

uv run pytest test/pgpytest.py -v -s
uv run python -m test.docling_test
uv run python -m test.document_processor_test

Option C — Local Setup

1. Clone the repository

git clone https://github.com/sunishbharat/docling-pgvector.git
cd docling-pgvector

2. Start PostgreSQL + pgvector

docker run --name pgvector-container \
  -e POSTGRES_PASSWORD=postgres \
  -p 5432:5432 \
  -d pgvector/pgvector:pg17

3. Create the database and enable the extension

PGPASSWORD=postgres psql -h localhost -p 5432 -U postgres \
  -c "CREATE DATABASE vectordb;"

PGPASSWORD=postgres psql -h localhost -p 5432 -U postgres -d vectordb \
  -c "CREATE EXTENSION IF NOT EXISTS vector;"

4. Install Python dependencies

uv sync
uv pip install -e .

5. Download the test PDF

mkdir -p ./data
curl -L https://arxiv.org/pdf/1706.03762 -o ./data/test.pdf

6. Set the database connection

export DATABASE_URL=postgresql://postgres:postgres@localhost:5432/vectordb

Alternatively, set individual env vars:

export PG_HOST=localhost
export PG_PORT=5432
export PG_DATABASE=vectordb
export PG_USER=postgres
export PG_PASSWORD=postgres

7. Run the tests

uv run pytest test/pgpytest.py -v -s
uv run python -m test.docling_test
uv run python -m test.document_processor_test

Usage

1. Parse PDF and generate embeddings

from document_processor import DocumentProcessor
from dconfig import EmbeddingsConfig

config = EmbeddingsConfig(model_name="BAAI/bge-base-en-v1.5")
processor = DocumentProcessor(embedconfig=config)

content_list, model = processor.embeddings_generate(path="./data/test.pdf")
embeddings = model.encode(content_list)

2. Store embeddings in PostgreSQL

from pgvector_client import PGVectorClient, PGVectorConfig

pg_config = PGVectorConfig(host="localhost", database="vectordb")
with PGVectorClient(pg_config) as client:
    with client.cursor() as cur:
        cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
        cur.execute(f"CREATE TABLE IF NOT EXISTS items (id bigserial PRIMARY KEY, text TEXT, embedding vector({model.get_sentence_embedding_dimension()}))")

    for chunk, embed in zip(content_list, embeddings):
        with client.cursor() as cur:
            cur.execute("INSERT INTO items (text, embedding) VALUES (%s, %s)", (chunk, embed))

3. Similarity search

query_vec = model.encode("your search query", normalize_embeddings=True)
with PGVectorClient(pg_config) as client:
    with client.cursor() as cur:
        cur.execute("""
            SELECT id, text, embedding <-> %s AS distance
            FROM items ORDER BY distance LIMIT 2
        """, (query_vec, query_vec))
        results = cur.fetchall()

for id_, text, dist in results:
    print(f"[{dist:.4f}] {text[:100]}")

Docling detects tables in the PDF and exports them as Markdown, so they are stored and retrieved as structured text alongside regular chunks.

Sample Output

Pass your query string directly to model.encode() — no other configuration needed:

query = "Maximum path lengths, per-layer complexity and minimum number of sequential operations Table"
query_vec = model.encode(query, normalize_embeddings=True)

Results are ranked by distance — lower means more relevant. Real output from running document_processor_test.py against the "Attention Is All You Need" paper:

INFO:root:id_=43, dist=0.550322916862681, ->
 Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations
 for different layer types. n is the sequence length, d is the representation dimension,
 k is the kernel size of convolutions and r the size of the neighborhood in restricted self-attention.

 | Layer Type     | Complexity per Layer | Sequential Ops | Max Path Length |
 |----------------|----------------------|----------------|-----------------|
 | Self-Attention | O(n² · d)            | O(1)           | O(1)            |
 | Recurrent      | O(n · d²)            | O(n)           | O(n)            |
 | Convolutional  | O(k · n · d²)        | O(1)           | O(log_k(n))     |

INFO:root:id_=16, dist=0.672102512607041, ->
 4 Why Self-Attention
 In this section we compare various aspects of self-attention layers to the recurrent and convolutional
 layers commonly used for mapping one variable-length sequence of symbol representations (x1,...,xn)
 to another sequence of equal length (z1,...,zn).

The top result (id=43) is a table extracted directly from the PDF by Docling and stored as structured Markdown alongside regular text chunks.

Configuration

Embedding Model (`EmbeddingsConfig`)

Parameter	Default	Description
`model_name`	`BAAI/bge-base-en-v1.5`	Any HuggingFace SentenceTransformer model
`dims`	`768`	Auto-resolved from the loaded model
`batch_size`	`32`	Encoding batch size

The model is validated against HuggingFace Hub before downloading. An InvalidModelError is raised if the model ID does not exist.

Database (`PGVectorConfig`)

Parameter	Env Var	Default
`host`	`PG_HOST`	`localhost`
`port`	`PG_PORT`	`5432`
`database`	`PG_DATABASE`	`vectordb`
`user`	`PG_USER`	`postgres`
`password`	`PG_PASSWORD`	`postgres`

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
src		src
test		test
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

docling-pgvector

How It Works

Requirements

Option A — Pre-built Docker Image (Recommended)

Option B — Dev Container

Option C — Local Setup

Usage

1. Parse PDF and generate embeddings

2. Store embeddings in PostgreSQL

3. Similarity search

Sample Output

Configuration

Embedding Model (`EmbeddingsConfig`)

Database (`PGVectorConfig`)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

docling-pgvector

How It Works

Requirements

Option A — Pre-built Docker Image (Recommended)

Option B — Dev Container

Option C — Local Setup

Usage

1. Parse PDF and generate embeddings

2. Store embeddings in PostgreSQL

3. Similarity search

Sample Output

Configuration

Embedding Model (EmbeddingsConfig)

Database (PGVectorConfig)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Embedding Model (`EmbeddingsConfig`)

Database (`PGVectorConfig`)

Packages