Building a RAG pipeline requires wiring together a document parser, an embedding model, and a vector database yourself, investing time learning each tool along the way. docling-pgvector frees from that overhead and gives user a simple interface to focus on what matters:
- Provide a PDF as input
- Run a query
- Get similarity-ranked text chunks back, ready to pass to any LLM of your choice
Support for additional input formats including CSV, web pages, and other document types is currently under development.
The default embedding model is BAAI/bge-base-en-v1.5, but any HuggingFace SentenceTransformer model can be used by passing a different model_name to EmbeddingsConfig.
PDF File
│
▼
Docling (PDF parser) ← GPU auto-detected
│ page-batched conversion
▼
HybridChunker + TableItem ← semantic text chunks + Markdown tables
│ unique content
▼
SentenceTransformer ← BAAI/bge-base-en-v1.5 (768-dim)
│ vector embeddings
▼
PostgreSQL + pgvector ← similarity search (L2 distance)
Before you begin: Make sure Docker Desktop (or an equivalent Docker runtime) is installed and running on your machine. All setup options below rely on Docker.
| Option | Tools needed |
|---|---|
| A — Pre-built Docker Image (recommended) | Docker Desktop |
| B — Dev Container | VS Code + Dev Containers extension |
| C — Local | Python 3.12+ and uv |
The fastest way to get started. Pull the pre-built image and run — no Python, no dependency installs, no setup scripts required.
Terminal: Use Git Bash or WSL on Windows. If you prefer PowerShell, replace
$(pwd)with${PWD}in step 5. Command Prompt is not recommended as some commands will not work correctly.
1. Pull the image
docker pull ghcr.io/sunishbharat/docling-pgvector:cpu-dev2. Clone the repository
git clone https://github.com/sunishbharat/docling-pgvector.git
cd docling-pgvector3. Start PostgreSQL + pgvector
Create a shared network so both containers can talk to each other, then start the database.
docker network create devnet || true
docker rm -f pgvector-container 2>/dev/null || true
docker run --name pgvector-container \
--network devnet \
-e POSTGRES_USER=postgres \
-e POSTGRES_PASSWORD=postgres \
-e POSTGRES_DB=vectordb \
-p 5432:5432 \
-d pgvector/pgvector:pg174. Create the database extension
Wait until the database is ready, then enable the pgvector extension.
docker exec pgvector-container bash -c "until pg_isready -U postgres; do sleep 1; done"
docker exec pgvector-container psql -U postgres -d vectordb \
-c "CREATE EXTENSION IF NOT EXISTS vector;"5. Run the container
On Windows, replace
$(pwd)with${PWD}in PowerShell or%cd%in Command Prompt.
docker run --rm -it \
--network devnet \
-e DATABASE_URL=postgresql://postgres:postgres@pgvector-container:5432/vectordb \
-v $(pwd):/workspace/docling-pgvector \
-w /workspace/docling-pgvector \
ghcr.io/sunishbharat/docling-pgvector:cpu-dev \
bash6. Inside the container — install the project and run tests
/opt/venv/bin/pip install -e .
pytest test/pgpytest.py -v -s
python -m test.docling_test
python -m test.document_processor_testEverything is pre-configured. All dependencies, the app container, and PostgreSQL+pgvector are set up automatically — no manual configuration needed.
1. Clone the repository
git clone https://github.com/sunishbharat/docling-pgvector.git
cd docling-pgvector2. Open in VS Code
code .3. Reopen in Dev Container
When VS Code prompts "Reopen in Container", click it. Or open the Command Palette (Ctrl+Shift+P) and run:
Dev Containers: Reopen in Container
4. Wait for setup to complete
The first time takes a few minutes. The setup script automatically:
- Installs the project and all Python dependencies
- Creates the
vectordbdatabase and enables thevectorextension - Downloads
./data/test.pdf(the "Attention Is All You Need" paper for testing)
5. Run the tests
uv run pytest test/pgpytest.py -v -s
uv run python -m test.docling_test
uv run python -m test.document_processor_test1. Clone the repository
git clone https://github.com/sunishbharat/docling-pgvector.git
cd docling-pgvector2. Start PostgreSQL + pgvector
docker run --name pgvector-container \
-e POSTGRES_PASSWORD=postgres \
-p 5432:5432 \
-d pgvector/pgvector:pg173. Create the database and enable the extension
PGPASSWORD=postgres psql -h localhost -p 5432 -U postgres \
-c "CREATE DATABASE vectordb;"
PGPASSWORD=postgres psql -h localhost -p 5432 -U postgres -d vectordb \
-c "CREATE EXTENSION IF NOT EXISTS vector;"4. Install Python dependencies
uv sync
uv pip install -e .5. Download the test PDF
mkdir -p ./data
curl -L https://arxiv.org/pdf/1706.03762 -o ./data/test.pdf6. Set the database connection
export DATABASE_URL=postgresql://postgres:postgres@localhost:5432/vectordbAlternatively, set individual env vars:
export PG_HOST=localhost export PG_PORT=5432 export PG_DATABASE=vectordb export PG_USER=postgres export PG_PASSWORD=postgres
7. Run the tests
uv run pytest test/pgpytest.py -v -s
uv run python -m test.docling_test
uv run python -m test.document_processor_testfrom document_processor import DocumentProcessor
from dconfig import EmbeddingsConfig
config = EmbeddingsConfig(model_name="BAAI/bge-base-en-v1.5")
processor = DocumentProcessor(embedconfig=config)
content_list, model = processor.embeddings_generate(path="./data/test.pdf")
embeddings = model.encode(content_list)from pgvector_client import PGVectorClient, PGVectorConfig
pg_config = PGVectorConfig(host="localhost", database="vectordb")
with PGVectorClient(pg_config) as client:
with client.cursor() as cur:
cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
cur.execute(f"CREATE TABLE IF NOT EXISTS items (id bigserial PRIMARY KEY, text TEXT, embedding vector({model.get_sentence_embedding_dimension()}))")
for chunk, embed in zip(content_list, embeddings):
with client.cursor() as cur:
cur.execute("INSERT INTO items (text, embedding) VALUES (%s, %s)", (chunk, embed))query_vec = model.encode("your search query", normalize_embeddings=True)
with PGVectorClient(pg_config) as client:
with client.cursor() as cur:
cur.execute("""
SELECT id, text, embedding <-> %s AS distance
FROM items ORDER BY distance LIMIT 2
""", (query_vec, query_vec))
results = cur.fetchall()
for id_, text, dist in results:
print(f"[{dist:.4f}] {text[:100]}")Docling detects tables in the PDF and exports them as Markdown, so they are stored and retrieved as structured text alongside regular chunks.
Pass your query string directly to model.encode() — no other configuration needed:
query = "Maximum path lengths, per-layer complexity and minimum number of sequential operations Table"
query_vec = model.encode(query, normalize_embeddings=True)Results are ranked by distance — lower means more relevant. Real output from running document_processor_test.py against the "Attention Is All You Need" paper:
INFO:root:id_=43, dist=0.550322916862681, ->
Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations
for different layer types. n is the sequence length, d is the representation dimension,
k is the kernel size of convolutions and r the size of the neighborhood in restricted self-attention.
| Layer Type | Complexity per Layer | Sequential Ops | Max Path Length |
|----------------|----------------------|----------------|-----------------|
| Self-Attention | O(n² · d) | O(1) | O(1) |
| Recurrent | O(n · d²) | O(n) | O(n) |
| Convolutional | O(k · n · d²) | O(1) | O(log_k(n)) |
INFO:root:id_=16, dist=0.672102512607041, ->
4 Why Self-Attention
In this section we compare various aspects of self-attention layers to the recurrent and convolutional
layers commonly used for mapping one variable-length sequence of symbol representations (x1,...,xn)
to another sequence of equal length (z1,...,zn).
The top result (id=43) is a table extracted directly from the PDF by Docling and stored as structured Markdown alongside regular text chunks.
| Parameter | Default | Description |
|---|---|---|
model_name |
BAAI/bge-base-en-v1.5 |
Any HuggingFace SentenceTransformer model |
dims |
768 |
Auto-resolved from the loaded model |
batch_size |
32 |
Encoding batch size |
The model is validated against HuggingFace Hub before downloading. An
InvalidModelErroris raised if the model ID does not exist.
| Parameter | Env Var | Default |
|---|---|---|
host |
PG_HOST |
localhost |
port |
PG_PORT |
5432 |
database |
PG_DATABASE |
vectordb |
user |
PG_USER |
postgres |
password |
PG_PASSWORD |
postgres |