Skip to content

Mann10/Multi-Tenant-Agentic-Rag-as-a-Service

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi-Tenant Agentic RAG as a Service

A Python/FastAPI implementation of a multi-tenant Retrieval-Augmented Generation (RAG) system. The project is split into three services:

  • Ingestion Service: watches tenant folders, parses uploaded documents, extracts OCR text from images, chunks content, and sends chunks for embedding.
  • Embedding Service: creates embeddings with Voyage AI and stores vectors in Pinecone, using one Pinecone namespace per tenant.
  • Retrieval Service: validates user questions, retrieves relevant chunks, and generates grounded answers with source citations.

Architecture

Multitenatrag

Ingestion Flow

sequenceDiagram
    autonumber
    actor Client
    participant Raw as data/{tenant_id}/raw
    participant Ingestion as Ingestion Service :8000
    participant DB as ingestion.db
    participant Parser as Parser
    participant OCR as Ollama OCR
    participant Chunker as Chunker
    participant Embedding as Embedding Service :8001
    participant Voyage as Voyage AI
    participant Pinecone as Pinecone
    participant Processed as data/{tenant_id}/processed

    Client->>Raw: Add document files
    Client->>Ingestion: POST /ingest/{tenant_id}?strategy=recursive
    Ingestion->>DB: Create ingestion job
    Ingestion-->>Client: Return started status
    Ingestion->>Raw: List tenant raw files
    loop For each file
        Ingestion->>Parser: Detect type and parse content
        Parser-->>Ingestion: Pages, text, extracted image paths
        opt Page or document contains images
            Ingestion->>OCR: Extract image text
            OCR-->>Ingestion: OCR text
        end
        Ingestion->>Chunker: Split page text
        Chunker-->>Ingestion: ChunkPayload list with metadata
        Ingestion->>Processed: Move parsed file
    end
    Ingestion->>Embedding: POST /embed with tenant chunks
    Embedding->>Voyage: Create document embeddings
    Voyage-->>Embedding: Embedding vectors
    Embedding->>Pinecone: Upsert vectors into tenant namespace
    Pinecone-->>Embedding: Upsert result
    Embedding-->>Ingestion: Stored vector count
    Ingestion->>DB: Mark job completed or failed
    Client->>Ingestion: GET /ingest/{tenant_id}/status
    Ingestion-->>Client: Job status, file count, chunk count
Loading

Retrieval Flow

sequenceDiagram
    autonumber
    actor Client
    participant Retrieval as Retrieval Service :8002
    participant DB as retrieval.db
    participant Validator as Validation LLM
    participant Embedding as Embedding Service :8001
    participant Voyage as Voyage AI
    participant Pinecone as Pinecone
    participant Answer as Answer LLM

    Client->>Retrieval: POST /query/{tenant_id}
    Retrieval->>DB: Create retrieval session
    Retrieval-->>Client: query_id with pending_validation
    Retrieval->>Validator: Validate query intent and specificity

    alt Query needs clarification
        Validator-->>Retrieval: clarification_question
        Retrieval->>DB: Save needs_clarification status
        Client->>Retrieval: GET /query/{query_id}
        Retrieval-->>Client: clarification_question
        Client->>Retrieval: POST /query/{query_id}/clarify
        Retrieval->>DB: Update session with clarification
        Retrieval->>Validator: Validate clarified query
    end

    Validator-->>Retrieval: validated
    Retrieval->>Embedding: POST /embed-texts with query
    Embedding->>Voyage: Create query embedding
    Voyage-->>Embedding: Query vector
    Embedding-->>Retrieval: Query vector
    Retrieval->>Embedding: POST /query with tenant_id and vector
    Embedding->>Pinecone: Query tenant namespace
    Pinecone-->>Embedding: Top matching chunks
    Embedding-->>Retrieval: Matches with metadata

    alt Matching chunks found
        Retrieval->>Answer: Generate grounded answer from chunks
        Answer-->>Retrieval: Markdown answer with citations
        Retrieval->>DB: Save completed answer and sources
    else No chunks found
        Retrieval->>DB: Save completed no-results answer
    end

    Client->>Retrieval: GET /query/{query_id}
    Retrieval-->>Client: Final answer, sources, or error
Loading

Repository Layout

.
|-- data/                    # Tenant document folders
|   `-- {tenant_id}/
|       |-- raw/             # Place files here before ingestion
|       `-- processed/       # Ingested files are moved here
|-- embedding-service/       # Embedding and Pinecone vector service
|-- ingestion-service/       # Document ingestion, parsing, OCR, chunking
|-- retrieval-service/       # Query validation, retrieval, answer generation
|-- shared/                  # Shared Pydantic models
|-- ingestion.db             # SQLite ingestion job database
|-- retrieval.db             # SQLite retrieval session database
`-- test.py                  # Small local Ollama embedding test

Features

  • Multi-tenant ingestion and retrieval through tenant IDs.
  • Tenant isolation through Pinecone namespaces.
  • Supported input types: PDF, DOCX, TXT, PNG, JPG, JPEG, GIF, BMP, TIFF, and WEBP.
  • PDF and DOCX image extraction.
  • OCR through an Ollama vision model.
  • Recursive chunking by default, with optional semantic chunking.
  • Background ingestion jobs with status tracking.
  • Query clarification flow for vague or underspecified questions.
  • Source-aware answers that cite filenames and page numbers.
  • LangSmith tracing hooks in embedding and retrieval flows.

Prerequisites

  • Python 3.12 or compatible Python 3.x runtime.
  • Pinecone account and API key.
  • Voyage AI API key.
  • A LiteLLM/OpenAI-compatible chat endpoint for retrieval generation.
  • Ollama if you want local OCR or semantic chunking.

The code expects these Ollama models when those paths are used:

ollama pull glm-ocr:latest
ollama pull nomic-embed-text:latest

Environment Variables

Create a .env file in the project root. Do not commit real secrets.

# Retrieval LLM endpoint
LITELLM_API_BASE=http://127.0.0.1:11434/v1
LITELLM_API_KEY=ollama
VALIDATION_MODEL=ministral-3:latest
ANSWER_MODEL=ministral-3:latest

# Optional LangSmith tracing
LANGCHAIN_TRACING_V2=false
LANGCHAIN_API_KEY=

# Embeddings and vector storage
VOYAGE_API_KEY=your-voyage-api-key
PINECONE_API_KEY=your-pinecone-api-key
PINECONE_INDEX=multi-tenant-rag
PINECONE_CLOUD=aws
PINECONE_REGION=us-east-1

# Service URLs
EMBEDDING_SERVICE_URL=http://localhost:8001

# Local storage
DATA_DIR=C:\Multi-tenant-Rag-as-a-Service\data
OLLAMA_HOST=http://localhost:11434

Notes:

  • VALIDATION_MODEL and RETRIEVAL_MODEL are used by the retrieval service. If they are not set, both default to ministral-3:latest.
  • DATA_DIR defaults to /data in the code, so set it explicitly for local Windows development.
  • Both ingestion and retrieval database modules read DB_PATH if provided. If you set it globally, both services will use the same SQLite file. Leaving it unset uses ./ingestion.db for ingestion and ./retrieval.db for retrieval.

Installation

From the project root:

python -m venv venv
.\venv\Scripts\Activate.ps1
python -m pip install --upgrade pip

Install the service dependencies:

pip install -r ingestion-service\requirements.txt
pip install -r embedding-service\requirements.txt
pip install -r retrieval-service\requirements.txt

Running the Services

Run each service in a separate terminal from the project root.

.\venv\Scripts\Activate.ps1
python -m uvicorn embedding-service.main:app --host 0.0.0.0 --port 8001
.\venv\Scripts\Activate.ps1
python -m uvicorn ingestion-service.main:app --host 0.0.0.0 --port 8000
.\venv\Scripts\Activate.ps1
python -m uvicorn retrieval-service.main:app --host 0.0.0.0 --port 8002

Health checks:

curl http://localhost:8000/health
curl http://localhost:8001/health
curl http://localhost:8002/health

Data Layout

Each tenant gets its own folder under DATA_DIR:

data/
`-- acme-corp/
    |-- raw/
    `-- processed/

Place documents into data/{tenant_id}/raw/, then start an ingestion job. After a file is parsed successfully, it is moved to data/{tenant_id}/processed/.

API Usage

1. Start Ingestion

curl -X POST "http://localhost:8000/ingest/acme-corp?strategy=recursive"

Optional query parameters:

  • strategy=recursive: character-based recursive chunking.
  • strategy=semantic: semantic chunking with Ollama embeddings.
  • custom_tags=tag-value: attaches a simple custom tag to chunk metadata.

2. Check Ingestion Status

curl http://localhost:8000/ingest/acme-corp/status

Example response:

{
  "tenant_id": "acme-corp",
  "status": "completed",
  "started_at": "2026-05-27 08:10:00",
  "completed_at": "2026-05-27 08:10:20",
  "error_message": null,
  "file_count": 2,
  "chunk_count": 14
}

3. Submit a Query

curl -X POST http://localhost:8002/query/acme-corp `
  -H "Content-Type: application/json" `
  -d "{\"query\":\"What does the resume say about Python experience?\"}"

The response includes a query_id. The retrieval pipeline runs in the background.

{
  "query_id": "generated-query-id",
  "status": "pending_validation",
  "answer": null,
  "clarification_question": null,
  "sources": null,
  "error_message": null
}

4. Poll Query Status

curl http://localhost:8002/query/generated-query-id

Completed responses include an answer and source filenames:

{
  "query_id": "generated-query-id",
  "status": "completed",
  "answer": "The answer in markdown...",
  "clarification_question": null,
  "sources": ["Mannresumetailored.pdf"],
  "error_message": null
}

5. Answer a Clarification Request

If a query is vague, the retrieval service can return status=needs_clarification with a clarification question.

curl -X POST http://localhost:8002/query/generated-query-id/clarify `
  -H "Content-Type: application/json" `
  -d "{\"clarification\":\"Focus on the candidate's backend Python projects.\"}"

Service Endpoints

Ingestion Service :8000

  • POST /ingest/{tenant_id}: start background ingestion.
  • GET /ingest/{tenant_id}/status: read latest ingestion job status.
  • GET /health: health check.

Embedding Service :8001

  • POST /embed: embed document chunks and upsert to Pinecone.
  • POST /embed-texts: embed raw text only, used for query vectors.
  • POST /query: query a tenant namespace in Pinecone.
  • GET /health: health check.

Retrieval Service :8002

  • POST /query/{tenant_id}: submit a user query.
  • GET /query/{query_id}: poll query status or result.
  • POST /query/{query_id}/clarify: continue a query that needs clarification.
  • GET /health: health check.

How Tenant Isolation Works

The embedding service stores all vectors in the configured Pinecone index, but uses the tenant ID as the Pinecone namespace:

Pinecone index: multi-tenant-rag
namespace: acme-corp
namespace: tenant_2
namespace: ...

Retrieval queries always include the tenant_id, so they search only that tenant's namespace.

Implementation Notes

  • shared/models.py contains the shared Pydantic request, response, metadata, ingestion, and query status models.
  • Ingestion jobs are tracked in SQLite through ingestion.db.
  • Retrieval sessions are tracked in SQLite through retrieval.db.
  • Pinecone vectors use Voyage voyage-3.5 embeddings with dimension 1024.
  • Retrieved chunk metadata includes filename, page number, document type, chunk index, timestamp, optional custom tags, and optional source image paths.
  • The retrieval pipeline validates the query first. Valid queries proceed to vector retrieval and answer generation; vague queries stop with a clarification prompt.

Troubleshooting

  • Directory not found: .../raw: create data/{tenant_id}/raw before calling ingestion.
  • Embedding failures: check VOYAGE_API_KEY and that the embedding service is running on EMBEDDING_SERVICE_URL.
  • Pinecone upsert/query failures: check PINECONE_API_KEY, region, cloud, and index name.
  • OCR failures: make sure Ollama is running and glm-ocr:latest is available.
  • Semantic chunking failures: make sure Ollama is running and nomic-embed-text:latest is available.
  • Retrieval LLM failures: check LITELLM_API_BASE, LITELLM_API_KEY, VALIDATION_MODEL, and ANSWER_MODEL.

Development Notes

The existing .gitignore excludes:

venv
__pycache__
.env

Consider also excluding local SQLite databases and tenant data if they should not be committed:

*.db
data/

About

This is multitenant agentic rag as a service.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages