Skip to content

scottgal/lucidrag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

309 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

lucidRAG

The Open-Source Agentic RAG Platform

Multimodal retrieval + synthesis with domain specialists, model routing, GraphRAG, and local-first deployment

.NET License GitHub release GitHub Downloads

IN DEVELOPMENT — This project is under active development. The hosted SaaS version at lucidrag.com will be available shortly — watch this repo for updates!


DoomSummarizer — Console-First Research Assistant

DoomSummarizer Releases Platform

DoomSummarizer is a distillation of lucidRAG principles — hybrid search, entity extraction, knowledge graph construction, evidence-grounded synthesis — compressed into a standalone single-binary CLI. Fully open source, it includes the web crawler with HTTP ETag/Last-Modified cache-aware incremental crawling that also powers the commercial lucidRAG SaaS platform's indexing pipeline. It fetches, ranks, and synthesizes news and research with a built-in local knowledge base, NER entity extraction, and long-form article generation. No API keys required for default sources. If you've used NotebookLM: this is that workflow, but open, local-first, and composable. Aside: we're not trying to be NotebookLM; DoomSummarizer is intentionally a different system built around local control, open components, and composable workflows.

# Daily digest from default sources
doomsummarizer scroll

# Topic query with tone
doomsummarizer scroll "AI regulation news" -v snarky

# Build a knowledge base from any website (incremental with ETag caching, entities by default)
doomsummarizer crawl https://docs.example.com -n mydocs
doomsummarizer ask -s crawl:mydocs "how does authentication work?"

# Generate a long-form article with evidence grounding
doomsummarizer scroll "history of LLMs" -t blog-article -o llm-history.md

Key features: hybrid BM25 + semantic ranking, ONNX embeddings (offline), HTTP ETag/Last-Modified cache-aware web crawling, NER entity extraction with knowledge graph, dual-model local defaults (gemma3:4b + qwen3:0.6b), optional budget-controlled cloud LLM fallback (Anthropic/OpenAI), 15+ output templates, email delivery.

*Full documentation, all CLI options, and examples → **

Download: Grab pre-built binaries for Windows, Linux, and macOS from **Releases **.


Why lucidRAG?

Most RAG systems are basic document-to-vector pipelines. lucidRAG is different:

Feature Basic RAG lucidRAG
Search Semantic only Hybrid BM25 + Semantic with RRF fusion (Typesense / Qdrant + Lucene)
Query Processing Direct embedding Agentic decomposition (Sentinel)
Knowledge Flat chunks GraphRAG with entity extraction & communities
Images Not supported 22-wave ML pipeline (OCR, faces, motion, scenes)
Data Files Not supported CSV, Excel, Parquet profiling with DuckDB
Video Not supported Scene detection, transcript extraction
Deployment Cloud-dependent Zero API keys — runs fully local
Multi-tenancy Not supported Schema-per-tenant with automatic provisioning

LucidRAG SaaS (Coming Soon)

lucidrag.com — the fully managed, hosted version of lucidRAG is on the way.

The SaaS edition is powered by Typesense as its unified search engine — replacing separate Qdrant and Lucene.NET deployments with a single, high-performance C++ engine that handles both BM25 keyword search and semantic vector search in one query. One API call performs hybrid retrieval with configurable rank fusion, delivering sub-50ms latency at millions of documents.

Why Typesense for production SaaS?

  • Unified hybrid search — BM25 + HNSW vector search + rank fusion in a single engine. No more synchronizing two separate search indices.
  • Built-in semantic search — Auto-generates embeddings from document fields using built-in ONNX models (all-MiniLM-L12-v2) or remote APIs (OpenAI, etc.). No external embedding pipeline needed.
  • Built-in conversational RAG — Send a natural language question, get a grounded answer with streaming. Multi-collection context aggregation out of the box.
  • Sub-50ms search latency — In-memory C++ architecture with HNSW indexing. 28M records on 4 vCPUs at 28ms average.
  • Multi-tenant ready — Scoped API keys with collection-level isolation map directly to SaaS multi-tenancy patterns.
  • Single binary, zero dependencies — No JVM, no runtime overhead, no garbage collection pauses. Raft-based HA clustering built in.
  • Natural language query understanding — LLM-powered intent detection converts free-form queries into structured filters (similar to LucidRAG's Sentinel, but at the search engine level).

The entire OSS core — including the lucidRAG CLI (with all non-Typesense search engines built in and working), DoomSummarizer (the web crawler and research assistant), and all pipeline infrastructure — remains fully open source under The Unlicense. The SaaS platform uses lucidRAG's plugin architecture to build something that works at scale: adding Typesense as a unified search backend, multi-tenant SaaS isolation, API key management, analytics dashboards, and the embeddable widget CDN.


What's New

  • Domain specialists in the ingestion pipeline: financial, technical/academic, and narrative plugins now classify content and enrich chunks with domain-specific entities/signals.
  • Model specialists by task tier: named provider routing for triage, general, synthesis, and vision, with local-first defaults and cloud fallback.
  • Expanded multimodal stack: unified registry now routes documents, images, data, video, and audio through dedicated pipelines.
  • Operational hardening: resilient LLM backends (retry/circuit breaker), LFU caching, per-tenant cache layers, and OpenTelemetry-ready instrumentation.

Getting Started (Development)

Note: lucidRAG is in active development. These instructions are for contributors and early testers.

Prerequisites

  • .NET 10.0 SDK
  • PostgreSQL 16+ with pgvector extension
  • Node.js 18+ (for CSS build)
  • Optional: Ollama for local LLM inference

Running Locally

# Clone the repository
git clone https://github.com/scottgal/lucidrag.git
cd lucidrag

# Set up the database connection in user secrets
dotnet user-secrets -p src/LucidRAG/LucidRAG.csproj set "ConnectionStrings:DefaultConnection" "Host=localhost;Database=LucidRAG;Username=postgres;Password=yourpassword"

# Build and run
dotnet run --project src/LucidRAG/LucidRAG.csproj

Standalone Mode (Quick Testing)

For quick testing without PostgreSQL:

dotnet run --project src/LucidRAG/LucidRAG.csproj -- --standalone

Uses SQLite + InMemory vectors. Note: Embeddings are not persisted between restarts in standalone mode.


Core Components

lucidRAG is built from specialized processing engines, each designed for a specific content type:

LucidRAG Platform
       │
       ├── Web Application (ASP.NET Core 10 + Razor + Alpine.js + Tailwind)
       │      ├── Chat Interface with streaming responses
       │      ├── File Explorer with natural language search
       │      ├── Knowledge Graph visualization (D3.js)
       │      └── Multi-tenant admin dashboard
       │
       └── Unified Pipeline Registry
              ├── DocumentPipeline  →  PDF, DOCX, Markdown, HTML, TXT
              ├── ImagePipeline     →  PNG, JPG, GIF, WebP (22-wave analysis)
              ├── DataPipeline      →  CSV, Excel, Parquet, JSON (DuckDB)
              ├── VideoPipeline     →  MP4, MKV, MOV (scene detection)
              ├── AudioPipeline     →  MP3, WAV, FLAC, M4A (transcription + fingerprinting)
              └── Domain Specialists → financial, technical, narrative enrichment

DocumentPipeline (Mostlylucid.DocSummarizer.Core)

Handles traditional documents with intelligent chunking and hybrid search:

  • PDF: Native extraction via PdfPig + table detection
  • DOCX: OpenXML parsing with structure preservation
  • Markdown/HTML: AST parsing with code block handling
  • Chunking: Semantic boundaries with configurable overlap
  • Search: BM25 lexical + BERT semantic with RRF fusion

ImagePipeline (ImageSummarizer.Core)

A 22-wave modular ML pipeline for comprehensive image understanding:

Wave Category Waves Purpose
OCR AdvancedOcr, MlOcr, OcrQuality Multi-engine text extraction with confidence
Vision AI Florence2, VisionLlm, ClipEmbedding Foundation models for understanding
Detection Face, Scene, TextRegion, QRCode Object and pattern detection
Analysis Color, Motion, Edge, Composition Visual feature extraction
Forensics Exif, Contradiction, AutoRouting Metadata and validation

Special Capabilities:

  • Animated GIF/WebP: Frame deduplication (SSIM), temporal voting, filmstrip generation
  • Faces: Detection with bounding boxes for privacy redaction
  • Motion: Optical flow analysis for animation classification

DataPipeline (DataSummarizer.Core)

Structured data profiling powered by DuckDB:

  • Column Profiling: Type inference, cardinality, null rates
  • Statistical Analysis: Percentiles, distributions, outliers
  • Constraint Validation: Unique keys, foreign key relationships
  • Query Generation: Auto-generated SQL for common questions

VideoPipeline (VideoSummarizer.Core)

Video content extraction and analysis:

  • Scene Detection: ML-based shot boundary detection
  • Keyframe Extraction: Representative frames per scene
  • Audio Transcription: Whisper integration for speech-to-text
  • Frame Sampling: Configurable intervals for analysis

Intelligent Features

Agentic Query Decomposition (Sentinel)

The Sentinel service transforms user queries into optimized search plans:

User: "Compare the authentication approaches in the 2023 and 2024 security audits"
       │
       └── Sentinel Analysis
              ├── Query Type: Comparison
              ├── Sub-queries:
              │     ├── "authentication approach 2023 security audit"
              │     └── "authentication approach 2024 security audit"
              └── Fusion Strategy: Side-by-side comparison

Features:

  • Query classification (keyword, semantic, comparison, aggregation)
  • Automatic sub-query generation
  • Clarification requests for ambiguous queries
  • 15-minute query plan caching

Domain Specialists (Content Intelligence)

During ingestion, lucidRAG runs plugin-based domain enrichment over chunked content:

Specialist Focus Example Signals
financial Earnings/markets/filings financial.tickers_mentioned, financial.sentiment, document subtype
technical Papers/docs/methodology citation counts, bibliography/DOI presence, methodology terms
narrative Fiction/literary text character graph, dialogue density, genre/perspective, narrative subtype

These specialists are auto-registered and selected by confidence threshold, or forced via explicit domain hints when you already know the corpus type.

Model Specialists (LLM Routing)

lucidRAG also routes LLM calls by task tier using named providers:

Tier Default Provider Primary Model Typical Use
triage fast-local tinyllama query classification, decomposition, routing
general general claude-sonnet (fallback gpt-4o-mini) standard RAG answers/summaries
synthesis smart claude-opus (fallback gpt-4o) complex synthesis, multi-hop reasoning
vision vision minicpm-v:8b (fallback gpt-4o) image captioning and vision tasks

For DoomSummarizer defaults, synthesis uses gemma3:4b and sentinel/triage uses qwen3:0.6b.

GraphRAG Knowledge Graph

Entity extraction with community detection for connected knowledge:

Documents → Entity Extraction → Relationship Building → Community Detection
                │                      │                       │
                ├── Person            ├── works_at           ├── Louvain clustering
                ├── Organization      ├── located_in         ├── LLM summarization
                ├── Location          ├── related_to         └── Visual exploration
                └── Concept           └── mentions

Interactive Visualization: D3.js force-directed graph with:

  • Node sizing by connection count
  • Edge weights showing relationship strength
  • Community coloring for clusters
  • Click-through to source documents

Evidence Artifacts

Structured storage for all extracted intelligence:

Artifact Type Content
ocr_text Extracted text with per-character confidence
ocr_word_boxes Bounding box coordinates for each word
llm_summary AI-generated content summaries
filmstrip Compressed frame sequences for GIFs/videos
key_frame Representative frames from videos
table_csv Extracted tables as CSV
table_json Table metadata and structure
transcript Audio transcriptions with timestamps

File Explorer

The new File Explorer provides a full-width document browser with:

  • Natural Language Search: Query documents using conversational language
  • Signal Filters: Filter by hasImages, hasTables, hasCode, dateRange
  • Entity Filters: Filter by extracted entities (Person, Organization, etc.)
  • Community Filters: Filter by GraphRAG community clusters
  • Folder Organization: Virtual folders for document organization
  • Bulk Operations: Select multiple documents for batch actions

Multi-Tenancy

Enterprise-ready tenant isolation:

┌─────────────────────────────────────────────────────┐
│                   LucidRAG Instance                 │
├─────────────────────────────────────────────────────┤
│  tenant_acme (schema)    │  tenant_globex (schema)  │
│  ├── collections         │  ├── collections         │
│  ├── documents           │  ├── documents           │
│  ├── entities            │  ├── entities            │
│  └── tenant-local indexes│  └── tenant-local indexes│
└─────────────────────────────────────────────────────┘

Features:

  • PostgreSQL schema-per-tenant isolation
  • Automatic schema provisioning on first access
  • Domain-based routing (subdomain or path)
  • Per-tenant Qdrant collections (self-hosted) or Typesense collections (SaaS)
  • Role-based access control per tenant

API Reference

Endpoint Methods Description
/api/chat POST, GET Conversational AI with memory
/api/search POST Stateless semantic search
/api/documents GET, POST, DELETE Document CRUD
/api/explorer GET File browser with filters
/api/collections CRUD Collection management
/api/folders CRUD Virtual folder organization
/api/graph GET Knowledge graph data
/api/communities GET, POST Community detection
/api/evidence GET Artifact retrieval
/api/tenants CRUD Multi-tenant management
/api/ingestion CRUD Source management (GitHub, S3, FTP)
/api/crawl POST, GET Web crawling

OpenAPI Documentation: /scalar/v1


Configuration

Embedding Backend

{
  "DocSummarizer": {
    "EmbeddingBackend": "Onnx",  // Onnx (local), Ollama, OpenAI, Anthropic
    "BertRag": {
      "VectorStore": "Qdrant",   // Qdrant (self-hosted), DuckDB
      "CollectionName": "ragdocs"
    }
  }
}

Typesense (SaaS / Production):

{
  "Typesense": {
    "Host": "localhost",
    "Port": "8108",
    "ApiKey": "your-api-key",
    "CollectionName": "lucidrag_evidence",
    "DefaultAlpha": 0.3
  }
}

LLM Provider

Local (Ollama):

{
  "DocSummarizer": {
    "LlmBackend": "Ollama",
    "Ollama": {
      "BaseUrl": "http://localhost:11434",
      "Model": "qwen2.5:3b"
    }
  }
}

Cloud (Anthropic/OpenAI):

{
  "DocSummarizer": {
    "LlmBackend": "Anthropic",
    "Anthropic": { "Model": "claude-3-5-sonnet-latest" }
  }
}

Unified LLM Providers (Advanced)

For multi-provider setups with named instances and resilience, use the YAML-based configuration:

# Config/llm-providers.yaml
backends:
  anthropic:
    type: anthropic
    api_key: ${ANTHROPIC_API_KEY}
    max_retries: 3

providers:
  fast-local:
    model: tinyllama
  general:
    model: claude-sonnet
    fallback: gpt-4o-mini
  smart:
    model: claude-opus

Features:

  • Named providers with tier-based selection (triage, general, synthesis, vision)
  • Polly resilience (retry with exponential backoff, circuit breaker)
  • OpenTelemetry observability (tracing, metrics)
  • Named prompt library with provider-specific overrides

See docs/UNIFIED_LLM_PROVIDERS.md for complete documentation.

DoomSummarizer Model Roles (Default)

{
  "ollama": {
    "model": "gemma3:4b",
    "sentinelModel": "qwen3:0.6b"
  }
}

Use cloud providers only when enabled; local Ollama remains primary by default.


CLI Tools

DoomSummarizer CLI (Standalone Research Assistant)

DoomSummarizer Releases

A distillation of lucidRAG into a single-binary CLI. Console-first research assistant and personal knowledge base — fetches, ranks, and synthesizes content from 30+ sources with local ONNX embeddings, no API keys required.

doomsummarizer scroll "AI security news" -v snarky     # Digest with tone
doomsummarizer crawl https://docs.example.com            # Build a KB (entities by default)
doomsummarizer ask -s crawl:docs "how does auth work?"   # Query your KB
doomsummarizer scroll "Rust vs Go" -t deep-dive -o comparison.md  # Long-form article

Full documentation → | * Download →*

lucidRAG CLI (Fully OSS)

The CLI ships with all search engines built in — ONNX embeddings, BM25 (Lucene.NET), Qdrant vector search, and RRF fusion all work out of the box. No commercial plugins or API keys required.

# Process files (auto-routes by extension)
lucidrag-cli process document.pdf image.gif data.csv --collection mydata

# Search
lucidrag-cli search "authentication best practices" --collection mydata

# Interactive chat
lucidrag-cli chat --collection mydata

# Run web server
lucidrag-cli serve --port 5080

ImageSummarizer CLI (Standalone)

A powerful image analysis tool with MCP server support:

# Install globally
dotnet tool install -g Mostlylucid.ImageSummarizer.Cli

# Analyze image
imagesummarizer screenshot.png

# Process animated GIF
imagesummarizer animation.gif --pipeline advancedocr

# Run as MCP server for Claude Desktop
imagesummarizer --mcp

MCP Tools (9 available): extract_text_from_image, analyze_image_quality, list_ocr_pipelines, batch_extract_text, summarize_animated_gif, generate_caption, generate_detailed_description, analyze_with_template, list_output_templates


NuGet Packages

This repo publishes a set of reusable NuGet packages (pipelines, storage, and integrations) that power lucidRAG and DoomSummarizer.

  • Package list + install guide: docs/NUGET_PACKAGES.md
  • Canonical docs index: docs/DOCS-INDEX.md
  • lucidRAG retrieval deep dive: docs/LUCIDRAG_RETRIEVAL_SYSTEM.md

Development

# Build solution
dotnet build LucidRAG.sln

# Run with hot reload
dotnet watch run --project src/LucidRAG/LucidRAG.csproj

# Build CSS (Tailwind + DaisyUI)
cd src/LucidRAG && npm install && npm run build:css

# Run tests
dotnet test LucidRAG.sln -c Release --filter "Category!=Browser&Category!=Integration"

# Integration tests (requires PostgreSQL + migrations / test DB)
dotnet test src/LucidRAG.Tests/LucidRAG.Tests.csproj -c Release --filter "Category=Integration"

# Browser tests (requires a running app + downloads Chromium via Puppeteer)
dotnet test src/LucidRAG.Tests/LucidRAG.Tests.csproj -c Release --filter "Category=Browser"

Requirements

Component Version Notes
.NET SDK 10.0+ Required
PostgreSQL 16+ Or SQLite for standalone
Node.js 18+ For CSS build only

Optional Services:

  • Ollama — Local LLM inference (DoomSummarizer defaults: gemma3:4b + qwen3:0.6b)
  • Qdrant — Self-hosted vector storage
  • Typesense — Unified hybrid search engine (BM25 + semantic vectors in one engine) — used in SaaS production
  • Docling — Enhanced PDF/DOCX parsing

Architecture

src/
├── LucidRAG/                          # Web application
├── LucidRAG.Cli/                      # Command-line tool
├── LucidRAG.Core/                     # Business logic & entities
├── LucidRAG.Tests/                    # Integration tests
│
├── Mostlylucid.Summarizer.Core/       # Pipeline interfaces
├── Mostlylucid.DocSummarizer.Core/    # Document processing
├── ImageSummarizer.Core/              # Image analysis (22 waves)
├── DataSummarizer.Core/               # Structured data profiling
├── VideoSummarizer.Core/              # Video processing
├── AudioSummarizer.Core/              # Audio analysis + transcription
├── DomainClassifier.Core/             # Plugin registry + enrichment orchestration
├── DomainClassifier.Financial/        # Financial specialist plugin
├── DomainClassifier.Technical/        # Technical/academic specialist plugin
├── DomainClassifier.Narrative/        # Narrative specialist plugin
│
├── Mostlylucid.DocSummarizer.Anthropic/  # Claude integration
├── Mostlylucid.DocSummarizer.OpenAI/     # OpenAI integration
├── LucidRAG.LLM/                         # Unified LLM providers + prompt library
│
├── Mostlylucid.GraphRag/              # Entity extraction & graphs
├── Mostlylucid.RAG/                   # Vector store abstraction
│
└── DoomSummarizer/                     # Console-first local research assistant

CI/CD

Workflow Trigger Output
build.yml PR/Push Tests with PostgreSQL containers
release-lucidrag.yml lucidrag-v* tag Docker multi-arch (amd64/arm64)
release-lucidrag-cli.yml cli-v* tag CLI binaries
release-imagesummarizer.yml img-v* tag ImageSummarizer releases
publish-docsummarizer-nuget.yml Manual NuGet packages

License

The UnLicense — see LICENSE


Contributing

lucidRAG is in active development and we welcome contributions! Please check the Issues for areas where help is needed.


About

A full RAG (Retrieval Augmented Generation) system featuring a low friction GraphRAG and conversational interface on documents.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors