🚀 SA-Doc-Generator

An AI-powered documentation agent that ingests GitHub repositories and supporting documents, builds a semantic knowledge graph, and generates structured technical documentation using RAG (Retrieval-Augmented Generation).

Overview

SA-Doc-Generator combines vector embeddings (ChromaDB) with a graph knowledge base (Neo4j) to provide deep code understanding. It can:

Ingest and index GitHub repositories with full AST parsing
Build relationship graphs (imports, calls, contains) for code navigation
Answer questions about codebases with context-aware responses
Generate technical specifications from templates
Visualize code architecture and flows as Mermaid diagrams

The application runs as an interactive chat interface powered by Chainlit.

Architecture

flowchart TB
    subgraph UI["Chainlit UI :8090"]
        Commands["/ingest, /generate, /ask..."]
        Flows["Multi-step Flows"]
        Handlers["File Upload & Actions"]
        Intent["Intent Detection"]
    end

    subgraph Core["Core Services"]
        LLM["LLM (GPT-4)"]
        RAG["RAG Engine"]
        Templates["Template Manager"]
        Steering["Steering Manager"]
    end

    subgraph GraphKB["Graph Knowledge Base"]
        Ingestion["Ingestion<br/>(AST Parsing)"]
        Retrieval["Retrieval<br/>(Graph Traversal)"]
        Viz["Visualization<br/>(Mermaid)"]
    end

    subgraph Storage["Storage Layer"]
        ChromaDB[("ChromaDB :8091<br/>Vector Embeddings")]
        Neo4j[("Neo4j :7474/:7687<br/>Graph Relationships")]
    end

    UI --> Core
    Core --> GraphKB
    RAG --> ChromaDB
    Retrieval --> Neo4j
    Ingestion --> ChromaDB
    Ingestion --> Neo4j
    Viz --> Neo4j

Quick Start

Prerequisites

Python 3.11+ (required for proper type hint support)
Docker and Docker Compose
OpenAI API key
(Optional) GitHub token for private repositories

1. Clone and Configure

git clone <repo-url>
cd sa-doc-generator

# Set up environment variables
cp .env.example .env

Edit .env with your credentials:

OPENAI_API_KEY=sk-your-key-here
GITHUB_TOKEN=ghp-your-token-here  # Optional, for private repos
NEO4J_USER=neo4j
NEO4J_PASSWORD=password

2. Restore Pre-indexed Data (Optional)

If backups are available, restore the pre-indexed REPO-NAME codebase:

./scripts/restore_databases.sh

This loads vector embeddings and graph relationships so you can start querying immediately.

3. Start the Application

docker compose up -d

4. Access the UI

Chat Interface: http://localhost:8090
Neo4j Browser: http://localhost:7474 (user: neo4j, password: password)

Services

Service	Port	Description
agent	8090	Chainlit chat application
chromadb	8091	Vector database for embeddings
neo4j	7474	Graph database browser (HTTP)
neo4j	7687	Graph database (Bolt protocol)

Chat Commands

Repository Ingestion

Command	Description
`/ingest [url]`	Index a GitHub repository (dual-write to ChromaDB + Neo4j)
`/status [url]`	Check ingestion progress
`/diff [url]`	Check for repository updates
`/list_repos`	List all indexed repositories

Code Q&A

Command	Description
`/ask [question]`	Ask questions about indexed code
`/search [query]`	Search repository contents
`/architecture [repo]`	Generate architecture overview

Graph Visualization

Command	Description
`/visualize [repo] [type] [options]`	Generate interactive graph visualizations
`/graph_stats [repo]`	Display graph statistics (node/edge counts, depths)

Visualization Types:

architecture — Directory structure and file organization
calls — Function/method call relationships
dependencies — File import/dependency relationships
full — Combined view of architecture, calls, and dependencies
comprehensive — Single unified graph with all nodes and edges
call_chain — Trace calls from/to a specific symbol
hotspots — Most connected symbols (potential refactoring targets)

Options:

--symbol=<name> — Symbol name for call_chain (required)
--direction=outgoing|incoming — Call direction (default: outgoing)
--depth=<n> — Max traversal depth (default: 30)
--limit=<n> — Max results
--kinds=function,method,class — Filter by symbol kinds

Examples:

/visualize my_repo calls
/visualize my_repo hotspots
/visualize my_repo call_chain --symbol=process_request
/visualize my_repo call_chain --symbol=save --direction=incoming
/visualize my_repo calls --kinds=function,method --depth=20

Document Management

Command	Description
`/upload`	Upload supporting documents (PDFs, etc.)
`/list_docs [carrier]`	List uploaded documents
`/view_doc [filename]`	View document contents
`/delete_doc [filename]`	Delete a document

Generation

Command	Description
`/generate [type] [topic]`	Generate documentation from templates
`/prompts`	List available generation prompts
`/add_template`	Upload a new template

Steering Documents

Command	Description
`/add_steering`	Add a steering document
`/list_steering`	List steering documents
`/edit_steering [name]`	Edit a steering document
`/remove_steering [name]`	Remove a steering document

Utility

Command	Description
`/help`	Show help information
`/menu`	Show interactive command menu

You can also use natural language — the intent detector will map your request to the appropriate command.

Project Structure

sa-doc-generator/
├── src/
│   ├── app.py                 # Chainlit entry point
│   ├── context.py             # Application context (services)
│   ├── commands/              # Slash command implementations
│   │   ├── ingest.py          # /ingest, /status, /diff, /list_repos
│   │   ├── ask_code.py        # /ask, /search, /architecture
│   │   ├── generate.py        # /generate, /prompts
│   │   ├── visualize.py       # /visualize
│   │   ├── docs.py            # /upload, /list_docs, /view_doc
│   │   ├── steering.py        # Steering document commands
│   │   └── menu.py            # /menu
│   ├── core/
│   │   ├── llm.py             # OpenAI LLM wrapper
│   │   ├── rag.py             # RAG engine (retrieval + generation)
│   │   ├── embedding.py       # Embedding service
│   │   ├── template_manager.py # Jinja2 template handling
│   │   ├── steering_manager.py # Steering document management
│   │   ├── intent_detector.py  # Natural language → command mapping
│   │   └── prompt_manager.py   # Prompt template management
│   ├── graph_kb/              # Graph Knowledge Base
│   │   ├── ingestion/         # Repository ingestion (AST parsing)
│   │   ├── retrieval/         # Graph traversal and search
│   │   ├── storage/           # Neo4j and metadata stores
│   │   ├── visualization/     # Mermaid diagram generation
│   │   ├── analysis/          # Code analysis utilities
│   │   └── models/            # Data models and enums
│   ├── flows/                 # Multi-step conversation flows
│   ├── handlers/              # File upload and action handlers
│   ├── ingestion/             # Document ingestion (PDF, etc.)
│   ├── storage/               # Vector store abstractions
│   ├── steering/              # Default steering documents
│   ├── templates/             # Generation templates
│   └── utils/                 # Logging, Mermaid rendering
├── scripts/
│   ├── backup_databases.sh    # Backup ChromaDB + Neo4j
│   └── restore_databases.sh   # Restore from backups
├── config/
│   └── settings.yaml          # Application configuration
├── docker-compose.yml
├── Dockerfile.api
└── requirements.txt

Configuration

config/settings.yaml

# Vector database
vector_db_path: "./chroma_db"
storage_path: "./output_docs"

# LLM
openai_model: "gpt-4o"

# Embeddings (local model)
embedding_model: "jinaai/jina-embeddings-v3"
dimensions: 1024
embedding_device: "cpu"

# Chunking
chunk_size: 1000
chunk_overlap: 200

# Graph traversal depth
max_depth: 10

Environment Variables

Variable	Required	Description
`OPENAI_API_KEY`	Yes	OpenAI API key for LLM
`GITHUB_TOKEN`	No	GitHub token for private repos
`NEO4J_URI`	No	Neo4j connection URI (default: bolt://neo4j:7687)
`NEO4J_USER`	No	Neo4j username (default: neo4j)
`NEO4J_PASSWORD`	No	Neo4j password (default: password)
`CHROMA_SERVER_HOST`	No	ChromaDB host (default: chromadb)
`CHROMA_SERVER_PORT`	No	ChromaDB port (default: 8000)

Database Backup & Restore

The application uses ChromaDB (vector embeddings) and Neo4j (graph knowledge base) for persistent storage. Scripts are provided to capture and restore database state.

Creating a Backup

# Stop containers for consistent backup
docker compose down

# Create backup
./scripts/backup_databases.sh

# Restart
docker compose up -d

Creates timestamped archives in ./backups/:

neo4j_backup_YYYYMMDD_HHMMSS.tar.gz
chromadb_backup_YYYYMMDD_HHMMSS.tar.gz
backup_manifest_YYYYMMDD_HHMMSS.json

Restoring from Backup

# Restore latest backup
./scripts/restore_databases.sh

# Restore specific backup
./scripts/restore_databases.sh --timestamp 20241226_143000

# List available backups
./scripts/restore_databases.sh --list

Options:

--latest — Restore from latest backup (default)
--timestamp YYYYMMDD_HHMMSS — Restore specific backup
--neo4j-only — Only restore Neo4j
--chromadb-only — Only restore ChromaDB

Quick Start with Pre-loaded Data

# 1. Clone and configure
git clone <repo-url> && cd <repo-name>
cp .env.example .env  # Edit with your OPENAI_API_KEY

# 2. Restore databases
./scripts/restore_databases.sh

# 3. Start application
docker compose up -d

Development

Local Development (without Docker)

# Install dependencies
pip install -r requirements.txt

# Start Neo4j and ChromaDB separately, then:
chainlit run src/app.py -w

Rebuild After Code Changes

docker compose down
docker compose up --build -d

View Logs

docker compose logs -f agent

Data Storage

Local SQLite Databases

The application uses two local SQLite databases for metadata and state management:

File	Purpose
`graph_kb_metadata.db`	Graph KB ingestion state (repos, files, chunks, progress)
`metadata.db`	Document ingestion state

graph_kb_metadata.db tracks:

Repository status — which repos are indexed, their status (pending/indexing/ready/error/paused), last commit
File-level checkpoints — which files within a repo have been processed (enables pause/resume)
Pending chunks — chunks parsed but not yet embedded (resume from embedding phase)
Failed chunks — chunks that failed embedding (can retry without re-indexing)
Live progress — real-time indexing progress that persists across page refreshes

This enables:

Pause and resume long-running ingestion jobs
Checkpoint recovery if the process crashes mid-ingestion
Accurate progress reporting via /status

Docker Volumes

When running with Docker Compose, the main data is stored in named volumes:

Volume	Contents
`chromadb_data`	Vector embeddings
`neo4j_data`	Graph database
`neo4j_logs`	Neo4j logs
`repo_data`	Cloned repositories
`hf_cache`	HuggingFace model cache

Use the backup/restore scripts to capture and share the ChromaDB and Neo4j volumes.

How It Works

Ingestion Pipeline

Clone: Repository is cloned to local storage
AST Parsing: Python files are parsed to extract functions, classes, imports
Chunking: Code is split into semantic chunks
Embedding: Chunks are embedded using the configured model
Dual Write:
- ChromaDB: Vector embeddings for semantic search
- Neo4j: Graph nodes and relationships (CALLS, IMPORTS, CONTAINS)

Retrieval Pipeline

Vector Search: Query is embedded and matched against ChromaDB
Graph Expansion: Initial matches are expanded via Neo4j relationships (up to N hops)
Ranking: Results are ranked by vector similarity + graph distance + file bonuses
Token Pruning: Context is limited to fit LLM context window (~8000 tokens)

Generation Pipeline

Template Selection: User selects a generation template
Context Retrieval: Relevant code/docs are retrieved via RAG
Steering: Steering documents provide additional instructions
LLM Generation: GPT-4 generates the document
Output: Markdown document saved to output_docs/

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
alembic		alembic
config		config
docs		docs
e2e		e2e
graph_kb_api		graph_kb_api
graph_kb_dashboard		graph_kb_dashboard
scripts		scripts
shared		shared
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile.api		Dockerfile.api
Dockerfile.api.optimized		Dockerfile.api.optimized
Makefile		Makefile
README.md		README.md
alembic.ini		alembic.ini
docker-compose.yml		docker-compose.yml
pytest.ini		pytest.ini
requirements.api.txt		requirements.api.txt
ruff.toml		ruff.toml

Folders and files

Latest commit

History

Repository files navigation

🚀 SA-Doc-Generator

Overview

Architecture

Quick Start

Prerequisites

1. Clone and Configure

2. Restore Pre-indexed Data (Optional)

3. Start the Application

4. Access the UI

Services

Chat Commands

Repository Ingestion

Code Q&A

Graph Visualization

Document Management

Generation

Steering Documents

Utility

Project Structure

Configuration

config/settings.yaml

Environment Variables

Database Backup & Restore

Creating a Backup

Restoring from Backup

Quick Start with Pre-loaded Data

Development

Local Development (without Docker)

Rebuild After Code Changes

View Logs

Data Storage

Local SQLite Databases

Docker Volumes

How It Works

Ingestion Pipeline

Retrieval Pipeline

Generation Pipeline

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages