| title | Architecture (v3) |
|---|---|
| layout | default |
| parent | Architecture |
| nav_order | 1 |
Version: v3.0.0 For v1 architecture: See v1 Architecture (legacy)
This document describes the technical architecture of codetect v3.0.0.
- Overview
- Core Components
- Data Flow
- Database Schema
- Configuration System
- Performance Optimizations
- Future Enhancements
codetect is an MCP (Model Context Protocol) server that provides fast codebase search capabilities for Claude Code and other LLM tools.
Architecture Principles:
- Hybrid Search: Combine keyword (ripgrep), symbol (ctags), and semantic (embeddings) search
- Local-First: All processing happens locally (no cloud dependencies)
- Database-Agnostic: Support both SQLite (default) and PostgreSQL (production)
- Multi-Repo Isolation: Dimension-grouped tables isolate repos using different embedding models
βββββββββββββββββββββββββββββββββββββββββββ
β MCP Server (stdio) β
β cmd/codetect/main.go β
ββββββββββββββββ¬βββββββββββββββββββββββββββ
β
βββΊ Keyword Search (ripgrep)
β internal/search/keyword.go
β
βββΊ Symbol Search (ctags + SQLite/PostgreSQL)
β internal/search/symbols/
β internal/db/
β
βββΊ Semantic Search (Ollama/LiteLLM + Embeddings)
internal/embedding/
internal/search/semantic.go
Key Files:
internal/mcp/server.go- MCP protocol implementationinternal/tools/tools.go- Tool registration (search_keyword, get_file, symbols, hybrid_search_v2)internal/search/keyword.go- Ripgrep integrationinternal/search/symbols/index.go- Symbol indexing with ctagsinternal/embedding/searcher.go- Semantic search implementation
Source Code
β
βββΊ Ctags Extraction
β (symbols: functions, classes, types)
β
βββΊ AST Chunking
β (split files into semantic chunks)
β
βββΊ Embedding Generation
(Ollama/LiteLLM + vector storage)
Indexing Modes:
- Incremental: Only index changed files (default)
- Full: Force re-index all files (
--forceflag)
Chunking Strategy:
- AST-based for supported languages (Go, Python, JavaScript, etc.)
- Line-based fallback for unsupported languages
- Configurable chunk size (default: 512 lines)
v2.0.0 introduces dimension-grouped tables for multi-repo support:
βββββββββββββββββββββββββββββββββββββββββββββ
β Embedding Store β
β internal/embedding/store.go β
β β
β βββββββββββββββββββββββββββββββββββββββ β
β β repo_embeddings_768 β β β nomic-embed-text
β β (repos using 768-dim embeddings) β β
β βββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββ β
β β repo_embeddings_1024 β β β bge-m3
β β (repos using 1024-dim embeddings) β β
β βββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββ β
β β repo_configs β β β Model tracking
β β (tracks model + dimensions) β β
β βββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββ
Why dimension groups?
- Isolation: Different repos can use different models without conflicts
- Performance: Smaller dimension-specific indexes are faster to query
- Flexibility: Easy to experiment with new models per-repo
- Migration: Automatic migration when switching models
Supported Providers:
- Ollama (default): Local embedding server (recommended: bge-m3)
- LiteLLM: OpenAI-compatible API gateway
- Off: Disable semantic search
v2.0.0 supports two database backends:
| Feature | SQLite | PostgreSQL |
|---|---|---|
| Setup | Zero config | Requires setup |
| Performance (small) | Fast (< 1ms) | Slower (initial overhead) |
| Performance (large) | Linear scan (100ms+) | HNSW index (< 1ms) |
| Multi-repo | Separate DB per repo | Centralized database |
| Deployment | Single-user | Organization-scale |
Database Abstraction:
// internal/db/adapter.go
type DBAdapter interface {
Exec(query string, args ...interface{}) error
Query(query string, args ...interface{}) (*sql.Rows, error)
Dialect() Dialect
}
type Dialect string
const (
DialectSQLite Dialect = "sqlite"
DialectPostgreSQL Dialect = "postgres"
)Why abstraction?
- Swap backends without code changes
- Dialect-specific SQL generation (e.g.,
?vs$1placeholders) - Easy to add new backends (MySQL, DuckDB, etc.)
1. User runs: codetect index
2. Scan directory for files
ββ Skip .git/, node_modules/, .codetect/
ββ Respect .gitignore patterns
ββ Filter by extension (code files only)
3. Run ctags on each file
ββ Extract symbols (functions, classes, types)
ββ Parse ctags output (JSON format)
ββ Store in database (symbols table)
4. User runs: codetect embed (optional)
5. Chunk files for embedding
ββ AST-based chunking (tree-sitter)
ββ Fallback to line-based chunking
ββ Metadata: file path, line range, language
6. Generate embeddings
ββ Batch chunks (default: 10 parallel workers)
ββ Call embedding provider (Ollama/LiteLLM)
ββ Store vectors in dimension-grouped table
7. Index complete
ββ Print stats (symbols, chunks, time)
1. Claude Code sends MCP request
ββ Tool: search_keyword, symbols, or hybrid_search_v2
2. Route to appropriate handler
ββ search_keyword β ripgrep
ββ symbols β SQL query on symbols table
ββ hybrid_search_v2 β keyword + vector similarity search
3. Execute search
ββ Keyword: spawn ripgrep subprocess
ββ Symbol: SQL SELECT with LIKE
ββ Semantic: cosine similarity via SQL
4. Rank and filter results
ββ Limit to top_k (default: 20-50)
ββ Deduplicate by file path
ββ Sort by relevance score
5. Return to Claude Code
ββ JSON response with file paths, line numbers, snippets
1. User query: "authentication middleware"
2. Parallel execution:
ββ Keyword search: ripgrep "authentication.*middleware"
ββ Semantic search: embedding similarity to "authentication middleware"
3. Reciprocal Rank Fusion (RRF)
ββ Rank keyword results: [A:1, B:2, C:3]
ββ Rank semantic results: [C:1, A:2, D:3]
ββ Fuse scores: rrf_score = 1/(k + rank)
4. Combined ranking:
ββ C: 1/61 + 1/63 = 0.032
ββ A: 1/61 + 1/62 = 0.032
ββ B: 1/62 + 0 = 0.016
ββ D: 0 + 1/63 = 0.016
5. Return top results
ββ [C, A, B, D]
Symbols Table:
CREATE TABLE symbols (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
kind TEXT NOT NULL, -- function, class, type, variable, etc.
file_path TEXT NOT NULL,
line INTEGER NOT NULL,
pattern TEXT, -- ctags pattern (for verification)
language TEXT, -- go, python, javascript, etc.
repo_root TEXT NOT NULL, -- /path/to/repo (for multi-repo isolation)
indexed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_symbols_name ON symbols(name);
CREATE INDEX idx_symbols_kind ON symbols(kind);
CREATE INDEX idx_symbols_repo ON symbols(repo_root);Dimension-Grouped Embedding Tables:
-- Separate table for each dimension size
CREATE TABLE repo_embeddings_768 (
id INTEGER PRIMARY KEY,
repo_root TEXT NOT NULL,
file_path TEXT NOT NULL,
chunk_hash TEXT NOT NULL UNIQUE, -- Content-addressed (SHA256)
content TEXT NOT NULL,
start_line INTEGER,
end_line INTEGER,
embedding BLOB NOT NULL, -- SQLite: raw bytes, PostgreSQL: vector(768)
indexed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_embeddings_768_repo ON repo_embeddings_768(repo_root);
CREATE INDEX idx_embeddings_768_hash ON repo_embeddings_768(chunk_hash);
-- PostgreSQL-specific: HNSW index for fast ANN search
CREATE INDEX idx_embeddings_768_vector ON repo_embeddings_768 USING hnsw (embedding vector_cosine_ops);Repo Config Table:
CREATE TABLE repo_configs (
repo_root TEXT PRIMARY KEY,
model TEXT NOT NULL, -- nomic-embed-text, bge-m3, etc.
dimensions INTEGER NOT NULL, -- 768, 1024, etc.
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);v1 used a single code_embeddings table. v2 uses dimension-grouped tables.
Migration Logic:
- Detect schema version (query
sqlite_master/information_schema) - If v1 schema detected:
- Create dimension-grouped tables
- Leave old tables intact (backward compatibility)
- On first embed:
- Detect model dimensions
- Insert into correct dimension group
- Old embeddings remain in
code_embeddingsuntil re-embedding
Backward Compatibility:
v2 can read from both v1 (code_embeddings) and v2 (repo_embeddings_*) tables, with preference for v2.
See Migration Guide for detailed upgrade instructions.
# Database
CODETECT_DB_TYPE=sqlite # sqlite (default) or postgres
CODETECT_DB_DSN=postgres://... # PostgreSQL connection string
CODETECT_DB_PATH=/custom/path # SQLite database path override
CODETECT_VECTOR_DIMENSIONS=768 # Vector dimensions (auto-detected if not set)
# Embedding
CODETECT_EMBEDDING_PROVIDER=ollama # ollama (default), litellm, off
CODETECT_OLLAMA_URL=http://... # Ollama URL (default: http://localhost:11434)
CODETECT_LITELLM_URL=http://... # LiteLLM URL (default: http://localhost:4000)
CODETECT_LITELLM_API_KEY=sk-... # LiteLLM API key
CODETECT_EMBEDDING_MODEL=bge-m3 # Model override (provider-specific)
# Logging
CODETECT_LOG_LEVEL=info # debug, info, warn, error
CODETECT_LOG_FORMAT=text # text (default), json
# Privacy
CODETECT_HASH_PATHS=false # SHA-256 hash file paths at rest (default: false)Planned for future releases. Currently all config via environment variables.
- Environment variables (highest priority)
- Project config (
.codetect.yaml, planned) - Global config (
~/.config/codetect/config.json, partial support) - Defaults (lowest priority)
v2.0.0 adds parallel embedding with -j flag:
# Default: 10 parallel workers
codetect embed -j 10
# Benchmark: 1000 files
# Sequential: 7m 30s
# Parallel (-j 10): 2m 15s
# Speedup: 3.3xImplementation:
// internal/embedding/searcher.go
func (s *Searcher) IndexChunksParallel(ctx context.Context, chunks []Chunk, workers int, progressFn func(int, int)) error {
workCh := make(chan Chunk, workers)
resultCh := make(chan EmbeddingResult, workers)
// Spawn workers
for i := 0; i < workers; i++ {
go s.worker(ctx, workCh, resultCh)
}
// Feed work
go func() {
for _, chunk := range chunks {
workCh <- chunk
}
close(workCh)
}()
// Collect results
for i := 0; i < len(chunks); i++ {
result := <-resultCh
s.store.Insert(result)
if progressFn != nil {
progressFn(i+1, len(chunks))
}
}
}Embeddings are keyed by chunk_hash (SHA256 of content):
SELECT embedding FROM repo_embeddings_768 WHERE chunk_hash = ?Benefits:
- Skip re-embedding unchanged chunks (95%+ cache hit rate on incremental updates)
- Deduplication (identical chunks across files)
- Integrity verification (detect corruption)
Separate tables per dimension size:
Why it's faster:
- Smaller indexes (fewer rows to scan)
- Type safety (no dimension mismatch bugs)
- HNSW optimization (PostgreSQL can build better indexes on fixed dimensions)
Example:
# 10,000 embeddings across 3 repos
# v1 (single table)
code_embeddings: 10,000 rows
Query: scan all 10,000 rows β 100ms
# v2 (dimension groups)
repo_embeddings_768: 7,000 rows (Repo A, B)
repo_embeddings_1024: 3,000 rows (Repo C)
Query: scan only 7,000 rows β 70ms
PostgreSQL + pgvector supports HNSW (Hierarchical Navigable Small World) indexing:
CREATE INDEX idx_embeddings_768_vector
ON repo_embeddings_768
USING hnsw (embedding vector_cosine_ops);Performance:
| Dataset Size | SQLite (linear scan) | PostgreSQL + HNSW |
|---|---|---|
| 100 vectors | 77 ΞΌs | 603 ΞΌs (slower) |
| 1,000 vectors | 1.19 ms | 745 ΞΌs (1.6x faster) |
| 10,000 vectors | 58.1 ms | 963 ΞΌs (60x faster) |
Trade-offs:
- Setup: PostgreSQL requires installation, SQLite is zero-config
- Small datasets: SQLite is faster (no index overhead)
- Large datasets: PostgreSQL is massively faster (sub-linear ANN search)
All indexes are stored centrally under ~/.codetect/projects/:
~/.codetect/projects/
βββ index.json # Reverse lookup: dir name β repo path
βββ myproject-a1b2c3d4/ # <basename>-<hash> naming
β βββ index.db # SQLite database containing:
β β βββ symbols # Symbol definitions
β β βββ repo_embeddings_768 # 768-dim embeddings
β β βββ repo_embeddings_1024 # 1024-dim embeddings
β β βββ repo_configs # Model tracking
β βββ merkle-tree.json # Change detection tree
βββ other-repo-e5f6g7h8/
βββ ...
The directory name hash is derived from the git remote origin URL (for repos with a remote) or the absolute path (for non-git directories). This means git repos survive directory moves β the same data directory is used regardless of where the repo is cloned.
Migration: Existing .codetect/ directories at project roots are auto-migrated to centralized storage on first use.
v1 Note: v1 used .repo_search/ (early) and later .codetect/symbols.db. v2 uses centralized index.db with a different schema.
codetect is designed to work with partial dependencies:
| Dependency | If Missing |
|---|---|
| ripgrep | search_keyword fails (required) |
| Ollama/LiteLLM | hybrid_search_v2 returns available: false |
The MCP server always starts; tools report availability in their responses.
The daemon provides automatic re-indexing when files change:
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β fsnotify β βββΆ β Daemon β βββΆ β Re-index β
β Watcher β β Process β β Changed β
βββββββββββββββ βββββββββββββββ βββββββββββββββ
Features:
- File system watching via fsnotify
- Debounced re-indexing to avoid excessive updates
- IPC for daemon control (start/stop/status)
- Respects
.gitignorepatterns - PID file and Unix socket for process management
Commands:
codetect-daemon start- Start the daemoncodetect-daemon stop- Stop the daemoncodetect-daemon status- Show daemon status
Central tracking of all indexed projects:
~/.config/codetect/
βββ registry.json # Global project registry
βββ projects # Registered project paths
βββ settings # Auto-watch configuration
βββ stats # Index statistics per project
Features:
- JSON-based storage for portability
- Per-project index statistics (symbol count, embedding count, DB size)
- Watch enabled/disabled flags
- Last indexed timestamp tracking
- Global settings for auto-watch and debounce
See Registry Guide for detailed usage.
Testing framework for comparing MCP vs non-MCP performance:
Test Cases β Runner β [MCP Search, Direct Search] β Validator β Report
Features:
- JSONL-based test case format
- Categories: search, navigate, understand
- Per-repo test case storage in centralized data directory
- Automated validation of results
- Performance comparison reports
Commands:
codetect-eval run- Run evaluation testscodetect-eval report- Display saved reportscodetect-eval list- List available test cases
- Merkle trees - Sub-second change detection for large repos
- AST-aware indexing - Parse syntax trees directly (no ctags)
- Hybrid ranking - Reciprocal rank fusion of keyword + semantic scores
- Reranking models - Post-filter results with embedding similarity
- Connection pooling - Shared DB/embedding connections via
ResourcePool - Token-efficient design -
detailparameter, response budgeting, compressed descriptions
- Multi-language AST chunking - Expand beyond Go/Python/JavaScript
- Query expansion - Automatic synonym expansion for semantic search
- Configuration file - Project-level
.codetect.yaml - HTTP API - Alternative to MCP for non-MCP tools
- CLI query mode -
codetect search "query"for terminal use - Graph-based navigation - Call graphs, type hierarchies, dependency trees
- LSP integration - Real-time indexing via Language Server Protocol
- Distributed indexing - Index large monorepos across multiple machines
- MCP Specification
- pgvector Documentation
- Reciprocal Rank Fusion Paper
- HNSW Algorithm
- v1 Architecture (legacy)
- Migration Guide
Document Version: 3.0 Last Updated: 2026-02-16 codetect Version: 3.0.0