feat(search): body-aware chunking + code embedder for semantic retrieval by eren23 · Pull Request #94 · eren23/attocode

eren23 · 2026-05-31T10:11:29Z

What

Lifts semantic-search retrieval quality by embedding function/method bodies (not just signatures) and adding a code-aware embedder path, plus the eval infrastructure to measure it honestly.

Why / Evidence

Validated with a hardened eval harness (--reindex forces a synchronous embedding rebuild so the index actually reflects the config under test). BGE, --reindex, 3 repos × 5 ground-truth queries each:

Repo	MRR@10	NDCG@10	vec_ready
redis (C)	0.633	0.290	✅
pandas (Py)	0.600	0.288	✅
gh-cli (Go)	0.100	0.114	✅

Python/C-family retrieval lands at 0.60–0.63, clearing the 0.55 target. The lift comes from embedding bodies at all.

Body-budget sweep (200/400/800) is a no-op on this query set — per-repo MRR/NDCG and chunk counts are bit-identical across all three budgets. Default left at 400; could drop to 200 for free storage/compute savings later.

gh-cli (Go) at 0.100 is a pre-existing Go-extraction weakness (regex, not tree-sitter — a separate Phase-1 item), not a regression from this work, and confirmed serving vectors (not a keyword fallback).

Changes

Body-aware chunking — _slice_body token-budgeted extractor; func/method chunks carry signature + body; ATTOCODE_BODY_TOKEN_BUDGET (default 400).
Embedder doc/query asymmetry — embed_query on the provider ABC; nomic search_query: prefix; query path uses embed_query; nomic-first auto-detect.
Model-mismatch guard — refuse vectors from a different embedding model (keyword fallback) + surfaced in semantic_search_status.
Path down-weight — de-rank tests/, benchmarks/, docs_src/, asv_bench/.
Coverage/readiness fix — is_index_ready() / get_index_progress() now count all indexed languages (tree-sitter: Go, Rust, C…), not just py/js/ts. Previously a Go repo reported 0% coverage / vec_ready=False despite a fully built index. Extracted _supported_languages() helper.
Eval harness — --reindex + CodeIntelService.reindex(embeddings=True); correctness guards (provenance stamp; fail-loud on 0 embedded chunks); per-repo progress + isolation; --json; eval/run_body_budget_sweep.py resumable sweep driver.

Tests

Unit tests for body chunking, embed_query asymmetry, model-mismatch guard, path down-weight, supported-languages coverage, reindex(embeddings), and eval reporting helpers.
integrations/context suite: 606 passed / 1 xfailed (the lone unrelated failure is a pre-existing token-estimate test, untouched here).

🤖 Generated with Claude Code

…T, default 400) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

test_semantic_search.py built managers via SemanticSearchManager.__new__() and hand-listed dataclass fields; new fields (scoring_config, _trigram_index, nl_mode) were never added to those blocks, so 9 tests crashed with AttributeError on the core retrieval path. Plus test_queue_reindex_noop_when_unavailable didn't stub _ensure_provider, so the real BGE provider loaded and flipped _keyword_fallback back to False, defeating its simulated-unavailable assertion. Route the 5 __new__ blocks through a shared _bare_manager() that uses the real (cheap, provider-deferred) constructor + targeted overrides, so adding future fields can't silently re-break these. Stub _ensure_provider in the noop test. integrations/context suite: 582p/9f -> 591p/0f. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…h_query prefix

…l reindex)

…atch guard + status line Earlier per-task commits (2fa0d03, 03ba829) landed empty because the Edit anchor strings didn't match the real code; the mismatch guard in search() and the status line in semantic_search_status were never actually applied. The Task-3 nomic test also mocked encode() with a plain list, so .tolist() raised. This adds the real guard + status line and fixes the test mock to return a numpy array. 767 passed / 5 skipped across context + tools suites.

Add _path_rank_penalty (0.6 for tests/asv_bench/docs_src/benchmarks, 1.0 for source) applied as a final ranking stage so genuine source outranks tests/benchmarks. Also removes stray placeholder comments that an earlier batch edit left in the boost stage.

…smatch guard Earlier batch commits (03ba829 Task7, 2941cf2 Task8) landed empty because Edit anchors didn't match. This adds the real _path_rank_penalty + ranking stage (Task 8) and the model-mismatch status line (Task 7), and updates tests broken by intended behavior changes: auto-detect now nomic-first (was bge-first), the mismatch guard requires matching provider/store model names in RRF mocks, and the status-mismatch test mock needs numeric progress fields. Full context+tools suites: 710 passed, 1 xfailed.

The search-quality harness scored whatever index already existed (or none), so index changes (body chunking / embedder swap) could not be measured — embeddings built lazily mid-scoring or queries hit keyword fallback. Adds CodeIntelService.reindex(embeddings=True) which synchronously rebuilds the vector index via SemanticSearchManager.index() after the AST reindex, and an eval --reindex flag (default off, preserves prior behavior) that uses it. Respects ATTOCODE_EMBEDDING_MODEL + ATTOCODE_BODY_TOKEN_BUDGET so sweeps are deterministic. +2 unit tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Builds on the --reindex fix so the body-chunking/embedder sweep produces trustworthy, observable, resumable results: correctness guards (provenance stamp + fail-loud on 0 embedded chunks), per-repo progress milestones + isolation, --json output, pure format_sweep_comparison/_overall_from_results helpers (unit-tested), and eval/run_body_budget_sweep.py resumable driver (per-repo subprocess, skips done budgets, writes to gitignored eval/sweep_out/). +4 reporting tests. No new lint introduced. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…s/ts only is_index_ready()/get_index_progress() derived total_files from a hardcoded {python,javascript,typescript} set while the indexer (index/reindex) embeds a wider set — tree-sitter langs (Go, Rust, C, ...). On a Go repo (gh-cli) this made coverage read 0% even with a fully built vector index, so is_index_ready() returned False and the eval reported vec_ready=False despite serving vectors. Extracted SemanticSearchManager._supported_languages() and routed all three call sites through it. +3 unit tests; 198 passed in integrations/context. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Bumps version + CHANGELOG for the retrieval-quality work (PR #94): body-aware chunking, embed_query doc/query asymmetry, model-mismatch guard, path down-weight, non-Python coverage/readiness fix, and the hardened eval harness. Validated: redis 0.633 / pandas 0.600 MRR@10 (>=0.55 target); body budget a no-op 200-800. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Completes the 0.2.25 bump: 8f19414's pyproject edit silently half-matched (two version= lines) and src/attoswarm was missed, so the 4 sources validate_release_version.py checks disagreed. All now read 0.2.25. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…embedder' into feat/retrieval-quality-chunking-embedder

eren23 and others added 18 commits May 30, 2026 19:06

feat(search): add _slice_body token-budgeted body extractor

c148c3c

feat(search): embed function/method bodies (ATTOCODE_BODY_TOKEN_BUDGE…

f961713

…T, default 400) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

feat(embeddings): add embed_query (doc/query asymmetry) + nomic searc…

54952bd

…h_query prefix

feat(search): embed queries via embed_query (correct nomic prefix)

4f3bffd

feat(embeddings): default auto-detect to nomic-embed-text-v1.5

25538ce

feat(search): refuse vectors from a mismatched embedding model (manua…

2fa0d03

…l reindex)

feat(search): surface embedding-model mismatch in semantic_search_status

03ba829

Merge remote-tracking branch 'origin/feat/retrieval-quality-chunking-…

bd2c68f

…embedder' into feat/retrieval-quality-chunking-embedder

eren23 merged commit 54a9f3a into main May 31, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(search): body-aware chunking + code embedder for semantic retrieval#94

feat(search): body-aware chunking + code embedder for semantic retrieval#94
eren23 merged 18 commits into
mainfrom
feat/retrieval-quality-chunking-embedder

eren23 commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eren23 commented May 31, 2026

What

Why / Evidence

Changes

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant