Skip to content

feat(search): body-aware chunking + code embedder for semantic retrieval#94

Merged
eren23 merged 18 commits into
mainfrom
feat/retrieval-quality-chunking-embedder
May 31, 2026
Merged

feat(search): body-aware chunking + code embedder for semantic retrieval#94
eren23 merged 18 commits into
mainfrom
feat/retrieval-quality-chunking-embedder

Conversation

@eren23

@eren23 eren23 commented May 31, 2026

Copy link
Copy Markdown
Owner

What

Lifts semantic-search retrieval quality by embedding function/method bodies (not just signatures) and adding a code-aware embedder path, plus the eval infrastructure to measure it honestly.

Why / Evidence

Validated with a hardened eval harness (--reindex forces a synchronous embedding rebuild so the index actually reflects the config under test). BGE, --reindex, 3 repos × 5 ground-truth queries each:

Repo MRR@10 NDCG@10 vec_ready
redis (C) 0.633 0.290
pandas (Py) 0.600 0.288
gh-cli (Go) 0.100 0.114

Python/C-family retrieval lands at 0.60–0.63, clearing the 0.55 target. The lift comes from embedding bodies at all.

Body-budget sweep (200/400/800) is a no-op on this query set — per-repo MRR/NDCG and chunk counts are bit-identical across all three budgets. Default left at 400; could drop to 200 for free storage/compute savings later.

gh-cli (Go) at 0.100 is a pre-existing Go-extraction weakness (regex, not tree-sitter — a separate Phase-1 item), not a regression from this work, and confirmed serving vectors (not a keyword fallback).

Changes

  • Body-aware chunking_slice_body token-budgeted extractor; func/method chunks carry signature + body; ATTOCODE_BODY_TOKEN_BUDGET (default 400).
  • Embedder doc/query asymmetryembed_query on the provider ABC; nomic search_query: prefix; query path uses embed_query; nomic-first auto-detect.
  • Model-mismatch guard — refuse vectors from a different embedding model (keyword fallback) + surfaced in semantic_search_status.
  • Path down-weight — de-rank tests/, benchmarks/, docs_src/, asv_bench/.
  • Coverage/readiness fixis_index_ready() / get_index_progress() now count all indexed languages (tree-sitter: Go, Rust, C…), not just py/js/ts. Previously a Go repo reported 0% coverage / vec_ready=False despite a fully built index. Extracted _supported_languages() helper.
  • Eval harness--reindex + CodeIntelService.reindex(embeddings=True); correctness guards (provenance stamp; fail-loud on 0 embedded chunks); per-repo progress + isolation; --json; eval/run_body_budget_sweep.py resumable sweep driver.

Tests

  • Unit tests for body chunking, embed_query asymmetry, model-mismatch guard, path down-weight, supported-languages coverage, reindex(embeddings), and eval reporting helpers.
  • integrations/context suite: 606 passed / 1 xfailed (the lone unrelated failure is a pre-existing token-estimate test, untouched here).

🤖 Generated with Claude Code

eren23 and others added 18 commits May 30, 2026 19:06
…T, default 400)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
test_semantic_search.py built managers via SemanticSearchManager.__new__() and hand-listed dataclass fields; new fields (scoring_config, _trigram_index, nl_mode) were never added to those blocks, so 9 tests crashed with AttributeError on the core retrieval path. Plus test_queue_reindex_noop_when_unavailable didn't stub _ensure_provider, so the real BGE provider loaded and flipped _keyword_fallback back to False, defeating its simulated-unavailable assertion.

Route the 5 __new__ blocks through a shared _bare_manager() that uses the real (cheap, provider-deferred) constructor + targeted overrides, so adding future fields can't silently re-break these. Stub _ensure_provider in the noop test. integrations/context suite: 582p/9f -> 591p/0f.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…atch guard + status line

Earlier per-task commits (2fa0d03, 03ba829) landed empty because the Edit anchor strings didn't match the real code; the mismatch guard in search() and the status line in semantic_search_status were never actually applied. The Task-3 nomic test also mocked encode() with a plain list, so .tolist() raised. This adds the real guard + status line and fixes the test mock to return a numpy array. 767 passed / 5 skipped across context + tools suites.
Add _path_rank_penalty (0.6 for tests/asv_bench/docs_src/benchmarks, 1.0 for source) applied as a final ranking stage so genuine source outranks tests/benchmarks. Also removes stray placeholder comments that an earlier batch edit left in the boost stage.
…smatch guard

Earlier batch commits (03ba829 Task7, 2941cf2 Task8) landed empty because Edit anchors didn't match. This adds the real _path_rank_penalty + ranking stage (Task 8) and the model-mismatch status line (Task 7), and updates tests broken by intended behavior changes: auto-detect now nomic-first (was bge-first), the mismatch guard requires matching provider/store model names in RRF mocks, and the status-mismatch test mock needs numeric progress fields. Full context+tools suites: 710 passed, 1 xfailed.
The search-quality harness scored whatever index already existed (or none), so index changes (body chunking / embedder swap) could not be measured — embeddings built lazily mid-scoring or queries hit keyword fallback. Adds CodeIntelService.reindex(embeddings=True) which synchronously rebuilds the vector index via SemanticSearchManager.index() after the AST reindex, and an eval --reindex flag (default off, preserves prior behavior) that uses it. Respects ATTOCODE_EMBEDDING_MODEL + ATTOCODE_BODY_TOKEN_BUDGET so sweeps are deterministic. +2 unit tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Builds on the --reindex fix so the body-chunking/embedder sweep produces trustworthy, observable, resumable results: correctness guards (provenance stamp + fail-loud on 0 embedded chunks), per-repo progress milestones + isolation, --json output, pure format_sweep_comparison/_overall_from_results helpers (unit-tested), and eval/run_body_budget_sweep.py resumable driver (per-repo subprocess, skips done budgets, writes to gitignored eval/sweep_out/). +4 reporting tests. No new lint introduced.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s/ts only

is_index_ready()/get_index_progress() derived total_files from a hardcoded
{python,javascript,typescript} set while the indexer (index/reindex) embeds a
wider set — tree-sitter langs (Go, Rust, C, ...). On a Go repo (gh-cli) this
made coverage read 0% even with a fully built vector index, so is_index_ready()
returned False and the eval reported vec_ready=False despite serving vectors.
Extracted SemanticSearchManager._supported_languages() and routed all three
call sites through it. +3 unit tests; 198 passed in integrations/context.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bumps version + CHANGELOG for the retrieval-quality work (PR #94): body-aware
chunking, embed_query doc/query asymmetry, model-mismatch guard, path
down-weight, non-Python coverage/readiness fix, and the hardened eval harness.
Validated: redis 0.633 / pandas 0.600 MRR@10 (>=0.55 target); body budget a
no-op 200-800.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bumps version + CHANGELOG for the retrieval-quality work (PR #94): body-aware
chunking, embed_query doc/query asymmetry, model-mismatch guard, path
down-weight, non-Python coverage/readiness fix, and the hardened eval harness.
Validated: redis 0.633 / pandas 0.600 MRR@10 (>=0.55 target); body budget a
no-op 200-800.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Completes the 0.2.25 bump: 8f19414's pyproject edit silently half-matched
(two version= lines) and src/attoswarm was missed, so the 4 sources
validate_release_version.py checks disagreed. All now read 0.2.25.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…embedder' into feat/retrieval-quality-chunking-embedder
@eren23 eren23 merged commit 54a9f3a into main May 31, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant