feat(search): body-aware chunking + code embedder for semantic retrieval#94
Merged
Conversation
…T, default 400) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
test_semantic_search.py built managers via SemanticSearchManager.__new__() and hand-listed dataclass fields; new fields (scoring_config, _trigram_index, nl_mode) were never added to those blocks, so 9 tests crashed with AttributeError on the core retrieval path. Plus test_queue_reindex_noop_when_unavailable didn't stub _ensure_provider, so the real BGE provider loaded and flipped _keyword_fallback back to False, defeating its simulated-unavailable assertion. Route the 5 __new__ blocks through a shared _bare_manager() that uses the real (cheap, provider-deferred) constructor + targeted overrides, so adding future fields can't silently re-break these. Stub _ensure_provider in the noop test. integrations/context suite: 582p/9f -> 591p/0f. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…atch guard + status line Earlier per-task commits (2fa0d03, 03ba829) landed empty because the Edit anchor strings didn't match the real code; the mismatch guard in search() and the status line in semantic_search_status were never actually applied. The Task-3 nomic test also mocked encode() with a plain list, so .tolist() raised. This adds the real guard + status line and fixes the test mock to return a numpy array. 767 passed / 5 skipped across context + tools suites.
Add _path_rank_penalty (0.6 for tests/asv_bench/docs_src/benchmarks, 1.0 for source) applied as a final ranking stage so genuine source outranks tests/benchmarks. Also removes stray placeholder comments that an earlier batch edit left in the boost stage.
…smatch guard Earlier batch commits (03ba829 Task7, 2941cf2 Task8) landed empty because Edit anchors didn't match. This adds the real _path_rank_penalty + ranking stage (Task 8) and the model-mismatch status line (Task 7), and updates tests broken by intended behavior changes: auto-detect now nomic-first (was bge-first), the mismatch guard requires matching provider/store model names in RRF mocks, and the status-mismatch test mock needs numeric progress fields. Full context+tools suites: 710 passed, 1 xfailed.
The search-quality harness scored whatever index already existed (or none), so index changes (body chunking / embedder swap) could not be measured — embeddings built lazily mid-scoring or queries hit keyword fallback. Adds CodeIntelService.reindex(embeddings=True) which synchronously rebuilds the vector index via SemanticSearchManager.index() after the AST reindex, and an eval --reindex flag (default off, preserves prior behavior) that uses it. Respects ATTOCODE_EMBEDDING_MODEL + ATTOCODE_BODY_TOKEN_BUDGET so sweeps are deterministic. +2 unit tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Builds on the --reindex fix so the body-chunking/embedder sweep produces trustworthy, observable, resumable results: correctness guards (provenance stamp + fail-loud on 0 embedded chunks), per-repo progress milestones + isolation, --json output, pure format_sweep_comparison/_overall_from_results helpers (unit-tested), and eval/run_body_budget_sweep.py resumable driver (per-repo subprocess, skips done budgets, writes to gitignored eval/sweep_out/). +4 reporting tests. No new lint introduced. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s/ts only
is_index_ready()/get_index_progress() derived total_files from a hardcoded
{python,javascript,typescript} set while the indexer (index/reindex) embeds a
wider set — tree-sitter langs (Go, Rust, C, ...). On a Go repo (gh-cli) this
made coverage read 0% even with a fully built vector index, so is_index_ready()
returned False and the eval reported vec_ready=False despite serving vectors.
Extracted SemanticSearchManager._supported_languages() and routed all three
call sites through it. +3 unit tests; 198 passed in integrations/context.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bumps version + CHANGELOG for the retrieval-quality work (PR #94): body-aware chunking, embed_query doc/query asymmetry, model-mismatch guard, path down-weight, non-Python coverage/readiness fix, and the hardened eval harness. Validated: redis 0.633 / pandas 0.600 MRR@10 (>=0.55 target); body budget a no-op 200-800. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bumps version + CHANGELOG for the retrieval-quality work (PR #94): body-aware chunking, embed_query doc/query asymmetry, model-mismatch guard, path down-weight, non-Python coverage/readiness fix, and the hardened eval harness. Validated: redis 0.633 / pandas 0.600 MRR@10 (>=0.55 target); body budget a no-op 200-800. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Completes the 0.2.25 bump: 8f19414's pyproject edit silently half-matched (two version= lines) and src/attoswarm was missed, so the 4 sources validate_release_version.py checks disagreed. All now read 0.2.25. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…embedder' into feat/retrieval-quality-chunking-embedder
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Lifts semantic-search retrieval quality by embedding function/method bodies (not just signatures) and adding a code-aware embedder path, plus the eval infrastructure to measure it honestly.
Why / Evidence
Validated with a hardened eval harness (
--reindexforces a synchronous embedding rebuild so the index actually reflects the config under test). BGE,--reindex, 3 repos × 5 ground-truth queries each:Python/C-family retrieval lands at 0.60–0.63, clearing the 0.55 target. The lift comes from embedding bodies at all.
Body-budget sweep (200/400/800) is a no-op on this query set — per-repo MRR/NDCG and chunk counts are bit-identical across all three budgets. Default left at 400; could drop to 200 for free storage/compute savings later.
gh-cli (Go) at 0.100 is a pre-existing Go-extraction weakness (regex, not tree-sitter — a separate Phase-1 item), not a regression from this work, and confirmed serving vectors (not a keyword fallback).
Changes
_slice_bodytoken-budgeted extractor; func/method chunks carry signature + body;ATTOCODE_BODY_TOKEN_BUDGET(default 400).embed_queryon the provider ABC; nomicsearch_query:prefix; query path usesembed_query; nomic-first auto-detect.semantic_search_status.tests/,benchmarks/,docs_src/,asv_bench/.is_index_ready()/get_index_progress()now count all indexed languages (tree-sitter: Go, Rust, C…), not just py/js/ts. Previously a Go repo reported 0% coverage /vec_ready=Falsedespite a fully built index. Extracted_supported_languages()helper.--reindex+CodeIntelService.reindex(embeddings=True); correctness guards (provenance stamp; fail-loud on 0 embedded chunks); per-repo progress + isolation;--json;eval/run_body_budget_sweep.pyresumable sweep driver.Tests
integrations/contextsuite: 606 passed / 1 xfailed (the lone unrelated failure is a pre-existing token-estimate test, untouched here).🤖 Generated with Claude Code