feat: use Nextcloud filename for indexed file title + reconcile on rename#857
Conversation
|
Claude finished @cbcoutinho's task in 2m 32s —— View job PR Review: feat/file-title-from-filename
Overall: ready to merge. The design is sound, the implementation is clean, and the two rename skip-paths are both covered. Findings below are all nit/observation level — nothing blocking.
|
| Finding | |
|---|---|
| 🔴 Blocking | none |
| 🟡 Notable | Scanner reconcile elif has no unit test (acknowledged follow-up) |
| 🟢 Nit | Legacy-point backfill will log at INFO on every dedup hit until backfilled |
| 🟢 Nit | file_metadata now used only for page_boundaries (narrowed use, not a problem) |
…name The vector-sync pipeline derived an indexed file's display title from the document's embedded metadata (e.g. a PDF's /Title), falling back to the filename only when absent. That embedded title frequently disagrees with how the user named the file in Nextcloud and is confusing in the astrolabe vector-viz UI (a passive consumer of the `title` payload field). For files, always derive the title from the Nextcloud filename via a shared `file_title_from_path` helper. Notes/deck/news keep their metadata titles. A rename/move in Nextcloud keeps the fileid (doc_id) and content (etag/mtime) but changes the path, so both the dedup claim and the scanner freshness gate skip re-embedding and the stored file_path/title go stale. Add `reconcile_document_path`: a metadata-only set_payload that refreshes file_path + title on the existing real chunks without re-fetch/re-embed. Wire it into both skip paths: - dedup hit (etag unchanged on rename) via claim_existing_index(current_path=...) - scanner incremental skip (etag changed, mtime stable) Both reuse already-fetched payloads, so steady-state scans add no extra round-trip (reconcile is a no-op when the path is unchanged). Refs: Deck #204 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… path Round-1 review follow-ups (PR #857): - scanner.py: skip the rename-reconcile when the existing metadata point is a placeholder. reconcile_document_path only touches real chunks, so a not-yet- indexed file would just incur a 0-point set_payload; the real index writes the current path anyway. - test_sharing_state.py: add a dedup-hit case where the file was renamed AND the user is new to the ACL, asserting both set_payload writes fire (file_path/title and acl_principals). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
e62a055 to
a03bc0a
Compare
|
Round-1 findings addressed in
Left as-is:
|
…mment Round-2 review follow-ups (PR #857), both documentation-only: - sharing_state.py: reconcile_document_path docstring no longer claims it returns False when no real points exist — it returns True and the set_payload is a Qdrant-side no-op (callers discard the return value). - scanner.py: reword the rename-reconcile comment to state the precise reason (modified_at stable so not re-queued; path may be stale from a rename) rather than the loose "dedup miss / etag changed" phrasing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Round-2 findings addressed in
|
|
|
Round 3: thanks — taking the "ready to merge" verdict. No code changes this round; the remaining items are nits and I want to avoid churning the review/CI cycle for them:
Holding for the integration matrix to go green before this is merge-ready. |
…with posture preservation Selective merge of 196 upstream commits past pi0n00r/master (76a4756) to upstream tip (457c115). Catches up across 13 minor versions: v0.99.0 -> v0.100.x -> ... -> v0.110.x -> v0.111.0. Conflict resolution (6): - nextcloud_mcp_server/auth/astrolabe_client.py (modify/delete) -- git rm (Class-2 keep-deleted; upstream re-modified the file we strip) - third_party/astrolabe (modify/delete) -- git rm (multiple submodule pointer bumps absorbed via same disposition) - nextcloud_mcp_server/client/contacts.py -- took ours (byte-preserving substrate; PR cbcoutinho#876 contacts-delete-object-href fix is in upstream line range we did not need) - nextcloud_mcp_server/client/webdav.py -- took ours (A.2 chunked-upload at CHUNK_THRESHOLD load-bearing) - nextcloud_mcp_server/models/contacts.py -- took ours (5 typed fields preserved: urls, note, photo, categories, addresses) - nextcloud_mcp_server/server/oauth_tools.py -- surgical merge-then- patch-then-strip per Cat F Lesson #2. Upstream added a 30-line block in get_provisioning_status() that instantiates AstrolabeClient directly to check background-sync provisioning status via Astrolabe. Rejected the entire block (took HEAD; AstrolabeClient is not in our tree by Class-2 strip). The OAuth refresh-token fallback path immediately below the block remains as the sole provisioning check. Class-2 keep-deleted extension (per Gary directive 2026-06-04): - tests/contract/test_astrolabe_credentials_consumer.py - tests/integration/test_astrolabe_multi_user_background_sync.py - tests/unit/test_astrolabe_client.py Three new Astrolabe-coupled test files added by upstream. Per the standing principle that tests of stripped code have nothing to verify in our tree, force-rm'd; added to Class-2 list. bridgette_* metric rename extended (per Gary directive 2026-06-04): - observability/metrics.py +7 new astrolabe_<lower> identifiers renamed (text_quality, parse_failed_total, classifier_flag_total, plus 4 more from the embedding-tokens + per-tenant-usage work) - config.py +1 (1 astrolabe_<lower> reference in a comment/identifier) - document_processors/classifier.py +1 - document_processors/_isolation.py +1 Total cumulative bridgette_ count: 25 across 5 files (was 15 in 2 files at Cat N). Pre-existing Tier-1 elicitation landmines unchanged. Auto-merge handled cleanly: README.md, SECURITY.md, LICENSE.md, AGENTS.md (no upstream changes to those files in this range); 90+ other files including the contacts byte-preserving tests, the I/F-Match webdav tests, and the size-limit tests. Upstream did not touch our just-customized README/SECURITY from this session's 76a4756 commit. Substantive upstream additions absorbed: Potentially substantive on our runtime: - PR cbcoutinho#885: feat(deck): add deck_move_card_to_board tool for cross-board moves. We exercise Deck; this is net-new MCP functionality. - PR cbcoutinho#876: fix(contacts): contacts delete object href -- examined for relevance to our 9-op surface; our byte-preserving substrate handles delete via patch_contact differently; not load-bearing for us. Maintenance / bug fixes: - 44 fix commits across the range (auth, search, vector, embedding, document processing); most are bug fixes to dormant code in our env. - PR cbcoutinho#854: fix/astrolabe-multi-user-integration-tests -- absorbed dormant (multi_user_basic + Astrolabe coupled; both off in our env) Dormant absorption (no Qdrant / no semantic search / no document processing target in our env): - PR cbcoutinho#855: tier-0 document classifier in shadow mode - PR cbcoutinho#858: tiered PDF processor with pypdfium2 fast path - PR cbcoutinho#862, cbcoutinho#863: pypdfium2 page close + OCR quality escalation trigger - PR cbcoutinho#868: page-aware PDF chunking - PR cbcoutinho#865: webhook tag indexing (real-time vector indexing) - PR cbcoutinho#871: per-tenant usage metering data plane (bridgette_-renamed) - PR cbcoutinho#875: meter embedding tokens (bridgette_-renamed) - PR cbcoutinho#879: pages_embedded real parsed page count - PR cbcoutinho#857: file title from filename + reconcile on rename - PR cbcoutinho#884: nongating readiness lifespan refactor - ADR-029: Pact consumer contract for Astrolabe credentials (test artifacts force-rm'd; the ADR itself is documentation, harmless) CI / dependency churn (no posture impact): - 64 renovate dep bumps + chore commits across the range - 4+ astrolabe submodule pointer bumps -- all absorbed via git rm Posture verification (paragraph 6 gauntlet, ALL PASS): - oauth_tools.py: 0 Astrolabe refs, 4 get_shared_storage, 0 Optional[ - models/contacts.py: 5 typed fields present - vcard_parser.py: 441 lines (byte-preserving substrate) - webdav.py: CHUNK_THRESHOLD line 382 preserved - Class-2 keep-deleted (9 files incl. 3 new tests): all absent - .gitmodules: 0 astrolabe entries - README.md / SECURITY.md fork-customizations: 0 astrolabe, 0 graphics, 0 cbcoutinho upstream-pointing URLs - astrolabe_<lower> outside elicitation Tier-1 landmines: 0 - bridgette_ source-wide: 25 across 5 files - pyproject.toml version: 0.111.0 - py_compile: 15/15 critical files clean Multi-user gate (paragraph 2): the auth-fix and astrolabe-multi-user integration test fixes are maintenance class on already-in-tree code; no new multi-user provisioning, cross-user tokens, or per-user OAuth scope expansion. Gate does not fire. Stats: 116 files changed, +10,558 / -1,447 lines. Safety tag: pre-upstream-merge-196-76a47560.



Summary
The vector-sync pipeline derived an indexed file's display title from the document's embedded metadata (e.g. a PDF's
/Title), falling back to the Nextcloud filename only when absent. That embedded title frequently disagrees with how the user actually named the file, which is confusing in the astrolabe vector-viz UI. The UI is a passive consumer of thetitlepayload field, so this is a backend change.This PR:
doc_type == "file", always derives the display title from the Nextcloud filename (sharedfile_title_from_pathhelper). Notes / deck cards / news keep their existing metadata-derived titles.fileid(ourdoc_id) and content (etag/mtime) but changes the path, so both the dedup claim and the scanner freshness gate skip re-embedding and the storedfile_path/titlego stale. Newreconcile_document_pathrefreshesfile_path+titleon the existing real chunks via a metadata-onlyset_payload— no re-fetch, no re-embed.Both rename skip-paths are covered, regardless of Nextcloud's etag-on-rename behavior:
claim_existing_index(current_path=...)Both reuse payloads already fetched, so steady-state scans (no rename) add no extra Qdrant round-trip (reconcile no-ops when the path is unchanged).
Changes
vector/sharing_state.py—file_title_from_path(),reconcile_document_path(), optionalcurrent_pathonclaim_existing_index().vector/processor.py— file title from filename; threadscurrent_pathinto the dedup claim.vector/scanner.py— threadscurrent_pathinto the dedup claim; reconciles in the incremental-skip branch.tests/unit/vector/test_sharing_state.py— +90 lines: title helper, reconcile (rename / no-op / legacy backfill), dedup-hit reconcile.Testing
ruff check,ruff format --check,ty check— clean.pytest tests/unit/— 1360 passed, 1 skipped.Known follow-ups (out of scope)
file_path/titlebut not the storedetag; an etag-regenerating rename leaves a mild perpetual-rescan inefficiency (correctness unaffected).Refs: Deck #204
This PR was generated with the help of AI, and reviewed by a Human