Skip to content

feat: use Nextcloud filename for indexed file title + reconcile on rename#857

Merged
cbcoutinho merged 3 commits into
masterfrom
feat/file-title-from-filename
Jun 4, 2026
Merged

feat: use Nextcloud filename for indexed file title + reconcile on rename#857
cbcoutinho merged 3 commits into
masterfrom
feat/file-title-from-filename

Conversation

@cbcoutinho

Copy link
Copy Markdown
Owner

Summary

The vector-sync pipeline derived an indexed file's display title from the document's embedded metadata (e.g. a PDF's /Title), falling back to the Nextcloud filename only when absent. That embedded title frequently disagrees with how the user actually named the file, which is confusing in the astrolabe vector-viz UI. The UI is a passive consumer of the title payload field, so this is a backend change.

This PR:

  • For doc_type == "file", always derives the display title from the Nextcloud filename (shared file_title_from_path helper). Notes / deck cards / news keep their existing metadata-derived titles.
  • Adds rename/move reconciliation: a rename in Nextcloud keeps the fileid (our doc_id) and content (etag/mtime) but changes the path, so both the dedup claim and the scanner freshness gate skip re-embedding and the stored file_path/title go stale. New reconcile_document_path refreshes file_path + title on the existing real chunks via a metadata-only set_payload — no re-fetch, no re-embed.

Both rename skip-paths are covered, regardless of Nextcloud's etag-on-rename behavior:

  • dedup hit (etag unchanged on rename) → claim_existing_index(current_path=...)
  • scanner incremental skip (etag changed, mtime stable) → reconcile in the scanner loop

Both reuse payloads already fetched, so steady-state scans (no rename) add no extra Qdrant round-trip (reconcile no-ops when the path is unchanged).

Changes

  • vector/sharing_state.pyfile_title_from_path(), reconcile_document_path(), optional current_path on claim_existing_index().
  • vector/processor.py — file title from filename; threads current_path into the dedup claim.
  • vector/scanner.py — threads current_path into the dedup claim; reconciles in the incremental-skip branch.
  • tests/unit/vector/test_sharing_state.py — +90 lines: title helper, reconcile (rename / no-op / legacy backfill), dedup-hit reconcile.

Testing

  • ruff check, ruff format --check, ty check — clean.
  • pytest tests/unit/ — 1360 passed, 1 skipped.

Known follow-ups (out of scope)

  • The path-2 reconcile refreshes file_path/title but not the stored etag; an etag-regenerating rename leaves a mild perpetual-rescan inefficiency (correctness unaffected).
  • No end-to-end integration test for rename → rescan → updated payload yet (the reconcile primitive is unit-tested).

Refs: Deck #204


This PR was generated with the help of AI, and reviewed by a Human

@claude

claude Bot commented Jun 4, 2026

Copy link
Copy Markdown

Claude finished @cbcoutinho's task in 2m 32s —— View job


PR Review: feat/file-title-from-filename

  • Load tools and set up
  • Read changed files and git diff
  • Analyze sharing_state.py changes
  • Analyze processor.py changes
  • Analyze scanner.py changes
  • Review test coverage
  • Post final review

Overall: ready to merge. The design is sound, the implementation is clean, and the two rename skip-paths are both covered. Findings below are all nit/observation level — nothing blocking.


vector/sharing_state.py

file_title_from_path — Clean and correct. The or file_path fallback for the degenerate "/" case is the right call; returning "" as a title would be confusing. Worth tracing for confidence:

"/".rstrip("/") → ""
"".rsplit("/", 1)[-1] → ""
"" or "/" → "/"  ✓

reconcile_document_path — The real_only=True filter is the right choice: placeholders already carry the current file_path from write_placeholder_point, and the reconcile is only meaningful for real embedded chunks. No extra Qdrant round-trip on unchanged paths since the early return False fires. The logger.info using %r (quoted output) makes None vs empty string visually distinct in logs — good defensive choice.

claim_existing_index ordering — Reconcile fires before add_principal. Both are independently non-fatal, so the order doesn't affect correctness. Worth noting that a reconcile failure doesn't suppress the principal grant (and vice versa), which is the right fail-open behaviour.

🟢 Nitexisting_metadata.get("file_path") passes None to stored_path when the key is absent. This triggers the "backfill" path in reconcile_document_path (None != current_path → write fires). That's the intended legacy-backfill behaviour, and it works, but it means every dedup hit for a legacy point without a file_path key will issue a set_payload even on repeat scans. Since reconcile returns True and logs at INFO in that case, it could be noisy until all legacy points are backfilled. Not a correctness issue — just worth noting if you see unexpected INFO spam post-deploy.


vector/processor.py

Title derivation change (line 626) — The old file_metadata.get("title") or file_path.split("/")[-1] used embedded document metadata as the primary source. The new file_title_from_path(file_path) always uses the Nextcloud filename, which is the stated goal. One minor downstream effect: file_metadata (line 622) is now only consumed for page_boundaries — it's still used, just no longer for the title. No dead-code issue, just a narrowed use.

🟢 Nit — The comment block added at lines 623–625 is well-written and explains the why clearly.


vector/scanner.py

New elif reconcile branch (lines 590–617) — The guard condition is correctly structured:

elif existing_metadata is not None and not existing_metadata.get("is_placeholder", False):

The existing_metadata is not None half is technically unreachable (if it were None, needs_indexing would be True and the if branch would have fired). Keeping it there is fine — it makes the condition self-documenting and safe against future refactors.

The comment above the elif clearly states the three conditions that must hold to reach this branch (fresh mtime → not needs_indexing; etag dedup hit → already continued; not placeholder). This is exactly the rename-with-stable-mtime case and the comment is accurate after the round-2 fix.

🟡 Observation — The scanner's reconcile elif has no direct unit test. The reconcile primitive is well tested in TestReconcileDocumentPath, and the dedup-hit reconcile path is tested in TestClaimExistingIndex, but the scanner-layer invocation (the incremental-skip path where etag changed but mtime is stable) isn't exercised at the unit level. The PR description acknowledges this as a known follow-up. Since the primitive is tested and the wiring is straightforward, this is acceptable short-term — but worth adding a scanner unit test before this code path gets more complex.


tests/unit/vector/test_sharing_state.py

The test additions are solid:

  • TestFileTitleFromPath — 4 cases cover normal, no-directory, trailing-slash, and the root-only edge case.
  • TestReconcileDocumentPath — 4 cases cover no-op (unchanged), no-op (empty current), rename, and legacy-backfill. The filter-key assertion (_must_keys(kwargs["points"]) == ["doc_id", "doc_type", "is_placeholder"]) confirms real_only=True is wired through.
  • TestClaimExistingIndex additions — the dual-write case (test_dedup_hit_reconciles_and_grants_new_principal) is the most important new test; it verifies both set_payload calls fire independently, which was the round-1 finding. test_dedup_hit_without_current_path_skips_reconcile closes the regression case for non-file callers.

🟢 Nittest_hit_for_already_listed_user_writes_nothing (the pre-existing test at line 255) was written without current_path. It now also implicitly tests that claim_existing_index without current_path skips reconcile even when file_path is present in the payload — this is covered by the new test_dedup_hit_without_current_path_skips_reconcile, which makes the intent explicit. Good.


Summary

Finding
🔴 Blocking none
🟡 Notable Scanner reconcile elif has no unit test (acknowledged follow-up)
🟢 Nit Legacy-point backfill will log at INFO on every dedup hit until backfilled
🟢 Nit file_metadata now used only for page_boundaries (narrowed use, not a problem)

cbcoutinho and others added 2 commits June 5, 2026 01:16
…name

The vector-sync pipeline derived an indexed file's display title from the
document's embedded metadata (e.g. a PDF's /Title), falling back to the
filename only when absent. That embedded title frequently disagrees with how
the user named the file in Nextcloud and is confusing in the astrolabe
vector-viz UI (a passive consumer of the `title` payload field).

For files, always derive the title from the Nextcloud filename via a shared
`file_title_from_path` helper. Notes/deck/news keep their metadata titles.

A rename/move in Nextcloud keeps the fileid (doc_id) and content (etag/mtime)
but changes the path, so both the dedup claim and the scanner freshness gate
skip re-embedding and the stored file_path/title go stale. Add
`reconcile_document_path`: a metadata-only set_payload that refreshes
file_path + title on the existing real chunks without re-fetch/re-embed.
Wire it into both skip paths:
  - dedup hit (etag unchanged on rename) via claim_existing_index(current_path=...)
  - scanner incremental skip (etag changed, mtime stable)
Both reuse already-fetched payloads, so steady-state scans add no extra
round-trip (reconcile is a no-op when the path is unchanged).

Refs: Deck #204

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… path

Round-1 review follow-ups (PR #857):
- scanner.py: skip the rename-reconcile when the existing metadata point is a
  placeholder. reconcile_document_path only touches real chunks, so a not-yet-
  indexed file would just incur a 0-point set_payload; the real index writes the
  current path anyway.
- test_sharing_state.py: add a dedup-hit case where the file was renamed AND the
  user is new to the ACL, asserting both set_payload writes fire (file_path/title
  and acl_principals).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cbcoutinho cbcoutinho force-pushed the feat/file-title-from-filename branch from e62a055 to a03bc0a Compare June 4, 2026 23:19
@cbcoutinho

Copy link
Copy Markdown
Owner Author

Round-1 findings addressed in a03bc0af:

  • _shadow_classify "removal" (🟡): false alarm from a stale base — the branch was cut from dd335275 before feat: tier-0 document classifier (shadow mode) #855 landed the tier-0 classifier, so GitHub diffed against the newer master and showed those lines as removed. Rebased onto current origin/master (now 0.102.0); _shadow_classify and its imports are intact, my changes apply cleanly with no conflict.
  • Placeholder round-trip (🟢, scanner.py): added the not existing_metadata.get("is_placeholder", False) guard so the scanner-skip reconcile no longer issues a 0-point set_payload for a not-yet-indexed file.
  • Dual-write test gap (🟢): added test_dedup_hit_reconciles_and_grants_new_principal — renamed file + new ACL user, asserting both set_payload calls fire (file_path/title and acl_principals).

Left as-is:

  • Merging the two set_payload calls (🟢): keeping them separate, per your own note that merging adds complexity for little gain — the dual-write case is now covered by a test.

ruff/ty green; tests/unit/vector/ + classifier tests pass (116).

…mment

Round-2 review follow-ups (PR #857), both documentation-only:
- sharing_state.py: reconcile_document_path docstring no longer claims it
  returns False when no real points exist — it returns True and the set_payload
  is a Qdrant-side no-op (callers discard the return value).
- scanner.py: reword the rename-reconcile comment to state the precise reason
  (modified_at stable so not re-queued; path may be stale from a rename) rather
  than the loose "dedup miss / etag changed" phrasing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cbcoutinho

Copy link
Copy Markdown
Owner Author

Round-2 findings addressed in b7479b0d (both documentation-only, no behavior change):

  • reconcile_document_path docstring (🟡): corrected — it returns True once the path differs and the set_payload is a Qdrant-side no-op when no real chunks exist; it does not return False in that case. Took the lighter docstring fix (no extra pre-check scroll), per your suggestion.
  • Scanner comment imprecision (🟡): reworded to state the precise trigger — "modified_at stable so not re-queued; path may be stale from a rename" — instead of the loose "dedup miss / etag changed" phrasing.

ruff/ty green; tests/unit/vector/ pass (29).

@sonarqubecloud

sonarqubecloud Bot commented Jun 4, 2026

Copy link
Copy Markdown

@cbcoutinho

Copy link
Copy Markdown
Owner Author

Round 3: thanks — taking the "ready to merge" verdict. No code changes this round; the remaining items are nits and I want to avoid churning the review/CI cycle for them:

  • Scanner elif has no unit test (🟡): acknowledged follow-up — adding it means standing up the first scanner-level unit harness (WebDAV tag-REPORT + Qdrant + send-stream mocks). The reconcile primitive and the dedup-hit path are unit-tested; deferring the scanner-wiring test as noted in the PR.
  • Legacy-point backfill INFO noise (🟢): it self-heals — the first reconcile writes file_path onto the real chunks, so the next scan sees stored_path == current_path and no-ops. One INFO line per legacy doc on the first post-deploy scan, then quiet.
  • file_metadata narrowed use (🟢): intentional — still consumed for page_boundaries.

Holding for the integration matrix to go green before this is merge-ready.

@cbcoutinho cbcoutinho merged commit bfe3190 into master Jun 4, 2026
11 of 13 checks passed
pi0n00r added a commit to pi0n00r/nextcloud-mcp-server that referenced this pull request Jun 11, 2026
…with posture preservation

Selective merge of 196 upstream commits past pi0n00r/master (76a4756) to
upstream tip (457c115). Catches up across 13 minor versions: v0.99.0 ->
v0.100.x -> ... -> v0.110.x -> v0.111.0.

Conflict resolution (6):
  - nextcloud_mcp_server/auth/astrolabe_client.py (modify/delete) -- git rm
    (Class-2 keep-deleted; upstream re-modified the file we strip)
  - third_party/astrolabe (modify/delete) -- git rm (multiple submodule
    pointer bumps absorbed via same disposition)
  - nextcloud_mcp_server/client/contacts.py -- took ours (byte-preserving
    substrate; PR cbcoutinho#876 contacts-delete-object-href fix is in upstream
    line range we did not need)
  - nextcloud_mcp_server/client/webdav.py -- took ours (A.2 chunked-upload
    at CHUNK_THRESHOLD load-bearing)
  - nextcloud_mcp_server/models/contacts.py -- took ours (5 typed fields
    preserved: urls, note, photo, categories, addresses)
  - nextcloud_mcp_server/server/oauth_tools.py -- surgical merge-then-
    patch-then-strip per Cat F Lesson #2. Upstream added a 30-line block
    in get_provisioning_status() that instantiates AstrolabeClient
    directly to check background-sync provisioning status via Astrolabe.
    Rejected the entire block (took HEAD; AstrolabeClient is not in our
    tree by Class-2 strip). The OAuth refresh-token fallback path
    immediately below the block remains as the sole provisioning check.

Class-2 keep-deleted extension (per Gary directive 2026-06-04):
  - tests/contract/test_astrolabe_credentials_consumer.py
  - tests/integration/test_astrolabe_multi_user_background_sync.py
  - tests/unit/test_astrolabe_client.py
    Three new Astrolabe-coupled test files added by upstream. Per the
    standing principle that tests of stripped code have nothing to
    verify in our tree, force-rm'd; added to Class-2 list.

bridgette_* metric rename extended (per Gary directive 2026-06-04):
  - observability/metrics.py +7 new astrolabe_<lower> identifiers renamed
    (text_quality, parse_failed_total, classifier_flag_total, plus 4 more
    from the embedding-tokens + per-tenant-usage work)
  - config.py +1 (1 astrolabe_<lower> reference in a comment/identifier)
  - document_processors/classifier.py +1
  - document_processors/_isolation.py +1
  Total cumulative bridgette_ count: 25 across 5 files (was 15 in 2
  files at Cat N). Pre-existing Tier-1 elicitation landmines unchanged.

Auto-merge handled cleanly: README.md, SECURITY.md, LICENSE.md, AGENTS.md
(no upstream changes to those files in this range); 90+ other files
including the contacts byte-preserving tests, the I/F-Match webdav tests,
and the size-limit tests. Upstream did not touch our just-customized
README/SECURITY from this session's 76a4756 commit.

Substantive upstream additions absorbed:

Potentially substantive on our runtime:
  - PR cbcoutinho#885: feat(deck): add deck_move_card_to_board tool for cross-board
    moves. We exercise Deck; this is net-new MCP functionality.
  - PR cbcoutinho#876: fix(contacts): contacts delete object href -- examined for
    relevance to our 9-op surface; our byte-preserving substrate handles
    delete via patch_contact differently; not load-bearing for us.

Maintenance / bug fixes:
  - 44 fix commits across the range (auth, search, vector, embedding,
    document processing); most are bug fixes to dormant code in our env.
  - PR cbcoutinho#854: fix/astrolabe-multi-user-integration-tests -- absorbed
    dormant (multi_user_basic + Astrolabe coupled; both off in our env)

Dormant absorption (no Qdrant / no semantic search / no document
processing target in our env):
  - PR cbcoutinho#855: tier-0 document classifier in shadow mode
  - PR cbcoutinho#858: tiered PDF processor with pypdfium2 fast path
  - PR cbcoutinho#862, cbcoutinho#863: pypdfium2 page close + OCR quality escalation trigger
  - PR cbcoutinho#868: page-aware PDF chunking
  - PR cbcoutinho#865: webhook tag indexing (real-time vector indexing)
  - PR cbcoutinho#871: per-tenant usage metering data plane (bridgette_-renamed)
  - PR cbcoutinho#875: meter embedding tokens (bridgette_-renamed)
  - PR cbcoutinho#879: pages_embedded real parsed page count
  - PR cbcoutinho#857: file title from filename + reconcile on rename
  - PR cbcoutinho#884: nongating readiness lifespan refactor
  - ADR-029: Pact consumer contract for Astrolabe credentials (test
    artifacts force-rm'd; the ADR itself is documentation, harmless)

CI / dependency churn (no posture impact):
  - 64 renovate dep bumps + chore commits across the range
  - 4+ astrolabe submodule pointer bumps -- all absorbed via git rm

Posture verification (paragraph 6 gauntlet, ALL PASS):
  - oauth_tools.py: 0 Astrolabe refs, 4 get_shared_storage, 0 Optional[
  - models/contacts.py: 5 typed fields present
  - vcard_parser.py: 441 lines (byte-preserving substrate)
  - webdav.py: CHUNK_THRESHOLD line 382 preserved
  - Class-2 keep-deleted (9 files incl. 3 new tests): all absent
  - .gitmodules: 0 astrolabe entries
  - README.md / SECURITY.md fork-customizations: 0 astrolabe, 0 graphics,
    0 cbcoutinho upstream-pointing URLs
  - astrolabe_<lower> outside elicitation Tier-1 landmines: 0
  - bridgette_ source-wide: 25 across 5 files
  - pyproject.toml version: 0.111.0
  - py_compile: 15/15 critical files clean

Multi-user gate (paragraph 2): the auth-fix and astrolabe-multi-user
integration test fixes are maintenance class on already-in-tree code; no
new multi-user provisioning, cross-user tokens, or per-user OAuth scope
expansion. Gate does not fire.

Stats: 116 files changed, +10,558 / -1,447 lines.

Safety tag: pre-upstream-merge-196-76a47560.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant