Skip to content

Providers/SemanticStore: hosted OpenAI vector stores behind a unified SemanticStore protocol#29

Merged
odrobnik merged 3 commits into
mainfrom
claude/admiring-joliot-150cd6
Jun 11, 2026
Merged

Providers/SemanticStore: hosted OpenAI vector stores behind a unified SemanticStore protocol#29
odrobnik merged 3 commits into
mainfrom
claude/admiring-joliot-150cd6

Conversation

@odrobnik

Copy link
Copy Markdown
Contributor

What this does

Lets the OpenAI hosted vector stores be used exactly like the local SQLite one — one protocol, two backends:

let store: any SemanticStore = try await OpenAIVectorStore.openOrCreate(named: "knowledge", client: client)
// …or: try SQLiteVectorStore(storage: .file(path))   — same indexText / indexFile / sync / search / count

Fixes

  • The vector-store client was accidentally internal. Everything except createVectorStoreFile lacked public — usable only via @testable. All CRUD/file/batch methods are now public; ExpirationPolicy/FileCounts/VectorStoreFilesBatch became constructible/readable; stray debug print removed; Vectore filename typo fixed.
  • FileStatus couldn't decode failurescancelled/failed were missing, so retrieving a failed file threw a decoding error instead of reporting it.

Features

  • Search endpoint (POST /v1/vector_stores/{id}/search) with VectorStoreAttribute/VectorStoreFilter wire types, tolerant search_query decoding, and attributes on file attach.
  • SemanticStore protocol — the store-agnostic core; MemoryMatch & co. un-gated and Sendable; span-less (0/0) citations render source:path.
  • OpenAIVectorStore — hosted SemanticStore with local-parity identity via path/source/hash attributes (replace on re-index, hash-skip, sync-prune), storage-leak-free deletes, and server-side query rewriting (rewritesQueries, observable via lastSearchQueries). No trait needed.
  • LocalVectorStore made public (the guide always advertised it).

Breaking: product rename VectorStoreSemanticStore

The target now hosts three stores plus FTS5/RRF/expansion/reranking, and the old name collided with both the Providers.VectorStore wire DTO and OpenAI's product term. Migration: .product(name: "SemanticStore", …) + import SemanticStore. The SQLiteVectorStore trait and concrete class names are unchanged. Known trade-off: module and protocol share a name (XCTest-style), so module-qualifying other symbols needs scoped imports.

Verification

  • Both build configs clean (default + --traits SQLiteVectorStore), swiftlint --strict clean.
  • 47 tests in the three store suites, including a live hosted round-trip (index → search → rewrite → incremental skip → sync prune → delete) gated on OPENAI_API_KEY.
  • rewrite_query validated against the live API (a bogus-param control confirms strict body validation).
  • QMDKit compiled and tested against this branch via swift package edit (7/7), both before and after the rename.

Follow-up

  • QMDKit needs the two-line adaptation (product name + imports) once this merges.
  • Future: media/image indexing (indexMedia, multimodal embeddings) per the openclaw mechanism — the 0/0 whole-artifact citation convention is already in place for it.

🤖 Generated with Claude Code

odrobnik and others added 2 commits June 11, 2026 12:53
…ttributes

The vector-store CRUD, file, and batch methods were internal by accident
(only createVectorStoreFile carried public) — reachable solely via
@testable. All of them are now public, with the supporting models made
usable: ExpirationPolicy gets a public init, FileCounts public counts,
VectorStoreFilesBatch goes public, LastError becomes Sendable. The
stray progress print in waitUntilVectorStoreIsReady is gone and the
'Vectore' filename typo is fixed.

Two functional additions bring the client up to the current API:

- FileStatus gains cancelled/failed — decoding a failed vector-store
  file previously threw instead of reporting the failure.
- searchVectorStore(id:query:maxNumResults:filters:rewriteQuery:) — the
  2025 search endpoint the 2024-era client predated — with its wire
  types (VectorStoreAttribute, VectorStoreFilter, the result page with
  tolerant search_query decoding) and attributes on file attach, the
  hooks the SemanticStore unification builds on. The filter operator
  case names (eq/ne/gt/lt/or) mirror the documented wire names and are
  excluded from the identifier_name lint rule like the other protocol
  names.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…protocol

A store-agnostic core — indexText / indexFile / sync / search / count —
as the new SemanticStore protocol, with MemoryMatch, IndexOutcome, and
SyncSummary moved out of the SQLiteVectorStore trait gate (Sendable,
public inits) so they exist in every build. Citations render
source:path when a match has no line span (0/0 = whole artifact).
SQLiteVectorStore conforms as-is; keyword/hybrid/fused/expanded search
and reranking remain its extras. The FNV-1a hash and relativePath
helpers move to shared StoreSupport (fingerprints stay byte-identical;
the existing suite proves it).

New: OpenAIVectorStore, a drop-in SemanticStore over OpenAI's hosted
vector stores. Identity mirrors the local store via path/source/hash
file attributes — re-indexing replaces, unchanged content is
hash-skipped without re-uploading, sync prunes — and search maps the
hosted results onto MemoryMatch (sources filter via attribute filter,
topN capped at the endpoint's 50). delete() and pruning also delete
the underlying File uploads so account storage doesn't leak.
rewritesQueries opts into server-side query rewriting per search;
lastSearchQueries exposes what the rewriter executed (the local
analogue remains expandedSearch + QueryExpander). count() reports
completed documents — chunking is server-side. Not trait-gated: the
hosted store needs no SQLite engine.

LocalVectorStore and TextFragment go public — the guide always
documented them as the zero-setup store, but they were internal.

Because the target now hosts three stores, FTS5 keyword search, RRF,
query expansion, and reranking — and its old name collided with both
the Providers VectorStore wire DTO and OpenAI's product term — the
product/target is renamed VectorStore -> SemanticStore, after the
protocol that is now its center. The SQLiteVectorStore trait and the
concrete store class names are unchanged. Migration:
.product(name: "SemanticStore", …) + import SemanticStore.

Tests: 9 offline (filter/page wire coverage, match mapping, citations,
filenames, hash stability) plus a live hosted round-trip — index,
search, rewrite, incremental skip, sync prune, delete — gated on
OPENAI_API_KEY. Docs/SemanticStore.md covers the three stores, the
protocol, hosted parity notes, and query rewriting on both engines.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1a9faf5867

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread Sources/SemanticStore/OpenAIVectorStore.swift Outdated
Comment thread Sources/SemanticStore/OpenAIVectorStore.swift Outdated
…d one

Review follow-ups (PR #29):

- Replacement order: upload, attach, and fully process the new document
  BEFORE removing the old one, so a transient failure leaves the prior
  content searchable instead of losing the identity. A failed or
  half-attached replacement is detached and deleted before the error
  propagates.

- Unchanged fast path: the inventory now carries each remote file's
  status, and a hash match only counts as 'unchanged' when it is a
  single, completed document (isCurrent). Previously a failed upload
  kept its hash attribute, so retrying the same content skipped
  re-indexing while the document stayed unsearchable. Stale duplicates
  from an interrupted replacement also fail the gate and are swept up
  by the next upsert.

Offline test covers the gate; the live round-trip exercises the new
replacement ordering.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@odrobnik odrobnik merged commit b671dae into main Jun 11, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant