Skip to content

bug: ChromaDB collection uses L2 distance instead of cosine — degrades semantic search quality #24

@RutgerBos

Description

@RutgerBos

Problem

All ChromaDB collections are created without specifying a distance metric, so they default to L2 (squared Euclidean). Sentence embedding models (including the all-MiniLM-L6-v2 used here) are trained to produce vectors where cosine similarity is the meaningful distance measure. Using L2 treats vector magnitude as significant when it isn't, producing lower-quality nearest-neighbour retrieval.

Where it happens

retrievers.py — both ChromaRetriever (line 59) and PersistentChromaRetriever (line 203):

self.collection = self.client.get_or_create_collection(
    name=collection_name, embedding_function=self.embedding_function
    # ← no metadata={"hnsw:space": "cosine"}
)

Fix

self.collection = self.client.get_or_create_collection(
    name=collection_name,
    embedding_function=self.embedding_function,
    metadata={"hnsw:space": "cosine"},
)

Apply the same change to all three get_or_create_collection calls in retrievers.py.

Migration note

The hnsw:space setting is locked at collection creation time and cannot be changed on an existing collection. Any existing persistent ChromaDB collection will need to be recreated (export documents → delete collection → recreate with cosine → re-import). The in-memory ChromaRetriever used by the MCP server is unaffected since it starts fresh each session.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions