Skip to content

Conversation

@kevin-montrose
Copy link
Contributor

@kevin-montrose kevin-montrose commented Aug 14, 2025

Implement a small subset of Vector Set functionality, as a base for future full support.

This PR:

  • Uses DiskANN to do the actual vector operations
    • While not available yet, latest DiskANN (and the diskann-garnet integration package) should be OSS'd soon
    • The expectation is, even if reviewed and approved, this PR will not be merged until diskann-garnet is available in nuget.org (and ideally the source is also available on GitHub)
  • Introduces a notion of namespaces to Tsavorite, which is used to "hide" data from other commands
  • Adds a VectorSet bit to RecordInfo to allow distinguishing Vector Sets from other keys
  • Implements a subset of VADD, VREM, VEMB, and VSIM
  • Adds two extensions to these Redis commands:
    • XB8 - which allows passing byte-values without conversion, joining FP32 (which is used for little-endian floats)
    • XPREQ8 - a quantization option which takes in pre-quantized vectors, meant for use with XB8
  • Recovery, replication, and migration for Vector Sets
  • Hides all this functionality behind EnableVectorSetPreview/--enable-vector-set-preview

A more complete write up of the design, justifications, and remaining work is in vector-set.md.


There is still a lot of work to be done on Vector Sets, but this PR is at a point where it's functional enough to be played with - and there's merit to merging it so other work (Store V2, multi-threaded replication, etc.) isn't complicated.

The "big" things (besides commands and options) that are missing are:

  • Non-XPREQ8 quantizers - implementation here is on DiskANN
  • Variable length vector ids - likewise, support is coming in DiskANN, though some Garnet work will also be needed

@badrishc
Copy link
Collaborator

badrishc commented Aug 18, 2025

This all makes sense. In storage-v2, we have the notion of a LogRecord record format, that already includes optional fields such as ETag and Expiration. Adding a namespace would be analogous to this, with a couple of differences:

  • We use one RecordInfo bit to indicate whether the record has a namespace field
  • If true, then the NameSpace byte (or larger if needed) exists as part of the record, roughly: <RecordInfo, key, value, etag?, expiration?, namespace?, ...>
  • Tsavorite APIs are adjusted to accept the optional namespace (in addition to key/expiration/etag).
  • All hash codes and key equalities in Tsavorite would need to incorporate the namespace in their computations.
  • Certain namespaces (e.g., namespace numbers starting with "11") would be reserved for vector-set usage. Within that sub-namespace, the vector-sets can partition bits as they want.

TBD: how to handle larger namespace names in the same framework (e.g., "/user/foo/"), and is that useful/necessary to make as a first-class citizen versus users directly incorporating in the actual key.

@prvyk
Copy link
Contributor

prvyk commented Aug 26, 2025

Adding a namespace would be analogous to this, with a couple of differences:

...

TBD: how to handle larger namespace names in the same framework (e.g., "/user/foo/"), and is that useful/necessary to make as a first-class citizen versus users directly incorporating in the actual key.

Namespaces can be used for a lot of things, like replacing the current numbered database implementation with something that is supported cluster-wide as in valkey (In that case they end up in the same AOF file I guess?). Also, quite a few RESP-accepting DBs have a namespace implementation of sorts, e.g. kvrocks. I had an idea of combining ACLs and database numbers (IMHO, this would be usually better than redis's prefix ACLs, because it doesn't require client to cooperate), and namespaces would be just as natural here.

@vazois
Copy link
Collaborator

vazois commented Sep 11, 2025

Once Main & Object stores are merged, we should rework this like so:

  • Use a bit in RecordInfo to indicate "has namespace"

  • Map Vector Set contexts to namespaces, using a similar byte sequence idea to group Vector Sets into their own class of namespaces

    • Something akin to "all namespaces that start 0b10xx_xxxx are Vector Sets"
  • All elements of a Vector Set go into the same namespace

    • In practice this will be multiple namespaces per Vector Set since each one needs at least quantized vectors and neighbor lists

This sounds more like a data type instead of a namespace. I propose to treat it as such and have the option to include Bitmaps and Hyperloglog on this. In fact, we might need to think about the possibility of adding other specially encoded complex data types as bulk strings.
Namespace can exist separately

@kevin-montrose
Copy link
Contributor Author

Once Main & Object stores are merged, we should rework this like so:

  • Use a bit in RecordInfo to indicate "has namespace"

  • Map Vector Set contexts to namespaces, using a similar byte sequence idea to group Vector Sets into their own class of namespaces

    • Something akin to "all namespaces that start 0b10xx_xxxx are Vector Sets"
  • All elements of a Vector Set go into the same namespace

    • In practice this will be multiple namespaces per Vector Set since each one needs at least quantized vectors and neighbor lists

This sounds more like a data type instead of a namespace. I propose to treat it as such and have the option to include Bitmaps and Hyperloglog on this. In fact, we might need to think about the possibility of adding other specially encoded complex data types as bulk strings. Namespace can exist separately

You're right here - there are two orthogonal concerns:

  1. Vector Set top-level key, which is visible to other RESP commands (like TYPE, DEL, etc.) but needs to be distinguishable from other non-Vector Set data
  2. Vector element data in a Vector Set, which needs to be in the main store but inaccessible to other commands

This branch currently uses a namespace byte in SpanByte for 2, and a bit in RecordInfo for 1.

@TalZaccai TalZaccai added this to the Garnet-v2 milestone Dec 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants