Author: Varshith
Team: Valori
Contact: varshith.gudur17@gmail.com
A high-performance vector database library for Python that provides efficient storage, indexing, and search capabilities for high-dimensional vectors.
Important
Valori is in early development, so bugs and breaking changes are expected. Please use the issues page to report bugs or request features.
- π High Performance: Optimized for speed with multiple indexing algorithms
- π Document Parsing: Support for PDF, Office, text, and advanced parsing with Docling
- π Processing Pipeline: Complete document processing with cleaning, chunking, and embedding
- πΎ Multiple Storage Backends: Memory, disk, and hybrid storage options
- π Advanced Indexing: Flat, HNSW, and IVF indices for different use cases
- ποΈ Vector Quantization: Scalar and product quantization for memory efficiency
- πΎ Persistence: Tensor-based and incremental persistence strategies
- π Production Ready: Comprehensive logging, monitoring, and error handling
- π Python Native: Pure Python implementation with NumPy integration
- π Extensible: Plugin architecture for custom components
Install Valori using pip:
pip install valoriOr install from source:
git clone https://github.com/varshith-Git/valori.git
cd valori
pip install -e .import numpy as np
from valori import VectorDBClient
from valori.storage import MemoryStorage
from valori.indices import FlatIndex
from valori.processors import ProcessingPipeline
# Create components
storage = MemoryStorage({})
index = FlatIndex({"metric": "cosine"})
# Create client
client = VectorDBClient(storage, index)
client.initialize()
# Process documents
pipeline_config = {
"parsers": {"text": {"chunk_size": 1000}},
"processors": {
"cleaning": {"normalize_whitespace": True},
"chunking": {"strategy": "semantic"},
"embedding": {"model_name": "sentence-transformers/all-MiniLM-L6-v2"}
}
}
pipeline = ProcessingPipeline(pipeline_config)
pipeline.initialize()
# Process a document
result = pipeline.process_document("document.pdf")
embedding = np.array(result["embedding"]).reshape(1, -1)
# Store in vector database
inserted_ids = client.insert(embedding, [result["metadata"]])
# Search for similar documents
query_text = "machine learning"
query_result = pipeline.process_text(query_text)
query_embedding = np.array(query_result["embedding"])
results = client.search(query_embedding, k=5)
for i, result in enumerate(results):
print(f"{i+1}. Document: {result['metadata']['file_name']}")
# Clean up
client.close()
pipeline.close()Memory Storage: Fast but not persistent
from valori.storage import MemoryStorage
storage = MemoryStorage({})Disk Storage: Persistent but slower
from valori.storage import DiskStorage
storage = DiskStorage({"data_dir": "./my_vectordb"})Hybrid Storage: Combines memory and disk for optimal performance
from valori.storage import HybridStorage
storage = HybridStorage({
"memory": {},
"disk": {"data_dir": "./my_vectordb"},
"memory_limit": 10000
})Flat Index: Exhaustive search, accurate but slower for large datasets
from valori.indices import FlatIndex
index = FlatIndex({"metric": "cosine"}) # or "euclidean"HNSW Index: Fast approximate search for large datasets
from valori.indices import HNSWIndex
index = HNSWIndex({
"metric": "cosine",
"m": 16,
"ef_construction": 200,
"ef_search": 50
})IVF Index: Clustering-based index for large datasets
from valori.indices import IVFIndex
index = IVFIndex({
"metric": "cosine",
"n_clusters": 100,
"n_probes": 10
})LSH Index: Locality sensitive hashing for high-dimensional data
from valori.indices import LSHIndex
index = LSHIndex({
"metric": "cosine",
"num_hash_tables": 10,
"hash_size": 16,
"num_projections": 64,
"threshold": 0.3
})Annoy Index: Approximate nearest neighbors with random projection trees
from valori.indices import AnnoyIndex
index = AnnoyIndex({
"metric": "angular",
"num_trees": 10,
"search_k": -1
})
# Add vectors, then build
index.add(vectors, metadata)
index.build() # Required for AnnoyParse various document formats:
Text and PDF Parsing:
from valori.parsers import TextParser, PDFParser
# Parse text files
text_parser = TextParser({"encoding": "auto", "chunk_size": 1000})
result = text_parser.parse("document.txt")
# Parse PDF files
pdf_parser = PDFParser({"extract_tables": True, "chunk_size": 1000})
result = pdf_parser.parse("document.pdf")Advanced Parsing with Docling:
from valori.parsers import DoclingParser
# Microsoft Docling for advanced parsing
docling_parser = DoclingParser({"extract_tables": True, "preserve_layout": True})Complete Processing Pipeline:
from valori.processors import ProcessingPipeline
pipeline_config = {
"parsers": {"text": {"chunk_size": 1000}},
"processors": {
"cleaning": {"normalize_whitespace": True, "remove_html": True},
"chunking": {"strategy": "semantic", "chunk_size": 1000},
"embedding": {"model_name": "sentence-transformers/all-MiniLM-L6-v2"}
}
}
pipeline = ProcessingPipeline(pipeline_config)
pipeline.initialize()
# Process document end-to-end
result = pipeline.process_document("document.pdf")Reduce memory usage with vector quantization:
Scalar Quantization:
from valori.quantization import ScalarQuantizer
quantizer = ScalarQuantizer({"bits": 8})Product Quantization:
from valori.quantization import ProductQuantizer
quantizer = ProductQuantizer({"m": 8, "k": 256})SAQ Quantization:
from valori.quantization import SAQQuantizer
quantizer = SAQQuantizer({
"total_bits": 128,
"n_segments": 8,
"adjustment_iters": 3,
"rescore_top_k": 50
})import numpy as np
from valori import VectorDBClient
from valori.storage import HybridStorage
from valori.indices import HNSWIndex
from valori.quantization import ProductQuantizer
from valori.persistence import TensorPersistence
# Create all components
storage = HybridStorage({
"memory": {},
"disk": {"data_dir": "./vectordb_data"},
"memory_limit": 10000
})
index = HNSWIndex({
"metric": "cosine",
"m": 32,
"ef_construction": 400,
"ef_search": 100
})
quantizer = ProductQuantizer({
"m": 16,
"k": 256
})
quantizer = SAQQuantizer({
"total_bits": 128,
"n_segments": 8,
"adjustment_iters": 3,
"rescore_top_k": 50
})
persistence = TensorPersistence({
"data_dir": "./vectordb_persistence",
"compression": True
})
# Create client
client = VectorDBClient(storage, index, quantizer, persistence)
client.initialize()
# Your vector operations here...
client.close()import json
from valori.utils.logging import setup_logging
# Setup logging
setup_logging({
"level": "INFO",
"log_to_file": True,
"log_file": "Valori.log"
})
# Load configuration
with open("config.json", "r") as f:
config = json.load(f)
# Initialize with production config
client = VectorDBClient.from_config(config)
client.initialize()
# Your production code here...
client.close()Check out the examples/ directory for comprehensive examples:
basic_usage.py- Basic operations and conceptsdocument_processing.py- Complete document parsing and processing workflowadvanced_indexing.py- LSH and Annoy indexing algorithms comparisonadvanced_quantization.py- Quantization techniques and performanceproduction_setup.py- Production deployment and monitoring
Full documentation is included in the docs/ folder of this repository. Key entry points:
- Getting started (tutorial):
docs/getting_started.rst - Quickstart guide:
docs/quickstart.rst - API reference:
docs/api.rst
If a documentation site is published for this project, it will be linked from the project landing page. To build the docs locally:
cd docs
make html
# Output will be in docs/_build/htmlYou can also open the source rst files directly in the repo if you prefer to read them without building HTML.
# Clone the repository
git clone https://github.com/varshith-Git/valori.git
cd valori
# Setup development environment
bash scripts/install_dev.sh
# Activate virtual environment
source venv/bin/activate# Run all tests
bash scripts/run_tests.sh
# Run with coverage
bash scripts/run_tests.sh --coverage
# Run specific tests
bash scripts/run_tests.sh tests/test_storage.py# Format code
black src/ tests/
# Lint code
flake8 src/ tests/
# Type checking
mypy src/
# Security checks
safety check
bandit -r src/cd docs
make html# Run benchmarks
python scripts/benchmark.py
# Quick benchmarks
python scripts/benchmark.py --quickValori is designed for high performance:
- Memory Efficiency: Up to 75% memory reduction with quantization
- Search Speed: Sub-millisecond search times for small datasets
- Scalability: Handles millions of vectors with appropriate indexing
- Flexibility: Choose the right components for your use case
We welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- π Documentation
- π Issue Tracker
- π¬ Discussions
- π§ Email Support
- GPU acceleration support
- Distributed deployment
- More indexing algorithms (LSH, Annoy)
- REST API server
- Web UI for database management
- Integration with popular ML frameworks
If you use Valori in your research, please cite:
@software{valori2025,
title={Valori: A High-Performance Vector Database for Python},
author={Varshith},
year={2024},
url={https://github.com/varshith-Git/valori}
}Made with β€οΈ by the Valori Team