BacTaxID is a universal, genome-based bacterial typing system that overcomes limitations of traditional species-specific frameworks. Developed in Rust, it uses Binwise Densified MinHash (BinDash) combined with ntHash for rapid k-mer sketching, achieving O(log S) complexity while maintaining strict proportionality to Average Nucleotide Identity (ANI).
Applied to 2.3 million genomes across 67 genera, BacTaxID demonstrates:
- Universal concordance with species/sub-species classifications
- Epidemiologically relevant resolution (L₃ ≈ 99% ANI ≈ MLST)
- Sub-clonal outbreak detection (L₅ ≈ 99.99% ANI ≈ 5-27 SNPs/Mb)
- Scalability to millions of genomes with hierarchical search complexity
- BinDash Algorithm: Direct binning eliminates heap operations (O(log S) vs O(S log S))
- ntHash Streaming: Rolling-hash k-mer decomposition without materialization
- Parallel Processing: Rayon-powered multi-threaded distance computations
- Signature-based Indexing: xxHash64 u64 signatures for unique genome identification
- UBIGINT[] Sketch Storage: Native SQL arrays to universal usage
- Automatic Duplicate Detection:
duplicatestable links identical sketches - DuckDB Backend: Self-contained portable database with SQL/Python/R/CLI APIs
- Pseudo-Clique Clustering: Prevents chaining artifacts via click_threshold criterion
- Hierarchical Resolution: L₀ (96% ANI) → L₅ (99.99% ANI) configurable levels
- Classifier/Satellite States: Robust reference management avoiding chain effects
- Monotonicity Constraints: Ensures hierarchical consistency across levels
- Maximal Clique Detection: PetGraph-based complete subgraph identification
- ANI-based Thresholds: Direct quantitative link to Average Nucleotide Identity
- Reference Community Detection: Size-based optimization (reference_size parameter)
- Dynamic Scheme Evolution: Incremental updates without full re-clustering
-
Sketching Engine (
src/sketch/sketching.rs)- BinDash sketch generation: partition hash space into bins, retain minimum per bin
- Jaccard similarity → ANI conversion:
ANI = 1 - (-(1/k)*ln(2J/(1+J))) - xxHash64 signature generation for unique genome identification
-
Database Layer (
src/db/db.rs)sketchestable: signature (UBIGINT PRIMARY KEY), UBIGINT[] sketch arrayduplicatestable: maps duplicate samples to canonical signaturescodetable: hierarchical classifications with L_i_int, L_i_full, L_i_state columns
-
Graph Analysis (
src/graph/graph.rs)- Maximal clique detection using PetGraph's
maximal_cliquesalgorithm - Edge filtering by ANI threshold and parent-level group consistency
- Strategic edge pruning post-clique formation for next-level searches
- Maximal clique detection using PetGraph's
-
Command Interface (
src/commands/)init: Initialize database with TOML configurationupdate: Add reference genomes, detect duplicates, build hierarchyclassify: Read-only classification of query genomesdistance: Pairwise ANI distance matrix generation
genus = "Escherichia"
acronym = "EC"
levels = "[0.96, 0.98, 0.99, 0.995, 0.999, 0.9999]" # ANI thresholds
kmer_size = 31
sketch_size = 4000
click_size = 5 # Minimum clique size for new cluster
click_threshold = 0.8 # Fraction of cluster members query must match
reference_size = 100 # Maximum classifiers per clusterInitialize database with taxonomic levels and metadata.
bactaxid init --config config.toml --db bacteria.dbbactaxid update --db bacteria.db --files references.txt [--cpus 8] [--debug]- Processes FASTA files sequentially, generates sketches, computes signatures
- Detects duplicates: identical signatures →
duplicatestable entry - Hierarchical classification: best-hit search → clique detection if no assignment
- Classifier/Satellite assignment: based on click_threshold and reference_size
- Debug mode: saves all pairwise distances to
debugtable
bactaxid classify --db bacteria.db --queries queries.txt --output results.tsv [--verbose]- Read-only classification against pre-built reference scheme
- Hierarchical search: compares query only to classifiers sharing parent-level code
- Early stopping: terminates at first level without valid best-hit
- Output: TSV with query_id, signature, best_hit, similarity, levels_reached, taxonomic codes
bactaxid distance --db bacteria.db --ids samples.txt --output distances.phylip \
--format phylip [--threshold 0.95] [--threads 8]- Pairwise ANI distances: computes all-vs-all similarity matrix
- Formats:
tsv/csv: Long format (sample1, sample2, ANI, distance)phylip: Square matrix for phylogenetic analysis
- Duplicate handling: transparently resolves samples to canonical signatures
- Parallel computation: Rayon-powered multi-threaded distance calculations
- Threshold filtering (TSV/CSV only): reports only pairs ≥ threshold
- Hash Space Partitioning: Divide 2⁶⁴ hash space into
sketch_sizebins - Streaming k-mer Processing: ntHash generates canonical k-mer hashes on-the-fly
- Bin Assignment:
bin_index = hash / bin_size, retain minimum hash per bin - Sketch Comparison: Count matching bin values → Jaccard estimate
J = matches / S - ANI Conversion:
ANI = 1 - (-(1/k)*ln(2J/(1+J))), accurate for ANI ≥ 85%
Advantages:
- O(1) per k-mer (vs O(log S) for bottom-k MinHash heap operations)
- Memory-efficient: Fixed-size vector regardless of genome length
- ANI proportionality: Direct quantitative link to evolutionary distance
For level i:
1. Retrieve classifiers C_i sharing parent code (level i-1)
2. Compute distances: query → all c ∈ C_i
3. If best_hit ∈ C_i satisfies threshold AND
click_criterion(query, C_i, best_hit.code):
→ Assign query to best_hit.code
→ Promote to Classifier if reference_size not exceeded
Click Criterion (prevents chaining):
def click_criterion(query, classifiers, code):
cluster = [c for c in classifiers if c.code == code]
matches = sum(1 for c in cluster if distance(query, c) >= threshold)
return matches / len(cluster) >= click_thresholdIf no valid best-hit:
1. Retrieve ALL genomes G sharing parent code (classifiers + satellites)
2. Construct distance graph: nodes = G, edges = pairs meeting threshold
3. Find maximal cliques using PetGraph::maximal_cliques()
4. If clique size >= click_size:
→ Create new code (next integer for parent group)
→ Assign ALL clique members as Classifiers
→ Assign non-clique graph members as Satellites
→ Prune edges for next-level search
Monotonicity Constraint: code_full[i] must start with code_full[i-1]
- Unique Identification: Each sketch assigned 64-bit signature based on content hash
- Duplicate Detection: Identical sketches share signature →
duplicatestable - Primary Key:
sketches.signaturereplacessketches.samplefor scalability - Metadata Tracking:
duplicatesmaps sample names → canonical signatures
| Feature | Description | Benefit |
|---|---|---|
| Signature Architecture | xxHash64 u64 identifiers, duplicates table |
Efficient duplicate handling, robust indexing |
| UBIGINT[] Storage | Native SQL sketch arrays | Direct SQL queries, no deserialization |
classify Command |
Read-only hierarchical classification | Fast prospective typing, no DB modifications |
distance Command |
Pairwise ANI matrices (TSV/CSV/PHYLIP) | Phylogenetic analysis, outbreak investigation |
| Parallel Distance | Rayon multi-threading | Minutes for millions of comparisons |
| English Documentation | Full code comments translation | Improved accessibility |
# 1. Clone and build
git clone https://github.com/irycisBioinfo/BacTaxID
cd BacTaxID
cargo build --release
# 2. Initialize database
./target/release/bactaxid init --config config.toml --db escherichia.db
# 3. Build reference scheme
./target/release/bactaxid update --db escherichia.db --files references.txt --cpus 8
# 4. Classify new samples
./target/release/bactaxid classify --db escherichia.db --queries queries.txt \
--output classifications.tsv --verbose
# 5. Generate distance matrix
./target/release/bactaxid distance --db escherichia.db --ids samples.txt \
--output distances.phylip --format phylip --threads 8- Sketch generation: ~1-2s per genome (5 Mb, k=31, S=4000)
- Classification: O(log N) per level due to hierarchical search
- All-vs-all distances: O(N²S) parallelized with Rayon
- Memory: ~2GB minimum, scales with database size
- Database size: ~50MB per 1000 genomes (k=31, S=4000)
- Species concordance: L₀-L₁ (96-98% ANI) ≈ 95% ANI species threshold
- MLST equivalence: L₃ (99% ANI) universal peak across 11 genera
- Outbreak detection: L₅ (99.99% ANI) → 5.92-27 SNPs/Mb (genus-dependent)
- Completeness: 98% at L₀, variable at L₅ (population structure-dependent)
- Rust: 2021 edition or later
- System Memory: 2GB minimum (scales with dataset)
- Storage: SSD recommended for I/O performance
- CPU: Multi-core beneficial for parallel operations
git clone https://github.com/irycisBioinfo/BacTaxID
cd BacTaxID
cargo build --releasePre-compiled binaries and 67 pre-computed genus schemes available at:
- www.bactaxid.org (interactive tools, KronaPlots)
- zenodo.org/doi/10.5281/zenodo.17226627 (classification data)
If you use BacTaxID in your research, please cite:
Díez Fernández de Bobadilla M, Fernández Lanza V (2025). BacTaxID: A universal framework for standardized bacterial classification. [Manuscript in preparation]
GPL-3.0 • Maintainers: Val F. Lanza, Miguel Diez Fernandez de Bobadilla
Issues: GitHub Issues
BacTaxID - Empowering bacterial taxonomy identification through advanced computational methods, hierarchical resolution, and universal standardization.
