GitHub - MuteJester/LZGraphs: LZ76 Graphs and Applications in Immunology

LZGraphs

LZ76 Graphs and Applications in Immunology
Explore the docs »

Report Bug · Request Feature

🧬 New to LZGraphs? Head over to the full documentation and tutorials for comprehensive guides, API reference, and worked examples covering every feature of the library.

About The Project

LZGraphs 🧬 is a Python library for immune receptor repertoire analysis based on the Lempel-Ziv 76 (LZ-76) compression algorithm. It builds directed graph models from TCR and BCR CDR3 sequences, capturing the sequential structure of repertoires without relying on alignment.

The methodology is presented in the research paper "A Novel Approach to T-Cell Receptor Beta Chain (TCRB) Repertoire Encoding Using Lossless String Compression".

Background

The diversity of T-cells and B-cells is crucial for producing receptors that recognize the wide range of pathogens encountered throughout life. V(D)J recombination generates this diversity through a stochastic process, making repertoire analysis challenging. LZGraphs addresses this by decomposing sequences into LZ-76 subpatterns and encoding them as graph transitions, providing a compact, information-rich representation of an entire repertoire.

Key Features

Alignment-free analysis -- no error-prone sequence alignment required
Generation probability inference -- compute P(sequence) under the learned graph model
Sequence simulation -- generate realistic synthetic sequences from the graph
Diversity estimation -- LZ-based diversity indices (K-diversity family)
Information-theoretic metrics -- entropy, perplexity, Jensen-Shannon divergence, mutual information, and more
Repertoire comparison -- compare two repertoires via graph-level statistics
Analytical probability distributions -- exact moments and scipy-like distribution objects for generation probabilities
Gene annotation support -- optional V/J gene tracking on edges for gene usage analysis
Abundance weighting -- weight sequences by clonal abundance for more realistic models
Serialization -- save and load graphs in JSON format

Installation

Install from PyPI:

pip install LZGraphs

LZGraphs requires Python 3.9 or later. To verify the installation:

import LZGraphs
print(LZGraphs.__version__)

Quick Start

Build an amino acid positional graph from CDR3 sequences and compute sequence probabilities:

import pandas as pd
from LZGraphs import AAPLZGraph

# Prepare data as a DataFrame with a 'cdr3_amino_acid' column
data = pd.DataFrame({
    'cdr3_amino_acid': [
        'CASSLAPGATNEKLFF',
        'CASSLGQAYEQYF',
        'CASSFSTCSANYGYTF',
        'CASSQEGTEAFF',
        'CASSLGQGNIQYF',
        # ... your CDR3 amino acid sequences
    ]
})

# Construct the graph
graph = AAPLZGraph(data, verbose=True)

# Compute the log-probability of a sequence under the model
log_prob = graph.walk_log_probability('CASSLAPGATNEKLFF')
print(f"Log P(seq): {log_prob:.4f}")

# Simulate 100 new sequences from the graph
generated = graph.simulate(100, seed=42)
print(f"Generated {len(generated)} sequences")

# Access graph properties
print(f"Nodes: {graph.num_subpatterns}, Edges: {graph.num_transitions}")
print(f"Length distribution: {graph.length_probabilities}")

Graph Types

LZGraphs provides three graph variants, each suited to different sequence types and analysis goals:

AAPLZGraph -- Amino Acid Positional

Best for CDR3 amino acid sequences. Each LZ-76 subpattern is annotated with its position in the sequence, creating a directed acyclic graph (DAG). This enables exact analytical computations including lzpgen_moments() and lzpgen_analytical_distribution().

from LZGraphs import AAPLZGraph

graph = AAPLZGraph(data, verbose=True)  # data has 'cdr3_amino_acid' column

NDPLZGraph -- Nucleotide Double Positional

Best for CDR3 nucleotide sequences where reading frame matters. Encodes both the subpattern and a double positional index derived from nucleotide positions. Also a DAG, supporting exact analytical methods.

from LZGraphs import NDPLZGraph

graph = NDPLZGraph(data, verbose=True)  # data has 'cdr3_rearrangement' column

NaiveLZGraph -- Basic Nucleotide

A simpler model for nucleotide sequences that uses raw LZ-76 subpatterns without positional annotation. The resulting graph may contain cycles. Use Monte Carlo methods (lzpgen_distribution()) rather than exact analytics for this graph type.

from LZGraphs import NaiveLZGraph
from LZGraphs import generate_kmer_dictionary

cdr3_list = ['TGTGCCAGCAGC...', 'TGTGCCAGCAGT...', ...]
dictionary = generate_kmer_dictionary(cdr3_list)
graph = NaiveLZGraph(cdr3_list, dictionary, verbose=True)

Gene Annotation

All three graph types support optional V and J gene annotation. Include V and J columns in your DataFrame (or pass them separately for NaiveLZGraph) to track gene usage on graph edges:

data = pd.DataFrame({
    'cdr3_amino_acid': sequences,
    'V': v_genes,
    'J': j_genes,
})
graph = AAPLZGraph(data, verbose=True)

# Gene data is now available
print(graph.has_gene_data)           # True
print(graph.marginal_v_genes)        # V gene usage distribution
print(graph.marginal_j_genes)        # J gene usage distribution

Sequence Abundance Weighting

Immune repertoire datasets often include clonal abundance information -- the number of times each unique clonotype was observed. LZGraphs supports abundance-weighted graph construction, where each sequence contributes proportionally to its observed count rather than being treated as a single observation.

This is particularly important for:

More accurate probability estimates -- highly expanded clones exert greater influence on transition probabilities, reflecting the true distribution of the repertoire
Better representation of clonal expansion -- dominant clones shape the graph structure proportionally to their prevalence
More realistic sequence generation -- simulated sequences reflect the abundance-weighted landscape, not just the unique sequence set

To use abundance weighting, include an abundance column in your DataFrame:

data = pd.DataFrame({
    'cdr3_amino_acid': ['CASSLAPGATNEKLFF', 'CASSLGQAYEQYF', 'CASSFSTCSANYGYTF'],
    'abundance': [150, 42, 7],
})

# Each sequence is weighted by its abundance during graph construction
graph = AAPLZGraph(data, verbose=True)

For NaiveLZGraph, pass abundances as a separate parameter:

graph = NaiveLZGraph(
    cdr3_list,
    dictionary,
    verbose=True,
    abundances=[150, 42, 7, ...],
)

When no abundance information is provided, every sequence is implicitly weighted as 1.

Core Capabilities

Probability Inference

Compute the probability of a sequence under the learned Markov model:

prob = graph.walk_probability('CASSLAPGATNEKLFF')
log_prob = graph.walk_log_probability('CASSLAPGATNEKLFF')

Sequence Simulation

Generate new sequences by sampling random walks through the graph:

sequences = graph.simulate(1000, seed=42)

Generation Probability Distributions

Characterize the distribution of generation probabilities across the repertoire:

# Monte Carlo empirical distribution (works on all graph types)
log_probs = graph.lzpgen_distribution(n=10000, seed=42)

# Exact analytical moments (DAG graphs only: AAPLZGraph, NDPLZGraph)
moments = graph.lzpgen_moments()
print(moments['mean'], moments['std'])

# Full scipy-like distribution object (DAG graphs only)
dist = graph.lzpgen_analytical_distribution()
print(dist.mean(), dist.std())
x = dist.ppf(0.05)  # 5th percentile

Diversity Metrics

from LZGraphs import lz_centrality, k_diversity

centrality = lz_centrality(graph, 'CASSLAPGATNEKLFF')
diversity = k_diversity(sequences, graph.encode_sequence, sample_size=1000)

Information-Theoretic Analysis

from LZGraphs import (
    node_entropy, edge_entropy, graph_entropy,
    jensen_shannon_divergence, compare_repertoires,
)

print(f"Graph entropy: {graph_entropy(graph):.4f}")

# Compare two repertoires
jsd = jensen_shannon_divergence(graph1, graph2)
comparison = compare_repertoires(graph1, graph2)

Visualization

from LZGraphs import plot_graph, plot_possible_paths

plot_graph(graph)
plot_possible_paths(graph, 'CASSLAPGATNEKLFF')

Saturation Analysis

from LZGraphs import NodeEdgeSaturationProbe

probe = NodeEdgeSaturationProbe()
# Feed sequences incrementally and track node/edge saturation curves

For detailed usage of every feature, see the documentation.

Contributing

Contributions are what make the open-source community such a powerful place to create new ideas, inspire, and make progress. Any contributions you make are greatly appreciated.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

Distributed under the MIT license. See LICENSE for more information.

Contact

Thomas Konstantinovsky - [email protected]

Project Link: https://github.com/MuteJester/LZGraphs

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.github/workflows		.github/workflows
.idea		.idea
Examples		Examples
docs		docs
misc		misc
src/LZGraphs		src/LZGraphs
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
main.py		main.py
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LZGraphs

Table of Contents

About The Project

Background

Key Features

Installation

Quick Start

Graph Types

AAPLZGraph -- Amino Acid Positional

NDPLZGraph -- Nucleotide Double Positional

NaiveLZGraph -- Basic Nucleotide

Gene Annotation

Sequence Abundance Weighting

Core Capabilities

Probability Inference

Sequence Simulation

Generation Probability Distributions

Diversity Metrics

Information-Theoretic Analysis

Visualization

Saturation Analysis

Contributing

License

Contact

About

Uh oh!

Releases 15

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

MuteJester/LZGraphs

Folders and files

Latest commit

History

Repository files navigation

LZGraphs

Table of Contents

About The Project

Background

Key Features

Installation

Quick Start

Graph Types

AAPLZGraph -- Amino Acid Positional

NDPLZGraph -- Nucleotide Double Positional

NaiveLZGraph -- Basic Nucleotide

Gene Annotation

Sequence Abundance Weighting

Core Capabilities

Probability Inference

Sequence Simulation

Generation Probability Distributions

Diversity Metrics

Information-Theoretic Analysis

Visualization

Saturation Analysis

Contributing

License

Contact

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 15

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages