Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
c00f47e
fix: add QTL Cascade command to CLI for enhanced analysis capabilities
enriquea Mar 28, 2026
0c704a7
fix: add eQTL and pQTL table builders to CLI for enhanced data proces…
enriquea Mar 28, 2026
e57a341
fix: add eQTL and pQTL table builders to registry for improved data p…
enriquea Mar 28, 2026
351605d
feat: implement eQTL and pQTL table builders with enhanced import fun…
enriquea Mar 28, 2026
292b3c6
feat: add QTL Cascade module for comprehensive variant analysis and r…
enriquea Mar 28, 2026
5c588d2
feat: add coloc module for colocalization analysis using Approximate …
enriquea Mar 28, 2026
6e77bd8
feat: add constants for QTL cascade analysis including thresholds and…
enriquea Mar 28, 2026
bc79713
feat: add gene-level summary aggregation for QTL cascade evidence
enriquea Mar 28, 2026
e3f6e0c
feat: implement stage-based pipeline orchestration for QTL cascade an…
enriquea Mar 28, 2026
5b5485f
feat: add CLI commands for QTL cascade analysis including cascade, co…
enriquea Mar 28, 2026
05befa3
feat: add HTML report generation for QTL cascade analysis
enriquea Mar 28, 2026
76d9008
update docs with qtl pipelines
enriquea Mar 28, 2026
42d68c6
feat: add visualization functions for QTL cascade analysis
enriquea Mar 28, 2026
0cbdea5
blacked
enriquea Mar 28, 2026
8707b25
feat: add logdiff function and improve tissue filtering in coloc anal…
enriquea Mar 28, 2026
b9ed7c9
feat: add warning for multi-tissue data without a tissue filter in co…
enriquea Mar 28, 2026
69e24cb
feat: update PQTL_SOURCES to reflect current implementation status
enriquea Mar 28, 2026
6b890fb
feat: update disease-gene table documentation for clarity on gene_id …
enriquea Mar 28, 2026
74f67cc
feat: clarify coloc results inclusion in outputs for QTL Cascade
enriquea Mar 28, 2026
5284f6f
feat: update tissue option help text for clarity on file import restr…
enriquea Mar 28, 2026
c35f744
feat: reorder pipeline stages and add single-tissue report generation
enriquea Mar 28, 2026
ad48b15
fix: correct formatting of confidence interval in plot output
enriquea Mar 28, 2026
5f1553e
feat: add handling for empty data in plot output and enhance heatmap …
enriquea Mar 28, 2026
7b25f31
fix: correct code block formatting in qtlcascade documentation
enriquea Mar 28, 2026
aea2d62
feat: enhance dry run output and update coloc threshold for gene colo…
enriquea Mar 28, 2026
290b790
feat: sanitize HTML output in report generation to prevent XSS vulner…
enriquea Mar 28, 2026
bf856b4
feat: normalize contig names for reference genome compatibility and u…
enriquea Mar 28, 2026
47610c8
feat: add CLI tests for eQTL and pQTL table creation and cascade comm…
enriquea Mar 28, 2026
a8f172e
feat: enhance error logging for gene summary loading and scope plots …
enriquea Mar 28, 2026
36ad0e6
fix: resolve Codacy static analysis issues - remove unused imports an…
Copilot Mar 28, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions docs_site/examples/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ This directory contains end-to-end workflow examples demonstrating how to use hv
- [PSROC (Prediction Score ROC Analysis)](#psroc-prediction-score-roc-analysis)
- [EnrichEx (Gene Set Enrichment)](#enrichex-gene-set-enrichment)
- [PTM (Post-Translational Modification)](#ptm-post-translational-modification)
- [QTL Cascade (Molecular QTL Integration)](#qtl-cascade-molecular-qtl-integration)
- [Ancestry Inference](#ancestry-inference)
- [ClinVar Data Streaming](#clinvar-data-streaming)
- [ClinGen Streaming](#clingen-streaming)
Expand Down Expand Up @@ -105,6 +106,26 @@ hvantk ptm report -o report.html --landscape-json results/landscape/landscape_su

---

### QTL Cascade (Molecular QTL Integration)

Trace variant effects across molecular layers — from DNA to RNA (eQTL) to protein (pQTL) — using outer-join cascade classification and colocalization ABF analysis.

**Quick start:**
```bash
# Build eQTL and pQTL tables
hvantk mktable eqtl --raw-input /data/gtex_v11/signif_pairs/ --output-ht eqtl.ht --source gtex_v11 --tissue Liver
hvantk mktable pqtl --raw-input /data/fang_pqtl/Liver_allpairs.txt.gz --output-ht pqtl.ht --source gtex_fang --gene-map-ht ensembl.ht

# Run cascade pipeline
hvantk qtlcascade run --eqtl-ht eqtl.ht --pqtl-ht pqtl.ht -o results/cascade/
```

**Outputs:** Cascade Hail Table, gene summary (HT + TSV), plots (PNG), HTML report. Coloc results (TSV) are included when `--eqtl-allpairs` and `--pqtl-allpairs` are provided.

**Documentation:** [QTL Cascade Examples](qtlcascade.md) | [QTL Cascade Docs](../tools/qtlcascade.md)

---

### ClinVar Data Streaming

**Directory:** [`clinvar/`](https://github.com/bigbio/hvantk/tree/main/examples/clinvar/)
Expand Down
191 changes: 191 additions & 0 deletions docs_site/examples/qtlcascade.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
# QTL Cascade Example

This page demonstrates the QTL cascade pipeline for tracing variant effects from DNA to RNA (eQTL) to protein (pQTL).

## Overview

The QTL cascade pipeline:
1. Builds eQTL and pQTL Hail Tables from summary statistics
2. Outer-joins them on `(locus, alleles, gene_id)` to classify variant–gene pairs
3. Optionally runs colocalization ABF to distinguish true cascades from LD artifacts
4. Generates gene-level summaries with constraint and disease overlays
5. Produces plots and an HTML report

## Quick Start (CLI)

```bash
# Step 1: Build eQTL table from GTEx v11 significant pairs
hvantk mktable eqtl \
--raw-input /data/gtex_v11/signif_pairs/Liver.v11.signif_pairs.parquet \
--output-ht /data/tables/eqtl_liver.ht \
--source gtex_v11 \
--tissue Liver

# Step 2: Build pQTL table from Fang et al. allpairs
hvantk mktable pqtl \
--raw-input /data/fang_pqtl/Liver_allpairs.txt.gz \
--output-ht /data/tables/pqtl_liver.ht \
--source gtex_fang \
--tissue Liver \
--gene-map-ht /data/tables/ensembl_gene.ht \
--p-threshold 5e-8

# Step 3: Run the cascade pipeline
hvantk qtlcascade run \
--eqtl-ht /data/tables/eqtl_liver.ht \
--pqtl-ht /data/tables/pqtl_liver.ht \
--constraint-ht /data/tables/gnomad_metrics.ht \
-o /results/cascade_liver
```

## Quick Start (Python API)

```python
from hvantk.qtlcascade import (
build_cascade,
build_cascade_gene_summary,
CascadeConfig,
CascadePipeline,
)

# Option A: Use individual functions
ht = build_cascade(
eqtl_ht_path="/data/tables/eqtl_liver.ht",
pqtl_ht_path="/data/tables/pqtl_liver.ht",
output_path="/results/cascade.ht",
tissue="Liver",
)
print(f"Cascade pairs: {ht.count()}")

gene_ht = build_cascade_gene_summary(
cascade_ht_path="/results/cascade.ht",
output_path="/results/gene_summary.ht",
constraint_ht_path="/data/tables/gnomad_metrics.ht",
)
print(f"Cascade genes: {gene_ht.count()}")

# Option B: Use the pipeline
config = CascadeConfig(
eqtl_ht="/data/tables/eqtl_liver.ht",
pqtl_ht="/data/tables/pqtl_liver.ht",
constraint_ht="/data/tables/gnomad_metrics.ht",
output_dir="/results/cascade_liver",
)
pipeline = CascadePipeline(config)
result = pipeline.run()
print(f"Class counts: {result.class_counts}")
```

## With Colocalization

To distinguish true signal propagation from LD artifacts, provide allpairs tables for coloc:

```bash
# Build allpairs tables (set p-threshold to 0 to keep all variants)
hvantk mktable eqtl \
--raw-input /data/gtex_v11/allpairs/Liver/ \
--output-ht /data/tables/eqtl_allpairs_liver.ht \
--source gtex_v11 --tissue Liver --p-threshold 0

hvantk mktable pqtl \
--raw-input /data/fang_pqtl/Liver_allpairs.txt.gz \
--output-ht /data/tables/pqtl_allpairs_liver.ht \
--source gtex_fang --tissue Liver \
--gene-map-ht /data/tables/ensembl_gene.ht

# Run pipeline with coloc
hvantk qtlcascade run \
--eqtl-ht /data/tables/eqtl_liver.ht \
--pqtl-ht /data/tables/pqtl_liver.ht \
--eqtl-allpairs /data/tables/eqtl_allpairs_liver.ht \
--pqtl-allpairs /data/tables/pqtl_allpairs_liver.ht \
--constraint-ht /data/tables/gnomad_metrics.ht \
--disease-genes-ht /data/tables/clingen.ht \
-o /results/cascade_liver_coloc
```

## Multi-Tissue Analysis

Run across Fang et al. (2025) tissues with cross-tissue comparison:

```bash
hvantk qtlcascade run \
--eqtl-ht /data/tables/eqtl.ht \
--pqtl-ht /data/tables/pqtl.ht \
--eqtl-allpairs /data/tables/eqtl_allpairs.ht \
--pqtl-allpairs /data/tables/pqtl_allpairs.ht \
--tissues "Liver,Heart,Lung,Colon,Thyroid" \
--constraint-ht /data/tables/gnomad_metrics.ht \
-o /results/cascade_multi
```

```python
# Python API for multi-tissue
config = CascadeConfig(
eqtl_ht="/data/tables/eqtl.ht",
pqtl_ht="/data/tables/pqtl.ht",
eqtl_allpairs_ht="/data/tables/eqtl_allpairs.ht",
pqtl_allpairs_ht="/data/tables/pqtl_allpairs.ht",
tissues=["Liver", "Heart", "Lung", "Colon", "Thyroid"],
constraint_ht="/data/tables/gnomad_metrics.ht",
output_dir="/results/cascade_multi",
)
pipeline = CascadePipeline(config)
results = pipeline.run_collection()

for tissue, res in results.items():
n_coloc = 0
if res.coloc_df is not None:
n_coloc = (res.coloc_df["H4"] > 0.8).sum()
print(f"{tissue}: {res.n_cascade_genes} genes, {n_coloc} colocalised")
```

## Expected Outputs

```
results/cascade_liver/
├── per_tissue/
│ └── liver/
│ ├── cascade.ht/ # Variant-gene cascade pairs
│ ├── gene_summary.ht/ # Gene-level aggregation
│ ├── gene_summary.tsv # TSV export
│ └── coloc_results.tsv # H0-H4 posteriors (if coloc run)
├── plots/
│ ├── liver_cascade_classes.png
│ └── liver_coloc_posteriors.png
└── qtlcascade_report.html
```

## Standalone Coloc

Run colocalization independently on a set of genes:

```bash
# Extract cascade genes with both eQTL + pQTL evidence
# (e.g., from gene_summary.tsv where has_complete_cascade = true)
awk -F'\t' 'NR>1 && $7=="true" {print $1}' gene_summary.tsv > cascade_genes.txt

# Run coloc
hvantk qtlcascade coloc \
--eqtl-allpairs /data/tables/eqtl_allpairs_liver.ht \
--pqtl-allpairs /data/tables/pqtl_allpairs_liver.ht \
--cascade-genes cascade_genes.txt \
--tissue Liver \
--window-kb 500 \
-o coloc_results.tsv
```

## Data Sources

| Source | Type | Format | Reference |
|--------|------|--------|-----------|
| GTEx v11 | eQTL | Parquet | GTEx Consortium |
| GTEx v8 | eQTL | TSV | GTEx Consortium |
| eQTLGen | eQTL | TSV | Vosa et al. (2021) |
| Fang et al. 2025 | pQTL | Space-delimited TSV | Fang et al. (2025) |

See [Data Sources](../guide/data-sources.md#qtl-data) for download instructions.

---

**Documentation:** [QTL Cascade Docs](../tools/qtlcascade.md)
79 changes: 79 additions & 0 deletions docs_site/guide/data-sources.md
Original file line number Diff line number Diff line change
Expand Up @@ -282,6 +282,85 @@ URL: https://ftp.ensembl.org/pub/release-113/gtf/homo_sapiens/
wget https://ftp.ensembl.org/pub/release-113/gtf/homo_sapiens/Homo_sapiens.GRCh38.113.gtf.gz
```

## QTL data

These datasets are used to build eQTL and pQTL Hail Tables for the QTL cascade pipeline.

### GTEx eQTL data

Expression quantitative trait loci from the GTEx project. Available as significant pairs (genome-wide significant associations) and allpairs (full summary statistics for coloc).

**GTEx v11** (recommended):
URL: https://www.gtexportal.org/home/downloads/adult-gtex/qtl

```bash
# Download significant pairs (Parquet format, ~50 MB per tissue)
# Navigate to GTEx Portal → Downloads → Adult GTEx → QTL → eQTL → Significant pairs

# Build significant-pairs table
hvantk mktable eqtl \
--raw-input /data/gtex_v11/Liver.v11.signif_pairs.parquet \
--output-ht eqtl_liver.ht \
--source gtex_v11 \
--tissue Liver

# Build allpairs table for coloc (set p-threshold to 0)
hvantk mktable eqtl \
--raw-input /data/gtex_v11/allpairs/Liver/ \
--output-ht eqtl_allpairs_liver.ht \
--source gtex_v11 \
--tissue Liver \
--p-threshold 0
```

**GTEx v8** (TSV format):

```bash
# Download from GTEx Portal v8 archive
hvantk mktable eqtl \
--raw-input /data/gtex_v8/Liver.v8.signif_variant_gene_pairs.txt.gz \
--output-ht eqtl_liver_v8.ht \
--source gtex_v8
```

**eQTLGen** (blood eQTLs):
URL: https://www.eqtlgen.org/cis-eqtls.html

```bash
# Download cis-eQTL full results (~2 GB)
hvantk mktable eqtl \
--raw-input /data/eqtlgen/cis-eQTLs_full.txt.gz \
--output-ht eqtl_blood.ht \
--source eqtlgen
```

### Fang et al. (2025) pQTL data

Protein quantitative trait loci from Fang et al. (2025), covering 5 tissues (Colon, Heart, Liver, Lung, Thyroid). Space-delimited allpairs format with columns: `gene_name SNP CHR BP A1 NMISS BETA STAT P`. SE is derived as `|BETA/STAT|` (rows with `STAT = 0` are filtered out).

URL: Contact authors or GTEx Portal supplementary data.

> **Note:** Fang pQTL data uses gene symbols. Provide a `--gene-map-ht` (Ensembl gene table) for symbol → Ensembl ID mapping.

```bash
# Build pQTL table with gene mapping
hvantk mktable pqtl \
--raw-input /data/fang_pqtl/Liver_allpairs.txt.gz \
--output-ht pqtl_liver.ht \
--source gtex_fang \
--tissue Liver \
--gene-map-ht ensembl_gene.ht \
--p-threshold 5e-8

# Allpairs for coloc (omit p-threshold)
hvantk mktable pqtl \
--raw-input /data/fang_pqtl/Liver_allpairs.txt.gz \
--output-ht pqtl_allpairs_liver.ht \
--source gtex_fang \
--tissue Liver \
--gene-map-ht ensembl_gene.ht
```

## Expression data sources

These datasets are used to build expression MatrixTables via the UCSC Cell Browser and Expression Atlas downloaders.
Expand Down
44 changes: 44 additions & 0 deletions docs_site/guide/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,50 @@ hvantk ptm build \

> **Note:** The PTM build command downloads Ensembl GTF and UniProt PTM data automatically. Use `--gtf-path` and `--ptm-tsv` to provide pre-downloaded files.

- eQTL (keyed by locus, alleles, gene_id)

```bash
# GTEx v11 significant pairs (Parquet)
hvantk mktable eqtl \
--raw-input /data/gtex_v11/Liver.v11.signif_pairs.parquet \
--output-ht /out/eqtl_liver.ht \
--source gtex_v11 \
--tissue Liver

# GTEx v8 (TSV)
hvantk mktable eqtl \
--raw-input /data/gtex_v8/Liver.v8.signif_variant_gene_pairs.txt.gz \
--output-ht /out/eqtl_liver.ht \
--source gtex_v8

# eQTLGen (blood cis-eQTLs)
hvantk mktable eqtl \
--raw-input /data/eqtlgen/cis-eQTLs_full.txt.gz \
--output-ht /out/eqtl_blood.ht \
--source eqtlgen

# Allpairs for coloc (set p-threshold to 0)
hvantk mktable eqtl \
--raw-input /data/gtex_v11/allpairs/Liver/ \
--output-ht /out/eqtl_allpairs_liver.ht \
--source gtex_v11 --tissue Liver --p-threshold 0
```

- pQTL (keyed by locus, alleles, gene_id)

```bash
# Fang et al. 2025 (space-delimited allpairs)
hvantk mktable pqtl \
--raw-input /data/fang_pqtl/Liver_allpairs.txt.gz \
--output-ht /out/pqtl_liver.ht \
--source gtex_fang \
--tissue Liver \
--gene-map-ht /data/ensembl_gene.ht \
--p-threshold 5e-8
```

> **Note:** Fang pQTL data uses gene symbols. Provide `--gene-map-ht` (Ensembl gene table with `gene_name` field) for symbol → Ensembl ID mapping.

## 2) Batch-create Tables (HT) from a recipe

Use a recipe to build many tables at once. JSON and YAML are both supported (YAML requires PyYAML installed).
Expand Down
2 changes: 1 addition & 1 deletion docs_site/images/hvantk-architecture.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 4 additions & 0 deletions docs_site/images/hvantk-qtlcascade-workflow.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading