|
| 1 | +# AlphaGenome Variant Prediction Streamer — Design Spec |
| 2 | + |
| 3 | +**Date:** 2026-03-31 |
| 4 | +**Branch:** `variantpredictions` |
| 5 | +**Status:** Approved |
| 6 | + |
| 7 | +## Overview |
| 8 | + |
| 9 | +Add an `AlphaGenomeStreamer` to hvantk that, given a set of variant positions, calls the AlphaGenome API and produces per-modality Hail Tables with the full multimodal predictions (expression, splicing, chromatin, contact maps). The streamer handles authentication, adaptive interval grouping, rate limiting, checkpoint-based resumption, and assembly of final outputs. |
| 10 | + |
| 11 | +## Motivation |
| 12 | + |
| 13 | +AlphaGenome (DeepMind, 2026) is a genomic foundation model that predicts regulatory effects of genetic variants at single base-pair resolution across multiple modalities. Integrating it into hvantk enables researchers to enrich variant annotations with predicted functional impact — complementing existing static annotation sources (ClinVar, gnomAD, dbNSFP) with model-based predictions. |
| 14 | + |
| 15 | +## Config File |
| 16 | + |
| 17 | +A YAML config file controls API access, ontology terms, and interval strategy: |
| 18 | + |
| 19 | +```yaml |
| 20 | +api: |
| 21 | + key: "AG-xxxx" # or null -> falls back to ALPHAGENOME_API_KEY env var |
| 22 | + max_retries: 3 |
| 23 | + retry_backoff: 2.0 # exponential backoff base (seconds) |
| 24 | + request_timeout: 120 # seconds per API call |
| 25 | + |
| 26 | +ontology: |
| 27 | + terms: |
| 28 | + - "UBERON:0001157" # liver |
| 29 | + - "UBERON:0000955" # brain |
| 30 | + - "UBERON:0002048" # lung |
| 31 | + output_types: |
| 32 | + - RNA_SEQ |
| 33 | + - CHROMATIN |
| 34 | + - SPLICING |
| 35 | + - CONTACT_MAP |
| 36 | + |
| 37 | +intervals: |
| 38 | + default_size: 1048576 # 1Mbp fixed-size fallback |
| 39 | + adaptive: true # group nearby variants into shared intervals |
| 40 | + adaptive_max_size: 1048576 # max interval size when adaptive |
| 41 | + density_window: 50000 # bp window to check for neighboring variants |
| 42 | +``` |
| 43 | +
|
| 44 | +**Auth resolution order:** config `api.key` > `ALPHAGENOME_API_KEY` env var > error with message. |
| 45 | + |
| 46 | +## Interval Strategy |
| 47 | + |
| 48 | +### Adaptive Mode (default) |
| 49 | + |
| 50 | +1. Sort all variants by genomic position (chrom, pos). |
| 51 | +2. Scan linearly — group consecutive variants within `density_window` (default 50kb) of each other. |
| 52 | +3. For each group, compute a bounding interval centered on the group midpoint, expanded to cover all variants plus padding. |
| 53 | +4. Cap at `adaptive_max_size` (1Mbp — AlphaGenome's limit). If a group exceeds this, split into sub-groups. |
| 54 | +5. Singleton variants (no neighbors within the window) get `default_size` centered on their position. |
| 55 | + |
| 56 | +### Fixed Mode (`adaptive: false`) |
| 57 | + |
| 58 | +Each variant gets a `default_size` window centered on its position. No grouping — one API call per variant. |
| 59 | + |
| 60 | +### Benefit |
| 61 | + |
| 62 | +If 200 of 5000 variants cluster in a gene-dense region, they share a single API call instead of 200 separate ones. For a typical GWAS-like variant set this can reduce API calls significantly. |
| 63 | + |
| 64 | +### Implementation |
| 65 | + |
| 66 | +A pure function `compute_intervals(variants, config) -> List[Tuple[Interval, List[Variant]]]` that returns interval-to-variant mappings. Testable independently of the API. |
| 67 | + |
| 68 | +## Streamer Architecture |
| 69 | + |
| 70 | +`AlphaGenomeStreamer` extends `HailDataStreamer`. |
| 71 | + |
| 72 | +### Input |
| 73 | + |
| 74 | +- Hail Table keyed by `(locus, alleles)`, or |
| 75 | +- TSV with `chrom, pos, ref, alt` columns (auto-imported into Hail Table on `setup()`). |
| 76 | + |
| 77 | +### Lifecycle |
| 78 | + |
| 79 | +#### `setup()` |
| 80 | + |
| 81 | +1. Initialize Hail. |
| 82 | +2. Load and validate config YAML. |
| 83 | +3. Authenticate — create `dna_client` via `dna_client.create(api_key)`. |
| 84 | +4. Load input variants (Hail Table or TSV). |
| 85 | +5. Load checkpoint state if resuming — a JSON file tracking completed intervals. |
| 86 | +6. Compute interval-to-variant mappings via `compute_intervals()`. |
| 87 | +7. Filter out already-completed intervals from checkpoint. |
| 88 | + |
| 89 | +#### `stream()` |
| 90 | + |
| 91 | +- Iterate over pending intervals in batches of `chunk_size`. |
| 92 | +- For each interval + its variants, call `model.predict_variant()` per variant. |
| 93 | +- Collect raw outputs into per-modality dicts. |
| 94 | +- After each batch: write intermediate results to checkpoint dir, update checkpoint JSON. |
| 95 | +- Yield a dict of `{modality_name: hl.Table}` per batch (intermediate, used by `StreamProcessor` pipeline or discarded if running standalone). |
| 96 | + |
| 97 | +#### `teardown()` |
| 98 | + |
| 99 | +- Union all batch checkpoints into final per-modality Hail Tables. |
| 100 | +- Write to output paths (e.g., `output_dir/rna_seq.ht`, `output_dir/chromatin.ht`). |
| 101 | +- Each table keyed by `(locus, alleles)` with ontology term as a column/nested struct. |
| 102 | +- Log summary stats (variants processed, API calls made, failures). |
| 103 | + |
| 104 | +### Checkpoint Directory Structure |
| 105 | + |
| 106 | +``` |
| 107 | +output_dir/ |
| 108 | +├── _checkpoints/ |
| 109 | +│ ├── state.json # {"completed_intervals": [...], "failed_variants": [...]} |
| 110 | +│ ├── batch_000.json # raw API responses for batch 0 |
| 111 | +│ ├── batch_001.json |
| 112 | +│ └── ... |
| 113 | +├── rna_seq.ht/ |
| 114 | +├── chromatin.ht/ |
| 115 | +├── splicing.ht/ |
| 116 | +└── contact_map.ht/ |
| 117 | +``` |
| 118 | +
|
| 119 | +## Rate Limiting & Error Handling |
| 120 | +
|
| 121 | +### Adaptive Throttling |
| 122 | +
|
| 123 | +AlphaGenome doesn't publish fixed rate limits (they vary by demand), so the streamer uses adaptive throttling: |
| 124 | +
|
| 125 | +- **Base delay:** Configurable minimum wait between API calls (default 0.5s). |
| 126 | +- **Backoff on 429/5xx:** Exponential backoff with jitter — `retry_backoff^attempt * (1 + random(0, 0.5))`, up to `max_retries`. |
| 127 | +- **Cooldown:** If 3 consecutive requests get rate-limited, pause for 60s before continuing. |
| 128 | +- **Progress logging:** Log every N intervals completed, estimated time remaining based on average call duration. |
| 129 | +
|
| 130 | +### Error Classification |
| 131 | +
|
| 132 | +| Error Type | Behavior | |
| 133 | +|------------|----------| |
| 134 | +| 429 Too Many Requests | Backoff + retry | |
| 135 | +| 5xx Server Error | Backoff + retry | |
| 136 | +| 4xx Client Error (not 429) | Log warning, skip variant, record in `failed_variants` | |
| 137 | +| Timeout | Retry up to `max_retries`, then skip + record | |
| 138 | +| Network Error | Retry up to `max_retries`, then skip + record | |
| 139 | +| Invalid variant (no ref/alt at position) | Log warning, skip, record | |
| 140 | +
|
| 141 | +### Resumption Flow |
| 142 | +
|
| 143 | +1. On start, check for `_checkpoints/state.json`. |
| 144 | +2. If exists, load completed intervals — skip them, log "Resuming from batch N". |
| 145 | +3. Failed variants from previous run are re-attempted once. |
| 146 | +4. User can force a clean start via `--no-resume` CLI flag. |
| 147 | +
|
| 148 | +### Final Report |
| 149 | +
|
| 150 | +After completion, log a summary: total variants, successful, failed (with reasons), API calls made, total runtime. |
| 151 | +
|
| 152 | +## CLI & Table Builder Integration |
| 153 | +
|
| 154 | +### CLI Command |
| 155 | +
|
| 156 | +```bash |
| 157 | +hvantk mktable alphagenome \ |
| 158 | + --input variants.ht \ |
| 159 | + --output-dir /out/alphagenome/ \ |
| 160 | + --config alphagenome_config.yaml \ |
| 161 | + --no-resume \ |
| 162 | + --overwrite |
| 163 | +``` |
| 164 | + |
| 165 | +### Builder Function |
| 166 | + |
| 167 | +In `hvantk/tables/table_builders.py`: |
| 168 | + |
| 169 | +```python |
| 170 | +def create_alphagenome_tb( |
| 171 | + input_path: str, |
| 172 | + output_path: str, |
| 173 | + config_path: str, |
| 174 | + no_resume: bool = False, |
| 175 | + overwrite: bool = False, |
| 176 | +) -> Dict[str, hl.Table]: |
| 177 | + """Build per-modality Hail Tables from AlphaGenome predictions.""" |
| 178 | +``` |
| 179 | + |
| 180 | +Returns `Dict[str, hl.Table]` — a deviation from the standard single-table builder signature, necessary for multi-table output. |
| 181 | + |
| 182 | +### Registry |
| 183 | + |
| 184 | +In `hvantk/tables/registry.py`: |
| 185 | + |
| 186 | +```python |
| 187 | +TABLE_BUILDERS["alphagenome"] = create_table_adapter( |
| 188 | + "hvantk.tables.table_builders", "create_alphagenome_tb" |
| 189 | +) |
| 190 | +``` |
| 191 | + |
| 192 | +### Recipe Support |
| 193 | + |
| 194 | +Output is a directory path: |
| 195 | + |
| 196 | +```json |
| 197 | +{ |
| 198 | + "name": "alphagenome", |
| 199 | + "input": "/data/variants.ht", |
| 200 | + "output": "/out/alphagenome/", |
| 201 | + "params": {"config_path": "alphagenome_config.yaml"} |
| 202 | +} |
| 203 | +``` |
| 204 | + |
| 205 | +## Files & Module Layout |
| 206 | + |
| 207 | +### New Files |
| 208 | + |
| 209 | +- `hvantk/data/alphagenome_streamer.py` — `AlphaGenomeStreamer` class + `compute_intervals()` function |
| 210 | +- `hvantk/tests/test_alphagenome_streamer.py` — unit tests with mocked API |
| 211 | +- `hvantk/tests/testdata/alphagenome_config.yaml` — test config fixture |
| 212 | + |
| 213 | +### Modified Files |
| 214 | + |
| 215 | +- `hvantk/tables/table_builders.py` — add `create_alphagenome_tb()` |
| 216 | +- `hvantk/tables/registry.py` — register `"alphagenome"` builder |
| 217 | +- `hvantk/commands/make_table_cli.py` — add `mktable_alphagenome` command |
| 218 | +- `hvantk/core/constants.py` — add AlphaGenome default constants |
| 219 | +- `hvantk/data/__init__.py` — export `AlphaGenomeStreamer` |
| 220 | + |
| 221 | +### Dependencies |
| 222 | + |
| 223 | +- `alphagenome` (PyPI) — **optional dependency**. Builder raises `ImportError` with install instructions if missing. |
| 224 | +- No new core dependencies. |
| 225 | + |
| 226 | +## Testing Strategy |
| 227 | + |
| 228 | +- Unit tests with mocked `dna_client` — test streamer lifecycle, interval computation, checkpoint logic, error handling. |
| 229 | +- `compute_intervals()` tested as a pure function with known variant distributions. |
| 230 | +- Integration tests marked `@pytest.mark.network` for real API calls (small variant set). |
| 231 | +- Config validation tests (missing key, invalid ontology terms, bad interval params). |
0 commit comments