Skip to content

Commit bd48823

Browse files
authored
Merge pull request #88 from bigbio/dev
Release: merge dev into main (AlphaGenome, PeptideAtlas, backend abstraction)
2 parents d69be00 + 58ceb3b commit bd48823

34 files changed

Lines changed: 5759 additions & 67 deletions

docs/superpowers/plans/2026-03-31-alphagenome-streamer.md

Lines changed: 1572 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 231 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,231 @@
1+
# AlphaGenome Variant Prediction Streamer — Design Spec
2+
3+
**Date:** 2026-03-31
4+
**Branch:** `variantpredictions`
5+
**Status:** Approved
6+
7+
## Overview
8+
9+
Add an `AlphaGenomeStreamer` to hvantk that, given a set of variant positions, calls the AlphaGenome API and produces per-modality Hail Tables with the full multimodal predictions (expression, splicing, chromatin, contact maps). The streamer handles authentication, adaptive interval grouping, rate limiting, checkpoint-based resumption, and assembly of final outputs.
10+
11+
## Motivation
12+
13+
AlphaGenome (DeepMind, 2026) is a genomic foundation model that predicts regulatory effects of genetic variants at single base-pair resolution across multiple modalities. Integrating it into hvantk enables researchers to enrich variant annotations with predicted functional impact — complementing existing static annotation sources (ClinVar, gnomAD, dbNSFP) with model-based predictions.
14+
15+
## Config File
16+
17+
A YAML config file controls API access, ontology terms, and interval strategy:
18+
19+
```yaml
20+
api:
21+
key: "AG-xxxx" # or null -> falls back to ALPHAGENOME_API_KEY env var
22+
max_retries: 3
23+
retry_backoff: 2.0 # exponential backoff base (seconds)
24+
request_timeout: 120 # seconds per API call
25+
26+
ontology:
27+
terms:
28+
- "UBERON:0001157" # liver
29+
- "UBERON:0000955" # brain
30+
- "UBERON:0002048" # lung
31+
output_types:
32+
- RNA_SEQ
33+
- CHROMATIN
34+
- SPLICING
35+
- CONTACT_MAP
36+
37+
intervals:
38+
default_size: 1048576 # 1Mbp fixed-size fallback
39+
adaptive: true # group nearby variants into shared intervals
40+
adaptive_max_size: 1048576 # max interval size when adaptive
41+
density_window: 50000 # bp window to check for neighboring variants
42+
```
43+
44+
**Auth resolution order:** config `api.key` > `ALPHAGENOME_API_KEY` env var > error with message.
45+
46+
## Interval Strategy
47+
48+
### Adaptive Mode (default)
49+
50+
1. Sort all variants by genomic position (chrom, pos).
51+
2. Scan linearly — group consecutive variants within `density_window` (default 50kb) of each other.
52+
3. For each group, compute a bounding interval centered on the group midpoint, expanded to cover all variants plus padding.
53+
4. Cap at `adaptive_max_size` (1Mbp — AlphaGenome's limit). If a group exceeds this, split into sub-groups.
54+
5. Singleton variants (no neighbors within the window) get `default_size` centered on their position.
55+
56+
### Fixed Mode (`adaptive: false`)
57+
58+
Each variant gets a `default_size` window centered on its position. No grouping — one API call per variant.
59+
60+
### Benefit
61+
62+
If 200 of 5000 variants cluster in a gene-dense region, they share a single API call instead of 200 separate ones. For a typical GWAS-like variant set this can reduce API calls significantly.
63+
64+
### Implementation
65+
66+
A pure function `compute_intervals(variants, config) -> List[Tuple[Interval, List[Variant]]]` that returns interval-to-variant mappings. Testable independently of the API.
67+
68+
## Streamer Architecture
69+
70+
`AlphaGenomeStreamer` extends `HailDataStreamer`.
71+
72+
### Input
73+
74+
- Hail Table keyed by `(locus, alleles)`, or
75+
- TSV with `chrom, pos, ref, alt` columns (auto-imported into Hail Table on `setup()`).
76+
77+
### Lifecycle
78+
79+
#### `setup()`
80+
81+
1. Initialize Hail.
82+
2. Load and validate config YAML.
83+
3. Authenticate — create `dna_client` via `dna_client.create(api_key)`.
84+
4. Load input variants (Hail Table or TSV).
85+
5. Load checkpoint state if resuming — a JSON file tracking completed intervals.
86+
6. Compute interval-to-variant mappings via `compute_intervals()`.
87+
7. Filter out already-completed intervals from checkpoint.
88+
89+
#### `stream()`
90+
91+
- Iterate over pending intervals in batches of `chunk_size`.
92+
- For each interval + its variants, call `model.predict_variant()` per variant.
93+
- Collect raw outputs into per-modality dicts.
94+
- After each batch: write intermediate results to checkpoint dir, update checkpoint JSON.
95+
- Yield a dict of `{modality_name: hl.Table}` per batch (intermediate, used by `StreamProcessor` pipeline or discarded if running standalone).
96+
97+
#### `teardown()`
98+
99+
- Union all batch checkpoints into final per-modality Hail Tables.
100+
- Write to output paths (e.g., `output_dir/rna_seq.ht`, `output_dir/chromatin.ht`).
101+
- Each table keyed by `(locus, alleles)` with ontology term as a column/nested struct.
102+
- Log summary stats (variants processed, API calls made, failures).
103+
104+
### Checkpoint Directory Structure
105+
106+
```
107+
output_dir/
108+
├── _checkpoints/
109+
│ ├── state.json # {"completed_intervals": [...], "failed_variants": [...]}
110+
│ ├── batch_000.json # raw API responses for batch 0
111+
│ ├── batch_001.json
112+
│ └── ...
113+
├── rna_seq.ht/
114+
├── chromatin.ht/
115+
├── splicing.ht/
116+
└── contact_map.ht/
117+
```
118+
119+
## Rate Limiting & Error Handling
120+
121+
### Adaptive Throttling
122+
123+
AlphaGenome doesn't publish fixed rate limits (they vary by demand), so the streamer uses adaptive throttling:
124+
125+
- **Base delay:** Configurable minimum wait between API calls (default 0.5s).
126+
- **Backoff on 429/5xx:** Exponential backoff with jitter — `retry_backoff^attempt * (1 + random(0, 0.5))`, up to `max_retries`.
127+
- **Cooldown:** If 3 consecutive requests get rate-limited, pause for 60s before continuing.
128+
- **Progress logging:** Log every N intervals completed, estimated time remaining based on average call duration.
129+
130+
### Error Classification
131+
132+
| Error Type | Behavior |
133+
|------------|----------|
134+
| 429 Too Many Requests | Backoff + retry |
135+
| 5xx Server Error | Backoff + retry |
136+
| 4xx Client Error (not 429) | Log warning, skip variant, record in `failed_variants` |
137+
| Timeout | Retry up to `max_retries`, then skip + record |
138+
| Network Error | Retry up to `max_retries`, then skip + record |
139+
| Invalid variant (no ref/alt at position) | Log warning, skip, record |
140+
141+
### Resumption Flow
142+
143+
1. On start, check for `_checkpoints/state.json`.
144+
2. If exists, load completed intervals — skip them, log "Resuming from batch N".
145+
3. Failed variants from previous run are re-attempted once.
146+
4. User can force a clean start via `--no-resume` CLI flag.
147+
148+
### Final Report
149+
150+
After completion, log a summary: total variants, successful, failed (with reasons), API calls made, total runtime.
151+
152+
## CLI & Table Builder Integration
153+
154+
### CLI Command
155+
156+
```bash
157+
hvantk mktable alphagenome \
158+
--input variants.ht \
159+
--output-dir /out/alphagenome/ \
160+
--config alphagenome_config.yaml \
161+
--no-resume \
162+
--overwrite
163+
```
164+
165+
### Builder Function
166+
167+
In `hvantk/tables/table_builders.py`:
168+
169+
```python
170+
def create_alphagenome_tb(
171+
input_path: str,
172+
output_path: str,
173+
config_path: str,
174+
no_resume: bool = False,
175+
overwrite: bool = False,
176+
) -> Dict[str, hl.Table]:
177+
"""Build per-modality Hail Tables from AlphaGenome predictions."""
178+
```
179+
180+
Returns `Dict[str, hl.Table]` — a deviation from the standard single-table builder signature, necessary for multi-table output.
181+
182+
### Registry
183+
184+
In `hvantk/tables/registry.py`:
185+
186+
```python
187+
TABLE_BUILDERS["alphagenome"] = create_table_adapter(
188+
"hvantk.tables.table_builders", "create_alphagenome_tb"
189+
)
190+
```
191+
192+
### Recipe Support
193+
194+
Output is a directory path:
195+
196+
```json
197+
{
198+
"name": "alphagenome",
199+
"input": "/data/variants.ht",
200+
"output": "/out/alphagenome/",
201+
"params": {"config_path": "alphagenome_config.yaml"}
202+
}
203+
```
204+
205+
## Files & Module Layout
206+
207+
### New Files
208+
209+
- `hvantk/data/alphagenome_streamer.py``AlphaGenomeStreamer` class + `compute_intervals()` function
210+
- `hvantk/tests/test_alphagenome_streamer.py` — unit tests with mocked API
211+
- `hvantk/tests/testdata/alphagenome_config.yaml` — test config fixture
212+
213+
### Modified Files
214+
215+
- `hvantk/tables/table_builders.py` — add `create_alphagenome_tb()`
216+
- `hvantk/tables/registry.py` — register `"alphagenome"` builder
217+
- `hvantk/commands/make_table_cli.py` — add `mktable_alphagenome` command
218+
- `hvantk/core/constants.py` — add AlphaGenome default constants
219+
- `hvantk/data/__init__.py` — export `AlphaGenomeStreamer`
220+
221+
### Dependencies
222+
223+
- `alphagenome` (PyPI) — **optional dependency**. Builder raises `ImportError` with install instructions if missing.
224+
- No new core dependencies.
225+
226+
## Testing Strategy
227+
228+
- Unit tests with mocked `dna_client` — test streamer lifecycle, interval computation, checkpoint logic, error handling.
229+
- `compute_intervals()` tested as a pure function with known variant distributions.
230+
- Integration tests marked `@pytest.mark.network` for real API calls (small variant set).
231+
- Config validation tests (missing key, invalid ontology terms, bad interval params).

0 commit comments

Comments
 (0)