Conversation
Phosphorylation-aware variant classification to evaluate whether multi-source phospho evidence adds signal for ClinVar VUS reclassification on top of existing pathogenicity predictors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add converters module with hail_mt_to_anndata and anndata_to_hail_mt functions for crossing between variant-keyed Hail data and expression AnnData in the scverse integration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Includes np.bool compatibility workaround for Hail 0.2.x with numpy >= 1.24. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add load_ucsc_metadata and create_anndata_from_ucsc_matrix to ucsc.py alongside existing Hail functions, and build_ucsc_ad to matrix_builders. Hail imports are now lazy so AnnData paths work without Hail installed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add convert_sdrf_to_dataframe and create_anndata_from_expression_atlas to expression_atlas.py alongside existing Hail functions. Add build_expression_atlas_ad to matrix_builders.py following the same pattern as build_ucsc_ad. Make hail imports lazy to allow non-Hail usage without hail installed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add three new functions to matrix_utils.py that operate on AnnData objects alongside the existing Hail-based functions: describe_expression_ad, filter_by_metadata_ad, and summarize_expression_ad. Guard hail/anndata imports to allow standalone usage of either backend. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace Hail MatrixTable builders with AnnData builders in all mkmatrix subcommands (ucsc, expression-atlas, cptac, cptac-phospho). Remove Hail-specific options (--min-partitions, --force-bgz, --auto-convert-bgz) and init_hail() calls. Rewrite expression describe/summarize/markers commands to operate on .h5ad files using scanpy for differential expression. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…n up imports Replace Hail-based matrix builders with AnnData equivalents in the registry. Remove build_ucsc_mt, build_expression_atlas_mt, build_cptac_mt, build_cptac_phospho_mt from matrix_builders.py. Remove old Hail helper functions from ucsc.py, expression_atlas.py, and cptac.py. Delete wilcoxon_hail.py and obsolete Hail expression test files. Update test_matrix_utils.py fixture to use Hail imports directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sts for downloads
The file was rewritten for AnnData during scverse migration but kept the Hail-era filename. The leftover Hail histogram function had no live callers (only @pytest.mark.hail tests against removed Hail builders) and a broken ``"ad.AnnData"`` annotation that triggered an F821 lint error. - New hvantk/visualization/expression/anndata.py exposes visualize_expression_distribution(adata, ...) operating on adata.X. - Old hail.py removed; visualization facade and subpackage doc updated. - Test import switched to the new module.
…ze_expression_ad Completes the cleanup planned in the scverse-integration plan (Task 10 step 5). The Hail-based functions in matrix_utils.py were superseded by their _ad equivalents when expression I/O migrated to AnnData; their only remaining callers were two skipped @pytest.mark.hail test files. - Remove annotate_column_summary, describe_expression_mt, summarize_expression, summarize_matrix, filter_by_metadata, filter_by_gene_list, filter_by_expression, get_top_expressed_genes (and the optional hail import). - Delete tests/test_matrix_utils.py and tests/test_summarize_expression.py. - summarize_expression_ad: stop mutating adata.obs (was writing _group_label into the caller's object) and replace the per-row index.get_loc lookup with groupby indices for O(n_cells) splitting. - gene_sets.py: refresh extract_marker_gene_sets docstring to point at the AnnData summarizer and note its long-format output requires pivot or scanpy rank_genes_groups for direct use.
…hail_context - converters.anndata_to_hail_mt previously built the long-format DataFrame with a Python double loop over n_obs * n_var entries (200M dict allocs for a modest 10k x 20k matrix). Replace with np.repeat / np.tile / X.ravel — single pass, no Python loop. - The np.bool monkey-patch (Hail 0.2.x compat with NumPy >= 1.24) lived as a side-effect at the top of converters.py. Move it into hail_context.py so it runs before any Hail import regardless of which module pulls Hail in first. converters.py now imports hl via hail_context to guarantee the shim is applied.
test_dataset_download_returns_intermediate_tsv mocked urllib.request.urlretrieve, but the production code calls requests.get. The mock was a no-op, so the test attempted a real network download and hung pytest -q indefinitely. Mock requests.get with a context-manager fake response that streams the bundled fixture zip.
- matrix_builders.py, cptac.py, expression_atlas.py: remove inner ``import anndata as ad`` repeats inside builder functions; module-level import already in scope. - test_matrix_utils_anndata.py: replace the importlib.util workaround (which sidestepped hvantk.utils.__init__ over a non-existent Python 3.10 syntax issue) with a plain import. Project requires Python >= 3.10 where PEP 604 unions are valid.
Implements the `hvantk ptm constraint` CLI and `hvantk.ptm.constraint` module from the 2026-04-13 implementation plan. Compares gnomAD allele- frequency distributions between PTM-proximal and non-PTM variants, stratified by tissue, cell type, or any categorical metadata field. Modules added: - hvantk/ptm/constraint.py — orchestrator, PTMConstraintConfig/Result, five statistical tests (per-group ranking, τ quartile, τ × LOEUF factorial, PTM category × group heatmap, within-gene Wilcoxon) - hvantk/ptm/constraint_expression.py — unified backend adapter for Hail MatrixTable / AnnData / tabular expression sources - hvantk/ptm/constraint_plots.py — four-panel figure renderer - hvantk/ptm/constraint_report.py — self-contained HTML report - hvantk/utils/tissue_specificity.py — thin tspex wrapper for τ / TSI / Gini / Shannon metrics Wiring: - hvantk/commands/ptm_cli.py — `ptm constraint` subcommand - hvantk/ptm/__init__.py — lazy exports for the new API - pyproject.toml — `constraint` extra (tspex, matplotlib, seaborn) Docs: - docs_site/tools/ptm-constraint.md - docs_site/guide/usage.md (10-line recipe) - docs_site/architecture.md (module index) Tests: 6 smoke tests in hvantk/tests/test_ptm_constraint.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ield Code-review follow-ups to the initial constraint implementation: - _anndata_value_column unconditionally returned "mean", so --expression-metric median and median_nonzero were silent no-ops on the AnnData backend. Replaced the value-column helper with a direct numpy aggregation (_aggregate_anndata_direct) that runs per-group np.median / np.nanmedian (for nonzero) over adata.X when the user requests a median metric. - _load_variants silently skipped label filtering when the configured label_field was absent from the variants HT, leading to results that claim to be filtered to TN/TP but actually include every label. Now raises ValueError with a clear remediation hint unless the user explicitly passes --label-filter all. - pyproject.toml + requirements.txt: added tspex so CI (which installs from requirements.txt) can import the wrapper and the constraint smoke tests do not fail with ImportError. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Conda CI workflow has been failing since the AnnData migration landed on dev because environment.yml never gained anndata/scanpy. The constraint work also needs tspex; add it alongside so `python-package- conda.yml` can import every test module. Added: anndata, scanpy, pysam, PyYAML, psutil, scikit-learn, tspex.
feat(ptm): stratified AF-depletion analysis (`hvantk ptm constraint`)
📝 WalkthroughWalkthroughLarge-scale architectural migration from Hail MatrixTable to AnnData for expression matrix handling across multiple data sources, introduction of a new PTM constraint analysis pipeline with CLI integration and comprehensive statistical testing, removal of AlphaGenome streamer documentation, and addition of utility modules for AnnData operations and tissue specificity computation. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| UnusedCode | 2 medium 1 minor |
| Documentation | 58 minor |
| ErrorProne | 6 high |
| Security | 17 high |
| Complexity | 16 medium |
🟢 Metrics 307 complexity · 4 duplication
Metric Results Complexity 307 Duplication 4
TIP This summary will be updated as you push new changes. Give us feedback
There was a problem hiding this comment.
Pull request overview
This PR adds a new PTM constraint analysis workflow (hvantk ptm constraint) and refactors expression-matrix handling to support AnnData (.h5ad) as a first-class backend (alongside Hail MT and pre-aggregated tabular inputs), with updated builders/CLI/tests/docs.
Changes:
- Introduces PTM constraint analysis modules + CLI command, producing TSV/plots/HTML report across multiple expression backends.
- Refactors
hvantk mkmatrixand expression builders to produce AnnData objects instead of Hail MatrixTables. - Adds AnnData/scanpy/tspex-related utilities, converters, tests, and documentation updates.
Reviewed changes
Copilot reviewed 52 out of 55 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| requirements.txt | Adds tspex to pip requirements. |
| pyproject.toml | Adds AnnData/Scanpy deps and new extras for constraint/PTM-related features. |
| environment.yml | Adds conda deps required for AnnData + constraint workflows. |
| hvantk/visualization/expression/hail.py | Removes Hail-backed expression distribution plotting module. |
| hvantk/visualization/expression/anndata.py | Adds AnnData-backed expression distribution plotting. |
| hvantk/visualization/expression/init.py | Updates expression visualization namespace docs to reference AnnData backend. |
| hvantk/visualization/init.py | Routes visualize_expression_distribution facade to AnnData implementation. |
| hvantk/utils/wilcoxon_hail.py | Removes Hail adapter for Wilcoxon marker detection. |
| hvantk/utils/tissue_specificity.py | Adds tspex wrapper for tissue-specificity metrics (τ, etc.). |
| hvantk/utils/gene_sets.py | Updates marker gene set extraction docstring to clarify expected summary format(s). |
| hvantk/tests/test_wilcoxon_hail.py | Removes Hail Wilcoxon integration tests. |
| hvantk/tests/test_summarize_expression.py | Removes Hail-based expression summarization tests. |
| hvantk/tests/test_matrix_utils.py | Removes Hail MatrixTable-based matrix_utils tests (incl. plotting). |
| hvantk/tests/test_expression_tables_hail.py | Removes Hail integration tests for expression table builders. |
| hvantk/tests/test_cptac_phospho.py | Removes CPTAC phospho dataset tests tied to old pipeline paths. |
| hvantk/tests/test_cptac.py | Removes Hail-based CPTAC conversion tests. |
| hvantk/tests/test_cli_mkmatrix.py | Removes MT-based mkmatrix CLI tests. |
| hvantk/tests/test_matrix_utils_anndata.py | Adds AnnData-based matrix_utils tests. |
| hvantk/tests/test_expression_builders_anndata.py | Adds AnnData builder + mkmatrix CLI integration tests. |
| hvantk/tests/test_anndata_utils.py | Adds tests for AnnData provenance, summaries, and I/O helpers. |
| hvantk/tests/test_converters.py | Adds tests for Hail MT ↔ AnnData converters. |
| hvantk/tests/test_ptm_constraint.py | Adds smoke tests for PTM constraint pipeline components + CLI help. |
| hvantk/tests/test_ptm.py | Updates PTM mapping test to read BGZF output. |
| hvantk/tests/test_peptideatlas_phospho.py | Updates download mocking to match new requests.get implementation. |
| hvantk/tables/ucsc.py | Refactors UCSC loader from Hail MT to pandas/AnnData ingestion. |
| hvantk/tables/matrix_builders.py | Refactors builders to produce AnnData and write .h5ad. |
| hvantk/tables/expression_atlas.py | Refactors Expression Atlas ingestion from Hail to pandas/AnnData. |
| hvantk/tables/cptac.py | Refactors CPTAC ingestion from Hail to AnnData (long + phospho wide). |
| hvantk/tables/registry.py | Updates registry to point matrix adapters to new AnnData builders. |
| hvantk/tables/table_builders.py | Adjusts a Hail import call to set min_partitions=16. |
| hvantk/core/anndata_utils.py | Adds AnnData helpers for provenance metadata, column summaries, save/load. |
| hvantk/core/converters.py | Adds Hail MatrixTable ↔ AnnData converters. |
| hvantk/core/hail_context.py | Adds NumPy compatibility shim (np.bool) before importing Hail. |
| hvantk/ptm/pipeline.py | Adds stdlib-only BGZF writing + switches downloads to requests. |
| hvantk/ptm/mapper.py | Adds CDS mapping cache/binary-search mapping and optimizes GTF parsing. |
| hvantk/ptm/annotate.py | Extends PTM annotation to optionally collect per-site evidence structs. |
| hvantk/ptm/init.py | Converts PTM package exports to lazy imports to avoid eager Hail import. |
| hvantk/ptm/constraint_expression.py | Adds expression-source adapter producing gene×group matrices from multiple backends. |
| hvantk/ptm/constraint_plots.py | Adds four-panel plotting for constraint analysis outputs. |
| hvantk/commands/make_matrix_cli.py | Refactors hvantk mkmatrix to produce .h5ad AnnData outputs. |
| hvantk/commands/ptm_cli.py | Adds hvantk ptm constraint CLI command + options. |
| hvantk/datasets/peptideatlas_phospho_datasets.py | Switches download implementation from urlretrieve to requests. |
| hvantk/datasets/cptac_phospho_datasets.py | Adds source fallback logic for CPTAC phospho/clinical retrieval. |
| hvantk/data/file_utils.py | Adds BGZF writer implementation (used for bgz output elsewhere). |
| docs_site/tools/ptm-constraint.md | Adds a user guide and flag reference for PTM constraint command. |
| docs_site/guide/usage.md | Documents how to run PTM constraint analysis in the general usage guide. |
| docs_site/architecture.md | Registers/describe new PTM constraint modules in architecture docs. |
| docs/superpowers/specs/2026-03-31-alphagenome-streamer-design.md | Removes obsolete AlphaGenome streamer design spec. |
| .gitignore | Adds ignores for notebook checkpoints, VS Code settings, and .claude/. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| index_col=metadata_index_col, | ||
| ) | ||
| logger.info("Loading UCSC metadata from %s", metadata_path) | ||
| metadata_df = load_ucsc_metadata(metadata_path) |
There was a problem hiding this comment.
The CLI passes delimiter, but metadata loading ignores it (load_ucsc_metadata(metadata_path) uses its default tab separator). This makes --delimiter inconsistent and will break non-tab metadata files. Pass sep=delimiter (and potentially expose index_col again if needed) when calling load_ucsc_metadata.
| metadata_df = load_ucsc_metadata(metadata_path) | |
| metadata_df = load_ucsc_metadata(metadata_path, sep=delimiter) |
| logger.info("Reading CPTAC expression from %s", expression_path) | ||
| expr_df = pd.read_csv(expression_path, sep="\t") | ||
|
|
||
| logger.info("Reading CPTAC metadata from %s", metadata_path) | ||
| meta_df = pd.read_csv(metadata_path, sep="\t") | ||
| meta_df = meta_df.set_index(sample_id_col) |
There was a problem hiding this comment.
build_cptac_ad unconditionally reads with sep="\t", but the docstring says TSV/CSV are supported. Either detect the delimiter from the file extension (or sniff), or update the docstring/CLI to only support TSV to avoid surprising parse failures for CSV inputs.
| expr_df = pd.read_csv(expression_path, sep="\t") | ||
|
|
||
| logger.info("Reading CPTAC phospho metadata from %s", metadata_path) | ||
| meta_df = pd.read_csv(metadata_path, sep="\t") |
There was a problem hiding this comment.
build_cptac_phospho_ad also hard-codes sep="\t" while the docstring advertises TSV/CSV support. Please align behavior and docs (delimiter detection or explicit TSV-only contract).
| expr_df = pd.read_csv(expression_path, sep="\t") | |
| logger.info("Reading CPTAC phospho metadata from %s", metadata_path) | |
| meta_df = pd.read_csv(metadata_path, sep="\t") | |
| expr_df = pd.read_csv(expression_path, sep=None, engine="python") | |
| logger.info("Reading CPTAC phospho metadata from %s", metadata_path) | |
| meta_df = pd.read_csv(metadata_path, sep=None, engine="python") |
| for line in f: | ||
| if line.startswith("#"): | ||
| if line[0] == "#": | ||
| continue | ||
| fields = line.strip().split("\t") | ||
|
|
||
| # Quick scan for feature type before full split — skip the ~80% | ||
| # of lines that are gene/exon/UTR/start_codon/stop_codon/etc. | ||
| # GTF columns are tab-separated; feature type is in column 3. | ||
| tab1 = line.index("\t") | ||
| tab2 = line.index("\t", tab1 + 1) | ||
| tab3 = line.index("\t", tab2 + 1) | ||
| feature = line[tab2 + 1 : tab3] |
There was a problem hiding this comment.
parse_ensembl_gtf assumes every line is non-empty and tab-delimited. line[0] will raise on blank lines, and line.index("\t") will raise ValueError on malformed lines, aborting the whole parse. Add a quick guard (e.g., if not line or line[0] == '#': continue) and handle missing tabs safely (e.g., try/except ValueError or if '\t' not in line: continue).
| agg_expr = hl.median(hl.agg.collect(entry)) | ||
| elif aggfunc == "median_nonzero": | ||
| agg_expr = hl.median(hl.agg.filter(entry > 0, hl.agg.collect(entry))) |
There was a problem hiding this comment.
For the Hail MT backend, computing the median via hl.median(hl.agg.collect(entry)) (and the filtered variant) collects all per-group entry values into an array per row, which is not scalable for large MTs and can blow executor memory. Consider using Hail's approximate quantile aggregators (e.g., hl.agg.approx_quantiles(entry, 0.5)[0] / hl.agg.approx_median if available) or restricting the Hail backend to mean only.
| agg_expr = hl.median(hl.agg.collect(entry)) | |
| elif aggfunc == "median_nonzero": | |
| agg_expr = hl.median(hl.agg.filter(entry > 0, hl.agg.collect(entry))) | |
| agg_expr = hl.agg.approx_quantiles(entry, [0.5])[0] | |
| elif aggfunc == "median_nonzero": | |
| agg_expr = hl.agg.filter(entry > 0, hl.agg.approx_quantiles(entry, [0.5]))[0] |
|
|
||
| import requests | ||
| from collections import defaultdict | ||
| from dataclasses import dataclass, field | ||
| from typing import Dict, List, Optional | ||
|
|
||
| import struct | ||
| import zlib | ||
|
|
||
| # --------------------------------------------------------------------------- | ||
| # Lightweight BGZF writer (stdlib-only, no Hail dependency). | ||
| # | ||
| # Mirrors hvantk.data.file_utils.BgzfWriter but lives here so the PTM | ||
| # pipeline's pure-Python stages never trigger hvantk.data.__init__, which | ||
| # eagerly imports Hail. | ||
| # --------------------------------------------------------------------------- | ||
|
|
||
| _BGZF_BLOCK_SIZE = 65280 # max uncompressed payload per BGZF block | ||
|
|
||
|
|
||
| def _make_bgzf_block(data: bytes) -> bytes: | ||
| """Compress *data* into a single BGZF block.""" | ||
| comp = zlib.compressobj(zlib.Z_DEFAULT_COMPRESSION, zlib.DEFLATED, -15) | ||
| compressed = comp.compress(data) + comp.flush() | ||
| bsize = 18 + len(compressed) + 8 - 1 | ||
| header = ( | ||
| b"\x1f\x8b\x08\x04" | ||
| b"\x00\x00\x00\x00" | ||
| b"\x00\xff" | ||
| + struct.pack("<H", 6) | ||
| + b"BC" | ||
| + struct.pack("<H", 2) | ||
| + struct.pack("<H", bsize) | ||
| ) | ||
| crc = zlib.crc32(data) & 0xFFFFFFFF | ||
| trailer = struct.pack("<I", crc) + struct.pack("<I", len(data) & 0xFFFFFFFF) | ||
| return header + compressed + trailer | ||
|
|
||
|
|
||
| class BgzfWriter: | ||
| """Buffered BGZF text writer for producing block-gzipped TSV files.""" | ||
|
|
||
| def __init__(self, path: str, encoding: str = "utf-8") -> None: | ||
| self._fout = open(path, "wb") | ||
| self._encoding = encoding | ||
| self._buf = bytearray() | ||
|
|
||
| def __enter__(self) -> "BgzfWriter": | ||
| return self | ||
|
|
||
| def __exit__(self, *exc) -> None: | ||
| self.close() | ||
|
|
||
| def write(self, text: str) -> None: | ||
| self._buf.extend(text.encode(self._encoding)) | ||
| while len(self._buf) >= _BGZF_BLOCK_SIZE: | ||
| chunk = bytes(self._buf[:_BGZF_BLOCK_SIZE]) | ||
| self._buf = self._buf[_BGZF_BLOCK_SIZE:] | ||
| self._fout.write(_make_bgzf_block(chunk)) | ||
|
|
||
| def close(self) -> None: | ||
| if self._fout.closed: | ||
| return | ||
| if self._buf: | ||
| self._fout.write(_make_bgzf_block(bytes(self._buf))) | ||
| self._buf.clear() | ||
| self._fout.write(_make_bgzf_block(b"")) | ||
| self._fout.close() |
There was a problem hiding this comment.
BgzfWriter is now implemented both here and in hvantk.data.file_utils, which creates a maintenance risk (bugfixes can diverge). Since importing hvantk.data.file_utils triggers the heavy hvantk.data.__init__, consider moving the BGZF writer to a lightweight shared module (e.g. hvantk.core.bgzf) or making hvantk.data.__init__ avoid eager heavy imports so both callers can reuse a single implementation.
| import requests | |
| from collections import defaultdict | |
| from dataclasses import dataclass, field | |
| from typing import Dict, List, Optional | |
| import struct | |
| import zlib | |
| # --------------------------------------------------------------------------- | |
| # Lightweight BGZF writer (stdlib-only, no Hail dependency). | |
| # | |
| # Mirrors hvantk.data.file_utils.BgzfWriter but lives here so the PTM | |
| # pipeline's pure-Python stages never trigger hvantk.data.__init__, which | |
| # eagerly imports Hail. | |
| # --------------------------------------------------------------------------- | |
| _BGZF_BLOCK_SIZE = 65280 # max uncompressed payload per BGZF block | |
| def _make_bgzf_block(data: bytes) -> bytes: | |
| """Compress *data* into a single BGZF block.""" | |
| comp = zlib.compressobj(zlib.Z_DEFAULT_COMPRESSION, zlib.DEFLATED, -15) | |
| compressed = comp.compress(data) + comp.flush() | |
| bsize = 18 + len(compressed) + 8 - 1 | |
| header = ( | |
| b"\x1f\x8b\x08\x04" | |
| b"\x00\x00\x00\x00" | |
| b"\x00\xff" | |
| + struct.pack("<H", 6) | |
| + b"BC" | |
| + struct.pack("<H", 2) | |
| + struct.pack("<H", bsize) | |
| ) | |
| crc = zlib.crc32(data) & 0xFFFFFFFF | |
| trailer = struct.pack("<I", crc) + struct.pack("<I", len(data) & 0xFFFFFFFF) | |
| return header + compressed + trailer | |
| class BgzfWriter: | |
| """Buffered BGZF text writer for producing block-gzipped TSV files.""" | |
| def __init__(self, path: str, encoding: str = "utf-8") -> None: | |
| self._fout = open(path, "wb") | |
| self._encoding = encoding | |
| self._buf = bytearray() | |
| def __enter__(self) -> "BgzfWriter": | |
| return self | |
| def __exit__(self, *exc) -> None: | |
| self.close() | |
| def write(self, text: str) -> None: | |
| self._buf.extend(text.encode(self._encoding)) | |
| while len(self._buf) >= _BGZF_BLOCK_SIZE: | |
| chunk = bytes(self._buf[:_BGZF_BLOCK_SIZE]) | |
| self._buf = self._buf[_BGZF_BLOCK_SIZE:] | |
| self._fout.write(_make_bgzf_block(chunk)) | |
| def close(self) -> None: | |
| if self._fout.closed: | |
| return | |
| if self._buf: | |
| self._fout.write(_make_bgzf_block(bytes(self._buf))) | |
| self._buf.clear() | |
| self._fout.write(_make_bgzf_block(b"")) | |
| self._fout.close() | |
| import importlib.util | |
| from pathlib import Path | |
| import requests | |
| from collections import defaultdict | |
| from dataclasses import dataclass, field | |
| from typing import Dict, List, Optional | |
| # --------------------------------------------------------------------------- | |
| # Reuse the shared BGZF writer implementation without importing | |
| # hvantk.data.__init__, which may eagerly import heavy dependencies. | |
| # --------------------------------------------------------------------------- | |
| def _load_shared_bgzf_writer(): | |
| """Load hvantk.data.file_utils.BgzfWriter without package import side effects.""" | |
| file_utils_path = Path(__file__).resolve().parents[1] / "data" / "file_utils.py" | |
| spec = importlib.util.spec_from_file_location("hvantk._bgzf_file_utils", file_utils_path) | |
| if spec is None or spec.loader is None: | |
| raise ImportError(f"Unable to load BGZF writer module from {file_utils_path}") | |
| module = importlib.util.module_from_spec(spec) | |
| spec.loader.exec_module(module) | |
| try: | |
| return module.BgzfWriter | |
| except AttributeError as exc: | |
| raise ImportError( | |
| f"Module loaded from {file_utils_path} does not define BgzfWriter" | |
| ) from exc | |
| BgzfWriter = _load_shared_bgzf_writer() |
| # Collect row (var) and col (obs) annotations | ||
| row_df = mt.rows().to_pandas() | ||
| row_df = row_df.set_index(row_key) | ||
|
|
||
| col_df = mt.cols().to_pandas() | ||
| col_df = col_df.set_index(col_key) | ||
|
|
||
| # Collect entries and pivot to dense matrix | ||
| entries_df = mt.select_entries(mt[entry_field]).entries().to_pandas() | ||
| pivot = entries_df.pivot(index=col_key, columns=row_key, values=entry_field) | ||
|
|
||
| # Reindex to match collected row/col order | ||
| pivot = pivot.reindex(index=col_df.index, columns=row_df.index) | ||
|
|
||
| X = pivot.values.astype(np.float32) | ||
|
|
There was a problem hiding this comment.
hail_mt_to_anndata pulls rows(), cols(), and especially entries() into pandas and pivots to a dense matrix. This will only work for very small MatrixTables and can easily OOM the driver for real expression MTs. It would help to either document/guard this as a small-MT-only helper, or provide an export/chunked path (or a sparse AnnData build) for larger inputs.
| X = adata.X | ||
| if sp.issparse(X): | ||
| X = X.toarray() | ||
| values = np.asarray(X).ravel() |
There was a problem hiding this comment.
adata.X can be a large sparse matrix; calling X.toarray() materializes the full dense matrix and can OOM for realistic scRNA-seq datasets. For a histogram, consider using X.data (non-zeros) plus an implicit zero count, or sampling values, so sparse inputs stay memory-safe.
There was a problem hiding this comment.
Actionable comments posted: 14
🧹 Nitpick comments (15)
hvantk/datasets/peptideatlas_phospho_datasets.py (1)
441-446: Consider moving therequestsimport to module level.The
requestsimport is placed inside the method. While this delays the import cost, therequestslibrary is already a common dependency used elsewhere in the codebase (e.g.,hvantk/data/file_utils.py). Moving it to the top of the file with other imports would be more consistent with the module's style.♻️ Suggested change
At the top of the file (after line 27):
import requestsThen in
download():- import requests - with requests.get(self.zip_url, stream=True, timeout=600) as resp:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/datasets/peptideatlas_phospho_datasets.py` around lines 441 - 446, Move the inline import of requests out of the download method and add it to the module-level imports so the library is imported once for the module; update the top of the file to include "import requests" and remove the "import requests" line inside the download() block (the code using requests remains the same: requests.get(self.zip_url, stream=True, timeout=600) ... resp.raise_for_status() ... shutil.copyfileobj(resp.raw, fout)).hvantk/tables/expression_atlas.py (1)
214-214: Consider handling non-numeric values gracefully.The
.astype(np.float32)conversion will fail if sample columns contain non-numeric values (e.g., "NA" strings not recognized by pandas). Consider usingpd.to_numericwitherrors='coerce'for robustness:♻️ Suggested improvement
- X = df[sample_cols].values.T.astype(np.float32) + # Convert to numeric, coercing errors to NaN + X = df[sample_cols].apply(pd.to_numeric, errors='coerce').values.T.astype(np.float32)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/tables/expression_atlas.py` at line 214, The current assignment X = df[sample_cols].values.T.astype(np.float32) will raise on non-numeric strings; change to coerce non-numeric values to NaN first (e.g. replace the df[sample_cols] selection with df[sample_cols].apply(lambda c: pd.to_numeric(c, errors='coerce')) or df[sample_cols].apply(pd.to_numeric, errors='coerce')) and then convert to float32 and transpose to produce X; adjust downstream code to handle resulting NaNs (e.g., drop or impute) as needed..gitignore (1)
212-213: Duplicate.ipynb_checkpointsentry.Line 79 already ignores
.ipynb_checkpoints(without trailing slash). This new entry with the trailing slash is redundant since both patterns achieve the same result.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.gitignore around lines 212 - 213, Remove the duplicate .ipynb_checkpoints entry from the ignore list: keep only one pattern for ".ipynb_checkpoints" (either with or without the trailing slash) and delete the redundant line that adds ".ipynb_checkpoints/" so the .gitignore contains a single canonical entry for that pattern.hvantk/data/file_utils.py (1)
462-466: Unused_pathattribute.The
_pathattribute is stored but never referenced. Consider removing it or using it in error messages/logging.♻️ Suggested fix
def __init__(self, path: str, encoding: str = "utf-8") -> None: - self._path = path self._encoding = encoding self._fout = open(path, "wb") self._buf = bytearray()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/data/file_utils.py` around lines 462 - 466, The __init__ stores self._path but never uses it; either remove the unused attribute or keep and use it for context (e.g., include self._path in error messages/logging and in exception handling around opening the file). Update the initializer in the class's __init__ (and any related error handling elsewhere in this class) to either drop the assignment to self._path or reference self._path in log/raise messages when calling open(path, "wb") and when reporting I/O errors so the file path is not lost.hvantk/tests/test_ptm.py (1)
183-186: Consider movinggzipimport to module level.The
gzipmodule is a standard library module that could be imported at the top of the file alongside other imports for consistency and clarity.Suggested change
Add to the imports section at the top of the file:
import gzipThen simplify the test code:
# Read back and verify (BGZF is gzip-compatible) - import gzip with gzip.open(output_bgz, "rt") as f:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/tests/test_ptm.py` around lines 183 - 186, Move the local "import gzip" out of the test body and add it to the module-level imports at the top of hvantk/tests/test_ptm.py, then remove the inline import so the block that opens output_bgz with gzip.open(...) and constructs csv.DictReader(...) uses the top-level gzip; update any linters/import-grouping if needed.hvantk/visualization/expression/anndata.py (1)
35-38: Consider memory implications for large sparse matrices.Converting a large sparse matrix to dense and raveling all values into memory could cause issues with very large datasets. For visualization purposes, consider sampling a subset of values when the matrix is large.
Optional: Add sampling for large matrices
+ MAX_SAMPLES = 1_000_000 # Limit samples for memory efficiency + X = adata.X if sp.issparse(X): - X = X.toarray() - values = np.asarray(X).ravel() + # Sample from sparse data to avoid memory explosion + data = X.data + if len(data) > MAX_SAMPLES: + rng = np.random.default_rng(42) + data = rng.choice(data, MAX_SAMPLES, replace=False) + values = data + else: + values = np.asarray(X).ravel() + if len(values) > MAX_SAMPLES: + rng = np.random.default_rng(42) + values = rng.choice(values, MAX_SAMPLES, replace=False)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/visualization/expression/anndata.py` around lines 35 - 38, The current code unconditionally densifies adata.X (X = X.toarray()) which can OOM for large sparse matrices; instead, when sp.issparse(X) detect large matrices by checking X.nnz or X.shape and apply sampling: use X.data (the nonzero values) and randomly sample up to a MAX_SAMPLES limit, and if you need zeros represented also randomly sample indices from the implicit zeros to approximate distribution; then build values from the sampled nonzero and sampled-zero values rather than calling toarray(); update the block around X, sp.issparse(X), and values to implement this conditional sampling strategy.pyproject.toml (1)
31-32: Makescanpyan optional dependency.
scanpyis imported only once, within function scope insummarize_expression_cli.py(line 289), and is not used elsewhere in the codebase. This minimal and localized usage justifies making it optional to reduce the dependency footprint for users who don't need marker gene analysis.Suggested change
anndata = ">=0.10.0" -scanpy = ">=1.10.0" +scanpy = { version = ">=1.10.0", optional = true } [tool.poetry.extras] viz = ["matplotlib", "seaborn", "plotly"] duckdb = ["duckdb"] interactive = ["plotly"] hgc = ["matplotlib", "seaborn", "plotly", "gnomad"] psroc = ["matplotlib", "plotly"] ptm = ["cptac"] constraint = ["tspex", "matplotlib", "seaborn"] +markers = ["scanpy"]Then add a runtime check in
summarize_expression_cli.py:try: import scanpy as sc except ImportError as exc: raise ImportError( "scanpy is required for marker analysis. " "Install with 'pip install hvantk[markers]'" ) from exc🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 31 - 32, Update pyproject.toml to make scanpy optional by removing it from the main dependencies and adding an extras group (e.g., extras = { markers = ["scanpy>=1.10.0"] }) so users can install hvantk[markers] when they need marker analysis; then in summarize_expression_cli.py, wrap the local import of scanpy (used in the marker-related function) in a try/except ImportError and raise a clear ImportError message pointing users to install the markers extra (e.g., "scanpy is required for marker analysis. Install with 'pip install hvantk[markers]'") so the runtime check fails fast with guidance.hvantk/tests/test_ptm_constraint.py (1)
90-108: Usetmp_pathfixture instead of hardcoded/tmp/out.While this config is only validated (not executed), using a hardcoded
/tmppath is flagged as a security concern and doesn't follow test isolation best practices.📝 Proposed fix
-def test_constraint_config_validate(): +def test_constraint_config_validate(tmp_path): from hvantk.ptm.constraint import PTMConstraintConfig cfg = PTMConstraintConfig( variants_ht_path="/does/not/exist.ht", expression_source="bogus-source", expression_path="/also/missing.tsv", grouping="tissue", - output_dir="/tmp/out", + output_dir=str(tmp_path / "out"), label_filter="NOT_A_LABEL", min_cells_per_group=-5, min_variants_per_group=0, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/tests/test_ptm_constraint.py` around lines 90 - 108, The test test_constraint_config_validate uses a hardcoded output_dir "/tmp/out"; change the test to accept the pytest tmp_path fixture, pass output_dir=str(tmp_path) when constructing PTMConstraintConfig, and update the test function signature to def test_constraint_config_validate(tmp_path): so the config uses an isolated temporary directory rather than a global /tmp path while keeping all other validation assertions unchanged.hvantk/tests/test_expression_builders_anndata.py (1)
12-24: Consider addingstrict=Truetozip()calls.While the test data is controlled, using
strict=Truewould catch accidental length mismatches during test maintenance.📝 Proposed fix
def _write_expression_matrix(self, path, genes, cells, values): """Write a genes-x-cells TSV expression matrix.""" with open(path, "w") as fh: fh.write("gene\t" + "\t".join(cells) + "\n") - for gene, row in zip(genes, values): + for gene, row in zip(genes, values, strict=True): fh.write(gene + "\t" + "\t".join(str(v) for v in row) + "\n") def _write_metadata(self, path, cell_ids, cell_types): """Write a metadata TSV with cell_id index and cell_type column.""" with open(path, "w") as fh: fh.write("cell_id\tcell_type\n") - for cid, ct in zip(cell_ids, cell_types): + for cid, ct in zip(cell_ids, cell_types, strict=True): fh.write(f"{cid}\t{ct}\n")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/tests/test_expression_builders_anndata.py` around lines 12 - 24, The zip() calls in _write_expression_matrix and _write_metadata can silently ignore length mismatches; update both to use zip(..., strict=True) so mismatched lengths raise an error during tests — change for gene,row in zip(genes, values) to zip(genes, values, strict=True) in _write_expression_matrix and for cid,ct in zip(cell_ids, cell_types) to zip(cell_ids, cell_types, strict=True) in _write_metadata.docs_site/tools/ptm-constraint.md (1)
83-97: Add language specifier to fenced code block.The output directory tree block should have a language identifier for consistency and to satisfy markdown linting rules.
📝 Proposed fix
-``` +```text <output-dir>/ ├── summary.json # top-line effect sizes + config echo🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs_site/tools/ptm-constraint.md` around lines 83 - 97, The fenced code block that starts with the directory tree beginning "<output-dir>/" is missing a language specifier; update the opening fence from ``` to ```text so the block becomes a ```text fenced code block (i.e., add "text" after the three backticks) to satisfy markdown linting and ensure consistent rendering of the directory tree.hvantk/commands/summarize_expression_cli.py (1)
163-199: Consider usingctx.exit(1)for consistency.Line 170 uses
raise SystemExit(1)while other commands in this module usectx.exit(1). Both work, butctx.exit(1)is more consistent with Click patterns used elsewhere in the codebase.📝 Proposed fix
+@click.pass_context def summarize_expression_cmd( + ctx, matrix_path, ... ): ... if output_path.exists() and not overwrite: click.echo( f"Error: Output file already exists: {output_path}\n" "Use --overwrite to replace it.", err=True, ) - raise SystemExit(1) + ctx.exit(1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/commands/summarize_expression_cli.py` around lines 163 - 199, Replace the direct raise SystemExit(1) in the output existence check with a Click-style exit call for consistency: use ctx.exit(1) (or, if the command function does not receive a ctx parameter, call click.get_current_context().exit(1)). Update the code at the output_path existence branch (the block that currently does click.echo(...) then raise SystemExit(1)) to call the appropriate ctx.exit(1) instead.hvantk/commands/make_matrix_cli.py (2)
116-135: Consider exposing--gene-name-columnoption for completeness.The
build_expression_atlas_adfunction accepts agene_name_columnparameter (default"Gene Name") that isn't exposed in the CLI. While the default works for standard Expression Atlas files, users with non-standard column names would need to modify the code.This is optional since the default handles the common case.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/commands/make_matrix_cli.py` around lines 116 - 135, The CLI function mkmatrix_expression_atlas should expose a --gene-name-column option and forward it to the internal _build_expression_atlas_ad call; add a new parameter to mkmatrix_expression_atlas (e.g., gene_name_column with default "Gene Name") that maps to the CLI flag --gene-name-column, and pass that value into _build_expression_atlas_ad via its gene_name_column argument so non-standard Expression Atlas files work without editing code.
222-240: Missing--site-id-colCLI option for CPTAC phospho builder.The
build_cptac_phospho_adfunction accepts asite_id_colparameter (default"SiteID") that isn't exposed in the CLI. This is inconsistent with thecptaccommand which exposes all column configuration options.Users with phospho matrices using a different site identifier column name (e.g.,
"Site","PhosphoSite") would be unable to process them via CLI.Proposed fix to add site_id_col option
`@click.option`("--sid", "--sample-id-col", "sample_id_col", default="SampleID", show_default=True) +@click.option("--site-id-col", default="SiteID", show_default=True, help="Column containing site identifiers") `@click.option`("-w", "--overwrite", is_flag=True) def mkmatrix_cptac_phospho( expression, metadata, output_path, sample_id_col, + site_id_col, overwrite, ): """Build a CPTAC phospho AnnData object from site intensities and metadata.""" logger.info("Building CPTAC phospho AnnData") adata = _build_cptac_phospho_ad( expression_path=expression, metadata_path=metadata, output_path=output_path, + site_id_col=site_id_col, sample_id_col=sample_id_col, overwrite=overwrite, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/commands/make_matrix_cli.py` around lines 222 - 240, Add a new Click option "--site-id-col" (arg name site_id_col, default "SiteID", show_default=True) to the mkmatrix_cptac_phospho command and forward that site_id_col argument into the call to _build_cptac_phospho_ad so it passes site_id_col=site_id_col; this exposes the build_cptac_phospho_ad site_id_col parameter (default "SiteID") to the CLI so users can specify custom phospho site identifier column names.hvantk/tables/matrix_builders.py (2)
10-21: Unused import and unsorted__all__.
Listis imported fromtypingbut never used in this module.- Static analysis flags that
__all__is not sorted (RUF022).Proposed fix
import anndata as ad import logging -from typing import Optional, List +from typing import Optional logger = logging.getLogger(__name__) __all__ = [ + "build_cptac_ad", + "build_cptac_phospho_ad", + "build_expression_atlas_ad", "build_ucsc_ad", - "build_expression_atlas_ad", - "build_cptac_ad", - "build_cptac_phospho_ad", ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/tables/matrix_builders.py` around lines 10 - 21, Remove the unused typing import and sort the exported symbols: delete the unused "List" import from the typing import list and then reorder the __all__ list alphabetically (e.g., "build_cptac_ad", "build_cptac_phospho_ad", "build_expression_atlas_ad", "build_ucsc_ad") so static analysis warnings for the unused import (List) and unsorted __all__ are resolved; keep the existing function names (build_ucsc_ad, build_expression_atlas_ad, build_cptac_ad, build_cptac_phospho_ad) intact.
199-204: Hardcoded tab separator may fail for comma-delimited CSV files.The docstring mentions "TSV/CSV" support, but
pd.read_csvuses hardcodedsep="\t". Users with comma-separated files would encounter parsing errors.Consider either:
- Adding a
delimiterparameter (consistent with UCSC/Expression Atlas builders)- Updating the docstring to clarify only TSV is supported
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/tables/matrix_builders.py` around lines 199 - 204, The code hardcodes a tab separator when reading CPTAC files (logger.info + pd.read_csv for expression_path and metadata_path, producing expr_df and meta_df), which breaks comma-delimited CSV input; add a new parameter (e.g., delimiter or sep) to the enclosing function and pass it into both pd.read_csv calls (use the default "\t" to preserve current behavior) or alternatively update the docstring to state only TSV is supported; ensure meta_df = meta_df.set_index(sample_id_col) still runs after reading with the chosen delimiter.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@hvantk/core/anndata_utils.py`:
- Around line 91-98: The numeric-aggregation code uses
np.nanmin/np.nanmax/np.nanmean on series.values which can be an ExtensionArray
for pandas nullable dtypes; convert the Series to a standard float NumPy array
or use pandas' native aggregations to avoid TypeError. Update the logic around
the variable 'series' (the Series from adata.obs[col]) so that before reducing
you either call series.to_numpy(dtype=float, na_value=np.nan) or
series.astype(float).values, or replace the np.nan* calls with
series.min(skipna=True), series.max(skipna=True), and series.mean(skipna=True),
and then store those results back into summary[col] (same dict keys
"min","max","mean").
In `@hvantk/core/converters.py`:
- Around line 69-82: hail_mt_to_anndata and anndata_to_hail_mt currently
materialize the entire matrix into memory (via
row_df/col_df/entries_df.to_pandas(), pivot, and sparse.toarray()) which will
OOM for large matrices; add an explicit size guard that computes n_obs * n_vars
(from mt.rows/mt.cols counts or adata.n_obs/adata.n_vars) and compare against a
configurable MAX_MATRIX_SIZE threshold (env var or constant), and raise a clear
ValueError with guidance (e.g., suggest using a blockwise/streamed conversion or
increasing the threshold) if exceeded; alternatively implement a blockwise
conversion path (stream entries by column/row chunks to avoid pivoting
everything at once) and only fall back to full-materialization when under the
size limit — update logic around symbols row_df, col_df, entries_df, pivot, X in
hail_mt_to_anndata and where sparse.toarray()/ravel() is called in
anndata_to_hail_mt.
- Around line 55-59: The resolver currently takes the first element of
mt.row_key/mt.col_key without validating cardinality; update the logic around
row_key and col_key so you explicitly validate mt.row_key and mt.col_key: if
len(mt.row_key) == 0 or len(mt.col_key) == 0 raise a clear ValueError indicating
the MatrixTable is unkeyed, and if len(...) > 1 raise a ValueError indicating a
composite key and require the caller to supply an explicit row_key/col_key; only
when len(...) == 1 set row_key = list(mt.row_key)[0] (and similarly for
col_key). Ensure the error messages reference row_key/col_key and MatrixTable so
callers can fix input.
In `@hvantk/datasets/cptac_phospho_datasets.py`:
- Around line 304-312: The per-call fallback in _get_phospho causes inconsistent
sources and overly-broad exception swallowing; instead, probe and pick a single
working source once (e.g., at function/class/module init) by calling
ds.get_phosphoproteomics(source="umich", tissue_type="both") and if that raises
only the specific "source unavailable" error switch to the default and store the
chosen_source in a variable; then have _get_phospho use that chosen_source (call
ds.get_phosphoproteomics(source=chosen_source, tissue_type=...)) so all reads
(tumor TSV, normal TSV, matrix export) use the same source, and apply the same
single-source selection and narrow exception handling approach to the clinical
fallback that uses ct/clinical loader so you don’t catch all Exception types.
In `@hvantk/ptm/constraint_expression.py`:
- Around line 138-141: The loader currently accepts .pkl/.pickle and calls
pd.read_pickle (variable ext with branches creating df), which is unsafe; remove
the pickle branch entirely and ensure the CLI/validation for --expression-path
rejects .pkl/.pickle (or only allow .parquet/.pq), so the code only calls
pd.read_parquet for supported extensions; update any help/validation code that
lists accepted extensions to exclude pickle and add an explicit unsafe flag if
you want to reintroduce pickle behind opt-in.
- Around line 218-229: The code fails when adata.obs[grouping] contains NaN
because astype(str) yields mixed types that break sorted(set(labels)); fix by
filtering out missing group values before converting to strings and before
computing unique_groups: create a boolean index like valid =
adata.obs[grouping].notna(), then derive labels from adata.obs.loc[valid,
grouping].astype(str).to_numpy() and compute unique_groups from that filtered
labels; ensure subsequent masks and counts (mask, n) operate on the same
filtered index (or use the filtered AnnData slice) so grouping, X, and any
downstream arrays stay aligned with labels and min_cells_per_group checks.
In `@hvantk/ptm/constraint.py`:
- Around line 147-149: The code always creates/overwrites outputs (os.makedirs
and downstream TSV/summary/plots) without checking config.overwrite; update
run_ptm_constraint() to honor config.overwrite by adding an existence guard at
the start: if config.overwrite is False and config.output_dir already contains
the expected output files (TSV files, summary.json, or plots), abort/raise an
error or return early; conversely, if config.overwrite is True, allow creating
or delete existing outputs before writing. Ensure all places that delete or
re-create config.output_dir/plots only run when config.overwrite is True and
reference run_ptm_constraint(), config.output_dir, config.overwrite, and
summary.json/TSV/plots in your checks.
- Around line 461-465: The qcut currently runs on variant rows so genes with
many variants skew the cut points; instead compute quartile bins using one row
per gene by selecting unique gene entries (e.g.,
df.drop_duplicates(subset="gene_key") or groupby('gene_key') and take a
representative 'tau'/'loeuf' value), run pd.qcut on that deduplicated series
with the same labels, then map/join the resulting per-gene bin assignments back
onto the full variant-level df to set "tau_quartile" (and similarly for the
LOEUF qcut block referenced later) so bin boundaries are derived from unique
genes, not variant rows.
- Around line 269-275: The code currently replaces a missing AF field with zeros
and later uses fillna(0.0), which falsely signals homozygous reference; instead,
when config.af_field is not present, raise an explicit error (e.g., throw/raise
ValueError or RuntimeError) so the pipeline fails fast and the schema issue is
fixed upstream, and remove any row-level fillna(0.0) calls so unparsable/null
AFs remain null (allowing explicit dropping or handling downstream); update the
branch that checks config.af_field (the select["af"] assignment) to raise and
remove/replace any hl.Table/MatrixTable .fillna(0.0) usage on AF to preserve
nulls.
In `@hvantk/ptm/pipeline.py`:
- Around line 209-213: The download currently streams directly into gtf_path
using requests.get and shutil.copyfileobj which can leave a truncated cache file
on failure; change the logic in the block that calls
requests.get(ENSEMBL_GTF_URL, ...) so that you write to a temporary file (e.g.,
using tempfile.NamedTemporaryFile or tempfile.mkstemp) and copy resp.raw into
that temp file, then atomically move the completed temp into gtf_path with
os.replace() only after the copy and resp.raise_for_status() succeed; ensure the
temp file is cleaned up on exceptions and any open files/streams (resp, temp)
are properly closed.
- Around line 490-498: After concatenating CPTAC into all_combined_path and
repointing mapped_path, also set result.mapped_tsv_path = all_combined_path so
the returned result reflects the post-merge file; update this in the branch that
calls _concat_bgz_tsvs([mapped_path, cptac_mapped_path], all_combined_path) (the
same place you update mapped_path and result.n_total), ensuring callers of
create_ptm_sites_tb() receive the updated mapped_tsv_path.
In `@hvantk/tables/cptac.py`:
- Around line 89-92: In cptac.py, when joining metadata_df into obs (created
from wide.index) and the later join at lines ~149-153, first compute the overlap
= obs.index.intersection(metadata_df.index) and validate it; if overlap is
empty, log a warning (or raise a ValueError) and avoid doing a blind left join
that yields all-NaN rows; if partial overlap, log how many samples matched and
then perform obs = obs.join(metadata_df, how="left") restricted to the
overlapping index (or ensure rows with no metadata are handled explicitly). Use
the same validation/logging pattern used in hvantk/tables/ucsc.py (checking
adata.obs.index.intersection(metadata_df.index) and warning on empty overlap)
and apply it to the variables obs, metadata_df, and wide.index in both join
locations.
In `@hvantk/utils/matrix_utils.py`:
- Around line 127-139: The current label construction joins stringified fields
with "_" which corrupts tuples with underscores and turns missing values into
the literal "nan"; change the logic in the block that builds labels (the
group_by handling and label_series creation) to build canonical group keys as
tuples or a pandas.MultiIndex from the original adata.obs columns (e.g., using
tuple(zip(...)) or pandas.MultiIndex.from_frame(adata.obs[group_by])) and drop
rows with any missing grouping fields before downstream aggregation, and ensure
label_series uses adata.obs.index (not np.arange(adata.n_obs)); update any
callers (e.g., the AnnData mean path referenced in constraint_expression.py) to
consume tuple/MultiIndex keys or a safely encoded representation.
In `@requirements.txt`:
- Around line 14-15: requirements.txt is missing runtime deps required by the
new modules (hvantk/core/converters.py, hvantk/tables/cptac.py,
hvantk/tables/ucsc.py) which import anndata; update requirements.txt to include
anndata plus the other packages declared in environment.yml (scanpy>=1.10.0,
pysam>=0.22, psutil, scikit-learn) and any version constraints from
environment.yml so a plain pip install -r requirements.txt will succeed and
match the environment.yml runtime dependencies.
---
Nitpick comments:
In @.gitignore:
- Around line 212-213: Remove the duplicate .ipynb_checkpoints entry from the
ignore list: keep only one pattern for ".ipynb_checkpoints" (either with or
without the trailing slash) and delete the redundant line that adds
".ipynb_checkpoints/" so the .gitignore contains a single canonical entry for
that pattern.
In `@docs_site/tools/ptm-constraint.md`:
- Around line 83-97: The fenced code block that starts with the directory tree
beginning "<output-dir>/" is missing a language specifier; update the opening
fence from ``` to ```text so the block becomes a ```text fenced code block
(i.e., add "text" after the three backticks) to satisfy markdown linting and
ensure consistent rendering of the directory tree.
In `@hvantk/commands/make_matrix_cli.py`:
- Around line 116-135: The CLI function mkmatrix_expression_atlas should expose
a --gene-name-column option and forward it to the internal
_build_expression_atlas_ad call; add a new parameter to
mkmatrix_expression_atlas (e.g., gene_name_column with default "Gene Name") that
maps to the CLI flag --gene-name-column, and pass that value into
_build_expression_atlas_ad via its gene_name_column argument so non-standard
Expression Atlas files work without editing code.
- Around line 222-240: Add a new Click option "--site-id-col" (arg name
site_id_col, default "SiteID", show_default=True) to the mkmatrix_cptac_phospho
command and forward that site_id_col argument into the call to
_build_cptac_phospho_ad so it passes site_id_col=site_id_col; this exposes the
build_cptac_phospho_ad site_id_col parameter (default "SiteID") to the CLI so
users can specify custom phospho site identifier column names.
In `@hvantk/commands/summarize_expression_cli.py`:
- Around line 163-199: Replace the direct raise SystemExit(1) in the output
existence check with a Click-style exit call for consistency: use ctx.exit(1)
(or, if the command function does not receive a ctx parameter, call
click.get_current_context().exit(1)). Update the code at the output_path
existence branch (the block that currently does click.echo(...) then raise
SystemExit(1)) to call the appropriate ctx.exit(1) instead.
In `@hvantk/data/file_utils.py`:
- Around line 462-466: The __init__ stores self._path but never uses it; either
remove the unused attribute or keep and use it for context (e.g., include
self._path in error messages/logging and in exception handling around opening
the file). Update the initializer in the class's __init__ (and any related error
handling elsewhere in this class) to either drop the assignment to self._path or
reference self._path in log/raise messages when calling open(path, "wb") and
when reporting I/O errors so the file path is not lost.
In `@hvantk/datasets/peptideatlas_phospho_datasets.py`:
- Around line 441-446: Move the inline import of requests out of the download
method and add it to the module-level imports so the library is imported once
for the module; update the top of the file to include "import requests" and
remove the "import requests" line inside the download() block (the code using
requests remains the same: requests.get(self.zip_url, stream=True, timeout=600)
... resp.raise_for_status() ... shutil.copyfileobj(resp.raw, fout)).
In `@hvantk/tables/expression_atlas.py`:
- Line 214: The current assignment X =
df[sample_cols].values.T.astype(np.float32) will raise on non-numeric strings;
change to coerce non-numeric values to NaN first (e.g. replace the
df[sample_cols] selection with df[sample_cols].apply(lambda c: pd.to_numeric(c,
errors='coerce')) or df[sample_cols].apply(pd.to_numeric, errors='coerce')) and
then convert to float32 and transpose to produce X; adjust downstream code to
handle resulting NaNs (e.g., drop or impute) as needed.
In `@hvantk/tables/matrix_builders.py`:
- Around line 10-21: Remove the unused typing import and sort the exported
symbols: delete the unused "List" import from the typing import list and then
reorder the __all__ list alphabetically (e.g., "build_cptac_ad",
"build_cptac_phospho_ad", "build_expression_atlas_ad", "build_ucsc_ad") so
static analysis warnings for the unused import (List) and unsorted __all__ are
resolved; keep the existing function names (build_ucsc_ad,
build_expression_atlas_ad, build_cptac_ad, build_cptac_phospho_ad) intact.
- Around line 199-204: The code hardcodes a tab separator when reading CPTAC
files (logger.info + pd.read_csv for expression_path and metadata_path,
producing expr_df and meta_df), which breaks comma-delimited CSV input; add a
new parameter (e.g., delimiter or sep) to the enclosing function and pass it
into both pd.read_csv calls (use the default "\t" to preserve current behavior)
or alternatively update the docstring to state only TSV is supported; ensure
meta_df = meta_df.set_index(sample_id_col) still runs after reading with the
chosen delimiter.
In `@hvantk/tests/test_expression_builders_anndata.py`:
- Around line 12-24: The zip() calls in _write_expression_matrix and
_write_metadata can silently ignore length mismatches; update both to use
zip(..., strict=True) so mismatched lengths raise an error during tests — change
for gene,row in zip(genes, values) to zip(genes, values, strict=True) in
_write_expression_matrix and for cid,ct in zip(cell_ids, cell_types) to
zip(cell_ids, cell_types, strict=True) in _write_metadata.
In `@hvantk/tests/test_ptm_constraint.py`:
- Around line 90-108: The test test_constraint_config_validate uses a hardcoded
output_dir "/tmp/out"; change the test to accept the pytest tmp_path fixture,
pass output_dir=str(tmp_path) when constructing PTMConstraintConfig, and update
the test function signature to def test_constraint_config_validate(tmp_path): so
the config uses an isolated temporary directory rather than a global /tmp path
while keeping all other validation assertions unchanged.
In `@hvantk/tests/test_ptm.py`:
- Around line 183-186: Move the local "import gzip" out of the test body and add
it to the module-level imports at the top of hvantk/tests/test_ptm.py, then
remove the inline import so the block that opens output_bgz with gzip.open(...)
and constructs csv.DictReader(...) uses the top-level gzip; update any
linters/import-grouping if needed.
In `@hvantk/visualization/expression/anndata.py`:
- Around line 35-38: The current code unconditionally densifies adata.X (X =
X.toarray()) which can OOM for large sparse matrices; instead, when
sp.issparse(X) detect large matrices by checking X.nnz or X.shape and apply
sampling: use X.data (the nonzero values) and randomly sample up to a
MAX_SAMPLES limit, and if you need zeros represented also randomly sample
indices from the implicit zeros to approximate distribution; then build values
from the sampled nonzero and sampled-zero values rather than calling toarray();
update the block around X, sp.issparse(X), and values to implement this
conditional sampling strategy.
In `@pyproject.toml`:
- Around line 31-32: Update pyproject.toml to make scanpy optional by removing
it from the main dependencies and adding an extras group (e.g., extras = {
markers = ["scanpy>=1.10.0"] }) so users can install hvantk[markers] when they
need marker analysis; then in summarize_expression_cli.py, wrap the local import
of scanpy (used in the marker-related function) in a try/except ImportError and
raise a clear ImportError message pointing users to install the markers extra
(e.g., "scanpy is required for marker analysis. Install with 'pip install
hvantk[markers]'") so the runtime check fails fast with guidance.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f90c4031-e371-48e9-b494-721fe6941d4f
⛔ Files ignored due to path filters (1)
poetry.lockis excluded by!**/*.lock
📒 Files selected for processing (54)
.gitignoredocs/superpowers/plans/2026-03-31-alphagenome-streamer.mddocs/superpowers/specs/2026-03-31-alphagenome-streamer-design.mddocs_site/architecture.mddocs_site/guide/usage.mddocs_site/tools/ptm-constraint.mdenvironment.ymlhvantk/commands/make_matrix_cli.pyhvantk/commands/ptm_cli.pyhvantk/commands/summarize_expression_cli.pyhvantk/core/anndata_utils.pyhvantk/core/converters.pyhvantk/core/hail_context.pyhvantk/data/file_utils.pyhvantk/datasets/cptac_phospho_datasets.pyhvantk/datasets/peptideatlas_phospho_datasets.pyhvantk/ptm/__init__.pyhvantk/ptm/annotate.pyhvantk/ptm/constraint.pyhvantk/ptm/constraint_expression.pyhvantk/ptm/constraint_plots.pyhvantk/ptm/constraint_report.pyhvantk/ptm/mapper.pyhvantk/ptm/pipeline.pyhvantk/tables/cptac.pyhvantk/tables/expression_atlas.pyhvantk/tables/matrix_builders.pyhvantk/tables/registry.pyhvantk/tables/table_builders.pyhvantk/tables/ucsc.pyhvantk/tests/test_anndata_utils.pyhvantk/tests/test_cli_mkmatrix.pyhvantk/tests/test_converters.pyhvantk/tests/test_cptac.pyhvantk/tests/test_cptac_phospho.pyhvantk/tests/test_expression_builders_anndata.pyhvantk/tests/test_expression_tables_hail.pyhvantk/tests/test_matrix_utils.pyhvantk/tests/test_matrix_utils_anndata.pyhvantk/tests/test_peptideatlas_phospho.pyhvantk/tests/test_ptm.pyhvantk/tests/test_ptm_constraint.pyhvantk/tests/test_summarize_expression.pyhvantk/tests/test_wilcoxon_hail.pyhvantk/utils/gene_sets.pyhvantk/utils/matrix_utils.pyhvantk/utils/tissue_specificity.pyhvantk/utils/wilcoxon_hail.pyhvantk/visualization/__init__.pyhvantk/visualization/expression/__init__.pyhvantk/visualization/expression/anndata.pyhvantk/visualization/expression/hail.pypyproject.tomlrequirements.txt
💤 Files with no reviewable changes (11)
- hvantk/tests/test_expression_tables_hail.py
- docs/superpowers/specs/2026-03-31-alphagenome-streamer-design.md
- hvantk/tests/test_cli_mkmatrix.py
- hvantk/tests/test_matrix_utils.py
- hvantk/tests/test_cptac.py
- hvantk/tests/test_wilcoxon_hail.py
- hvantk/visualization/expression/hail.py
- hvantk/tests/test_summarize_expression.py
- hvantk/tests/test_cptac_phospho.py
- docs/superpowers/plans/2026-03-31-alphagenome-streamer.md
- hvantk/utils/wilcoxon_hail.py
| if pd.api.types.is_numeric_dtype(series): | ||
| summary[col] = { | ||
| "dtype": "numeric", | ||
| "min": float(np.nanmin(series.values)), | ||
| "max": float(np.nanmax(series.values)), | ||
| "mean": float(np.nanmean(series.values)), | ||
| "n_missing": n_missing, | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate and examine the file
find . -name "anndata_utils.py" -type fRepository: bigbio/hvantk
Length of output: 87
🏁 Script executed:
# Read the full function context around lines 91-98
head -120 hvantk/core/anndata_utils.py | tail -50Repository: bigbio/hvantk
Length of output: 1944
🏁 Script executed:
# Also check imports to understand what's being used
head -20 hvantk/core/anndata_utils.pyRepository: bigbio/hvantk
Length of output: 614
🌐 Web query:
numpy nanmin nanmax pandas extension array Float64 Int64 behavior
💡 Result:
NumPy's nanmin and nanmax ignore NaN values and return the min/max among non-NaN elements (or NaN if all NaN). Regular min/max propagate NaN, returning NaN if any NaN present. Pandas nullable extension arrays (Int64, Float64, etc.) use pd.NA (not np.nan) as missing indicator, backed by data + mask. Their .min/.max (default skipna=True) skip missing values (pd.NA), computing min/max on valid data. With skipna=False, return pd.NA if any missing. Direct use of np.nanmin/np.nanmax on pandas extension arrays is not supported and typically fails (TypeError), as they are not NumPy arrays and lack full NumPy ufunc compatibility. ExtensionArrays implement custom _reduce("min"/"max") handling missing via mask, not delegating to NumPy nanmin/nanmax. Example behaviors (from docs/source): - NumPy: np.nanmin(np.array([1, np.nan, 3])) → 1.0 - Pandas Int64: pd.array([1, None, 3], dtype="Int64").min → 1 (skips ) - np.nanmin(pd.array([1, None, 3], dtype="Int64")) → TypeError (incompatible) For interoperability, convert to NumPy first: arr.to_numpy.astype(float) then np.nanmin, but loses exact semantics (pd.NA → np.nan). Sources confirm via official docs (numpy.nanmin docs, pandas ExtensionArray._reduce, nullable dtype guides, source code IntegerArray/FloatingArray inherit NumericArray with masked reductions).
Citations:
- 1: https://numpy.org/devdocs/reference/generated/numpy.nanmax.html
- 2: https://numpy.org/doc/stable/reference/generated/numpy.nanmin.html?highlight=numpy+maximum
- 3: https://numpy.org/doc/stable/reference/generated/numpy.min.html
- 4: https://stackoverflow.com/questions/62865647/why-is-max-and-min-of-numpy-array-nan
- 5: https://github.com/pandas-dev/pandas/blob/main/pandas/core/arrays/floating.py
- 6: https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
- 7: https://pandas.pydata.org/docs/reference/api/pandas.api.extensions.ExtensionArray._reduce.html
- 8: https://pandas.pydata.org/docs/user_guide/missing_data.html
- 9: https://pandas.pydata.org/docs/user_guide/integer_na.html
- 10: https://github.com/pandas-dev/pandas/blob/master/pandas/core/arrays/integer.py
🌐 Web query:
pandas to_numeric to_numpy dtype conversion numeric series
💡 Result:
In pandas, pd.to_numeric converts a Series (typically object dtype with strings/mixed types) to a numeric pandas Series. The default dtype is float64 if the data requires floats or has NaNs (from errors='coerce'), or int64 if all integer values without NaNs. Use downcast='integer' or 'signed' for smaller int dtypes like int8 if possible, or dtype_backend='numpy_nullable' for nullable Int64 (capital I). Examples from docs: s = pd.Series(['1.0', '2', -3]) pd.to_numeric(s) # dtype: float64 pd.to_numeric(s, downcast='signed') # dtype: int8 s_na = pd.Series(['1', '2', 'apple']) pd.to_numeric(s_na, errors='coerce') # dtype: float64 with NaN Calling .to_numpy on the resulting numeric Series returns a NumPy ndarray with matching dtype: - pandas float64 or Float64 -> np.float64 - pandas int64 -> np.int64 - smaller dtypes like int8 -> np.int8 - For nullable dtypes (e.g., Int64 from dtype_backend), .to_numpy coerces to object dtype ndarray with as np.nan or None. For standard numeric Series without NaNs, dtype is preserved as native NumPy type in the ndarray. Examples: pd.Series([1,2,3]).to_numpy # array([1,2,3], dtype=int64) pd.Series([1.0,2.0]).to_numpy # array([1.,2.], dtype=float64) No dtype loss occurs for non-nullable numeric dtypes; .to_numpy provides direct view to underlying NumPy data (copy=False by default).
Citations:
- 1: https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html
- 2: https://pandas.pydata.org/docs/reference/api/pandas.Series.to_numpy.html
- 3: http://prnum.s3-website.bhs.perf.cloud.ovh.net/docs/reference/api/pandas.Series.to_numpy.html
- 4: https://github.com/pandas-dev/pandas/blob/master/pandas/core/tools/numeric.py
- 5: http://pandas.pydata.org/docs/dev/reference/api/pandas.to_numeric.html
- 6: https://stackoverflow.com/questions/72520974/pandas-to-numeric-integer-downcast-cast-floats-not-to-integer
- 7: https://pandas.pydata.org/docs/user_guide/integer_na.html
Convert numeric columns to standard NumPy float array before reduction, or use pandas native aggregation methods.
If adata.obs[col] uses a pandas nullable numeric dtype (Int64, Float64, etc.), then series.values returns an ExtensionArray that is incompatible with np.nanmin/np.nanmax/np.nanmean, causing a TypeError at runtime. Use .astype(float).values or .to_numpy(dtype=float, na_value=np.nan) to convert before reducing, or call pandas native methods directly (series.min(), series.max(), series.mean() with skipna=True) which handle both regular and nullable dtypes correctly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hvantk/core/anndata_utils.py` around lines 91 - 98, The numeric-aggregation
code uses np.nanmin/np.nanmax/np.nanmean on series.values which can be an
ExtensionArray for pandas nullable dtypes; convert the Series to a standard
float NumPy array or use pandas' native aggregations to avoid TypeError. Update
the logic around the variable 'series' (the Series from adata.obs[col]) so that
before reducing you either call series.to_numpy(dtype=float, na_value=np.nan) or
series.astype(float).values, or replace the np.nan* calls with
series.min(skipna=True), series.max(skipna=True), and series.mean(skipna=True),
and then store those results back into summary[col] (same dict keys
"min","max","mean").
| # Resolve keys | ||
| if row_key is None: | ||
| row_key = list(mt.row_key)[0] | ||
| if col_key is None: | ||
| col_key = list(mt.col_key)[0] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -type f -name "converters.py" | head -20Repository: bigbio/hvantk
Length of output: 111
🏁 Script executed:
git ls-files hvantk/core/ | head -20Repository: bigbio/hvantk
Length of output: 351
🏁 Script executed:
cd hvantk/core && find . -name "*.py" -type f | head -20Repository: bigbio/hvantk
Length of output: 231
🏁 Script executed:
wc -l hvantk/core/converters.pyRepository: bigbio/hvantk
Length of output: 86
🏁 Script executed:
cat -n hvantk/core/converters.py | head -100Repository: bigbio/hvantk
Length of output: 3473
🏁 Script executed:
cat -n hvantk/core/converters.py | tail -80Repository: bigbio/hvantk
Length of output: 3069
🏁 Script executed:
grep -r "row_key\|col_key" hvantk/ --include="*.py" | grep -E "test|spec" | head -20Repository: bigbio/hvantk
Length of output: 423
🏁 Script executed:
cat -n hvantk/tests/test_converters.pyRepository: bigbio/hvantk
Length of output: 4046
🏁 Script executed:
cd hvantk && find . -name "*.py" -type f -exec grep -l "to_matrix_table\|row_key\|col_key" {} \; | head -15Repository: bigbio/hvantk
Length of output: 369
🏁 Script executed:
find . -name "*.py" -type f -exec grep -l "composite.*key\|multiple.*key" {} \; | head -10Repository: bigbio/hvantk
Length of output: 125
🏁 Script executed:
rg "key_rows_by|key_cols_by" -A 3 -B 3 hvantk/ --include="*.py" | head -50Repository: bigbio/hvantk
Length of output: 498
🏁 Script executed:
rg "row_key.*=.*\[" hvantk/ --include="*.py" | head -20Repository: bigbio/hvantk
Length of output: 498
🏁 Script executed:
rg "row_key.*=.*\[" hvantk/Repository: bigbio/hvantk
Length of output: 244
🏁 Script executed:
rg "to_matrix_table" hvantk/ -A 5Repository: bigbio/hvantk
Length of output: 371
🏁 Script executed:
cat -n hvantk/core/converters.py | sed -n '136,141p'Repository: bigbio/hvantk
Length of output: 227
🏁 Script executed:
rg "composite.*key|multiple.*key" hvantk/Repository: bigbio/hvantk
Length of output: 238
🏁 Script executed:
grep -r "IndexError\|ValueError" hvantk/tests/test_converters.pyRepository: bigbio/hvantk
Length of output: 39
🏁 Script executed:
cat -n hvantk/enrichex/burden.py | grep -A 5 -B 5 "row_key_name"Repository: bigbio/hvantk
Length of output: 1325
🏁 Script executed:
cat -n hvantk/tests/test_table_utils.py | grep -A 20 "Create a table with multiple key fields"Repository: bigbio/hvantk
Length of output: 917
🏁 Script executed:
rg "key_rows_by|key_cols_by" hvantk/tests/ -A 2 -B 2Repository: bigbio/hvantk
Length of output: 7781
🏁 Script executed:
rg "composite" hvantk/tests/ -A 3 -B 3Repository: bigbio/hvantk
Length of output: 39
🏁 Script executed:
cat -n hvantk/tests/test_converters.py | tail -30Repository: bigbio/hvantk
Length of output: 1453
🏁 Script executed:
cat -n hvantk/tests/ancestry/conftest.py | sed -n '1,40p'Repository: bigbio/hvantk
Length of output: 1393
🏁 Script executed:
rg "hail_mt_to_anndata" hvantk/ -B 2 -A 5Repository: bigbio/hvantk
Length of output: 5428
🏁 Script executed:
python3 << 'EOF'
# Check if we can examine Hail's type system
# Let's look for any documentation or usage patterns
import subprocess
result = subprocess.run(['rg', 'row_key.*len|len.*row_key', 'hvantk/'],
capture_output=True, text=True)
print("Length checks on row_key/col_key:")
print(result.stdout if result.stdout else "No matches found")
result = subprocess.run(['rg', 'range_matrix_table', 'hvantk/tests/', '-A', '10'],
capture_output=True, text=True)
print("\nrange_matrix_table usage:")
print(result.stdout[:2000] if result.stdout else "No matches")
EOFRepository: bigbio/hvantk
Length of output: 2135
🏁 Script executed:
rg "row_key|col_key" hvantk/tests/test_converters.py -B 5 -A 5Repository: bigbio/hvantk
Length of output: 986
🏁 Script executed:
python3 << 'EOF'
# Let's check if there's a way to create an unkeyed MatrixTable or what happens
# We can look for any usage of unkeyed tables
import subprocess
# Search for any mention of "unkey" or empty keys
result = subprocess.run(['rg', 'unkey|drop.*key', 'hvantk/'],
capture_output=True, text=True)
print("Unkey operations:")
print(result.stdout if result.stdout else "No matches found")
# Check for any error handling around keys
result = subprocess.run(['rg', 'KeyError|IndexError|ValueError.*key', 'hvantk/'],
capture_output=True, text=True)
print("\nError handling for keys:")
print(result.stdout if result.stdout else "No matches found")
EOFRepository: bigbio/hvantk
Length of output: 3166
🏁 Script executed:
rg "simple_mt" hvantk/tests/test_converters.py -B 10 | head -40Repository: bigbio/hvantk
Length of output: 1260
🏁 Script executed:
cat -n hvantk/enrichex/burden.py | sed -n '1545,1560p'Repository: bigbio/hvantk
Length of output: 823
🏁 Script executed:
cat -n hvantk/ptm/constraint_expression.py | grep -A 3 -B 3 "no row_key"Repository: bigbio/hvantk
Length of output: 350
🏁 Script executed:
cat -n hvantk/tests/ancestry/conftest.py | sed -n '45,65p'Repository: bigbio/hvantk
Length of output: 858
🏁 Script executed:
rg "key_rows_by.*\[|key_cols_by.*\[" hvantk/Repository: bigbio/hvantk
Length of output: 122
Add explicit validation for single row and column keys.
Lines 57–59 silently use the first key field without checking cardinality. This fails with IndexError on an unkeyed MatrixTable and silently drops key components on a composite-key MT. Defensive checks for this already exist elsewhere in the codebase (e.g., constraint_expression.py). Add explicit validation here.
Suggested guard
if row_key is None:
- row_key = list(mt.row_key)[0]
+ if len(mt.row_key) != 1:
+ raise ValueError(
+ "hail_mt_to_anndata requires exactly one row key."
+ )
+ row_key = next(iter(mt.row_key))
if col_key is None:
- col_key = list(mt.col_key)[0]
+ if len(mt.col_key) != 1:
+ raise ValueError(
+ "hail_mt_to_anndata requires exactly one column key."
+ )
+ col_key = next(iter(mt.col_key))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Resolve keys | |
| if row_key is None: | |
| row_key = list(mt.row_key)[0] | |
| if col_key is None: | |
| col_key = list(mt.col_key)[0] | |
| # Resolve keys | |
| if row_key is None: | |
| if len(mt.row_key) != 1: | |
| raise ValueError( | |
| "hail_mt_to_anndata requires exactly one row key." | |
| ) | |
| row_key = next(iter(mt.row_key)) | |
| if col_key is None: | |
| if len(mt.col_key) != 1: | |
| raise ValueError( | |
| "hail_mt_to_anndata requires exactly one column key." | |
| ) | |
| col_key = next(iter(mt.col_key)) |
🧰 Tools
🪛 Ruff (0.15.9)
[warning] 57-57: Prefer next(iter(mt.row_key)) over single element slice
Replace with next(iter(mt.row_key))
(RUF015)
[warning] 59-59: Prefer next(iter(mt.col_key)) over single element slice
Replace with next(iter(mt.col_key))
(RUF015)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hvantk/core/converters.py` around lines 55 - 59, The resolver currently takes
the first element of mt.row_key/mt.col_key without validating cardinality;
update the logic around row_key and col_key so you explicitly validate
mt.row_key and mt.col_key: if len(mt.row_key) == 0 or len(mt.col_key) == 0 raise
a clear ValueError indicating the MatrixTable is unkeyed, and if len(...) > 1
raise a ValueError indicating a composite key and require the caller to supply
an explicit row_key/col_key; only when len(...) == 1 set row_key =
list(mt.row_key)[0] (and similarly for col_key). Ensure the error messages
reference row_key/col_key and MatrixTable so callers can fix input.
| row_df = mt.rows().to_pandas() | ||
| row_df = row_df.set_index(row_key) | ||
|
|
||
| col_df = mt.cols().to_pandas() | ||
| col_df = col_df.set_index(col_key) | ||
|
|
||
| # Collect entries and pivot to dense matrix | ||
| entries_df = mt.select_entries(mt[entry_field]).entries().to_pandas() | ||
| pivot = entries_df.pivot(index=col_key, columns=row_key, values=entry_field) | ||
|
|
||
| # Reindex to match collected row/col order | ||
| pivot = pivot.reindex(index=col_df.index, columns=row_df.index) | ||
|
|
||
| X = pivot.values.astype(np.float32) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name converters.py -type f | head -20Repository: bigbio/hvantk
Length of output: 111
🏁 Script executed:
wc -l ./hvantk/core/converters.pyRepository: bigbio/hvantk
Length of output: 88
🏁 Script executed:
cat -n ./hvantk/core/converters.pyRepository: bigbio/hvantk
Length of output: 5747
🏁 Script executed:
rg -n "hail_mt_to_anndata|anndata_to_hail_mt" --type py -B 2 -A 2 | head -50Repository: bigbio/hvantk
Length of output: 3560
🏁 Script executed:
rg -n "size|memory|OOM|large|limit|guard" hvantk/core/converters.py -iRepository: bigbio/hvantk
Length of output: 39
🏁 Script executed:
rg -n "stream|block|chunk|slice" hvantk/core/ -i --type pyRepository: bigbio/hvantk
Length of output: 950
Both conversion paths unconditionally materialize the entire matrix on the driver.
Lines 69, 72, and 76 in hail_mt_to_anndata call to_pandas() three times (rows, cols, and entries), then pivot the entire entries DataFrame into a dense matrix. Lines 118 and 133 in anndata_to_hail_mt convert sparse matrices to dense via toarray() and then ravel() the entire matrix into memory. Neither function includes size validation, memory guards, or a streaming alternative. This design will cause OOM failures on realistic expression matrices (e.g., 100k cells × 30k genes).
Add an explicit size guard (e.g., max acceptable obs×var product) with a clear error message, or implement a blockwise/streamed conversion path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hvantk/core/converters.py` around lines 69 - 82, hail_mt_to_anndata and
anndata_to_hail_mt currently materialize the entire matrix into memory (via
row_df/col_df/entries_df.to_pandas(), pivot, and sparse.toarray()) which will
OOM for large matrices; add an explicit size guard that computes n_obs * n_vars
(from mt.rows/mt.cols counts or adata.n_obs/adata.n_vars) and compare against a
configurable MAX_MATRIX_SIZE threshold (env var or constant), and raise a clear
ValueError with guidance (e.g., suggest using a blockwise/streamed conversion or
increasing the threshold) if exceeded; alternatively implement a blockwise
conversion path (stream entries by column/row chunks to avoid pivoting
everything at once) and only fall back to full-materialization when under the
size limit — update logic around symbols row_df, col_df, entries_df, pivot, X in
hail_mt_to_anndata and where sparse.toarray()/ravel() is called in
anndata_to_hail_mt.
| def _get_phospho(tissue_type="both", source="umich"): | ||
| """Try *source* first; fall back to default if unavailable.""" | ||
| try: | ||
| return ds.get_phosphoproteomics(source=source, tissue_type=tissue_type) | ||
| except Exception: | ||
| logger.info( | ||
| "Source '%s' unavailable for %s, falling back to default", source, ct | ||
| ) | ||
| return ds.get_phosphoproteomics(tissue_type=tissue_type) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's locate and examine the file
fd "cptac_phospho_datasets.py" -type fRepository: bigbio/hvantk
Length of output: 227
🏁 Script executed:
# Get the file size and examine the relevant sections
wc -l hvantk/datasets/cptac_phospho_datasets.pyRepository: bigbio/hvantk
Length of output: 102
🏁 Script executed:
# Read the full context around the flagged lines (300-350) to understand the function and all call sites
sed -n '295,355p' hvantk/datasets/cptac_phospho_datasets.py | cat -nRepository: bigbio/hvantk
Length of output: 3108
🏁 Script executed:
# Also look for the `_get_phospho` function definition to understand its scope and behavior
sed -n '300,350p' hvantk/datasets/cptac_phospho_datasets.py | cat -nRepository: bigbio/hvantk
Length of output: 2657
🏁 Script executed:
# Look at imports and understand the CPTAC library usage
sed -n '1,50p' hvantk/datasets/cptac_phospho_datasets.py | cat -nRepository: bigbio/hvantk
Length of output: 1687
🏁 Script executed:
# Search for any configuration or documentation about CPTAC sources
rg "get_phosphoproteomics|get_clinical" hvantk/datasets/cptac_phospho_datasets.py -B 2 -A 2Repository: bigbio/hvantk
Length of output: 862
🏁 Script executed:
# Check if there's any CPTAC documentation or type hints in the codebase
rg "source.*umich|source.*mssm" hvantk/ -t pyRepository: bigbio/hvantk
Length of output: 259
🏁 Script executed:
# Look for the _load_cptac_dataset function to understand the library's behavior
rg "_load_cptac_dataset" hvantk/datasets/cptac_phospho_datasets.py -B 5 -A 10Repository: bigbio/hvantk
Length of output: 1315
🏁 Script executed:
# Look for test files related to CPTAC
fd "test.*cptac|cptac.*test" -type f hvantk/Repository: bigbio/hvantk
Length of output: 227
🏁 Script executed:
# Check requirements or setup files for cptac package version
fd "requirements|setup|pyproject" -type f | xargs rg "cptac" 2>/dev/nullRepository: bigbio/hvantk
Length of output: 10473
🏁 Script executed:
# Search for any documentation or comments about source handling
rg "source.*mssm|source.*umich" hvantk/ -B 3 -A 3Repository: bigbio/hvantk
Length of output: 1319
🏁 Script executed:
# Check the full download method to understand the complete context
sed -n '260,350p' hvantk/datasets/cptac_phospho_datasets.py | cat -nRepository: bigbio/hvantk
Length of output: 4790
🏁 Script executed:
# Search for any imports or exception handling specific to cptac
rg "cptac\.|import cptac" hvantk/datasets/cptac_phospho_datasets.py -B 5 -A 5Repository: bigbio/hvantk
Length of output: 396
🏁 Script executed:
# Check if there's any test coverage for this code path
fd "test" -type f hvantk/ | xargs rg "CPTACPhosphoDataset|_get_phospho|source.*umich" 2>/dev/null | head -20Repository: bigbio/hvantk
Length of output: 1023
🏁 Script executed:
# Look for any GitHub issues or comments about CPTAC source handling
rg "unavailable|fallback|source.*mssm" hvantk/ -iRepository: bigbio/hvantk
Length of output: 2857
🏁 Script executed:
# Check the actual CPTAC library behavior by looking at package docs or any local type hints
rg "get_phosphoproteomics|get_clinical" hvantk/ -A 3 -B 3Repository: bigbio/hvantk
Length of output: 1990
Determine and reuse a single CPTAC source instead of falling back per call.
Each _get_phospho call (lines 316, 324, 339) is evaluated independently, so one failed "umich" lookup leaves the tumor TSV, normal TSV, and matrix export backed by different sources. Both fallback blocks also catch every Exception, silently converting unrelated loader/network failures into source switches. Determine the supported source once and reuse it for all phosphoproteomics reads; apply the same narrowing to the clinical fallback (lines 351–354).
🧰 Tools
🪛 Ruff (0.15.9)
[warning] 308-308: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hvantk/datasets/cptac_phospho_datasets.py` around lines 304 - 312, The
per-call fallback in _get_phospho causes inconsistent sources and overly-broad
exception swallowing; instead, probe and pick a single working source once
(e.g., at function/class/module init) by calling
ds.get_phosphoproteomics(source="umich", tissue_type="both") and if that raises
only the specific "source unavailable" error switch to the default and store the
chosen_source in a variable; then have _get_phospho use that chosen_source (call
ds.get_phosphoproteomics(source=chosen_source, tissue_type=...)) so all reads
(tumor TSV, normal TSV, matrix export) use the same source, and apply the same
single-source selection and narrow exception handling approach to the clinical
fallback that uses ct/clinical loader so you don’t catch all Exception types.
| if ext in {".parquet", ".pq"}: | ||
| df = pd.read_parquet(path) | ||
| elif ext in {".pkl", ".pickle"}: | ||
| df = pd.read_pickle(path) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's locate and examine the file
fd -a constraint_expression.pyRepository: bigbio/hvantk
Length of output: 111
🏁 Script executed:
# Once we have the file, let's look at the code around lines 138-141
cat -n hvantk/ptm/constraint_expression.py | head -160 | tail -40Repository: bigbio/hvantk
Length of output: 1634
🏁 Script executed:
# Let's also get a broader view of the function containing these lines
cat -n hvantk/ptm/constraint_expression.py | head -200 | tail -100Repository: bigbio/hvantk
Length of output: 4086
🏁 Script executed:
# Find all calls to _load_from_tabular to understand the call chain
rg "_load_from_tabular" -n hvantk/ptm/constraint_expression.pyRepository: bigbio/hvantk
Length of output: 152
🏁 Script executed:
# Search for CLI entry points or argparse definitions in the ptm module
fd -a "\.py$" hvantk/ptm/ | head -20Repository: bigbio/hvantk
Length of output: 585
🏁 Script executed:
# Look for main entry points or CLI handlers
rg -n "argparse|click|typer|def main" hvantk/ptm/Repository: bigbio/hvantk
Length of output: 39
🏁 Script executed:
# Check if there's a CLI module or setup.py that shows how paths are passed
rg -n "_load_from_tabular" --type pyRepository: bigbio/hvantk
Length of output: 224
🏁 Script executed:
# Look at context around line 72 where _load_from_tabular is called
cat -n hvantk/ptm/constraint_expression.py | sed -n '50,90p'Repository: bigbio/hvantk
Length of output: 1800
🏁 Script executed:
# Look at the full function definition starting before line 72
cat -n hvantk/ptm/constraint_expression.py | sed -n '35,85p'Repository: bigbio/hvantk
Length of output: 2292
🏁 Script executed:
# Get the function signature and name
cat -n hvantk/ptm/constraint_expression.py | sed -n '1,40p'Repository: bigbio/hvantk
Length of output: 1371
🏁 Script executed:
# Search for all calls to this function across the codebase
rg "load_expression_source" --type py -nRepository: bigbio/hvantk
Length of output: 39
🏁 Script executed:
# Search for calls to load_gene_by_group_matrix
rg "load_gene_by_group_matrix" --type py -nRepository: bigbio/hvantk
Length of output: 986
🏁 Script executed:
# Also check for CLI/entry point files (main, CLI runner, etc.)
rg -n "def main|if __name__" hvantk/ --type py | head -20Repository: bigbio/hvantk
Length of output: 1364
🏁 Script executed:
# Look for any argument parsing or click/argparse decorators related to path arguments
rg -n "click|argparse|typer|@app\." hvantk/ --type py | grep -i "path\|file\|input" | head -20Repository: bigbio/hvantk
Length of output: 1857
🏁 Script executed:
# Let's examine constraint.py around line 158 to see how load_gene_by_group_matrix is called
cat -n hvantk/ptm/constraint.py | sed -n '145,175p'Repository: bigbio/hvantk
Length of output: 1272
🏁 Script executed:
# Check the function containing this call - look further back
cat -n hvantk/ptm/constraint.py | sed -n '1,50p'Repository: bigbio/hvantk
Length of output: 1708
🏁 Script executed:
# Search for CLI-related functions in constraint.py that might call this function
rg -n "def " hvantk/ptm/constraint.py | head -20Repository: bigbio/hvantk
Length of output: 822
🏁 Script executed:
# Search for calls to run_ptm_constraint
rg "run_ptm_constraint" --type py -nRepository: bigbio/hvantk
Length of output: 729
🏁 Script executed:
# Check if there's a PTM-related CLI entry in the main hvantk.py or other CLI modules
cat -n hvantk/hvantk.py | head -120Repository: bigbio/hvantk
Length of output: 4062
🏁 Script executed:
# Look for ptm command/subcommand in the CLI
rg -n "ptm\|constraint" hvantk/hvantk.py hvantk/commands/ --type py | head -30Repository: bigbio/hvantk
Length of output: 39
🏁 Script executed:
# Examine ptm_cli.py around line 656 to see how the path is supplied to run_ptm_constraint
cat -n hvantk/commands/ptm_cli.py | sed -n '640,700p'Repository: bigbio/hvantk
Length of output: 2420
🏁 Script executed:
# Look at the ptm command/subcommand definitions to see where expression_path comes from
cat -n hvantk/commands/ptm_cli.py | sed -n '1,100p'Repository: bigbio/hvantk
Length of output: 3934
🏁 Script executed:
# Search for where expression_path is defined as a CLI option
rg -n "expression_path\|expression-path" hvantk/commands/ptm_cli.py | head -20Repository: bigbio/hvantk
Length of output: 39
🏁 Script executed:
# Search for the constraint subcommand definition
rg -n "constraint" hvantk/commands/ptm_cli.py -A 5 -B 5 | head -100Repository: bigbio/hvantk
Length of output: 3440
🏁 Script executed:
# Search for `@ptm_group.command` or constraint in the file
rg -n "@ptm_group.command|def constraint" hvantk/commands/ptm_cli.pyRepository: bigbio/hvantk
Length of output: 305
🏁 Script executed:
# Now let's look at the --expression-path option definition
cat -n hvantk/commands/ptm_cli.py | sed -n '525,605p'Repository: bigbio/hvantk
Length of output: 3103
Remove pickle support from this user-facing loader.
The --expression-path CLI option accepts .pkl/.pickle files. pd.read_pickle() executes pickle payloads during deserialization, turning this into an arbitrary-code-execution sink. Remove pickle support entirely, or gate it behind an explicit unsafe flag.
🧰 Tools
🪛 Ruff (0.15.9)
[error] 141-141: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue
(S301)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hvantk/ptm/constraint_expression.py` around lines 138 - 141, The loader
currently accepts .pkl/.pickle and calls pd.read_pickle (variable ext with
branches creating df), which is unsafe; remove the pickle branch entirely and
ensure the CLI/validation for --expression-path rejects .pkl/.pickle (or only
allow .parquet/.pq), so the code only calls pd.read_parquet for supported
extensions; update any help/validation code that lists accepted extensions to
exclude pickle and add an explicit unsafe flag if you want to reintroduce pickle
behind opt-in.
| logger.info("Downloading Ensembl GTF to %s...", gtf_path) | ||
| urllib.request.urlretrieve(ENSEMBL_GTF_URL, gtf_path) # nosec B310 | ||
| with requests.get(ENSEMBL_GTF_URL, stream=True, timeout=600) as resp: | ||
| resp.raise_for_status() | ||
| with open(gtf_path, "wb") as fout: | ||
| shutil.copyfileobj(resp.raw, fout) |
There was a problem hiding this comment.
Write the GTF atomically instead of streaming into the cache path.
If the request or copy fails midway, gtf_path is left truncated and the next run will still treat it as a valid cached file because the cache check only looks at os.path.exists. Download to a temp file and os.replace() it only after the copy finishes successfully.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hvantk/ptm/pipeline.py` around lines 209 - 213, The download currently
streams directly into gtf_path using requests.get and shutil.copyfileobj which
can leave a truncated cache file on failure; change the logic in the block that
calls requests.get(ENSEMBL_GTF_URL, ...) so that you write to a temporary file
(e.g., using tempfile.NamedTemporaryFile or tempfile.mkstemp) and copy resp.raw
into that temp file, then atomically move the completed temp into gtf_path with
os.replace() only after the copy and resp.raise_for_status() succeed; ensure the
temp file is cleaned up on exceptions and any open files/streams (resp, temp)
are properly closed.
| # Use a distinct filename to avoid truncating mapped_path when it | ||
| # points to ptm_sites_combined.tsv.bgz from step 4b. | ||
| all_combined_path = os.path.join( | ||
| config.output_dir, "ptm_sites_all_combined.tsv.bgz" | ||
| ) | ||
| _concat_bgz_tsvs([mapped_path, cptac_mapped_path], all_combined_path) | ||
|
|
||
| mapped_path = all_combined_path | ||
| result.n_total += cptac_result.n_total |
There was a problem hiding this comment.
Update the returned mapped path after the CPTAC merge.
Line 497 repoints mapped_path, but result.mapped_tsv_path is never updated in this branch. When cptac_tsv is present, callers get the pre-CPTAC file even though create_ptm_sites_tb() consumes ptm_sites_all_combined.tsv.bgz.
Suggested fix
_concat_bgz_tsvs([mapped_path, cptac_mapped_path], all_combined_path)
mapped_path = all_combined_path
+ result.mapped_tsv_path = all_combined_path
result.n_total += cptac_result.n_total🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hvantk/ptm/pipeline.py` around lines 490 - 498, After concatenating CPTAC
into all_combined_path and repointing mapped_path, also set
result.mapped_tsv_path = all_combined_path so the returned result reflects the
post-merge file; update this in the branch that calls
_concat_bgz_tsvs([mapped_path, cptac_mapped_path], all_combined_path) (the same
place you update mapped_path and result.n_total), ensuring callers of
create_ptm_sites_tb() receive the updated mapped_tsv_path.
| obs = pd.DataFrame(index=wide.index) | ||
| if metadata_df is not None: | ||
| obs = obs.join(metadata_df, how="left") | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -type f -name "*.py" | grep -E "(cptac|ucsc)" | head -20Repository: bigbio/hvantk
Length of output: 358
🏁 Script executed:
wc -l hvantk/tables/cptac.py hvantk/tables/ucsc.pyRepository: bigbio/hvantk
Length of output: 122
🏁 Script executed:
sed -n '80,100p' hvantk/tables/cptac.pyRepository: bigbio/hvantk
Length of output: 683
🏁 Script executed:
sed -n '140,160p' hvantk/tables/cptac.pyRepository: bigbio/hvantk
Length of output: 686
🏁 Script executed:
cat -n hvantk/tables/ucsc.py | head -200Repository: bigbio/hvantk
Length of output: 5256
🏁 Script executed:
# Check if there's any validation of metadata_df before these join calls in cptac.py
sed -n '75,95p' hvantk/tables/cptac.pyRepository: bigbio/hvantk
Length of output: 678
🏁 Script executed:
sed -n '130,160p' hvantk/tables/cptac.pyRepository: bigbio/hvantk
Length of output: 1041
🏁 Script executed:
# Check function signatures to understand where metadata_df comes from
grep -n "def.*metadata" hvantk/tables/cptac.pyRepository: bigbio/hvantk
Length of output: 39
🏁 Script executed:
grep -B 5 "metadata_df" hvantk/tables/cptac.py | grep -E "(def |metadata_df)"Repository: bigbio/hvantk
Length of output: 469
🏁 Script executed:
# Get full function signatures for both builders
sed -n '40,70p' hvantk/tables/cptac.pyRepository: bigbio/hvantk
Length of output: 1169
🏁 Script executed:
# Check if there are any other callers or tests that show how metadata_df is expected to be structured
grep -r "create_anndata_from_cptac" --include="*.py" | grep -v "^Binary"Repository: bigbio/hvantk
Length of output: 662
🏁 Script executed:
# Check the full phospho function signature
sed -n '105,125p' hvantk/tables/cptac.pyRepository: bigbio/hvantk
Length of output: 806
🏁 Script executed:
# Check how metadata_df is constructed in the matrix_builders callers
sed -n '1,50p' hvantk/tables/matrix_builders.py | head -100Repository: bigbio/hvantk
Length of output: 1431
🏁 Script executed:
# Get the full context around the cptac function calls in matrix_builders
rg -B 5 -A 5 "create_anndata_from_cptac" hvantk/tables/matrix_builders.py | head -60Repository: bigbio/hvantk
Length of output: 1582
Add overlap validation to metadata joins to prevent silent NaN filling.
Lines 89-92 and 149-153 join metadata_df without checking if its index overlaps with the sample IDs. A left join with no overlapping indices silently produces all-NaN obs, which breaks downstream analysis that depends on sample metadata (e.g., tissue/cell-type grouping).
The hvantk/tables/ucsc.py builder already addresses this by checking adata.obs.index.intersection(metadata_df.index) and logging a warning when the overlap is empty (lines 133-137). Adopt the same pattern here: validate that the metadata index contains the expected sample IDs, emit a warning if overlap is missing, and preferably raise an error if the overlap is empty.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hvantk/tables/cptac.py` around lines 89 - 92, In cptac.py, when joining
metadata_df into obs (created from wide.index) and the later join at lines
~149-153, first compute the overlap = obs.index.intersection(metadata_df.index)
and validate it; if overlap is empty, log a warning (or raise a ValueError) and
avoid doing a blind left join that yields all-NaN rows; if partial overlap, log
how many samples matched and then perform obs = obs.join(metadata_df,
how="left") restricted to the overlapping index (or ensure rows with no metadata
are handled explicitly). Use the same validation/logging pattern used in
hvantk/tables/ucsc.py (checking adata.obs.index.intersection(metadata_df.index)
and warning on empty overlap) and apply it to the variables obs, metadata_df,
and wide.index in both join locations.
| if isinstance(group_by, str): | ||
| group_by = [group_by] | ||
|
|
||
| # --- Validate --- | ||
| _validate_group_by(mt, group_by, metadata_field) | ||
|
|
||
| # --- Filter columns --- | ||
| if filter_by: | ||
| mt = filter_by_metadata(mt, filter_by) | ||
|
|
||
| # --- Build group label --- | ||
| if len(group_by) == 1: | ||
| label_expr = hl.str(mt[metadata_field][group_by[0]]) | ||
| labels = adata.obs[group_by[0]].astype(str) | ||
| else: | ||
| label_expr = hl.delimit([hl.str(mt[metadata_field][f]) for f in group_by], "_") | ||
| mt = mt.annotate_cols(_group_label=label_expr) | ||
|
|
||
| # --- Filter groups by min cells --- | ||
| group_counts = mt.aggregate_cols(hl.agg.counter(mt._group_label)) | ||
| valid_groups = {g for g, n in group_counts.items() if n >= min_cells_per_group} | ||
| skipped = {g: n for g, n in group_counts.items() if n < min_cells_per_group} | ||
| if skipped: | ||
| logger.warning( | ||
| "Skipping %d group(s) with fewer than %d cells: %s", | ||
| len(skipped), | ||
| min_cells_per_group, | ||
| skipped, | ||
| ) | ||
| if not valid_groups: | ||
| raise ValueError( | ||
| f"No groups have >= {min_cells_per_group} cells. " | ||
| f"Group sizes: {group_counts}" | ||
| ) | ||
|
|
||
| mt = mt.filter_cols(hl.literal(valid_groups).contains(mt._group_label)) | ||
|
|
||
| # --- Group and aggregate --- | ||
| grouped_mt = mt.group_cols_by(mt._group_label).aggregate( | ||
| mean=hl.agg.mean(mt[expr_field]), | ||
| fraction_expressed=hl.agg.fraction(mt[expr_field] > 0), | ||
| n_cells=hl.agg.count(), | ||
| ) | ||
|
|
||
| # --- Collect per-gene stats into a dict --- | ||
| grouped_mt = grouped_mt.annotate_rows( | ||
| stats=hl.dict( | ||
| hl.agg.collect( | ||
| hl.tuple( | ||
| [ | ||
| grouped_mt._group_label, | ||
| hl.struct( | ||
| mean=grouped_mt.mean, | ||
| fraction_expressed=grouped_mt.fraction_expressed, | ||
| n_cells=hl.int32(grouped_mt.n_cells), | ||
| ), | ||
| ] | ||
| ) | ||
| ) | ||
| ) | ||
| ) | ||
|
|
||
| # --- Build output Table --- | ||
| row_fields = {gene_id_field: mt.row[gene_id_field]} | ||
| if gene_name_field and gene_name_field in mt.row: | ||
| row_fields[gene_name_field] = mt.row[gene_name_field] | ||
|
|
||
| tb = grouped_mt.rows() | ||
| select_exprs = {"gene_id": tb[gene_id_field], "stats": tb.stats} | ||
| if gene_name_field and gene_name_field in tb.row: | ||
| select_exprs["gene_name"] = tb[gene_name_field] | ||
|
|
||
| tb = tb.select(**select_exprs) | ||
| tb = tb.key_by("gene_id") | ||
|
|
||
| if output_path: | ||
| logger.info("Checkpointing summary table to %s", output_path) | ||
| tb = tb.checkpoint(output_path, overwrite=overwrite) | ||
|
|
||
| logger.info( | ||
| "Summarized %d genes across %d groups (group_by=%s)", | ||
| tb.count(), | ||
| len(valid_groups), | ||
| group_by, | ||
| ) | ||
| return tb | ||
|
|
||
|
|
||
| def summarize_matrix(mt: hl.MatrixTable) -> Dict: | ||
| """ | ||
| Provide a comprehensive summary of a MatrixTable. | ||
|
|
||
| Args: | ||
| mt: Hail MatrixTable to summarize | ||
|
|
||
| Returns: | ||
| Dict containing summary statistics and information about the MatrixTable | ||
| """ | ||
| # Basic counts | ||
| n_samples = mt.count_cols() | ||
| n_genes = mt.count_rows() | ||
| n_entries = n_samples * n_genes | ||
|
|
||
| # Sample metadata summaries | ||
| if "metadata" in mt.col: | ||
| metadata_fields = list(mt.col.metadata.dtype) | ||
| metadata_stats = {} | ||
|
|
||
| for field in metadata_fields: | ||
| # Check if the field is likely numerical | ||
| field_type = mt.col.metadata[field].dtype | ||
| is_numeric = hl.is_numeric(field_type) | ||
|
|
||
| if is_numeric: | ||
| # For numerical fields, compute summary statistics | ||
| field_stats = mt.aggregate_cols(hl.agg.stats(mt.metadata[field])) | ||
| # Convert to a more compact dict format | ||
| field_stats = { | ||
| "mean": field_stats.mean, | ||
| "std": field_stats.stdev, | ||
| "min": field_stats.min, | ||
| "max": field_stats.max, | ||
| "n": field_stats.n, | ||
| "sum": field_stats.sum, | ||
| labels = adata.obs[group_by[0]].astype(str) | ||
| for col in group_by[1:]: | ||
| labels = labels + "_" + adata.obs[col].astype(str) | ||
|
|
||
| # Local Series — does not touch adata.obs. | ||
| label_series = pd.Series(labels.values, index=np.arange(adata.n_obs)) | ||
|
|
There was a problem hiding this comment.
Don't encode groups with plain underscore joins.
For multi-column group_by, this can merge distinct tuples when the values themselves contain underscores, and astype(str) also turns missing metadata into a real "nan" group. Group on the original column tuple (or a safely escaped encoding) and drop rows with missing grouping fields before aggregation. This also feeds the AnnData mean path in hvantk/ptm/constraint_expression.py:189-205.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hvantk/utils/matrix_utils.py` around lines 127 - 139, The current label
construction joins stringified fields with "_" which corrupts tuples with
underscores and turns missing values into the literal "nan"; change the logic in
the block that builds labels (the group_by handling and label_series creation)
to build canonical group keys as tuples or a pandas.MultiIndex from the original
adata.obs columns (e.g., using tuple(zip(...)) or
pandas.MultiIndex.from_frame(adata.obs[group_by])) and drop rows with any
missing grouping fields before downstream aggregation, and ensure label_series
uses adata.obs.index (not np.arange(adata.n_obs)); update any callers (e.g., the
AnnData mean path referenced in constraint_expression.py) to consume
tuple/MultiIndex keys or a safely encoded representation.
| pyyaml | ||
| tspex No newline at end of file |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, check if anndata is imported in the three mentioned files
echo "=== Checking for anndata imports ==="
for file in hvantk/core/converters.py hvantk/tables/cptac.py hvantk/tables/ucsc.py; do
if [ -f "$file" ]; then
echo "File: $file"
grep -n "anndata" "$file" | head -20
else
echo "File not found: $file"
fi
done
echo ""
echo "=== Current requirements.txt ==="
if [ -f "requirements.txt" ]; then
cat -n requirements.txt
else
echo "requirements.txt not found"
fi
echo ""
echo "=== Current environment.yml ==="
if [ -f "environment.yml" ]; then
cat -n environment.yml
else
echo "environment.yml not found"
fiRepository: bigbio/hvantk
Length of output: 1805
requirements.txt is missing critical AnnData runtime dependencies.
The three new modules hvantk/core/converters.py (line 14), hvantk/tables/cptac.py (line 12), and hvantk/tables/ucsc.py (line 24) all import anndata, but this dependency is absent from requirements.txt. A plain pip install -r requirements.txt will fail with ModuleNotFoundError: anndata.
Additionally, environment.yml specifies several other packages not in requirements.txt: scanpy>=1.10.0, pysam>=0.22, psutil, and scikit-learn. Sync requirements.txt with the actual runtime dependencies listed in environment.yml.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@requirements.txt` around lines 14 - 15, requirements.txt is missing runtime
deps required by the new modules (hvantk/core/converters.py,
hvantk/tables/cptac.py, hvantk/tables/ucsc.py) which import anndata; update
requirements.txt to include anndata plus the other packages declared in
environment.yml (scanpy>=1.10.0, pysam>=0.22, psutil, scikit-learn) and any
version constraints from environment.yml so a plain pip install -r
requirements.txt will succeed and match the environment.yml runtime
dependencies.
|
@copilot, review and evaluate every comments in this PR. Fix only if valid. |
Agent-Logs-Url: https://github.com/bigbio/hvantk/sessions/4e744d04-4f43-4311-857e-5eb7affb7697 Co-authored-by: enriquea <8707954+enriquea@users.noreply.github.com>
Done in |
This pull request introduces a new PTM constraint analysis feature and refactors the expression matrix-building CLI to use AnnData objects instead of Hail MatrixTables. It also documents and registers new modules and dependencies to support stratified PTM constraint analysis, including support for multiple expression data backends (AnnData, Hail MT, tabular). The CLI and builder functions have been updated accordingly, and new documentation has been added to guide users through the new workflows.
PTM Constraint Analysis:
hvantk ptm constraintfor stratified allele-frequency depletion analysis at PTM codons, supporting grouping by tissue or cell type and multiple expression data backends (AnnData, Hail MT, tabular). Outputs include TSVs, plots, and a self-contained HTML report.Codebase Organization & Documentation:
constraint.py,constraint_expression.py,constraint_plots.py,constraint_report.py) in the architecture docs and clarified their purpose within theptm/directory. [1] [2]Expression Matrix CLI Refactor:
hvantk mkmatrixCLI to create AnnData (.h5ad) objects instead of Hail MatrixTables, updating function names, options, and help text for all supported sources (UCSC, Expression Atlas, CPTAC). [1] [2] [3] [4]Dependencies:
environment.ymlfor AnnData, Scanpy, pysam, PyYAML, psutil, scikit-learn, and tspex to support the new analysis and data backends.Removed/Obsolete:
Summary by CodeRabbit
New Features
hvantk ptm constraintcommand for stratified PTM depletion analysis with statistical testing and HTML reportsDocumentation