Skip to content

Integrate AnnData support and implement PTM constraint analysis#90

Merged
enriquea merged 33 commits intomainfrom
dev
Apr 14, 2026
Merged

Integrate AnnData support and implement PTM constraint analysis#90
enriquea merged 33 commits intomainfrom
dev

Conversation

@enriquea
Copy link
Copy Markdown
Collaborator

@enriquea enriquea commented Apr 14, 2026

This pull request introduces a new PTM constraint analysis feature and refactors the expression matrix-building CLI to use AnnData objects instead of Hail MatrixTables. It also documents and registers new modules and dependencies to support stratified PTM constraint analysis, including support for multiple expression data backends (AnnData, Hail MT, tabular). The CLI and builder functions have been updated accordingly, and new documentation has been added to guide users through the new workflows.

PTM Constraint Analysis:

  • Added a new CLI command hvantk ptm constraint for stratified allele-frequency depletion analysis at PTM codons, supporting grouping by tissue or cell type and multiple expression data backends (AnnData, Hail MT, tabular). Outputs include TSVs, plots, and a self-contained HTML report.
  • Added a detailed user guide for the new PTM constraint command, including usage examples and flag documentation.

Codebase Organization & Documentation:

  • Registered and documented new PTM constraint modules (constraint.py, constraint_expression.py, constraint_plots.py, constraint_report.py) in the architecture docs and clarified their purpose within the ptm/ directory. [1] [2]

Expression Matrix CLI Refactor:

  • Refactored the hvantk mkmatrix CLI to create AnnData (.h5ad) objects instead of Hail MatrixTables, updating function names, options, and help text for all supported sources (UCSC, Expression Atlas, CPTAC). [1] [2] [3] [4]

Dependencies:

  • Added new dependencies to environment.yml for AnnData, Scanpy, pysam, PyYAML, psutil, scikit-learn, and tspex to support the new analysis and data backends.

Removed/Obsolete:

  • Removed the outdated AlphaGenome variant prediction streamer design spec, which is no longer relevant to the current codebase direction.

Summary by CodeRabbit

  • New Features

    • Added hvantk ptm constraint command for stratified PTM depletion analysis with statistical testing and HTML reports
    • Replaced MatrixTable expression format with AnnData (H5AD) for improved compatibility and performance
    • Enhanced expression matrix creation from UCSC, Expression Atlas, and CPTAC data sources
  • Documentation

    • Added comprehensive PTM constraint analysis guide with CLI examples and output descriptions

enriquea and others added 30 commits April 8, 2026 00:51
Phosphorylation-aware variant classification to evaluate whether
multi-source phospho evidence adds signal for ClinVar VUS
reclassification on top of existing pathogenicity predictors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add converters module with hail_mt_to_anndata and anndata_to_hail_mt
functions for crossing between variant-keyed Hail data and expression
AnnData in the scverse integration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Includes np.bool compatibility workaround for Hail 0.2.x with numpy >= 1.24.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add load_ucsc_metadata and create_anndata_from_ucsc_matrix to ucsc.py
alongside existing Hail functions, and build_ucsc_ad to matrix_builders.
Hail imports are now lazy so AnnData paths work without Hail installed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add convert_sdrf_to_dataframe and create_anndata_from_expression_atlas
to expression_atlas.py alongside existing Hail functions. Add
build_expression_atlas_ad to matrix_builders.py following the same
pattern as build_ucsc_ad. Make hail imports lazy to allow non-Hail
usage without hail installed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add three new functions to matrix_utils.py that operate on AnnData objects
alongside the existing Hail-based functions: describe_expression_ad,
filter_by_metadata_ad, and summarize_expression_ad. Guard hail/anndata
imports to allow standalone usage of either backend.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace Hail MatrixTable builders with AnnData builders in all mkmatrix
subcommands (ucsc, expression-atlas, cptac, cptac-phospho). Remove
Hail-specific options (--min-partitions, --force-bgz, --auto-convert-bgz)
and init_hail() calls. Rewrite expression describe/summarize/markers
commands to operate on .h5ad files using scanpy for differential expression.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…n up imports

Replace Hail-based matrix builders with AnnData equivalents in the registry.
Remove build_ucsc_mt, build_expression_atlas_mt, build_cptac_mt,
build_cptac_phospho_mt from matrix_builders.py. Remove old Hail helper
functions from ucsc.py, expression_atlas.py, and cptac.py. Delete
wilcoxon_hail.py and obsolete Hail expression test files. Update
test_matrix_utils.py fixture to use Hail imports directly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The file was rewritten for AnnData during scverse migration but kept the
Hail-era filename. The leftover Hail histogram function had no live callers
(only @pytest.mark.hail tests against removed Hail builders) and a broken
``"ad.AnnData"`` annotation that triggered an F821 lint error.

- New hvantk/visualization/expression/anndata.py exposes
  visualize_expression_distribution(adata, ...) operating on adata.X.
- Old hail.py removed; visualization facade and subpackage doc updated.
- Test import switched to the new module.
…ze_expression_ad

Completes the cleanup planned in the scverse-integration plan (Task 10
step 5). The Hail-based functions in matrix_utils.py were superseded by
their _ad equivalents when expression I/O migrated to AnnData; their only
remaining callers were two skipped @pytest.mark.hail test files.

- Remove annotate_column_summary, describe_expression_mt, summarize_expression,
  summarize_matrix, filter_by_metadata, filter_by_gene_list, filter_by_expression,
  get_top_expressed_genes (and the optional hail import).
- Delete tests/test_matrix_utils.py and tests/test_summarize_expression.py.
- summarize_expression_ad: stop mutating adata.obs (was writing _group_label
  into the caller's object) and replace the per-row index.get_loc lookup
  with groupby indices for O(n_cells) splitting.
- gene_sets.py: refresh extract_marker_gene_sets docstring to point at the
  AnnData summarizer and note its long-format output requires pivot or
  scanpy rank_genes_groups for direct use.
…hail_context

- converters.anndata_to_hail_mt previously built the long-format DataFrame
  with a Python double loop over n_obs * n_var entries (200M dict allocs
  for a modest 10k x 20k matrix). Replace with np.repeat / np.tile /
  X.ravel — single pass, no Python loop.
- The np.bool monkey-patch (Hail 0.2.x compat with NumPy >= 1.24) lived
  as a side-effect at the top of converters.py. Move it into hail_context.py
  so it runs before any Hail import regardless of which module pulls Hail
  in first. converters.py now imports hl via hail_context to guarantee the
  shim is applied.
test_dataset_download_returns_intermediate_tsv mocked
urllib.request.urlretrieve, but the production code calls requests.get.
The mock was a no-op, so the test attempted a real network download and
hung pytest -q indefinitely. Mock requests.get with a context-manager
fake response that streams the bundled fixture zip.
- matrix_builders.py, cptac.py, expression_atlas.py: remove inner
  ``import anndata as ad`` repeats inside builder functions; module-level
  import already in scope.
- test_matrix_utils_anndata.py: replace the importlib.util workaround
  (which sidestepped hvantk.utils.__init__ over a non-existent Python
  3.10 syntax issue) with a plain import. Project requires Python >= 3.10
  where PEP 604 unions are valid.
Implements the `hvantk ptm constraint` CLI and `hvantk.ptm.constraint`
module from the 2026-04-13 implementation plan. Compares gnomAD allele-
frequency distributions between PTM-proximal and non-PTM variants,
stratified by tissue, cell type, or any categorical metadata field.

Modules added:
- hvantk/ptm/constraint.py — orchestrator, PTMConstraintConfig/Result,
  five statistical tests (per-group ranking, τ quartile, τ × LOEUF
  factorial, PTM category × group heatmap, within-gene Wilcoxon)
- hvantk/ptm/constraint_expression.py — unified backend adapter for
  Hail MatrixTable / AnnData / tabular expression sources
- hvantk/ptm/constraint_plots.py — four-panel figure renderer
- hvantk/ptm/constraint_report.py — self-contained HTML report
- hvantk/utils/tissue_specificity.py — thin tspex wrapper for τ / TSI
  / Gini / Shannon metrics

Wiring:
- hvantk/commands/ptm_cli.py — `ptm constraint` subcommand
- hvantk/ptm/__init__.py — lazy exports for the new API
- pyproject.toml — `constraint` extra (tspex, matplotlib, seaborn)

Docs:
- docs_site/tools/ptm-constraint.md
- docs_site/guide/usage.md (10-line recipe)
- docs_site/architecture.md (module index)

Tests: 6 smoke tests in hvantk/tests/test_ptm_constraint.py.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ield

Code-review follow-ups to the initial constraint implementation:

- _anndata_value_column unconditionally returned "mean", so
  --expression-metric median and median_nonzero were silent no-ops on
  the AnnData backend. Replaced the value-column helper with a direct
  numpy aggregation (_aggregate_anndata_direct) that runs per-group
  np.median / np.nanmedian (for nonzero) over adata.X when the user
  requests a median metric.

- _load_variants silently skipped label filtering when the configured
  label_field was absent from the variants HT, leading to results that
  claim to be filtered to TN/TP but actually include every label. Now
  raises ValueError with a clear remediation hint unless the user
  explicitly passes --label-filter all.

- pyproject.toml + requirements.txt: added tspex so CI (which installs
  from requirements.txt) can import the wrapper and the constraint
  smoke tests do not fail with ImportError.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Conda CI workflow has been failing since the AnnData migration
landed on dev because environment.yml never gained anndata/scanpy. The
constraint work also needs tspex; add it alongside so `python-package-
conda.yml` can import every test module.

Added: anndata, scanpy, pysam, PyYAML, psutil, scikit-learn, tspex.
feat(ptm): stratified AF-depletion analysis (`hvantk ptm constraint`)
@enriquea enriquea requested a review from Copilot April 14, 2026 08:18
@enriquea enriquea self-assigned this Apr 14, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 14, 2026

📝 Walkthrough

Walkthrough

Large-scale architectural migration from Hail MatrixTable to AnnData for expression matrix handling across multiple data sources, introduction of a new PTM constraint analysis pipeline with CLI integration and comprehensive statistical testing, removal of AlphaGenome streamer documentation, and addition of utility modules for AnnData operations and tissue specificity computation.

Changes

Cohort / File(s) Summary
AnnData Core Infrastructure
hvantk/core/anndata_utils.py, hvantk/core/converters.py
New modules providing AnnData metadata construction, column summarization, save/load utilities, and bidirectional Hail ↔ AnnData conversion functions.
Expression Matrix Builders
hvantk/tables/matrix_builders.py, hvantk/tables/ucsc.py, hvantk/tables/cptac.py, hvantk/tables/expression_atlas.py
Replaced Hail MatrixTable builders with AnnData equivalents; removed MatrixTable-specific options (metadata_index_col, min_partitions, force_bgz, categorical_cols, numeric_cols); delegated metadata handling to pandas/AnnData conventions.
Expression Utilities & CLI
hvantk/utils/matrix_utils.py, hvantk/utils/tissue_specificity.py, hvantk/commands/summarize_expression_cli.py, hvantk/commands/make_matrix_cli.py
Replaced Hail-based matrix utilities (summarize_expression, filter_by_metadata, etc.) with AnnData equivalents; added compute_specificity() wrapper for tissue specificity via tspex; updated CLI to use .h5ad outputs and AnnData-based metadata handling.
PTM Constraint Analysis Pipeline
hvantk/ptm/constraint.py, hvantk/ptm/constraint_expression.py, hvantk/ptm/constraint_plots.py, hvantk/ptm/constraint_report.py, hvantk/ptm/__init__.py, hvantk/commands/ptm_cli.py
New end-to-end stratified constraint analysis with configuration validation, gene-by-group expression loading (supporting Hail MT, AnnData, tabular), statistical testing (Mann–Whitney, τ quartile, LOEUF factorial, category-group heatmap, within-gene paired Wilcoxon), PNG/HTML report generation, and new hvantk ptm constraint CLI subcommand.
PTM Data Processing
hvantk/ptm/mapper.py, hvantk/ptm/pipeline.py, hvantk/ptm/annotate.py
Added TranscriptCDS class and caching for residue-to-genomic mapping; replaced urllib downloads with streaming requests; added BGZF TSV output and concat utilities; extended variant annotation with optional ptm_evidence field.
File & Data Management
hvantk/data/file_utils.py, hvantk/hail_context.py, hvantk/datasets/cptac_phospho_datasets.py, hvantk/datasets/peptideatlas_phospho_datasets.py
Added BgzfWriter class for buffered BGZF output; patched NumPy compatibility shim for Hail import; improved dataset fallback handling for phosphoproteomics sources; replaced urllib downloads with streaming requests.
Visualization
hvantk/visualization/__init__.py, hvantk/visualization/expression/__init__.py, hvantk/visualization/expression/anndata.py, hvantk/visualization/expression/hail.py
Removed Hail MatrixTable visualization module; added AnnData-based visualize_expression_distribution() supporting sparse matrices; updated lazy-import facade to target AnnData implementation.
Test Updates — Deleted
hvantk/tests/test_cli_mkmatrix.py, hvantk/tests/test_cptac.py, hvantk/tests/test_cptac_phospho.py, hvantk/tests/test_expression_tables_hail.py, hvantk/tests/test_matrix_utils.py, hvantk/tests/test_summarize_expression.py, hvantk/tests/test_wilcoxon_hail.py
Removed Hail-dependent tests covering MatrixTable construction, CPTAC/phospho conversion, expression summarization, and Wilcoxon marker detection.
Test Updates — Added
hvantk/tests/test_anndata_utils.py, hvantk/tests/test_converters.py, hvantk/tests/test_expression_builders_anndata.py, hvantk/tests/test_matrix_utils_anndata.py, hvantk/tests/test_ptm_constraint.py
Added AnnData-based tests for utilities, Hail↔AnnData converters, expression builders with CLI integration, matrix operations, and PTM constraint pipeline smoke tests.
Test Updates — Modified
hvantk/tests/test_ptm.py, hvantk/tests/test_peptideatlas_phospho.py
Updated PTM roundtrip test for BGZF output; updated phosphoproteomics download mocking from urllib to requests.
Dependencies & Configuration
.gitignore, environment.yml, pyproject.toml, requirements.txt, hvantk/tables/registry.py, hvantk/tables/table_builders.py
Added ignore patterns for Jupyter/VSCode/Claude; updated dependencies (anndata, scanpy, pysam, PyYAML, psutil, scikit-learn, tspex); updated registry to reference new *_ad builder functions; added min_partitions=16 for PTM table import.
Documentation — Added
docs_site/guide/usage.md, docs_site/tools/ptm-constraint.md
Added usage section for hvantk ptm constraint workflow with expression backend options, grouping/aggregation choices, and output structure; added reference documentation for constraint analysis command with flags and caveats.
Documentation — Modified
docs_site/architecture.md
Expanded ptm/ module description to enumerate new constraint files and describe stratified AF-depletion analysis via hvantk ptm constraint.
Documentation — Removed
docs/superpowers/plans/2026-03-31-alphagenome-streamer.md, docs/superpowers/specs/2026-03-31-alphagenome-streamer-design.md
Deleted AlphaGenome Variant Prediction Streamer implementation plan and design specification documents.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested labels

enhancement, architecture, documentation, data-formats

Suggested reviewers

  • ypriverol

Poem

🐰 Hop! From tables of Hail to AnnData's way,
New constraint tests bloom—tau quartiles at play!
BGZF blocks cascade, PTM sites align,
One pipeline converges, where markers all shine.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Integrate AnnData support and implement PTM constraint analysis' is comprehensive and directly matches the two main themes of this large changeset: AnnData integration and the new PTM constraint feature.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codacy-production
Copy link
Copy Markdown

codacy-production bot commented Apr 14, 2026

Not up to standards ⛔

🔴 Issues 23 high · 18 medium · 59 minor

Alerts:
⚠ 100 issues (≤ 0 issues of at least minor severity)

Results:
100 new issues

Category Results
UnusedCode 2 medium
1 minor
Documentation 58 minor
ErrorProne 6 high
Security 17 high
Complexity 16 medium

View in Codacy

🟢 Metrics 307 complexity · 4 duplication

Metric Results
Complexity 307
Duplication 4

View in Codacy

TIP This summary will be updated as you push new changes. Give us feedback

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new PTM constraint analysis workflow (hvantk ptm constraint) and refactors expression-matrix handling to support AnnData (.h5ad) as a first-class backend (alongside Hail MT and pre-aggregated tabular inputs), with updated builders/CLI/tests/docs.

Changes:

  • Introduces PTM constraint analysis modules + CLI command, producing TSV/plots/HTML report across multiple expression backends.
  • Refactors hvantk mkmatrix and expression builders to produce AnnData objects instead of Hail MatrixTables.
  • Adds AnnData/scanpy/tspex-related utilities, converters, tests, and documentation updates.

Reviewed changes

Copilot reviewed 52 out of 55 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
requirements.txt Adds tspex to pip requirements.
pyproject.toml Adds AnnData/Scanpy deps and new extras for constraint/PTM-related features.
environment.yml Adds conda deps required for AnnData + constraint workflows.
hvantk/visualization/expression/hail.py Removes Hail-backed expression distribution plotting module.
hvantk/visualization/expression/anndata.py Adds AnnData-backed expression distribution plotting.
hvantk/visualization/expression/init.py Updates expression visualization namespace docs to reference AnnData backend.
hvantk/visualization/init.py Routes visualize_expression_distribution facade to AnnData implementation.
hvantk/utils/wilcoxon_hail.py Removes Hail adapter for Wilcoxon marker detection.
hvantk/utils/tissue_specificity.py Adds tspex wrapper for tissue-specificity metrics (τ, etc.).
hvantk/utils/gene_sets.py Updates marker gene set extraction docstring to clarify expected summary format(s).
hvantk/tests/test_wilcoxon_hail.py Removes Hail Wilcoxon integration tests.
hvantk/tests/test_summarize_expression.py Removes Hail-based expression summarization tests.
hvantk/tests/test_matrix_utils.py Removes Hail MatrixTable-based matrix_utils tests (incl. plotting).
hvantk/tests/test_expression_tables_hail.py Removes Hail integration tests for expression table builders.
hvantk/tests/test_cptac_phospho.py Removes CPTAC phospho dataset tests tied to old pipeline paths.
hvantk/tests/test_cptac.py Removes Hail-based CPTAC conversion tests.
hvantk/tests/test_cli_mkmatrix.py Removes MT-based mkmatrix CLI tests.
hvantk/tests/test_matrix_utils_anndata.py Adds AnnData-based matrix_utils tests.
hvantk/tests/test_expression_builders_anndata.py Adds AnnData builder + mkmatrix CLI integration tests.
hvantk/tests/test_anndata_utils.py Adds tests for AnnData provenance, summaries, and I/O helpers.
hvantk/tests/test_converters.py Adds tests for Hail MT ↔ AnnData converters.
hvantk/tests/test_ptm_constraint.py Adds smoke tests for PTM constraint pipeline components + CLI help.
hvantk/tests/test_ptm.py Updates PTM mapping test to read BGZF output.
hvantk/tests/test_peptideatlas_phospho.py Updates download mocking to match new requests.get implementation.
hvantk/tables/ucsc.py Refactors UCSC loader from Hail MT to pandas/AnnData ingestion.
hvantk/tables/matrix_builders.py Refactors builders to produce AnnData and write .h5ad.
hvantk/tables/expression_atlas.py Refactors Expression Atlas ingestion from Hail to pandas/AnnData.
hvantk/tables/cptac.py Refactors CPTAC ingestion from Hail to AnnData (long + phospho wide).
hvantk/tables/registry.py Updates registry to point matrix adapters to new AnnData builders.
hvantk/tables/table_builders.py Adjusts a Hail import call to set min_partitions=16.
hvantk/core/anndata_utils.py Adds AnnData helpers for provenance metadata, column summaries, save/load.
hvantk/core/converters.py Adds Hail MatrixTable ↔ AnnData converters.
hvantk/core/hail_context.py Adds NumPy compatibility shim (np.bool) before importing Hail.
hvantk/ptm/pipeline.py Adds stdlib-only BGZF writing + switches downloads to requests.
hvantk/ptm/mapper.py Adds CDS mapping cache/binary-search mapping and optimizes GTF parsing.
hvantk/ptm/annotate.py Extends PTM annotation to optionally collect per-site evidence structs.
hvantk/ptm/init.py Converts PTM package exports to lazy imports to avoid eager Hail import.
hvantk/ptm/constraint_expression.py Adds expression-source adapter producing gene×group matrices from multiple backends.
hvantk/ptm/constraint_plots.py Adds four-panel plotting for constraint analysis outputs.
hvantk/commands/make_matrix_cli.py Refactors hvantk mkmatrix to produce .h5ad AnnData outputs.
hvantk/commands/ptm_cli.py Adds hvantk ptm constraint CLI command + options.
hvantk/datasets/peptideatlas_phospho_datasets.py Switches download implementation from urlretrieve to requests.
hvantk/datasets/cptac_phospho_datasets.py Adds source fallback logic for CPTAC phospho/clinical retrieval.
hvantk/data/file_utils.py Adds BGZF writer implementation (used for bgz output elsewhere).
docs_site/tools/ptm-constraint.md Adds a user guide and flag reference for PTM constraint command.
docs_site/guide/usage.md Documents how to run PTM constraint analysis in the general usage guide.
docs_site/architecture.md Registers/describe new PTM constraint modules in architecture docs.
docs/superpowers/specs/2026-03-31-alphagenome-streamer-design.md Removes obsolete AlphaGenome streamer design spec.
.gitignore Adds ignores for notebook checkpoints, VS Code settings, and .claude/.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread hvantk/tables/matrix_builders.py Outdated
index_col=metadata_index_col,
)
logger.info("Loading UCSC metadata from %s", metadata_path)
metadata_df = load_ucsc_metadata(metadata_path)
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI passes delimiter, but metadata loading ignores it (load_ucsc_metadata(metadata_path) uses its default tab separator). This makes --delimiter inconsistent and will break non-tab metadata files. Pass sep=delimiter (and potentially expose index_col again if needed) when calling load_ucsc_metadata.

Suggested change
metadata_df = load_ucsc_metadata(metadata_path)
metadata_df = load_ucsc_metadata(metadata_path, sep=delimiter)

Copilot uses AI. Check for mistakes.
Comment on lines +199 to +204
logger.info("Reading CPTAC expression from %s", expression_path)
expr_df = pd.read_csv(expression_path, sep="\t")

logger.info("Reading CPTAC metadata from %s", metadata_path)
meta_df = pd.read_csv(metadata_path, sep="\t")
meta_df = meta_df.set_index(sample_id_col)
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_cptac_ad unconditionally reads with sep="\t", but the docstring says TSV/CSV are supported. Either detect the delimiter from the file extension (or sniff), or update the docstring/CLI to only support TSV to avoid surprising parse failures for CSV inputs.

Copilot uses AI. Check for mistakes.
Comment thread hvantk/tables/matrix_builders.py Outdated
Comment on lines +266 to +269
expr_df = pd.read_csv(expression_path, sep="\t")

logger.info("Reading CPTAC phospho metadata from %s", metadata_path)
meta_df = pd.read_csv(metadata_path, sep="\t")
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_cptac_phospho_ad also hard-codes sep="\t" while the docstring advertises TSV/CSV support. Please align behavior and docs (delimiter detection or explicit TSV-only contract).

Suggested change
expr_df = pd.read_csv(expression_path, sep="\t")
logger.info("Reading CPTAC phospho metadata from %s", metadata_path)
meta_df = pd.read_csv(metadata_path, sep="\t")
expr_df = pd.read_csv(expression_path, sep=None, engine="python")
logger.info("Reading CPTAC phospho metadata from %s", metadata_path)
meta_df = pd.read_csv(metadata_path, sep=None, engine="python")

Copilot uses AI. Check for mistakes.
Comment thread hvantk/ptm/mapper.py Outdated
Comment on lines +121 to +131
for line in f:
if line.startswith("#"):
if line[0] == "#":
continue
fields = line.strip().split("\t")

# Quick scan for feature type before full split — skip the ~80%
# of lines that are gene/exon/UTR/start_codon/stop_codon/etc.
# GTF columns are tab-separated; feature type is in column 3.
tab1 = line.index("\t")
tab2 = line.index("\t", tab1 + 1)
tab3 = line.index("\t", tab2 + 1)
feature = line[tab2 + 1 : tab3]
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_ensembl_gtf assumes every line is non-empty and tab-delimited. line[0] will raise on blank lines, and line.index("\t") will raise ValueError on malformed lines, aborting the whole parse. Add a quick guard (e.g., if not line or line[0] == '#': continue) and handle missing tabs safely (e.g., try/except ValueError or if '\t' not in line: continue).

Copilot uses AI. Check for mistakes.
Comment thread hvantk/ptm/constraint_expression.py Outdated
Comment on lines +306 to +308
agg_expr = hl.median(hl.agg.collect(entry))
elif aggfunc == "median_nonzero":
agg_expr = hl.median(hl.agg.filter(entry > 0, hl.agg.collect(entry)))
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the Hail MT backend, computing the median via hl.median(hl.agg.collect(entry)) (and the filtered variant) collects all per-group entry values into an array per row, which is not scalable for large MTs and can blow executor memory. Consider using Hail's approximate quantile aggregators (e.g., hl.agg.approx_quantiles(entry, 0.5)[0] / hl.agg.approx_median if available) or restricting the Hail backend to mean only.

Suggested change
agg_expr = hl.median(hl.agg.collect(entry))
elif aggfunc == "median_nonzero":
agg_expr = hl.median(hl.agg.filter(entry > 0, hl.agg.collect(entry)))
agg_expr = hl.agg.approx_quantiles(entry, [0.5])[0]
elif aggfunc == "median_nonzero":
agg_expr = hl.agg.filter(entry > 0, hl.agg.approx_quantiles(entry, [0.5]))[0]

Copilot uses AI. Check for mistakes.
Comment thread hvantk/ptm/pipeline.py Outdated
Comment on lines +24 to +91

import requests
from collections import defaultdict
from dataclasses import dataclass, field
from typing import Dict, List, Optional

import struct
import zlib

# ---------------------------------------------------------------------------
# Lightweight BGZF writer (stdlib-only, no Hail dependency).
#
# Mirrors hvantk.data.file_utils.BgzfWriter but lives here so the PTM
# pipeline's pure-Python stages never trigger hvantk.data.__init__, which
# eagerly imports Hail.
# ---------------------------------------------------------------------------

_BGZF_BLOCK_SIZE = 65280 # max uncompressed payload per BGZF block


def _make_bgzf_block(data: bytes) -> bytes:
"""Compress *data* into a single BGZF block."""
comp = zlib.compressobj(zlib.Z_DEFAULT_COMPRESSION, zlib.DEFLATED, -15)
compressed = comp.compress(data) + comp.flush()
bsize = 18 + len(compressed) + 8 - 1
header = (
b"\x1f\x8b\x08\x04"
b"\x00\x00\x00\x00"
b"\x00\xff"
+ struct.pack("<H", 6)
+ b"BC"
+ struct.pack("<H", 2)
+ struct.pack("<H", bsize)
)
crc = zlib.crc32(data) & 0xFFFFFFFF
trailer = struct.pack("<I", crc) + struct.pack("<I", len(data) & 0xFFFFFFFF)
return header + compressed + trailer


class BgzfWriter:
"""Buffered BGZF text writer for producing block-gzipped TSV files."""

def __init__(self, path: str, encoding: str = "utf-8") -> None:
self._fout = open(path, "wb")
self._encoding = encoding
self._buf = bytearray()

def __enter__(self) -> "BgzfWriter":
return self

def __exit__(self, *exc) -> None:
self.close()

def write(self, text: str) -> None:
self._buf.extend(text.encode(self._encoding))
while len(self._buf) >= _BGZF_BLOCK_SIZE:
chunk = bytes(self._buf[:_BGZF_BLOCK_SIZE])
self._buf = self._buf[_BGZF_BLOCK_SIZE:]
self._fout.write(_make_bgzf_block(chunk))

def close(self) -> None:
if self._fout.closed:
return
if self._buf:
self._fout.write(_make_bgzf_block(bytes(self._buf)))
self._buf.clear()
self._fout.write(_make_bgzf_block(b""))
self._fout.close()
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BgzfWriter is now implemented both here and in hvantk.data.file_utils, which creates a maintenance risk (bugfixes can diverge). Since importing hvantk.data.file_utils triggers the heavy hvantk.data.__init__, consider moving the BGZF writer to a lightweight shared module (e.g. hvantk.core.bgzf) or making hvantk.data.__init__ avoid eager heavy imports so both callers can reuse a single implementation.

Suggested change
import requests
from collections import defaultdict
from dataclasses import dataclass, field
from typing import Dict, List, Optional
import struct
import zlib
# ---------------------------------------------------------------------------
# Lightweight BGZF writer (stdlib-only, no Hail dependency).
#
# Mirrors hvantk.data.file_utils.BgzfWriter but lives here so the PTM
# pipeline's pure-Python stages never trigger hvantk.data.__init__, which
# eagerly imports Hail.
# ---------------------------------------------------------------------------
_BGZF_BLOCK_SIZE = 65280 # max uncompressed payload per BGZF block
def _make_bgzf_block(data: bytes) -> bytes:
"""Compress *data* into a single BGZF block."""
comp = zlib.compressobj(zlib.Z_DEFAULT_COMPRESSION, zlib.DEFLATED, -15)
compressed = comp.compress(data) + comp.flush()
bsize = 18 + len(compressed) + 8 - 1
header = (
b"\x1f\x8b\x08\x04"
b"\x00\x00\x00\x00"
b"\x00\xff"
+ struct.pack("<H", 6)
+ b"BC"
+ struct.pack("<H", 2)
+ struct.pack("<H", bsize)
)
crc = zlib.crc32(data) & 0xFFFFFFFF
trailer = struct.pack("<I", crc) + struct.pack("<I", len(data) & 0xFFFFFFFF)
return header + compressed + trailer
class BgzfWriter:
"""Buffered BGZF text writer for producing block-gzipped TSV files."""
def __init__(self, path: str, encoding: str = "utf-8") -> None:
self._fout = open(path, "wb")
self._encoding = encoding
self._buf = bytearray()
def __enter__(self) -> "BgzfWriter":
return self
def __exit__(self, *exc) -> None:
self.close()
def write(self, text: str) -> None:
self._buf.extend(text.encode(self._encoding))
while len(self._buf) >= _BGZF_BLOCK_SIZE:
chunk = bytes(self._buf[:_BGZF_BLOCK_SIZE])
self._buf = self._buf[_BGZF_BLOCK_SIZE:]
self._fout.write(_make_bgzf_block(chunk))
def close(self) -> None:
if self._fout.closed:
return
if self._buf:
self._fout.write(_make_bgzf_block(bytes(self._buf)))
self._buf.clear()
self._fout.write(_make_bgzf_block(b""))
self._fout.close()
import importlib.util
from pathlib import Path
import requests
from collections import defaultdict
from dataclasses import dataclass, field
from typing import Dict, List, Optional
# ---------------------------------------------------------------------------
# Reuse the shared BGZF writer implementation without importing
# hvantk.data.__init__, which may eagerly import heavy dependencies.
# ---------------------------------------------------------------------------
def _load_shared_bgzf_writer():
"""Load hvantk.data.file_utils.BgzfWriter without package import side effects."""
file_utils_path = Path(__file__).resolve().parents[1] / "data" / "file_utils.py"
spec = importlib.util.spec_from_file_location("hvantk._bgzf_file_utils", file_utils_path)
if spec is None or spec.loader is None:
raise ImportError(f"Unable to load BGZF writer module from {file_utils_path}")
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
try:
return module.BgzfWriter
except AttributeError as exc:
raise ImportError(
f"Module loaded from {file_utils_path} does not define BgzfWriter"
) from exc
BgzfWriter = _load_shared_bgzf_writer()

Copilot uses AI. Check for mistakes.
Comment thread hvantk/core/converters.py
Comment on lines +68 to +83
# Collect row (var) and col (obs) annotations
row_df = mt.rows().to_pandas()
row_df = row_df.set_index(row_key)

col_df = mt.cols().to_pandas()
col_df = col_df.set_index(col_key)

# Collect entries and pivot to dense matrix
entries_df = mt.select_entries(mt[entry_field]).entries().to_pandas()
pivot = entries_df.pivot(index=col_key, columns=row_key, values=entry_field)

# Reindex to match collected row/col order
pivot = pivot.reindex(index=col_df.index, columns=row_df.index)

X = pivot.values.astype(np.float32)

Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hail_mt_to_anndata pulls rows(), cols(), and especially entries() into pandas and pivots to a dense matrix. This will only work for very small MatrixTables and can easily OOM the driver for real expression MTs. It would help to either document/guard this as a small-MT-only helper, or provide an export/chunked path (or a sparse AnnData build) for larger inputs.

Copilot uses AI. Check for mistakes.
Comment on lines +35 to +38
X = adata.X
if sp.issparse(X):
X = X.toarray()
values = np.asarray(X).ravel()
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adata.X can be a large sparse matrix; calling X.toarray() materializes the full dense matrix and can OOM for realistic scRNA-seq datasets. For a histogram, consider using X.data (non-zeros) plus an implicit zero count, or sampling values, so sparse inputs stay memory-safe.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

🧹 Nitpick comments (15)
hvantk/datasets/peptideatlas_phospho_datasets.py (1)

441-446: Consider moving the requests import to module level.

The requests import is placed inside the method. While this delays the import cost, the requests library is already a common dependency used elsewhere in the codebase (e.g., hvantk/data/file_utils.py). Moving it to the top of the file with other imports would be more consistent with the module's style.

♻️ Suggested change

At the top of the file (after line 27):

import requests

Then in download():

-            import requests
-
             with requests.get(self.zip_url, stream=True, timeout=600) as resp:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/datasets/peptideatlas_phospho_datasets.py` around lines 441 - 446,
Move the inline import of requests out of the download method and add it to the
module-level imports so the library is imported once for the module; update the
top of the file to include "import requests" and remove the "import requests"
line inside the download() block (the code using requests remains the same:
requests.get(self.zip_url, stream=True, timeout=600) ... resp.raise_for_status()
... shutil.copyfileobj(resp.raw, fout)).
hvantk/tables/expression_atlas.py (1)

214-214: Consider handling non-numeric values gracefully.

The .astype(np.float32) conversion will fail if sample columns contain non-numeric values (e.g., "NA" strings not recognized by pandas). Consider using pd.to_numeric with errors='coerce' for robustness:

♻️ Suggested improvement
-    X = df[sample_cols].values.T.astype(np.float32)
+    # Convert to numeric, coercing errors to NaN
+    X = df[sample_cols].apply(pd.to_numeric, errors='coerce').values.T.astype(np.float32)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/tables/expression_atlas.py` at line 214, The current assignment X =
df[sample_cols].values.T.astype(np.float32) will raise on non-numeric strings;
change to coerce non-numeric values to NaN first (e.g. replace the
df[sample_cols] selection with df[sample_cols].apply(lambda c: pd.to_numeric(c,
errors='coerce')) or df[sample_cols].apply(pd.to_numeric, errors='coerce')) and
then convert to float32 and transpose to produce X; adjust downstream code to
handle resulting NaNs (e.g., drop or impute) as needed.
.gitignore (1)

212-213: Duplicate .ipynb_checkpoints entry.

Line 79 already ignores .ipynb_checkpoints (without trailing slash). This new entry with the trailing slash is redundant since both patterns achieve the same result.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.gitignore around lines 212 - 213, Remove the duplicate .ipynb_checkpoints
entry from the ignore list: keep only one pattern for ".ipynb_checkpoints"
(either with or without the trailing slash) and delete the redundant line that
adds ".ipynb_checkpoints/" so the .gitignore contains a single canonical entry
for that pattern.
hvantk/data/file_utils.py (1)

462-466: Unused _path attribute.

The _path attribute is stored but never referenced. Consider removing it or using it in error messages/logging.

♻️ Suggested fix
     def __init__(self, path: str, encoding: str = "utf-8") -> None:
-        self._path = path
         self._encoding = encoding
         self._fout = open(path, "wb")
         self._buf = bytearray()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/data/file_utils.py` around lines 462 - 466, The __init__ stores
self._path but never uses it; either remove the unused attribute or keep and use
it for context (e.g., include self._path in error messages/logging and in
exception handling around opening the file). Update the initializer in the
class's __init__ (and any related error handling elsewhere in this class) to
either drop the assignment to self._path or reference self._path in log/raise
messages when calling open(path, "wb") and when reporting I/O errors so the file
path is not lost.
hvantk/tests/test_ptm.py (1)

183-186: Consider moving gzip import to module level.

The gzip module is a standard library module that could be imported at the top of the file alongside other imports for consistency and clarity.

Suggested change

Add to the imports section at the top of the file:

import gzip

Then simplify the test code:

     # Read back and verify (BGZF is gzip-compatible)
-    import gzip
     with gzip.open(output_bgz, "rt") as f:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/tests/test_ptm.py` around lines 183 - 186, Move the local "import
gzip" out of the test body and add it to the module-level imports at the top of
hvantk/tests/test_ptm.py, then remove the inline import so the block that opens
output_bgz with gzip.open(...) and constructs csv.DictReader(...) uses the
top-level gzip; update any linters/import-grouping if needed.
hvantk/visualization/expression/anndata.py (1)

35-38: Consider memory implications for large sparse matrices.

Converting a large sparse matrix to dense and raveling all values into memory could cause issues with very large datasets. For visualization purposes, consider sampling a subset of values when the matrix is large.

Optional: Add sampling for large matrices
+    MAX_SAMPLES = 1_000_000  # Limit samples for memory efficiency
+
     X = adata.X
     if sp.issparse(X):
-        X = X.toarray()
-    values = np.asarray(X).ravel()
+        # Sample from sparse data to avoid memory explosion
+        data = X.data
+        if len(data) > MAX_SAMPLES:
+            rng = np.random.default_rng(42)
+            data = rng.choice(data, MAX_SAMPLES, replace=False)
+        values = data
+    else:
+        values = np.asarray(X).ravel()
+        if len(values) > MAX_SAMPLES:
+            rng = np.random.default_rng(42)
+            values = rng.choice(values, MAX_SAMPLES, replace=False)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/visualization/expression/anndata.py` around lines 35 - 38, The current
code unconditionally densifies adata.X (X = X.toarray()) which can OOM for large
sparse matrices; instead, when sp.issparse(X) detect large matrices by checking
X.nnz or X.shape and apply sampling: use X.data (the nonzero values) and
randomly sample up to a MAX_SAMPLES limit, and if you need zeros represented
also randomly sample indices from the implicit zeros to approximate
distribution; then build values from the sampled nonzero and sampled-zero values
rather than calling toarray(); update the block around X, sp.issparse(X), and
values to implement this conditional sampling strategy.
pyproject.toml (1)

31-32: Make scanpy an optional dependency.

scanpy is imported only once, within function scope in summarize_expression_cli.py (line 289), and is not used elsewhere in the codebase. This minimal and localized usage justifies making it optional to reduce the dependency footprint for users who don't need marker gene analysis.

Suggested change
 anndata = ">=0.10.0"
-scanpy = ">=1.10.0"
+scanpy = { version = ">=1.10.0", optional = true }

 [tool.poetry.extras]
 viz = ["matplotlib", "seaborn", "plotly"]
 duckdb = ["duckdb"]
 interactive = ["plotly"]
 hgc = ["matplotlib", "seaborn", "plotly", "gnomad"]
 psroc = ["matplotlib", "plotly"]
 ptm = ["cptac"]
 constraint = ["tspex", "matplotlib", "seaborn"]
+markers = ["scanpy"]

Then add a runtime check in summarize_expression_cli.py:

try:
    import scanpy as sc
except ImportError as exc:
    raise ImportError(
        "scanpy is required for marker analysis. "
        "Install with 'pip install hvantk[markers]'"
    ) from exc
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 31 - 32, Update pyproject.toml to make scanpy
optional by removing it from the main dependencies and adding an extras group
(e.g., extras = { markers = ["scanpy>=1.10.0"] }) so users can install
hvantk[markers] when they need marker analysis; then in
summarize_expression_cli.py, wrap the local import of scanpy (used in the
marker-related function) in a try/except ImportError and raise a clear
ImportError message pointing users to install the markers extra (e.g., "scanpy
is required for marker analysis. Install with 'pip install hvantk[markers]'") so
the runtime check fails fast with guidance.
hvantk/tests/test_ptm_constraint.py (1)

90-108: Use tmp_path fixture instead of hardcoded /tmp/out.

While this config is only validated (not executed), using a hardcoded /tmp path is flagged as a security concern and doesn't follow test isolation best practices.

📝 Proposed fix
-def test_constraint_config_validate():
+def test_constraint_config_validate(tmp_path):
     from hvantk.ptm.constraint import PTMConstraintConfig

     cfg = PTMConstraintConfig(
         variants_ht_path="/does/not/exist.ht",
         expression_source="bogus-source",
         expression_path="/also/missing.tsv",
         grouping="tissue",
-        output_dir="/tmp/out",
+        output_dir=str(tmp_path / "out"),
         label_filter="NOT_A_LABEL",
         min_cells_per_group=-5,
         min_variants_per_group=0,
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/tests/test_ptm_constraint.py` around lines 90 - 108, The test
test_constraint_config_validate uses a hardcoded output_dir "/tmp/out"; change
the test to accept the pytest tmp_path fixture, pass output_dir=str(tmp_path)
when constructing PTMConstraintConfig, and update the test function signature to
def test_constraint_config_validate(tmp_path): so the config uses an isolated
temporary directory rather than a global /tmp path while keeping all other
validation assertions unchanged.
hvantk/tests/test_expression_builders_anndata.py (1)

12-24: Consider adding strict=True to zip() calls.

While the test data is controlled, using strict=True would catch accidental length mismatches during test maintenance.

📝 Proposed fix
     def _write_expression_matrix(self, path, genes, cells, values):
         """Write a genes-x-cells TSV expression matrix."""
         with open(path, "w") as fh:
             fh.write("gene\t" + "\t".join(cells) + "\n")
-            for gene, row in zip(genes, values):
+            for gene, row in zip(genes, values, strict=True):
                 fh.write(gene + "\t" + "\t".join(str(v) for v in row) + "\n")

     def _write_metadata(self, path, cell_ids, cell_types):
         """Write a metadata TSV with cell_id index and cell_type column."""
         with open(path, "w") as fh:
             fh.write("cell_id\tcell_type\n")
-            for cid, ct in zip(cell_ids, cell_types):
+            for cid, ct in zip(cell_ids, cell_types, strict=True):
                 fh.write(f"{cid}\t{ct}\n")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/tests/test_expression_builders_anndata.py` around lines 12 - 24, The
zip() calls in _write_expression_matrix and _write_metadata can silently ignore
length mismatches; update both to use zip(..., strict=True) so mismatched
lengths raise an error during tests — change for gene,row in zip(genes, values)
to zip(genes, values, strict=True) in _write_expression_matrix and for cid,ct in
zip(cell_ids, cell_types) to zip(cell_ids, cell_types, strict=True) in
_write_metadata.
docs_site/tools/ptm-constraint.md (1)

83-97: Add language specifier to fenced code block.

The output directory tree block should have a language identifier for consistency and to satisfy markdown linting rules.

📝 Proposed fix
-```
+```text
 <output-dir>/
 ├── summary.json                     # top-line effect sizes + config echo
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs_site/tools/ptm-constraint.md` around lines 83 - 97, The fenced code
block that starts with the directory tree beginning "<output-dir>/" is missing a
language specifier; update the opening fence from ``` to ```text so the block
becomes a ```text fenced code block (i.e., add "text" after the three backticks)
to satisfy markdown linting and ensure consistent rendering of the directory
tree.
hvantk/commands/summarize_expression_cli.py (1)

163-199: Consider using ctx.exit(1) for consistency.

Line 170 uses raise SystemExit(1) while other commands in this module use ctx.exit(1). Both work, but ctx.exit(1) is more consistent with Click patterns used elsewhere in the codebase.

📝 Proposed fix
+@click.pass_context
 def summarize_expression_cmd(
+    ctx,
     matrix_path,
     ...
 ):
     ...
     if output_path.exists() and not overwrite:
         click.echo(
             f"Error: Output file already exists: {output_path}\n"
             "Use --overwrite to replace it.",
             err=True,
         )
-        raise SystemExit(1)
+        ctx.exit(1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/commands/summarize_expression_cli.py` around lines 163 - 199, Replace
the direct raise SystemExit(1) in the output existence check with a Click-style
exit call for consistency: use ctx.exit(1) (or, if the command function does not
receive a ctx parameter, call click.get_current_context().exit(1)). Update the
code at the output_path existence branch (the block that currently does
click.echo(...) then raise SystemExit(1)) to call the appropriate ctx.exit(1)
instead.
hvantk/commands/make_matrix_cli.py (2)

116-135: Consider exposing --gene-name-column option for completeness.

The build_expression_atlas_ad function accepts a gene_name_column parameter (default "Gene Name") that isn't exposed in the CLI. While the default works for standard Expression Atlas files, users with non-standard column names would need to modify the code.

This is optional since the default handles the common case.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/commands/make_matrix_cli.py` around lines 116 - 135, The CLI function
mkmatrix_expression_atlas should expose a --gene-name-column option and forward
it to the internal _build_expression_atlas_ad call; add a new parameter to
mkmatrix_expression_atlas (e.g., gene_name_column with default "Gene Name") that
maps to the CLI flag --gene-name-column, and pass that value into
_build_expression_atlas_ad via its gene_name_column argument so non-standard
Expression Atlas files work without editing code.

222-240: Missing --site-id-col CLI option for CPTAC phospho builder.

The build_cptac_phospho_ad function accepts a site_id_col parameter (default "SiteID") that isn't exposed in the CLI. This is inconsistent with the cptac command which exposes all column configuration options.

Users with phospho matrices using a different site identifier column name (e.g., "Site", "PhosphoSite") would be unable to process them via CLI.

Proposed fix to add site_id_col option
 `@click.option`("--sid", "--sample-id-col", "sample_id_col", default="SampleID", show_default=True)
+@click.option("--site-id-col", default="SiteID", show_default=True, help="Column containing site identifiers")
 `@click.option`("-w", "--overwrite", is_flag=True)
 def mkmatrix_cptac_phospho(
     expression,
     metadata,
     output_path,
     sample_id_col,
+    site_id_col,
     overwrite,
 ):
     """Build a CPTAC phospho AnnData object from site intensities and metadata."""
     logger.info("Building CPTAC phospho AnnData")

     adata = _build_cptac_phospho_ad(
         expression_path=expression,
         metadata_path=metadata,
         output_path=output_path,
+        site_id_col=site_id_col,
         sample_id_col=sample_id_col,
         overwrite=overwrite,
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/commands/make_matrix_cli.py` around lines 222 - 240, Add a new Click
option "--site-id-col" (arg name site_id_col, default "SiteID",
show_default=True) to the mkmatrix_cptac_phospho command and forward that
site_id_col argument into the call to _build_cptac_phospho_ad so it passes
site_id_col=site_id_col; this exposes the build_cptac_phospho_ad site_id_col
parameter (default "SiteID") to the CLI so users can specify custom phospho site
identifier column names.
hvantk/tables/matrix_builders.py (2)

10-21: Unused import and unsorted __all__.

  1. List is imported from typing but never used in this module.
  2. Static analysis flags that __all__ is not sorted (RUF022).
Proposed fix
 import anndata as ad
 import logging
-from typing import Optional, List
+from typing import Optional

 logger = logging.getLogger(__name__)

 __all__ = [
+    "build_cptac_ad",
+    "build_cptac_phospho_ad",
+    "build_expression_atlas_ad",
     "build_ucsc_ad",
-    "build_expression_atlas_ad",
-    "build_cptac_ad",
-    "build_cptac_phospho_ad",
 ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/tables/matrix_builders.py` around lines 10 - 21, Remove the unused
typing import and sort the exported symbols: delete the unused "List" import
from the typing import list and then reorder the __all__ list alphabetically
(e.g., "build_cptac_ad", "build_cptac_phospho_ad", "build_expression_atlas_ad",
"build_ucsc_ad") so static analysis warnings for the unused import (List) and
unsorted __all__ are resolved; keep the existing function names (build_ucsc_ad,
build_expression_atlas_ad, build_cptac_ad, build_cptac_phospho_ad) intact.

199-204: Hardcoded tab separator may fail for comma-delimited CSV files.

The docstring mentions "TSV/CSV" support, but pd.read_csv uses hardcoded sep="\t". Users with comma-separated files would encounter parsing errors.

Consider either:

  1. Adding a delimiter parameter (consistent with UCSC/Expression Atlas builders)
  2. Updating the docstring to clarify only TSV is supported
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/tables/matrix_builders.py` around lines 199 - 204, The code hardcodes
a tab separator when reading CPTAC files (logger.info + pd.read_csv for
expression_path and metadata_path, producing expr_df and meta_df), which breaks
comma-delimited CSV input; add a new parameter (e.g., delimiter or sep) to the
enclosing function and pass it into both pd.read_csv calls (use the default "\t"
to preserve current behavior) or alternatively update the docstring to state
only TSV is supported; ensure meta_df = meta_df.set_index(sample_id_col) still
runs after reading with the chosen delimiter.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hvantk/core/anndata_utils.py`:
- Around line 91-98: The numeric-aggregation code uses
np.nanmin/np.nanmax/np.nanmean on series.values which can be an ExtensionArray
for pandas nullable dtypes; convert the Series to a standard float NumPy array
or use pandas' native aggregations to avoid TypeError. Update the logic around
the variable 'series' (the Series from adata.obs[col]) so that before reducing
you either call series.to_numpy(dtype=float, na_value=np.nan) or
series.astype(float).values, or replace the np.nan* calls with
series.min(skipna=True), series.max(skipna=True), and series.mean(skipna=True),
and then store those results back into summary[col] (same dict keys
"min","max","mean").

In `@hvantk/core/converters.py`:
- Around line 69-82: hail_mt_to_anndata and anndata_to_hail_mt currently
materialize the entire matrix into memory (via
row_df/col_df/entries_df.to_pandas(), pivot, and sparse.toarray()) which will
OOM for large matrices; add an explicit size guard that computes n_obs * n_vars
(from mt.rows/mt.cols counts or adata.n_obs/adata.n_vars) and compare against a
configurable MAX_MATRIX_SIZE threshold (env var or constant), and raise a clear
ValueError with guidance (e.g., suggest using a blockwise/streamed conversion or
increasing the threshold) if exceeded; alternatively implement a blockwise
conversion path (stream entries by column/row chunks to avoid pivoting
everything at once) and only fall back to full-materialization when under the
size limit — update logic around symbols row_df, col_df, entries_df, pivot, X in
hail_mt_to_anndata and where sparse.toarray()/ravel() is called in
anndata_to_hail_mt.
- Around line 55-59: The resolver currently takes the first element of
mt.row_key/mt.col_key without validating cardinality; update the logic around
row_key and col_key so you explicitly validate mt.row_key and mt.col_key: if
len(mt.row_key) == 0 or len(mt.col_key) == 0 raise a clear ValueError indicating
the MatrixTable is unkeyed, and if len(...) > 1 raise a ValueError indicating a
composite key and require the caller to supply an explicit row_key/col_key; only
when len(...) == 1 set row_key = list(mt.row_key)[0] (and similarly for
col_key). Ensure the error messages reference row_key/col_key and MatrixTable so
callers can fix input.

In `@hvantk/datasets/cptac_phospho_datasets.py`:
- Around line 304-312: The per-call fallback in _get_phospho causes inconsistent
sources and overly-broad exception swallowing; instead, probe and pick a single
working source once (e.g., at function/class/module init) by calling
ds.get_phosphoproteomics(source="umich", tissue_type="both") and if that raises
only the specific "source unavailable" error switch to the default and store the
chosen_source in a variable; then have _get_phospho use that chosen_source (call
ds.get_phosphoproteomics(source=chosen_source, tissue_type=...)) so all reads
(tumor TSV, normal TSV, matrix export) use the same source, and apply the same
single-source selection and narrow exception handling approach to the clinical
fallback that uses ct/clinical loader so you don’t catch all Exception types.

In `@hvantk/ptm/constraint_expression.py`:
- Around line 138-141: The loader currently accepts .pkl/.pickle and calls
pd.read_pickle (variable ext with branches creating df), which is unsafe; remove
the pickle branch entirely and ensure the CLI/validation for --expression-path
rejects .pkl/.pickle (or only allow .parquet/.pq), so the code only calls
pd.read_parquet for supported extensions; update any help/validation code that
lists accepted extensions to exclude pickle and add an explicit unsafe flag if
you want to reintroduce pickle behind opt-in.
- Around line 218-229: The code fails when adata.obs[grouping] contains NaN
because astype(str) yields mixed types that break sorted(set(labels)); fix by
filtering out missing group values before converting to strings and before
computing unique_groups: create a boolean index like valid =
adata.obs[grouping].notna(), then derive labels from adata.obs.loc[valid,
grouping].astype(str).to_numpy() and compute unique_groups from that filtered
labels; ensure subsequent masks and counts (mask, n) operate on the same
filtered index (or use the filtered AnnData slice) so grouping, X, and any
downstream arrays stay aligned with labels and min_cells_per_group checks.

In `@hvantk/ptm/constraint.py`:
- Around line 147-149: The code always creates/overwrites outputs (os.makedirs
and downstream TSV/summary/plots) without checking config.overwrite; update
run_ptm_constraint() to honor config.overwrite by adding an existence guard at
the start: if config.overwrite is False and config.output_dir already contains
the expected output files (TSV files, summary.json, or plots), abort/raise an
error or return early; conversely, if config.overwrite is True, allow creating
or delete existing outputs before writing. Ensure all places that delete or
re-create config.output_dir/plots only run when config.overwrite is True and
reference run_ptm_constraint(), config.output_dir, config.overwrite, and
summary.json/TSV/plots in your checks.
- Around line 461-465: The qcut currently runs on variant rows so genes with
many variants skew the cut points; instead compute quartile bins using one row
per gene by selecting unique gene entries (e.g.,
df.drop_duplicates(subset="gene_key") or groupby('gene_key') and take a
representative 'tau'/'loeuf' value), run pd.qcut on that deduplicated series
with the same labels, then map/join the resulting per-gene bin assignments back
onto the full variant-level df to set "tau_quartile" (and similarly for the
LOEUF qcut block referenced later) so bin boundaries are derived from unique
genes, not variant rows.
- Around line 269-275: The code currently replaces a missing AF field with zeros
and later uses fillna(0.0), which falsely signals homozygous reference; instead,
when config.af_field is not present, raise an explicit error (e.g., throw/raise
ValueError or RuntimeError) so the pipeline fails fast and the schema issue is
fixed upstream, and remove any row-level fillna(0.0) calls so unparsable/null
AFs remain null (allowing explicit dropping or handling downstream); update the
branch that checks config.af_field (the select["af"] assignment) to raise and
remove/replace any hl.Table/MatrixTable .fillna(0.0) usage on AF to preserve
nulls.

In `@hvantk/ptm/pipeline.py`:
- Around line 209-213: The download currently streams directly into gtf_path
using requests.get and shutil.copyfileobj which can leave a truncated cache file
on failure; change the logic in the block that calls
requests.get(ENSEMBL_GTF_URL, ...) so that you write to a temporary file (e.g.,
using tempfile.NamedTemporaryFile or tempfile.mkstemp) and copy resp.raw into
that temp file, then atomically move the completed temp into gtf_path with
os.replace() only after the copy and resp.raise_for_status() succeed; ensure the
temp file is cleaned up on exceptions and any open files/streams (resp, temp)
are properly closed.
- Around line 490-498: After concatenating CPTAC into all_combined_path and
repointing mapped_path, also set result.mapped_tsv_path = all_combined_path so
the returned result reflects the post-merge file; update this in the branch that
calls _concat_bgz_tsvs([mapped_path, cptac_mapped_path], all_combined_path) (the
same place you update mapped_path and result.n_total), ensuring callers of
create_ptm_sites_tb() receive the updated mapped_tsv_path.

In `@hvantk/tables/cptac.py`:
- Around line 89-92: In cptac.py, when joining metadata_df into obs (created
from wide.index) and the later join at lines ~149-153, first compute the overlap
= obs.index.intersection(metadata_df.index) and validate it; if overlap is
empty, log a warning (or raise a ValueError) and avoid doing a blind left join
that yields all-NaN rows; if partial overlap, log how many samples matched and
then perform obs = obs.join(metadata_df, how="left") restricted to the
overlapping index (or ensure rows with no metadata are handled explicitly). Use
the same validation/logging pattern used in hvantk/tables/ucsc.py (checking
adata.obs.index.intersection(metadata_df.index) and warning on empty overlap)
and apply it to the variables obs, metadata_df, and wide.index in both join
locations.

In `@hvantk/utils/matrix_utils.py`:
- Around line 127-139: The current label construction joins stringified fields
with "_" which corrupts tuples with underscores and turns missing values into
the literal "nan"; change the logic in the block that builds labels (the
group_by handling and label_series creation) to build canonical group keys as
tuples or a pandas.MultiIndex from the original adata.obs columns (e.g., using
tuple(zip(...)) or pandas.MultiIndex.from_frame(adata.obs[group_by])) and drop
rows with any missing grouping fields before downstream aggregation, and ensure
label_series uses adata.obs.index (not np.arange(adata.n_obs)); update any
callers (e.g., the AnnData mean path referenced in constraint_expression.py) to
consume tuple/MultiIndex keys or a safely encoded representation.

In `@requirements.txt`:
- Around line 14-15: requirements.txt is missing runtime deps required by the
new modules (hvantk/core/converters.py, hvantk/tables/cptac.py,
hvantk/tables/ucsc.py) which import anndata; update requirements.txt to include
anndata plus the other packages declared in environment.yml (scanpy>=1.10.0,
pysam>=0.22, psutil, scikit-learn) and any version constraints from
environment.yml so a plain pip install -r requirements.txt will succeed and
match the environment.yml runtime dependencies.

---

Nitpick comments:
In @.gitignore:
- Around line 212-213: Remove the duplicate .ipynb_checkpoints entry from the
ignore list: keep only one pattern for ".ipynb_checkpoints" (either with or
without the trailing slash) and delete the redundant line that adds
".ipynb_checkpoints/" so the .gitignore contains a single canonical entry for
that pattern.

In `@docs_site/tools/ptm-constraint.md`:
- Around line 83-97: The fenced code block that starts with the directory tree
beginning "<output-dir>/" is missing a language specifier; update the opening
fence from ``` to ```text so the block becomes a ```text fenced code block
(i.e., add "text" after the three backticks) to satisfy markdown linting and
ensure consistent rendering of the directory tree.

In `@hvantk/commands/make_matrix_cli.py`:
- Around line 116-135: The CLI function mkmatrix_expression_atlas should expose
a --gene-name-column option and forward it to the internal
_build_expression_atlas_ad call; add a new parameter to
mkmatrix_expression_atlas (e.g., gene_name_column with default "Gene Name") that
maps to the CLI flag --gene-name-column, and pass that value into
_build_expression_atlas_ad via its gene_name_column argument so non-standard
Expression Atlas files work without editing code.
- Around line 222-240: Add a new Click option "--site-id-col" (arg name
site_id_col, default "SiteID", show_default=True) to the mkmatrix_cptac_phospho
command and forward that site_id_col argument into the call to
_build_cptac_phospho_ad so it passes site_id_col=site_id_col; this exposes the
build_cptac_phospho_ad site_id_col parameter (default "SiteID") to the CLI so
users can specify custom phospho site identifier column names.

In `@hvantk/commands/summarize_expression_cli.py`:
- Around line 163-199: Replace the direct raise SystemExit(1) in the output
existence check with a Click-style exit call for consistency: use ctx.exit(1)
(or, if the command function does not receive a ctx parameter, call
click.get_current_context().exit(1)). Update the code at the output_path
existence branch (the block that currently does click.echo(...) then raise
SystemExit(1)) to call the appropriate ctx.exit(1) instead.

In `@hvantk/data/file_utils.py`:
- Around line 462-466: The __init__ stores self._path but never uses it; either
remove the unused attribute or keep and use it for context (e.g., include
self._path in error messages/logging and in exception handling around opening
the file). Update the initializer in the class's __init__ (and any related error
handling elsewhere in this class) to either drop the assignment to self._path or
reference self._path in log/raise messages when calling open(path, "wb") and
when reporting I/O errors so the file path is not lost.

In `@hvantk/datasets/peptideatlas_phospho_datasets.py`:
- Around line 441-446: Move the inline import of requests out of the download
method and add it to the module-level imports so the library is imported once
for the module; update the top of the file to include "import requests" and
remove the "import requests" line inside the download() block (the code using
requests remains the same: requests.get(self.zip_url, stream=True, timeout=600)
... resp.raise_for_status() ... shutil.copyfileobj(resp.raw, fout)).

In `@hvantk/tables/expression_atlas.py`:
- Line 214: The current assignment X =
df[sample_cols].values.T.astype(np.float32) will raise on non-numeric strings;
change to coerce non-numeric values to NaN first (e.g. replace the
df[sample_cols] selection with df[sample_cols].apply(lambda c: pd.to_numeric(c,
errors='coerce')) or df[sample_cols].apply(pd.to_numeric, errors='coerce')) and
then convert to float32 and transpose to produce X; adjust downstream code to
handle resulting NaNs (e.g., drop or impute) as needed.

In `@hvantk/tables/matrix_builders.py`:
- Around line 10-21: Remove the unused typing import and sort the exported
symbols: delete the unused "List" import from the typing import list and then
reorder the __all__ list alphabetically (e.g., "build_cptac_ad",
"build_cptac_phospho_ad", "build_expression_atlas_ad", "build_ucsc_ad") so
static analysis warnings for the unused import (List) and unsorted __all__ are
resolved; keep the existing function names (build_ucsc_ad,
build_expression_atlas_ad, build_cptac_ad, build_cptac_phospho_ad) intact.
- Around line 199-204: The code hardcodes a tab separator when reading CPTAC
files (logger.info + pd.read_csv for expression_path and metadata_path,
producing expr_df and meta_df), which breaks comma-delimited CSV input; add a
new parameter (e.g., delimiter or sep) to the enclosing function and pass it
into both pd.read_csv calls (use the default "\t" to preserve current behavior)
or alternatively update the docstring to state only TSV is supported; ensure
meta_df = meta_df.set_index(sample_id_col) still runs after reading with the
chosen delimiter.

In `@hvantk/tests/test_expression_builders_anndata.py`:
- Around line 12-24: The zip() calls in _write_expression_matrix and
_write_metadata can silently ignore length mismatches; update both to use
zip(..., strict=True) so mismatched lengths raise an error during tests — change
for gene,row in zip(genes, values) to zip(genes, values, strict=True) in
_write_expression_matrix and for cid,ct in zip(cell_ids, cell_types) to
zip(cell_ids, cell_types, strict=True) in _write_metadata.

In `@hvantk/tests/test_ptm_constraint.py`:
- Around line 90-108: The test test_constraint_config_validate uses a hardcoded
output_dir "/tmp/out"; change the test to accept the pytest tmp_path fixture,
pass output_dir=str(tmp_path) when constructing PTMConstraintConfig, and update
the test function signature to def test_constraint_config_validate(tmp_path): so
the config uses an isolated temporary directory rather than a global /tmp path
while keeping all other validation assertions unchanged.

In `@hvantk/tests/test_ptm.py`:
- Around line 183-186: Move the local "import gzip" out of the test body and add
it to the module-level imports at the top of hvantk/tests/test_ptm.py, then
remove the inline import so the block that opens output_bgz with gzip.open(...)
and constructs csv.DictReader(...) uses the top-level gzip; update any
linters/import-grouping if needed.

In `@hvantk/visualization/expression/anndata.py`:
- Around line 35-38: The current code unconditionally densifies adata.X (X =
X.toarray()) which can OOM for large sparse matrices; instead, when
sp.issparse(X) detect large matrices by checking X.nnz or X.shape and apply
sampling: use X.data (the nonzero values) and randomly sample up to a
MAX_SAMPLES limit, and if you need zeros represented also randomly sample
indices from the implicit zeros to approximate distribution; then build values
from the sampled nonzero and sampled-zero values rather than calling toarray();
update the block around X, sp.issparse(X), and values to implement this
conditional sampling strategy.

In `@pyproject.toml`:
- Around line 31-32: Update pyproject.toml to make scanpy optional by removing
it from the main dependencies and adding an extras group (e.g., extras = {
markers = ["scanpy>=1.10.0"] }) so users can install hvantk[markers] when they
need marker analysis; then in summarize_expression_cli.py, wrap the local import
of scanpy (used in the marker-related function) in a try/except ImportError and
raise a clear ImportError message pointing users to install the markers extra
(e.g., "scanpy is required for marker analysis. Install with 'pip install
hvantk[markers]'") so the runtime check fails fast with guidance.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f90c4031-e371-48e9-b494-721fe6941d4f

📥 Commits

Reviewing files that changed from the base of the PR and between bd48823 and 03d50b5.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (54)
  • .gitignore
  • docs/superpowers/plans/2026-03-31-alphagenome-streamer.md
  • docs/superpowers/specs/2026-03-31-alphagenome-streamer-design.md
  • docs_site/architecture.md
  • docs_site/guide/usage.md
  • docs_site/tools/ptm-constraint.md
  • environment.yml
  • hvantk/commands/make_matrix_cli.py
  • hvantk/commands/ptm_cli.py
  • hvantk/commands/summarize_expression_cli.py
  • hvantk/core/anndata_utils.py
  • hvantk/core/converters.py
  • hvantk/core/hail_context.py
  • hvantk/data/file_utils.py
  • hvantk/datasets/cptac_phospho_datasets.py
  • hvantk/datasets/peptideatlas_phospho_datasets.py
  • hvantk/ptm/__init__.py
  • hvantk/ptm/annotate.py
  • hvantk/ptm/constraint.py
  • hvantk/ptm/constraint_expression.py
  • hvantk/ptm/constraint_plots.py
  • hvantk/ptm/constraint_report.py
  • hvantk/ptm/mapper.py
  • hvantk/ptm/pipeline.py
  • hvantk/tables/cptac.py
  • hvantk/tables/expression_atlas.py
  • hvantk/tables/matrix_builders.py
  • hvantk/tables/registry.py
  • hvantk/tables/table_builders.py
  • hvantk/tables/ucsc.py
  • hvantk/tests/test_anndata_utils.py
  • hvantk/tests/test_cli_mkmatrix.py
  • hvantk/tests/test_converters.py
  • hvantk/tests/test_cptac.py
  • hvantk/tests/test_cptac_phospho.py
  • hvantk/tests/test_expression_builders_anndata.py
  • hvantk/tests/test_expression_tables_hail.py
  • hvantk/tests/test_matrix_utils.py
  • hvantk/tests/test_matrix_utils_anndata.py
  • hvantk/tests/test_peptideatlas_phospho.py
  • hvantk/tests/test_ptm.py
  • hvantk/tests/test_ptm_constraint.py
  • hvantk/tests/test_summarize_expression.py
  • hvantk/tests/test_wilcoxon_hail.py
  • hvantk/utils/gene_sets.py
  • hvantk/utils/matrix_utils.py
  • hvantk/utils/tissue_specificity.py
  • hvantk/utils/wilcoxon_hail.py
  • hvantk/visualization/__init__.py
  • hvantk/visualization/expression/__init__.py
  • hvantk/visualization/expression/anndata.py
  • hvantk/visualization/expression/hail.py
  • pyproject.toml
  • requirements.txt
💤 Files with no reviewable changes (11)
  • hvantk/tests/test_expression_tables_hail.py
  • docs/superpowers/specs/2026-03-31-alphagenome-streamer-design.md
  • hvantk/tests/test_cli_mkmatrix.py
  • hvantk/tests/test_matrix_utils.py
  • hvantk/tests/test_cptac.py
  • hvantk/tests/test_wilcoxon_hail.py
  • hvantk/visualization/expression/hail.py
  • hvantk/tests/test_summarize_expression.py
  • hvantk/tests/test_cptac_phospho.py
  • docs/superpowers/plans/2026-03-31-alphagenome-streamer.md
  • hvantk/utils/wilcoxon_hail.py

Comment on lines +91 to +98
if pd.api.types.is_numeric_dtype(series):
summary[col] = {
"dtype": "numeric",
"min": float(np.nanmin(series.values)),
"max": float(np.nanmax(series.values)),
"mean": float(np.nanmean(series.values)),
"n_missing": n_missing,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the file
find . -name "anndata_utils.py" -type f

Repository: bigbio/hvantk

Length of output: 87


🏁 Script executed:

# Read the full function context around lines 91-98
head -120 hvantk/core/anndata_utils.py | tail -50

Repository: bigbio/hvantk

Length of output: 1944


🏁 Script executed:

# Also check imports to understand what's being used
head -20 hvantk/core/anndata_utils.py

Repository: bigbio/hvantk

Length of output: 614


🌐 Web query:

numpy nanmin nanmax pandas extension array Float64 Int64 behavior

💡 Result:

NumPy's nanmin and nanmax ignore NaN values and return the min/max among non-NaN elements (or NaN if all NaN). Regular min/max propagate NaN, returning NaN if any NaN present. Pandas nullable extension arrays (Int64, Float64, etc.) use pd.NA (not np.nan) as missing indicator, backed by data + mask. Their .min/.max (default skipna=True) skip missing values (pd.NA), computing min/max on valid data. With skipna=False, return pd.NA if any missing. Direct use of np.nanmin/np.nanmax on pandas extension arrays is not supported and typically fails (TypeError), as they are not NumPy arrays and lack full NumPy ufunc compatibility. ExtensionArrays implement custom _reduce("min"/"max") handling missing via mask, not delegating to NumPy nanmin/nanmax. Example behaviors (from docs/source): - NumPy: np.nanmin(np.array([1, np.nan, 3])) → 1.0 - Pandas Int64: pd.array([1, None, 3], dtype="Int64").min → 1 (skips ) - np.nanmin(pd.array([1, None, 3], dtype="Int64")) → TypeError (incompatible) For interoperability, convert to NumPy first: arr.to_numpy.astype(float) then np.nanmin, but loses exact semantics (pd.NA → np.nan). Sources confirm via official docs (numpy.nanmin docs, pandas ExtensionArray._reduce, nullable dtype guides, source code IntegerArray/FloatingArray inherit NumericArray with masked reductions).

Citations:


🌐 Web query:

pandas to_numeric to_numpy dtype conversion numeric series

💡 Result:

In pandas, pd.to_numeric converts a Series (typically object dtype with strings/mixed types) to a numeric pandas Series. The default dtype is float64 if the data requires floats or has NaNs (from errors='coerce'), or int64 if all integer values without NaNs. Use downcast='integer' or 'signed' for smaller int dtypes like int8 if possible, or dtype_backend='numpy_nullable' for nullable Int64 (capital I). Examples from docs: s = pd.Series(['1.0', '2', -3]) pd.to_numeric(s) # dtype: float64 pd.to_numeric(s, downcast='signed') # dtype: int8 s_na = pd.Series(['1', '2', 'apple']) pd.to_numeric(s_na, errors='coerce') # dtype: float64 with NaN Calling .to_numpy on the resulting numeric Series returns a NumPy ndarray with matching dtype: - pandas float64 or Float64 -> np.float64 - pandas int64 -> np.int64 - smaller dtypes like int8 -> np.int8 - For nullable dtypes (e.g., Int64 from dtype_backend), .to_numpy coerces to object dtype ndarray with as np.nan or None. For standard numeric Series without NaNs, dtype is preserved as native NumPy type in the ndarray. Examples: pd.Series([1,2,3]).to_numpy # array([1,2,3], dtype=int64) pd.Series([1.0,2.0]).to_numpy # array([1.,2.], dtype=float64) No dtype loss occurs for non-nullable numeric dtypes; .to_numpy provides direct view to underlying NumPy data (copy=False by default).

Citations:


Convert numeric columns to standard NumPy float array before reduction, or use pandas native aggregation methods.

If adata.obs[col] uses a pandas nullable numeric dtype (Int64, Float64, etc.), then series.values returns an ExtensionArray that is incompatible with np.nanmin/np.nanmax/np.nanmean, causing a TypeError at runtime. Use .astype(float).values or .to_numpy(dtype=float, na_value=np.nan) to convert before reducing, or call pandas native methods directly (series.min(), series.max(), series.mean() with skipna=True) which handle both regular and nullable dtypes correctly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/core/anndata_utils.py` around lines 91 - 98, The numeric-aggregation
code uses np.nanmin/np.nanmax/np.nanmean on series.values which can be an
ExtensionArray for pandas nullable dtypes; convert the Series to a standard
float NumPy array or use pandas' native aggregations to avoid TypeError. Update
the logic around the variable 'series' (the Series from adata.obs[col]) so that
before reducing you either call series.to_numpy(dtype=float, na_value=np.nan) or
series.astype(float).values, or replace the np.nan* calls with
series.min(skipna=True), series.max(skipna=True), and series.mean(skipna=True),
and then store those results back into summary[col] (same dict keys
"min","max","mean").

Comment thread hvantk/core/converters.py
Comment on lines +55 to +59
# Resolve keys
if row_key is None:
row_key = list(mt.row_key)[0]
if col_key is None:
col_key = list(mt.col_key)[0]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "converters.py" | head -20

Repository: bigbio/hvantk

Length of output: 111


🏁 Script executed:

git ls-files hvantk/core/ | head -20

Repository: bigbio/hvantk

Length of output: 351


🏁 Script executed:

cd hvantk/core && find . -name "*.py" -type f | head -20

Repository: bigbio/hvantk

Length of output: 231


🏁 Script executed:

wc -l hvantk/core/converters.py

Repository: bigbio/hvantk

Length of output: 86


🏁 Script executed:

cat -n hvantk/core/converters.py | head -100

Repository: bigbio/hvantk

Length of output: 3473


🏁 Script executed:

cat -n hvantk/core/converters.py | tail -80

Repository: bigbio/hvantk

Length of output: 3069


🏁 Script executed:

grep -r "row_key\|col_key" hvantk/ --include="*.py" | grep -E "test|spec" | head -20

Repository: bigbio/hvantk

Length of output: 423


🏁 Script executed:

cat -n hvantk/tests/test_converters.py

Repository: bigbio/hvantk

Length of output: 4046


🏁 Script executed:

cd hvantk && find . -name "*.py" -type f -exec grep -l "to_matrix_table\|row_key\|col_key" {} \; | head -15

Repository: bigbio/hvantk

Length of output: 369


🏁 Script executed:

find . -name "*.py" -type f -exec grep -l "composite.*key\|multiple.*key" {} \; | head -10

Repository: bigbio/hvantk

Length of output: 125


🏁 Script executed:

rg "key_rows_by|key_cols_by" -A 3 -B 3 hvantk/ --include="*.py" | head -50

Repository: bigbio/hvantk

Length of output: 498


🏁 Script executed:

rg "row_key.*=.*\[" hvantk/ --include="*.py" | head -20

Repository: bigbio/hvantk

Length of output: 498


🏁 Script executed:

rg "row_key.*=.*\[" hvantk/

Repository: bigbio/hvantk

Length of output: 244


🏁 Script executed:

rg "to_matrix_table" hvantk/ -A 5

Repository: bigbio/hvantk

Length of output: 371


🏁 Script executed:

cat -n hvantk/core/converters.py | sed -n '136,141p'

Repository: bigbio/hvantk

Length of output: 227


🏁 Script executed:

rg "composite.*key|multiple.*key" hvantk/

Repository: bigbio/hvantk

Length of output: 238


🏁 Script executed:

grep -r "IndexError\|ValueError" hvantk/tests/test_converters.py

Repository: bigbio/hvantk

Length of output: 39


🏁 Script executed:

cat -n hvantk/enrichex/burden.py | grep -A 5 -B 5 "row_key_name"

Repository: bigbio/hvantk

Length of output: 1325


🏁 Script executed:

cat -n hvantk/tests/test_table_utils.py | grep -A 20 "Create a table with multiple key fields"

Repository: bigbio/hvantk

Length of output: 917


🏁 Script executed:

rg "key_rows_by|key_cols_by" hvantk/tests/ -A 2 -B 2

Repository: bigbio/hvantk

Length of output: 7781


🏁 Script executed:

rg "composite" hvantk/tests/ -A 3 -B 3

Repository: bigbio/hvantk

Length of output: 39


🏁 Script executed:

cat -n hvantk/tests/test_converters.py | tail -30

Repository: bigbio/hvantk

Length of output: 1453


🏁 Script executed:

cat -n hvantk/tests/ancestry/conftest.py | sed -n '1,40p'

Repository: bigbio/hvantk

Length of output: 1393


🏁 Script executed:

rg "hail_mt_to_anndata" hvantk/ -B 2 -A 5

Repository: bigbio/hvantk

Length of output: 5428


🏁 Script executed:

python3 << 'EOF'
# Check if we can examine Hail's type system
# Let's look for any documentation or usage patterns

import subprocess
result = subprocess.run(['rg', 'row_key.*len|len.*row_key', 'hvantk/'], 
                       capture_output=True, text=True)
print("Length checks on row_key/col_key:")
print(result.stdout if result.stdout else "No matches found")

result = subprocess.run(['rg', 'range_matrix_table', 'hvantk/tests/', '-A', '10'], 
                       capture_output=True, text=True)
print("\nrange_matrix_table usage:")
print(result.stdout[:2000] if result.stdout else "No matches")
EOF

Repository: bigbio/hvantk

Length of output: 2135


🏁 Script executed:

rg "row_key|col_key" hvantk/tests/test_converters.py -B 5 -A 5

Repository: bigbio/hvantk

Length of output: 986


🏁 Script executed:

python3 << 'EOF'
# Let's check if there's a way to create an unkeyed MatrixTable or what happens
# We can look for any usage of unkeyed tables

import subprocess

# Search for any mention of "unkey" or empty keys
result = subprocess.run(['rg', 'unkey|drop.*key', 'hvantk/'], 
                       capture_output=True, text=True)
print("Unkey operations:")
print(result.stdout if result.stdout else "No matches found")

# Check for any error handling around keys
result = subprocess.run(['rg', 'KeyError|IndexError|ValueError.*key', 'hvantk/'], 
                       capture_output=True, text=True)
print("\nError handling for keys:")
print(result.stdout if result.stdout else "No matches found")
EOF

Repository: bigbio/hvantk

Length of output: 3166


🏁 Script executed:

rg "simple_mt" hvantk/tests/test_converters.py -B 10 | head -40

Repository: bigbio/hvantk

Length of output: 1260


🏁 Script executed:

cat -n hvantk/enrichex/burden.py | sed -n '1545,1560p'

Repository: bigbio/hvantk

Length of output: 823


🏁 Script executed:

cat -n hvantk/ptm/constraint_expression.py | grep -A 3 -B 3 "no row_key"

Repository: bigbio/hvantk

Length of output: 350


🏁 Script executed:

cat -n hvantk/tests/ancestry/conftest.py | sed -n '45,65p'

Repository: bigbio/hvantk

Length of output: 858


🏁 Script executed:

rg "key_rows_by.*\[|key_cols_by.*\[" hvantk/

Repository: bigbio/hvantk

Length of output: 122


Add explicit validation for single row and column keys.

Lines 57–59 silently use the first key field without checking cardinality. This fails with IndexError on an unkeyed MatrixTable and silently drops key components on a composite-key MT. Defensive checks for this already exist elsewhere in the codebase (e.g., constraint_expression.py). Add explicit validation here.

Suggested guard
     if row_key is None:
-        row_key = list(mt.row_key)[0]
+        if len(mt.row_key) != 1:
+            raise ValueError(
+                "hail_mt_to_anndata requires exactly one row key."
+            )
+        row_key = next(iter(mt.row_key))
     if col_key is None:
-        col_key = list(mt.col_key)[0]
+        if len(mt.col_key) != 1:
+            raise ValueError(
+                "hail_mt_to_anndata requires exactly one column key."
+            )
+        col_key = next(iter(mt.col_key))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Resolve keys
if row_key is None:
row_key = list(mt.row_key)[0]
if col_key is None:
col_key = list(mt.col_key)[0]
# Resolve keys
if row_key is None:
if len(mt.row_key) != 1:
raise ValueError(
"hail_mt_to_anndata requires exactly one row key."
)
row_key = next(iter(mt.row_key))
if col_key is None:
if len(mt.col_key) != 1:
raise ValueError(
"hail_mt_to_anndata requires exactly one column key."
)
col_key = next(iter(mt.col_key))
🧰 Tools
🪛 Ruff (0.15.9)

[warning] 57-57: Prefer next(iter(mt.row_key)) over single element slice

Replace with next(iter(mt.row_key))

(RUF015)


[warning] 59-59: Prefer next(iter(mt.col_key)) over single element slice

Replace with next(iter(mt.col_key))

(RUF015)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/core/converters.py` around lines 55 - 59, The resolver currently takes
the first element of mt.row_key/mt.col_key without validating cardinality;
update the logic around row_key and col_key so you explicitly validate
mt.row_key and mt.col_key: if len(mt.row_key) == 0 or len(mt.col_key) == 0 raise
a clear ValueError indicating the MatrixTable is unkeyed, and if len(...) > 1
raise a ValueError indicating a composite key and require the caller to supply
an explicit row_key/col_key; only when len(...) == 1 set row_key =
list(mt.row_key)[0] (and similarly for col_key). Ensure the error messages
reference row_key/col_key and MatrixTable so callers can fix input.

Comment thread hvantk/core/converters.py
Comment on lines +69 to +82
row_df = mt.rows().to_pandas()
row_df = row_df.set_index(row_key)

col_df = mt.cols().to_pandas()
col_df = col_df.set_index(col_key)

# Collect entries and pivot to dense matrix
entries_df = mt.select_entries(mt[entry_field]).entries().to_pandas()
pivot = entries_df.pivot(index=col_key, columns=row_key, values=entry_field)

# Reindex to match collected row/col order
pivot = pivot.reindex(index=col_df.index, columns=row_df.index)

X = pivot.values.astype(np.float32)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name converters.py -type f | head -20

Repository: bigbio/hvantk

Length of output: 111


🏁 Script executed:

wc -l ./hvantk/core/converters.py

Repository: bigbio/hvantk

Length of output: 88


🏁 Script executed:

cat -n ./hvantk/core/converters.py

Repository: bigbio/hvantk

Length of output: 5747


🏁 Script executed:

rg -n "hail_mt_to_anndata|anndata_to_hail_mt" --type py -B 2 -A 2 | head -50

Repository: bigbio/hvantk

Length of output: 3560


🏁 Script executed:

rg -n "size|memory|OOM|large|limit|guard" hvantk/core/converters.py -i

Repository: bigbio/hvantk

Length of output: 39


🏁 Script executed:

rg -n "stream|block|chunk|slice" hvantk/core/ -i --type py

Repository: bigbio/hvantk

Length of output: 950


Both conversion paths unconditionally materialize the entire matrix on the driver.

Lines 69, 72, and 76 in hail_mt_to_anndata call to_pandas() three times (rows, cols, and entries), then pivot the entire entries DataFrame into a dense matrix. Lines 118 and 133 in anndata_to_hail_mt convert sparse matrices to dense via toarray() and then ravel() the entire matrix into memory. Neither function includes size validation, memory guards, or a streaming alternative. This design will cause OOM failures on realistic expression matrices (e.g., 100k cells × 30k genes).

Add an explicit size guard (e.g., max acceptable obs×var product) with a clear error message, or implement a blockwise/streamed conversion path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/core/converters.py` around lines 69 - 82, hail_mt_to_anndata and
anndata_to_hail_mt currently materialize the entire matrix into memory (via
row_df/col_df/entries_df.to_pandas(), pivot, and sparse.toarray()) which will
OOM for large matrices; add an explicit size guard that computes n_obs * n_vars
(from mt.rows/mt.cols counts or adata.n_obs/adata.n_vars) and compare against a
configurable MAX_MATRIX_SIZE threshold (env var or constant), and raise a clear
ValueError with guidance (e.g., suggest using a blockwise/streamed conversion or
increasing the threshold) if exceeded; alternatively implement a blockwise
conversion path (stream entries by column/row chunks to avoid pivoting
everything at once) and only fall back to full-materialization when under the
size limit — update logic around symbols row_df, col_df, entries_df, pivot, X in
hail_mt_to_anndata and where sparse.toarray()/ravel() is called in
anndata_to_hail_mt.

Comment on lines +304 to +312
def _get_phospho(tissue_type="both", source="umich"):
"""Try *source* first; fall back to default if unavailable."""
try:
return ds.get_phosphoproteomics(source=source, tissue_type=tissue_type)
except Exception:
logger.info(
"Source '%s' unavailable for %s, falling back to default", source, ct
)
return ds.get_phosphoproteomics(tissue_type=tissue_type)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and examine the file
fd "cptac_phospho_datasets.py" -type f

Repository: bigbio/hvantk

Length of output: 227


🏁 Script executed:

# Get the file size and examine the relevant sections
wc -l hvantk/datasets/cptac_phospho_datasets.py

Repository: bigbio/hvantk

Length of output: 102


🏁 Script executed:

# Read the full context around the flagged lines (300-350) to understand the function and all call sites
sed -n '295,355p' hvantk/datasets/cptac_phospho_datasets.py | cat -n

Repository: bigbio/hvantk

Length of output: 3108


🏁 Script executed:

# Also look for the `_get_phospho` function definition to understand its scope and behavior
sed -n '300,350p' hvantk/datasets/cptac_phospho_datasets.py | cat -n

Repository: bigbio/hvantk

Length of output: 2657


🏁 Script executed:

# Look at imports and understand the CPTAC library usage
sed -n '1,50p' hvantk/datasets/cptac_phospho_datasets.py | cat -n

Repository: bigbio/hvantk

Length of output: 1687


🏁 Script executed:

# Search for any configuration or documentation about CPTAC sources
rg "get_phosphoproteomics|get_clinical" hvantk/datasets/cptac_phospho_datasets.py -B 2 -A 2

Repository: bigbio/hvantk

Length of output: 862


🏁 Script executed:

# Check if there's any CPTAC documentation or type hints in the codebase
rg "source.*umich|source.*mssm" hvantk/ -t py

Repository: bigbio/hvantk

Length of output: 259


🏁 Script executed:

# Look for the _load_cptac_dataset function to understand the library's behavior
rg "_load_cptac_dataset" hvantk/datasets/cptac_phospho_datasets.py -B 5 -A 10

Repository: bigbio/hvantk

Length of output: 1315


🏁 Script executed:

# Look for test files related to CPTAC
fd "test.*cptac|cptac.*test" -type f hvantk/

Repository: bigbio/hvantk

Length of output: 227


🏁 Script executed:

# Check requirements or setup files for cptac package version
fd "requirements|setup|pyproject" -type f | xargs rg "cptac" 2>/dev/null

Repository: bigbio/hvantk

Length of output: 10473


🏁 Script executed:

# Search for any documentation or comments about source handling
rg "source.*mssm|source.*umich" hvantk/ -B 3 -A 3

Repository: bigbio/hvantk

Length of output: 1319


🏁 Script executed:

# Check the full download method to understand the complete context
sed -n '260,350p' hvantk/datasets/cptac_phospho_datasets.py | cat -n

Repository: bigbio/hvantk

Length of output: 4790


🏁 Script executed:

# Search for any imports or exception handling specific to cptac
rg "cptac\.|import cptac" hvantk/datasets/cptac_phospho_datasets.py -B 5 -A 5

Repository: bigbio/hvantk

Length of output: 396


🏁 Script executed:

# Check if there's any test coverage for this code path
fd "test" -type f hvantk/ | xargs rg "CPTACPhosphoDataset|_get_phospho|source.*umich" 2>/dev/null | head -20

Repository: bigbio/hvantk

Length of output: 1023


🏁 Script executed:

# Look for any GitHub issues or comments about CPTAC source handling
rg "unavailable|fallback|source.*mssm" hvantk/ -i

Repository: bigbio/hvantk

Length of output: 2857


🏁 Script executed:

# Check the actual CPTAC library behavior by looking at package docs or any local type hints
rg "get_phosphoproteomics|get_clinical" hvantk/ -A 3 -B 3

Repository: bigbio/hvantk

Length of output: 1990


Determine and reuse a single CPTAC source instead of falling back per call.

Each _get_phospho call (lines 316, 324, 339) is evaluated independently, so one failed "umich" lookup leaves the tumor TSV, normal TSV, and matrix export backed by different sources. Both fallback blocks also catch every Exception, silently converting unrelated loader/network failures into source switches. Determine the supported source once and reuse it for all phosphoproteomics reads; apply the same narrowing to the clinical fallback (lines 351–354).

🧰 Tools
🪛 Ruff (0.15.9)

[warning] 308-308: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/datasets/cptac_phospho_datasets.py` around lines 304 - 312, The
per-call fallback in _get_phospho causes inconsistent sources and overly-broad
exception swallowing; instead, probe and pick a single working source once
(e.g., at function/class/module init) by calling
ds.get_phosphoproteomics(source="umich", tissue_type="both") and if that raises
only the specific "source unavailable" error switch to the default and store the
chosen_source in a variable; then have _get_phospho use that chosen_source (call
ds.get_phosphoproteomics(source=chosen_source, tissue_type=...)) so all reads
(tumor TSV, normal TSV, matrix export) use the same source, and apply the same
single-source selection and narrow exception handling approach to the clinical
fallback that uses ct/clinical loader so you don’t catch all Exception types.

Comment on lines +138 to +141
if ext in {".parquet", ".pq"}:
df = pd.read_parquet(path)
elif ext in {".pkl", ".pickle"}:
df = pd.read_pickle(path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and examine the file
fd -a constraint_expression.py

Repository: bigbio/hvantk

Length of output: 111


🏁 Script executed:

# Once we have the file, let's look at the code around lines 138-141
cat -n hvantk/ptm/constraint_expression.py | head -160 | tail -40

Repository: bigbio/hvantk

Length of output: 1634


🏁 Script executed:

# Let's also get a broader view of the function containing these lines
cat -n hvantk/ptm/constraint_expression.py | head -200 | tail -100

Repository: bigbio/hvantk

Length of output: 4086


🏁 Script executed:

# Find all calls to _load_from_tabular to understand the call chain
rg "_load_from_tabular" -n hvantk/ptm/constraint_expression.py

Repository: bigbio/hvantk

Length of output: 152


🏁 Script executed:

# Search for CLI entry points or argparse definitions in the ptm module
fd -a "\.py$" hvantk/ptm/ | head -20

Repository: bigbio/hvantk

Length of output: 585


🏁 Script executed:

# Look for main entry points or CLI handlers
rg -n "argparse|click|typer|def main" hvantk/ptm/

Repository: bigbio/hvantk

Length of output: 39


🏁 Script executed:

# Check if there's a CLI module or setup.py that shows how paths are passed
rg -n "_load_from_tabular" --type py

Repository: bigbio/hvantk

Length of output: 224


🏁 Script executed:

# Look at context around line 72 where _load_from_tabular is called
cat -n hvantk/ptm/constraint_expression.py | sed -n '50,90p'

Repository: bigbio/hvantk

Length of output: 1800


🏁 Script executed:

# Look at the full function definition starting before line 72
cat -n hvantk/ptm/constraint_expression.py | sed -n '35,85p'

Repository: bigbio/hvantk

Length of output: 2292


🏁 Script executed:

# Get the function signature and name
cat -n hvantk/ptm/constraint_expression.py | sed -n '1,40p'

Repository: bigbio/hvantk

Length of output: 1371


🏁 Script executed:

# Search for all calls to this function across the codebase
rg "load_expression_source" --type py -n

Repository: bigbio/hvantk

Length of output: 39


🏁 Script executed:

# Search for calls to load_gene_by_group_matrix
rg "load_gene_by_group_matrix" --type py -n

Repository: bigbio/hvantk

Length of output: 986


🏁 Script executed:

# Also check for CLI/entry point files (main, CLI runner, etc.)
rg -n "def main|if __name__" hvantk/ --type py | head -20

Repository: bigbio/hvantk

Length of output: 1364


🏁 Script executed:

# Look for any argument parsing or click/argparse decorators related to path arguments
rg -n "click|argparse|typer|@app\." hvantk/ --type py | grep -i "path\|file\|input" | head -20

Repository: bigbio/hvantk

Length of output: 1857


🏁 Script executed:

# Let's examine constraint.py around line 158 to see how load_gene_by_group_matrix is called
cat -n hvantk/ptm/constraint.py | sed -n '145,175p'

Repository: bigbio/hvantk

Length of output: 1272


🏁 Script executed:

# Check the function containing this call - look further back
cat -n hvantk/ptm/constraint.py | sed -n '1,50p'

Repository: bigbio/hvantk

Length of output: 1708


🏁 Script executed:

# Search for CLI-related functions in constraint.py that might call this function
rg -n "def " hvantk/ptm/constraint.py | head -20

Repository: bigbio/hvantk

Length of output: 822


🏁 Script executed:

# Search for calls to run_ptm_constraint
rg "run_ptm_constraint" --type py -n

Repository: bigbio/hvantk

Length of output: 729


🏁 Script executed:

# Check if there's a PTM-related CLI entry in the main hvantk.py or other CLI modules
cat -n hvantk/hvantk.py | head -120

Repository: bigbio/hvantk

Length of output: 4062


🏁 Script executed:

# Look for ptm command/subcommand in the CLI
rg -n "ptm\|constraint" hvantk/hvantk.py hvantk/commands/ --type py | head -30

Repository: bigbio/hvantk

Length of output: 39


🏁 Script executed:

# Examine ptm_cli.py around line 656 to see how the path is supplied to run_ptm_constraint
cat -n hvantk/commands/ptm_cli.py | sed -n '640,700p'

Repository: bigbio/hvantk

Length of output: 2420


🏁 Script executed:

# Look at the ptm command/subcommand definitions to see where expression_path comes from
cat -n hvantk/commands/ptm_cli.py | sed -n '1,100p'

Repository: bigbio/hvantk

Length of output: 3934


🏁 Script executed:

# Search for where expression_path is defined as a CLI option
rg -n "expression_path\|expression-path" hvantk/commands/ptm_cli.py | head -20

Repository: bigbio/hvantk

Length of output: 39


🏁 Script executed:

# Search for the constraint subcommand definition
rg -n "constraint" hvantk/commands/ptm_cli.py -A 5 -B 5 | head -100

Repository: bigbio/hvantk

Length of output: 3440


🏁 Script executed:

# Search for `@ptm_group.command` or constraint in the file
rg -n "@ptm_group.command|def constraint" hvantk/commands/ptm_cli.py

Repository: bigbio/hvantk

Length of output: 305


🏁 Script executed:

# Now let's look at the --expression-path option definition
cat -n hvantk/commands/ptm_cli.py | sed -n '525,605p'

Repository: bigbio/hvantk

Length of output: 3103


Remove pickle support from this user-facing loader.

The --expression-path CLI option accepts .pkl/.pickle files. pd.read_pickle() executes pickle payloads during deserialization, turning this into an arbitrary-code-execution sink. Remove pickle support entirely, or gate it behind an explicit unsafe flag.

🧰 Tools
🪛 Ruff (0.15.9)

[error] 141-141: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue

(S301)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/constraint_expression.py` around lines 138 - 141, The loader
currently accepts .pkl/.pickle and calls pd.read_pickle (variable ext with
branches creating df), which is unsafe; remove the pickle branch entirely and
ensure the CLI/validation for --expression-path rejects .pkl/.pickle (or only
allow .parquet/.pq), so the code only calls pd.read_parquet for supported
extensions; update any help/validation code that lists accepted extensions to
exclude pickle and add an explicit unsafe flag if you want to reintroduce pickle
behind opt-in.

Comment thread hvantk/ptm/pipeline.py
Comment on lines 209 to +213
logger.info("Downloading Ensembl GTF to %s...", gtf_path)
urllib.request.urlretrieve(ENSEMBL_GTF_URL, gtf_path) # nosec B310
with requests.get(ENSEMBL_GTF_URL, stream=True, timeout=600) as resp:
resp.raise_for_status()
with open(gtf_path, "wb") as fout:
shutil.copyfileobj(resp.raw, fout)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Write the GTF atomically instead of streaming into the cache path.

If the request or copy fails midway, gtf_path is left truncated and the next run will still treat it as a valid cached file because the cache check only looks at os.path.exists. Download to a temp file and os.replace() it only after the copy finishes successfully.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/pipeline.py` around lines 209 - 213, The download currently
streams directly into gtf_path using requests.get and shutil.copyfileobj which
can leave a truncated cache file on failure; change the logic in the block that
calls requests.get(ENSEMBL_GTF_URL, ...) so that you write to a temporary file
(e.g., using tempfile.NamedTemporaryFile or tempfile.mkstemp) and copy resp.raw
into that temp file, then atomically move the completed temp into gtf_path with
os.replace() only after the copy and resp.raise_for_status() succeed; ensure the
temp file is cleaned up on exceptions and any open files/streams (resp, temp)
are properly closed.

Comment thread hvantk/ptm/pipeline.py
Comment on lines +490 to 498
# Use a distinct filename to avoid truncating mapped_path when it
# points to ptm_sites_combined.tsv.bgz from step 4b.
all_combined_path = os.path.join(
config.output_dir, "ptm_sites_all_combined.tsv.bgz"
)
_concat_bgz_tsvs([mapped_path, cptac_mapped_path], all_combined_path)

mapped_path = all_combined_path
result.n_total += cptac_result.n_total
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Update the returned mapped path after the CPTAC merge.

Line 497 repoints mapped_path, but result.mapped_tsv_path is never updated in this branch. When cptac_tsv is present, callers get the pre-CPTAC file even though create_ptm_sites_tb() consumes ptm_sites_all_combined.tsv.bgz.

Suggested fix
         _concat_bgz_tsvs([mapped_path, cptac_mapped_path], all_combined_path)

         mapped_path = all_combined_path
+        result.mapped_tsv_path = all_combined_path
         result.n_total += cptac_result.n_total
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/pipeline.py` around lines 490 - 498, After concatenating CPTAC
into all_combined_path and repointing mapped_path, also set
result.mapped_tsv_path = all_combined_path so the returned result reflects the
post-merge file; update this in the branch that calls
_concat_bgz_tsvs([mapped_path, cptac_mapped_path], all_combined_path) (the same
place you update mapped_path and result.n_total), ensuring callers of
create_ptm_sites_tb() receive the updated mapped_tsv_path.

Comment thread hvantk/tables/cptac.py
Comment on lines +89 to 92
obs = pd.DataFrame(index=wide.index)
if metadata_df is not None:
obs = obs.join(metadata_df, how="left")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "*.py" | grep -E "(cptac|ucsc)" | head -20

Repository: bigbio/hvantk

Length of output: 358


🏁 Script executed:

wc -l hvantk/tables/cptac.py hvantk/tables/ucsc.py

Repository: bigbio/hvantk

Length of output: 122


🏁 Script executed:

sed -n '80,100p' hvantk/tables/cptac.py

Repository: bigbio/hvantk

Length of output: 683


🏁 Script executed:

sed -n '140,160p' hvantk/tables/cptac.py

Repository: bigbio/hvantk

Length of output: 686


🏁 Script executed:

cat -n hvantk/tables/ucsc.py | head -200

Repository: bigbio/hvantk

Length of output: 5256


🏁 Script executed:

# Check if there's any validation of metadata_df before these join calls in cptac.py
sed -n '75,95p' hvantk/tables/cptac.py

Repository: bigbio/hvantk

Length of output: 678


🏁 Script executed:

sed -n '130,160p' hvantk/tables/cptac.py

Repository: bigbio/hvantk

Length of output: 1041


🏁 Script executed:

# Check function signatures to understand where metadata_df comes from
grep -n "def.*metadata" hvantk/tables/cptac.py

Repository: bigbio/hvantk

Length of output: 39


🏁 Script executed:

grep -B 5 "metadata_df" hvantk/tables/cptac.py | grep -E "(def |metadata_df)"

Repository: bigbio/hvantk

Length of output: 469


🏁 Script executed:

# Get full function signatures for both builders
sed -n '40,70p' hvantk/tables/cptac.py

Repository: bigbio/hvantk

Length of output: 1169


🏁 Script executed:

# Check if there are any other callers or tests that show how metadata_df is expected to be structured
grep -r "create_anndata_from_cptac" --include="*.py" | grep -v "^Binary"

Repository: bigbio/hvantk

Length of output: 662


🏁 Script executed:

# Check the full phospho function signature
sed -n '105,125p' hvantk/tables/cptac.py

Repository: bigbio/hvantk

Length of output: 806


🏁 Script executed:

# Check how metadata_df is constructed in the matrix_builders callers
sed -n '1,50p' hvantk/tables/matrix_builders.py | head -100

Repository: bigbio/hvantk

Length of output: 1431


🏁 Script executed:

# Get the full context around the cptac function calls in matrix_builders
rg -B 5 -A 5 "create_anndata_from_cptac" hvantk/tables/matrix_builders.py | head -60

Repository: bigbio/hvantk

Length of output: 1582


Add overlap validation to metadata joins to prevent silent NaN filling.

Lines 89-92 and 149-153 join metadata_df without checking if its index overlaps with the sample IDs. A left join with no overlapping indices silently produces all-NaN obs, which breaks downstream analysis that depends on sample metadata (e.g., tissue/cell-type grouping).

The hvantk/tables/ucsc.py builder already addresses this by checking adata.obs.index.intersection(metadata_df.index) and logging a warning when the overlap is empty (lines 133-137). Adopt the same pattern here: validate that the metadata index contains the expected sample IDs, emit a warning if overlap is missing, and preferably raise an error if the overlap is empty.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/tables/cptac.py` around lines 89 - 92, In cptac.py, when joining
metadata_df into obs (created from wide.index) and the later join at lines
~149-153, first compute the overlap = obs.index.intersection(metadata_df.index)
and validate it; if overlap is empty, log a warning (or raise a ValueError) and
avoid doing a blind left join that yields all-NaN rows; if partial overlap, log
how many samples matched and then perform obs = obs.join(metadata_df,
how="left") restricted to the overlapping index (or ensure rows with no metadata
are handled explicitly). Use the same validation/logging pattern used in
hvantk/tables/ucsc.py (checking adata.obs.index.intersection(metadata_df.index)
and warning on empty overlap) and apply it to the variables obs, metadata_df,
and wide.index in both join locations.

Comment on lines 127 to +139
if isinstance(group_by, str):
group_by = [group_by]

# --- Validate ---
_validate_group_by(mt, group_by, metadata_field)

# --- Filter columns ---
if filter_by:
mt = filter_by_metadata(mt, filter_by)

# --- Build group label ---
if len(group_by) == 1:
label_expr = hl.str(mt[metadata_field][group_by[0]])
labels = adata.obs[group_by[0]].astype(str)
else:
label_expr = hl.delimit([hl.str(mt[metadata_field][f]) for f in group_by], "_")
mt = mt.annotate_cols(_group_label=label_expr)

# --- Filter groups by min cells ---
group_counts = mt.aggregate_cols(hl.agg.counter(mt._group_label))
valid_groups = {g for g, n in group_counts.items() if n >= min_cells_per_group}
skipped = {g: n for g, n in group_counts.items() if n < min_cells_per_group}
if skipped:
logger.warning(
"Skipping %d group(s) with fewer than %d cells: %s",
len(skipped),
min_cells_per_group,
skipped,
)
if not valid_groups:
raise ValueError(
f"No groups have >= {min_cells_per_group} cells. "
f"Group sizes: {group_counts}"
)

mt = mt.filter_cols(hl.literal(valid_groups).contains(mt._group_label))

# --- Group and aggregate ---
grouped_mt = mt.group_cols_by(mt._group_label).aggregate(
mean=hl.agg.mean(mt[expr_field]),
fraction_expressed=hl.agg.fraction(mt[expr_field] > 0),
n_cells=hl.agg.count(),
)

# --- Collect per-gene stats into a dict ---
grouped_mt = grouped_mt.annotate_rows(
stats=hl.dict(
hl.agg.collect(
hl.tuple(
[
grouped_mt._group_label,
hl.struct(
mean=grouped_mt.mean,
fraction_expressed=grouped_mt.fraction_expressed,
n_cells=hl.int32(grouped_mt.n_cells),
),
]
)
)
)
)

# --- Build output Table ---
row_fields = {gene_id_field: mt.row[gene_id_field]}
if gene_name_field and gene_name_field in mt.row:
row_fields[gene_name_field] = mt.row[gene_name_field]

tb = grouped_mt.rows()
select_exprs = {"gene_id": tb[gene_id_field], "stats": tb.stats}
if gene_name_field and gene_name_field in tb.row:
select_exprs["gene_name"] = tb[gene_name_field]

tb = tb.select(**select_exprs)
tb = tb.key_by("gene_id")

if output_path:
logger.info("Checkpointing summary table to %s", output_path)
tb = tb.checkpoint(output_path, overwrite=overwrite)

logger.info(
"Summarized %d genes across %d groups (group_by=%s)",
tb.count(),
len(valid_groups),
group_by,
)
return tb


def summarize_matrix(mt: hl.MatrixTable) -> Dict:
"""
Provide a comprehensive summary of a MatrixTable.

Args:
mt: Hail MatrixTable to summarize

Returns:
Dict containing summary statistics and information about the MatrixTable
"""
# Basic counts
n_samples = mt.count_cols()
n_genes = mt.count_rows()
n_entries = n_samples * n_genes

# Sample metadata summaries
if "metadata" in mt.col:
metadata_fields = list(mt.col.metadata.dtype)
metadata_stats = {}

for field in metadata_fields:
# Check if the field is likely numerical
field_type = mt.col.metadata[field].dtype
is_numeric = hl.is_numeric(field_type)

if is_numeric:
# For numerical fields, compute summary statistics
field_stats = mt.aggregate_cols(hl.agg.stats(mt.metadata[field]))
# Convert to a more compact dict format
field_stats = {
"mean": field_stats.mean,
"std": field_stats.stdev,
"min": field_stats.min,
"max": field_stats.max,
"n": field_stats.n,
"sum": field_stats.sum,
labels = adata.obs[group_by[0]].astype(str)
for col in group_by[1:]:
labels = labels + "_" + adata.obs[col].astype(str)

# Local Series — does not touch adata.obs.
label_series = pd.Series(labels.values, index=np.arange(adata.n_obs))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't encode groups with plain underscore joins.

For multi-column group_by, this can merge distinct tuples when the values themselves contain underscores, and astype(str) also turns missing metadata into a real "nan" group. Group on the original column tuple (or a safely escaped encoding) and drop rows with missing grouping fields before aggregation. This also feeds the AnnData mean path in hvantk/ptm/constraint_expression.py:189-205.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/utils/matrix_utils.py` around lines 127 - 139, The current label
construction joins stringified fields with "_" which corrupts tuples with
underscores and turns missing values into the literal "nan"; change the logic in
the block that builds labels (the group_by handling and label_series creation)
to build canonical group keys as tuples or a pandas.MultiIndex from the original
adata.obs columns (e.g., using tuple(zip(...)) or
pandas.MultiIndex.from_frame(adata.obs[group_by])) and drop rows with any
missing grouping fields before downstream aggregation, and ensure label_series
uses adata.obs.index (not np.arange(adata.n_obs)); update any callers (e.g., the
AnnData mean path referenced in constraint_expression.py) to consume
tuple/MultiIndex keys or a safely encoded representation.

Comment thread requirements.txt Outdated
Comment on lines +14 to +15
pyyaml
tspex No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

# First, check if anndata is imported in the three mentioned files
echo "=== Checking for anndata imports ==="
for file in hvantk/core/converters.py hvantk/tables/cptac.py hvantk/tables/ucsc.py; do
  if [ -f "$file" ]; then
    echo "File: $file"
    grep -n "anndata" "$file" | head -20
  else
    echo "File not found: $file"
  fi
done

echo ""
echo "=== Current requirements.txt ==="
if [ -f "requirements.txt" ]; then
  cat -n requirements.txt
else
  echo "requirements.txt not found"
fi

echo ""
echo "=== Current environment.yml ==="
if [ -f "environment.yml" ]; then
  cat -n environment.yml
else
  echo "environment.yml not found"
fi

Repository: bigbio/hvantk

Length of output: 1805


requirements.txt is missing critical AnnData runtime dependencies.

The three new modules hvantk/core/converters.py (line 14), hvantk/tables/cptac.py (line 12), and hvantk/tables/ucsc.py (line 24) all import anndata, but this dependency is absent from requirements.txt. A plain pip install -r requirements.txt will fail with ModuleNotFoundError: anndata.

Additionally, environment.yml specifies several other packages not in requirements.txt: scanpy>=1.10.0, pysam>=0.22, psutil, and scikit-learn. Sync requirements.txt with the actual runtime dependencies listed in environment.yml.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@requirements.txt` around lines 14 - 15, requirements.txt is missing runtime
deps required by the new modules (hvantk/core/converters.py,
hvantk/tables/cptac.py, hvantk/tables/ucsc.py) which import anndata; update
requirements.txt to include anndata plus the other packages declared in
environment.yml (scanpy>=1.10.0, pysam>=0.22, psutil, scikit-learn) and any
version constraints from environment.yml so a plain pip install -r
requirements.txt will succeed and match the environment.yml runtime
dependencies.

@enriquea
Copy link
Copy Markdown
Collaborator Author

@copilot, review and evaluate every comments in this PR. Fix only if valid.
Check the CI/CD GitHub workflows that are failing. Fix them.

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 14, 2026

@copilot, review and evaluate every comments in this PR. Fix only if valid.
Check the CI/CD GitHub workflows that are failing. Fix them.

Done in 32a57e5. I reviewed all prior PR comments, fixed the valid ones (UCSC/CPTAC delimiter handling, GTF parsing guards, Hail MT aggregation safety for medians, BGZF writer deduplication, sparse-safe AnnData histogram behavior, and small-MT guard for MT→AnnData conversion), and investigated CI failures via workflow logs. The failing workflows were due to missing anndata in requirements.txt; that is now fixed and validated with targeted tests/lint.

@enriquea enriquea merged commit a67cbaa into main Apr 14, 2026
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants