Integrate AnnData support and implement PTM constraint analysis by enriquea · Pull Request #90 · bigbio/hvantk

enriquea · 2026-04-14T08:18:12Z

This pull request introduces a new PTM constraint analysis feature and refactors the expression matrix-building CLI to use AnnData objects instead of Hail MatrixTables. It also documents and registers new modules and dependencies to support stratified PTM constraint analysis, including support for multiple expression data backends (AnnData, Hail MT, tabular). The CLI and builder functions have been updated accordingly, and new documentation has been added to guide users through the new workflows.

PTM Constraint Analysis:

Added a new CLI command hvantk ptm constraint for stratified allele-frequency depletion analysis at PTM codons, supporting grouping by tissue or cell type and multiple expression data backends (AnnData, Hail MT, tabular). Outputs include TSVs, plots, and a self-contained HTML report.
Added a detailed user guide for the new PTM constraint command, including usage examples and flag documentation.

Codebase Organization & Documentation:

Registered and documented new PTM constraint modules (constraint.py, constraint_expression.py, constraint_plots.py, constraint_report.py) in the architecture docs and clarified their purpose within the ptm/ directory. [1] [2]

Expression Matrix CLI Refactor:

Refactored the hvantk mkmatrix CLI to create AnnData (.h5ad) objects instead of Hail MatrixTables, updating function names, options, and help text for all supported sources (UCSC, Expression Atlas, CPTAC). [1] [2] [3] [4]

Dependencies:

Added new dependencies to environment.yml for AnnData, Scanpy, pysam, PyYAML, psutil, scikit-learn, and tspex to support the new analysis and data backends.

Removed/Obsolete:

Removed the outdated AlphaGenome variant prediction streamer design spec, which is no longer relevant to the current codebase direction.

Summary by CodeRabbit

New Features
- Added hvantk ptm constraint command for stratified PTM depletion analysis with statistical testing and HTML reports
- Replaced MatrixTable expression format with AnnData (H5AD) for improved compatibility and performance
- Enhanced expression matrix creation from UCSC, Expression Atlas, and CPTAC data sources
Documentation
- Added comprehensive PTM constraint analysis guide with CLI examples and output descriptions

Phosphorylation-aware variant classification to evaluate whether multi-source phospho evidence adds signal for ClinVar VUS reclassification on top of existing pathogenicity predictors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add converters module with hail_mt_to_anndata and anndata_to_hail_mt functions for crossing between variant-keyed Hail data and expression AnnData in the scverse integration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Includes np.bool compatibility workaround for Hail 0.2.x with numpy >= 1.24. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add load_ucsc_metadata and create_anndata_from_ucsc_matrix to ucsc.py alongside existing Hail functions, and build_ucsc_ad to matrix_builders. Hail imports are now lazy so AnnData paths work without Hail installed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add convert_sdrf_to_dataframe and create_anndata_from_expression_atlas to expression_atlas.py alongside existing Hail functions. Add build_expression_atlas_ad to matrix_builders.py following the same pattern as build_ucsc_ad. Make hail imports lazy to allow non-Hail usage without hail installed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add three new functions to matrix_utils.py that operate on AnnData objects alongside the existing Hail-based functions: describe_expression_ad, filter_by_metadata_ad, and summarize_expression_ad. Guard hail/anndata imports to allow standalone usage of either backend. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace Hail MatrixTable builders with AnnData builders in all mkmatrix subcommands (ucsc, expression-atlas, cptac, cptac-phospho). Remove Hail-specific options (--min-partitions, --force-bgz, --auto-convert-bgz) and init_hail() calls. Rewrite expression describe/summarize/markers commands to operate on .h5ad files using scanpy for differential expression. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…n up imports Replace Hail-based matrix builders with AnnData equivalents in the registry. Remove build_ucsc_mt, build_expression_atlas_mt, build_cptac_mt, build_cptac_phospho_mt from matrix_builders.py. Remove old Hail helper functions from ucsc.py, expression_atlas.py, and cptac.py. Delete wilcoxon_hail.py and obsolete Hail expression test files. Update test_matrix_utils.py fixture to use Hail imports directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…sts for downloads

…rip test

…CDS class

The file was rewritten for AnnData during scverse migration but kept the Hail-era filename. The leftover Hail histogram function had no live callers (only @pytest.mark.hail tests against removed Hail builders) and a broken ``"ad.AnnData"`` annotation that triggered an F821 lint error. - New hvantk/visualization/expression/anndata.py exposes visualize_expression_distribution(adata, ...) operating on adata.X. - Old hail.py removed; visualization facade and subpackage doc updated. - Test import switched to the new module.

…ze_expression_ad Completes the cleanup planned in the scverse-integration plan (Task 10 step 5). The Hail-based functions in matrix_utils.py were superseded by their _ad equivalents when expression I/O migrated to AnnData; their only remaining callers were two skipped @pytest.mark.hail test files. - Remove annotate_column_summary, describe_expression_mt, summarize_expression, summarize_matrix, filter_by_metadata, filter_by_gene_list, filter_by_expression, get_top_expressed_genes (and the optional hail import). - Delete tests/test_matrix_utils.py and tests/test_summarize_expression.py. - summarize_expression_ad: stop mutating adata.obs (was writing _group_label into the caller's object) and replace the per-row index.get_loc lookup with groupby indices for O(n_cells) splitting. - gene_sets.py: refresh extract_marker_gene_sets docstring to point at the AnnData summarizer and note its long-format output requires pivot or scanpy rank_genes_groups for direct use.

…hail_context - converters.anndata_to_hail_mt previously built the long-format DataFrame with a Python double loop over n_obs * n_var entries (200M dict allocs for a modest 10k x 20k matrix). Replace with np.repeat / np.tile / X.ravel — single pass, no Python loop. - The np.bool monkey-patch (Hail 0.2.x compat with NumPy >= 1.24) lived as a side-effect at the top of converters.py. Move it into hail_context.py so it runs before any Hail import regardless of which module pulls Hail in first. converters.py now imports hl via hail_context to guarantee the shim is applied.

test_dataset_download_returns_intermediate_tsv mocked urllib.request.urlretrieve, but the production code calls requests.get. The mock was a no-op, so the test attempted a real network download and hung pytest -q indefinitely. Mock requests.get with a context-manager fake response that streams the bundled fixture zip.

- matrix_builders.py, cptac.py, expression_atlas.py: remove inner ``import anndata as ad`` repeats inside builder functions; module-level import already in scope. - test_matrix_utils_anndata.py: replace the importlib.util workaround (which sidestepped hvantk.utils.__init__ over a non-existent Python 3.10 syntax issue) with a plain import. Project requires Python >= 3.10 where PEP 604 unions are valid.

Implements the `hvantk ptm constraint` CLI and `hvantk.ptm.constraint` module from the 2026-04-13 implementation plan. Compares gnomAD allele- frequency distributions between PTM-proximal and non-PTM variants, stratified by tissue, cell type, or any categorical metadata field. Modules added: - hvantk/ptm/constraint.py — orchestrator, PTMConstraintConfig/Result, five statistical tests (per-group ranking, τ quartile, τ × LOEUF factorial, PTM category × group heatmap, within-gene Wilcoxon) - hvantk/ptm/constraint_expression.py — unified backend adapter for Hail MatrixTable / AnnData / tabular expression sources - hvantk/ptm/constraint_plots.py — four-panel figure renderer - hvantk/ptm/constraint_report.py — self-contained HTML report - hvantk/utils/tissue_specificity.py — thin tspex wrapper for τ / TSI / Gini / Shannon metrics Wiring: - hvantk/commands/ptm_cli.py — `ptm constraint` subcommand - hvantk/ptm/__init__.py — lazy exports for the new API - pyproject.toml — `constraint` extra (tspex, matplotlib, seaborn) Docs: - docs_site/tools/ptm-constraint.md - docs_site/guide/usage.md (10-line recipe) - docs_site/architecture.md (module index) Tests: 6 smoke tests in hvantk/tests/test_ptm_constraint.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ield Code-review follow-ups to the initial constraint implementation: - _anndata_value_column unconditionally returned "mean", so --expression-metric median and median_nonzero were silent no-ops on the AnnData backend. Replaced the value-column helper with a direct numpy aggregation (_aggregate_anndata_direct) that runs per-group np.median / np.nanmedian (for nonzero) over adata.X when the user requests a median metric. - _load_variants silently skipped label filtering when the configured label_field was absent from the variants HT, leading to results that claim to be filtered to TN/TP but actually include every label. Now raises ValueError with a clear remediation hint unless the user explicitly passes --label-filter all. - pyproject.toml + requirements.txt: added tspex so CI (which installs from requirements.txt) can import the wrapper and the constraint smoke tests do not fail with ImportError. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The Conda CI workflow has been failing since the AnnData migration landed on dev because environment.yml never gained anndata/scanpy. The constraint work also needs tspex; add it alongside so `python-package- conda.yml` can import every test module. Added: anndata, scanpy, pysam, PyYAML, psutil, scikit-learn, tspex.

feat(ptm): stratified AF-depletion analysis (`hvantk ptm constraint`)

coderabbitai · 2026-04-14T08:18:31Z

📝 Walkthrough

Walkthrough

Large-scale architectural migration from Hail MatrixTable to AnnData for expression matrix handling across multiple data sources, introduction of a new PTM constraint analysis pipeline with CLI integration and comprehensive statistical testing, removal of AlphaGenome streamer documentation, and addition of utility modules for AnnData operations and tissue specificity computation.

Changes

Cohort / File(s)	Summary
AnnData Core Infrastructure `hvantk/core/anndata_utils.py`, `hvantk/core/converters.py`	New modules providing AnnData metadata construction, column summarization, save/load utilities, and bidirectional Hail ↔ AnnData conversion functions.
Expression Matrix Builders `hvantk/tables/matrix_builders.py`, `hvantk/tables/ucsc.py`, `hvantk/tables/cptac.py`, `hvantk/tables/expression_atlas.py`	Replaced Hail MatrixTable builders with AnnData equivalents; removed MatrixTable-specific options (`metadata_index_col`, `min_partitions`, `force_bgz`, `categorical_cols`, `numeric_cols`); delegated metadata handling to pandas/AnnData conventions.
Expression Utilities & CLI `hvantk/utils/matrix_utils.py`, `hvantk/utils/tissue_specificity.py`, `hvantk/commands/summarize_expression_cli.py`, `hvantk/commands/make_matrix_cli.py`	Replaced Hail-based matrix utilities (`summarize_expression`, `filter_by_metadata`, etc.) with AnnData equivalents; added `compute_specificity()` wrapper for tissue specificity via tspex; updated CLI to use `.h5ad` outputs and AnnData-based metadata handling.
PTM Constraint Analysis Pipeline `hvantk/ptm/constraint.py`, `hvantk/ptm/constraint_expression.py`, `hvantk/ptm/constraint_plots.py`, `hvantk/ptm/constraint_report.py`, `hvantk/ptm/__init__.py`, `hvantk/commands/ptm_cli.py`	New end-to-end stratified constraint analysis with configuration validation, gene-by-group expression loading (supporting Hail MT, AnnData, tabular), statistical testing (Mann–Whitney, τ quartile, LOEUF factorial, category-group heatmap, within-gene paired Wilcoxon), PNG/HTML report generation, and new `hvantk ptm constraint` CLI subcommand.
PTM Data Processing `hvantk/ptm/mapper.py`, `hvantk/ptm/pipeline.py`, `hvantk/ptm/annotate.py`	Added `TranscriptCDS` class and caching for residue-to-genomic mapping; replaced `urllib` downloads with streaming `requests`; added BGZF TSV output and concat utilities; extended variant annotation with optional `ptm_evidence` field.
File & Data Management `hvantk/data/file_utils.py`, `hvantk/hail_context.py`, `hvantk/datasets/cptac_phospho_datasets.py`, `hvantk/datasets/peptideatlas_phospho_datasets.py`	Added `BgzfWriter` class for buffered BGZF output; patched NumPy compatibility shim for Hail import; improved dataset fallback handling for phosphoproteomics sources; replaced urllib downloads with streaming requests.
Visualization `hvantk/visualization/__init__.py`, `hvantk/visualization/expression/__init__.py`, `hvantk/visualization/expression/anndata.py`, `hvantk/visualization/expression/hail.py`	Removed Hail MatrixTable visualization module; added AnnData-based `visualize_expression_distribution()` supporting sparse matrices; updated lazy-import facade to target AnnData implementation.
Test Updates — Deleted `hvantk/tests/test_cli_mkmatrix.py`, `hvantk/tests/test_cptac.py`, `hvantk/tests/test_cptac_phospho.py`, `hvantk/tests/test_expression_tables_hail.py`, `hvantk/tests/test_matrix_utils.py`, `hvantk/tests/test_summarize_expression.py`, `hvantk/tests/test_wilcoxon_hail.py`	Removed Hail-dependent tests covering MatrixTable construction, CPTAC/phospho conversion, expression summarization, and Wilcoxon marker detection.
Test Updates — Added `hvantk/tests/test_anndata_utils.py`, `hvantk/tests/test_converters.py`, `hvantk/tests/test_expression_builders_anndata.py`, `hvantk/tests/test_matrix_utils_anndata.py`, `hvantk/tests/test_ptm_constraint.py`	Added AnnData-based tests for utilities, Hail↔AnnData converters, expression builders with CLI integration, matrix operations, and PTM constraint pipeline smoke tests.
Test Updates — Modified `hvantk/tests/test_ptm.py`, `hvantk/tests/test_peptideatlas_phospho.py`	Updated PTM roundtrip test for BGZF output; updated phosphoproteomics download mocking from urllib to requests.
Dependencies & Configuration `.gitignore`, `environment.yml`, `pyproject.toml`, `requirements.txt`, `hvantk/tables/registry.py`, `hvantk/tables/table_builders.py`	Added ignore patterns for Jupyter/VSCode/Claude; updated dependencies (anndata, scanpy, pysam, PyYAML, psutil, scikit-learn, tspex); updated registry to reference new `*_ad` builder functions; added `min_partitions=16` for PTM table import.
Documentation — Added `docs_site/guide/usage.md`, `docs_site/tools/ptm-constraint.md`	Added usage section for `hvantk ptm constraint` workflow with expression backend options, grouping/aggregation choices, and output structure; added reference documentation for constraint analysis command with flags and caveats.
Documentation — Modified `docs_site/architecture.md`	Expanded `ptm/` module description to enumerate new constraint files and describe stratified AF-depletion analysis via `hvantk ptm constraint`.
Documentation — Removed `docs/superpowers/plans/2026-03-31-alphagenome-streamer.md`, `docs/superpowers/specs/2026-03-31-alphagenome-streamer-design.md`	Deleted AlphaGenome Variant Prediction Streamer implementation plan and design specification documents.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Add new PTM (Post-Translational Modification) analysis module #77: Directly related—both PRs extensively modify the PTM subsystem, adding/altering PTM CLI commands, constraint analysis modules, and mapper/pipeline improvements.
Handling files formats conversion & improve docs #60: Related through BGZF handling—both PRs modify hvantk/data/file_utils.py and coordinate BGZF-related flows in matrix builders.
Add CPTAC data processing module with unit tests #22: Opposite-direction related—the retrieved PR adds Hail-based CPTAC conversion functions while the main PR replaces them with AnnData equivalents.

Suggested labels

enhancement, architecture, documentation, data-formats

Suggested reviewers

ypriverol

Poem

🐰 Hop! From tables of Hail to AnnData's way,
New constraint tests bloom—tau quartiles at play!
BGZF blocks cascade, PTM sites align,
One pipeline converges, where markers all shine. ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 57.78% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Integrate AnnData support and implement PTM constraint analysis' is comprehensive and directly matches the two main themes of this large changeset: AnnData integration and the new PTM constraint feature.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codacy-production · 2026-04-14T08:18:52Z

Not up to standards ⛔

🔴 Issues 23 high · 18 medium · 59 minor

Alerts:
⚠ 100 issues (≤ 0 issues of at least minor severity)

Results:
100 new issues

Category Results

UnusedCode 2 medium
1 minor

Documentation 58 minor

ErrorProne 6 high

Security 17 high

Complexity 16 medium

View in Codacy

🟢 Metrics 307 complexity · 4 duplication

Metric Results

Complexity 307

Duplication 4

View in Codacy

_{TIP This summary will be updated as you push new changes. Give us feedback}

Copilot

Pull request overview

This PR adds a new PTM constraint analysis workflow (hvantk ptm constraint) and refactors expression-matrix handling to support AnnData (.h5ad) as a first-class backend (alongside Hail MT and pre-aggregated tabular inputs), with updated builders/CLI/tests/docs.

Changes:

Introduces PTM constraint analysis modules + CLI command, producing TSV/plots/HTML report across multiple expression backends.
Refactors hvantk mkmatrix and expression builders to produce AnnData objects instead of Hail MatrixTables.
Adds AnnData/scanpy/tspex-related utilities, converters, tests, and documentation updates.

Reviewed changes

Copilot reviewed 52 out of 55 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
requirements.txt	Adds `tspex` to pip requirements.
pyproject.toml	Adds AnnData/Scanpy deps and new extras for constraint/PTM-related features.
environment.yml	Adds conda deps required for AnnData + constraint workflows.
hvantk/visualization/expression/hail.py	Removes Hail-backed expression distribution plotting module.
hvantk/visualization/expression/anndata.py	Adds AnnData-backed expression distribution plotting.
hvantk/visualization/expression/init.py	Updates expression visualization namespace docs to reference AnnData backend.
hvantk/visualization/init.py	Routes `visualize_expression_distribution` facade to AnnData implementation.
hvantk/utils/wilcoxon_hail.py	Removes Hail adapter for Wilcoxon marker detection.
hvantk/utils/tissue_specificity.py	Adds `tspex` wrapper for tissue-specificity metrics (τ, etc.).
hvantk/utils/gene_sets.py	Updates marker gene set extraction docstring to clarify expected summary format(s).
hvantk/tests/test_wilcoxon_hail.py	Removes Hail Wilcoxon integration tests.
hvantk/tests/test_summarize_expression.py	Removes Hail-based expression summarization tests.
hvantk/tests/test_matrix_utils.py	Removes Hail MatrixTable-based matrix_utils tests (incl. plotting).
hvantk/tests/test_expression_tables_hail.py	Removes Hail integration tests for expression table builders.
hvantk/tests/test_cptac_phospho.py	Removes CPTAC phospho dataset tests tied to old pipeline paths.
hvantk/tests/test_cptac.py	Removes Hail-based CPTAC conversion tests.
hvantk/tests/test_cli_mkmatrix.py	Removes MT-based mkmatrix CLI tests.
hvantk/tests/test_matrix_utils_anndata.py	Adds AnnData-based matrix_utils tests.
hvantk/tests/test_expression_builders_anndata.py	Adds AnnData builder + mkmatrix CLI integration tests.
hvantk/tests/test_anndata_utils.py	Adds tests for AnnData provenance, summaries, and I/O helpers.
hvantk/tests/test_converters.py	Adds tests for Hail MT ↔ AnnData converters.
hvantk/tests/test_ptm_constraint.py	Adds smoke tests for PTM constraint pipeline components + CLI help.
hvantk/tests/test_ptm.py	Updates PTM mapping test to read BGZF output.
hvantk/tests/test_peptideatlas_phospho.py	Updates download mocking to match new `requests.get` implementation.
hvantk/tables/ucsc.py	Refactors UCSC loader from Hail MT to pandas/AnnData ingestion.
hvantk/tables/matrix_builders.py	Refactors builders to produce AnnData and write `.h5ad`.
hvantk/tables/expression_atlas.py	Refactors Expression Atlas ingestion from Hail to pandas/AnnData.
hvantk/tables/cptac.py	Refactors CPTAC ingestion from Hail to AnnData (long + phospho wide).
hvantk/tables/registry.py	Updates registry to point matrix adapters to new AnnData builders.
hvantk/tables/table_builders.py	Adjusts a Hail import call to set `min_partitions=16`.
hvantk/core/anndata_utils.py	Adds AnnData helpers for provenance metadata, column summaries, save/load.
hvantk/core/converters.py	Adds Hail MatrixTable ↔ AnnData converters.
hvantk/core/hail_context.py	Adds NumPy compatibility shim (`np.bool`) before importing Hail.
hvantk/ptm/pipeline.py	Adds stdlib-only BGZF writing + switches downloads to `requests`.
hvantk/ptm/mapper.py	Adds CDS mapping cache/binary-search mapping and optimizes GTF parsing.
hvantk/ptm/annotate.py	Extends PTM annotation to optionally collect per-site evidence structs.
hvantk/ptm/init.py	Converts PTM package exports to lazy imports to avoid eager Hail import.
hvantk/ptm/constraint_expression.py	Adds expression-source adapter producing gene×group matrices from multiple backends.
hvantk/ptm/constraint_plots.py	Adds four-panel plotting for constraint analysis outputs.
hvantk/commands/make_matrix_cli.py	Refactors `hvantk mkmatrix` to produce `.h5ad` AnnData outputs.
hvantk/commands/ptm_cli.py	Adds `hvantk ptm constraint` CLI command + options.
hvantk/datasets/peptideatlas_phospho_datasets.py	Switches download implementation from `urlretrieve` to `requests`.
hvantk/datasets/cptac_phospho_datasets.py	Adds source fallback logic for CPTAC phospho/clinical retrieval.
hvantk/data/file_utils.py	Adds BGZF writer implementation (used for bgz output elsewhere).
docs_site/tools/ptm-constraint.md	Adds a user guide and flag reference for PTM constraint command.
docs_site/guide/usage.md	Documents how to run PTM constraint analysis in the general usage guide.
docs_site/architecture.md	Registers/describe new PTM constraint modules in architecture docs.
docs/superpowers/specs/2026-03-31-alphagenome-streamer-design.md	Removes obsolete AlphaGenome streamer design spec.
.gitignore	Adds ignores for notebook checkpoints, VS Code settings, and `.claude/`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-14T08:23:55Z

-        index_col=metadata_index_col,
-    )
+    logger.info("Loading UCSC metadata from %s", metadata_path)
+    metadata_df = load_ucsc_metadata(metadata_path)


The CLI passes delimiter, but metadata loading ignores it (load_ucsc_metadata(metadata_path) uses its default tab separator). This makes --delimiter inconsistent and will break non-tab metadata files. Pass sep=delimiter (and potentially expose index_col again if needed) when calling load_ucsc_metadata.

Suggested change

metadata_df = load_ucsc_metadata(metadata_path)

metadata_df = load_ucsc_metadata(metadata_path, sep=delimiter)

Copilot · 2026-04-14T08:23:56Z

+    logger.info("Reading CPTAC expression from %s", expression_path)
+    expr_df = pd.read_csv(expression_path, sep="\t")
+
+    logger.info("Reading CPTAC metadata from %s", metadata_path)
+    meta_df = pd.read_csv(metadata_path, sep="\t")
+    meta_df = meta_df.set_index(sample_id_col)


build_cptac_ad unconditionally reads with sep="\t", but the docstring says TSV/CSV are supported. Either detect the delimiter from the file extension (or sniff), or update the docstring/CLI to only support TSV to avoid surprising parse failures for CSV inputs.

Copilot · 2026-04-14T08:23:56Z

+    expr_df = pd.read_csv(expression_path, sep="\t")
+
+    logger.info("Reading CPTAC phospho metadata from %s", metadata_path)
+    meta_df = pd.read_csv(metadata_path, sep="\t")


build_cptac_phospho_ad also hard-codes sep="\t" while the docstring advertises TSV/CSV support. Please align behavior and docs (delimiter detection or explicit TSV-only contract).

Suggested change

expr_df = pd.read_csv(expression_path, sep="\t")

logger.info("Reading CPTAC phospho metadata from %s", metadata_path)

meta_df = pd.read_csv(metadata_path, sep="\t")

expr_df = pd.read_csv(expression_path, sep=None, engine="python")

logger.info("Reading CPTAC phospho metadata from %s", metadata_path)

meta_df = pd.read_csv(metadata_path, sep=None, engine="python")

Copilot · 2026-04-14T08:23:56Z

        for line in f:
-            if line.startswith("#"):
+            if line[0] == "#":
                continue
-            fields = line.strip().split("\t")
+
+            # Quick scan for feature type before full split — skip the ~80%
+            # of lines that are gene/exon/UTR/start_codon/stop_codon/etc.
+            # GTF columns are tab-separated; feature type is in column 3.
+            tab1 = line.index("\t")
+            tab2 = line.index("\t", tab1 + 1)
+            tab3 = line.index("\t", tab2 + 1)
+            feature = line[tab2 + 1 : tab3]


parse_ensembl_gtf assumes every line is non-empty and tab-delimited. line[0] will raise on blank lines, and line.index("\t") will raise ValueError on malformed lines, aborting the whole parse. Add a quick guard (e.g., if not line or line[0] == '#': continue) and handle missing tabs safely (e.g., try/except ValueError or if '\t' not in line: continue).

Copilot · 2026-04-14T08:23:56Z

+        agg_expr = hl.median(hl.agg.collect(entry))
+    elif aggfunc == "median_nonzero":
+        agg_expr = hl.median(hl.agg.filter(entry > 0, hl.agg.collect(entry)))


For the Hail MT backend, computing the median via hl.median(hl.agg.collect(entry)) (and the filtered variant) collects all per-group entry values into an array per row, which is not scalable for large MTs and can blow executor memory. Consider using Hail's approximate quantile aggregators (e.g., hl.agg.approx_quantiles(entry, 0.5)[0] / hl.agg.approx_median if available) or restricting the Hail backend to mean only.

Suggested change

agg_expr = hl.median(hl.agg.collect(entry))

elif aggfunc == "median_nonzero":

agg_expr = hl.median(hl.agg.filter(entry > 0, hl.agg.collect(entry)))

agg_expr = hl.agg.approx_quantiles(entry, [0.5])[0]

elif aggfunc == "median_nonzero":

agg_expr = hl.agg.filter(entry > 0, hl.agg.approx_quantiles(entry, [0.5]))[0]

Copilot · 2026-04-14T08:23:57Z

+
+import requests
 from collections import defaultdict
 from dataclasses import dataclass, field
 from typing import Dict, List, Optional

+import struct
+import zlib
+
+# ---------------------------------------------------------------------------
+# Lightweight BGZF writer (stdlib-only, no Hail dependency).
+#
+# Mirrors hvantk.data.file_utils.BgzfWriter but lives here so the PTM
+# pipeline's pure-Python stages never trigger hvantk.data.__init__, which
+# eagerly imports Hail.
+# ---------------------------------------------------------------------------
+
+_BGZF_BLOCK_SIZE = 65280  # max uncompressed payload per BGZF block
+
+
+def _make_bgzf_block(data: bytes) -> bytes:
+    """Compress *data* into a single BGZF block."""
+    comp = zlib.compressobj(zlib.Z_DEFAULT_COMPRESSION, zlib.DEFLATED, -15)
+    compressed = comp.compress(data) + comp.flush()
+    bsize = 18 + len(compressed) + 8 - 1
+    header = (
+        b"\x1f\x8b\x08\x04"
+        b"\x00\x00\x00\x00"
+        b"\x00\xff"
+        + struct.pack("<H", 6)
+        + b"BC"
+        + struct.pack("<H", 2)
+        + struct.pack("<H", bsize)
+    )
+    crc = zlib.crc32(data) & 0xFFFFFFFF
+    trailer = struct.pack("<I", crc) + struct.pack("<I", len(data) & 0xFFFFFFFF)
+    return header + compressed + trailer
+
+
+class BgzfWriter:
+    """Buffered BGZF text writer for producing block-gzipped TSV files."""
+
+    def __init__(self, path: str, encoding: str = "utf-8") -> None:
+        self._fout = open(path, "wb")
+        self._encoding = encoding
+        self._buf = bytearray()
+
+    def __enter__(self) -> "BgzfWriter":
+        return self
+
+    def __exit__(self, *exc) -> None:
+        self.close()
+
+    def write(self, text: str) -> None:
+        self._buf.extend(text.encode(self._encoding))
+        while len(self._buf) >= _BGZF_BLOCK_SIZE:
+            chunk = bytes(self._buf[:_BGZF_BLOCK_SIZE])
+            self._buf = self._buf[_BGZF_BLOCK_SIZE:]
+            self._fout.write(_make_bgzf_block(chunk))
+
+    def close(self) -> None:
+        if self._fout.closed:
+            return
+        if self._buf:
+            self._fout.write(_make_bgzf_block(bytes(self._buf)))
+            self._buf.clear()
+        self._fout.write(_make_bgzf_block(b""))
+        self._fout.close()


BgzfWriter is now implemented both here and in hvantk.data.file_utils, which creates a maintenance risk (bugfixes can diverge). Since importing hvantk.data.file_utils triggers the heavy hvantk.data.__init__, consider moving the BGZF writer to a lightweight shared module (e.g. hvantk.core.bgzf) or making hvantk.data.__init__ avoid eager heavy imports so both callers can reuse a single implementation.

Suggested change

import requests

from collections import defaultdict

from dataclasses import dataclass, field

from typing import Dict, List, Optional

import struct

import zlib

# ---------------------------------------------------------------------------

# Lightweight BGZF writer (stdlib-only, no Hail dependency).

#

# Mirrors hvantk.data.file_utils.BgzfWriter but lives here so the PTM

# pipeline's pure-Python stages never trigger hvantk.data.__init__, which

# eagerly imports Hail.

# ---------------------------------------------------------------------------

_BGZF_BLOCK_SIZE = 65280 # max uncompressed payload per BGZF block

def _make_bgzf_block(data: bytes) -> bytes:

"""Compress *data* into a single BGZF block."""

comp = zlib.compressobj(zlib.Z_DEFAULT_COMPRESSION, zlib.DEFLATED, -15)

compressed = comp.compress(data) + comp.flush()

bsize = 18 + len(compressed) + 8 - 1

header = (

b"\x1f\x8b\x08\x04"

b"\x00\x00\x00\x00"

b"\x00\xff"

+ struct.pack("<H", 6)

+ b"BC"

+ struct.pack("<H", 2)

+ struct.pack("<H", bsize)

)

crc = zlib.crc32(data) & 0xFFFFFFFF

trailer = struct.pack("<I", crc) + struct.pack("<I", len(data) & 0xFFFFFFFF)

return header + compressed + trailer

class BgzfWriter:

"""Buffered BGZF text writer for producing block-gzipped TSV files."""

def __init__(self, path: str, encoding: str = "utf-8") -> None:

self._fout = open(path, "wb")

self._encoding = encoding

self._buf = bytearray()

def __enter__(self) -> "BgzfWriter":

return self

def __exit__(self, *exc) -> None:

self.close()

def write(self, text: str) -> None:

self._buf.extend(text.encode(self._encoding))

while len(self._buf) >= _BGZF_BLOCK_SIZE:

chunk = bytes(self._buf[:_BGZF_BLOCK_SIZE])

self._buf = self._buf[_BGZF_BLOCK_SIZE:]

self._fout.write(_make_bgzf_block(chunk))

def close(self) -> None:

if self._fout.closed:

return

if self._buf:

self._fout.write(_make_bgzf_block(bytes(self._buf)))

self._buf.clear()

self._fout.write(_make_bgzf_block(b""))

self._fout.close()

import importlib.util

from pathlib import Path

import requests

from collections import defaultdict

from dataclasses import dataclass, field

from typing import Dict, List, Optional

# ---------------------------------------------------------------------------

# Reuse the shared BGZF writer implementation without importing

# hvantk.data.__init__, which may eagerly import heavy dependencies.

# ---------------------------------------------------------------------------

def _load_shared_bgzf_writer():

"""Load hvantk.data.file_utils.BgzfWriter without package import side effects."""

file_utils_path = Path(__file__).resolve().parents[1] / "data" / "file_utils.py"

spec = importlib.util.spec_from_file_location("hvantk._bgzf_file_utils", file_utils_path)

if spec is None or spec.loader is None:

raise ImportError(f"Unable to load BGZF writer module from {file_utils_path}")

module = importlib.util.module_from_spec(spec)

spec.loader.exec_module(module)

try:

return module.BgzfWriter

except AttributeError as exc:

raise ImportError(

f"Module loaded from {file_utils_path} does not define BgzfWriter"

) from exc

BgzfWriter = _load_shared_bgzf_writer()

Copilot · 2026-04-14T08:23:57Z

+    # Collect row (var) and col (obs) annotations
+    row_df = mt.rows().to_pandas()
+    row_df = row_df.set_index(row_key)
+
+    col_df = mt.cols().to_pandas()
+    col_df = col_df.set_index(col_key)
+
+    # Collect entries and pivot to dense matrix
+    entries_df = mt.select_entries(mt[entry_field]).entries().to_pandas()
+    pivot = entries_df.pivot(index=col_key, columns=row_key, values=entry_field)
+
+    # Reindex to match collected row/col order
+    pivot = pivot.reindex(index=col_df.index, columns=row_df.index)
+
+    X = pivot.values.astype(np.float32)
+


hail_mt_to_anndata pulls rows(), cols(), and especially entries() into pandas and pivots to a dense matrix. This will only work for very small MatrixTables and can easily OOM the driver for real expression MTs. It would help to either document/guard this as a small-MT-only helper, or provide an export/chunked path (or a sparse AnnData build) for larger inputs.

Copilot · 2026-04-14T08:23:57Z

+    X = adata.X
+    if sp.issparse(X):
+        X = X.toarray()
+    values = np.asarray(X).ravel()


adata.X can be a large sparse matrix; calling X.toarray() materializes the full dense matrix and can OOM for realistic scRNA-seq datasets. For a histogram, consider using X.data (non-zeros) plus an implicit zero count, or sampling values, so sparse inputs stay memory-safe.

coderabbitai

Actionable comments posted: 14

🧹 Nitpick comments (15)

hvantk/datasets/peptideatlas_phospho_datasets.py (1)

441-446: Consider moving the requests import to module level.

The requests import is placed inside the method. While this delays the import cost, the requests library is already a common dependency used elsewhere in the codebase (e.g., hvantk/data/file_utils.py). Moving it to the top of the file with other imports would be more consistent with the module's style.
♻️ Suggested change

At the top of the file (after line 27):
import requests
Then in download():
-            import requests
-
             with requests.get(self.zip_url, stream=True, timeout=600) as resp:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/datasets/peptideatlas_phospho_datasets.py` around lines 441 - 446,
Move the inline import of requests out of the download method and add it to the
module-level imports so the library is imported once for the module; update the
top of the file to include "import requests" and remove the "import requests"
line inside the download() block (the code using requests remains the same:
requests.get(self.zip_url, stream=True, timeout=600) ... resp.raise_for_status()
... shutil.copyfileobj(resp.raw, fout)).

hvantk/tables/expression_atlas.py (1)

214-214: Consider handling non-numeric values gracefully.

The .astype(np.float32) conversion will fail if sample columns contain non-numeric values (e.g., "NA" strings not recognized by pandas). Consider using pd.to_numeric with errors='coerce' for robustness:

♻️ Suggested improvement

-    X = df[sample_cols].values.T.astype(np.float32)
+    # Convert to numeric, coercing errors to NaN
+    X = df[sample_cols].apply(pd.to_numeric, errors='coerce').values.T.astype(np.float32)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@hvantk/tables/expression_atlas.py` at line 214, The current assignment X =
df[sample_cols].values.T.astype(np.float32) will raise on non-numeric strings;
change to coerce non-numeric values to NaN first (e.g. replace the
df[sample_cols] selection with df[sample_cols].apply(lambda c: pd.to_numeric(c,
errors='coerce')) or df[sample_cols].apply(pd.to_numeric, errors='coerce')) and
then convert to float32 and transpose to produce X; adjust downstream code to
handle resulting NaNs (e.g., drop or impute) as needed.

.gitignore (1)

212-213: Duplicate .ipynb_checkpoints entry.

Line 79 already ignores .ipynb_checkpoints (without trailing slash). This new entry with the trailing slash is redundant since both patterns achieve the same result.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.gitignore around lines 212 - 213, Remove the duplicate .ipynb_checkpoints
entry from the ignore list: keep only one pattern for ".ipynb_checkpoints"
(either with or without the trailing slash) and delete the redundant line that
adds ".ipynb_checkpoints/" so the .gitignore contains a single canonical entry
for that pattern.

hvantk/data/file_utils.py (1)

462-466: Unused _path attribute.

The _path attribute is stored but never referenced. Consider removing it or using it in error messages/logging.

♻️ Suggested fix

     def __init__(self, path: str, encoding: str = "utf-8") -> None:
-        self._path = path
         self._encoding = encoding
         self._fout = open(path, "wb")
         self._buf = bytearray()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@hvantk/data/file_utils.py` around lines 462 - 466, The __init__ stores
self._path but never uses it; either remove the unused attribute or keep and use
it for context (e.g., include self._path in error messages/logging and in
exception handling around opening the file). Update the initializer in the
class's __init__ (and any related error handling elsewhere in this class) to
either drop the assignment to self._path or reference self._path in log/raise
messages when calling open(path, "wb") and when reporting I/O errors so the file
path is not lost.

hvantk/tests/test_ptm.py (1)

183-186: Consider moving gzip import to module level.

The gzip module is a standard library module that could be imported at the top of the file alongside other imports for consistency and clarity.
Suggested change

Add to the imports section at the top of the file:
import gzip
Then simplify the test code:
     # Read back and verify (BGZF is gzip-compatible)
-    import gzip
     with gzip.open(output_bgz, "rt") as f:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/tests/test_ptm.py` around lines 183 - 186, Move the local "import
gzip" out of the test body and add it to the module-level imports at the top of
hvantk/tests/test_ptm.py, then remove the inline import so the block that opens
output_bgz with gzip.open(...) and constructs csv.DictReader(...) uses the
top-level gzip; update any linters/import-grouping if needed.

hvantk/visualization/expression/anndata.py (1)

35-38: Consider memory implications for large sparse matrices.

Converting a large sparse matrix to dense and raveling all values into memory could cause issues with very large datasets. For visualization purposes, consider sampling a subset of values when the matrix is large.

Optional: Add sampling for large matrices

+    MAX_SAMPLES = 1_000_000  # Limit samples for memory efficiency
+
     X = adata.X
     if sp.issparse(X):
-        X = X.toarray()
-    values = np.asarray(X).ravel()
+        # Sample from sparse data to avoid memory explosion
+        data = X.data
+        if len(data) > MAX_SAMPLES:
+            rng = np.random.default_rng(42)
+            data = rng.choice(data, MAX_SAMPLES, replace=False)
+        values = data
+    else:
+        values = np.asarray(X).ravel()
+        if len(values) > MAX_SAMPLES:
+            rng = np.random.default_rng(42)
+            values = rng.choice(values, MAX_SAMPLES, replace=False)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@hvantk/visualization/expression/anndata.py` around lines 35 - 38, The current
code unconditionally densifies adata.X (X = X.toarray()) which can OOM for large
sparse matrices; instead, when sp.issparse(X) detect large matrices by checking
X.nnz or X.shape and apply sampling: use X.data (the nonzero values) and
randomly sample up to a MAX_SAMPLES limit, and if you need zeros represented
also randomly sample indices from the implicit zeros to approximate
distribution; then build values from the sampled nonzero and sampled-zero values
rather than calling toarray(); update the block around X, sp.issparse(X), and
values to implement this conditional sampling strategy.

pyproject.toml (1)

31-32: Make scanpy an optional dependency.

scanpy is imported only once, within function scope in summarize_expression_cli.py (line 289), and is not used elsewhere in the codebase. This minimal and localized usage justifies making it optional to reduce the dependency footprint for users who don't need marker gene analysis.

Suggested change

 anndata = ">=0.10.0"
-scanpy = ">=1.10.0"
+scanpy = { version = ">=1.10.0", optional = true }

 [tool.poetry.extras]
 viz = ["matplotlib", "seaborn", "plotly"]
 duckdb = ["duckdb"]
 interactive = ["plotly"]
 hgc = ["matplotlib", "seaborn", "plotly", "gnomad"]
 psroc = ["matplotlib", "plotly"]
 ptm = ["cptac"]
 constraint = ["tspex", "matplotlib", "seaborn"]
+markers = ["scanpy"]

Then add a runtime check in summarize_expression_cli.py:

try:
    import scanpy as sc
except ImportError as exc:
    raise ImportError(
        "scanpy is required for marker analysis. "
        "Install with 'pip install hvantk[markers]'"
    ) from exc

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 31 - 32, Update pyproject.toml to make scanpy
optional by removing it from the main dependencies and adding an extras group
(e.g., extras = { markers = ["scanpy>=1.10.0"] }) so users can install
hvantk[markers] when they need marker analysis; then in
summarize_expression_cli.py, wrap the local import of scanpy (used in the
marker-related function) in a try/except ImportError and raise a clear
ImportError message pointing users to install the markers extra (e.g., "scanpy
is required for marker analysis. Install with 'pip install hvantk[markers]'") so
the runtime check fails fast with guidance.

hvantk/tests/test_ptm_constraint.py (1)

90-108: Use tmp_path fixture instead of hardcoded /tmp/out.

While this config is only validated (not executed), using a hardcoded /tmp path is flagged as a security concern and doesn't follow test isolation best practices.

📝 Proposed fix

-def test_constraint_config_validate():
+def test_constraint_config_validate(tmp_path):
     from hvantk.ptm.constraint import PTMConstraintConfig

     cfg = PTMConstraintConfig(
         variants_ht_path="/does/not/exist.ht",
         expression_source="bogus-source",
         expression_path="/also/missing.tsv",
         grouping="tissue",
-        output_dir="/tmp/out",
+        output_dir=str(tmp_path / "out"),
         label_filter="NOT_A_LABEL",
         min_cells_per_group=-5,
         min_variants_per_group=0,
     )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@hvantk/tests/test_ptm_constraint.py` around lines 90 - 108, The test
test_constraint_config_validate uses a hardcoded output_dir "/tmp/out"; change
the test to accept the pytest tmp_path fixture, pass output_dir=str(tmp_path)
when constructing PTMConstraintConfig, and update the test function signature to
def test_constraint_config_validate(tmp_path): so the config uses an isolated
temporary directory rather than a global /tmp path while keeping all other
validation assertions unchanged.

hvantk/tests/test_expression_builders_anndata.py (1)

12-24: Consider adding strict=True to zip() calls.

While the test data is controlled, using strict=True would catch accidental length mismatches during test maintenance.

📝 Proposed fix

     def _write_expression_matrix(self, path, genes, cells, values):
         """Write a genes-x-cells TSV expression matrix."""
         with open(path, "w") as fh:
             fh.write("gene\t" + "\t".join(cells) + "\n")
-            for gene, row in zip(genes, values):
+            for gene, row in zip(genes, values, strict=True):
                 fh.write(gene + "\t" + "\t".join(str(v) for v in row) + "\n")

     def _write_metadata(self, path, cell_ids, cell_types):
         """Write a metadata TSV with cell_id index and cell_type column."""
         with open(path, "w") as fh:
             fh.write("cell_id\tcell_type\n")
-            for cid, ct in zip(cell_ids, cell_types):
+            for cid, ct in zip(cell_ids, cell_types, strict=True):
                 fh.write(f"{cid}\t{ct}\n")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@hvantk/tests/test_expression_builders_anndata.py` around lines 12 - 24, The
zip() calls in _write_expression_matrix and _write_metadata can silently ignore
length mismatches; update both to use zip(..., strict=True) so mismatched
lengths raise an error during tests — change for gene,row in zip(genes, values)
to zip(genes, values, strict=True) in _write_expression_matrix and for cid,ct in
zip(cell_ids, cell_types) to zip(cell_ids, cell_types, strict=True) in
_write_metadata.

docs_site/tools/ptm-constraint.md (1)

83-97: Add language specifier to fenced code block.

The output directory tree block should have a language identifier for consistency and to satisfy markdown linting rules.

📝 Proposed fix

-```
+```text
 <output-dir>/
 ├── summary.json                     # top-line effect sizes + config echo

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs_site/tools/ptm-constraint.md` around lines 83 - 97, The fenced code
block that starts with the directory tree beginning "<output-dir>/" is missing a
language specifier; update the opening fence from ``` to ```text so the block
becomes a ```text fenced code block (i.e., add "text" after the three backticks)
to satisfy markdown linting and ensure consistent rendering of the directory
tree.

hvantk/commands/summarize_expression_cli.py (1)

163-199: Consider using ctx.exit(1) for consistency.

Line 170 uses raise SystemExit(1) while other commands in this module use ctx.exit(1). Both work, but ctx.exit(1) is more consistent with Click patterns used elsewhere in the codebase.

📝 Proposed fix

+@click.pass_context
 def summarize_expression_cmd(
+    ctx,
     matrix_path,
     ...
 ):
     ...
     if output_path.exists() and not overwrite:
         click.echo(
             f"Error: Output file already exists: {output_path}\n"
             "Use --overwrite to replace it.",
             err=True,
         )
-        raise SystemExit(1)
+        ctx.exit(1)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@hvantk/commands/summarize_expression_cli.py` around lines 163 - 199, Replace
the direct raise SystemExit(1) in the output existence check with a Click-style
exit call for consistency: use ctx.exit(1) (or, if the command function does not
receive a ctx parameter, call click.get_current_context().exit(1)). Update the
code at the output_path existence branch (the block that currently does
click.echo(...) then raise SystemExit(1)) to call the appropriate ctx.exit(1)
instead.

hvantk/commands/make_matrix_cli.py (2)

116-135: Consider exposing --gene-name-column option for completeness.

The build_expression_atlas_ad function accepts a gene_name_column parameter (default "Gene Name") that isn't exposed in the CLI. While the default works for standard Expression Atlas files, users with non-standard column names would need to modify the code.

This is optional since the default handles the common case.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/commands/make_matrix_cli.py` around lines 116 - 135, The CLI function
mkmatrix_expression_atlas should expose a --gene-name-column option and forward
it to the internal _build_expression_atlas_ad call; add a new parameter to
mkmatrix_expression_atlas (e.g., gene_name_column with default "Gene Name") that
maps to the CLI flag --gene-name-column, and pass that value into
_build_expression_atlas_ad via its gene_name_column argument so non-standard
Expression Atlas files work without editing code.
222-240: Missing --site-id-col CLI option for CPTAC phospho builder.

The build_cptac_phospho_ad function accepts a site_id_col parameter (default "SiteID") that isn't exposed in the CLI. This is inconsistent with the cptac command which exposes all column configuration options.

Users with phospho matrices using a different site identifier column name (e.g., "Site", "PhosphoSite") would be unable to process them via CLI.
Proposed fix to add site_id_col option
 `@click.option`("--sid", "--sample-id-col", "sample_id_col", default="SampleID", show_default=True)
+@click.option("--site-id-col", default="SiteID", show_default=True, help="Column containing site identifiers")
 `@click.option`("-w", "--overwrite", is_flag=True)
 def mkmatrix_cptac_phospho(
     expression,
     metadata,
     output_path,
     sample_id_col,
+    site_id_col,
     overwrite,
 ):
     """Build a CPTAC phospho AnnData object from site intensities and metadata."""
     logger.info("Building CPTAC phospho AnnData")

     adata = _build_cptac_phospho_ad(
         expression_path=expression,
         metadata_path=metadata,
         output_path=output_path,
+        site_id_col=site_id_col,
         sample_id_col=sample_id_col,
         overwrite=overwrite,
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/commands/make_matrix_cli.py` around lines 222 - 240, Add a new Click
option "--site-id-col" (arg name site_id_col, default "SiteID",
show_default=True) to the mkmatrix_cptac_phospho command and forward that
site_id_col argument into the call to _build_cptac_phospho_ad so it passes
site_id_col=site_id_col; this exposes the build_cptac_phospho_ad site_id_col
parameter (default "SiteID") to the CLI so users can specify custom phospho site
identifier column names.

hvantk/tables/matrix_builders.py (2)

10-21: Unused import and unsorted __all__.

List is imported from typing but never used in this module.
Static analysis flags that __all__ is not sorted (RUF022).

Proposed fix

 import anndata as ad
 import logging
-from typing import Optional, List
+from typing import Optional

 logger = logging.getLogger(__name__)

 __all__ = [
+    "build_cptac_ad",
+    "build_cptac_phospho_ad",
+    "build_expression_atlas_ad",
     "build_ucsc_ad",
-    "build_expression_atlas_ad",
-    "build_cptac_ad",
-    "build_cptac_phospho_ad",
 ]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@hvantk/tables/matrix_builders.py` around lines 10 - 21, Remove the unused
typing import and sort the exported symbols: delete the unused "List" import
from the typing import list and then reorder the __all__ list alphabetically
(e.g., "build_cptac_ad", "build_cptac_phospho_ad", "build_expression_atlas_ad",
"build_ucsc_ad") so static analysis warnings for the unused import (List) and
unsorted __all__ are resolved; keep the existing function names (build_ucsc_ad,
build_expression_atlas_ad, build_cptac_ad, build_cptac_phospho_ad) intact.

199-204: Hardcoded tab separator may fail for comma-delimited CSV files.

The docstring mentions "TSV/CSV" support, but pd.read_csv uses hardcoded sep="\t". Users with comma-separated files would encounter parsing errors.

Consider either:

Adding a delimiter parameter (consistent with UCSC/Expression Atlas builders)
Updating the docstring to clarify only TSV is supported

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@hvantk/tables/matrix_builders.py` around lines 199 - 204, The code hardcodes
a tab separator when reading CPTAC files (logger.info + pd.read_csv for
expression_path and metadata_path, producing expr_df and meta_df), which breaks
comma-delimited CSV input; add a new parameter (e.g., delimiter or sep) to the
enclosing function and pass it into both pd.read_csv calls (use the default "\t"
to preserve current behavior) or alternatively update the docstring to state
only TSV is supported; ensure meta_df = meta_df.set_index(sample_id_col) still
runs after reading with the chosen delimiter.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hvantk/core/anndata_utils.py`:
- Around line 91-98: The numeric-aggregation code uses
np.nanmin/np.nanmax/np.nanmean on series.values which can be an ExtensionArray
for pandas nullable dtypes; convert the Series to a standard float NumPy array
or use pandas' native aggregations to avoid TypeError. Update the logic around
the variable 'series' (the Series from adata.obs[col]) so that before reducing
you either call series.to_numpy(dtype=float, na_value=np.nan) or
series.astype(float).values, or replace the np.nan* calls with
series.min(skipna=True), series.max(skipna=True), and series.mean(skipna=True),
and then store those results back into summary[col] (same dict keys
"min","max","mean").

In `@hvantk/core/converters.py`:
- Around line 69-82: hail_mt_to_anndata and anndata_to_hail_mt currently
materialize the entire matrix into memory (via
row_df/col_df/entries_df.to_pandas(), pivot, and sparse.toarray()) which will
OOM for large matrices; add an explicit size guard that computes n_obs * n_vars
(from mt.rows/mt.cols counts or adata.n_obs/adata.n_vars) and compare against a
configurable MAX_MATRIX_SIZE threshold (env var or constant), and raise a clear
ValueError with guidance (e.g., suggest using a blockwise/streamed conversion or
increasing the threshold) if exceeded; alternatively implement a blockwise
conversion path (stream entries by column/row chunks to avoid pivoting
everything at once) and only fall back to full-materialization when under the
size limit — update logic around symbols row_df, col_df, entries_df, pivot, X in
hail_mt_to_anndata and where sparse.toarray()/ravel() is called in
anndata_to_hail_mt.
- Around line 55-59: The resolver currently takes the first element of
mt.row_key/mt.col_key without validating cardinality; update the logic around
row_key and col_key so you explicitly validate mt.row_key and mt.col_key: if
len(mt.row_key) == 0 or len(mt.col_key) == 0 raise a clear ValueError indicating
the MatrixTable is unkeyed, and if len(...) > 1 raise a ValueError indicating a
composite key and require the caller to supply an explicit row_key/col_key; only
when len(...) == 1 set row_key = list(mt.row_key)[0] (and similarly for
col_key). Ensure the error messages reference row_key/col_key and MatrixTable so
callers can fix input.

In `@hvantk/datasets/cptac_phospho_datasets.py`:
- Around line 304-312: The per-call fallback in _get_phospho causes inconsistent
sources and overly-broad exception swallowing; instead, probe and pick a single
working source once (e.g., at function/class/module init) by calling
ds.get_phosphoproteomics(source="umich", tissue_type="both") and if that raises
only the specific "source unavailable" error switch to the default and store the
chosen_source in a variable; then have _get_phospho use that chosen_source (call
ds.get_phosphoproteomics(source=chosen_source, tissue_type=...)) so all reads
(tumor TSV, normal TSV, matrix export) use the same source, and apply the same
single-source selection and narrow exception handling approach to the clinical
fallback that uses ct/clinical loader so you don’t catch all Exception types.

In `@hvantk/ptm/constraint_expression.py`:
- Around line 138-141: The loader currently accepts .pkl/.pickle and calls
pd.read_pickle (variable ext with branches creating df), which is unsafe; remove
the pickle branch entirely and ensure the CLI/validation for --expression-path
rejects .pkl/.pickle (or only allow .parquet/.pq), so the code only calls
pd.read_parquet for supported extensions; update any help/validation code that
lists accepted extensions to exclude pickle and add an explicit unsafe flag if
you want to reintroduce pickle behind opt-in.
- Around line 218-229: The code fails when adata.obs[grouping] contains NaN
because astype(str) yields mixed types that break sorted(set(labels)); fix by
filtering out missing group values before converting to strings and before
computing unique_groups: create a boolean index like valid =
adata.obs[grouping].notna(), then derive labels from adata.obs.loc[valid,
grouping].astype(str).to_numpy() and compute unique_groups from that filtered
labels; ensure subsequent masks and counts (mask, n) operate on the same
filtered index (or use the filtered AnnData slice) so grouping, X, and any
downstream arrays stay aligned with labels and min_cells_per_group checks.

In `@hvantk/ptm/constraint.py`:
- Around line 147-149: The code always creates/overwrites outputs (os.makedirs
and downstream TSV/summary/plots) without checking config.overwrite; update
run_ptm_constraint() to honor config.overwrite by adding an existence guard at
the start: if config.overwrite is False and config.output_dir already contains
the expected output files (TSV files, summary.json, or plots), abort/raise an
error or return early; conversely, if config.overwrite is True, allow creating
or delete existing outputs before writing. Ensure all places that delete or
re-create config.output_dir/plots only run when config.overwrite is True and
reference run_ptm_constraint(), config.output_dir, config.overwrite, and
summary.json/TSV/plots in your checks.
- Around line 461-465: The qcut currently runs on variant rows so genes with
many variants skew the cut points; instead compute quartile bins using one row
per gene by selecting unique gene entries (e.g.,
df.drop_duplicates(subset="gene_key") or groupby('gene_key') and take a
representative 'tau'/'loeuf' value), run pd.qcut on that deduplicated series
with the same labels, then map/join the resulting per-gene bin assignments back
onto the full variant-level df to set "tau_quartile" (and similarly for the
LOEUF qcut block referenced later) so bin boundaries are derived from unique
genes, not variant rows.
- Around line 269-275: The code currently replaces a missing AF field with zeros
and later uses fillna(0.0), which falsely signals homozygous reference; instead,
when config.af_field is not present, raise an explicit error (e.g., throw/raise
ValueError or RuntimeError) so the pipeline fails fast and the schema issue is
fixed upstream, and remove any row-level fillna(0.0) calls so unparsable/null
AFs remain null (allowing explicit dropping or handling downstream); update the
branch that checks config.af_field (the select["af"] assignment) to raise and
remove/replace any hl.Table/MatrixTable .fillna(0.0) usage on AF to preserve
nulls.

In `@hvantk/ptm/pipeline.py`:
- Around line 209-213: The download currently streams directly into gtf_path
using requests.get and shutil.copyfileobj which can leave a truncated cache file
on failure; change the logic in the block that calls
requests.get(ENSEMBL_GTF_URL, ...) so that you write to a temporary file (e.g.,
using tempfile.NamedTemporaryFile or tempfile.mkstemp) and copy resp.raw into
that temp file, then atomically move the completed temp into gtf_path with
os.replace() only after the copy and resp.raise_for_status() succeed; ensure the
temp file is cleaned up on exceptions and any open files/streams (resp, temp)
are properly closed.
- Around line 490-498: After concatenating CPTAC into all_combined_path and
repointing mapped_path, also set result.mapped_tsv_path = all_combined_path so
the returned result reflects the post-merge file; update this in the branch that
calls _concat_bgz_tsvs([mapped_path, cptac_mapped_path], all_combined_path) (the
same place you update mapped_path and result.n_total), ensuring callers of
create_ptm_sites_tb() receive the updated mapped_tsv_path.

In `@hvantk/tables/cptac.py`:
- Around line 89-92: In cptac.py, when joining metadata_df into obs (created
from wide.index) and the later join at lines ~149-153, first compute the overlap
= obs.index.intersection(metadata_df.index) and validate it; if overlap is
empty, log a warning (or raise a ValueError) and avoid doing a blind left join
that yields all-NaN rows; if partial overlap, log how many samples matched and
then perform obs = obs.join(metadata_df, how="left") restricted to the
overlapping index (or ensure rows with no metadata are handled explicitly). Use
the same validation/logging pattern used in hvantk/tables/ucsc.py (checking
adata.obs.index.intersection(metadata_df.index) and warning on empty overlap)
and apply it to the variables obs, metadata_df, and wide.index in both join
locations.

In `@hvantk/utils/matrix_utils.py`:
- Around line 127-139: The current label construction joins stringified fields
with "_" which corrupts tuples with underscores and turns missing values into
the literal "nan"; change the logic in the block that builds labels (the
group_by handling and label_series creation) to build canonical group keys as
tuples or a pandas.MultiIndex from the original adata.obs columns (e.g., using
tuple(zip(...)) or pandas.MultiIndex.from_frame(adata.obs[group_by])) and drop
rows with any missing grouping fields before downstream aggregation, and ensure
label_series uses adata.obs.index (not np.arange(adata.n_obs)); update any
callers (e.g., the AnnData mean path referenced in constraint_expression.py) to
consume tuple/MultiIndex keys or a safely encoded representation.

In `@requirements.txt`:
- Around line 14-15: requirements.txt is missing runtime deps required by the
new modules (hvantk/core/converters.py, hvantk/tables/cptac.py,
hvantk/tables/ucsc.py) which import anndata; update requirements.txt to include
anndata plus the other packages declared in environment.yml (scanpy>=1.10.0,
pysam>=0.22, psutil, scikit-learn) and any version constraints from
environment.yml so a plain pip install -r requirements.txt will succeed and
match the environment.yml runtime dependencies.

---

Nitpick comments:
In @.gitignore:
- Around line 212-213: Remove the duplicate .ipynb_checkpoints entry from the
ignore list: keep only one pattern for ".ipynb_checkpoints" (either with or
without the trailing slash) and delete the redundant line that adds
".ipynb_checkpoints/" so the .gitignore contains a single canonical entry for
that pattern.

In `@docs_site/tools/ptm-constraint.md`:
- Around line 83-97: The fenced code block that starts with the directory tree
beginning "<output-dir>/" is missing a language specifier; update the opening
fence from ``` to ```text so the block becomes a ```text fenced code block
(i.e., add "text" after the three backticks) to satisfy markdown linting and
ensure consistent rendering of the directory tree.

In `@hvantk/commands/make_matrix_cli.py`:
- Around line 116-135: The CLI function mkmatrix_expression_atlas should expose
a --gene-name-column option and forward it to the internal
_build_expression_atlas_ad call; add a new parameter to
mkmatrix_expression_atlas (e.g., gene_name_column with default "Gene Name") that
maps to the CLI flag --gene-name-column, and pass that value into
_build_expression_atlas_ad via its gene_name_column argument so non-standard
Expression Atlas files work without editing code.
- Around line 222-240: Add a new Click option "--site-id-col" (arg name
site_id_col, default "SiteID", show_default=True) to the mkmatrix_cptac_phospho
command and forward that site_id_col argument into the call to
_build_cptac_phospho_ad so it passes site_id_col=site_id_col; this exposes the
build_cptac_phospho_ad site_id_col parameter (default "SiteID") to the CLI so
users can specify custom phospho site identifier column names.

In `@hvantk/commands/summarize_expression_cli.py`:
- Around line 163-199: Replace the direct raise SystemExit(1) in the output
existence check with a Click-style exit call for consistency: use ctx.exit(1)
(or, if the command function does not receive a ctx parameter, call
click.get_current_context().exit(1)). Update the code at the output_path
existence branch (the block that currently does click.echo(...) then raise
SystemExit(1)) to call the appropriate ctx.exit(1) instead.

In `@hvantk/data/file_utils.py`:
- Around line 462-466: The __init__ stores self._path but never uses it; either
remove the unused attribute or keep and use it for context (e.g., include
self._path in error messages/logging and in exception handling around opening
the file). Update the initializer in the class's __init__ (and any related error
handling elsewhere in this class) to either drop the assignment to self._path or
reference self._path in log/raise messages when calling open(path, "wb") and
when reporting I/O errors so the file path is not lost.

In `@hvantk/datasets/peptideatlas_phospho_datasets.py`:
- Around line 441-446: Move the inline import of requests out of the download
method and add it to the module-level imports so the library is imported once
for the module; update the top of the file to include "import requests" and
remove the "import requests" line inside the download() block (the code using
requests remains the same: requests.get(self.zip_url, stream=True, timeout=600)
... resp.raise_for_status() ... shutil.copyfileobj(resp.raw, fout)).

In `@hvantk/tables/expression_atlas.py`:
- Line 214: The current assignment X =
df[sample_cols].values.T.astype(np.float32) will raise on non-numeric strings;
change to coerce non-numeric values to NaN first (e.g. replace the
df[sample_cols] selection with df[sample_cols].apply(lambda c: pd.to_numeric(c,
errors='coerce')) or df[sample_cols].apply(pd.to_numeric, errors='coerce')) and
then convert to float32 and transpose to produce X; adjust downstream code to
handle resulting NaNs (e.g., drop or impute) as needed.

In `@hvantk/tables/matrix_builders.py`:
- Around line 10-21: Remove the unused typing import and sort the exported
symbols: delete the unused "List" import from the typing import list and then
reorder the __all__ list alphabetically (e.g., "build_cptac_ad",
"build_cptac_phospho_ad", "build_expression_atlas_ad", "build_ucsc_ad") so
static analysis warnings for the unused import (List) and unsorted __all__ are
resolved; keep the existing function names (build_ucsc_ad,
build_expression_atlas_ad, build_cptac_ad, build_cptac_phospho_ad) intact.
- Around line 199-204: The code hardcodes a tab separator when reading CPTAC
files (logger.info + pd.read_csv for expression_path and metadata_path,
producing expr_df and meta_df), which breaks comma-delimited CSV input; add a
new parameter (e.g., delimiter or sep) to the enclosing function and pass it
into both pd.read_csv calls (use the default "\t" to preserve current behavior)
or alternatively update the docstring to state only TSV is supported; ensure
meta_df = meta_df.set_index(sample_id_col) still runs after reading with the
chosen delimiter.

In `@hvantk/tests/test_expression_builders_anndata.py`:
- Around line 12-24: The zip() calls in _write_expression_matrix and
_write_metadata can silently ignore length mismatches; update both to use
zip(..., strict=True) so mismatched lengths raise an error during tests — change
for gene,row in zip(genes, values) to zip(genes, values, strict=True) in
_write_expression_matrix and for cid,ct in zip(cell_ids, cell_types) to
zip(cell_ids, cell_types, strict=True) in _write_metadata.

In `@hvantk/tests/test_ptm_constraint.py`:
- Around line 90-108: The test test_constraint_config_validate uses a hardcoded
output_dir "/tmp/out"; change the test to accept the pytest tmp_path fixture,
pass output_dir=str(tmp_path) when constructing PTMConstraintConfig, and update
the test function signature to def test_constraint_config_validate(tmp_path): so
the config uses an isolated temporary directory rather than a global /tmp path
while keeping all other validation assertions unchanged.

In `@hvantk/tests/test_ptm.py`:
- Around line 183-186: Move the local "import gzip" out of the test body and add
it to the module-level imports at the top of hvantk/tests/test_ptm.py, then
remove the inline import so the block that opens output_bgz with gzip.open(...)
and constructs csv.DictReader(...) uses the top-level gzip; update any
linters/import-grouping if needed.

In `@hvantk/visualization/expression/anndata.py`:
- Around line 35-38: The current code unconditionally densifies adata.X (X =
X.toarray()) which can OOM for large sparse matrices; instead, when
sp.issparse(X) detect large matrices by checking X.nnz or X.shape and apply
sampling: use X.data (the nonzero values) and randomly sample up to a
MAX_SAMPLES limit, and if you need zeros represented also randomly sample
indices from the implicit zeros to approximate distribution; then build values
from the sampled nonzero and sampled-zero values rather than calling toarray();
update the block around X, sp.issparse(X), and values to implement this
conditional sampling strategy.

In `@pyproject.toml`:
- Around line 31-32: Update pyproject.toml to make scanpy optional by removing
it from the main dependencies and adding an extras group (e.g., extras = {
markers = ["scanpy>=1.10.0"] }) so users can install hvantk[markers] when they
need marker analysis; then in summarize_expression_cli.py, wrap the local import
of scanpy (used in the marker-related function) in a try/except ImportError and
raise a clear ImportError message pointing users to install the markers extra
(e.g., "scanpy is required for marker analysis. Install with 'pip install
hvantk[markers]'") so the runtime check fails fast with guidance.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f90c4031-e371-48e9-b494-721fe6941d4f

📥 Commits

Reviewing files that changed from the base of the PR and between bd48823 and 03d50b5.

⛔ Files ignored due to path filters (1)

poetry.lock is excluded by !**/*.lock

📒 Files selected for processing (54)

.gitignore
docs/superpowers/plans/2026-03-31-alphagenome-streamer.md
docs/superpowers/specs/2026-03-31-alphagenome-streamer-design.md
docs_site/architecture.md
docs_site/guide/usage.md
docs_site/tools/ptm-constraint.md
environment.yml
hvantk/commands/make_matrix_cli.py
hvantk/commands/ptm_cli.py
hvantk/commands/summarize_expression_cli.py
hvantk/core/anndata_utils.py
hvantk/core/converters.py
hvantk/core/hail_context.py
hvantk/data/file_utils.py
hvantk/datasets/cptac_phospho_datasets.py
hvantk/datasets/peptideatlas_phospho_datasets.py
hvantk/ptm/__init__.py
hvantk/ptm/annotate.py
hvantk/ptm/constraint.py
hvantk/ptm/constraint_expression.py
hvantk/ptm/constraint_plots.py
hvantk/ptm/constraint_report.py
hvantk/ptm/mapper.py
hvantk/ptm/pipeline.py
hvantk/tables/cptac.py
hvantk/tables/expression_atlas.py
hvantk/tables/matrix_builders.py
hvantk/tables/registry.py
hvantk/tables/table_builders.py
hvantk/tables/ucsc.py
hvantk/tests/test_anndata_utils.py
hvantk/tests/test_cli_mkmatrix.py
hvantk/tests/test_converters.py
hvantk/tests/test_cptac.py
hvantk/tests/test_cptac_phospho.py
hvantk/tests/test_expression_builders_anndata.py
hvantk/tests/test_expression_tables_hail.py
hvantk/tests/test_matrix_utils.py
hvantk/tests/test_matrix_utils_anndata.py
hvantk/tests/test_peptideatlas_phospho.py
hvantk/tests/test_ptm.py
hvantk/tests/test_ptm_constraint.py
hvantk/tests/test_summarize_expression.py
hvantk/tests/test_wilcoxon_hail.py
hvantk/utils/gene_sets.py
hvantk/utils/matrix_utils.py
hvantk/utils/tissue_specificity.py
hvantk/utils/wilcoxon_hail.py
hvantk/visualization/__init__.py
hvantk/visualization/expression/__init__.py
hvantk/visualization/expression/anndata.py
hvantk/visualization/expression/hail.py
pyproject.toml
requirements.txt

💤 Files with no reviewable changes (11)

hvantk/tests/test_expression_tables_hail.py
docs/superpowers/specs/2026-03-31-alphagenome-streamer-design.md
hvantk/tests/test_cli_mkmatrix.py
hvantk/tests/test_matrix_utils.py
hvantk/tests/test_cptac.py
hvantk/tests/test_wilcoxon_hail.py
hvantk/visualization/expression/hail.py
hvantk/tests/test_summarize_expression.py
hvantk/tests/test_cptac_phospho.py
docs/superpowers/plans/2026-03-31-alphagenome-streamer.md
hvantk/utils/wilcoxon_hail.py

coderabbitai · 2026-04-14T08:40:42Z

+        if pd.api.types.is_numeric_dtype(series):
+            summary[col] = {
+                "dtype": "numeric",
+                "min": float(np.nanmin(series.values)),
+                "max": float(np.nanmax(series.values)),
+                "mean": float(np.nanmean(series.values)),
+                "n_missing": n_missing,
+            }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the file find . -name "anndata_utils.py" -type f

Repository: bigbio/hvantk

Length of output: 87

🏁 Script executed:

# Read the full function context around lines 91-98 head -120 hvantk/core/anndata_utils.py | tail -50

Repository: bigbio/hvantk

Length of output: 1944

🏁 Script executed:

# Also check imports to understand what's being used head -20 hvantk/core/anndata_utils.py

Repository: bigbio/hvantk

Length of output: 614

🌐 Web query:

numpy nanmin nanmax pandas extension array Float64 Int64 behavior

💡 Result:

NumPy's nanmin and nanmax ignore NaN values and return the min/max among non-NaN elements (or NaN if all NaN). Regular min/max propagate NaN, returning NaN if any NaN present. Pandas nullable extension arrays (Int64, Float64, etc.) use pd.NA (not np.nan) as missing indicator, backed by data + mask. Their .min/.max (default skipna=True) skip missing values (pd.NA), computing min/max on valid data. With skipna=False, return pd.NA if any missing. Direct use of np.nanmin/np.nanmax on pandas extension arrays is not supported and typically fails (TypeError), as they are not NumPy arrays and lack full NumPy ufunc compatibility. ExtensionArrays implement custom _reduce("min"/"max") handling missing via mask, not delegating to NumPy nanmin/nanmax. Example behaviors (from docs/source): - NumPy: np.nanmin(np.array([1, np.nan, 3])) → 1.0 - Pandas Int64: pd.array([1, None, 3], dtype="Int64").min → 1 (skips ) - np.nanmin(pd.array([1, None, 3], dtype="Int64")) → TypeError (incompatible) For interoperability, convert to NumPy first: arr.to_numpy.astype(float) then np.nanmin, but loses exact semantics (pd.NA → np.nan). Sources confirm via official docs (numpy.nanmin docs, pandas ExtensionArray._reduce, nullable dtype guides, source code IntegerArray/FloatingArray inherit NumericArray with masked reductions).

Citations:

1: https://numpy.org/devdocs/reference/generated/numpy.nanmax.html

2: https://numpy.org/doc/stable/reference/generated/numpy.nanmin.html?highlight=numpy+maximum

3: https://numpy.org/doc/stable/reference/generated/numpy.min.html

4: https://stackoverflow.com/questions/62865647/why-is-max-and-min-of-numpy-array-nan

5: https://github.com/pandas-dev/pandas/blob/main/pandas/core/arrays/floating.py

6: https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

7: https://pandas.pydata.org/docs/reference/api/pandas.api.extensions.ExtensionArray._reduce.html

8: https://pandas.pydata.org/docs/user_guide/missing_data.html

9: https://pandas.pydata.org/docs/user_guide/integer_na.html

10: https://github.com/pandas-dev/pandas/blob/master/pandas/core/arrays/integer.py

🌐 Web query:

pandas to_numeric to_numpy dtype conversion numeric series

💡 Result:

In pandas, pd.to_numeric converts a Series (typically object dtype with strings/mixed types) to a numeric pandas Series. The default dtype is float64 if the data requires floats or has NaNs (from errors='coerce'), or int64 if all integer values without NaNs. Use downcast='integer' or 'signed' for smaller int dtypes like int8 if possible, or dtype_backend='numpy_nullable' for nullable Int64 (capital I). Examples from docs: s = pd.Series(['1.0', '2', -3]) pd.to_numeric(s) # dtype: float64 pd.to_numeric(s, downcast='signed') # dtype: int8 s_na = pd.Series(['1', '2', 'apple']) pd.to_numeric(s_na, errors='coerce') # dtype: float64 with NaN Calling .to_numpy on the resulting numeric Series returns a NumPy ndarray with matching dtype: - pandas float64 or Float64 -> np.float64 - pandas int64 -> np.int64 - smaller dtypes like int8 -> np.int8 - For nullable dtypes (e.g., Int64 from dtype_backend), .to_numpy coerces to object dtype ndarray with as np.nan or None. For standard numeric Series without NaNs, dtype is preserved as native NumPy type in the ndarray. Examples: pd.Series([1,2,3]).to_numpy # array([1,2,3], dtype=int64) pd.Series([1.0,2.0]).to_numpy # array([1.,2.], dtype=float64) No dtype loss occurs for non-nullable numeric dtypes; .to_numpy provides direct view to underlying NumPy data (copy=False by default).

Citations:

1: https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html

2: https://pandas.pydata.org/docs/reference/api/pandas.Series.to_numpy.html

3: http://prnum.s3-website.bhs.perf.cloud.ovh.net/docs/reference/api/pandas.Series.to_numpy.html

4: https://github.com/pandas-dev/pandas/blob/master/pandas/core/tools/numeric.py

5: http://pandas.pydata.org/docs/dev/reference/api/pandas.to_numeric.html

6: https://stackoverflow.com/questions/72520974/pandas-to-numeric-integer-downcast-cast-floats-not-to-integer

7: https://pandas.pydata.org/docs/user_guide/integer_na.html

Convert numeric columns to standard NumPy float array before reduction, or use pandas native aggregation methods.

If adata.obs[col] uses a pandas nullable numeric dtype (Int64, Float64, etc.), then series.values returns an ExtensionArray that is incompatible with np.nanmin/np.nanmax/np.nanmean, causing a TypeError at runtime. Use .astype(float).values or .to_numpy(dtype=float, na_value=np.nan) to convert before reducing, or call pandas native methods directly (series.min(), series.max(), series.mean() with skipna=True) which handle both regular and nullable dtypes correctly.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hvantk/core/anndata_utils.py` around lines 91 - 98, The numeric-aggregation code uses np.nanmin/np.nanmax/np.nanmean on series.values which can be an ExtensionArray for pandas nullable dtypes; convert the Series to a standard float NumPy array or use pandas' native aggregations to avoid TypeError. Update the logic around the variable 'series' (the Series from adata.obs[col]) so that before reducing you either call series.to_numpy(dtype=float, na_value=np.nan) or series.astype(float).values, or replace the np.nan* calls with series.min(skipna=True), series.max(skipna=True), and series.mean(skipna=True), and then store those results back into summary[col] (same dict keys "min","max","mean").

coderabbitai · 2026-04-14T08:40:42Z

+    # Resolve keys
+    if row_key is None:
+        row_key = list(mt.row_key)[0]
+    if col_key is None:
+        col_key = list(mt.col_key)[0]


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "converters.py" | head -20

Repository: bigbio/hvantk

Length of output: 111

🏁 Script executed:

git ls-files hvantk/core/ | head -20

Repository: bigbio/hvantk

Length of output: 351

🏁 Script executed:

cd hvantk/core && find . -name "*.py" -type f | head -20

Repository: bigbio/hvantk

Length of output: 231

🏁 Script executed:

wc -l hvantk/core/converters.py

Repository: bigbio/hvantk

Length of output: 86

🏁 Script executed:

cat -n hvantk/core/converters.py | head -100

Repository: bigbio/hvantk

Length of output: 3473

🏁 Script executed:

cat -n hvantk/core/converters.py | tail -80

Repository: bigbio/hvantk

Length of output: 3069

🏁 Script executed:

grep -r "row_key\|col_key" hvantk/ --include="*.py" | grep -E "test|spec" | head -20

Repository: bigbio/hvantk

Length of output: 423

🏁 Script executed:

cat -n hvantk/tests/test_converters.py

Repository: bigbio/hvantk

Length of output: 4046

🏁 Script executed:

cd hvantk && find . -name "*.py" -type f -exec grep -l "to_matrix_table\|row_key\|col_key" {} \; | head -15

Repository: bigbio/hvantk

Length of output: 369

🏁 Script executed:

find . -name "*.py" -type f -exec grep -l "composite.*key\|multiple.*key" {} \; | head -10

Repository: bigbio/hvantk

Length of output: 125

🏁 Script executed:

rg "key_rows_by|key_cols_by" -A 3 -B 3 hvantk/ --include="*.py" | head -50

Repository: bigbio/hvantk

Length of output: 498

🏁 Script executed:

rg "row_key.*=.*\[" hvantk/ --include="*.py" | head -20

Repository: bigbio/hvantk

Length of output: 498

🏁 Script executed:

rg "row_key.*=.*\[" hvantk/

Repository: bigbio/hvantk

Length of output: 244

🏁 Script executed:

rg "to_matrix_table" hvantk/ -A 5

Repository: bigbio/hvantk

Length of output: 371

🏁 Script executed:

cat -n hvantk/core/converters.py | sed -n '136,141p'

Repository: bigbio/hvantk

Length of output: 227

🏁 Script executed:

rg "composite.*key|multiple.*key" hvantk/

Repository: bigbio/hvantk

Length of output: 238

🏁 Script executed:

grep -r "IndexError\|ValueError" hvantk/tests/test_converters.py

Repository: bigbio/hvantk

Length of output: 39

🏁 Script executed:

cat -n hvantk/enrichex/burden.py | grep -A 5 -B 5 "row_key_name"

Repository: bigbio/hvantk

Length of output: 1325

🏁 Script executed:

cat -n hvantk/tests/test_table_utils.py | grep -A 20 "Create a table with multiple key fields"

Repository: bigbio/hvantk

Length of output: 917

🏁 Script executed:

rg "key_rows_by|key_cols_by" hvantk/tests/ -A 2 -B 2

Repository: bigbio/hvantk

Length of output: 7781

🏁 Script executed:

rg "composite" hvantk/tests/ -A 3 -B 3

Repository: bigbio/hvantk

Length of output: 39

🏁 Script executed:

cat -n hvantk/tests/test_converters.py | tail -30

Repository: bigbio/hvantk

Length of output: 1453

🏁 Script executed:

cat -n hvantk/tests/ancestry/conftest.py | sed -n '1,40p'

Repository: bigbio/hvantk

Length of output: 1393

🏁 Script executed:

rg "hail_mt_to_anndata" hvantk/ -B 2 -A 5

Repository: bigbio/hvantk

Length of output: 5428

🏁 Script executed:

python3 << 'EOF' # Check if we can examine Hail's type system # Let's look for any documentation or usage patterns import subprocess result = subprocess.run(['rg', 'row_key.*len|len.*row_key', 'hvantk/'], capture_output=True, text=True) print("Length checks on row_key/col_key:") print(result.stdout if result.stdout else "No matches found") result = subprocess.run(['rg', 'range_matrix_table', 'hvantk/tests/', '-A', '10'], capture_output=True, text=True) print("\nrange_matrix_table usage:") print(result.stdout[:2000] if result.stdout else "No matches") EOF

Repository: bigbio/hvantk

Length of output: 2135

🏁 Script executed:

rg "row_key|col_key" hvantk/tests/test_converters.py -B 5 -A 5

Repository: bigbio/hvantk

Length of output: 986

🏁 Script executed:

python3 << 'EOF' # Let's check if there's a way to create an unkeyed MatrixTable or what happens # We can look for any usage of unkeyed tables import subprocess # Search for any mention of "unkey" or empty keys result = subprocess.run(['rg', 'unkey|drop.*key', 'hvantk/'], capture_output=True, text=True) print("Unkey operations:") print(result.stdout if result.stdout else "No matches found") # Check for any error handling around keys result = subprocess.run(['rg', 'KeyError|IndexError|ValueError.*key', 'hvantk/'], capture_output=True, text=True) print("\nError handling for keys:") print(result.stdout if result.stdout else "No matches found") EOF

Repository: bigbio/hvantk

Length of output: 3166

🏁 Script executed:

rg "simple_mt" hvantk/tests/test_converters.py -B 10 | head -40

Repository: bigbio/hvantk

Length of output: 1260

🏁 Script executed:

cat -n hvantk/enrichex/burden.py | sed -n '1545,1560p'

Repository: bigbio/hvantk

Length of output: 823

🏁 Script executed:

cat -n hvantk/ptm/constraint_expression.py | grep -A 3 -B 3 "no row_key"

Repository: bigbio/hvantk

Length of output: 350

🏁 Script executed:

cat -n hvantk/tests/ancestry/conftest.py | sed -n '45,65p'

Repository: bigbio/hvantk

Length of output: 858

🏁 Script executed:

rg "key_rows_by.*\[|key_cols_by.*\[" hvantk/

Repository: bigbio/hvantk

Length of output: 122

Add explicit validation for single row and column keys.

Lines 57–59 silently use the first key field without checking cardinality. This fails with IndexError on an unkeyed MatrixTable and silently drops key components on a composite-key MT. Defensive checks for this already exist elsewhere in the codebase (e.g., constraint_expression.py). Add explicit validation here.

Suggested guard

if row_key is None: - row_key = list(mt.row_key)[0] + if len(mt.row_key) != 1: + raise ValueError( + "hail_mt_to_anndata requires exactly one row key." + ) + row_key = next(iter(mt.row_key)) if col_key is None: - col_key = list(mt.col_key)[0] + if len(mt.col_key) != 1: + raise ValueError( + "hail_mt_to_anndata requires exactly one column key." + ) + col_key = next(iter(mt.col_key))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Resolve keys

if row_key is None:

row_key = list(mt.row_key)[0]

if col_key is None:

col_key = list(mt.col_key)[0]

# Resolve keys

if row_key is None:

if len(mt.row_key) != 1:

raise ValueError(

"hail_mt_to_anndata requires exactly one row key."

)

row_key = next(iter(mt.row_key))

if col_key is None:

if len(mt.col_key) != 1:

raise ValueError(

"hail_mt_to_anndata requires exactly one column key."

)

col_key = next(iter(mt.col_key))

🧰 Tools

🪛 Ruff (0.15.9)

[warning] 57-57: Prefer next(iter(mt.row_key)) over single element slice

Replace with next(iter(mt.row_key))

(RUF015)

[warning] 59-59: Prefer next(iter(mt.col_key)) over single element slice

Replace with next(iter(mt.col_key))

(RUF015)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hvantk/core/converters.py` around lines 55 - 59, The resolver currently takes the first element of mt.row_key/mt.col_key without validating cardinality; update the logic around row_key and col_key so you explicitly validate mt.row_key and mt.col_key: if len(mt.row_key) == 0 or len(mt.col_key) == 0 raise a clear ValueError indicating the MatrixTable is unkeyed, and if len(...) > 1 raise a ValueError indicating a composite key and require the caller to supply an explicit row_key/col_key; only when len(...) == 1 set row_key = list(mt.row_key)[0] (and similarly for col_key). Ensure the error messages reference row_key/col_key and MatrixTable so callers can fix input.

coderabbitai · 2026-04-14T08:40:42Z

+    row_df = mt.rows().to_pandas()
+    row_df = row_df.set_index(row_key)
+
+    col_df = mt.cols().to_pandas()
+    col_df = col_df.set_index(col_key)
+
+    # Collect entries and pivot to dense matrix
+    entries_df = mt.select_entries(mt[entry_field]).entries().to_pandas()
+    pivot = entries_df.pivot(index=col_key, columns=row_key, values=entry_field)
+
+    # Reindex to match collected row/col order
+    pivot = pivot.reindex(index=col_df.index, columns=row_df.index)
+
+    X = pivot.values.astype(np.float32)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name converters.py -type f | head -20

Repository: bigbio/hvantk

Length of output: 111

🏁 Script executed:

wc -l ./hvantk/core/converters.py

Repository: bigbio/hvantk

Length of output: 88

🏁 Script executed:

cat -n ./hvantk/core/converters.py

Repository: bigbio/hvantk

Length of output: 5747

🏁 Script executed:

rg -n "hail_mt_to_anndata|anndata_to_hail_mt" --type py -B 2 -A 2 | head -50

Repository: bigbio/hvantk

Length of output: 3560

🏁 Script executed:

rg -n "size|memory|OOM|large|limit|guard" hvantk/core/converters.py -i

Repository: bigbio/hvantk

Length of output: 39

🏁 Script executed:

rg -n "stream|block|chunk|slice" hvantk/core/ -i --type py

Repository: bigbio/hvantk

Length of output: 950

Both conversion paths unconditionally materialize the entire matrix on the driver.

Lines 69, 72, and 76 in hail_mt_to_anndata call to_pandas() three times (rows, cols, and entries), then pivot the entire entries DataFrame into a dense matrix. Lines 118 and 133 in anndata_to_hail_mt convert sparse matrices to dense via toarray() and then ravel() the entire matrix into memory. Neither function includes size validation, memory guards, or a streaming alternative. This design will cause OOM failures on realistic expression matrices (e.g., 100k cells × 30k genes).

Add an explicit size guard (e.g., max acceptable obs×var product) with a clear error message, or implement a blockwise/streamed conversion path.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hvantk/core/converters.py` around lines 69 - 82, hail_mt_to_anndata and anndata_to_hail_mt currently materialize the entire matrix into memory (via row_df/col_df/entries_df.to_pandas(), pivot, and sparse.toarray()) which will OOM for large matrices; add an explicit size guard that computes n_obs * n_vars (from mt.rows/mt.cols counts or adata.n_obs/adata.n_vars) and compare against a configurable MAX_MATRIX_SIZE threshold (env var or constant), and raise a clear ValueError with guidance (e.g., suggest using a blockwise/streamed conversion or increasing the threshold) if exceeded; alternatively implement a blockwise conversion path (stream entries by column/row chunks to avoid pivoting everything at once) and only fall back to full-materialization when under the size limit — update logic around symbols row_df, col_df, entries_df, pivot, X in hail_mt_to_anndata and where sparse.toarray()/ravel() is called in anndata_to_hail_mt.

coderabbitai · 2026-04-14T08:40:42Z

+        def _get_phospho(tissue_type="both", source="umich"):
+            """Try *source* first; fall back to default if unavailable."""
+            try:
+                return ds.get_phosphoproteomics(source=source, tissue_type=tissue_type)
+            except Exception:
+                logger.info(
+                    "Source '%s' unavailable for %s, falling back to default", source, ct
+                )
+                return ds.get_phosphoproteomics(tissue_type=tissue_type)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and examine the file fd "cptac_phospho_datasets.py" -type f

Repository: bigbio/hvantk

Length of output: 227

🏁 Script executed:

# Get the file size and examine the relevant sections wc -l hvantk/datasets/cptac_phospho_datasets.py

Repository: bigbio/hvantk

Length of output: 102

🏁 Script executed:

# Read the full context around the flagged lines (300-350) to understand the function and all call sites sed -n '295,355p' hvantk/datasets/cptac_phospho_datasets.py | cat -n

Repository: bigbio/hvantk

Length of output: 3108

🏁 Script executed:

# Also look for the `_get_phospho` function definition to understand its scope and behavior sed -n '300,350p' hvantk/datasets/cptac_phospho_datasets.py | cat -n

Repository: bigbio/hvantk

Length of output: 2657

🏁 Script executed:

# Look at imports and understand the CPTAC library usage sed -n '1,50p' hvantk/datasets/cptac_phospho_datasets.py | cat -n

Repository: bigbio/hvantk

Length of output: 1687

🏁 Script executed:

# Search for any configuration or documentation about CPTAC sources rg "get_phosphoproteomics|get_clinical" hvantk/datasets/cptac_phospho_datasets.py -B 2 -A 2

Repository: bigbio/hvantk

Length of output: 862

🏁 Script executed:

# Check if there's any CPTAC documentation or type hints in the codebase rg "source.*umich|source.*mssm" hvantk/ -t py

Repository: bigbio/hvantk

Length of output: 259

🏁 Script executed:

# Look for the _load_cptac_dataset function to understand the library's behavior rg "_load_cptac_dataset" hvantk/datasets/cptac_phospho_datasets.py -B 5 -A 10

Repository: bigbio/hvantk

Length of output: 1315

🏁 Script executed:

# Look for test files related to CPTAC fd "test.*cptac|cptac.*test" -type f hvantk/

Repository: bigbio/hvantk

Length of output: 227

🏁 Script executed:

# Check requirements or setup files for cptac package version fd "requirements|setup|pyproject" -type f | xargs rg "cptac" 2>/dev/null

Repository: bigbio/hvantk

Length of output: 10473

🏁 Script executed:

# Search for any documentation or comments about source handling rg "source.*mssm|source.*umich" hvantk/ -B 3 -A 3

Repository: bigbio/hvantk

Length of output: 1319

🏁 Script executed:

# Check the full download method to understand the complete context sed -n '260,350p' hvantk/datasets/cptac_phospho_datasets.py | cat -n

Repository: bigbio/hvantk

Length of output: 4790

🏁 Script executed:

# Search for any imports or exception handling specific to cptac rg "cptac\.|import cptac" hvantk/datasets/cptac_phospho_datasets.py -B 5 -A 5

Repository: bigbio/hvantk

Length of output: 396

🏁 Script executed:

# Check if there's any test coverage for this code path fd "test" -type f hvantk/ | xargs rg "CPTACPhosphoDataset|_get_phospho|source.*umich" 2>/dev/null | head -20

Repository: bigbio/hvantk

Length of output: 1023

🏁 Script executed:

# Look for any GitHub issues or comments about CPTAC source handling rg "unavailable|fallback|source.*mssm" hvantk/ -i

Repository: bigbio/hvantk

Length of output: 2857

🏁 Script executed:

# Check the actual CPTAC library behavior by looking at package docs or any local type hints rg "get_phosphoproteomics|get_clinical" hvantk/ -A 3 -B 3

Repository: bigbio/hvantk

Length of output: 1990

Determine and reuse a single CPTAC source instead of falling back per call.

Each _get_phospho call (lines 316, 324, 339) is evaluated independently, so one failed "umich" lookup leaves the tumor TSV, normal TSV, and matrix export backed by different sources. Both fallback blocks also catch every Exception, silently converting unrelated loader/network failures into source switches. Determine the supported source once and reuse it for all phosphoproteomics reads; apply the same narrowing to the clinical fallback (lines 351–354).

🧰 Tools

🪛 Ruff (0.15.9)

[warning] 308-308: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hvantk/datasets/cptac_phospho_datasets.py` around lines 304 - 312, The per-call fallback in _get_phospho causes inconsistent sources and overly-broad exception swallowing; instead, probe and pick a single working source once (e.g., at function/class/module init) by calling ds.get_phosphoproteomics(source="umich", tissue_type="both") and if that raises only the specific "source unavailable" error switch to the default and store the chosen_source in a variable; then have _get_phospho use that chosen_source (call ds.get_phosphoproteomics(source=chosen_source, tissue_type=...)) so all reads (tumor TSV, normal TSV, matrix export) use the same source, and apply the same single-source selection and narrow exception handling approach to the clinical fallback that uses ct/clinical loader so you don’t catch all Exception types.

coderabbitai · 2026-04-14T08:40:42Z

+    if ext in {".parquet", ".pq"}:
+        df = pd.read_parquet(path)
+    elif ext in {".pkl", ".pickle"}:
+        df = pd.read_pickle(path)


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and examine the file fd -a constraint_expression.py

Repository: bigbio/hvantk

Length of output: 111

🏁 Script executed:

# Once we have the file, let's look at the code around lines 138-141 cat -n hvantk/ptm/constraint_expression.py | head -160 | tail -40

Repository: bigbio/hvantk

Length of output: 1634

🏁 Script executed:

# Let's also get a broader view of the function containing these lines cat -n hvantk/ptm/constraint_expression.py | head -200 | tail -100

Repository: bigbio/hvantk

Length of output: 4086

🏁 Script executed:

# Find all calls to _load_from_tabular to understand the call chain rg "_load_from_tabular" -n hvantk/ptm/constraint_expression.py

Repository: bigbio/hvantk

Length of output: 152

🏁 Script executed:

# Search for CLI entry points or argparse definitions in the ptm module fd -a "\.py$" hvantk/ptm/ | head -20

Repository: bigbio/hvantk

Length of output: 585

🏁 Script executed:

# Look for main entry points or CLI handlers rg -n "argparse|click|typer|def main" hvantk/ptm/

Repository: bigbio/hvantk

Length of output: 39

🏁 Script executed:

# Check if there's a CLI module or setup.py that shows how paths are passed rg -n "_load_from_tabular" --type py

Repository: bigbio/hvantk

Length of output: 224

🏁 Script executed:

# Look at context around line 72 where _load_from_tabular is called cat -n hvantk/ptm/constraint_expression.py | sed -n '50,90p'

Repository: bigbio/hvantk

Length of output: 1800

🏁 Script executed:

# Look at the full function definition starting before line 72 cat -n hvantk/ptm/constraint_expression.py | sed -n '35,85p'

Repository: bigbio/hvantk

Length of output: 2292

🏁 Script executed:

# Get the function signature and name cat -n hvantk/ptm/constraint_expression.py | sed -n '1,40p'

Repository: bigbio/hvantk

Length of output: 1371

🏁 Script executed:

# Search for all calls to this function across the codebase rg "load_expression_source" --type py -n

Repository: bigbio/hvantk

Length of output: 39

🏁 Script executed:

# Search for calls to load_gene_by_group_matrix rg "load_gene_by_group_matrix" --type py -n

Repository: bigbio/hvantk

Length of output: 986

🏁 Script executed:

# Also check for CLI/entry point files (main, CLI runner, etc.) rg -n "def main|if __name__" hvantk/ --type py | head -20

Repository: bigbio/hvantk

Length of output: 1364

🏁 Script executed:

# Look for any argument parsing or click/argparse decorators related to path arguments rg -n "click|argparse|typer|@app\." hvantk/ --type py | grep -i "path\|file\|input" | head -20

Repository: bigbio/hvantk

Length of output: 1857

🏁 Script executed:

# Let's examine constraint.py around line 158 to see how load_gene_by_group_matrix is called cat -n hvantk/ptm/constraint.py | sed -n '145,175p'

Repository: bigbio/hvantk

Length of output: 1272

🏁 Script executed:

# Check the function containing this call - look further back cat -n hvantk/ptm/constraint.py | sed -n '1,50p'

Repository: bigbio/hvantk

Length of output: 1708

🏁 Script executed:

# Search for CLI-related functions in constraint.py that might call this function rg -n "def " hvantk/ptm/constraint.py | head -20

Repository: bigbio/hvantk

Length of output: 822

🏁 Script executed:

# Search for calls to run_ptm_constraint rg "run_ptm_constraint" --type py -n

Repository: bigbio/hvantk

Length of output: 729

🏁 Script executed:

# Check if there's a PTM-related CLI entry in the main hvantk.py or other CLI modules cat -n hvantk/hvantk.py | head -120

Repository: bigbio/hvantk

Length of output: 4062

🏁 Script executed:

# Look for ptm command/subcommand in the CLI rg -n "ptm\|constraint" hvantk/hvantk.py hvantk/commands/ --type py | head -30

Repository: bigbio/hvantk

Length of output: 39

🏁 Script executed:

# Examine ptm_cli.py around line 656 to see how the path is supplied to run_ptm_constraint cat -n hvantk/commands/ptm_cli.py | sed -n '640,700p'

Repository: bigbio/hvantk

Length of output: 2420

🏁 Script executed:

# Look at the ptm command/subcommand definitions to see where expression_path comes from cat -n hvantk/commands/ptm_cli.py | sed -n '1,100p'

Repository: bigbio/hvantk

Length of output: 3934

🏁 Script executed:

# Search for where expression_path is defined as a CLI option rg -n "expression_path\|expression-path" hvantk/commands/ptm_cli.py | head -20

Repository: bigbio/hvantk

Length of output: 39

🏁 Script executed:

# Search for the constraint subcommand definition rg -n "constraint" hvantk/commands/ptm_cli.py -A 5 -B 5 | head -100

Repository: bigbio/hvantk

Length of output: 3440

🏁 Script executed:

# Search for `@ptm_group.command` or constraint in the file rg -n "@ptm_group.command|def constraint" hvantk/commands/ptm_cli.py

Repository: bigbio/hvantk

Length of output: 305

🏁 Script executed:

# Now let's look at the --expression-path option definition cat -n hvantk/commands/ptm_cli.py | sed -n '525,605p'

Repository: bigbio/hvantk

Length of output: 3103

Remove pickle support from this user-facing loader.

The --expression-path CLI option accepts .pkl/.pickle files. pd.read_pickle() executes pickle payloads during deserialization, turning this into an arbitrary-code-execution sink. Remove pickle support entirely, or gate it behind an explicit unsafe flag.

🧰 Tools

🪛 Ruff (0.15.9)

[error] 141-141: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue

(S301)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hvantk/ptm/constraint_expression.py` around lines 138 - 141, The loader currently accepts .pkl/.pickle and calls pd.read_pickle (variable ext with branches creating df), which is unsafe; remove the pickle branch entirely and ensure the CLI/validation for --expression-path rejects .pkl/.pickle (or only allow .parquet/.pq), so the code only calls pd.read_parquet for supported extensions; update any help/validation code that lists accepted extensions to exclude pickle and add an explicit unsafe flag if you want to reintroduce pickle behind opt-in.

coderabbitai · 2026-04-14T08:40:42Z

    logger.info("Downloading Ensembl GTF to %s...", gtf_path)
-    urllib.request.urlretrieve(ENSEMBL_GTF_URL, gtf_path)  # nosec B310
+    with requests.get(ENSEMBL_GTF_URL, stream=True, timeout=600) as resp:
+        resp.raise_for_status()
+        with open(gtf_path, "wb") as fout:
+            shutil.copyfileobj(resp.raw, fout)


⚠️ Potential issue | 🟠 Major

Write the GTF atomically instead of streaming into the cache path.

If the request or copy fails midway, gtf_path is left truncated and the next run will still treat it as a valid cached file because the cache check only looks at os.path.exists. Download to a temp file and os.replace() it only after the copy finishes successfully.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hvantk/ptm/pipeline.py` around lines 209 - 213, The download currently streams directly into gtf_path using requests.get and shutil.copyfileobj which can leave a truncated cache file on failure; change the logic in the block that calls requests.get(ENSEMBL_GTF_URL, ...) so that you write to a temporary file (e.g., using tempfile.NamedTemporaryFile or tempfile.mkstemp) and copy resp.raw into that temp file, then atomically move the completed temp into gtf_path with os.replace() only after the copy and resp.raise_for_status() succeed; ensure the temp file is cleaned up on exceptions and any open files/streams (resp, temp) are properly closed.

coderabbitai · 2026-04-14T08:40:42Z

+        # Use a distinct filename to avoid truncating mapped_path when it
+        # points to ptm_sites_combined.tsv.bgz from step 4b.
+        all_combined_path = os.path.join(
+            config.output_dir, "ptm_sites_all_combined.tsv.bgz"
+        )
+        _concat_bgz_tsvs([mapped_path, cptac_mapped_path], all_combined_path)
+
+        mapped_path = all_combined_path
        result.n_total += cptac_result.n_total


⚠️ Potential issue | 🟠 Major

Update the returned mapped path after the CPTAC merge.

Line 497 repoints mapped_path, but result.mapped_tsv_path is never updated in this branch. When cptac_tsv is present, callers get the pre-CPTAC file even though create_ptm_sites_tb() consumes ptm_sites_all_combined.tsv.bgz.

Suggested fix

_concat_bgz_tsvs([mapped_path, cptac_mapped_path], all_combined_path) mapped_path = all_combined_path + result.mapped_tsv_path = all_combined_path result.n_total += cptac_result.n_total

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hvantk/ptm/pipeline.py` around lines 490 - 498, After concatenating CPTAC into all_combined_path and repointing mapped_path, also set result.mapped_tsv_path = all_combined_path so the returned result reflects the post-merge file; update this in the branch that calls _concat_bgz_tsvs([mapped_path, cptac_mapped_path], all_combined_path) (the same place you update mapped_path and result.n_total), ensuring callers of create_ptm_sites_tb() receive the updated mapped_tsv_path.

coderabbitai · 2026-04-14T08:40:42Z

+    obs = pd.DataFrame(index=wide.index)
+    if metadata_df is not None:
+        obs = obs.join(metadata_df, how="left")



⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "*.py" | grep -E "(cptac|ucsc)" | head -20

Repository: bigbio/hvantk

Length of output: 358

🏁 Script executed:

wc -l hvantk/tables/cptac.py hvantk/tables/ucsc.py

Repository: bigbio/hvantk

Length of output: 122

🏁 Script executed:

sed -n '80,100p' hvantk/tables/cptac.py

Repository: bigbio/hvantk

Length of output: 683

🏁 Script executed:

sed -n '140,160p' hvantk/tables/cptac.py

Repository: bigbio/hvantk

Length of output: 686

🏁 Script executed:

cat -n hvantk/tables/ucsc.py | head -200

Repository: bigbio/hvantk

Length of output: 5256

🏁 Script executed:

# Check if there's any validation of metadata_df before these join calls in cptac.py sed -n '75,95p' hvantk/tables/cptac.py

Repository: bigbio/hvantk

Length of output: 678

🏁 Script executed:

sed -n '130,160p' hvantk/tables/cptac.py

Repository: bigbio/hvantk

Length of output: 1041

🏁 Script executed:

# Check function signatures to understand where metadata_df comes from grep -n "def.*metadata" hvantk/tables/cptac.py

Repository: bigbio/hvantk

Length of output: 39

🏁 Script executed:

grep -B 5 "metadata_df" hvantk/tables/cptac.py | grep -E "(def |metadata_df)"

Repository: bigbio/hvantk

Length of output: 469

🏁 Script executed:

# Get full function signatures for both builders sed -n '40,70p' hvantk/tables/cptac.py

Repository: bigbio/hvantk

Length of output: 1169

🏁 Script executed:

# Check if there are any other callers or tests that show how metadata_df is expected to be structured grep -r "create_anndata_from_cptac" --include="*.py" | grep -v "^Binary"

Repository: bigbio/hvantk

Length of output: 662

🏁 Script executed:

# Check the full phospho function signature sed -n '105,125p' hvantk/tables/cptac.py

Repository: bigbio/hvantk

Length of output: 806

🏁 Script executed:

# Check how metadata_df is constructed in the matrix_builders callers sed -n '1,50p' hvantk/tables/matrix_builders.py | head -100

Repository: bigbio/hvantk

Length of output: 1431

🏁 Script executed:

# Get the full context around the cptac function calls in matrix_builders rg -B 5 -A 5 "create_anndata_from_cptac" hvantk/tables/matrix_builders.py | head -60

Repository: bigbio/hvantk

Length of output: 1582

Add overlap validation to metadata joins to prevent silent NaN filling.

Lines 89-92 and 149-153 join metadata_df without checking if its index overlaps with the sample IDs. A left join with no overlapping indices silently produces all-NaN obs, which breaks downstream analysis that depends on sample metadata (e.g., tissue/cell-type grouping).

The hvantk/tables/ucsc.py builder already addresses this by checking adata.obs.index.intersection(metadata_df.index) and logging a warning when the overlap is empty (lines 133-137). Adopt the same pattern here: validate that the metadata index contains the expected sample IDs, emit a warning if overlap is missing, and preferably raise an error if the overlap is empty.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hvantk/tables/cptac.py` around lines 89 - 92, In cptac.py, when joining metadata_df into obs (created from wide.index) and the later join at lines ~149-153, first compute the overlap = obs.index.intersection(metadata_df.index) and validate it; if overlap is empty, log a warning (or raise a ValueError) and avoid doing a blind left join that yields all-NaN rows; if partial overlap, log how many samples matched and then perform obs = obs.join(metadata_df, how="left") restricted to the overlapping index (or ensure rows with no metadata are handled explicitly). Use the same validation/logging pattern used in hvantk/tables/ucsc.py (checking adata.obs.index.intersection(metadata_df.index) and warning on empty overlap) and apply it to the variables obs, metadata_df, and wide.index in both join locations.

coderabbitai · 2026-04-14T08:40:42Z

    if isinstance(group_by, str):
        group_by = [group_by]

-    # --- Validate ---
-    _validate_group_by(mt, group_by, metadata_field)
-
-    # --- Filter columns ---
-    if filter_by:
-        mt = filter_by_metadata(mt, filter_by)
-
-    # --- Build group label ---
    if len(group_by) == 1:
-        label_expr = hl.str(mt[metadata_field][group_by[0]])
+        labels = adata.obs[group_by[0]].astype(str)
    else:
-        label_expr = hl.delimit([hl.str(mt[metadata_field][f]) for f in group_by], "_")
-    mt = mt.annotate_cols(_group_label=label_expr)
-
-    # --- Filter groups by min cells ---
-    group_counts = mt.aggregate_cols(hl.agg.counter(mt._group_label))
-    valid_groups = {g for g, n in group_counts.items() if n >= min_cells_per_group}
-    skipped = {g: n for g, n in group_counts.items() if n < min_cells_per_group}
-    if skipped:
-        logger.warning(
-            "Skipping %d group(s) with fewer than %d cells: %s",
-            len(skipped),
-            min_cells_per_group,
-            skipped,
-        )
-    if not valid_groups:
-        raise ValueError(
-            f"No groups have >= {min_cells_per_group} cells. "
-            f"Group sizes: {group_counts}"
-        )
-
-    mt = mt.filter_cols(hl.literal(valid_groups).contains(mt._group_label))
-
-    # --- Group and aggregate ---
-    grouped_mt = mt.group_cols_by(mt._group_label).aggregate(
-        mean=hl.agg.mean(mt[expr_field]),
-        fraction_expressed=hl.agg.fraction(mt[expr_field] > 0),
-        n_cells=hl.agg.count(),
-    )
-
-    # --- Collect per-gene stats into a dict ---
-    grouped_mt = grouped_mt.annotate_rows(
-        stats=hl.dict(
-            hl.agg.collect(
-                hl.tuple(
-                    [
-                        grouped_mt._group_label,
-                        hl.struct(
-                            mean=grouped_mt.mean,
-                            fraction_expressed=grouped_mt.fraction_expressed,
-                            n_cells=hl.int32(grouped_mt.n_cells),
-                        ),
-                    ]
-                )
-            )
-        )
-    )
-
-    # --- Build output Table ---
-    row_fields = {gene_id_field: mt.row[gene_id_field]}
-    if gene_name_field and gene_name_field in mt.row:
-        row_fields[gene_name_field] = mt.row[gene_name_field]
-
-    tb = grouped_mt.rows()
-    select_exprs = {"gene_id": tb[gene_id_field], "stats": tb.stats}
-    if gene_name_field and gene_name_field in tb.row:
-        select_exprs["gene_name"] = tb[gene_name_field]
-
-    tb = tb.select(**select_exprs)
-    tb = tb.key_by("gene_id")
-
-    if output_path:
-        logger.info("Checkpointing summary table to %s", output_path)
-        tb = tb.checkpoint(output_path, overwrite=overwrite)
-
-    logger.info(
-        "Summarized %d genes across %d groups (group_by=%s)",
-        tb.count(),
-        len(valid_groups),
-        group_by,
-    )
-    return tb
-
-
-def summarize_matrix(mt: hl.MatrixTable) -> Dict:
-    """
-    Provide a comprehensive summary of a MatrixTable.
-
-    Args:
-        mt: Hail MatrixTable to summarize
-
-    Returns:
-        Dict containing summary statistics and information about the MatrixTable
-    """
-    # Basic counts
-    n_samples = mt.count_cols()
-    n_genes = mt.count_rows()
-    n_entries = n_samples * n_genes
-
-    # Sample metadata summaries
-    if "metadata" in mt.col:
-        metadata_fields = list(mt.col.metadata.dtype)
-        metadata_stats = {}
-
-        for field in metadata_fields:
-            # Check if the field is likely numerical
-            field_type = mt.col.metadata[field].dtype
-            is_numeric = hl.is_numeric(field_type)
-
-            if is_numeric:
-                # For numerical fields, compute summary statistics
-                field_stats = mt.aggregate_cols(hl.agg.stats(mt.metadata[field]))
-                # Convert to a more compact dict format
-                field_stats = {
-                    "mean": field_stats.mean,
-                    "std": field_stats.stdev,
-                    "min": field_stats.min,
-                    "max": field_stats.max,
-                    "n": field_stats.n,
-                    "sum": field_stats.sum,
+        labels = adata.obs[group_by[0]].astype(str)
+        for col in group_by[1:]:
+            labels = labels + "_" + adata.obs[col].astype(str)
+
+    # Local Series — does not touch adata.obs.
+    label_series = pd.Series(labels.values, index=np.arange(adata.n_obs))
+


⚠️ Potential issue | 🟠 Major

Don't encode groups with plain underscore joins.

For multi-column group_by, this can merge distinct tuples when the values themselves contain underscores, and astype(str) also turns missing metadata into a real "nan" group. Group on the original column tuple (or a safely escaped encoding) and drop rows with missing grouping fields before aggregation. This also feeds the AnnData mean path in hvantk/ptm/constraint_expression.py:189-205.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hvantk/utils/matrix_utils.py` around lines 127 - 139, The current label construction joins stringified fields with "_" which corrupts tuples with underscores and turns missing values into the literal "nan"; change the logic in the block that builds labels (the group_by handling and label_series creation) to build canonical group keys as tuples or a pandas.MultiIndex from the original adata.obs columns (e.g., using tuple(zip(...)) or pandas.MultiIndex.from_frame(adata.obs[group_by])) and drop rows with any missing grouping fields before downstream aggregation, and ensure label_series uses adata.obs.index (not np.arange(adata.n_obs)); update any callers (e.g., the AnnData mean path referenced in constraint_expression.py) to consume tuple/MultiIndex keys or a safely encoded representation.

coderabbitai · 2026-04-14T08:40:42Z

+pyyaml
+tspex


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, check if anndata is imported in the three mentioned files echo "=== Checking for anndata imports ===" for file in hvantk/core/converters.py hvantk/tables/cptac.py hvantk/tables/ucsc.py; do if [ -f "$file" ]; then echo "File: $file" grep -n "anndata" "$file" | head -20 else echo "File not found: $file" fi done echo "" echo "=== Current requirements.txt ===" if [ -f "requirements.txt" ]; then cat -n requirements.txt else echo "requirements.txt not found" fi echo "" echo "=== Current environment.yml ===" if [ -f "environment.yml" ]; then cat -n environment.yml else echo "environment.yml not found" fi

Repository: bigbio/hvantk

Length of output: 1805

requirements.txt is missing critical AnnData runtime dependencies.

The three new modules hvantk/core/converters.py (line 14), hvantk/tables/cptac.py (line 12), and hvantk/tables/ucsc.py (line 24) all import anndata, but this dependency is absent from requirements.txt. A plain pip install -r requirements.txt will fail with ModuleNotFoundError: anndata.

Additionally, environment.yml specifies several other packages not in requirements.txt: scanpy>=1.10.0, pysam>=0.22, psutil, and scikit-learn. Sync requirements.txt with the actual runtime dependencies listed in environment.yml.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@requirements.txt` around lines 14 - 15, requirements.txt is missing runtime deps required by the new modules (hvantk/core/converters.py, hvantk/tables/cptac.py, hvantk/tables/ucsc.py) which import anndata; update requirements.txt to include anndata plus the other packages declared in environment.yml (scanpy>=1.10.0, pysam>=0.22, psutil, scikit-learn) and any version constraints from environment.yml so a plain pip install -r requirements.txt will succeed and match the environment.yml runtime dependencies.

enriquea · 2026-04-14T08:50:27Z

@copilot, review and evaluate every comments in this PR. Fix only if valid.
Check the CI/CD GitHub workflows that are failing. Fix them.

Agent-Logs-Url: https://github.com/bigbio/hvantk/sessions/4e744d04-4f43-4311-857e-5eb7affb7697 Co-authored-by: enriquea <8707954+enriquea@users.noreply.github.com>

Copilot · 2026-04-14T09:06:31Z

@copilot, review and evaluate every comments in this PR. Fix only if valid.
Check the CI/CD GitHub workflows that are failing. Fix them.

Done in 32a57e5. I reviewed all prior PR comments, fixed the valid ones (UCSC/CPTAC delimiter handling, GTF parsing guards, Hail MT aggregation safety for medians, BGZF writer deduplication, sparse-safe AnnData histogram behavior, and small-MT guard for MT→AnnData conversion), and investigated CI failures via workflow logs. The failing workflows were due to missing anndata in requirements.txt; that is now fixed and validated with targeted tests/lint.

enriquea and others added 30 commits April 8, 2026 00:51

build: add anndata and scanpy dependencies for scverse integration

586890f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add AnnData utility module (provenance, column summary, I/O)

6546666

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add Hail <-> AnnData converter bridge

a88cff1

Includes np.bool compatibility workaround for Hail 0.2.x with numpy >= 1.24. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add AnnData builders for CPTAC proteomics and phosphoproteomics

67434f8

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add AnnData expression distribution visualization

2c1fd8c

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: resolve flake8 F821 lint errors in type annotations

40886b7

refactor: improve dataset loading and error handling; switch to reque…

95d165d

…sts for downloads

clean up

cab7664

Merge branch 'scverse-integration' into dev

3ef2eb8

refactor: update output file format to BGZF for map_ptm_sites round-t…

bc9a832

…rip test

feat: increase minimum partitions to 16 for PTM sites table import

da5ca67

feat: implement BGZF compression for PTM site mapping and output

b871f10

feat: add BgzfWriter class for buffered BGZF file writing

8edbdf8

feat: implement optimized residue-to-genomic mapping using Transcript…

aa911bb

…CDS class

feat: add support for evidence metadata in PTM annotation

f8324fc

enriquea added 2 commits April 14, 2026 09:53

Merge pull request #89 from bigbio/ptm-constraint

e18b4ae

feat(ptm): stratified AF-depletion analysis (`hvantk ptm constraint`)

chore: update .gitignore

03d50b5

enriquea requested a review from Copilot April 14, 2026 08:18

enriquea self-assigned this Apr 14, 2026

Copilot started reviewing on behalf of enriquea April 14, 2026 08:18 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

coderabbitai bot reviewed Apr 14, 2026

View reviewed changes

Copilot started work on behalf of enriquea April 14, 2026 08:50 View session

Address PR review findings and fix CI test dependency

32a57e5

Agent-Logs-Url: https://github.com/bigbio/hvantk/sessions/4e744d04-4f43-4311-857e-5eb7affb7697 Co-authored-by: enriquea <8707954+enriquea@users.noreply.github.com>

Copilot finished work on behalf of enriquea April 14, 2026 09:08

enriquea merged commit a67cbaa into main Apr 14, 2026
5 of 6 checks passed

	metadata_df = load_ucsc_metadata(metadata_path)
	metadata_df = load_ucsc_metadata(metadata_path, sep=delimiter)

		pyyaml
		tspex No newline at end of file

Conversation

enriquea commented Apr 14, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

codacy-production bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Not up to standards ⛔

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2026

enriquea commented Apr 14, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 14, 2026 •

edited

Loading

codacy-production bot commented Apr 14, 2026 •

edited

Loading