including PeptideAtlas and CPTAC PTMs as stream information by ypriverol · Pull Request #84 · bigbio/hvantk

ypriverol · 2026-03-31T13:53:21Z

Summary by CodeRabbit

New Features
- Download CPTAC phosphoproteomics datasets via new cptac-phospho CLI command with cancer type selection and pan-cancer merging support
- Download PeptideAtlas human phosphoproteome datasets via new peptideatlas-phospho CLI command
- Build CPTAC phosphoproteomics matrix tables with sample and site metadata integration
- Extend PTM building pipeline to incorporate PeptideAtlas and CPTAC phosphoproteomics sources
- Visualize source data overlap analysis for phospho-sites
Tests
- Added comprehensive test coverage for phosphoproteomics dataset parsing and processing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…UMNS Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds a new CLI subcommand `hvantk download peptideatlas-phospho` that downloads PeptideAtlas human phospho build data, supporting latest or specific build via --build-date and --build-id options. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…e comparison Three-panel plot: site overlap counts, box plot of observation counts by source category, and cumulative distribution of PeptideAtlas evidence. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The relative URLs on the PeptideAtlas page resolve under /builds/human/phospho/, not /builds/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Support text-based notation (S[Phospho]) alongside numeric (S[167]) - Stream large tables instead of loading into memory - Fix amino acid extraction by returning AA from offset parser - Handle malformed sequences with unclosed brackets Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Without this filter, shared peptides mapped to all 588K protein entries inflated site count from ~260K to 9.7M. The protein_identification table's presence_level_id=1 gives the 11,852 canonical proteins matching the PeptideAtlas build page. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tests - Fix doubled /phospho in URL construction (from_build method) - Downloader now produces intermediate TSV, not just raw zip - Numeric mass detection covers T[181] and Y[243] in addition to S[167] - Add parametrized tests for _extract_phospho_offsets (16 cases) - Add test for canonical protein filtering (presence_level_id=1) - Add test for URL construction correctness Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add cptac_tsv field to PTMBuildConfig with validation, CPTAC mapping and concatenation step in ptm_build_pipeline(), and --cptac-tsv CLI option to the ptm build command. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…load() - Rename colon→coad, remove endometrial (matching actual cptac package classes) - Add overwrite parameter and Dict return to CPTACPhosphoDataset.download() - Use consistent cptac-phospho-{type} file naming Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The cptac package has a bug where get_phosphoproteomics() and get_clinical() fail with "generator has no len()" when multiple sources exist and no source is specified. Use source='umich' for phospho and source='mssm' for clinical to work around this. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ho output - Download tumor and normal phospho data separately per cancer type - Intermediate TSV now includes tissue_type, cancer_type, mean_intensity columns - Produces separate tumor/normal TSVs plus a combined TSV - Metadata CSV includes tissue_type per sample (from .N suffix convention) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- write_matrix_csv now expands multi-sites (T185_Y187 → T185, Y187) and averages duplicates from different peptides - build_cptac_phospho_mt uses regex for robust site ID parsing - Updated tests to verify multi-site expansion and per-sample values Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Downloaded datasets (UniProt, PeptideAtlas, CPTAC) are reproducible from code and should not be committed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-03-31T13:54:10Z

📝 Walkthrough

Walkthrough

This PR adds support for downloading and integrating CPTAC and PeptideAtlas phosphoproteomics datasets into hvantk. New CLI commands enable dataset downloads to intermediate TSV formats, a matrix builder creates Hail MatrixTables from CPTAC phospho data, and the PTM pipeline is extended to ingest these datasets alongside existing UniProt sources.

Changes

Cohort / File(s)	Summary
CPTAC Phospho Dataset Infrastructure `hvantk/datasets/cptac_phospho_datasets.py`, `hvantk/commands/cptac_phospho_downloader.py`, `hvantk/tests/test_cptac_phospho.py`	Adds module to download CPTAC phosphoproteomics per cancer type, extract phospho sites from intensity matrices, write intermediate TSVs and metadata CSVs, and defines `CPTACPhosphoDataset` class. CLI command registers `cptac-phospho` downloader with options for cancer type, all-cancer pan-cancer merge, listing available types, and overwrite behavior. Tests validate phosphosite parsing, metadata handling, and matrix CSV conversion.
PeptideAtlas Phospho Dataset Infrastructure `hvantk/datasets/peptideatlas_phospho_datasets.py`, `hvantk/commands/peptideatlas_phospho_downloader.py`, `hvantk/tests/test_peptideatlas_phospho.py`	Adds module to stream-parse PeptideAtlas ZIP archives, extract phospho sites with observation count aggregation, and write intermediate TSVs. Includes factory methods for latest/specific build selection. CLI command registers `peptideatlas-phospho` downloader with build date/ID options. Tests cover phospho offset extraction, site aggregation across peptides, DECOY filtering, and dataset metadata.
PTM Pipeline Integration `hvantk/ptm/constants.py`, `hvantk/ptm/pipeline.py`, `hvantk/ptm/plot.py`, `hvantk/commands/ptm_cli.py`	Extends `PTMBuildConfig` with optional `peptideatlas_tsv` and `cptac_tsv` paths, updates `ptm_build_pipeline()` to map and concatenate sites from multiple sources with aggregated statistics, adds `n_observations` field to output schema, and introduces `plot_source_overlap()` for visualizing UniProt/PeptideAtlas coverage. PTM CLI exposes new `--peptideatlas-tsv` and `--cptac-tsv` options.
Matrix Table Building `hvantk/tables/matrix_builders.py`, `hvantk/tables/registry.py`, `hvantk/tables/table_builders.py`, `hvantk/commands/make_matrix_cli.py`	Adds `build_cptac_phospho_mt` builder to convert CPTAC phospho expression CSVs and metadata into Hail MatrixTables, with site ID parsing and optional column casting. Registers builder in matrix registry and extends PTM table builder to handle `n_observations` field. CLI command `mkmatrix cptac-phospho` accepts expression, metadata, site/sample ID columns, and categorical/numeric column specifications.
CLI Command Integration `hvantk/commands/download_cli.py`	Imports and registers `cptac-phospho` and `peptideatlas-phospho` subcommands under the `download` group.
Configuration & Tests `.gitignore`, `hvantk/tests/test_ptm.py`	Updates `.gitignore` to ignore `data/` directory and extends PTM round-trip test to assert `evidence_type` and `n_observations` fields.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Add new PTM (Post-Translational Modification) analysis module #77 — Both PRs extend the PTM feature set across hvantk/ptm/ package and CLI, modifying pipeline configuration, table builders, and command handlers for integrated dataset processing.
Add several enhancements and refactorings across the codebase #71 — This PR's new matrix builder integration with build_matrix_metadata is directly related to prior metadata annotation infrastructure additions.
Handling files formats conversion & improve docs #60 — Related enhancements to the mkmatrix CLI and matrix builders for CPTAC phospho support and auto-conversion features.

Suggested labels

enhancement, feature, testing

Suggested reviewers

enriquea

Poem

🐰 Phospho sites hopping through the data streams so bright,
CPTAC and PeptideAtlas bringing coverage into light,
TSVs merge and matrices bloom in organized delight,
The PTM pipeline learns to paint with every data source's might! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title is vague and does not clearly convey the main purpose of the changeset. 'stream information' is imprecise and does not adequately summarize the substantial new functionality added.	Revise the title to be more specific and descriptive. Consider: 'Add PeptideAtlas and CPTAC phosphoproteomics dataset downloaders and matrix builders' or similar to accurately reflect the scope of changes.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 87.93% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ptm

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codacy-production · 2026-03-31T13:56:26Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

_{TIP This summary will be updated as you push new changes. Give us feedback}

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (4)

hvantk/ptm/constants.py (1)

19-20: Avoid hardcoding values labeled as “latest.”

Line 19 and Line 20 encode a point-in-time build as constants named LATEST_*, which will eventually drift and silently age. Prefer an override mechanism (env/config) or runtime resolution so releases don’t require code edits for dataset freshness.

Suggested adjustment

+import os
+
-PEPTIDEATLAS_LATEST_BUILD_DATE = "202512"
-PEPTIDEATLAS_LATEST_BUILD_ID = "606"
+PEPTIDEATLAS_LATEST_BUILD_DATE = os.getenv("HVANTK_PEPA_BUILD_DATE", "202512")
+PEPTIDEATLAS_LATEST_BUILD_ID = os.getenv("HVANTK_PEPA_BUILD_ID", "606")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/constants.py` around lines 19 - 20, The constants
PEPTIDEATLAS_LATEST_BUILD_DATE and PEPTIDEATLAS_LATEST_BUILD_ID are hardcoded as
"LATEST" values; change them to be resolved at runtime via configuration or
environment variables so they don't silently age. Replace the fixed constants
with a runtime accessor (e.g., get_peptideatlas_latest_build() or module-level
resolution) that checks environment variables (e.g.,
PEPTIDEATLAS_LATEST_BUILD_DATE, PEPTIDEATLAS_LATEST_BUILD_ID) and falls back to
the existing literals only as defaults; update any callers that import these
names to use the accessor or the module attributes that are set on import.
Ensure parsing/validation of the env values and add tests or comments
documenting the override mechanism.

hvantk/commands/make_matrix_cli.py (1)

245-301: Expose site_id_col in the CLI to match builder capability.

build_cptac_phospho_mt supports site_id_col, but this command hardcodes the default by not exposing it. Adding an option improves compatibility with alternate matrix schemas.

Suggested adjustment

 `@click.option`("-o", "--output-mt", "output_mt", required=True, type=click.Path())
+@click.option("--site-id-col", "site_id_col", default="SiteID", show_default=True)
 `@click.option`("--sid", "--sample-id-col", "sample_id_col", default="SampleID", show_default=True)
 def mkmatrix_cptac_phospho(
     expression,
     metadata,
     output_mt,
+    site_id_col,
     sample_id_col,
     categorical_cols,
     numeric_cols,
     overwrite,
 ):
@@
     mt = _build_cptac_phospho_mt(
         expression_path=expression,
         metadata_path=metadata,
         output_mt=output_mt,
+        site_id_col=site_id_col,
         sample_id_col=sample_id_col,
         categorical_cols=_split(categorical_cols),
         numeric_cols=_split(numeric_cols),
         overwrite=overwrite,
     )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@hvantk/commands/make_matrix_cli.py` around lines 245 - 301, The CLI command
mkmatrix_cptac_phospho currently doesn't expose the site_id_col parameter even
though the builder _build_cptac_phospho_mt supports it; add a new click.option
(e.g., "--site-id-col" with a sensible default like "SiteID") to the
mkmatrix_cptac_phospho signature and forward that variable into the call to
_build_cptac_phospho_mt via site_id_col=site_id_col so alternate matrix schemas
can specify the site identifier column.

hvantk/ptm/pipeline.py (1)

323-387: Consider extracting duplicated TSV concatenation logic into a helper function.

The concatenation logic for combining mapped TSVs (lines 329-349 for PeptideAtlas and lines 362-382 for CPTAC) is nearly identical. This could be refactored into a helper function to reduce duplication and improve maintainability.

♻️ Proposed refactor to extract helper function

+def _concatenate_mapped_tsvs(
+    source_paths: List[str],
+    output_path: str,
+) -> None:
+    """Concatenate multiple mapped TSV files into a single combined file."""
+    with open(output_path, "w", newline="") as fout:
+        writer = csv.DictWriter(
+            fout, fieldnames=PTM_OUTPUT_COLUMNS, delimiter="\t", lineterminator="\n"
+        )
+        writer.writeheader()
+        for src_path in source_paths:
+            with open(src_path) as fin:
+                reader = csv.DictReader(fin, delimiter="\t")
+                for row in reader:
+                    writer.writerow(row)

Then replace the duplicated blocks with calls to this helper.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/pipeline.py` around lines 323 - 387, The duplicated TSV
concatenation blocks should be moved into a single helper (e.g., def
concatenate_mapped_tsvs(src_paths: List[str], out_path: str,
fieldnames=PTM_OUTPUT_COLUMNS)) and both use sites: call it to concatenate
[mapped_path, pa_mapped_path] and [mapped_path, cptac_mapped_path]; update the
callers in the mapping flow (around map_ptm_sites results for pa_result and
cptac_result) to set mapped_path = returned out_path and keep the existing
updates to result (n_total, n_mapped, n_failed, resolution_counts) unchanged;
ensure the helper opens out_path with newline="", writes the header using
csv.DictWriter(fieldnames=PTM_OUTPUT_COLUMNS, delimiter="\t",
lineterminator="\n"), iterates each src_path with csv.DictReader(delimiter="\t")
and copies rows, and raise or log a clear error if any src_path is missing so
behavior matches original expectations.

hvantk/datasets/peptideatlas_phospho_datasets.py (1)

63-64: Unused constant — consider removing.

_BRACKET_MOD_PATTERN is defined but never used. The _extract_phospho_offsets function manually parses brackets instead of using this regex.

🧹 Suggested removal

-# Pattern to match bracket modifications like [167] or [167.0]
-_BRACKET_MOD_PATTERN = re.compile(r"\[([0-9.]+)\]")

Also remove import re from line 20 if no other uses exist.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@hvantk/datasets/peptideatlas_phospho_datasets.py` around lines 63 - 64, The
constant _BRACKET_MOD_PATTERN is unused because _extract_phospho_offsets
performs manual bracket parsing; remove the unused _BRACKET_MOD_PATTERN
definition and delete the unused import re (if there are no other usages of re
in the file), and keep _extract_phospho_offsets as-is or refactor it to use
_BRACKET_MOD_PATTERN if you prefer regex — but do not leave the orphaned regex
constant/import in the module.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hvantk/commands/peptideatlas_phospho_downloader.py`:
- Around line 51-54: The current logic silently ignores partial build parameters
(build_date or build_id) and calls PeptideAtlasPhosphoDataset.from_latest();
update the code that branches on build_date and build_id to detect the partial
case and either raise a clear error or log/warn and exit, ensuring users must
provide both parameters together; specifically modify the conditional around
build_date/build_id to: if both provided call
PeptideAtlasPhosphoDataset.from_build(build_date, build_id), if neither call
PeptideAtlasPhosphoDataset.from_latest(), and if exactly one is provided emit a
user-facing warning or raise a ValueError indicating both --build-date and
--build-id are required together.

In `@hvantk/datasets/cptac_phospho_datasets.py`:
- Around line 313-322: The try/except around ds.get_phosphoproteomics is too
broad and leaves normal_df possibly undefined; replace the blanket except
Exception with catching the specific error(s) that ds.get_phosphoproteomics can
raise (e.g., KeyError/IOError/ValueError as appropriate) and only handle the "no
data" case there, and ensure normal_df is always defined before it's referenced
later (either initialize normal_df = None before the try, or move subsequent
references such as the normal_df.shape check, extract_phospho_sites(normal_df,
ct, tissue_type="normal") and write_intermediate_tsv(normal_sites, normal_tsv)
inside the try block or guard them with an explicit "if normal_df is not None"
check); keep logger calls (logger.info) but include the specific caught
exception variable when logging expected failures for context.

In `@hvantk/datasets/peptideatlas_phospho_datasets.py`:
- Around line 396-400: Validate that self.zip_url uses an allowed scheme (e.g.,
parse the URL and ensure scheme == "https") before attempting download, then
perform the download to a temporary file (e.g., zip_path + ".part") and only
rename/move it to zip_path on successful completion; wrap the
urllib.request.urlretrieve call in a try/except that catches network errors
(URLError, HTTPError) and generic Exception, log a descriptive error with
exception details via logger.error, remove any partial temp file on failure, and
either re-raise the exception or return a clear failure value so partial files
are not left behind and invalid schemes are rejected.

In `@hvantk/ptm/plot.py`:
- Line 487: The int(...) cast for n_obs can raise ValueError for blank or
non-numeric TSV values; update the parsing around the n_obs assignment (the line
using row.get("n_observations", "0") that sets n_obs) to safely handle
missing/invalid input by normalizing the fetched value (e.g., str(...).strip()),
then try converting to int inside a try/except ValueError block and fall back to
0 on failure; ensure the code still accepts valid numeric strings and preserves
the n_obs variable for downstream plotting logic.
- Around line 505-514: The current logic only checks for "UniProt" and
"PeptideAtlas" and misclassifies sites that include "CPTAC" (or any other
additional source) as UniProt-only; update the categorization to consider the
full set of sources in info["sources"] (variables: site_sources,
info["sources"], both_sources, uniprot_only, pa_only). Compute flags has_up and
has_pa as before but also inspect the total number of sources (e.g.,
len(info["sources"]) or a has_other flag) and only classify as uniprot_only or
pa_only when that source is present and it is the sole source; keep the
both_sources branch for has_up and has_pa combinations so that UniProt+CPTAC (or
other combos) are not incorrectly labeled UniProt-only.

In `@hvantk/tables/matrix_builders.py`:
- Around line 262-268: _parsе_site_id currently returns (sid, "", 0) for
unparseable inputs which yields empty amino_acid and residue_pos=0; update
_parse_site_id to detect non-matching site IDs and either (a) log a warning via
the module logger including the offending sid and context (use the existing
logger or create one) and return None or raise a specific ValueError, or (b)
return a sentinel (e.g. None) so callers can filter them out; also update any
callers that iterate over _parse_site_id (refer to the function name
_parse_site_id and the regex _site_id_re) to skip None results or catch the
ValueError and drop/log the row instead of inserting empty/zero fields.

---

Nitpick comments:
In `@hvantk/commands/make_matrix_cli.py`:
- Around line 245-301: The CLI command mkmatrix_cptac_phospho currently doesn't
expose the site_id_col parameter even though the builder _build_cptac_phospho_mt
supports it; add a new click.option (e.g., "--site-id-col" with a sensible
default like "SiteID") to the mkmatrix_cptac_phospho signature and forward that
variable into the call to _build_cptac_phospho_mt via site_id_col=site_id_col so
alternate matrix schemas can specify the site identifier column.

In `@hvantk/datasets/peptideatlas_phospho_datasets.py`:
- Around line 63-64: The constant _BRACKET_MOD_PATTERN is unused because
_extract_phospho_offsets performs manual bracket parsing; remove the unused
_BRACKET_MOD_PATTERN definition and delete the unused import re (if there are no
other usages of re in the file), and keep _extract_phospho_offsets as-is or
refactor it to use _BRACKET_MOD_PATTERN if you prefer regex — but do not leave
the orphaned regex constant/import in the module.

In `@hvantk/ptm/constants.py`:
- Around line 19-20: The constants PEPTIDEATLAS_LATEST_BUILD_DATE and
PEPTIDEATLAS_LATEST_BUILD_ID are hardcoded as "LATEST" values; change them to be
resolved at runtime via configuration or environment variables so they don't
silently age. Replace the fixed constants with a runtime accessor (e.g.,
get_peptideatlas_latest_build() or module-level resolution) that checks
environment variables (e.g., PEPTIDEATLAS_LATEST_BUILD_DATE,
PEPTIDEATLAS_LATEST_BUILD_ID) and falls back to the existing literals only as
defaults; update any callers that import these names to use the accessor or the
module attributes that are set on import. Ensure parsing/validation of the env
values and add tests or comments documenting the override mechanism.

In `@hvantk/ptm/pipeline.py`:
- Around line 323-387: The duplicated TSV concatenation blocks should be moved
into a single helper (e.g., def concatenate_mapped_tsvs(src_paths: List[str],
out_path: str, fieldnames=PTM_OUTPUT_COLUMNS)) and both use sites: call it to
concatenate [mapped_path, pa_mapped_path] and [mapped_path, cptac_mapped_path];
update the callers in the mapping flow (around map_ptm_sites results for
pa_result and cptac_result) to set mapped_path = returned out_path and keep the
existing updates to result (n_total, n_mapped, n_failed, resolution_counts)
unchanged; ensure the helper opens out_path with newline="", writes the header
using csv.DictWriter(fieldnames=PTM_OUTPUT_COLUMNS, delimiter="\t",
lineterminator="\n"), iterates each src_path with csv.DictReader(delimiter="\t")
and copies rows, and raise or log a clear error if any src_path is missing so
behavior matches original expectations.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7ced376e-1f01-4814-b768-bc8a96d19db0

📥 Commits

Reviewing files that changed from the base of the PR and between e5ef742 and 80bbad4.

📒 Files selected for processing (17)

.gitignore
hvantk/commands/cptac_phospho_downloader.py
hvantk/commands/download_cli.py
hvantk/commands/make_matrix_cli.py
hvantk/commands/peptideatlas_phospho_downloader.py
hvantk/commands/ptm_cli.py
hvantk/datasets/cptac_phospho_datasets.py
hvantk/datasets/peptideatlas_phospho_datasets.py
hvantk/ptm/constants.py
hvantk/ptm/pipeline.py
hvantk/ptm/plot.py
hvantk/tables/matrix_builders.py
hvantk/tables/registry.py
hvantk/tables/table_builders.py
hvantk/tests/test_cptac_phospho.py
hvantk/tests/test_peptideatlas_phospho.py
hvantk/tests/test_ptm.py

coderabbitai · 2026-03-31T14:04:50Z

+        if build_date and build_id:
+            dataset = PeptideAtlasPhosphoDataset.from_build(build_date, build_id)
+        else:
+            dataset = PeptideAtlasPhosphoDataset.from_latest()


⚠️ Potential issue | 🟡 Minor

Partial build parameters silently ignored.

If a user provides only --build-date or only --build-id (but not both), the code silently falls back to from_latest(). This could be confusing. Consider warning the user or requiring both when either is provided.

🛡️ Proposed fix to warn on partial parameters

if build_date and build_id: dataset = PeptideAtlasPhosphoDataset.from_build(build_date, build_id) + elif build_date or build_id: + click.echo( + "Warning: Both --build-date and --build-id are required for a specific build. " + "Using latest build instead.", + err=True, + ) + dataset = PeptideAtlasPhosphoDataset.from_latest() else: dataset = PeptideAtlasPhosphoDataset.from_latest()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if build_date and build_id:

dataset = PeptideAtlasPhosphoDataset.from_build(build_date, build_id)

else:

dataset = PeptideAtlasPhosphoDataset.from_latest()

if build_date and build_id:

dataset = PeptideAtlasPhosphoDataset.from_build(build_date, build_id)

elif build_date or build_id:

click.echo(

"Warning: Both --build-date and --build-id are required for a specific build. "

"Using latest build instead.",

err=True,

)

dataset = PeptideAtlasPhosphoDataset.from_latest()

else:

dataset = PeptideAtlasPhosphoDataset.from_latest()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hvantk/commands/peptideatlas_phospho_downloader.py` around lines 51 - 54, The current logic silently ignores partial build parameters (build_date or build_id) and calls PeptideAtlasPhosphoDataset.from_latest(); update the code that branches on build_date and build_id to detect the partial case and either raise a clear error or log/warn and exit, ensuring users must provide both parameters together; specifically modify the conditional around build_date/build_id to: if both provided call PeptideAtlasPhosphoDataset.from_build(build_date, build_id), if neither call PeptideAtlasPhosphoDataset.from_latest(), and if exactly one is provided emit a user-facing warning or raise a ValueError indicating both --build-date and --build-id are required together.

coderabbitai · 2026-03-31T14:04:50Z

+        try:
+            normal_df = ds.get_phosphoproteomics(source="umich", tissue_type="normal")
+            if len(normal_df) > 0:
+                logger.info("Normal: %d samples x %d columns", *normal_df.shape)
+                normal_sites = extract_phospho_sites(normal_df, ct, tissue_type="normal")
+                write_intermediate_tsv(normal_sites, normal_tsv)
+            else:
+                logger.info("No normal samples available for %s", ct)
+        except Exception as e:
+            logger.info("No normal tissue data for %s: %s", ct, e)


⚠️ Potential issue | 🟡 Minor

Catching blind Exception and potential NameError on normal_df.

Line 321 catches a broad Exception, which could mask unexpected errors. Consider catching a more specific exception type.

Line 357 references normal_df which may not be defined if an exception was raised before line 314.

🛡️ Proposed fix

# Fetch normal phospho data (may be empty for some cancer types) normal_sites = [] + normal_df = None try: normal_df = ds.get_phosphoproteomics(source="umich", tissue_type="normal") if len(normal_df) > 0: logger.info("Normal: %d samples x %d columns", *normal_df.shape) normal_sites = extract_phospho_sites(normal_df, ct, tissue_type="normal") write_intermediate_tsv(normal_sites, normal_tsv) else: logger.info("No normal samples available for %s", ct) - except Exception as e: + except (KeyError, ValueError, AttributeError) as e: logger.info("No normal tissue data for %s: %s", ct, e)

And update line 357:

- len(tumor_df), len(normal_df) if normal_sites else 0, + len(tumor_df), len(normal_df) if normal_df is not None else 0,

🧰 Tools

🪛 Ruff (0.15.7)

[warning] 321-321: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hvantk/datasets/cptac_phospho_datasets.py` around lines 313 - 322, The try/except around ds.get_phosphoproteomics is too broad and leaves normal_df possibly undefined; replace the blanket except Exception with catching the specific error(s) that ds.get_phosphoproteomics can raise (e.g., KeyError/IOError/ValueError as appropriate) and only handle the "no data" case there, and ensure normal_df is always defined before it's referenced later (either initialize normal_df = None before the try, or move subsequent references such as the normal_df.shape check, extract_phospho_sites(normal_df, ct, tissue_type="normal") and write_intermediate_tsv(normal_sites, normal_tsv) inside the try block or guard them with an explicit "if normal_df is not None" check); keep logger calls (logger.info) but include the specific caught exception variable when logging expected failures for context.

coderabbitai · 2026-03-31T14:04:50Z

+        # Download zip if needed
+        if not os.path.exists(zip_path) or overwrite:
+            logger.info("Downloading %s -> %s", self.zip_url, zip_path)
+            urllib.request.urlretrieve(self.zip_url, zip_path)
+            logger.info("Download complete: %s", zip_path)


⚠️ Potential issue | 🟡 Minor

Add error handling for download failures and URL scheme validation.

Network failures will raise uncaught exceptions, potentially leaving partial files on disk.

While the base URL is hardcoded to https://, explicit scheme validation provides defense-in-depth (addresses static analysis S310).

🛡️ Suggested improvement

# Download zip if needed if not os.path.exists(zip_path) or overwrite: logger.info("Downloading %s -> %s", self.zip_url, zip_path) - urllib.request.urlretrieve(self.zip_url, zip_path) - logger.info("Download complete: %s", zip_path) + # Validate URL scheme + if not self.zip_url.startswith(("https://", "http://")): + raise ValueError(f"Invalid URL scheme: {self.zip_url}") + try: + urllib.request.urlretrieve(self.zip_url, zip_path) + logger.info("Download complete: %s", zip_path) + except Exception as e: + # Clean up partial download + if os.path.exists(zip_path): + os.remove(zip_path) + raise RuntimeError(f"Failed to download {self.zip_url}: {e}") from e else: logger.info("Using cached zip: %s", zip_path)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Download zip if needed

if not os.path.exists(zip_path) or overwrite:

logger.info("Downloading %s -> %s", self.zip_url, zip_path)

urllib.request.urlretrieve(self.zip_url, zip_path)

logger.info("Download complete: %s", zip_path)

# Download zip if needed

if not os.path.exists(zip_path) or overwrite:

logger.info("Downloading %s -> %s", self.zip_url, zip_path)

# Validate URL scheme

if not self.zip_url.startswith(("https://", "http://")):

raise ValueError(f"Invalid URL scheme: {self.zip_url}")

try:

urllib.request.urlretrieve(self.zip_url, zip_path)

logger.info("Download complete: %s", zip_path)

except Exception as e:

# Clean up partial download

if os.path.exists(zip_path):

os.remove(zip_path)

raise RuntimeError(f"Failed to download {self.zip_url}: {e}") from e

🧰 Tools

🪛 Ruff (0.15.7)

[error] 399-399: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hvantk/datasets/peptideatlas_phospho_datasets.py` around lines 396 - 400, Validate that self.zip_url uses an allowed scheme (e.g., parse the URL and ensure scheme == "https") before attempting download, then perform the download to a temporary file (e.g., zip_path + ".part") and only rename/move it to zip_path on successful completion; wrap the urllib.request.urlretrieve call in a try/except that catches network errors (URLError, HTTPError) and generic Exception, log a descriptive error with exception details via logger.error, remove any partial temp file on failure, and either re-raise the exception or return a clear failure value so partial files are not left behind and invalid schemes are rejected.

coderabbitai · 2026-03-31T14:04:50Z

+        for row in reader:
+            key = (row["uniprot_id"], row["residue_pos"])
+            site_sources[key]["sources"].add(row["source_db"])
+            n_obs = int(row.get("n_observations", "0"))


⚠️ Potential issue | 🟠 Major

n_observations parsing can raise and abort plotting.

Line 487 does a raw int(...) cast; blank/non-numeric values from external TSVs will raise ValueError and fail the plot path.

Proposed fix

- n_obs = int(row.get("n_observations", "0")) + raw_n_obs = (row.get("n_observations") or "0").strip() + try: + n_obs = int(raw_n_obs) + except ValueError: + n_obs = 0

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

n_obs = int(row.get("n_observations", "0"))

raw_n_obs = (row.get("n_observations") or "0").strip()

try:

n_obs = int(raw_n_obs)

except ValueError:

n_obs = 0

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hvantk/ptm/plot.py` at line 487, The int(...) cast for n_obs can raise ValueError for blank or non-numeric TSV values; update the parsing around the n_obs assignment (the line using row.get("n_observations", "0") that sets n_obs) to safely handle missing/invalid input by normalizing the fetched value (e.g., str(...).strip()), then try converting to int inside a try/except ValueError block and fall back to 0 on failure; ensure the code still accepts valid numeric strings and preserves the n_obs variable for downstream plotting logic.

coderabbitai · 2026-03-31T14:04:50Z

+    for _key, info in site_sources.items():
+        has_up = "UniProt" in info["sources"]
+        has_pa = "PeptideAtlas" in info["sources"]
+        if has_up and has_pa:
+            both_sources.append(info["n_obs"])
+        elif has_up:
+            uniprot_only.append(info["n_obs"])
+        elif has_pa:
+            pa_only.append(info["n_obs"])
+


⚠️ Potential issue | 🟠 Major

Source-overlap categories are incorrect when CPTAC rows are present.

Line 506–Line 513 only test UniProt/PeptideAtlas. Sites with CPTAC get miscounted (e.g., UniProt+CPTAC is currently labeled as UniProt-only), which biases panel-1 counts and downstream distributions.

Proposed fix

- for _key, info in site_sources.items(): - has_up = "UniProt" in info["sources"] - has_pa = "PeptideAtlas" in info["sources"] - if has_up and has_pa: - both_sources.append(info["n_obs"]) - elif has_up: - uniprot_only.append(info["n_obs"]) - elif has_pa: - pa_only.append(info["n_obs"]) + excluded_other_sources = 0 + for _key, info in site_sources.items(): + sources = info["sources"] + has_up = "UniProt" in sources + has_pa = "PeptideAtlas" in sources + has_cptac = "CPTAC" in sources + + if has_up and has_pa: + both_sources.append(info["n_obs"]) + elif has_up and not has_cptac: + uniprot_only.append(info["n_obs"]) + elif has_pa and not has_cptac: + pa_only.append(info["n_obs"]) + else: + excluded_other_sources += 1 + + if excluded_other_sources: + logger.info( + "Excluded %d sites containing non-target sources from UniProt/PeptideAtlas overlap plot", + excluded_other_sources, + )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hvantk/ptm/plot.py` around lines 505 - 514, The current logic only checks for "UniProt" and "PeptideAtlas" and misclassifies sites that include "CPTAC" (or any other additional source) as UniProt-only; update the categorization to consider the full set of sources in info["sources"] (variables: site_sources, info["sources"], both_sources, uniprot_only, pa_only). Compute flags has_up and has_pa as before but also inspect the total number of sources (e.g., len(info["sources"]) or a has_other flag) and only classify as uniprot_only or pa_only when that source is present and it is the sole source; keep the both_sources branch for has_up and has_pa combinations so that UniProt+CPTAC (or other combos) are not incorrectly labeled UniProt-only.

coderabbitai · 2026-03-31T14:04:50Z

+    _site_id_re = re.compile(r"^(.+)_([STY])(\d+)$")
+
+    def _parse_site_id(sid):
+        m = _site_id_re.match(sid)
+        if m:
+            return m.group(1), m.group(2), int(m.group(3))
+        return sid, "", 0


⚠️ Potential issue | 🟡 Minor

Unparseable site IDs produce empty/zero row fields.

When a site ID doesn't match the expected Gene_[STY](\d+) pattern, _parse_site_id returns (sid, "", 0). This means rows may have empty amino_acid and residue_pos=0, which could cause issues in downstream analysis expecting valid phospho site data.

Consider logging a warning for unparseable site IDs or filtering them out.

🛡️ Proposed fix to log warnings for unparseable site IDs

def _parse_site_id(sid): m = _site_id_re.match(sid) if m: return m.group(1), m.group(2), int(m.group(3)) + logger.warning("Unparseable site ID format: %s", sid) return sid, "", 0

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

_site_id_re = re.compile(r"^(.+)_([STY])(\d+)$")

def _parse_site_id(sid):

m = _site_id_re.match(sid)

if m:

return m.group(1), m.group(2), int(m.group(3))

return sid, "", 0

_site_id_re = re.compile(r"^(.+)_([STY])(\d+)$")

def _parse_site_id(sid):

m = _site_id_re.match(sid)

if m:

return m.group(1), m.group(2), int(m.group(3))

logger.warning("Unparseable site ID format: %s", sid)

return sid, "", 0

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hvantk/tables/matrix_builders.py` around lines 262 - 268, _parsе_site_id currently returns (sid, "", 0) for unparseable inputs which yields empty amino_acid and residue_pos=0; update _parse_site_id to detect non-matching site IDs and either (a) log a warning via the module logger including the offending sid and context (use the existing logger or create one) and return None or raise a specific ValueError, or (b) return a sentinel (e.g. None) so callers can filter them out; also update any callers that iterate over _parse_site_id (refer to the function name _parse_site_id and the regex _site_id_re) to skip None results or catch the ValueError and drop/log the row instead of inserting empty/zero fields.

ypriverol and others added 25 commits March 30, 2026 08:47

docs: add PeptideAtlas phospho handler design spec

6208552

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: add PeptideAtlas phospho handler implementation plan

8f503e2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add PeptideAtlas constants and n_observations to PTM_OUTPUT_COL…

e957dea

…UMNS Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: update PTM pipeline to write 13-column output with n_observations

cae6edb

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add n_observations field to PTM sites Hail Table builder

926b7e0

feat: add PeptideAtlas phospho dataset class with parsing and tests

f8f9dcd

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: integrate PeptideAtlas phospho source into PTM build pipeline

7a1c4fc

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: correct PeptideAtlas phospho base URL to include human/phospho path

91f1190

The relative URLs on the PeptideAtlas page resolve under /builds/human/phospho/, not /builds/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: add CPTAC phospho handler design spec

5b73802

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: add CPTAC phospho handler implementation plan

c23d077

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add CPTAC cancer type constants

bed198e

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add CPTAC phospho dataset class with site parsing and tests

8552a39

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add cptac-phospho download CLI command

6ea7867

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: add cptac-phospho MatrixTable builder and CLI command

0228acb

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: add data/ to .gitignore

9062059

Downloaded datasets (UniProt, PeptideAtlas, CPTAC) are reproducible from code and should not be committed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

remove plans

80bbad4

coderabbitai bot reviewed Mar 31, 2026

View reviewed changes

enriquea merged commit d69be00 into main Mar 31, 2026
7 checks passed

coderabbitai bot mentioned this pull request Apr 16, 2026

Add PTM features including tests, reports, and atlas assembly #91

Merged

-            n_obs = int(row.get("n_observations", "0"))
+            raw_n_obs = (row.get("n_observations") or "0").strip()
+            try:
+                n_obs = int(raw_n_obs)
+            except ValueError:
+                n_obs = 0

Conversation

ypriverol commented Mar 31, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

codacy-production bot commented Mar 31, 2026

Up to standards ✅

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ypriverol commented Mar 31, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 31, 2026 •

edited

Loading