Skip to content

including PeptideAtlas and CPTAC PTMs as stream information #84

Merged
enriquea merged 26 commits intomainfrom
ptm
Mar 31, 2026
Merged

including PeptideAtlas and CPTAC PTMs as stream information #84
enriquea merged 26 commits intomainfrom
ptm

Conversation

@ypriverol
Copy link
Copy Markdown
Member

@ypriverol ypriverol commented Mar 31, 2026

Summary by CodeRabbit

  • New Features

    • Download CPTAC phosphoproteomics datasets via new cptac-phospho CLI command with cancer type selection and pan-cancer merging support
    • Download PeptideAtlas human phosphoproteome datasets via new peptideatlas-phospho CLI command
    • Build CPTAC phosphoproteomics matrix tables with sample and site metadata integration
    • Extend PTM building pipeline to incorporate PeptideAtlas and CPTAC phosphoproteomics sources
    • Visualize source data overlap analysis for phospho-sites
  • Tests

    • Added comprehensive test coverage for phosphoproteomics dataset parsing and processing

ypriverol and others added 25 commits March 30, 2026 08:47
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…UMNS

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a new CLI subcommand `hvantk download peptideatlas-phospho` that
downloads PeptideAtlas human phospho build data, supporting latest or
specific build via --build-date and --build-id options.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e comparison

Three-panel plot: site overlap counts, box plot of observation counts
by source category, and cumulative distribution of PeptideAtlas evidence.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The relative URLs on the PeptideAtlas page resolve under
/builds/human/phospho/, not /builds/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Support text-based notation (S[Phospho]) alongside numeric (S[167])
- Stream large tables instead of loading into memory
- Fix amino acid extraction by returning AA from offset parser
- Handle malformed sequences with unclosed brackets

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Without this filter, shared peptides mapped to all 588K protein entries
inflated site count from ~260K to 9.7M. The protein_identification
table's presence_level_id=1 gives the 11,852 canonical proteins
matching the PeptideAtlas build page.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tests

- Fix doubled /phospho in URL construction (from_build method)
- Downloader now produces intermediate TSV, not just raw zip
- Numeric mass detection covers T[181] and Y[243] in addition to S[167]
- Add parametrized tests for _extract_phospho_offsets (16 cases)
- Add test for canonical protein filtering (presence_level_id=1)
- Add test for URL construction correctness

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add cptac_tsv field to PTMBuildConfig with validation, CPTAC mapping
and concatenation step in ptm_build_pipeline(), and --cptac-tsv CLI
option to the ptm build command.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…load()

- Rename colon→coad, remove endometrial (matching actual cptac package classes)
- Add overwrite parameter and Dict return to CPTACPhosphoDataset.download()
- Use consistent cptac-phospho-{type} file naming

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The cptac package has a bug where get_phosphoproteomics() and
get_clinical() fail with "generator has no len()" when multiple
sources exist and no source is specified. Use source='umich' for
phospho and source='mssm' for clinical to work around this.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ho output

- Download tumor and normal phospho data separately per cancer type
- Intermediate TSV now includes tissue_type, cancer_type, mean_intensity columns
- Produces separate tumor/normal TSVs plus a combined TSV
- Metadata CSV includes tissue_type per sample (from .N suffix convention)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- write_matrix_csv now expands multi-sites (T185_Y187 → T185, Y187)
  and averages duplicates from different peptides
- build_cptac_phospho_mt uses regex for robust site ID parsing
- Updated tests to verify multi-site expansion and per-sample values

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Downloaded datasets (UniProt, PeptideAtlas, CPTAC) are reproducible
from code and should not be committed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 31, 2026

📝 Walkthrough

Walkthrough

This PR adds support for downloading and integrating CPTAC and PeptideAtlas phosphoproteomics datasets into hvantk. New CLI commands enable dataset downloads to intermediate TSV formats, a matrix builder creates Hail MatrixTables from CPTAC phospho data, and the PTM pipeline is extended to ingest these datasets alongside existing UniProt sources.

Changes

Cohort / File(s) Summary
CPTAC Phospho Dataset Infrastructure
hvantk/datasets/cptac_phospho_datasets.py, hvantk/commands/cptac_phospho_downloader.py, hvantk/tests/test_cptac_phospho.py
Adds module to download CPTAC phosphoproteomics per cancer type, extract phospho sites from intensity matrices, write intermediate TSVs and metadata CSVs, and defines CPTACPhosphoDataset class. CLI command registers cptac-phospho downloader with options for cancer type, all-cancer pan-cancer merge, listing available types, and overwrite behavior. Tests validate phosphosite parsing, metadata handling, and matrix CSV conversion.
PeptideAtlas Phospho Dataset Infrastructure
hvantk/datasets/peptideatlas_phospho_datasets.py, hvantk/commands/peptideatlas_phospho_downloader.py, hvantk/tests/test_peptideatlas_phospho.py
Adds module to stream-parse PeptideAtlas ZIP archives, extract phospho sites with observation count aggregation, and write intermediate TSVs. Includes factory methods for latest/specific build selection. CLI command registers peptideatlas-phospho downloader with build date/ID options. Tests cover phospho offset extraction, site aggregation across peptides, DECOY filtering, and dataset metadata.
PTM Pipeline Integration
hvantk/ptm/constants.py, hvantk/ptm/pipeline.py, hvantk/ptm/plot.py, hvantk/commands/ptm_cli.py
Extends PTMBuildConfig with optional peptideatlas_tsv and cptac_tsv paths, updates ptm_build_pipeline() to map and concatenate sites from multiple sources with aggregated statistics, adds n_observations field to output schema, and introduces plot_source_overlap() for visualizing UniProt/PeptideAtlas coverage. PTM CLI exposes new --peptideatlas-tsv and --cptac-tsv options.
Matrix Table Building
hvantk/tables/matrix_builders.py, hvantk/tables/registry.py, hvantk/tables/table_builders.py, hvantk/commands/make_matrix_cli.py
Adds build_cptac_phospho_mt builder to convert CPTAC phospho expression CSVs and metadata into Hail MatrixTables, with site ID parsing and optional column casting. Registers builder in matrix registry and extends PTM table builder to handle n_observations field. CLI command mkmatrix cptac-phospho accepts expression, metadata, site/sample ID columns, and categorical/numeric column specifications.
CLI Command Integration
hvantk/commands/download_cli.py
Imports and registers cptac-phospho and peptideatlas-phospho subcommands under the download group.
Configuration & Tests
.gitignore, hvantk/tests/test_ptm.py
Updates .gitignore to ignore data/ directory and extends PTM round-trip test to assert evidence_type and n_observations fields.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

enhancement, feature, testing

Suggested reviewers

  • enriquea

Poem

🐰 Phospho sites hopping through the data streams so bright,
CPTAC and PeptideAtlas bringing coverage into light,
TSVs merge and matrices bloom in organized delight,
The PTM pipeline learns to paint with every data source's might! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title is vague and does not clearly convey the main purpose of the changeset. 'stream information' is imprecise and does not adequately summarize the substantial new functionality added. Revise the title to be more specific and descriptive. Consider: 'Add PeptideAtlas and CPTAC phosphoproteomics dataset downloaders and matrix builders' or similar to accurately reflect the scope of changes.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 87.93% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ptm

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codacy-production
Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

TIP This summary will be updated as you push new changes. Give us feedback

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (4)
hvantk/ptm/constants.py (1)

19-20: Avoid hardcoding values labeled as “latest.”

Line 19 and Line 20 encode a point-in-time build as constants named LATEST_*, which will eventually drift and silently age. Prefer an override mechanism (env/config) or runtime resolution so releases don’t require code edits for dataset freshness.

Suggested adjustment
+import os
+
-PEPTIDEATLAS_LATEST_BUILD_DATE = "202512"
-PEPTIDEATLAS_LATEST_BUILD_ID = "606"
+PEPTIDEATLAS_LATEST_BUILD_DATE = os.getenv("HVANTK_PEPA_BUILD_DATE", "202512")
+PEPTIDEATLAS_LATEST_BUILD_ID = os.getenv("HVANTK_PEPA_BUILD_ID", "606")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/constants.py` around lines 19 - 20, The constants
PEPTIDEATLAS_LATEST_BUILD_DATE and PEPTIDEATLAS_LATEST_BUILD_ID are hardcoded as
"LATEST" values; change them to be resolved at runtime via configuration or
environment variables so they don't silently age. Replace the fixed constants
with a runtime accessor (e.g., get_peptideatlas_latest_build() or module-level
resolution) that checks environment variables (e.g.,
PEPTIDEATLAS_LATEST_BUILD_DATE, PEPTIDEATLAS_LATEST_BUILD_ID) and falls back to
the existing literals only as defaults; update any callers that import these
names to use the accessor or the module attributes that are set on import.
Ensure parsing/validation of the env values and add tests or comments
documenting the override mechanism.
hvantk/commands/make_matrix_cli.py (1)

245-301: Expose site_id_col in the CLI to match builder capability.

build_cptac_phospho_mt supports site_id_col, but this command hardcodes the default by not exposing it. Adding an option improves compatibility with alternate matrix schemas.

Suggested adjustment
 `@click.option`("-o", "--output-mt", "output_mt", required=True, type=click.Path())
+@click.option("--site-id-col", "site_id_col", default="SiteID", show_default=True)
 `@click.option`("--sid", "--sample-id-col", "sample_id_col", default="SampleID", show_default=True)
 def mkmatrix_cptac_phospho(
     expression,
     metadata,
     output_mt,
+    site_id_col,
     sample_id_col,
     categorical_cols,
     numeric_cols,
     overwrite,
 ):
@@
     mt = _build_cptac_phospho_mt(
         expression_path=expression,
         metadata_path=metadata,
         output_mt=output_mt,
+        site_id_col=site_id_col,
         sample_id_col=sample_id_col,
         categorical_cols=_split(categorical_cols),
         numeric_cols=_split(numeric_cols),
         overwrite=overwrite,
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/commands/make_matrix_cli.py` around lines 245 - 301, The CLI command
mkmatrix_cptac_phospho currently doesn't expose the site_id_col parameter even
though the builder _build_cptac_phospho_mt supports it; add a new click.option
(e.g., "--site-id-col" with a sensible default like "SiteID") to the
mkmatrix_cptac_phospho signature and forward that variable into the call to
_build_cptac_phospho_mt via site_id_col=site_id_col so alternate matrix schemas
can specify the site identifier column.
hvantk/ptm/pipeline.py (1)

323-387: Consider extracting duplicated TSV concatenation logic into a helper function.

The concatenation logic for combining mapped TSVs (lines 329-349 for PeptideAtlas and lines 362-382 for CPTAC) is nearly identical. This could be refactored into a helper function to reduce duplication and improve maintainability.

♻️ Proposed refactor to extract helper function
+def _concatenate_mapped_tsvs(
+    source_paths: List[str],
+    output_path: str,
+) -> None:
+    """Concatenate multiple mapped TSV files into a single combined file."""
+    with open(output_path, "w", newline="") as fout:
+        writer = csv.DictWriter(
+            fout, fieldnames=PTM_OUTPUT_COLUMNS, delimiter="\t", lineterminator="\n"
+        )
+        writer.writeheader()
+        for src_path in source_paths:
+            with open(src_path) as fin:
+                reader = csv.DictReader(fin, delimiter="\t")
+                for row in reader:
+                    writer.writerow(row)

Then replace the duplicated blocks with calls to this helper.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/pipeline.py` around lines 323 - 387, The duplicated TSV
concatenation blocks should be moved into a single helper (e.g., def
concatenate_mapped_tsvs(src_paths: List[str], out_path: str,
fieldnames=PTM_OUTPUT_COLUMNS)) and both use sites: call it to concatenate
[mapped_path, pa_mapped_path] and [mapped_path, cptac_mapped_path]; update the
callers in the mapping flow (around map_ptm_sites results for pa_result and
cptac_result) to set mapped_path = returned out_path and keep the existing
updates to result (n_total, n_mapped, n_failed, resolution_counts) unchanged;
ensure the helper opens out_path with newline="", writes the header using
csv.DictWriter(fieldnames=PTM_OUTPUT_COLUMNS, delimiter="\t",
lineterminator="\n"), iterates each src_path with csv.DictReader(delimiter="\t")
and copies rows, and raise or log a clear error if any src_path is missing so
behavior matches original expectations.
hvantk/datasets/peptideatlas_phospho_datasets.py (1)

63-64: Unused constant — consider removing.

_BRACKET_MOD_PATTERN is defined but never used. The _extract_phospho_offsets function manually parses brackets instead of using this regex.

🧹 Suggested removal
-# Pattern to match bracket modifications like [167] or [167.0]
-_BRACKET_MOD_PATTERN = re.compile(r"\[([0-9.]+)\]")

Also remove import re from line 20 if no other uses exist.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/datasets/peptideatlas_phospho_datasets.py` around lines 63 - 64, The
constant _BRACKET_MOD_PATTERN is unused because _extract_phospho_offsets
performs manual bracket parsing; remove the unused _BRACKET_MOD_PATTERN
definition and delete the unused import re (if there are no other usages of re
in the file), and keep _extract_phospho_offsets as-is or refactor it to use
_BRACKET_MOD_PATTERN if you prefer regex — but do not leave the orphaned regex
constant/import in the module.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hvantk/commands/peptideatlas_phospho_downloader.py`:
- Around line 51-54: The current logic silently ignores partial build parameters
(build_date or build_id) and calls PeptideAtlasPhosphoDataset.from_latest();
update the code that branches on build_date and build_id to detect the partial
case and either raise a clear error or log/warn and exit, ensuring users must
provide both parameters together; specifically modify the conditional around
build_date/build_id to: if both provided call
PeptideAtlasPhosphoDataset.from_build(build_date, build_id), if neither call
PeptideAtlasPhosphoDataset.from_latest(), and if exactly one is provided emit a
user-facing warning or raise a ValueError indicating both --build-date and
--build-id are required together.

In `@hvantk/datasets/cptac_phospho_datasets.py`:
- Around line 313-322: The try/except around ds.get_phosphoproteomics is too
broad and leaves normal_df possibly undefined; replace the blanket except
Exception with catching the specific error(s) that ds.get_phosphoproteomics can
raise (e.g., KeyError/IOError/ValueError as appropriate) and only handle the "no
data" case there, and ensure normal_df is always defined before it's referenced
later (either initialize normal_df = None before the try, or move subsequent
references such as the normal_df.shape check, extract_phospho_sites(normal_df,
ct, tissue_type="normal") and write_intermediate_tsv(normal_sites, normal_tsv)
inside the try block or guard them with an explicit "if normal_df is not None"
check); keep logger calls (logger.info) but include the specific caught
exception variable when logging expected failures for context.

In `@hvantk/datasets/peptideatlas_phospho_datasets.py`:
- Around line 396-400: Validate that self.zip_url uses an allowed scheme (e.g.,
parse the URL and ensure scheme == "https") before attempting download, then
perform the download to a temporary file (e.g., zip_path + ".part") and only
rename/move it to zip_path on successful completion; wrap the
urllib.request.urlretrieve call in a try/except that catches network errors
(URLError, HTTPError) and generic Exception, log a descriptive error with
exception details via logger.error, remove any partial temp file on failure, and
either re-raise the exception or return a clear failure value so partial files
are not left behind and invalid schemes are rejected.

In `@hvantk/ptm/plot.py`:
- Line 487: The int(...) cast for n_obs can raise ValueError for blank or
non-numeric TSV values; update the parsing around the n_obs assignment (the line
using row.get("n_observations", "0") that sets n_obs) to safely handle
missing/invalid input by normalizing the fetched value (e.g., str(...).strip()),
then try converting to int inside a try/except ValueError block and fall back to
0 on failure; ensure the code still accepts valid numeric strings and preserves
the n_obs variable for downstream plotting logic.
- Around line 505-514: The current logic only checks for "UniProt" and
"PeptideAtlas" and misclassifies sites that include "CPTAC" (or any other
additional source) as UniProt-only; update the categorization to consider the
full set of sources in info["sources"] (variables: site_sources,
info["sources"], both_sources, uniprot_only, pa_only). Compute flags has_up and
has_pa as before but also inspect the total number of sources (e.g.,
len(info["sources"]) or a has_other flag) and only classify as uniprot_only or
pa_only when that source is present and it is the sole source; keep the
both_sources branch for has_up and has_pa combinations so that UniProt+CPTAC (or
other combos) are not incorrectly labeled UniProt-only.

In `@hvantk/tables/matrix_builders.py`:
- Around line 262-268: _parsе_site_id currently returns (sid, "", 0) for
unparseable inputs which yields empty amino_acid and residue_pos=0; update
_parse_site_id to detect non-matching site IDs and either (a) log a warning via
the module logger including the offending sid and context (use the existing
logger or create one) and return None or raise a specific ValueError, or (b)
return a sentinel (e.g. None) so callers can filter them out; also update any
callers that iterate over _parse_site_id (refer to the function name
_parse_site_id and the regex _site_id_re) to skip None results or catch the
ValueError and drop/log the row instead of inserting empty/zero fields.

---

Nitpick comments:
In `@hvantk/commands/make_matrix_cli.py`:
- Around line 245-301: The CLI command mkmatrix_cptac_phospho currently doesn't
expose the site_id_col parameter even though the builder _build_cptac_phospho_mt
supports it; add a new click.option (e.g., "--site-id-col" with a sensible
default like "SiteID") to the mkmatrix_cptac_phospho signature and forward that
variable into the call to _build_cptac_phospho_mt via site_id_col=site_id_col so
alternate matrix schemas can specify the site identifier column.

In `@hvantk/datasets/peptideatlas_phospho_datasets.py`:
- Around line 63-64: The constant _BRACKET_MOD_PATTERN is unused because
_extract_phospho_offsets performs manual bracket parsing; remove the unused
_BRACKET_MOD_PATTERN definition and delete the unused import re (if there are no
other usages of re in the file), and keep _extract_phospho_offsets as-is or
refactor it to use _BRACKET_MOD_PATTERN if you prefer regex — but do not leave
the orphaned regex constant/import in the module.

In `@hvantk/ptm/constants.py`:
- Around line 19-20: The constants PEPTIDEATLAS_LATEST_BUILD_DATE and
PEPTIDEATLAS_LATEST_BUILD_ID are hardcoded as "LATEST" values; change them to be
resolved at runtime via configuration or environment variables so they don't
silently age. Replace the fixed constants with a runtime accessor (e.g.,
get_peptideatlas_latest_build() or module-level resolution) that checks
environment variables (e.g., PEPTIDEATLAS_LATEST_BUILD_DATE,
PEPTIDEATLAS_LATEST_BUILD_ID) and falls back to the existing literals only as
defaults; update any callers that import these names to use the accessor or the
module attributes that are set on import. Ensure parsing/validation of the env
values and add tests or comments documenting the override mechanism.

In `@hvantk/ptm/pipeline.py`:
- Around line 323-387: The duplicated TSV concatenation blocks should be moved
into a single helper (e.g., def concatenate_mapped_tsvs(src_paths: List[str],
out_path: str, fieldnames=PTM_OUTPUT_COLUMNS)) and both use sites: call it to
concatenate [mapped_path, pa_mapped_path] and [mapped_path, cptac_mapped_path];
update the callers in the mapping flow (around map_ptm_sites results for
pa_result and cptac_result) to set mapped_path = returned out_path and keep the
existing updates to result (n_total, n_mapped, n_failed, resolution_counts)
unchanged; ensure the helper opens out_path with newline="", writes the header
using csv.DictWriter(fieldnames=PTM_OUTPUT_COLUMNS, delimiter="\t",
lineterminator="\n"), iterates each src_path with csv.DictReader(delimiter="\t")
and copies rows, and raise or log a clear error if any src_path is missing so
behavior matches original expectations.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7ced376e-1f01-4814-b768-bc8a96d19db0

📥 Commits

Reviewing files that changed from the base of the PR and between e5ef742 and 80bbad4.

📒 Files selected for processing (17)
  • .gitignore
  • hvantk/commands/cptac_phospho_downloader.py
  • hvantk/commands/download_cli.py
  • hvantk/commands/make_matrix_cli.py
  • hvantk/commands/peptideatlas_phospho_downloader.py
  • hvantk/commands/ptm_cli.py
  • hvantk/datasets/cptac_phospho_datasets.py
  • hvantk/datasets/peptideatlas_phospho_datasets.py
  • hvantk/ptm/constants.py
  • hvantk/ptm/pipeline.py
  • hvantk/ptm/plot.py
  • hvantk/tables/matrix_builders.py
  • hvantk/tables/registry.py
  • hvantk/tables/table_builders.py
  • hvantk/tests/test_cptac_phospho.py
  • hvantk/tests/test_peptideatlas_phospho.py
  • hvantk/tests/test_ptm.py

Comment on lines +51 to +54
if build_date and build_id:
dataset = PeptideAtlasPhosphoDataset.from_build(build_date, build_id)
else:
dataset = PeptideAtlasPhosphoDataset.from_latest()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Partial build parameters silently ignored.

If a user provides only --build-date or only --build-id (but not both), the code silently falls back to from_latest(). This could be confusing. Consider warning the user or requiring both when either is provided.

🛡️ Proposed fix to warn on partial parameters
         if build_date and build_id:
             dataset = PeptideAtlasPhosphoDataset.from_build(build_date, build_id)
+        elif build_date or build_id:
+            click.echo(
+                "Warning: Both --build-date and --build-id are required for a specific build. "
+                "Using latest build instead.",
+                err=True,
+            )
+            dataset = PeptideAtlasPhosphoDataset.from_latest()
         else:
             dataset = PeptideAtlasPhosphoDataset.from_latest()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if build_date and build_id:
dataset = PeptideAtlasPhosphoDataset.from_build(build_date, build_id)
else:
dataset = PeptideAtlasPhosphoDataset.from_latest()
if build_date and build_id:
dataset = PeptideAtlasPhosphoDataset.from_build(build_date, build_id)
elif build_date or build_id:
click.echo(
"Warning: Both --build-date and --build-id are required for a specific build. "
"Using latest build instead.",
err=True,
)
dataset = PeptideAtlasPhosphoDataset.from_latest()
else:
dataset = PeptideAtlasPhosphoDataset.from_latest()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/commands/peptideatlas_phospho_downloader.py` around lines 51 - 54, The
current logic silently ignores partial build parameters (build_date or build_id)
and calls PeptideAtlasPhosphoDataset.from_latest(); update the code that
branches on build_date and build_id to detect the partial case and either raise
a clear error or log/warn and exit, ensuring users must provide both parameters
together; specifically modify the conditional around build_date/build_id to: if
both provided call PeptideAtlasPhosphoDataset.from_build(build_date, build_id),
if neither call PeptideAtlasPhosphoDataset.from_latest(), and if exactly one is
provided emit a user-facing warning or raise a ValueError indicating both
--build-date and --build-id are required together.

Comment on lines +313 to +322
try:
normal_df = ds.get_phosphoproteomics(source="umich", tissue_type="normal")
if len(normal_df) > 0:
logger.info("Normal: %d samples x %d columns", *normal_df.shape)
normal_sites = extract_phospho_sites(normal_df, ct, tissue_type="normal")
write_intermediate_tsv(normal_sites, normal_tsv)
else:
logger.info("No normal samples available for %s", ct)
except Exception as e:
logger.info("No normal tissue data for %s: %s", ct, e)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Catching blind Exception and potential NameError on normal_df.

  1. Line 321 catches a broad Exception, which could mask unexpected errors. Consider catching a more specific exception type.
  2. Line 357 references normal_df which may not be defined if an exception was raised before line 314.
🛡️ Proposed fix
         # Fetch normal phospho data (may be empty for some cancer types)
         normal_sites = []
+        normal_df = None
         try:
             normal_df = ds.get_phosphoproteomics(source="umich", tissue_type="normal")
             if len(normal_df) > 0:
                 logger.info("Normal: %d samples x %d columns", *normal_df.shape)
                 normal_sites = extract_phospho_sites(normal_df, ct, tissue_type="normal")
                 write_intermediate_tsv(normal_sites, normal_tsv)
             else:
                 logger.info("No normal samples available for %s", ct)
-        except Exception as e:
+        except (KeyError, ValueError, AttributeError) as e:
             logger.info("No normal tissue data for %s: %s", ct, e)

And update line 357:

-            len(tumor_df), len(normal_df) if normal_sites else 0,
+            len(tumor_df), len(normal_df) if normal_df is not None else 0,
🧰 Tools
🪛 Ruff (0.15.7)

[warning] 321-321: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/datasets/cptac_phospho_datasets.py` around lines 313 - 322, The
try/except around ds.get_phosphoproteomics is too broad and leaves normal_df
possibly undefined; replace the blanket except Exception with catching the
specific error(s) that ds.get_phosphoproteomics can raise (e.g.,
KeyError/IOError/ValueError as appropriate) and only handle the "no data" case
there, and ensure normal_df is always defined before it's referenced later
(either initialize normal_df = None before the try, or move subsequent
references such as the normal_df.shape check, extract_phospho_sites(normal_df,
ct, tissue_type="normal") and write_intermediate_tsv(normal_sites, normal_tsv)
inside the try block or guard them with an explicit "if normal_df is not None"
check); keep logger calls (logger.info) but include the specific caught
exception variable when logging expected failures for context.

Comment on lines +396 to +400
# Download zip if needed
if not os.path.exists(zip_path) or overwrite:
logger.info("Downloading %s -> %s", self.zip_url, zip_path)
urllib.request.urlretrieve(self.zip_url, zip_path)
logger.info("Download complete: %s", zip_path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add error handling for download failures and URL scheme validation.

  1. Network failures will raise uncaught exceptions, potentially leaving partial files on disk.
  2. While the base URL is hardcoded to https://, explicit scheme validation provides defense-in-depth (addresses static analysis S310).
🛡️ Suggested improvement
         # Download zip if needed
         if not os.path.exists(zip_path) or overwrite:
             logger.info("Downloading %s -> %s", self.zip_url, zip_path)
-            urllib.request.urlretrieve(self.zip_url, zip_path)
-            logger.info("Download complete: %s", zip_path)
+            # Validate URL scheme
+            if not self.zip_url.startswith(("https://", "http://")):
+                raise ValueError(f"Invalid URL scheme: {self.zip_url}")
+            try:
+                urllib.request.urlretrieve(self.zip_url, zip_path)
+                logger.info("Download complete: %s", zip_path)
+            except Exception as e:
+                # Clean up partial download
+                if os.path.exists(zip_path):
+                    os.remove(zip_path)
+                raise RuntimeError(f"Failed to download {self.zip_url}: {e}") from e
         else:
             logger.info("Using cached zip: %s", zip_path)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Download zip if needed
if not os.path.exists(zip_path) or overwrite:
logger.info("Downloading %s -> %s", self.zip_url, zip_path)
urllib.request.urlretrieve(self.zip_url, zip_path)
logger.info("Download complete: %s", zip_path)
# Download zip if needed
if not os.path.exists(zip_path) or overwrite:
logger.info("Downloading %s -> %s", self.zip_url, zip_path)
# Validate URL scheme
if not self.zip_url.startswith(("https://", "http://")):
raise ValueError(f"Invalid URL scheme: {self.zip_url}")
try:
urllib.request.urlretrieve(self.zip_url, zip_path)
logger.info("Download complete: %s", zip_path)
except Exception as e:
# Clean up partial download
if os.path.exists(zip_path):
os.remove(zip_path)
raise RuntimeError(f"Failed to download {self.zip_url}: {e}") from e
🧰 Tools
🪛 Ruff (0.15.7)

[error] 399-399: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/datasets/peptideatlas_phospho_datasets.py` around lines 396 - 400,
Validate that self.zip_url uses an allowed scheme (e.g., parse the URL and
ensure scheme == "https") before attempting download, then perform the download
to a temporary file (e.g., zip_path + ".part") and only rename/move it to
zip_path on successful completion; wrap the urllib.request.urlretrieve call in a
try/except that catches network errors (URLError, HTTPError) and generic
Exception, log a descriptive error with exception details via logger.error,
remove any partial temp file on failure, and either re-raise the exception or
return a clear failure value so partial files are not left behind and invalid
schemes are rejected.

Comment thread hvantk/ptm/plot.py
for row in reader:
key = (row["uniprot_id"], row["residue_pos"])
site_sources[key]["sources"].add(row["source_db"])
n_obs = int(row.get("n_observations", "0"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

n_observations parsing can raise and abort plotting.

Line 487 does a raw int(...) cast; blank/non-numeric values from external TSVs will raise ValueError and fail the plot path.

Proposed fix
-            n_obs = int(row.get("n_observations", "0"))
+            raw_n_obs = (row.get("n_observations") or "0").strip()
+            try:
+                n_obs = int(raw_n_obs)
+            except ValueError:
+                n_obs = 0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
n_obs = int(row.get("n_observations", "0"))
raw_n_obs = (row.get("n_observations") or "0").strip()
try:
n_obs = int(raw_n_obs)
except ValueError:
n_obs = 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/plot.py` at line 487, The int(...) cast for n_obs can raise
ValueError for blank or non-numeric TSV values; update the parsing around the
n_obs assignment (the line using row.get("n_observations", "0") that sets n_obs)
to safely handle missing/invalid input by normalizing the fetched value (e.g.,
str(...).strip()), then try converting to int inside a try/except ValueError
block and fall back to 0 on failure; ensure the code still accepts valid numeric
strings and preserves the n_obs variable for downstream plotting logic.

Comment thread hvantk/ptm/plot.py
Comment on lines +505 to +514
for _key, info in site_sources.items():
has_up = "UniProt" in info["sources"]
has_pa = "PeptideAtlas" in info["sources"]
if has_up and has_pa:
both_sources.append(info["n_obs"])
elif has_up:
uniprot_only.append(info["n_obs"])
elif has_pa:
pa_only.append(info["n_obs"])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Source-overlap categories are incorrect when CPTAC rows are present.

Line 506–Line 513 only test UniProt/PeptideAtlas. Sites with CPTAC get miscounted (e.g., UniProt+CPTAC is currently labeled as UniProt-only), which biases panel-1 counts and downstream distributions.

Proposed fix
-    for _key, info in site_sources.items():
-        has_up = "UniProt" in info["sources"]
-        has_pa = "PeptideAtlas" in info["sources"]
-        if has_up and has_pa:
-            both_sources.append(info["n_obs"])
-        elif has_up:
-            uniprot_only.append(info["n_obs"])
-        elif has_pa:
-            pa_only.append(info["n_obs"])
+    excluded_other_sources = 0
+    for _key, info in site_sources.items():
+        sources = info["sources"]
+        has_up = "UniProt" in sources
+        has_pa = "PeptideAtlas" in sources
+        has_cptac = "CPTAC" in sources
+
+        if has_up and has_pa:
+            both_sources.append(info["n_obs"])
+        elif has_up and not has_cptac:
+            uniprot_only.append(info["n_obs"])
+        elif has_pa and not has_cptac:
+            pa_only.append(info["n_obs"])
+        else:
+            excluded_other_sources += 1
+
+    if excluded_other_sources:
+        logger.info(
+            "Excluded %d sites containing non-target sources from UniProt/PeptideAtlas overlap plot",
+            excluded_other_sources,
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/ptm/plot.py` around lines 505 - 514, The current logic only checks for
"UniProt" and "PeptideAtlas" and misclassifies sites that include "CPTAC" (or
any other additional source) as UniProt-only; update the categorization to
consider the full set of sources in info["sources"] (variables: site_sources,
info["sources"], both_sources, uniprot_only, pa_only). Compute flags has_up and
has_pa as before but also inspect the total number of sources (e.g.,
len(info["sources"]) or a has_other flag) and only classify as uniprot_only or
pa_only when that source is present and it is the sole source; keep the
both_sources branch for has_up and has_pa combinations so that UniProt+CPTAC (or
other combos) are not incorrectly labeled UniProt-only.

Comment on lines +262 to +268
_site_id_re = re.compile(r"^(.+)_([STY])(\d+)$")

def _parse_site_id(sid):
m = _site_id_re.match(sid)
if m:
return m.group(1), m.group(2), int(m.group(3))
return sid, "", 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Unparseable site IDs produce empty/zero row fields.

When a site ID doesn't match the expected Gene_[STY](\d+) pattern, _parse_site_id returns (sid, "", 0). This means rows may have empty amino_acid and residue_pos=0, which could cause issues in downstream analysis expecting valid phospho site data.

Consider logging a warning for unparseable site IDs or filtering them out.

🛡️ Proposed fix to log warnings for unparseable site IDs
     def _parse_site_id(sid):
         m = _site_id_re.match(sid)
         if m:
             return m.group(1), m.group(2), int(m.group(3))
+        logger.warning("Unparseable site ID format: %s", sid)
         return sid, "", 0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
_site_id_re = re.compile(r"^(.+)_([STY])(\d+)$")
def _parse_site_id(sid):
m = _site_id_re.match(sid)
if m:
return m.group(1), m.group(2), int(m.group(3))
return sid, "", 0
_site_id_re = re.compile(r"^(.+)_([STY])(\d+)$")
def _parse_site_id(sid):
m = _site_id_re.match(sid)
if m:
return m.group(1), m.group(2), int(m.group(3))
logger.warning("Unparseable site ID format: %s", sid)
return sid, "", 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hvantk/tables/matrix_builders.py` around lines 262 - 268, _parsе_site_id
currently returns (sid, "", 0) for unparseable inputs which yields empty
amino_acid and residue_pos=0; update _parse_site_id to detect non-matching site
IDs and either (a) log a warning via the module logger including the offending
sid and context (use the existing logger or create one) and return None or raise
a specific ValueError, or (b) return a sentinel (e.g. None) so callers can
filter them out; also update any callers that iterate over _parse_site_id (refer
to the function name _parse_site_id and the regex _site_id_re) to skip None
results or catch the ValueError and drop/log the row instead of inserting
empty/zero fields.

@enriquea enriquea merged commit d69be00 into main Mar 31, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants