Conversation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…UMNS Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a new CLI subcommand `hvantk download peptideatlas-phospho` that downloads PeptideAtlas human phospho build data, supporting latest or specific build via --build-date and --build-id options. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e comparison Three-panel plot: site overlap counts, box plot of observation counts by source category, and cumulative distribution of PeptideAtlas evidence. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The relative URLs on the PeptideAtlas page resolve under /builds/human/phospho/, not /builds/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Support text-based notation (S[Phospho]) alongside numeric (S[167]) - Stream large tables instead of loading into memory - Fix amino acid extraction by returning AA from offset parser - Handle malformed sequences with unclosed brackets Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Without this filter, shared peptides mapped to all 588K protein entries inflated site count from ~260K to 9.7M. The protein_identification table's presence_level_id=1 gives the 11,852 canonical proteins matching the PeptideAtlas build page. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tests - Fix doubled /phospho in URL construction (from_build method) - Downloader now produces intermediate TSV, not just raw zip - Numeric mass detection covers T[181] and Y[243] in addition to S[167] - Add parametrized tests for _extract_phospho_offsets (16 cases) - Add test for canonical protein filtering (presence_level_id=1) - Add test for URL construction correctness Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add cptac_tsv field to PTMBuildConfig with validation, CPTAC mapping and concatenation step in ptm_build_pipeline(), and --cptac-tsv CLI option to the ptm build command. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…load()
- Rename colon→coad, remove endometrial (matching actual cptac package classes)
- Add overwrite parameter and Dict return to CPTACPhosphoDataset.download()
- Use consistent cptac-phospho-{type} file naming
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The cptac package has a bug where get_phosphoproteomics() and get_clinical() fail with "generator has no len()" when multiple sources exist and no source is specified. Use source='umich' for phospho and source='mssm' for clinical to work around this. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ho output - Download tumor and normal phospho data separately per cancer type - Intermediate TSV now includes tissue_type, cancer_type, mean_intensity columns - Produces separate tumor/normal TSVs plus a combined TSV - Metadata CSV includes tissue_type per sample (from .N suffix convention) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- write_matrix_csv now expands multi-sites (T185_Y187 → T185, Y187) and averages duplicates from different peptides - build_cptac_phospho_mt uses regex for robust site ID parsing - Updated tests to verify multi-site expansion and per-sample values Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Downloaded datasets (UniProt, PeptideAtlas, CPTAC) are reproducible from code and should not be committed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR adds support for downloading and integrating CPTAC and PeptideAtlas phosphoproteomics datasets into hvantk. New CLI commands enable dataset downloads to intermediate TSV formats, a matrix builder creates Hail MatrixTables from CPTAC phospho data, and the PTM pipeline is extended to ingest these datasets alongside existing UniProt sources. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Up to standards ✅🟢 Issues
|
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (4)
hvantk/ptm/constants.py (1)
19-20: Avoid hardcoding values labeled as “latest.”Line 19 and Line 20 encode a point-in-time build as constants named
LATEST_*, which will eventually drift and silently age. Prefer an override mechanism (env/config) or runtime resolution so releases don’t require code edits for dataset freshness.Suggested adjustment
+import os + -PEPTIDEATLAS_LATEST_BUILD_DATE = "202512" -PEPTIDEATLAS_LATEST_BUILD_ID = "606" +PEPTIDEATLAS_LATEST_BUILD_DATE = os.getenv("HVANTK_PEPA_BUILD_DATE", "202512") +PEPTIDEATLAS_LATEST_BUILD_ID = os.getenv("HVANTK_PEPA_BUILD_ID", "606")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/ptm/constants.py` around lines 19 - 20, The constants PEPTIDEATLAS_LATEST_BUILD_DATE and PEPTIDEATLAS_LATEST_BUILD_ID are hardcoded as "LATEST" values; change them to be resolved at runtime via configuration or environment variables so they don't silently age. Replace the fixed constants with a runtime accessor (e.g., get_peptideatlas_latest_build() or module-level resolution) that checks environment variables (e.g., PEPTIDEATLAS_LATEST_BUILD_DATE, PEPTIDEATLAS_LATEST_BUILD_ID) and falls back to the existing literals only as defaults; update any callers that import these names to use the accessor or the module attributes that are set on import. Ensure parsing/validation of the env values and add tests or comments documenting the override mechanism.hvantk/commands/make_matrix_cli.py (1)
245-301: Exposesite_id_colin the CLI to match builder capability.
build_cptac_phospho_mtsupportssite_id_col, but this command hardcodes the default by not exposing it. Adding an option improves compatibility with alternate matrix schemas.Suggested adjustment
`@click.option`("-o", "--output-mt", "output_mt", required=True, type=click.Path()) +@click.option("--site-id-col", "site_id_col", default="SiteID", show_default=True) `@click.option`("--sid", "--sample-id-col", "sample_id_col", default="SampleID", show_default=True) def mkmatrix_cptac_phospho( expression, metadata, output_mt, + site_id_col, sample_id_col, categorical_cols, numeric_cols, overwrite, ): @@ mt = _build_cptac_phospho_mt( expression_path=expression, metadata_path=metadata, output_mt=output_mt, + site_id_col=site_id_col, sample_id_col=sample_id_col, categorical_cols=_split(categorical_cols), numeric_cols=_split(numeric_cols), overwrite=overwrite, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/commands/make_matrix_cli.py` around lines 245 - 301, The CLI command mkmatrix_cptac_phospho currently doesn't expose the site_id_col parameter even though the builder _build_cptac_phospho_mt supports it; add a new click.option (e.g., "--site-id-col" with a sensible default like "SiteID") to the mkmatrix_cptac_phospho signature and forward that variable into the call to _build_cptac_phospho_mt via site_id_col=site_id_col so alternate matrix schemas can specify the site identifier column.hvantk/ptm/pipeline.py (1)
323-387: Consider extracting duplicated TSV concatenation logic into a helper function.The concatenation logic for combining mapped TSVs (lines 329-349 for PeptideAtlas and lines 362-382 for CPTAC) is nearly identical. This could be refactored into a helper function to reduce duplication and improve maintainability.
♻️ Proposed refactor to extract helper function
+def _concatenate_mapped_tsvs( + source_paths: List[str], + output_path: str, +) -> None: + """Concatenate multiple mapped TSV files into a single combined file.""" + with open(output_path, "w", newline="") as fout: + writer = csv.DictWriter( + fout, fieldnames=PTM_OUTPUT_COLUMNS, delimiter="\t", lineterminator="\n" + ) + writer.writeheader() + for src_path in source_paths: + with open(src_path) as fin: + reader = csv.DictReader(fin, delimiter="\t") + for row in reader: + writer.writerow(row)Then replace the duplicated blocks with calls to this helper.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/ptm/pipeline.py` around lines 323 - 387, The duplicated TSV concatenation blocks should be moved into a single helper (e.g., def concatenate_mapped_tsvs(src_paths: List[str], out_path: str, fieldnames=PTM_OUTPUT_COLUMNS)) and both use sites: call it to concatenate [mapped_path, pa_mapped_path] and [mapped_path, cptac_mapped_path]; update the callers in the mapping flow (around map_ptm_sites results for pa_result and cptac_result) to set mapped_path = returned out_path and keep the existing updates to result (n_total, n_mapped, n_failed, resolution_counts) unchanged; ensure the helper opens out_path with newline="", writes the header using csv.DictWriter(fieldnames=PTM_OUTPUT_COLUMNS, delimiter="\t", lineterminator="\n"), iterates each src_path with csv.DictReader(delimiter="\t") and copies rows, and raise or log a clear error if any src_path is missing so behavior matches original expectations.hvantk/datasets/peptideatlas_phospho_datasets.py (1)
63-64: Unused constant — consider removing.
_BRACKET_MOD_PATTERNis defined but never used. The_extract_phospho_offsetsfunction manually parses brackets instead of using this regex.🧹 Suggested removal
-# Pattern to match bracket modifications like [167] or [167.0] -_BRACKET_MOD_PATTERN = re.compile(r"\[([0-9.]+)\]")Also remove
import refrom line 20 if no other uses exist.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hvantk/datasets/peptideatlas_phospho_datasets.py` around lines 63 - 64, The constant _BRACKET_MOD_PATTERN is unused because _extract_phospho_offsets performs manual bracket parsing; remove the unused _BRACKET_MOD_PATTERN definition and delete the unused import re (if there are no other usages of re in the file), and keep _extract_phospho_offsets as-is or refactor it to use _BRACKET_MOD_PATTERN if you prefer regex — but do not leave the orphaned regex constant/import in the module.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@hvantk/commands/peptideatlas_phospho_downloader.py`:
- Around line 51-54: The current logic silently ignores partial build parameters
(build_date or build_id) and calls PeptideAtlasPhosphoDataset.from_latest();
update the code that branches on build_date and build_id to detect the partial
case and either raise a clear error or log/warn and exit, ensuring users must
provide both parameters together; specifically modify the conditional around
build_date/build_id to: if both provided call
PeptideAtlasPhosphoDataset.from_build(build_date, build_id), if neither call
PeptideAtlasPhosphoDataset.from_latest(), and if exactly one is provided emit a
user-facing warning or raise a ValueError indicating both --build-date and
--build-id are required together.
In `@hvantk/datasets/cptac_phospho_datasets.py`:
- Around line 313-322: The try/except around ds.get_phosphoproteomics is too
broad and leaves normal_df possibly undefined; replace the blanket except
Exception with catching the specific error(s) that ds.get_phosphoproteomics can
raise (e.g., KeyError/IOError/ValueError as appropriate) and only handle the "no
data" case there, and ensure normal_df is always defined before it's referenced
later (either initialize normal_df = None before the try, or move subsequent
references such as the normal_df.shape check, extract_phospho_sites(normal_df,
ct, tissue_type="normal") and write_intermediate_tsv(normal_sites, normal_tsv)
inside the try block or guard them with an explicit "if normal_df is not None"
check); keep logger calls (logger.info) but include the specific caught
exception variable when logging expected failures for context.
In `@hvantk/datasets/peptideatlas_phospho_datasets.py`:
- Around line 396-400: Validate that self.zip_url uses an allowed scheme (e.g.,
parse the URL and ensure scheme == "https") before attempting download, then
perform the download to a temporary file (e.g., zip_path + ".part") and only
rename/move it to zip_path on successful completion; wrap the
urllib.request.urlretrieve call in a try/except that catches network errors
(URLError, HTTPError) and generic Exception, log a descriptive error with
exception details via logger.error, remove any partial temp file on failure, and
either re-raise the exception or return a clear failure value so partial files
are not left behind and invalid schemes are rejected.
In `@hvantk/ptm/plot.py`:
- Line 487: The int(...) cast for n_obs can raise ValueError for blank or
non-numeric TSV values; update the parsing around the n_obs assignment (the line
using row.get("n_observations", "0") that sets n_obs) to safely handle
missing/invalid input by normalizing the fetched value (e.g., str(...).strip()),
then try converting to int inside a try/except ValueError block and fall back to
0 on failure; ensure the code still accepts valid numeric strings and preserves
the n_obs variable for downstream plotting logic.
- Around line 505-514: The current logic only checks for "UniProt" and
"PeptideAtlas" and misclassifies sites that include "CPTAC" (or any other
additional source) as UniProt-only; update the categorization to consider the
full set of sources in info["sources"] (variables: site_sources,
info["sources"], both_sources, uniprot_only, pa_only). Compute flags has_up and
has_pa as before but also inspect the total number of sources (e.g.,
len(info["sources"]) or a has_other flag) and only classify as uniprot_only or
pa_only when that source is present and it is the sole source; keep the
both_sources branch for has_up and has_pa combinations so that UniProt+CPTAC (or
other combos) are not incorrectly labeled UniProt-only.
In `@hvantk/tables/matrix_builders.py`:
- Around line 262-268: _parsе_site_id currently returns (sid, "", 0) for
unparseable inputs which yields empty amino_acid and residue_pos=0; update
_parse_site_id to detect non-matching site IDs and either (a) log a warning via
the module logger including the offending sid and context (use the existing
logger or create one) and return None or raise a specific ValueError, or (b)
return a sentinel (e.g. None) so callers can filter them out; also update any
callers that iterate over _parse_site_id (refer to the function name
_parse_site_id and the regex _site_id_re) to skip None results or catch the
ValueError and drop/log the row instead of inserting empty/zero fields.
---
Nitpick comments:
In `@hvantk/commands/make_matrix_cli.py`:
- Around line 245-301: The CLI command mkmatrix_cptac_phospho currently doesn't
expose the site_id_col parameter even though the builder _build_cptac_phospho_mt
supports it; add a new click.option (e.g., "--site-id-col" with a sensible
default like "SiteID") to the mkmatrix_cptac_phospho signature and forward that
variable into the call to _build_cptac_phospho_mt via site_id_col=site_id_col so
alternate matrix schemas can specify the site identifier column.
In `@hvantk/datasets/peptideatlas_phospho_datasets.py`:
- Around line 63-64: The constant _BRACKET_MOD_PATTERN is unused because
_extract_phospho_offsets performs manual bracket parsing; remove the unused
_BRACKET_MOD_PATTERN definition and delete the unused import re (if there are no
other usages of re in the file), and keep _extract_phospho_offsets as-is or
refactor it to use _BRACKET_MOD_PATTERN if you prefer regex — but do not leave
the orphaned regex constant/import in the module.
In `@hvantk/ptm/constants.py`:
- Around line 19-20: The constants PEPTIDEATLAS_LATEST_BUILD_DATE and
PEPTIDEATLAS_LATEST_BUILD_ID are hardcoded as "LATEST" values; change them to be
resolved at runtime via configuration or environment variables so they don't
silently age. Replace the fixed constants with a runtime accessor (e.g.,
get_peptideatlas_latest_build() or module-level resolution) that checks
environment variables (e.g., PEPTIDEATLAS_LATEST_BUILD_DATE,
PEPTIDEATLAS_LATEST_BUILD_ID) and falls back to the existing literals only as
defaults; update any callers that import these names to use the accessor or the
module attributes that are set on import. Ensure parsing/validation of the env
values and add tests or comments documenting the override mechanism.
In `@hvantk/ptm/pipeline.py`:
- Around line 323-387: The duplicated TSV concatenation blocks should be moved
into a single helper (e.g., def concatenate_mapped_tsvs(src_paths: List[str],
out_path: str, fieldnames=PTM_OUTPUT_COLUMNS)) and both use sites: call it to
concatenate [mapped_path, pa_mapped_path] and [mapped_path, cptac_mapped_path];
update the callers in the mapping flow (around map_ptm_sites results for
pa_result and cptac_result) to set mapped_path = returned out_path and keep the
existing updates to result (n_total, n_mapped, n_failed, resolution_counts)
unchanged; ensure the helper opens out_path with newline="", writes the header
using csv.DictWriter(fieldnames=PTM_OUTPUT_COLUMNS, delimiter="\t",
lineterminator="\n"), iterates each src_path with csv.DictReader(delimiter="\t")
and copies rows, and raise or log a clear error if any src_path is missing so
behavior matches original expectations.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7ced376e-1f01-4814-b768-bc8a96d19db0
📒 Files selected for processing (17)
.gitignorehvantk/commands/cptac_phospho_downloader.pyhvantk/commands/download_cli.pyhvantk/commands/make_matrix_cli.pyhvantk/commands/peptideatlas_phospho_downloader.pyhvantk/commands/ptm_cli.pyhvantk/datasets/cptac_phospho_datasets.pyhvantk/datasets/peptideatlas_phospho_datasets.pyhvantk/ptm/constants.pyhvantk/ptm/pipeline.pyhvantk/ptm/plot.pyhvantk/tables/matrix_builders.pyhvantk/tables/registry.pyhvantk/tables/table_builders.pyhvantk/tests/test_cptac_phospho.pyhvantk/tests/test_peptideatlas_phospho.pyhvantk/tests/test_ptm.py
| if build_date and build_id: | ||
| dataset = PeptideAtlasPhosphoDataset.from_build(build_date, build_id) | ||
| else: | ||
| dataset = PeptideAtlasPhosphoDataset.from_latest() |
There was a problem hiding this comment.
Partial build parameters silently ignored.
If a user provides only --build-date or only --build-id (but not both), the code silently falls back to from_latest(). This could be confusing. Consider warning the user or requiring both when either is provided.
🛡️ Proposed fix to warn on partial parameters
if build_date and build_id:
dataset = PeptideAtlasPhosphoDataset.from_build(build_date, build_id)
+ elif build_date or build_id:
+ click.echo(
+ "Warning: Both --build-date and --build-id are required for a specific build. "
+ "Using latest build instead.",
+ err=True,
+ )
+ dataset = PeptideAtlasPhosphoDataset.from_latest()
else:
dataset = PeptideAtlasPhosphoDataset.from_latest()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if build_date and build_id: | |
| dataset = PeptideAtlasPhosphoDataset.from_build(build_date, build_id) | |
| else: | |
| dataset = PeptideAtlasPhosphoDataset.from_latest() | |
| if build_date and build_id: | |
| dataset = PeptideAtlasPhosphoDataset.from_build(build_date, build_id) | |
| elif build_date or build_id: | |
| click.echo( | |
| "Warning: Both --build-date and --build-id are required for a specific build. " | |
| "Using latest build instead.", | |
| err=True, | |
| ) | |
| dataset = PeptideAtlasPhosphoDataset.from_latest() | |
| else: | |
| dataset = PeptideAtlasPhosphoDataset.from_latest() |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hvantk/commands/peptideatlas_phospho_downloader.py` around lines 51 - 54, The
current logic silently ignores partial build parameters (build_date or build_id)
and calls PeptideAtlasPhosphoDataset.from_latest(); update the code that
branches on build_date and build_id to detect the partial case and either raise
a clear error or log/warn and exit, ensuring users must provide both parameters
together; specifically modify the conditional around build_date/build_id to: if
both provided call PeptideAtlasPhosphoDataset.from_build(build_date, build_id),
if neither call PeptideAtlasPhosphoDataset.from_latest(), and if exactly one is
provided emit a user-facing warning or raise a ValueError indicating both
--build-date and --build-id are required together.
| try: | ||
| normal_df = ds.get_phosphoproteomics(source="umich", tissue_type="normal") | ||
| if len(normal_df) > 0: | ||
| logger.info("Normal: %d samples x %d columns", *normal_df.shape) | ||
| normal_sites = extract_phospho_sites(normal_df, ct, tissue_type="normal") | ||
| write_intermediate_tsv(normal_sites, normal_tsv) | ||
| else: | ||
| logger.info("No normal samples available for %s", ct) | ||
| except Exception as e: | ||
| logger.info("No normal tissue data for %s: %s", ct, e) |
There was a problem hiding this comment.
Catching blind Exception and potential NameError on normal_df.
- Line 321 catches a broad
Exception, which could mask unexpected errors. Consider catching a more specific exception type. - Line 357 references
normal_dfwhich may not be defined if an exception was raised before line 314.
🛡️ Proposed fix
# Fetch normal phospho data (may be empty for some cancer types)
normal_sites = []
+ normal_df = None
try:
normal_df = ds.get_phosphoproteomics(source="umich", tissue_type="normal")
if len(normal_df) > 0:
logger.info("Normal: %d samples x %d columns", *normal_df.shape)
normal_sites = extract_phospho_sites(normal_df, ct, tissue_type="normal")
write_intermediate_tsv(normal_sites, normal_tsv)
else:
logger.info("No normal samples available for %s", ct)
- except Exception as e:
+ except (KeyError, ValueError, AttributeError) as e:
logger.info("No normal tissue data for %s: %s", ct, e)And update line 357:
- len(tumor_df), len(normal_df) if normal_sites else 0,
+ len(tumor_df), len(normal_df) if normal_df is not None else 0,🧰 Tools
🪛 Ruff (0.15.7)
[warning] 321-321: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hvantk/datasets/cptac_phospho_datasets.py` around lines 313 - 322, The
try/except around ds.get_phosphoproteomics is too broad and leaves normal_df
possibly undefined; replace the blanket except Exception with catching the
specific error(s) that ds.get_phosphoproteomics can raise (e.g.,
KeyError/IOError/ValueError as appropriate) and only handle the "no data" case
there, and ensure normal_df is always defined before it's referenced later
(either initialize normal_df = None before the try, or move subsequent
references such as the normal_df.shape check, extract_phospho_sites(normal_df,
ct, tissue_type="normal") and write_intermediate_tsv(normal_sites, normal_tsv)
inside the try block or guard them with an explicit "if normal_df is not None"
check); keep logger calls (logger.info) but include the specific caught
exception variable when logging expected failures for context.
| # Download zip if needed | ||
| if not os.path.exists(zip_path) or overwrite: | ||
| logger.info("Downloading %s -> %s", self.zip_url, zip_path) | ||
| urllib.request.urlretrieve(self.zip_url, zip_path) | ||
| logger.info("Download complete: %s", zip_path) |
There was a problem hiding this comment.
Add error handling for download failures and URL scheme validation.
- Network failures will raise uncaught exceptions, potentially leaving partial files on disk.
- While the base URL is hardcoded to
https://, explicit scheme validation provides defense-in-depth (addresses static analysis S310).
🛡️ Suggested improvement
# Download zip if needed
if not os.path.exists(zip_path) or overwrite:
logger.info("Downloading %s -> %s", self.zip_url, zip_path)
- urllib.request.urlretrieve(self.zip_url, zip_path)
- logger.info("Download complete: %s", zip_path)
+ # Validate URL scheme
+ if not self.zip_url.startswith(("https://", "http://")):
+ raise ValueError(f"Invalid URL scheme: {self.zip_url}")
+ try:
+ urllib.request.urlretrieve(self.zip_url, zip_path)
+ logger.info("Download complete: %s", zip_path)
+ except Exception as e:
+ # Clean up partial download
+ if os.path.exists(zip_path):
+ os.remove(zip_path)
+ raise RuntimeError(f"Failed to download {self.zip_url}: {e}") from e
else:
logger.info("Using cached zip: %s", zip_path)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Download zip if needed | |
| if not os.path.exists(zip_path) or overwrite: | |
| logger.info("Downloading %s -> %s", self.zip_url, zip_path) | |
| urllib.request.urlretrieve(self.zip_url, zip_path) | |
| logger.info("Download complete: %s", zip_path) | |
| # Download zip if needed | |
| if not os.path.exists(zip_path) or overwrite: | |
| logger.info("Downloading %s -> %s", self.zip_url, zip_path) | |
| # Validate URL scheme | |
| if not self.zip_url.startswith(("https://", "http://")): | |
| raise ValueError(f"Invalid URL scheme: {self.zip_url}") | |
| try: | |
| urllib.request.urlretrieve(self.zip_url, zip_path) | |
| logger.info("Download complete: %s", zip_path) | |
| except Exception as e: | |
| # Clean up partial download | |
| if os.path.exists(zip_path): | |
| os.remove(zip_path) | |
| raise RuntimeError(f"Failed to download {self.zip_url}: {e}") from e |
🧰 Tools
🪛 Ruff (0.15.7)
[error] 399-399: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hvantk/datasets/peptideatlas_phospho_datasets.py` around lines 396 - 400,
Validate that self.zip_url uses an allowed scheme (e.g., parse the URL and
ensure scheme == "https") before attempting download, then perform the download
to a temporary file (e.g., zip_path + ".part") and only rename/move it to
zip_path on successful completion; wrap the urllib.request.urlretrieve call in a
try/except that catches network errors (URLError, HTTPError) and generic
Exception, log a descriptive error with exception details via logger.error,
remove any partial temp file on failure, and either re-raise the exception or
return a clear failure value so partial files are not left behind and invalid
schemes are rejected.
| for row in reader: | ||
| key = (row["uniprot_id"], row["residue_pos"]) | ||
| site_sources[key]["sources"].add(row["source_db"]) | ||
| n_obs = int(row.get("n_observations", "0")) |
There was a problem hiding this comment.
n_observations parsing can raise and abort plotting.
Line 487 does a raw int(...) cast; blank/non-numeric values from external TSVs will raise ValueError and fail the plot path.
Proposed fix
- n_obs = int(row.get("n_observations", "0"))
+ raw_n_obs = (row.get("n_observations") or "0").strip()
+ try:
+ n_obs = int(raw_n_obs)
+ except ValueError:
+ n_obs = 0📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| n_obs = int(row.get("n_observations", "0")) | |
| raw_n_obs = (row.get("n_observations") or "0").strip() | |
| try: | |
| n_obs = int(raw_n_obs) | |
| except ValueError: | |
| n_obs = 0 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hvantk/ptm/plot.py` at line 487, The int(...) cast for n_obs can raise
ValueError for blank or non-numeric TSV values; update the parsing around the
n_obs assignment (the line using row.get("n_observations", "0") that sets n_obs)
to safely handle missing/invalid input by normalizing the fetched value (e.g.,
str(...).strip()), then try converting to int inside a try/except ValueError
block and fall back to 0 on failure; ensure the code still accepts valid numeric
strings and preserves the n_obs variable for downstream plotting logic.
| for _key, info in site_sources.items(): | ||
| has_up = "UniProt" in info["sources"] | ||
| has_pa = "PeptideAtlas" in info["sources"] | ||
| if has_up and has_pa: | ||
| both_sources.append(info["n_obs"]) | ||
| elif has_up: | ||
| uniprot_only.append(info["n_obs"]) | ||
| elif has_pa: | ||
| pa_only.append(info["n_obs"]) | ||
|
|
There was a problem hiding this comment.
Source-overlap categories are incorrect when CPTAC rows are present.
Line 506–Line 513 only test UniProt/PeptideAtlas. Sites with CPTAC get miscounted (e.g., UniProt+CPTAC is currently labeled as UniProt-only), which biases panel-1 counts and downstream distributions.
Proposed fix
- for _key, info in site_sources.items():
- has_up = "UniProt" in info["sources"]
- has_pa = "PeptideAtlas" in info["sources"]
- if has_up and has_pa:
- both_sources.append(info["n_obs"])
- elif has_up:
- uniprot_only.append(info["n_obs"])
- elif has_pa:
- pa_only.append(info["n_obs"])
+ excluded_other_sources = 0
+ for _key, info in site_sources.items():
+ sources = info["sources"]
+ has_up = "UniProt" in sources
+ has_pa = "PeptideAtlas" in sources
+ has_cptac = "CPTAC" in sources
+
+ if has_up and has_pa:
+ both_sources.append(info["n_obs"])
+ elif has_up and not has_cptac:
+ uniprot_only.append(info["n_obs"])
+ elif has_pa and not has_cptac:
+ pa_only.append(info["n_obs"])
+ else:
+ excluded_other_sources += 1
+
+ if excluded_other_sources:
+ logger.info(
+ "Excluded %d sites containing non-target sources from UniProt/PeptideAtlas overlap plot",
+ excluded_other_sources,
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hvantk/ptm/plot.py` around lines 505 - 514, The current logic only checks for
"UniProt" and "PeptideAtlas" and misclassifies sites that include "CPTAC" (or
any other additional source) as UniProt-only; update the categorization to
consider the full set of sources in info["sources"] (variables: site_sources,
info["sources"], both_sources, uniprot_only, pa_only). Compute flags has_up and
has_pa as before but also inspect the total number of sources (e.g.,
len(info["sources"]) or a has_other flag) and only classify as uniprot_only or
pa_only when that source is present and it is the sole source; keep the
both_sources branch for has_up and has_pa combinations so that UniProt+CPTAC (or
other combos) are not incorrectly labeled UniProt-only.
| _site_id_re = re.compile(r"^(.+)_([STY])(\d+)$") | ||
|
|
||
| def _parse_site_id(sid): | ||
| m = _site_id_re.match(sid) | ||
| if m: | ||
| return m.group(1), m.group(2), int(m.group(3)) | ||
| return sid, "", 0 |
There was a problem hiding this comment.
Unparseable site IDs produce empty/zero row fields.
When a site ID doesn't match the expected Gene_[STY](\d+) pattern, _parse_site_id returns (sid, "", 0). This means rows may have empty amino_acid and residue_pos=0, which could cause issues in downstream analysis expecting valid phospho site data.
Consider logging a warning for unparseable site IDs or filtering them out.
🛡️ Proposed fix to log warnings for unparseable site IDs
def _parse_site_id(sid):
m = _site_id_re.match(sid)
if m:
return m.group(1), m.group(2), int(m.group(3))
+ logger.warning("Unparseable site ID format: %s", sid)
return sid, "", 0📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| _site_id_re = re.compile(r"^(.+)_([STY])(\d+)$") | |
| def _parse_site_id(sid): | |
| m = _site_id_re.match(sid) | |
| if m: | |
| return m.group(1), m.group(2), int(m.group(3)) | |
| return sid, "", 0 | |
| _site_id_re = re.compile(r"^(.+)_([STY])(\d+)$") | |
| def _parse_site_id(sid): | |
| m = _site_id_re.match(sid) | |
| if m: | |
| return m.group(1), m.group(2), int(m.group(3)) | |
| logger.warning("Unparseable site ID format: %s", sid) | |
| return sid, "", 0 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hvantk/tables/matrix_builders.py` around lines 262 - 268, _parsе_site_id
currently returns (sid, "", 0) for unparseable inputs which yields empty
amino_acid and residue_pos=0; update _parse_site_id to detect non-matching site
IDs and either (a) log a warning via the module logger including the offending
sid and context (use the existing logger or create one) and return None or raise
a specific ValueError, or (b) return a sentinel (e.g. None) so callers can
filter them out; also update any callers that iterate over _parse_site_id (refer
to the function name _parse_site_id and the regex _site_id_re) to skip None
results or catch the ValueError and drop/log the row instead of inserting
empty/zero fields.
Summary by CodeRabbit
New Features
cptac-phosphoCLI command with cancer type selection and pan-cancer merging supportpeptideatlas-phosphoCLI commandTests