Skip to content

Add dat bypass, build-cluster-db CLI, and incremental improvements#10

Open
ypriverol wants to merge 9 commits intomainfrom
feature/dat-bypass-cluster-db
Open

Add dat bypass, build-cluster-db CLI, and incremental improvements#10
ypriverol wants to merge 9 commits intomainfrom
feature/dat-bypass-cluster-db

Conversation

@ypriverol
Copy link
Copy Markdown
Member

@ypriverol ypriverol commented Apr 13, 2026

Summary

  • Add maracluster_dat.py: parquet-to-dat conversion bypassing MGF entirely (15x smaller output)
  • Add build-cluster-db and parquet2dat CLI commands
  • Expand incremental merge-clusters with BIN strategy support
  • Add QPX metadata module (qpx_metadata.py) and MSNet-to-QPX converter

Details

The dat bypass writes MaRaCluster's internal binary format directly from parquet. Binning replicates BinSpectra::binBinaryTruncated: bin = floor(mz / 1.000508 + 0.32), top-40 peaks, skip peaks >= precursor neutral mass. Size: ~100 bytes/spectrum vs ~1.5 KB in MGF.

The build-cluster-db command creates cluster_metadata.parquet + psm_cluster_membership.parquet from MaRaCluster output + scan_titles mapping, designed for the dat-bypass workflow.

Test plan

  • Run python test_e2e_single_dataset.py on PXD014877
  • Verify pyspectrafuse convert-dat produces valid .dat files
  • Test pyspectrafuse build-cluster-db end-to-end
  • Test incremental merge with --method_type bin

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • New CLI commands: build-cluster-db, cluster2parquet, convert-to-qpx, convert-dat, parquet2mgf enhancements, and an incremental group (extract-reps, merge-clusters).
    • MSNet→QPX conversion and automated QPX metadata generation.
    • Parquet→DAT export for MaRaCluster and representative-MGF extraction for clusters.
  • Refactor

    • Broader parquet schema support and metadata lookup via QPX; USI construction from explicit components.
    • Improved consensus strategies with size-adaptive algorithms.
  • Documentation

    • README updated with new workflows and command examples.
  • Tests

    • Added incremental workflow tests (representative extraction and resolution).

ypriverol and others added 5 commits April 11, 2026 12:01
Normalizes both legacy msnet and QPX parquet formats to canonical column
names at load time, extracts global_qvalue from QPX additional_scores,
and adds a standalone cluster2parquet command for the cluster-parquet step.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…head

BestSpectrumStrategy now uses groupby.idxmin() instead of
groupby.apply(top_n_rows) for selecting the best spectrum per cluster
(~27x measured speedup). Base class uses sort_values + groupby.head()
for top-N selection in large clusters (~76x measured speedup). Also
replaces np.isin with .isin() and np.vstack with pd.concat for
cleaner DataFrame operations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously only the first parquet file was processed when a list was
passed. Now iterates over all files, accumulating cluster results across
the full dataset. State (write counts, file indices) persists across
files so MGF path/order mapping remains consistent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements three-phase incremental clustering for adding new projects
without re-clustering the entire dataset:

1. extract-reps: Writes one representative spectrum per cluster from
   cluster_metadata.parquet to MGF (TITLE=rep:{cluster_id} for tracing)
2. resolve-clusters: After MaRaCluster runs on reps + new data, maps
   new cluster IDs back to existing ones via representative membership.
   Handles cluster merges (multiple old clusters → one new).
3. merge-results: Appends new PSMs to membership table (dedup by USI),
   rebuilds cluster metadata with updated stats and consensus spectra.

CLI: pyspectrafuse incremental extract-reps / merge-clusters
Tests cover extraction, index building, resolution, and merge scenarios.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add parquet-to-dat conversion (maracluster_dat.py) bypassing MGF entirely (15x smaller)
- Add build-cluster-db command for dat-bypass workflow
- Add parquet2dat CLI command
- Expand incremental merge-clusters with BIN strategy support
- Update cluster2parquet for QPX metadata compatibility
- Register new CLI commands in pyspectrafuse.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 13, 2026

📝 Walkthrough

Walkthrough

This PR adds a cluster database pipeline, incremental clustering tools, QPX metadata ingestion and schema normalization, new CLI commands (build-cluster-db, cluster2parquet, convert-to-qpx, convert-dat, incremental subcommands), improved consensus strategies, DAT conversion support, and related tests and packaging/docs updates.

Changes

Cohort / File(s) Summary
Cluster result & combining
pyspectrafuse/cluster_parquet_combine/cluster_res_handler.py, pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py
Switched DataFrame assembly from vstack→pd.concat; normalize mgf_path keys to basenames when building lookup; CombineCluster2Parquet.inject_cluster_info now accepts sample_info_dict and supports multiple parquet inputs with per-file schema adaptation and USI construction.
Parquet schema & metadata
pyspectrafuse/common/constant.py, pyspectrafuse/common/parquet_utils.py, pyspectrafuse/common/qpx_metadata.py, pyspectrafuse/common/sdrf_utils.py
Added ParquetSchemaAdapter to normalize legacy/QPX schemas and canonical column constants; replaced rglob with recursive glob and explicit exclusions for metadata/outputs; new QPX metadata reader qpx_metadata.py; SDRF util gains skip_instrument and improved file-suffix stripping.
New CLI commands
pyspectrafuse/commands/build_cluster_db.py, pyspectrafuse/commands/cluster2parquet.py, pyspectrafuse/commands/convert_msnet_to_qpx.py, pyspectrafuse/commands/parquet2dat.py, pyspectrafuse/commands/incremental.py
Added commands for building cluster DB, cluster→parquet, MSNet→QPX conversion, parquet→DAT conversion, and incremental workflows (extract-reps, merge-clusters), with associated parsing, streaming, and Parquet writers.
Consensus strategies
pyspectrafuse/consensus_strategy/...
Changed pandas group apply calls to group_keys=False; replaced some group-by-top-N apply with vectorized sort_values+groupby().head()/idxmin(); added size-aware representative selection (_select_pairwise, _select_centroid) and fallback peak-clustering behavior.
DAT conversion & incremental utilities
pyspectrafuse/maracluster_dat.py, pyspectrafuse/incremental/resolve_clusters.py, pyspectrafuse/incremental/merge_results.py, pyspectrafuse/incremental/representative_mgf.py
New MaRaCluster DAT writer with peak binning and scan info; incremental resolution of new→old cluster IDs via representative MGF; append/rebuild membership and cluster metadata writers; representative-MGF extractor.
Existing command updates & parquet→MGF
pyspectrafuse/mgf_convert/parquet2mgf.py, pyspectrafuse/commands/spectrum2msp.py, pyspectrafuse/commands/quantmsio2mgf.py, pyspectrafuse/common/msp_utils.py
Replaced SDRF lookup with get_metadata_dict() (QPX); Parquet2Mgf updated to use ParquetSchemaAdapter and build USI from components; added skip_instrument flag; improved mz_array parsing.
CLI entry, tests, packaging, docs
pyspectrafuse/pyspectrafuse.py, tests/test_incremental.py, pyproject.toml, README.md, Dockerfile, tests/conftest.py
Registered new CLI subcommands; added incremental tests and fixtures; bumped packaging/build backend and version; README and Dockerfile updates to reflect new commands/workflow.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI as build-cluster-db CMD
    participant TSV as Cluster TSV Parser
    participant Titles as ScanTitles Parser
    participant Parquet as Parquet Scanner
    participant Schema as ParquetSchemaAdapter
    participant PSM as PSM Membership Builder
    participant Consensus as Consensus Builder
    participant Output as Parquet Writer

    User->>CLI: invoke with cluster_tsv, scan_titles, parquet_dirs
    CLI->>TSV: parse_cluster_tsv()
    TSV-->>CLI: cluster assignments (mgf_basename, scannr, cluster_id)
    CLI->>Titles: parse_scan_titles()
    Titles-->>CLI: scan → run_file, orig_scan, peptidoform, charge
    CLI->>Parquet: scan parquet_dirs for scalar fields
    Parquet->>Schema: available_columns() + adapt()
    Schema-->>Parquet: normalized rows
    Parquet-->>CLI: scalar values (PEP, q-value, mz)
    CLI->>PSM: merge cluster assignments + scan info + scalars
    PSM->>PSM: compute per-cluster stats (count, best PEP/q, purity)
    PSM->>Output: write psm_cluster_membership.parquet
    CLI->>Parquet: stream mz/intensity arrays per cluster
    Parquet->>Consensus: build consensus spectra
    Consensus-->>PSM: consensus arrays
    PSM->>Output: write cluster_metadata.parquet
    Output-->>User: return output paths
Loading
sequenceDiagram
    actor User
    participant CLI as merge-clusters CMD
    participant Resolver as resolve_incremental_clusters()
    participant RepMGF as Representative MGF parser
    participant IDMap as ID Mapper
    participant ClusterRes as ClusterResHandler
    participant Combiner as CombineCluster2Parquet
    participant Merger as append_psm_membership()
    participant Rebuilder as rebuild_cluster_metadata()
    participant Output as Parquet Writer

    User->>CLI: invoke with cluster_tsv, rep_mgf, new_parquet_dir
    CLI->>Resolver: resolve_incremental_clusters()
    Resolver->>RepMGF: build_representative_index()
    RepMGF-->>Resolver: mapping rep order → old_cluster_id
    Resolver->>IDMap: correlate new_cluster_id → old_cluster_id
    Resolver-->>CLI: id_map + new_spectra_df
    CLI->>ClusterRes: get_cluster_dict(new_cluster_tsv)
    ClusterRes-->>CLI: cluster_accession → mgf_path lookup
    CLI->>Combiner: inject_cluster_info(new_parquets, sample_info_dict)
    Combiner-->>CLI: enriched_df with cluster_accession
    CLI->>Merger: append_psm_membership(existing_membership, new_df)
    Merger-->>CLI: merged membership parquet
    CLI->>Rebuilder: rebuild_cluster_metadata(merged_membership,...)
    Rebuilder-->>Output: updated cluster_metadata.parquet
    CLI-->>User: summary + output paths
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested labels

Review effort [1-5]: 4

Poem

🐰 nibble, hop, and tap the keys,

I sort the clusters, name with ease,
QPX sprites and Parquet dreams,
Consensus hummed in tiny beams,
Hooray — the database blooms for me!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add dat bypass, build-cluster-db CLI, and incremental improvements' accurately summarizes the main additions in this changeset, directly referencing the three key components added: dat bypass (parquet2dat conversion), build-cluster-db CLI command, and incremental clustering improvements.
Docstring Coverage ✅ Passed Docstring coverage is 85.14% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/dat-bypass-cluster-db

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

- Lead with convert-dat and build-cluster-db as recommended commands
- Add incremental clustering and QPX metadata to feature list
- Update project structure with new modules
- Mark MGF conversion and SDRF handling as legacy
- Update version to 0.0.4

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (2)
pyspectrafuse/commands/parquet2dat.py (1)

35-39: Move parquet discovery out of the spectrum2msp command module.

This command now depends on another CLI command for a generic filesystem helper. That makes command imports bidirectional over time and duplicates logic that already belongs in pyspectrafuse.common.parquet_utils. Pull the helper into common and reuse it from both commands.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyspectrafuse/commands/parquet2dat.py` around lines 35 - 39, The parquet
discovery helper find_target_ext_files is being imported from
pyspectrafuse.commands.spectrum2msp causing a cross-command dependency; move or
reuse the generic helper from pyspectrafuse.common.parquet_utils instead: remove
the import of find_target_ext_files from pyspectrafuse.commands.spectrum2msp in
parquet2dat.py and import the equivalent function (or add a new helper) from
pyspectrafuse.common.parquet_utils, then call that helper to produce
parquet_files and delete any duplicated discovery logic so both spectrum2msp and
parquet2dat rely on the single common helper (update exports in parquet_utils if
needed).
pyspectrafuse/commands/spectrum2msp.py (1)

40-53: Return parquet files in a stable order.

glob.glob() does not guarantee a reproducible traversal order. Sorting here keeps downstream behavior deterministic — convert-dat increments file_idx per discovered file, and this command also uses the first entry for output naming.

Proposed fix
-    for path_str in glob.glob(str(Path(directory) / f'**/*{extensions}'), recursive=True):
+    for path_str in sorted(glob.glob(str(Path(directory) / f'**/*{extensions}'), recursive=True)):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyspectrafuse/commands/spectrum2msp.py` around lines 40 - 53, glob.glob() can
return files in nondeterministic order; make the discovery deterministic by
sorting the found paths. Change the loop over glob.glob(...) to iterate over
sorted(glob.glob(...)) or sort the collected list `res` (e.g., call `res.sort()`
before returning) so that the order of `path_str` items and the resulting `res`
list is stable for downstream `file_idx` increments and output naming; update
code around the for-loop that builds `res` (referencing `path_str`, `file`, and
`res`) to ensure the sorted behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pyspectrafuse/commands/build_cluster_db.py`:
- Around line 272-275: The groupby().idxmin() call can fail when a cluster's
posterior_error_probability is all NaN; before computing best_idx, coerce
posterior_error_probability to numeric on the enriched DataFrame and replace
NaNs with np.inf (e.g., set enriched['_pep'] =
pd.to_numeric(enriched['posterior_error_probability'],
errors='coerce').fillna(np.inf) or call enriched['_pep'].fillna(np.inf,
inplace=True)) so that enriched.groupby('cluster_id')['_pep'].idxmin() returns
valid positions; then compute best_idx and select best_rows as before (symbols:
enriched, '_pep', posterior_error_probability, idxmin, best_idx, best_rows).

In `@pyspectrafuse/commands/quantmsio2mgf.py`:
- Around line 46-53: The worker pool is being created with
Pool(processes=os.cpu_count()) which ignores the --task_parallel option; change
the Pool creation to use the task_parallel value (i.e.
Pool(processes=task_parallel)) so that Parquet2Mgf.convert_to_mgf_task actually
runs with the requested number of worker processes, and keep the existing
chunksize parameter for batching (or compute a separate chunksize variable) when
calling pool.imap to control task batching rather than concurrency.

In `@pyspectrafuse/commands/spectrum2msp.py`:
- Line 198: The code currently calls ParquetPathHandler(...).get_item_info()
with the first parquet file path (path_parquet_lst[0]), which extracts the
accession from the last path segment and therefore yields the parquet filename
instead of the project basename; change the call to pass the project directory
(e.g., the parent directory of the parquet files or the existing project_dir
variable) into ParquetPathHandler so get_item_info() reads the directory name as
the accession; update the reference at the call site (where basename is set) to
construct the handler with the folder containing the parquet files rather than
the file path itself.

In `@pyspectrafuse/common/parquet_utils.py`:
- Around line 58-68: The loop building parquet_path_lst currently appends any
glob hit (parquet_file) without checking whether it's a file, so partitioned
datasets stored as directories ending in .parquet get returned; update the loop
in this helper to skip directories by calling parquet_file.is_file() (or
equivalently ensure pathlib.Path(parquet_file).is_file()) before applying the
existing suffix/name/parts filters and appending to parquet_path_lst, leaving
all other filtering (checks against _metadata_suffixes, _exclude_names,
_exclude_dirs) unchanged.

In `@pyspectrafuse/incremental/resolve_clusters.py`:
- Around line 99-103: The current filtering uses rep_mgf_basename and
str.contains, which yields false positives; instead compute exact basenames
using pathlib.Path and do an equality match: derive rep_mgf_basename from
Path(rep_mgf_path).name, compare against cluster_df['mgf_path'].map(lambda p:
Path(p).name) (or add a temporary column of basenames) to build rep_mask, then
set rep_rows/new_rows accordingly; ensure pathlib.Path is imported if not
already.

In `@pyspectrafuse/mgf_convert/parquet2mgf.py`:
- Around line 155-163: _get_mgf_path currently always pulls instrument from
sample_info_dict, ignoring the skip_instrument option; update _get_mgf_path to
check the skip_instrument flag (the variable/attribute used by Parquet2Mgf for
that option) and, when True, use a single shared instrument bucket (e.g., a
fixed name like "all_instruments" or an empty instrument segment) instead of
info[0]; modify the function where mgf_group_df['mgf_file_path'] is set so
_get_mgf_path reads skip_instrument and constructs the path accordingly
(reference function name _get_mgf_path, the sample_info_dict lookup, and
Parquet2Mgf.skip_instrument or the local skip_instrument variable).

In `@tests/test_incremental.py`:
- Around line 150-155: The fixture string assignments for the variable
tsv_content use unnecessary f-string prefixes (e.g., f"reps.mgf\t0\tNEW_A\n"
blocks) which trigger Ruff F541; remove the leading f from each quoted literal
(including the second occurrence around lines 193-196) so they become plain
string literals while preserving the content and escape sequences.

---

Nitpick comments:
In `@pyspectrafuse/commands/parquet2dat.py`:
- Around line 35-39: The parquet discovery helper find_target_ext_files is being
imported from pyspectrafuse.commands.spectrum2msp causing a cross-command
dependency; move or reuse the generic helper from
pyspectrafuse.common.parquet_utils instead: remove the import of
find_target_ext_files from pyspectrafuse.commands.spectrum2msp in parquet2dat.py
and import the equivalent function (or add a new helper) from
pyspectrafuse.common.parquet_utils, then call that helper to produce
parquet_files and delete any duplicated discovery logic so both spectrum2msp and
parquet2dat rely on the single common helper (update exports in parquet_utils if
needed).

In `@pyspectrafuse/commands/spectrum2msp.py`:
- Around line 40-53: glob.glob() can return files in nondeterministic order;
make the discovery deterministic by sorting the found paths. Change the loop
over glob.glob(...) to iterate over sorted(glob.glob(...)) or sort the collected
list `res` (e.g., call `res.sort()` before returning) so that the order of
`path_str` items and the resulting `res` list is stable for downstream
`file_idx` increments and output naming; update code around the for-loop that
builds `res` (referencing `path_str`, `file`, and `res`) to ensure the sorted
behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2b66a7c3-de12-4302-8ff6-4d095e94000f

📥 Commits

Reviewing files that changed from the base of the PR and between 891c6d9 and 3116dcc.

📒 Files selected for processing (27)
  • pyspectrafuse/cluster_parquet_combine/cluster_res_handler.py
  • pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py
  • pyspectrafuse/commands/build_cluster_db.py
  • pyspectrafuse/commands/cluster2parquet.py
  • pyspectrafuse/commands/convert_msnet_to_qpx.py
  • pyspectrafuse/commands/incremental.py
  • pyspectrafuse/commands/parquet2dat.py
  • pyspectrafuse/commands/quantmsio2mgf.py
  • pyspectrafuse/commands/spectrum2msp.py
  • pyspectrafuse/common/constant.py
  • pyspectrafuse/common/msp_utils.py
  • pyspectrafuse/common/parquet_utils.py
  • pyspectrafuse/common/qpx_metadata.py
  • pyspectrafuse/common/sdrf_utils.py
  • pyspectrafuse/consensus_strategy/average_spectrum_strategy.py
  • pyspectrafuse/consensus_strategy/best_spetrum_strategy.py
  • pyspectrafuse/consensus_strategy/binning_strategy.py
  • pyspectrafuse/consensus_strategy/consensus_strategy_base.py
  • pyspectrafuse/consensus_strategy/most_similar_strategy.py
  • pyspectrafuse/incremental/__init__.py
  • pyspectrafuse/incremental/merge_results.py
  • pyspectrafuse/incremental/representative_mgf.py
  • pyspectrafuse/incremental/resolve_clusters.py
  • pyspectrafuse/maracluster_dat.py
  • pyspectrafuse/mgf_convert/parquet2mgf.py
  • pyspectrafuse/pyspectrafuse.py
  • tests/test_incremental.py
👮 Files not reviewed due to content moderation or server errors (10)
  • pyspectrafuse/common/msp_utils.py
  • pyspectrafuse/common/constant.py
  • pyspectrafuse/common/qpx_metadata.py
  • pyspectrafuse/consensus_strategy/binning_strategy.py
  • pyspectrafuse/pyspectrafuse.py
  • pyspectrafuse/cluster_parquet_combine/cluster_res_handler.py
  • pyspectrafuse/consensus_strategy/average_spectrum_strategy.py
  • pyspectrafuse/common/sdrf_utils.py
  • pyspectrafuse/commands/convert_msnet_to_qpx.py
  • pyspectrafuse/commands/incremental.py

Comment on lines +272 to +275
enriched['_pep'] = pd.to_numeric(enriched['posterior_error_probability'], errors='coerce')
best_idx = enriched.groupby('cluster_id')['_pep'].idxmin()
best_rows = enriched.loc[best_idx, ['cluster_id', 'run_file_name', 'orig_scan',
'peptidoform', 'precursor_mz']].copy()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard idxmin() against all-NaN PEP groups.

posterior_error_probability is populated through a left join, so a cluster can end up with _pep entirely missing. In that case groupby().idxmin() produces invalid indexers and enriched.loc[best_idx, ...] will fail. Filling _pep with np.inf before the groupby is enough here.

Proposed fix
-    enriched['_pep'] = pd.to_numeric(enriched['posterior_error_probability'], errors='coerce')
+    enriched['_pep'] = pd.to_numeric(
+        enriched['posterior_error_probability'], errors='coerce'
+    ).fillna(np.inf)
     best_idx = enriched.groupby('cluster_id')['_pep'].idxmin()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
enriched['_pep'] = pd.to_numeric(enriched['posterior_error_probability'], errors='coerce')
best_idx = enriched.groupby('cluster_id')['_pep'].idxmin()
best_rows = enriched.loc[best_idx, ['cluster_id', 'run_file_name', 'orig_scan',
'peptidoform', 'precursor_mz']].copy()
enriched['_pep'] = pd.to_numeric(
enriched['posterior_error_probability'], errors='coerce'
).fillna(np.inf)
best_idx = enriched.groupby('cluster_id')['_pep'].idxmin()
best_rows = enriched.loc[best_idx, ['cluster_id', 'run_file_name', 'orig_scan',
'peptidoform', 'precursor_mz']].copy()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyspectrafuse/commands/build_cluster_db.py` around lines 272 - 275, The
groupby().idxmin() call can fail when a cluster's posterior_error_probability is
all NaN; before computing best_idx, coerce posterior_error_probability to
numeric on the enriched DataFrame and replace NaNs with np.inf (e.g., set
enriched['_pep'] = pd.to_numeric(enriched['posterior_error_probability'],
errors='coerce').fillna(np.inf) or call enriched['_pep'].fillna(np.inf,
inplace=True)) so that enriched.groupby('cluster_id')['_pep'].idxmin() returns
valid positions; then compute best_idx and select best_rows as before (symbols:
enriched, '_pep', posterior_error_probability, idxmin, best_idx, best_rows).

Comment on lines +46 to 53
pool = Pool(processes=os.cpu_count())

# Pack all required parameters into a list of tuples
tasks = [(parquet_file_path, sdrf_file_path, res_file_path, batch_size, spectra_capacity) for parquet_file_path
in
parquet_file_path_lst]
tasks = [(parquet_file_path, sample_info_dict, res_file_path, batch_size, spectra_capacity, skip_instrument)
for parquet_file_path in parquet_file_path_lst]

# Use pool.imap to execute tasks in parallel with specified maximum parallel tasks
for _ in pool.imap(Parquet2Mgf().convert_to_mgf_task, tasks, chunksize=task_parallel):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate the file
find . -type f -name "quantmsio2mgf.py" 2>/dev/null

Repository: bigbio/pyspectrafuse

Length of output: 105


🏁 Script executed:

# View the file context around lines 46-53
cat -n pyspectrafuse/commands/quantmsio2mgf.py | sed -n '30,70p'

Repository: bigbio/pyspectrafuse

Length of output: 1624


🏁 Script executed:

# Search for task_parallel definition and usage in the file
rg "task_parallel" pyspectrafuse/commands/quantmsio2mgf.py -B 2 -A 2

Repository: bigbio/pyspectrafuse

Length of output: 1256


🏁 Script executed:

# Check the function signature to understand parameters
ast-grep --pattern 'def $_($$) {
  $$$
}'

Repository: bigbio/pyspectrafuse

Length of output: 46


task_parallel is not limiting worker concurrency.

Line 46 creates os.cpu_count() worker processes, while line 53 passes task_parallel as chunksize, which only controls task batching, not the number of workers. So --task_parallel 1 still spawns the full CPU count in parallel. Wire the option into Pool(processes=...) to actually limit worker count, and use chunksize separately for task batching.

Proposed fix
-    pool = Pool(processes=os.cpu_count())
-
-    # Pack all required parameters into a list of tuples
-    tasks = [(parquet_file_path, sample_info_dict, res_file_path, batch_size, spectra_capacity, skip_instrument)
-             for parquet_file_path in parquet_file_path_lst]
-
-    # Use pool.imap to execute tasks in parallel with specified maximum parallel tasks
-    for _ in pool.imap(Parquet2Mgf().convert_to_mgf_task, tasks, chunksize=task_parallel):
-        pass
-
-    # Close the process pool and wait for all processes to complete
-    pool.close()
-    pool.join()
+    worker_count = min(task_parallel, os.cpu_count() or 1)
+
+    # Pack all required parameters into a list of tuples
+    tasks = [(parquet_file_path, sample_info_dict, res_file_path, batch_size, spectra_capacity, skip_instrument)
+             for parquet_file_path in parquet_file_path_lst]
+
+    with Pool(processes=worker_count) as pool:
+        for _ in pool.imap(Parquet2Mgf().convert_to_mgf_task, tasks, chunksize=1):
+            pass
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyspectrafuse/commands/quantmsio2mgf.py` around lines 46 - 53, The worker
pool is being created with Pool(processes=os.cpu_count()) which ignores the
--task_parallel option; change the Pool creation to use the task_parallel value
(i.e. Pool(processes=task_parallel)) so that Parquet2Mgf.convert_to_mgf_task
actually runs with the requested number of worker processes, and keep the
existing chunksize parameter for batching (or compute a separate chunksize
variable) when calling pool.imap to control task batching rather than
concurrency.

output_dir = Path(parquet_dir) / 'msp' / species / instrument / charge
output_dir.mkdir(parents=True, exist_ok=True)
basename = ParquetPathHandler(parquet_dir).get_item_info()
basename = ParquetPathHandler(path_parquet_lst[0]).get_item_info()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use the project directory for get_item_info(), not the first parquet filename.

Line 198 passes a parquet file path into get_item_info(), but that helper extracts the accession from the last path segment. For typical inputs the last segment is just something.parquet, so this now raises ValueError or derives the wrong basename.

Proposed fix
-    basename = ParquetPathHandler(path_parquet_lst[0]).get_item_info()
+    basename = ParquetPathHandler(parquet_dir).get_item_info()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
basename = ParquetPathHandler(path_parquet_lst[0]).get_item_info()
basename = ParquetPathHandler(parquet_dir).get_item_info()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyspectrafuse/commands/spectrum2msp.py` at line 198, The code currently calls
ParquetPathHandler(...).get_item_info() with the first parquet file path
(path_parquet_lst[0]), which extracts the accession from the last path segment
and therefore yields the parquet filename instead of the project basename;
change the call to pass the project directory (e.g., the parent directory of the
parquet files or the existing project_dir variable) into ParquetPathHandler so
get_item_info() reads the directory name as the accession; update the reference
at the call site (where basename is set) to construct the handler with the
folder containing the parquet files rather than the file path itself.

Comment on lines +58 to 68
# Use glob.glob (follows symlinks) instead of Path.rglob (does not)
for path_str in glob.glob(str(Path(dir_path) / '**/*.parquet'), recursive=True):
parquet_file = Path(path_str)
name = str(parquet_file)
if any(name.endswith(s) for s in _metadata_suffixes):
continue
if any(ex in parquet_file.stem for ex in _exclude_names):
continue
if any(part in _exclude_dirs for part in parquet_file.parts):
continue
parquet_path_lst.append(parquet_file)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Filter out .parquet directories before returning results.

Line 60 turns every glob hit into a candidate path, but this helper never checks is_file(). Partitioned parquet datasets stored as directories named *.parquet will now be returned here and later consumers will treat them as files.

Proposed fix
         # Use glob.glob (follows symlinks) instead of Path.rglob (does not)
         for path_str in glob.glob(str(Path(dir_path) / '**/*.parquet'), recursive=True):
             parquet_file = Path(path_str)
+            if not parquet_file.is_file():
+                continue
             name = str(parquet_file)
             if any(name.endswith(s) for s in _metadata_suffixes):
                 continue
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Use glob.glob (follows symlinks) instead of Path.rglob (does not)
for path_str in glob.glob(str(Path(dir_path) / '**/*.parquet'), recursive=True):
parquet_file = Path(path_str)
name = str(parquet_file)
if any(name.endswith(s) for s in _metadata_suffixes):
continue
if any(ex in parquet_file.stem for ex in _exclude_names):
continue
if any(part in _exclude_dirs for part in parquet_file.parts):
continue
parquet_path_lst.append(parquet_file)
# Use glob.glob (follows symlinks) instead of Path.rglob (does not)
for path_str in glob.glob(str(Path(dir_path) / '**/*.parquet'), recursive=True):
parquet_file = Path(path_str)
if not parquet_file.is_file():
continue
name = str(parquet_file)
if any(name.endswith(s) for s in _metadata_suffixes):
continue
if any(ex in parquet_file.stem for ex in _exclude_names):
continue
if any(part in _exclude_dirs for part in parquet_file.parts):
continue
parquet_path_lst.append(parquet_file)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyspectrafuse/common/parquet_utils.py` around lines 58 - 68, The loop
building parquet_path_lst currently appends any glob hit (parquet_file) without
checking whether it's a file, so partitioned datasets stored as directories
ending in .parquet get returned; update the loop in this helper to skip
directories by calling parquet_file.is_file() (or equivalently ensure
pathlib.Path(parquet_file).is_file()) before applying the existing
suffix/name/parts filters and appending to parquet_path_lst, leaving all other
filtering (checks against _metadata_suffixes, _exclude_names, _exclude_dirs)
unchanged.

Comment on lines +99 to +103
rep_mgf_basename = rep_mgf_path.rsplit('/', 1)[-1] if '/' in rep_mgf_path else rep_mgf_path

rep_mask = cluster_df['mgf_path'].str.contains(rep_mgf_basename, regex=False)
rep_rows = cluster_df[rep_mask].copy()
new_rows = cluster_df[~rep_mask].copy()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use exact basename matching for representative rows.

str.contains(rep_mgf_basename, regex=False) will also match unrelated paths like foo_reps.mgf.bak, so those spectra get filtered out of new_rows and can inherit the wrong resolved_cluster_id. Using Path(...).name == rep_mgf_basename avoids both the false positives and the current /-only path split.

Proposed fix
+from pathlib import Path
+
-    rep_mgf_basename = rep_mgf_path.rsplit('/', 1)[-1] if '/' in rep_mgf_path else rep_mgf_path
-
-    rep_mask = cluster_df['mgf_path'].str.contains(rep_mgf_basename, regex=False)
+    rep_mgf_basename = Path(rep_mgf_path).name
+    rep_mask = cluster_df['mgf_path'].map(lambda p: Path(str(p)).name == rep_mgf_basename)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
rep_mgf_basename = rep_mgf_path.rsplit('/', 1)[-1] if '/' in rep_mgf_path else rep_mgf_path
rep_mask = cluster_df['mgf_path'].str.contains(rep_mgf_basename, regex=False)
rep_rows = cluster_df[rep_mask].copy()
new_rows = cluster_df[~rep_mask].copy()
rep_mgf_basename = Path(rep_mgf_path).name
rep_mask = cluster_df['mgf_path'].map(lambda p: Path(str(p)).name == rep_mgf_basename)
rep_rows = cluster_df[rep_mask].copy()
new_rows = cluster_df[~rep_mask].copy()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyspectrafuse/incremental/resolve_clusters.py` around lines 99 - 103, The
current filtering uses rep_mgf_basename and str.contains, which yields false
positives; instead compute exact basenames using pathlib.Path and do an equality
match: derive rep_mgf_basename from Path(rep_mgf_path).name, compare against
cluster_df['mgf_path'].map(lambda p: Path(p).name) (or add a temporary column of
basenames) to build rep_mask, then set rep_rows/new_rows accordingly; ensure
pathlib.Path is imported if not already.

Comment on lines +155 to +163
def _get_mgf_path(row):
fname = Parquet2Mgf.get_filename_from_usi(row)
info = sample_info_dict.get(fname)
if info is None:
logger.warning(f"No metadata for run file: {fname}")
info = ['Unknown', 'Unknown']
return '/'.join(info + ['charge' + str(row["charge"]), 'mgf files'])

mgf_group_df['mgf_file_path'] = row_group.apply(
lambda row: '/'.join(sample_info_dict.get(Parquet2Mgf.get_filename_from_usi(row)) +
['charge' + str(row["charge"]), 'mgf files']), axis=1)
mgf_group_df['mgf_file_path'] = row_group.apply(_get_mgf_path, axis=1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

skip_instrument is currently a no-op.

Lines 94-96 advertise a mode that should collapse output under a shared instrument bucket, but _get_mgf_path() always uses the instrument from sample_info_dict. As written, --skip_instrument still writes into per-instrument folders.

Proposed fix
             def _get_mgf_path(row):
                 fname = Parquet2Mgf.get_filename_from_usi(row)
                 info = sample_info_dict.get(fname)
                 if info is None:
                     logger.warning(f"No metadata for run file: {fname}")
                     info = ['Unknown', 'Unknown']
+                if skip_instrument:
+                    info = [info[0], 'all_instruments']
                 return '/'.join(info + ['charge' + str(row["charge"]), 'mgf files'])
🧰 Tools
🪛 Ruff (0.15.9)

[warning] 161-161: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyspectrafuse/mgf_convert/parquet2mgf.py` around lines 155 - 163,
_get_mgf_path currently always pulls instrument from sample_info_dict, ignoring
the skip_instrument option; update _get_mgf_path to check the skip_instrument
flag (the variable/attribute used by Parquet2Mgf for that option) and, when
True, use a single shared instrument bucket (e.g., a fixed name like
"all_instruments" or an empty instrument segment) instead of info[0]; modify the
function where mgf_group_df['mgf_file_path'] is set so _get_mgf_path reads
skip_instrument and constructs the path accordingly (reference function name
_get_mgf_path, the sample_info_dict lookup, and Parquet2Mgf.skip_instrument or
the local skip_instrument variable).

Comment thread tests/test_incremental.py
Comment on lines +150 to +155
tsv_content = (
f"reps.mgf\t0\tNEW_A\n"
f"reps.mgf\t1\tNEW_B\n"
f"new_data.mgf\t0\tNEW_A\n"
f"new_data.mgf\t1\tNEW_C\n"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, locate and read the test_incremental.py file
find . -name "test_incremental.py" -type f

Repository: bigbio/pyspectrafuse

Length of output: 91


🏁 Script executed:

# Read the specific lines mentioned in the review
wc -l tests/test_incremental.py

Repository: bigbio/pyspectrafuse

Length of output: 93


🏁 Script executed:

# Read lines 150-155 and 193-196 from the test file
sed -n '150,155p; 193,196p' tests/test_incremental.py

Repository: bigbio/pyspectrafuse

Length of output: 357


🏁 Script executed:

# Double-check by searching for any interpolation patterns in these strings
sed -n '150,155p; 193,196p' tests/test_incremental.py | grep -E '\{.*\}' || echo "No placeholders found"

Repository: bigbio/pyspectrafuse

Length of output: 85


Remove the unused f prefixes in these fixture strings.

These f-string literals contain no placeholders, so Ruff flags them with F541 and the test module will fail lint in its current form.

Proposed fix
-            f"reps.mgf\t0\tNEW_A\n"
-            f"reps.mgf\t1\tNEW_B\n"
-            f"new_data.mgf\t0\tNEW_A\n"
-            f"new_data.mgf\t1\tNEW_C\n"
+            "reps.mgf\t0\tNEW_A\n"
+            "reps.mgf\t1\tNEW_B\n"
+            "new_data.mgf\t0\tNEW_A\n"
+            "new_data.mgf\t1\tNEW_C\n"
-            f"reps.mgf\t0\tMERGED\n"
-            f"reps.mgf\t1\tMERGED\n"
+            "reps.mgf\t0\tMERGED\n"
+            "reps.mgf\t1\tMERGED\n"

Also applies to: 193-196

🧰 Tools
🪛 Ruff (0.15.9)

[error] 151-151: f-string without any placeholders

Remove extraneous f prefix

(F541)


[error] 152-152: f-string without any placeholders

Remove extraneous f prefix

(F541)


[error] 153-153: f-string without any placeholders

Remove extraneous f prefix

(F541)


[error] 154-154: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_incremental.py` around lines 150 - 155, The fixture string
assignments for the variable tsv_content use unnecessary f-string prefixes
(e.g., f"reps.mgf\t0\tNEW_A\n" blocks) which trigger Ruff F541; remove the
leading f from each quoted literal (including the second occurrence around lines
193-196) so they become plain string literals while preserving the content and
escape sequences.

ypriverol and others added 3 commits April 13, 2026 20:59
Parquet-to-dat is the standard conversion path, not a bypass of anything.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix test_get_metadata_dict_from_sdrf: column name case mismatch
  (Characteristics[organism] -> characteristics[organism])
- Switch from setuptools to hatchling build backend
- Bump version to 0.0.4, require Python >=3.10
- Update Dockerfile to use uv for faster installs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace per-row iterrows() with columnar array access
- Buffer dat/scaninfo/title writes per batch instead of per spectrum
- Pre-extract columns (charges, mz, runs, etc.) outside inner loop
- Same output, 3-5x faster on large datasets

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Dockerfile`:
- Around line 39-44: The Docker build fails because pyproject.toml declares
readme = "README.md" but README.md is not copied before running RUN uv pip
install --system --no-cache . Update the Dockerfile so that README.md is copied
into the image before the install step (i.e., place a COPY README.md . between
the COPY pyproject.toml . and COPY pyspectrafuse/ ./ lines) so the package
installation can read the declared README during build.
- Line 34: Replace the non-reproducible image reference used as a build stage
source in the COPY statement (currently "ghcr.io/astral-sh/uv:latest" in the
COPY --from=... line) with an explicit version tag and its SHA256 digest (e.g.,
ghcr.io/astral-sh/uv:<version>@sha256:<digest>); obtain the correct tag and
digest using docker manifest inspect (or the uv Docker integration docs) and
update the COPY --from=... source to that pinned reference to ensure
reproducible, verifiable builds.

In `@pyspectrafuse/maracluster_dat.py`:
- Around line 147-149: The code defines a columns list including 'rt' but still
writes Spectrum records with retentionTime=0.0; update the Spectrum/dict
construction to use the parsed rt value instead of the hard-coded 0.0 (e.g.,
retentionTime=rt or retentionTime=row['rt']), and make the same change at the
other spot flagged (the second Spectrum/dict creation). Ensure the 'rt' value is
cast to float if necessary before assigning to retentionTime so the parquet
persists the actual retention time.
- Around line 196-218: The code converts mz_array and intensity_array to numpy
arrays then calls _bin_spectrum but doesn't verify they have the same length;
when they differ _bin_spectrum may index out-of-bounds. Before computing
charge/prec_mz/etc., add a validation step comparing lengths of mz_arr and
int_arr (and that both lengths > 0) and skip the spectrum (increment skipped) if
they differ or are zero-length; reference the variables mz_arr, int_arr and the
called function _bin_spectrum when adding this guard so only well-formed arrays
reach _bin_spectrum.
- Around line 264-279: The converter is currently grabbing any ".parquet" files
(via find_target_ext_files) and produces .dat outputs for non-spectrum files;
restrict it to spectrum inputs by changing the finder/filter to only match
".psm.parquet" (e.g., call find_target_ext_files(parquet_dir, '.psm.parquet') or
post-filter parquet_files to keep only names ending with '.psm.parquet'), and
keep the FileNotFoundError when the filtered list is empty; reference the
existing parquet_files variable and the find_target_ext_files call so the filter
change is localized and prevents producing .dat for cluster metadata or
membership files.

In `@README.md`:
- Line 162: The fenced code block at the README's project tree is missing a
language identifier (triggering markdownlint MD040); update the opening fence
from ``` to ```text (or another appropriate language like ```bash) so the block
becomes labeled (e.g., change the block around the pyspectrafuse tree to start
with ```text) and leave the closing ``` as-is to satisfy the linter and improve
rendering.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9eb047c6-ab59-4739-9146-a66082074b31

📥 Commits

Reviewing files that changed from the base of the PR and between 3116dcc and a2fa10a.

📒 Files selected for processing (5)
  • Dockerfile
  • README.md
  • pyproject.toml
  • pyspectrafuse/maracluster_dat.py
  • tests/conftest.py

Comment thread Dockerfile
&& rm -rf /var/lib/apt/lists/*

## Install uv
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

What is the recommended secure Docker pattern for ghcr.io/astral-sh/uv (pinning by explicit version and sha256 digest)?

💡 Result:

The recommended secure Docker pattern for ghcr.io/astral-sh/uv is to pin to an explicit version tag and further pin by SHA256 digest for reproducibility and security against tag movement. Official uv documentation provides this example in a multi-stage Dockerfile: FROM python:3.12-slim-trixie COPY --from=ghcr.io/astral-sh/uv:0.11.6 /uv /uvx /bin/ For maximum security, replace the tag with @sha256:: COPY --from=ghcr.io/astral-sh/uv@sha256:2381d6aa60c326b71fd40023f921a0a3b8f91b14d5db6b90402e65a635053709 /uv /uvx /bin/ To obtain the current digest for a specific tag (e.g., :0.4.17), run: docker manifest inspect ghcr.io/astral-sh/uv:0.4.17 (look for "digest" under the platform config). Available tags include :latest, :{version} (e.g., 0.11.6), and variants like :debian, :python3.12-trixie. View all at https://ghcr.io/astral-sh/uv. Images are signed with GitHub attestations; verify with gh attestation verify oci://ghcr.io/astral-sh/uv:0.11.6. This ensures supply chain security by preventing image substitution.

Citations:


Pin the uv image by explicit version tag and SHA256 digest.

The ghcr.io/astral-sh/uv:latest tag is non-reproducible and weakens supply-chain control. Use an explicit version tag combined with SHA256 digest pinning to ensure build reproducibility and prevent image substitution attacks.

Proposed change
-COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
+COPY --from=ghcr.io/astral-sh/uv:<pinned-version>@sha256:<pinned-digest> /uv /uvx /bin/

Consult the official uv Docker integration guide for available version tags and instructions to retrieve the corresponding digest using docker manifest inspect.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` at line 34, Replace the non-reproducible image reference used as
a build stage source in the COPY statement (currently
"ghcr.io/astral-sh/uv:latest" in the COPY --from=... line) with an explicit
version tag and its SHA256 digest (e.g.,
ghcr.io/astral-sh/uv:<version>@sha256:<digest>); obtain the correct tag and
digest using docker manifest inspect (or the uv Docker integration docs) and
update the COPY --from=... source to that pinned reference to ensure
reproducible, verifiable builds.

Comment thread Dockerfile
Comment on lines 39 to +44
## Copy project files
COPY pyproject.toml .
COPY pyspectrafuse/ ./pyspectrafuse/

## Install the package
RUN pip install --no-cache-dir -e .
## Install the package with uv
RUN uv pip install --system --no-cache .
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Checking declared readme in pyproject.toml:"
rg -n '^\s*readme\s*=' pyproject.toml

echo
echo "Checking copied files in Dockerfile:"
rg -n '^\s*COPY\s+' Dockerfile

Repository: bigbio/pyspectrafuse

Length of output: 291


Add missing README.md to Docker build context before install.

When uv pip install . runs from the local directory, it reads pyproject.toml which declares readme = "README.md" at line 9. The build will fail because README.md is not copied into the Docker image before the install command executes. Add the missing file between the pyproject.toml and pyspectrafuse/ copies:

Fix
 ## Copy project files
 COPY pyproject.toml .
+COPY README.md .
 COPY pyspectrafuse/ ./pyspectrafuse/

 ## Install the package with uv
 RUN uv pip install --system --no-cache .
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## Copy project files
COPY pyproject.toml .
COPY pyspectrafuse/ ./pyspectrafuse/
## Install the package
RUN pip install --no-cache-dir -e .
## Install the package with uv
RUN uv pip install --system --no-cache .
## Copy project files
COPY pyproject.toml .
COPY README.md .
COPY pyspectrafuse/ ./pyspectrafuse/
## Install the package with uv
RUN uv pip install --system --no-cache .
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` around lines 39 - 44, The Docker build fails because
pyproject.toml declares readme = "README.md" but README.md is not copied before
running RUN uv pip install --system --no-cache . Update the Dockerfile so that
README.md is copied into the image before the install step (i.e., place a COPY
README.md . between the COPY pyproject.toml . and COPY pyspectrafuse/ ./ lines)
so the package installation can read the declared README during build.

Comment on lines +147 to +149
columns = ['mz_array', 'intensity_array', 'charge',
'observed_mz', 'calculated_mz', 'rt', 'scan',
'run_file_name', 'peptidoform']
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Persist the parquet RT instead of hard-coding 0.0.

rt is loaded here, but every Spectrum record is still written with retentionTime=0. Any MaRaCluster mode that consumes RT from the .dat payload will see all spectra at time zero.

Proposed fix
             runs = df['run_file_name'].values if 'run_file_name' in df.columns else [''] * len(df)
             peps = df['peptidoform'].values if 'peptidoform' in df.columns else [''] * len(df)
             scans = df['scan'].values if 'scan' in df.columns else [0] * len(df)
+            rts = df['rt'].values if 'rt' in df.columns else np.zeros(len(df), dtype=np.float32)
...
+                rt = rts[i]
+                rt = 0.0 if rt is None or np.isnan(rt) else float(rt)
                 dat_buf.extend(_pack_spectrum(
-                    file_idx, scannr, charge, prec_mz, 0.0, peak_bins))
+                    file_idx, scannr, charge, prec_mz, rt, peak_bins))

Also applies to: 226-227

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyspectrafuse/maracluster_dat.py` around lines 147 - 149, The code defines a
columns list including 'rt' but still writes Spectrum records with
retentionTime=0.0; update the Spectrum/dict construction to use the parsed rt
value instead of the hard-coded 0.0 (e.g., retentionTime=rt or
retentionTime=row['rt']), and make the same change at the other spot flagged
(the second Spectrum/dict creation). Ensure the 'rt' value is cast to float if
necessary before assigning to retentionTime so the parquet persists the actual
retention time.

Comment on lines +196 to +218
mz_arr = np.asarray(mz_arr, dtype=np.float64)
int_arr = np.asarray(int_arr, dtype=np.float64)

if len(mz_arr) == 0 or len(int_arr) == 0:
skipped += 1
continue

charge = int(charges[i])
if charge <= 0:
skipped += 1
continue

# Precursor m/z: prefer observed_mz, fall back to calculated_mz
prec_mz = obs_mz[i] if obs_mz is not None else None
if prec_mz is None or np.isnan(prec_mz):
prec_mz = calc_mz[i] if calc_mz is not None else 0.0
prec_mz = float(prec_mz)

# Neutral mass for peak filtering
prec_mass = prec_mz * charge - PROTON_MASS * (charge - 1)

# Bin the spectrum
peak_bins = _bin_spectrum(mz_arr, int_arr, prec_mass)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate fragment array lengths before binning.

If a parquet row has different-length mz_array and intensity_array, _bin_spectrum() can index past the end of mz_array and kill the whole conversion instead of skipping one bad spectrum.

Proposed fix
-                if len(mz_arr) == 0 or len(int_arr) == 0:
+                if len(mz_arr) == 0 or len(int_arr) == 0 or len(mz_arr) != len(int_arr):
                     skipped += 1
                     continue
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
mz_arr = np.asarray(mz_arr, dtype=np.float64)
int_arr = np.asarray(int_arr, dtype=np.float64)
if len(mz_arr) == 0 or len(int_arr) == 0:
skipped += 1
continue
charge = int(charges[i])
if charge <= 0:
skipped += 1
continue
# Precursor m/z: prefer observed_mz, fall back to calculated_mz
prec_mz = obs_mz[i] if obs_mz is not None else None
if prec_mz is None or np.isnan(prec_mz):
prec_mz = calc_mz[i] if calc_mz is not None else 0.0
prec_mz = float(prec_mz)
# Neutral mass for peak filtering
prec_mass = prec_mz * charge - PROTON_MASS * (charge - 1)
# Bin the spectrum
peak_bins = _bin_spectrum(mz_arr, int_arr, prec_mass)
mz_arr = np.asarray(mz_arr, dtype=np.float64)
int_arr = np.asarray(int_arr, dtype=np.float64)
if len(mz_arr) == 0 or len(int_arr) == 0 or len(mz_arr) != len(int_arr):
skipped += 1
continue
charge = int(charges[i])
if charge <= 0:
skipped += 1
continue
# Precursor m/z: prefer observed_mz, fall back to calculated_mz
prec_mz = obs_mz[i] if obs_mz is not None else None
if prec_mz is None or np.isnan(prec_mz):
prec_mz = calc_mz[i] if calc_mz is not None else 0.0
prec_mz = float(prec_mz)
# Neutral mass for peak filtering
prec_mass = prec_mz * charge - PROTON_MASS * (charge - 1)
# Bin the spectrum
peak_bins = _bin_spectrum(mz_arr, int_arr, prec_mass)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyspectrafuse/maracluster_dat.py` around lines 196 - 218, The code converts
mz_array and intensity_array to numpy arrays then calls _bin_spectrum but
doesn't verify they have the same length; when they differ _bin_spectrum may
index out-of-bounds. Before computing charge/prec_mz/etc., add a validation step
comparing lengths of mz_arr and int_arr (and that both lengths > 0) and skip the
spectrum (increment skipped) if they differ or are zero-length; reference the
variables mz_arr, int_arr and the called function _bin_spectrum when adding this
guard so only well-formed arrays reach _bin_spectrum.

Comment on lines +264 to +279
"""Convert all .psm.parquet files in a directory to .dat format.

Args:
parquet_dir: Directory containing .psm.parquet files
output_dir: Directory to write .dat files
file_idx_start: Starting file index
charge_filter: If set, only include spectra with this charge

Returns:
Tuple of (list of .dat file paths, total spectra written)
"""
from pyspectrafuse.commands.spectrum2msp import find_target_ext_files

parquet_files = find_target_ext_files(parquet_dir, '.parquet')
if not parquet_files:
raise FileNotFoundError(f"No parquet files found in {parquet_dir}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Restrict dataset conversion to the spectrum parquet inputs.

The docstring says this path converts .psm.parquet files, but find_target_ext_files(parquet_dir, '.parquet') will also pick up outputs like cluster_metadata.parquet and psm_cluster_membership.parquet. Those files do not have the spectrum arrays this converter expects, so this silently generates useless .dat artifacts for non-spectrum inputs.

Proposed fix
-    parquet_files = find_target_ext_files(parquet_dir, '.parquet')
+    parquet_files = find_target_ext_files(parquet_dir, '.psm.parquet')
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyspectrafuse/maracluster_dat.py` around lines 264 - 279, The converter is
currently grabbing any ".parquet" files (via find_target_ext_files) and produces
.dat outputs for non-spectrum files; restrict it to spectrum inputs by changing
the finder/filter to only match ".psm.parquet" (e.g., call
find_target_ext_files(parquet_dir, '.psm.parquet') or post-filter parquet_files
to keep only names ending with '.psm.parquet'), and keep the FileNotFoundError
when the filtered list is empty; reference the existing parquet_files variable
and the find_target_ext_files call so the filter change is localized and
prevents producing .dat for cluster metadata or membership files.

Comment thread README.md
## Project Structure

Run the test suite:
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language identifier to the fenced block.

Line 162 uses a fenced code block without a language, which triggers markdownlint MD040.

Proposed fix
-```
+```text
 pyspectrafuse/
 ├── maracluster_dat.py              # Parquet to .dat binary conversion
 ...
 └── mgf_convert/                    # MGF conversion utilities
</details>

<!-- suggestion_start -->

<details>
<summary>📝 Committable suggestion</summary>

> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

```suggestion

🧰 Tools
🪛 markdownlint-cli2 (0.22.0)

[warning] 162-162: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` at line 162, The fenced code block at the README's project tree is
missing a language identifier (triggering markdownlint MD040); update the
opening fence from ``` to ```text (or another appropriate language like ```bash)
so the block becomes labeled (e.g., change the block around the pyspectrafuse
tree to start with ```text) and leave the closing ``` as-is to satisfy the
linter and improve rendering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant