Skip to content

major performance improvements#4

Merged
ypriverol merged 2 commits intomainfrom
dev
Jan 6, 2026
Merged

major performance improvements#4
ypriverol merged 2 commits intomainfrom
dev

Conversation

@ypriverol
Copy link
Copy Markdown
Member

@ypriverol ypriverol commented Jan 6, 2026

PR Type

Enhancement


Description

  • Replace pandas apply() with vectorized operations for 10-50x speedup

    • String concatenation using + operator instead of lambda functions
    • Use map() for dictionary lookups instead of apply()
    • Use .str accessor for string operations on Series
  • Optimize numerical computations in binning and similarity calculations

    • Pre-compute constants to avoid repeated calculations in loops
    • Cache spectrum list to avoid repeated dictionary lookups
    • Optimize bin array indexing with pre-computed valid indices
  • Remove duplicate code blocks and improve code clarity

    • Eliminate redundant function implementations in consensus strategies
    • Simplify file path construction with pre-computed basename
  • Add logging for MSP file generation completion


Diagram Walkthrough

flowchart LR
  A["Pandas apply() calls"] -->|Replace with vectorized ops| B["String concatenation + map()"]
  C["Repeated calculations in loops"] -->|Pre-compute constants| D["Cached values"]
  E["Dictionary lookups in loops"] -->|Cache spectrum list| F["Direct list access"]
  G["Duplicate code blocks"] -->|Remove redundancy| H["Single implementation"]
  B --> I["10-50x performance improvement"]
  D --> I
  F --> I
Loading

File Walkthrough

Relevant files
Enhancement
cluster_res_handler.py
Vectorize string operations in cluster path construction 

pyspectrafuse/cluster_parquet_combine/cluster_res_handler.py

  • Replace two apply() calls with vectorized string concatenation
    operations
  • Use + operator for string operations instead of lambda functions
  • Convert index column to string using astype(str) for concatenation
+6/-6     
combine_cluster_and_parquet.py
Optimize dictionary mapping and string operations               

pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py

  • Replace apply() with map() for dictionary lookups (10-50x speedup)
  • Vectorize USI parsing using .str.split() accessor
  • Pre-compute basename and file index to avoid repeated string
    operations in loop
  • Simplify mgf_file_path construction with pre-computed values
+13/-8   
spectrum2msp.py
Add completion logging for MSP generation                               

pyspectrafuse/commands/spectrum2msp.py

  • Add logging message after successful MSP file generation
+2/-0     
sdrf_utils.py
Vectorize string and data type operations                               

pyspectrafuse/common/sdrf_utils.py

  • Replace apply() with .str.split().str[0] for vectorized string
    splitting
  • Replace apply(lambda x: list(x), axis=1) with .values.tolist() for
    vectorized conversion
  • Add comments explaining vectorized operation benefits
+4/-2     
binning_strategy.py
Optimize bin calculation with pre-computed constants         

pyspectrafuse/consensus_strategy/binning_strategy.py

  • Pre-compute min_mz_div_bin_size constant to avoid repeated division in
    loop
  • Optimize bin calculation formula using pre-computed constant
  • Cache valid bin indices to avoid repeated array indexing operations
  • Remove duplicate code block with old implementation
+9/-73   
most_similar_strategy.py
Optimize similarity matrix computation with caching           

pyspectrafuse/consensus_strategy/most_similar_strategy.py

  • Optimize similarity matrix construction by caching spectrum list
  • Pre-compute spectrum list to avoid repeated dictionary lookups
  • Simplify matrix mirroring logic with explicit condition check
  • Remove duplicate code block with old implementation
+14/-28 
parquet2mgf.py
Cache basename for repeated file path construction             

pyspectrafuse/mgf_convert/parquet2mgf.py

  • Pre-compute basename from parquet path to avoid repeated string
    operations
  • Simplify mgf_file_path construction using pre-computed basename and
    file_index
  • Apply same optimization pattern in two locations within the method
+7/-4     
Miscellaneous
average_spectrum_strategy.py
Remove duplicate code in consensus aggregation                     

pyspectrafuse/consensus_strategy/average_spectrum_strategy.py

  • Remove duplicate code block that was unreachable after return
    statement
  • Eliminate redundant implementation of consensus spectrum aggregation
+0/-4     

Summary by CodeRabbit

  • Performance Improvements

    • Vectorized string and mapping operations across data pipelines for faster processing (clustering, parquet↔mgf handling, SDRF parsing).
    • Optimized consensus algorithms and similarity computations for more efficient spectrum aggregation.
  • Refactor

    • Internal simplifications reducing redundant path/string work and streamlining aggregation logic.
  • Logging

    • Added a completion log entry after MSP file generation.
  • Chores

    • Added container build assets and automated CI workflow for image build and distribution.

✏️ Tip: You can customize this high-level summary in your review settings.

@ypriverol ypriverol requested a review from Copilot January 6, 2026 14:40
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 6, 2026

📝 Walkthrough

Walkthrough

This PR replaces many per-row apply operations with vectorized string and array operations, precomputes repeated path/basename values, optimizes similarity matrix computation, simplifies consensus aggregation logic, and adds a single log message and a new CI workflow plus Dockerfile.

Changes

Cohort / File(s) Summary
Cluster mgf path vectorization
pyspectrafuse/cluster_parquet_combine/cluster_res_handler.py
Replaced row-wise apply mgf_path/index construction with vectorized string concatenation and index appending.
Cluster–parquet injection & mapping
pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py
Swapped apply with map for cluster_accession; vectorized mgf path/filename extraction and introduced precomputed basename_parquet and file_index logic.
Parquet→MGF basename optimization
pyspectrafuse/mgf_convert/parquet2mgf.py
Precomputed basename_parquet and replaced repeated Path parsing with basename + per-group file_index for mgf path construction.
DataFrame vectorization
pyspectrafuse/common/sdrf_utils.py
Replaced apply-based string split with str.split().str[0] and axis=1 apply constructing lists with values.tolist() for organism_instrument.
Average consensus simplification
pyspectrafuse/consensus_strategy/average_spectrum_strategy.py
Removed block that built ms2spectrumDict and performed groupby + aggregated average consensus path; function now omits that aggregation branch.
Binning strategy refactor
pyspectrafuse/consensus_strategy/binning_strategy.py
Precomputed min_mz_div_bin_size, revised bin index math, used validated indices to update arrays, and removed legacy per-cluster loop/quorum and peak-consolidation logic.
Similarity matrix optimization
pyspectrafuse/consensus_strategy/most_similar_strategy.py
Replaced full symmetric O(n²) double-loop with upper-triangular computation and mirrored assignment; prepared spectra as list to avoid repeated dict lookups.
MGF/Parquet helper optimization
pyspectrafuse/mgf_convert/parquet2mgf.py
(Related internal path/name deduplication) Uses precomputed basename and consistent file_index across group operations.
Logging
pyspectrafuse/commands/spectrum2msp.py
Added log statement announcing successful MSP file generation after write completion.
CI workflow
.github/workflows/pyspectrafuse-containers.yml
New GitHub Actions workflow to build/push Docker and Singularity images, extract version from pyproject, and handle tagging/conditional pushes.
Container image
Dockerfile
New Dockerfile for building the pyspectrafuse CLI image (python:3.10-slim base, installs deps, pip install -e ., entrypoint set).

Sequence Diagram(s)

(No sequence diagrams generated — changes are internal optimizations and refactors without new multi-component control flows requiring visualization.)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • the latest version #2 — Directly modifies cluster_res_handler.py and combine_cluster_and_parquet.py with overlapping mgf_path and mapping refactors.

Suggested labels

Review effort [1-5]: 4

Poem

🐰 I hopped through strings and trims with cheer,
Vectorized paths now swift and clear.
Bins and matrices trimmed of repeat,
Faster hops on nimble feet.
A tiny carrot for code so dear 🥕

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'major performance improvements' accurately reflects the main objective of the PR, which involves replacing pandas apply() calls with vectorized operations and optimizing numerical computations across multiple files.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review bot commented Jan 6, 2026

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
Path traversal write

Description: The new mgf_file_path construction uses path components derived from external/parsing
sources (e.g., group from sample_info_dict/USI-derived filename and output_path) to build
filesystem paths and create directories/write files, which can enable path traversal or
arbitrary file write if those inputs contain .., absolute paths, or path separators.
parquet2mgf.py [120-147]

Referred Code
# Pre-compute basename to avoid repeated string operations
basename_parquet = Path(parquet_path).parts[-1].split('.')[0]

for group, group_df in mgf_group_df.groupby('mgf_file_path'):
    base_mgf_path = f"{output_path}/{group}"
    file_index = relation_dict[base_mgf_path] + 1
    mgf_file_path = f"{base_mgf_path}/{basename_parquet}_{file_index}.mgf"
    Path(mgf_file_path).parent.mkdir(parents=True, exist_ok=True)

    if write_count_dict[group] + group_df.shape[0] <= SPECTRA_NUM:
        Parquet2Mgf.write2mgf(mgf_file_path, '\n\n'.join(group_df["spectrum"]))
        write_count_dict[group] += group_df.shape[0]
    else:
        remain_num = SPECTRA_NUM - write_count_dict[group]
        if remain_num > 0:
            group_df_remain = group_df.head(remain_num)
            Parquet2Mgf.write2mgf(mgf_file_path, '\n\n'.join(group_df_remain["spectrum"]))
            write_count_dict[group] += group_df_remain.shape[0]

        # Update the index of the read and mgf files
        relation_dict[base_mgf_path] += 1


 ... (clipped 7 lines)
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

🔴
Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Missing USI validation: New vectorized USI parsing assumes all USI values are non-null and contain at least three
:-separated fields, which can silently produce null paths or downstream failures without
explicit validation/handling.

Referred Code
# Vectorized string operations - extract filename from USI and map to sample info
# USI format: mzspec:dataset:filename:scan:sequence/charge
filenames = row_group['USI'].str.split(':').str[2]
sample_info = filenames.map(sample_info_dict)
charges = 'charge' + row_group['charge'].astype(str)
mgf_group_df['mgf_file_path'] = (sample_info + '/' + charges + '/mgf files')

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Unvalidated external data: The new logic derives filesystem paths from externally-sourced USI strings and dictionary
mappings without validating/sanitizing the parsed filename component, which can lead to
unsafe or incorrect path construction.

Referred Code
# Vectorized string operations - extract filename from USI and map to sample info
# USI format: mzspec:dataset:filename:scan:sequence/charge
filenames = row_group['USI'].str.split(':').str[2]
sample_info = filenames.map(sample_info_dict)
charges = 'charge' + row_group['charge'].astype(str)
mgf_group_df['mgf_file_path'] = (sample_info + '/' + charges + '/mgf files')

Learn more about managing compliance generic rules or creating your own custom rules

  • Update
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review bot commented Jan 6, 2026

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix TypeError in path construction

Fix a TypeError in path construction by joining the list elements in the
sample_info Series into a single string using .str.join('/') before
concatenation.

pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py [92-94]

 sample_info = filenames.map(sample_info_dict)
 charges = 'charge' + row_group['charge'].astype(str)
-mgf_group_df['mgf_file_path'] = (sample_info + '/' + charges + '/mgf files')
+mgf_group_df['mgf_file_path'] = (sample_info.str.join('/') + '/' + charges + '/mgf files')

[Suggestion processed]

Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a TypeError that would occur when concatenating a pandas Series of lists with strings, which makes the current PR code non-functional. The proposed fix using .str.join('/') is correct and essential.

High
General
Validate mapped sample_info presence

Add validation to ensure all sample_info mappings exist after the map operation,
raising a KeyError if any are missing to prevent downstream errors.

pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py [92]

 sample_info = filenames.map(sample_info_dict)
+missing = sample_info.isna()
+if missing.any():
+    missing_files = filenames[missing].unique().tolist()
+    raise KeyError(f"Missing sample info for files: {missing_files}")
  • Apply / Chat
Suggestion importance[1-10]: 8

__

Why: This suggestion adds crucial error handling by checking for missing sample_info mappings. This prevents potential NaN values from propagating and causing cryptic downstream errors, making the code more robust.

Medium
Use pd.concat for concatenation

Use pd.concat(charge_df_lst, ignore_index=True) instead of
pd.DataFrame(np.vstack(charge_df_lst)) to concatenate DataFrames more
efficiently and preserve column types.

pyspectrafuse/cluster_parquet_combine/cluster_res_handler.py [45-46]

-charge_df = pd.DataFrame(np.vstack(charge_df_lst), 
-                         columns=['mgf_path', 'index', 'cluster_accession'])
+charge_df = pd.concat(charge_df_lst, ignore_index=True)
  • Apply / Chat
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly recommends using pd.concat over np.vstack for concatenating DataFrames, which is more idiomatic, efficient, and safer as it preserves dtypes.

Low
Simplify symmetric matrix construction logic

Simplify the symmetric matrix construction by removing the redundant if i != j
check and assigning the similarity score to both sim_matrix[i, j] and
sim_matrix[j, i] unconditionally.

pyspectrafuse/consensus_strategy/most_similar_strategy.py [99-111]

 # Build similarity matrix - optimized for symmetric matrix
 # Only calculate upper triangle and mirror to lower triangle
 n = len(spectra_keys)
 sim_matrix = np.zeros((n, n))
 spectra_list = [cluster_spectra[key] for key in spectra_keys]
 
 # Calculate upper triangle including diagonal
 for i in range(n):
     for j in range(i, n):
         similarity = self.compare_spectra(spectra_list[i], spectra_list[j])
         sim_matrix[i, j] = similarity
-        if i != j:  # Mirror to lower triangle (skip diagonal)
-            sim_matrix[j, i] = similarity
+        sim_matrix[j, i] = similarity
  • Apply / Chat
Suggestion importance[1-10]: 4

__

Why: The suggestion offers a valid micro-optimization by removing a redundant if i != j check, which simplifies the code and slightly improves performance by eliminating a conditional branch inside the loop.

Low
  • Update

Comment on lines +93 to +94
charges = 'charge' + row_group['charge'].astype(str)
mgf_group_df['mgf_file_path'] = (sample_info + '/' + charges + '/mgf files')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Fix TypeError in path construction

Suggested change
charges = 'charge' + row_group['charge'].astype(str)
mgf_group_df['mgf_file_path'] = (sample_info + '/' + charges + '/mgf files')
charges = 'charge' + row_group['charge'].astype(str)
mgf_group_df['mgf_file_path'] = (sample_info.str.join('/') + '/' + charges + '/mgf files')

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements performance optimizations across the codebase by replacing inefficient apply() calls with vectorized pandas operations, pre-computing repeated calculations, and removing duplicate code. The changes aim to significantly speed up data processing, particularly for large datasets.

Key changes:

  • Vectorized pandas operations replacing apply() calls for 10-50x speedup potential
  • Pre-computation of repeated string operations to avoid redundant calculations
  • Removal of duplicate/dead code in consensus strategies

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pyspectrafuse/mgf_convert/parquet2mgf.py Pre-computes basename to avoid repeated Path operations in loop
pyspectrafuse/consensus_strategy/most_similar_strategy.py Optimizes similarity matrix calculation with list comprehension; removes duplicate code
pyspectrafuse/consensus_strategy/binning_strategy.py Pre-computes division factor for bin calculation; optimizes array indexing; removes duplicate code
pyspectrafuse/consensus_strategy/average_spectrum_strategy.py Removes duplicate code block
pyspectrafuse/common/sdrf_utils.py Replaces apply() with vectorized str operations and values.tolist()
pyspectrafuse/commands/spectrum2msp.py Adds success logging message
pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py Vectorizes string operations and USI parsing; pre-computes basename
pyspectrafuse/cluster_parquet_combine/cluster_res_handler.py Vectorizes string concatenation operations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

filenames = row_group['USI'].str.split(':').str[2]
sample_info = filenames.map(sample_info_dict)
charges = 'charge' + row_group['charge'].astype(str)
mgf_group_df['mgf_file_path'] = (sample_info + '/' + charges + '/mgf files')
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vectorized string concatenation is incorrect. The sample_info variable contains a pandas Series where each element is a list [organism, instrument] (from sample_info_dict). The current code attempts to concatenate these lists with strings using +, which will fail because you cannot directly concatenate a Series of lists with strings.

The original code used '/'.join(sample_info_dict.get(...) + ['charge' + str(row["charge"]), 'mgf files']) which joins list elements with '/'. The vectorized version should first construct the full list per row, then join with '/'. Consider using apply() with a lambda that constructs the path from the list elements, or properly concatenate list elements before joining.

Suggested change
mgf_group_df['mgf_file_path'] = (sample_info + '/' + charges + '/mgf files')
mgf_group_df['mgf_file_path'] = [
'/'.join(info + [charge, 'mgf files'])
for info, charge in zip(sample_info, charges)
]

Copilot uses AI. Check for mistakes.
Comment on lines 92 to 93
return res, single_spectrum_df

Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a duplicate return statement. Line 90 returns the result, making line 92 unreachable dead code. Remove the duplicate return statement on line 92.

Suggested change
return res, single_spectrum_df

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI Agents
In @pyspectrafuse/consensus_strategy/most_similar_strategy.py:
- Around line 99-111: The similarity matrix construction incorrectly mirrors the
upper triangle despite the similarity function being asymmetric (the
pointer-based _dot() makes dot(a,b) != dot(b,a)), so remove the mirroring and
compute the full n×n matrix by calling compare_spectra for every (i,j) pair;
update the loops that build sim_matrix (references: sim_matrix, spectra_list,
compare_spectra, _dot) to iterate j from 0..n-1 and assign sim_matrix[i, j] =
similarity for every pair without setting sim_matrix[j, i].
🧹 Nitpick comments (1)
pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py (1)

89-97: Great vectorization! Complete the basename optimization.

The vectorized string operations eliminate per-row overhead and are a significant performance improvement. Pre-computing basename_parquet (line 97) is also excellent.

However, line 130 still uses the old pattern Path(path_parquet).parts[-1].split('.')[0] instead of reusing basename_parquet. This should be updated for consistency and to complete the optimization.

🔎 Proposed fix to use basename_parquet at line 130
-                    mgf_file_path = (f"{base_mgf_path}/{Path(path_parquet).parts[-1].split('.')[0]}_"
+                    mgf_file_path = (f"{base_mgf_path}/{basename_parquet}_"
                                     f"{file_index_dict[base_mgf_path] + 1}.mgf")
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2aa8d28 and 027cacd.

📒 Files selected for processing (8)
  • pyspectrafuse/cluster_parquet_combine/cluster_res_handler.py
  • pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py
  • pyspectrafuse/commands/spectrum2msp.py
  • pyspectrafuse/common/sdrf_utils.py
  • pyspectrafuse/consensus_strategy/average_spectrum_strategy.py
  • pyspectrafuse/consensus_strategy/binning_strategy.py
  • pyspectrafuse/consensus_strategy/most_similar_strategy.py
  • pyspectrafuse/mgf_convert/parquet2mgf.py
💤 Files with no reviewable changes (1)
  • pyspectrafuse/consensus_strategy/average_spectrum_strategy.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Agent
  • GitHub Check: CodeQL analysis (python)
  • GitHub Check: test (macos-latest, 3.10)
  • GitHub Check: test (macos-latest, 3.9)
🔇 Additional comments (12)
pyspectrafuse/commands/spectrum2msp.py (1)

207-208: LGTM! Good user experience improvement.

The completion log provides clear feedback to users about the output location after MSP file generation.

pyspectrafuse/common/sdrf_utils.py (2)

53-54: Excellent vectorization!

Replacing the per-row apply() with vectorized string operations (str.split('.').str[0]) significantly improves performance while maintaining identical semantics.


71-73: Great performance optimization!

Converting from apply() to .values.tolist() provides a significant performance boost. The vectorized approach directly converts the underlying numpy array to a list of lists, eliminating Python-level iteration overhead.

pyspectrafuse/mgf_convert/parquet2mgf.py (1)

120-126: Excellent pre-computation optimization!

Pre-computing basename_parquet once before the groupby loop eliminates redundant path parsing operations. The basename is constant for the input parquet file, making this a safe and effective performance improvement.

Also applies to: 143-144

pyspectrafuse/consensus_strategy/binning_strategy.py (3)

98-100: Smart pre-computation for bin calculation!

Pre-computing min_mz_div_bin_size eliminates repeated division operations in the loop. The refactored bin calculation floor(mz/bin_size - min_mz/bin_size) is mathematically equivalent to floor((mz - min_mz)/bin_size) while being more efficient.


103-110: Optimized bin calculation and indexing.

The two-step validation approach (computing valid_bins mask, then extracting valid_indices) ensures bins are within bounds before updating accumulators. The logic correctly uses valid_bins to slice spectrum data and valid_indices to index the bin arrays.


114-132: Good bounds checking in edge case handling.

The edge case consolidation logic correctly validates indices before accessing valid_indices + 1 to prevent out-of-bounds array access. The threshold-based consolidation of split peaks maintains the binning strategy's correctness.

pyspectrafuse/cluster_parquet_combine/cluster_res_handler.py (2)

40-42: Excellent vectorization of path construction!

The two-step vectorized approach (prepending sample info, then appending index) replaces per-row operations with efficient pandas string operations, providing a significant performance boost while maintaining identical semantics.


118-120: Consistent vectorization pattern!

The vectorized path construction follows the same efficient pattern as walk_dir, eliminating Python-level iteration overhead while preserving the original path construction logic.

pyspectrafuse/consensus_strategy/most_similar_strategy.py (1)

114-114: Good documentation improvement.

The updated comment clarifies that sum(axis=0) computes the total similarity for each spectrum, making the selection logic more understandable.

pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py (2)

33-34: Excellent vectorization!

Replacing apply() with map() for dictionary lookups is a well-known pandas optimization. The map() method is specifically optimized for this use case and can provide 10-50x speedup on large DataFrames by avoiding Python lambda overhead.


101-102: Correct usage of pre-computed basename.

The usage of basename_parquet here correctly applies the optimization by avoiding repeated Path operations within the groupby loop.

Comment on lines +99 to +111
# Build similarity matrix - optimized for symmetric matrix
# Only calculate upper triangle and mirror to lower triangle
n = len(spectra_keys)
sim_matrix = np.zeros((n, n))
spectra_list = [cluster_spectra[key] for key in spectra_keys]

# Calculate upper triangle including diagonal
for i in range(n):
for j in range(i, n):
similarity = self.compare_spectra(spectra_list[i], spectra_list[j])
sim_matrix[i, j] = similarity
if i != j: # Mirror to lower triangle (skip diagonal)
sim_matrix[j, i] = similarity
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Find spectrum_utils module and dot function
find . -type f -name "*spectrum_utils*" | head -20

Repository: bigbio/pyspectrafuse

Length of output: 46


🏁 Script executed:

# Search for the dot function definition
rg -n "def dot" --type=py

Repository: bigbio/pyspectrafuse

Length of output: 175


🏁 Script executed:

# Find the compare_spectra method in most_similar_strategy.py
rg -n "def compare_spectra" --type=py

Repository: bigbio/pyspectrafuse

Length of output: 46


🏁 Script executed:

# Check for any imports of dot or similarity functions in most_similar_strategy.py
rg -n "import.*dot|from.*dot" pyspectrafuse/consensus_strategy/most_similar_strategy.py

Repository: bigbio/pyspectrafuse

Length of output: 122


🏁 Script executed:

# Read the dot function implementation in metrics.py
cat -n pyspectrafuse/consensus_strategy/metrics.py | head -60

Repository: bigbio/pyspectrafuse

Length of output: 2205


🏁 Script executed:

# Find the compare_spectra method
rg -n -A15 "def compare_spectra" pyspectrafuse/consensus_strategy/most_similar_strategy.py

Repository: bigbio/pyspectrafuse

Length of output: 46


🏁 Script executed:

# Read more of metrics.py to see the full _dot implementation
cat -n pyspectrafuse/consensus_strategy/metrics.py | head -100

Repository: bigbio/pyspectrafuse

Length of output: 3844


🏁 Script executed:

# Search for compare_spectra in the entire codebase
rg -n "compare_spectra" --type=py

Repository: bigbio/pyspectrafuse

Length of output: 334


🏁 Script executed:

# Read lines around compare_spectra assignment in most_similar_strategy.py
sed -n '30,45p' pyspectrafuse/consensus_strategy/most_similar_strategy.py

Repository: bigbio/pyspectrafuse

Length of output: 986


🏁 Script executed:

# Also read the full metrics.py to understand normalization and get complete picture
wc -l pyspectrafuse/consensus_strategy/metrics.py

Repository: bigbio/pyspectrafuse

Length of output: 111


🏁 Script executed:

# Read the full _dot implementation and surrounding code
cat -n pyspectrafuse/consensus_strategy/metrics.py | sed -n '52,100p'

Repository: bigbio/pyspectrafuse

Length of output: 2073


🏁 Script executed:

# Look for test files that might test the symmetry of the dot function
find . -type f -name "*test*.py" | xargs grep -l "dot\|symmetry" | head -5

Repository: bigbio/pyspectrafuse

Length of output: 46


🏁 Script executed:

# Check if there are any comments or documentation about symmetry assumptions
rg -n "symmetric|symmetry" pyspectrafuse/consensus_strategy/ --type=py

Repository: bigbio/pyspectrafuse

Length of output: 191


🏁 Script executed:

# Create a simple test to demonstrate the asymmetry
python3 << 'EOF'
# Demonstrate why the forward-pointer algorithm is asymmetric

# Simulate the _dot algorithm
def simulate_dot(mz1, intensity1, mz2, intensity2, tol):
    """Simulate the _dot function behavior"""
    fragment_other_i = 0
    score = 0.0
    for fragment_i in range(len(mz1)):
        # Advance pointer while mz2[pointer] is too small
        while (fragment_other_i < len(mz2) - 1 and
               mz2[fragment_other_i] < mz1[fragment_i] - tol):
            fragment_other_i += 1
        # Check if current peaks match
        if (abs(mz1[fragment_i] - mz2[fragment_other_i]) <= tol and
                fragment_other_i < len(mz2)):
            score += intensity1[fragment_i] * intensity2[fragment_other_i]
            fragment_other_i += 1
    return score

import numpy as np

# Test case showing asymmetry
mz1 = np.array([100.0, 200.0, 300.0])
intensity1 = np.array([1.0, 1.0, 1.0])
mz2 = np.array([125.0, 275.0])
intensity2 = np.array([1.0, 1.0])
tol = 50.0

score1 = simulate_dot(mz1, intensity1, mz2, intensity2, tol)
score2 = simulate_dot(mz2, intensity2, mz1, intensity1, tol)

print(f"dot(spec1, spec2) = {score1}")
print(f"dot(spec2, spec1) = {score2}")
print(f"Symmetric? {score1 == score2}")
EOF

Repository: bigbio/pyspectrafuse

Length of output: 150


Critical bug: Dot product is not symmetric—do not mirror the matrix.

The similarity metric is asymmetric due to the one-way pointer algorithm in _dot(). Calling dot(spec_a, spec_b) does not equal dot(spec_b, spec_a) because the pointer-based matching depends on which spectrum's peaks are iterated first. Mirroring the upper triangle produces an incorrect similarity matrix and will lead to wrong consensus spectrum selection.

Remove the mirroring logic and compute the full n×n matrix:

for i in range(n):
    for j in range(n):
        similarity = self.compare_spectra(spectra_list[i], spectra_list[j])
        sim_matrix[i, j] = similarity
🤖 Prompt for AI Agents
In @pyspectrafuse/consensus_strategy/most_similar_strategy.py around lines 99 -
111, The similarity matrix construction incorrectly mirrors the upper triangle
despite the similarity function being asymmetric (the pointer-based _dot() makes
dot(a,b) != dot(b,a)), so remove the mirroring and compute the full n×n matrix
by calling compare_spectra for every (i,j) pair; update the loops that build
sim_matrix (references: sim_matrix, spectra_list, compare_spectra, _dot) to
iterate j from 0..n-1 and assign sim_matrix[i, j] = similarity for every pair
without setting sim_matrix[j, i].

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI Agents
In @.github/workflows/pyspectrafuse-containers.yml:
- Around line 114-124: The "Login and Deploy Container" step has two issues: its
if condition omits the same push_images guard used in the other Singularity
steps so it will run on master pushes even when push_images is false, and piping
the token via echo risks exposing GHCR_TOKEN in logs. Fix by changing the step's
if to match the other steps (include the push_images check used elsewhere, e.g.,
ensure the if includes github.event.inputs.push_images == 'true' or the same
expression used in lines 99/105/109) and replace the echo pipe with a safe
--password-stdin usage that doesn't print the token (e.g., feed the secret via a
heredoc or pass it via an environment variable read by singularity's
--password-stdin), keeping the rest of the pushes (IS_RELEASE/IS_MASTER logic)
intact.
🧹 Nitpick comments (3)
Dockerfile (2)

26-26: Prefer regular install over editable mode in production containers.

Using pip install -e . (editable/development mode) in a Docker image is unnecessary and can cause issues. Editable installs create symlinks to the source, which is useful during development but adds no value in a container where the code won't change. Use a regular install instead.

🔎 Proposed fix
 # Install the package
-RUN pip install --no-cache-dir -e .
+RUN pip install --no-cache-dir .

1-32: Consider adding a non-root user for improved security.

The container runs as root by default. While this is acceptable for many CLI tools, adding a non-root user is a security best practice, especially if the container might be used in production environments or pipelines.

🔎 Optional enhancement
 # Install the package
 RUN pip install --no-cache-dir .

+# Create non-root user
+RUN useradd --create-home --shell /bin/bash appuser
+USER appuser
+
 # Set the entrypoint
 ENTRYPOINT ["pyspectrafuse_cli"]
.github/workflows/pyspectrafuse-containers.yml (1)

62-73: Version extraction may fail silently if pattern doesn't match.

If pyproject.toml format changes or the version line is missing, grep and cut will produce empty output without error. Consider adding validation.

🔎 Proposed fix with validation
     - name: Extract version from pyproject.toml
       id: version
       run: |
         VERSION=$(grep '^version = ' pyproject.toml | cut -d'"' -f2)
+        if [ -z "$VERSION" ]; then
+          echo "::error::Failed to extract version from pyproject.toml"
+          exit 1
+        fi
         echo "VERSION=$VERSION" >> $GITHUB_OUTPUT
         echo "Extracted version: $VERSION"
         # Extract major and minor version
         MAJOR=$(echo $VERSION | cut -d'.' -f1)
         MINOR=$(echo $VERSION | cut -d'.' -f2)
         echo "MAJOR=$MAJOR" >> $GITHUB_OUTPUT
         echo "MINOR=$MINOR" >> $GITHUB_OUTPUT
         echo "Major.Minor: $MAJOR.$MINOR"
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 027cacd and d0c86fa.

📒 Files selected for processing (2)
  • .github/workflows/pyspectrafuse-containers.yml
  • Dockerfile
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: test (macos-latest, 3.8)
🔇 Additional comments (3)
.github/workflows/pyspectrafuse-containers.yml (3)

55-60: Verify the need for a custom GHCR_TOKEN secret.

GitHub Actions provides GITHUB_TOKEN with packages: write permission (already declared on line 32) that can authenticate to GHCR. Using secrets.GITHUB_TOKEN is simpler and doesn't require manually creating and maintaining a separate secret.

If a custom token is intentionally required (e.g., for cross-repo access), please disregard. Otherwise, consider:

🔎 Proposed simplification
     - name: Log in to GitHub Container Registry
       uses: docker/login-action@5e57cd118135c172c3672efd75eb46360885c0ef
       with:
         registry: ghcr.io
         username: ${{ github.actor }}
-        password: ${{ secrets.GHCR_TOKEN }}
+        password: ${{ secrets.GITHUB_TOKEN }}

98-112: Singularity build will fail because the Docker image was removed.

Line 93 removes the Docker image (docker image rm ... || true) before the Singularity conversion step attempts to pull it from the registry on line 111. For PRs, the image is never pushed (line 79), so docker://ghcr.io/bigbio/pyspectrafuse:latest won't exist remotely, and the Singularity build will fail.

However, since the Singularity steps are skipped for PRs (line 99 condition), this is consistent. Just noting the cleanup on line 93 is safe because Singularity pulls from the remote registry, not local Docker.


3-24: Workflow triggers are well-structured.

The trigger configuration correctly handles multiple scenarios: master pushes, PRs with relevant path filters, releases, and manual dispatch with configurable options. The path filters for PRs (Dockerfile, .github/workflows/**) appropriately limit when the workflow runs.

Comment on lines +114 to +124
- name: Login and Deploy Container
if: (github.event_name != 'pull_request')
env:
IS_RELEASE: ${{ github.event_name == 'release' }}
IS_MASTER: ${{ github.ref == 'refs/heads/master' && github.event_name == 'push' }}
run: |
echo ${{ secrets.GHCR_TOKEN }} | singularity remote login -u ${{ secrets.GHCR_USERNAME }} --password-stdin oras://ghcr.io
singularity push pyspectrafuse.sif oras://ghcr.io/bigbio/pyspectrafuse-sif:${{ steps.version.outputs.VERSION }}
if [[ "${{ env.IS_RELEASE }}" == "true" || "${{ env.IS_MASTER }}" == "true" ]]; then
singularity push pyspectrafuse.sif oras://ghcr.io/bigbio/pyspectrafuse-sif:latest
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent condition and potential secret exposure risk.

Two issues here:

  1. The if condition on line 115 differs from other Singularity steps (lines 99, 105, 109) which also check push_images. This step will run on master push even if push_images is explicitly set to false via workflow_dispatch.

  2. Line 120 pipes the token through echo which could expose it in debug logs. Use --password-stdin with a heredoc or environment variable instead.

🔎 Proposed fix
     - name: Login and Deploy Container
-      if: (github.event_name != 'pull_request')
+      if: ${{ (github.event.inputs.push_images == true || github.event.inputs.push_images == '') && (github.event_name != 'pull_request') }}
       env:
         IS_RELEASE: ${{ github.event_name == 'release' }}
         IS_MASTER: ${{ github.ref == 'refs/heads/master' && github.event_name == 'push' }}
+        GHCR_TOKEN: ${{ secrets.GHCR_TOKEN }}
       run: |
-        echo ${{ secrets.GHCR_TOKEN }} | singularity remote login -u ${{ secrets.GHCR_USERNAME }} --password-stdin oras://ghcr.io
+        singularity remote login -u ${{ secrets.GHCR_USERNAME }} --password-stdin oras://ghcr.io <<< "$GHCR_TOKEN"
         singularity push pyspectrafuse.sif oras://ghcr.io/bigbio/pyspectrafuse-sif:${{ steps.version.outputs.VERSION }}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Login and Deploy Container
if: (github.event_name != 'pull_request')
env:
IS_RELEASE: ${{ github.event_name == 'release' }}
IS_MASTER: ${{ github.ref == 'refs/heads/master' && github.event_name == 'push' }}
run: |
echo ${{ secrets.GHCR_TOKEN }} | singularity remote login -u ${{ secrets.GHCR_USERNAME }} --password-stdin oras://ghcr.io
singularity push pyspectrafuse.sif oras://ghcr.io/bigbio/pyspectrafuse-sif:${{ steps.version.outputs.VERSION }}
if [[ "${{ env.IS_RELEASE }}" == "true" || "${{ env.IS_MASTER }}" == "true" ]]; then
singularity push pyspectrafuse.sif oras://ghcr.io/bigbio/pyspectrafuse-sif:latest
fi
- name: Login and Deploy Container
if: ${{ (github.event.inputs.push_images == true || github.event.inputs.push_images == '') && (github.event_name != 'pull_request') }}
env:
IS_RELEASE: ${{ github.event_name == 'release' }}
IS_MASTER: ${{ github.ref == 'refs/heads/master' && github.event_name == 'push' }}
GHCR_TOKEN: ${{ secrets.GHCR_TOKEN }}
run: |
singularity remote login -u ${{ secrets.GHCR_USERNAME }} --password-stdin oras://ghcr.io <<< "$GHCR_TOKEN"
singularity push pyspectrafuse.sif oras://ghcr.io/bigbio/pyspectrafuse-sif:${{ steps.version.outputs.VERSION }}
if [[ "${{ env.IS_RELEASE }}" == "true" || "${{ env.IS_MASTER }}" == "true" ]]; then
singularity push pyspectrafuse.sif oras://ghcr.io/bigbio/pyspectrafuse-sif:latest
fi
🤖 Prompt for AI Agents
In @.github/workflows/pyspectrafuse-containers.yml around lines 114 - 124, The
"Login and Deploy Container" step has two issues: its if condition omits the
same push_images guard used in the other Singularity steps so it will run on
master pushes even when push_images is false, and piping the token via echo
risks exposing GHCR_TOKEN in logs. Fix by changing the step's if to match the
other steps (include the push_images check used elsewhere, e.g., ensure the if
includes github.event.inputs.push_images == 'true' or the same expression used
in lines 99/105/109) and replace the echo pipe with a safe --password-stdin
usage that doesn't print the token (e.g., feed the secret via a heredoc or pass
it via an environment variable read by singularity's --password-stdin), keeping
the rest of the pushes (IS_RELEASE/IS_MASTER logic) intact.

@ypriverol ypriverol merged commit dd72319 into main Jan 6, 2026
9 checks passed
This was referenced Jan 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants