major performance improvements by ypriverol · Pull Request #4 · bigbio/pyspectrafuse

ypriverol · 2026-01-06T14:40:12Z

PR Type

Enhancement

Description

Replace pandas apply() with vectorized operations for 10-50x speedup
- String concatenation using + operator instead of lambda functions
- Use map() for dictionary lookups instead of apply()
- Use .str accessor for string operations on Series
Optimize numerical computations in binning and similarity calculations
- Pre-compute constants to avoid repeated calculations in loops
- Cache spectrum list to avoid repeated dictionary lookups
- Optimize bin array indexing with pre-computed valid indices
Remove duplicate code blocks and improve code clarity
- Eliminate redundant function implementations in consensus strategies
- Simplify file path construction with pre-computed basename
Add logging for MSP file generation completion

Diagram Walkthrough

flowchart LR
  A["Pandas apply() calls"] -->|Replace with vectorized ops| B["String concatenation + map()"]
  C["Repeated calculations in loops"] -->|Pre-compute constants| D["Cached values"]
  E["Dictionary lookups in loops"] -->|Cache spectrum list| F["Direct list access"]
  G["Duplicate code blocks"] -->|Remove redundancy| H["Single implementation"]
  B --> I["10-50x performance improvement"]
  D --> I
  F --> I

File Walkthrough

Relevant files

Enhancement

cluster_res_handler.py `Vectorize string operations in cluster path construction` pyspectrafuse/cluster_parquet_combine/cluster_res_handler.py Replace two `apply()` calls with vectorized string concatenation operations Use `+` operator for string operations instead of lambda functions Convert `index` column to string using `astype(str)` for concatenation	+6/-6
combine_cluster_and_parquet.py `Optimize dictionary mapping and string operations` pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py Replace `apply()` with `map()` for dictionary lookups (10-50x speedup) Vectorize USI parsing using `.str.split()` accessor Pre-compute basename and file index to avoid repeated string operations in loop Simplify mgf_file_path construction with pre-computed values	+13/-8
spectrum2msp.py `Add completion logging for MSP generation` pyspectrafuse/commands/spectrum2msp.py Add logging message after successful MSP file generation	+2/-0
sdrf_utils.py `Vectorize string and data type operations` pyspectrafuse/common/sdrf_utils.py Replace `apply()` with `.str.split().str[0]` for vectorized string splitting Replace `apply(lambda x: list(x), axis=1)` with `.values.tolist()` for vectorized conversion Add comments explaining vectorized operation benefits	+4/-2
binning_strategy.py `Optimize bin calculation with pre-computed constants` pyspectrafuse/consensus_strategy/binning_strategy.py Pre-compute `min_mz_div_bin_size` constant to avoid repeated division in loop Optimize bin calculation formula using pre-computed constant Cache valid bin indices to avoid repeated array indexing operations Remove duplicate code block with old implementation	+9/-73
most_similar_strategy.py `Optimize similarity matrix computation with caching` pyspectrafuse/consensus_strategy/most_similar_strategy.py Optimize similarity matrix construction by caching spectrum list Pre-compute spectrum list to avoid repeated dictionary lookups Simplify matrix mirroring logic with explicit condition check Remove duplicate code block with old implementation	+14/-28
parquet2mgf.py `Cache basename for repeated file path construction` pyspectrafuse/mgf_convert/parquet2mgf.py Pre-compute basename from parquet path to avoid repeated string operations Simplify mgf_file_path construction using pre-computed basename and file_index Apply same optimization pattern in two locations within the method	+7/-4

Miscellaneous

average_spectrum_strategy.py `Remove duplicate code in consensus aggregation` pyspectrafuse/consensus_strategy/average_spectrum_strategy.py Remove duplicate code block that was unreachable after return statement Eliminate redundant implementation of consensus spectrum aggregation	+0/-4

Summary by CodeRabbit

Performance Improvements
- Vectorized string and mapping operations across data pipelines for faster processing (clustering, parquet↔mgf handling, SDRF parsing).
- Optimized consensus algorithms and similarity computations for more efficient spectrum aggregation.
Refactor
- Internal simplifications reducing redundant path/string work and streamlining aggregation logic.
Logging
- Added a completion log entry after MSP file generation.
Chores
- Added container build assets and automated CI workflow for image build and distribution.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-06T14:40:23Z

📝 Walkthrough

Walkthrough

This PR replaces many per-row apply operations with vectorized string and array operations, precomputes repeated path/basename values, optimizes similarity matrix computation, simplifies consensus aggregation logic, and adds a single log message and a new CI workflow plus Dockerfile.

Changes

Cohort / File(s)	Summary
Cluster mgf path vectorization `pyspectrafuse/cluster_parquet_combine/cluster_res_handler.py`	Replaced row-wise `apply` mgf_path/index construction with vectorized string concatenation and index appending.
Cluster–parquet injection & mapping `pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py`	Swapped `apply` with `map` for cluster_accession; vectorized mgf path/filename extraction and introduced precomputed `basename_parquet` and file_index logic.
Parquet→MGF basename optimization `pyspectrafuse/mgf_convert/parquet2mgf.py`	Precomputed `basename_parquet` and replaced repeated Path parsing with basename + per-group file_index for mgf path construction.
DataFrame vectorization `pyspectrafuse/common/sdrf_utils.py`	Replaced `apply`-based string split with `str.split().str[0]` and axis=1 `apply` constructing lists with `values.tolist()` for organism_instrument.
Average consensus simplification `pyspectrafuse/consensus_strategy/average_spectrum_strategy.py`	Removed block that built `ms2spectrumDict` and performed groupby + aggregated average consensus path; function now omits that aggregation branch.
Binning strategy refactor `pyspectrafuse/consensus_strategy/binning_strategy.py`	Precomputed `min_mz_div_bin_size`, revised bin index math, used validated indices to update arrays, and removed legacy per-cluster loop/quorum and peak-consolidation logic.
Similarity matrix optimization `pyspectrafuse/consensus_strategy/most_similar_strategy.py`	Replaced full symmetric O(n²) double-loop with upper-triangular computation and mirrored assignment; prepared spectra as list to avoid repeated dict lookups.
MGF/Parquet helper optimization `pyspectrafuse/mgf_convert/parquet2mgf.py`	(Related internal path/name deduplication) Uses precomputed basename and consistent file_index across group operations.
Logging `pyspectrafuse/commands/spectrum2msp.py`	Added log statement announcing successful MSP file generation after write completion.
CI workflow `.github/workflows/pyspectrafuse-containers.yml`	New GitHub Actions workflow to build/push Docker and Singularity images, extract version from pyproject, and handle tagging/conditional pushes.
Container image `Dockerfile`	New Dockerfile for building the pyspectrafuse CLI image (python:3.10-slim base, installs deps, pip install -e ., entrypoint set).

Sequence Diagram(s)

(No sequence diagrams generated — changes are internal optimizations and refactors without new multi-component control flows requiring visualization.)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

the latest version #2 — Directly modifies cluster_res_handler.py and combine_cluster_and_parquet.py with overlapping mgf_path and mapping refactors.

Suggested labels

Review effort [1-5]: 4

Poem

🐰 I hopped through strings and trims with cheer,
Vectorized paths now swift and clear.
Bins and matrices trimmed of repeat,
Faster hops on nimble feet.
A tiny carrot for code so dear 🥕

Pre-merge checks

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'major performance improvements' accurately reflects the main objective of the PR, which involves replacing pandas apply() calls with vectorized operations and optimizing numerical computations across multiple files.
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

qodo-code-review · 2026-01-06T14:40:48Z

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
⚪	Path traversal write Description: The new `mgf_file_path` construction uses path components derived from external/parsing sources (e.g., `group` from `sample_info_dict`/USI-derived filename and `output_path`) to build filesystem paths and create directories/write files, which can enable path traversal or arbitrary file write if those inputs contain `..`, absolute paths, or path separators. parquet2mgf.py [120-147] Referred Code # Pre-compute basename to avoid repeated string operations basename_parquet = Path(parquet_path).parts[-1].split('.')[0] for group, group_df in mgf_group_df.groupby('mgf_file_path'): base_mgf_path = f"{output_path}/{group}" file_index = relation_dict[base_mgf_path] + 1 mgf_file_path = f"{base_mgf_path}/{basename_parquet}_{file_index}.mgf" Path(mgf_file_path).parent.mkdir(parents=True, exist_ok=True) if write_count_dict[group] + group_df.shape[0] <= SPECTRA_NUM: Parquet2Mgf.write2mgf(mgf_file_path, '\n\n'.join(group_df["spectrum"])) write_count_dict[group] += group_df.shape[0] else: remain_num = SPECTRA_NUM - write_count_dict[group] if remain_num > 0: group_df_remain = group_df.head(remain_num) Parquet2Mgf.write2mgf(mgf_file_path, '\n\n'.join(group_df_remain["spectrum"])) write_count_dict[group] += group_df_remain.shape[0] # Update the index of the read and mgf files relation_dict[base_mgf_path] += 1 ... (clipped 7 lines)
Ticket Compliance
⚪	🎫 No ticket provided Create ticket/issue
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
🟢	Generic: Meaningful Naming and Self-Documenting Code Objective: Ensure all identifiers clearly express their purpose and intent, making code self-documenting Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Error Handling Objective: To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
🔴	Generic: Robust Error Handling and Edge Case Management Objective: Ensure comprehensive error handling that provides meaningful context and graceful degradation Status: Missing USI validation: New vectorized USI parsing assumes all `USI` values are non-null and contain at least three `:`-separated fields, which can silently produce null paths or downstream failures without explicit validation/handling. Referred Code # Vectorized string operations - extract filename from USI and map to sample info # USI format: mzspec:dataset:filename:scan:sequence/charge filenames = row_group['USI'].str.split(':').str[2] sample_info = filenames.map(sample_info_dict) charges = 'charge' + row_group['charge'].astype(str) mgf_group_df['mgf_file_path'] = (sample_info + '/' + charges + '/mgf files') Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Security-First Input Validation and Data Handling Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities Status: Unvalidated external data: The new logic derives filesystem paths from externally-sourced `USI` strings and dictionary mappings without validating/sanitizing the parsed filename component, which can lead to unsafe or incorrect path construction. Referred Code # Vectorized string operations - extract filename from USI and map to sample info # USI format: mzspec:dataset:filename:scan:sequence/charge filenames = row_group['USI'].str.split(':').str[2] sample_info = filenames.map(sample_info_dict) charges = 'charge' + row_group['charge'].astype(str) mgf_group_df['mgf_file_path'] = (sample_info + '/' + charges + '/mgf files') Learn more about managing compliance generic rules or creating your own custom rules
⚪
	Update

Compliance status legend

🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

qodo-code-review · 2026-01-06T14:42:38Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	~~Fix TypeError in path construction~~ Fix a `TypeError` in path construction by joining the list elements in the `sample_info` Series into a single string using `.str.join('/')` before concatenation. pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py [92-94] sample_info = filenames.map(sample_info_dict) charges = 'charge' + row_group['charge'].astype(str) -mgf_group_df['mgf_file_path'] = (sample_info + '/' + charges + '/mgf files') +mgf_group_df['mgf_file_path'] = (sample_info.str.join('/') + '/' + charges + '/mgf files') `[Suggestion processed]` Suggestion importance[1-10]: 9 __ Why: The suggestion correctly identifies a `TypeError` that would occur when concatenating a pandas Series of lists with strings, which makes the current PR code non-functional. The proposed fix using `.str.join('/')` is correct and essential.	High
General	Validate mapped sample_info presence Add validation to ensure all `sample_info` mappings exist after the map operation, raising a `KeyError` if any are missing to prevent downstream errors. pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py [92] sample_info = filenames.map(sample_info_dict) +missing = sample_info.isna() +if missing.any(): + missing_files = filenames[missing].unique().tolist() + raise KeyError(f"Missing sample info for files: {missing_files}") Apply / Chat Suggestion importance[1-10]: 8 __ Why: This suggestion adds crucial error handling by checking for missing `sample_info` mappings. This prevents potential `NaN` values from propagating and causing cryptic downstream errors, making the code more robust.	Medium
	Use pd.concat for concatenation Use `pd.concat(charge_df_lst, ignore_index=True)` instead of `pd.DataFrame(np.vstack(charge_df_lst))` to concatenate DataFrames more efficiently and preserve column types. pyspectrafuse/cluster_parquet_combine/cluster_res_handler.py [45-46] -charge_df = pd.DataFrame(np.vstack(charge_df_lst), - columns=['mgf_path', 'index', 'cluster_accession']) +charge_df = pd.concat(charge_df_lst, ignore_index=True) Apply / Chat Suggestion importance[1-10]: 6 __ Why: The suggestion correctly recommends using `pd.concat` over `np.vstack` for concatenating DataFrames, which is more idiomatic, efficient, and safer as it preserves dtypes.	Low
	Simplify symmetric matrix construction logic Simplify the symmetric matrix construction by removing the redundant `if i != j` check and assigning the similarity score to both `sim_matrix[i, j]` and `sim_matrix[j, i]` unconditionally. pyspectrafuse/consensus_strategy/most_similar_strategy.py [99-111] # Build similarity matrix - optimized for symmetric matrix # Only calculate upper triangle and mirror to lower triangle n = len(spectra_keys) sim_matrix = np.zeros((n, n)) spectra_list = [cluster_spectra[key] for key in spectra_keys] # Calculate upper triangle including diagonal for i in range(n): for j in range(i, n): similarity = self.compare_spectra(spectra_list[i], spectra_list[j]) sim_matrix[i, j] = similarity - if i != j: # Mirror to lower triangle (skip diagonal) - sim_matrix[j, i] = similarity + sim_matrix[j, i] = similarity Apply / Chat Suggestion importance[1-10]: 4 __ Why: The suggestion offers a valid micro-optimization by removing a redundant `if i != j` check, which simplifies the code and slightly improves performance by eliminating a conditional branch inside the loop.	Low
Update

qodo-code-review · 2026-01-06T14:45:15Z

+            charges = 'charge' + row_group['charge'].astype(str)
+            mgf_group_df['mgf_file_path'] = (sample_info + '/' + charges + '/mgf files')


Suggestion: Fix TypeError in path construction

Suggested change

charges = 'charge' + row_group['charge'].astype(str)

mgf_group_df['mgf_file_path'] = (sample_info + '/' + charges + '/mgf files')

charges = 'charge' + row_group['charge'].astype(str)

mgf_group_df['mgf_file_path'] = (sample_info.str.join('/') + '/' + charges + '/mgf files')

Copilot

Pull request overview

This PR implements performance optimizations across the codebase by replacing inefficient apply() calls with vectorized pandas operations, pre-computing repeated calculations, and removing duplicate code. The changes aim to significantly speed up data processing, particularly for large datasets.

Key changes:

Vectorized pandas operations replacing apply() calls for 10-50x speedup potential
Pre-computation of repeated string operations to avoid redundant calculations
Removal of duplicate/dead code in consensus strategies

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
pyspectrafuse/mgf_convert/parquet2mgf.py	Pre-computes basename to avoid repeated Path operations in loop
pyspectrafuse/consensus_strategy/most_similar_strategy.py	Optimizes similarity matrix calculation with list comprehension; removes duplicate code
pyspectrafuse/consensus_strategy/binning_strategy.py	Pre-computes division factor for bin calculation; optimizes array indexing; removes duplicate code
pyspectrafuse/consensus_strategy/average_spectrum_strategy.py	Removes duplicate code block
pyspectrafuse/common/sdrf_utils.py	Replaces apply() with vectorized str operations and values.tolist()
pyspectrafuse/commands/spectrum2msp.py	Adds success logging message
pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py	Vectorizes string operations and USI parsing; pre-computes basename
pyspectrafuse/cluster_parquet_combine/cluster_res_handler.py	Vectorizes string concatenation operations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-06T14:46:06Z

+            filenames = row_group['USI'].str.split(':').str[2]
+            sample_info = filenames.map(sample_info_dict)
+            charges = 'charge' + row_group['charge'].astype(str)
+            mgf_group_df['mgf_file_path'] = (sample_info + '/' + charges + '/mgf files')


The vectorized string concatenation is incorrect. The sample_info variable contains a pandas Series where each element is a list [organism, instrument] (from sample_info_dict). The current code attempts to concatenate these lists with strings using +, which will fail because you cannot directly concatenate a Series of lists with strings.

The original code used '/'.join(sample_info_dict.get(...) + ['charge' + str(row["charge"]), 'mgf files']) which joins list elements with '/'. The vectorized version should first construct the full list per row, then join with '/'. Consider using apply() with a lambda that constructs the path from the list elements, or properly concatenate list elements before joining.

Suggested change

mgf_group_df['mgf_file_path'] = (sample_info + '/' + charges + '/mgf files')

mgf_group_df['mgf_file_path'] = [

'/'.join(info + [charge, 'mgf files'])

for info, charge in zip(sample_info, charges)

]

Copilot · 2026-01-06T14:46:07Z

        return res, single_spectrum_df



There is a duplicate return statement. Line 90 returns the result, making line 92 unreachable dead code. Remove the duplicate return statement on line 92.

Suggested change

return res, single_spectrum_df

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI Agents

In @pyspectrafuse/consensus_strategy/most_similar_strategy.py:
- Around line 99-111: The similarity matrix construction incorrectly mirrors the
upper triangle despite the similarity function being asymmetric (the
pointer-based _dot() makes dot(a,b) != dot(b,a)), so remove the mirroring and
compute the full n×n matrix by calling compare_spectra for every (i,j) pair;
update the loops that build sim_matrix (references: sim_matrix, spectra_list,
compare_spectra, _dot) to iterate j from 0..n-1 and assign sim_matrix[i, j] =
similarity for every pair without setting sim_matrix[j, i].

🧹 Nitpick comments (1)

pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py (1)
89-97: Great vectorization! Complete the basename optimization.

The vectorized string operations eliminate per-row overhead and are a significant performance improvement. Pre-computing basename_parquet (line 97) is also excellent.

However, line 130 still uses the old pattern Path(path_parquet).parts[-1].split('.')[0] instead of reusing basename_parquet. This should be updated for consistency and to complete the optimization.
🔎 Proposed fix to use basename_parquet at line 130
-                    mgf_file_path = (f"{base_mgf_path}/{Path(path_parquet).parts[-1].split('.')[0]}_"
+                    mgf_file_path = (f"{base_mgf_path}/{basename_parquet}_"
                                     f"{file_index_dict[base_mgf_path] + 1}.mgf")

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2aa8d28 and 027cacd.

📒 Files selected for processing (8)

pyspectrafuse/cluster_parquet_combine/cluster_res_handler.py
pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py
pyspectrafuse/commands/spectrum2msp.py
pyspectrafuse/common/sdrf_utils.py
pyspectrafuse/consensus_strategy/average_spectrum_strategy.py
pyspectrafuse/consensus_strategy/binning_strategy.py
pyspectrafuse/consensus_strategy/most_similar_strategy.py
pyspectrafuse/mgf_convert/parquet2mgf.py

💤 Files with no reviewable changes (1)

pyspectrafuse/consensus_strategy/average_spectrum_strategy.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Agent
GitHub Check: CodeQL analysis (python)
GitHub Check: test (macos-latest, 3.10)
GitHub Check: test (macos-latest, 3.9)

🔇 Additional comments (12)

pyspectrafuse/commands/spectrum2msp.py (1)

207-208: LGTM! Good user experience improvement.

The completion log provides clear feedback to users about the output location after MSP file generation.

pyspectrafuse/common/sdrf_utils.py (2)

53-54: Excellent vectorization!

Replacing the per-row apply() with vectorized string operations (str.split('.').str[0]) significantly improves performance while maintaining identical semantics.

71-73: Great performance optimization!

Converting from apply() to .values.tolist() provides a significant performance boost. The vectorized approach directly converts the underlying numpy array to a list of lists, eliminating Python-level iteration overhead.

pyspectrafuse/mgf_convert/parquet2mgf.py (1)

120-126: Excellent pre-computation optimization!

Pre-computing basename_parquet once before the groupby loop eliminates redundant path parsing operations. The basename is constant for the input parquet file, making this a safe and effective performance improvement.

Also applies to: 143-144

pyspectrafuse/consensus_strategy/binning_strategy.py (3)

98-100: Smart pre-computation for bin calculation!

Pre-computing min_mz_div_bin_size eliminates repeated division operations in the loop. The refactored bin calculation floor(mz/bin_size - min_mz/bin_size) is mathematically equivalent to floor((mz - min_mz)/bin_size) while being more efficient.

103-110: Optimized bin calculation and indexing.

The two-step validation approach (computing valid_bins mask, then extracting valid_indices) ensures bins are within bounds before updating accumulators. The logic correctly uses valid_bins to slice spectrum data and valid_indices to index the bin arrays.

114-132: Good bounds checking in edge case handling.

The edge case consolidation logic correctly validates indices before accessing valid_indices + 1 to prevent out-of-bounds array access. The threshold-based consolidation of split peaks maintains the binning strategy's correctness.

pyspectrafuse/cluster_parquet_combine/cluster_res_handler.py (2)

40-42: Excellent vectorization of path construction!

The two-step vectorized approach (prepending sample info, then appending index) replaces per-row operations with efficient pandas string operations, providing a significant performance boost while maintaining identical semantics.

118-120: Consistent vectorization pattern!

The vectorized path construction follows the same efficient pattern as walk_dir, eliminating Python-level iteration overhead while preserving the original path construction logic.

pyspectrafuse/consensus_strategy/most_similar_strategy.py (1)

114-114: Good documentation improvement.

The updated comment clarifies that sum(axis=0) computes the total similarity for each spectrum, making the selection logic more understandable.

pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py (2)

33-34: Excellent vectorization!

Replacing apply() with map() for dictionary lookups is a well-known pandas optimization. The map() method is specifically optimized for this use case and can provide 10-50x speedup on large DataFrames by avoiding Python lambda overhead.

101-102: Correct usage of pre-computed basename.

The usage of basename_parquet here correctly applies the optimization by avoiding repeated Path operations within the groupby loop.

coderabbitai · 2026-01-06T14:46:44Z

+        # Build similarity matrix - optimized for symmetric matrix
+        # Only calculate upper triangle and mirror to lower triangle
+        n = len(spectra_keys)
+        sim_matrix = np.zeros((n, n))
+        spectra_list = [cluster_spectra[key] for key in spectra_keys]
+
+        # Calculate upper triangle including diagonal
+        for i in range(n):
+            for j in range(i, n):
+                similarity = self.compare_spectra(spectra_list[i], spectra_list[j])
+                sim_matrix[i, j] = similarity
+                if i != j:  # Mirror to lower triangle (skip diagonal)
+                    sim_matrix[j, i] = similarity


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Find spectrum_utils module and dot function find . -type f -name "*spectrum_utils*" | head -20

Repository: bigbio/pyspectrafuse

Length of output: 46

🏁 Script executed:

# Search for the dot function definition rg -n "def dot" --type=py

Repository: bigbio/pyspectrafuse

Length of output: 175

🏁 Script executed:

# Find the compare_spectra method in most_similar_strategy.py rg -n "def compare_spectra" --type=py

Repository: bigbio/pyspectrafuse

Length of output: 46

🏁 Script executed:

# Check for any imports of dot or similarity functions in most_similar_strategy.py rg -n "import.*dot|from.*dot" pyspectrafuse/consensus_strategy/most_similar_strategy.py

Repository: bigbio/pyspectrafuse

Length of output: 122

🏁 Script executed:

# Read the dot function implementation in metrics.py cat -n pyspectrafuse/consensus_strategy/metrics.py | head -60

Repository: bigbio/pyspectrafuse

Length of output: 2205

🏁 Script executed:

# Find the compare_spectra method rg -n -A15 "def compare_spectra" pyspectrafuse/consensus_strategy/most_similar_strategy.py

Repository: bigbio/pyspectrafuse

Length of output: 46

🏁 Script executed:

# Read more of metrics.py to see the full _dot implementation cat -n pyspectrafuse/consensus_strategy/metrics.py | head -100

Repository: bigbio/pyspectrafuse

Length of output: 3844

🏁 Script executed:

# Search for compare_spectra in the entire codebase rg -n "compare_spectra" --type=py

Repository: bigbio/pyspectrafuse

Length of output: 334

🏁 Script executed:

# Read lines around compare_spectra assignment in most_similar_strategy.py sed -n '30,45p' pyspectrafuse/consensus_strategy/most_similar_strategy.py

Repository: bigbio/pyspectrafuse

Length of output: 986

🏁 Script executed:

# Also read the full metrics.py to understand normalization and get complete picture wc -l pyspectrafuse/consensus_strategy/metrics.py

Repository: bigbio/pyspectrafuse

Length of output: 111

🏁 Script executed:

# Read the full _dot implementation and surrounding code cat -n pyspectrafuse/consensus_strategy/metrics.py | sed -n '52,100p'

Repository: bigbio/pyspectrafuse

Length of output: 2073

🏁 Script executed:

# Look for test files that might test the symmetry of the dot function find . -type f -name "*test*.py" | xargs grep -l "dot\|symmetry" | head -5

Repository: bigbio/pyspectrafuse

Length of output: 46

🏁 Script executed:

# Check if there are any comments or documentation about symmetry assumptions rg -n "symmetric|symmetry" pyspectrafuse/consensus_strategy/ --type=py

Repository: bigbio/pyspectrafuse

Length of output: 191

🏁 Script executed:

# Create a simple test to demonstrate the asymmetry python3 << 'EOF' # Demonstrate why the forward-pointer algorithm is asymmetric # Simulate the _dot algorithm def simulate_dot(mz1, intensity1, mz2, intensity2, tol): """Simulate the _dot function behavior""" fragment_other_i = 0 score = 0.0 for fragment_i in range(len(mz1)): # Advance pointer while mz2[pointer] is too small while (fragment_other_i < len(mz2) - 1 and mz2[fragment_other_i] < mz1[fragment_i] - tol): fragment_other_i += 1 # Check if current peaks match if (abs(mz1[fragment_i] - mz2[fragment_other_i]) <= tol and fragment_other_i < len(mz2)): score += intensity1[fragment_i] * intensity2[fragment_other_i] fragment_other_i += 1 return score import numpy as np # Test case showing asymmetry mz1 = np.array([100.0, 200.0, 300.0]) intensity1 = np.array([1.0, 1.0, 1.0]) mz2 = np.array([125.0, 275.0]) intensity2 = np.array([1.0, 1.0]) tol = 50.0 score1 = simulate_dot(mz1, intensity1, mz2, intensity2, tol) score2 = simulate_dot(mz2, intensity2, mz1, intensity1, tol) print(f"dot(spec1, spec2) = {score1}") print(f"dot(spec2, spec1) = {score2}") print(f"Symmetric? {score1 == score2}") EOF

Repository: bigbio/pyspectrafuse

Length of output: 150

Critical bug: Dot product is not symmetric—do not mirror the matrix.

The similarity metric is asymmetric due to the one-way pointer algorithm in _dot(). Calling dot(spec_a, spec_b) does not equal dot(spec_b, spec_a) because the pointer-based matching depends on which spectrum's peaks are iterated first. Mirroring the upper triangle produces an incorrect similarity matrix and will lead to wrong consensus spectrum selection.

Remove the mirroring logic and compute the full n×n matrix:

for i in range(n): for j in range(n): similarity = self.compare_spectra(spectra_list[i], spectra_list[j]) sim_matrix[i, j] = similarity

🤖 Prompt for AI Agents

In @pyspectrafuse/consensus_strategy/most_similar_strategy.py around lines 99 - 111, The similarity matrix construction incorrectly mirrors the upper triangle despite the similarity function being asymmetric (the pointer-based _dot() makes dot(a,b) != dot(b,a)), so remove the mirroring and compute the full n×n matrix by calling compare_spectra for every (i,j) pair; update the loops that build sim_matrix (references: sim_matrix, spectra_list, compare_spectra, _dot) to iterate j from 0..n-1 and assign sim_matrix[i, j] = similarity for every pair without setting sim_matrix[j, i].

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI Agents

In @.github/workflows/pyspectrafuse-containers.yml:
- Around line 114-124: The "Login and Deploy Container" step has two issues: its
if condition omits the same push_images guard used in the other Singularity
steps so it will run on master pushes even when push_images is false, and piping
the token via echo risks exposing GHCR_TOKEN in logs. Fix by changing the step's
if to match the other steps (include the push_images check used elsewhere, e.g.,
ensure the if includes github.event.inputs.push_images == 'true' or the same
expression used in lines 99/105/109) and replace the echo pipe with a safe
--password-stdin usage that doesn't print the token (e.g., feed the secret via a
heredoc or pass it via an environment variable read by singularity's
--password-stdin), keeping the rest of the pushes (IS_RELEASE/IS_MASTER logic)
intact.

🧹 Nitpick comments (3)

Dockerfile (2)
26-26: Prefer regular install over editable mode in production containers.

Using pip install -e . (editable/development mode) in a Docker image is unnecessary and can cause issues. Editable installs create symlinks to the source, which is useful during development but adds no value in a container where the code won't change. Use a regular install instead.
🔎 Proposed fix
 # Install the package
-RUN pip install --no-cache-dir -e .
+RUN pip install --no-cache-dir .
1-32: Consider adding a non-root user for improved security.

The container runs as root by default. While this is acceptable for many CLI tools, adding a non-root user is a security best practice, especially if the container might be used in production environments or pipelines.
🔎 Optional enhancement
 # Install the package
 RUN pip install --no-cache-dir .

+# Create non-root user
+RUN useradd --create-home --shell /bin/bash appuser
+USER appuser
+
 # Set the entrypoint
 ENTRYPOINT ["pyspectrafuse_cli"]
.github/workflows/pyspectrafuse-containers.yml (1)
62-73: Version extraction may fail silently if pattern doesn't match.

If pyproject.toml format changes or the version line is missing, grep and cut will produce empty output without error. Consider adding validation.
🔎 Proposed fix with validation
     - name: Extract version from pyproject.toml
       id: version
       run: |
         VERSION=$(grep '^version = ' pyproject.toml | cut -d'"' -f2)
+        if [ -z "$VERSION" ]; then
+          echo "::error::Failed to extract version from pyproject.toml"
+          exit 1
+        fi
         echo "VERSION=$VERSION" >> $GITHUB_OUTPUT
         echo "Extracted version: $VERSION"
         # Extract major and minor version
         MAJOR=$(echo $VERSION | cut -d'.' -f1)
         MINOR=$(echo $VERSION | cut -d'.' -f2)
         echo "MAJOR=$MAJOR" >> $GITHUB_OUTPUT
         echo "MINOR=$MINOR" >> $GITHUB_OUTPUT
         echo "Major.Minor: $MAJOR.$MINOR"

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 027cacd and d0c86fa.

📒 Files selected for processing (2)

.github/workflows/pyspectrafuse-containers.yml
Dockerfile

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: test (macos-latest, 3.8)

🔇 Additional comments (3)

.github/workflows/pyspectrafuse-containers.yml (3)
55-60: Verify the need for a custom GHCR_TOKEN secret.

GitHub Actions provides GITHUB_TOKEN with packages: write permission (already declared on line 32) that can authenticate to GHCR. Using secrets.GITHUB_TOKEN is simpler and doesn't require manually creating and maintaining a separate secret.

If a custom token is intentionally required (e.g., for cross-repo access), please disregard. Otherwise, consider:
🔎 Proposed simplification
     - name: Log in to GitHub Container Registry
       uses: docker/login-action@5e57cd118135c172c3672efd75eb46360885c0ef
       with:
         registry: ghcr.io
         username: ${{ github.actor }}
-        password: ${{ secrets.GHCR_TOKEN }}
+        password: ${{ secrets.GITHUB_TOKEN }}
98-112: Singularity build will fail because the Docker image was removed.

Line 93 removes the Docker image (docker image rm ... || true) before the Singularity conversion step attempts to pull it from the registry on line 111. For PRs, the image is never pushed (line 79), so docker://ghcr.io/bigbio/pyspectrafuse:latest won't exist remotely, and the Singularity build will fail.

However, since the Singularity steps are skipped for PRs (line 99 condition), this is consistent. Just noting the cleanup on line 93 is safe because Singularity pulls from the remote registry, not local Docker.

3-24: Workflow triggers are well-structured.

The trigger configuration correctly handles multiple scenarios: master pushes, PRs with relevant path filters, releases, and manual dispatch with configurable options. The path filters for PRs (Dockerfile, .github/workflows/**) appropriately limit when the workflow runs.

coderabbitai · 2026-01-06T14:57:17Z

+    - name: Login and Deploy Container
+      if: (github.event_name != 'pull_request')
+      env:
+        IS_RELEASE: ${{ github.event_name == 'release' }}
+        IS_MASTER: ${{ github.ref == 'refs/heads/master' && github.event_name == 'push' }}
+      run: |
+        echo ${{ secrets.GHCR_TOKEN }} | singularity remote login -u ${{ secrets.GHCR_USERNAME }} --password-stdin oras://ghcr.io
+        singularity push pyspectrafuse.sif oras://ghcr.io/bigbio/pyspectrafuse-sif:${{ steps.version.outputs.VERSION }}
+        if [[ "${{ env.IS_RELEASE }}" == "true" || "${{ env.IS_MASTER }}" == "true" ]]; then
+          singularity push pyspectrafuse.sif oras://ghcr.io/bigbio/pyspectrafuse-sif:latest
+        fi


⚠️ Potential issue | 🟡 Minor

Inconsistent condition and potential secret exposure risk.

Two issues here:

The if condition on line 115 differs from other Singularity steps (lines 99, 105, 109) which also check push_images. This step will run on master push even if push_images is explicitly set to false via workflow_dispatch.

Line 120 pipes the token through echo which could expose it in debug logs. Use --password-stdin with a heredoc or environment variable instead.

🔎 Proposed fix

- name: Login and Deploy Container - if: (github.event_name != 'pull_request') + if: ${{ (github.event.inputs.push_images == true || github.event.inputs.push_images == '') && (github.event_name != 'pull_request') }} env: IS_RELEASE: ${{ github.event_name == 'release' }} IS_MASTER: ${{ github.ref == 'refs/heads/master' && github.event_name == 'push' }} + GHCR_TOKEN: ${{ secrets.GHCR_TOKEN }} run: | - echo ${{ secrets.GHCR_TOKEN }} | singularity remote login -u ${{ secrets.GHCR_USERNAME }} --password-stdin oras://ghcr.io + singularity remote login -u ${{ secrets.GHCR_USERNAME }} --password-stdin oras://ghcr.io <<< "$GHCR_TOKEN" singularity push pyspectrafuse.sif oras://ghcr.io/bigbio/pyspectrafuse-sif:${{ steps.version.outputs.VERSION }}

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- name: Login and Deploy Container

if: (github.event_name != 'pull_request')

env:

IS_RELEASE: ${{ github.event_name == 'release' }}

IS_MASTER: ${{ github.ref == 'refs/heads/master' && github.event_name == 'push' }}

run: |

echo ${{ secrets.GHCR_TOKEN }} | singularity remote login -u ${{ secrets.GHCR_USERNAME }} --password-stdin oras://ghcr.io

singularity push pyspectrafuse.sif oras://ghcr.io/bigbio/pyspectrafuse-sif:${{ steps.version.outputs.VERSION }}

if [[ "${{ env.IS_RELEASE }}" == "true" || "${{ env.IS_MASTER }}" == "true" ]]; then

singularity push pyspectrafuse.sif oras://ghcr.io/bigbio/pyspectrafuse-sif:latest

fi

- name: Login and Deploy Container

if: ${{ (github.event.inputs.push_images == true || github.event.inputs.push_images == '') && (github.event_name != 'pull_request') }}

env:

IS_RELEASE: ${{ github.event_name == 'release' }}

IS_MASTER: ${{ github.ref == 'refs/heads/master' && github.event_name == 'push' }}

GHCR_TOKEN: ${{ secrets.GHCR_TOKEN }}

run: |

singularity remote login -u ${{ secrets.GHCR_USERNAME }} --password-stdin oras://ghcr.io <<< "$GHCR_TOKEN"

singularity push pyspectrafuse.sif oras://ghcr.io/bigbio/pyspectrafuse-sif:${{ steps.version.outputs.VERSION }}

if [[ "${{ env.IS_RELEASE }}" == "true" || "${{ env.IS_MASTER }}" == "true" ]]; then

singularity push pyspectrafuse.sif oras://ghcr.io/bigbio/pyspectrafuse-sif:latest

fi

🤖 Prompt for AI Agents

In @.github/workflows/pyspectrafuse-containers.yml around lines 114 - 124, The "Login and Deploy Container" step has two issues: its if condition omits the same push_images guard used in the other Singularity steps so it will run on master pushes even when push_images is false, and piping the token via echo risks exposing GHCR_TOKEN in logs. Fix by changing the step's if to match the other steps (include the push_images check used elsewhere, e.g., ensure the if includes github.event.inputs.push_images == 'true' or the same expression used in lines 99/105/109) and replace the echo pipe with a safe --password-stdin usage that doesn't print the token (e.g., feed the secret via a heredoc or pass it via an environment variable read by singularity's --password-stdin), keeping the rest of the pushes (IS_RELEASE/IS_MASTER logic) intact.

major preformance improvements

027cacd

ypriverol requested a review from Copilot January 6, 2026 14:40

qodo-code-review bot added the Review effort 3/5 label Jan 6, 2026

Copilot started reviewing on behalf of ypriverol January 6, 2026 14:41 View session

qodo-code-review bot reviewed Jan 6, 2026

View reviewed changes

Copilot AI reviewed Jan 6, 2026

View reviewed changes

coderabbitai bot reviewed Jan 6, 2026

View reviewed changes

container added

d0c86fa

coderabbitai bot reviewed Jan 6, 2026

View reviewed changes

ypriverol merged commit dd72319 into main Jan 6, 2026
9 checks passed

This was referenced Jan 7, 2026

tests improved #5

Merged

minor changes #7

Merged

		charges = 'charge' + row_group['charge'].astype(str)
		mgf_group_df['mgf_file_path'] = (sample_info + '/' + charges + '/mgf files')

-            mgf_group_df['mgf_file_path'] = (sample_info + '/' + charges + '/mgf files')
+            mgf_group_df['mgf_file_path'] = [
+                '/'.join(info + [charge, 'mgf files'])
+                for info, charge in zip(sample_info, charges)
+            ]

Conversation

ypriverol commented Jan 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Description

Diagram Walkthrough

File Walkthrough

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

Pre-merge checks

Uh oh!

qodo-code-review bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Compliance Guide 🔍

Uh oh!

qodo-code-review bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Uh oh!

qodo-code-review bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ypriverol commented Jan 6, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 6, 2026 •

edited

Loading

qodo-code-review bot commented Jan 6, 2026 •

edited

Loading

qodo-code-review bot commented Jan 6, 2026 •

edited

Loading