Conversation
📝 WalkthroughWalkthroughThis PR replaces many per-row apply operations with vectorized string and array operations, precomputes repeated path/basename values, optimizes similarity matrix computation, simplifies consensus aggregation logic, and adds a single log message and a new CI workflow plus Dockerfile. Changes
Sequence Diagram(s)(No sequence diagrams generated — changes are internal optimizations and refactors without new multi-component control flows requiring visualization.) Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
PR Compliance Guide 🔍Below is a summary of compliance checks for this PR:
Compliance status legend🟢 - Fully Compliant🟡 - Partial Compliant 🔴 - Not Compliant ⚪ - Requires Further Human Verification 🏷️ - Compliance label |
|||||||||||||||||||||||
PR Code Suggestions ✨Explore these optional code suggestions:
|
||||||||||||||||
| charges = 'charge' + row_group['charge'].astype(str) | ||
| mgf_group_df['mgf_file_path'] = (sample_info + '/' + charges + '/mgf files') |
There was a problem hiding this comment.
Suggestion: Fix TypeError in path construction
| charges = 'charge' + row_group['charge'].astype(str) | |
| mgf_group_df['mgf_file_path'] = (sample_info + '/' + charges + '/mgf files') | |
| charges = 'charge' + row_group['charge'].astype(str) | |
| mgf_group_df['mgf_file_path'] = (sample_info.str.join('/') + '/' + charges + '/mgf files') |
There was a problem hiding this comment.
Pull request overview
This PR implements performance optimizations across the codebase by replacing inefficient apply() calls with vectorized pandas operations, pre-computing repeated calculations, and removing duplicate code. The changes aim to significantly speed up data processing, particularly for large datasets.
Key changes:
- Vectorized pandas operations replacing
apply()calls for 10-50x speedup potential - Pre-computation of repeated string operations to avoid redundant calculations
- Removal of duplicate/dead code in consensus strategies
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pyspectrafuse/mgf_convert/parquet2mgf.py | Pre-computes basename to avoid repeated Path operations in loop |
| pyspectrafuse/consensus_strategy/most_similar_strategy.py | Optimizes similarity matrix calculation with list comprehension; removes duplicate code |
| pyspectrafuse/consensus_strategy/binning_strategy.py | Pre-computes division factor for bin calculation; optimizes array indexing; removes duplicate code |
| pyspectrafuse/consensus_strategy/average_spectrum_strategy.py | Removes duplicate code block |
| pyspectrafuse/common/sdrf_utils.py | Replaces apply() with vectorized str operations and values.tolist() |
| pyspectrafuse/commands/spectrum2msp.py | Adds success logging message |
| pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py | Vectorizes string operations and USI parsing; pre-computes basename |
| pyspectrafuse/cluster_parquet_combine/cluster_res_handler.py | Vectorizes string concatenation operations |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| filenames = row_group['USI'].str.split(':').str[2] | ||
| sample_info = filenames.map(sample_info_dict) | ||
| charges = 'charge' + row_group['charge'].astype(str) | ||
| mgf_group_df['mgf_file_path'] = (sample_info + '/' + charges + '/mgf files') |
There was a problem hiding this comment.
The vectorized string concatenation is incorrect. The sample_info variable contains a pandas Series where each element is a list [organism, instrument] (from sample_info_dict). The current code attempts to concatenate these lists with strings using +, which will fail because you cannot directly concatenate a Series of lists with strings.
The original code used '/'.join(sample_info_dict.get(...) + ['charge' + str(row["charge"]), 'mgf files']) which joins list elements with '/'. The vectorized version should first construct the full list per row, then join with '/'. Consider using apply() with a lambda that constructs the path from the list elements, or properly concatenate list elements before joining.
| mgf_group_df['mgf_file_path'] = (sample_info + '/' + charges + '/mgf files') | |
| mgf_group_df['mgf_file_path'] = [ | |
| '/'.join(info + [charge, 'mgf files']) | |
| for info, charge in zip(sample_info, charges) | |
| ] |
| return res, single_spectrum_df | ||
|
|
There was a problem hiding this comment.
There is a duplicate return statement. Line 90 returns the result, making line 92 unreachable dead code. Remove the duplicate return statement on line 92.
| return res, single_spectrum_df |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI Agents
In @pyspectrafuse/consensus_strategy/most_similar_strategy.py:
- Around line 99-111: The similarity matrix construction incorrectly mirrors the
upper triangle despite the similarity function being asymmetric (the
pointer-based _dot() makes dot(a,b) != dot(b,a)), so remove the mirroring and
compute the full n×n matrix by calling compare_spectra for every (i,j) pair;
update the loops that build sim_matrix (references: sim_matrix, spectra_list,
compare_spectra, _dot) to iterate j from 0..n-1 and assign sim_matrix[i, j] =
similarity for every pair without setting sim_matrix[j, i].
🧹 Nitpick comments (1)
pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py (1)
89-97: Great vectorization! Complete the basename optimization.The vectorized string operations eliminate per-row overhead and are a significant performance improvement. Pre-computing
basename_parquet(line 97) is also excellent.However, line 130 still uses the old pattern
Path(path_parquet).parts[-1].split('.')[0]instead of reusingbasename_parquet. This should be updated for consistency and to complete the optimization.🔎 Proposed fix to use basename_parquet at line 130
- mgf_file_path = (f"{base_mgf_path}/{Path(path_parquet).parts[-1].split('.')[0]}_" + mgf_file_path = (f"{base_mgf_path}/{basename_parquet}_" f"{file_index_dict[base_mgf_path] + 1}.mgf")
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
pyspectrafuse/cluster_parquet_combine/cluster_res_handler.pypyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.pypyspectrafuse/commands/spectrum2msp.pypyspectrafuse/common/sdrf_utils.pypyspectrafuse/consensus_strategy/average_spectrum_strategy.pypyspectrafuse/consensus_strategy/binning_strategy.pypyspectrafuse/consensus_strategy/most_similar_strategy.pypyspectrafuse/mgf_convert/parquet2mgf.py
💤 Files with no reviewable changes (1)
- pyspectrafuse/consensus_strategy/average_spectrum_strategy.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Agent
- GitHub Check: CodeQL analysis (python)
- GitHub Check: test (macos-latest, 3.10)
- GitHub Check: test (macos-latest, 3.9)
🔇 Additional comments (12)
pyspectrafuse/commands/spectrum2msp.py (1)
207-208: LGTM! Good user experience improvement.The completion log provides clear feedback to users about the output location after MSP file generation.
pyspectrafuse/common/sdrf_utils.py (2)
53-54: Excellent vectorization!Replacing the per-row
apply()with vectorized string operations (str.split('.').str[0]) significantly improves performance while maintaining identical semantics.
71-73: Great performance optimization!Converting from
apply()to.values.tolist()provides a significant performance boost. The vectorized approach directly converts the underlying numpy array to a list of lists, eliminating Python-level iteration overhead.pyspectrafuse/mgf_convert/parquet2mgf.py (1)
120-126: Excellent pre-computation optimization!Pre-computing
basename_parquetonce before the groupby loop eliminates redundant path parsing operations. The basename is constant for the input parquet file, making this a safe and effective performance improvement.Also applies to: 143-144
pyspectrafuse/consensus_strategy/binning_strategy.py (3)
98-100: Smart pre-computation for bin calculation!Pre-computing
min_mz_div_bin_sizeeliminates repeated division operations in the loop. The refactored bin calculationfloor(mz/bin_size - min_mz/bin_size)is mathematically equivalent tofloor((mz - min_mz)/bin_size)while being more efficient.
103-110: Optimized bin calculation and indexing.The two-step validation approach (computing
valid_binsmask, then extractingvalid_indices) ensures bins are within bounds before updating accumulators. The logic correctly usesvalid_binsto slice spectrum data andvalid_indicesto index the bin arrays.
114-132: Good bounds checking in edge case handling.The edge case consolidation logic correctly validates indices before accessing
valid_indices + 1to prevent out-of-bounds array access. The threshold-based consolidation of split peaks maintains the binning strategy's correctness.pyspectrafuse/cluster_parquet_combine/cluster_res_handler.py (2)
40-42: Excellent vectorization of path construction!The two-step vectorized approach (prepending sample info, then appending index) replaces per-row operations with efficient pandas string operations, providing a significant performance boost while maintaining identical semantics.
118-120: Consistent vectorization pattern!The vectorized path construction follows the same efficient pattern as
walk_dir, eliminating Python-level iteration overhead while preserving the original path construction logic.pyspectrafuse/consensus_strategy/most_similar_strategy.py (1)
114-114: Good documentation improvement.The updated comment clarifies that
sum(axis=0)computes the total similarity for each spectrum, making the selection logic more understandable.pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py (2)
33-34: Excellent vectorization!Replacing
apply()withmap()for dictionary lookups is a well-known pandas optimization. Themap()method is specifically optimized for this use case and can provide 10-50x speedup on large DataFrames by avoiding Python lambda overhead.
101-102: Correct usage of pre-computed basename.The usage of
basename_parquethere correctly applies the optimization by avoiding repeated Path operations within the groupby loop.
| # Build similarity matrix - optimized for symmetric matrix | ||
| # Only calculate upper triangle and mirror to lower triangle | ||
| n = len(spectra_keys) | ||
| sim_matrix = np.zeros((n, n)) | ||
| spectra_list = [cluster_spectra[key] for key in spectra_keys] | ||
|
|
||
| # Calculate upper triangle including diagonal | ||
| for i in range(n): | ||
| for j in range(i, n): | ||
| similarity = self.compare_spectra(spectra_list[i], spectra_list[j]) | ||
| sim_matrix[i, j] = similarity | ||
| if i != j: # Mirror to lower triangle (skip diagonal) | ||
| sim_matrix[j, i] = similarity |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find spectrum_utils module and dot function
find . -type f -name "*spectrum_utils*" | head -20Repository: bigbio/pyspectrafuse
Length of output: 46
🏁 Script executed:
# Search for the dot function definition
rg -n "def dot" --type=pyRepository: bigbio/pyspectrafuse
Length of output: 175
🏁 Script executed:
# Find the compare_spectra method in most_similar_strategy.py
rg -n "def compare_spectra" --type=pyRepository: bigbio/pyspectrafuse
Length of output: 46
🏁 Script executed:
# Check for any imports of dot or similarity functions in most_similar_strategy.py
rg -n "import.*dot|from.*dot" pyspectrafuse/consensus_strategy/most_similar_strategy.pyRepository: bigbio/pyspectrafuse
Length of output: 122
🏁 Script executed:
# Read the dot function implementation in metrics.py
cat -n pyspectrafuse/consensus_strategy/metrics.py | head -60Repository: bigbio/pyspectrafuse
Length of output: 2205
🏁 Script executed:
# Find the compare_spectra method
rg -n -A15 "def compare_spectra" pyspectrafuse/consensus_strategy/most_similar_strategy.pyRepository: bigbio/pyspectrafuse
Length of output: 46
🏁 Script executed:
# Read more of metrics.py to see the full _dot implementation
cat -n pyspectrafuse/consensus_strategy/metrics.py | head -100Repository: bigbio/pyspectrafuse
Length of output: 3844
🏁 Script executed:
# Search for compare_spectra in the entire codebase
rg -n "compare_spectra" --type=pyRepository: bigbio/pyspectrafuse
Length of output: 334
🏁 Script executed:
# Read lines around compare_spectra assignment in most_similar_strategy.py
sed -n '30,45p' pyspectrafuse/consensus_strategy/most_similar_strategy.pyRepository: bigbio/pyspectrafuse
Length of output: 986
🏁 Script executed:
# Also read the full metrics.py to understand normalization and get complete picture
wc -l pyspectrafuse/consensus_strategy/metrics.pyRepository: bigbio/pyspectrafuse
Length of output: 111
🏁 Script executed:
# Read the full _dot implementation and surrounding code
cat -n pyspectrafuse/consensus_strategy/metrics.py | sed -n '52,100p'Repository: bigbio/pyspectrafuse
Length of output: 2073
🏁 Script executed:
# Look for test files that might test the symmetry of the dot function
find . -type f -name "*test*.py" | xargs grep -l "dot\|symmetry" | head -5Repository: bigbio/pyspectrafuse
Length of output: 46
🏁 Script executed:
# Check if there are any comments or documentation about symmetry assumptions
rg -n "symmetric|symmetry" pyspectrafuse/consensus_strategy/ --type=pyRepository: bigbio/pyspectrafuse
Length of output: 191
🏁 Script executed:
# Create a simple test to demonstrate the asymmetry
python3 << 'EOF'
# Demonstrate why the forward-pointer algorithm is asymmetric
# Simulate the _dot algorithm
def simulate_dot(mz1, intensity1, mz2, intensity2, tol):
"""Simulate the _dot function behavior"""
fragment_other_i = 0
score = 0.0
for fragment_i in range(len(mz1)):
# Advance pointer while mz2[pointer] is too small
while (fragment_other_i < len(mz2) - 1 and
mz2[fragment_other_i] < mz1[fragment_i] - tol):
fragment_other_i += 1
# Check if current peaks match
if (abs(mz1[fragment_i] - mz2[fragment_other_i]) <= tol and
fragment_other_i < len(mz2)):
score += intensity1[fragment_i] * intensity2[fragment_other_i]
fragment_other_i += 1
return score
import numpy as np
# Test case showing asymmetry
mz1 = np.array([100.0, 200.0, 300.0])
intensity1 = np.array([1.0, 1.0, 1.0])
mz2 = np.array([125.0, 275.0])
intensity2 = np.array([1.0, 1.0])
tol = 50.0
score1 = simulate_dot(mz1, intensity1, mz2, intensity2, tol)
score2 = simulate_dot(mz2, intensity2, mz1, intensity1, tol)
print(f"dot(spec1, spec2) = {score1}")
print(f"dot(spec2, spec1) = {score2}")
print(f"Symmetric? {score1 == score2}")
EOFRepository: bigbio/pyspectrafuse
Length of output: 150
Critical bug: Dot product is not symmetric—do not mirror the matrix.
The similarity metric is asymmetric due to the one-way pointer algorithm in _dot(). Calling dot(spec_a, spec_b) does not equal dot(spec_b, spec_a) because the pointer-based matching depends on which spectrum's peaks are iterated first. Mirroring the upper triangle produces an incorrect similarity matrix and will lead to wrong consensus spectrum selection.
Remove the mirroring logic and compute the full n×n matrix:
for i in range(n):
for j in range(n):
similarity = self.compare_spectra(spectra_list[i], spectra_list[j])
sim_matrix[i, j] = similarity🤖 Prompt for AI Agents
In @pyspectrafuse/consensus_strategy/most_similar_strategy.py around lines 99 -
111, The similarity matrix construction incorrectly mirrors the upper triangle
despite the similarity function being asymmetric (the pointer-based _dot() makes
dot(a,b) != dot(b,a)), so remove the mirroring and compute the full n×n matrix
by calling compare_spectra for every (i,j) pair; update the loops that build
sim_matrix (references: sim_matrix, spectra_list, compare_spectra, _dot) to
iterate j from 0..n-1 and assign sim_matrix[i, j] = similarity for every pair
without setting sim_matrix[j, i].
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI Agents
In @.github/workflows/pyspectrafuse-containers.yml:
- Around line 114-124: The "Login and Deploy Container" step has two issues: its
if condition omits the same push_images guard used in the other Singularity
steps so it will run on master pushes even when push_images is false, and piping
the token via echo risks exposing GHCR_TOKEN in logs. Fix by changing the step's
if to match the other steps (include the push_images check used elsewhere, e.g.,
ensure the if includes github.event.inputs.push_images == 'true' or the same
expression used in lines 99/105/109) and replace the echo pipe with a safe
--password-stdin usage that doesn't print the token (e.g., feed the secret via a
heredoc or pass it via an environment variable read by singularity's
--password-stdin), keeping the rest of the pushes (IS_RELEASE/IS_MASTER logic)
intact.
🧹 Nitpick comments (3)
Dockerfile (2)
26-26: Prefer regular install over editable mode in production containers.Using
pip install -e .(editable/development mode) in a Docker image is unnecessary and can cause issues. Editable installs create symlinks to the source, which is useful during development but adds no value in a container where the code won't change. Use a regular install instead.🔎 Proposed fix
# Install the package -RUN pip install --no-cache-dir -e . +RUN pip install --no-cache-dir .
1-32: Consider adding a non-root user for improved security.The container runs as root by default. While this is acceptable for many CLI tools, adding a non-root user is a security best practice, especially if the container might be used in production environments or pipelines.
🔎 Optional enhancement
# Install the package RUN pip install --no-cache-dir . +# Create non-root user +RUN useradd --create-home --shell /bin/bash appuser +USER appuser + # Set the entrypoint ENTRYPOINT ["pyspectrafuse_cli"].github/workflows/pyspectrafuse-containers.yml (1)
62-73: Version extraction may fail silently if pattern doesn't match.If
pyproject.tomlformat changes or theversionline is missing,grepandcutwill produce empty output without error. Consider adding validation.🔎 Proposed fix with validation
- name: Extract version from pyproject.toml id: version run: | VERSION=$(grep '^version = ' pyproject.toml | cut -d'"' -f2) + if [ -z "$VERSION" ]; then + echo "::error::Failed to extract version from pyproject.toml" + exit 1 + fi echo "VERSION=$VERSION" >> $GITHUB_OUTPUT echo "Extracted version: $VERSION" # Extract major and minor version MAJOR=$(echo $VERSION | cut -d'.' -f1) MINOR=$(echo $VERSION | cut -d'.' -f2) echo "MAJOR=$MAJOR" >> $GITHUB_OUTPUT echo "MINOR=$MINOR" >> $GITHUB_OUTPUT echo "Major.Minor: $MAJOR.$MINOR"
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
.github/workflows/pyspectrafuse-containers.ymlDockerfile
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test (macos-latest, 3.8)
🔇 Additional comments (3)
.github/workflows/pyspectrafuse-containers.yml (3)
55-60: Verify the need for a customGHCR_TOKENsecret.GitHub Actions provides
GITHUB_TOKENwithpackages: writepermission (already declared on line 32) that can authenticate to GHCR. Usingsecrets.GITHUB_TOKENis simpler and doesn't require manually creating and maintaining a separate secret.If a custom token is intentionally required (e.g., for cross-repo access), please disregard. Otherwise, consider:
🔎 Proposed simplification
- name: Log in to GitHub Container Registry uses: docker/login-action@5e57cd118135c172c3672efd75eb46360885c0ef with: registry: ghcr.io username: ${{ github.actor }} - password: ${{ secrets.GHCR_TOKEN }} + password: ${{ secrets.GITHUB_TOKEN }}
98-112: Singularity build will fail because the Docker image was removed.Line 93 removes the Docker image (
docker image rm ... || true) before the Singularity conversion step attempts to pull it from the registry on line 111. For PRs, the image is never pushed (line 79), sodocker://ghcr.io/bigbio/pyspectrafuse:latestwon't exist remotely, and the Singularity build will fail.However, since the Singularity steps are skipped for PRs (line 99 condition), this is consistent. Just noting the cleanup on line 93 is safe because Singularity pulls from the remote registry, not local Docker.
3-24: Workflow triggers are well-structured.The trigger configuration correctly handles multiple scenarios: master pushes, PRs with relevant path filters, releases, and manual dispatch with configurable options. The path filters for PRs (
Dockerfile,.github/workflows/**) appropriately limit when the workflow runs.
| - name: Login and Deploy Container | ||
| if: (github.event_name != 'pull_request') | ||
| env: | ||
| IS_RELEASE: ${{ github.event_name == 'release' }} | ||
| IS_MASTER: ${{ github.ref == 'refs/heads/master' && github.event_name == 'push' }} | ||
| run: | | ||
| echo ${{ secrets.GHCR_TOKEN }} | singularity remote login -u ${{ secrets.GHCR_USERNAME }} --password-stdin oras://ghcr.io | ||
| singularity push pyspectrafuse.sif oras://ghcr.io/bigbio/pyspectrafuse-sif:${{ steps.version.outputs.VERSION }} | ||
| if [[ "${{ env.IS_RELEASE }}" == "true" || "${{ env.IS_MASTER }}" == "true" ]]; then | ||
| singularity push pyspectrafuse.sif oras://ghcr.io/bigbio/pyspectrafuse-sif:latest | ||
| fi |
There was a problem hiding this comment.
Inconsistent condition and potential secret exposure risk.
Two issues here:
-
The
ifcondition on line 115 differs from other Singularity steps (lines 99, 105, 109) which also checkpush_images. This step will run on master push even ifpush_imagesis explicitly set tofalsevia workflow_dispatch. -
Line 120 pipes the token through
echowhich could expose it in debug logs. Use--password-stdinwith a heredoc or environment variable instead.
🔎 Proposed fix
- name: Login and Deploy Container
- if: (github.event_name != 'pull_request')
+ if: ${{ (github.event.inputs.push_images == true || github.event.inputs.push_images == '') && (github.event_name != 'pull_request') }}
env:
IS_RELEASE: ${{ github.event_name == 'release' }}
IS_MASTER: ${{ github.ref == 'refs/heads/master' && github.event_name == 'push' }}
+ GHCR_TOKEN: ${{ secrets.GHCR_TOKEN }}
run: |
- echo ${{ secrets.GHCR_TOKEN }} | singularity remote login -u ${{ secrets.GHCR_USERNAME }} --password-stdin oras://ghcr.io
+ singularity remote login -u ${{ secrets.GHCR_USERNAME }} --password-stdin oras://ghcr.io <<< "$GHCR_TOKEN"
singularity push pyspectrafuse.sif oras://ghcr.io/bigbio/pyspectrafuse-sif:${{ steps.version.outputs.VERSION }}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: Login and Deploy Container | |
| if: (github.event_name != 'pull_request') | |
| env: | |
| IS_RELEASE: ${{ github.event_name == 'release' }} | |
| IS_MASTER: ${{ github.ref == 'refs/heads/master' && github.event_name == 'push' }} | |
| run: | | |
| echo ${{ secrets.GHCR_TOKEN }} | singularity remote login -u ${{ secrets.GHCR_USERNAME }} --password-stdin oras://ghcr.io | |
| singularity push pyspectrafuse.sif oras://ghcr.io/bigbio/pyspectrafuse-sif:${{ steps.version.outputs.VERSION }} | |
| if [[ "${{ env.IS_RELEASE }}" == "true" || "${{ env.IS_MASTER }}" == "true" ]]; then | |
| singularity push pyspectrafuse.sif oras://ghcr.io/bigbio/pyspectrafuse-sif:latest | |
| fi | |
| - name: Login and Deploy Container | |
| if: ${{ (github.event.inputs.push_images == true || github.event.inputs.push_images == '') && (github.event_name != 'pull_request') }} | |
| env: | |
| IS_RELEASE: ${{ github.event_name == 'release' }} | |
| IS_MASTER: ${{ github.ref == 'refs/heads/master' && github.event_name == 'push' }} | |
| GHCR_TOKEN: ${{ secrets.GHCR_TOKEN }} | |
| run: | | |
| singularity remote login -u ${{ secrets.GHCR_USERNAME }} --password-stdin oras://ghcr.io <<< "$GHCR_TOKEN" | |
| singularity push pyspectrafuse.sif oras://ghcr.io/bigbio/pyspectrafuse-sif:${{ steps.version.outputs.VERSION }} | |
| if [[ "${{ env.IS_RELEASE }}" == "true" || "${{ env.IS_MASTER }}" == "true" ]]; then | |
| singularity push pyspectrafuse.sif oras://ghcr.io/bigbio/pyspectrafuse-sif:latest | |
| fi |
🤖 Prompt for AI Agents
In @.github/workflows/pyspectrafuse-containers.yml around lines 114 - 124, The
"Login and Deploy Container" step has two issues: its if condition omits the
same push_images guard used in the other Singularity steps so it will run on
master pushes even when push_images is false, and piping the token via echo
risks exposing GHCR_TOKEN in logs. Fix by changing the step's if to match the
other steps (include the push_images check used elsewhere, e.g., ensure the if
includes github.event.inputs.push_images == 'true' or the same expression used
in lines 99/105/109) and replace the echo pipe with a safe --password-stdin
usage that doesn't print the token (e.g., feed the secret via a heredoc or pass
it via an environment variable read by singularity's --password-stdin), keeping
the rest of the pushes (IS_RELEASE/IS_MASTER logic) intact.
PR Type
Enhancement
Description
Replace pandas
apply()with vectorized operations for 10-50x speedup+operator instead of lambda functionsmap()for dictionary lookups instead ofapply().straccessor for string operations on SeriesOptimize numerical computations in binning and similarity calculations
Remove duplicate code blocks and improve code clarity
Add logging for MSP file generation completion
Diagram Walkthrough
File Walkthrough
cluster_res_handler.py
Vectorize string operations in cluster path constructionpyspectrafuse/cluster_parquet_combine/cluster_res_handler.py
apply()calls with vectorized string concatenationoperations
+operator for string operations instead of lambda functionsindexcolumn to string usingastype(str)for concatenationcombine_cluster_and_parquet.py
Optimize dictionary mapping and string operationspyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py
apply()withmap()for dictionary lookups (10-50x speedup).str.split()accessoroperations in loop
spectrum2msp.py
Add completion logging for MSP generationpyspectrafuse/commands/spectrum2msp.py
sdrf_utils.py
Vectorize string and data type operationspyspectrafuse/common/sdrf_utils.py
apply()with.str.split().str[0]for vectorized stringsplitting
apply(lambda x: list(x), axis=1)with.values.tolist()forvectorized conversion
binning_strategy.py
Optimize bin calculation with pre-computed constantspyspectrafuse/consensus_strategy/binning_strategy.py
min_mz_div_bin_sizeconstant to avoid repeated division inloop
most_similar_strategy.py
Optimize similarity matrix computation with cachingpyspectrafuse/consensus_strategy/most_similar_strategy.py
parquet2mgf.py
Cache basename for repeated file path constructionpyspectrafuse/mgf_convert/parquet2mgf.py
operations
file_index
average_spectrum_strategy.py
Remove duplicate code in consensus aggregationpyspectrafuse/consensus_strategy/average_spectrum_strategy.py
statement
Summary by CodeRabbit
Performance Improvements
Refactor
Logging
Chores
✏️ Tip: You can customize this high-level summary in your review settings.