Skip to content

minor changes#7

Merged
ypriverol merged 4 commits intomainfrom
dev
Jan 7, 2026
Merged

minor changes#7
ypriverol merged 4 commits intomainfrom
dev

Conversation

@ypriverol
Copy link
Copy Markdown
Member

@ypriverol ypriverol commented Jan 7, 2026

PR Type

Bug fix, Tests


Description

  • Fix MGF file batch concatenation by adding separators between appends

  • Ensure proper newline handling to prevent parser failures

  • Add trailing newlines to prevent malformed batch boundaries

  • Clean up trailing whitespace in test files


Diagram Walkthrough

flowchart LR
  A["Parquet2Mgf.write2mgf"] --> B["Check if file exists and has content"]
  B --> C["Add blank line separator if needed"]
  C --> D["Write spectrum content"]
  D --> E["Ensure file ends with newline"]
  E --> F["Prevent malformed batch boundaries"]
Loading

File Walkthrough

Relevant files
Bug fix
parquet2mgf.py
Add batch separator and newline handling logic                     

pyspectrafuse/mgf_convert/parquet2mgf.py

  • Added logic to detect existing file content and insert blank line
    separators between batches
  • Implemented newline validation to ensure write_content ends with
    newline character
  • Added comprehensive comments explaining the batch concatenation issue
    and solution
  • Prevents "END IONSBEGIN IONS" malformation that breaks strict MGF
    parsers
+10/-0   
Formatting
test_msp_utils.py
Clean up trailing whitespace                                                         

tests/test_msp_utils.py

  • Removed trailing blank lines at end of file
+2/-0     
test_parquet_utils.py
Clean up trailing whitespace                                                         

tests/test_parquet_utils.py

  • Removed trailing blank lines at end of file
+2/-0     
test_sdrf_utils.py
Clean up trailing whitespace                                                         

tests/test_sdrf_utils.py

  • Removed trailing blank lines at end of file
+2/-0     
tests.yml
Clean up trailing whitespace                                                         

.github/workflows/tests.yml

  • Removed trailing blank lines at end of file
+2/-0     
README.md
Clean up trailing whitespace                                                         

tests/README.md

  • Removed trailing blank lines at end of file
+2/-0     

Summary by CodeRabbit

  • Bug Fixes

    • Ensure proper separators when appending spectral data so consecutive writes to the same MGF file remain parseable.
    • Robust handling of multi-valued sample metadata when combining cluster and parquet data to avoid failures with list-valued fields.
    • Updated MSP output naming: generated MSP files now use a .msp extension (was .msp.txt).
  • Chores

    • Minor formatting and trailing-whitespace tidy-up across tests and CI docs.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 7, 2026

📝 Walkthrough

Walkthrough

Adds safe batch-appending to MGF output and robust sample_info handling for MGF path construction; other edits are whitespace-only formatting changes across CI and test files.

Changes

Cohort / File(s) Summary
Batch append safety
pyspectrafuse/mgf_convert/parquet2mgf.py
write2mgf now detects an existing non-empty target and inserts a separator newline before appending; it also ensures appended content ends with a newline so subsequent appends don't merge END IONS/BEGIN IONS.
Sample-info path handling
pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py
Constructs mgf_file_path using sample_info.apply(...) that joins list-valued sample_info entries with '/' (stringifies non-lists) to avoid errors when sample_info elements are lists.
MSP filename extension
pyspectrafuse/commands/spectrum2msp.py
Changed MSP output filename extension from .msp.txt to .msp in output path construction.
Formatting & whitespace
.github/workflows/tests.yml, tests/README.md, tests/test_msp_utils.py, tests/test_parquet_utils.py, tests/test_sdrf_utils.py
Added trailing blank lines / minor whitespace changes only; no functional changes.

Sequence Diagram(s)

(omitted — changes are local/function-level and do not introduce multi-component sequential flows requiring visualization)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped through code with nimble paws,
I stitched the MGF with tidy laws,
A newline placed to keep apart,
Lists joined neatly, play their part —
Hooray for neat and careful draws!

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'minor changes' is vague and generic, failing to communicate the main change despite the PR fixing critical MGF file batch concatenation issues and parser compatibility problems. Revise the title to clearly describe the primary change, such as 'Fix MGF batch concatenation by adding blank-line separators between appended files' or 'Ensure proper MGF batch boundaries to prevent parser failures'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review bot commented Jan 7, 2026

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
Path traversal write

Description: write2mgf appends to an arbitrary target_path without validation/allowlisting, so if
target_path can be influenced by untrusted input it could enable path
traversal/overwriting or creating files in unintended locations (e.g.,
../../.ssh/authorized_keys or other sensitive paths) and would also follow symlinks when
opening in append mode.
parquet2mgf.py [20-33]

Referred Code
def write2mgf(target_path: str, write_content: str):
    # NOTE: This function may be called multiple times for the same target file (parquet batches).
    # Always ensure we separate batches with blank lines, otherwise we can end up with:
    #   END IONSBEGIN IONS
    # which breaks strict MGF parsers (e.g. MaRaCluster).
    needs_separator = os.path.exists(target_path) and os.path.getsize(target_path) > 0
    with open(target_path, 'a') as f:
        logger.info(f"Writing spectrum to MGF file at path: {target_path}")
        if needs_separator:
            f.write("\n\n")
        f.write(write_content)
        # Ensure file ends with a newline to make subsequent appends safe.
        if write_content and not write_content.endswith("\n"):
            f.write("\n")
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status:
Missing audit context: The new file-write log entry records the path but does not include user identity or an
explicit success/failure outcome, so it may be insufficient as an audit trail depending on
whether this action is considered critical in this system.

Referred Code
# NOTE: This function may be called multiple times for the same target file (parquet batches).
# Always ensure we separate batches with blank lines, otherwise we can end up with:
#   END IONSBEGIN IONS
# which breaks strict MGF parsers (e.g. MaRaCluster).
needs_separator = os.path.exists(target_path) and os.path.getsize(target_path) > 0
with open(target_path, 'a') as f:
    logger.info(f"Writing spectrum to MGF file at path: {target_path}")
    if needs_separator:
        f.write("\n\n")
    f.write(write_content)
    # Ensure file ends with a newline to make subsequent appends safe.
    if write_content and not write_content.endswith("\n"):
        f.write("\n")

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Unhandled I/O failures: The new append logic performs filesystem operations (exists, getsize, open, write) without
handling expected I/O error cases (e.g., permission issues, missing directories, partial
writes) or emitting actionable context on failure.

Referred Code
needs_separator = os.path.exists(target_path) and os.path.getsize(target_path) > 0
with open(target_path, 'a') as f:
    logger.info(f"Writing spectrum to MGF file at path: {target_path}")
    if needs_separator:
        f.write("\n\n")
    f.write(write_content)
    # Ensure file ends with a newline to make subsequent appends safe.
    if write_content and not write_content.endswith("\n"):
        f.write("\n")

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status:
Path logged in info: The new logger.info message logs the full filesystem path which may be sensitive in some
environments (e.g., usernames in home paths), so it should be verified that this is
acceptable and does not expose PII/PHI/secrets via logs.

Referred Code
logger.info(f"Writing spectrum to MGF file at path: {target_path}")
if needs_separator:

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Unvalidated file path: The new code writes to target_path without validating or constraining the path (e.g.,
allowed directory, traversal protection), which could be a security issue if target_path
can be influenced by untrusted input.

Referred Code
def write2mgf(target_path: str, write_content: str):
    # NOTE: This function may be called multiple times for the same target file (parquet batches).
    # Always ensure we separate batches with blank lines, otherwise we can end up with:
    #   END IONSBEGIN IONS
    # which breaks strict MGF parsers (e.g. MaRaCluster).
    needs_separator = os.path.exists(target_path) and os.path.getsize(target_path) > 0
    with open(target_path, 'a') as f:
        logger.info(f"Writing spectrum to MGF file at path: {target_path}")
        if needs_separator:
            f.write("\n\n")
        f.write(write_content)
        # Ensure file ends with a newline to make subsequent appends safe.
        if write_content and not write_content.endswith("\n"):
            f.write("\n")

Learn more about managing compliance generic rules or creating your own custom rules

  • Update
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review bot commented Jan 7, 2026

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix logic to prevent extra newlines
Suggestion Impact:The commit changed the separator written when appending from "\n\n" to "\n", reducing the extra blank line between batches and addressing the newline-spacing issue, though it did not implement the full refactor proposed in the suggestion.

code diff:

         with open(target_path, 'a') as f:
             logger.info(f"Writing spectrum to MGF file at path: {target_path}")
             if needs_separator:
-                f.write("\n\n")
+                f.write("\n")
             f.write(write_content)

Refactor the file writing logic to prevent adding extra newlines between
batches. The current implementation can result in three newlines instead of the
intended two.

pyspectrafuse/mgf_convert/parquet2mgf.py [25-33]

-needs_separator = os.path.exists(target_path) and os.path.getsize(target_path) > 0
-with open(target_path, 'a') as f:
+# Ensure the content to be written ends with a double newline for separation.
+# This simplifies the logic and ensures consistent spacing between batches.
+if not write_content.endswith("\n\n"):
+    if write_content.endswith("\n"):
+        write_content += "\n"
+    else:
+        write_content += "\n\n"
+
+# If the file is empty, we write directly. If not, we prepend a separator.
+# However, since we now always end with \n\n, we can just append.
+# The first write will have a trailing blank line, and subsequent appends will follow.
+# To avoid a leading blank line on the first write, we can do this:
+
+mode = 'a' if os.path.exists(target_path) and os.path.getsize(target_path) > 0 else 'w'
+
+with open(target_path, mode) as f:
     logger.info(f"Writing spectrum to MGF file at path: {target_path}")
-    if needs_separator:
-        f.write("\n\n")
     f.write(write_content)
-    # Ensure file ends with a newline to make subsequent appends safe.
-    if write_content and not write_content.endswith("\n"):
-        f.write("\n")

[Suggestion processed]

Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies a logic flaw in the PR where an extra blank line (three newlines) is inserted between batches, and proposes a valid fix to ensure consistent spacing.

Low
  • Update

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes MGF file batch concatenation issues by adding proper separators between batch appends to prevent malformed "END IONSBEGIN IONS" sequences that break strict MGF parsers. The changes also include formatting improvements to ensure files end with proper newlines.

Key Changes:

  • Added batch separator logic to prevent concatenation errors in MGF file generation
  • Implemented newline validation to ensure proper file endings
  • Cleaned up trailing whitespace in test files and configuration

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pyspectrafuse/mgf_convert/parquet2mgf.py Added separator and newline handling logic to prevent malformed MGF batch boundaries
tests/test_sdrf_utils.py Added trailing newlines for proper file formatting
tests/test_parquet_utils.py Added trailing newlines for proper file formatting
tests/test_msp_utils.py Added trailing newlines for proper file formatting
tests/README.md Added trailing newlines for proper file formatting
.github/workflows/tests.yml Added trailing newlines for proper file formatting

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pyspectrafuse/mgf_convert/parquet2mgf.py Outdated
Comment thread pyspectrafuse/mgf_convert/parquet2mgf.py
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py:
- Around line 94-95: Validate mapping results from
filenames.map(sample_info_dict) before building paths: check sample_info for NaN
entries (e.g., sample_info.isna()) and either raise a clear error listing
missing filenames or drop those rows from mgf_group_df; then build mgf_file_path
using the now-verified sample_info (join list elements with '/'). Reference the
variables sample_info, sample_info_dict, filenames, and mgf_group_df so the
check happens immediately after the mapping and before the lambda that
constructs mgf_file_path.
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 35714cd and 8fb05c2.

📒 Files selected for processing (1)
  • pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: test (ubuntu-latest, 3.9)
  • GitHub Check: test (ubuntu-latest, 3.8)
  • GitHub Check: test (macos-latest, 3.11)
  • GitHub Check: test (ubuntu-latest, 3.10)
  • GitHub Check: test (macos-latest, 3.8)
  • GitHub Check: test (macos-latest, 3.10)
  • GitHub Check: test (macos-latest, 3.9)
  • GitHub Check: test (ubuntu-latest, 3.11)

Comment on lines +94 to +95
# sample_info_dict values are lists [species, instrument]; join them with '/'
mgf_group_df['mgf_file_path'] = sample_info.apply(lambda x: '/'.join(x) if isinstance(x, list) else str(x)) + '/' + charges + '/mgf files'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

# First, let's understand the structure of the file and find sample_info_dict usage
echo "=== Examining combine_cluster_and_parquet.py around lines 94-95 ==="
head -120 pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py | tail -40

echo -e "\n=== Searching for sample_info_dict definition ==="
rg -n 'sample_info_dict' --type=py -B2 -A2

echo -e "\n=== Searching for get_metadata_dict_from_sdrf function ==="
rg -n 'def get_metadata_dict_from_sdrf' --type=py -A30

echo -e "\n=== Searching for NaN/None handling related to sample_info ==="
rg -n 'isna|notna|dropna|fillna' --type=py | grep -i 'sample\|filename\|nan' | head -20

echo -e "\n=== Checking for validation of filenames ==="
rg -n 'sample_info_dict.*keys\|in sample_info_dict\|sample_info_dict.get' --type=py -B2 -A2

Repository: bigbio/pyspectrafuse

Length of output: 8314


🏁 Script executed:

#!/bin/bash

echo "=== Reading full sdrf_utils.py to see complete data processing ==="
cat -n pyspectrafuse/common/sdrf_utils.py

echo -e "\n=== Checking for validation before line 92 in combine_cluster_and_parquet.py ==="
head -92 pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py | tail -30

echo -e "\n=== Searching for unit tests related to this functionality ==="
find . -name '*test*.py' -type f | xargs grep -l 'sample_info\|get_metadata_dict_from_sdrf\|combine_cluster' 2>/dev/null | head -10

echo -e "\n=== Checking parquet2mgf.py to see how it handles the same scenario ==="
sed -n '120,135p' pyspectrafuse/mgf_convert/parquet2mgf.py

Repository: bigbio/pyspectrafuse

Length of output: 6605


🏁 Script executed:

#!/bin/bash

echo "=== Checking test_sdrf_utils.py ==="
cat -n pyspectrafuse/../tests/test_sdrf_utils.py 2>/dev/null || find . -name 'test_sdrf_utils.py' -exec cat -n {} \;

echo -e "\n=== Checking if filenames are validated against sample_info_dict ==="
rg -n 'sample_info_dict.*keys\|in sample_info_dict\|isin.*sample_info' --type=py -B3 -A3

echo -e "\n=== Looking for any error handling or assertions related to filenames ==="
rg -n 'assert.*filename\|filename.*assert\|filename.*in\|filename.*valid' --type=py pyspectrafuse/cluster_parquet_combine/ pyspectrafuse/mgf_convert/ -B2 -A2

echo -e "\n=== Checking USI parsing and filename extraction ==="
rg -n 'get_filename_from_usi' --type=py -B2 -A10

Repository: bigbio/pyspectrafuse

Length of output: 4251


🏁 Script executed:

#!/bin/bash

echo "=== Check if there's any filtering or validation of filenames before mapping ==="
sed -n '60,95p' pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py

echo -e "\n=== Check if parquet data is validated before processing ==="
rg -n 'dropna|notna|fillna|isnull|notnull' --type=py pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py

echo -e "\n=== Compare with parquet2mgf.py structure around same operation ==="
sed -n '110,135p' pyspectrafuse/mgf_convert/parquet2mgf.py

echo -e "\n=== Check if there's error handling around the mgf_file_path construction ==="
rg -n 'try|except|raise' --type=py pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py | head -20

Repository: bigbio/pyspectrafuse

Length of output: 4027


Validate filenames against sample_info_dict before mapping.

The lambda function prevents crashes better than the .get() approach in parquet2mgf.py, but creates malformed paths when filenames don't map. When filenames.map(sample_info_dict) encounters unmapped keys, it returns NaN. The isinstance(x, list) check converts NaN to the string 'nan', resulting in paths like 'nan/charge1/mgf files' instead of raising an error or filtering bad data.

Add validation to ensure all filenames extracted from the USI column exist in sample_info_dict before the mapping operation, or explicitly handle and exclude unmapped entries.

🤖 Prompt for AI Agents
In @pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py around
lines 94 - 95, Validate mapping results from filenames.map(sample_info_dict)
before building paths: check sample_info for NaN entries (e.g.,
sample_info.isna()) and either raise a clear error listing missing filenames or
drop those rows from mgf_group_df; then build mgf_file_path using the
now-verified sample_info (join list elements with '/'). Reference the variables
sample_info, sample_info_dict, filenames, and mgf_group_df so the check happens
immediately after the mapping and before the lambda that constructs
mgf_file_path.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @pyspectrafuse/commands/spectrum2msp.py:
- Line 181: Tests are still expecting the old .msp.txt extension while
production now writes .msp; update the test paths in tests/test_msp_utils.py
(references around the two assertions at the locations mentioned) to use the
.msp extension instead of .msp.txt so they match the output created by
spectrum2msp where output = output_dir / f"{basename}_{uuid.uuid4()}.msp".
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f6004a1 and d2d4e45.

📒 Files selected for processing (1)
  • pyspectrafuse/commands/spectrum2msp.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: test (macos-latest, 3.10)
  • GitHub Check: test (ubuntu-latest, 3.8)
  • GitHub Check: test (macos-latest, 3.8)
  • GitHub Check: test (macos-latest, 3.11)
  • GitHub Check: test (ubuntu-latest, 3.11)
  • GitHub Check: test (ubuntu-latest, 3.9)
  • GitHub Check: test (macos-latest, 3.9)
  • GitHub Check: test (ubuntu-latest, 3.10)

Comment thread pyspectrafuse/commands/spectrum2msp.py
@ypriverol ypriverol merged commit d1ed947 into main Jan 7, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants