Conversation
📝 WalkthroughWalkthroughAdds safe batch-appending to MGF output and robust sample_info handling for MGF path construction; other edits are whitespace-only formatting changes across CI and test files. Changes
Sequence Diagram(s)(omitted — changes are local/function-level and do not introduce multi-component sequential flows requiring visualization) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
PR Compliance Guide 🔍Below is a summary of compliance checks for this PR:
Compliance status legend🟢 - Fully Compliant🟡 - Partial Compliant 🔴 - Not Compliant ⚪ - Requires Further Human Verification 🏷️ - Compliance label |
||||||||||||||||||||||||
PR Code Suggestions ✨Explore these optional code suggestions:
|
|||||||||
There was a problem hiding this comment.
Pull request overview
This PR fixes MGF file batch concatenation issues by adding proper separators between batch appends to prevent malformed "END IONSBEGIN IONS" sequences that break strict MGF parsers. The changes also include formatting improvements to ensure files end with proper newlines.
Key Changes:
- Added batch separator logic to prevent concatenation errors in MGF file generation
- Implemented newline validation to ensure proper file endings
- Cleaned up trailing whitespace in test files and configuration
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pyspectrafuse/mgf_convert/parquet2mgf.py | Added separator and newline handling logic to prevent malformed MGF batch boundaries |
| tests/test_sdrf_utils.py | Added trailing newlines for proper file formatting |
| tests/test_parquet_utils.py | Added trailing newlines for proper file formatting |
| tests/test_msp_utils.py | Added trailing newlines for proper file formatting |
| tests/README.md | Added trailing newlines for proper file formatting |
| .github/workflows/tests.yml | Added trailing newlines for proper file formatting |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py:
- Around line 94-95: Validate mapping results from
filenames.map(sample_info_dict) before building paths: check sample_info for NaN
entries (e.g., sample_info.isna()) and either raise a clear error listing
missing filenames or drop those rows from mgf_group_df; then build mgf_file_path
using the now-verified sample_info (join list elements with '/'). Reference the
variables sample_info, sample_info_dict, filenames, and mgf_group_df so the
check happens immediately after the mapping and before the lambda that
constructs mgf_file_path.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: test (ubuntu-latest, 3.9)
- GitHub Check: test (ubuntu-latest, 3.8)
- GitHub Check: test (macos-latest, 3.11)
- GitHub Check: test (ubuntu-latest, 3.10)
- GitHub Check: test (macos-latest, 3.8)
- GitHub Check: test (macos-latest, 3.10)
- GitHub Check: test (macos-latest, 3.9)
- GitHub Check: test (ubuntu-latest, 3.11)
| # sample_info_dict values are lists [species, instrument]; join them with '/' | ||
| mgf_group_df['mgf_file_path'] = sample_info.apply(lambda x: '/'.join(x) if isinstance(x, list) else str(x)) + '/' + charges + '/mgf files' |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, let's understand the structure of the file and find sample_info_dict usage
echo "=== Examining combine_cluster_and_parquet.py around lines 94-95 ==="
head -120 pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py | tail -40
echo -e "\n=== Searching for sample_info_dict definition ==="
rg -n 'sample_info_dict' --type=py -B2 -A2
echo -e "\n=== Searching for get_metadata_dict_from_sdrf function ==="
rg -n 'def get_metadata_dict_from_sdrf' --type=py -A30
echo -e "\n=== Searching for NaN/None handling related to sample_info ==="
rg -n 'isna|notna|dropna|fillna' --type=py | grep -i 'sample\|filename\|nan' | head -20
echo -e "\n=== Checking for validation of filenames ==="
rg -n 'sample_info_dict.*keys\|in sample_info_dict\|sample_info_dict.get' --type=py -B2 -A2Repository: bigbio/pyspectrafuse
Length of output: 8314
🏁 Script executed:
#!/bin/bash
echo "=== Reading full sdrf_utils.py to see complete data processing ==="
cat -n pyspectrafuse/common/sdrf_utils.py
echo -e "\n=== Checking for validation before line 92 in combine_cluster_and_parquet.py ==="
head -92 pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py | tail -30
echo -e "\n=== Searching for unit tests related to this functionality ==="
find . -name '*test*.py' -type f | xargs grep -l 'sample_info\|get_metadata_dict_from_sdrf\|combine_cluster' 2>/dev/null | head -10
echo -e "\n=== Checking parquet2mgf.py to see how it handles the same scenario ==="
sed -n '120,135p' pyspectrafuse/mgf_convert/parquet2mgf.pyRepository: bigbio/pyspectrafuse
Length of output: 6605
🏁 Script executed:
#!/bin/bash
echo "=== Checking test_sdrf_utils.py ==="
cat -n pyspectrafuse/../tests/test_sdrf_utils.py 2>/dev/null || find . -name 'test_sdrf_utils.py' -exec cat -n {} \;
echo -e "\n=== Checking if filenames are validated against sample_info_dict ==="
rg -n 'sample_info_dict.*keys\|in sample_info_dict\|isin.*sample_info' --type=py -B3 -A3
echo -e "\n=== Looking for any error handling or assertions related to filenames ==="
rg -n 'assert.*filename\|filename.*assert\|filename.*in\|filename.*valid' --type=py pyspectrafuse/cluster_parquet_combine/ pyspectrafuse/mgf_convert/ -B2 -A2
echo -e "\n=== Checking USI parsing and filename extraction ==="
rg -n 'get_filename_from_usi' --type=py -B2 -A10Repository: bigbio/pyspectrafuse
Length of output: 4251
🏁 Script executed:
#!/bin/bash
echo "=== Check if there's any filtering or validation of filenames before mapping ==="
sed -n '60,95p' pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py
echo -e "\n=== Check if parquet data is validated before processing ==="
rg -n 'dropna|notna|fillna|isnull|notnull' --type=py pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py
echo -e "\n=== Compare with parquet2mgf.py structure around same operation ==="
sed -n '110,135p' pyspectrafuse/mgf_convert/parquet2mgf.py
echo -e "\n=== Check if there's error handling around the mgf_file_path construction ==="
rg -n 'try|except|raise' --type=py pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py | head -20Repository: bigbio/pyspectrafuse
Length of output: 4027
Validate filenames against sample_info_dict before mapping.
The lambda function prevents crashes better than the .get() approach in parquet2mgf.py, but creates malformed paths when filenames don't map. When filenames.map(sample_info_dict) encounters unmapped keys, it returns NaN. The isinstance(x, list) check converts NaN to the string 'nan', resulting in paths like 'nan/charge1/mgf files' instead of raising an error or filtering bad data.
Add validation to ensure all filenames extracted from the USI column exist in sample_info_dict before the mapping operation, or explicitly handle and exclude unmapped entries.
🤖 Prompt for AI Agents
In @pyspectrafuse/cluster_parquet_combine/combine_cluster_and_parquet.py around
lines 94 - 95, Validate mapping results from filenames.map(sample_info_dict)
before building paths: check sample_info for NaN entries (e.g.,
sample_info.isna()) and either raise a clear error listing missing filenames or
drop those rows from mgf_group_df; then build mgf_file_path using the
now-verified sample_info (join list elements with '/'). Reference the variables
sample_info, sample_info_dict, filenames, and mgf_group_df so the check happens
immediately after the mapping and before the lambda that constructs
mgf_file_path.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @pyspectrafuse/commands/spectrum2msp.py:
- Line 181: Tests are still expecting the old .msp.txt extension while
production now writes .msp; update the test paths in tests/test_msp_utils.py
(references around the two assertions at the locations mentioned) to use the
.msp extension instead of .msp.txt so they match the output created by
spectrum2msp where output = output_dir / f"{basename}_{uuid.uuid4()}.msp".
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
pyspectrafuse/commands/spectrum2msp.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: test (macos-latest, 3.10)
- GitHub Check: test (ubuntu-latest, 3.8)
- GitHub Check: test (macos-latest, 3.8)
- GitHub Check: test (macos-latest, 3.11)
- GitHub Check: test (ubuntu-latest, 3.11)
- GitHub Check: test (ubuntu-latest, 3.9)
- GitHub Check: test (macos-latest, 3.9)
- GitHub Check: test (ubuntu-latest, 3.10)
PR Type
Bug fix, Tests
Description
Fix MGF file batch concatenation by adding separators between appends
Ensure proper newline handling to prevent parser failures
Add trailing newlines to prevent malformed batch boundaries
Clean up trailing whitespace in test files
Diagram Walkthrough
File Walkthrough
parquet2mgf.py
Add batch separator and newline handling logicpyspectrafuse/mgf_convert/parquet2mgf.py
separators between batches
newline character
and solution
parsers
test_msp_utils.py
Clean up trailing whitespacetests/test_msp_utils.py
test_parquet_utils.py
Clean up trailing whitespacetests/test_parquet_utils.py
test_sdrf_utils.py
Clean up trailing whitespacetests/test_sdrf_utils.py
tests.yml
Clean up trailing whitespace.github/workflows/tests.yml
README.md
Clean up trailing whitespacetests/README.md
Summary by CodeRabbit
Bug Fixes
Chores
✏️ Tip: You can customize this high-level summary in your review settings.