Dev by ypriverol · Pull Request #20 · bigbio/spectrafuse

ypriverol · 2026-04-13T21:05:17Z

Summary by CodeRabbit

Chores
- Restructured continuous integration pipeline with enhanced three-stage validation process: syntax checking, test data validation, and pipeline execution testing
- Test environment now uses embedded test data instead of external downloads for faster, more reliable execution
- Reduced test execution timeout to 30 minutes for quicker feedback cycles
- Simplified artifact collection and reporting for improved transparency

- Add minimal QPX test dataset (50 spectra, 14 KB) in test_data/TEST001/ - Update test.config with default_species/default_instrument for QPX - Fix SVG workflow diagram: wider spacing, no overlapping labels/badges - Update .gitignore to allow test_data/ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Updated CI workflow to include syntax validation and test data checks.

Agent-Logs-Url: https://github.com/bigbio/spectrafuse/sessions/955c5f4a-8a76-437f-b5c2-f9c5f8ba6f3d Co-authored-by: ypriverol <52113+ypriverol@users.noreply.github.com>

Fix CI workflow YAML parse error in artifact upload step

Feature/dat bypass workflow

coderabbitai · 2026-04-13T21:05:28Z

Warning

Rate limit exceeded

@ypriverol has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 39 minutes and 33 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 39 minutes and 33 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4c5d7c15-7c4b-4b27-a317-04e74ffbb8aa

📥 Commits

Reviewing files that changed from the base of the PR and between ce3c7dd and 83831d4.

📒 Files selected for processing (1)

.github/workflows/ci.yml

📝 Walkthrough

Walkthrough

The CI pipeline is refactored from a single test job into a three-stage sequence: syntax validation, data validation, and pipeline execution. Singularity support and FTP-based test data downloads are removed. Test data is bundled with the repository, and Docker becomes the sole execution profile with a shortened 30-minute timeout.

Changes

Cohort / File(s)	Summary
CI Pipeline Refactoring `.github/workflows/ci.yml`	Replaced monolithic `test` job with three-stage pipeline: `syntax-check` (Nextflow `-preview` mode validation), `test-data-check` (Python/PyArrow column and content validation on `.parquet` files), and `pipeline-test` (actual Nextflow execution). Removed Singularity environment variables and conditional setup logic. Simplified artifact uploads: failure logs conditionally copied if execution traces exist, `.nextflow.log` always uploaded, success artifacts removed.
Configuration Updates `.gitignore`, `conf/tests/test.config`	Removed `test_data/` from ignored paths to bundle test data in repository. Added `params.default_species` (`Homo sapiens`) and `params.default_instrument` (`Q Exactive HF`) to test config. Updated `parquet_dir` description to reference bundled 50-spectra QPX dataset.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Three jobs now bloom where once was one,
Syntax checked before the tests are run,
Data validated, Docker takes the stage,
No Singularity in this new age!
Bundled and bright, our pipeline's grown! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'Dev' is vague and generic, providing no meaningful information about the changeset's primary purpose or scope.	Replace 'Dev' with a descriptive title that summarizes the main change, such as 'Refactor CI pipeline to separate syntax validation, data checks, and pipeline testing' or 'Update test infrastructure and configuration'.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codacy-production · 2026-04-13T21:06:01Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

_{TIP This summary will be updated as you push new changes. Give us feedback}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.gitignore (1)

175-180: ⚠️ Potential issue | 🟠 Major

*.parquet still keeps new bundled fixtures out of version control.

Line 177 still ignores every Parquet file, so adding or renaming anything under test_data/ will stay untracked even though this PR moves test fixtures into the repo.

Suggested fix

 # Data files - should not be in repository
 data/
 *.parquet
+!test_data/**/*.parquet
 
 # CI/CD reports
 reports/

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.gitignore around lines 175 - 180, The .gitignore entry '*.parquet' is
globally ignoring all Parquet files (so new test fixtures under test_data/
remain untracked); update .gitignore to stop excluding every .parquet file —
either remove the '*.parquet' pattern or replace it with a more specific pattern
(e.g. only ignore data/ or a specific fixtures path) and if you need to keep
most Parquets ignored, add a negation for the test fixtures (e.g. add a pattern
like '!test_data/**.parquet') so the project’s test fixtures are tracked; update
the file to remove the global '*.parquet' ignore and ensure test_data Parquet
files are no longer excluded.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/ci.yml:
- Around line 37-53: The CI Nextflow runs for the jobs named "Validate pipeline
syntax (preview mode)" and "Validate incremental mode syntax" are bypassing the
test profile; update both nextflow run commands to enable the test profile
(e.g., add -profile test or -c conf/tests/test.config) so conf/tests/test.config
is loaded during CI rather than relying on ad-hoc overrides, and keep any
necessary lightweight overrides but do not remove or bypass the test profile
flag.
- Around line 84-92: The preflight currently only asserts presence of columns
for the `run` and `sample` pyarrow Tables but doesn't fail when those tables are
empty; update the checks around the `run` and `sample` variables to also assert
they contain rows (e.g., assert len(run) > 0 and assert len(sample) > 0 or use
run.num_rows/sample.num_rows) and include brief failure messages so the CI fails
fast if either parquet fixture is empty.
- Around line 136-145: The shell test using an unquoted glob ([ -f
test_results/pipeline_info/execution_trace*.txt ]) can expand to multiple args
and fail when several trace files exist; replace that check with a glob-safe
probe such as using compgen -G "test_results/pipeline_info/execution_trace*.txt"
>/dev/null || true (or enabling nullglob and checking for files), then run grep
against the matched files (execution_trace*.txt) as before and preserve the rest
of the logic that sets failed and iterates to copy logs in the "Gather failed
logs" step so the variable failed and the while-read loop continue to work
correctly.

---

Outside diff comments:
In @.gitignore:
- Around line 175-180: The .gitignore entry '*.parquet' is globally ignoring all
Parquet files (so new test fixtures under test_data/ remain untracked); update
.gitignore to stop excluding every .parquet file — either remove the '*.parquet'
pattern or replace it with a more specific pattern (e.g. only ignore data/ or a
specific fixtures path) and if you need to keep most Parquets ignored, add a
negation for the test fixtures (e.g. add a pattern like '!test_data/**.parquet')
so the project’s test fixtures are tracked; update the file to remove the global
'*.parquet' ignore and ensure test_data Parquet files are no longer excluded.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 51a96497-2a4b-48cd-acf7-84ea414b9da1

📥 Commits

Reviewing files that changed from the base of the PR and between cecc27f and ce3c7dd.

⛔ Files ignored due to path filters (4)

docs/images/spectrafuse_workflow.svg is excluded by !**/*.svg
test_data/TEST001/TEST001.psm.parquet is excluded by !**/*.parquet
test_data/TEST001/TEST001.run.parquet is excluded by !**/*.parquet
test_data/TEST001/TEST001.sample.parquet is excluded by !**/*.parquet

📒 Files selected for processing (3)

.github/workflows/ci.yml
.gitignore
conf/tests/test.config

.github/workflows/ci.yml

coderabbitai · 2026-04-13T21:09:55Z

.github/workflows/ci.yml

+          run = pq.read_table('test_data/TEST001/TEST001.run.parquet')
+          assert 'run_file_name' in run.column_names                                                                                          
+          assert 'instrument' in run.column_names         
+          print(f'Run parquet: {len(run)} rows, columns: {run.column_names}')                                                                 
+
+          sample = pq.read_table('test_data/TEST001/TEST001.sample.parquet')                                                                  
+          assert 'run_file_name' in sample.column_names   
+          assert 'organism' in sample.column_names                                                                                            
+          print(f'Sample parquet: {len(sample)} rows, columns: {sample.column_names}')


⚠️ Potential issue | 🟡 Minor

Fail fast if the run or sample parquet fixtures are empty.

This preflight only checks row count for the PSM parquet. Empty metadata tables would still pass this job and then fail later in the actual pipeline run.

Suggested fix

run = pq.read_table('test_data/TEST001/TEST001.run.parquet') assert 'run_file_name' in run.column_names assert 'instrument' in run.column_names + assert len(run) > 0, 'Run table is empty' print(f'Run parquet: {len(run)} rows, columns: {run.column_names}') sample = pq.read_table('test_data/TEST001/TEST001.sample.parquet') assert 'run_file_name' in sample.column_names assert 'organism' in sample.column_names + assert len(sample) > 0, 'Sample table is empty' print(f'Sample parquet: {len(sample)} rows, columns: {sample.column_names}')

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

run = pq.read_table('test_data/TEST001/TEST001.run.parquet')

assert 'run_file_name' in run.column_names

assert 'instrument' in run.column_names

print(f'Run parquet: {len(run)} rows, columns: {run.column_names}')

sample = pq.read_table('test_data/TEST001/TEST001.sample.parquet')

assert 'run_file_name' in sample.column_names

assert 'organism' in sample.column_names

print(f'Sample parquet: {len(sample)} rows, columns: {sample.column_names}')

run = pq.read_table('test_data/TEST001/TEST001.run.parquet')

assert 'run_file_name' in run.column_names

assert 'instrument' in run.column_names

assert len(run) > 0, 'Run table is empty'

print(f'Run parquet: {len(run)} rows, columns: {run.column_names}')

sample = pq.read_table('test_data/TEST001/TEST001.sample.parquet')

assert 'run_file_name' in sample.column_names

assert 'organism' in sample.column_names

assert len(sample) > 0, 'Sample table is empty'

print(f'Sample parquet: {len(sample)} rows, columns: {sample.column_names}')

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/ci.yml around lines 84 - 92, The preflight currently only asserts presence of columns for the `run` and `sample` pyarrow Tables but doesn't fail when those tables are empty; update the checks around the `run` and `sample` variables to also assert they contain rows (e.g., assert len(run) > 0 and assert len(sample) > 0 or use run.num_rows/sample.num_rows) and include brief failure messages so the CI fails fast if either parquet fixture is empty.

coderabbitai · 2026-04-13T21:09:56Z

.github/workflows/ci.yml

+      - name: Gather failed logs                                                                                                              
+        if: failure() || cancelled()                      
        run: |
-          echo "ARTIFACT_TIMESTAMP=$(date +%s)" >> $GITHUB_ENV
-
-      - uses: actions/upload-artifact@v4
+          mkdir -p failed_logs
+          if [ -f test_results/pipeline_info/execution_trace*.txt ]; then
+            failed=$(grep "FAILED" test_results/pipeline_info/execution_trace*.txt | cut -f 2 || true)
+            while read -r line ; do                                                                                                           
+              [ -n "$line" ] && cp $(ls work/${line}*/*.log 2>/dev/null) failed_logs/ 2>/dev/null || true
+            done <<< "$failed"                                                                                                                
+          fi                                              


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail tmpdir="$(mktemp -d)" trap 'rm -rf "$tmpdir"' EXIT mkdir -p "$tmpdir/test_results/pipeline_info" touch \ "$tmpdir/test_results/pipeline_info/execution_trace_1.txt" \ "$tmpdir/test_results/pipeline_info/execution_trace_2.txt" cd "$tmpdir" set +e [ -f test_results/pipeline_info/execution_trace*.txt ] status=$? set -e echo "exit_status=$status" echo "Expected: non-zero status because the glob expands to multiple paths."

Repository: bigbio/spectrafuse

Length of output: 260

🏁 Script executed:

# First, check if the file exists and retrieve the relevant section cat -n .github/workflows/ci.yml | sed -n '130,150p'

Repository: bigbio/spectrafuse

Length of output: 1922

🏁 Script executed:

# Check for other problematic glob patterns in the workflow file rg '\*' .github/workflows/ci.yml | head -20

Repository: bigbio/spectrafuse

Length of output: 344

Make the trace-file probe glob-safe.

Lines 140–141 use unquoted globs that break when multiple trace files exist. The [ -f test_results/pipeline_info/execution_trace*.txt ] test expands to multiple arguments, causing a shell syntax error and silently skipping failure-log collection.

Suggested fix

- name: Gather failed logs if: failure() || cancelled() run: | mkdir -p failed_logs + shopt -s nullglob + trace_files=(test_results/pipeline_info/execution_trace*.txt) + if ((${`#trace_files`[@]})); then - if [ -f test_results/pipeline_info/execution_trace*.txt ]; then - failed=$(grep "FAILED" test_results/pipeline_info/execution_trace*.txt | cut -f 2 || true) + failed=$(grep "FAILED" "${trace_files[@]}" | cut -f 2 || true) while read -r line ; do - [ -n "$line" ] && cp $(ls work/${line}*/*.log 2>/dev/null) failed_logs/ 2>/dev/null || true + [ -n "$line" ] && cp work/${line}*/*.log failed_logs/ 2>/dev/null || true done <<< "$failed" fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- name: Gather failed logs

if: failure() || cancelled()

run: |

echo "ARTIFACT_TIMESTAMP=$(date +%s)" >> $GITHUB_ENV

- uses: actions/upload-artifact@v4

mkdir -p failed_logs

if [ -f test_results/pipeline_info/execution_trace*.txt ]; then

failed=$(grep "FAILED" test_results/pipeline_info/execution_trace*.txt | cut -f 2 || true)

while read -r line ; do

[ -n "$line" ] && cp $(ls work/${line}*/*.log 2>/dev/null) failed_logs/ 2>/dev/null || true

done <<< "$failed"

fi

- name: Gather failed logs

if: failure() || cancelled()

run: |

mkdir -p failed_logs

shopt -s nullglob

trace_files=(test_results/pipeline_info/execution_trace*.txt)

if ((${`#trace_files`[@]})); then

failed=$(grep "FAILED" "${trace_files[@]}" | cut -f 2 || true)

while read -r line ; do

[ -n "$line" ] && cp work/${line}*/*.log failed_logs/ 2>/dev/null || true

done <<< "$failed"

fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/ci.yml around lines 136 - 145, The shell test using an unquoted glob ([ -f test_results/pipeline_info/execution_trace*.txt ]) can expand to multiple args and fail when several trace files exist; replace that check with a glob-safe probe such as using compgen -G "test_results/pipeline_info/execution_trace*.txt" >/dev/null || true (or enabling nullglob and checking for files), then run grep against the matched files (execution_trace*.txt) as before and preserve the rest of the logic that sets failed and iterates to copy logs in the "Gather failed logs" step so the variable failed and the while-read loop continue to work correctly.

…kflow

Fix CI: single-line commands, bundled test data

ypriverol and others added 7 commits April 13, 2026 21:38

Enhance CI workflow with syntax and data validation

0f14f94

Updated CI workflow to include syntax validation and test data checks.

Refactor CI workflow for nf-core

ba362a8

Update ci.yml

20c1737

Fix ci.yml artifact upload indentation

e42bb24

Agent-Logs-Url: https://github.com/bigbio/spectrafuse/sessions/955c5f4a-8a76-437f-b5c2-f9c5f8ba6f3d Co-authored-by: ypriverol <52113+ypriverol@users.noreply.github.com>

Merge pull request #18 from bigbio/copilot/fix-ci-yml-workflow

cccc8ac

Fix CI workflow YAML parse error in artifact upload step

Merge pull request #19 from bigbio/feature/dat-bypass-workflow

ce3c7dd

Feature/dat bypass workflow

coderabbitai bot reviewed Apr 13, 2026

View reviewed changes

ypriverol added 3 commits April 13, 2026 22:17

Simplify Nextflow command syntax in CI workflow

38a41c5

Merge remote-tracking branch 'origin/dev' into feature/dat-bypass-wor…

a420304

…kflow

Merge pull request #21 from bigbio/feature/dat-bypass-workflow

83831d4

Fix CI: single-line commands, bundled test data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev#20

Dev#20
ypriverol wants to merge 10 commits intomainfrom
dev

ypriverol commented Apr 13, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 13, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

codacy-production bot commented Apr 13, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot Apr 13, 2026

Uh oh!

coderabbitai bot Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ypriverol commented Apr 13, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

codacy-production bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Up to standards ✅

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ypriverol commented Apr 13, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 13, 2026 •

edited

Loading

codacy-production bot commented Apr 13, 2026 •

edited

Loading