Skip to content

Dev#20

Open
ypriverol wants to merge 10 commits intomainfrom
dev
Open

Dev#20
ypriverol wants to merge 10 commits intomainfrom
dev

Conversation

@ypriverol
Copy link
Copy Markdown
Member

@ypriverol ypriverol commented Apr 13, 2026

Summary by CodeRabbit

  • Chores
    • Restructured continuous integration pipeline with enhanced three-stage validation process: syntax checking, test data validation, and pipeline execution testing
    • Test environment now uses embedded test data instead of external downloads for faster, more reliable execution
    • Reduced test execution timeout to 30 minutes for quicker feedback cycles
    • Simplified artifact collection and reporting for improved transparency

ypriverol and others added 7 commits April 13, 2026 21:38
- Add minimal QPX test dataset (50 spectra, 14 KB) in test_data/TEST001/
- Update test.config with default_species/default_instrument for QPX
- Fix SVG workflow diagram: wider spacing, no overlapping labels/badges
- Update .gitignore to allow test_data/

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Updated CI workflow to include syntax validation and test data checks.
Agent-Logs-Url: https://github.com/bigbio/spectrafuse/sessions/955c5f4a-8a76-437f-b5c2-f9c5f8ba6f3d

Co-authored-by: ypriverol <52113+ypriverol@users.noreply.github.com>
Fix CI workflow YAML parse error in artifact upload step
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 13, 2026

Warning

Rate limit exceeded

@ypriverol has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 39 minutes and 33 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 39 minutes and 33 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4c5d7c15-7c4b-4b27-a317-04e74ffbb8aa

📥 Commits

Reviewing files that changed from the base of the PR and between ce3c7dd and 83831d4.

📒 Files selected for processing (1)
  • .github/workflows/ci.yml
📝 Walkthrough

Walkthrough

The CI pipeline is refactored from a single test job into a three-stage sequence: syntax validation, data validation, and pipeline execution. Singularity support and FTP-based test data downloads are removed. Test data is bundled with the repository, and Docker becomes the sole execution profile with a shortened 30-minute timeout.

Changes

Cohort / File(s) Summary
CI Pipeline Refactoring
.github/workflows/ci.yml
Replaced monolithic test job with three-stage pipeline: syntax-check (Nextflow -preview mode validation), test-data-check (Python/PyArrow column and content validation on .parquet files), and pipeline-test (actual Nextflow execution). Removed Singularity environment variables and conditional setup logic. Simplified artifact uploads: failure logs conditionally copied if execution traces exist, .nextflow.log always uploaded, success artifacts removed.
Configuration Updates
.gitignore, conf/tests/test.config
Removed test_data/ from ignored paths to bundle test data in repository. Added params.default_species (Homo sapiens) and params.default_instrument (Q Exactive HF) to test config. Updated parquet_dir description to reference bundled 50-spectra QPX dataset.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Three jobs now bloom where once was one,
Syntax checked before the tests are run,
Data validated, Docker takes the stage,
No Singularity in this new age!
Bundled and bright, our pipeline's grown! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Dev' is vague and generic, providing no meaningful information about the changeset's primary purpose or scope. Replace 'Dev' with a descriptive title that summarizes the main change, such as 'Refactor CI pipeline to separate syntax validation, data checks, and pipeline testing' or 'Update test infrastructure and configuration'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codacy-production
Copy link
Copy Markdown

codacy-production bot commented Apr 13, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

TIP This summary will be updated as you push new changes. Give us feedback

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.gitignore (1)

175-180: ⚠️ Potential issue | 🟠 Major

*.parquet still keeps new bundled fixtures out of version control.

Line 177 still ignores every Parquet file, so adding or renaming anything under test_data/ will stay untracked even though this PR moves test fixtures into the repo.

Suggested fix
 # Data files - should not be in repository
 data/
 *.parquet
+!test_data/**/*.parquet
 
 # CI/CD reports
 reports/
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.gitignore around lines 175 - 180, The .gitignore entry '*.parquet' is
globally ignoring all Parquet files (so new test fixtures under test_data/
remain untracked); update .gitignore to stop excluding every .parquet file —
either remove the '*.parquet' pattern or replace it with a more specific pattern
(e.g. only ignore data/ or a specific fixtures path) and if you need to keep
most Parquets ignored, add a negation for the test fixtures (e.g. add a pattern
like '!test_data/**.parquet') so the project’s test fixtures are tracked; update
the file to remove the global '*.parquet' ignore and ensure test_data Parquet
files are no longer excluded.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/ci.yml:
- Around line 37-53: The CI Nextflow runs for the jobs named "Validate pipeline
syntax (preview mode)" and "Validate incremental mode syntax" are bypassing the
test profile; update both nextflow run commands to enable the test profile
(e.g., add -profile test or -c conf/tests/test.config) so conf/tests/test.config
is loaded during CI rather than relying on ad-hoc overrides, and keep any
necessary lightweight overrides but do not remove or bypass the test profile
flag.
- Around line 84-92: The preflight currently only asserts presence of columns
for the `run` and `sample` pyarrow Tables but doesn't fail when those tables are
empty; update the checks around the `run` and `sample` variables to also assert
they contain rows (e.g., assert len(run) > 0 and assert len(sample) > 0 or use
run.num_rows/sample.num_rows) and include brief failure messages so the CI fails
fast if either parquet fixture is empty.
- Around line 136-145: The shell test using an unquoted glob ([ -f
test_results/pipeline_info/execution_trace*.txt ]) can expand to multiple args
and fail when several trace files exist; replace that check with a glob-safe
probe such as using compgen -G "test_results/pipeline_info/execution_trace*.txt"
>/dev/null || true (or enabling nullglob and checking for files), then run grep
against the matched files (execution_trace*.txt) as before and preserve the rest
of the logic that sets failed and iterates to copy logs in the "Gather failed
logs" step so the variable failed and the while-read loop continue to work
correctly.

---

Outside diff comments:
In @.gitignore:
- Around line 175-180: The .gitignore entry '*.parquet' is globally ignoring all
Parquet files (so new test fixtures under test_data/ remain untracked); update
.gitignore to stop excluding every .parquet file — either remove the '*.parquet'
pattern or replace it with a more specific pattern (e.g. only ignore data/ or a
specific fixtures path) and if you need to keep most Parquets ignored, add a
negation for the test fixtures (e.g. add a pattern like '!test_data/**.parquet')
so the project’s test fixtures are tracked; update the file to remove the global
'*.parquet' ignore and ensure test_data Parquet files are no longer excluded.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 51a96497-2a4b-48cd-acf7-84ea414b9da1

📥 Commits

Reviewing files that changed from the base of the PR and between cecc27f and ce3c7dd.

⛔ Files ignored due to path filters (4)
  • docs/images/spectrafuse_workflow.svg is excluded by !**/*.svg
  • test_data/TEST001/TEST001.psm.parquet is excluded by !**/*.parquet
  • test_data/TEST001/TEST001.run.parquet is excluded by !**/*.parquet
  • test_data/TEST001/TEST001.sample.parquet is excluded by !**/*.parquet
📒 Files selected for processing (3)
  • .github/workflows/ci.yml
  • .gitignore
  • conf/tests/test.config

Comment on lines +84 to +92
run = pq.read_table('test_data/TEST001/TEST001.run.parquet')
assert 'run_file_name' in run.column_names
assert 'instrument' in run.column_names
print(f'Run parquet: {len(run)} rows, columns: {run.column_names}')

sample = pq.read_table('test_data/TEST001/TEST001.sample.parquet')
assert 'run_file_name' in sample.column_names
assert 'organism' in sample.column_names
print(f'Sample parquet: {len(sample)} rows, columns: {sample.column_names}')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fail fast if the run or sample parquet fixtures are empty.

This preflight only checks row count for the PSM parquet. Empty metadata tables would still pass this job and then fail later in the actual pipeline run.

Suggested fix
           run = pq.read_table('test_data/TEST001/TEST001.run.parquet')
           assert 'run_file_name' in run.column_names
           assert 'instrument' in run.column_names
+          assert len(run) > 0, 'Run table is empty'
           print(f'Run parquet: {len(run)} rows, columns: {run.column_names}')

           sample = pq.read_table('test_data/TEST001/TEST001.sample.parquet')
           assert 'run_file_name' in sample.column_names
           assert 'organism' in sample.column_names
+          assert len(sample) > 0, 'Sample table is empty'
           print(f'Sample parquet: {len(sample)} rows, columns: {sample.column_names}')
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
run = pq.read_table('test_data/TEST001/TEST001.run.parquet')
assert 'run_file_name' in run.column_names
assert 'instrument' in run.column_names
print(f'Run parquet: {len(run)} rows, columns: {run.column_names}')
sample = pq.read_table('test_data/TEST001/TEST001.sample.parquet')
assert 'run_file_name' in sample.column_names
assert 'organism' in sample.column_names
print(f'Sample parquet: {len(sample)} rows, columns: {sample.column_names}')
run = pq.read_table('test_data/TEST001/TEST001.run.parquet')
assert 'run_file_name' in run.column_names
assert 'instrument' in run.column_names
assert len(run) > 0, 'Run table is empty'
print(f'Run parquet: {len(run)} rows, columns: {run.column_names}')
sample = pq.read_table('test_data/TEST001/TEST001.sample.parquet')
assert 'run_file_name' in sample.column_names
assert 'organism' in sample.column_names
assert len(sample) > 0, 'Sample table is empty'
print(f'Sample parquet: {len(sample)} rows, columns: {sample.column_names}')
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci.yml around lines 84 - 92, The preflight currently only
asserts presence of columns for the `run` and `sample` pyarrow Tables but
doesn't fail when those tables are empty; update the checks around the `run` and
`sample` variables to also assert they contain rows (e.g., assert len(run) > 0
and assert len(sample) > 0 or use run.num_rows/sample.num_rows) and include
brief failure messages so the CI fails fast if either parquet fixture is empty.

Comment on lines +136 to +145
- name: Gather failed logs
if: failure() || cancelled()
run: |
echo "ARTIFACT_TIMESTAMP=$(date +%s)" >> $GITHUB_ENV

- uses: actions/upload-artifact@v4
mkdir -p failed_logs
if [ -f test_results/pipeline_info/execution_trace*.txt ]; then
failed=$(grep "FAILED" test_results/pipeline_info/execution_trace*.txt | cut -f 2 || true)
while read -r line ; do
[ -n "$line" ] && cp $(ls work/${line}*/*.log 2>/dev/null) failed_logs/ 2>/dev/null || true
done <<< "$failed"
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

tmpdir="$(mktemp -d)"
trap 'rm -rf "$tmpdir"' EXIT

mkdir -p "$tmpdir/test_results/pipeline_info"
touch \
  "$tmpdir/test_results/pipeline_info/execution_trace_1.txt" \
  "$tmpdir/test_results/pipeline_info/execution_trace_2.txt"

cd "$tmpdir"

set +e
[ -f test_results/pipeline_info/execution_trace*.txt ]
status=$?
set -e

echo "exit_status=$status"
echo "Expected: non-zero status because the glob expands to multiple paths."

Repository: bigbio/spectrafuse

Length of output: 260


🏁 Script executed:

# First, check if the file exists and retrieve the relevant section
cat -n .github/workflows/ci.yml | sed -n '130,150p'

Repository: bigbio/spectrafuse

Length of output: 1922


🏁 Script executed:

# Check for other problematic glob patterns in the workflow file
rg '\*' .github/workflows/ci.yml | head -20

Repository: bigbio/spectrafuse

Length of output: 344


Make the trace-file probe glob-safe.

Lines 140–141 use unquoted globs that break when multiple trace files exist. The [ -f test_results/pipeline_info/execution_trace*.txt ] test expands to multiple arguments, causing a shell syntax error and silently skipping failure-log collection.

Suggested fix
       - name: Gather failed logs
         if: failure() || cancelled()
         run: |
           mkdir -p failed_logs
+          shopt -s nullglob
+          trace_files=(test_results/pipeline_info/execution_trace*.txt)
+          if ((${`#trace_files`[@]})); then
-          if [ -f test_results/pipeline_info/execution_trace*.txt ]; then
-            failed=$(grep "FAILED" test_results/pipeline_info/execution_trace*.txt | cut -f 2 || true)
+            failed=$(grep "FAILED" "${trace_files[@]}" | cut -f 2 || true)
             while read -r line ; do
-              [ -n "$line" ] && cp $(ls work/${line}*/*.log 2>/dev/null) failed_logs/ 2>/dev/null || true
+              [ -n "$line" ] && cp work/${line}*/*.log failed_logs/ 2>/dev/null || true
             done <<< "$failed"
           fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Gather failed logs
if: failure() || cancelled()
run: |
echo "ARTIFACT_TIMESTAMP=$(date +%s)" >> $GITHUB_ENV
- uses: actions/upload-artifact@v4
mkdir -p failed_logs
if [ -f test_results/pipeline_info/execution_trace*.txt ]; then
failed=$(grep "FAILED" test_results/pipeline_info/execution_trace*.txt | cut -f 2 || true)
while read -r line ; do
[ -n "$line" ] && cp $(ls work/${line}*/*.log 2>/dev/null) failed_logs/ 2>/dev/null || true
done <<< "$failed"
fi
- name: Gather failed logs
if: failure() || cancelled()
run: |
mkdir -p failed_logs
shopt -s nullglob
trace_files=(test_results/pipeline_info/execution_trace*.txt)
if ((${`#trace_files`[@]})); then
failed=$(grep "FAILED" "${trace_files[@]}" | cut -f 2 || true)
while read -r line ; do
[ -n "$line" ] && cp work/${line}*/*.log failed_logs/ 2>/dev/null || true
done <<< "$failed"
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci.yml around lines 136 - 145, The shell test using an
unquoted glob ([ -f test_results/pipeline_info/execution_trace*.txt ]) can
expand to multiple args and fail when several trace files exist; replace that
check with a glob-safe probe such as using compgen -G
"test_results/pipeline_info/execution_trace*.txt" >/dev/null || true (or
enabling nullglob and checking for files), then run grep against the matched
files (execution_trace*.txt) as before and preserve the rest of the logic that
sets failed and iterates to copy logs in the "Gather failed logs" step so the
variable failed and the while-read loop continue to work correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants