Skip to content

Add DIAMOND blastp subworkflow for protein homology search against NCBI RefSeq#50

Open
tracelail wants to merge 55 commits intonf-core:devfrom
tracelail:unfinished-diamond-blastp
Open

Add DIAMOND blastp subworkflow for protein homology search against NCBI RefSeq#50
tracelail wants to merge 55 commits intonf-core:devfrom
tracelail:unfinished-diamond-blastp

Conversation

@tracelail
Copy link
Copy Markdown

@tracelail tracelail commented Jun 2, 2025

Protein functional annotation at scale demands efficient sequence alignment against large reference databases. BLAST, the traditional tool for this task, is computationally expensive for large datasets. DIAMOND (Buchfink et al., Nature Methods 2021) provides BLAST-compatible sensitivity with significantly higher throughput, up to 10,000× faster than BLAST on large protein databases by using a double-indexed alignment algorithm optimized for modern hardware. Adding DIAMOND blastp to proteinannotator enables users to run protein homology searches against the full NCBI RefSeq protein database as part of their functional annotation workflow, which would be impractical with BLAST at scale.

This PR adds DIAMOND blastp (Buchfink et al., Nature Methods 2021) to the pipeline, using the existing nf-core diamond/blastp and diamond/makedb modules, plus two new local modules:

New local modules:

  • ncbirefseqdownload : Downloads and concatenates NCBI RefSeq protein FASTAs for a specified release category (e.g. complete, other) into a single compressed reference FASTA for use by diamond/makedb
  • diamondpreparetaxa : Downloads and extracts NCBI taxonomy files (nodes.dmp, names.dmp) required for taxonomic classification in diamond/makedb

New subworkflows:

  • subworkflows/local/diamond : Orchestrates the full DIAMOND pipeline: RefSeq download → taxonomy preparation → DIAMOND_MAKEDBDIAMOND_BLASTP. Supports all seven DIAMOND output formats (blast, xml, txt, daa, sam, tsv, paf) via params.diamond_outfmt and params.diamond_blast_columns
  • subworkflows/local/functional_annotation — Integrates the DIAMOND subworkflow alongside the existing InterProScan logic. InterProScan execution is controlled by params.skip_interproscan and is intentionally kept as-is for a collaborator's parallel branch

New parameters (added to nextflow_schema.json and nextflow.config):

  • -refseq_release — NCBI RefSeq release category (default: complete)
  • -taxondmp_zip — URL to NCBI taxonomy dump archive
  • -taxonmap — URL to compressed protein accession-to-taxid map
  • -diamond_outfmt — Output format code (default: 6, tabular)
  • -diamond_blast_columns — Optional column list for tabular output

Testing:

All new local modules have nf-test test suites with both live and stub tests:

  • ncbirefseqdownload — tested against the NCBI other RefSeq release; stub uses echo "" | gzip to produce a valid (non-empty) gzip stub
  • diamondpreparetaxa — tested against the NCBI taxonomy dump; stub creates placeholder nodes.dmp and names.dmp

The diamond subworkflow is tested end-to-end using a miniature mini_prot.accession2taxid.gz taxon map and a small test_refseq.fasta. Tests cover four scenarios:

  • Live run with tabular output (outfmt 6, no columns)
  • Stub: outfmt 6, no columns
  • Stub: outfmt 6, with custom columns
  • Stub: outfmt 0 (pairwise BLAST format)

The functional_annotation subworkflow is tested with skip_interproscan = true in both live and stub modes, confirming that the DIAMOND path runs correctly independently of InterProScan.

All stub tests are tagged CI for fast pipeline CI runs. Live module tests require -profile docker and are tagged accordingly.

Scope notes:

  • diamond_tsv is emitted from FUNCTIONAL_ANNOTATION but not consumed downstream in proteinannotator.nf — comparison module and downstream integration deferred to follow-up PR
  • --skip_diamond parameter deferred to follow-up PR
  • Version collection uses current versions.yml pattern, not topic channels. Topic channel migration deferred pending nf-core mandate timeline (Q2 2026)

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/proteinannotator branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (e.g. nf-test test */local --profile=~test,docker for all new local tests).
  • Check for unexpected warnings in debug mode (nf-test test */local --profile=~test,docker,debug).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@nf-core-bot
Copy link
Copy Markdown
Member

nf-core-bot commented Jun 2, 2025

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.3.1.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

@olgabot olgabot mentioned this pull request Jun 24, 2025
11 tasks
Copy link
Copy Markdown
Collaborator

@olgabot olgabot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out I had started this review a while ago... But here are a bunch of suggestions that will hopefully deal with the container issues.

Comment thread .vscode/settings.json
Comment thread .nf-test.log
Comment thread CITATIONS.md Outdated
Comment thread docs/output.md Outdated
Comment thread docs/output.md
Comment thread modules/local/diamondpreparetaxa/environment.yml
Comment thread modules/local/diamondpreparetaxa/main.nf Outdated
Comment thread modules/local/diamondpreparetaxa/main.nf Outdated
Comment thread modules/local/diamondpreparetaxa/main.nf Outdated
Comment thread modules/local/ncbirefseqdownload/main.nf Outdated
tracelail and others added 5 commits July 28, 2025 12:21
Comment thread subworkflows/local/diamond/tests/main.nf.test Outdated
…rs for DIAMOND subworkflow to nextflow.config.
…ded diamond test to functional annotation main.nf.test. copied test data to functional annotation as well.
…Fixed functional annotiation diamond_tsv output type error.
… modules and subworkflows. Added some tests and confirmed updated snapshots.
… merged back in if needed or fixed if requested.

updated docs.md with some breaks in markdown and output examples of PAF.

Fixed some dev merging issues in functional_annotation/meta.yml.
…ond test had a typo of tsv output when it should have been txt. Updated snapshots are also included.
…hannels from diamond subworkflow main and nf-test assertions.
@tracelail tracelail marked this pull request as ready for review April 9, 2026 14:55
@tracelail tracelail changed the title wrote draft integration of BLAST_MAKEBLASTDB and NCBIREFSEQDOWNLOAD into functional_annotation subworkflow. Integration of BLAST_MAKEBLASTDB and NCBIREFSEQDOWNLOAD into functional_annotation subworkflow. Apr 9, 2026
@tracelail tracelail changed the title Integration of BLAST_MAKEBLASTDB and NCBIREFSEQDOWNLOAD into functional_annotation subworkflow. Add DIAMOND blastp subworkflow for protein homology search against NCBI RefSeq Apr 9, 2026
@tracelail tracelail requested a review from olgabot April 9, 2026 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants