Add DIAMOND blastp subworkflow for protein homology search against NCBI RefSeq#50
Open
tracelail wants to merge 55 commits intonf-core:devfrom
Open
Add DIAMOND blastp subworkflow for protein homology search against NCBI RefSeq#50tracelail wants to merge 55 commits intonf-core:devfrom
tracelail wants to merge 55 commits intonf-core:devfrom
Conversation
Member
|
Warning Newer version of the nf-core template is available. Your pipeline is using an old version of the nf-core template: 3.3.1. For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation. |
…nto unfinished-diamond-blastp
…o unfinished-diamond-blastp
…aretaxa local module template.
…t and wrote an initial main.nf.test for the process
…t process is successful.
11 tasks
…utilized tests/nextflow.config to fix memory error.
…to diamondpreparetaxa module
…est1.fasta and test2.fasta to the diamond subworkflow directory for diamond/blastp nf-test input.
…iamond subworkflow Diamond_blastp output. Included Diamond subworkflow execution in the functional annotation subworkflow.
olgabot
requested changes
Jul 22, 2025
Collaborator
olgabot
left a comment
There was a problem hiding this comment.
Turns out I had started this review a while ago... But here are a bunch of suggestions that will hopefully deal with the container issues.
Co-authored-by: Olga Botvinnik <olga.botvinnik@gmail.com>
…w config processes for DIAMOND_MAKEDB and DIAMOND_BLASTP.
…o unfinished-diamond-blastp
…blastp' into unfinished-diamond-blastp
olgabot
reviewed
Aug 6, 2025
…rs for DIAMOND subworkflow to nextflow.config.
…ded diamond test to functional annotation main.nf.test. copied test data to functional annotation as well.
…Fixed functional annotiation diamond_tsv output type error.
…ed placeholder, todo lint warnings.
… modules and subworkflows. Added some tests and confirmed updated snapshots.
… merged back in if needed or fixed if requested. updated docs.md with some breaks in markdown and output examples of PAF. Fixed some dev merging issues in functional_annotation/meta.yml.
…and annotation information.
…ond test had a typo of tsv output when it should have been txt. Updated snapshots are also included.
…hannels from diamond subworkflow main and nf-test assertions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Protein functional annotation at scale demands efficient sequence alignment against large reference databases. BLAST, the traditional tool for this task, is computationally expensive for large datasets. DIAMOND (Buchfink et al., Nature Methods 2021) provides BLAST-compatible sensitivity with significantly higher throughput, up to 10,000× faster than BLAST on large protein databases by using a double-indexed alignment algorithm optimized for modern hardware. Adding DIAMOND blastp to
proteinannotatorenables users to run protein homology searches against the full NCBI RefSeq protein database as part of their functional annotation workflow, which would be impractical with BLAST at scale.This PR adds DIAMOND blastp (Buchfink et al., Nature Methods 2021) to the pipeline, using the existing nf-core
diamond/blastpanddiamond/makedbmodules, plus two new local modules:New local modules:
ncbirefseqdownload: Downloads and concatenates NCBI RefSeq protein FASTAs for a specified release category (e.g.complete,other) into a single compressed reference FASTA for use bydiamond/makedbdiamondpreparetaxa: Downloads and extracts NCBI taxonomy files (nodes.dmp,names.dmp) required for taxonomic classification indiamond/makedbNew subworkflows:
subworkflows/local/diamond: Orchestrates the full DIAMOND pipeline: RefSeq download → taxonomy preparation →DIAMOND_MAKEDB→DIAMOND_BLASTP. Supports all seven DIAMOND output formats (blast,xml,txt,daa,sam,tsv,paf) viaparams.diamond_outfmtandparams.diamond_blast_columnssubworkflows/local/functional_annotation— Integrates theDIAMONDsubworkflow alongside the existing InterProScan logic. InterProScan execution is controlled byparams.skip_interproscanand is intentionally kept as-is for a collaborator's parallel branchNew parameters (added to
nextflow_schema.jsonandnextflow.config):-refseq_release— NCBI RefSeq release category (default:complete)-taxondmp_zip— URL to NCBI taxonomy dump archive-taxonmap— URL to compressed protein accession-to-taxid map-diamond_outfmt— Output format code (default:6, tabular)-diamond_blast_columns— Optional column list for tabular outputTesting:
All new local modules have nf-test test suites with both live and stub tests:
ncbirefseqdownload— tested against the NCBIotherRefSeq release; stub usesecho "" | gzipto produce a valid (non-empty) gzip stubdiamondpreparetaxa— tested against the NCBI taxonomy dump; stub creates placeholdernodes.dmpandnames.dmpThe
diamondsubworkflow is tested end-to-end using a miniaturemini_prot.accession2taxid.gztaxon map and a smalltest_refseq.fasta. Tests cover four scenarios:The
functional_annotationsubworkflow is tested withskip_interproscan = truein both live and stub modes, confirming that the DIAMOND path runs correctly independently of InterProScan.All stub tests are tagged
CIfor fast pipeline CI runs. Live module tests require-profile dockerand are tagged accordingly.Scope notes:
diamond_tsvis emitted fromFUNCTIONAL_ANNOTATIONbut not consumed downstream inproteinannotator.nf— comparison module and downstream integration deferred to follow-up PR--skip_diamondparameter deferred to follow-up PRversions.ymlpattern, not topic channels. Topic channel migration deferred pending nf-core mandate timeline (Q2 2026)PR checklist
nf-core pipelines lint).nf-test test */local --profile=~test,dockerfor all new local tests).nf-test test */local --profile=~test,docker,debug).docs/usage.mdis updated.docs/output.mdis updated.CHANGELOG.mdis updated.README.mdis updated (including new tool citations and authors/contributors).