You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ScanNeo2 currently processes a single sample per invocation — config['data'] holds one sample's data, and rule all expands over the scalar config['data']['name']. For HPC/SLURM use it would be much better to submit one workflow that processes many samples in parallel, driven by a sample sheet, with all non-sample parameters shared from config.yaml.
The good news: the {sample} wildcard is already threaded through every rule and output path. The only thing pinning the workflow to one sample is that config['data'] is a single global structure rather than keyed by sample.
Approach
Use the idiomatic Snakemake pattern — a sample sheet plus the native DAG parallelism. No new parallelism machinery is needed: independent samples have non-intersecting DAGs, so Snakemake parallelizes them automatically. On SLURM, each rule-job is submitted as its own job.
Core change:config['data'] (single global, built by data_structure() in common.smk) becomes a SAMPLES dict keyed by sample name. Every config['data'][X] lookup becomes SAMPLES[wildcards.sample][X] — wildcards.sample is already available in every input helper.
Implementation plan
Sample sheet — config/samples.tsv, one row per sample (wide format: sample, dnaseq_tumor, dnaseq_normal, rnaseq, custom_variants, custom_hla_I, custom_hla_II). config.yaml keeps all shared parameters (reference, threads, mapq, indel, hlatyping, prioritization, ...) and gains a samples: config/samples.tsv key. The current per-sample data: block is removed.
Snakefile — load the sheet with pandas; rule all does expand("results/{sample}/prioritization/", sample=SAMPLES.keys()).
common.smk — make data_structure() a pure per-sample function (no global config['data'] mutation); build SAMPLES = {row.sample: data_structure(row) for row in sheet}. Rewrite the ~120 config['data'][...] references (all concentrated in this file) to SAMPLES[wildcards.sample][...]. handle_seqfiles() stays largely as-is (called per sample).
align.smk parse-time conditionals — align.smk defines rules conditionally with if config['data']['dnaseq_filetype'] in ['.fq','.fastq']: (3 references). Rule definitions are evaluated at parse time and cannot branch per sample. Resolve by either (a) always defining the rules and moving the filetype branch into the input functions, or (b) constraining v1 to "all samples share filetype/readtype" with explicit validation.
Schema validation — extend workflow/schemas/config.schema.yaml to validate the sample sheet (columns, required fields). Pairs naturally with the open schema-validation half of Add config schema validation and wildcard constraints #65.
SLURM profile — add profiles/slurm/config.yaml using the Snakemake-8 snakemake-executor-plugin-slurm (partition, default mem_mb/runtime, jobs). Invocation: snakemake --workflow-profile profiles/slurm --jobs N.
Download atomicity audit — with all samples sharing the reference/resource outputs, audit ref.smk and the download rules so a killed or interrupted download (SLURM timeout, network drop) cannot leave a truncated file at the final output path. Snakemake builds each shared resource exactly once (single DAG node, no {sample} in the path) and re-runs jobs it knows were interrupted, but the robust pattern is download-to-temp + atomic mv so the final path only ever appears complete. curl --fail (PR Replace shell=True subprocess calls with list-based arguments #74) catches HTTP errors but not mid-transfer kills. Optionally document a pre-stage step (snakemake --until <ref targets>) to build resources/ once before launching the sample batch.
Open design decisions
Clean break vs. backwards compatibility — fully switch to the sample sheet (simpler code, breaking change → minor version bump) vs. keep the single-sample data: block working as a 1-row fallback (more code). Leaning clean break.
align.smk per-sample filetype — option (a) vs. (b) above.
Sheet format — wide (fixed group columns) vs. long (one row per sample×group). Wide is friendlier for the "same config, list of samples" use case.
Concurrency notes
Within one run, shared resources are safe automatically: reference download/index outputs have no {sample} wildcard, so Snakemake runs each as a single job and all samples fan out from it — no duplication, no double-write.
Across separate invocations in the same working directory, Snakemake's directory lock prevents a concurrent run. Keep each invocation in its own working directory (relative resources/ paths).
Partial-download robustness is covered by step 7.
Testing
Add a 2-sample sheet under .tests/integration/.
snakemake -n dry-run to confirm the DAG expands over both samples.
A real 2-sample run on SLURM via the profile.
Scope
Standalone feature, not part of the 2026-03-26 audit cluster. Likely one PR for the sample-sheet + common.smk refactor, optionally a second for the SLURM profile.
Motivation
ScanNeo2 currently processes a single sample per invocation —
config['data']holds one sample's data, andrule allexpands over the scalarconfig['data']['name']. For HPC/SLURM use it would be much better to submit one workflow that processes many samples in parallel, driven by a sample sheet, with all non-sample parameters shared fromconfig.yaml.The good news: the
{sample}wildcard is already threaded through every rule and output path. The only thing pinning the workflow to one sample is thatconfig['data']is a single global structure rather than keyed by sample.Approach
Use the idiomatic Snakemake pattern — a sample sheet plus the native DAG parallelism. No new parallelism machinery is needed: independent samples have non-intersecting DAGs, so Snakemake parallelizes them automatically. On SLURM, each rule-job is submitted as its own job.
Core change:
config['data'](single global, built bydata_structure()incommon.smk) becomes aSAMPLESdict keyed by sample name. Everyconfig['data'][X]lookup becomesSAMPLES[wildcards.sample][X]—wildcards.sampleis already available in every input helper.Implementation plan
Sample sheet —
config/samples.tsv, one row per sample (wide format:sample,dnaseq_tumor,dnaseq_normal,rnaseq,custom_variants,custom_hla_I,custom_hla_II).config.yamlkeeps all shared parameters (reference,threads,mapq,indel,hlatyping,prioritization, ...) and gains asamples: config/samples.tsvkey. The current per-sampledata:block is removed.Snakefile— load the sheet with pandas;rule alldoesexpand("results/{sample}/prioritization/", sample=SAMPLES.keys()).common.smk— makedata_structure()a pure per-sample function (no globalconfig['data']mutation); buildSAMPLES = {row.sample: data_structure(row) for row in sheet}. Rewrite the ~120config['data'][...]references (all concentrated in this file) toSAMPLES[wildcards.sample][...].handle_seqfiles()stays largely as-is (called per sample).align.smkparse-time conditionals —align.smkdefines rules conditionally withif config['data']['dnaseq_filetype'] in ['.fq','.fastq']:(3 references). Rule definitions are evaluated at parse time and cannot branch per sample. Resolve by either (a) always defining the rules and moving the filetype branch into the input functions, or (b) constraining v1 to "all samples share filetype/readtype" with explicit validation.Schema validation — extend
workflow/schemas/config.schema.yamlto validate the sample sheet (columns, required fields). Pairs naturally with the open schema-validation half of Add config schema validation and wildcard constraints #65.SLURM profile — add
profiles/slurm/config.yamlusing the Snakemake-8snakemake-executor-plugin-slurm(partition, defaultmem_mb/runtime,jobs). Invocation:snakemake --workflow-profile profiles/slurm --jobs N.Download atomicity audit — with all samples sharing the reference/resource outputs, audit
ref.smkand the download rules so a killed or interrupted download (SLURM timeout, network drop) cannot leave a truncated file at the final output path. Snakemake builds each shared resource exactly once (single DAG node, no{sample}in the path) and re-runs jobs it knows were interrupted, but the robust pattern is download-to-temp + atomicmvso the final path only ever appears complete.curl --fail(PR Replace shell=True subprocess calls with list-based arguments #74) catches HTTP errors but not mid-transfer kills. Optionally document a pre-stage step (snakemake --until <ref targets>) to buildresources/once before launching the sample batch.Open design decisions
data:block working as a 1-row fallback (more code). Leaning clean break.align.smkper-sample filetype — option (a) vs. (b) above.Concurrency notes
{sample}wildcard, so Snakemake runs each as a single job and all samples fan out from it — no duplication, no double-write.resources/paths).Testing
.tests/integration/.snakemake -ndry-run to confirm the DAG expands over both samples.Scope
Standalone feature, not part of the 2026-03-26 audit cluster. Likely one PR for the sample-sheet +
common.smkrefactor, optionally a second for the SLURM profile.