Skip to content

Multi-sample support: process multiple samples in one run via a sample sheet #93

@riasc

Description

@riasc

Motivation

ScanNeo2 currently processes a single sample per invocation — config['data'] holds one sample's data, and rule all expands over the scalar config['data']['name']. For HPC/SLURM use it would be much better to submit one workflow that processes many samples in parallel, driven by a sample sheet, with all non-sample parameters shared from config.yaml.

The good news: the {sample} wildcard is already threaded through every rule and output path. The only thing pinning the workflow to one sample is that config['data'] is a single global structure rather than keyed by sample.

Approach

Use the idiomatic Snakemake pattern — a sample sheet plus the native DAG parallelism. No new parallelism machinery is needed: independent samples have non-intersecting DAGs, so Snakemake parallelizes them automatically. On SLURM, each rule-job is submitted as its own job.

Core change: config['data'] (single global, built by data_structure() in common.smk) becomes a SAMPLES dict keyed by sample name. Every config['data'][X] lookup becomes SAMPLES[wildcards.sample][X]wildcards.sample is already available in every input helper.

Implementation plan

  1. Sample sheetconfig/samples.tsv, one row per sample (wide format: sample, dnaseq_tumor, dnaseq_normal, rnaseq, custom_variants, custom_hla_I, custom_hla_II). config.yaml keeps all shared parameters (reference, threads, mapq, indel, hlatyping, prioritization, ...) and gains a samples: config/samples.tsv key. The current per-sample data: block is removed.

  2. Snakefile — load the sheet with pandas; rule all does expand("results/{sample}/prioritization/", sample=SAMPLES.keys()).

  3. common.smk — make data_structure() a pure per-sample function (no global config['data'] mutation); build SAMPLES = {row.sample: data_structure(row) for row in sheet}. Rewrite the ~120 config['data'][...] references (all concentrated in this file) to SAMPLES[wildcards.sample][...]. handle_seqfiles() stays largely as-is (called per sample).

  4. align.smk parse-time conditionalsalign.smk defines rules conditionally with if config['data']['dnaseq_filetype'] in ['.fq','.fastq']: (3 references). Rule definitions are evaluated at parse time and cannot branch per sample. Resolve by either (a) always defining the rules and moving the filetype branch into the input functions, or (b) constraining v1 to "all samples share filetype/readtype" with explicit validation.

  5. Schema validation — extend workflow/schemas/config.schema.yaml to validate the sample sheet (columns, required fields). Pairs naturally with the open schema-validation half of Add config schema validation and wildcard constraints #65.

  6. SLURM profile — add profiles/slurm/config.yaml using the Snakemake-8 snakemake-executor-plugin-slurm (partition, default mem_mb/runtime, jobs). Invocation: snakemake --workflow-profile profiles/slurm --jobs N.

  7. Download atomicity audit — with all samples sharing the reference/resource outputs, audit ref.smk and the download rules so a killed or interrupted download (SLURM timeout, network drop) cannot leave a truncated file at the final output path. Snakemake builds each shared resource exactly once (single DAG node, no {sample} in the path) and re-runs jobs it knows were interrupted, but the robust pattern is download-to-temp + atomic mv so the final path only ever appears complete. curl --fail (PR Replace shell=True subprocess calls with list-based arguments #74) catches HTTP errors but not mid-transfer kills. Optionally document a pre-stage step (snakemake --until <ref targets>) to build resources/ once before launching the sample batch.

Open design decisions

  • Clean break vs. backwards compatibility — fully switch to the sample sheet (simpler code, breaking change → minor version bump) vs. keep the single-sample data: block working as a 1-row fallback (more code). Leaning clean break.
  • align.smk per-sample filetype — option (a) vs. (b) above.
  • Sheet format — wide (fixed group columns) vs. long (one row per sample×group). Wide is friendlier for the "same config, list of samples" use case.

Concurrency notes

  • Within one run, shared resources are safe automatically: reference download/index outputs have no {sample} wildcard, so Snakemake runs each as a single job and all samples fan out from it — no duplication, no double-write.
  • Across separate invocations in the same working directory, Snakemake's directory lock prevents a concurrent run. Keep each invocation in its own working directory (relative resources/ paths).
  • Partial-download robustness is covered by step 7.

Testing

  • Add a 2-sample sheet under .tests/integration/.
  • snakemake -n dry-run to confirm the DAG expands over both samples.
  • A real 2-sample run on SLURM via the profile.

Scope

Standalone feature, not part of the 2026-03-26 audit cluster. Likely one PR for the sample-sheet + common.smk refactor, optionally a second for the SLURM profile.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions