Multi-sample support: process multiple samples in one run via a sample sheet

## Motivation

ScanNeo2 currently processes a single sample per invocation — `config['data']` holds one sample's data, and `rule all` expands over the scalar `config['data']['name']`. For HPC/SLURM use it would be much better to submit one workflow that processes **many samples in parallel**, driven by a **sample sheet**, with all non-sample parameters shared from `config.yaml`.

The good news: the `{sample}` wildcard is already threaded through every rule and output path. The only thing pinning the workflow to one sample is that `config['data']` is a single global structure rather than keyed by sample.

## Approach

Use the idiomatic Snakemake pattern — a sample sheet plus the native DAG parallelism. No new parallelism machinery is needed: independent samples have non-intersecting DAGs, so Snakemake parallelizes them automatically. On SLURM, each rule-job is submitted as its own job.

**Core change:** `config['data']` (single global, built by `data_structure()` in `common.smk`) becomes a `SAMPLES` dict keyed by sample name. Every `config['data'][X]` lookup becomes `SAMPLES[wildcards.sample][X]` — `wildcards.sample` is already available in every input helper.

## Implementation plan

1. **Sample sheet** — `config/samples.tsv`, one row per sample (wide format: `sample`, `dnaseq_tumor`, `dnaseq_normal`, `rnaseq`, `custom_variants`, `custom_hla_I`, `custom_hla_II`). `config.yaml` keeps all shared parameters (`reference`, `threads`, `mapq`, `indel`, `hlatyping`, `prioritization`, ...) and gains a `samples: config/samples.tsv` key. The current per-sample `data:` block is removed.

2. **`Snakefile`** — load the sheet with pandas; `rule all` does `expand("results/{sample}/prioritization/", sample=SAMPLES.keys())`.

3. **`common.smk`** — make `data_structure()` a pure per-sample function (no global `config['data']` mutation); build `SAMPLES = {row.sample: data_structure(row) for row in sheet}`. Rewrite the ~120 `config['data'][...]` references (all concentrated in this file) to `SAMPLES[wildcards.sample][...]`. `handle_seqfiles()` stays largely as-is (called per sample).

4. **`align.smk` parse-time conditionals** — `align.smk` defines rules conditionally with `if config['data']['dnaseq_filetype'] in ['.fq','.fastq']:` (3 references). Rule definitions are evaluated at parse time and cannot branch per sample. Resolve by either (a) always defining the rules and moving the filetype branch into the input functions, or (b) constraining v1 to "all samples share filetype/readtype" with explicit validation.

5. **Schema validation** — extend `workflow/schemas/config.schema.yaml` to validate the sample sheet (columns, required fields). Pairs naturally with the open schema-validation half of #65.

6. **SLURM profile** — add `profiles/slurm/config.yaml` using the Snakemake-8 `snakemake-executor-plugin-slurm` (partition, default `mem_mb`/`runtime`, `jobs`). Invocation: `snakemake --workflow-profile profiles/slurm --jobs N`.

7. **Download atomicity audit** — with all samples sharing the reference/resource outputs, audit `ref.smk` and the download rules so a killed or interrupted download (SLURM timeout, network drop) cannot leave a truncated file at the final output path. Snakemake builds each shared resource exactly once (single DAG node, no `{sample}` in the path) and re-runs jobs it knows were interrupted, but the robust pattern is download-to-temp + atomic `mv` so the final path only ever appears complete. `curl --fail` (PR #74) catches HTTP errors but not mid-transfer kills. Optionally document a pre-stage step (`snakemake --until <ref targets>`) to build `resources/` once before launching the sample batch.

## Open design decisions

- **Clean break vs. backwards compatibility** — fully switch to the sample sheet (simpler code, breaking change → minor version bump) vs. keep the single-sample `data:` block working as a 1-row fallback (more code). Leaning clean break.
- **`align.smk` per-sample filetype** — option (a) vs. (b) above.
- **Sheet format** — wide (fixed group columns) vs. long (one row per sample×group). Wide is friendlier for the "same config, list of samples" use case.

## Concurrency notes

- Within one run, shared resources are safe automatically: reference download/index outputs have no `{sample}` wildcard, so Snakemake runs each as a single job and all samples fan out from it — no duplication, no double-write.
- Across separate invocations in the same working directory, Snakemake's directory lock prevents a concurrent run. Keep each invocation in its own working directory (relative `resources/` paths).
- Partial-download robustness is covered by step 7.

## Testing

- Add a 2-sample sheet under `.tests/integration/`.
- `snakemake -n` dry-run to confirm the DAG expands over both samples.
- A real 2-sample run on SLURM via the profile.

## Scope

Standalone feature, not part of the 2026-03-26 audit cluster. Likely one PR for the sample-sheet + `common.smk` refactor, optionally a second for the SLURM profile.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-sample support: process multiple samples in one run via a sample sheet #93

Motivation

Approach

Implementation plan

Open design decisions

Concurrency notes

Testing

Scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Multi-sample support: process multiple samples in one run via a sample sheet #93

Description

Motivation

Approach

Implementation plan

Open design decisions

Concurrency notes

Testing

Scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions