Purpose
Make RunPod execution reproducible, interruption-safe, and auditable for this repo.
Mandatory Reading (blocking)
First comment must summarize:
reports/NL_IMPLEMENTATION_ORACLE.md section 6.3.3 and 6.3.4
docs/FSDP_SCALING_GUIDE.md
docs/release_checklist.md
docs/env_matrix.md
Also include links reviewed from RunPod docs in the first comment.
Required Code Anchors
scripts/compute/
- training entrypoints (
train.py, train_fsdp.py, train_deepspeed.py)
- docs under
docs/
Scope
- Add concrete RunPod playbook:
- pod create settings
- persistent storage conventions
- SSH + transfer commands
- checkpoint frequency guidance for spot/on-demand
- forced stop/resume drill
Deliverables
Acceptance Criteria
- Fresh pod can run smoke end-to-end using docs only.
- Resume drill validated with evidence.
- First issue comment contains mandatory reading summary.
Purpose
Make RunPod execution reproducible, interruption-safe, and auditable for this repo.
Mandatory Reading (blocking)
First comment must summarize:
reports/NL_IMPLEMENTATION_ORACLE.mdsection 6.3.3 and 6.3.4docs/FSDP_SCALING_GUIDE.mddocs/release_checklist.mddocs/env_matrix.mdAlso include links reviewed from RunPod docs in the first comment.
Required Code Anchors
scripts/compute/train.py,train_fsdp.py,train_deepspeed.py)docs/Scope
Deliverables
docs/runpod_execution.mdAcceptance Criteria