v0.2: batch-1 / small-batch latency path (pinned memory, graph capture)

Batch-1 decode is ~1 ms on H200 — launch + H2D/D2H bound, not compute. After the megakernel (issue #2) lands, measure the residual small-batch floor and attack with pinned host memory + (CUDA/HIP) graph capture. Honest scope note: real-time superconducting cadence stays FPGA territory; the target is 'fast single-shot', receipted.