The dominant cost in RelayBpDecoder is the host-side leg loop: ~dozens of kernel launches per relay leg, thousands per batch (on Metal: ~31 s launch overhead vs ~1.3 s math; on H200 the gap to vendor-class throughput is ~50x). The [[72,12,6]] per-shot message state (mu 18.1 KB + nu 18.1 KB + posterior/gamma ~13 KB) fits in shared memory, so the design target is: one Triton program = one shot; all BP iterations + relay legs + nconv selection run in-kernel from SRAM; per-shot early exit on convergence; ONE launch per batch. Same source on CUDA/ROCm/Metal. Gate: LER-identity vs the current implementation under the existing validation tiers, then throughput receipts on all three platforms. Fallback if SRAM/occupancy blocks: CUDA/HIP graph capture of the leg schedule (NVIDIA/AMD only) — tracked separately.
The dominant cost in RelayBpDecoder is the host-side leg loop: ~dozens of kernel launches per relay leg, thousands per batch (on Metal: ~31 s launch overhead vs ~1.3 s math; on H200 the gap to vendor-class throughput is ~50x). The [[72,12,6]] per-shot message state (mu 18.1 KB + nu 18.1 KB + posterior/gamma ~13 KB) fits in shared memory, so the design target is: one Triton program = one shot; all BP iterations + relay legs + nconv selection run in-kernel from SRAM; per-shot early exit on convergence; ONE launch per batch. Same source on CUDA/ROCm/Metal. Gate: LER-identity vs the current implementation under the existing validation tiers, then throughput receipts on all three platforms. Fallback if SRAM/occupancy blocks: CUDA/HIP graph capture of the leg schedule (NVIDIA/AMD only) — tracked separately.