Skip to content

v0.2: single-launch persistent Relay-BP megakernel (shot-per-program, SRAM-resident) #2

@bledden

Description

@bledden

The dominant cost in RelayBpDecoder is the host-side leg loop: ~dozens of kernel launches per relay leg, thousands per batch (on Metal: ~31 s launch overhead vs ~1.3 s math; on H200 the gap to vendor-class throughput is ~50x). The [[72,12,6]] per-shot message state (mu 18.1 KB + nu 18.1 KB + posterior/gamma ~13 KB) fits in shared memory, so the design target is: one Triton program = one shot; all BP iterations + relay legs + nconv selection run in-kernel from SRAM; per-shot early exit on convergence; ONE launch per batch. Same source on CUDA/ROCm/Metal. Gate: LER-identity vs the current implementation under the existing validation tiers, then throughput receipts on all three platforms. Fallback if SRAM/occupancy blocks: CUDA/HIP graph capture of the leg schedule (NVIDIA/AMD only) — tracked separately.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions