Skip to content

v0.2: batch-1 / small-batch latency path (pinned memory, graph capture) #4

@bledden

Description

@bledden

Batch-1 decode is ~1 ms on H200 — launch + H2D/D2H bound, not compute. After the megakernel (issue #2) lands, measure the residual small-batch floor and attack with pinned host memory + (CUDA/HIP) graph capture. Honest scope note: real-time superconducting cadence stays FPGA territory; the target is 'fast single-shot', receipted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions