v0.2 merged the megakernel as standalone backend classes (tridec.backends.megakernel.{BpMegaTriton,RelayBpMegaTriton}), validated direct on Metal/CUDA/ROCm with receipts. NOT yet wired into the public from_dem/RelayBpDecoder dispatch — that path still uses the two-kernel RelayBpTriton.
Design (validated finding): on a GPU backend, RELAY should default to the megakernel (9-22x faster, identical correctness); plain BP should STAY two-kernel (the BP megakernel loses at large batch — no early-exit lever). So: RelayBpDecoder(backend='triton'|'metal') -> mega by default; BpDecoder -> two-kernel; expose kernel='mega'|'two-kernel'|'auto' override.
Gate before shipping: re-run the relay gates THROUGH from_dem(...).decode_batch (not just the standalone classes) on a CUDA or ROCm GPU — the dispatch wiring is the untested surface. Needs a GPU session; do not flip the default without it.
v0.2 merged the megakernel as standalone backend classes (tridec.backends.megakernel.{BpMegaTriton,RelayBpMegaTriton}), validated direct on Metal/CUDA/ROCm with receipts. NOT yet wired into the public from_dem/RelayBpDecoder dispatch — that path still uses the two-kernel RelayBpTriton.
Design (validated finding): on a GPU backend, RELAY should default to the megakernel (9-22x faster, identical correctness); plain BP should STAY two-kernel (the BP megakernel loses at large batch — no early-exit lever). So: RelayBpDecoder(backend='triton'|'metal') -> mega by default; BpDecoder -> two-kernel; expose kernel='mega'|'two-kernel'|'auto' override.
Gate before shipping: re-run the relay gates THROUGH from_dem(...).decode_batch (not just the standalone classes) on a CUDA or ROCm GPU — the dispatch wiring is the untested surface. Needs a GPU session; do not flip the default without it.