tridec

Badge honesty: CI is CPU-only (ubuntu + macos arm64; the macos lane binds the strict exact-count receipt gates). There are no GPU runners — the CUDA/ROCm kernel paths are validated by the carried H200/MI300X receipts in bench/receipts/, and the experimental Metal tier runs on a local machine.

An open, vendor-portable GPU decoder library for quantum LDPC codes — Triton min-sum BP and Relay-BP decoders that consume any stim DetectorErrorModel or raw parity-check matrices, with CPU reference implementations, validated against the standard CPU references (ldpc, relay-bp), running on NVIDIA (CUDA) and AMD (ROCm) GPUs.

The same Triton kernels run unmodified on both vendors: the Relay-BP kernel reproduces its logical-error-rate validation numbers identically on an NVIDIA H200 (CUDA 12.4, triton 3.0) and an AMD MI300X (ROCm 7.0, triton 3.4) — see docs/benchmark.md and the raw receipts in bench/receipts/. Validated scope is NVIDIA + AMD; Apple silicon runs the same kernels through triton-metal as an experimental backend (see below).

v0.2: the megakernel backend (opt-in)

A single-launch persistent megakernel — the entire Relay-BP decode (every BP iteration, every relay leg, in-kernel syndrome convergence + nconv stop + lowest-weight selection) in one kernel launch per decode_batch, with per-shot early exit, instead of the v0.1 host loop's thousands of launches. Validated on all three platforms against the v0.1 two-kernel path and the relay-bp Rust oracle (14/14 gates at BLOCK 128/256 on CUDA and ROCm; barriers verified honored on both):

Relay-BP megakernel vs v0.1 two-kernel	speedup
Apple M4 Max (Metal, triton-metal)	~197× — 30.0 s → 0.152 s / 2000 shots (relay BLOCK=256, num_warps=8)
NVIDIA H200 (CUDA)	9–19× fp32 (fp64 to 37× mid-batch) — batch-1 62.5 → 3.44 ms; 34.6 µs/syn @8192
AMD MI300X (ROCm)	9–32× — batch-1 8.48 ms; 46.0 µs/syn @8192

(Speedups are vs each platform's own v0.1 two-kernel path and vary with batch size — min–max across batch 1–16384. Absolute cross-vendor performance, where H200 leads, is in the limits below.) Receipt stacks: H200 (CUDA 12.4 / triton 3.0), MI300X (ROCm 6.2 / torch 2.5.1 / triton 3.1, gfx942), M4 Max (triton-metal, CODEGEN_VERSION 2026.06.13); raw in bench/receipts/megakernel_* — the Metal block-lift re-measure is in megakernel_metal_lift.{md,json}. The Metal 30.0 s baseline and the v0.1 Apple section's 31 s below are independent measurement runs of the same two-kernel relay (run-to-run jitter), not a discrepancy.

Auto-dispatch (v0.2.1): tridec.from_dem(..., algorithm="relay") / RelayBpDecoder now use the megakernel by default on GPU — relay wins decisively, so it's the default; pass megakernel=False for the v0.1 two-kernel host loop. The path is GPU-gated by construction (RelayBpDecoder only accepts the triton/metal backends, never CPU). BP keeps the two-kernel default (BpMegaTriton stays opt-in via tridec.backends.megakernel — the plain-BP megakernel is a single-shot latency tool that loses at batch throughput). (#5). Receipts: bench/receipts/megakernel_{h200,mi300x,metal}*.

Megakernel: honest limits + tuning

Plain-BP megakernel is a single-shot latency tool, not a throughput tool. At batch-1 it is ~1.7× faster than the two-kernel BP path (H200 0.61 vs 1.06 ms); at large batch it loses (plain BP has no early-exit lever) — the two-kernel BP path stays the throughput default. Use BpMegaTriton for low-latency bare BP, RelayBpMegaTriton for the accurate latency path.
Real-time / single-shot: H200 leads MI300X 2.47× at batch-1 (3.44 vs 8.48 ms) — wider than v0.1's ~9% two-kernel gap, because the single-CTA-per-shot design amplifies per-SM and codegen differences at batch-1. Batched, the gap is 1.25–1.33×; correctness is identical across vendors. (The pitch is vendor-portable + performant on both, never parity.)
Per-arch autotuning. v0.2 ships autotuned BLOCK/num_warps configs for H200, MI300X and M4 Max, pinned in _CUDA_TUNED keyed by gcnArchName/device name. AMD (wavefront-64) wants the opposite shape from NVIDIA warps — low warps for BP, max BLOCK+warps for relay.
Metal: fully lifted off the old BLOCK=32 pin — both kernels at BLOCK=256: BP (256) (20 → 12 ms, 1.67×) and relay (256, num_warps=8) (441 → 152 ms, 2.89× — the ~197× headline above), relay bit-identical to BLOCK=128. The relay num_warps=8 is load-bearing: it sets num_threads = num_warps×32 = 256 = BLOCK so each thread handles exactly one element (n = BLOCK/num_threads = 1); at n>1 triton-metal's base path under-covers a BLOCK-wide store and now loudly refuses (MetalNonRecoverableError, never silent-wrong), so the footgun can't bite. Requires triton-metal with the in-loop-reduction + n=1-store fixes (older triton-metal loudly refuses relay@256). fp32-only on Metal (no fp64), same as the two-kernel path, and the fp32 near-tie-flip caveat below applies to the megakernel unchanged. Receipt: bench/receipts/megakernel_metal_lift.md.

Install

Most users want pip install "tridec[torch,decoders]" (CPU+GPU torch backend plus the reference adapters). The bare install is the numpy CPU reference only — correct but slow.

pip install tridec                # numpy CPU reference only
pip install "tridec[torch]"       # + batched torch backend (CPU/GPU)
pip install "tridec[gpu]"         # + Triton GPU kernels (CUDA or ROCm)
pip install "tridec[decoders]"    # + ldpc / relay-bp reference adapters
pip install "tridec[sinter]"      # + sinter.collect integration

Quickstart

import stim
import tridec

circuit = stim.Circuit.from_file("memory.stim")
dem = circuit.detector_error_model(decompose_errors=False)

decoder = tridec.from_dem(dem, backend="auto")   # triton > torch > numpy

dets, obs = circuit.compile_detector_sampler(seed=0).sample(
    100_000, separate_observables=True)
pred = decoder.decode_batch(dets)                      # (shots, n_obs) bool
print("logical error rate:", (pred != obs).any(axis=1).mean())

Raw matrices work too: tridec.from_matrices(H, priors, observables=Lo). Relay-BP: tridec.from_dem(dem, algorithm="relay") (Triton kernels only).

With sinter (the [sinter] extra):

import sinter
from tridec.sinter import sinter_decoders

stats = sinter.collect(
    num_workers=4, tasks=tasks,
    decoders=["tridec_bp", "pymatching"],
    custom_decoders=sinter_decoders(),
    max_shots=1_000_000)

Backend × algorithm matrix (honest availability)

Algorithm	`numpy`	`torch`	`triton`	`metal` (experimental)
min-sum BP	yes (CPU reference)	yes (CPU + CUDA/ROCm)	yes (CUDA + ROCm)	yes (fp32)
Relay-BP	no	no	yes (CUDA + ROCm)	yes (fp32, slow — see below)

There is no in-package CPU Relay-BP; its CPU reference is IBM's relay-bp Rust decoder, wrapped in tridec.adapters and used as the validation oracle for the Triton path.

What's validated where

Environment	Status
CPU (any)	numpy BP reference; torch BP bit-identical to numpy at fp64 (one iteration), LER-identical full decode
NVIDIA H200, CUDA 12.4, torch 2.4.1, triton 3.0.0	Triton BP: ≥99.5% hard-decision agreement vs fp64 references, LER-identical (156 = 156 = 156 fails / 2000 shots vs numpy/torch). Triton Relay-BP: LER-matches the `relay-bp` Rust oracle (31 vs 38 fails / 2000, overlapping Wilson CIs) — carried source-repo receipts
AMD MI300X, ROCm 7.0.0, torch 2.9, triton 3.4.0	Same kernels, unmodified: identical primitive-identity numbers (pre-leg posterior max-diff 1.8e-15) and the same oracle-vs-Triton LER identity (carried receipts) — and validated through the installed package for v0.1.0 (`bench/receipts/mi300x_packaged.json`): full suite 88 passed / 10 skipped on gfx942 (GPU tiers bind, darwin-only strict tiers skip), packaged-API BP 166 = numpy 166 fails / 2000, Relay-BP fp32 34 vs Rust oracle 31 (overlapping CIs), throughput within ±2.2% of the carried receipt
Apple silicon (M4 Max), triton-metal	Experimental, spike-validated only (`bench/receipts/metal_spike.md`): both kernels pass the same correctness gates at fp32; see the section below

This table covers the two-kernel BP/Relay-BP path (v0.1). The v0.2 megakernel's own per-platform validation (14/14 gates on CUDA + ROCm, ~197× on Metal at the lifted BLOCK=256 blocks — BP (256), relay (256,8)) is in the megakernel section above, with raw receipts in bench/receipts/megakernel_*.

Experimental: Apple silicon (Metal)

The same Triton kernels run on Apple-silicon GPUs through triton-metal, with zero changes to the kernel source. This is experimental: validated at spike level on one machine (M4 Max), fp32 only (Metal has no fp64), and not part of the official receipt set.

# triton-metal + a triton >= 3.6 build + torch must be importable, then:
pip install tridec
python -c "import tridec; print(tridec.available_backends())"  # ['metal', ...]

backend="auto" detects the triton-metal environment (darwin, triton + triton_metal importable, no CUDA/ROCm device) and selects "metal"; backend="triton" resolves to "metal" there too, and backend="metal" asserts the environment is present. The execution pattern is triton-metal's documented one — CPU torch tensors (zero-copy via unified memory; not mps) — so no device arguments are needed.

What the spike measured (2000 canonical shots, seed 0, M4 Max — re-validated through this API path in tests/test_metal.py):

min-sum BP (fp32): all correctness gates pass — one-iteration hard agreement 1.000 vs the fp64 numpy reference on both the surface-code and BB-code fixtures; LER 76 = 76 / 2000 (surface) and 167 vs 168 / 2000 (BB). Batched decode of 2000 shots in 28 ms (surface) / 167 ms (BB) — 37–56× the per-shot numpy baseline on the same machine.
Relay-BP (fp32): correct but slow — LER matches the relay-bp Rust oracle (31 vs 39 fails / 2000, per-shot agreement 99.3%), but decode_batch(2000) takes 31 s vs 1.26 s for the Rust CPU oracle: relay's per-iteration host loop (~7k small kernel launches) is launch-overhead dominated on Metal. Use it for validation, not production.
Relay-BP on metal enforces fp32: dtype="float64" raises with a clear error; the default resolves to float32.

No claims beyond the spike: no official LER receipts, no cross-machine validation, no performance tuning. CUDA/ROCm remain the supported GPU paths.

Compatibility floors in pyproject.toml; known-good pins: stim 1.15.0, ldpc 2.4.1, relay-bp 0.2.2, torch 2.4.1 / 2.9, triton 3.0 / 3.4.

Validation discipline

tridec.validation ships the matched-protocol harness the numbers were produced with: dem_hash (sha256 of the DEM's canonical bytes), run_matched (one shared DEM, one shot set, fail-fast DEM-identity and tie-break gates), Wilson/TOST statistics and a paired per-shot gap-to-MLE bootstrap. The test suite pins the extraction byte-for-byte: 8 canonical BB-code fixture circuits must hash to the exact DEM sha256s recorded in the carried zoo_grid.json receipt, and a full 16,667-shot cell must reproduce the recorded logical-failure counts of the ldpc reference adapters exactly.

For v0.1.0 the WHOLE grid was re-decoded in the receipt environment (bench/full_grid_noregression.py): 31 of 32 (cell, decoder) failure counts reproduce exactly — all 24 BP / BP-OSD-0 / BP-OSD-10 counts, and 7 of 8 BPLSD counts. The single deviation (BPLSD, p=0.002/X: 879 vs 880, one shot in 200,000) is attributed by a same-environment repeat experiment to run-to-run nondeterminism inside ldpc's BpLsdDecoder itself (identical shots, fresh instances, 5 repeats: 880/880/879/880/880 — the same single shot flips) — documented in bench/receipts/full_grid_noregression.json.

Status

0.2.1 — Relay-BP auto-dispatch: from_dem(..., algorithm="relay") / RelayBpDecoder use the megakernel by default on GPU (megakernel=False opts back to the two-kernel host loop); GPU-gated by construction. Validated on all three platforms — Metal (M4 Max), NVIDIA H200 (CUDA), AMD MI300X (ROCm/gfx942). Also: the statistical-tier validation gates now use a sample-size-aware Wilson-CI overlap test (#1). 0.2.0 added the megakernel backend (tri-platform validated; see above). v0.1.0 shipped the two-kernel BP/Relay-BP path + the validation discipline. The kernels and their receipts are stable; the public API surface is young and may still move before 1.0 — minor 0.x releases may rename or remove public API; 1.0 will lock the surface. GPU paths require triton

a CUDA/ROCm GPU (or the experimental triton-metal environment); the GPU/metal test tiers skip cleanly where unavailable.

License

Apache-2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
bench		bench
docs		docs
src/tridec		src/tridec
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tridec

v0.2: the megakernel backend (opt-in)

Megakernel: honest limits + tuning

Install

Quickstart

Backend × algorithm matrix (honest availability)

What's validated where

Experimental: Apple silicon (Metal)

Validation discipline

Status

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tridec

v0.2: the megakernel backend (opt-in)

Megakernel: honest limits + tuning

Install

Quickstart

Backend × algorithm matrix (honest availability)

What's validated where

Experimental: Apple silicon (Metal)

Validation discipline

Status

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages