Fast LLM speculative inference server for consumer hardware.
-
Updated
Jun 13, 2026 - C++
Fast LLM speculative inference server for consumer hardware.
An agent harness that compiles a model into one provably-correct, self-retargeting CUDA megakernel and self-tunes it past cuBLAS at batch-1 LLM decode.
Air.rs 70B+ inference on consumer GPU, LLM inference in Rust
A light, transparent, and modular inference & quantization engine for studying LLMs.
Vendor-portable GPU decoders for quantum LDPC codes — Triton min-sum BP and Relay-BP for any stim detector error model, validated on NVIDIA (CUDA), AMD (ROCm), and Apple (Metal, experimental); opt-in single-launch persistent-megakernel backend.
Add a description, image, and links to the megakernel topic page so that developers can more easily learn about it.
To associate your repository with the megakernel topic, visit your repo's landing page and select "manage topics."