This is the Codex/native-agent guide to driving AutoMegaKernel (AMK), the Codex analogue of
CLAUDE.md. It tells an agent exactly what AMK is, the one edit surface it owns, the canonical
tool/CLI names, the non-negotiable honesty rules, and a copy-paste optimization loop. Everything
here maps 1:1 to existing, verified code; nothing is faked.
For the full human/agent contract read HARNESS.md. For what AMK is and what is
measured-real today, read README.md.
AMK compiles a HuggingFace Llama-family model into ONE persistent CUDA megakernel, the whole forward pass fused into a single cooperative kernel launch, and self-improves the schedule over its own baseline, correctness-gated, unattended. It is the sibling of AutoKernel (which tunes one kernel); AMK tunes the whole-model megakernel via a new search axis: the schedule.
The agent's job is to search the schedule, not to write kernels. The frozen VM deterministically
lowers a ScheduleConfig into a runnable megakernel and proves it deadlock/race-free before any
launch. An unsafe config is a clean REJECTED, never a hung GPU.
You edit one structured object: a ScheduleConfig (a JSON dict of typed knobs), optionally with
a kernel_knobs sub-object. You NEVER touch raw kernel code, vm/, Task.sm, or the frozen ABI.
The schema is [schemas/schedule_config.schema.json]; the live, machine-readable knob list comes
from amk_propose(...)["search_space"]. The knobs:
| knob | type | choices | meaning |
|---|---|---|---|
tiling.gemv.N_tile |
int | 64,128,256,512 | GEMV output-column tile width (the one tiling knob lowered today) |
tiling.attention.kv_block |
int | 64,128,256 | KV window block (recorded; not yet lowered) |
fusion_grouping |
list[list[str]] | [], [["gate","up"]], ... |
op groups to co-reside (a safe hint) |
sm_assignment |
str/dict | round_robin/load_balance/map |
SM placement policy |
pipelining_depth |
int | 0–4 typ. | instructions ahead to prefetch weights (hides the HBM bubble) |
page_allocation |
str | graph_color/linear/none |
activation page reuse policy |
threads_per_block |
int | 128,256,512 | persistent VM block size (occupancy-proven) |
smem_bytes_per_block |
int | 0,16384,49152 | dynamic SMEM opt-in (over-cap = clean reject) |
kernel_knobs (e.g. instruction -D macros the autoresearch/loop drivers explore) is part of the
search space the loop mutates; pass it inside the config dict to amk_eval as a "kernel_knobs" object.
- Correctness FIRST. A latency is NEVER reported without a correctness PASS vs the CPU ReferenceVM. Keep a candidate only if it is correct AND ≥1% faster than the incumbent.
- validate-before-launch. An unsafe
ScheduleConfigis a cleanREJECTED(deadlock/race-free proof), never a hung GPU. - The edit surface is
ScheduleConfig+kernel_knobsONLY, never raw kernel code, nevervm/, never the frozen ABI. - Measured-gpu latency is drift-robust; physically-impossible sub-roofline latencies are withheld.
- All speedups are vs AMK's OWN baseline, NOT a claim of beating cuBLAS/vLLM (AMK is currently within ~13% of cuBLAS at batch-1, behind it, and says so).
The MCP server module is amk_mcp.py (repo root). It exposes:
amk_doctor()→ torch/CUDA availability, device name, registered GpuTargets.amk_propose(model, gpu="rtx5090")→ incumbentScheduleConfig+ editablesearch_space(includes thekernel_knobssub-surface).amk_eval(model, gpu, config, device="auto")→ structured verdict{valid, correct, latency_us, latency_kind, pct_of_roofline, bound_us, schedule_id, ...}.configis a JSON object (aScheduleConfig, optionally with a"kernel_knobs"object).amk_loop(model, gpu, budget=8, device="auto")→ keep/revert loop result{best_verdict, best_config, rows, ...}.amk_autoresearch(model, gpu, minutes=None, iters=None, device="auto", overnight=False, cold=False)→ unattended campaign result.amk_orchestrate_status()/amk_orchestrate_next()/amk_orchestrate_report()→ campaign state-machine snapshots (structured dicts).amk_orchestrate_record(status, latency_us=None, pct_roofline=None, kind=null, config=None, description="")→ record one experiment outcome (status∈ kept|revert|failed|crash|timeout|rejected).
amk propose|eval|loop|autoresearch|compile|generate|doctor # the `amk` console command
python amk_orchestrate.py status|next|record|report # the campaign state machine
Every amk <cmd> is also uv run python amk_cli.py <cmd> .... The eval/tune-instruction/etc.
verbs print only JSON on stdout (build chatter → stderr), safe to pipe into jq/json.loads.
eval exit code is 0 iff valid+correct (gate your loop on it).
Add this table so Codex launches the AMK MCP server (the agent extra pulls in the mcp SDK):
[mcp_servers.automegakernel]
command = "uv"
args = ["run", "--extra", "agent", "python", "amk_mcp.py"](Claude Code reads the equivalent project config from .mcp.json at the repo root.)
The mcp SDK is optional; install it once with uv sync --extra agent. The tool LOGIC in
amk_mcp.py imports ONLY existing AMK modules, so the package stays importable without it.
# 1) read the edit surface (the "program.md" step)
uv run python amk_cli.py propose toy --gpu rtx5090 > surface.json
# 2) write a candidate config (edit ONE knob), evaluate it (JSON-only on stdout)
cat > cfg.json <<'JSON'
{ "tiling": {"gemv": {"N_tile": 128}, "attention": {"kv_block": 128}},
"fusion_grouping": [], "sm_assignment": "load_balance",
"pipelining_depth": 3, "page_allocation": "graph_color",
"threads_per_block": 256, "smem_bytes_per_block": 0 }
JSON
uv run python amk_cli.py eval toy --gpu rtx5090 --config cfg.json # exit 0 => valid+correct
# 3) KEEP the candidate only if it is correct AND >=1% faster than the incumbent; else REVERT.
# Repeat with the next single-knob edit.
# 4) or let the built-in keep/revert loop do all of the above:
uv run python amk_cli.py loop toy --gpu rtx5090 --budget 16Equivalent via MCP tools: amk_propose → amk_eval (loop, keeping on correctness-then-≥1%) →
amk_loop to automate. Programmatic Python is in HARNESS.md §6.
# fast, deterministic cost-model pass (no GPU)
uv run python amk_cli.py autoresearch toy --gpu rtx5090 --iters 20 --device cpu
# sleep on it: an ~8h real-GPU campaign that never stops on a plateau (basin-hops, preserves best)
uv run python amk_cli.py autoresearch small --gpu rtx5090 --minutes 480 --device cuda --overnight
# wake up: workspace/amk_overnight_report.md (best config + speedup vs AMK's OWN baseline)Via MCP: amk_autoresearch(model, gpu, minutes=..., overnight=true). Drive the campaign brain with
amk_orchestrate_status / amk_orchestrate_next / amk_orchestrate_record / amk_orchestrate_report
(or python amk_orchestrate.py status|next|record|report). Both the headless driver and an agent
talking to the orchestrator write the same results.tsv + flywheel corpus, so AMK gets smarter
every run regardless of who drove it.
<gpu> is a registered GpuTarget: rtx5090, b200, h100, a100. <model> is toy / toy-2L
(fully supported) or a HuggingFace id (best-effort via schedule.graph.from_hf). The harness uses
the local GPU/CPU only, it never touches Modal or any cloud GPU.