core: drop per-run String/TypedFact alloc in resolve_symbols_with_states by czoli1976 · Pull Request #2364 · sonos/tract

czoli1976 · 2026-06-13T09:20:41Z

What

SimpleState::resolve_symbols_with_states runs once per run() — i.e. once per decode token for an autoregressive model. It selects the stateful ops that need symbol resolution with:

.filter(|s| s.init_tensor_fact().is_some())

but OpState::init_tensor_fact() clones a String (the cache name) and a TypedFact, wraps them in an Option, and returns them — only for the result to be tested with is_some() and immediately dropped.

This PR adds an allocation-free predicate OpState::has_init_tensor_fact() -> bool that mirrors init_tensor_fact(), and uses it for the filter.

The set of ops that override init_tensor_fact to return Some (the transformers + GPU DynKeyValueCache states, with the metal/cuda fused ops delegating) is exactly the set that overrides resolve_symbols, so the filter selects the same states as before — only the per-state allocation is removed. A drift-guard test (has_init_tensor_fact_matches_init_tensor_fact) keeps the two in sync, since if they ever disagreed an op's resolve_symbols would silently stop running.

Honesty up front: this is a micro-optimization

It removes exactly one heap allocation per KV-cache op per decode step (2 * n_layers per token), with no behavioural change. On a real, compute-bound model it is a tiny fraction of per-token work and does not move wall-clock time. The value is reduced allocator pressure on the per-token hot path — most visible when layer count is high relative to compute, and under allocator contention. I'm filing it because it's a strictly-better, zero-risk change with a clear measurement, not because it's a speedup you'll feel on a 1.7B model.

Benchmarks

Two reproducible benchmark examples are included under transformers/examples/, both using a counting global allocator.

kv_resolve_probe — compute-light model with N real DynKeyValueCache ops, allocations per decode step:

n_caches	before	after	Δ allocs	before ns/step	after ns/step
16	52	36	−16	8544	7723
32	101	69	−32	14842	11923
64	198	134	−64	24336	22628
128	391	263	−128	37502	31523

Allocations drop by exactly N (one String clone per cache per step; the TypedFact clone was inline-smallvec and didn't hit the heap). Wall-time is consistently lower and scales with N when compute is light.

llm_decode_bench — end-to-end, Qwen3-1.7B q40, folded DynKeyValueCache decode (56 caches = 28 layers × 2), persistent SimpleState, 128 decode tokens × 3 runs:

	before	after
allocs/token	17444	17388
tokens/sec	~19.6	~19.6

Exactly −56 allocations/token (= 2 * n_layers), deterministic across runs, no wall-clock regression.

(Note: the causal_llm example unfolds KV caches into explicit model I/O, which removes these stateful ops entirely, so it does not exercise this path — hence the dedicated folded-mode llm_decode_bench.)

Testing

cargo test -p tract-core (247) and -p tract-transformers (incl. the dyn_kv_cache NNEF round-trip + the new drift-guard test) pass.
metal builds; the cuda override mirrors metal (delegation).
The stateless path is unaffected by construction (no op states → the filter iterates nothing).

Files

core/src/ops/mod.rs — new has_init_tensor_fact() trait method (default false)
core/src/plan.rs — use it in the per-run filter
transformers/src/ops/dyn_kv_cache.rs — override true + drift-guard test
gpu/src/ops/dyn_kv_cache.rs — override true
metal/src/ops/fused_axis_op.rs, cuda/src/ops/fused_axis_op.rs — delegate
transformers/examples/{kv_resolve_probe,llm_decode_bench}.rs — benchmarks

🤖 Generated with Claude Code

`SimpleState::resolve_symbols_with_states` runs once per `run()` — i.e. once per decode token for an LLM. It selected the stateful ops that need symbol resolution with `s.init_tensor_fact().is_some()`, but `init_tensor_fact()` clones a `String` (the cache name) and a `TypedFact` and returns them in an `Option` purely so the result can be tested with `is_some()` and dropped. Add an allocation-free `OpState::has_init_tensor_fact() -> bool` predicate that mirrors `init_tensor_fact()`, and use it for the filter. The set of ops that override `init_tensor_fact` to return `Some` (the transformers and GPU KV-cache states, with the metal/cuda fused ops delegating) is exactly the set that overrides `resolve_symbols`, so the filter selects the same states as before — only the per-state allocation is removed. A drift-guard test keeps the two methods in sync. This is a micro-optimization: it removes exactly one heap allocation per KV-cache op per decode step (2 * n_layers per token), with no change in behaviour. On a real model it is a small fraction of per-token work and does not move wall-clock; the value is reduced allocator pressure on the per-token hot path, most visible when layer count is high relative to compute and under allocator contention. Benchmarks (added as transformers examples): kv_resolve_probe (compute-light, N KV caches), allocs/decode-step: N=16: 52 -> 36 N=32: 101 -> 69 N=64: 198 -> 134 N=128: 391 -> 263 (exactly N fewer; ~7-20% faster) llm_decode_bench (Qwen3-1.7B q40, folded decode, 56 KV caches): allocs/token: 17444 -> 17388 (exactly -56 = 2 * 28 layers, deterministic) tokens/sec: ~19.6 -> ~19.6 (unchanged, within noise) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

kali · 2026-06-17T18:04:07Z

Rebased!

czoli1976 force-pushed the perf/resolve-symbols-no-alloc branch from fd9b857 to 9ca9631 Compare June 13, 2026 09:23

kali force-pushed the perf/resolve-symbols-no-alloc branch from 9ca9631 to 3872934 Compare June 17, 2026 18:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: drop per-run String/TypedFact alloc in resolve_symbols_with_states#2364

core: drop per-run String/TypedFact alloc in resolve_symbols_with_states#2364
czoli1976 wants to merge 1 commit into
sonos:mainfrom
czoli1976:perf/resolve-symbols-no-alloc

czoli1976 commented Jun 13, 2026

Uh oh!

kali commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

czoli1976 commented Jun 13, 2026

What

Honesty up front: this is a micro-optimization

Benchmarks

Testing

Files

Uh oh!

kali commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants