Skip to content

core: drop per-run String/TypedFact alloc in resolve_symbols_with_states#2364

Open
czoli1976 wants to merge 1 commit into
sonos:mainfrom
czoli1976:perf/resolve-symbols-no-alloc
Open

core: drop per-run String/TypedFact alloc in resolve_symbols_with_states#2364
czoli1976 wants to merge 1 commit into
sonos:mainfrom
czoli1976:perf/resolve-symbols-no-alloc

Conversation

@czoli1976

Copy link
Copy Markdown
Contributor

What

SimpleState::resolve_symbols_with_states runs once per run() — i.e. once per decode token for an autoregressive model. It selects the stateful ops that need symbol resolution with:

.filter(|s| s.init_tensor_fact().is_some())

but OpState::init_tensor_fact() clones a String (the cache name) and a TypedFact, wraps them in an Option, and returns them — only for the result to be tested with is_some() and immediately dropped.

This PR adds an allocation-free predicate OpState::has_init_tensor_fact() -> bool that mirrors init_tensor_fact(), and uses it for the filter.

The set of ops that override init_tensor_fact to return Some (the transformers + GPU DynKeyValueCache states, with the metal/cuda fused ops delegating) is exactly the set that overrides resolve_symbols, so the filter selects the same states as before — only the per-state allocation is removed. A drift-guard test (has_init_tensor_fact_matches_init_tensor_fact) keeps the two in sync, since if they ever disagreed an op's resolve_symbols would silently stop running.

Honesty up front: this is a micro-optimization

It removes exactly one heap allocation per KV-cache op per decode step (2 * n_layers per token), with no behavioural change. On a real, compute-bound model it is a tiny fraction of per-token work and does not move wall-clock time. The value is reduced allocator pressure on the per-token hot path — most visible when layer count is high relative to compute, and under allocator contention. I'm filing it because it's a strictly-better, zero-risk change with a clear measurement, not because it's a speedup you'll feel on a 1.7B model.

Benchmarks

Two reproducible benchmark examples are included under transformers/examples/, both using a counting global allocator.

kv_resolve_probe — compute-light model with N real DynKeyValueCache ops, allocations per decode step:

n_caches before after Δ allocs before ns/step after ns/step
16 52 36 −16 8544 7723
32 101 69 −32 14842 11923
64 198 134 −64 24336 22628
128 391 263 −128 37502 31523

Allocations drop by exactly N (one String clone per cache per step; the TypedFact clone was inline-smallvec and didn't hit the heap). Wall-time is consistently lower and scales with N when compute is light.

llm_decode_bench — end-to-end, Qwen3-1.7B q40, folded DynKeyValueCache decode (56 caches = 28 layers × 2), persistent SimpleState, 128 decode tokens × 3 runs:

before after
allocs/token 17444 17388
tokens/sec ~19.6 ~19.6

Exactly −56 allocations/token (= 2 * n_layers), deterministic across runs, no wall-clock regression.

(Note: the causal_llm example unfolds KV caches into explicit model I/O, which removes these stateful ops entirely, so it does not exercise this path — hence the dedicated folded-mode llm_decode_bench.)

Testing

  • cargo test -p tract-core (247) and -p tract-transformers (incl. the dyn_kv_cache NNEF round-trip + the new drift-guard test) pass.
  • metal builds; the cuda override mirrors metal (delegation).
  • The stateless path is unaffected by construction (no op states → the filter iterates nothing).

Files

  • core/src/ops/mod.rs — new has_init_tensor_fact() trait method (default false)
  • core/src/plan.rs — use it in the per-run filter
  • transformers/src/ops/dyn_kv_cache.rs — override true + drift-guard test
  • gpu/src/ops/dyn_kv_cache.rs — override true
  • metal/src/ops/fused_axis_op.rs, cuda/src/ops/fused_axis_op.rs — delegate
  • transformers/examples/{kv_resolve_probe,llm_decode_bench}.rs — benchmarks

🤖 Generated with Claude Code

@czoli1976 czoli1976 force-pushed the perf/resolve-symbols-no-alloc branch from fd9b857 to 9ca9631 Compare June 13, 2026 09:23
`SimpleState::resolve_symbols_with_states` runs once per `run()` — i.e. once
per decode token for an LLM. It selected the stateful ops that need symbol
resolution with `s.init_tensor_fact().is_some()`, but `init_tensor_fact()`
clones a `String` (the cache name) and a `TypedFact` and returns them in an
`Option` purely so the result can be tested with `is_some()` and dropped.

Add an allocation-free `OpState::has_init_tensor_fact() -> bool` predicate that
mirrors `init_tensor_fact()`, and use it for the filter. The set of ops that
override `init_tensor_fact` to return `Some` (the transformers and GPU KV-cache
states, with the metal/cuda fused ops delegating) is exactly the set that
overrides `resolve_symbols`, so the filter selects the same states as before —
only the per-state allocation is removed. A drift-guard test keeps the two
methods in sync.

This is a micro-optimization: it removes exactly one heap allocation per
KV-cache op per decode step (2 * n_layers per token), with no change in
behaviour. On a real model it is a small fraction of per-token work and does
not move wall-clock; the value is reduced allocator pressure on the per-token
hot path, most visible when layer count is high relative to compute and under
allocator contention.

Benchmarks (added as transformers examples):

  kv_resolve_probe (compute-light, N KV caches), allocs/decode-step:
    N=16:  52 -> 36   N=32: 101 -> 69
    N=64: 198 -> 134   N=128: 391 -> 263   (exactly N fewer; ~7-20% faster)

  llm_decode_bench (Qwen3-1.7B q40, folded decode, 56 KV caches):
    allocs/token: 17444 -> 17388  (exactly -56 = 2 * 28 layers, deterministic)
    tokens/sec:   ~19.6 -> ~19.6  (unchanged, within noise)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@kali kali force-pushed the perf/resolve-symbols-no-alloc branch from 9ca9631 to 3872934 Compare June 17, 2026 18:04
@kali

kali commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Rebased!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants