Skip to content

linalg/cache sorcery: LLC/SLC-aware budget for the L3 outer blocking tier#2352

Open
czoli1976 wants to merge 3 commits into
sonos:mainfrom
czoli1976:feature/mmm-st-llc-slc
Open

linalg/cache sorcery: LLC/SLC-aware budget for the L3 outer blocking tier#2352
czoli1976 wants to merge 3 commits into
sonos:mainfrom
czoli1976:feature/mmm-st-llc-slc

Conversation

@czoli1976

@czoli1976 czoli1976 commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Stacked on #2349 (cache module) + #2350 (L3 outer tier). Review only the top commit — GitHub shows the whole stack until the lower PRs merge (a fork branch can't be a PR base in upstream).

Problem

#2350's outer tier fires only when the per-CPU cache topology exposes an architectural L3 (cpu/cache level 3). Many Arm SoCs (and Apple) have no cluster L3 but a System-Level Cache (SLC) shared with the GPU/NPU/display (Qualcomm LLCC, Apple SLC), which is never listed under cpu/cache/index* — so those parts see l3 == 0 and miss a real last-level cache the tier could use.

"LLC" = role (last cache before DRAM). "SLC" = a specific shared interconnect cache that is one kind of LLC. This generalises "detect L3" → "establish the LLC budget," and treats a contended SLC more conservatively than a dedicated L3.

Change

New cache::last_level_cache() -> Option<(usize, LlcKind)> resolving, first hit wins:

  1. TRACT_LLC_BYTES env override ("8M", "33554432"); kind = SystemLevel iff TRACT_LLC_CONTENDED is set, else Dedicated.
  2. architectural L3 (cache_info().l3 > l2) → Dedicated.
  3. best-effort Linux devicetree SLC probe (cache-level == 3 + cache-size, outside /cpus) → SystemLevel.

l3_block_budget_bytes() now budgets a Dedicated L3 at ~½ and a contended SystemLevel cache at ~¼ (can't assume residency of lines the GPU/NPU keep evicting). Unknown ⇒ None ⇒ no outer tier (unchanged, regression-safe).

Notes

  • Purely additive; default behavior on parts with a normal L3 is identical (L3 path, ½ budget).
  • SLCs whose size is fixed in the controller (e.g. Qualcomm LLCC carries no cache-size in DT) fall to the TRACT_LLC_BYTES override — documented.
  • Pure resolver resolve_llc() is unit-tested for priority/regression-safety; the DT probe is panic-tested.

Prior art

  • Runtime cache sizing: Eigen queryCacheSizes (CPUID/sysctl) + manage_caching_sizes; glibc sysconf(_SC_LEVELx_CACHE_SIZE); ACPI PPTT; hwloc.
  • SLC exposure: Qualcomm LLCC (drivers/soc/qcom/llcc-qcom.c, devicetree qcom,llcc; LWN "SDM845 System Cache Driver"); generic devicetree cache bindings (cache-level/cache-size).
  • Gap this addresses: mainstream GEMM libs (OpenBLAS/BLIS/Eigen/MKL) detect L1/L2/L3 but don't chase the SLC — they equate "L3" with "LLC" — so SLC-aware blocking on SoCs whose LLC is an SLC is under-addressed.

Caveats (honest)

  • The ¼ SLC fraction and the depth-4 DT walk are heuristics; they want validation on a real SLC SoC (Snapdragon/Apple). None was available where this was authored (x86/Apple-Silicon hosts — the SLC path is inert on both: x86 takes the Dedicated L3 branch, Apple Silicon exposes no DT SLC, so locally only the regression-safe path is exercised).
  • The robust core is the TRACT_LLC_BYTES override; the devicetree autodetect is best-effort and only kicks in when there is no architectural L3 and a DT node actually carries a numeric cache-size.

This falls in the Extreme Category

czoli1976 and others added 3 commits June 7, 2026 16:12
The single-thread MMM block-budget probed L2 with detection logic inlined in
frame/mmm/mod.rs, reusable nowhere and limited to macOS/Linux L2. Move it into a
cache module that exposes L1d/L2/L3 through one memoised probe (macOS/iOS via
sysctlbyname, Linux/Android via /sys, Windows via wmic) and have the block
budget read it. The existing L2 budget is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…walk

The single-thread tile walk blocked one level, sizing panel blocks to L2 only;
at large k a grid that exceeds L2 still re-fetches shared A/B panels from DRAM as
it sweeps. Wrap the L2 inner block in an outer super-block sized to L3 (from the
crate::cache probe) so a group of inner blocks stays L3-resident across the
sweep. The outer tier engages only when an L3 larger than L2 is detected;
otherwise the edge is the whole grid and the walk is identical to before. Still
pure tile reordering, so bit-exact with the naive loop.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
sonos#2350's outer tier fires only when the per-CPU cache topology exposes an
architectural L3 (cpu/cache level 3). Many Arm SoCs (and Apple) have no cluster
L3 but a System-Level Cache (SLC) shared with the GPU/NPU/display (Qualcomm
LLCC, Apple SLC), which is never listed under cpu/cache/index* — so those parts
see l3 == 0 and miss a real last-level cache the tier could use.

Add cache::last_level_cache() -> Option<(usize, LlcKind)> resolving, first hit
wins: (1) TRACT_LLC_BYTES env override (+ TRACT_LLC_CONTENDED); (2) architectural
L3 → Dedicated; (3) best-effort Linux devicetree SLC (cache-level == 3 +
cache-size, outside /cpus) → SystemLevel. l3_block_budget_bytes() now budgets a
Dedicated L3 at ~1/2 and a contended SystemLevel cache at ~1/4. Unknown ⇒ None ⇒
no outer tier (unchanged, regression-safe).

Purely additive — default behaviour on parts with a normal L3 is identical. Pure
resolver unit-tested for priority/regression-safety; DT probe is panic-tested.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@czoli1976 czoli1976 changed the title linalg/cache: LLC/SLC-aware budget for the L3 outer blocking tier linalg/cache sorcery: LLC/SLC-aware budget for the L3 outer blocking tier Jun 7, 2026
@czoli1976 czoli1976 marked this pull request as ready for review June 9, 2026 08:37
@czoli1976

Copy link
Copy Markdown
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant