linalg/cache sorcery: LLC/SLC-aware budget for the L3 outer blocking tier#2352
Open
czoli1976 wants to merge 3 commits into
Open
linalg/cache sorcery: LLC/SLC-aware budget for the L3 outer blocking tier#2352czoli1976 wants to merge 3 commits into
czoli1976 wants to merge 3 commits into
Conversation
The single-thread MMM block-budget probed L2 with detection logic inlined in frame/mmm/mod.rs, reusable nowhere and limited to macOS/Linux L2. Move it into a cache module that exposes L1d/L2/L3 through one memoised probe (macOS/iOS via sysctlbyname, Linux/Android via /sys, Windows via wmic) and have the block budget read it. The existing L2 budget is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…walk The single-thread tile walk blocked one level, sizing panel blocks to L2 only; at large k a grid that exceeds L2 still re-fetches shared A/B panels from DRAM as it sweeps. Wrap the L2 inner block in an outer super-block sized to L3 (from the crate::cache probe) so a group of inner blocks stays L3-resident across the sweep. The outer tier engages only when an L3 larger than L2 is detected; otherwise the edge is the whole grid and the walk is identical to before. Still pure tile reordering, so bit-exact with the naive loop. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
sonos#2350's outer tier fires only when the per-CPU cache topology exposes an architectural L3 (cpu/cache level 3). Many Arm SoCs (and Apple) have no cluster L3 but a System-Level Cache (SLC) shared with the GPU/NPU/display (Qualcomm LLCC, Apple SLC), which is never listed under cpu/cache/index* — so those parts see l3 == 0 and miss a real last-level cache the tier could use. Add cache::last_level_cache() -> Option<(usize, LlcKind)> resolving, first hit wins: (1) TRACT_LLC_BYTES env override (+ TRACT_LLC_CONTENDED); (2) architectural L3 → Dedicated; (3) best-effort Linux devicetree SLC (cache-level == 3 + cache-size, outside /cpus) → SystemLevel. l3_block_budget_bytes() now budgets a Dedicated L3 at ~1/2 and a contended SystemLevel cache at ~1/4. Unknown ⇒ None ⇒ no outer tier (unchanged, regression-safe). Purely additive — default behaviour on parts with a normal L3 is identical. Pure resolver unit-tested for priority/regression-safety; DT probe is panic-tested. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
Author
|
@kali can you please review this Cache PR Trio ? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #2349 (cache module) + #2350 (L3 outer tier). Review only the top commit — GitHub shows the whole stack until the lower PRs merge (a fork branch can't be a PR base in upstream).
Problem
#2350's outer tier fires only when the per-CPU cache topology exposes an architectural L3 (
cpu/cachelevel 3). Many Arm SoCs (and Apple) have no cluster L3 but a System-Level Cache (SLC) shared with the GPU/NPU/display (Qualcomm LLCC, Apple SLC), which is never listed undercpu/cache/index*— so those parts seel3 == 0and miss a real last-level cache the tier could use."LLC" = role (last cache before DRAM). "SLC" = a specific shared interconnect cache that is one kind of LLC. This generalises "detect L3" → "establish the LLC budget," and treats a contended SLC more conservatively than a dedicated L3.
Change
New
cache::last_level_cache() -> Option<(usize, LlcKind)>resolving, first hit wins:TRACT_LLC_BYTESenv override ("8M","33554432"); kind =SystemLeveliffTRACT_LLC_CONTENDEDis set, elseDedicated.cache_info().l3 > l2) →Dedicated.cache-level == 3+cache-size, outside/cpus) →SystemLevel.l3_block_budget_bytes()now budgets aDedicatedL3 at ~½ and a contendedSystemLevelcache at ~¼ (can't assume residency of lines the GPU/NPU keep evicting). Unknown ⇒None⇒ no outer tier (unchanged, regression-safe).Notes
cache-sizein DT) fall to theTRACT_LLC_BYTESoverride — documented.resolve_llc()is unit-tested for priority/regression-safety; the DT probe is panic-tested.Prior art
queryCacheSizes(CPUID/sysctl) +manage_caching_sizes; glibcsysconf(_SC_LEVELx_CACHE_SIZE); ACPI PPTT; hwloc.drivers/soc/qcom/llcc-qcom.c, devicetreeqcom,llcc; LWN "SDM845 System Cache Driver"); generic devicetree cache bindings (cache-level/cache-size).Caveats (honest)
DedicatedL3 branch, Apple Silicon exposes no DT SLC, so locally only the regression-safe path is exercised).TRACT_LLC_BYTESoverride; the devicetree autodetect is best-effort and only kicks in when there is no architectural L3 and a DT node actually carries a numericcache-size.This falls in the Extreme Category