Skip to content

[wave] NSA: GQA-aware memory layout & tiling for MI350 cache hierarchy #1258

@harsh-nod

Description

@harsh-nod

Parent

Part of #1243 — DeepSeek NSA kernels for MI350

Description

Optimize memory layouts and tiling strategies for NSA kernels to exploit MI350's cache hierarchy and matrix core (MFMA) shapes under GQA configurations.

GQA layout challenges

NSA with GQA has G KV head groups and H = G × HEADS_PER_GROUP query heads. The reference implementation tiles the head dimension as BLOCK_H = max(16, HEADS_PER_GROUP).

Key questions for MI350:

  1. MFMA shape alignment: MI350's MFMA instructions operate on specific shapes (e.g., 16×16, 32×32). Does HEADS_PER_GROUP (typically 16 for DeepSeek-V3: 128 heads / 8 groups) align with MFMA M-dimension?
  2. Register tiling: Processing all heads in a GQA group together means Q is [BLOCK_H, D] and KV is [block_size, D]. The QK^T result is [BLOCK_H, block_size] — choose tiling to match MFMA.

Memory layout optimization

  1. Input tensor layout

    • Evaluate BHMD vs BHMGD vs custom swizzled layouts
    • Ensure stride patterns enable coalesced global loads on MI350
    • Consider if K should be stored pre-transposed in memory for QK^T
  2. L2 cache optimization

    • MI350 L2 is shared across CUs — size TBD (likely 32-96MB)
    • For selection attention: KV blocks selected by different queries may overlap → L2 cache reuse
    • Consider sorting queries by their selected block indices to improve L2 hit rate
  3. Register file optimization

    • MI350 VGPR file: 512 VGPRs per SIMD at min occupancy
    • Selection attention forward needs: Q (BLOCK_H × D), K_block (D × block_size), V_block (block_size × D), accum (BLOCK_H × D), max/sum (BLOCK_H) — compute total register pressure
    • Consider splitting D dimension if register pressure is too high
  4. Compressed attention layout

    • Compressed KV is contiguous and small (N/block_size) — ensure it's L2-resident
    • Block mask for causal can be precomputed and stored in constant memory

Depends on

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestnsaDeepSeek Native Sparse Attention

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions