Parent
Part of #1243 — DeepSeek NSA kernels for MI350
Description
Optimize memory layouts and tiling strategies for NSA kernels to exploit MI350's cache hierarchy and matrix core (MFMA) shapes under GQA configurations.
GQA layout challenges
NSA with GQA has G KV head groups and H = G × HEADS_PER_GROUP query heads. The reference implementation tiles the head dimension as BLOCK_H = max(16, HEADS_PER_GROUP).
Key questions for MI350:
- MFMA shape alignment: MI350's MFMA instructions operate on specific shapes (e.g., 16×16, 32×32). Does HEADS_PER_GROUP (typically 16 for DeepSeek-V3: 128 heads / 8 groups) align with MFMA M-dimension?
- Register tiling: Processing all heads in a GQA group together means Q is [BLOCK_H, D] and KV is [block_size, D]. The QK^T result is [BLOCK_H, block_size] — choose tiling to match MFMA.
Memory layout optimization
-
Input tensor layout
- Evaluate BHMD vs BHMGD vs custom swizzled layouts
- Ensure stride patterns enable coalesced global loads on MI350
- Consider if K should be stored pre-transposed in memory for QK^T
-
L2 cache optimization
- MI350 L2 is shared across CUs — size TBD (likely 32-96MB)
- For selection attention: KV blocks selected by different queries may overlap → L2 cache reuse
- Consider sorting queries by their selected block indices to improve L2 hit rate
-
Register file optimization
- MI350 VGPR file: 512 VGPRs per SIMD at min occupancy
- Selection attention forward needs: Q (BLOCK_H × D), K_block (D × block_size), V_block (block_size × D), accum (BLOCK_H × D), max/sum (BLOCK_H) — compute total register pressure
- Consider splitting D dimension if register pressure is too high
-
Compressed attention layout
- Compressed KV is contiguous and small (N/block_size) — ensure it's L2-resident
- Block mask for causal can be precomputed and stored in constant memory
Depends on
Parent
Part of #1243 — DeepSeek NSA kernels for MI350
Description
Optimize memory layouts and tiling strategies for NSA kernels to exploit MI350's cache hierarchy and matrix core (MFMA) shapes under GQA configurations.
GQA layout challenges
NSA with GQA has G KV head groups and H = G × HEADS_PER_GROUP query heads. The reference implementation tiles the head dimension as BLOCK_H = max(16, HEADS_PER_GROUP).
Key questions for MI350:
Memory layout optimization
Input tensor layout
L2 cache optimization
Register file optimization
Compressed attention layout
Depends on