[wave] NSA: GQA-aware memory layout & tiling for MI350 cache hierarchy

## Parent
Part of #1243 — DeepSeek NSA kernels for MI350

## Description

Optimize memory layouts and tiling strategies for NSA kernels to exploit MI350's cache hierarchy and matrix core (MFMA) shapes under GQA configurations.

### GQA layout challenges

NSA with GQA has G KV head groups and H = G × HEADS_PER_GROUP query heads. The reference implementation tiles the head dimension as BLOCK_H = max(16, HEADS_PER_GROUP).

Key questions for MI350:
1. **MFMA shape alignment**: MI350's MFMA instructions operate on specific shapes (e.g., 16×16, 32×32). Does HEADS_PER_GROUP (typically 16 for DeepSeek-V3: 128 heads / 8 groups) align with MFMA M-dimension?
2. **Register tiling**: Processing all heads in a GQA group together means Q is [BLOCK_H, D] and KV is [block_size, D]. The QK^T result is [BLOCK_H, block_size] — choose tiling to match MFMA.

### Memory layout optimization

1. **Input tensor layout**
   - Evaluate BHMD vs BHMGD vs custom swizzled layouts
   - Ensure stride patterns enable coalesced global loads on MI350
   - Consider if K should be stored pre-transposed in memory for QK^T

2. **L2 cache optimization**
   - MI350 L2 is shared across CUs — size TBD (likely 32-96MB)
   - For selection attention: KV blocks selected by different queries may overlap → L2 cache reuse
   - Consider sorting queries by their selected block indices to improve L2 hit rate

3. **Register file optimization**
   - MI350 VGPR file: 512 VGPRs per SIMD at min occupancy
   - Selection attention forward needs: Q (BLOCK_H × D), K_block (D × block_size), V_block (block_size × D), accum (BLOCK_H × D), max/sum (BLOCK_H) — compute total register pressure
   - Consider splitting D dimension if register pressure is too high

4. **Compressed attention layout**
   - Compressed KV is contiguous and small (N/block_size) — ensure it's L2-resident
   - Block mask for causal can be precomputed and stored in constant memory

### Depends on
- #1244 (design doc)
- #1248 (selection attention forward)
- #1252 (selection attention backward)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wave] NSA: GQA-aware memory layout & tiling for MI350 cache hierarchy #1258

Parent

Description

GQA layout challenges

Memory layout optimization

Depends on

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[wave] NSA: GQA-aware memory layout & tiling for MI350 cache hierarchy #1258

Description

Parent

Description

GQA layout challenges

Memory layout optimization

Depends on

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions