BarraCUDA/spill_analysis.txt at master · Zaneham/BarraCUDA · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
Divergence-Aware SSA Register Allocator — Spill Analysis
=========================================================

Date:    2026-03-14
Kernel:  Moa transport kernel (gpu/tp_kern.cu)
         Monte Carlo neutron transport, 654 source lines
         253 MIR blocks, 4035 virtual registers (3933V 102S)
         3854 divergent VGPRs, 79 uniform VGPRs
Target:  GFX942 (CDNA3, MI300X), Wave64

The new SSA register allocator (ra_ssa) eliminates ALL 186 VGPR spills
on the Moa transport kernel.  Total scratch traffic drops by 78%.
Total emitted instruction count drops from 9,448 to 6,761 (28.4%).

on Wave64 hardware, spilling a divergent VGPR costs 64 dwords of scratch per lane.  Spilling a
uniform VGPR costs 1 dword via v_readfirstlane.  The old allocator
treated all spills equally, the new one exploits the 64:1 cost ratio.

                        Old (ra_gc)     SSA (ra_ssa)     Change
                        -----------     -----------     -------
VGPR spills                     186               0      -100% (HOLY!)
SGPR spills                      21              29       +38%
Total spills                    207              29       -86%
Scratch ops (store+load)      1,754             392       -78%
Scratch bytes                 1,396             272       -81%
VGPRs used                      250             237        -5%
SGPRs used                      102             102         0%
Total emitted instructions    9,448           6,761       -28%
v_readfirstlane (SGPR path)       0              39        new

The 29 SGPR spills are cheap (4 bytes each via VGPR relay to scratch).
The 39 v_readfirstlane instructions are the scalar extraction path
for SGPR spill/reload, each replaces what would have been a 256-byte
per-lane scratch store on the old allocator.

Algorithm
---------
1. CFG + Cooper et al. (2001) iterative dominator tree
2. Loop nesting depth (exponential weighting, Braun & Hack 2009)
3. SSA liveness with PHI-aware dataflow + exec-mask region extension
4. Divergence-aware spill cost: cost(v) = Σ depth_weight × div_weight
     - div_weight = 64 for divergent VGPRs (Wave64 scratch cost)
     - div_weight = 1  for uniform VGPRs (readfirstlane to scalar)
     - div_weight = 1  for SGPRs (already scalar)
5. Rematerialisation detection (immediate loads → cost 0)
6. SSA coloring: domtree preorder, backward scan, greedy lowest-color
     - Precoloring for intra-block interference resolution
     - Divergence-weighted spill victim selection on pressure overflow
7. Spill codegen with 4 paths:
     A. Remat (0 bytes scratch, 1 instruction)
     B. Uniform VGPR: v_readfirstlane → scratch (4 bytes)
     C. Divergent VGPR: full per-lane scratch (wave_width × 4 bytes)
     D. SGPR: v_mov to relay → scratch (4 bytes)
8. Post-RA phi elimination with free coalescing (same color = no copy)

All static memory (~30 MB), no malloc.  ~1,300 lines of C99.
Operates on SSA form before phi elimination — free PHI coalescing.
Fallback to ra_gc/ra_lin for functions exceeding pool limits.

Enabled via: barracuda --ssa-ra

Files
-----
src/amdgpu/ra_ssa.c   — allocator implementation ( approx. 1,300 lines)
src/amdgpu/ra_ssa.h   — public interface
src/amdgpu/amdgpu.h   — vr_divg[] bitvector, shared helpers
src/amdgpu/isel.c     — divergence propagation to per-vreg bitvector
src/amdgpu/emit.c     — SSA dispatch, un-static shared helpers
src/main.c            — --ssa-ra flag

References
----------
Sampaio, D., Souza, R. M. de, Collange, S., & Pereira, F. M. Q. (2013).
  Divergence analysis. ACM TOPLAS 35(4), Article 13, 1-36.
  https://doi.org/10.1145/2523815

Cooper, K. D., Harvey, T. J., & Kennedy, K. (2001).
  A simple, fast dominance algorithm.
  Software Practice and Experience, 4, 1-10.

Braun, M., & Hack, S. (2009).
  Register spilling and live-range splitting for SSA-form programs.
  CC 2009, LNCS 5501, pp. 174-189.
  https://doi.org/10.1007/978-3-642-00722-4_13

Yes I used Zotero because I always seem to miss something in apa7th lol.

Next Steps
----------
- Run on MI300X hardware
  barracuda --ssa-ra --amdgpu-bin --gfx942 gpu/tp_kern.cu
  then: kahu a.hsaco --all
  then: run Godiva benchmark, verify k_eff matches CPU (0.995 ± 0.001)
- If kernel runs correctly, benchmark against ra_gc binary
- Expected: significant speedup from 78% scratch reduction
  (Sampaio et al. report 26.21% speedup on 395 CUDA kernels)

Test Status
-----------
- 90/91 tests pass (1 skipped, same as before so no change there)
- vector_add: 6 VGPRs, 0 spills, 0 scratch (both RDNA3 and CDNA3)
- Moa kernel: binary generation succeeds (a.hsaco, 56 KB)
- No verifier errors