-
Notifications
You must be signed in to change notification settings - Fork 81
Expand file tree
/
Copy pathspill_analysis.txt
More file actions
103 lines (88 loc) · 4.53 KB
/
spill_analysis.txt
File metadata and controls
103 lines (88 loc) · 4.53 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
Divergence-Aware SSA Register Allocator — Spill Analysis
=========================================================
Date: 2026-03-14
Kernel: Moa transport kernel (gpu/tp_kern.cu)
Monte Carlo neutron transport, 654 source lines
253 MIR blocks, 4035 virtual registers (3933V 102S)
3854 divergent VGPRs, 79 uniform VGPRs
Target: GFX942 (CDNA3, MI300X), Wave64
The new SSA register allocator (ra_ssa) eliminates ALL 186 VGPR spills
on the Moa transport kernel. Total scratch traffic drops by 78%.
Total emitted instruction count drops from 9,448 to 6,761 (28.4%).
on Wave64 hardware, spilling a divergent VGPR costs 64 dwords of scratch per lane. Spilling a
uniform VGPR costs 1 dword via v_readfirstlane. The old allocator
treated all spills equally, the new one exploits the 64:1 cost ratio.
Old (ra_gc) SSA (ra_ssa) Change
----------- ----------- -------
VGPR spills 186 0 -100% (HOLY!)
SGPR spills 21 29 +38%
Total spills 207 29 -86%
Scratch ops (store+load) 1,754 392 -78%
Scratch bytes 1,396 272 -81%
VGPRs used 250 237 -5%
SGPRs used 102 102 0%
Total emitted instructions 9,448 6,761 -28%
v_readfirstlane (SGPR path) 0 39 new
The 29 SGPR spills are cheap (4 bytes each via VGPR relay to scratch).
The 39 v_readfirstlane instructions are the scalar extraction path
for SGPR spill/reload, each replaces what would have been a 256-byte
per-lane scratch store on the old allocator.
Algorithm
---------
1. CFG + Cooper et al. (2001) iterative dominator tree
2. Loop nesting depth (exponential weighting, Braun & Hack 2009)
3. SSA liveness with PHI-aware dataflow + exec-mask region extension
4. Divergence-aware spill cost: cost(v) = Σ depth_weight × div_weight
- div_weight = 64 for divergent VGPRs (Wave64 scratch cost)
- div_weight = 1 for uniform VGPRs (readfirstlane to scalar)
- div_weight = 1 for SGPRs (already scalar)
5. Rematerialisation detection (immediate loads → cost 0)
6. SSA coloring: domtree preorder, backward scan, greedy lowest-color
- Precoloring for intra-block interference resolution
- Divergence-weighted spill victim selection on pressure overflow
7. Spill codegen with 4 paths:
A. Remat (0 bytes scratch, 1 instruction)
B. Uniform VGPR: v_readfirstlane → scratch (4 bytes)
C. Divergent VGPR: full per-lane scratch (wave_width × 4 bytes)
D. SGPR: v_mov to relay → scratch (4 bytes)
8. Post-RA phi elimination with free coalescing (same color = no copy)
All static memory (~30 MB), no malloc. ~1,300 lines of C99.
Operates on SSA form before phi elimination — free PHI coalescing.
Fallback to ra_gc/ra_lin for functions exceeding pool limits.
Enabled via: barracuda --ssa-ra
Files
-----
src/amdgpu/ra_ssa.c — allocator implementation ( approx. 1,300 lines)
src/amdgpu/ra_ssa.h — public interface
src/amdgpu/amdgpu.h — vr_divg[] bitvector, shared helpers
src/amdgpu/isel.c — divergence propagation to per-vreg bitvector
src/amdgpu/emit.c — SSA dispatch, un-static shared helpers
src/main.c — --ssa-ra flag
References
----------
Sampaio, D., Souza, R. M. de, Collange, S., & Pereira, F. M. Q. (2013).
Divergence analysis. ACM TOPLAS 35(4), Article 13, 1-36.
https://doi.org/10.1145/2523815
Cooper, K. D., Harvey, T. J., & Kennedy, K. (2001).
A simple, fast dominance algorithm.
Software Practice and Experience, 4, 1-10.
Braun, M., & Hack, S. (2009).
Register spilling and live-range splitting for SSA-form programs.
CC 2009, LNCS 5501, pp. 174-189.
https://doi.org/10.1007/978-3-642-00722-4_13
Yes I used Zotero because I always seem to miss something in apa7th lol.
Next Steps
----------
- Run on MI300X hardware
barracuda --ssa-ra --amdgpu-bin --gfx942 gpu/tp_kern.cu
then: kahu a.hsaco --all
then: run Godiva benchmark, verify k_eff matches CPU (0.995 ± 0.001)
- If kernel runs correctly, benchmark against ra_gc binary
- Expected: significant speedup from 78% scratch reduction
(Sampaio et al. report 26.21% speedup on 395 CUDA kernels)
Test Status
-----------
- 90/91 tests pass (1 skipped, same as before so no change there)
- vector_add: 6 VGPRs, 0 spills, 0 scratch (both RDNA3 and CDNA3)
- Moa kernel: binary generation succeeds (a.hsaco, 56 KB)
- No verifier errors