GitHub - AlexanderGSC/rvv-gemm: This repository contains the analysis and progressive optimisation of a generic matrix multiplication algorithm (GEMM) for large matrices running natively on a 64-bit RISC-V architecture utilising the RVV 1.0 vector extension and advanced cache locality techniques (Tiling).

GEMM Optimization with RISC-V RVV 1.0

This repository contains the analysis and progressive optimisation of a generic matrix multiplication algorithm (GEMM) for large matrices ($2000 \times 3000 \times 100$), running natively on a 64-bit RISC-V architecture utilising the RVV 1.0 vector extension and advanced cache locality techniques (Tiling), ILP, Out Of Order execution and multithreading via openMP. All improvements are cumulative:

The repository develops the algorithm through various stages:

GEMM_O2.c: Pure scalar code.
GEMM_O3.c: code vectorised by the compiler using the -O3 flag.
GEMM_vaddmul.c: First vector approach using vle32.v, vfmul.vf, vfadd.vv.
GEMM_vmacc.c: Latency reduction by replacing the Mul/Add pair with the combined instruction vfmacc.vf.
GEMM_tiled.c: Implementation of Loop Nest Blocking (Tiling) by dividing the spatial problem into submatrices that fit into the L1 cache, eliminating stores with strides to RAM.
GEMM_unroll_tiled.c: Enabling ILP (Instruction Level Parallelism) with the implementation of loop unrolling with depth 4 and interleaving to enable OOO (Out-of-Order) execution, utilising the VPU pipeline and mitigating RAW risks.
GEMM_omp.c: multithreading application using OpenMP, to make the most of the SBC’s 8 physical cores.

Benchmark Results

The experiment was carried out on a Banana BPI-F3 SBC (with a RISC-V architecture processor), measuring the pure computation time required to calculate the resulting matrix $C$:

perf stat -e cycles,instructions,branches,branch-misses,L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses ./P3GEMM_version 2000 3000 100

Metric	`GEMM_O2` (Base)	`GEMM_O3`	`GEMM_vaddmul`	`GEMM_vmacc`	`GEMM_tiled`	`GEMM_unrolled_tiled`	`GEMM_omp`
Computing Time	2.881 s	0.812 s	0.776 s	0.708 s	0.319 s	0.211 s	0.043 s
Speedup	1.0x	3.75x	3.69x	4.06x	8.99x	15.02x	66.58x
CPU Cycles (`cycles`)	5.484 M	2.177 M	2.138 M	2.010 M	1.389 M	1.234 M	1.436 M
Instructions	5.412 M	1.562 M	793 M	782 M	675 M	674 M	741 M
IPC (`insn per cycle`)	0.99	0.72	0.37	0.39	0.49	0.55	0.52
L1 Loads (`loads`)	1.949 M	466 M	574 M	574 M	361 M	223 M	231 M
L1 Load Misses	1.955.513	1.877.319	2.044.323	2.036.231	2.479.215	2.415.152	2.300.855
L1 Stores (`stores`)	689 M	244 M	280 M	280 M	91 M	91 M	91 M
L1 Store Misses	568 K	553 K	550 K	550 K	546 K	543 K	549 M

Key takeaway: Maximum optimisation is not achieved simply by inserting vector instructions, but by managing how data flows between RAM, L1 cache blocks and the silicon’s vector ALUs.

Overall acceleration: the tiled version achieves an 88.9% reduction in computation time (down from 2.881 s to 0.319 s) compared to the base _O2 version.
Instruction reduction: Manual vector optimisation (_vaddmul and _vmacc) and block-based optimisation (_tiled) significantly reduce the total number of instructions compared to automatic vectorisation _O3. This is because the compiler is very conservative in its application, and uses an LMUL of 1 to avoid excessive pressure on registers. This can be examined in the assembly code generated by the compiler:

  90 0082 D7F7060D 		vsetvli	a5,a3,e32,m1,ta,ma                 # LMUL=1
  91              		.loc 1 20 8 is_stmt 1
  92              		.loc 1 20 22 is_stmt 0
  93 0086 87600502 		vle32.v	v1,0(a0)
  94 008a 93952700 		slli	a1,a5,2
  95              		.loc 1 20 40
  96 008e 07610602 		vle32.v	v2,0(a2)
  19:P3GEMM.c      ****         	for (j = 0; j < m; j++) 
  97              		.loc 1 19 24 discriminator 1
  98 0092 9D8E     		sub	a3,a3,a5
  99 0094 2E95     		add	a0,a0,a1
 100 0096 2E96     		add	a2,a2,a1
 101              		.loc 1 20 26
 102 0098 D79021B2 		vfmacc.vv	v1,v3,v2                       #Uses vmacc
 103              		.loc 1 20 16
 104 009c A7600702 		vse32.v	v1,0(a4)

The misleading IPC of _O2: Although _O2 has the highest IPC (0.99), it is inefficient; the processor runs fast but only executes scalar instructions.
Efficient operation fusion: The _vmacc version reduces the number of instructions compared to _vaddmul (from 793M to 782M), demonstrating that the multiply and accumulate operations are fused into a single CPU cycle.
Hardware-Saturating Instruction Interleaving: Implementing a row-wise loop unrolling factor of 4 effectively hidden FMA (Fused Multiply-Add) latency, maximizing pipeline throughput and driving instruction per cycle (IPC) efficiency up to 0.55.
Store Loads traffic: _tiled reduces L1 cache writes from 689 million to 91 million (a reduction of 86.7%), confirming that partial sums are retained in the registers before being written to memory.
Data bus independence: _tiled reduces L1 reads (loads) to one-fifth of the base version, preventing the CPU from suffering from data starvation and raising its actual IPC to 0.49.
**ILP and VPU saturation
Consistency of mandatory failures: Write failures (store-misses) remain constant (~550 K) across all versions because they correspond to the mandatory initialisation of the array.

Tile Scaling Analysis and L1 Cache Behaviour

The Banana Pi features a 32 KB L1 data cache and operates with 32-bit floating-point precision (4 bytes per element). An empirical study was carried out by varying the tile size to find the hardware’s optimal saturation point. Here are the results for the tiled version without unrolling :

Tile Size	Computation Time	L1 Loads	L1 Misses	% Misses	L1 Cache Status
32 x 32	0.712 s	588,147,925	2,267,033	0.39%	Underutilised
50 x 50	0.310 s	368,267,051	2,762,719	0.75%	Optimal net balance
64 x 64	0.320 s	361,152,490	2,463,928	0.68%	Alignment optimum
128 x 128	0.452 s	359,436,762	8,375,178	2.33%	Saturation and Overflow

Compilation

To compile natively in the RISC-V environment using vector support (requires gcc with support for RVV 1.0):

# Compile all versions
make

# Run by entering dimensions: N P M (N x P) (P x M)
./python3 P3GEMM_su 2000 3000 100

Next Steps

Implement the same algorithm in C++ to measure the impact of zero-cost abstractions on the code.
Use GEMM to implement a 2D convolution using the im2col algorithm.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
src		src
LICENSE		LICENSE
P3.md		P3.md
P3sinsu.png		P3sinsu.png
Readme.md		Readme.md
result.png		result.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GEMM Optimization with RISC-V RVV 1.0

Benchmark Results

Tile Scaling Analysis and L1 Cache Behaviour

Compilation

Next Steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GEMM Optimization with RISC-V RVV 1.0

Benchmark Results

Tile Scaling Analysis and L1 Cache Behaviour

Compilation

Next Steps

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages