Memory Access Streaming Fusion Pass by ShangkunLi · Pull Request #268 · coredac/dataflow

ShangkunLi · 2026-02-11T16:07:38Z

`--memory-access-streaming-fusion`: Memory Access Streaming Fusion Pass

Summary

This PR adds the MemoryAccessStreamingFusion pass (--memory-access-streaming-fusion), which identifies and fuses taskflow.task operations connected by intermediate memory buffers. When one task writes to a memref and another reads from it, the pass merges them into a single fused task, eliminating the intermediate memref.alloc (if have) and converting the memory-based data transfer into direct SSA value flow (streaming).

Motivation

After convert-affine-to-taskflow, each serialized loop nest becomes an independent task that communicates with other tasks through shared memrefs. Many of these intermediate buffers exist solely to pass data between producer and consumer tasks. Fusing these tasks:

Reduces memory traffic by eliminating redundant store/load pairs through intermediate buffers.
Enables streaming execution — the fused task computes the writer's value and immediately uses it in the reader's computation, without materializing the full intermediate buffer.
Reduces task count, simplifying downstream scheduling and placement.

How It Works

The pass operates in iterative rounds to handle fusion chains (e.g., A→B→C: first round fuses A+B, second round fuses (A+B)+C):

Dependency Analysis — Traces SSA value flow through write_outputs → read/write_memrefs to build a memory dependency graph capturing RAW, WAW, and WAR dependencies. Uses original_read/write_memrefs to identify the physical intermediate %alloc buffers.
Candidate Identification — Finds fusable (writer, reader) pairs that satisfy:
- Writer has no value_outputs (simplified constraint for correctness).
- Both tasks have compatible perfectly-nested loop bounds.
- The intermediate memref has no external uses (only used by the two tasks).
- No cyclic dependency between the pair.
Fusion Transformation — For each valid candidate:
- Creates a new fused task with merged operand lists (excluding the intermediate).
- Inlines the writer's loop body, replacing the intermediate store with the stored value.
- Inlines the reader's loop body, replacing the intermediate load with the writer's stored value (direct SSA forwarding).
- Replaces all uses of the original tasks' outputs with the fused task's outputs.
- Erases the original writer and reader tasks, and the intermediate memref.alloc.

Example

Before (3 tasks, 2 intermediate buffers):

%alloc_A = memref.alloc() : memref<1x64x8x8xf32>
// Task_4: transpose (writes %alloc_A)
%wo_4 = taskflow.task @Task_4 read_memrefs(%wo_3) write_memrefs(%alloc_A)
    [original_read_memrefs(%alloc_prev), original_write_memrefs(%alloc_A)] {
  affine.for ... {
    %v = affine.load %arg1[...] : memref<1x8x8x64xf32>
    affine.store %v, %arg2[...] : memref<1x64x8x8xf32>   // ← store to intermediate
  }
}

%alloc_B = memref.alloc() : memref<1x64x8x8xf32>
// Task_5: clamp (reads %alloc_A, writes %alloc_B)
%wo_5 = taskflow.task @Task_5 read_memrefs(%wo_4) write_memrefs(%alloc_B)
    [original_read_memrefs(%alloc_A), original_write_memrefs(%alloc_B)] {
  affine.for ... {
    %v = affine.load %arg1[...] : memref<1x64x8x8xf32>   // ← load from intermediate
    %c = arith.minimumf %v, %cst_max : f32
    %r = arith.maximumf %c, %cst_min : f32
    affine.store %r, %arg2[...] : memref<1x64x8x8xf32>
  }
}

After (1 fused task, intermediate %alloc_A eliminated):

%alloc_B = memref.alloc() : memref<1x64x8x8xf32>
// Task_4_Task_5_fused: transpose + clamp (streaming — no intermediate buffer)
%wo_fused = taskflow.task @Task_4_Task_5_fused read_memrefs(%wo_3) write_memrefs(%alloc_B)
    [original_read_memrefs(%alloc_prev), original_write_memrefs(%alloc_B)] {
  affine.for ... {
    %v = affine.load %arg1[...] : memref<1x8x8x64xf32>
    // ↓ writer's store eliminated, value streamed directly ↓
    %c = arith.minimumf %v, %cst_max : f32
    %r = arith.maximumf %c, %cst_min : f32
    affine.store %r, %arg2[...] : memref<1x64x8x8xf32>
  }
}

Iterative Chaining Example (ResNet)

On the SimpleResNet benchmark, the pass performs iterative fusion across multiple rounds:

Round	Fusion	Result
1	`Task_4` (transpose) + `Task_5` (clamp)	`Task_4_Task_5_fused`
2	`Task_10` (transpose) + `Task_11` (add)	`Task_10_Task_11_fused`
3	`Task_10_Task_11_fused` + `Task_12` (clamp)	`Task_10_Task_11_Task_12_fused_fused`

This reduces the ResNet task graph from 13 tasks → 10 tasks, eliminating 3 intermediate buffers.

Fusion Criteria

Check	Description
No writer `value_outputs`	Fused task only propagates reader's outputs
Compatible loop bounds	Both tasks must have identical perfectly-nested `affine.for` loop bounds
Private intermediate	The intermediate memref must only be used by writer + reader
No cyclic deps	Writer must not read any memref that reader writes (excluding intermediate)

guosran · 2026-02-12T22:41:13Z

lib/TaskflowDialect/Transforms/Optimizations/MemoryAccessStreamingFusion.cpp

+    // shape).
+    if (writer->write_memrefs.size() == 1 && reader->read_memrefs.size() == 1) {
+      benefit += 50;
+    }


This is a simple calculation, will it result in a lot of ties?

Yes, this may introduce some ties.

We use a greedy-based fusion here; we actually fuse all the tasks that meet the constraints.

I can not tell the effect of having ties for now. Maybe we can do some tests by applying it to more benchmarks.

ShangkunLi added 3 commits February 11, 2026 16:22

prototype memory access streaming fusion pass

16bae0a

change the canFuse logic

e41cb8d

add test

e32bc6d

ShangkunLi requested review from guosran and tancheng February 11, 2026 16:08

tancheng approved these changes Feb 11, 2026

View reviewed changes

ShangkunLi merged commit f7f2ee2 into coredac:main Feb 12, 2026
1 check passed

guosran reviewed Feb 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Access Streaming Fusion Pass#268

Memory Access Streaming Fusion Pass#268
ShangkunLi merged 3 commits intocoredac:mainfrom
ShangkunLi:memory-fusion

ShangkunLi commented Feb 11, 2026

Uh oh!

Uh oh!

guosran Feb 12, 2026

Uh oh!

ShangkunLi Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

ShangkunLi commented Feb 11, 2026

--memory-access-streaming-fusion: Memory Access Streaming Fusion Pass

Summary

Motivation

How It Works

Example

Iterative Chaining Example (ResNet)

Fusion Criteria

Uh oh!

Uh oh!

guosran Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

ShangkunLi Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

`--memory-access-streaming-fusion`: Memory Access Streaming Fusion Pass