Skip to content

Memory Access Streaming Fusion Pass#268

Merged
ShangkunLi merged 3 commits intocoredac:mainfrom
ShangkunLi:memory-fusion
Feb 12, 2026
Merged

Memory Access Streaming Fusion Pass#268
ShangkunLi merged 3 commits intocoredac:mainfrom
ShangkunLi:memory-fusion

Conversation

@ShangkunLi
Copy link
Collaborator

--memory-access-streaming-fusion: Memory Access Streaming Fusion Pass

Summary

This PR adds the MemoryAccessStreamingFusion pass (--memory-access-streaming-fusion), which identifies and fuses taskflow.task operations connected by intermediate memory buffers. When one task writes to a memref and another reads from it, the pass merges them into a single fused task, eliminating the intermediate memref.alloc (if have) and converting the memory-based data transfer into direct SSA value flow (streaming).

Motivation

After convert-affine-to-taskflow, each serialized loop nest becomes an independent task that communicates with other tasks through shared memrefs. Many of these intermediate buffers exist solely to pass data between producer and consumer tasks. Fusing these tasks:

  • Reduces memory traffic by eliminating redundant store/load pairs through intermediate buffers.
  • Enables streaming execution — the fused task computes the writer's value and immediately uses it in the reader's computation, without materializing the full intermediate buffer.
  • Reduces task count, simplifying downstream scheduling and placement.

How It Works

The pass operates in iterative rounds to handle fusion chains (e.g., A→B→C: first round fuses A+B, second round fuses (A+B)+C):

  1. Dependency Analysis — Traces SSA value flow through write_outputsread/write_memrefs to build a memory dependency graph capturing RAW, WAW, and WAR dependencies. Uses original_read/write_memrefs to identify the physical intermediate %alloc buffers.

  2. Candidate Identification — Finds fusable (writer, reader) pairs that satisfy:

    • Writer has no value_outputs (simplified constraint for correctness).
    • Both tasks have compatible perfectly-nested loop bounds.
    • The intermediate memref has no external uses (only used by the two tasks).
    • No cyclic dependency between the pair.
  3. Fusion Transformation — For each valid candidate:

    • Creates a new fused task with merged operand lists (excluding the intermediate).
    • Inlines the writer's loop body, replacing the intermediate store with the stored value.
    • Inlines the reader's loop body, replacing the intermediate load with the writer's stored value (direct SSA forwarding).
    • Replaces all uses of the original tasks' outputs with the fused task's outputs.
    • Erases the original writer and reader tasks, and the intermediate memref.alloc.

Example

Before (3 tasks, 2 intermediate buffers):

%alloc_A = memref.alloc() : memref<1x64x8x8xf32>
// Task_4: transpose (writes %alloc_A)
%wo_4 = taskflow.task @Task_4 read_memrefs(%wo_3) write_memrefs(%alloc_A)
    [original_read_memrefs(%alloc_prev), original_write_memrefs(%alloc_A)] {
  affine.for ... {
    %v = affine.load %arg1[...] : memref<1x8x8x64xf32>
    affine.store %v, %arg2[...] : memref<1x64x8x8xf32>   // ← store to intermediate
  }
}

%alloc_B = memref.alloc() : memref<1x64x8x8xf32>
// Task_5: clamp (reads %alloc_A, writes %alloc_B)
%wo_5 = taskflow.task @Task_5 read_memrefs(%wo_4) write_memrefs(%alloc_B)
    [original_read_memrefs(%alloc_A), original_write_memrefs(%alloc_B)] {
  affine.for ... {
    %v = affine.load %arg1[...] : memref<1x64x8x8xf32>   // ← load from intermediate
    %c = arith.minimumf %v, %cst_max : f32
    %r = arith.maximumf %c, %cst_min : f32
    affine.store %r, %arg2[...] : memref<1x64x8x8xf32>
  }
}

After (1 fused task, intermediate %alloc_A eliminated):

%alloc_B = memref.alloc() : memref<1x64x8x8xf32>
// Task_4_Task_5_fused: transpose + clamp (streaming — no intermediate buffer)
%wo_fused = taskflow.task @Task_4_Task_5_fused read_memrefs(%wo_3) write_memrefs(%alloc_B)
    [original_read_memrefs(%alloc_prev), original_write_memrefs(%alloc_B)] {
  affine.for ... {
    %v = affine.load %arg1[...] : memref<1x8x8x64xf32>
    // ↓ writer's store eliminated, value streamed directly ↓
    %c = arith.minimumf %v, %cst_max : f32
    %r = arith.maximumf %c, %cst_min : f32
    affine.store %r, %arg2[...] : memref<1x64x8x8xf32>
  }
}

Iterative Chaining Example (ResNet)

On the SimpleResNet benchmark, the pass performs iterative fusion across multiple rounds:

Round Fusion Result
1 Task_4 (transpose) + Task_5 (clamp) Task_4_Task_5_fused
2 Task_10 (transpose) + Task_11 (add) Task_10_Task_11_fused
3 Task_10_Task_11_fused + Task_12 (clamp) Task_10_Task_11_Task_12_fused_fused

This reduces the ResNet task graph from 13 tasks → 10 tasks, eliminating 3 intermediate buffers.

Fusion Criteria

Check Description
No writer value_outputs Fused task only propagates reader's outputs
Compatible loop bounds Both tasks must have identical perfectly-nested affine.for loop bounds
Private intermediate The intermediate memref must only be used by writer + reader
No cyclic deps Writer must not read any memref that reader writes (excluding intermediate)

@ShangkunLi ShangkunLi merged commit f7f2ee2 into coredac:main Feb 12, 2026
1 check passed
// shape).
if (writer->write_memrefs.size() == 1 && reader->read_memrefs.size() == 1) {
benefit += 50;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a simple calculation, will it result in a lot of ties?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this may introduce some ties.

We use a greedy-based fusion here; we actually fuse all the tasks that meet the constraints.

I can not tell the effect of having ties for now. Maybe we can do some tests by applying it to more benchmarks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments