Resource aware task optimization by guosran · Pull Request #269 · coredac/dataflow

guosran · 2026-02-17T21:35:46Z

Overview

This PR introduces ResourceAwareTaskOptimizationPass, a two-phase MLIR pass that optimizes CGRA resource allocation for the Neura taskflow dialect on a 4×4 CGRA grid (16 CGRAs total).

Phase 1: Utilization Fusion

Merges independent tasks (no SSA or memory dependency edges in either direction) into a single fused task, sequentially concatenating their loop bodies. This frees up CGRA budget that Phase 2 can reallocate to critical-path bottlenecks.

Phase 2: Latency-Aware Pipeline Balance

Uses the pipelined latency model:

latency(task) = II × (⌈trip_count / cgra_count⌉ − 1) + steps

Iteratively finds the critical-path bottleneck (minimum slack node with highest individual latency) and allocates one additional CGRA to it, repeating until the 16-CGRA budget is exhausted or no improvement is possible.

The outer loop (max 10 iterations) alternates fusion and balance until convergence (no change in either phase).

Speculative Profiling for `compiled_ii` and `steps`

To obtain accurate II and steps without waiting for full compilation:

Phase 1 (Taskflow → Neura): Clone the parent func::FuncOp, strip all tasks except the target, run ConstructHyperblockFromTask → ClassifyCounters → ConvertTaskflowToNeura on the clone to produce neura.kernel ops.
Phase 2 (Neura pipeline): Clone each kernel body into a standalone func::FuncOp tagged accelerator="neura", then run the full Neura lowering pipeline (LowerAffinePass → ConvertSCFToCFPass → AssignAccelerator → LowerMemRefToNeura → LowerArithToNeura → ... → InsertDataMovPass).

`compiled_ii` Extraction — Trade-offs

Source	When used	Accuracy
`MapToAcceleratorPass` → `mapping_info.compiled_ii`	All ops are `DataMov`-wrapped AND total ops ≤ 150	Highest (real modulo scheduler result)
`max(ResMII, RecMII)`	Mapper skipped (size guard or DataMov guard fails)	Lower bound, conservative
Default `ii=1, steps=1`	Phase 1 or 2 pipeline fails entirely	Pessimistic fallback

Guard conditions for the mapper:

DataMov completeness: All non-reserve operand producers must be neura.data_mov. If InsertDataMovPass didn't fully wrap all operands (happens for kernels with complex control flow), the mapper asserts.
Op count limit (kMapperOpLimit = 150): Prevents exponential backtracking in the modulo scheduler during speculative profiling of large kernels.

Split-Profile for Fused Tasks

After fusion, the fused task body contains N sequential loop nests. ConvertTaskflowToNeuraPass asserts hyperblock_count == 1, so we cannot profile the fused task directly. Instead we:

Create a temporary single-loop wrapper task for each top-level loop nest.
Profile each independently.
Assign max(ii) and sum(steps) to the fused task.

Test Coverage

Test	Tasks	Fusions	Result
`irregular-loop`	3 (incl. reduction)	0 (value output guard)	1+7+8=16 CGRAs
`parallel-nested`	2→1 (fused)	1	cgra_count=10, total=16
`multi-nested`	4→3 (one fusion)	1	Task_1: 6, fused: 9, Task_4: 1
`resnet`	13→6 (7 fusions)	7	Task_3: 6, Task_9: 6, others: 1

Known Limitations

Perfectly-nested assumption for trip_count: For non-perfectly-nested loops inside a task body, computeTripCount multiplies inner-loop counts of each top-level loop structure. This is accurate for the current workloads (convolutions, matmuls).
kMapperOpLimit = 150: Large kernels skip MapToAcceleratorPass and fall back to ResMII/RecMII bounds. This is a deliberate performance vs. accuracy trade-off for speculative profiling.
Fusion limited to write-output tasks: Tasks with value outputs (reductions) are excluded from utilization fusion. Full support would require tracking value-output flow across the fused task boundary.

…alancing and fusion

…steps

…ce and fusion - Add two-phase optimization: Utilization Fusion + Latency-Aware Pipeline Balance - Implement pipelined latency model: latency = II * (ceil(trip_count/cgra_count) - 1) + steps - Add fallback profiling using operation counting for robust performance estimation - Critical path detection using slack analysis for bottleneck identification - Task fusion for independent tasks to free up CGRA budget - Support 4x4 CGRA grid (16 total) with complete allocation - All 4 taskflow lit tests passing (multi-nested, parallel-nested, irregular-loop, resnet) - Environment-agnostic: no Neura-specific analysis APIs, only standard MLIR operations

…erage Bug fixes: - Fix RecMII computation: use cycle.length (excl. reserve/ctrl_mov) instead of cycle.operations.size(), consistent with MapToAcceleratorPass - Fix PipelineBalancer: the outer for-loop was dead code due to 'return' inside the first iteration; refactor to recompute critical path each CGRA increment - Fix placeholder generation in profileTask: replace type-specific AllocOp / ConstantIntOp with UnrealizedConversionCastOp which handles all types including dynamic-shape MemRefs without requiring dynamic-size operands - Fix fusion guard: skip tasks with value outputs (reduction/iter_args loops) to prevent assertion failure in replaceTaskResults New features: - Add WAW (write-after-write) memory dependency edges to prevent incorrect fusion of tasks that write the same memref in program order - Improve computeTripCount: walk only top-level affine.for ops and sum their nested products, correctly handling sequential loops at the same IR level (e.g. 'for i=0..10; for j=0..5' yields 15 not 50) - Persist trip_count attribute at convergence alongside cgra_count/ii/steps Cleanups: - Remove unused #include <cmath> - Add RESOPT lit checks for irregular-loop test (previously uncovered) Tests: 4/4 PASS (irregular-loop, parallel-nested, multi-nested, resnet)

Copilot

Pull request overview

This PR adds a new MLIR optimization pass that fuses independent Taskflow tasks and balances CGRA allocation using a pipelined latency model, plus updates several multi-CGRA tests to exercise the new behavior.

Changes:

Introduces ResourceAwareTaskOptimizationPass implementing utilization fusion + latency-aware CGRA rebalancing with speculative profiling.
Wires the new pass into build/registration (CMake + Passes.td/h).
Extends Taskflow MLIR tests with --resource-aware-task-optimization RUN lines and RESOPT FileCheck assertions.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
lib/TaskflowDialect/Transforms/Optimizations/ResourceAwareTaskOptimizationPass.cpp	Implements the new two-phase optimization pass and speculative profiling pipeline
lib/TaskflowDialect/Transforms/Optimizations/CMakeLists.txt	Builds/links the new pass into the optimization library
include/TaskflowDialect/TaskflowPasses.td	Registers the new pass and its summary/description
include/TaskflowDialect/TaskflowPasses.h	Exposes the factory method for the new pass
test/multi-cgra/taskflow/resnet/simple_resnet_tosa.mlir	Adds RUN + FileCheck coverage for RESOPT expectations
test/multi-cgra/taskflow/parallel-nested/parallel-nested.mlir	Adds RUN + RESOPT checks (but currently duplicated)
test/multi-cgra/taskflow/multi-nested/multi-nested.mlir	Adds RUN + RESOPT checks
test/multi-cgra/taskflow/irregular-loop/irregular-loop.mlir	Adds RUN + RESOPT checks
test/benchmark/Zeonica_Testbench	Updates submodule pointer
debug.log	Adds a debug artifact containing a crash backtrace/logs

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

lib/TaskflowDialect/Transforms/Optimizations/ResourceAwareTaskOptimizationPass.cpp

test/multi-cgra/taskflow/parallel-nested/parallel-nested.mlir

include/TaskflowDialect/TaskflowPasses.td

…oss iterations - Remove duplicate RESOPT RUN+FileCheck block in parallel-nested.mlir that was a copy-paste error (identical input/output/check-prefix). - Persist ii, steps, and trip_count to IR during intermediate iterations (alongside cgra_count) so that graph.build() on subsequent iterations can skip expensive speculative profiling for unchanged tasks via the existing has_precomputed guard.

tancheng · 2026-02-18T01:37:28Z

Shouldn't we fix #260 first to align the task/func/kernel?

ShangkunLi · 2026-02-18T02:03:27Z

Shouldn't we fix #260 first to align the task/func/kernel?

I think they are orthogonal. This pr is trying to do some optimizations on the task dependency graph, regardless of how we construct this graph.

For now, we build the task dependency graph based on the affine loops within one func. We can further extend it so that we can create a task dependency graph from multiple funcs.

guosran added 7 commits February 14, 2026 06:54

feat: Implement resource-aware task optimization pass with pipeline b…

06a625b

…alancing and fusion

refactor: reorder to fusion-first, update latency model to II*(tc-1)+…

842e61e

…steps

refactor: remove steps, convert LLVM_DEBUG to llvm::errs()

2292f26

refactor: implement full slack analysis in findBottleneck

991c917

make cgra_count=1 explicit in IR output

5becdb3

guosran requested review from ShangkunLi and Copilot and removed request for Copilot February 17, 2026 21:35

removed excessive files

588737d

Copilot AI review requested due to automatic review settings February 17, 2026 21:51

Copilot AI reviewed Feb 17, 2026

View reviewed changes

guosran added 3 commits February 18, 2026 06:03

clean up: remove debug.log

963bf79

fix: restore Zeonica_Testbench submodule to main branch pointer

b5fa1a1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resource aware task optimization#269

Resource aware task optimization#269
guosran wants to merge 11 commits intomainfrom
feature/resource-aware-task-optimization

guosran commented Feb 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tancheng commented Feb 18, 2026

Uh oh!

ShangkunLi commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

guosran commented Feb 17, 2026

Overview

Phase 1: Utilization Fusion

Phase 2: Latency-Aware Pipeline Balance

Speculative Profiling for compiled_ii and steps

compiled_ii Extraction — Trade-offs

Split-Profile for Fused Tasks

Test Coverage

Known Limitations

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tancheng commented Feb 18, 2026

Uh oh!

ShangkunLi commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Speculative Profiling for `compiled_ii` and `steps`

`compiled_ii` Extraction — Trade-offs