Skip to content

Resource aware task optimization#269

Open
guosran wants to merge 11 commits intomainfrom
feature/resource-aware-task-optimization
Open

Resource aware task optimization#269
guosran wants to merge 11 commits intomainfrom
feature/resource-aware-task-optimization

Conversation

@guosran
Copy link
Collaborator

@guosran guosran commented Feb 17, 2026

Overview

This PR introduces ResourceAwareTaskOptimizationPass, a two-phase MLIR pass that optimizes CGRA resource allocation for the Neura taskflow dialect on a 4×4 CGRA grid (16 CGRAs total).

Phase 1: Utilization Fusion

Merges independent tasks (no SSA or memory dependency edges in either direction) into a single fused task, sequentially concatenating their loop bodies. This frees up CGRA budget that Phase 2 can reallocate to critical-path bottlenecks.

Phase 2: Latency-Aware Pipeline Balance

Uses the pipelined latency model:

latency(task) = II × (⌈trip_count / cgra_count⌉ − 1) + steps

Iteratively finds the critical-path bottleneck (minimum slack node with highest individual latency) and allocates one additional CGRA to it, repeating until the 16-CGRA budget is exhausted or no improvement is possible.

The outer loop (max 10 iterations) alternates fusion and balance until convergence (no change in either phase).


Speculative Profiling for compiled_ii and steps

To obtain accurate II and steps without waiting for full compilation:

  1. Phase 1 (Taskflow → Neura): Clone the parent func::FuncOp, strip all tasks except the target, run ConstructHyperblockFromTask → ClassifyCounters → ConvertTaskflowToNeura on the clone to produce neura.kernel ops.
  2. Phase 2 (Neura pipeline): Clone each kernel body into a standalone func::FuncOp tagged accelerator="neura", then run the full Neura lowering pipeline (LowerAffinePass → ConvertSCFToCFPass → AssignAccelerator → LowerMemRefToNeura → LowerArithToNeura → ... → InsertDataMovPass).

compiled_ii Extraction — Trade-offs

Source When used Accuracy
MapToAcceleratorPassmapping_info.compiled_ii All ops are DataMov-wrapped AND total ops ≤ 150 Highest (real modulo scheduler result)
max(ResMII, RecMII) Mapper skipped (size guard or DataMov guard fails) Lower bound, conservative
Default ii=1, steps=1 Phase 1 or 2 pipeline fails entirely Pessimistic fallback

Guard conditions for the mapper:

  • DataMov completeness: All non-reserve operand producers must be neura.data_mov. If InsertDataMovPass didn't fully wrap all operands (happens for kernels with complex control flow), the mapper asserts.
  • Op count limit (kMapperOpLimit = 150): Prevents exponential backtracking in the modulo scheduler during speculative profiling of large kernels.

Split-Profile for Fused Tasks

After fusion, the fused task body contains N sequential loop nests. ConvertTaskflowToNeuraPass asserts hyperblock_count == 1, so we cannot profile the fused task directly. Instead we:

  1. Create a temporary single-loop wrapper task for each top-level loop nest.
  2. Profile each independently.
  3. Assign max(ii) and sum(steps) to the fused task.

Test Coverage

Test Tasks Fusions Result
irregular-loop 3 (incl. reduction) 0 (value output guard) 1+7+8=16 CGRAs
parallel-nested 2→1 (fused) 1 cgra_count=10, total=16
multi-nested 4→3 (one fusion) 1 Task_1: 6, fused: 9, Task_4: 1
resnet 13→6 (7 fusions) 7 Task_3: 6, Task_9: 6, others: 1

Known Limitations

  1. Perfectly-nested assumption for trip_count: For non-perfectly-nested loops inside a task body, computeTripCount multiplies inner-loop counts of each top-level loop structure. This is accurate for the current workloads (convolutions, matmuls).
  2. kMapperOpLimit = 150: Large kernels skip MapToAcceleratorPass and fall back to ResMII/RecMII bounds. This is a deliberate performance vs. accuracy trade-off for speculative profiling.
  3. Fusion limited to write-output tasks: Tasks with value outputs (reductions) are excluded from utilization fusion. Full support would require tracking value-output flow across the fused task boundary.

…ce and fusion

- Add two-phase optimization: Utilization Fusion + Latency-Aware Pipeline Balance
- Implement pipelined latency model: latency = II * (ceil(trip_count/cgra_count) - 1) + steps
- Add fallback profiling using operation counting for robust performance estimation
- Critical path detection using slack analysis for bottleneck identification
- Task fusion for independent tasks to free up CGRA budget
- Support 4x4 CGRA grid (16 total) with complete allocation
- All 4 taskflow lit tests passing (multi-nested, parallel-nested, irregular-loop, resnet)
- Environment-agnostic: no Neura-specific analysis APIs, only standard MLIR operations
…erage

Bug fixes:
- Fix RecMII computation: use cycle.length (excl. reserve/ctrl_mov) instead
  of cycle.operations.size(), consistent with MapToAcceleratorPass
- Fix PipelineBalancer: the outer for-loop was dead code due to 'return' inside
  the first iteration; refactor to recompute critical path each CGRA increment
- Fix placeholder generation in profileTask: replace type-specific AllocOp /
  ConstantIntOp with UnrealizedConversionCastOp which handles all types
  including dynamic-shape MemRefs without requiring dynamic-size operands
- Fix fusion guard: skip tasks with value outputs (reduction/iter_args loops)
  to prevent assertion failure in replaceTaskResults

New features:
- Add WAW (write-after-write) memory dependency edges to prevent incorrect
  fusion of tasks that write the same memref in program order
- Improve computeTripCount: walk only top-level affine.for ops and sum their
  nested products, correctly handling sequential loops at the same IR level
  (e.g. 'for i=0..10; for j=0..5' yields 15 not 50)
- Persist trip_count attribute at convergence alongside cgra_count/ii/steps

Cleanups:
- Remove unused #include <cmath>
- Add RESOPT lit checks for irregular-loop test (previously uncovered)

Tests: 4/4 PASS (irregular-loop, parallel-nested, multi-nested, resnet)
@guosran guosran requested review from ShangkunLi and Copilot and removed request for Copilot February 17, 2026 21:35
Copilot AI review requested due to automatic review settings February 17, 2026 21:51
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new MLIR optimization pass that fuses independent Taskflow tasks and balances CGRA allocation using a pipelined latency model, plus updates several multi-CGRA tests to exercise the new behavior.

Changes:

  • Introduces ResourceAwareTaskOptimizationPass implementing utilization fusion + latency-aware CGRA rebalancing with speculative profiling.
  • Wires the new pass into build/registration (CMake + Passes.td/h).
  • Extends Taskflow MLIR tests with --resource-aware-task-optimization RUN lines and RESOPT FileCheck assertions.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
lib/TaskflowDialect/Transforms/Optimizations/ResourceAwareTaskOptimizationPass.cpp Implements the new two-phase optimization pass and speculative profiling pipeline
lib/TaskflowDialect/Transforms/Optimizations/CMakeLists.txt Builds/links the new pass into the optimization library
include/TaskflowDialect/TaskflowPasses.td Registers the new pass and its summary/description
include/TaskflowDialect/TaskflowPasses.h Exposes the factory method for the new pass
test/multi-cgra/taskflow/resnet/simple_resnet_tosa.mlir Adds RUN + FileCheck coverage for RESOPT expectations
test/multi-cgra/taskflow/parallel-nested/parallel-nested.mlir Adds RUN + RESOPT checks (but currently duplicated)
test/multi-cgra/taskflow/multi-nested/multi-nested.mlir Adds RUN + RESOPT checks
test/multi-cgra/taskflow/irregular-loop/irregular-loop.mlir Adds RUN + RESOPT checks
test/benchmark/Zeonica_Testbench Updates submodule pointer
debug.log Adds a debug artifact containing a crash backtrace/logs

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…oss iterations

- Remove duplicate RESOPT RUN+FileCheck block in parallel-nested.mlir
  that was a copy-paste error (identical input/output/check-prefix).

- Persist ii, steps, and trip_count to IR during intermediate iterations
  (alongside cgra_count) so that graph.build() on subsequent iterations
  can skip expensive speculative profiling for unchanged tasks via the
  existing has_precomputed guard.
@tancheng
Copy link
Contributor

Shouldn't we fix #260 first to align the task/func/kernel?

@ShangkunLi
Copy link
Collaborator

Shouldn't we fix #260 first to align the task/func/kernel?

I think they are orthogonal. This pr is trying to do some optimizations on the task dependency graph, regardless of how we construct this graph.

For now, we build the task dependency graph based on the affine loops within one func. We can further extend it so that we can create a task dependency graph from multiple funcs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments