Conversation
…alancing and fusion
…ce and fusion - Add two-phase optimization: Utilization Fusion + Latency-Aware Pipeline Balance - Implement pipelined latency model: latency = II * (ceil(trip_count/cgra_count) - 1) + steps - Add fallback profiling using operation counting for robust performance estimation - Critical path detection using slack analysis for bottleneck identification - Task fusion for independent tasks to free up CGRA budget - Support 4x4 CGRA grid (16 total) with complete allocation - All 4 taskflow lit tests passing (multi-nested, parallel-nested, irregular-loop, resnet) - Environment-agnostic: no Neura-specific analysis APIs, only standard MLIR operations
…erage Bug fixes: - Fix RecMII computation: use cycle.length (excl. reserve/ctrl_mov) instead of cycle.operations.size(), consistent with MapToAcceleratorPass - Fix PipelineBalancer: the outer for-loop was dead code due to 'return' inside the first iteration; refactor to recompute critical path each CGRA increment - Fix placeholder generation in profileTask: replace type-specific AllocOp / ConstantIntOp with UnrealizedConversionCastOp which handles all types including dynamic-shape MemRefs without requiring dynamic-size operands - Fix fusion guard: skip tasks with value outputs (reduction/iter_args loops) to prevent assertion failure in replaceTaskResults New features: - Add WAW (write-after-write) memory dependency edges to prevent incorrect fusion of tasks that write the same memref in program order - Improve computeTripCount: walk only top-level affine.for ops and sum their nested products, correctly handling sequential loops at the same IR level (e.g. 'for i=0..10; for j=0..5' yields 15 not 50) - Persist trip_count attribute at convergence alongside cgra_count/ii/steps Cleanups: - Remove unused #include <cmath> - Add RESOPT lit checks for irregular-loop test (previously uncovered) Tests: 4/4 PASS (irregular-loop, parallel-nested, multi-nested, resnet)
There was a problem hiding this comment.
Pull request overview
This PR adds a new MLIR optimization pass that fuses independent Taskflow tasks and balances CGRA allocation using a pipelined latency model, plus updates several multi-CGRA tests to exercise the new behavior.
Changes:
- Introduces
ResourceAwareTaskOptimizationPassimplementing utilization fusion + latency-aware CGRA rebalancing with speculative profiling. - Wires the new pass into build/registration (CMake + Passes.td/h).
- Extends Taskflow MLIR tests with
--resource-aware-task-optimizationRUN lines and RESOPT FileCheck assertions.
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| lib/TaskflowDialect/Transforms/Optimizations/ResourceAwareTaskOptimizationPass.cpp | Implements the new two-phase optimization pass and speculative profiling pipeline |
| lib/TaskflowDialect/Transforms/Optimizations/CMakeLists.txt | Builds/links the new pass into the optimization library |
| include/TaskflowDialect/TaskflowPasses.td | Registers the new pass and its summary/description |
| include/TaskflowDialect/TaskflowPasses.h | Exposes the factory method for the new pass |
| test/multi-cgra/taskflow/resnet/simple_resnet_tosa.mlir | Adds RUN + FileCheck coverage for RESOPT expectations |
| test/multi-cgra/taskflow/parallel-nested/parallel-nested.mlir | Adds RUN + RESOPT checks (but currently duplicated) |
| test/multi-cgra/taskflow/multi-nested/multi-nested.mlir | Adds RUN + RESOPT checks |
| test/multi-cgra/taskflow/irregular-loop/irregular-loop.mlir | Adds RUN + RESOPT checks |
| test/benchmark/Zeonica_Testbench | Updates submodule pointer |
| debug.log | Adds a debug artifact containing a crash backtrace/logs |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
lib/TaskflowDialect/Transforms/Optimizations/ResourceAwareTaskOptimizationPass.cpp
Show resolved
Hide resolved
lib/TaskflowDialect/Transforms/Optimizations/ResourceAwareTaskOptimizationPass.cpp
Show resolved
Hide resolved
lib/TaskflowDialect/Transforms/Optimizations/ResourceAwareTaskOptimizationPass.cpp
Show resolved
Hide resolved
lib/TaskflowDialect/Transforms/Optimizations/ResourceAwareTaskOptimizationPass.cpp
Show resolved
Hide resolved
…oss iterations - Remove duplicate RESOPT RUN+FileCheck block in parallel-nested.mlir that was a copy-paste error (identical input/output/check-prefix). - Persist ii, steps, and trip_count to IR during intermediate iterations (alongside cgra_count) so that graph.build() on subsequent iterations can skip expensive speculative profiling for unchanged tasks via the existing has_precomputed guard.
|
Shouldn't we fix #260 first to align the task/func/kernel? |
I think they are orthogonal. This pr is trying to do some optimizations on the task dependency graph, regardless of how we construct this graph. For now, we build the task dependency graph based on the |
Overview
This PR introduces
ResourceAwareTaskOptimizationPass, a two-phase MLIR pass that optimizes CGRA resource allocation for the Neura taskflow dialect on a 4×4 CGRA grid (16 CGRAs total).Phase 1: Utilization Fusion
Merges independent tasks (no SSA or memory dependency edges in either direction) into a single fused task, sequentially concatenating their loop bodies. This frees up CGRA budget that Phase 2 can reallocate to critical-path bottlenecks.
Phase 2: Latency-Aware Pipeline Balance
Uses the pipelined latency model:
Iteratively finds the critical-path bottleneck (minimum slack node with highest individual latency) and allocates one additional CGRA to it, repeating until the 16-CGRA budget is exhausted or no improvement is possible.
The outer loop (max 10 iterations) alternates fusion and balance until convergence (no change in either phase).
Speculative Profiling for
compiled_iiandstepsTo obtain accurate
IIandstepswithout waiting for full compilation:func::FuncOp, strip all tasks except the target, runConstructHyperblockFromTask → ClassifyCounters → ConvertTaskflowToNeuraon the clone to produceneura.kernelops.func::FuncOptaggedaccelerator="neura", then run the full Neura lowering pipeline (LowerAffinePass → ConvertSCFToCFPass → AssignAccelerator → LowerMemRefToNeura → LowerArithToNeura → ... → InsertDataMovPass).compiled_iiExtraction — Trade-offsMapToAcceleratorPass→mapping_info.compiled_iiDataMov-wrapped AND total ops ≤ 150max(ResMII, RecMII)ii=1, steps=1Guard conditions for the mapper:
neura.data_mov. IfInsertDataMovPassdidn't fully wrap all operands (happens for kernels with complex control flow), the mapper asserts.kMapperOpLimit = 150): Prevents exponential backtracking in the modulo scheduler during speculative profiling of large kernels.Split-Profile for Fused Tasks
After fusion, the fused task body contains N sequential loop nests.
ConvertTaskflowToNeuraPassassertshyperblock_count == 1, so we cannot profile the fused task directly. Instead we:max(ii)andsum(steps)to the fused task.Test Coverage
irregular-loopparallel-nestedmulti-nestedresnetKnown Limitations
computeTripCountmultiplies inner-loop counts of each top-level loop structure. This is accurate for the current workloads (convolutions, matmuls).kMapperOpLimit = 150: Large kernels skipMapToAcceleratorPassand fall back to ResMII/RecMII bounds. This is a deliberate performance vs. accuracy trade-off for speculative profiling.