Conversation
| @@ -0,0 +1,234 @@ | |||
| # Copyright 2025 The IREE Authors | |||
9ebc371 to
f037848
Compare
|
I've gone back and forth on supporting mapping_dynamic_vals in flattening. It should work in theory, but it's been tricky to get all of the tests passing, so I've set it to skip in that case at this point. So if we were to land it in the current state, we would need follow-ups to (1) get it working for mapping_dynamic_vals and (2) get it working with Water. |
fa75e88 to
b0da273
Compare
Hardcode84
left a comment
There was a problem hiding this comment.
Mostly LGTM, just a few comments
| """ | ||
| ip = InsertionPoint.current | ||
| owner = ip.block.owner | ||
| is_in_loop = not isinstance(owner, func_d.FuncOp) and owner.name == "scf.for" |
There was a problem hiding this comment.
Ideally, this should be done by LICM pass in our waveasm pipeline. We probably can add a TODO
| zero_indices = [arith_d.constant(IndexType.get(), 0)] * len(sym_strides) | ||
| lin_src, _ = _linearize_memref( | ||
| kb_src, zero_indices, zero_indices, strides_vals | ||
| ) | ||
| if buffer_ops_enabled: | ||
| valid_bytes = _compute_valid_bytes( | ||
| lin_src, | ||
| element_type, | ||
| input_shape, | ||
| emitter, | ||
| use_real_bounds_override=True, | ||
| ) | ||
| lin_src = _cast_buffer_and_encode_stride( | ||
| lin_src, | ||
| strides_vals, | ||
| element_type, | ||
| valid_bytes, | ||
| ) |
There was a problem hiding this comment.
We can it into lambda to avoid copypaste
|
|
||
| def _get_iv_symbols(expr: sympy.Expr) -> list[sympy.Symbol]: | ||
| """Return all induction-variable symbols in *expr*.""" | ||
| return [s for s in expr.free_symbols if str(s).startswith(_INDUCTION_PREFIX)] |
There was a problem hiding this comment.
This string matching is unfortunate, can you at least move this function to symbol_utils.py and use _INDUCTION_SYMBOL_PREFIX.
|
also, we need unittests for the new symbol/numeric probing functions |
Water Code Coverage |
ede0a8f to
6d6c076
Compare
The main goal of this is to be able to have memory addresses for reads in a loop be simplified to `start + IV * stride`, with each of those values either being constant or at least able to be hoisted outside of the loop body. Adds a pre-codegen pipeline that flattens N-dimensional read addresses into 1-dimensional LINEAR_INDEX accesses: - flatten_read_indices: rewrites mapped reads to use a single flat offset - annotate_iv_strides: extracts constant IV strides for loop-carried reads - Codegen LINEAR_INDEX paths for both vector loads and GatherToLDS - Removes the old _try_iv_split_offset codegen approach in favor of the new pipeline-based linearization - Helper functions (mem_simplify, linearize_dims, _infer_floor_to_exact) in mapping_utils for symbolic floor/Mod cancellation - Adjust bounds for linearized reads This adds some new lit tests that show that with our mxfp4 shuffle layout we can generate linearized reads with constant stride. This changes a ton of other lit tests. Disclaimer: I was asked to get this PR up ASAP, and this is a ton of churn in the lit tests, and I have not yet validated that they are all correct. Signed-off-by: William G Hatch <william@hatch.uno>
Three fixes to the flatten_read_indices pass: 1. Skip reads with MemoryAccessFlags (VOLATILE, NONTEMPORAL). The LINEAR_INDEX codegen fallback path uses vector.maskedload which drops volatile semantics. This caused incorrect streamk partial buffer synchronization (stale spinlock reads). 2. Use physical_layout.shape for stride computation when a MemoryLayout is present, matching the strides the emitter creates for the memref via reinterpret_cast. 3. Use physical (post-mapping) start expressions as bound keys in _convert_bounds, falling back to the original index when the physical start contains $dynamic_val symbols that are only resolvable through the mapping at codegen time. Made-with: Cursor Signed-off-by: William G Hatch <william@hatch.uno>
Replace the water-specific emit_water_dialect flag with a general linearize_reads option (default True). Set it to False in: - water_e2e_test: water-opt does not yet understand $LINEAR_INDEX - waveasm 256x224x256 dynamic+bufops: linearized reads push VGPR count past the 256-register limit; disabling linearization lets the test pass instead of needing an xfail Signed-off-by: William G Hatch <william@hatch.uno>
These tests check MLIR output via FileCheck and their CHECK patterns have not been updated for the linearized-read output. The water e2e tests already disable linearize_reads; align the lit tests to match. Made-with: Cursor Signed-off-by: William G Hatch <william@hatch.uno>
256x224x256 with dynamic dims overflows VGPRs regardless of buffer ops when linearized reads are enabled (unscheduled). 256x160x256 with dynamic dims produces numerical mismatches when linearized, likely due to the preshuffle mapping's floor/Mod expressions over dynamic K not simplifying correctly after flatten_read_indices. Disable linearize_reads for both configurations when unscheduled + dynamic dims. Made-with: Cursor Signed-off-by: William G Hatch <william@hatch.uno>
- Update CHECK line numbers in mlir_converter_debug_locations.py (shifted +1 by linearize_reads=False insertion) - Add annotate_iv_strides to expected_failures in mlir_roundtrip_pipeline.py - Disable flatten_read_indices when dynamic_strides or water backend is active (incompatible with pre-computed symbolic strides) - Broaden waveasm e2e MXFP4 skip to all dynamic-dim configs (floor/Mod expressions over dynamic K cause numerical mismatches) - Update expected VGPR/SGPR counts for testScaledBatchedGemmMXFP4Codegen (172/62 with linearize_reads vs previous 170/61) Made-with: Cursor Signed-off-by: William G Hatch <william@hatch.uno>
A previous commit disabled linearization for dynamic strides cases. Signed-off-by: William G Hatch <william@hatch.uno>
Linearized reads set the SRD base to the buffer start (offset=0) and use the full flat offset as the load index. For masked-off threads at workgroup boundaries, the computed flat offset can exceed the actual tensor size. With validBytes set to the hardware maximum (~2GB), the SRD bounds check does not catch these, causing GPU memory access faults on large tensors (e.g. GEMM shape 4096x20480x2560). Fix by passing use_real_bounds_override=True to _compute_valid_bytes for all linearized read paths (handle_read LINEAR_INDEX and handle_gather_to_lds LINEAR_INDEX), so the SRD's validBytes reflects the actual tensor size and OOB accesses from masked-off threads harmlessly return zero. Made-with: Cursor Signed-off-by: William G Hatch <william@hatch.uno>
…er reads Linearized reads with buffer ops used vector.maskedload on fat_raw_buffer memrefs to handle boundary elements. This does not lower correctly through the AMDGPU backend: the masked vector load can interact poorly with the SRD bounds check when validBytes is set to the exact tensor size, causing valid elements near the boundary to be zeroed. Switch to the same OOB-index-redirect strategy that the write path already uses: for each element, arith.select between the real offset and an out-of-bounds index based on the mask. When the mask is splatted (uniform across all elements), use a scalar select on the vector load offset for better codegen. This fixes 8 shape1 (111, 813) test failures in mi35x CI: test_copy, test_dynamic_copy, test_vector_add, test_bound_check, test_read_write_same, test_offset_read_one, test_offset_write, test_transpose_write. Made-with: Cursor Signed-off-by: William G Hatch <william@hatch.uno>
_convert_bounds was using the post-mapping (physical) start expression as the bounds-check key. When a mapping adds a runtime offset via set_symbol (e.g. OFFSET in prefill_attention, EXT_IDX in extend_attention), the physical start is in a different coordinate space than the bound value. For example, with mapping N_Q: j + OFFSET and bound N_Q (per-sequence length 64), the mask became (j + OFFSET) < 64, which is always false for OFFSET >= 64. Generalize the existing $dynamic_val fallback: instead of string- matching "dynamic_val", detect any extra free symbols in the physical start compared to the original (pre-mapping) index. Extra symbols indicate a coordinate-space shift, so fall back to the original index for bounds checking. Made-with: Cursor Signed-off-by: William G Hatch <william@hatch.uno>
There are still some dynamic cases used by these kernels that fail for linearization. But I want to land something, so let's iterate on those later along with the other follow-ups. Signed-off-by: William G Hatch <william@hatch.uno>
Signed-off-by: William G Hatch <william@hatch.uno>
Signed-off-by: William G Hatch <william@hatch.uno>
Signed-off-by: William G Hatch <william@hatch.uno>
Signed-off-by: William G Hatch <william@hatch.uno>
6d6c076 to
2f89f54
Compare
The pass only operates on LINEAR_INDEX reads, so it should not run when linearize_reads is disabled. Made-with: Cursor Signed-off-by: William G Hatch <william@hatch.uno>
The Water backend does not yet handle $LINEAR_INDEX symbols, so set linearize_reads=False for water e2e, roundtrip pipeline, and the new mxfp4 scaled MMA converter tests. Made-with: Cursor Signed-off-by: William G Hatch <william@hatch.uno>
2f89f54 to
6dbdfe7
Compare
|
@Hardcode84 I've addressed your comments, let me know if you have any other comments. |
The main goal of this is to be able to have memory addresses for reads in a loop be simplified to
start + IV * stride, with each of those values either being constant or at least able to be hoisted outside of the loop body.Adds a pre-codegen pipeline that flattens N-dimensional read addresses into 1-dimensional LINEAR_INDEX accesses:
This adds some new lit tests that show that with our mxfp4 shuffle layout we can generate linearized reads with constant stride.
This changes a ton of other lit tests. Disclaimer: I was asked to get this PR up ASAP, and this is a ton of churn in the lit tests, and I have not yet validated that they are all correct.