You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
cuda::transform_iterator deliberately sets its classic iterator_category member to input_iterator_tag whenever the wrapped functor returns by value, even when the underlying iterator is random-access. This is spec-compliant for C++20 std::ranges::transform_view::iterator, but it silently breaks downstream Thrust algorithms that dispatch on std::iterator_traits<It>::iterator_category (e.g. thrust::copy → thrust::scatter → thrust::make_permutation_iterator), causing them to fall off their CUB bulk / vectorized fast paths.
In cuDF we observed a 2.5–3.3× GPU-side slowdown after migrating two call sites from thrust::make_transform_iterator to cuda::transform_iterator. One-line revert restored parity with thrust::: rapidsai/cudf#14162.
Observed impact (cuDF scatter regression, RTX 6000 / CUDA 13.1)
Benchmark: cudf::scatter with COPYING_NVBENCH -b scatter. Baseline origin/main uses thrust::make_transform_iterator. The regressing branch uses cuda::transform_iterator over this functor:
The transformed iterator flows into thrust::scatter(..., map_it, ...), which the generic implementation expands to thrust::copy(first, last, thrust::make_permutation_iterator(output, map_it)).
thrust::make_permutation_iterator takes the minimum traversal of its two constituents; with map_it downgraded to input, the permutation iterator becomes input, thrust::copy loses its CUB bulk fast path, and the scatter kernel falls onto a materializing fallback — costing ~170 ns/row of extra device work and extra device allocations (causing new RMM-pool OOMs at medium sizes):
Is this a duplicate?
Type of Bug
Performance
Component
General CCCL
Describe the bug
Problem
cuda::transform_iteratordeliberately sets its classiciterator_categorymember toinput_iterator_tagwhenever the wrapped functor returns by value, even when the underlying iterator is random-access. This is spec-compliant for C++20std::ranges::transform_view::iterator, but it silently breaks downstream Thrust algorithms that dispatch onstd::iterator_traits<It>::iterator_category(e.g.thrust::copy→thrust::scatter→thrust::make_permutation_iterator), causing them to fall off their CUB bulk / vectorized fast paths.In cuDF we observed a 2.5–3.3× GPU-side slowdown after migrating two call sites from
thrust::make_transform_iteratortocuda::transform_iterator. One-line revert restored parity withthrust::: rapidsai/cudf#14162.Observed impact (cuDF scatter regression, RTX 6000 / CUDA 13.1)
Benchmark:
cudf::scatterwithCOPYING_NVBENCH -b scatter. Baselineorigin/mainusesthrust::make_transform_iterator. The regressing branch usescuda::transform_iteratorover this functor:The transformed iterator flows into
thrust::scatter(..., map_it, ...), which the generic implementation expands tothrust::copy(first, last, thrust::make_permutation_iterator(output, map_it)).thrust::make_permutation_iteratortakes the minimum traversal of its two constituents; withmap_itdowngraded to input, the permutation iterator becomes input,thrust::copyloses its CUB bulk fast path, and the scatter kernel falls onto a materializing fallback — costing ~170 ns/row of extra device work and extra device allocations (causing new RMM-pool OOMs at medium sizes):How to Reproduce
COPYING_NVBENCH.COPYING_NVBENCH -b scatter --json broken.jsonon any NVIDIA GPU.COPYING_NVBENCH -b scatter --json fixed.jsonon any NVIDIA GPU.nvbench_compare.py broken.json fixed.jsonExpected behavior
There should be significant change in the performance when switching from
thrust::transform_iteratorvscuda::transform_iterator.Reproduction link
No response
Operating System
No response
nvidia-smi output
No response
NVCC version
No response