Skip to content

[BUG]: cuda::transform_iterator silently downgrades iterator_category → 2.5–3.3× slowdowns for thrust algorithms that dispatch on it #8640

@ttnghia

Description

@ttnghia

Is this a duplicate?

Type of Bug

Performance

Component

General CCCL

Describe the bug

Problem

cuda::transform_iterator deliberately sets its classic iterator_category member to input_iterator_tag whenever the wrapped functor returns by value, even when the underlying iterator is random-access. This is spec-compliant for C++20 std::ranges::transform_view::iterator, but it silently breaks downstream Thrust algorithms that dispatch on std::iterator_traits<It>::iterator_category (e.g. thrust::copythrust::scatterthrust::make_permutation_iterator), causing them to fall off their CUB bulk / vectorized fast paths.

In cuDF we observed a 2.5–3.3× GPU-side slowdown after migrating two call sites from thrust::make_transform_iterator to cuda::transform_iterator. One-line revert restored parity with thrust::: rapidsai/cudf#14162.

Observed impact (cuDF scatter regression, RTX 6000 / CUDA 13.1)

Benchmark: cudf::scatter with COPYING_NVBENCH -b scatter. Baseline origin/main uses thrust::make_transform_iterator. The regressing branch uses cuda::transform_iterator over this functor:

template <typename MapType>
struct index_converter {
  __device__ MapType operator()(MapType in) const { return ((in % n_rows) + n_rows) % n_rows; }
  size_type n_rows;
};

The transformed iterator flows into thrust::scatter(..., map_it, ...), which the generic implementation expands to
thrust::copy(first, last, thrust::make_permutation_iterator(output, map_it)).

thrust::make_permutation_iterator takes the minimum traversal of its two constituents; with map_it downgraded to input, the permutation iterator becomes input, thrust::copy loses its CUB bulk fast path, and the scatter kernel falls onto a materializing fallback — costing ~170 ns/row of extra device work and extra device allocations (causing new RMM-pool OOMs at medium sizes):

num_rows num_cols thrust (ref) cuda (cmp) GPU Δ %Diff
262 144 1 24.40 µs 70.66 µs +46.3 µs +189.6 %
2 097 152 1 159.34 µs 512.57 µs +353.2 µs +221.7 %
16 777 216 1 1.227 ms 4.062 ms +2.835 ms +231.1 %
262 144 8 171.12 µs 463.55 µs +292.4 µs +170.9 %
2 097 152 8 1.256 ms 3.414 ms +2.158 ms +171.9 %
16 777 216 8 10.18 ms 27.27 ms +17.09 ms +167.9 %

How to Reproduce

  1. Check out branch in Use stream pool for gather/scatter. rapidsai/cudf#14162, but commit 2076570046683348807e34655ef9bca7fda1c9a9.
  2. Build COPYING_NVBENCH.
  3. COPYING_NVBENCH -b scatter --json broken.json on any NVIDIA GPU.
  4. Checkout the latest commit on that branch and run .
  5. COPYING_NVBENCH -b scatter --json fixed.json on any NVIDIA GPU.
  6. nvbench_compare.py broken.json fixed.json

Expected behavior

There should be significant change in the performance when switching from thrust::transform_iterator vs cuda::transform_iterator.

Reproduction link

No response

Operating System

No response

nvidia-smi output

No response

NVCC version

No response

Metadata

Metadata

Assignees

Labels

needs triageIssues that require the team's attention

Type

No fields configured for Bug.

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions