[BUG]: `cuda::transform_iterator` silently downgrades `iterator_category` → 2.5–3.3× slowdowns for thrust algorithms that dispatch on it

### Is this a duplicate?

- [x] I confirmed there appear to be no [duplicate issues](https://github.com/NVIDIA/cccl/issues) for this bug and that I agree to the [Code of Conduct](CODE_OF_CONDUCT.md)

### Type of Bug

Performance

### Component

General CCCL

### Describe the bug

## Problem

`cuda::transform_iterator` deliberately sets its classic `iterator_category` member to `input_iterator_tag` whenever the wrapped functor returns by value, even when the underlying iterator is random-access. This is spec-compliant for C++20 `std::ranges::transform_view::iterator`, but it silently breaks downstream Thrust algorithms that dispatch on `std::iterator_traits<It>::iterator_category` (e.g. `thrust::copy` → `thrust::scatter` → `thrust::make_permutation_iterator`), causing them to fall off their CUB bulk / vectorized fast paths.

In cuDF we observed a **2.5–3.3× GPU-side slowdown** after migrating two call sites from `thrust::make_transform_iterator` to `cuda::transform_iterator`. One-line revert restored parity with `thrust::`: https://github.com/rapidsai/cudf/pull/14162.


## Observed impact (cuDF scatter regression, RTX 6000 / CUDA 13.1)

Benchmark: `cudf::scatter` with `COPYING_NVBENCH -b scatter`. Baseline `origin/main` uses `thrust::make_transform_iterator`. The regressing branch uses `cuda::transform_iterator` over this functor:

```cpp
template <typename MapType>
struct index_converter {
  __device__ MapType operator()(MapType in) const { return ((in % n_rows) + n_rows) % n_rows; }
  size_type n_rows;
};
```

The transformed iterator flows into `thrust::scatter(..., map_it, ...)`, which the generic implementation expands to
`thrust::copy(first, last, thrust::make_permutation_iterator(output, map_it))`.

`thrust::make_permutation_iterator` takes the minimum traversal of its two constituents; with `map_it` downgraded to input, the permutation iterator becomes input, `thrust::copy` loses its CUB bulk fast path, and the scatter kernel falls onto a materializing fallback — costing **~170 ns/row of extra device work** and extra device allocations (causing new RMM-pool OOMs at medium sizes):

| num_rows | num_cols | thrust (ref) | cuda (cmp) | GPU Δ | %Diff |
|---:|---:|---:|---:|---:|---:|
| 262 144 | 1 | 24.40 µs | 70.66 µs | +46.3 µs | **+189.6 %** |
| 2 097 152 | 1 | 159.34 µs | 512.57 µs | +353.2 µs | **+221.7 %** |
| 16 777 216 | 1 | 1.227 ms | 4.062 ms | +2.835 ms | **+231.1 %** |
| 262 144 | 8 | 171.12 µs | 463.55 µs | +292.4 µs | **+170.9 %** |
| 2 097 152 | 8 | 1.256 ms | 3.414 ms | +2.158 ms | **+171.9 %** |
| 16 777 216 | 8 | 10.18 ms | 27.27 ms | +17.09 ms | **+167.9 %** |

### How to Reproduce

1. Check out branch in https://github.com/rapidsai/cudf/pull/14162, but commit 2076570046683348807e34655ef9bca7fda1c9a9.
2. Build `COPYING_NVBENCH`.
3. `COPYING_NVBENCH -b scatter --json broken.json` on any NVIDIA GPU.
4. Checkout the latest commit on that branch and run .
5. `COPYING_NVBENCH -b scatter --json fixed.json` on any NVIDIA GPU.
6. `nvbench_compare.py broken.json fixed.json`

### Expected behavior

There should be significant change in the performance when switching from `thrust::transform_iterator` vs `cuda::transform_iterator`.

### Reproduction link

_No response_

### Operating System

_No response_

### nvidia-smi output

_No response_

### NVCC version

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: `cuda::transform_iterator` silently downgrades `iterator_category` → 2.5–3.3× slowdowns for thrust algorithms that dispatch on it #8640

Is this a duplicate?

Type of Bug

Component

Describe the bug

Problem

Observed impact (cuDF scatter regression, RTX 6000 / CUDA 13.1)

How to Reproduce

Expected behavior

Reproduction link

Operating System

nvidia-smi output

NVCC version

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

num_rows	num_cols	thrust (ref)	cuda (cmp)	GPU Δ	%Diff
262 144	1	24.40 µs	70.66 µs	+46.3 µs	+189.6 %
2 097 152	1	159.34 µs	512.57 µs	+353.2 µs	+221.7 %
16 777 216	1	1.227 ms	4.062 ms	+2.835 ms	+231.1 %
262 144	8	171.12 µs	463.55 µs	+292.4 µs	+170.9 %
2 097 152	8	1.256 ms	3.414 ms	+2.158 ms	+171.9 %
16 777 216	8	10.18 ms	27.27 ms	+17.09 ms	+167.9 %

[BUG]: cuda::transform_iterator silently downgrades iterator_category → 2.5–3.3× slowdowns for thrust algorithms that dispatch on it #8640

Description

Is this a duplicate?

Type of Bug

Component

Describe the bug

Problem

Observed impact (cuDF scatter regression, RTX 6000 / CUDA 13.1)

How to Reproduce

Expected behavior

Reproduction link

Operating System

nvidia-smi output

NVCC version

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[BUG]: `cuda::transform_iterator` silently downgrades `iterator_category` → 2.5–3.3× slowdowns for thrust algorithms that dispatch on it #8640