[cudax] Add support for generic thread groups within warp and cluster by davebayer · Pull Request #8792 · NVIDIA/cccl

davebayer · 2026-05-04T07:33:31Z

No description provided.

github-actions · 2026-05-04T08:10:48Z

🥳 CI Workflow Results

🟩 Finished in 32m 24s: Pass: 100%/54 | Total: 5h 14m | Max: 32m 17s | Hits: 97%/31859

See results here.

miscco · 2026-05-04T09:42:58Z

+  using Level = typename Group::level_type;

-  if (!Unit{}.is_part_of(group))
+  if constexpr (cuda::std::is_same_v<Level, cuda::warp_level>)


Question: Does this also handle cluster level fine or was the comment outdated

miscco · 2026-05-04T09:44:08Z

+    // todo(dabayer): Implement fallback for cc < 80.
+    T result;
+    NV_IF_TARGET(NV_PROVIDES_SM_80,
+                 ({ result = __reduce_add_sync(group.__synchronizer_instance().__lane_mask(), result_unit.value()); }))
+    return (cuda::gpu_thread.is_root_rank(group)) ? cuda::std::optional{result} : cuda::std::nullopt;


Question: That comment suggests the code path is not valid for SM < 80, we should at least assert that to ensure we do not forget it once this goes out of experimentall

miscco · 2026-05-04T09:45:39Z

+    // todo(dabayer): Implement fallback for cc < 80.
+    T result;
+    NV_IF_TARGET(NV_PROVIDES_SM_80,
+                 ({ result = __reduce_add_sync(group.__synchronizer_instance().__lane_mask(), result_unit.value()); }))


Question: Cannot we use ThreadReduce here just fine?

It should use the __reduce_add_sync optimization when applicable

pciolkosz · 2026-05-05T02:21:14Z

-  {
-    group_sums[group_rank] = 0;
-  }
+    __shared__ T group_sums[ngroups];


Hmm, I was thinking if in actual interface shared memory should come from the user instead, for example here if only one group calls this we waste shared memory

This is correct, but I would say that this is outside of this epic's scope, so I just went with statically shared memory allocations inside the algorithms

github-project-automation Bot added this to CCCL May 4, 2026

davebayer requested a review from a team as a code owner May 4, 2026 07:33

github-project-automation Bot moved this to Todo in CCCL May 4, 2026

davebayer requested a review from andralex May 4, 2026 07:33

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 4, 2026

[cudax] Add support for generic thread groups within warp and cluster

37cc75e

davebayer force-pushed the groups_improve_coop_alg branch from b20f402 to 37cc75e Compare May 4, 2026 07:36

miscco reviewed May 4, 2026

View reviewed changes

pciolkosz reviewed May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cudax] Add support for generic thread groups within warp and cluster#8792

[cudax] Add support for generic thread groups within warp and cluster#8792
davebayer wants to merge 1 commit into
NVIDIA:mainfrom
davebayer:groups_improve_coop_alg

davebayer commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

miscco May 4, 2026

Uh oh!

miscco May 4, 2026

Uh oh!

miscco May 4, 2026

Uh oh!

pciolkosz May 5, 2026

Uh oh!

davebayer May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

davebayer commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

🥳 CI Workflow Results

🟩 Finished in 32m 24s: Pass: 100%/54 | Total: 5h 14m | Max: 32m 17s | Hits: 97%/31859

Uh oh!

miscco May 4, 2026

Choose a reason for hiding this comment

Uh oh!

miscco May 4, 2026

Choose a reason for hiding this comment

Uh oh!

miscco May 4, 2026

Choose a reason for hiding this comment

Uh oh!

pciolkosz May 5, 2026

Choose a reason for hiding this comment

Uh oh!

davebayer May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants