Skip to content

[cudax] Add support for generic thread groups within warp and cluster#8792

Open
davebayer wants to merge 1 commit into
NVIDIA:mainfrom
davebayer:groups_improve_coop_alg
Open

[cudax] Add support for generic thread groups within warp and cluster#8792
davebayer wants to merge 1 commit into
NVIDIA:mainfrom
davebayer:groups_improve_coop_alg

Conversation

@davebayer
Copy link
Copy Markdown
Contributor

No description provided.

@davebayer davebayer requested a review from a team as a code owner May 4, 2026 07:33
@github-project-automation github-project-automation Bot moved this to Todo in CCCL May 4, 2026
@davebayer davebayer requested a review from andralex May 4, 2026 07:33
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 4, 2026
@davebayer davebayer force-pushed the groups_improve_coop_alg branch from b20f402 to 37cc75e Compare May 4, 2026 07:36
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

🥳 CI Workflow Results

🟩 Finished in 32m 24s: Pass: 100%/54 | Total: 5h 14m | Max: 32m 17s | Hits: 97%/31859

See results here.

using Level = typename Group::level_type;

if (!Unit{}.is_part_of(group))
if constexpr (cuda::std::is_same_v<Level, cuda::warp_level>)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Does this also handle cluster level fine or was the comment outdated

Comment on lines +128 to +132
// todo(dabayer): Implement fallback for cc < 80.
T result;
NV_IF_TARGET(NV_PROVIDES_SM_80,
({ result = __reduce_add_sync(group.__synchronizer_instance().__lane_mask(), result_unit.value()); }))
return (cuda::gpu_thread.is_root_rank(group)) ? cuda::std::optional{result} : cuda::std::nullopt;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: That comment suggests the code path is not valid for SM < 80, we should at least assert that to ensure we do not forget it once this goes out of experimentall

// todo(dabayer): Implement fallback for cc < 80.
T result;
NV_IF_TARGET(NV_PROVIDES_SM_80,
({ result = __reduce_add_sync(group.__synchronizer_instance().__lane_mask(), result_unit.value()); }))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Cannot we use ThreadReduce here just fine?

It should use the __reduce_add_sync optimization when applicable

{
group_sums[group_rank] = 0;
}
__shared__ T group_sums[ngroups];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I was thinking if in actual interface shared memory should come from the user instead, for example here if only one group calls this we waste shared memory

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, but I would say that this is outside of this epic's scope, so I just went with statically shared memory allocations inside the algorithms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

3 participants