Fix potential deadlock during NCCL local communicator creation. by romerojosh · Pull Request #98 · NVIDIA/cuDecomp

romerojosh · 2026-01-21T21:02:42Z

It was discovered that for grid configurations that distribute intra-node GPUs unevenly across row and column communicators, the library can deadlock when creating a node-local NCCL communicator. The issue is that the decision to create a NCCL local communicator is based on only the row and column communicator a particular rank is in, but the communicator creation involves all ranks on a node. If some but not all ranks on a node determine they need to create a NCCL local communicator, this deadlocks.

This only impacts cases where a user explicitly sets a processor decomposition that results in an uneven distribution of intra-node GPUs (e.g using a 4x3 process grid on a 4 GPU per node system). For cases that autotune the grid and end up in this situation, the local NCCL communicator is created unconditionally before autotuning and there is no deadlock issue.

This PR fixes the problem by making the NCCL local communicator decision consistent across nodes.

Signed-off-by: Josh Romero <joshr@nvidia.com>

romerojosh · 2026-01-21T21:21:40Z

/build

github-actions · 2026-01-21T21:22:14Z

🚀 Build workflow triggered! View run

github-actions · 2026-01-21T21:29:34Z

✅ Build workflow passed! View run

romerojosh added 3 commits January 21, 2026 12:45

Fix potential deadlock during NCCL local communicator creation.

46a2e9e

Signed-off-by: Josh Romero <joshr@nvidia.com>

Use clique comm when needed.

f22dce0

Signed-off-by: Josh Romero <joshr@nvidia.com>

Formatting.

40eadb7

Signed-off-by: Josh Romero <joshr@nvidia.com>

romerojosh merged commit 6718b14 into main Jan 22, 2026
4 checks passed

romerojosh deleted the local_nccl_comm_init_fix branch February 3, 2026 22:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix potential deadlock during NCCL local communicator creation.#98

Fix potential deadlock during NCCL local communicator creation.#98
romerojosh merged 3 commits intomainfrom
local_nccl_comm_init_fix

romerojosh commented Jan 21, 2026

Uh oh!

romerojosh commented Jan 21, 2026

Uh oh!

github-actions bot commented Jan 21, 2026

Uh oh!

github-actions bot commented Jan 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

romerojosh commented Jan 21, 2026

Uh oh!

romerojosh commented Jan 21, 2026

Uh oh!

github-actions bot commented Jan 21, 2026

Uh oh!

github-actions bot commented Jan 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant