Skip to content

Fix potential deadlock during NCCL local communicator creation.#98

Merged
romerojosh merged 3 commits intomainfrom
local_nccl_comm_init_fix
Jan 22, 2026
Merged

Fix potential deadlock during NCCL local communicator creation.#98
romerojosh merged 3 commits intomainfrom
local_nccl_comm_init_fix

Conversation

@romerojosh
Copy link
Collaborator

It was discovered that for grid configurations that distribute intra-node GPUs unevenly across row and column communicators, the library can deadlock when creating a node-local NCCL communicator. The issue is that the decision to create a NCCL local communicator is based on only the row and column communicator a particular rank is in, but the communicator creation involves all ranks on a node. If some but not all ranks on a node determine they need to create a NCCL local communicator, this deadlocks.

This only impacts cases where a user explicitly sets a processor decomposition that results in an uneven distribution of intra-node GPUs (e.g using a 4x3 process grid on a 4 GPU per node system). For cases that autotune the grid and end up in this situation, the local NCCL communicator is created unconditionally before autotuning and there is no deadlock issue.

This PR fixes the problem by making the NCCL local communicator decision consistent across nodes.

Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
@romerojosh
Copy link
Collaborator Author

/build

@github-actions
Copy link

🚀 Build workflow triggered! View run

@github-actions
Copy link

✅ Build workflow passed! View run

@romerojosh romerojosh merged commit 6718b14 into main Jan 22, 2026
4 checks passed
@romerojosh romerojosh deleted the local_nccl_comm_init_fix branch February 3, 2026 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant