Fix potential deadlock during NCCL local communicator creation.#98
Merged
romerojosh merged 3 commits intomainfrom Jan 22, 2026
Merged
Fix potential deadlock during NCCL local communicator creation.#98romerojosh merged 3 commits intomainfrom
romerojosh merged 3 commits intomainfrom
Conversation
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Collaborator
Author
|
/build |
|
🚀 Build workflow triggered! View run |
|
✅ Build workflow passed! View run |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
It was discovered that for grid configurations that distribute intra-node GPUs unevenly across row and column communicators, the library can deadlock when creating a node-local NCCL communicator. The issue is that the decision to create a NCCL local communicator is based on only the row and column communicator a particular rank is in, but the communicator creation involves all ranks on a node. If some but not all ranks on a node determine they need to create a NCCL local communicator, this deadlocks.
This only impacts cases where a user explicitly sets a processor decomposition that results in an uneven distribution of intra-node GPUs (e.g using a 4x3 process grid on a 4 GPU per node system). For cases that autotune the grid and end up in this situation, the local NCCL communicator is created unconditionally before autotuning and there is no deadlock issue.
This PR fixes the problem by making the NCCL local communicator decision consistent across nodes.