Skip to content
This repository was archived by the owner on Mar 23, 2023. It is now read-only.
This repository was archived by the owner on Mar 23, 2023. It is now read-only.

The error happened when I did multi-node distributed training #180

@ShangWeiKuo

Description

@ShangWeiKuo

🐛 Describe the bug

Excuse me. When I enter the command "colossalai run --nproc_per_node 4 --host [host1 ip addr],[host2 ip addr] --master_addr [host1 ip addr] train.py", I got this message: Error: failed to run torchrun --nproc_per_node=4 --nnodes=2 --node_rank=1 --rdzv_backend=c10d --rdzv_endpoint=[host1 ip addr]:29500 --rdzv_id=colossalai-default-job train.py on [host2 ip addr]

What are the configurations I have to set in the train.py you provided with?

Environment

CUDA Version: 11.3
PyTorch Version: 1.12.0
CUDA Version in PyTorch Build: 11.3
PyTorch CUDA Version Match: ✓
CUDA Extension: ✓

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions