一、问题
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set
二、解决办法
import os
import torch.distributed as dist
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '5678'
dist.init_process_group(backend='gloo',init_method='env://',rank=0,world_size=int(os.environ['WORLD_SIZE']) if 'WORLD_SIZE' in os.environ else 1)
替换main函数里面的代码:![函数
三、后续
运行,继续报错
RuntimeError: Distributed package doesn’t have NCCL built in
这里的参数nvcc改为gloo就可以了