首先介绍两台Linux服务器之间的ssh连接:
例如在一台服务器的终端输入ssh root@218.195.247.241 -p 30430(另一台服务器的参数),之后按照步骤输入密码即可连接。
如果要使用多机多卡,两台服务器必须是可以ping通的(据我所试),ping命令如下,在终端使用即可:
ping 202.201.252.102 V100
ping 202.201.252.101 A30
我们假设有两台服务器,一台是V100(显卡型号)-3(number)-16g(显存),另一台是A30-6-24g。在这两台服务器里分别有两个.py文件,一个是V_dist.py,一个是A_dist.py文件。文件内容如下,除了 os.environ["CUDA_VISIBLE_DEVICES"] 值不一样外其余都是一样的:
import os
import torch
import datetime
####### V100
os.environ["CUDA_VISIBLE_DEVICES"] = "1,2"
####### A30
os.environ["CUDA_VISIBLE_DEVICES"] = "2,4"
print(os.environ['MASTER_ADDR'],flush=True)
print(os.environ['MASTER_PORT'],flush=True)
world_size = int(os.environ['WORLD_SIZE'])
rank = int(os.environ["RANK"])
gpu = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(gpu)
dist_backend = 'nccl'
#dist_url = 'env://'
dist_url = "tcp://%s:%s" % (os.environ['MASTER_ADDR'], os.environ['MASTER_PORT'])
print('| distributed init (rank {}): {}, gpu {}'.format(
rank, dist_url, gpu), flush=True)
torch.distributed.init_process_group(
backend=dist_backend, init_method=dist_url,
world_size=world_size, rank=rank,
timeout=datetime.timedelta(0, 7200)
)
torch.distributed.barrier()
data = torch.tensor([[0.1, 0.1],
[0.2, 0.2],
[0.3, 0.3],
[0.4, 0.4],
], dtype=torch.float).to(torch.device('cuda'))
print('data:',data,flush=True)
sampler = torch.utils.data.distributed.DistributedSampler(data,shuffle=False)
loader = torch.utils.data.DataLoader(data,batch_size=1,sampler=sampler)
labels = torch.randn(1,2).to(torch.device('cuda'))
print('labels:',labels,flush=True)
model = torch.nn.Linear(2,2).to(torch.device('cuda'))
if rank == 0 or rank == 2:
print(f'model_{rank}:',list(model.named_parameters()),flush=True)
model = torch.nn.parallel.DistributedDataParallel(model,device_ids=[gpu])
if rank == 1 or rank == 3:
print(f'Model_{rank}:',list(model.named_parameters()),flush=True)
####### 这里不全部打印是为了简洁
loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
for data in loader:
print(data)
# forward pass
output = model(data)
# backward pass
loss_fn(output, labels).backward()
# update parameters
optimizer.step()
V100上执行命令:
python -m torch.distributed.launch --nproc_per_node=2 \
--nnodes=2 --node_rank=0 --master_addr="202.201.252.102" \
--master_port=29522 V_dist.py
A30上执行命令:
python -m torch.distributed.launch --nproc_per_node=2 \
--nnodes=2 --node_rank=1 --master_addr="202.201.252.102" \
--master_port=29522 A_dist.py
V100终端显示:(整理过后的,并非原显示)
202.201.252.102
29522
| distributed init (rank 0): tcp://202.201.252.102:29522, gpu 0
202.201.252.102
29522
| distributed init (rank 1): tcp://202.201.252.102:29522, gpu 1
data: tensor([[0.1000, 0.1000],
[0.2000, 0.2000],
[0.3000, 0.3000],
[0.4000, 0.4000]], device='cuda:0')
data: tensor([[0.1000, 0.1000],
[0.2000, 0.2000],
[0.3000, 0.3000],
[0.4000, 0.4000]], device='cuda:1')
labels: tensor([[-1.8352, 2.8090]], device='cuda:0')
labels: tensor([[-1.2993, -0.8943]], device='cuda:1')
model_0: [('weight', Parameter containing:
tensor([[0.5818, 0.4880],
[0.4510, 0.6375]], device='cuda:0', requires_grad=True)),
('bias', Parameter containing:
tensor([0.4690, 0.3800], device='cuda:0', requires_grad=True))]
Model_1: [('module.weight', Parameter containing:
tensor([[0.5818, 0.4880],
[0.4510, 0.6375]], device='cuda:1', requires_grad=True)),
('module.bias', Parameter containing:
tensor([0.4690, 0.3800], device='cuda:1', requires_grad=True))]
tensor([[0.1000, 0.1000]], device='cuda:0')
tensor([[0.2000, 0.2000]], device='cuda:1')
A30终端显示:(同上)
202.201.252.102
29522
| distributed init (rank 2): tcp://202.201.252.102:29522, gpu 0
202.201.252.102
29522
| distributed init (rank 3): tcp://202.201.252.102:29522, gpu 1
data: tensor([[0.1000, 0.1000],
[0.2000, 0.2000],
[0.3000, 0.3000],
[0.4000, 0.4000]], device='cuda:1')
data: tensor([[0.1000, 0.1000],
[0.2000, 0.2000],
[0.3000, 0.3000],
[0.4000, 0.4000]], device='cuda:0')
labels: tensor([[-0.1021, -0.1853]], device='cuda:1')
labels: tensor([[-0.6373, 0.6369]], device='cuda:0')
model_2: [('weight', Parameter containing:
tensor([[-0.2403, -0.4664],
[-0.4876, -0.3261]], device='cuda:0', requires_grad=True)),
('bias', Parameter containing:
tensor([-0.2044, 0.6556], device='cuda:0', requires_grad=True))]
Model_3: [('module.weight', Parameter containing:
tensor([[0.5818, 0.4880],
[0.4510, 0.6375]], device='cuda:1', requires_grad=True)),
('module.bias', Parameter containing:
tensor([0.4690, 0.3800], device='cuda:1', requires_grad=True))]
tensor([[0.4000, 0.4000]], device='cuda:1')
tensor([[0.3000, 0.3000]], device='cuda:0')
每台机器分配的卡最好数量一致,数量不一致还可以运行且使用torch.distributed.launch的方法我目前没有找到。(依照我的尝试,我感觉就是用不了,rank的值在数量不一致时没法确定)
以下代码直接运行,无法打印任何东西,换顺序也一样。(test.py)
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "2,4"
print(os.environ["RANK"],flush=True)
print(os.environ["LOCAL_RANK"],flush=True)
print(os.environ['MASTER_ADDR'],flush=True)
print(os.environ['MASTER_PORT'],flush=True)
若想运行,在终端使用以下命令行:
python -m torch.distributed.launch --nproc_per_node=2 \
--nnodes=1 --node_rank=0 --master_addr="202.201.252.102" \
--master_port=29522 test.py
0
0
202.201.252.102
29522
1
1
202.201.252.102
29522