1. 系统环境
Hardware Environment(Ascend/GPU/CPU): Ascend
Software Environment:
MindSpore version (source or binary): 1.6.1
Python version (e.g., Python 3.7.5): 3.7.6
OS platform and distribution (e.g., Linux Ubuntu 16.04):
GCC/Compiler version (if compiled from source):
复制
2. 脚本
在分布式运行环境中,需要bash启动脚本和python文件,分别如下:
将下述的bash脚本另存为 run.sh
#!/bin/bash
set -e
EXEC_PATH=$(pwd)
export HCCL_CONNECT_TIMEOUT=120 # 避免复现需要很长时间,设置超时为120s
export RANK_SIZE=8
export RANK_TABLE_FILE=${EXEC_PATH}/rank_table_8pcs.json
for((i=0;i<8;i++))
do
rm -rf device$i
mkdir device$i
cp ./train.py ./device$i
cd ./device$i
export DEVICE_ID=$i
export RANK_ID=$i
echo "start training for device $i"
env > env$i.log
python ./train.py > train.log$i 2>&1 &
cd ../
done
echo "The program launch succeed, the log is under device0/train.log0."
复制
将下述的脚本命名为run.py
"""Operator Parallel Example"""
import numpy as np
from mindspore import context, Parameter
from mindspore.nn import Cell, Momentum
from mindspore.ops import operations as P
from mindspore.train import Model
from mindspore.nn.loss import SoftmaxCrossEntropyWithLogits
import mindspore.dataset as ds
import mindspore.communication.management as D
from mindspore.train.callback import LossMonitor
from mindspore.train.callback import ModelCheckpoint
from mindspore.common.initializer import initializer
step_per_epoch = 4
def get_dataset(*inputs):
def generate():
for _ in range(step_per_epoch):
yield inputs
return generate
class Net(Cell):
"""define net"""
def __init__(self):
super().__init__()
self.matmul = P.MatMul().shard(((6, 4), (4, 6)))
self.weight = Parameter(initializer("normal", [32, 60]), "w1")
self.relu = P.ReLU().shard(((6, 4),))
def construct(self, x):
out = self.matmul(x, self.weight)
out = self.relu(out)
return out
if __name__ == "__main__":
context.set_context(mode=context.GRAPH_MODE, device_target="Ascend", save_graphs=True)
D.init()
rank = D.get_rank()
context.set_auto_parallel_context(parallel_mode="semi_auto_parallel", device_num=8, full_batch=True)
np.random.seed(1)
input_data = np.random.rand(60, 32).astype(np.float32)
label_data = np.random.rand(60, 16).astype(np.float32)
fake_dataset = get_dataset(input_data, label_data)
net = Net()
callback = [LossMonitor(), ModelCheckpoint(directory="{}".format(rank))]
dataset = ds.GeneratorDataset(fake_dataset, ["input", "label"])
loss = SoftmaxCrossEntropyWithLogits()
learning_rate = 0.001
momentum = 0.1
epoch_size = 1
opt = Momentum(net.trainable_params(), learning_rate, momentum)
model = Model(net, loss_fn=loss, optimizer=opt)
model.train(epoch_size, dataset, callbacks=callback, dataset_sink_mode=False)
复制
3. rank_table_8p.json参考如下
{
"version": "1.0",
"server_count": "1",
"server_list": [
{
"server_id": "10.90.41.205",
"device": [
{
"device_id": "0",
"device_ip": "192.98.92.107",
"rank_id": "0"
},
{
"device_id": "1",
"device_ip": "192.98.93.107",
"rank_id": "1"
},
{
"device_id": "2",
"device_ip": "192.98.94.107",
"rank_id": "2"
},
{
"device_id": "3",
"device_ip": "192.98.95.107",
"rank_id": "3"
},
{
"device_id": "4",
"device_ip": "192.98.92.108",
"rank_id": "4"
},
{
"device_id": "5",
"device_ip": "192.98.93.108",
"rank_id": "5"
},
{
"device_id": "6",
"device_ip": "192.98.94.108",
"rank_id": "6"
},
{
"device_id": "7",
"device_ip": "192.98.95.108",
"rank_id": "7"
}
],
"host_nic_ip": "reserve"
}
],
"status": "completed"
}
复制
假设本地拥有8个Ascend 910设备,那么可以设置启动命令如下:
bash run.sh 8
复制
4、报错信息
如此配置的运行结果会收到“并行度非2的幂次方”的报错,信息如下:
[ERROR] PARALLEL (70022, 7fd69fa7700,python):2022-04-06-15:31:52:343.111 [mindspore/ccsrc/fronted/parallel/ops_info/operator_info.cc:125] CheckStrategyValue] MatMulInfo00: The strategy is ((6, 4), (4,6)), the value of stategy must be the power of 2, but get 6.
复制
5、报错分析和解决
该报错原因就是当前并行不支持非2的幂次方的并行切分,6*4显然不是2的幂次方。因此我们只需要根据实际情况选择16,32或其它符合要求的并行度即可。我们把算子的shard切分度改为符合要求的即可。