MindSpore并行模式配置报错解决:Parallel mode dose not support **

系统环境

Hardware Environment(Ascend/GPU/CPU): Ascend
Software Environment:
MindSpore version (source or binary): 1.6.0
Python version (e.g., Python 3.7.5): 3.7.5
OS platform and distribution (e.g., Linux Ubuntu 16.04):
GCC/Compiler version (if compiled from source): 7.3.0
复制

运行脚本

在分布式运行环境中,需要bash启动脚本、python训练脚本以及rank_table的配置文件,将三个文件保存在同一目录下,三文件分别如下:

bash启动脚本

将以下shell脚本保存为run.sh

#!/bin/bash

RANK_SIZE=$1
EXEC_PATH=$(pwd)
export RANK_SIZE=${RANK_SIZE}
export HCCL_CONNECT_TIMEOUT=120 # 避免复现需要很长时间,设置超时为120s
export RANK_TABLE_FILE=${EXEC_PATH}/rank_table_8pcs.json # rank_table file的存放位置

for((i=0;i<${RANK_SIZE};i++))
do
    rm -rf device$i
    mkdir device$i
    cp ./train.py ./device$i
    cd ./device$i
    export DEVICE_ID=$i
    export RANK_ID=$i
    echo "start training for device $i"
    env > env$i.log
    python ./train.py > train.log$i 2>&1 &
    cd ../
done
echo "The program launch succeed, the log is under device*/train.log*."
复制

训练脚本

将以下python脚本保存为train.py

"""Parallel Example"""
import numpy as np

from mindspore import context, Parameter
from mindspore.nn import Cell, Momentum
from mindspore.ops import operations as P
from mindspore.train import Model
from mindspore.nn.loss import SoftmaxCrossEntropyWithLogits
import mindspore.dataset as ds
import mindspore.communication.management as D
from mindspore.train.callback import LossMonitor
from mindspore.train.callback import ModelCheckpoint
from mindspore.common.initializer import initializer

step_per_epoch = 4

def get_dataset(*inputs):
    def generate():
        for _ in range(step_per_epoch):
            yield inputs
    return generate


class Net(Cell):
    """define net"""
    def __init__(self):
        super().__init__()
        self.matmul = P.MatMul().shard(((2, 4), (4, 1)))
        self.weight = Parameter(initializer("normal", [32, 16]), "w1")
        self.relu = P.ReLU().shard(((8, 1),))

    def construct(self, x):
        out = self.matmul(x, self.weight)
        out = self.relu(out)
        return out


if __name__ == "__main__":
    context.set_context(mode=context.GRAPH_MODE, device_target="Ascend", save_graphs=True)
    D.init()
    rank = D.get_rank()
    context.set_auto_parallel_context(parallel_mode="semi_parallel", device_num=8, full_batch=True)

    np.random.seed(1)
    input_data = np.random.rand(16, 32).astype(np.float32)
    label_data = np.random.rand(16, 16).astype(np.float32)
    fake_dataset = get_dataset(input_data, label_data)

    net = Net()

    callback = [LossMonitor(), ModelCheckpoint(directory="{}".format(rank))]
    dataset = ds.GeneratorDataset(fake_dataset, ["input", "label"])
    loss = SoftmaxCrossEntropyWithLogits()

    learning_rate = 0.001
    momentum = 0.1
    epoch_size = 1
    opt = Momentum(net.trainable_params(), learning_rate, momentum)

    model = Model(net, loss_fn=loss, optimizer=opt)
    model.train(epoch_size, dataset, callbacks=callback, dataset_sink_mode=False)
复制

rank_table文件

将下面json文件保存为rank_table_8pcs.json

{
    "version": "1.0",
    "server_count": "1",
    "server_list": [
        {
            "server_id": "10.90.41.205",
            "device": [
                {
                    "device_id": "0",
                    "device_ip": "192.98.92.107",
                    "rank_id": "0"
                },
                {
                    "device_id": "1",
                    "device_ip": "192.98.93.107",
                    "rank_id": "1"
                },
                {
                    "device_id": "2",
                    "device_ip": "192.98.94.107",
                    "rank_id": "2"
                },
                {
                    "device_id": "3",
                    "device_ip": "192.98.95.107",
                    "rank_id": "3"
                },
                {
                    "device_id": "4",
                    "device_ip": "192.98.92.108",
                    "rank_id": "4"
                },
                {
                    "device_id": "5",
                    "device_ip": "192.98.93.108",
                    "rank_id": "5"
                },
                {
                    "device_id": "6",
                    "device_ip": "192.98.94.108",
                    "rank_id": "6"
                },
                {
                    "device_id": "7",
                    "device_ip": "192.98.95.108",
                    "rank_id": "7"
                }
            ],
            "host_nic_ip": "reserve"
        }
    ],
    "status": "completed"
}
复制

报错分析

启动脚本

如果本地机器有8个Ascend 910设备,那么可以设置启动命令如下:

bash run.sh 8
复制

错误信息分析

如下图所示,明显的报错信息说明当前的并行模式中不支持semi_parallel这种模式。

报错解决

修改错误

MindSpore主页中找到set_auto_parallel_context接口,观察到参数parallel_mode 的可选项有如下几种,正确配置即可。

成功运行

在合理配置并行模式之后(如配置成semi_auto_parallel),成功运行输出loss值,如下图所示:

  • 5
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值