MindSpoer报错:The strategy is ((6, 4), (4,6)), the value of stategy must be the power of 2, but get 6.

1. 系统环境

Hardware Environment(Ascend/GPU/CPU): Ascend
Software Environment:
MindSpore version (source or binary): 1.6.1
Python version (e.g., Python 3.7.5): 3.7.6
OS platform and distribution (e.g., Linux Ubuntu 16.04):
GCC/Compiler version (if compiled from source):复制

2. 脚本

在分布式运行环境中,需要bash启动脚本和python文件,分别如下:

将下述的bash脚本另存为 run.sh

#!/bin/bash
set -e
EXEC_PATH=$(pwd)
export HCCL_CONNECT_TIMEOUT=120 # 避免复现需要很长时间,设置超时为120s
export RANK_SIZE=8
export RANK_TABLE_FILE=${EXEC_PATH}/rank_table_8pcs.json

for((i=0;i<8;i++))
do
    rm -rf device$i
    mkdir device$i
    cp ./train.py ./device$i
    cd ./device$i
    export DEVICE_ID=$i
    export RANK_ID=$i
    echo "start training for device $i"
    env > env$i.log
    python ./train.py > train.log$i 2>&1 &
    cd ../
done
echo "The program launch succeed, the log is under device0/train.log0."复制

将下述的脚本命名为run.py

"""Operator Parallel Example"""
import numpy as np

from mindspore import context, Parameter
from mindspore.nn import Cell, Momentum
from mindspore.ops import operations as P
from mindspore.train import Model
from mindspore.nn.loss import SoftmaxCrossEntropyWithLogits
import mindspore.dataset as ds
import mindspore.communication.management as D
from mindspore.train.callback import LossMonitor
from mindspore.train.callback import ModelCheckpoint
from mindspore.common.initializer import initializer

step_per_epoch = 4

def get_dataset(*inputs):
    def generate():
        for _ in range(step_per_epoch):
            yield inputs
    return generate


class Net(Cell):
    """define net"""
    def __init__(self):
        super().__init__()
        self.matmul = P.MatMul().shard(((6, 4), (4, 6)))
        self.weight = Parameter(initializer("normal", [32, 60]), "w1")
        self.relu = P.ReLU().shard(((6, 4),))

    def construct(self, x):
        out = self.matmul(x, self.weight)
        out = self.relu(out)
        return out


if __name__ == "__main__":
    context.set_context(mode=context.GRAPH_MODE, device_target="Ascend", save_graphs=True)
    D.init()
    rank = D.get_rank()
    context.set_auto_parallel_context(parallel_mode="semi_auto_parallel", device_num=8, full_batch=True)

    np.random.seed(1)
    input_data = np.random.rand(60, 32).astype(np.float32)
    label_data = np.random.rand(60, 16).astype(np.float32)
    fake_dataset = get_dataset(input_data, label_data)

    net = Net()

    callback = [LossMonitor(), ModelCheckpoint(directory="{}".format(rank))]
    dataset = ds.GeneratorDataset(fake_dataset, ["input", "label"])
    loss = SoftmaxCrossEntropyWithLogits()

    learning_rate = 0.001
    momentum = 0.1
    epoch_size = 1
    opt = Momentum(net.trainable_params(), learning_rate, momentum)

    model = Model(net, loss_fn=loss, optimizer=opt)
    model.train(epoch_size, dataset, callbacks=callback, dataset_sink_mode=False)复制

3. rank_table_8p.json参考如下

{
    "version": "1.0",
    "server_count": "1",
    "server_list": [
        {
            "server_id": "10.90.41.205",
            "device": [
                {
                    "device_id": "0",
                    "device_ip": "192.98.92.107",
                    "rank_id": "0"
                },
                {
                    "device_id": "1",
                    "device_ip": "192.98.93.107",
                    "rank_id": "1"
                },
                {
                    "device_id": "2",
                    "device_ip": "192.98.94.107",
                    "rank_id": "2"
                },
                {
                    "device_id": "3",
                    "device_ip": "192.98.95.107",
                    "rank_id": "3"
                },
                {
                    "device_id": "4",
                    "device_ip": "192.98.92.108",
                    "rank_id": "4"
                },
                {
                    "device_id": "5",
                    "device_ip": "192.98.93.108",
                    "rank_id": "5"
                },
                {
                    "device_id": "6",
                    "device_ip": "192.98.94.108",
                    "rank_id": "6"
                },
                {
                    "device_id": "7",
                    "device_ip": "192.98.95.108",
                    "rank_id": "7"
                }
            ],
            "host_nic_ip": "reserve"
        }
    ],
    "status": "completed"
}复制

假设本地拥有8个Ascend 910设备,那么可以设置启动命令如下:

bash run.sh 8复制

4、报错信息

如此配置的运行结果会收到“并行度非2的幂次方”的报错,信息如下:

[ERROR] PARALLEL (70022, 7fd69fa7700,python):2022-04-06-15:31:52:343.111 [mindspore/ccsrc/fronted/parallel/ops_info/operator_info.cc:125] CheckStrategyValue] MatMulInfo00: The strategy is ((6, 4), (4,6)), the value of stategy must be the power of 2, but get 6.复制

5、报错分析和解决

该报错原因就是当前并行不支持非2的幂次方的并行切分,6*4显然不是2的幂次方。因此我们只需要根据实际情况选择16,32或其它符合要求的并行度即可。我们把算子的shard切分度改为符合要求的即可。

  • 3
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值