分布式模型并行报错:operator Mul init failed或者CheckStrategy failed.

1.系统环境

硬件环境(Ascend/GPU/CPU): Ascend

执行模式:静态图

Python版本:3.7

操作系统平台:Linux

2. 报错信息

2.1 问题描述

分布式模型并行,报错;

operator Mul init failed或者CheckStrategy failed.

[ERROR] PARALLEL(99,ffff90ca1930,python):2023-07-10-13:29:46.150.073 [mindspore/ccsrc/frontend/parallel/ops_info/arithmetic_info.cc:89] CheckStrategy] MulInfo7575 : Invalid strategy.
[ERROR] PARALLEL(99,ffff90ca1930,python):2023-07-10-13:29:46.150.127 [mindspore/ccsrc/frontend/parallel/ops_info/operator_info.cc:885] InitForCostModelWithAutoRepeatCalc] MulInfo7575: CheckStrategy failed.
[ERROR] PARALLEL(99,ffff90ca1930,python):2023-07-10-13:29:46.150.140 [mindspore/ccsrc/frontend/parallel/ops_info/operator_info.cc:852] Init] MulInfo7575 : Init failed.
 
----------------------------------------------------
- The Function Call Stack: (For framework developers)
----------------------------------------------------
In file /opt/huawei/schedule-train/algorithm/*/nn/transformer.py:1197/                current_key = self.mul1(self.tile(key, (1, 1, 1, self.src_seq_length)),/
In file /opt/huawei/schedule-train/algorithm/*/transformer.py:1187/            if self.is_first_iteration:/
In file /opt/huawei/schedule-train/algorithm/pangu_sigma/ms/parallel/nn*/angu_sigma/src/pangu_moe.py:133/        attention, layer_present = self.attention(query_vector, input_x, input_x,  abs_pos_embed, rel_pos_embed, input_mask,/
In file /opt/huawei/schedule-train/algorithm/*/main_model.py:379/            encoder_output, _ = self.top_query_layer(encoder_output, top_query_hidden_states, encoder_masks,/
In file /opt/huawei/schedule-train/algorithm/*/main_model.py:370/        if self.is_pipeline:/
In file /opt/huawei/schedule-train/algorithm/*/main_model.py:434/        output_states, word_table = self.backbone(input_ids, input_position, input_rel_pos, attention_mask,/
In file /opt/huawei/schedule-train/algorithm/*/mian_model.py:530/        logits = self.backbone(input_ids, position_ids, None, attention_mask, expert_ids, current_index,/
 
----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/frontend/parallel/step_parallel.cc:1580 ExtractStrategyAndInit复制

3. 根因分析

模型并行中Mul算子的切分策略不对,导致该报错。我们开Info日志可以看到以下信息

[INFO] PARALLEL(283,ffff95abcbf0,python):2023-07-10-17:11:31.981.235 [mindspore/ccsrc/fronted/parallel/step_parallel_utils.cc:1253] ExtractStrategt] Extract information: strategy ((4, 8, 1, 1), (1, 1, 1,11))复制

我们结合报错信息中的调用栈可以知道报错的Mul算子的具体位置。然后调整下该算子的切分策略即可。

4. 解决方案

self.mul1 = P.Mul().shard(((1, 1, 1, 1),
                           (1, 1, 1, 1)))复制

遇到此类的CheckStrategy failed 报错,尝试修改报错算子的切分策略,具体策略还需根据实际而定。

  • 5
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值