分布式模型并行报错：operator Mul init failed或者CheckStrategy failed.-CSDN博客

本文链接：https://blog.csdn.net/2401_85200798/article/details/139106122

1.系统环境

硬件环境(Ascend/GPU/CPU): Ascend

执行模式：静态图

Python版本：3.7

操作系统平台：Linux

2. 报错信息

2.1 问题描述

分布式模型并行，报错；

operator Mul init failed或者CheckStrategy failed.

[ERROR] PARALLEL(99,ffff90ca1930,python):2023-07-10-13:29:46.150.073 [mindspore/ccsrc/frontend/parallel/ops_info/arithmetic_info.cc:89] CheckStrategy] MulInfo7575 : Invalid strategy.
[ERROR] PARALLEL(99,ffff90ca1930,python):2023-07-10-13:29:46.150.127 [mindspore/ccsrc/frontend/parallel/ops_info/operator_info.cc:885] InitForCostModelWithAutoRepeatCalc] MulInfo7575: CheckStrategy failed.
[ERROR] PARALLEL(99,ffff90ca1930,python):2023-07-10-13:29:46.150.140 [mindspore/ccsrc/frontend/parallel/ops_info/operator_info.cc:852] Init] MulInfo7575 : Init failed.
 
----------------------------------------------------
- The Function Call Stack: (For framework developers)
----------------------------------------------------
In file /opt/huawei/schedule-train/algorithm/*/nn/transformer.py:1197/                current_key = self.mul1(self.tile(key, (1, 1, 1, self.src_seq_length)),/
In file /opt/huawei/schedule-train/algorithm/*/transformer.py:1187/            if self.is_first_iteration:/
In file /opt/huawei/schedule-train/algorithm/pangu_sigma/ms/parallel/nn*/angu_sigma/src/pangu_moe.py:133/        attention, layer_present = self.attention(query_vector, input_x, input_x,  abs_pos_embed, rel_pos_embed, input_mask,/
In file /opt/huawei/schedule-train/algorithm/*/main_model.py:379/            encoder_output, _ = self.top_query_layer(encoder_output, top_query_hidden_states, encoder_masks,/
In file /opt/huawei/schedule-train/algorithm/*/main_model.py:370/        if self.is_pipeline:/
In file /opt/huawei/schedule-train/algorithm/*/main_model.py:434/        output_states, word_table = self.backbone(input_ids, input_position, input_rel_pos, attention_mask,/
In file /opt/huawei/schedule-train/algorithm/*/mian_model.py:530/        logits = self.backbone(input_ids, position_ids, None, attention_mask, expert_ids, current_index,/
 
----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/frontend/parallel/step_parallel.cc:1580 ExtractStrategyAndInit复制

3. 根因分析

模型并行中Mul算子的切分策略不对，导致该报错。我们开Info日志可以看到以下信息

[INFO] PARALLEL(283,ffff95abcbf0,python):2023-07-10-17:11:31.981.235 [mindspore/ccsrc/fronted/parallel/step_parallel_utils.cc:1253] ExtractStrategt] Extract information: strategy ((4, 8, 1, 1), (1, 1, 1,11))复制

我们结合报错信息中的调用栈可以知道报错的Mul算子的具体位置。然后调整下该算子的切分策略即可。

4. 解决方案

self.mul1 = P.Mul().shard(((1, 1, 1, 1),
                           (1, 1, 1, 1)))复制

遇到此类的CheckStrategy failed 报错，尝试修改报错算子的切分策略，具体策略还需根据实际而定。