1.系统环境
硬件环境(Ascend/GPU/CPU): Ascend
执行模式:静态图
Python版本:3.7
操作系统平台:Linux
2. 报错信息
2.1 问题描述
分布式模型并行,报错;
operator Mul init failed或者CheckStrategy failed.
[ERROR] PARALLEL(99,ffff90ca1930,python):2023-07-10-13:29:46.150.073 [mindspore/ccsrc/frontend/parallel/ops_info/arithmetic_info.cc:89] CheckStrategy] MulInfo7575 : Invalid strategy.
[ERROR] PARALLEL(99,ffff90ca1930,python):2023-07-10-13:29:46.150.127 [mindspore/ccsrc/frontend/parallel/ops_info/operator_info.cc:885] InitForCostModelWithAutoRepeatCalc] MulInfo7575: CheckStrategy failed.
[ERROR] PARALLEL(99,ffff90ca1930,python):2023-07-10-13:29:46.150.140 [mindspore/ccsrc/frontend/parallel/ops_info/operator_info.cc:852] Init] MulInfo7575 : Init failed.
----------------------------------------------------
- The Function Call Stack: (For framework developers)
----------------------------------------------------
In file /opt/huawei/schedule-train/algorithm/*/nn/transformer.py:1197/ current_key = self.mul1(self.tile(key, (1, 1, 1, self.src_seq_length)),/
In file /opt/huawei/schedule-train/algorithm/*/transformer.py:1187/ if self.is_first_iteration:/
In file /opt/huawei/schedule-train/algorithm/pangu_sigma/ms/parallel/nn*/angu_sigma/src/pangu_moe.py:133/ attention, layer_present = self.attention(query_vector, input_x, input_x, abs_pos_embed, rel_pos_embed, input_mask,/
In file /opt/huawei/schedule-train/algorithm/*/main_model.py:379/ encoder_output, _ = self.top_query_layer(encoder_output, top_query_hidden_states, encoder_masks,/
In file /opt/huawei/schedule-train/algorithm/*/main_model.py:370/ if self.is_pipeline:/
In file /opt/huawei/schedule-train/algorithm/*/main_model.py:434/ output_states, word_table = self.backbone(input_ids, input_position, input_rel_pos, attention_mask,/
In file /opt/huawei/schedule-train/algorithm/*/mian_model.py:530/ logits = self.backbone(input_ids, position_ids, None, attention_mask, expert_ids, current_index,/
----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/frontend/parallel/step_parallel.cc:1580 ExtractStrategyAndInit
复制
3. 根因分析
模型并行中Mul算子的切分策略不对,导致该报错。我们开Info日志可以看到以下信息
[INFO] PARALLEL(283,ffff95abcbf0,python):2023-07-10-17:11:31.981.235 [mindspore/ccsrc/fronted/parallel/step_parallel_utils.cc:1253] ExtractStrategt] Extract information: strategy ((4, 8, 1, 1), (1, 1, 1,11))
复制
我们结合报错信息中的调用栈可以知道报错的Mul算子的具体位置。然后调整下该算子的切分策略即可。
4. 解决方案
self.mul1 = P.Mul().shard(((1, 1, 1, 1),
(1, 1, 1, 1)))
复制
遇到此类的CheckStrategy failed 报错,尝试修改报错算子的切分策略,具体策略还需根据实际而定。