1 系统环境
硬件环境(Ascend/GPU/CPU): Ascend
MindSpore版本: 2.1
执行模式(PyNative/ Graph): 不限
2 报错信息
2.1 问题描述
使用Ascend+MindSpore2.1环境,将并行设置为dp:mp:pp=2:2:2时,每次保存模型会报如下信息(有时候是报WARNING):
[WARNING] MD(1764057, ffff9bd28b20, python):2023-09-08-20: 30: 46.738.589 [mindspore/ccs rc/minddata/dataset/engine/da tasetops/data_queue_op.cc:115] ~DataQueueop] preprocess_batch: 600; batch_queue: 8, 8, 8, 8, 8, 8, 8, 8, 8, 8;
push_start_time -> push_end_time 2023-09-08-20:29:44.203.784 ا 2023-09-08-20:29:44.204.298 2023-09-08-20:29:44.204.457 -> 2023-09-08-20:29:44.204.748
2023-09-08-20: 29:53.230.823 -> 2023-09-08-20:29:53.231.299
2023-09-08-20:29:53. 231.463 -> 2023-09-08-20:29:53.231.726
2023-09-08-20:30:02.229.866 -> 2023-09-08-20:30:02.230.527 2023-09-08-20:30:02.230.696 -> 2023-09-08-20:30:02.231.047
2023-09-08-20:30:11.231.599 -> 2023-09-08-20:30:11.232.006
2023-09-08-20:30: 11.232.116 -> 2023-09-08- 20:30:11.232.359
2023-09-08-20:30: 20.206.672 -> 2023-09-08-20: 30:20.207.262
2023-09-08-20:30:20.207.400 - 2023-09-08-20:30:20.207.788 For more details, please refer to the FAQ at https://www.mindspore.cn/docs/en/master/fag/data processing.html.
Traceback (most recent call last): File "wizardcoder/run_ wizardcoder.py", line 149, in <module> device_id=args.device_ id) File "wizardcoder/run_wizardcoder.py"• line 81, in main task.train(train_checkpoint=ckpt. resume=resume) File "/home/wizardcoder/1_wizardcoder-m indfomers/mindfomers/trainer/t rainer.py", line 424, in train is full config=True, **kwargs File "/home/wizardcoder/1_wizardcoder-mindfommers/mindfommers/tra iner/causal_language_modeling/caus al_language_modeling.py", line 104, in train
**kwargs) File "/home/wizardcoder/1_wiza rdcoder-mindfomers/mindfomers/t rainer/base_trainer.py", line 631, in training_process initial epoch=config. runner config.initial epoch) File */home/xxx/miniconda3/envs/test08237Lib/python3 .7/site-packages/mindspore/t rain/model.py", line 1066, in train initial epoch=initial epoch) File "/home/xxx/miniconda3/envs/ test0823/lib/python3.7/site-packages/mindspore/train/model .py", line 113, in wrapper func(self. *args, **kwargs) File */home/xxx/miniconda3/envs/t est0823/lib/python3.7/site- packages/mindspore/train/model.py", line 620, in _train cb params, sink size, initial epoch, valid infos) File /home/xxx/miniconda3lenvs/test08237lib/python3.7/site- packages/mindspore/train/model .py", line 709, in _train_da taset sink process list callback .on_train step end(run context File "7home/xxx/miniconda3/envs/test0823/lib/python3.7/site- packages/mindspore/train/callback/_cal lback.py", line 412, in on_train_step_end
cb.on train_step_end( run_ context) File •/home/1iyejun/miniconda3/envs/ test0823/lib/python3. 7/site-packages/mindspore/train/ca1lback/_callback.py", line 254,. in on_train_step_end self.step_end(run_context) File "/home/xxx7miniconda3/envs/test0823/lib/python3.7/site- packages/mindspore/train/callback/_checkpoint.py". line 461, in step_end
self._save_ckpt(cb_params) File "/home/wizardcoder/1_wizardcoder-mindfomers/mindfomers/core/c allback/callback.py", line 481, in _save_ckpt
cb_params.train_ network.exec_checkpoint_graph() File "/home/xxx/miniconda37envs/test0823/Lib/python3 .7/site-packages/mindspore/nn/cell.py". line 976. in exec_checkpoint_graph cell_graph_executor(self, phase='save') File "/home/xxx/miniconda3/envs/test0823/lib/python3.7/site- packages/mindspore/common/api -py", line 1672, in _call_ return self.run(obj, *args. phase-phase) File hae/xxxy/aincandas/ens/tert23/lib/pythens. 7/site padkags/indspare/camon/apPr. uine 17nin nmreturn self._graph _executo r((), exe phase) File "/home/xxx/miniconda3/envs/test0823/lib/python3.7/site-packages/mindspore/train/callback/_callback .py", line 88, in checkpo int_cb_for_save_op _fill_param_into_net(CUR NET, parameter list) File "/home/xxx/miniconda3/envs/test0823/lib/python3.7/site-packages/mindspore/train/callback/_callback.py", line 68, in _fill_param_into_net
load_ param_into_ net(net, parameter dict, strict load=True) File "7home/Tiyejun/miniconda3/envs/test0823/lib/python3.7/site-packages/mindspore/train/serialization.py", line 1222, in load param_into_net
update_param (param, new_param, strict load) FiTe "/home/xxx/miniconda3/envs/ test823/lib/python3.7/site-pack ages/mindspore/train/serialization .py", line 125, in _update_param
RuntimeError: For "load param into net', accu_grads.backbone .blocks.5.attention. projection.weight in the argument 'net' should have the same shape as accu_ grads.backbone .blocks.5.
ttention.projection.weight in the argument 'parameter dict'. But got its shape (1536, 6144) in the argument 'net' and shape (3072, 6144) in the argument 'parameter_dict '.May you
need to checkwhether the checkpoint you loaded is correct or the batch size and so on in the 'net' and 'parameter dict' are same.
复制
3 根因分析
保存模型需要使用到callback.py中的_save_ckpt()功能,但是在使用Ascend上使用时,会走进如下代码的第一个if判断代码部分,从而导致报错。
self._last_triggered_step = cb_params.cur_step_num
if context.get_context("enable_ge") and os.getenv("MS_ENABLE_REF_MODE", "O") == "O":
set_cur_net(cb_params.train_network)
cb_params.train_network. exec_checkpoint_graph()
if "epoch_num" in self._append_dict:
self._append_dict["epoch_num"] = self._append_epoch_num+ cb_params.cur_epoch_num
if "step_num" in self._append_dict:
self._append_dict["step_num"] = seLf._append_step_num + cb_params.cur_epoch_num * cb_params.batch_num
复制
4 解决方案
目前的解决方案有两个(目标就是不走进这个条件判断里面):
self._last_triggered_step = cb_params.cur_step_num
if context.get_context("enable_ge") and os.getenv("MS_ENABLE_REF_MODE", "O") == "O":
set_cur_net(cb_params.train_network)
cb_params.train_network. exec_checkpoint_graph()
复制
1) 注释掉上面if判断的代码,问题可以解决。
2) 设置export MS_ENABLE_REF_MODE=1,上述代码则不需要注释。注意:有些环境中设置MS_ENABLE_REF_MODE=1可能会报错,可能是CANN版本等原因,这需要在适合的环境中使用。