MindSpore保存模型提示:need to checkwhether you is batch size and so on in the ‘net‘ and ‘parameter dict‘ a

系统环境

硬件环境(Ascend/GPU/CPU): Ascend

MindSpore版本: 2.1

执行模式(PyNative/ Graph): 不限

报错信息

2.1 问题描述

使用Ascend+MindSpore2.1环境,将并行设置为dp:mp:pp=2:2:2时,每次保存模型会报如下信息(有时候是报WARNING):

[WARNING] MD(1764057, ffff9bd28b20, python):2023-09-08-20: 30: 46.738.589 [mindspore/ccs rc/minddata/dataset/engine/da tasetops/data_queue_op.cc:115] ~DataQueueop]   preprocess_batch: 600; batch_queue: 8, 8, 8, 8, 8, 8, 8, 8, 8, 8;

push_start_time -> push_end_time 2023-09-08-20:29:44.203.784 ا 2023-09-08-20:29:44.204.298    2023-09-08-20:29:44.204.457 -> 2023-09-08-20:29:44.204.748

2023-09-08-20: 29:53.230.823 -> 2023-09-08-20:29:53.231.299

2023-09-08-20:29:53. 231.463 -> 2023-09-08-20:29:53.231.726

2023-09-08-20:30:02.229.866 -> 2023-09-08-20:30:02.230.527   2023-09-08-20:30:02.230.696 ->   2023-09-08-20:30:02.231.047

2023-09-08-20:30:11.231.599 -> 2023-09-08-20:30:11.232.006

2023-09-08-20:30: 11.232.116 -> 2023-09-08- 20:30:11.232.359

2023-09-08-20:30: 20.206.672 -> 2023-09-08-20: 30:20.207.262

2023-09-08-20:30:20.207.400 - 2023-09-08-20:30:20.207.788    For more details, please refer to the FAQ at https://www.mindspore.cn/docs/en/master/fag/data processing.html.

Traceback (most recent call last):   File "wizardcoder/run_ wizardcoder.py", line 149, in <module>     device_id=args.device_ id)  File "wizardcoder/run_wizardcoder.py"• line 81, in main     task.train(train_checkpoint=ckpt.    resume=resume)     File "/home/wizardcoder/1_wizardcoder-m indfomers/mindfomers/trainer/t rainer.py", line 424, in train    is full config=True, **kwargs    File "/home/wizardcoder/1_wizardcoder-mindfommers/mindfommers/tra iner/causal_language_modeling/caus al_language_modeling.py", line 104, in train

**kwargs) File "/home/wizardcoder/1_wiza rdcoder-mindfomers/mindfomers/t rainer/base_trainer.py", line 631, in training_process   initial epoch=config. runner config.initial epoch) File */home/xxx/miniconda3/envs/test08237Lib/python3 .7/site-packages/mindspore/t rain/model.py", line 1066, in train     initial epoch=initial epoch)     File "/home/xxx/miniconda3/envs/ test0823/lib/python3.7/site-packages/mindspore/train/model .py", line 113, in wrapper    func(self. *args, **kwargs) File */home/xxx/miniconda3/envs/t est0823/lib/python3.7/site- packages/mindspore/train/model.py", line 620, in _train     cb params, sink size, initial epoch, valid infos)  File /home/xxx/miniconda3lenvs/test08237lib/python3.7/site- packages/mindspore/train/model .py", line 709, in _train_da taset sink process list callback .on_train step end(run context  File "7home/xxx/miniconda3/envs/test0823/lib/python3.7/site- packages/mindspore/train/callback/_cal lback.py", line 412, in on_train_step_end

cb.on train_step_end( run_ context)  File •/home/1iyejun/miniconda3/envs/ test0823/lib/python3. 7/site-packages/mindspore/train/ca1lback/_callback.py", line 254,. in on_train_step_end    self.step_end(run_context)  File "/home/xxx7miniconda3/envs/test0823/lib/python3.7/site- packages/mindspore/train/callback/_checkpoint.py". line 461, in step_end

self._save_ckpt(cb_params)  File "/home/wizardcoder/1_wizardcoder-mindfomers/mindfomers/core/c allback/callback.py", line 481, in _save_ckpt

cb_params.train_ network.exec_checkpoint_graph()   File "/home/xxx/miniconda37envs/test0823/Lib/python3 .7/site-packages/mindspore/nn/cell.py". line 976. in exec_checkpoint_graph    cell_graph_executor(self, phase='save')   File "/home/xxx/miniconda3/envs/test0823/lib/python3.7/site- packages/mindspore/common/api -py", line 1672, in _call_     return self.run(obj, *args. phase-phase)  File hae/xxxy/aincandas/ens/tert23/lib/pythens. 7/site padkags/indspare/camon/apPr. uine 17nin nmreturn self._graph _executo r((), exe phase)   File "/home/xxx/miniconda3/envs/test0823/lib/python3.7/site-packages/mindspore/train/callback/_callback .py", line 88, in checkpo int_cb_for_save_op  _fill_param_into_net(CUR NET, parameter list)  File "/home/xxx/miniconda3/envs/test0823/lib/python3.7/site-packages/mindspore/train/callback/_callback.py", line 68, in _fill_param_into_net

load_ param_into_ net(net, parameter dict, strict load=True) File "7home/Tiyejun/miniconda3/envs/test0823/lib/python3.7/site-packages/mindspore/train/serialization.py", line 1222, in load param_into_net

update_param (param, new_param, strict load)   FiTe "/home/xxx/miniconda3/envs/ test823/lib/python3.7/site-pack ages/mindspore/train/serialization .py", line 125, in _update_param

RuntimeError: For "load param into net', accu_grads.backbone .blocks.5.attention. projection.weight in the argument 'net' should have the same shape as accu_ grads.backbone .blocks.5.

ttention.projection.weight in the argument 'parameter dict'. But got its shape (1536, 6144) in the argument 'net' and shape (3072, 6144) in  the argument 'parameter_dict '.May you

need to checkwhether the checkpoint you loaded is correct or the batch size and so on in the 'net' and 'parameter dict' are same.复制

根因分析

保存模型需要使用到callback.py中的_save_ckpt()功能,但是在使用Ascend上使用时,会走进如下代码的第一个if判断代码部分,从而导致报错。

self._last_triggered_step = cb_params.cur_step_num
if context.get_context("enable_ge") and os.getenv("MS_ENABLE_REF_MODE", "O") == "O":
    set_cur_net(cb_params.train_network)
    cb_params.train_network. exec_checkpoint_graph()
if "epoch_num" in self._append_dict:
    self._append_dict["epoch_num"] = self._append_epoch_num+ cb_params.cur_epoch_num
if "step_num" in self._append_dict:
    self._append_dict["step_num"] = seLf._append_step_num + cb_params.cur_epoch_num * cb_params.batch_num复制

解决方案

目前的解决方案有两个(目标就是不走进这个条件判断里面):

self._last_triggered_step = cb_params.cur_step_num
if context.get_context("enable_ge") and os.getenv("MS_ENABLE_REF_MODE", "O") == "O":
    set_cur_net(cb_params.train_network)
    cb_params.train_network. exec_checkpoint_graph()复制

1) 注释掉上面if判断的代码,问题可以解决。

2) 设置export MS_ENABLE_REF_MODE=1,上述代码则不需要注释。注意:有些环境中设置MS_ENABLE_REF_MODE=1可能会报错,可能是CANN版本等原因,这需要在适合的环境中使用。

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值