模型并行策略为 1:1:8 时报错RuntimeError: Stage num is 8 is not equal to stage used: 5

系统环境

硬件环境(Ascend/GPU/CPU): Ascend

MindSpore版本: 2.2.0

执行模式(PyNative/ Graph): 不限

报错信息

2.1 问题描述

模型跑4层网络,设置并行策略为dp:mp:pp=1:1:8,出现报错

2.2 报错信息

Traceback (most recent call last):
  File "wizardcoder/run_wizardcode r.py", line 148, in <module>
    device_id=args.device_id)
  File "wizardcoder/run_wizardcoder.py", line 90, in main
    task. finetune(finetune_checkpoint=config.load_checkpoint • auto_trans_ckpt=config .auto_trans_ckpt, resume=resume)
  File "/home/wizardcoder/1_wizardcoder-mindformers-916/mindformers/trainer/trainer.py", Tine 522, in finetune
    is_full_config=True, **kwargs
  File"/home/wizardcoder/1_wizardcoder-mindformers-916/mindformers/traine r/caus al_language_modeling/caus al_language_modeling.py", line 106, in train
    **kwargs)
  File "/home/wizardcoder/1_wizardcoder-mindformers-916/mindformers/traine r/base_trainer.py", line 616, in training_process
    transform_and_load_checkpoint (config, model, network, dataset)
  File "/home/wizardcoder/1_wizardc oder-mindformers-916/mindforme rs/trainer/utils.py", line 300, in transform_ and_load_checkpoint
    build_model(config, model, dataset, do_eval=do_eval, do_predict=do_predict)
  File "/home/wizardcoder/1_wizardcoder-mindformers-916/mindformers/trainer/utils.py", line 330, in build_model
    sink_size=config.runner_config.sink_size)
  File "/root/miniconda3/envs/wiz ardcoder/lib/python3.7/site-packages/mindspore/train/model .py", line 1263, in build
    self._init(train_dataset , valid_dataset, sink_siz e, epoch)
  File "/root/miniconda3/envs/wiza rdcoder/lib/python3.1/site-packages /mindspore/train/model.py", line 524, in _init
    train_network.compile( *inputs)
  File "/root/miniconda3/envs/wizardcoder/lib/python3 .7/site-packages/mindspore/nn/cell .py", line 939, in compile
    jit_config_dict=self._jit_config_dict, *compile_args, **kwargs)
  File n/root/miniconda3/envs7wiza rdcoder/lib/python3.7/site-packages /mindspore/common/api.py", line 1623, in compile
    result = self. araoh executor.comoilelobi. aras. kwaras, phase, self. use vm mode() 
RuntimeError: Stage num is 8 is not equal to stage used: 5复制

根因分析

这是因为模型层数只有4层,无法进行pipeline=8的分层切割。

解决方案

需要满足pipeline_stage小于等于num_layers这一条件。

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值