MindSpore开启summary报错ValueError: not enough values to unpack (expected 4, got 0)

系统环境

硬件环境(Ascend/GPU/CPU): Ascend

MindSpore版本: 2.1.1

执行模式(PyNative/ Graph): 不限

报错信息

2.1 问题描述

在Ascend+MindSpore2.1.1环境中,配置2机16卡全参训练,开启profile时可以启动训练,但是在使用profile收集性能数据时出现问题,报错如下:

Thu 21 Sep 2023 11:00:57 [INFO] [MSVP] [2328699] msprof_common . py: start analyzing data in "/home/wizardcoder/ 1_wizardcoder-mindformers-916/research/output/profile/rank_15/p rofiler/     PROF_000001_20230921085903388_FMDGQIRGMNNQRFFB/device_7"

Thu 21 Sep 2023 11:00:57 [INFO] [MSVP] [2328699] msprof_common .py: It may take few minutes, please be patient  Thu 21 sep 2023 11:01:07 [INFo] [MsvP] [2328699] msprof_common.py: Analysis data in "/home/wizardcode r/1_wizardcoder-mindformers -916/research/output/profile/rank_15/p rofiler/PROF_0€   0001 20230921085903388 FMDGQIRGMNNQRFFB/device 7" finished   [WARNING] ME(1932377:28147 3783364672, MainProcess): 2023-09-21- 11:04:05.598.182 [mindspore/pro filer/profiling.py:1102] [Profiler] Can not found cube fops and vector fops data in the

summary  [WARNINGÍ ME (1932377:281473783364672, MainProcess):2023-09-21 -11:04:05.994.536 [mindspore/prof iler/parser/memory_usage_parser.py:135] The memory file does not exist! Please ignore th

warning  if you are running heterogeneous training.     [WARNING] ME(1932377:281473783364672,MainProcess):2023-09-21-11:04:05.994.877 [mindspore/profiler/profiling.py:1134l The file </home/wizardcoder/1_wizardcoder -mindformers-916/resear    ch/output/profile/rank_15/profi ler/memory_usage_15.pb> not found "aceback (most recent call last):     File "wizardcoder/run_wizardcoder.py", line 149, in <module>

device id=args.device id)   File "wizardcoder/run wizardcoder.py", line 81, in main

task.train(train checkpo int=ckpt,resume=resume)   File- "/home/wizardcoder/1_wizardcoder-mindformers-916/mindformers/trainer/trainer.py" , line 423, in train     is full config=True    —   **kwargs) File " /home/wizardcoder/1 wizardcoder-mindformers-916/mindformers/t rainer/causal_language_modeling/causal_language_modeling.py", line 106, in train    **kwargs) File "/home/wizardcoder/1_wizardcoder-mindformers-916/mindformers/t rainer/base_trainer.py", line 644, in training_process

initial epoch=config.runner_config.initial_epoch)  File "/root/anaconda3/envs/wizardcoder/lib/python3.7/site-packages/mindspore/train/model.py" , line 1066, in train

initial epoch=init ial epoch)    File "/root/anaconda3/envs/wizar dcoder/lib/python3.7/site-packages/mindspore/train/model.py" , line 113, in wrapper    func(self. *args, **kwargs) File "/root/anaconda3/envs/wizardcode r/Lib/python3.7/site-packages/mindspore/t rain/model.py", line 613, in _train  self. train process(epoch, train dataset, list callback, cb params, initial epoch, valid infos)    File "/root/anaconda3/envs /wizardcoder/lib/python3.7/site-packages/mindspore/train/model.py", line 921, in _train_process     list callback.on train_step_end(run context)   File "7root/anaconda3/envs /wizardcoder/lib/python3.7/site-packages/mindspore/t rain/callback/_callback .py", line 412, in on_train_step_end

cb.on train step end( run context)   File "/root/anaconda3/envs7wizardcode r/lib/python3.7/site-packages /mindspore/train/callback/_callback .py", line 254, in on_train_step_end   self.step_end(run_context)  File "/home7wizardcoder/1_wizardcoder-mindformers -916/mindformers/core/cal Lback/callback.py", line 630, in step_end     self.profiler.analyse() File " "/root/anaconda3/envs/wizardcode r/lib/python3.7/site-packages/mindspore/profiler/profiling.py", line 579, in analyse  self. ascend analyse()

File Cself.ascend graph analyse() •/root/anacondas/enws/wizardeoder/Lib/pythona.7/site-packages/mindspore/prefiter/profiling. py, Line 970, in ascend anatys File "/root/anaconda3/envs /wizardcoder/lib/python3.7/site-packages /mindspore/profiler/profiling.py", line 1205, in ascend graph analyse  self. ascend graph hccl analyse(source path)    File "/root/anaconda3/envs/wizardcoder/lib/python3.7/site-packages/mindspore/profiler/profiling.py" . line 1153. in ascend graph hccl analyse

File "7root/anaconda3/envs/wizardcoder/lib/python3.7/s ite-packages/mindspore/profiler/parser/ascend_hccl _generator.py", line 148, in parse   raw = self._iteration_analyse(hccl_detail data, iteration_ id)   File "/root/anaconda3/envs/wizardcoder/lib/python3.7/site-packages/mindspore/profiler/parser/as cend hccl_generator.py", line 222, in _iteration_analyse  link info = self. link info analyse(hccl detail data)

File "7root/anaconda3/envs/wizardcoder/lib/python3.7/site-packages/mindspore/profiler/parser/ascend_hccl_generator.py", line 247, in _link_info_analyse     transport_information[ 'RDMA'] = self. rdma_analyse(groupby_t ransport)   

File "/root7anaconda3/envs/wizardcoder/Tib/python3.7/site-packages/mindspore/profiler/parser/ascend _hcc l_generator.py", line 102, in _rdma_analyse   

thread_groups, _, _, _ = np.unique(groupby_transport['tid']) ValueError: not enough values to unpack (expected 4, got 0)复制

根因分析

应该和两机网络通信有关。

解决方案

File "/root7anaconda3/envs/wizardcoder/Tib/python3.7/site-packages/mindspore/profiler/parser/ascend _hcc l_generator.py", line 102, in _rdma_analyse   
    thread_groups, _, _, _ = np.unique(groupby_transport['tid']) ValueError: not enough values to unpack (expected 4, got 0)复制

解决办法是把上面报错提示中的“_, _, _”去掉。

  • 3
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值