1 系统环境
硬件环境(Ascend/GPU/CPU): Ascend
MindSpore版本: 2.1.1
执行模式(PyNative/ Graph): 不限
2 报错信息
2.1 问题描述
在Ascend+MindSpore2.1.1环境中,配置2机16卡全参训练,开启profile时可以启动训练,但是在使用profile收集性能数据时出现问题,报错如下:
Thu 21 Sep 2023 11:00:57 [INFO] [MSVP] [2328699] msprof_common . py: start analyzing data in "/home/wizardcoder/ 1_wizardcoder-mindformers-916/research/output/profile/rank_15/p rofiler/ PROF_000001_20230921085903388_FMDGQIRGMNNQRFFB/device_7"
Thu 21 Sep 2023 11:00:57 [INFO] [MSVP] [2328699] msprof_common .py: It may take few minutes, please be patient Thu 21 sep 2023 11:01:07 [INFo] [MsvP] [2328699] msprof_common.py: Analysis data in "/home/wizardcode r/1_wizardcoder-mindformers -916/research/output/profile/rank_15/p rofiler/PROF_0€ 0001 20230921085903388 FMDGQIRGMNNQRFFB/device 7" finished [WARNING] ME(1932377:28147 3783364672, MainProcess): 2023-09-21- 11:04:05.598.182 [mindspore/pro filer/profiling.py:1102] [Profiler] Can not found cube fops and vector fops data in the
summary [WARNINGÍ ME (1932377:281473783364672, MainProcess):2023-09-21 -11:04:05.994.536 [mindspore/prof iler/parser/memory_usage_parser.py:135] The memory file does not exist! Please ignore th
warning if you are running heterogeneous training. [WARNING] ME(1932377:281473783364672,MainProcess):2023-09-21-11:04:05.994.877 [mindspore/profiler/profiling.py:1134l The file </home/wizardcoder/1_wizardcoder -mindformers-916/resear ch/output/profile/rank_15/profi ler/memory_usage_15.pb> not found "aceback (most recent call last): File "wizardcoder/run_wizardcoder.py", line 149, in <module>
device id=args.device id) File "wizardcoder/run wizardcoder.py", line 81, in main
task.train(train checkpo int=ckpt,resume=resume) File- "/home/wizardcoder/1_wizardcoder-mindformers-916/mindformers/trainer/trainer.py" , line 423, in train is full config=True — **kwargs) File " /home/wizardcoder/1 wizardcoder-mindformers-916/mindformers/t rainer/causal_language_modeling/causal_language_modeling.py", line 106, in train **kwargs) File "/home/wizardcoder/1_wizardcoder-mindformers-916/mindformers/t rainer/base_trainer.py", line 644, in training_process
initial epoch=config.runner_config.initial_epoch) File "/root/anaconda3/envs/wizardcoder/lib/python3.7/site-packages/mindspore/train/model.py" , line 1066, in train
initial epoch=init ial epoch) File "/root/anaconda3/envs/wizar dcoder/lib/python3.7/site-packages/mindspore/train/model.py" , line 113, in wrapper func(self. *args, **kwargs) File "/root/anaconda3/envs/wizardcode r/Lib/python3.7/site-packages/mindspore/t rain/model.py", line 613, in _train self. train process(epoch, train dataset, list callback, cb params, initial epoch, valid infos) File "/root/anaconda3/envs /wizardcoder/lib/python3.7/site-packages/mindspore/train/model.py", line 921, in _train_process list callback.on train_step_end(run context) File "7root/anaconda3/envs /wizardcoder/lib/python3.7/site-packages/mindspore/t rain/callback/_callback .py", line 412, in on_train_step_end
cb.on train step end( run context) File "/root/anaconda3/envs7wizardcode r/lib/python3.7/site-packages /mindspore/train/callback/_callback .py", line 254, in on_train_step_end self.step_end(run_context) File "/home7wizardcoder/1_wizardcoder-mindformers -916/mindformers/core/cal Lback/callback.py", line 630, in step_end self.profiler.analyse() File " "/root/anaconda3/envs/wizardcode r/lib/python3.7/site-packages/mindspore/profiler/profiling.py", line 579, in analyse self. ascend analyse()
File Cself.ascend graph analyse() •/root/anacondas/enws/wizardeoder/Lib/pythona.7/site-packages/mindspore/prefiter/profiling. py, Line 970, in ascend anatys File "/root/anaconda3/envs /wizardcoder/lib/python3.7/site-packages /mindspore/profiler/profiling.py", line 1205, in ascend graph analyse self. ascend graph hccl analyse(source path) File "/root/anaconda3/envs/wizardcoder/lib/python3.7/site-packages/mindspore/profiler/profiling.py" . line 1153. in ascend graph hccl analyse
File "7root/anaconda3/envs/wizardcoder/lib/python3.7/s ite-packages/mindspore/profiler/parser/ascend_hccl _generator.py", line 148, in parse raw = self._iteration_analyse(hccl_detail data, iteration_ id) File "/root/anaconda3/envs/wizardcoder/lib/python3.7/site-packages/mindspore/profiler/parser/as cend hccl_generator.py", line 222, in _iteration_analyse link info = self. link info analyse(hccl detail data)
File "7root/anaconda3/envs/wizardcoder/lib/python3.7/site-packages/mindspore/profiler/parser/ascend_hccl_generator.py", line 247, in _link_info_analyse transport_information[ 'RDMA'] = self. rdma_analyse(groupby_t ransport)
File "/root7anaconda3/envs/wizardcoder/Tib/python3.7/site-packages/mindspore/profiler/parser/ascend _hcc l_generator.py", line 102, in _rdma_analyse
thread_groups, _, _, _ = np.unique(groupby_transport['tid']) ValueError: not enough values to unpack (expected 4, got 0)
复制
3 根因分析
应该和两机网络通信有关。
4 解决方案
File "/root7anaconda3/envs/wizardcoder/Tib/python3.7/site-packages/mindspore/profiler/parser/ascend _hcc l_generator.py", line 102, in _rdma_analyse
thread_groups, _, _, _ = np.unique(groupby_transport['tid']) ValueError: not enough values to unpack (expected 4, got 0)
复制
解决办法是把上面报错提示中的“_, _, _”去掉。