在modelarts平台开发slowfast算子时出现数据集处理问题
[10/04 14:43:43][INFO] start copy.py: 299: ============== Starting Training ==============
[10/04 14:43:43][INFO] start copy.py: 301: total_epoch=20, steps_per_epoch=101
[WARNING] MD(178,fffba4ff91e0,python):2022-10-04-14:44:30.306.953 [mindspore/ccsrc/minddata/dataset/engine/datasetops/device_queue_op.cc:725] DetectPerBatchTime] Bad performance attention, it takes more than 25 seconds to fetch a batch of data from dataset pipeline, which might result `GetNext` timeout problem. You may test dataset processing performance(with creating dataset iterator) and optimize it.
[ERROR] MD(178,ffff60c791e0,python):2022-10-04-14:45:25.453.944 [mindspore/ccsrc/minddata/dataset/util/task.cc:67] operator()] Task: GeneratorOp(ID:3) - thread(281472305435104) is terminated with err msg: Exception thrown from PyFunc. Exception: Generator worker process timeout.
At:
/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/dataset/engine/datasets.py(3841): process
Line of code : 195
File : /home/jenkins/agent-working-dir/workspace/Compile_Ascend_ARM_CentOS@2/mindspore/mindspore/ccsrc/minddata/dataset/engine/datasetops/source/generator_op.cc
[ERROR] MD(178,ffff60c791e0,python):2022-10-04-14:45:25.454.325 [mindspore/ccsrc/minddata/dataset/util/task_manager.cc:217] InterruptMaster] Task is terminated with err msg(more detail in info level log):Exception thrown from PyFunc. Exception: Generator worker process timeout.
At:
/home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/dataset/engine/datasets.py(3841): process
Line of code : 195
File : /home/jenkins/agent-working-dir/workspace/Compile_Ascend_ARM_CentOS@2/mindspore/mindspore/ccsrc/minddata/dataset/engine/datasetops/source/generator_op.cc
[WARNING] CORE(178,ffffaff20170,python):2022-10-04-14:48:20.618.138 [mindspore/core/ir/anf_extends.cc:65] fullname_with_scope] Input 0 of cnode is not a value node, its type is CNode.
可以看到提示处理数据集时超时,但是相关数据集在启智平台上运行时没有问题
启智平台运行时使用的时mindspore1.7版本,但在华为云的modelarts上使用的是mindspore1.5.1版本,是否是因为这一版本问题导致的呢?是否有其余解决办法呢?
****************************************************解答*****************************************************
看错误原因是python function执行时间太长了,要不尝试一下几种方法
1. GeneratorDataset中python_multiprocessing设置为True
2. GeneratorDataset的num_parallel_workers设置大一些(默认值应该是1)