ModalArts使用经验记录

目标任务:在华为npu上复现论文avatarGAN,并保证精度。
用的工具感觉都好陌生,入门教程也有好多对不上的地方。强迫自己做一遍,练下心态。

起始状态:npu代码和数据集已准备好(上传到自己的OBS目录下),ModalArts插件已安装。

训练模型

  1. 根据教程修改训练脚本。
  2. 在PyCharm工具栏中,选择“ModelArts > Edit Training Job Configuration”。
  3. 在弹出的对话框中,每个参数的意义如下。
  • “Job Name”:自动生成,首次提交训练作业时,该名称也可以自己指定。
  • “AI Engine”:选择训练的框架,以及版本。 本任务基于昇腾AI处理器使用Tensorflow训练框架执行训练。
  • “Algorithm Source”:选择“Frequently-used”,代表常用框架。
  • “Boot File Path”:选择本地的训练脚本“main.py”。
  • “Code Directory”:选择启动脚本所在“src”目录。(填完“Boot File Path”会自动填上)
  • “OBS Path”:填写输出路径,用于存储训练输出模型和日志文件。
  • “Data Path in OBS”:填写数据上传的OBS目录。此处需完整OBS路径,需包含OBS桶名称。
  • “Specifications”:选择GPU规格。
  • “Running Parameters”:是训练脚本所需要的输入参数。
    填写完成后,单击“Apply and Run”提交训练作业到云上ModelArts。请添加图片描述

问题排查:调用数据集

运行后显示job is failed,报错太少很难确认问题的位置。首先考虑是否是调用时数据部分的训练脚本没有修改好。
原本代码:(linux服务器上)直接本地调取数据集
现在代码:驱动ModalArts从OBS上调用数据集。
请添加图片描述
即脚本里所有用接口调数据集的代码都是针对linux服务器本地调用的,需要改成:先从OBS传到ModalArts,再让ModalArts本地调用。

  1. 从OBS传到ModalArts
import argparse
import moxing as mox
# 解析输入参数data_url
parser = argparse.ArgumentParser()
parser.add_argument("--data_url", type=str, default="./dataset")
config = parser.parse_args()
# 在ModelArts容器创建数据存放目录
data_dir = "/cache/dataset"
os.makedirs(data_dir)
# OBS数据拷贝到ModelArts容器内
mox.file.copy_parallel(config.data_url, data_dir)  

在Training Job Configuration中配置会对应一些运行参数,比如--train_url就是"OBS Path",--data_url就是"Data Path in OBS"。
请添加图片描述
2. 在ModalArts调用数据
把所有调用数据的地方修改,问题解决。

问题排查:数据储存

代码正常跑起来后,output中没有生成图片。
同理,这里是先在ModalArts上数据生成,再传到OBS上储存,估计这个过程出了差错。

File "/home/ma-user/anaconda/lib/python3.7/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
    model.train(args) if args.phase == 'train' \
  File "/home/ma-user/modelarts/user-job-dir/code/model.py", line 346, in train
    self.sample_model(args.sample_dir, epoch, idx)
  File "/home/ma-user/modelarts/user-job-dir/code/model.py", line 396, in sample_model
    './{}/A_{:02d}_{:04d}.jpg'.format(sample_dir, epoch, idx))
  File "/home/ma-user/modelarts/user-job-dir/code/utils.py", line 123, in save_images
    return imsave(inverse_transform(images), size, image_path)
  File "/home/ma-user/modelarts/user-job-dir/code/utils.py", line 149, in imsave
    return imageio.imwrite(path, img_as_ubyte(merge(images, size)))
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/imageio/core/functions.py", line 303, in imwrite
    writer = get_writer(uri, format, "i", **kwargs)
    self._parse_uri(uri)
  File "/home/ma-user/anaconda/lib/python3.7/site-packages/imageio/core/request.py", line 265, in _parse_uri
    raise FileNotFoundError("The directory %r does not exist" % dn)
FileNotFoundError: The directory '/home/ma-user/modelarts/workspace/device0/cache/output/sample' does not exist

问题排查:“registered signal handler”

[ModelArts Service Log][init] toolkit_obs_upload_job_pid = 56
[ModelArts Service Log][init] toolkit_obs_upload_pid = 58
[ModelArts Service Log][init] running at 2021-11-24-10:45:59
[ModelArts Service Log][init] ip of the pod: 172.16.0.118
[ModelArts Service Log][init] autosearch_path is empty, skip the autosearch download
[ModelArts Service Log][init] download code_url: s3://avatar-gan-npu/GAN_npu_for_TensorFlow/output/MA-new-GAN_npu_for_TensorFlow-11-24-10-45/code/
[ModelArts Service Log][init] modelarts_downloader_job_pid = 126
[ModelArts Service Log]2021-11-24 10:46:01,691 - modelarts-downloader.py[line:264] - INFO: Main: modelarts-downloader starting with Namespace(dst='./', recursive=True, skip_creating_dir=False, src='s3://avatar-gan-npu/GAN_npu_for_TensorFlow/output/MA-new-GAN_npu_for_TensorFlow-11-24-10-45/code/', trace=False, type='common', verbose=False)
[ModelArts Service Log][init] code is now in /home/ma-user/modelarts/user-job-dir
[ModelArts Service Log][init]engine type is: ascend-powered-engine
[ModelArts Service Log][init]python version: 2
[ModelArts Service Log][init] inputs_handler_job_pid = 217
[ModelArts Service Log][INFO][2021/11/24 10:46:03]: caching the content of [data_url] inputs
[ModelArts Service Log]2021-11-24 10:46:04,725 - modelarts-downloader.py[line:264] - INFO: Main: modelarts-downloader starting with Namespace(dst='./', recursive=True, skip_creating_dir=True, src='s3://avatar-gan-npu/GAN_npu_for_TensorFlow/datasets/', trace=False, type='common', verbose=False)
[ModelArts Service Log]2021-11-24 10:46:05,228 - file_io.py[line:1131] - INFO: Listing OBS: 1000
[ModelArts Service Log]2021-11-24 10:46:05,370 - file_io.py[line:1131] - INFO: Listing OBS: 2000
[ModelArts Service Log]2021-11-24 10:46:05,514 - file_io.py[line:1131] - INFO: Listing OBS: 3000
[ModelArts Service Log]2021-11-24 10:46:05,545 - file_io.py[line:1131] - INFO: Listing OBS: 4000
[ModelArts Service Log]2021-11-24 10:46:29,532 - file_io.py[line:1131] - INFO: Listing OBS: 1000
[ModelArts Service Log]2021-11-24 10:46:29,676 - file_io.py[line:1131] - INFO: Listing OBS: 2000
[ModelArts Service Log]2021-11-24 10:46:29,858 - file_io.py[line:1131] - INFO: Listing OBS: 3000
[ModelArts Service Log]2021-11-24 10:46:29,890 - file_io.py[line:1131] - INFO: Listing OBS: 4000
[ModelArts Service Log]2021-11-24 10:46:31,553 - file_io.py[line:2101] - INFO: pid: None.	1000/4147
[ModelArts Service Log]2021-11-24 10:46:32,893 - file_io.py[line:2101] - INFO: pid: None.	2000/4147
[ModelArts Service Log]2021-11-24 10:46:33,994 - file_io.py[line:2101] - INFO: pid: None.	3000/4147
[ModelArts Service Log]2021-11-24 10:46:35,286 - file_io.py[line:2101] - INFO: pid: None.	4000/4147
[ModelArts Service Log][INFO][2021/11/24 10:46:35]: cache the content of [data_url] inputs successfully
[ModelArts Service Log][INFO][2021/11/24 10:46:35]: it can be accessed at local dir [/home/ma-user/modelarts/inputs/data_url_0]
[ModelArts Service Log][INFO][2021/11/24 10:46:36,599]: mkdir for local output dir
[ModelArts Service Log][INFO][2021/11/24 10:46:36,599]: output-handler finalized
[ModelArts Service Log][init] exiting at 2021-11-24-10:46:36
[ModelArts Service Log][init] upload_metrics_pid = 447
[ModelArts Service Log][init] stop toolkit_obs_upload_pid = 58 by signal SIGTERM
[ModelArts Service Log][sidecar] toolkit_obs_upload 58 ret_code is 0
[ModelArts Service Log][init] exit with 0
[ModelArts Service Log][sidecar] toolkit_obs_upload_job_pid = 32
[ModelArts Service Log][sidecar] toolkit_obs_upload_pid = 33
[ModelArts Service Log][sidecar] running at 2021-11-24-10:46:40
[ModelArts Service Log][sidecar] outputs_handler_job_pid = 60
[ModelArts Service Log][sidecar] outputs_handler_pid = 61
[ModelArts Service Log][sidecar] toolkit_obs_sync_by_channels_job_pid = 75
[ModelArts Service Log][sidecar] toolkit_obs_sync_by_channels_pid = 76
[ModelArts Service Log][sidecar] waiting for training complete
time="2021-11-24T10:46:40+08:00" level=info msg="local dir = /home/ma-user/modelarts/log/" file="upload.go:39" Command=obs/upload Component=ma-training-toolkit Platform=ModelArts-Service
time="2021-11-24T10:46:40+08:00" level=info msg="obs dir = s3://modelarts-training-log-cn-north-4/dbda8e34-ee6d-438d-bff9-1cf651eb96a0/worker-0" file="upload.go:42" Command=obs/upload Component=ma-training-toolkit Platform=ModelArts-Service
time="2021-11-24T10:46:40+08:00" level=info msg="start the periodic upload task, upload period = 5 seconds " file="upload.go:52" Command=obs/upload Component=ma-training-toolkit Platform=ModelArts-Service
time="2021-11-24T10:46:41+08:00" level=info msg="local dir = /home/ma-user/modelarts/outputs/train_url_0/" file="upload.go:39" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=train_url Platform=ModelArts-Service
time="2021-11-24T10:46:41+08:00" level=info msg="obs dir = s3://avatar-gan-npu/GAN_npu_for_TensorFlow/output/MA-new-GAN_npu_for_TensorFlow-11-24-10-45/output/" file="upload.go:42" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=train_url Platform=ModelArts-Service
time="2021-11-24T10:46:41+08:00" level=info msg="start the periodic upload task, upload period = 30 seconds " file="upload.go:52" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=train_url Platform=ModelArts-Service
time="2021-11-24T10:46:41+08:00" level=info msg="local dir = /home/ma-user/modelarts/log/" file="upload.go:39" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=log_url Platform=ModelArts-Service
time="2021-11-24T10:46:41+08:00" level=info msg="obs dir = s3://avatar-gan-npu/GAN_npu_for_TensorFlow/output/MA-new-GAN_npu_for_TensorFlow-11-24-10-45/log/" file="upload.go:42" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=log_url Platform=ModelArts-Service
time="2021-11-24T10:46:41+08:00" level=info msg="start the periodic upload task, upload period = 30 seconds " file="upload.go:52" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=log_url Platform=ModelArts-Service
[ModelArts Service Log][INFO][2021/11/24 10:46:42,107]: registered signal handler

原本需要手动将数据从ModelArts传到OBS,更换最新版本ModelArts插件后程序总是卡在"registered signal handler"(最久卡过6个多小时然后杀掉)
根据前后文日志

time="2021-11-24T10:46:41+08:00" level=info msg="start the periodic upload task, upload period = 30 seconds " file="upload.go:52" Command=obs/sync_by_channels Component=ma-training-toolkit Ctx=log_url Platform=ModelArts-Service

感觉就是每30s的动态传输出了问题。

-------------------------------------------------- 更新分割线 ---------------------------------------------------------
请添加图片描述
出错原因:Image Path没有换成对应版本(这个要手动更改)

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值