TF训练网络GPU/NPU测试FAQ记录

最新推荐文章于 2023-03-21 15:01:27 发布

我是一只月月鸟

最新推荐文章于 2023-03-21 15:01:27 发布

阅读量1.1k

点赞数

分类专栏：网络训练

本文链接：https://blog.csdn.net/yuzipeng/article/details/119675755

版权

网络训练专栏收录该内容

2 篇文章 1 订阅

订阅专栏

DFN网络 GPU运行

github链接：https://github.com/YuhuiMa/DFN-tensorflow

1、报错：Could not load dynamic library 'libcudart.so.11.0’

解决方法：在host侧执行apt install nvidia-cuda-toolkit，然后重新进入容器运行

2、报错：AssertionError: The number of images in the data/train/main is not equal to that in the data/train/segmentation

解决方法：

VOCtrainval_11-May-2012.tar解压后得到的JPEGImages 和 SegmentationClass 的图片数据不一致，所以需要自己手动写脚本处理下。

3、报错：[Errno 2] No such file or directory: 'data/train/segmentation/2009_000039.jpg

解决方法：原segmentation里的图片都是Png格式，所以手动转换下格式

4、报错：OSError: cannot identify image file 'data/val/main/.gitkeep’

解决方法：删除.gitkeep文件

5、报错：ZeroDivisionError: division by zero

解决方法：val文件夹也需要有main/segmentation数据。

xlnet网络GPU运行

github链接：https://github.com/zihangdai/xlnet

1、报错： data/squad/train-v2.0.json; No such file or directory

解决方法：创建data/squad/文件夹，下载数据文件

wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

2、scripts/prepro_squad.sh 脚本运行时间长

解决方法：暂无，先不管

3、报错：ValueError: Tensor conversion requested dtype string for Tensor with dtype float32: <tf.Tensor ‘args_0:0’ shape=() dtype=float32

解决方法：这是由于tfrecord文件不存在或者路径不正确导致。

GS_ROOT=
GS_PROC_DATA_DIR=${GS_ROOT}/proc_data/squad

数据预数据脚本中，由于 GS_ROOT变量未设置，导致GS_PROC_DATA_DIR路径为/proc_data/squad，而非当前路径下的/proc_data/squad路径。

同时你要注意，prepro_squad.sh里的max_seq_length 和 max_query_length 是否与gpu_squad_base.sh参数一致，否则训练时仍然会找不到tfrecord文件。tfrecord文件格式为：

"{}.*.slen-{}.qlen-{}.train.tf_record".format(spm_basename, FLAGS.max_seq_length,FLAGS.max_query_length)

RCAN网络GPU运行

github链接：https://github.com/dongheehand/RCAN-tf

1、报错：ImportError: cannot import name '_validate_lengths’

解决方法：

（1）参考作者要求的numpy=1.15.0，scikit-image=0.15.0进行安装，发现Numpy版本过低，会影响到其他库的正常使用，如tensorflow-gpu库要求numpy>=1.16。

（2）参考网上相关wiki，都是说升级scikit-image库即可解决，但是当前网络仍然有这个报错。后经定位发现，作者util.py脚本中自己重写了crop方法，因此仅升级scikit-images库是无法解决本网络中的问题

（3）修改util.py脚本，共有2处代码

# 原代码245行
# import numpy as np
# from numpy.lib.arraypad import _validate_lengths
# 新代码
import numpy as np
from distutils.version import LooseVersion as Version
old_numpy = Version(np.__version__) < Version('1.16')
if old_numpy:
    from numpy.lib.arraypad import _validate_lengths
else:
    from numpy.lib.arraypad import _as_pairs
    
# 原代码292行
# crops = _validate_lengths(ar, crop_width)
# 新代码 
crops = _as_pairs(crop_width, ar.ndim, as_index=True)

2、报错：FileNotFoundError: [Errno 2] No such file or directory: './HR’

解决方法：只下载了**(NTIRE 2017) Low Res Images数据集，未下载High Resolution Images**数据集

3、报错：ValueError: setting an array element with a sequence.

解决方法：HR数据集添加到GT_path后无此报错

4、运行命令，供参考：

训练命令：python main.py --train_GT_path dataSet/DIV2K_train_HR/ --test_GT_path dataSet/DIV2K_valid_HR/ --train_LR_path dataSet/DIV2K_train_LR_bicubic/X2/ --test_LR_path dataSet/DIV2K_valid_LR_bicubic/X2/ --test_with_train True --scale 2 --log_freq 1000 --max_step 1000 --mode train

测试命令：python main.py --mode test --pre_trained_model ./model/RCA_model_0064_feats_10_res_2.00_scale_last --test_LR_path ./dataSet/benchmark/Set5/LR_bicubic/X2/ --test_GT_path ./dataSet/benchmark/Set5/HR/ --scale 2 --self_ensemble False

MAMNET网络GPU运行

github链接：https://github.com/junhyukk/MAMNet-Tensorflow

1、报错：[Errno 2] No such file or directory: '…/dataSet/DIV2K_train_HR/DIV2K_train_LR_bicubic/X2’

解决方法：脚本里默认的数据集是DIV2K_train_LR_bicubic/X2，因此在设置data_dir时只能指定DIV2K_train_LR_bicubic数据集，而不能用DIV2K_train_HR数据集

pix2pix网络运行

1、GPU运行报错：Invalid argument: Nan in summary histogram for: generator/encoder_2/conv2d/bias/values.

解决方法：

（1）batch_size不要设置成1，改小epoch数量

（2）减小学习率为0.00005

（3）将之前生成的train ckpt等数据删除后重新跑

2、NPU运行报错：tensorflow.python.framework.errors_impl.InternalError: Missing 0-th output from {{node generator/decoder_8/DropOutGenMask}}

解决方法：https://gitee.com/ascend/modelzoo/issues/I43NHY?from=project-issue

finetune-transformer-lm 网络运行

1、GPU运行时报错：OP_REQUIRES failed at cwise_ops_common.cc:82 : Resource exhausted: OOM when allocating tensor with shape[16,77,768] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

解决方法：此类报错是内存不足导致，一般还会打印如下内存信息。

2021-07-30 02:59:02.387058: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats:
Limit:                 10740259881
InUse:                 10740112128
MaxInUse:              10740112128
NumAllocs:                    2096
MaxAllocSize:            124594176

因此在创建docker容器时，申请2G内存，1G空间不够用，如 --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864

同时可以将batch_size改小，也能减少模型的内存需求。

2、NPU运行时报错：ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

解决方法：numpy版本过低，通过pip install --upgrade numpy进行版本升级。

ViDeNN网络运行

1、报错：assert len(eval_data) != 0, 'No testing data!'

解决方法：main_spatialCNN.py文件 47/49行，测试数据图片格式为jpg，但实际路径下的数据集是Png格式。

2、训练命令：python main_spatialCNN.py

测试命令：python main_spatialCNN.py --phase test

3、NPU运行报错：TypeError: train() got an unexpected keyword argument 'hooks’

解决方法：https://gitee.com/ascend/modelzoo/issues/I43NKO?from=project-issue

NLRN网络GPU/NPU运行

1、脚本默认train_steps=-1，永远不会停止，因此一定要设置train_step数量

2、modelarts上运行时，报错：NameError: name ‘npu_config_proto’ is not defined

解决方法：使用toolkit方式可规避。如果需要在modelarts上执行，必须通过source /home/ma-user/miniconda3/bin/activate TensorFlow-1.15.0 进入容器内部。

我是一只月月鸟

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
TF训练网络GPU/NPU测试FAQ记录

DFN网络 GPU运行github链接：https://github.com/YuhuiMa/DFN-tensorflow1、报错：Could not load dynamic library 'libcudart.so.11.0’解决方法：在host侧执行apt install nvidia-cuda-toolkit，然后重新进入容器运行2、报错：AssertionError: The number of images in the data/train/main is not equal to
复制链接

扫一扫