could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED Process finished with exit code -1073741819

解决:运行参数#--gpu_memory_fraction=0.9       注释掉问题就解决了!

报错信息如下:

INFO:tensorflow:global_step/sec: 0
2019-03-12 19:11:20.300266: E c:\users\user\source\repos\tensorflow\tensorflow\stream_executor\cuda\cuda_dnn.cc:332] could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2019-03-12 19:11:20.301051: E c:\users\user\source\repos\tensorflow\tensorflow\stream_executor\cuda\cuda_dnn.cc:332] could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED

Process finished with exit code -1073741819 (0xC0000005)

以下为全部打印信息

WARNING:tensorflow:From D:/work/SSD-Tensorflow-master/train_ssd_network.py:202: create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.create_global_step

# =========================================================================== #
# Training | Evaluation flags:
# =========================================================================== #
{'adadelta_rho': <absl.flags._flag.Flag object at 0x000002604F58CDD8>,
 'adagrad_initial_accumulator_value': <absl.flags._flag.Flag object at 0x000002604F58CE80>,
 'adam_beta1': <absl.flags._flag.Flag object at 0x000002604F58CF28>,
 'adam_beta2': <absl.flags._flag.Flag object at 0x000002604F58CFD0>,
 'batch_size': <absl.flags._flag.Flag object at 0x000002604F598DA0>,
 'checkpoint_exclude_scopes': <absl.flags._flag.Flag object at 0x000002604F59B080>,
 'checkpoint_model_scope': <absl.flags._flag.Flag object at 0x000002604F598FD0>,
 'checkpoint_path': <absl.flags._flag.Flag object at 0x000002604F598F60>,
 'clone_on_cpu': <absl.flags._flag.BooleanFlag object at 0x000002604F58C7B8>,
 'dataset_dir': <absl.flags._flag.Flag object at 0x000002604F598BA8>,
 'dataset_name': <absl.flags._flag.Flag object at 0x000002604F5989E8>,
 'dataset_split_name': <absl.flags._flag.Flag object at 0x000002604F598B00>,
 'end_learning_rate': <absl.flags._flag.Flag object at 0x000002604F598710>,
 'ftrl_initial_accumulator_value': <absl.flags._flag.Flag object at 0x000002604F598208>,
 'ftrl_l1': <absl.flags._flag.Flag object at 0x000002604F5982B0>,
 'ftrl_l2': <absl.flags._flag.Flag object at 0x000002604F598358>,
 'ftrl_learning_rate_power': <absl.flags._flag.Flag object at 0x000002604F598160>,
 'gpu_memory_fraction': <absl.flags._flag.Flag object at 0x000002604F58CC18>,
 'h': <tensorflow.python.platform.app._HelpFlag object at 0x000002604F59B198>,
 'help': <tensorflow.python.platform.app._HelpFlag object at 0x000002604F59B198>,
 'helpfull': <tensorflow.python.platform.app._HelpfullFlag object at 0x000002604F59B208>,
 'helpshort': <tensorflow.python.platform.app._HelpshortFlag object at 0x000002604F59B278>,
 'ignore_missing_vars': <absl.flags._flag.BooleanFlag object at 0x000002604F59B128>,
 'label_smoothing': <absl.flags._flag.Flag object at 0x000002604F598780>,
 'labels_offset': <absl.flags._flag.Flag object at 0x000002604F598C18>,
 'learning_rate': <absl.flags._flag.Flag object at 0x000002604F598668>,
 'learning_rate_decay_factor': <absl.flags._flag.Flag object at 0x000002604F598828>,
 'learning_rate_decay_type': <absl.flags._flag.Flag object at 0x000002604F5985F8>,
 'log_every_n_steps': <absl.flags._flag.Flag object at 0x000002604F58CA20>,
 'loss_alpha': <absl.flags._flag.Flag object at 0x0000026031DEA240>,
 'match_threshold': <absl.flags._flag.Flag object at 0x000002604F58C630>,
 'max_number_of_steps': <absl.flags._flag.Flag object at 0x000002604F598EB8>,
 'model_name': <absl.flags._flag.Flag object at 0x000002604F598CC0>,
 'momentum': <absl.flags._flag.Flag object at 0x000002604F598400>,
 'moving_average_decay': <absl.flags._flag.Flag object at 0x000002604F598978>,
 'negative_ratio': <absl.flags._flag.Flag object at 0x000002604F58C588>,
 'num_classes': <absl.flags._flag.Flag object at 0x000002604F598A58>,
 'num_clones': <absl.flags._flag.Flag object at 0x000002604F58C780>,
 'num_epochs_per_decay': <absl.flags._flag.Flag object at 0x000002604F5988D0>,
 'num_preprocessing_threads': <absl.flags._flag.Flag object at 0x000002604F58C978>,
 'num_readers': <absl.flags._flag.Flag object at 0x000002604F58C8D0>,
 'opt_epsilon': <absl.flags._flag.Flag object at 0x000002604F5980B8>,
 'optimizer': <absl.flags._flag.Flag object at 0x000002604F58CD68>,
 'preprocessing_name': <absl.flags._flag.Flag object at 0x000002604F598D30>,
 'rmsprop_decay': <absl.flags._flag.Flag object at 0x000002604F598550>,
 'rmsprop_momentum': <absl.flags._flag.Flag object at 0x000002604F5984A8>,
 'save_interval_secs': <absl.flags._flag.Flag object at 0x000002604F58CB70>,
 'save_summaries_secs': <absl.flags._flag.Flag object at 0x000002604F58CAC8>,
 'train_dir': <absl.flags._flag.Flag object at 0x000002604F58C6D8>,
 'train_image_size': <absl.flags._flag.Flag object at 0x000002604F598E48>,
 'trainable_scopes': <absl.flags._flag.Flag object at 0x000002604F59B0F0>,
 'weight_decay': <absl.flags._flag.Flag object at 0x000002604F58CCC0>}

# =========================================================================== #
# SSD net parameters:
# =========================================================================== #
{'anchor_offset': 0.5,
 'anchor_ratios': [[2, 0.5],
                   [2, 0.5, 3, 0.3333333333333333],
                   [2, 0.5, 3, 0.3333333333333333],
                   [2, 0.5, 3, 0.3333333333333333],
                   [2, 0.5],
                   [2, 0.5]],
 'anchor_size_bounds': [0.15, 0.9],
 'anchor_sizes': [(2.0, 45.0),
                  (45.0, 99.0),
                  (99.0, 153.0),
                  (153.0, 207.0),
                  (207.0, 261.0),
                  (261.0, 315.0)],
 'anchor_steps': [8, 16, 32, 64, 100, 300],
 'feat_layers': ['block4', 'block7', 'block8', 'block9', 'block10', 'block11'],
 'feat_shapes': [(38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1)],
 'img_shape': (300, 300),
 'no_annotation_label': 2,
 'normalizations': [20, -1, -1, -1, -1, -1],
 'num_classes': 2,
 'prior_scaling': [0.1, 0.1, 0.2, 0.2]}

# =========================================================================== #
# Training | Evaluation dataset files:
# =========================================================================== #
['.\\tfrecords\\voc_2007_train_000.tfrecord',
 '.\\tfrecords\\voc_2007_train_001.tfrecord',
 '.\\tfrecords\\voc_2007_train_002.tfrecord',
 '.\\tfrecords\\voc_2007_train_003.tfrecord']

INFO:tensorflow:Fine-tuning from ./checkpoints/vgg_16.ckpt. Ignoring missing vars: False
WARNING:tensorflow:From C:\Users\11327\AppData\Roaming\Python\Python36\site-packages\tensorflow\contrib\slim\python\slim\learning.py:737: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2019-03-12 19:35:08.779143: I c:\users\user\source\repos\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2019-03-12 19:35:08.976953: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1392] Found device 0 with properties: 
name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.645
pciBusID: 0000:01:00.0
totalMemory: 8.00GiB freeMemory: 6.62GiB
2019-03-12 19:35:08.977310: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1471] Adding visible gpu devices: 0
2019-03-12 19:35:09.609884: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-12 19:35:09.610104: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:958]      0 
2019-03-12 19:35:09.610250: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:971] 0:   N 
2019-03-12 19:35:09.610476: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7372 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1)
2019-03-12 19:35:09.611373: E c:\users\user\source\repos\tensorflow\tensorflow\stream_executor\cuda\cuda_driver.cc:903] failed to allocate 7.20G (7730940928 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
INFO:tensorflow:Restoring parameters from ./checkpoints/vgg_16.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path ./logs/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
2019-03-12 19:35:16.465249: E c:\users\user\source\repos\tensorflow\tensorflow\stream_executor\cuda\cuda_dnn.cc:332] could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2019-03-12 19:35:16.466070: E c:\users\user\source\repos\tensorflow\tensorflow\stream_executor\cuda\cuda_dnn.cc:332] could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED

Process finished with exit code -1073741819 (0xC0000005)

使用参数如下:

--train_dir=./logs/
--dataset_dir=./tfrecords/
--dataset_name=pascalvoc_2007
--dataset_split_name=train
--model_name=ssd_300_vgg
--checkpoint_path=./checkpoints/vgg_16.ckpt
--checkpoint_model_scope=vgg_16
--checkpoint_exclude_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box
--trainable_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box
--save_summaries_secs=60
--save_interval_secs=600
--weight_decay=0.0005
--optimizer=adam
--learning_rate=0.001
--learning_rate_decay_factor=0.94
--batch_size=16
--gpu_memory_fraction=0.9

使用参数如下:

python3 train_ssd_network.py \    --train_dir=/media/comway/data/dial_SSD/SSD-Tensorflow-master/train_log/ \   #训练生成模型的存放路径    --dataset_dir=/media/comway/data/dial_SSD/SSD-Tensorflow-master/dialvoc-train-tfrecords \  #数据存放路径    --dataset_name=pascalvoc_2007 \  #数据名的前缀,我觉得应该通过这个调用是2007还是2012    --dataset_split_name=train \  #是加载训练集还是测试集    --model_name=ssd_300_vgg \  #加载的模型的名字    --checkpoint_path=/media/comway/data/dial_SSD/SSD-Tensorflow-master/checkpoints/ssd_300_vgg.ckpt \  #所加载模型的路径 --checkpoint_exclude_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box \    --trainable_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box \    --save_summaries_secs=60 \#每60s保存一下日志    --save_interval_secs=600 \  #每600s保存一下模型    --weight_decay=0.0005 \   #正则化的权值衰减的系数    --optimizer=adam \  #选取的最优化函数    --learning_rate=0.001 \  #学习率    --learning_rate_decay_factor=0.94 \  #学习率的衰减因子    --batch_size=16 \       --gpu_memory_fraction=0.9  #指定占用gpu内存的百分比
参考博客:https://blog.csdn.net/comway_Li/article/details/85239484里的最后一个方案,反正第一个方案也用了,就是报错。

操作:1,之前也用这个参数了,但是不知道什么情况就是报错。后来很多次关机,重启电脑,重启pycharm,修改参数,重新解压改预训练模型文件。

2,我还往回改了下,听师兄说,回到最原始的错误,说不定报错就没了,我就往回改了,果然如此。程序终于可以继续训练起来了。

3,把英伟达的进程全部关了再重启电脑,关了电脑管家等保护软件。(小白也不知道影不影响,反正就这么做了)

4,主要还是与运行程序里的分类数21全部改为自己的类数。不要遗漏个别num_class数值。

最终训练自己的数据集所用参数(整理版=运行版):
--train_dir=./logs/
--dataset_dir=./tfrecords/
--dataset_name=pascalvoc_2007
--dataset_split_name=train
--model_name=ssd_300_vgg
--checkpoint_path=./checkpoints/ssd_300_vgg.ckpt
#--checkpoint_model_scope=vgg_16
--checkpoint_exclude_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box
--trainable_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box
--save_summaries_secs=60
--save_interval_secs=600
--weight_decay=0.0005
--optimizer=adam
--learning_rate=0.001
--learning_rate_decay_factor=0.94
--batch_size=16
#--gpu_memory_fraction=0.9

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值