分布式计算学习第二坑 cudnn

最新推荐文章于 2021-09-12 20:33:48 发布

weixin_45939774

最新推荐文章于 2021-09-12 20:33:48 发布

阅读量653

点赞数

分类专栏： tensorflow on spark

本文链接：https://blog.csdn.net/weixin_45939774/article/details/106350214

版权

tensorflow 同时被 3 个专栏收录

1 篇文章 0 订阅

订阅专栏

1 篇文章 0 订阅

订阅专栏

spark

1 篇文章 0 订阅

订阅专栏

andrew@1manjaro:~/mount/arch/TensorFlowOnSpark/examples/resnet#export TF_CONFIG='{"cluster": { "chief": ["localhost:2222"], "worker": ["localhost:2223"]}, "task": {"type": "chief", "index": 0}}'
python resnet_cifar_main.py --data_dir=${CIFAR_DATA} --num_gpus=1 --ds=multi_worker_mirrored --train_epochs=100
2020-05-26 10:43:02.240568: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
2020-05-26 10:43:03.329536: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-05-26 10:43:03.332103: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-26 10:43:03.332337: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:1c:00.0 name: GeForce RTX 2070 computeCapability: 7.5
coreClock: 1.725GHz coreCount: 36 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2020-05-26 10:43:03.332359: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
2020-05-26 10:43:03.333363: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-05-26 10:43:03.334473: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-05-26 10:43:03.334654: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-05-26 10:43:03.335746: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-05-26 10:43:03.336382: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-05-26 10:43:03.338684: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-05-26 10:43:03.338811: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-26 10:43:03.339096: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-26 10:43:03.339297: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-05-26 10:43:03.339909: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: FMA
2020-05-26 10:43:03.360884: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 3701195000 Hz
2020-05-26 10:43:03.361383: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5586666e7300 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-05-26 10:43:03.361399: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-05-26 10:43:03.682688: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-26 10:43:03.682977: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x558665a43420 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-05-26 10:43:03.683005: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2070, Compute Capability 7.5
2020-05-26 10:43:03.683211: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-26 10:43:03.683432: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:1c:00.0 name: GeForce RTX 2070 computeCapability: 7.5
coreClock: 1.725GHz coreCount: 36 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2020-05-26 10:43:03.683458: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
2020-05-26 10:43:03.683481: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-05-26 10:43:03.683491: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-05-26 10:43:03.683500: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-05-26 10:43:03.683510: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-05-26 10:43:03.683519: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-05-26 10:43:03.683528: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-05-26 10:43:03.683571: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-26 10:43:03.683789: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-26 10:43:03.683974: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-05-26 10:43:03.683995: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
2020-05-26 10:43:03.959819: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-05-26 10:43:03.959852: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0 
2020-05-26 10:43:03.959860: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N 
2020-05-26 10:43:03.960066: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-26 10:43:03.960320: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-26 10:43:03.960529: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 35 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:1c:00.0, compute capability: 7.5)
2020-05-26 10:43:03.960938: I tensorflow/core/common_runtime/process_util.cc:147] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
2020-05-26 10:43:03.961343: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-26 10:43:03.961546: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:1c:00.0 name: GeForce RTX 2070 computeCapability: 7.5
coreClock: 1.725GHz coreCount: 36 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2020-05-26 10:43:03.961570: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
2020-05-26 10:43:03.961593: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-05-26 10:43:03.961604: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-05-26 10:43:03.961614: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-05-26 10:43:03.961624: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-05-26 10:43:03.961634: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-05-26 10:43:03.961644: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-05-26 10:43:03.961686: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-26 10:43:03.961902: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-26 10:43:03.962084: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-05-26 10:43:03.962100: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-05-26 10:43:03.962106: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0 
2020-05-26 10:43:03.962111: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N 
2020-05-26 10:43:03.962169: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-26 10:43:03.962391: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-26 10:43:03.962579: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:chief/replica:0/task:0/device:GPU:0 with 35 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:1c:00.0, compute capability: 7.5)
2020-05-26 10:43:03.964986: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job chief -> {0 -> localhost:2222}
2020-05-26 10:43:03.965006: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2223}
2020-05-26 10:43:03.965380: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:2222
INFO:tensorflow:Enabled multi-worker collective ops with available devices: ['/job:chief/replica:0/task:0/device:CPU:0', '/job:chief/replica:0/task:0/device:XLA_CPU:0', '/job:chief/replica:0/task:0/device:XLA_GPU:0', '/job:chief/replica:0/task:0/device:GPU:0']
I0526 10:43:03.965880 140302475241280 collective_all_reduce_strategy.py:303] Enabled multi-worker collective ops with available devices: ['/job:chief/replica:0/task:0/device:CPU:0', '/job:chief/replica:0/task:0/device:XLA_CPU:0', '/job:chief/replica:0/task:0/device:XLA_GPU:0', '/job:chief/replica:0/task:0/device:GPU:0']
INFO:tensorflow:Using MirroredStrategy with devices ('/job:chief/task:0/device:GPU:0',)
I0526 10:43:03.966263 140302475241280 mirrored_strategy.py:500] Using MirroredStrategy with devices ('/job:chief/task:0/device:GPU:0',)
INFO:tensorflow:MultiWorkerMirroredStrategy with cluster_spec = {'chief': ['localhost:2222'], 'worker': ['localhost:2223']}, task_type = 'chief', task_id = 0, num_workers = 2, local_devices = ('/job:chief/task:0/device:GPU:0',), communication = CollectiveCommunication.AUTO
I0526 10:43:03.966428 140302475241280 collective_all_reduce_strategy.py:344] MultiWorkerMirroredStrategy with cluster_spec = {'chief': ['localhost:2222'], 'worker': ['localhost:2223']}, task_type = 'chief', task_id = 0, num_workers = 2, local_devices = ('/job:chief/task:0/device:GPU:0',), communication = CollectiveCommunication.AUTO
INFO:tensorflow:Collective batch_all_reduce: 1 all-reduces, num_workers = 2, communication_hint = AUTO, num_packs = 1
I0526 10:43:04.433851 140302475241280 cross_device_ops.py:1059] Collective batch_all_reduce: 1 all-reduces, num_workers = 2, communication_hint = AUTO, num_packs = 1
INFO:tensorflow:Collective batch_all_reduce: 1 all-reduces, num_workers = 2, communication_hint = AUTO, num_packs = 1
I0526 10:43:04.437643 140302475241280 cross_device_ops.py:1059] Collective batch_all_reduce: 1 all-reduces, num_workers = 2, communication_hint = AUTO, num_packs = 1
INFO:tensorflow:Collective batch_all_reduce: 1 all-reduces, num_workers = 2, communication_hint = AUTO, num_packs = 1
I0526 10:43:04.444913 140302475241280 cross_device_ops.py:1059] Collective batch_all_reduce: 1 all-reduces, num_workers = 2, communication_hint = AUTO, num_packs = 1
INFO:tensorflow:Collective batch_all_reduce: 1 all-reduces, num_workers = 2, communication_hint = AUTO, num_packs = 1
I0526 10:43:04.447288 140302475241280 cross_device_ops.py:1059] Collective batch_all_reduce: 1 all-reduces, num_workers = 2, communication_hint = AUTO, num_packs = 1
INFO:tensorflow:Collective batch_all_reduce: 1 all-reduces, num_workers = 2, communication_hint = AUTO, num_packs = 1
I0526 10:43:04.500257 140302475241280 cross_device_ops.py:1059] Collective batch_all_reduce: 1 all-reduces, num_workers = 2, communication_hint = AUTO, num_packs = 1
INFO:tensorflow:Collective batch_all_reduce: 1 all-reduces, num_workers = 2, communication_hint = AUTO, num_packs = 1
I0526 10:43:04.502656 140302475241280 cross_device_ops.py:1059] Collective batch_all_reduce: 1 all-reduces, num_workers = 2, communication_hint = AUTO, num_packs = 1
INFO:tensorflow:Collective batch_all_reduce: 1 all-reduces, num_workers = 2, communication_hint = AUTO, num_packs = 1
I0526 10:43:04.508196 140302475241280 cross_device_ops.py:1059] Collective batch_all_reduce: 1 all-reduces, num_workers = 2, communication_hint = AUTO, num_packs = 1
INFO:tensorflow:Collective batch_all_reduce: 1 all-reduces, num_workers = 2, communication_hint = AUTO, num_packs = 1
I0526 10:43:04.510025 140302475241280 cross_device_ops.py:1059] Collective batch_all_reduce: 1 all-reduces, num_workers = 2, communication_hint = AUTO, num_packs = 1
INFO:tensorflow:Collective batch_all_reduce: 1 all-reduces, num_workers = 2, communication_hint = AUTO, num_packs = 1
I0526 10:43:04.543776 140302475241280 cross_device_ops.py:1059] Collective batch_all_reduce: 1 all-reduces, num_workers = 2, communication_hint = AUTO, num_packs = 1
INFO:tensorflow:Collective batch_all_reduce: 1 all-reduces, num_workers = 2, communication_hint = AUTO, num_packs = 1
I0526 10:43:04.545943 140302475241280 cross_device_ops.py:1059] Collective batch_all_reduce: 1 all-reduces, num_workers = 2, communication_hint = AUTO, num_packs = 1
INFO:tensorflow:Running Distribute Coordinator with mode = 'independent_worker', cluster_spec = {'chief': ['localhost:2222'], 'worker': ['localhost:2223']}, task_type = 'chief', task_id = 0, environment = None, rpc_layer = 'grpc'
I0526 10:43:06.832544 140302475241280 distribute_coordinator.py:773] Running Distribute Coordinator with mode = 'independent_worker', cluster_spec = {'chief': ['localhost:2222'], 'worker': ['localhost:2223']}, task_type = 'chief', task_id = 0, environment = None, rpc_layer = 'grpc'
WARNING:tensorflow:`eval_fn` is not passed in. The `worker_fn` will be used if an "evaluator" task exists in the cluster.
W0526 10:43:06.832681 140302475241280 distribute_coordinator.py:825] `eval_fn` is not passed in. The `worker_fn` will be used if an "evaluator" task exists in the cluster.
WARNING:tensorflow:`eval_strategy` is not passed in. No distribution strategy will be used for evaluation.
W0526 10:43:06.832730 140302475241280 distribute_coordinator.py:829] `eval_strategy` is not passed in. No distribution strategy will be used for evaluation.
INFO:tensorflow:Using MirroredStrategy with devices ('/job:chief/task:0/device:GPU:0',)
I0526 10:43:06.833080 140302475241280 mirrored_strategy.py:500] Using MirroredStrategy with devices ('/job:chief/task:0/device:GPU:0',)
INFO:tensorflow:MultiWorkerMirroredStrategy with cluster_spec = {'chief': ['localhost:2222'], 'worker': ['localhost:2223']}, task_type = 'chief', task_id = 0, num_workers = 2, local_devices = ('/job:chief/task:0/device:GPU:0',), communication = CollectiveCommunication.AUTO
I0526 10:43:06.833179 140302475241280 collective_all_reduce_strategy.py:344] MultiWorkerMirroredStrategy with cluster_spec = {'chief': ['localhost:2222'], 'worker': ['localhost:2223']}, task_type = 'chief', task_id = 0, num_workers = 2, local_devices = ('/job:chief/task:0/device:GPU:0',), communication = CollectiveCommunication.AUTO
INFO:tensorflow:Using MirroredStrategy with devices ('/job:chief/task:0/device:GPU:0',)
I0526 10:43:06.833521 140302475241280 mirrored_strategy.py:500] Using MirroredStrategy with devices ('/job:chief/task:0/device:GPU:0',)
INFO:tensorflow:MultiWorkerMirroredStrategy with cluster_spec = {'chief': ['localhost:2222'], 'worker': ['localhost:2223']}, task_type = 'chief', task_id = 0, num_workers = 2, local_devices = ('/job:chief/task:0/device:GPU:0',), communication = CollectiveCommunication.AUTO
I0526 10:43:06.833611 140302475241280 collective_all_reduce_strategy.py:344] MultiWorkerMirroredStrategy with cluster_spec = {'chief': ['localhost:2222'], 'worker': ['localhost:2223']}, task_type = 'chief', task_id = 0, num_workers = 2, local_devices = ('/job:chief/task:0/device:GPU:0',), communication = CollectiveCommunication.AUTO
Epoch 1/100
INFO:tensorflow:Collective batch_all_reduce: 176 all-reduces, num_workers = 2, communication_hint = AUTO, num_packs = 1
I0526 10:43:08.031161 140302475241280 cross_device_ops.py:1054] Collective batch_all_reduce: 176 all-reduces, num_workers = 2, communication_hint = AUTO, num_packs = 1
INFO:tensorflow:Collective batch_all_reduce: 176 all-reduces, num_workers = 2, communication_hint = AUTO, num_packs = 1
I0526 10:43:11.108297 140302475241280 cross_device_ops.py:1054] Collective batch_all_reduce: 176 all-reduces, num_workers = 2, communication_hint = AUTO, num_packs = 1
2020-05-26 10:43:15.515986: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-05-26 10:43:15.864391: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-05-26 10:43:16.010903: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-05-26 10:43:16.021040: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
  File "resnet_cifar_main.py", line 288, in <module>
    app.run(main)
  File "/usr/lib/python3.8/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/lib/python3.8/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "resnet_cifar_main.py", line 282, in main
    return run(flags.FLAGS)
  File "resnet_cifar_main.py", line 251, in run
    history = model.fit(train_input_dataset,
  File "/usr/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 72, in _method_wrapper
    return dc.run_distribute_coordinator(
  File "/usr/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_coordinator.py", line 852, in run_distribute_coordinator
    return _run_single_worker(worker_fn, strategy, cluster_spec, task_type,
  File "/usr/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_coordinator.py", line 360, in _run_single_worker
    return worker_fn(strategy)
  File "/usr/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 73, in <lambda>
    lambda _: method(self, *args, **kwargs),
  File "/usr/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 848, in fit
    tmp_logs = train_function(iterator)
  File "/usr/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 580, in __call__
    result = self._call(*args, **kwds)
  File "/usr/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 644, in _call
    return self._stateless_fn(*args, **kwds)
  File "/usr/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2420, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/usr/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1661, in _filtered_call
    return self._call_flat(
  File "/usr/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1745, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "/usr/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 593, in call
    outputs = execute.execute(
  File "/usr/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node resnet56/conv1/Conv2D (defined at /threading.py:932) ]]
	 [[GroupCrossDeviceControlEdges_0/Identity_2/_35]]
  (1) Unknown:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node resnet56/conv1/Conv2D (defined at /threading.py:932) ]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_19079]

Function call stack:
train_function -> train_function

2020-05-26 10:43:16.566656: W tensorflow/core/common_runtime/eager/context.cc:447] Unable to destroy server_ object, so releasing instead. Servers don't support clean shutdown.

是显存不够用导致的

import tensorflow as tf
import keras

config = tf.compat.v1.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.4
tf.compat.v1.keras.backend.set_session(tf.compat.v1.Session(config=config))

在这里插入图片描述

weixin_45939774

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
分布式计算学习第二坑 cudnn

andrew@1manjaro:~/mount/arch/TensorFlowOnSpark/examples/resnet#export TF_CONFIG='{"cluster": { "chief": ["localhost:2222"], "worker": ["localhost:2223"]}, "task": {"type": "chief", "index": 0}}'python resnet_cifar_main.py --data_dir=${CIFAR_DATA} --num_gp
复制链接

扫一扫

专栏目录