win10 配置cuda、cudnn、tensorflow、pytorch过程记录

最新推荐文章于 2024-05-27 02:39:13 发布

枪枪枪

最新推荐文章于 2024-05-27 02:39:13 发布

阅读量2.2k

点赞数

分类专栏： Machine Learning 文章标签：机器学习 tensorflow

本文链接：https://blog.csdn.net/az9996/article/details/107448541

版权

Machine Learning 专栏收录该内容

52 篇文章 9 订阅

订阅专栏

文章目录

现在安装和上一年安装相比要便捷的多，没有太多琐碎的步骤，

注意cuda、cudnn、GPU、tensorflow之间的版本对应关系。

资料

tensorflow官网

Win10安装tensorflow-gpu步骤

windows下cuda的安装

cudnn官方文档

tensorflow查看使用的是cpu还是gpu

WIN10安装TENSORFLOW（GPU版本）详解（超详细，从零开始）

cuda工具集和显卡驱动版本对照表
在这里插入图片描述

1.安装cuda

在这里插入图片描述

谷歌搜索：cuda 10.2.141 driver
https://developer.nvidia.com/cuda-10.2-download-archive?target_os=Windows&target_arch=x86_64&target_version=10&target_type=exelocal
在这里插入图片描述

在这里插入图片描述
避免安装不必要的组件，这里选择自定义安装

C盘空间充足，所以这里我不做更改

完毕

关于cuda的环境变量，我这里是安装完毕后自动添加的有。

在cuda samples文件夹中启动示例，查看运行输出，可以看到运行示例程序时在GPU、CPU上的时间。
在这里插入图片描述

在这里插入图片描述

2. 安装cudnn

访问该页面（url）查看cudnn和cuda版本对应关系
在这里插入图片描述
cudnn下载页面：https://developer.nvidia.com/rdp/form/cudnn-download-survey
根据系统版本，我这里选的是cudnn library for win10

下载解压后
将F:\下载\ChromeDownload\cudnn-10.2-windows10-x64-v7.6.5.32\cuda\bin\cudnn*.dll复制到C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\bin路径下
在这里插入图片描述
F:\下载\ChromeDownload\cudnn-10.2-windows10-x64-v7.6.5.32\cuda\include\cudnn*.h复制到C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\include

F:\下载\ChromeDownload\cudnn-10.2-windows10-x64-v7.6.5.32\cuda\lib\x64复制到C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\lib\x64
在这里插入图片描述
打开一个cmd，键入control sysdm.cpl

添加环境变量
变量名：CUDA_PATH
变量的值：C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2

我这边在安装cuda时已经自动添加过这个环境变量。

如果在开发过程中使用的是visual studio 那么还需要将cudnn.lib添加到你的项目中，在“项目”->“属性” “链接器” “输入” “附加依赖项” 中添加cudnn.lib并确定即可
在这里插入图片描述

测试tensorflow是否有使用GPU

新建python文件
运行如下内容

from tensorflow.python.client import device_lib

print(device_lib.list_local_devices())

查看输出，可以看到GPU设备，说明成功~~

2020-07-20 19:46:00.994189: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2020-07-20 19:46:00.994416: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2020-07-20 19:46:03.496997: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-07-20 19:46:03.506445: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x27217c3dcf0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-07-20 19:46:03.506899: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-07-20 19:46:03.518994: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-07-20 19:46:03.575822: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1050 Ti computeCapability: 6.1
coreClock: 1.62GHz coreCount: 6 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 104.43GiB/s
2020-07-20 19:46:03.579216: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2020-07-20 19:46:03.662971: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-07-20 19:46:03.714476: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-07-20 19:46:03.735304: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-07-20 19:46:03.804328: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-07-20 19:46:03.838115: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-07-20 19:46:03.947359: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-07-20 19:46:03.947536: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1598] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2020-07-20 19:46:04.065192: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-20 19:46:04.065428: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0 
2020-07-20 19:46:04.065533: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N 
2020-07-20 19:46:04.070126: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x27217c3c970 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-07-20 19:46:04.070448: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 1050 Ti, Compute Capability 6.1
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 11094447684184939916
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 11168722219354010654
physical_device_desc: "device: XLA_CPU device"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 5706144976707029859
physical_device_desc: "device: XLA_GPU device"
]

3 在jupyternotebook中运行测试文件发现无法调用GPU

总是提示无法加载not load dynamic library 'cudart64_101.dll

返回第二步，看到有这一句提示信息：
2020-07-20 19:46:00.994189: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library ‘cudart64_101.dll’; dlerror: cudart64_101.dll not found

进入C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\bin下可以看到有cudart64_102.dll但是没有cudart64_101.dll

解决方法：
方法1.修改文件名
方法2.在网上重新下载cudart64_101.dll放入文件夹里，CUDART64_101.DLL

再次运行，ok！
在这里插入图片描述

参考资料：
cudart64_101.dll not found解决方法
https://blog.csdn.net/qq_32939413/article/details/105525025

4.status: Internal: invalid device function错误的解决方法

原因总结：cuda与tensorflow版本不兼容

tensorflow-gpu error | Non-OK-status: GpuLaunchKernel | status: Internal: invalid device function

接手了一个新的模型，模型中tensorflow版本为1.15.2，安装tensorflow-gpu==1.15.2后，运行时提示加载动态库失败，想着原本cuda10.2下有同前缀的dll文件，就是后面数字不同，于是改名称后再运行，发现动态库是能加载了，但是到最后却提示“status: Internal: invalid device function”

没办法，又去官网下载cuda10.0版本，安装后配置环境变量

也可以把10.0中的dll文件copy到10.2下
如图：
在这里插入图片描述

再次运行，成功使用GPU

5. 使用pytorch

到pytorch官网，根据自己情况进行选择，如图，使用给出的命令用conda工具安装
传送门
在这里插入图片描述

conda install pytorch torchvision cudatoolkit=10.2 -c pytorch

6.conda环境离线迁移

到anaconda安装路径下anaconda\envs，找到你想迁移的环境名的文件夹。先打成压缩包，复制到新机器的同路径下即可，若conda list看不到复制过去的环境名，重启一下即可

该方法只适用于同一大版本下的anaconda，anaconda2到anaconda3这样的就不行了。

参考资料：
anaconda使用教程+直接环境拷贝移植所遇到的问题解决

7.一个显卡上同时训练tensorflow模型和pytorch模型

后来又有一个需要训练的模型，用的是pytorch

一开始先启动的是pytorch，再启动tensorflow时发现提示无可用的设备
错误信息如下：

2020-09-21 23:12:59.250765: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-09-21 23:12:59.253410: W tensorflow/stream_executor/stream.cc:2041] attempting to perform BLAS operation using StreamExecutor without BLAS support
Traceback (most recent call last):
  File "run_train_win.py", line 46, in <module>
    run_train()
  File "run_train_win.py", line 42, in run_train
    train(args=args)
  File "../..\keras_bert_ner\train.py", line 138, in train
    validation_data=devs)
......
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas GEMM launch failed : a.shape=(6272, 11), b.shape=(11, 11), m=6272, n=11, k=11
         [[{{node loss/CRF_loss/crf_loss/MatMul_1}}]]
         [[Mean/_831]]
  (1) Internal: Blas GEMM launch failed : a.shape=(6272, 11), b.shape=(11, 11), m=6272, n=11, k=11
         [[{{node loss/CRF_loss/crf_loss/MatMul_1}}]]
0 successful operations.
0 derived errors ignored.

搜集资料后看到有回答说是tensorflow启动时默认占用整个显卡，所以当tensorflow后启动时发现显卡设备已被使用，所以导致tensorflow无法正常加载

参考资料：
https://www.zhihu.com/question/353248304
周军：
我来说一个和显存无关的，一张卡上要先load tf 再load pytorch，不然会有cudnn 初始化错误

改为先启动tensorflow后启动pytorch，两者都顺利的启动了起来在这里插入图片描述

后启动的pytorch中途抛错:RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

错误内容：

2020-09-21T23:37:43.852019 step: 1700, loss: 7405.79
Traceback (most recent call last):
  File "train.py", line 162, in <module>
    train(model, train_iter, optimizer, criterion, device)
  File "train.py", line 32, in train
    loss.backward()
  File "D:\main\Anaconda3\envs\Bert-BiLSTM-CRF-pytorch\lib\site-packages\torch\tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "D:\main\Anaconda3\envs\Bert-BiLSTM-CRF-pytorch\lib\site-packages\torch\autograd\__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Exception raised from _cudnn_rnn_backward_input at ..\aten\src\ATen\native\cudnn\RNN.cpp:923 (most recent call first):
00007FFC409A75A200007FFC409A7540 c10.dll!c10::Error::Error [<unknown file> @ <unknown line number>]
00007FFB87654F3600007FFB87654E80 torch_cuda.dll!at::native::Descriptor<cudnnRNNStruct,&cudnnCreateRNNDescriptor,&cudnnDestroyRNNDescriptor>::Descriptor<cu
dnnRNNStruct,&cudnnCreateRNNDescriptor,&cudnnDestroyRNNDescriptor> [<unknown file> @ <unknown line number>]
00007FFB8766BDBB00007FFB87669770 torch_cuda.dll!at::native::_cudnn_rnn_backward [<unknown file> @ <unknown line number>]
00007FFB87669CD000007FFB87669770 torch_cuda.dll!at::native::_cudnn_rnn_backward [<unknown file> @ <unknown line number>]
00007FFB876C284800007FFB8767E0A0 torch_cuda.dll!at::native::set_storage_cuda_ [<unknown file> @ <unknown line number>]
00007FFB876D107D00007FFB8767E0A0 torch_cuda.dll!at::native::set_storage_cuda_ [<unknown file> @ <unknown line number>]
00007FFBDE95BBF100007FFBDE8CD9D0 torch_cpu.dll!at::native::mkldnn_sigmoid_ [<unknown file> @ <unknown line number>]
00007FFBDE9AB9DA00007FFBDE9A8FA0 torch_cpu.dll!at::bucketize_out [<unknown file> @ <unknown line number>]
00007FFBDE992ECA00007FFBDE992D40 torch_cpu.dll!at::_cudnn_rnn_backward [<unknown file> @ <unknown line number>]
00007FFBDFC9088900007FFBDFC4E010 torch_cpu.dll!torch::autograd::GraphRoot::apply [<unknown file> @ <unknown line number>]
00007FFBDFC9D12D00007FFBDFC4E010 torch_cpu.dll!torch::autograd::GraphRoot::apply [<unknown file> @ <unknown line number>]
00007FFBDE95BBF100007FFBDE8CD9D0 torch_cpu.dll!at::native::mkldnn_sigmoid_ [<unknown file> @ <unknown line number>]
00007FFBDE9AB9DA00007FFBDE9A8FA0 torch_cpu.dll!at::bucketize_out [<unknown file> @ <unknown line number>]
00007FFBDE992ECA00007FFBDE992D40 torch_cpu.dll!at::_cudnn_rnn_backward [<unknown file> @ <unknown line number>]
00007FFBDFB9C12D00007FFBDFB9BAF0 torch_cpu.dll!torch::autograd::generated::CudnnRnnBackward::apply [<unknown file> @ <unknown line number>]
00007FFBDFB87E9100007FFBDFB87B50 torch_cpu.dll!torch::autograd::Node::operator() [<unknown file> @ <unknown line number>]
00007FFBE00EF9BA00007FFBE00EF300 torch_cpu.dll!torch::autograd::Engine::add_thread_pool_task [<unknown file> @ <unknown line number>]
00007FFBE00F03AD00007FFBE00EFFD0 torch_cpu.dll!torch::autograd::Engine::evaluate_function [<unknown file> @ <unknown line number>]
00007FFBE00F4FE200007FFBE00F4CA0 torch_cpu.dll!torch::autograd::Engine::thread_main [<unknown file> @ <unknown line number>]
00007FFBE00F4C4100007FFBE00F4BC0 torch_cpu.dll!torch::autograd::Engine::thread_init [<unknown file> @ <unknown line number>]
00007FFBC5FF0A7700007FFBC5FCA150 torch_python.dll!THPShortStorage_New [<unknown file> @ <unknown line number>]
00007FFBE00EBF1400007FFBE00EB780 torch_cpu.dll!torch::autograd::Engine::get_base_engine [<unknown file> @ <unknown line number>]
00007FFC819803BA00007FFC81980360 ucrtbase.dll!o_exp [<unknown file> @ <unknown line number>]
00007FFC82567E9400007FFC82567E80 KERNEL32.DLL!BaseThreadInitThunk [<unknown file> @ <unknown line number>]
00007FFC84F87AD100007FFC84F87AB0 ntdll.dll!RtlUserThreadStart [<unknown file> @ <unknown line number>]

搜集资料后看到如下解决方法

import torch.backends.cudnn
torch.backends.cudnn.enabled = False

定位到其在pytorch中的定义

# Add type annotation for the replaced module
enabled: bool
deterministic: bool
benchmark: bool

暂时找到这样一篇描述

pytorch torch.backends.cudnn设置作用

以及其它可能有用的资料
https://blog.csdn.net/qq_39938666/article/details/86611474

https://github.com/pytorch/pytorch/issues/17543

https://github.com/NVIDIA/tacotron2/issues/109

错误日志

CUDA out of memory

选的epochs和batch_size太大

100%|██████████████████████████████████████████████████████████████████████
████████████████████████████████| 6001/6001 [00:02<00:00, 2226.86it/s]
Load Data Done
Initial model...
Initial model Done
Start Train...
Traceback (most recent call last):
  File "train.py", line 165, in <module>
    train(model, train_iter, optimizer, criterion, device)
  File "train.py", line 28, in train
    loss = model.neg_log_likelihood(x, y) # logits: (N, T, VOCAB), y: (N, T)
  File "D:\programing\Bert-BiLSTM-CRF-pytorch\Bert-BiLSTM-CRF-pytorch\crf.py", line 150, in neg_log_likelihood
    feats = self._get_lstm_features(sentence)  #[batch_size, max_len, 16]
  File "D:\programing\Bert-BiLSTM-CRF-pytorch\Bert-BiLSTM-CRF-pytorch\crf.py", line 159, in _get_lstm_features
    embeds = self._bert_enc(sentence)  # [8, 75, 768]
  File "D:\programing\Bert-BiLSTM-CRF-pytorch\Bert-BiLSTM-CRF-pytorch\crf.py", line 108, in _bert_enc
    encoded_layer, _  = self.bert(x)
  File "D:\main\Anaconda3\envs\Bert-BiLSTM-CRF-pytorch\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "D:\main\Anaconda3\envs\Bert-BiLSTM-CRF-pytorch\lib\site-packages\pytorch_pretrained_bert\modeling.py", line 733, in forward
    output_all_encoded_layers=output_all_encoded_layers)
  File "D:\main\Anaconda3\envs\Bert-BiLSTM-CRF-pytorch\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "D:\main\Anaconda3\envs\Bert-BiLSTM-CRF-pytorch\lib\site-packages\pytorch_pretrained_bert\modeling.py", line 406, in forward
    hidden_states = layer_module(hidden_states, attention_mask)
  File "D:\main\Anaconda3\envs\Bert-BiLSTM-CRF-pytorch\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "D:\main\Anaconda3\envs\Bert-BiLSTM-CRF-pytorch\lib\site-packages\pytorch_pretrained_bert\modeling.py", line 392, in forward
    intermediate_output = self.intermediate(attention_output)
  File "D:\main\Anaconda3\envs\Bert-BiLSTM-CRF-pytorch\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "D:\main\Anaconda3\envs\Bert-BiLSTM-CRF-pytorch\lib\site-packages\pytorch_pretrained_bert\modeling.py", line 365, in forward
    hidden_states = self.intermediate_act_fn(hidden_states)
  File "D:\main\Anaconda3\envs\Bert-BiLSTM-CRF-pytorch\lib\site-packages\pytorch_pretrained_bert\modeling.py", line 124, in gelu
    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
RuntimeError: CUDA out of memory. Tried to allocate 754.00 MiB (GPU 0; 11.00 GiB total capacity; 4.27 GiB already allocated; 524.59 MiB free; 8.10
 GiB reserved in total by PyTorch)

枪枪枪

关注

0
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
win10 配置cuda、cudnn、tensorflow、pytorch过程记录

文章目录资料1.安装cuda2. 安装cudnn资料windows下cuda的安装https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html1.安装cuda谷歌搜索：cuda 10.2.141 driverhttps://developer.nvidia.com/cuda-10.2-download-archive?target_os=Windows&target_arch=x86_64&target_v
复制链接

扫一扫