ubuntu server 18.04使用tensorflow进行ddqn训练全过程

0. 前言

需要使用ddqn完成某项任务,为了快速训练,使用带有GPU的服务器进行训练。记录下整个过程,以及遇到的坑。

1. 选择模板代码

参考代码来源
GitHub
该代码最后一次更新是Mar 24, 2020。

环境配置:
python3.8
运行安装脚本:

apt-get update
apt-get install xvfb
apt-get install python-opengl
apt-get install python3-pip
python -m pip install --upgrade pip -i https://pypi.tuna.tsinghua.edu.cn/simple/

安装python requirements
所需requirements文件

tensorflow
tensorlayer
opencv-python-headless
matplotlib
pyglet==1.5.27
gym==0.20.0
python -m pip install -r requirements -i https://pypi.tuna.tsinghua.edu.cn/simple/

运行模板代码

xvfb-run -s "-screen 0 1400x900x24" python double_DQN\ \&\ dueling_DQN.py

2. ubuntu 环境准备

本部分为踩坑记录,不需要跟着做

服务器之前有其他人用过,也可能是系统自带python,因此会有python环境,首先查看python版本

python --version

显示python命令对应的版本是2.7.17,之后查找该命令对应的符号链接文件位置。

which python

会显示python命令使用的符号链接文件

/usr/bin/python

查看该路径下还有没有其它python版本

ls -al | grep python

输出如下

lrwxrwxrwx  1 root   root           9 Apr 16  2018 python -> python2.7
lrwxrwxrwx  1 root   root           9 Apr 16  2018 python2 -> python2.7
-rwxr-xr-x  1 root   root     3628904 Nov 29 02:51 python2.7
lrwxrwxrwx  1 root   root           9 Jun 22  2018 python3 -> python3.6
-rwxr-xr-x  2 root   root     4526456 Nov 25 22:10 python3.6
-rwxr-xr-x  2 root   root     4526456 Nov 25 22:10 python3.6m
-rwxr-xr-x  1 root   root        1018 Oct 29  2017 python3-jsondiff
-rwxr-xr-x  1 root   root        3661 Oct 29  2017 python3-jsonpatch
-rwxr-xr-x  1 root   root        1342 May  2  2016 python3-jsonpointer
-rwxr-xr-x  1 root   root         398 Nov 16  2017 python3-jsonschema
lrwxrwxrwx  1 root   root          10 Jun 22  2018 python3m -> python3.6m

发现python命令使用的是2.7,但python3可以使用3.6。因为目前有的tensorflow版本不支持2.7了,先使用3.6.

接着查看tensorflow gpu各个版本的要求:官网。如下图所示,我选择了2.0.0,发布于2019年9月30日,和代码更新时间比较近,并且支持python 3.6。
在这里插入图片描述
准备使用pip安装tensorflow,但pip并没有安装,使用一下命令安装pip。

apt install python3-pip

之后执行安装命令

python -m pip install tensorflow-gpu==2.0.0

比较难受的是pip源中并没有2.0.0,换了清华源也没有,输出如下

Could not find a version that satisfies the requirement tensorflow-gpu==2.0.0 (from versions: 1.13.1, 1.13.2, 1.14.0, 2.12.0)
No matching distribution found for tensorflow-gpu==2.0.0

可以看到最新的版本只有2.12.0,那只能安装最新的版本。
还需要选择python版本,至少需要python3.7。我为了之后使用方便直接把python命令的软连接接入到新安装的python3.7上。

apt install python3.7
rm -f /usr/bin/python
ln -s /usr/bin/python3.7  /usr/bin/python
python --version

最后显示版本为3.7.5,替换成功,
这时候需要更新一下pip(后面装tensorflow的时候需要安装很多相关包,如果不升级pip的话会有很多包装不上,其中一个报错是 Failed building wheel for grpcio)。

python -m pip install --upgrade pip

继续安装tensorflow-gpu。

python -m pip install tensorflow-gpu

输出如下

The "tensorflow-gpu" package has been removed!
    
Please install "tensorflow" instead.

Other than the name, the two packages have been identical
since TensorFlow 2.1, or roughly since Sep 2019. For more
information, see: pypi.org/project/tensorflow-gpu

意思是tensorflow2.1之后gpu包没得了,直接pip install tensorflow就可以。(安装速度感人,切换清华源)

python -m pip install tensorflow -i https://pypi.tuna.tsinghua.edu.cn/simple/

安装成功!!

3. 模板代码运行

本部分为踩坑记录,不需要跟着做

将代码下载到相应文件夹之后,可以使用如下语句运行ddqn模板代码。

python double_DQN\ \&\ dueling_DQN.py

当然会报很多no module的错误,使用pip依次安装,requirements总结如下

tensorlayer
opencv-python
opencv-python-headless
matplotlib
pyglet

将上述内容写文件,之后一键安装

python -m pip install -r requirments -i https://pypi.tuna.tsinghua.edu.cn/simple/

还需要安装gym,它是一个经常用于测试强化学习的示例,目前新的版本中获取新的状态时参数数量增加了,即以下语句会报错,step函数不仅输出变多了,而且s_的输出也不太正常。因此更换为早一点的版本。

s_,r,done,_ = self.env.step(a)

我根据模板代码的时间查看了gym的tag,发现时间上和模板代码相似,再打开gym的core文件查看step函数,果然从输出数量上合适的。进行安装:

python -m pip install gym==0.20.0 -i https://pypi.tuna.tsinghua.edu.cn/simple/

报错如下:

Collecting gym==0.20.0
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/f1/16/a421155206e7dc41b3a79d4e9311287b88c20140d567182839775088e9ad/gym-0.20.0.tar.gz (1.6 MB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [1 lines of output]
      error in gym setup command: 'extras_require' must be a dictionary whose values are strings or lists of strings containing valid project/version requirement specifiers.
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

原因找了好久,从【参考】中找到了解决办法,更新为指定版本的setuptools:

python -m pip install --upgrade pip setuptools==57.5.0 -i https://pypi.tuna.tsinghua.edu.cn/simple/

运行代码

python double_DQN\ \&\ dueling_DQN.py

发现pyglet最低要求python3.8。。。重新安装python3.8,之后直接使用requirements文件一键安装到新环境。

运行过程中报错

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/gym/envs/classic_control/rendering.py", line 27, in <module>
    from pyglet.gl import *
  File "/usr/local/lib/python3.8/dist-packages/pyglet/gl/__init__.py", line 47, in <module>
    from pyglet.gl.gl import *
  File "/usr/local/lib/python3.8/dist-packages/pyglet/gl/gl.py", line 7, in <module>
    from pyglet.gl.lib import link_GL as _link_function
  File "/usr/local/lib/python3.8/dist-packages/pyglet/gl/lib.py", line 98, in <module>
    from pyglet.gl.lib_glx import link_GL, link_GLX
  File "/usr/local/lib/python3.8/dist-packages/pyglet/gl/lib_glx.py", line 11, in <module>
    gl_lib = pyglet.lib.load_library('GL')
  File "/usr/local/lib/python3.8/dist-packages/pyglet/lib.py", line 134, in load_library
    raise ImportError(f'Library "{names[0]}" not found.')
ImportError: Library "GL" not found.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "double_DQN & dueling_DQN.py", line 195, in <module>
    ddqn.train(200)
  File "double_DQN & dueling_DQN.py", line 161, in train
    if self.is_rend:self.env.render()
  File "/usr/local/lib/python3.8/dist-packages/gym/core.py", line 254, in render
    return self.env.render(mode, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/gym/envs/classic_control/cartpole.py", line 179, in render
    from gym.envs.classic_control import rendering
  File "/usr/local/lib/python3.8/dist-packages/gym/envs/classic_control/rendering.py", line 29, in <module>
    raise ImportError(
ImportError: 
    Error occurred while running `from pyglet.gl import *`
    HINT: make sure you have OpenGL installed. On Ubuntu, you can run 'apt-get install python-opengl'.
    If you're running on a server, you may need a virtual frame buffer; something like this should work:
    'xvfb-run -s "-screen 0 1400x900x24" python <your_script.py>'

最后说明了原因,缺少OpenGL 。并且在服务器上运行显示有点问题,就按照他给的解决方案处理。

apt-get install python-opengl
xvfb-run -s "-screen 0 1400x900x24" python double_DQN\ \&\ dueling_DQN.py

处理之后,再次报错

Traceback (most recent call last):
  File "double_DQN & dueling_DQN.py", line 195, in <module>
    ddqn.train(200)
  File "double_DQN & dueling_DQN.py", line 161, in train
    if self.is_rend:self.env.render()
  File "/usr/local/lib/python3.8/dist-packages/gym/core.py", line 254, in render
    return self.env.render(mode, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/gym/envs/classic_control/cartpole.py", line 229, in render
    return self.viewer.render(return_rgb_array=mode == "rgb_array")
  File "/usr/local/lib/python3.8/dist-packages/gym/envs/classic_control/rendering.py", line 126, in render
    self.transform.enable()
  File "/usr/local/lib/python3.8/dist-packages/gym/envs/classic_control/rendering.py", line 232, in enable
    glPushMatrix()
NameError: name 'glPushMatrix' is not defined

原因是pyglet版本太高,降为1.5.27即可。

4. 安装GPU支持

根据tensorflow版本选择cuda和cudnn。

4.1 安装cuda 11.2

wget https://developer.download.nvidia.com/compute/cuda/11.2.2/local_installers/cuda_11.2.2_460.32.03_linux.run
chmod +x cuda_11.2.2_460.32.03_linux.run
sudo ./cuda_11.2.2_460.32.03_linux.run

安装完成后,需要将CUDA的路径添加到环境变量中。打开~/.bashrc文件,在文件末尾添加以下两行代码:

export PATH=/usr/local/cuda-11.2/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-11.2/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

然后运行以下命令使环境变量生效:

source ~/.bashrc

验证CUDA的安装是否成功。运行以下命令:

nvcc -V

4.2 安装cudnn 8.1

在NVIDIA官网下载
之后上传到服务器,

tar -xzvf cudnn-11.2-linux-x64-v8.1.1.33.tgz
cp -P cuda/include/cudnn*.h /usr/local/cuda-11.2/include
cp -P cuda/lib64/libcudnn* /usr/local/cuda-11.2/lib64/
chmod a+r /usr/local/cuda-11.2/include/cudnn*.h /usr/local/cuda-11.2/lib64/libcudnn*

使用如下代码测试gpu是否正常使用

import tensorflow as tf

# 显示当前GPU设备信息
print(tf.config.list_physical_devices('GPU'))

# 创建一个TensorFlow的Session并在其中进行一个简单的运算
with tf.compat.v1.Session() as sess:
    # 创建一个TensorFlow的常量张量
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    # 创建一个TensorFlow的变量张量
    b = tf.Variable(tf.random.normal([3, 2], stddev=0.1), name='b')
    # 进行矩阵乘法运算
    c = tf.matmul(a, b, name='c')

    # 初始化所有变量
    sess.run(tf.compat.v1.global_variables_initializer())

    # 运行TensorFlow图
    print(sess.run(c))

输出如下,可以正常使用

2023-03-05 14:43:55.699137: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-05 14:43:55.861866: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-03-05 14:43:56.628130: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.2/lib64
2023-03-05 14:43:56.628241: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.2/lib64
2023-03-05 14:43:56.628256: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-03-05 14:43:58.016301: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:58.031069: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:58.032359: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
2023-03-05 14:43:58.035332: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-05 14:43:58.036833: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:58.038127: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:58.039367: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:59.106971: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:59.108369: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:59.109663: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:59.110894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30969 MB memory:  -> device: 0, name: Tesla V100S-PCIE-32GB, pci bus id: 0000:00:06.0, compute capability: 7.0
2023-03-05 14:43:59.127126: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:357] MLIR V1 optimization pass is not enabled
[[0.50951785 0.10452858]
 [1.070737   0.27480656]]

模板代码使用GPU加速感觉速度没快多少,可能是神经网络层数比较少的原因。

  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Fourier_1024

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值