Ubuntu 18.04搭建TensorFlow模型训练环境的笔记

久别涉远道、

已于 2024-06-18 10:36:28 修改

阅读量1.1k

点赞数 10

分类专栏： Ubuntu 文章标签： ubuntu 笔记 python tensorflow conda

于 2024-06-17 16:04:49 首次发布

本文链接：https://blog.csdn.net/qq_39631051/article/details/139738480

版权

Ubuntu 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

文章目录

前言
一、安装Ubuntu系统
二、安装显卡驱动
三、安装CUDA
四、安装cuDNN
五、安装Anaconda
- 1.安装Anaconda
- 2.创建虚拟环境
六、部署训练代码

前言

记录Ubuntu 18.04搭建模型训练环境的笔记

一、安装Ubuntu系统

此次搭建选择Ubuntu 18.04系统，安装教程百度。安装完成之后设置固定IP与切换国内源。
在这里插入图片描述
切换国内源之后更新一下apt

sudo apt update
sudo apt upgrade

二、安装显卡驱动

1.删除原有驱动

sudo apt remove --purge nvidia*
sudo apt autoremove nvidia*

2.禁用nouveau驱动（开源显卡驱动）

编辑/etc/modprobe.d/blacklist-nouveau.conf文件

vim /etc/modprobe.d/blacklist-nouveau.conf

添加以下内容:

blacklist nouveau
options nouveau modeset=0

保存后执行以下命令更新改动：

sudo update-initramfs -u

重启系统

sudo reboot

重启之后输入：

lsmod | grep nouveau

如果没有输出信息，则禁用nouveau成功

3.关闭桌面系统

sudo service lightdm stop

4.安装显卡驱动

原文链接：https://blog.csdn.net/JineD/article/details/131201121
登录NVIDIA官网，可以选择适合自己电脑的驱动，下载下来。此次下载的驱动文件：NVIDIA-Linux-x86_64-535.146.02.run

给驱动文件添加可执行权限：

chmod +x NVIDIA-Linux-x86_64-535.146.02.run

安装gcc、make:

sudo apt install gcc make

开始安装显卡驱动

sudo NVIDIA-Linux-x86_64-535.146.02.run

每次安装都会出现如下提示，实际上pre-install固定会失败的,目的就是为了让你知道你自己在干嘛，选择Continue installation

The distribution-provided pre-install script failed! Are you sure you want to continue?

如下提示是否需要32位兼容，不需要 no即可

Unable to find a suitable destination to install 32-bit compatibility libraries. 
Your system may not be set up for 32-bit compatibility. 32-bit compatibility files will not be installed; 
if you wish to install them, re-run the installation and set a valid directory with the --compat32-libdir option.

DKMS注册内核模块，直接no不需要

Would you like to register the kernel module sources with DKMS? 
This will allow DKMS to automatically build a new module, 
if you install a different kernel later

然后会有如下过程提示，是否运行Nvidia-xconfig来配置X configuration文件，选择yes

Would you like to run the nvidia-xconfig utility to automatically update your X configuration file so that the
NVIDIA X driver will be used when you restart X?  Any pre-existing X configuration file will be backed up.

Tips: 如果提示这个Error，说明Xserver还没关，重新执行上面的关闭Xserver （即关闭桌面系统）

ERROR: You appear to be running an X server; please exit X before installing.  
For further details, please see the section INSTALLING THE NVIDIA DRIVER in the README available on the Linux driver download page at www.nvidia.com.

验证安装,执行nvidia-smi命令能看到显卡相关信息即可，其中的CUDA Version: 12.2为最高能支持到的cuda版本，并非当前系统安装的cuda版本
在这里插入图片描述

三、安装CUDA

前往官网下载CUDA：https://developer.nvidia.com/cuda-toolkit-archive/
在这里插入图片描述
按显卡驱动的最大支持CUDA版本选择，这里选择12.0.0，之后按条件选择

下载CUDA文件

wget https://developer.download.nvidia.com/compute/cuda/12.0.0/local_installers/cuda_12.0.0_525.60.13_linux.run

下载完成后执行

sudo sh cuda_12.0.0_525.60.13_linux.run

一开始会让你阅读一大堆说明，直接连续按空格到最下面即可，会有如下提示：

Do you accept the previously read EULA?
accept/decline/quit:

输入accept接受协议

===========
= Summary =
===========	
Driver:   Not Selected
Toolkit:  Installed in /usr/local/cuda-12.0/
		
Please make sure that
 -   PATH includes /usr/local/cuda-12.0/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-12.0/lib64, or, add /usr/local/cuda-12.0/lib64 to /etc/ld.so.conf and run ldconfig as root
		
To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-12.0/bin
***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 525.00 is required for CUDA 12.0 functionality to work.
To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
sudo <CudaInstaller>.run --silent --driver
		
Logfile is /var/log/cuda-installer.log

添加环境变量，执行命令：

vi ~/.bashrc

在文件尾部追加以下内容：

export PATH=$PATH:$PATH:/usr/local/cuda/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
export CUDA_HOME=$CUDA_HOME:/usr/local/cuda

注意：/usr/local/cuda目录是个软链接，链接到当前cuda版本的目录，实际目录路径为/usr/local/cuda-12.0，如果存在多版本的cuda需要切换，将当前软链接删除重新创建新的软链接。

添加完成后执行命令使其生效

source ~/.bashrc

验证安装，执行以下命令查看CUDA版本

nvcc --version

输出如下：
在这里插入图片描述
进入目录：

cd /usr/local/cuda/extras/demo_suite/

执行

./deviceQuery

在这里插入图片描述输出Result = PASS代表CUDA和GPU正常运行

四、安装cuDNN

Cudnn安装即下载文件复制到Cuda目录的过程，故实际上并未真正安装软件
下载地址为：https://developer.nvidia.com/cudnn-archive
当前选择 Download cuDNN v8.9.7 (December 5th, 2023), for CUDA 12.x这个版本下载
1、解压下载的文件

tar -xvf cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz

2、复制头文件

sudo cp ./cudnn-linux-x86_64-8.9.7.29_cuda12-archive/include/cudnn* /usr/local/cuda/include/

3、复制库文件

sudo cp ./cudnn-linux-x86_64-8.9.7.29_cuda12-archive/include/lib/libcudnn* /usr/local/cuda/lib64/

4、增加读取权限

sudo chmod a+r /usr/local/cuda/include/cudnn*
sudo chmod a+r /usr/local/cuda/lib64/libcudnn*

五、安装Anaconda

1.安装Anaconda

安装过程参考：https://blog.csdn.net/G_C_H/article/details/133553961
1、下载anaconda
从清华源：https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/?C=M&O=D
下载anaconda，本次选择的是Anaconda3-2024.02-1-Linux-x86_64.sh
2、安装anaconda，添加可执行权限：

sudo chmod +x ./Anaconda3-2024.02-1-Linux-x86_64.sh

执行：

./Anaconda3-2024.02-1-Linux-x86_64.sh

在这里插入图片描述
可以选择yes,也可以no，然后自己配置
选择no自己配置后，执行：

eval "$(/home/wlp/anaconda3/bin/conda shell.bash hook)"

激活 conda：

conda init

启动时不进入 conda 的 base 环境，即需要自己激活相应的虚拟环境：

source ~/.bashrc
conda config --set auto_activate_base false

验证：

conda env list

2.创建虚拟环境

启动anaconda界面，执行以下命令：

anaconda-navigator

创建虚拟环境命令：conda create --name <环境名称>

conda create --name py35-tsflow

因有备份的环境py35-tsflow.yaml，所以此次直接导入就好
注意：如果需要使用GPU进行模型训练，在虚拟环境中引用库的时候要注意选择GPU版本的库，才能在训练时启动 GPU训练。
本文后续的模型训练内容都是基于使用GPU进行训练的。
以下是我备份环境的信息py35-tsflow.yaml，TensorFlow是GPU版本的：

name: py35-tsflow-copy
channels:
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - _tflow_select=2.1.0=gpu
  - absl-py=0.4.1=py35_0
  - astor=0.7.1=py35_0
  - blas=1.0=mkl
  - bzip2=1.0.8=h7b6447c_0
  - ca-certificates=2019.10.16=0
  - cairo=1.14.12=h8948797_3
  - certifi=2018.8.24=py35_1
  - cudatoolkit=9.2=0
  - cudnn=7.6.4=cuda9.2_0
  - cupti=9.2.148=0
  - cycler=0.10.0=py35hc4d5149_0
  - cython=0.28.5=py35hf484d3e_0
  - dbus=1.13.12=h746ee38_0
  - expat=2.2.6=he6710b0_0
  - ffmpeg=4.0=hcdf2ecd_0
  - fontconfig=2.13.0=h9420a91_0
  - freeglut=3.0.0=hf484d3e_5
  - freetype=2.9.1=h8a8886c_1
  - gast=0.3.2=py_0
  - glib=2.63.1=h5a9c865_0
  - graphite2=1.3.13=h23475e2_0
  - grpcio=1.12.1=py35hdbcaa40_0
  - gst-plugins-base=1.14.0=hbbd80ab_1
  - gstreamer=1.14.0=hb453b48_1
  - harfbuzz=1.8.8=hffaf4a1_0
  - hdf5=1.10.2=hba1933b_1
  - icu=58.2=h9c2bf20_1
  - intel-openmp=2019.4=243
  - jasper=2.0.14=h07fcdf6_1
  - jpeg=9b=h024ee3a_2
  - kiwisolver=1.0.1=py35hf484d3e_0
  - libedit=3.1.20181209=hc058e9b_0
  - libffi=3.2.1=hd88cf55_4
  - libgcc-ng=9.1.0=hdf63c60_0
  - libgfortran-ng=7.3.0=hdf63c60_0
  - libglu=9.0.0=hf484d3e_1
  - libopencv=3.4.2=hb342d67_1
  - libopus=1.3=h7b6447c_0
  - libpng=1.6.37=hbc83047_0
  - libprotobuf=3.6.0=hdbcaa40_0
  - libstdcxx-ng=9.1.0=hdf63c60_0
  - libtiff=4.1.0=h2733197_0
  - libuuid=1.0.3=h1bed415_2
  - libvpx=1.7.0=h439df22_0
  - libxcb=1.13=h1bed415_1
  - libxml2=2.9.9=hea5a465_1
  - markdown=2.6.11=py35_0
  - matplotlib=3.0.0=py35h5429711_0
  - mkl=2018.0.3=1
  - mkl_fft=1.0.6=py35h7dd41cf_0
  - mkl_random=1.0.1=py35h4414c95_1
  - ncurses=6.1=he6710b0_1
  - numpy=1.15.2=py35h1d66e8a_0
  - numpy-base=1.15.2=py35h81de0dd_0
  - olefile=0.46=py_0
  - openssl=1.0.2t=h7b6447c_1
  - pcre=8.43=he6710b0_0
  - pillow=5.2.0=py35heded4f4_0
  - pip=10.0.1=py35_0
  - pixman=0.38.0=h7b6447c_0
  - protobuf=3.6.0=py35hf484d3e_0
  - py-opencv=3.4.2=py35hb342d67_1
  - pyparsing=2.4.5=py_0
  - pyqt=5.9.2=py35h05f1152_2
  - python=3.5.6=hc3d631a_0
  - python-dateutil=2.8.1=py_0
  - pytz=2019.3=py_0
  - qt=5.9.6=h8703b6f_2
  - readline=7.0=h7b6447c_5
  - scipy=1.1.0=py35hfa4b5c9_1
  - setuptools=40.2.0=py35_0
  - sip=4.19.8=py35hf484d3e_0
  - six=1.11.0=py35_1
  - sqlite=3.30.1=h7b6447c_0
  - tbb=2019.8=hfd86e86_0
  - tbb4py=2018.0.5=py35h6bb024c_0
  - tensorboard=1.10.0=py35hf484d3e_0
  - tensorflow=1.10.0=gpu_py35hd9c640d_0
  - tensorflow-base=1.10.0=gpu_py35had579c0_0
  - tensorflow-gpu=1.10.0=hf154084_0
  - termcolor=1.1.0=py35_1
  - tk=8.6.8=hbc83047_0
  - tornado=5.1.1=py35h7b6447c_0
  - werkzeug=0.16.0=py_0
  - wheel=0.31.1=py35_0
  - xz=5.2.4=h14c3975_4
  - zlib=1.2.11=h7b6447c_3
  - zstd=1.3.7=h0b5b093_0
  - pip:
    - easydict==1.9
    - pycocotools==2.0

六、部署训练代码

1.安装Python库

首先激活canda虚拟环境，所有Python库都安装在虚拟环境中
激活命令：

source activate py35-tsflow

命令中的py35-tsflow为虚拟环境的名称，可使用conda env list命令查看所有的虚拟环境信息。
安装cython, python-opencv, easydict,numpy等python库
执行安装命令：(临时使用清华源)

pip install Cython -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install opencv-python -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install matplotlib -i https://pypi.tuna.tsinghua.edu.cn/simple

2.部署训练代码

下载fast-rcnn项目：https://github.com/rbgirshick/fast-rcnn
下载tsflow-rcnn项目代码： https://github.com/dBeker/Faster-RCNN-TensorFlow-Python3.5.git
将两个项目文件解压到本地目录中，当前解压在/home/wlp/目录中
生成cython文件，移动至fast-rcnn项目lib目录中
在这里插入图片描述
执行命令：需要用anaconda中的python35的环境，否则会生成当前python环境的文件

make

这里显示跳过是因为cython文件已经存在了，如果没有cython文件就会生成
在这里插入图片描述
将./lib/utils/目录中生成的cython文件：cython_bbox.cpython-35m-x86_64-linux-gnu.so拷贝至tsflow-rcnn项目/lib/utils/目录中。

在这里插入图片描述
注意：这里的目录tsflow就是tsflow-rcnn项目目录，只是解压的时候改了名字。cython_bbox.cpython-35m-x86_64-linux-gnu.so环境不同这个文件的名称也会不一样。
移动至tsflow-rcnn项目/data/coco/PythonAPI/目录下，执行以下代码：

python setup.py build_ext --inplace
python setup.py build_ext install

3.下载预训练模型

下载地址： http://download.tensorflow.org/models/vgg_16_2016_08_28.tar.gz
将模型文件拷贝至/tsflow-rcnn/data/imagenet_weights/目录中，如果没有目录可自行创建
解压：

tar -zxvf vgg_16_2016_08_28.tar.gz

将解压后的模型文件改名为vgg16.ckpt

4.创建目录与训练文件等

按下面目录与文件的路径结构创建缺少的目录与文件
在这里插入图片描述
目录说明：
imagenet_weights：存放预训练模型
Annotations：存放标注文件
JPEGImages：存放图片文件
ImageSets/Main：存放测试集、训练集、验证集等文件，可创建空文件，文件内容执行训练时会写入
注意：上传的图片文件与标注文件时需要更改标注文件中对应图片所在的路径

5.开始训练

激活conda虚拟环境，以下命令都在虚拟环境中执行
生成imageset信息：编辑tsflow-rcnn/imageset.py文件，修改路径信息：
在这里插入图片描述
修改完成并保存后执行：

python imageset.py

修改/tsflow-rcnn/lib/datasets/pascal_voc.py文件
原文件
在这里插入图片描述
修改为自己所训练的标注

训练前需要将tsflow-rcnn/data/cache目录中之前生成的文件模型删除。因为会自己读取cache中的文本，导致训练出现错误。
开始训练，在tsflow-rcnn/目录下（在anaconda的python35-tsflow环境中）
执行：

python ./train.py

等待训练完成，训练完成后模型的路径/tsflow/default/voc_2007_trainval/default/
在这里插入图片描述
要使用模型，需要将四个模型文件一起拷贝使用，例

vgg16_faster_rcnn_iter_40000.ckpt.data-00000-of-00001
vgg16_faster_rcnn_iter_40000.ckpt.index
vgg16_faster_rcnn_iter_40000.ckpt.meta
vgg16_faster_rcnn_iter_40000.pkl

训练完成后进行测试
将上面训练40000次的文件都拷贝至tsflow-rcnn/output/vgg16/voc_2007_trainval/default/目录中，改名为vgg16，例

vgg16.ckpt.data-00000-of-00001
vgg16.ckpt.index
vgg16.ckpt.meta
vgg16.pkl

将测试图片放入/tsflow-rcnn/data/deme/目录中
在这里插入图片描述

修改tsflow-rcnn/demo.py中的代码：

在这里插入图片描述

旧版本代码中，n_classes是写死的，所以需要手动修改类别数。新版本中只要修改类别CLASSES就行。

net.create_atchitecture(sess, "TEST", n_classes, tag='default', anchor_scales=[8, 16, 32])

旧版本n_classes原本的代码是有20类物体+背景，所以是21。把类别数改为，你的类别+背景。如果是只检测两类物体，那就改为3
准备完成之后可以开始测试，执行：

python deme.py

当前测试结果会直接显示图片，如果不在ubuntu界面中执行，是无法显示图片的，可以修改demo.py，将显示图片改为保存图片。
退出conda虚拟环境命令：

conda deactivate

久别涉远道、

关注

10
点赞
踩
21

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录