https://www.douban.com/note/568373446/?type=like
[DL] GTX1080 + Ubuntu16.04 + CUDA 8.0RC + Tensorflow + Theano + keras
最近尝鲜配了一台三块1080的机器,部署了TF+Theano+keras的训练环境
过程中有不少坑,在这里记一下:)
# ubuntu u盘安装 Faild to copy file from CD-ROM:
用win32diskimager烧录ISO镜像
# 系统启动时提示nouveau error: unkown chipset
# nouveau无法识别GTX1080 - 禁用nouveau
vi /etc/modprobe.d/blacklist.conf
# 添加:
blacklist nouveau
sudo update-initramfs -u
sudo reboot
# 准备系统环境
sudo apt-get install build-essential wget
# 安装gcc g++ 4.8
sudo apt-get install gcc-4.8 gcc-4.8-multilib g++-4.8 g++-4.8-multilib
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-5 60
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.8 50
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-5 60
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-4.8 50
# 切换gcc g++版本
sudo update-alternatives --config gcc
sudo update-alternatives --config g++
# 移除gcc g++ 4.8
# sudo update-alternatives --remove gcc /usr/bin/gcc-4.8
# sudo update-alternatives --remove g++ /usr/bin/g++-4.8
# CUDA 8.0RC
# https://developer.nvidia.com/cuda-release-candidate-download
# 安装cuda toolkit
# 切换到gcc-4.8
sudo dpkg -i cuda-repo-ubuntu1604-8-0-rc_8.0.27-1_amd64.deb
sudo apt-get update
sudo apt-get install cuda
# 配置环境变量
echo "export CUDA_HOME=/usr/local/cuda" >> ~/.bashrc
echo "export PATH=/usr/local/cuda/bin:$PATH" >> ~/.bashrc
echo "export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH" >> ~/.bashrc
# 安装cuDNN
tar -xf cudnn-8.0-linux-x64-v5.0-ga.tgz
sudo cp -f cuda/lib64/*.* /usr/local/cuda/lib64/
sudo cp -f cuda/include/*.* /usr/local/cuda/include/
# 注意:GeForce GTX 1080 Developers must re-install the latest driver from www.nvidia.com/drivers after installing any of these CUDA Toolkits.
# 注意:gcc-4.8无法编译nvidia driver
# 注意:安装驱动时需要允许dkms
# 切换到gcc-5
sudo sh NVIDIA-Linux-x86_64-*.run
# 卸载驱动:sudo nvidia-uninstall
# 测试
cd /usr/local/cuda/samples/1_Utilities/deviceQuery
sudo make
./deviceQuery
# modprobe: ERROR: could not insert 'nvidia_361_uvm': Invalid argument
# 这是因为cuda8.0自带了361版本的nvidia driver,需要将其卸载
sudo apt-get remove nvidia-361
---------------------------------------
The following packages will be REMOVED:
cuda cuda-8-0 cuda-demo-suite-8-0 cuda-drivers cuda-runtime-8-0 nvidia-361 nvidia-361-dev
0 upgraded, 0 newly installed, 7 to remove and 76 not upgraded.
After this operation, 312 MB disk space will be freed.
Do you want to continue? [Y/n] y (别怕,没问题)
---------------------------------------
# Tensorflow 0.9.0 pip install (目前不支持CUDA8.0)
sudo apt-get install python-pip python-dev
sudo pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.9.0-cp27-none-linux_x86_64.whl
# 测试
python -c "import tensorflow"
# ImportError: libcudart.so.7.5: cannot open shared object file: No such file or directory (目前不支持CUDA8.0)
# Tensorflow 0.9.0 docker install (目前不支持CUDA8.0)
sudo docker pull tensorflow/tensorflow:r0.9-gpu
# Tensorflow 0.9.0 build from source
# 安装bazel
sudo apt-get install openjdk-8-jdk
echo "deb http://storage.googleapis.com/bazel-apt stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list
curl https://storage.googleapis.com/bazel-apt/doc/apt-key.pub.gpg | sudo apt-key add -
sudo apt-get update
sudo apt-get install bazel
# 编译tensorflow
sudo apt-get install python-numpy swig python-dev
mkdir ~/github && cd ~/github
git clone --recurse-submodules https://github.com/tensorflow/tensorflow
cd ~/github/tensorflow && ./configure
---------------------------------------
Please specify the location of python. [Default is /usr/bin/python]:
Do you wish to build TensorFlow with Google Cloud Platform support? [y/N] n
No Google Cloud Platform support will be enabled for TensorFlow
Do you wish to build TensorFlow with GPU support? [y/N] y
GPU support will be enabled for TensorFlow
Please specify which gcc nvcc should use as the host compiler. [Default is /usr/bin/gcc]:
Please specify the Cuda SDK version you want to use, e.g. 7.0. [Leave empty to use system default]: 8.0
Please specify the location where CUDA 8.0 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
Please specify the Cudnn version you want to use. [Leave empty to use system default]: 5 (not 5.0)
Please specify the location where cuDNN 5 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size.
[Default is: "3.5,5.2"]:
Setting up Cuda include
Setting up Cuda lib64
Setting up Cuda bin
Setting up Cuda nvvm
Setting up CUPTI include
Setting up CUPTI lib64
Configuration finished
---------------------------------------
bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
sudo pip install /tmp/tensorflow_pkg/tensorflow-0.9.0-py2-none-any.whl
# 测试
python -c "import tensorflow"
# ImportError: cannot import name pywrap_tensorflow:需要重启
sudo reboot
# Theano & keras
sudo apt-get install python-numpy python-scipy python-dev python-pip python-nose libopenblas-dev git
sudo pip install Theano
sudo pip install keras
# 配置Theano
echo "[global]" > ~/.theanorc
echo "floatX = float32" >> ~/.theanorc
echo "device = gpu0" >> ~/.theanorc
echo "[nvcc]" >> ~/.theanorc
echo "fastmath = True" >> ~/.theanorc
# 测试
python -c "import keras"
# matplotlib
sudo apt-get build-dep python-matplotlib
# E: You must put some 'source' URIs in your sources.list
sudo vi /etc/apt/sources.list
# 去掉所有deb-src前面的#号
sudo apt-get update
sudo pip install matplotlib
# h5py
sudo apt-get install libhdf5-dev
sudo apt-get install cython
sudo pip install h5py
# Docker
# Update apt sources
sudo apt-get update
sudo apt-get install apt-transport-https ca-certificates
sudo apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --recv-keys 58118E89F3A912897C070ADBF76221572C52609D
sudo vi /etc/apt/sources.list.d/docker.list
# 添加(14.04):
deb https://apt.dockerproject.org/repo ubuntu-trusty main
# 添加(16.04):
deb https://apt.dockerproject.org/repo ubuntu-xenial main
sudo apt-get update
sudo apt-get install docker-engine
sudo service docker start
# add user group
sudo groupadd docker
sudo usermod -aG docker [your username]
过程中有不少坑,在这里记一下:)
# ubuntu u盘安装 Faild to copy file from CD-ROM:
用win32diskimager烧录ISO镜像
# 系统启动时提示nouveau error: unkown chipset
# nouveau无法识别GTX1080 - 禁用nouveau
vi /etc/modprobe.d/blacklist.conf
# 添加:
blacklist nouveau
sudo update-initramfs -u
sudo reboot
# 准备系统环境
sudo apt-get install build-essential wget
# 安装gcc g++ 4.8
sudo apt-get install gcc-4.8 gcc-4.8-multilib g++-4.8 g++-4.8-multilib
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-5 60
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.8 50
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-5 60
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-4.8 50
# 切换gcc g++版本
sudo update-alternatives --config gcc
sudo update-alternatives --config g++
# 移除gcc g++ 4.8
# sudo update-alternatives --remove gcc /usr/bin/gcc-4.8
# sudo update-alternatives --remove g++ /usr/bin/g++-4.8
# CUDA 8.0RC
# https://developer.nvidia.com/cuda-release-candidate-download
# 安装cuda toolkit
# 切换到gcc-4.8
sudo dpkg -i cuda-repo-ubuntu1604-8-0-rc_8.0.27-1_amd64.deb
sudo apt-get update
sudo apt-get install cuda
# 配置环境变量
echo "export CUDA_HOME=/usr/local/cuda" >> ~/.bashrc
echo "export PATH=/usr/local/cuda/bin:$PATH" >> ~/.bashrc
echo "export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH" >> ~/.bashrc
# 安装cuDNN
tar -xf cudnn-8.0-linux-x64-v5.0-ga.tgz
sudo cp -f cuda/lib64/*.* /usr/local/cuda/lib64/
sudo cp -f cuda/include/*.* /usr/local/cuda/include/
# 注意:GeForce GTX 1080 Developers must re-install the latest driver from www.nvidia.com/drivers after installing any of these CUDA Toolkits.
# 注意:gcc-4.8无法编译nvidia driver
# 注意:安装驱动时需要允许dkms
# 切换到gcc-5
sudo sh NVIDIA-Linux-x86_64-*.run
# 卸载驱动:sudo nvidia-uninstall
# 测试
cd /usr/local/cuda/samples/1_Utilities/deviceQuery
sudo make
./deviceQuery
# modprobe: ERROR: could not insert 'nvidia_361_uvm': Invalid argument
# 这是因为cuda8.0自带了361版本的nvidia driver,需要将其卸载
sudo apt-get remove nvidia-361
---------------------------------------
The following packages will be REMOVED:
cuda cuda-8-0 cuda-demo-suite-8-0 cuda-drivers cuda-runtime-8-0 nvidia-361 nvidia-361-dev
0 upgraded, 0 newly installed, 7 to remove and 76 not upgraded.
After this operation, 312 MB disk space will be freed.
Do you want to continue? [Y/n] y (别怕,没问题)
---------------------------------------
# Tensorflow 0.9.0 pip install (目前不支持CUDA8.0)
sudo apt-get install python-pip python-dev
sudo pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.9.0-cp27-none-linux_x86_64.whl
# 测试
python -c "import tensorflow"
# ImportError: libcudart.so.7.5: cannot open shared object file: No such file or directory (目前不支持CUDA8.0)
# Tensorflow 0.9.0 docker install (目前不支持CUDA8.0)
sudo docker pull tensorflow/tensorflow:r0.9-gpu
# Tensorflow 0.9.0 build from source
# 安装bazel
sudo apt-get install openjdk-8-jdk
echo "deb http://storage.googleapis.com/bazel-apt stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list
curl https://storage.googleapis.com/bazel-apt/doc/apt-key.pub.gpg | sudo apt-key add -
sudo apt-get update
sudo apt-get install bazel
# 编译tensorflow
sudo apt-get install python-numpy swig python-dev
mkdir ~/github && cd ~/github
git clone --recurse-submodules https://github.com/tensorflow/tensorflow
cd ~/github/tensorflow && ./configure
---------------------------------------
Please specify the location of python. [Default is /usr/bin/python]:
Do you wish to build TensorFlow with Google Cloud Platform support? [y/N] n
No Google Cloud Platform support will be enabled for TensorFlow
Do you wish to build TensorFlow with GPU support? [y/N] y
GPU support will be enabled for TensorFlow
Please specify which gcc nvcc should use as the host compiler. [Default is /usr/bin/gcc]:
Please specify the Cuda SDK version you want to use, e.g. 7.0. [Leave empty to use system default]: 8.0
Please specify the location where CUDA 8.0 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
Please specify the Cudnn version you want to use. [Leave empty to use system default]: 5 (not 5.0)
Please specify the location where cuDNN 5 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size.
[Default is: "3.5,5.2"]:
Setting up Cuda include
Setting up Cuda lib64
Setting up Cuda bin
Setting up Cuda nvvm
Setting up CUPTI include
Setting up CUPTI lib64
Configuration finished
---------------------------------------
bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
sudo pip install /tmp/tensorflow_pkg/tensorflow-0.9.0-py2-none-any.whl
# 测试
python -c "import tensorflow"
# ImportError: cannot import name pywrap_tensorflow:需要重启
sudo reboot
# Theano & keras
sudo apt-get install python-numpy python-scipy python-dev python-pip python-nose libopenblas-dev git
sudo pip install Theano
sudo pip install keras
# 配置Theano
echo "[global]" > ~/.theanorc
echo "floatX = float32" >> ~/.theanorc
echo "device = gpu0" >> ~/.theanorc
echo "[nvcc]" >> ~/.theanorc
echo "fastmath = True" >> ~/.theanorc
# 测试
python -c "import keras"
# matplotlib
sudo apt-get build-dep python-matplotlib
# E: You must put some 'source' URIs in your sources.list
sudo vi /etc/apt/sources.list
# 去掉所有deb-src前面的#号
sudo apt-get update
sudo pip install matplotlib
# h5py
sudo apt-get install libhdf5-dev
sudo apt-get install cython
sudo pip install h5py
# Docker
# Update apt sources
sudo apt-get update
sudo apt-get install apt-transport-https ca-certificates
sudo apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --recv-keys 58118E89F3A912897C070ADBF76221572C52609D
sudo vi /etc/apt/sources.list.d/docker.list
# 添加(14.04):
deb https://apt.dockerproject.org/repo ubuntu-trusty main
# 添加(16.04):
deb https://apt.dockerproject.org/repo ubuntu-xenial main
sudo apt-get update
sudo apt-get install docker-engine
sudo service docker start
# add user group
sudo groupadd docker
sudo usermod -aG docker [your username]