以Ubuntu16.04系统为例(x86架构,64bit),安装Docker社区版和NVIDIA-Docker并下载启动TensorFlow镜像,跑起hello_tensorflow的ipynb。
准备工作
首先需要安装一些基本的组件,否则后面安装cuda会失败(比方会因为缺失gcc/g++/cc等编译工具安装cuda失败等等),所以先执行下面的命令:
1
2
|
sudo
apt
-
get
update
sudo
apt
-
get
install
-
y
build
-
essential
git
|
以上安装完成后,再次确认显卡是否支持cuda,打开NVIDIA官方查看显卡是否支持CUDA的连接:https://developer.nvidia.com/cuda-gpus
里面列举了不同系列下支持CUDA的显卡列表,因为需要翻墙,这里只列出GeForce系列的部分显卡(左边是卡的名字,右边是CUDA计算力,数值越大越好):
GPU | Compute Capability |
---|---|
NVIDIA TITAN Xp | 6.1 |
NVIDIA TITAN X | 6.1 |
GeForce GTX 1080 Ti | 6.1 |
GeForce GTX 1080 | 6.1 |
GeForce GTX 1070 | 6.1 |
GeForce GTX 1060 | 6.1 |
GeForce GTX 1050 | 6.1 |
GeForce GTX TITAN X | 5.2 |
GeForce GTX TITAN Z | 3.5 |
GeForce GTX TITAN Black | 3.5 |
GeForce GTX TITAN | 3.5 |
GeForce GTX 980 Ti | 5.2 |
GeForce GTX 980 | 5.2 |
GeForce GTX 970 | 5.2 |
GeForce GTX 960 | 5.2 |
GeForce GTX 950 | 5.2 |
虽然只是部分列表,但如果GPU不在以上GTX系列卡的名单里,还是建议装CPU版本(跳过安装CUDA的部分),只看安装Docker(不是NVIDIA-Docker)和安装TensorFlow的CPU镜像。
安装CUDA9.0
从官网下载cuda:CUDA Toolkit Download | NVIDIA Developer
https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1604&target_type=runfilelocal
点击进入链接看到如下图,根据咱们的系统版本(Linux的Ubuntu发行版的16.04,x86架构),选择对应的选项,Installer Type建议选择第一个。
下载完成后,根据官网给出的命令提示,对下载好的文件执行如下命令安装cuda:
1
|
sudo
sh
cuda_9
.
0.176_384.81_linux.run
|
以下问题在安装的时候会出现(我没有输入的,就是直接回车,选默认值):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
|
Do
you
accept
the
previously
read
EULA
?
accept
/
decline
/
quit
:
accept
Install
NVIDIA
Accelerated
Graphics
Driver
for
Linux
-
x86
_64
384.81
?
(
y
)
es
/
(
n
)
o
/
(
q
)
uit
:
y
Do
you
want
to
install
the
OpenGL
libraries
?
(
y
)
es
/
(
n
)
o
/
(
q
)
uit
[
default
is
yes
]
:
y
Do
you
want
to
run
nvidia
-
xconfig
?
This
will
update
the
system
X
configuration
file
so
that
the
NVIDIA
X
driver
is
used
.
The
pre
-
existing
X
configuration
file
will
be
backed
up
.
This
option
should
not
be
used
on
systems
that
require
a
custom
X
configuration
,
such
as
systems
with
multiple
GPU
vendors
.
(
y
)
es
/
(
n
)
o
/
(
q
)
uit
[
default
is
no
]
:
Install
the
CUDA
9.0
Toolkit
?
(
y
)
es
/
(
n
)
o
/
(
q
)
uit
:
yes
Enter
Toolkit
Location
[
default
is
/
usr
/
local
/
cuda
-
9.0
]
:
Do
you
want
to
install
a
symbolic
link
at
/
usr
/
local
/
cuda
?
(
y
)
es
/
(
n
)
o
/
(
q
)
uit
:
y
Install
the
CUDA
9.0
Samples
?
(
y
)
es
/
(
n
)
o
/
(
q
)
uit
:
y
Enter
CUDA
Samples
Location
[
default
is
/
home
/
yuens
]
:
|
安装成功则会显示以下信息:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
Installing
the
NVIDIA
display
driver
.
.
.
Installing
the
CUDA
Toolkit
in
/
usr
/
local
/
cuda
-
9.0
.
.
.
Missing
recommended
library
:
libGLU
.so
Missing
recommended
library
:
libXmu
.so
Installing
the
CUDA
Samples
in
/
home
/
yuanshuai
.
.
.
Copying
samples
to
/
home
/
yuanshuai
/
NVIDIA_CUDA
-
9.0_Samples
now
.
.
.
Finished
copying
samples
.
===
===
===
==
=
Summary
=
===
===
===
==
Driver
:
Installed
Toolkit
:
Installed
in
/
usr
/
local
/
cuda
-
9.0
Samples
:
Installed
in
/
home
/
yuanshuai
,
but
missing
recommended
libraries
Please
make
sure
that
-
PATH
includes
/
usr
/
local
/
cuda
-
9.0
/
bin
-
LD_LIBRARY_PATH
includes
/
usr
/
local
/
cuda
-
9.0
/
lib64
,
or
,
add
/
usr
/
local
/
cuda
-
9.0
/
lib64
to
/
etc
/
ld
.so
.conf
and
run
ldconfig
as
root
To
uninstall
the
CUDA
Toolkit
,
run
the
uninstall
script
in
/
usr
/
local
/
cuda
-
9.0
/
bin
To
uninstall
the
NVIDIA
Driver
,
run
nvidia
-
uninstall
Please
see
CUDA_Installation_Guide_Linux
.pdf
in
/
usr
/
local
/
cuda
-
9.0
/
doc
/
pdf
for
detailed
information
on
setting
up
CUDA
.
Logfile
is
/
tmp
/
cuda_install_5544
.log
|
安装失败也会给出Logfile的地址,只需要看看log的内容是啥,并解决就OK(像我上面一开始提到的如果没有安装gcc/cc/icc/g++等编译器,就会导致安装cuda失败,这都会在log里面写的很清楚并给出解决方案)。
检查是否安装成功,在命令行输入:nvidia-smi,如果得到以下类似的内容,说明安装成功:
安装Docker
打开官网关于docker installation for Linux的文档页(我建议还是打开下面的链接,看看):
- Get Docker CE for Ubuntu | Docker Documentation
https://docs.docker.com/engine/installation/linux/docker-ce/ubuntu/#uninstall-old-versions
我们用的系统是Xenial 16.04 (LTS),所以查看Ubuntu这个版本的部分,首先若先前有安装docker需要先卸载(若没有安装过则无需执行),执行命令:
1
|
sudo
apt
-
get
remove
docker
docker
-
engine
docker
.io
|
Docker的安装有多个方式,这里以最常见的方式(文档第一种)为例:
首先依次执行以下命令(反斜杠\代表一行,只是换行写更清晰),把docker仓库加进到apt里:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
sudo
apt
-
get
update
sudo
apt
-
get
install
\
apt
-
transport
-
https
\
ca
-
certificates
\
curl
\
software
-
properties
-
common
curl
-
fsSL
https
:
/
/
download
.docker
.com
/
linux
/
ubuntu
/
gpg
|
sudo
apt
-
key
add
-
sudo
apt
-
key
fingerprint
0EBFCD88
sudo
add
-
apt
-
repository
\
"
deb
[
arch
=
amd64
]
https
:
/
/
download
.docker
.com
/
linux
/
ubuntu
\
$
(
lsb_release
-
cs
)
\
stable"
|
正式安装docker:
1
2
3
4
5
6
7
|
sudo
apt
-
get
update
sudo
apt
-
get
install
docker
-
ce
apt
-
cache
madison
docker
-
ce
sudo
docker
run
hello
-
world
|
最后一个命令是验证docker是否安装成功,它会下载并执行hello-world镜像。如果安装正确,执行后的结果应该类似下面这样:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
Unable
to
find
image
'hello-world:latest'
locally
latest
:
Pulling
from
library
/
hello
-
world
5b0f327be733
:
Pull
complete
Digest
:
sha256
:
07d5f7800dfe37b8c2196c7b1c524c33808ce2e0f74e7aa00e603295ca9a0972
Status
:
Downloaded
newer
image
for
hello
-
world
:
latest
Hello
from
Docker
!
This
message
shows
that
your
installation
appears
to
be
working
correctly
.
To
generate
this
message
,
Docker
took
the
following
steps
:
1.
The
Docker
client
contacted
the
Docker
daemon
.
2.
The
Docker
daemon
pulled
the
"hello-world"
image
from
the
Docker
Hub
.
3.
The
Docker
daemon
created
a
new
container
from
that
image
which
runs
the
executable
that
produces
the
output
you
are
currently
reading
.
4.
The
Docker
daemon
streamed
that
output
to
the
Docker
client
,
which
sent
it
to
your
terminal
.
To
try
something
more
ambitious
,
you
can
run
an
Ubuntu
container
with
:
$
docker
run
-
it
ubuntu
bash
Share
images
,
automate
workflows
,
and
more
with
a
free
Docker
ID
:
https
:
/
/
cloud
.docker
.com
/
For
more
examples
and
ideas
,
visit
:
https
:
/
/
docs
.docker
.com
/
engine
/
userguide
/
|
安装NVIDIA-Docker
安装完成docker并检查安装正确(能跑出来hello-world)后,如果需要docker容器中有gpu支持,需要再安装NVIDIA-Docker,同样找到并打开该项目的主页:
- NVIDIA/nvidia-docker: Build and run Docker containers leveraging NVIDIA GPUs
https://github.com/NVIDIA/nvidia-docker
可以看到在Quick start小节,选择我们Ubuntu的发行版,依次执行命令:
1
2
3
4
5
6
7
8
9
|
# Install nvidia-docker and nvidia-docker-plugin
wget
-
P
/
tmp
https
:
/
/
github
.com
/
NVIDIA
/
nvidia
-
docker
/
releases
/
download
/
v1
.
0.1
/
nvidia
-
docker_1
.
0.1
-
1_amd64.deb
sudo
dpkg
-
i
/
tmp
/
nvidia
-
docker
*
.deb
&&
rm
/
tmp
/
nvidia
-
docker
*
.deb
# Test nvidia-smi 验证是否安装成功
nvidia
-
docker
run
--
rm
nvidia
/
cuda
nvidia
-
smi
|
上面最后一条命令是检查是否安装成功,安装成功,则会显示关于GPU的信息,类似前面的一个截图:
然后在执行下面这句,默认用nvdia-docker替代docker命令:
1
2
|
echo
'alias docker=nvidia-docker'
>>
~
/
.bashrc
bash
|
下载使用TensorFlow镜像
打开dockerhub关于tensorflow的页面:
- tensorflow/tensorflow – Docker Hub
https://hub.docker.com/r/tensorflow/tensorflow/
根据需要的版本下载tensorflow镜像并开启tensorflow容器:
CPU版本
1
|
docker
run
-
it
-
p
8888
:
8888
tensorflow
/
tensorflow
|
GPU版本
1
|
nvidia
-
docker
run
-
it
-
p
8888
:
8888
tensorflow
/
tensorflow
:
latest
-
gpu
|
如何使用
执行以上命令的结果类似如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
$
nvidia
-
docker
run
-
it
-
p
8888
:
8888
tensorflow
/
tensorflow
:
latest
-
gpu
[
I
02
:
51
:
21.230
NotebookApp
]
Writing
notebook
server
cookie
secret
to
/
root
/
.local
/
share
/
jupyter
/
runtime
/
notebook_cookie
_secret
[
W
02
:
51
:
21.242
NotebookApp
]
WARNING
:
The
notebook
server
is
listening
on
all
IP
addresses
and
not
using
encryption
.
This
is
not
recommended
.
[
I
02
:
51
:
21.249
NotebookApp
]
Serving
notebooks
from
local
directory
:
/
notebooks
[
I
02
:
51
:
21.249
NotebookApp
]
0
active
kernels
[
I
02
:
51
:
21.249
NotebookApp
]
The
Jupyter
Notebook
is
running
at
:
http
:
/
/
[
all
ip
addresses
on
your
system
]
:
8888
/
?
token
=
8f90cc7b9ad6ccc4f36f53f347c7a314220bbcb82dd416ea
[
I
02
:
51
:
21.249
NotebookApp
]
Use
Control
-
C
to
stop
this
server
and
shut
down
all
kernels
(
twice
to
skip
confirmation
)
.
[
C
02
:
51
:
21.249
NotebookApp
]
Copy
/
paste
this
URL
into
your
browser
when
you
connect
for
the
first
time
,
to
login
with
a
token
:
http
:
/
/
localhost
:
8888
/
?
token
=
8f90cc7b9ad6ccc4f36f53f347c7a314220bbcb82dd416ea
[
I
02
:
51
:
31.832
NotebookApp
]
302
GET
/
(
172.17.0.1
)
0.74ms
[
I
02
:
51
:
31.943
NotebookApp
]
302
GET
/
tree
?
(
172.17.0.1
)
1.44ms
|
其中看到有个网址:
1
|
http
:
/
/
localhost
:
8888
/
?
token
=
8f90cc7b9ad6ccc4f36f53f347c7a314220bbcb82dd416ea
|
每个人的网址在token=后面的内容是不一样的,现在我们打开浏览器,输入网址:
1
|
http
:
/
/
localhost
:
8888
/
|
这时会出现如下画面:
输入刚刚token后面的值后,点击login会看到一下画面:
点击第一个1_hello_tensorflow.ipynb,然后可以选择执行所有代码(见下图):
下载使用Kaggle镜像
其实kaggle官方提供了好些个镜像,这里以python的为例,即kaggle/python:latest,这个镜像地址在:https://hub.docker.com/r/kaggle/python/
下载该镜像使用命令:
1
|
docker
pull
kaggle
/
python
:
latest
|
这个镜像包含了tensorflow和xgboost。下面请看:
1
2
3
4
5
6
7
8
9
|
In
[
1
]
:
import
xgboost
as
xgb
In
[
2
]
:
xgb
.__version__
Out
[
2
]
:
'0.6'
In
[
3
]
:
import
tensorflow
as
tf
In
[
4
]
:
tf
.__version__
Out
[
4
]
:
'1.3.0'
|