1、确认服务器系统版本为16.04.02 (每台都需要操作)
预安装准备参考官网:https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#pre-installation-actions
for i in xsgpu81 xsgpu82 xsgpu83 xsgpu84 xsgpu85; do qssh root@$i 'cat /etc/issue;uname -r';done
Ubuntu 16.04.2 LTS \n \l
4.4.0-62-genericmodprobe
2、下载nvidia driver驱动并安装
可能需要 service lighted stop, 如果机器不干净(之前装过gpu相关的东西)的话
wget http://us.download.nvidia.com/XFree86/Linux-x86_64/375.26/NVIDIA-Linux-x86_64-375.26.run
root@xsgpu81:~# sudo sh NVIDIA-Linux-x86_64-375.26.run
Accept
OK
OK
OK
3、安装cuda
wget http://ogo0b6qe6.bkt.clouddn.com/cuda_8.0.61_375.26_linux.run
chmod +x cuda_8.0.61_375.26_linux.run
sudo sh cuda_8.0.61_375.26_linux.run --silent
echo "PATH=/usr/local/cuda-8.0/bin:$PATH" >> /root/.bashrc
echo "LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH" >> /root/.bashrc
source /root/.bashrc
4、拷贝测试文件
qscp NVIDIA_CUDA-8.0_Samples/0_Simple/vectorAdd/vectorAdd root@xsgpu81:/root/
root@xsgpu81:~# ./vectorAdd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
人肉部署含GPU设备的mesos-agent节点
按照标准流程在GPU机器上部署mesos-agent及其它基础服务(boots-docker, consul, logbeat)
人肉流程:
停含有GPU机器上的mesos-agent服务 supervisorctl stop mesos-agent
清理mesos-agent work_dir
rm -rf cat /home/qboxserver/mesos-agent/current/conf/mesos-agent/work_dir
进入到mesos-agent配置文件目录 /home/qboxserver/mesos-agent/current/conf/mesos-agent更新配置
获取机器上的GPU设备数和型号nvidia-smi -L, 列出的GPU设备数即为设备总数
将设备型号写入到attributes文件 echo "NETWORK:BRIDGE;GPU_MODEL:$MODEL” > attributes
增加isolation配置 echo "cgroups/devices,gpu/nvidia“ > isolation
标识可用的gpu设备编号 echo “0, 1, …, 设备总数 - 1” > nvidia_gpu_devices
resources中增加gpu资源{"name":"gpus","type":"SCALAR","scalar":{"value”:设备总数}}
进入/home/qboxserver/mesos-agent/current/libexec/mesos替换executor
保留原始的executor mv mesos-docker-executor mesos-docker-executor.cpp
下载gpu executor
wget http://ogo0b6qe6.bkt.clouddn.com/mesos-docker-executor-2017-11-18
mv mesos-docker-executor-2017-11-18 mesos-docker-executor; chown qboxserver.qboxserver mesos-docker-executor
cp mesos-docker-executor.go mesos-docker-executor
安装nvidia-docker-plugin
cd /home/qboxserver && mkdir nvidia-docker
cd /home/qboxserver/nvidia-docker
wget