gpu 调度方案
设置环境变量CUDA_VISIBLE_DEVICES
$ deviceQuery |& grep ^Device
Device 0: "Tesla M2090"
Device 1: "Tesla M2090"
$ CUDA_VISIBLE_DEVICES=0 deviceQuery |& grep ^Device
Device 0: "Tesla M2090"
如果这一步没有生效,再尝试设置
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
如果是在python中设置环境变量,应该确保在import tensorflow 和 pycuda 之前
CUDA_VISIBLE_DEVICES 不一定能彻底隔离GPU
参考:https://stackoverflow.com/a/58445444/6010781
在docker/kubernetes 一般是设置NVIDIA_VISIBLE_DEVICES
这个环境变量
如果不依赖于外部的调度框架如k8s/yarn 3.x, 那么设置环境变量之前需要自行维护一个资源表,记录每个节点上已分配和空闲的GPU。
直接在代码中指定GPU(极端不推荐)
import tensorflow as tf
with tf.device('/gpu:0'):
a = tf.constant(3.0)
with tf.Session() as sess:
while True:
print sess.run(a)
同样是无法知道已分配和空闲GPU,且hardcode方式使代码无法迁移。
在代码中检测哪些GPU上有足够的内存,然后设置环境变量
import subprocess as sp
import os
def mask_unused_gpus(leave_unmasked=1):
ACCEPTABLE_AVAILABLE_MEMORY = 1024
COMMAND = "nvidia-smi --query-gpu=memory.free --format=csv"
try:
_output_to_list = lambda x: x.decode('ascii').split('\n')[:-1]
memory_free_info = _output_to_list(sp.check_output(COMMAND.split()))[1:]
memory_free_values = [int(x.split()[0]) for i, x in enumerate(memory_free_info)]
available_gpus = [i for i, x in enumerate(memory_free_values) if x > ACCEPTABLE_AVAILABLE_MEMORY]
if len(available_gpus) < leave_unmasked: raise ValueError('Found only %d usable GPUs in the system' % len(available_gpus))
os.environ["CUDA_VISIBLE_DEVICES"] = ','.join(map(str, available_gpus[:leave_unmasked]))
except Exception as e:
print('"nvidia-smi" is probably not installed. GPUs are not masked', e)
mask_unused_gpus(2)
参考:https://stackoverflow.com/a/47998168/6010781
因为显存是动态变化的,可能会产生冲突
分配之后才检测,可能无法分配到空闲的节点,却在忙碌的节点上得不到资源。
每次请求GPU的时候请求同比例的内存
leewyang commented on Dec 9, 2017
This is dependent on your Spark setup. For instance, in our case, we run Spark on top of Hadoop/YARN, so YARN is responsible for allocating containers to run the Spark executors (which in turn run the TensorFlow nodes). Unfortunately, YARN currently does not have GPUs as a schedulable resource. Instead, YARN schedules generally on CPU and Memory, so in our case, we use Memory as a proxy for GPU.So, in your example, if we assume that your nodes are 64GB nodes with 4 GPUs each, then we’d schedule a GPU by requesting 16GB of memory. And, if this proxy is consistently used, then a node with all four GPUs in use (i.e. 64GB memory) would not be scheduled for any new executors/containers by YARN.
比如一个节点4个gpu,64G内存,每次申请1个GPU同时申请4G内存,可以做到1个节点不会被分配超过4个GPU请求的task
这种方法太过tricky,而且无法解决GPU之间冲突导致oom的问题,且混合负载时会导致资源浪费
参考: https://github.com/yahoo/TensorFlowOnSpark/issues/185
使用yarn 3.1.0+
通过设置
spark.yarn.driver.resource.yarn.io/gpu.amount
spark.yarn.executor.resource.yarn.io/gpu.amount
yarn不会告诉spark给它的某个container分配了哪些GPU,因此需要通过
spark.{driver/executor}.resource.gpu.discoveryScript
指定一个资源发现脚本,用于executor/driver在启动时自行发现可用资源
示例脚本如下:
ADDRS=`nvidia-smi --query-gpu=index --format=csv,noheader | sed -e ':a' -e 'N' -e'$!ba' -e 's/\n/","/g'`
echo {\"name\": \"gpu\", \"addresses\":[\"$ADDRS\"]}
脚本必须输出固定格式的json,例如
{"name": "gpu", "addresses":["0","1","2","3","4","5","6","7"]}
使用k8s集群调度
k8s会自动给每个pod设置NVIDIA_VISIBLE_DEVICES
环境变量,每个pod只能看到分配给自己的GPU,不会和别的pod冲突。
参考:https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/
nvidia triton
A high-performance inference server.
支持多个模型共享GPU,但只能用于服务,不能用于训练。
Triton runs multiple models from the same or different frameworks concurrently on a single GPU or CPU. In a multi-GPU server, it automatically creates an instance of each model on each GPU to increase utilization without extra coding.
参考:https://developer.nvidia.com/nvidia-triton-inference-server
总结
GPU调度最好是依赖于已有框架,如K8s、yarn(since 3.x)等。
若只考虑serving,可使用单独的服务器(不会有GPU训练任务),结合triton进行多模型部署。 或者使用容器+环境变量+tfserving的方式亦可。