【k8s-device plugin】k8s device plugin 编写实例教学

最新推荐文章于 2025-03-01 17:14:30 发布

oceanweave

最新推荐文章于 2025-03-01 17:14:30 发布

阅读量2.7k

点赞数 1

分类专栏： Kubernetes学习笔记文章标签： kubernetes

原文链接：https://my.oschina.net/jxcdwangtao/blog/1793656

版权

Kubernetes学习笔记专栏收录该内容

83 篇文章

订阅专栏

转载自

编写实例参考

Kubernetes开发知识–device-plugin的实现
https://github.com/joyme123/cola-device-plugin

Device Plugins

Device Pulgins 在 Kubernetes 1.10 中是 beta 特性，开始于 Kubernetes 1.8，用来给第三方设备厂商通过插件化的方式将设备资源对接到 Kubernetes，给容器提供 Extended Resources。

通过 Device Plugins 方式，用户不需要改 Kubernetes 的代码，由第三方设备厂商开发插件，实现 Kubernetes Device Plugins 的相关接口即可。

目前关注度比较高的 Device Plugins 实现有：

Nvidia 提供的 GPU 插件：NVIDIA device plugin for Kubernetes
高性能低延迟 RDMA 卡插件：RDMA device plugin for Kubernetes
低延迟 Solarflare 万兆网卡驱动：Solarflare Device Plugin

Device plugins 启动时，对外暴露几个 gRPC Service 提供服务，并通过 /var/lib/kubelet/device-plugins/kubelet.sock 向 kubelet 进行注册。

Device Plugins Registration

在 Kubernetes 1.10 之前的版本，默认 disable DevicePlugins，用户需要在 Feature Gate 中 enable。
在 Kubernetes 1.10，默认 enable DevicePlugins，用户可以在 Feature Gate 中 disable it。
当 DevicePlugins Feature Gate enable，kubelet 就会暴露一个 Register gRPC 接口。Device Plugins 通过调用 Register 接口完成 Device 的注册。

	pkg/kubelet/apis/deviceplugin/v1beta1/api.pb.go:440
	type RegistrationServer interface {
		Register(context.Context, *RegisterRequest) (*Empty, error)
	}


	pkg/kubelet/apis/deviceplugin/v1beta1/api.pb.go:87
	type RegisterRequest struct {
		// Version of the API the Device Plugin was built against
		Version string `protobuf:"bytes,1,opt,name=version,proto3" json:"version,omitempty"`
		// Name of the unix socket the device plugin is listening on
		// PATH = path.Join(DevicePluginPath, endpoint)
		Endpoint string `protobuf:"bytes,2,opt,name=endpoint,proto3" json:"endpoint,omitempty"`
		// Schedulable resource name. As of now it's expected to be a DNS Label
		ResourceName string `protobuf:"bytes,3,opt,name=resource_name,json=resourceName,proto3" json:"resource_name,omitempty"`
		// Options to be communicated with Device Manager
		Options *DevicePluginOptions `protobuf:"bytes,4,opt,name=options" json:"options,omitempty"`
	}

RegisterRequest 要求的参数如下：
- Version, 目前有 v1alpha,v1beta1 两个版本。
- Endpoint, 表示 device plugin 暴露的 socket 名称，Register 时会根据 Endpoint 生成 plugin 的 socket 放在 /var/lib/kubelet/device-plugins/ 目录下，比如 Nvidia GPU Device Plugin 对应 /var/lib/kubelet/device-plugins/nvidia.sock。
- ResourceName, 须按照 Extended Resource Naming Scheme 格式 vendor-domain/resource，比如 nvidia.com/gpu
- DevicePluginOptions, 作为 kubelet 与 device plugin 通信时的额外参数传递。
  - 对于 nvidia gpu，只有一个 PreStartRequired 选项，表示每个 Container 启动前是否要调用 Device Plugin 的 PreStartContainer 接口（是 Kubernetes 1.10 中 Device Plugin Interface 接口之一），默认为 false。
```
	vendor/k8s.io/kubernetes/pkg/kubelet/apis/deviceplugin/v1beta1/api.pb.go:71
	func (m *NvidiaDevicePlugin) GetDevicePluginOptions(context.Context, *pluginapi.Empty) (*pluginapi.DevicePluginOptions, error) {
		return &pluginapi.DevicePluginOptions{}, nil
	}

	github.com/NVIDIA/k8s-device-plugin/server.go:80
	type DevicePluginOptions struct {
		// Indicates if PreStartContainer call is required before each container start
		PreStartRequired bool `protobuf:"varint,1,opt,name=pre_start_required,json=preStartRequired,proto3" json:"pre_start_required,omitempty"`
	}
```

前面提到 Device Plugin Interface 目前有 v1alpha, v1beta1 两个版本，每个版本对应的接口如下：

v1alpha:

/deviceplugin.Registration/Register

	pkg/kubelet/apis/deviceplugin/v1alpha/api.pb.go:374
	var _Registration_serviceDesc = grpc.ServiceDesc{
		ServiceName: "deviceplugin.Registration",
		HandlerType: (*RegistrationServer)(nil),
		Methods: []grpc.MethodDesc{
			{
				MethodName: "Register",
				Handler:    _Registration_Register_Handler,
			},
		},
		Streams:  []grpc.StreamDesc{},
		Metadata: "api.proto",
	}

/deviceplugin.DevicePlugin/Allocate

/deviceplugin.DevicePlugin/ListAndWatch

	pkg/kubelet/apis/deviceplugin/v1alpha/api.pb.go:505
	var _DevicePlugin_serviceDesc = grpc.ServiceDesc{
		ServiceName: "deviceplugin.DevicePlugin",
		HandlerType: (*DevicePluginServer)(nil),
		Methods: []grpc.MethodDesc{
			{
				MethodName: "Allocate",
				Handler:    _DevicePlugin_Allocate_Handler,
			},
		},
		Streams: []grpc.StreamDesc{
			{
				StreamName:    "ListAndWatch",
				Handler:       _DevicePlugin_ListAndWatch_Handler,
				ServerStreams: true,
			},
		},
		Metadata: "api.proto",
	}

v1beta1:

/v1beta1.Registration/Register

	/v1beta1.Registration/Register

	pkg/kubelet/apis/deviceplugin/v1beta1/api.pb.go:466
	var _Registration_serviceDesc = grpc.ServiceDesc{
		ServiceName: "v1beta1.Registration",
		HandlerType: (*RegistrationServer)(nil),
		Methods: []grpc.MethodDesc{
			{
				MethodName: "Register",
				Handler:    _Registration_Register_Handler,
			},
		},
		Streams:  []grpc.StreamDesc{},
		Metadata: "api.proto",
	}

/v1beta1.DevicePlugin/ListAndWatch
/v1beta1.DevicePlugin/Allocate
/v1beta1.DevicePlugin/PreStartContainer

/v1beta1.DevicePlugin/GetDevicePluginOptions

	pkg/kubelet/apis/deviceplugin/v1beta1/api.pb.go:665
	var _DevicePlugin_serviceDesc = grpc.ServiceDesc{
		ServiceName: "v1beta1.DevicePlugin",
		HandlerType: (*DevicePluginServer)(nil),
		Methods: []grpc.MethodDesc{
			{
				MethodName: "GetDevicePluginOptions",
				Handler:    _DevicePlugin_GetDevicePluginOptions_Handler,
			},
			{
				MethodName: "Allocate",
				Handler:    _DevicePlugin_Allocate_Handler,
			},
			{
				MethodName: "PreStartContainer",
				Handler:    _DevicePlugin_PreStartContainer_Handler,
			},
		},
		Streams: []grpc.StreamDesc{
			{
				StreamName:    "ListAndWatch",
				Handler:       _DevicePlugin_ListAndWatch_Handler,
				ServerStreams: true,
			},
		},
		Metadata: "api.proto",
	}

当 Device Plugin 成功注册后，它将通过 ListAndWatch 向 kubelet 发送它管理的 device 列表，kubelet 收到数据后通过 API Server 更新 etcd 中对应 node 的 status 中。
然后用户就能在 Container Spec request 中请求对应的 device，注意以下限制：
- Extended Resource 只支持请求整数个 device，不支持小数点。
- 不支持超配，即 Resource QoS 只能是 Guaranteed。
- 同一块 Device 不能多个 Containers 共享。

Device Plugins Workflow

Device Plugins 的工作流如下：

初始化：Device Plugin 启动后，进行一些插件特定的初始化工作以确定对应的 Devices 处于 Ready 状态，对于 Nvidia GPU，就是加载 NVML Library。

启动 gRPC 服务：通过 /var/lib/kubelet/device-plugins/${Endpoint}.sock 对外暴露 gRPC 服务，不同的 API Version 对应不同的服务接口，前面已经提过，下面是每个接口的描述。

v1alpha：

ListAndWatch

Allocate

	pkg/kubelet/apis/deviceplugin/v1alpha/api.proto
	// DevicePlugin is the service advertised by Device Plugins
	service DevicePlugin {
		// ListAndWatch returns a stream of List of Devices
		// Whenever a Device state changes or a Device disappears, ListAndWatch
		// returns the new list
		rpc ListAndWatch(Empty) returns (stream ListAndWatchResponse) {}

		// Allocate is called during container creation so that the Device
		// Plugin can run device specific operations and instruct Kubelet
		// of the steps to make the Device available in the container
		rpc Allocate(AllocateRequest) returns (AllocateResponse) {}
	}

v1beta1：

ListAndWatch
Allocate
GetDevicePluginOptions

PreStartContainer

	pkg/kubelet/apis/deviceplugin/v1beta1/api.proto
	// DevicePlugin is the service advertised by Device Plugins
	service DevicePlugin {
		// GetDevicePluginOptions returns options to be communicated with Device
	        // Manager
		rpc GetDevicePluginOptions(Empty) returns (DevicePluginOptions) {}

		// ListAndWatch returns a stream of List of Devices
		// Whenever a Device state change or a Device disapears, ListAndWatch
		// returns the new list
		rpc ListAndWatch(Empty) returns (stream ListAndWatchResponse) {}

		// Allocate is called during container creation so that the Device
		// Plugin can run device specific operations and instruct Kubelet
		// of the steps to make the Device available in the container
		rpc Allocate(AllocateRequest) returns (AllocateResponse) {}

    // PreStartContainer is called, if indicated by Device Plugin during registeration phase,
    // before each container start. Device plugin can run device specific operations
    // such as reseting the device before making devices available to the container
    // 理解为，在设备可用前的一些准备操作
		rpc PreStartContainer(PreStartContainerRequest) returns (PreStartContainerResponse) {}
	}

Device Plugin 通过 /var/lib/kubelet/device-plugins/kubelet.sock 向 kubelet 进行注册。

注册成功后，Device Plugin 就正式进入了 Serving 模式，提供前面提到的 gRPC 接口调用服务，下面是 v1beta1 的每个接口对应的具体分析：

ListAndWatch：监控对应 Devices 的状态变更或者 Disappear 事件，返回 ListAndWatchResponse 给 kubelet, ListAndWatchResponse 就是 Device 列表。

	type ListAndWatchResponse struct {
		Devices []*Device `protobuf:"bytes,1,rep,name=devices" json:"devices,omitempty"`
	}

	type Device struct {
		// A unique ID assigned by the device plugin used
		// to identify devices during the communication
		// Max length of this field is 63 characters
		ID string `protobuf:"bytes,1,opt,name=ID,json=iD,proto3" json:"ID,omitempty"`
		// Health of the device, can be healthy or unhealthy, see constants.go
		Health string `protobuf:"bytes,2,opt,name=health,proto3" json:"health,omitempty"`
	}

下面是 struct Device 的 GPU Sample：

struct Device {
    ID: "GPU-fef8089b-4820-abfc-e83e-94318197576e",
    State: "Healthy",
}

Allocate：Device Plugin 执行 device-specific 操作，返回 AllocateResponse 给 kubelet，kubelet 再传给 dockerd, 由 dockerd (调用 nvidia-docker) 在创建容器时分配 device 时使用。下面是这个接口的 Request 和 Response 的描述。

Allocate is expected to be called during pod creation since allocation failures for any container would result in pod startup failure.
Allocate allows kubelet to exposes additional artifacts in a pod’s environment as directed by the plugin.

Allocate allows Device Plugin to run device specific operations on the Devices requested

	type AllocateRequest struct {
		ContainerRequests []*ContainerAllocateRequest `protobuf:"bytes,1,rep,name=container_requests,json=containerRequests" json:"container_requests,omitempty"`
	}

	type ContainerAllocateRequest struct {
		DevicesIDs []string `protobuf:"bytes,1,rep,name=devicesIDs" json:"devicesIDs,omitempty"`
	}

	// AllocateResponse includes the artifacts that needs to be injected into
	// a container for accessing 'deviceIDs' that were mentioned as part of
	// 'AllocateRequest'.
	// Failure Handling:
	// if Kubelet sends an allocation request for dev1 and dev2.
	// Allocation on dev1 succeeds but allocation on dev2 fails.
	// The Device plugin should send a ListAndWatch update and fail the
	// Allocation request
	type AllocateResponse struct {
		ContainerResponses []*ContainerAllocateResponse `protobuf:"bytes,1,rep,name=container_responses,json=containerResponses" json:"container_responses,omitempty"`
	}

	type ContainerAllocateResponse struct {
		// List of environment variable to be set in the container to access one of more devices.
		Envs map[string]string `protobuf:"bytes,1,rep,name=envs" json:"envs,omitempty" protobuf_key:"bytes,1,opt,name=key,proto3" protobuf_val:"bytes,2,opt,name=value,proto3"`
		// Mounts for the container.
		Mounts []*Mount `protobuf:"bytes,2,rep,name=mounts" json:"mounts,omitempty"`
		// Devices for the container.
		Devices []*DeviceSpec `protobuf:"bytes,3,rep,name=devices" json:"devices,omitempty"`
		// Container annotations to pass to the container runtime
		Annotations map[string]string `protobuf:"bytes,4,rep,name=annotations" json:"annotations,omitempty" protobuf_key:"bytes,1,opt,name=key,proto3" protobuf_val:"bytes,2,opt,name=value,proto3"`
	}

	// DeviceSpec specifies a host device to mount into a container.
	type DeviceSpec struct {
		// Path of the device within the container.
		ContainerPath string `protobuf:"bytes,1,opt,name=container_path,json=containerPath,proto3" json:"container_path,omitempty"`
		// Path of the device on the host.
		HostPath string `protobuf:"bytes,2,opt,name=host_path,json=hostPath,proto3" json:"host_path,omitempty"`
		// Cgroups permissions of the device, candidates are one or more of
		// * r - allows container to read from the specified device.
		// * w - allows container to write to the specified device.
		// * m - allows container to create device files that do not yet exist.
		Permissions string `protobuf:"bytes,3,opt,name=permissions,proto3" json:"permissions,omitempty"`
	}

AllocateRequest 就是 DeviceID 列表。
AllocateResponse 包括需要注入到 Container 里面的 Envs、Devices 的挂载信息 (包括 device 的 cgroup permissions) 以及自定义的 Annotations。

PreStartContainer：
- PreStartContainer is expected to be called before each container start if indicated by plugin during registration phase.
- PreStartContainer allows kubelet to pass reinitialized devices to containers.
- PreStartContainer allows Device Plugin to run device specific operations on the Devices requested.
```
	type PreStartContainerRequest struct {
		DevicesIDs []string `protobuf:"bytes,1,rep,name=devicesIDs" json:"devicesIDs,omitempty"`
	}

	// PreStartContainerResponse will be send by plugin in response to PreStartContainerRequest
	type PreStartContainerResponse struct {
	}
```

GetDevicePluginOptions: 目前只有 PreStartRequired 这一个 field。

type DevicePluginOptions struct {
	// Indicates if PreStartContainer call is required before each container start
	PreStartRequired bool `protobuf:"varint,1,opt,name=pre_start_required,json=preStartRequired,proto3" json:"pre_start_required,omitempty"`
}

异常处理

每次 kubelet 启动 (重启) 时，都会将 /var/lib/kubelet/device-plugins 下的所有 sockets 文件删除。
Device Plugin 要负责监测自己的 socket 被删除，然后进行重新注册，重新生成自己的 socket。
当 plugin socket 被误删，Device Plugin 该怎么办？

我们看看 Nvidia Device Plugin 是怎么处理的，相关的代码如下：

github.com/NVIDIA/k8s-device-plugin/main.go:15

func main() {
	...
	
	log.Println("Starting FS watcher.")
	watcher, err := newFSWatcher(pluginapi.DevicePluginPath)
	
    ...

	restart := true
	var devicePlugin *NvidiaDevicePlugin

L:
	for {
		if restart {
			if devicePlugin != nil {
				devicePlugin.Stop()
			}

			devicePlugin = NewNvidiaDevicePlugin()
			if err := devicePlugin.Serve(); err != nil {
				log.Println("Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?")
				log.Printf("You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites")
				log.Printf("You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start")
			} else {
				restart = false
			}
		}

		select {
		case event := <-watcher.Events:
			if event.Name == pluginapi.KubeletSocket && event.Op&fsnotify.Create == fsnotify.Create {
				log.Printf("inotify: %s created, restarting.", pluginapi.KubeletSocket)
				restart = true
			}

		case err := <-watcher.Errors:
			log.Printf("inotify: %s", err)

		case s := <-sigs:
			switch s {
			case syscall.SIGHUP:
				log.Println("Received SIGHUP, restarting.")
				restart = true
			default:
				log.Printf("Received signal \"%v\", shutting down.", s)
				devicePlugin.Stop()
				break L
			}
		}
	}
}

通过 fsnotify.Watcher 监控 /var/lib/kubelet/device-plugins/ 目录。
如果 fsnotify.Watcher 的 Events Channel 收到 Create kubelet.sock 事件（说明 kubelet 发生重启），则会触发 Nvidia Device Plugin 的重启。
Nvidia Device Plugin 重启的逻辑是：先检查 devicePlugin 对象是否为空（说明完成了 Nvidia Device Plugin 的初始化）：
- 如果不为空，则先停止 Nvidia Device Plugin 的 gRPC Server。
- 然后调用 NewNvidiaDevicePlugin () 重建一个新的 DevicePlugin 实例。
- 调用 Serve () 启动 gRPC Server，并先 kubelet 注册自己。

因此，这其中只监控了 kubelet.sock 的 Create 事件，能很好处理 kubelet 重启的问题，但是并没有监控自己的 socket 是否被删除的事件。所以，如果 Nvidia Device Plugin 的 socket 被误删了，那么将会导致 kubelet 无法与该节点的 Nvidia Device Plugin 进行 socket 通信，则意味着 Device Plugin 的 gRPC 接口都无法调通：

无法 ListAndWatch 该节点上的 Device 列表、健康状态，Devices 信息无法同步。
无法 Allocate Device，导致容器创建失败。

因此，建议加上对自己 device plugin socket 的删除事件的监控，一旦监控到删除，则应该触发 restart。

select {
    case event := <-watcher.Events:
    	if event.Name == pluginapi.KubeletSocket && event.Op&fsnotify.Create == fsnotify.Create {
    		log.Printf("inotify: %s created, restarting.", pluginapi.KubeletSocket)
    		restart = true
    	}
    	
    	// 增加对nvidia.sock的删除事件监控
    	if event.Name == serverSocket && event.Op&fsnotify.Delete == fsnotify.Delete {
    		log.Printf("inotify: %s deleted, restarting.", serverSocket)
    		restart = true
    	}
    	
    	...
}

Extended Resources

Device Plugin 是通过 Extended Resources 来 expose 宿主机上的资源的，Kubernetes 内置的 Resources 都是隶属于 kubernetes.io domain 的，因此 Extended Resource 不允许 advertise 在 kubernetes.io domain 下。
Node-level Extended Resource
- Device plugin 管理的资源
- 其他资源
  - 给 API Server 提交 PATCH 请求，给 node 的 status.capacity 添加新的资源名称和数量；
  - kubelet 通过定期更新 node status.allocatable 到 API Server，这其中就包括事先给 node 打 PATCH 新加的资源。之后请求了新加资源的 Pod 就会被 scheduler 根据 node status.allocatable 进行 FitResources Predicate 甩选 node。
  - 注意：kubelet 通过 --node-status-update-frequency 配置定期更新间隔，默认 10s。因此，当你提交完 PATCH 后，最坏情况下可能要等待 10s 左右的时间才能被 scheduler 发现并使用该资源。
```
curl --header "Content-Type: application/json-patch+json" \
--request PATCH \
--data '[{"op": "add", "path": "/status/capacity/example.com~1foo", "value": "5"}]' \
http://k8s-master:8080/api/v1/nodes/k8s-node-1/status
```
注意：~1 is the encoding for the character /in the patch path。
Cluster-level Extended Resources
- 通常集群级的 Extended Resources 是给 scheduler extender 使用的，用来做 Resources 的配额管理。
- 当 Pod 请求的 resource 中包含该 extended resources 时，default scheduler 才会将这个 Pod 发给对应的 scheduler extender 进行二次调度。
- ignoredByScheduler field 如果设置为 true，则 default scheduler 将不会对该资源进行 PodFitsResources 预选检查，通常都会设置为 true，因为 Cluster-level 不是跟 node 相关的，不适合进行 PodFitResources 对 Node 资源进行检查。
```
{
  "kind": "Policy",
  "apiVersion": "v1",
  "extenders": [
    {
      "urlPrefix":"<extender-endpoint>",
      "bindVerb": "bind",
      "ManagedResources": [
        {
          "name": "example.com/foo",
          "ignoredByScheduler": true
        }
      ]
    }
  ]
}
```
API Server 限制了 Extender Resources 只能为整数，比如 2,2000m,2Ki，不能为 1.5, 1500m。
Contaienr resources filed 中只配置的 Extended Resources 必须是 Guaranteed QoS。即要么只显示设置了 limits (此时 requests 默认同 limits)，要么 requests 和 limit 显示配置一样。

Scheduler GPU

https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/

这里我们只讨论 Kubernetes 1.10 中如何调度使用 GPU。

在 Kubernetes 1.8 之前，官方还是建议 enable alpha gate feature: Accelerators，通过请求 resource alpha.kubernetes.io/nvidia-gpu 来使用 gpu，并且要求容器挂载 Host 上的 nvidia lib 和 driver 到容器内。这部分内容，请参考我的博文：如何在 Kubernetes 集群中利用 GPU 进行 AI 训练。

从 Kubernetes 1.8 开始，官方推荐使用 Device Plugins 方式来使用 GPU。
需要在 Node 上 pre-install NVIDIA Driver，并建议通过 Daemonset 部署 NVIDIA Device Plugin，完成后 Kubernetes 才能发现 nvidia.com/gpu。
因为 device plugin 通过 extended resources 来 expose gpu resource 的，所以在 container 请求 gpu 资源的时候要注意 resource QoS 为 Guaranteed。
Containers 目前仍然不支持共享同一块 gpu 卡。每个 Container 可以请求多块 gpu 卡，但是不支持 gpu fraction。

使用官方 nvidia driver 除了以上注意事项之外，还需注意：

Node 上需要 pre-install nvidia docker 2.0，并使用 nvidia docker 替换 runC 作为 docker 的默认 runtime。

在 CentOS 上，参考如下方式安装 nvidia docker 2.0 :

	# Add the package repositories
	distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
	curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | \
	  sudo tee /etc/yum.repos.d/nvidia-docker.repo

	# Install nvidia-docker2 and reload the Docker daemon configuration
	sudo yum install -y nvidia-docker2
	sudo pkill -SIGHUP dockerd

	# Test nvidia-smi with the latest official CUDA image
	docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

以上工作都完成后，Container 就可以像请求 buit-in resources 一样请求 gpu 资源了：

	apiVersion: v1
	kind: Pod
	metadata:
	  name: cuda-vector-add
	spec:
	  restartPolicy: OnFailure
	  containers:
	    - name: cuda-vector-add
	      # https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
	      image: "k8s.gcr.io/cuda-vector-add:v0.1"
	      resources:
	        limits:
	          nvidia.com/gpu: 2 # requesting 2 GPU

使用 NodeSelector 区分不同型号的 GPU 服务器

如果你的集群中存在不同型号的 GPU 服务器，比如 nvidia tesla k80, p100, v100 等，而且不同的训练任务需要匹配不同的 GPU 型号，那么先给 Node 打上对应的 Label：

# Label your nodes with the accelerator type they have.
kubectl label nodes <node-with-k80> accelerator=nvidia-tesla-k80
kubectl label nodes <node-with-p100> accelerator=nvidia-tesla-p100

Pod 中通过 NodeSelector 来指定对应的 GPU 型号：

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      # https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
      image: "k8s.gcr.io/cuda-vector-add:v0.1"
      resources:
        limits:
          nvidia.com/gpu: 1
  nodeSelector:
    accelerator: nvidia-tesla-p100 # or nvidia-tesla-k80 etc.

思考：其实仅仅使用 NodeSelector 是不能很好解决这个问题的，这要求所有的 pod 都要加上对应的 NodeSelector。对于 V100 这样的昂贵稀有的 GPU 卡，通常还要求不能让别的训练任务使用，只给某些算法训练使用，这个时候我们可以通过给 Node 打上对应的 Taint，给需要的 Pod 的打上对应 Toleration 就能完美满足需求了。

Deploy

建议通过 Daemonset 来部署 Device Plugin，方便实现 failover。
Device Plugin Pod 必须具有 privileged 特权才能访问 /var/lib/kubelet/device-plugins
Device Plugin Pod 需将宿主机的 hostpath /var/lib/kubelet/device-plugins 挂载到容器内相同的目录。

kubernetes 1.8

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
spec:
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      containers:
      - image: nvidia/k8s-device-plugin:1.8
        name: nvidia-device-plugin-ctr
        securityContext:
          privileged: true
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins

kubernetes 1.10

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  template:
    metadata:
      # Mark this pod as a critical add-on; when enabled, the critical add-on scheduler
      # reserves resources for critical add-on pods so that they can be rescheduled after
      # a failure.  This annotation works in tandem with the toleration below.
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      # Allow this pod to be rescheduled while the node is in "critical add-ons only" mode.
      # This, along with the annotation above marks this pod as a critical add-on.
      - key: CriticalAddonsOnly
        operator: Exists
      containers:
      - image: nvidia/k8s-device-plugin:1.10
        name: nvidia-device-plugin-ctr
        securityContext:
          privileged: true
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins

关于 Kubernetes 对 critical pod 的处理，越来越有意思了，找个时间单独写个博客再详细聊这个。

Device Plugins 原理图

在这里插入图片描述

总结

几个月前，在我的博客如何在 Kubernetes 集群中利用 GPU 进行 AI 训练对 Kubernetes 1.8 如何使用 GPU 进行了分析，在 Kubernetes 1.10 中，已经推荐使用 Device Plugins 来使用 GPU 了。本文分析了 Device Plugin 的的原理和工作机制，介绍了 Extended Resource，Nvidia Device Plugin 的异常处理及改进点，如何使用和调度 GPU 等。下一篇篇博客，我将对 NVIDIA/k8s-device-plugin 和 kubelet device plugin 进行源码分析，更加深入了解 kubelet 和 nvidia device plugin 的交互细节。