Kubernetes Scheduler原理分析
调度器的作用是将待调度的Pod按照特定的调度算法和调度策略绑定到集群中的某个合适的Node上,并将信息写入etcd中。目标节点上的kubelet通过API Server监听到Kubernetes Scheduler产生的Pod绑定事件,获取对应的Pod清单,下载Image镜像。
调度过程中涉及的三个对象:
- 待调度Pod列表
- 可用Node列表
- 调度算法和调度策略
默认调度的调度流程分成两步:
4. 预选调度过程:遍历所有目标Node,筛选出符合要求的候选节点,k8s内置了多种预选策略。
5. k8s的调度算法是贪心算法,具体来说是通过采用优选策略计算出每个候选节点的打分,选出打分最高的节点。
调度流程
1.通过sched.NextPod()函数从优先队列中获取一个优先级最高的待调度Pod资源对象,如果没有获取到,那么该方法会阻塞住;
2.通过sched.Algorithm.Schedule调度函数执行Predicates的调度算法与Priorities算法,挑选出一个合适的节点;
3.当没有找到合适的节点时,调度器会尝试调用prof.RunPostFilterPlugins抢占低优先级的Pod资源对象的节点;
4.当调度器为Pod资源对象选择了一个合适的节点时,通过sched.bind函数将合适的节点与Pod资源对象绑定在一起
预选策略
指标有:磁盘、内存、cpu、标签、节点、端口。由此调度指标可以对应的有不同的预选策略:
(1)NoDiskConflict: 是否有磁盘冲突
(2)PodFitsResource: 不仅仅包含cpu与内存是否满足,还可以是pod中需要的任意资源
(3)PodSelectorMatches: 通过nodeSelector指定了选择某个节点
(4)PodFitsHost:指定的nodeName与备选节点是否一致
(5)CheckNodeLabelPresence:备选节点中是否存在标签
(6)CheckServiceAffinity:亲和与反亲和调度
(7)PodFitsPort:备选节点端口是否被占用
优选策略
优选策略有:
- LeastRequestPriority:该策略用于从备选节点列表中选择出资源消耗最小的节点
(1)通过节点上可以利用内存与cpu的资源量来计算节点分值
2.CalculateNodeLabelPriority: 该策略用于判断列出的标签在备选节点中存在时,是否选择该节点。在优选策略的标签列表中score=10,否则score=0。 - BalancedResourceAllocation: 改优选策略用于从备选节点列表中选出各项资源利用率最均衡的节点。涉及的资源只有:cpu和memory
源码分析
- 首先从Scheduler的数据结构入手:
type Scheduler struct {
// It is expected that changes made via SchedulerCache will be observed
// by NodeLister and Algorithm.
SchedulerCache internalcache.Cache
Algorithm core.ScheduleAlgorithm
// NextPod should be a function that blocks until the next pod
// is available. We don't use a channel for this, because scheduling
// a pod may take some amount of time and we don't want pods to get
// stale while they sit in a channel.
NextPod func() *framework.QueuedPodInfo
// Error is called if there is an error. It is passed the pod in
// question, and the error
Error func(*framework.QueuedPodInfo, error)
// Close this to shut down the scheduler.
StopEverything <-chan struct{}
// SchedulingQueue holds pods to be scheduled
SchedulingQueue internalqueue.SchedulingQueue
// Profiles are the scheduling profiles.
Profiles profile.Map
client clientset.Interface
}
Scheduler中主要包含了:调度缓存、调度队列、调度算法、clientset
- 调度缓存主要是为了避免每次调度都要去获取nodeinfo,其组成结构为:
type schedulerCache struct {
stop <-chan struct{}
ttl time.Duration
period time.Duration
// This mutex guards all fields within this cache struct.
mu sync.RWMutex
// a set of assumed pod keys.
// The key could further be used to get an entry in podStates.
assumedPods map[string]bool
// a map from pod key to podState.
podStates map[string]*podState
nodes map[string]*nodeInfoListItem
// headNode points to the most recently updated NodeInfo in "nodes". It is the
// head of the linked list.
headNode *nodeInfoListItem
nodeTree *nodeTree
// A map from image name to its imageState.
imageStates map[string]*imageState
}
cache包中主要包含三个部分:cache、node_tree(分区打散算法)、snapshot
有关调度器cache的相关源码解析可以查看这里,这篇文章介绍的非常清楚。
Cache接口负责存储从apiserver获取的数据,提供给Scheduler调度器获取Node的信息,然后由调度算法的决策pod的最终node节点,其中Snapshot和节点打散算法非常值得借鉴。
- PriorityQueue队列数据结构:
type PriorityQueue struct {
// PodNominator abstracts the operations to maintain nominated Pods.
framework.PodNominator
stop chan struct{}
clock util.Clock
// pod initial backoff duration.
podInitialBackoffDuration time.Duration
// pod maximum backoff duration.
podMaxBackoffDuration time.Duration
lock sync.RWMutex
cond sync.Cond
// activeQ is heap structure that scheduler actively looks at to find pods to
// schedule. Head of heap is the highest priority pod.
activeQ *heap.Heap
// podBackoffQ is a heap ordered by backoff expiry. Pods which have completed backoff
// are popped from this heap before the scheduler looks at activeQ
podBackoffQ *heap.Heap
// unschedulableQ holds pods that have been tried and determined unschedulable.
unschedulableQ *UnschedulablePodsMap
// schedulingCycle represents sequence number of scheduling cycle and is incremented
// when a pod is popped.
schedulingCycle int64
// moveRequestCycle caches the sequence number of scheduling cycle when we
// received a move request. Unscheduable pods in and before this scheduling
// cycle will be put back to activeQueue if we were trying to schedule them
// when we received move request.
moveRequestCycle int64
// closed indicates that the queue is closed.
// It is mainly used to let Pop() exit its control loop while waiting for an item.
closed bool
}
调度的优先级队列中包含三个队列:活动队列、不可调度队列、backoff队列。
backoff队列:backoff机制是并发编程中常见的一种机制,即如果任务反复执行依旧失败,则会按次增长等待调度时间,降低重试效率,从而避免反复失败浪费调度资源
针对调度失败的pod会优先存储在backoff队列中,等待后续重试。