【kubernetes/k8s源码分析】k8s extender scheduler 分析

1. Scheduler extender

    有三种方式为 kubernetes 添加新的调度规则,包括 predicates 和 priority 功能,本文讲解第三种方式

  • 第一种,直接在 kubernetes 添加调度规则,重新编译
  • 第二种,实现自己的调度,替换 k8s 的 scheduler
  • 第三种,实现 scheduler extender,提供扩展 k8s 调度的一个能力

   1.1 http body 结构 ExtenderArgs 

     设置的 scheduler 会向 extender 服务发起 http 请求,其中将 ExtenderArgs 序列化作为 body,主要是需要调度的 pod 的信息,以及调度的节点列表。

// ExtenderArgs represents the arguments needed by the extender to filter/prioritize
// nodes for a pod.
type ExtenderArgs struct {
	// Pod being scheduled
	Pod *v1.Pod
	// List of candidate nodes where the pod can be scheduled; to be populated
	// only if ExtenderConfig.NodeCacheCapable == false
	Nodes *v1.NodeList
	// List of candidate node names where the pod can be scheduled; to be
	// populated only if ExtenderConfig.NodeCacheCapable == true
	NodeNames *[]string
}

     1.1.1 读取 http body

      如果使用 github.com/emicklei/go-restful 包比较简单,直接使用 ReadEntity 读取到 ExtenderArgs 变量即可

func predicates(r *restful.Request, w *restful.Response) {
	var extenderArgs schedulerapi.ExtenderArgs

	if err := r.ReadEntity(&extenderArgs); err != nil {
		logrus.Errorf("predicate read entity error: %v", err)
		w.WriteErrorString(http.StatusInternalServerError, err.Error())
		return
	}

    1.1.2 定义自己的 filter 规则

     轮询所有 node 节点,使用自定义的过略规则,比如 CPU 内存 存储等指标,通过的加入到 canSchedule,未通的加入到 canNotSchedule,返回结果在 ExtenderFilterResult。

     predicateHandler 可以根据情况二定义,比如 CPU,内存 存储等等指标

func handleFilter(args schedulerapi.ExtenderArgs) *schedulerapi.ExtenderFilterResult {
	pod := args.Pod
	canSchedule := make([]v1.Node, 0, len(args.Nodes.Items))
	canNotSchedule := make(map[string]string)

	for _, node := range args.Nodes.Items {
		result, err := predicateHandler(*pod, node)
		if err != nil {
			canNotSchedule[node.Name] = err.Error()
		} else if result {
			canSchedule = append(canSchedule, node)
		}
	}
	return &schedulerapi.ExtenderFilterResult{
		Nodes: &v1.NodeList{
			Items: canSchedule,
		},
		FailedNodes: canNotSchedule,
		Error:       "",
	}
}

   1.2 http response body 结构 ExtenderFilterResult

     使用 ExtenderFilterResult 作为结构告知 scheduler 哪些 node 可以调度,哪些是不可以调度节点

// ExtenderFilterResult represents the results of a filter call to an extender
type ExtenderFilterResult struct {
	// Filtered set of nodes where the pod can be scheduled; to be populated
	// only if ExtenderConfig.NodeCacheCapable == false
	Nodes *v1.NodeList
	// Filtered set of nodes where the pod can be scheduled; to be populated
	// only if ExtenderConfig.NodeCacheCapable == true
	NodeNames *[]string
	// Filtered out nodes where the pod can't be scheduled and the failure messages
	FailedNodes FailedNodesMap
	// Error message indicating failure
	Error string
}

 

2. How the scheduler extender works

    2.1 extender scheduler policy 配置文件样例 policy.yaml

  •      urlPrefix 是向 scheduler 注册需要回调服务地址前缀
  •      enableHttps 是否 https 服务
  •      nodeCacheCapable :如果设置 NodeCache,那调度器只会传给 nodenames 列表。如果没有开启调度器会把所有 nodeinfo 完整结构都传递过来。
{
  "kind" : "Policy",
  "apiVersion" : "v1",
  "predicates" : [
  {"name" : "PodFitsHostPorts"},
  {"name" : "PodFitsResources"},
  {"name" : "NoDiskConflict"},
  {"name" : "MatchNodeSelector"},
  {"name" : "HostName"}
  ],
  "priorities" : [
  {"name" : "LeastRequestedPriority", "weight" : 1},
  {"name" : "BalancedResourceAllocation", "weight" : 1},
  {"name" : "ServiceSpreadingPriority", "weight" : 1},
  {"name" : "EqualPriority", "weight" : 1}
  ],
  "extenders" : [{
                   "urlPrefix": "http://localhost:8880",
                   "filterVerb": "predicates",
                   "prioritizeVerb": "priorities",
                   "preemptVerb": "preemption",
                   "bindVerb": "",
                   "weight": 1,
                   "enableHttps": false,
                   "nodeCacheCapable": false
                 }],
  "hardPodAffinitySymmetricWeight" : 10
}

  2.2 定义的 KubeSchedulerConfiguration

    设置调度器,以及调度算法的配置 policy 文件,使用上文的 2.1 policy.yaml 文件,下文 schedulerConfig.yaml

    kube-scheduler 启动时可以通过 --config=schedulerConfig.yaml 参数可以指定调度策略文件,用户可以根据需要组装Predicates 和 Priority函数。选择不同的过滤函数和优先级函数、控制优先级函数的权重、调整过滤函数的顺序都会影响调度过程。

apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration
schedulerName: my-scheduler
algorithmSource:
  policy:
    file:
      path: policy.yaml
leaderElection:
  leaderElect: true
  lockObjectName: my-scheduler
  lockObjectNamespace: kube-system

 

3. k8s scheduler extender 源码实现分析

    3.1 读取配置文件如果设置 file 或者 configMap

apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration
schedulerName: my-scheduler
algorithmSource:
  policy:
    file:
      path: extender-policy.yaml

source := schedulerAlgorithmSource
switch {
case source.Provider != nil:
	// Create the config from a named algorithm provider.
	sc, err := configurator.CreateFromProvider(*source.Provider)
	if err != nil {
		return nil, fmt.Errorf("couldn't create scheduler using provider %q: %v", *source.Provider, err)
	}
	config = sc
case source.Policy != nil:
	// Create the config from a user specified policy source.
	policy := &schedulerapi.Policy{}
	switch {
	case source.Policy.File != nil:
		if err := initPolicyFromFile(source.Policy.File.Path, policy); err != nil {
			return nil, err
		}
	case source.Policy.ConfigMap != nil:
		if err := initPolicyFromConfigMap(client, source.Policy.ConfigMap, policy); err != nil {
			return nil, err
		}
	}
	sc, err := configurator.CreateFromConfig(*policy)
	if err != nil {
		return nil, fmt.Errorf("couldn't create scheduler from policy: %v", err)
	}
	config = sc

    3.1.1 CreateFromConfig 读取设置的 policy.yaml 文件或者 configMap 配置

{
  "kind" : "Policy",
  "apiVersion" : "v1",
  "predicates" : [
    {"name" : "PodFitsHostPorts"},
    {"name" : "PodFitsResources"},
    {"name" : "NoDiskConflict"},
    {"name" : "MatchNodeSelector"},
    {"name" : "HostName"}
  ],
  "priorities" : [
    {"name" : "LeastRequestedPriority", "weight" : 1},
    {"name" : "BalancedResourceAllocation", "weight" : 1},
    {"name" : "ServiceSpreadingPriority", "weight" : 1},
    {"name" : "EqualPriority", "weight" : 1}
  ],
  "extenders" : [{
    "urlPrefix": "http://localhost:8880",
    "filterVerb": "predicates",
    "prioritizeVerb": "priorities",
    "preemptVerb": "preemption",
    "bindVerb": "",
    "weight": 1,
    "enableHttps": false,
    "nodeCacheCapable": false
   }],
  "hardPodAffinitySymmetricWeight" : 10
}

    3.1.2 SchedulerExtender 接口

    方法比较简单明了,路径 pkg/scheduler/algorithm/scheduler_interface.go

type SchedulerExtender interface {

	// Filter based on extender-implemented predicate functions. The filtered list is
	// expected to be a subset of the supplied list. failedNodesMap optionally contains
	// the list of failed nodes and failure reasons.
	Filter(pod *v1.Pod,
		nodes []*v1.Node, nodeNameToInfo map[string]*schedulernodeinfo.NodeInfo,
	) (filteredNodes []*v1.Node, failedNodesMap schedulerapi.FailedNodesMap, err error)

	// Prioritize based on extender-implemented priority functions. The returned scores & weight
	// are used to compute the weighted score for an extender. The weighted scores are added to
	// the scores computed  by Kubernetes scheduler. The total scores are used to do the host selection.
	Prioritize(pod *v1.Pod, nodes []*v1.Node) (hostPriorities *schedulerapi.HostPriorityList, weight int, err error)

	// Bind delegates the action of binding a pod to a node to the extender.
	Bind(binding *v1.Binding) error
   ..................................
}

    3.1.3 ExtenderConfig 结构体

    当调度 pod 时,extender 通过外部的进程来预选(filter)和优选 (prioritize) 节点,extender 也可以直接实现把 pod bind 到node

    使用 extender 功能时,需要创建 scheduler policy 配置文件,配置文件指明怎样能访问到 extender

  "extenders" : [{
    "urlPrefix": "http://localhost:8880",
    "filterVerb": "predicates",
    "prioritizeVerb": "priorities",
    "preemptVerb": "preemption",
    "bindVerb": "",
    "weight": 1,
    "enableHttps": false,
    "nodeCacheCapable": false
   }],

  •     URLPrefix 是 extender 服务的可用地址前缀
  •     FilterVerb filter 的调用地址,例如上述配置就是 http://localhost:8880/predicates,如果未空则不支持   
  •     enableHttps 是否 https 服务
  •     nodeCacheCapable :如果设置 NodeCache,那调度器只会传给 nodenames 列表。如果没有开启调度器会把所有 nodeinfo 完整结构都传递过来。
type ExtenderConfig struct {
	// URLPrefix at which the extender is available
	URLPrefix string
	// Verb for the filter call, empty if not supported. This verb is appended to the URLPrefix when issuing the filter call to extender.
	FilterVerb string
	// Verb for the preempt call, empty if not supported. This verb is appended to the URLPrefix when issuing the preempt call to extender.
	PreemptVerb string
	// Verb for the prioritize call, empty if not supported. This verb is appended to the URLPrefix when issuing the prioritize call to extender.
	PrioritizeVerb string
	// The numeric multiplier for the node scores that the prioritize call generates.
	// The weight should be a positive integer
	Weight int
	// Verb for the bind call, empty if not supported. This verb is appended to the URLPrefix when issuing the bind call to extender.
	// If this method is implemented by the extender, it is the extender's responsibility to bind the pod to apiserver. Only one extender
	// can implement this function.
	BindVerb string
	// EnableHTTPS specifies whether https should be used to communicate with the extender
	EnableHTTPS bool
	// TLSConfig specifies the transport layer security config
	TLSConfig *ExtenderTLSConfig
	// HTTPTimeout specifies the timeout duration for a call to the extender. Filter timeout fails the scheduling of the pod. Prioritize
	// timeout is ignored, k8s/other extenders priorities are used to select the node.
	HTTPTimeout time.Duration
	// NodeCacheCapable specifies that the extender is capable of caching node information,
	// so the scheduler should only send minimal information about the eligible nodes
	// assuming that the extender already cached full details of all nodes in the cluster
	NodeCacheCapable bool
	// ManagedResources is a list of extended resources that are managed by
	// this extender.
	// - A pod will be sent to the extender on the Filter, Prioritize and Bind
	//   (if the extender is the binder) phases iff the pod requests at least
	//   one of the extended resources in this list. If empty or unspecified,
	//   all pods will be sent to this extender.
	// - If IgnoredByScheduler is set to true for a resource, kube-scheduler
	//   will skip checking the resource in predicates.
	// +optional
	ManagedResources []ExtenderManagedResource
	// Ignorable specifies if the extender is ignorable, i.e. scheduling should not
	// fail when the extender returns an error or is not reachable.
	Ignorable bool
}

  3.2 HTTPExtender 实现了 SchedulerExtender 接口

    NewHTTPExtender 实例化 HTTPExtender 对象,实现了 SchedulerExtender 接口

var extenders []algorithm.SchedulerExtender
if len(policy.ExtenderConfigs) != 0 {
	ignoredExtendedResources := sets.NewString()
	var ignorableExtenders []algorithm.SchedulerExtender
	for ii := range policy.ExtenderConfigs {
		klog.V(2).Infof("Creating extender with config %+v", policy.ExtenderConfigs[ii])
		extender, err := core.NewHTTPExtender(&policy.ExtenderConfigs[ii])

    哪里会调用 Extender 呢?请看下文分解

   3.3 findNodesThatFit 函数

    预选阶段,这里会调用 extender 在根据自定义规则在过滤一下,所以这里是 HTTPExtender 的 Filter 方法

// Filters the nodes to find the ones that fit based on the given predicate functions
// Each node is passed through the predicate functions to determine if it is a fit
func (g *genericScheduler) findNodesThatFit(pod *v1.Pod, nodes []*v1.Node) ([]*v1.Node, FailedPredicateMap, error) {
	var filtered []*v1.Node
	failedPredicateMap := FailedPredicateMap{}

	if len(filtered) > 0 && len(g.extenders) != 0 {
		for _, extender := range g.extenders {
			if !extender.IsInterested(pod) {
				continue
			}
			filteredList, failedMap, err := extender.Filter(pod, filtered, g.nodeInfoSnapshot.NodeInfoMap)

  3.3.1 HTTPExtender 的 Filter 方法

   这里 ExtenderArgs 作为请求 body 的主体结构,ExtenderFilterResult 作为 response 的返回结构

// Filter based on extender implemented predicate functions. The filtered list is
// expected to be a subset of the supplied list. failedNodesMap optionally contains
// the list of failed nodes and failure reasons.
func (h *HTTPExtender) Filter(
	pod *v1.Pod,
	nodes []*v1.Node, nodeNameToInfo map[string]*schedulernodeinfo.NodeInfo,
) ([]*v1.Node, schedulerapi.FailedNodesMap, error) {
	var (
		result     schedulerapi.ExtenderFilterResult
		nodeList   *v1.NodeList
		nodeNames  *[]string
		nodeResult []*v1.Node
		args       *schedulerapi.ExtenderArgs
	)

     3.3.1.1 如果 filterVerb 未空,则未过略任何节点

if h.filterVerb == "" {
	return nodes, schedulerapi.FailedNodesMap{}, nil
}

     3.3.1.2 是否设置 nodeCacheCapable

     如果设置 NodeCache,那调度器只会传给 nodenames 列表。如果没有开启调度器会把所有 nodeinfo 完整结构都传递过来。

if h.nodeCacheCapable {
	nodeNameSlice := make([]string, 0, len(nodes))
	for _, node := range nodes {
		nodeNameSlice = append(nodeNameSlice, node.Name)
	}
	nodeNames = &nodeNameSlice
} else {
	nodeList = &v1.NodeList{}
	for _, node := range nodes {
		nodeList.Items = append(nodeList.Items, *node)
	}
}

     3.3.1.3 封装 ExtenderArgs 结构,包括需要调度的 Pod 与节点列表

args = &schedulerapi.ExtenderArgs{
	Pod:       pod,
	Nodes:     nodeList,
	NodeNames: nodeNames,
}

if err := h.send(h.filterVerb, args, &result); err != nil {
	return nil, nil, err
}

     3.3.1.4 发送 http 请求

     将 ExtenderArgs 序列化,请求地址就是 urlPrefix + “/” + filterVerb,所以主要实现逻辑在自定义 extender 服务中

// Helper function to send messages to the extender
func (h *HTTPExtender) send(action string, args interface{}, result interface{}) error {
	out, err := json.Marshal(args)
	if err != nil {
		return err
	}

	url := strings.TrimRight(h.extenderURL, "/") + "/" + action

	req, err := http.NewRequest("POST", url, bytes.NewReader(out))
	if err != nil {
		return err
	}

	req.Header.Set("Content-Type", "application/json")

	resp, err := h.client.Do(req)
	if err != nil {
		return err
	}
	defer resp.Body.Close()

	if resp.StatusCode != http.StatusOK {
		return fmt.Errorf("Failed %v with extender at URL %v, code %v", action, url, resp.StatusCode)
	}

	return json.NewDecoder(resp.Body).Decode(result)
}

   对于优选阶段怎么处理的呢?请看第 4 章节

4. PiroritizeNodes 函数

// PrioritizeNodes prioritizes the nodes by running the individual priority functions in parallel.
// Each priority function is expected to set a score of 0-10
// 0 is the lowest priority score (least preferred node) and 10 is the highest
// Each priority function can also have its own weight
// The node scores returned by the priority function are multiplied by the weights to get weighted scores
// All scores are finally combined (added) to get the total weighted scores of all nodes
func PrioritizeNodes(
	pod *v1.Pod,
	nodeNameToInfo map[string]*schedulernodeinfo.NodeInfo,
	meta interface{},
	priorityConfigs []priorities.PriorityConfig,
	nodes []*v1.Node,
	extenders []algorithm.SchedulerExtender,
) (schedulerapi.HostPriorityList, error) {

   4.1 对所有 extender 实行并发异步处理

    对所有结果分数累加,分析 HTTPExtender 的 Prioritize 方法

if len(extenders) != 0 && nodes != nil {
	combinedScores := make(map[string]int, len(nodeNameToInfo))
	for i := range extenders {
		if !extenders[i].IsInterested(pod) {
			continue
		}
		wg.Add(1)
		go func(extIndex int) {
			defer wg.Done()
			prioritizedList, weight, err := extenders[extIndex].Prioritize(pod, nodes)
			if err != nil {
				// Prioritization errors from extender can be ignored, let k8s/other extenders determine the priorities
				return
			}
			mu.Lock()
			for i := range *prioritizedList {
				host, score := (*prioritizedList)[i].Host, (*prioritizedList)[i].Score
				if klog.V(10) {
					klog.Infof("%v -> %v: %v, Score: (%d)", util.GetPodFullName(pod), host, extenders[extIndex].Name(), score)
				}
				combinedScores[host] += score * weight
			}
			mu.Unlock()
		}(i)
	}
	// wait for all go routines to finish
	wg.Wait()
	for i := range result {
		result[i].Score += combinedScores[result[i].Host]
	}
}

   4.2 HTTPExtender 的 Prioritize 方法

// Prioritize based on extender implemented priority functions. Weight*priority is added
// up for each such priority function. The returned score is added to the score computed
// by Kubernetes scheduler. The total score is used to do the host selection.
func (h *HTTPExtender) Prioritize(pod *v1.Pod, nodes []*v1.Node) (*schedulerapi.HostPriorityList, int, error) {
	var (
		result    schedulerapi.HostPriorityList
		nodeList  *v1.NodeList
		nodeNames *[]string
		args      *schedulerapi.ExtenderArgs
	)

     4.2.1 如果未设置 prioritizeVerb 

     打个比方,交白卷打 0 分,也就是无需实现自定义 prioritize 打分,0 分也就是对原来的得分没有影响

if h.prioritizeVerb == "" {
	result := schedulerapi.HostPriorityList{}
	for _, node := range nodes {
		result = append(result, schedulerapi.HostPriority{Host: node.Name, Score: 0})
	}
	return &result, 0, nil
}

    预选与优选的 HTTPExtender 处理方法相同,不过返回结果不同而已,一个是过略的 node 节点,一个是 node 节点给的分数

 

总结:

   部署需要scheduler指定配置 policy 的 Extender 配置

   Extender 服务实现 HTTP server,实现预选优选方法

 

参考:

    https://developer.ibm.com/technologies/containers/articles/creating-a-custom-kube-scheduler/

    https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/scheduler_extender.md

  • 3
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值