前言
Kube-controller-manager组件最终启动了很多controller,本文将对其中的DaemonSet-controller的源码进行阅读分析。
启动DaemonSet Controller
startDaemonSetController函数是Kube-controller-manager启动DaemonSet Controller的入口,函数比较简单就三个逻辑。
- 检查apps/v1/daemonsets资源是否available
- 调用daemon包中的NewDaemonSetsController函数创建DaemonSet Controller实例到dsc
- 调用DaemonSet Controller实例 dsc的Run方法
k8s.io/kubernetes/cmd/kube-controller-manager/app/apps.go:36
func startDaemonSetController(ctx ControllerContext) (http.Handler, bool, error) {
if !ctx.AvailableResources[schema.GroupVersionResource{Group: "apps", Version: "v1", Resource: "daemonsets"}] {
return nil, false, nil
}
dsc, err := daemon.NewDaemonSetsController(
ctx.InformerFactory.Apps().V1().DaemonSets(),
ctx.InformerFactory.Apps().V1().ControllerRevisions(),
ctx.InformerFactory.Core().V1().Pods(),
ctx.InformerFactory.Core().V1().Nodes(),
ctx.ClientBuilder.ClientOrDie("daemon-set-controller"),
flowcontrol.NewBackOff(1*time.Second, 15*time.Minute),
)
if err != nil {
return nil, true, fmt.Errorf("error creating DaemonSets controller: %v", err)
}
go dsc.Run(int(ctx.ComponentConfig.DaemonSetController.ConcurrentDaemonSetSyncs), ctx.Stop)
return nil, true, nil
}
创建DaemonSet Controller
Kube-controller-manager调用NewDaemonSetsController函数创建DaemonSet Controller实例,具体逻辑如下:
- 创建DaemonSetsController实例,(1)burstReplicas定为250,即DaemonSetsController一次最多只能处理250个pod;(2)调用controller.NewControllerExpectations创建了一个ControllerExpectations用于控制pod的数量;(3)调用workqueue.NewNamedRateLimitingQueue创建了一个workqueue用于储存需要sync的DaemonSet
- watch DaemonSet event及注册DaemonSet Lister和对应的Synced函数,DaemonSet event的各个Func处理为:(1)AddFunc调用dsc.enqueueDaemonSet将DaemonSet的key(即namespace和name)加入到DaemonSetsController的queue中; (2)UpdateFunc同样是调用dsc.enqueueDaemonSet将新的DaemonSet的key加入到DaemonSetsController的queue中;(3)DeleteFunc调用dsc.deleteDaemonset,先检查传入的interface是否是DaemonSet,如果是则调用dsc.enqueueDaemonSet将DaemonSet的key加入到DaemonSetsController的queue中
- watch history event对应的Add/Update/Delete Func,并注册history Lister和Synced函数
- watch pod event对应的Add/Update/Delete Func,并注册pod Lister和Synced函数,以及定义一个index为nodename的pod index
- watch node event的AddFunc和UpdateFunc,并注册node Lister和Synced函数
- 注册syncHandler为dsc.syncDaemonSet方法,是DaemonSet Controller处理DaemonSet的主要逻辑
- 注册enqueueDaemonSet为dsc.enqueue方法,注册enqueueDaemonSetRateLimited为dsc.enqueueRateLimited方法,两个方法都是将需要处理的DaemonSet加入DaemonSet维护的queue中
- 调用flowcontrol.NewBackOff定义一个default duration为1s,max duration为15m的backoff给failedPodsBackoff
可以看到DaemonSet Controller watch了集群的DaemonSet、History、Pod、Node对象,并维护一个queue存放需要处理的DaemonSet。
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:143 func NewDaemonSetsController( daemonSetInformer appsinformers.DaemonSetInformer, historyInformer appsinformers.ControllerRevisionInformer, podInformer coreinformers.PodInformer, nodeInformer coreinformers.NodeInformer, kubeClient clientset.Interface, failedPodsBackoff *flowcontrol.Backoff, ) (*DaemonSetsController, error) { eventBroadcaster := record.NewBroadcaster() eventBroadcaster.StartLogging(glog.Infof) eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: kubeClient.CoreV1().Events("")}) if kubeClient != nil && kubeClient.CoreV1().RESTClient().GetRateLimiter() != nil { if err := metrics.RegisterMetricAndTrackRateLimiterUsage("daemon_controller", kubeClient.CoreV1().RESTClient().GetRateLimiter()); err != nil { return nil, err } } //创建DaemonSetsController实例,(1)burstReplicas定为250,即DaemonSetsController一次最多只能处理250个pod;(2)调用controller.NewControllerExpectations 创建了一个ControllerExpectations用于控制pod的数量;(3)调用workqueue.NewNamedRateLimitingQueue创建了一个workqueue用于储存需要sync的DaemonSet dsc := &DaemonSetsController{ kubeClient: kubeClient, eventRecorder: eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "daemonset-controller"}), podControl: controller.RealPodControl{ KubeClient: kubeClient, Recorder: eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "daemonset-controller"}), }, crControl: controller.RealControllerRevisionControl{ KubeClient: kubeClient, }, burstReplicas: BurstReplicas, expectations: controller.NewControllerExpectations(), queue: workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "daemonset"), suspendedDaemonPods: map[string]sets.String{}, } //watch DaemonSet event,(1)AddFunc调用dsc.enqueueDaemonSet将DaemonSet的key(即namespace和name)加入到DaemonSetsController的queue中; (2)UpdateFunc同样是调用dsc.enqueueDaemonSet将新的DaemonSet的key加入到DaemonSetsController的queue中;(3)DeleteFunc调用dsc.deleteDaemonset,先检查 传入的interface是否是DaemonSet,如果是则调用dsc.enqueueDaemonSet将DaemonSet的key加入到DaemonSetsController的queue中 daemonSetInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ AddFunc: func(obj interface{}) { ds := obj.(*apps.DaemonSet) glog.V(4).Infof("Adding daemon set %s", ds.Name) dsc.enqueueDaemonSet(ds) }, UpdateFunc: func(old, cur interface{}) { oldDS := old.(*apps.DaemonSet) curDS := cur.(*apps.DaemonSet) glog.V(4).Infof("Updating daemon set %s", oldDS.Name) dsc.enqueueDaemonSet(curDS) }, DeleteFunc: dsc.deleteDaemonset, }) //注册DaemonSet Lister和对应的Synced函数 dsc.dsLister = daemonSetInformer.Lister() dsc.dsStoreSynced = daemonSetInformer.Informer().HasSynced //watch history event对应的Add/Update/Delete Func,并注册history Lister和Synced函数 historyInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ AddFunc: dsc.addHistory, UpdateFunc: dsc.updateHistory, DeleteFunc: dsc.deleteHistory, }) dsc.historyLister = historyInformer.Lister() dsc.historyStoreSynced = historyInformer.Informer().HasSynced //watch pod event对应的Add/Update/Delete Func,并注册pod Lister和Synced函数,以及定义一个index为nodename的pod index // Watch for creation/deletion of pods. The reason we watch is that we don't want a daemon set to create/delete // more pods until all the effects (expectations) of a daemon set's create/delete have been observed. podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ AddFunc: dsc.addPod, UpdateFunc: dsc.updatePod, DeleteFunc: dsc.deletePod, }) dsc.podLister = podInformer.Lister() // This custom indexer will index pods based on their NodeName which will decrease the amount of pods we need to get in simulate() call. podInformer.Informer().GetIndexer().AddIndexers(cache.Indexers{ "nodeName": indexByPodNodeName, }) dsc.podNodeIndex = podInformer.Informer().GetIndexer() dsc.podStoreSynced = podInformer.Informer().HasSynced //watch node event的AddFunc和UpdateFunc,并注册node Lister和Synced函数 nodeInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ AddFunc: dsc.addNode, UpdateFunc: dsc.updateNode, }, ) dsc.nodeStoreSynced = nodeInformer.Informer().HasSynced dsc.nodeLister = nodeInformer.Lister() //注册syncHandler为dsc.syncDaemonSet方法,是DaemonSet Controller处理DaemonSet的主要逻辑 dsc.syncHandler = dsc.syncDaemonSet //注册enqueueDaemonSet为dsc.enqueue方法,注册enqueueDaemonSetRateLimited为dsc.enqueueRateLimited方法,两个方法都是将需要处理的DaemonSet加入DaemonSet维护的queue中 dsc.enqueueDaemonSet = dsc.enqueue dsc.enqueueDaemonSetRateLimited = dsc.enqueueRateLimited //调用flowcontrol.NewBackOff定义一个default duration为1s,max duration为15m的backoff给failedPodsBackoff dsc.failedPodsBackoff = failedPodsBackoff return dsc, nil }
History Event
NewDaemonSetsController函数创建DaemonSet Controller的时候watch了集群的History对象,接下来看看具体的逻辑。
AddFunc
History的AddFunc注册的是addHistory方法,addHistory方法逻辑如下:
- 通过obj获得history,检查history的deletiontimestamp,如果不为nil,则表示history待删除,则调用dsc.deleteHistory方法并return
- 获取history的OwnerReference,如果OwnerReference不为nil,则再通过history的namespace和OwnerReference获取ds,如果获取到的ds正常,则直接return,不用做处理
- 如果上一步未获取到ds,表明孤儿history,则调用dsc.getDaemonSetsForHistory通过lable获取所有selector该lable的ds,接着调用dsc.enqueueDaemonSet将获取到的ds添加dsController维护的queue中
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:374 func (dsc *DaemonSetsController) addHistory(obj interface{}) { //通过obj获得history,检查history的deletiontimestamp,如果不为nil,则表示history待删除,则调用dsc.deleteHistory方法并return history := obj.(*apps.ControllerRevision) if history.DeletionTimestamp != nil { // On a restart of the controller manager, it's possible for an object to // show up in a state that is already pending deletion. dsc.deleteHistory(history) return } //获取history的OwnerReference,如果OwnerReference不为nil,则再通过history的namespace和OwnerReference获取ds,如果获取到的ds为nil则直接return,如果不为nil则写一条日志之后直接return, 即OwnerReference不为nil则不用做任何处理 // If it has a ControllerRef, that's all that matters. if controllerRef := metav1.GetControllerOf(history); controllerRef != nil { ds := dsc.resolveControllerRef(history.Namespace, controllerRef) if ds == nil { return } glog.V(4).Infof("ControllerRevision %s added.", history.Name) return } //如果上一步获取到的OwnerReference为nil,表明是孤儿history,则调用dsc.getDaemonSetsForHistory通过lable获取所有selector该lable的ds,接着调用dsc.enqueueDaemonSet将获取到的ds添加到dsController维护的queue中 // Otherwise, it's an orphan. Get a list of all matching DaemonSets and sync // them to see if anyone wants to adopt it. daemonSets := dsc.getDaemonSetsForHistory(history) if len(daemonSets) == 0 { return } glog.V(4).Infof("Orphan ControllerRevision %s added.", history.Name) for _, ds := range daemonSets { dsc.enqueueDaemonSet(ds) } }
UpdateFunc
History的UpdateFunc注册的是updateHistory方法,updateHistory方法主要逻辑如下:
- 分别获取curHistory和oldHistory,如果curHistory和oldHistory的ResourceVersion一样,说明History没有改变,则不做任何处理,直接return
- 分别获取curHistory和oldHistory的OwnerReference,如果curHistory和oldHistory的OwnerReference改变了,且oldHistory的OwnerReference不为nil,则通过oldHistory的Namespace和OwnerReference获取到ds,如果获取到的ds不为nil,则调用dsc.enqueueDaemonSet将获取到的ds添加到dsController维护的queue中
- 如果curHistory的OwnerReference不为nil,则通过curHistory的namespace和OwnerReference获取ds,如果ds为nil则直接return,ds不为nil则调用dsc.enqueueDaemonSet将获取到的ds添加到dsController维护的queue中,完成后return
- 如果curHistory的OwnerReference为nil,则表明History变为了孤儿History,如果oldHistory和curHistory的label和OwnerReference任意一个改变了,则调用dsc.getDaemonSetsForHistory通过lable获取所有selector该lable的ds,接着调用dsc.enqueueDaemonSet将获取到的ds添加到dsController维护的queue中
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:408 func (dsc *DaemonSetsController) updateHistory(old, cur interface{}) { //分别获取curHistory和oldHistory,如果curHistory和oldHistory的ResourceVersion一样,说明History没有改变,则不做任何处理,直接return curHistory := cur.(*apps.ControllerRevision) oldHistory := old.(*apps.ControllerRevision) if curHistory.ResourceVersion == oldHistory.ResourceVersion { // Periodic resync will send update events for all known ControllerRevisions. return } //分别获取curHistory和oldHistory的OwnerReference,如果curHistory和oldHistory的OwnerReference改变了,且oldHistory的OwnerReference不为nil, 则通过oldHistory的Namespace和OwnerReference获取到ds,如果获取到的ds不为nil,则调用dsc.enqueueDaemonSet将获取到的ds添加到dsController维护的queue中 curControllerRef := metav1.GetControllerOf(curHistory) oldControllerRef := metav1.GetControllerOf(oldHistory) controllerRefChanged := !reflect.DeepEqual(curControllerRef, oldControllerRef) if controllerRefChanged && oldControllerRef != nil { // The ControllerRef was changed. Sync the old controller, if any. if ds := dsc.resolveControllerRef(oldHistory.Namespace, oldControllerRef); ds != nil { dsc.enqueueDaemonSet(ds) } } //如果curHistory的OwnerReference不为nil,则通过curHistory的namespace和OwnerReference获取ds,如果ds为nil则直接return,ds不为nil则调用dsc.enqueueDaemonSet 将获取到的ds添加到dsController维护的queue中,完成后return // If it has a ControllerRef, that's all that matters. if curControllerRef != nil { ds := dsc.resolveControllerRef(curHistory.Namespace, curControllerRef) if ds == nil { return } glog.V(4).Infof("ControllerRevision %s updated.", curHistory.Name) dsc.enqueueDaemonSet(ds) return } //如果curHistory的OwnerReference为nil,则表明History变为了孤儿History,如果oldHistory和curHistory的label和OwnerReference任意一个改变了,则调用dsc.getDaemonSetsForHistory 通过lable获取所有selector该lable的ds,接着调用dsc.enqueueDaemonSet将获取到的ds添加到dsController维护的queue中 // Otherwise, it's an orphan. If anything changed, sync matching controllers // to see if anyone wants to adopt it now. labelChanged := !reflect.DeepEqual(curHistory.Labels, oldHistory.Labels) if labelChanged || controllerRefChanged { daemonSets := dsc.getDaemonSetsForHistory(curHistory) if len(daemonSets) == 0 { return } glog.V(4).Infof("Orphan ControllerRevision %s updated.", curHistory.Name) for _, ds := range daemonSets { dsc.enqueueDaemonSet(ds) } } }
DeleteFunc
History的DeleteFunc注册的是deleteHistory方法,deleteHistory方法主要逻辑如下:
- 检查传入的obj是否是history,如果不是则直接return,否则获取history
- 获取history的OwnerReference,如果OwnerReference为nil则直接return,否则通过History的namespace和OwnerReference获取ds,如果ds为nil则直接return,ds不为nil则调用dsc.enqueueDaemonSet将获取到的ds添加到dsController维护的queue中
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:455 func (dsc *DaemonSetsController) deleteHistory(obj interface{}) { //检查传入的obj是否是history,如果不是则直接return,否则获取history history, ok := obj.(*apps.ControllerRevision) // When a delete is dropped, the relist will notice a ControllerRevision in the store not // in the list, leading to the insertion of a tombstone object which contains // the deleted key/value. Note that this value might be stale. If the ControllerRevision // changed labels the new DaemonSet will not be woken up till the periodic resync. if !ok { tombstone, ok := obj.(cache.DeletedFinalStateUnknown) if !ok { utilruntime.HandleError(fmt.Errorf("Couldn't get object from tombstone %#v", obj)) return } history, ok = tombstone.Obj.(*apps.ControllerRevision) if !ok { utilruntime.HandleError(fmt.Errorf("Tombstone contained object that is not a ControllerRevision %#v", obj)) return } } //获取history的OwnerReference,如果OwnerReference为nil则直接return,否则通过History的namespace和OwnerReference获取ds,如果ds为nil则直接return,ds不为nil则调用dsc.enqueueDaemonSet 将获取到的ds添加到dsController维护的queue中 controllerRef := metav1.GetControllerOf(history) if controllerRef == nil { // No controller should care about orphans being deleted. return } ds := dsc.resolveControllerRef(history.Namespace, controllerRef) if ds == nil { return } glog.V(4).Infof("ControllerRevision %s deleted.", history.Name) dsc.enqueueDaemonSet(ds) }
Pod Event
AddFunc
Pod的AddFunc注册的是addPod方法,addPod方法逻辑如下:
- 通过传入的obj获取pod,并检查pod的DeletionTimestamp,如果DeletionTimestamp不为nil,说明pod待删除,调用dsc.deletePod方法并return
- 获取pod的OwnerReference,如果OwnerReference不为nil,则通过pod的namespace和OwnerReference获取ds,如果ds为nil则直接return,否则获取ds的namespace和name,如果获取正常,则调用dsc.enqueueDaemonSet将获取到的ds添加到dsController维护的queue后return,如果获取不成功则return
- 如果上一步获取到的pod的OwnerReference为nil,则说明是孤儿pod,则通过pod的lable获取所有selector该lable的ds,接着调用dsc.enqueueDaemonSet将获取到的所有的ds添加到dsController维护的queue
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:488 func (dsc *DaemonSetsController) addPod(obj interface{}) { //通过传入的obj获取pod,并检查pod的DeletionTimestamp,如果DeletionTimestamp不为nil,说明pod待删除,调用dsc.deletePod方法并return pod := obj.(*v1.Pod) if pod.DeletionTimestamp != nil { // on a restart of the controller manager, it's possible a new pod shows up in a state that // is already pending deletion. Prevent the pod from being a creation observation. dsc.deletePod(pod) return } //获取pod的OwnerReference,如果OwnerReference不为nil,则通过pod的namespace和OwnerReference获取ds,如果ds为nil则直接return, 否则获取ds的namespace和name,如果获取正常,则调用dsc.enqueueDaemonSet将获取到的ds添加到dsController维护的queue后return,如果获取不成功则return // If it has a ControllerRef, that's all that matters. if controllerRef := metav1.GetControllerOf(pod); controllerRef != nil { ds := dsc.resolveControllerRef(pod.Namespace, controllerRef) if ds == nil { return } dsKey, err := controller.KeyFunc(ds) if err != nil { return } glog.V(4).Infof("Pod %s added.", pod.Name) dsc.expectations.CreationObserved(dsKey) dsc.enqueueDaemonSet(ds) return } //如果上一步获取到的pod的OwnerReference为nil,则说明是孤儿pod,则通过pod的lable获取所有selector该lable的ds,接着调用dsc.enqueueDaemonSet将 获取到的所有的ds添加到dsController维护的queue // Otherwise, it's an orphan. Get a list of all matching DaemonSets and sync // them to see if anyone wants to adopt it. // DO NOT observe creation because no controller should be waiting for an // orphan. dss := dsc.getDaemonSetsForPod(pod) if len(dss) == 0 { return } glog.V(4).Infof("Orphan Pod %s added.", pod.Name) for _, ds := range dss { dsc.enqueueDaemonSet(ds) } }
UpdateFunc
Pod的UpdateFunc注册的是updatePod方法,updatePod方法逻辑如下:
- 对比curPod和oldPod的ResourceVersion,如果一样,说明pod没有改变,则直接return,不做其他处理
- 对比curPod和oldPod的OwnerReference,如果有改变且oldPod的OwnerReference不为空,则通过oldPod的namespace和OwnerReference获取ds,获取到的ds不为nil的话,则调用dsc.enqueueDaemonSet将获取到的ds添加到dsController维护的queue
- 如果curPod的OwnerReference不为nil,则通过curPod的namespace和OwnerReference获取ds,接着调用dsc.enqueueDaemonSet将获取到的ds添加到dsController维护的queue,接着检查Pod的Ready状态,如果oldPod没有Ready,curPod是Ready的,且ds的Spec.MinReadySeconds大于0,则调用dsc.enqueueDaemonSetAfter方法在MinReadySeconds+1秒后将获取到的ds添加到dsController维护的queue
- 如果curPod的OwnerReference为nil,则说明是孤儿Pod,则通过pod的lable获取所有selector该lable的ds,如果oldPod和curPod的label或者OwnerReference改变了,则调用dsc.enqueueDaemonSet将获取到的所有的ds添加到dsController维护的queue,如果label或者OwnerReference都没改变,则不做任何处理
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:531 func (dsc *DaemonSetsController) updatePod(old, cur interface{}) { curPod := cur.(*v1.Pod) oldPod := old.(*v1.Pod) //对比curPod和oldPod的ResourceVersion,如果一样,说明pod没有改变,则直接return,不做其他处理 if curPod.ResourceVersion == oldPod.ResourceVersion { // Periodic resync will send update events for all known pods. // Two different versions of the same pod will always have different RVs. return } //对比curPod和oldPod的OwnerReference,如果有改变且oldPod的OwnerReference不为空,则通过oldPod的namespace和OwnerReference获取ds, 获取到的ds不为nil的话,则调用dsc.enqueueDaemonSet将获取到的ds添加到dsController维护的queue curControllerRef := metav1.GetControllerOf(curPod) oldControllerRef := metav1.GetControllerOf(oldPod) controllerRefChanged := !reflect.DeepEqual(curControllerRef, oldControllerRef) if controllerRefChanged && oldControllerRef != nil { // The ControllerRef was changed. Sync the old controller, if any. if ds := dsc.resolveControllerRef(oldPod.Namespace, oldControllerRef); ds != nil { dsc.enqueueDaemonSet(ds) } } //如果curPod的OwnerReference不为nil,则通过curPod的namespace和OwnerReference获取ds,接着调用dsc.enqueueDaemonSet将获取到的ds添加到dsController维护的queue, 接着检查Pod的Ready状态,如果oldPod没有Ready,curPod是Ready的,且ds的Spec.MinReadySeconds大于0,则调用dsc.enqueueDaemonSetAfter方法在MinReadySeconds+1秒后 将获取到的ds添加到dsController维护的queue // If it has a ControllerRef, that's all that matters. if curControllerRef != nil { ds := dsc.resolveControllerRef(curPod.Namespace, curControllerRef) if ds == nil { return } glog.V(4).Infof("Pod %s updated.", curPod.Name) dsc.enqueueDaemonSet(ds) changedToReady := !podutil.IsPodReady(oldPod) && podutil.IsPodReady(curPod) // See https://github.com/kubernetes/kubernetes/pull/38076 for more details if changedToReady && ds.Spec.MinReadySeconds > 0 { // Add a second to avoid milliseconds skew in AddAfter. // See https://github.com/kubernetes/kubernetes/issues/39785#issuecomment-279959133 for more info. dsc.enqueueDaemonSetAfter(ds, (time.Duration(ds.Spec.MinReadySeconds)*time.Second)+time.Second) } return } //如果curPod的OwnerReference为nil,则说明是孤儿Pod,则通过pod的lable获取所有selector该lable的ds,如果oldPod和curPod的label或者OwnerReference改变了, 则调用dsc.enqueueDaemonSet将获取到的所有的ds添加到dsController维护的queue,如果label或者OwnerReference都没改变,则不做任何处理 // Otherwise, it's an orphan. If anything changed, sync matching controllers // to see if anyone wants to adopt it now. dss := dsc.getDaemonSetsForPod(curPod) if len(dss) == 0 { return } glog.V(4).Infof("Orphan Pod %s updated.", curPod.Name) labelChanged := !reflect.DeepEqual(curPod.Labels, oldPod.Labels) if labelChanged || controllerRefChanged { for _, ds := range dss { dsc.enqueueDaemonSet(ds) } } }
DeleteFunc
Pod的DeleteFunc注册的是deletePod方法,deletePod方法逻辑如下:
- 检查传入的obj是否是pod,如果不是pod则直接return,不做任何其他处理
- 如果pod的OwnerReference为nil,则说明是孤儿pod,则调用dsc.requeueSuspendedDaemonPods处理,dsc.requeueSuspendedDaemonPods是将ds Controller中的suspendedDaemonPods 中对应node上所有pod在重新加入dsController维护的queue中,完成后return
- 如果OwnerReference不为nil,则通过pod的namespace和OwnerReference调用dsc.resolveControllerRef获取ds
- 如果获取到的ds为nil且pod的Spec.Nodename不为空,则调用dsc.requeueSuspendedDaemonPods处理
- 如果获取到的ds不为nil,则调用dsc.enqueueDaemonSet将获取到ds添加到dsController维护的queue
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:644 func (dsc *DaemonSetsController) deletePod(obj interface{}) { //检查传入的obj是否是pod,如果不是pod则直接return,不做任何其他处理 pod, ok := obj.(*v1.Pod) // When a delete is dropped, the relist will notice a pod in the store not // in the list, leading to the insertion of a tombstone object which contains // the deleted key/value. Note that this value might be stale. If the pod // changed labels the new daemonset will not be woken up till the periodic // resync. if !ok { tombstone, ok := obj.(cache.DeletedFinalStateUnknown) if !ok { utilruntime.HandleError(fmt.Errorf("couldn't get object from tombstone %#v", obj)) return } pod, ok = tombstone.Obj.(*v1.Pod) if !ok { utilruntime.HandleError(fmt.Errorf("tombstone contained object that is not a pod %#v", obj)) return } } //如果pod的OwnerReference为nil,则说明是孤儿pod,则调用dsc.requeueSuspendedDaemonPods处理,完成后return controllerRef := metav1.GetControllerOf(pod) if controllerRef == nil { // No controller should care about orphans being deleted. if len(pod.Spec.NodeName) != 0 { // If scheduled pods were deleted, requeue suspended daemon pods. dsc.requeueSuspendedDaemonPods(pod.Spec.NodeName) } return } //如果OwnerReference不为nil,则通过pod的namespace和OwnerReference调用dsc.resolveControllerRef获取ds, 如果获取到的ds为nil且pod的Spec.Nodename不为空,则调用dsc.requeueSuspendedDaemonPods处理;如果获取到的ds 不为nil,则调用dsc.enqueueDaemonSet将获取到ds添加到dsController维护的queue ds := dsc.resolveControllerRef(pod.Namespace, controllerRef) if ds == nil { if len(pod.Spec.NodeName) != 0 { // If scheduled pods were deleted, requeue suspended daemon pods. dsc.requeueSuspendedDaemonPods(pod.Spec.NodeName) } return } dsKey, err := controller.KeyFunc(ds) if err != nil { return } glog.V(4).Infof("Pod %s deleted.", pod.Name) dsc.expectations.DeletionObserved(dsKey) dsc.enqueueDaemonSet(ds) }
Node Event
dsController对Node Event只注册了AddFunc和UpdateFunc,接下来看看这两个Func的逻辑
AddFunc
Node Event的AddFunc注册的是addNode方法,具体逻辑如下:
- 调用dsc.dsLister.List将所有的ds存放到dsList
- 循环处理dsList中的所有ds,调用dsc.nodeShouldRunDaemonPod方法检查ds是否需要调度到新增加的node上,调用dsc.enqueueDaemonSet方法将需要调度的ds添加到dsController维护的queue
其中dsc.nodeShouldRunDaemonPod方法是dsController最重要的方法,它用来检查ds是否需要将Pod调度到node上,dsController中多次用到该方法,将会放在“执行DaemonSet Controller”章节阅读分析
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:690 func (dsc *DaemonSetsController) addNode(obj interface{}) { // TODO: it'd be nice to pass a hint with these enqueues, so that each ds would only examine the added node (unless it has other work to do, too). //调用dsc.dsLister.List将所有的ds存放到dsList dsList, err := dsc.dsLister.List(labels.Everything()) if err != nil { glog.V(4).Infof("Error enqueueing daemon sets: %v", err) return } node := obj.(*v1.Node) //循环处理dsList中的所有ds,调用dsc.nodeShouldRunDaemonPod方法检查ds是否需要调度到新增加的node上,调用dsc.enqueueDaemonSet方法将需要调度的ds添加到dsController维护的queue for _, ds := range dsList { _, shouldSchedule, _, err := dsc.nodeShouldRunDaemonPod(node, ds) if err != nil { continue } if shouldSchedule { dsc.enqueueDaemonSet(ds) } } }
UpdateFunc
Node Event的UpdateFunc注册的是updateNode方法,具体逻辑如下:
- 调用shouldIgnoreNodeUpdate函数检查是否忽略该Node Update Event,shouldIgnoreNodeUpdate函数的检查逻辑为:
- 如果oldNode和curNode的status.Condition中为true的conditon不是一模一样的,则直接返回不忽略该Node Update Event,否则进入下一步的检查
- 如果上一步检查到oldNode和curNode的status.Condition中为true的conditon一模一样,则再检查oldNode和curNode除ResourceVersion和status.Condition以外的其他是否一样,如果一样则忽略该Node Update Event,否则则不忽略
- 取集群中所有的ds到dsList
- 对dsList中每一个ds做以下处理
- 调用dsc.nodeShouldRunDaemonPod检查ds和oldNode,将ds是否需要调度到oldNode及ds的Pod是否需要继续运行的值给oldShouldSchedule & oldShouldContinueRunning两个变量
- 调用dsc.nodeShouldRunDaemonPod检查ds和curNode,将ds是否需要调度到curNode及ds的Pod是否需要继续运行的值给currentShouldSchedule & currentShouldContinueRunning两个变量
- 如果oldShouldSchedule不等于currentShouldSchedule或者oldShouldContinueRunning不等于currentShouldContinueRunning,即ds在Node的调度运行发生变化,则调用dsc.enqueueDaemonSet方法将需要调度的ds添加到dsController维护的queue
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:747 func (dsc *DaemonSetsController) updateNode(old, cur interface{}) { oldNode := old.(*v1.Node) curNode := cur.(*v1.Node) //调用shouldIgnoreNodeUpdate函数检查是否忽略该Node Update Event,shouldIgnoreNodeUpdate函数的检查逻辑为: (1) 如果oldNode和curNode的status.Condition中为true的conditon不是一模一样的,则直接返回不忽略该Node Update Event,否则进入下一步的检查 (2) 如果上一步检查到oldNode和curNode的status.Condition中为true的conditon一模一样,则再检查oldNode和curNode除ResourceVersion和status.Condition以外的其他是否一样, 如果一样则忽略该Node Update Event,否则则不忽略 if shouldIgnoreNodeUpdate(*oldNode, *curNode) { return } //获取集群中所有的ds到dsList dsList, err := dsc.dsLister.List(labels.Everything()) if err != nil { glog.V(4).Infof("Error listing daemon sets: %v", err) return } //对dsList中每一个ds做以下处理 (1) 调用dsc.nodeShouldRunDaemonPod检查ds和oldNode,将ds是否需要调度到oldNode及ds的Pod是否需要继续运行的值给oldShouldSchedule & oldShouldContinueRunning两个变量 (2) 调用dsc.nodeShouldRunDaemonPod检查ds和curNode,将ds是否需要调度到curNode及ds的Pod是否需要继续运行的值给currentShouldSchedule & currentShouldContinueRunning两个变量 (3) 如果oldShouldSchedule不等于currentShouldSchedule或者oldShouldContinueRunning不等于currentShouldContinueRunning,即ds在Node的调度运行发生变化,则调用dsc.enqueueDaemonSet方法将需要调度的ds添加到dsController维护的queue // TODO: it'd be nice to pass a hint with these enqueues, so that each ds would only examine the added node (unless it has other work to do, too). for _, ds := range dsList { _, oldShouldSchedule, oldShouldContinueRunning, err := dsc.nodeShouldRunDaemonPod(oldNode, ds) if err != nil { continue } _, currentShouldSchedule, currentShouldContinueRunning, err := dsc.nodeShouldRunDaemonPod(curNode, ds) if err != nil { continue } if (oldShouldSchedule != currentShouldSchedule) || (oldShouldContinueRunning != currentShouldContinueRunning) { dsc.enqueueDaemonSet(ds) } } }
执行DaemonSet Controller
Kube-controller-manager调用了dsController的run方法来执行DaemonSet Controller,run方法的主要逻辑如下:
- 调用controller.WaitForCacheSync等待PodInformer、NodeInformer、HistoryInformer的HasSyncs都返回true,即等待Pod、Node和History完成同步
- 启动--concurrent-daemonset-syncs个go routine,每隔1秒循环执行dsc.runWorker方法,dsc.runWorker方法则是循环执行dsc.processNextWorkItem方法直到其返回true,dsc.processNextWorkItem方法的具体逻辑如下:
- 从dsController的queue中取出一个ds,如果没有取到ds即没有要处理的ds则直接return false
- 设置defer,在dsc.processNextWorkItem方法结束时mark上一步获取到的ds已处理
- 调用dsc.syncHandler方法即dsc.syncDaemonSet方法处理第一步获取到的ds,如果dsc.syncDaemonSet方法返回的error为nil即成功处理则调用dsc.queue.Forget从dsController中删除该ds
- 如果上一步的dsc.syncHandler方法未成功处理ds,则调用dsc.queue.AddRateLimited重新将ds加入dsController的queue中
- 启动1个go routine,每隔1分钟启用failedPodsBackoff的gc
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:265 func (dsc *DaemonSetsController) Run(workers int, stopCh <-chan struct{}) { defer utilruntime.HandleCrash() defer dsc.queue.ShutDown() glog.Infof("Starting daemon sets controller") defer glog.Infof("Shutting down daemon sets controller") //调用controller.WaitForCacheSync等待PodInformer、NodeInformer、HistoryInformer的HasSyncs都返回true,即等待Pod、Node和History完成同步 if !controller.WaitForCacheSync("daemon sets", stopCh, dsc.podStoreSynced, dsc.nodeStoreSynced, dsc.historyStoreSynced, dsc.dsStoreSynced) { return } //启动--concurrent-daemonset-syncs个go routine,每隔1秒循环执行dsc.runWorker方法,dsc.runWorker方法则是循环执行dsc.processNextWorkItem方法直到其返回true for i := 0; i < workers; i++ { go wait.Until(dsc.runWorker, time.Second, stopCh) } //启动1个go routine,每隔1分钟启用failedPodsBackoff的gc go wait.Until(dsc.failedPodsBackoff.GC, BackoffGCInterval, stopCh) <-stopCh } func (dsc *DaemonSetsController) runWorker() { for dsc.processNextWorkItem() { } } // processNextWorkItem deals with one key off the queue. It returns false when it's time to quit. func (dsc *DaemonSetsController) processNextWorkItem() bool { //从dsController的queue中取出一个ds,如果没有取到ds即没有要处理的ds则直接return false dsKey, quit := dsc.queue.Get() if quit { return false } defer dsc.queue.Done(dsKey) //调用dsc.syncHandler方法即dsc.syncDaemonSet方法处理第一步获取到的ds err := dsc.syncHandler(dsKey.(string)) if err == nil { dsc.queue.Forget(dsKey) return true } //如果上一步的dsc.syncHandler方法未成功处理ds,则调用dsc.queue.AddRateLimited重新将ds加入dsController的queue中 utilruntime.HandleError(fmt.Errorf("%v failed with : %v", dsKey, err)) dsc.queue.AddRateLimited(dsKey) return true }
dsc.nodeShouldRunDaemonPod方法
dsc.nodeShouldRunDaemonPod方法是dsController用来检查ds是否需要将Pod调度到node上,该方法返回三个结果:wantToRun表示node上应运行ds,但忽略node的condition引起的不能调度(比如DiskPressure 或者insufficient resource);shouldSchedule表示应该将ds调度该node;shouldContinueRunning表示node上已运行了ds并且应该继续运行。ds Controller在注册Node Event函数的时候用到该方法,之后也会多次用到该方法,所以先在这里详细阅读该方法
- 根据ds即nodeName创建一个pod给到newPod
- 将wantToRun, shouldSchedule, shouldContinueRunning这三个变量初始化为true
- 如果ds.Spec.Template.Spec.NodeName不为空,且node.name不等于ds.Spec.Template.Spec.NodeName(即该node不是ds selector的node,说明node上不用运行ds),则直接return wantToRun, shouldSchedule, shouldContinueRunning为false
- 调用dsc.simulate方法用一些预选策略检查node,并返回检查结果以及node的信息,错误原因放入reasons,dsc.simulate方法的具体逻辑如下:
- 获取node上所有的pod到objects
- 创建nodeInfo,并设置nodeInfo的node
- 循环检查objects中所有的pod,如果pod不属于ds,则调用nodeInfo.AddPod将pod的资源统计到nodeInfo,即统计node已使用资源
- 调用Predicates函数使用调度策略检查ds是否能在node上运行,Predicates函数的具体逻辑如下
- 如果ScheduleDaemonSetPods的FeatureGate enable(1.11为Alpha,默认不开启;1.12+变为Beta,默认开启),则调用checkNodeFitness函数只检查PodFitsHost、PodMatchNodeSelector、PodToleratesNodeTaints这三个调度策略,完成后直接return结果
- 调用kubelettypes.IsCriticalPod检查是否是Critical Pod:1、开启了PodPriority的FeatureGate(默认不开启),且pod的Priority大于等于2000000000;2、开启了ExperimentalCriticalPodAnnotation的FeatureGate(默认不开启),pod的namespace为kube-system,且annotations中有[scheduler.alpha.kubernetes.io/critical-pod=“”];满足以上两个条件中的一个则为Critical Pod。
- 先使用PodToleratesNodeTaints这个调度策略检查pod
- 如果是critical pod,则调用predicates.EssentialPredicates函数,否则调用predicates.GeneralPredicates函数。predicates.EssentialPredicates函数使用PodFitsHost、PodFitsHostPorts、PodMatchNodeSelector这三个调度策略检查pod;predicates.GeneralPredicates函数先使用PodFitsResources调度策略检查pod,接着也是调用predicates.EssentialPredicates函数检查pod
- 返回调度策略的检查结果
- 检查reasons中所有的预选策略检查失败的原因:
- 如果预选失败原因是InsufficientResourceError(即node资源不足),则将reason给到insufficientResourceErr
- 如果预选失败原因是.ErrNodeSelectorNotMatch、ErrPodNotMatchHostName、ErrNodeLabelPresenceViolated、ErrPodNotFitsHostPorts,则直接return wantToRun, shouldSchedule, shouldContinueRunning为false
- 如果预选失败原因是ErrTaintsTolerationsNotMatch,则再调用预选策略predicates.PodToleratesNodeNoExecuteTaints检查是否fit NoExecute 的taint,如果Tolerations也没全符合NoExecute的taint,则直接return wantToRun, shouldSchedule, shouldContinueRunning为false;如果Tolerations全部符合了NoExecute的taint,则将wantToRun, shouldSchedule设为false,继续检查后续的reason
- 如果预选失败原因是ErrDiskConflict、ErrVolumeZoneConflict、ErrMaxVolumeCountExceeded、ErrNodeUnderMemoryPressure、ErrNodeUnderDiskPressure,则将shouldSchedule设为false
- 如果预选失败原因是ErrPodAffinityNotMatch、ErrServiceAffinityViolated,则直接return wantToRun, shouldSchedule, shouldContinueRunning为false
- 检查完所有reasons以后检查shouldSchedule和insufficientResourceErr两个变量,如果shouldSchedule为true且insufficientResourceErr不为nil,则将shouldSchedule设为false(即预选失败原因是InsufficientResourceError,node也不能运行该ds)
关于调度算法详细检查可以查看scheduler_algorithm,或者阅读 pkg/scheduler/algorithm/predicates/predicates.go中的代码。
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:1327 func (dsc *DaemonSetsController) nodeShouldRunDaemonPod(node *v1.Node, ds *apps.DaemonSet) (wantToRun, shouldSchedule, shouldContinueRunning bool, err error) { //根据ds即nodeName创建一个pod给到newPod newPod := NewPod(ds, node.Name) // Because these bools require an && of all their required conditions, we start // with all bools set to true and set a bool to false if a condition is not met. // A bool should probably not be set to true after this line. //将wantToRun, shouldSchedule, shouldContinueRunning这三个变量初始化为true wantToRun, shouldSchedule, shouldContinueRunning = true, true, true // If the daemon set specifies a node name, check that it matches with node.Name. //如果ds.Spec.Template.Spec.NodeName不为空,且node.name不等于ds.Spec.Template.Spec.NodeName(即该node不是ds selector的node,说明node上不用运行ds),则直接return wantToRun, shouldSchedule, shouldContinueRunning为false if !(ds.Spec.Template.Spec.NodeName == "" || ds.Spec.Template.Spec.NodeName == node.Name) { return false, false, false, nil } //调用dsc.simulate方法用一些预选策略检查node,并返回检查结果以及node的信息,错误原因放入reasons,node的信息放入nodeInfo reasons, nodeInfo, err := dsc.simulate(newPod, node, ds) if err != nil { glog.Warningf("DaemonSet Predicates failed on node %s for ds '%s/%s' due to unexpected error: %v", node.Name, ds.ObjectMeta.Namespace, ds.ObjectMeta.Name, err) return false, false, false, err } //检查reasons中所有的预选策略检查失败的原因: 1、如果预选失败原因是InsufficientResourceError(即node资源不足),则将reason给到insufficientResourceErr 2、如果预选失败原因是.ErrNodeSelectorNotMatch、ErrPodNotMatchHostName、ErrNodeLabelPresenceViolated、ErrPodNotFitsHostPorts,则直接return wantToRun, shouldSchedule, shouldContinueRunning为false 3、如果预选失败原因是ErrTaintsTolerationsNotMatch,则再调用预选策略predicates.PodToleratesNodeNoExecuteTaints检查是否fit NoExecute 的taint,如果Tolerations也没全符合NoExecute的taint,则直接return wantToRun, shouldSchedule, shouldContinueRunning 为false;如果Tolerations全部符合了NoExecute的taint,则将wantToRun, shouldSchedule设为false,继续检查后续的reason 4、如果预选失败原因是ErrDiskConflict、ErrVolumeZoneConflict、ErrMaxVolumeCountExceeded、ErrNodeUnderMemoryPressure、ErrNodeUnderDiskPressure,则将shouldSchedule设为false 5、如果预选失败原因是ErrPodAffinityNotMatch、ErrServiceAffinityViolated,则直接return wantToRun, shouldSchedule, shouldContinueRunning为false // TODO(k82cn): When 'ScheduleDaemonSetPods' upgrade to beta or GA, remove unnecessary check on failure reason, // e.g. InsufficientResourceError; and simplify "wantToRun, shouldSchedule, shouldContinueRunning" // into one result, e.g. selectedNode. var insufficientResourceErr error for _, r := range reasons { glog.V(4).Infof("DaemonSet Predicates failed on node %s for ds '%s/%s' for reason: %v", node.Name, ds.ObjectMeta.Namespace, ds.ObjectMeta.Name, r.GetReason()) switch reason := r.(type) { case *predicates.InsufficientResourceError: insufficientResourceErr = reason case *predicates.PredicateFailureError: var emitEvent bool // we try to partition predicates into two partitions here: intentional on the part of the operator and not. switch reason { // intentional case predicates.ErrNodeSelectorNotMatch, predicates.ErrPodNotMatchHostName, predicates.ErrNodeLabelPresenceViolated, // this one is probably intentional since it's a workaround for not having // pod hard anti affinity. predicates.ErrPodNotFitsHostPorts: return false, false, false, nil case predicates.ErrTaintsTolerationsNotMatch: // DaemonSet is expected to respect taints and tolerations fitsNoExecute, _, err := predicates.PodToleratesNodeNoExecuteTaints(newPod, nil, nodeInfo) if err != nil { return false, false, false, err } if !fitsNoExecute { return false, false, false, nil } wantToRun, shouldSchedule = false, false // unintentional case predicates.ErrDiskConflict, predicates.ErrVolumeZoneConflict, predicates.ErrMaxVolumeCountExceeded, predicates.ErrNodeUnderMemoryPressure, predicates.ErrNodeUnderDiskPressure: // wantToRun and shouldContinueRunning are likely true here. They are // absolutely true at the time of writing the comment. See first comment // of this method. shouldSchedule = false emitEvent = true // unexpected case predicates.ErrPodAffinityNotMatch, predicates.ErrServiceAffinityViolated: glog.Warningf("unexpected predicate failure reason: %s", reason.GetReason()) return false, false, false, fmt.Errorf("unexpected reason: DaemonSet Predicates should not return reason %s", reason.GetReason()) default: glog.V(4).Infof("unknown predicate failure reason: %s", reason.GetReason()) wantToRun, shouldSchedule, shouldContinueRunning = false, false, false emitEvent = true } if emitEvent { dsc.eventRecorder.Eventf(ds, v1.EventTypeWarning, FailedPlacementReason, "failed to place pod on %q: %s", node.ObjectMeta.Name, reason.GetReason()) } } } // only emit this event if insufficient resource is the only thing // preventing the daemon pod from scheduling //检查完所有reasons以后检查shouldSchedule和insufficientResourceErr两个变量,如果shouldSchedule为true且insufficientResourceErr不为nil,则将shouldSchedule设为false(即预选失败原因是InsufficientResourceError,node也不能运行该ds) if shouldSchedule && insufficientResourceErr != nil { dsc.eventRecorder.Eventf(ds, v1.EventTypeWarning, FailedPlacementReason, "failed to place pod on %q: %s", node.ObjectMeta.Name, insufficientResourceErr.Error()) shouldSchedule = false } return }
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:1289 func (dsc *DaemonSetsController) simulate(newPod *v1.Pod, node *v1.Node, ds *apps.DaemonSet) ([]algorithm.PredicateFailureReason, *schedulercache.NodeInfo, error) { //获取node上所有的pod到objects objects, err := dsc.podNodeIndex.ByIndex("nodeName", node.Name) if err != nil { return nil, nil, err } //创建nodeInfo,并设置nodeInfo的node nodeInfo := schedulercache.NewNodeInfo() nodeInfo.SetNode(node) //循环检查objects中所有的pod,如果pod不属于ds,则调用nodeInfo.AddPod将pod的资源统计到nodeInfo,即统计node已使用资源 for _, obj := range objects { // Ignore pods that belong to the daemonset when taking into account whether a daemonset should bind to a node. // TODO: replace this with metav1.IsControlledBy() in 1.12 pod, ok := obj.(*v1.Pod) if !ok { continue } if isControlledByDaemonSet(pod, ds.GetUID()) { continue } nodeInfo.AddPod(pod) } //调用Predicates函数使用调度策略检查ds是否能在node上运行 _, reasons, err := Predicates(newPod, nodeInfo) return reasons, nodeInfo, err }
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:1461 func Predicates(pod *v1.Pod, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) { var predicateFails []algorithm.PredicateFailureReason // If ScheduleDaemonSetPods is enabled, only check nodeSelector and nodeAffinity. //如果ScheduleDaemonSetPods的FeatureGate enable(1.11为Alpha,默认不开启;1.12+变为Beta,默认开启),则调用checkNodeFitness函数只检查PodFitsHost、PodMatchNodeSelector、PodToleratesNodeTaints这三个调度策略,完成后直接return结果 if utilfeature.DefaultFeatureGate.Enabled(features.ScheduleDaemonSetPods) { fit, reasons, err := checkNodeFitness(pod, nil, nodeInfo) if err != nil { return false, predicateFails, err } if !fit { predicateFails = append(predicateFails, reasons...) } return len(predicateFails) == 0, predicateFails, nil } //调用kubelettypes.IsCriticalPod检查是否是Critical Pod:1、开启了PodPriority的FeatureGate(默认不开启),且pod的Priority大于等于2000000000;2、开启了ExperimentalCriticalPodAnnotation的FeatureGate(默认不开启),pod的namespace为kube-system, 且annotations中有[scheduler.alpha.kubernetes.io/critical-pod=“”];满足以上两个条件中的一个则为Critical Pod。 critical := kubelettypes.IsCriticalPod(pod) //先使用PodToleratesNodeTaints这个调度策略检查pod fit, reasons, err := predicates.PodToleratesNodeTaints(pod, nil, nodeInfo) if err != nil { return false, predicateFails, err } if !fit { predicateFails = append(predicateFails, reasons...) } //如果是critical pod,则调用predicates.EssentialPredicates函数,否则调用predicates.GeneralPredicates函数。predicates.EssentialPredicates函数使用PodFitsHost、PodFitsHostPorts、PodMatchNodeSelector这三个调度策略检查pod; predicates.GeneralPredicates函数先使用PodFitsResources调度策略检查pod,接着也是调用predicates.EssentialPredicates函数检查pod if critical { // If the pod is marked as critical and support for critical pod annotations is enabled, // check predicates for critical pods only. fit, reasons, err = predicates.EssentialPredicates(pod, nil, nodeInfo) } else { fit, reasons, err = predicates.GeneralPredicates(pod, nil, nodeInfo) } if err != nil { return false, predicateFails, err } if !fit { predicateFails = append(predicateFails, reasons...) } //返回调度策略的检查结果 return len(predicateFails) == 0, predicateFails, nil }
dsc.syncDaemonSet方法
dsc.syncDaemonSet方法是dsController处理的ds的主要逻辑,具体处理如下:
- 通过传入的key获取到namespace和name
- 通过上一步的获取到的namespace和name调用dsc.dsLister.DaemonSets获取ds
- 检查ds.Spec.Selector是否为为空,如果为空,则直接写一笔Event record后return nil
- 检查是否能成功获取的ds的key即ds的namespace和name,如果获取的时候出现error,则直接return error
- 检查ds的DeletionTimestamp,如果为nil说明ds待删除,直接return nil
- 调用dsc.constructHistory获取当前的history(即和ds的spec一样的history)到cur,如果没有获取到cur则创建一个;如果cur中有多个history则去重,只保留一个。另外获取除cur以外的所有history到old,接着获取cur的hash值
- 调用dsc.expectations.SatisfiedExpectations检查ds是否ready,如果没有ready,则只调用dsc.updateDaemonSetStatus更新ds的status,完成后return
- 调用dsc.manage管理ds,检查该ds的pod在各个node上的运行状态,及进行后续处理
- 调用dsc.expectations.SatisfiedExpectations检查ds是否ready,如果ready则根据UpdateStrategy.Type对ds进行更新
- 如果UpdateStrategy.Type是OnDelete则不做任何处理
- 如果UpdateStrategy.Type是RollingUpdate则调用dsc.rollingUpdate更新ds
- 调用dsc.cleanupHistory,根据ds的Spec.RevisionHistoryLimit的值,检查是否删除多余的history
- 调用dsc.updateDaemonSetStatus更新ds的status,完成后return
可以看到dsc.syncDaemonSet方法主要调用了dsc.updateDaemonSetStatus更新DaemonSet的stuatus;调用dsc.manage管理ds;调用dsc.rollingUpdate对DaemonSet进行滚动更新;调用dsc.cleanupHistory来清除多余的History。接下来就来看看这四个方法的详细逻辑。
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:1206 func (dsc *DaemonSetsController) syncDaemonSet(key string) error { startTime := time.Now() defer func() { glog.V(4).Infof("Finished syncing daemon set %q (%v)", key, time.Since(startTime)) }() //通过传入的key获取到namespace和name namespace, name, err := cache.SplitMetaNamespaceKey(key) if err != nil { return err } //通过上一步的获取到的namespace和name调用dsc.dsLister.DaemonSets获取ds ds, err := dsc.dsLister.DaemonSets(namespace).Get(name) if errors.IsNotFound(err) { glog.V(3).Infof("daemon set has been deleted %v", key) dsc.expectations.DeleteExpectations(key) return nil } if err != nil { return fmt.Errorf("unable to retrieve ds %v from store: %v", key, err) } //检查ds.Spec.Selector是否为为空,如果为空,则直接写一笔Event record后return nil everything := metav1.LabelSelector{} if reflect.DeepEqual(ds.Spec.Selector, &everything) { dsc.eventRecorder.Eventf(ds, v1.EventTypeWarning, SelectingAllReason, "This daemon set is selecting all pods. A non-empty selector is required.") return nil } //检查是否能成功获取的ds的key即ds的namespace和name,如果获取的时候出现error,则直接return error // Don't process a daemon set until all its creations and deletions have been processed. // For example if daemon set foo asked for 3 new daemon pods in the previous call to manage, // then we do not want to call manage on foo until the daemon pods have been created. dsKey, err := controller.KeyFunc(ds) if err != nil { return fmt.Errorf("couldn't get key for object %#v: %v", ds, err) } //检查ds的DeletionTimestamp,如果为nil说明ds待删除,直接return nil // If the DaemonSet is being deleted (either by foreground deletion or // orphan deletion), we cannot be sure if the DaemonSet history objects // it owned still exist -- those history objects can either be deleted // or orphaned. Garbage collector doesn't guarantee that it will delete // DaemonSet pods before deleting DaemonSet history objects, because // DaemonSet history doesn't own DaemonSet pods. We cannot reliably // calculate the status of a DaemonSet being deleted. Therefore, return // here without updating status for the DaemonSet being deleted. if ds.DeletionTimestamp != nil { return nil } //调用dsc.constructHistory获取当前的history(即和ds的spec一样的history)到cur,如果没有获取到cur则创建一个;如果cur中有多个history则去重,只保留一个。 另外获取除cur以外的所有history到old,接着获取cur的hash值 // Construct histories of the DaemonSet, and get the hash of current history cur, old, err := dsc.constructHistory(ds) if err != nil { return fmt.Errorf("failed to construct revisions of DaemonSet: %v", err) } hash := cur.Labels[apps.DefaultDaemonSetUniqueLabelKey] //调用dsc.expectations.SatisfiedExpectations检查ds是否ready,如果没有ready,则只调用dsc.updateDaemonSetStatus更新ds的status,完成后return if !dsc.expectations.SatisfiedExpectations(dsKey) { // Only update status. return dsc.updateDaemonSetStatus(ds, hash) } //调用dsc.manage管理ds,检查该ds的pod在各个node上的运行状态,及进行后续处理 err = dsc.manage(ds, hash) if err != nil { return err } //调用dsc.expectations.SatisfiedExpectations检查ds是否ready,如果ready则根据UpdateStrategy.Type对ds进行更新,(1)如果UpdateStrategy.Type是OnDelete则不做任何处理 (2) 如果UpdateStrategy.Type是RollingUpdate则调用dsc.rollingUpdate更新ds // Process rolling updates if we're ready. if dsc.expectations.SatisfiedExpectations(dsKey) { switch ds.Spec.UpdateStrategy.Type { case apps.OnDeleteDaemonSetStrategyType: case apps.RollingUpdateDaemonSetStrategyType: err = dsc.rollingUpdate(ds, hash) } if err != nil { return err } } //调用dsc.cleanupHistory,根据ds的Spec.RevisionHistoryLimit的值,检查是否删除多余的history err = dsc.cleanupHistory(ds, old) if err != nil { return fmt.Errorf("failed to clean up revisions of DaemonSet: %v", err) } //调用dsc.updateDaemonSetStatus更新ds的status,完成后return return dsc.updateDaemonSetStatus(ds, hash) }
dsc.updateDaemonSetStatus方法
dsc.updateDaemonSetStatus方法顾名思义就是检查ds在集群中的状态并更新ds的status,逻辑如下:
- 调用dsc.getNodesToDaemonPods获取一个map,存储了当前集群运行的ds的pod及pod所在的nodename,其中key为nodename获取方法是根据ds.spec.Selector,以及controllerKind创建一个PodControllerRefManager,接着调用PodControllerRefManager的ClaimPods方法获取pod,接着从获取到的pod获取pod所在的node的nodename,组成以nodename为key的map存入nodeToDaemonPods(注:在replicaset-controller的源码阅读中有详细列出PodControllerRefManager的ClaimPods方法:https://my.oschina.net/u/3797264/blog/2985926)
- List集群中所有的node存入nodeList中
- 循环处理上一步获取到的nodeList
- 调用dsc.nodeShouldRunDaemonPod方法检查ds是否需要在node上运行node,将结果给到变量wantToRun
- 检查node上是否运行了ds的pod,将结果给到变量scheduled
- 根据变量wantToRun、scheduled以及pod的ready/Available状态计算出status中的几个状态数据,计算的代码非常的清晰,这里不一一列出
- 调用storeDaemonSetStatus函数检查ds当前的status和上一步计算的是否一致,如果不一致则调用api更新ds的status
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:1145 func (dsc *DaemonSetsController) updateDaemonSetStatus(ds *apps.DaemonSet, hash string) error { glog.V(4).Infof("Updating daemon set status") //调用dsc.getNodesToDaemonPods获取一个map,存储了当前集群运行的ds的pod及pod所在的nodename,其中key为nodename 获取方法是根据ds.spec.Selector,以及controllerKind创建一个PodControllerRefManager,接着调用PodControllerRefManager的ClaimPods方法获取pod, 接着从获取到的pod获取pod所在的node的nodename,组成以nodename为key的map存入nodeToDaemonPods nodeToDaemonPods, err := dsc.getNodesToDaemonPods(ds) if err != nil { return fmt.Errorf("couldn't get node to daemon pod mapping for daemon set %q: %v", ds.Name, err) } //List集群中所有的node存入nodeList中 nodeList, err := dsc.nodeLister.List(labels.Everything()) if err != nil { return fmt.Errorf("couldn't get list of nodes when updating daemon set %#v: %v", ds, err) } //循环处理上一步获取到的nodeList 1、调用dsc.nodeShouldRunDaemonPod方法检查ds是否需要在node上运行node,将结果给到变量wantToRun 2、检查node上是否运行了ds的pod,将结果给到变量scheduled 3、根据变量wantToRun、scheduled以及pod的ready/Available状态计算出status中的几个状态数据,计算都比较简单,这里就不一一列出 4、调用storeDaemonSetStatus函数检查ds当前的status和上一步计算的是否一致,如果不一致则调用api更新ds的status var desiredNumberScheduled, currentNumberScheduled, numberMisscheduled, numberReady, updatedNumberScheduled, numberAvailable int for _, node := range nodeList { wantToRun, _, _, err := dsc.nodeShouldRunDaemonPod(node, ds) if err != nil { return err } scheduled := len(nodeToDaemonPods[node.Name]) > 0 if wantToRun { desiredNumberScheduled++ if scheduled { currentNumberScheduled++ // Sort the daemon pods by creation time, so that the oldest is first. daemonPods, _ := nodeToDaemonPods[node.Name] sort.Sort(podByCreationTimestampAndPhase(daemonPods)) pod := daemonPods[0] if podutil.IsPodReady(pod) { numberReady++ if podutil.IsPodAvailable(pod, ds.Spec.MinReadySeconds, metav1.Now()) { numberAvailable++ } } // If the returned error is not nil we have a parse error. // The controller handles this via the hash. generation, err := util.GetTemplateGeneration(ds) if err != nil { generation = nil } if util.IsPodUpdated(pod, hash, generation) { updatedNumberScheduled++ } } } else { if scheduled { numberMisscheduled++ } } } numberUnavailable := desiredNumberScheduled - numberAvailable err = storeDaemonSetStatus(dsc.kubeClient.AppsV1().DaemonSets(ds.Namespace), ds, desiredNumberScheduled, currentNumberScheduled, numberMisscheduled, numberReady, updatedNumberScheduled, numberAvailable, numberUnavailable) if err != nil { return fmt.Errorf("error storing status for daemon set %#v: %v", ds, err) } return nil }
dsc.manage方法
dsc.manage方法是管理ds的pods,检查哪些node上需要调度或者运行ds的pod,具体逻辑如下:
- 调用dsc.getNodesToDaemonPods获取一个map,存储了当前集群运行的ds的pod及pod所在的nodename,其中key为nodename
- List集群所有的node到nodeList
- 对nodeList中的所有node做如下处理:
- 调用dsc.podsShouldBeOnNode,检查node上是否创建ds的pod,如需要则将node.name给到nodesNeedingDaemonPodsOnNode,获取node上需要删除的ds的pod将值给podsToDeleteOnNode,以及ds的pod在node上失败的个数将值给failedPodsObservedOnNode
- 将nodesNeedingDaemonPodsOnNode汇总到nodesNeedingDaemonPods,即nodesNeedingDaemonPods保存了所有需要创建ds pod的node;将podsToDeleteOnNode汇总到podsToDelete,即podsToDelete保存了ds所有需要删除的pod;将failedPodsObservedOnNode汇总到failedPodsObserved,即failedPodsObserved为ds当前失败的pod数量
- 用nodesNeedingDaemonPods & podsToDelete调用dsc.syncNodes,在对应的node上创建pod,以及将需要删除的pod删除
- 检查failedPodsObserved,如果大于0即ds有failed的pod,则写一笔error日志
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:941 func (dsc *DaemonSetsController) manage(ds *apps.DaemonSet, hash string) error { //调用dsc.getNodesToDaemonPods获取一个map,存储了当前集群运行的ds的pod及pod所在的nodename,其中key为nodename // Find out the pods which are created for the nodes by DaemonSet. nodeToDaemonPods, err := dsc.getNodesToDaemonPods(ds) if err != nil { return fmt.Errorf("couldn't get node to daemon pod mapping for daemon set %q: %v", ds.Name, err) } // For each node, if the node is running the daemon pod but isn't supposed to, kill the daemon // pod. If the node is supposed to run the daemon pod, but isn't, create the daemon pod on the node. //List集群所有的node到nodeList nodeList, err := dsc.nodeLister.List(labels.Everything()) if err != nil { return fmt.Errorf("couldn't get list of nodes when syncing daemon set %#v: %v", ds, err) } var nodesNeedingDaemonPods, podsToDelete []string var failedPodsObserved int //对nodeList中的所有node做如下处理: 1、调用dsc.podsShouldBeOnNode,检查node上是否创建ds的pod,如需要则将node.name给到nodesNeedingDaemonPodsOnNode,获取node上需要删除的ds的pod将值给podsToDeleteOnNode,以及ds的pod在node上失败的个数将值给failedPodsObservedOnNode 2、将nodesNeedingDaemonPodsOnNode汇总到nodesNeedingDaemonPods,即nodesNeedingDaemonPods保存了所有需要创建ds pod的node;将podsToDeleteOnNode汇总到podsToDelete,即podsToDelete保存了ds所有需要删除的pod; 将failedPodsObservedOnNode汇总到failedPodsObserved,即failedPodsObserved为ds当前失败的pod数量 for _, node := range nodeList { nodesNeedingDaemonPodsOnNode, podsToDeleteOnNode, failedPodsObservedOnNode, err := dsc.podsShouldBeOnNode( node, nodeToDaemonPods, ds) if err != nil { continue } nodesNeedingDaemonPods = append(nodesNeedingDaemonPods, nodesNeedingDaemonPodsOnNode...) podsToDelete = append(podsToDelete, podsToDeleteOnNode...) failedPodsObserved += failedPodsObservedOnNode } //用nodesNeedingDaemonPods & podsToDelete调用dsc.syncNodes,在对应的node上创建pod,以及将需要删除的pod删除 // Label new pods using the hash label value of the current history when creating them if err = dsc.syncNodes(ds, podsToDelete, nodesNeedingDaemonPods, hash); err != nil { return err } //检查failedPodsObserved,如果大于0即ds有failed的pod,则写一笔error日志 // Throw an error when the daemon pods fail, to use ratelimiter to prevent kill-recreate hot loop if failedPodsObserved > 0 { return fmt.Errorf("deleted %d failed pods of DaemonSet %s/%s", failedPodsObserved, ds.Namespace, ds.Name) } return nil }
dsc.podsShouldBeOnNode方法
dsc.podsShouldBeOnNode方法是根据node上ds的pod的运行情况,决定node上是否要调度ds的pod或者删除ds的pod或者其他,具体逻辑如下:
- 调用dsc.nodeShouldRunDaemonPod方法,获取ds是否需要在node上运行到变量wantToRun,是否需要调度到node上到变量shouldSchedule,node上是否运行了ds的pod及继续运行到变量shouldContinueRunning
- 检查node上时候运行了ds的pod,将ds的pod给到变量daemonPods,存在与否给到变量exists
- 调用dsc.removeSuspendedDaemonPods方法删除suspendedDaemonPods中该node上的ds
- 检查wantToRun、shouldSchedule、shouldContinueRunning、daemonPods、exists这几个变量,检查node是否要运行ds的pod,结果给到nodesNeedingDaemonPods;检查node上运行的ds的pod是否需要删除;以及获取node上失败的ds的pod的数量
- 如果wantToRun为true,而shouldSchedule为false,即node上本该运行ds,但是因为node的资源不足或者其他问题造成不能调度,则调用dsc.addSuspendedDaemonPods将node.Name和ds添加到ds Controller的suspendedDaemonPods map中
- 如果shouldSchedule为true,而exists为false,即node上可以调度及运行ds,但是还未运行,则将node.Name添加到nodesNeedingDaemonPods数组,待之后在node上创建ds的pod
- 如果shouldContinueRunning为true,则说明node上已经有ds的pod及需要继续运行,那么就检查daemonPods中每个pod的状态,根据pod的不同的状态做以下不同的处理,检查完所有daemonPods以后,在检查daemonPodsRunning中的数量,如果超过1个,则保留daemonPodsRunning中创建时间最早的一个pod,其他pod都添加到
- 如果pod的DeletionTimestamp不为nil,说明pod将要删除,则跳过该pod,待pod删除后产生pod event再处理
- 如果pod的failed了,failedPodsObserved累加1,再检查pod的backoff key来判断是否删除pod,同一个pod失败删除的时间呈指数级增长(即1s,2s,4s,8s...,最大到15分钟则不再增加)
- 如果pod的status phase没有failed,则将pod添加到podsToDelete中待之后删除
- 如果shouldContinueRunning为false,而exists为true,说明node不用运行ds的pod,但实际却运行了,则将daemonPods中所有的pod添加到podsToDelete
- 返回nodesNeedingDaemonPods(需要创建ds的pod的node), podsToDelete(node上需要删除的ds的pod), failedPodsObserved(ds的pod在该node上失败的个数)
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:860 func (dsc *DaemonSetsController) podsShouldBeOnNode( node *v1.Node, nodeToDaemonPods map[string][]*v1.Pod, ds *apps.DaemonSet, ) (nodesNeedingDaemonPods, podsToDelete []string, failedPodsObserved int, err error) { //调用dsc.nodeShouldRunDaemonPod方法,获取ds是否需要在node上运行到变量wantToRun,是否需要调度到node上到变量shouldSchedule,node上是否运行了ds的pod及继续运行到变量shouldContinueRunning wantToRun, shouldSchedule, shouldContinueRunning, err := dsc.nodeShouldRunDaemonPod(node, ds) if err != nil { return } //检查node上时候运行了ds的pod,将ds的pod给到变量daemonPods,存在与否给到变量exists daemonPods, exists := nodeToDaemonPods[node.Name] dsKey, _ := cache.MetaNamespaceKeyFunc(ds) //调用dsc.removeSuspendedDaemonPods方法删除suspendedDaemonPods中该node上的ds dsc.removeSuspendedDaemonPods(node.Name, dsKey) //检查wantToRun、shouldSchedule、shouldContinueRunning、daemonPods、exists这几个变量,检查node是否要运行ds的pod,结果给到nodesNeedingDaemonPods;检查node上运行的ds的pod是否需要删除;以及获取node上失败的ds的pod的数量 (1) 如果wantToRun为true,而shouldSchedule为false,即node上本该运行ds,但是因为node的资源不足或者其他问题造成不能调度,则调用dsc.addSuspendedDaemonPods将node.Name和ds添加到ds Controller的suspendedDaemonPods map中 (2) 如果shouldSchedule为true,而exists为false,即node上可以调度及运行ds,但是还未运行,则将node.Name添加到nodesNeedingDaemonPods数组,待之后在node上创建ds的pod (3) 如果shouldContinueRunning为true,则说明node上已经有ds的pod及需要继续运行,那么就检查daemonPods中每个pod的状态,根据pod的不同的状态做以下不同的处理,检查完所有daemonPods以后,在检查daemonPodsRunning中的数量,如果超过1个, 则保留daemonPodsRunning中创建时间最早的一个pod,其他pod都添加到 a. 如果pod的DeletionTimestamp不为nil,说明pod将要删除,则跳过该pod,待pod删除后产生pod event再处理 b. 如果pod的failed了,failedPodsObserved累加1,再检查pod的backoff key来判断是否删除pod,同一个pod失败删除的时间呈指数级增长(即1s,2s,4s,8s...,最大到15分钟则不再增加) c. 如果pod的status phase没有failed,则将pod添加到podsToDelete中待之后删除 (4) 如果shouldContinueRunning为false,而exists为true,说明node不用运行ds的pod,但实际却运行了,则将daemonPods中所有的pod添加到podsToDelete switch { case wantToRun && !shouldSchedule: // If daemon pod is supposed to run, but can not be scheduled, add to suspended list. dsc.addSuspendedDaemonPods(node.Name, dsKey) case shouldSchedule && !exists: // If daemon pod is supposed to be running on node, but isn't, create daemon pod. nodesNeedingDaemonPods = append(nodesNeedingDaemonPods, node.Name) case shouldContinueRunning: // If a daemon pod failed, delete it // If there's non-daemon pods left on this node, we will create it in the next sync loop var daemonPodsRunning []*v1.Pod for _, pod := range daemonPods { if pod.DeletionTimestamp != nil { continue } if pod.Status.Phase == v1.PodFailed { failedPodsObserved++ // This is a critical place where DS is often fighting with kubelet that rejects pods. // We need to avoid hot looping and backoff. backoffKey := failedPodsBackoffKey(ds, node.Name) now := dsc.failedPodsBackoff.Clock.Now() inBackoff := dsc.failedPodsBackoff.IsInBackOffSinceUpdate(backoffKey, now) if inBackoff { delay := dsc.failedPodsBackoff.Get(backoffKey) glog.V(4).Infof("Deleting failed pod %s/%s on node %s has been limited by backoff - %v remaining", pod.Namespace, pod.Name, node.Name, delay) dsc.enqueueDaemonSetAfter(ds, delay) continue } dsc.failedPodsBackoff.Next(backoffKey, now) msg := fmt.Sprintf("Found failed daemon pod %s/%s on node %s, will try to kill it", pod.Namespace, pod.Name, node.Name) glog.V(2).Infof(msg) // Emit an event so that it's discoverable to users. dsc.eventRecorder.Eventf(ds, v1.EventTypeWarning, FailedDaemonPodReason, msg) podsToDelete = append(podsToDelete, pod.Name) } else { daemonPodsRunning = append(daemonPodsRunning, pod) } } // If daemon pod is supposed to be running on node, but more than 1 daemon pod is running, delete the excess daemon pods. // Sort the daemon pods by creation time, so the oldest is preserved. if len(daemonPodsRunning) > 1 { sort.Sort(podByCreationTimestampAndPhase(daemonPodsRunning)) for i := 1; i < len(daemonPodsRunning); i++ { podsToDelete = append(podsToDelete, daemonPodsRunning[i].Name) } } case !shouldContinueRunning && exists: // If daemon pod isn't supposed to run on node, but it is, delete all daemon pods on node. for _, pod := range daemonPods { podsToDelete = append(podsToDelete, pod.Name) } } //返回nodesNeedingDaemonPods(需要创建ds的pod的node), podsToDelete(node上需要删除的ds的pod), failedPodsObserved(ds的pod在该node上失败的个数) return nodesNeedingDaemonPods, podsToDelete, failedPodsObserved, nil }
dsc.syncNodes方法
dsc.syncNodes方法是根据上一步dsc.podsShouldBeOnNode方法计算得出的nodesNeedingDaemonPods和podsToDelete,在响应的node上创建ds的pod及删除podsToDelete数组中的pod,具体逻辑如下:
- 获取nodesNeedingDaemonPods长度到createDiff,获取podsToDelete长度到deleteDiff,如果createDiff或者deleteDiff大于250,则将其设为250个,即单次创建、删除pod最大值为250个
- 调用dsc.expectations.SetExpectations方法设置本次需要创建及删除的pod个数
- 从ds的”deprecated.daemonset.template.generation“的annotations中获取ds的generation
- 调用util.CreatePodTemplate创建pod的Template,创建的逻辑如下:
- 检查是否是Critical pod:满足以下3点要求则为Critical pod:a. 集群开启了ExperimentalCriticalPodAnnotation的FeatureGate(默认不开启); b. pod的namespace为kube-system; c. annotations中有[scheduler.alpha.kubernetes.io/critical-pod=“”]
- 为pod添加Tolerations,添加key为"node.kubernetes.io/not-ready"和"node.kubernetes.io/unreachable",Operator为"Exists",Effect为"NoExecute"的两个Tolerations;添加key为"node.kubernetes.io/disk-pressure"、"node.kubernetes.io/memory-pressure"和"node.kubernetes.io/unschedulable",Operator为"Exists",Effect为"NoSchedule"的Tolerations;如果ds使用了HostNetwork,则再添加key为"node.kubernetes.io/network-unavailable",Operator为"Exists",Effect为"NoSchedule"的Tolerations;如果是Critical pod,则再添加key为"node.kubernetes.io/out-of-disk",Operator为"Exists",Effect为"NoSchedule"和"NoExecute"的两个Tolerations。
- 如有generation和hash,则为pod的label添加"pod-template-generation"和"controller-revision-hash"以及对应的值
- 根据createDiff的值,创建相应的pod,其中创建过程中有以下需要留意:
- 第一次创建启动一个go routine创建1个Pod,之后启动的go routine个数呈指数级增长,即创建pod的个数为1、2、4、8...,下一次创建需要等上一次创建的go routine完成
- 如果集群开启了ScheduleDaemonSetPods的FeatureGate(1.11默认不开启,1.12+默认开启),则pod在创建之前添加requiredDuringSchedulingIgnoredDuringExecution类型的nodeAffinity,其中matchFields为metadata.name,operator为In,values为对应的nodename
- 调用dsc.podControl.DeletePod,逐个启动go routine删除podsToDelete中的pod
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:984 func (dsc *DaemonSetsController) syncNodes(ds *apps.DaemonSet, podsToDelete, nodesNeedingDaemonPods []string, hash string) error { // We need to set expectations before creating/deleting pods to avoid race conditions. dsKey, err := controller.KeyFunc(ds) if err != nil { return fmt.Errorf("couldn't get key for object %#v: %v", ds, err) } //获取nodesNeedingDaemonPods长度到createDiff,获取podsToDelete长度到deleteDiff,如果createDiff或者deleteDiff大于250,则将其设为250个,即单次创建、删除pod最大值为250个 createDiff := len(nodesNeedingDaemonPods) deleteDiff := len(podsToDelete) if createDiff > dsc.burstReplicas { createDiff = dsc.burstReplicas } if deleteDiff > dsc.burstReplicas { deleteDiff = dsc.burstReplicas } //调用dsc.expectations.SetExpectations方法设置本次需要创建及删除的pod个数 dsc.expectations.SetExpectations(dsKey, createDiff, deleteDiff) // error channel to communicate back failures. make the buffer big enough to avoid any blocking errCh := make(chan error, createDiff+deleteDiff) glog.V(4).Infof("Nodes needing daemon pods for daemon set %s: %+v, creating %d", ds.Name, nodesNeedingDaemonPods, createDiff) createWait := sync.WaitGroup{} // If the returned error is not nil we have a parse error. // The controller handles this via the hash. //从ds的”deprecated.daemonset.template.generation“的annotations中获取ds的generation generation, err := util.GetTemplateGeneration(ds) if err != nil { generation = nil } //调用util.CreatePodTemplate创建pod的Template,创建的逻辑如下: 1、检查是否是Critical pod:满足以下3点要求则为Critical pod:a. 集群开启了ExperimentalCriticalPodAnnotation的FeatureGate(默认不开启); b. pod的namespace为kube-system; c. annotations中有[scheduler.alpha.kubernetes.io/critical-pod=“”] 2、为pod添加Tolerations,添加key为"node.kubernetes.io/not-ready"和"node.kubernetes.io/unreachable",Operator为"Exists",Effect为"NoExecute"的两个Tolerations;添加key为"node.kubernetes.io/disk-pressure"、"node.kubernetes.io/memory-pressure" 和"node.kubernetes.io/unschedulable",Operator为"Exists",Effect为"NoSchedule"的Tolerations;如果ds使用了HostNetwork,则再添加key为"node.kubernetes.io/network-unavailable",Operator为"Exists",Effect为"NoSchedule"的Tolerations; 如果是Critical pod,则再添加key为"node.kubernetes.io/out-of-disk",Operator为"Exists",Effect为"NoSchedule"和"NoExecute"的两个Tolerations。 3、如有generation和hash,则为pod的label添加"pod-template-generation"和"controller-revision-hash"以及对应的值 template := util.CreatePodTemplate(ds.Namespace, ds.Spec.Template, generation, hash) // Batch the pod creates. Batch sizes start at SlowStartInitialBatchSize // and double with each successful iteration in a kind of "slow start". // This handles attempts to start large numbers of pods that would // likely all fail with the same error. For example a project with a // low quota that attempts to create a large number of pods will be // prevented from spamming the API service with the pod create requests // after one of its pods fails. Conveniently, this also prevents the // event spam that those failures would generate. //根据createDiff的值,创建相应的pod,其中创建过程中有以下需要留意: 1、第一次创建启动一个go routine创建1个Pod,之后启动的go routine个数呈指数级增长,即创建pod的个数为1、2、4、8...,下一次创建需要等上一次创建的go routine完成 2、如果集群开启了ScheduleDaemonSetPods的FeatureGate(1.11默认不开启,1.12+默认开启),则pod在创建之前添加requiredDuringSchedulingIgnoredDuringExecution类型的nodeAffinity,其中matchFields为metadata.name,operator为In,values为对应的nodename batchSize := integer.IntMin(createDiff, controller.SlowStartInitialBatchSize) for pos := 0; createDiff > pos; batchSize, pos = integer.IntMin(2*batchSize, createDiff-(pos+batchSize)), pos+batchSize { errorCount := len(errCh) createWait.Add(batchSize) for i := pos; i < pos+batchSize; i++ { go func(ix int) { defer createWait.Done() var err error podTemplate := &template if utilfeature.DefaultFeatureGate.Enabled(features.ScheduleDaemonSetPods) { podTemplate = template.DeepCopy() // The pod's NodeAffinity will be updated to make sure the Pod is bound // to the target node by default scheduler. It is safe to do so because there // should be no conflicting node affinity with the target node. podTemplate.Spec.Affinity = util.ReplaceDaemonSetPodNodeNameNodeAffinity( podTemplate.Spec.Affinity, nodesNeedingDaemonPods[ix]) err = dsc.podControl.CreatePodsWithControllerRef(ds.Namespace, podTemplate, ds, metav1.NewControllerRef(ds, controllerKind)) } else { err = dsc.podControl.CreatePodsOnNode(nodesNeedingDaemonPods[ix], ds.Namespace, podTemplate, ds, metav1.NewControllerRef(ds, controllerKind)) } if err != nil && errors.IsTimeout(err) { // Pod is created but its initialization has timed out. // If the initialization is successful eventually, the // controller will observe the creation via the informer. // If the initialization fails, or if the pod keeps // uninitialized for a long time, the informer will not // receive any update, and the controller will create a new // pod when the expectation expires. return } if err != nil { glog.V(2).Infof("Failed creation, decrementing expectations for set %q/%q", ds.Namespace, ds.Name) dsc.expectations.CreationObserved(dsKey) errCh <- err utilruntime.HandleError(err) } }(i) } createWait.Wait() // any skipped pods that we never attempted to start shouldn't be expected. skippedPods := createDiff - batchSize if errorCount < len(errCh) && skippedPods > 0 { glog.V(2).Infof("Slow-start failure. Skipping creation of %d pods, decrementing expectations for set %q/%q", skippedPods, ds.Namespace, ds.Name) for i := 0; i < skippedPods; i++ { dsc.expectations.CreationObserved(dsKey) } // The skipped pods will be retried later. The next controller resync will // retry the slow start process. break } } //调用dsc.podControl.DeletePod,逐个启动go routine删除podsToDelete中的pod glog.V(4).Infof("Pods to delete for daemon set %s: %+v, deleting %d", ds.Name, podsToDelete, deleteDiff) deleteWait := sync.WaitGroup{} deleteWait.Add(deleteDiff) for i := 0; i < deleteDiff; i++ { go func(ix int) { defer deleteWait.Done() if err := dsc.podControl.DeletePod(ds.Namespace, podsToDelete[ix], ds); err != nil { glog.V(2).Infof("Failed deletion, decrementing expectations for set %q/%q", ds.Namespace, ds.Name) dsc.expectations.DeletionObserved(dsKey) errCh <- err utilruntime.HandleError(err) } }(i) } deleteWait.Wait() // collect errors if any for proper reporting/retry logic in the controller errors := []error{} close(errCh) for err := range errCh { errors = append(errors, err) } return utilerrors.NewAggregate(errors) }
dsc.rollingUpdate方法
dsc.rollingUpdate方法是处理ds的滚动的更新的,实际是删除旧Template的pod,再等ds Controller创建新的pod,具体逻辑如下:
- 调用dsc.getNodesToDaemonPods获取一个map,存储了当前集群运行的ds的pod及pod所在的nodename,其中key为nodename
- 调用dsc.getAllDaemonSetPods方法对比ds和nodeToDaemonPods中pod的Generation、hash、Template,获取旧的pod到oldPods
- 调用dsc.getUnavailableNumbers获取ds最大的pod不可用数maxUnavailable,已经当前的pod不吭用数量numUnavailable,具体计算逻辑如下:
- numUnavailable有两种情况:a. 如果node上需要运行ds的pod但是实际node上没有运行ds的pod; b.如果node上需要运行ds的pod,node上也运行了该ds的pod,但是pod没有available
- maxUnavailable,是根据ds.Spec.UpdateStrategy.RollingUpdate.MaxUnavailable计算得出,如果该值是数值,则直接取用;如果是百分比,则以此乘以集群中需要运行该ds的node的数量,并向上取整
- 调用util.SplitByAvailablePods获取oldpods中已可用的pod到oldAvailablePods,以及不可用的pod到oldUnavailablePods
- 将oldUnavailablePods中所有pod添加到oldPodsToDelete等待删除
- 再添加 maxUnavailable - numUnavailable个oldAvailablePods到oldPodsToDelete待删除,即保证unavailable的pod数量不超过maxUnavailable
- 调用dsc.syncNodes删除oldPodsToDelete中的pod
func (dsc *DaemonSetsController) rollingUpdate(ds *apps.DaemonSet, hash string) error { //调用dsc.getNodesToDaemonPods获取一个map,存储了当前集群运行的ds的pod及pod所在的nodename,其中key为nodename nodeToDaemonPods, err := dsc.getNodesToDaemonPods(ds) if err != nil { return fmt.Errorf("couldn't get node to daemon pod mapping for daemon set %q: %v", ds.Name, err) } //调用dsc.getAllDaemonSetPods方法对比ds和nodeToDaemonPods中pod的Generation、hash、Template,获取旧的pod到oldPods _, oldPods := dsc.getAllDaemonSetPods(ds, nodeToDaemonPods, hash) //调用dsc.getUnavailableNumbers获取ds最大的pod不可用数maxUnavailable,已经当前的pod不吭用数量numUnavailable,具体计算逻辑如下: 1、numUnavailable有两种情况:a. 如果node上需要运行ds的pod但是实际node上没有运行ds的pod; b.如果node上需要运行ds的pod,node上也运行了该ds的pod,但是pod没有available 2、maxUnavailable,是根据ds.Spec.UpdateStrategy.RollingUpdate.MaxUnavailable计算得出,如果该值是数值,则直接取用;如果是百分比,则以此乘以集群中需要运行该ds的node的数量,并向上取整 maxUnavailable, numUnavailable, err := dsc.getUnavailableNumbers(ds, nodeToDaemonPods) if err != nil { return fmt.Errorf("Couldn't get unavailable numbers: %v", err) } //调用util.SplitByAvailablePods获取oldpods中已可用的pod到oldAvailablePods,以及不可用的pod到oldUnavailablePods oldAvailablePods, oldUnavailablePods := util.SplitByAvailablePods(ds.Spec.MinReadySeconds, oldPods) //将oldUnavailablePods中所有pod添加到oldPodsToDelete等待删除 // for oldPods delete all not running pods var oldPodsToDelete []string glog.V(4).Infof("Marking all unavailable old pods for deletion") for _, pod := range oldUnavailablePods { // Skip terminating pods. We won't delete them again if pod.DeletionTimestamp != nil { continue } glog.V(4).Infof("Marking pod %s/%s for deletion", ds.Name, pod.Name) oldPodsToDelete = append(oldPodsToDelete, pod.Name) } //再添加 maxUnavailable - numUnavailable个oldAvailablePods到oldPodsToDelete待删除,即保证unavailable的pod数量不超过maxUnavailable glog.V(4).Infof("Marking old pods for deletion") for _, pod := range oldAvailablePods { if numUnavailable >= maxUnavailable { glog.V(4).Infof("Number of unavailable DaemonSet pods: %d, is equal to or exceeds allowed maximum: %d", numUnavailable, maxUnavailable) break } glog.V(4).Infof("Marking pod %s/%s for deletion", ds.Name, pod.Name) oldPodsToDelete = append(oldPodsToDelete, pod.Name) numUnavailable++ } //调用dsc.syncNodes删除oldPodsToDelete中的pod return dsc.syncNodes(ds, oldPodsToDelete, []string{}, hash) }
总结
至此,DaemonSet Controller已阅读完。DaemonSet Controller主要就是控制DaemonSet在Node上的运行,它watch了集群DaemonSet、ControllerRevisions、Pod和Node的 Event,维护了一个Queue用来存储需要sync的DaemonSet ,最终启动了--concurrent-daemonset-syncs个go routine 循环从该Queue中取出DaemonSet处理。其中DaemonSet.syncDaemonSet方法是DaemonSet Controller处理DaemonSet最主要的逻辑,而它调用了DaemonSet.manage来控制ds在node上的运行,并调用了Create/Delete Pod接口来创建和删除。另外需要注意一下以下两点:
1、DaemonSet 的pod默认会添加key为"node.kubernetes.io/not-ready"和"node.kubernetes.io/unreachable",Operator为"Exists",Effect为"NoExecute"的两个Tolerations;key为"node.kubernetes.io/disk-pressure"、"node.kubernetes.io/memory-pressure" 和"node.kubernetes.io/unschedulable",Operator为"Exists",Effect为"NoSchedule"的Tolerations;如果ds使用了HostNetwork,则还会默认添加key为"node.kubernetes.io/network-unavailable",Operator为"Exists",Effect为"NoSchedule"的Tolerations;如果是Critical pod,则还会默认添加key为"node.kubernetes.io/out-of-disk",Operator为"Exists",Effect为"NoSchedule"和"NoExecute"的两个Tolerations。
2、DaemonSet有两种更新策略OnDelete和RollingUpdate。DaemonSet的Spec更新了:如果更新策略为OnDelete,则DaemonSet Controller则不做任何处理;如果更新策略为RollingUpdate,则DaemonSet Controller会先删除maxUnavailable个旧的pod之后,之后再创建新的pod,直到所有pod都更新为新的Spec为止。