Kubernetes 1.12.0 Kube-controller-manager之DaemonSet-controller源码阅读分析

前言

Kube-controller-manager组件最终启动了很多controller,本文将对其中的DaemonSet-controller的源码进行阅读分析。

启动DaemonSet Controller

startDaemonSetController函数是Kube-controller-manager启动DaemonSet Controller的入口,函数比较简单就三个逻辑。

  • 检查apps/v1/daemonsets资源是否available
  • 调用daemon包中的NewDaemonSetsController函数创建DaemonSet Controller实例到dsc
  • 调用DaemonSet Controller实例 dsc的Run方法
k8s.io/kubernetes/cmd/kube-controller-manager/app/apps.go:36

func startDaemonSetController(ctx ControllerContext) (http.Handler, bool, error) {
   if !ctx.AvailableResources[schema.GroupVersionResource{Group: "apps", Version: "v1", Resource: "daemonsets"}] {
      return nil, false, nil
   }
   dsc, err := daemon.NewDaemonSetsController(
      ctx.InformerFactory.Apps().V1().DaemonSets(),
      ctx.InformerFactory.Apps().V1().ControllerRevisions(),
      ctx.InformerFactory.Core().V1().Pods(),
      ctx.InformerFactory.Core().V1().Nodes(),
      ctx.ClientBuilder.ClientOrDie("daemon-set-controller"),
      flowcontrol.NewBackOff(1*time.Second, 15*time.Minute),
   )
   if err != nil {
      return nil, true, fmt.Errorf("error creating DaemonSets controller: %v", err)
   }
   go dsc.Run(int(ctx.ComponentConfig.DaemonSetController.ConcurrentDaemonSetSyncs), ctx.Stop)
   return nil, true, nil
}

创建DaemonSet Controller

Kube-controller-manager调用NewDaemonSetsController函数创建DaemonSet Controller实例,具体逻辑如下:

  • 创建DaemonSetsController实例,(1)burstReplicas定为250,即DaemonSetsController一次最多只能处理250个pod;(2)调用controller.NewControllerExpectations创建了一个ControllerExpectations用于控制pod的数量;(3)调用workqueue.NewNamedRateLimitingQueue创建了一个workqueue用于储存需要sync的DaemonSet
  • watch DaemonSet event及注册DaemonSet Lister和对应的Synced函数,DaemonSet event的各个Func处理为:(1)AddFunc调用dsc.enqueueDaemonSet将DaemonSet的key(即namespace和name)加入到DaemonSetsController的queue中; (2)UpdateFunc同样是调用dsc.enqueueDaemonSet将新的DaemonSet的key加入到DaemonSetsController的queue中;(3)DeleteFunc调用dsc.deleteDaemonset,先检查传入的interface是否是DaemonSet,如果是则调用dsc.enqueueDaemonSet将DaemonSet的key加入到DaemonSetsController的queue中
  • watch history event对应的Add/Update/Delete Func,并注册history Lister和Synced函数
  • watch pod event对应的Add/Update/Delete Func,并注册pod Lister和Synced函数,以及定义一个index为nodename的pod index
  • watch node event的AddFunc和UpdateFunc,并注册node Lister和Synced函数
  • 注册syncHandler为dsc.syncDaemonSet方法,是DaemonSet Controller处理DaemonSet的主要逻辑
  • 注册enqueueDaemonSet为dsc.enqueue方法,注册enqueueDaemonSetRateLimited为dsc.enqueueRateLimited方法,两个方法都是将需要处理的DaemonSet加入DaemonSet维护的queue中
  • 调用flowcontrol.NewBackOff定义一个default duration为1s,max duration为15m的backoff给failedPodsBackoff     

可以看到DaemonSet Controller watch了集群的DaemonSet、History、Pod、Node对象,并维护一个queue存放需要处理的DaemonSet。

k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:143

func NewDaemonSetsController(
   daemonSetInformer appsinformers.DaemonSetInformer,
   historyInformer appsinformers.ControllerRevisionInformer,
   podInformer coreinformers.PodInformer,
   nodeInformer coreinformers.NodeInformer,
   kubeClient clientset.Interface,
   failedPodsBackoff *flowcontrol.Backoff,
) (*DaemonSetsController, error) {
   eventBroadcaster := record.NewBroadcaster()
   eventBroadcaster.StartLogging(glog.Infof)
   eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: kubeClient.CoreV1().Events("")})

   if kubeClient != nil && kubeClient.CoreV1().RESTClient().GetRateLimiter() != nil {
      if err := metrics.RegisterMetricAndTrackRateLimiterUsage("daemon_controller", kubeClient.CoreV1().RESTClient().GetRateLimiter()); err != nil {
         return nil, err
      }
   }
   //创建DaemonSetsController实例,(1)burstReplicas定为250,即DaemonSetsController一次最多只能处理250个pod;(2)调用controller.NewControllerExpectations
     创建了一个ControllerExpectations用于控制pod的数量;(3)调用workqueue.NewNamedRateLimitingQueue创建了一个workqueue用于储存需要sync的DaemonSet
   dsc := &DaemonSetsController{
      kubeClient:    kubeClient,
      eventRecorder: eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "daemonset-controller"}),
      podControl: controller.RealPodControl{
         KubeClient: kubeClient,
         Recorder:   eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "daemonset-controller"}),
      },
      crControl: controller.RealControllerRevisionControl{
         KubeClient: kubeClient,
      },
      burstReplicas:       BurstReplicas,
      expectations:        controller.NewControllerExpectations(),
      queue:               workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "daemonset"),
      suspendedDaemonPods: map[string]sets.String{},
   }

   //watch DaemonSet event,(1)AddFunc调用dsc.enqueueDaemonSet将DaemonSet的key(即namespace和name)加入到DaemonSetsController的queue中;
     (2)UpdateFunc同样是调用dsc.enqueueDaemonSet将新的DaemonSet的key加入到DaemonSetsController的queue中;(3)DeleteFunc调用dsc.deleteDaemonset,先检查
     传入的interface是否是DaemonSet,如果是则调用dsc.enqueueDaemonSet将DaemonSet的key加入到DaemonSetsController的queue中
   daemonSetInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
      AddFunc: func(obj interface{}) {
         ds := obj.(*apps.DaemonSet)
         glog.V(4).Infof("Adding daemon set %s", ds.Name)
         dsc.enqueueDaemonSet(ds)
      },
      UpdateFunc: func(old, cur interface{}) {
         oldDS := old.(*apps.DaemonSet)
         curDS := cur.(*apps.DaemonSet)
         glog.V(4).Infof("Updating daemon set %s", oldDS.Name)
         dsc.enqueueDaemonSet(curDS)
      },
      DeleteFunc: dsc.deleteDaemonset,
   })

   //注册DaemonSet Lister和对应的Synced函数
   dsc.dsLister = daemonSetInformer.Lister()
   dsc.dsStoreSynced = daemonSetInformer.Informer().HasSynced

   //watch history event对应的Add/Update/Delete Func,并注册history Lister和Synced函数
   historyInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
      AddFunc:    dsc.addHistory,
      UpdateFunc: dsc.updateHistory,
      DeleteFunc: dsc.deleteHistory,
   })
   dsc.historyLister = historyInformer.Lister()
   dsc.historyStoreSynced = historyInformer.Informer().HasSynced

   //watch pod event对应的Add/Update/Delete Func,并注册pod Lister和Synced函数,以及定义一个index为nodename的pod index
   // Watch for creation/deletion of pods. The reason we watch is that we don't want a daemon set to create/delete
   // more pods until all the effects (expectations) of a daemon set's create/delete have been observed.
   podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
      AddFunc:    dsc.addPod,
      UpdateFunc: dsc.updatePod,
      DeleteFunc: dsc.deletePod,
   })
   dsc.podLister = podInformer.Lister()

   // This custom indexer will index pods based on their NodeName which will decrease the amount of pods we need to get in simulate() call.
   podInformer.Informer().GetIndexer().AddIndexers(cache.Indexers{
      "nodeName": indexByPodNodeName,
   })
   dsc.podNodeIndex = podInformer.Informer().GetIndexer()
   dsc.podStoreSynced = podInformer.Informer().HasSynced

   //watch node event的AddFunc和UpdateFunc,并注册node Lister和Synced函数
   nodeInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
      AddFunc:    dsc.addNode,
      UpdateFunc: dsc.updateNode,
   },
   )
   dsc.nodeStoreSynced = nodeInformer.Informer().HasSynced
   dsc.nodeLister = nodeInformer.Lister()

   //注册syncHandler为dsc.syncDaemonSet方法,是DaemonSet Controller处理DaemonSet的主要逻辑
   dsc.syncHandler = dsc.syncDaemonSet
   
   //注册enqueueDaemonSet为dsc.enqueue方法,注册enqueueDaemonSetRateLimited为dsc.enqueueRateLimited方法,两个方法都是将需要处理的DaemonSet加入DaemonSet维护的queue中
   dsc.enqueueDaemonSet = dsc.enqueue
   dsc.enqueueDaemonSetRateLimited = dsc.enqueueRateLimited

   //调用flowcontrol.NewBackOff定义一个default duration为1s,max duration为15m的backoff给failedPodsBackoff
   dsc.failedPodsBackoff = failedPodsBackoff

   return dsc, nil
}

History Event

NewDaemonSetsController函数创建DaemonSet Controller的时候watch了集群的History对象,接下来看看具体的逻辑。

AddFunc

History的AddFunc注册的是addHistory方法,addHistory方法逻辑如下:

  • 通过obj获得history,检查history的deletiontimestamp,如果不为nil,则表示history待删除,则调用dsc.deleteHistory方法并return
  • 获取history的OwnerReference,如果OwnerReference不为nil,则再通过history的namespace和OwnerReference获取ds,如果获取到的ds正常,则直接return,不用做处理
  • 如果上一步未获取到ds,表明孤儿history,则调用dsc.getDaemonSetsForHistory通过lable获取所有selector该lable的ds,接着调用dsc.enqueueDaemonSet将获取到的ds添加dsController维护的queue中
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:374

func (dsc *DaemonSetsController) addHistory(obj interface{}) {
   //通过obj获得history,检查history的deletiontimestamp,如果不为nil,则表示history待删除,则调用dsc.deleteHistory方法并return
   history := obj.(*apps.ControllerRevision)
   if history.DeletionTimestamp != nil {
      // On a restart of the controller manager, it's possible for an object to
      // show up in a state that is already pending deletion.
      dsc.deleteHistory(history)
      return
   }

   //获取history的OwnerReference,如果OwnerReference不为nil,则再通过history的namespace和OwnerReference获取ds,如果获取到的ds为nil则直接return,如果不为nil则写一条日志之后直接return,
     即OwnerReference不为nil则不用做任何处理
   // If it has a ControllerRef, that's all that matters.
   if controllerRef := metav1.GetControllerOf(history); controllerRef != nil {
      ds := dsc.resolveControllerRef(history.Namespace, controllerRef)
      if ds == nil {
         return
      }
      glog.V(4).Infof("ControllerRevision %s added.", history.Name)
      return
   }

   //如果上一步获取到的OwnerReference为nil,表明是孤儿history,则调用dsc.getDaemonSetsForHistory通过lable获取所有selector该lable的ds,接着调用dsc.enqueueDaemonSet将获取到的ds添加到dsController维护的queue中
   // Otherwise, it's an orphan. Get a list of all matching DaemonSets and sync
   // them to see if anyone wants to adopt it.
   daemonSets := dsc.getDaemonSetsForHistory(history)
   if len(daemonSets) == 0 {
      return
   }
   glog.V(4).Infof("Orphan ControllerRevision %s added.", history.Name)
   for _, ds := range daemonSets {
      dsc.enqueueDaemonSet(ds)
   }
}

UpdateFunc

History的UpdateFunc注册的是updateHistory方法,updateHistory方法主要逻辑如下:

  • 分别获取curHistory和oldHistory,如果curHistory和oldHistory的ResourceVersion一样,说明History没有改变,则不做任何处理,直接return
  • 分别获取curHistory和oldHistory的OwnerReference,如果curHistory和oldHistory的OwnerReference改变了,且oldHistory的OwnerReference不为nil,则通过oldHistory的Namespace和OwnerReference获取到ds,如果获取到的ds不为nil,则调用dsc.enqueueDaemonSet将获取到的ds添加到dsController维护的queue中
  • 如果curHistory的OwnerReference不为nil,则通过curHistory的namespace和OwnerReference获取ds,如果ds为nil则直接return,ds不为nil则调用dsc.enqueueDaemonSet将获取到的ds添加到dsController维护的queue中,完成后return
  • 如果curHistory的OwnerReference为nil,则表明History变为了孤儿History,如果oldHistory和curHistory的label和OwnerReference任意一个改变了,则调用dsc.getDaemonSetsForHistory通过lable获取所有selector该lable的ds,接着调用dsc.enqueueDaemonSet将获取到的ds添加到dsController维护的queue中
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:408
func (dsc *DaemonSetsController) updateHistory(old, cur interface{}) {
   //分别获取curHistory和oldHistory,如果curHistory和oldHistory的ResourceVersion一样,说明History没有改变,则不做任何处理,直接return
   curHistory := cur.(*apps.ControllerRevision)
   oldHistory := old.(*apps.ControllerRevision)
   if curHistory.ResourceVersion == oldHistory.ResourceVersion {
      // Periodic resync will send update events for all known ControllerRevisions.
      return
   }

   //分别获取curHistory和oldHistory的OwnerReference,如果curHistory和oldHistory的OwnerReference改变了,且oldHistory的OwnerReference不为nil,
     则通过oldHistory的Namespace和OwnerReference获取到ds,如果获取到的ds不为nil,则调用dsc.enqueueDaemonSet将获取到的ds添加到dsController维护的queue中
   curControllerRef := metav1.GetControllerOf(curHistory)
   oldControllerRef := metav1.GetControllerOf(oldHistory)
   controllerRefChanged := !reflect.DeepEqual(curControllerRef, oldControllerRef)
   if controllerRefChanged && oldControllerRef != nil {
      // The ControllerRef was changed. Sync the old controller, if any.
      if ds := dsc.resolveControllerRef(oldHistory.Namespace, oldControllerRef); ds != nil {
         dsc.enqueueDaemonSet(ds)
      }
   }

   //如果curHistory的OwnerReference不为nil,则通过curHistory的namespace和OwnerReference获取ds,如果ds为nil则直接return,ds不为nil则调用dsc.enqueueDaemonSet
     将获取到的ds添加到dsController维护的queue中,完成后return
   // If it has a ControllerRef, that's all that matters.
   if curControllerRef != nil {
      ds := dsc.resolveControllerRef(curHistory.Namespace, curControllerRef)
      if ds == nil {
         return
      }
      glog.V(4).Infof("ControllerRevision %s updated.", curHistory.Name)
      dsc.enqueueDaemonSet(ds)
      return
   }

   //如果curHistory的OwnerReference为nil,则表明History变为了孤儿History,如果oldHistory和curHistory的label和OwnerReference任意一个改变了,则调用dsc.getDaemonSetsForHistory
     通过lable获取所有selector该lable的ds,接着调用dsc.enqueueDaemonSet将获取到的ds添加到dsController维护的queue中
   // Otherwise, it's an orphan. If anything changed, sync matching controllers
   // to see if anyone wants to adopt it now.
   labelChanged := !reflect.DeepEqual(curHistory.Labels, oldHistory.Labels)
   if labelChanged || controllerRefChanged {
      daemonSets := dsc.getDaemonSetsForHistory(curHistory)
      if len(daemonSets) == 0 {
         return
      }
      glog.V(4).Infof("Orphan ControllerRevision %s updated.", curHistory.Name)
      for _, ds := range daemonSets {
         dsc.enqueueDaemonSet(ds)
      }
   }
}

DeleteFunc

History的DeleteFunc注册的是deleteHistory方法,deleteHistory方法主要逻辑如下:

  • 检查传入的obj是否是history,如果不是则直接return,否则获取history
  • 获取history的OwnerReference,如果OwnerReference为nil则直接return,否则通过History的namespace和OwnerReference获取ds,如果ds为nil则直接return,ds不为nil则调用dsc.enqueueDaemonSet将获取到的ds添加到dsController维护的queue中
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:455
func (dsc *DaemonSetsController) deleteHistory(obj interface{}) {
   //检查传入的obj是否是history,如果不是则直接return,否则获取history
   history, ok := obj.(*apps.ControllerRevision)

   // When a delete is dropped, the relist will notice a ControllerRevision in the store not
   // in the list, leading to the insertion of a tombstone object which contains
   // the deleted key/value. Note that this value might be stale. If the ControllerRevision
   // changed labels the new DaemonSet will not be woken up till the periodic resync.
   if !ok {
      tombstone, ok := obj.(cache.DeletedFinalStateUnknown)
      if !ok {
         utilruntime.HandleError(fmt.Errorf("Couldn't get object from tombstone %#v", obj))
         return
      }
      history, ok = tombstone.Obj.(*apps.ControllerRevision)
      if !ok {
         utilruntime.HandleError(fmt.Errorf("Tombstone contained object that is not a ControllerRevision %#v", obj))
         return
      }
   }

   //获取history的OwnerReference,如果OwnerReference为nil则直接return,否则通过History的namespace和OwnerReference获取ds,如果ds为nil则直接return,ds不为nil则调用dsc.enqueueDaemonSet
     将获取到的ds添加到dsController维护的queue中
   controllerRef := metav1.GetControllerOf(history)
   if controllerRef == nil {
      // No controller should care about orphans being deleted.
      return
   }
   ds := dsc.resolveControllerRef(history.Namespace, controllerRef)
   if ds == nil {
      return
   }
   glog.V(4).Infof("ControllerRevision %s deleted.", history.Name)
   dsc.enqueueDaemonSet(ds)
}

Pod Event

AddFunc

Pod的AddFunc注册的是addPod方法,addPod方法逻辑如下:

  • 通过传入的obj获取pod,并检查pod的DeletionTimestamp,如果DeletionTimestamp不为nil,说明pod待删除,调用dsc.deletePod方法并return
  • 获取pod的OwnerReference,如果OwnerReference不为nil,则通过pod的namespace和OwnerReference获取ds,如果ds为nil则直接return,否则获取ds的namespace和name,如果获取正常,则调用dsc.enqueueDaemonSet将获取到的ds添加到dsController维护的queue后return,如果获取不成功则return
  • 如果上一步获取到的pod的OwnerReference为nil,则说明是孤儿pod,则通过pod的lable获取所有selector该lable的ds,接着调用dsc.enqueueDaemonSet将获取到的所有的ds添加到dsController维护的queue
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:488
func (dsc *DaemonSetsController) addPod(obj interface{}) {
   //通过传入的obj获取pod,并检查pod的DeletionTimestamp,如果DeletionTimestamp不为nil,说明pod待删除,调用dsc.deletePod方法并return
   pod := obj.(*v1.Pod)

   if pod.DeletionTimestamp != nil {
      // on a restart of the controller manager, it's possible a new pod shows up in a state that
      // is already pending deletion. Prevent the pod from being a creation observation.
      dsc.deletePod(pod)
      return
   }

   //获取pod的OwnerReference,如果OwnerReference不为nil,则通过pod的namespace和OwnerReference获取ds,如果ds为nil则直接return,
     否则获取ds的namespace和name,如果获取正常,则调用dsc.enqueueDaemonSet将获取到的ds添加到dsController维护的queue后return,如果获取不成功则return
   // If it has a ControllerRef, that's all that matters.
   if controllerRef := metav1.GetControllerOf(pod); controllerRef != nil {
      ds := dsc.resolveControllerRef(pod.Namespace, controllerRef)
      if ds == nil {
         return
      }
      dsKey, err := controller.KeyFunc(ds)
      if err != nil {
         return
      }
      glog.V(4).Infof("Pod %s added.", pod.Name)
      dsc.expectations.CreationObserved(dsKey)
      dsc.enqueueDaemonSet(ds)
      return
   }

   //如果上一步获取到的pod的OwnerReference为nil,则说明是孤儿pod,则通过pod的lable获取所有selector该lable的ds,接着调用dsc.enqueueDaemonSet将
     获取到的所有的ds添加到dsController维护的queue
   // Otherwise, it's an orphan. Get a list of all matching DaemonSets and sync
   // them to see if anyone wants to adopt it.
   // DO NOT observe creation because no controller should be waiting for an
   // orphan.
   dss := dsc.getDaemonSetsForPod(pod)
   if len(dss) == 0 {
      return
   }
   glog.V(4).Infof("Orphan Pod %s added.", pod.Name)
   for _, ds := range dss {
      dsc.enqueueDaemonSet(ds)
   }
}

UpdateFunc

Pod的UpdateFunc注册的是updatePod方法,updatePod方法逻辑如下:

  • 对比curPod和oldPod的ResourceVersion,如果一样,说明pod没有改变,则直接return,不做其他处理
  • 对比curPod和oldPod的OwnerReference,如果有改变且oldPod的OwnerReference不为空,则通过oldPod的namespace和OwnerReference获取ds,获取到的ds不为nil的话,则调用dsc.enqueueDaemonSet将获取到的ds添加到dsController维护的queue
  • 如果curPod的OwnerReference不为nil,则通过curPod的namespace和OwnerReference获取ds,接着调用dsc.enqueueDaemonSet将获取到的ds添加到dsController维护的queue,接着检查Pod的Ready状态,如果oldPod没有Ready,curPod是Ready的,且ds的Spec.MinReadySeconds大于0,则调用dsc.enqueueDaemonSetAfter方法在MinReadySeconds+1秒后将获取到的ds添加到dsController维护的queue
  • 如果curPod的OwnerReference为nil,则说明是孤儿Pod,则通过pod的lable获取所有selector该lable的ds,如果oldPod和curPod的label或者OwnerReference改变了,则调用dsc.enqueueDaemonSet将获取到的所有的ds添加到dsController维护的queue,如果label或者OwnerReference都没改变,则不做任何处理
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:531
func (dsc *DaemonSetsController) updatePod(old, cur interface{}) {
   curPod := cur.(*v1.Pod)
   oldPod := old.(*v1.Pod)
   //对比curPod和oldPod的ResourceVersion,如果一样,说明pod没有改变,则直接return,不做其他处理
   if curPod.ResourceVersion == oldPod.ResourceVersion {
      // Periodic resync will send update events for all known pods.
      // Two different versions of the same pod will always have different RVs.
      return
   }

   //对比curPod和oldPod的OwnerReference,如果有改变且oldPod的OwnerReference不为空,则通过oldPod的namespace和OwnerReference获取ds,
     获取到的ds不为nil的话,则调用dsc.enqueueDaemonSet将获取到的ds添加到dsController维护的queue
   curControllerRef := metav1.GetControllerOf(curPod)
   oldControllerRef := metav1.GetControllerOf(oldPod)
   controllerRefChanged := !reflect.DeepEqual(curControllerRef, oldControllerRef)
   if controllerRefChanged && oldControllerRef != nil {
      // The ControllerRef was changed. Sync the old controller, if any.
      if ds := dsc.resolveControllerRef(oldPod.Namespace, oldControllerRef); ds != nil {
         dsc.enqueueDaemonSet(ds)
      }
   }

   //如果curPod的OwnerReference不为nil,则通过curPod的namespace和OwnerReference获取ds,接着调用dsc.enqueueDaemonSet将获取到的ds添加到dsController维护的queue,
     接着检查Pod的Ready状态,如果oldPod没有Ready,curPod是Ready的,且ds的Spec.MinReadySeconds大于0,则调用dsc.enqueueDaemonSetAfter方法在MinReadySeconds+1秒后
     将获取到的ds添加到dsController维护的queue
   // If it has a ControllerRef, that's all that matters.
   if curControllerRef != nil {
      ds := dsc.resolveControllerRef(curPod.Namespace, curControllerRef)
      if ds == nil {
         return
      }
      glog.V(4).Infof("Pod %s updated.", curPod.Name)
      dsc.enqueueDaemonSet(ds)
      changedToReady := !podutil.IsPodReady(oldPod) && podutil.IsPodReady(curPod)
      // See https://github.com/kubernetes/kubernetes/pull/38076 for more details
      if changedToReady && ds.Spec.MinReadySeconds > 0 {
         // Add a second to avoid milliseconds skew in AddAfter.
         // See https://github.com/kubernetes/kubernetes/issues/39785#issuecomment-279959133 for more info.
         dsc.enqueueDaemonSetAfter(ds, (time.Duration(ds.Spec.MinReadySeconds)*time.Second)+time.Second)
      }
      return
   }

   //如果curPod的OwnerReference为nil,则说明是孤儿Pod,则通过pod的lable获取所有selector该lable的ds,如果oldPod和curPod的label或者OwnerReference改变了,
     则调用dsc.enqueueDaemonSet将获取到的所有的ds添加到dsController维护的queue,如果label或者OwnerReference都没改变,则不做任何处理
   // Otherwise, it's an orphan. If anything changed, sync matching controllers
   // to see if anyone wants to adopt it now.
   dss := dsc.getDaemonSetsForPod(curPod)
   if len(dss) == 0 {
      return
   }
   glog.V(4).Infof("Orphan Pod %s updated.", curPod.Name)
   labelChanged := !reflect.DeepEqual(curPod.Labels, oldPod.Labels)
   if labelChanged || controllerRefChanged {
      for _, ds := range dss {
         dsc.enqueueDaemonSet(ds)
      }
   }
}

DeleteFunc

Pod的DeleteFunc注册的是deletePod方法,deletePod方法逻辑如下:

  • 检查传入的obj是否是pod,如果不是pod则直接return,不做任何其他处理
  • 如果pod的OwnerReference为nil,则说明是孤儿pod,则调用dsc.requeueSuspendedDaemonPods处理,dsc.requeueSuspendedDaemonPods是将ds Controller中的suspendedDaemonPods 中对应node上所有pod在重新加入dsController维护的queue中,完成后return
  • 如果OwnerReference不为nil,则通过pod的namespace和OwnerReference调用dsc.resolveControllerRef获取ds
    • 如果获取到的ds为nil且pod的Spec.Nodename不为空,则调用dsc.requeueSuspendedDaemonPods处理
    • 如果获取到的ds不为nil,则调用dsc.enqueueDaemonSet将获取到ds添加到dsController维护的queue
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:644
func (dsc *DaemonSetsController) deletePod(obj interface{}) {
   //检查传入的obj是否是pod,如果不是pod则直接return,不做任何其他处理
   pod, ok := obj.(*v1.Pod)
   // When a delete is dropped, the relist will notice a pod in the store not
   // in the list, leading to the insertion of a tombstone object which contains
   // the deleted key/value. Note that this value might be stale. If the pod
   // changed labels the new daemonset will not be woken up till the periodic
   // resync.
   if !ok {
      tombstone, ok := obj.(cache.DeletedFinalStateUnknown)
      if !ok {
         utilruntime.HandleError(fmt.Errorf("couldn't get object from tombstone %#v", obj))
         return
      }
      pod, ok = tombstone.Obj.(*v1.Pod)
      if !ok {
         utilruntime.HandleError(fmt.Errorf("tombstone contained object that is not a pod %#v", obj))
         return
      }
   }

   //如果pod的OwnerReference为nil,则说明是孤儿pod,则调用dsc.requeueSuspendedDaemonPods处理,完成后return
   controllerRef := metav1.GetControllerOf(pod)
   if controllerRef == nil {
      // No controller should care about orphans being deleted.
      if len(pod.Spec.NodeName) != 0 {
         // If scheduled pods were deleted, requeue suspended daemon pods.
         dsc.requeueSuspendedDaemonPods(pod.Spec.NodeName)
      }
      return
   }

   //如果OwnerReference不为nil,则通过pod的namespace和OwnerReference调用dsc.resolveControllerRef获取ds,
     如果获取到的ds为nil且pod的Spec.Nodename不为空,则调用dsc.requeueSuspendedDaemonPods处理;如果获取到的ds
     不为nil,则调用dsc.enqueueDaemonSet将获取到ds添加到dsController维护的queue
   ds := dsc.resolveControllerRef(pod.Namespace, controllerRef)
   if ds == nil {
      if len(pod.Spec.NodeName) != 0 {
         // If scheduled pods were deleted, requeue suspended daemon pods.
         dsc.requeueSuspendedDaemonPods(pod.Spec.NodeName)
      }
      return
   }
   dsKey, err := controller.KeyFunc(ds)
   if err != nil {
      return
   }
   glog.V(4).Infof("Pod %s deleted.", pod.Name)
   dsc.expectations.DeletionObserved(dsKey)
   dsc.enqueueDaemonSet(ds)
}

Node Event

dsController对Node Event只注册了AddFunc和UpdateFunc,接下来看看这两个Func的逻辑

AddFunc

Node Event的AddFunc注册的是addNode方法,具体逻辑如下:

  • 调用dsc.dsLister.List将所有的ds存放到dsList
  • 循环处理dsList中的所有ds,调用dsc.nodeShouldRunDaemonPod方法检查ds是否需要调度到新增加的node上,调用dsc.enqueueDaemonSet方法将需要调度的ds添加到dsController维护的queue

其中dsc.nodeShouldRunDaemonPod方法是dsController最重要的方法,它用来检查ds是否需要将Pod调度到node上,dsController中多次用到该方法,将会放在“执行DaemonSet Controller”章节阅读分析

k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:690
func (dsc *DaemonSetsController) addNode(obj interface{}) {
   // TODO: it'd be nice to pass a hint with these enqueues, so that each ds would only examine the added node (unless it has other work to do, too).
   //调用dsc.dsLister.List将所有的ds存放到dsList
   dsList, err := dsc.dsLister.List(labels.Everything())
   if err != nil {
      glog.V(4).Infof("Error enqueueing daemon sets: %v", err)
      return
   }
   node := obj.(*v1.Node)
   //循环处理dsList中的所有ds,调用dsc.nodeShouldRunDaemonPod方法检查ds是否需要调度到新增加的node上,调用dsc.enqueueDaemonSet方法将需要调度的ds添加到dsController维护的queue
   for _, ds := range dsList {
      _, shouldSchedule, _, err := dsc.nodeShouldRunDaemonPod(node, ds)
      if err != nil {
         continue
      }
      if shouldSchedule {
         dsc.enqueueDaemonSet(ds)
      }
   }
}

UpdateFunc

Node Event的UpdateFunc注册的是updateNode方法,具体逻辑如下:

  •    调用shouldIgnoreNodeUpdate函数检查是否忽略该Node Update Event,shouldIgnoreNodeUpdate函数的检查逻辑为:
    • 如果oldNode和curNode的status.Condition中为true的conditon不是一模一样的,则直接返回不忽略该Node Update Event,否则进入下一步的检查
    • 如果上一步检查到oldNode和curNode的status.Condition中为true的conditon一模一样,则再检查oldNode和curNode除ResourceVersion和status.Condition以外的其他是否一样,如果一样则忽略该Node Update Event,否则则不忽略
  • 取集群中所有的ds到dsList
  • 对dsList中每一个ds做以下处理
    • 调用dsc.nodeShouldRunDaemonPod检查ds和oldNode,将ds是否需要调度到oldNode及ds的Pod是否需要继续运行的值给oldShouldSchedule & oldShouldContinueRunning两个变量
    • 调用dsc.nodeShouldRunDaemonPod检查ds和curNode,将ds是否需要调度到curNode及ds的Pod是否需要继续运行的值给currentShouldSchedule & currentShouldContinueRunning两个变量
    • 如果oldShouldSchedule不等于currentShouldSchedule或者oldShouldContinueRunning不等于currentShouldContinueRunning,即ds在Node的调度运行发生变化,则调用dsc.enqueueDaemonSet方法将需要调度的ds添加到dsController维护的queue
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:747
func (dsc *DaemonSetsController) updateNode(old, cur interface{}) {
   oldNode := old.(*v1.Node)
   curNode := cur.(*v1.Node)
   //调用shouldIgnoreNodeUpdate函数检查是否忽略该Node Update Event,shouldIgnoreNodeUpdate函数的检查逻辑为:
     (1) 如果oldNode和curNode的status.Condition中为true的conditon不是一模一样的,则直接返回不忽略该Node Update Event,否则进入下一步的检查
     (2) 如果上一步检查到oldNode和curNode的status.Condition中为true的conditon一模一样,则再检查oldNode和curNode除ResourceVersion和status.Condition以外的其他是否一样,
         如果一样则忽略该Node Update Event,否则则不忽略
   if shouldIgnoreNodeUpdate(*oldNode, *curNode) {
      return
   }

   //获取集群中所有的ds到dsList
   dsList, err := dsc.dsLister.List(labels.Everything())
   if err != nil {
      glog.V(4).Infof("Error listing daemon sets: %v", err)
      return
   }

   //对dsList中每一个ds做以下处理
     (1) 调用dsc.nodeShouldRunDaemonPod检查ds和oldNode,将ds是否需要调度到oldNode及ds的Pod是否需要继续运行的值给oldShouldSchedule & oldShouldContinueRunning两个变量
     (2) 调用dsc.nodeShouldRunDaemonPod检查ds和curNode,将ds是否需要调度到curNode及ds的Pod是否需要继续运行的值给currentShouldSchedule & currentShouldContinueRunning两个变量
     (3) 如果oldShouldSchedule不等于currentShouldSchedule或者oldShouldContinueRunning不等于currentShouldContinueRunning,即ds在Node的调度运行发生变化,则调用dsc.enqueueDaemonSet方法将需要调度的ds添加到dsController维护的queue
   // TODO: it'd be nice to pass a hint with these enqueues, so that each ds would only examine the added node (unless it has other work to do, too).
   for _, ds := range dsList {
      _, oldShouldSchedule, oldShouldContinueRunning, err := dsc.nodeShouldRunDaemonPod(oldNode, ds)
      if err != nil {
         continue
      }
      _, currentShouldSchedule, currentShouldContinueRunning, err := dsc.nodeShouldRunDaemonPod(curNode, ds)
      if err != nil {
         continue
      }
      if (oldShouldSchedule != currentShouldSchedule) || (oldShouldContinueRunning != currentShouldContinueRunning) {
         dsc.enqueueDaemonSet(ds)
      }
   }
}

执行DaemonSet Controller

Kube-controller-manager调用了dsController的run方法来执行DaemonSet Controller,run方法的主要逻辑如下:

  • 调用controller.WaitForCacheSync等待PodInformer、NodeInformer、HistoryInformer的HasSyncs都返回true,即等待Pod、Node和History完成同步
  • 启动--concurrent-daemonset-syncs个go routine,每隔1秒循环执行dsc.runWorker方法,dsc.runWorker方法则是循环执行dsc.processNextWorkItem方法直到其返回true,dsc.processNextWorkItem方法的具体逻辑如下:
    • 从dsController的queue中取出一个ds,如果没有取到ds即没有要处理的ds则直接return false
    • 设置defer,在dsc.processNextWorkItem方法结束时mark上一步获取到的ds已处理
    • 调用dsc.syncHandler方法即dsc.syncDaemonSet方法处理第一步获取到的ds,如果dsc.syncDaemonSet方法返回的error为nil即成功处理则调用dsc.queue.Forget从dsController中删除该ds
    • 如果上一步的dsc.syncHandler方法未成功处理ds,则调用dsc.queue.AddRateLimited重新将ds加入dsController的queue中
  • 启动1个go routine,每隔1分钟启用failedPodsBackoff的gc
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:265
func (dsc *DaemonSetsController) Run(workers int, stopCh <-chan struct{}) {
   defer utilruntime.HandleCrash()
   defer dsc.queue.ShutDown()

   glog.Infof("Starting daemon sets controller")
   defer glog.Infof("Shutting down daemon sets controller")

   //调用controller.WaitForCacheSync等待PodInformer、NodeInformer、HistoryInformer的HasSyncs都返回true,即等待Pod、Node和History完成同步
   if !controller.WaitForCacheSync("daemon sets", stopCh, dsc.podStoreSynced, dsc.nodeStoreSynced, dsc.historyStoreSynced, dsc.dsStoreSynced) {
      return
   }

   //启动--concurrent-daemonset-syncs个go routine,每隔1秒循环执行dsc.runWorker方法,dsc.runWorker方法则是循环执行dsc.processNextWorkItem方法直到其返回true
   for i := 0; i < workers; i++ {
      go wait.Until(dsc.runWorker, time.Second, stopCh)
   }

   //启动1个go routine,每隔1分钟启用failedPodsBackoff的gc
   go wait.Until(dsc.failedPodsBackoff.GC, BackoffGCInterval, stopCh)

   <-stopCh
}

func (dsc *DaemonSetsController) runWorker() {
   for dsc.processNextWorkItem() {
   }
}

// processNextWorkItem deals with one key off the queue.  It returns false when it's time to quit.
func (dsc *DaemonSetsController) processNextWorkItem() bool {
   //从dsController的queue中取出一个ds,如果没有取到ds即没有要处理的ds则直接return false
   dsKey, quit := dsc.queue.Get()
   if quit {
      return false
   }
   defer dsc.queue.Done(dsKey)

   //调用dsc.syncHandler方法即dsc.syncDaemonSet方法处理第一步获取到的ds
   err := dsc.syncHandler(dsKey.(string))
   if err == nil {
      dsc.queue.Forget(dsKey)
      return true
   }

   //如果上一步的dsc.syncHandler方法未成功处理ds,则调用dsc.queue.AddRateLimited重新将ds加入dsController的queue中
   utilruntime.HandleError(fmt.Errorf("%v failed with : %v", dsKey, err))
   dsc.queue.AddRateLimited(dsKey)

   return true
}

dsc.nodeShouldRunDaemonPod方法

dsc.nodeShouldRunDaemonPod方法是dsController用来检查ds是否需要将Pod调度到node上,该方法返回三个结果:wantToRun表示node上应运行ds,但忽略node的condition引起的不能调度(比如DiskPressure 或者insufficient resource);shouldSchedule表示应该将ds调度该node;shouldContinueRunning表示node上已运行了ds并且应该继续运行。ds Controller在注册Node Event函数的时候用到该方法,之后也会多次用到该方法,所以先在这里详细阅读该方法

  • 根据ds即nodeName创建一个pod给到newPod
  • 将wantToRun, shouldSchedule, shouldContinueRunning这三个变量初始化为true
  • 如果ds.Spec.Template.Spec.NodeName不为空,且node.name不等于ds.Spec.Template.Spec.NodeName(即该node不是ds selector的node,说明node上不用运行ds),则直接return wantToRun, shouldSchedule, shouldContinueRunning为false
  • 调用dsc.simulate方法用一些预选策略检查node,并返回检查结果以及node的信息,错误原因放入reasons,dsc.simulate方法的具体逻辑如下:
    • 获取node上所有的pod到objects
    • 创建nodeInfo,并设置nodeInfo的node
    • 循环检查objects中所有的pod,如果pod不属于ds,则调用nodeInfo.AddPod将pod的资源统计到nodeInfo,即统计node已使用资源
    • 调用Predicates函数使用调度策略检查ds是否能在node上运行,Predicates函数的具体逻辑如下
      • 如果ScheduleDaemonSetPods的FeatureGate enable(1.11为Alpha,默认不开启;1.12+变为Beta,默认开启),则调用checkNodeFitness函数只检查PodFitsHost、PodMatchNodeSelector、PodToleratesNodeTaints这三个调度策略,完成后直接return结果
      • 调用kubelettypes.IsCriticalPod检查是否是Critical Pod:1、开启了PodPriority的FeatureGate(默认不开启),且pod的Priority大于等于2000000000;2、开启了ExperimentalCriticalPodAnnotation的FeatureGate(默认不开启),pod的namespace为kube-system,且annotations中有[scheduler.alpha.kubernetes.io/critical-pod=“”];满足以上两个条件中的一个则为Critical Pod。
      • 先使用PodToleratesNodeTaints这个调度策略检查pod
      • 如果是critical pod,则调用predicates.EssentialPredicates函数,否则调用predicates.GeneralPredicates函数。predicates.EssentialPredicates函数使用PodFitsHost、PodFitsHostPorts、PodMatchNodeSelector这三个调度策略检查pod;predicates.GeneralPredicates函数先使用PodFitsResources调度策略检查pod,接着也是调用predicates.EssentialPredicates函数检查pod
      • 返回调度策略的检查结果
  • 检查reasons中所有的预选策略检查失败的原因:
    • 如果预选失败原因是InsufficientResourceError(即node资源不足),则将reason给到insufficientResourceErr
    • 如果预选失败原因是.ErrNodeSelectorNotMatch、ErrPodNotMatchHostName、ErrNodeLabelPresenceViolated、ErrPodNotFitsHostPorts,则直接return wantToRun, shouldSchedule, shouldContinueRunning为false
    • 如果预选失败原因是ErrTaintsTolerationsNotMatch,则再调用预选策略predicates.PodToleratesNodeNoExecuteTaints检查是否fit NoExecute 的taint,如果Tolerations也没全符合NoExecute的taint,则直接return wantToRun, shouldSchedule, shouldContinueRunning为false;如果Tolerations全部符合了NoExecute的taint,则将wantToRun, shouldSchedule设为false,继续检查后续的reason
    • 如果预选失败原因是ErrDiskConflict、ErrVolumeZoneConflict、ErrMaxVolumeCountExceeded、ErrNodeUnderMemoryPressure、ErrNodeUnderDiskPressure,则将shouldSchedule设为false
    • 如果预选失败原因是ErrPodAffinityNotMatch、ErrServiceAffinityViolated,则直接return wantToRun, shouldSchedule, shouldContinueRunning为false
  • 检查完所有reasons以后检查shouldSchedule和insufficientResourceErr两个变量,如果shouldSchedule为true且insufficientResourceErr不为nil,则将shouldSchedule设为false(即预选失败原因是InsufficientResourceError,node也不能运行该ds)

关于调度算法详细检查可以查看scheduler_algorithm,或者阅读 pkg/scheduler/algorithm/predicates/predicates.go中的代码。

k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:1327
func (dsc *DaemonSetsController) nodeShouldRunDaemonPod(node *v1.Node, ds *apps.DaemonSet) (wantToRun, shouldSchedule, shouldContinueRunning bool, err error) {
   //根据ds即nodeName创建一个pod给到newPod
   newPod := NewPod(ds, node.Name)

   // Because these bools require an && of all their required conditions, we start
   // with all bools set to true and set a bool to false if a condition is not met.
   // A bool should probably not be set to true after this line.
   //将wantToRun, shouldSchedule, shouldContinueRunning这三个变量初始化为true
   wantToRun, shouldSchedule, shouldContinueRunning = true, true, true

   // If the daemon set specifies a node name, check that it matches with node.Name.
   //如果ds.Spec.Template.Spec.NodeName不为空,且node.name不等于ds.Spec.Template.Spec.NodeName(即该node不是ds selector的node,说明node上不用运行ds),则直接return wantToRun, shouldSchedule, shouldContinueRunning为false
   if !(ds.Spec.Template.Spec.NodeName == "" || ds.Spec.Template.Spec.NodeName == node.Name) {
      return false, false, false, nil
   }

   //调用dsc.simulate方法用一些预选策略检查node,并返回检查结果以及node的信息,错误原因放入reasons,node的信息放入nodeInfo
   reasons, nodeInfo, err := dsc.simulate(newPod, node, ds)
   if err != nil {
      glog.Warningf("DaemonSet Predicates failed on node %s for ds '%s/%s' due to unexpected error: %v", node.Name, ds.ObjectMeta.Namespace, ds.ObjectMeta.Name, err)
      return false, false, false, err
   }

   //检查reasons中所有的预选策略检查失败的原因:
     1、如果预选失败原因是InsufficientResourceError(即node资源不足),则将reason给到insufficientResourceErr
     2、如果预选失败原因是.ErrNodeSelectorNotMatch、ErrPodNotMatchHostName、ErrNodeLabelPresenceViolated、ErrPodNotFitsHostPorts,则直接return wantToRun, shouldSchedule, shouldContinueRunning为false
     3、如果预选失败原因是ErrTaintsTolerationsNotMatch,则再调用预选策略predicates.PodToleratesNodeNoExecuteTaints检查是否fit NoExecute 的taint,如果Tolerations也没全符合NoExecute的taint,则直接return wantToRun, shouldSchedule, shouldContinueRunning
        为false;如果Tolerations全部符合了NoExecute的taint,则将wantToRun, shouldSchedule设为false,继续检查后续的reason
     4、如果预选失败原因是ErrDiskConflict、ErrVolumeZoneConflict、ErrMaxVolumeCountExceeded、ErrNodeUnderMemoryPressure、ErrNodeUnderDiskPressure,则将shouldSchedule设为false
     5、如果预选失败原因是ErrPodAffinityNotMatch、ErrServiceAffinityViolated,则直接return wantToRun, shouldSchedule, shouldContinueRunning为false
   // TODO(k82cn): When 'ScheduleDaemonSetPods' upgrade to beta or GA, remove unnecessary check on failure reason,
   //              e.g. InsufficientResourceError; and simplify "wantToRun, shouldSchedule, shouldContinueRunning"
   //              into one result, e.g. selectedNode.
   var insufficientResourceErr error
   for _, r := range reasons {
      glog.V(4).Infof("DaemonSet Predicates failed on node %s for ds '%s/%s' for reason: %v", node.Name, ds.ObjectMeta.Namespace, ds.ObjectMeta.Name, r.GetReason())
      switch reason := r.(type) {
      case *predicates.InsufficientResourceError:
         insufficientResourceErr = reason
      case *predicates.PredicateFailureError:
         var emitEvent bool
         // we try to partition predicates into two partitions here: intentional on the part of the operator and not.
         switch reason {
         // intentional
         case
            predicates.ErrNodeSelectorNotMatch,
            predicates.ErrPodNotMatchHostName,
            predicates.ErrNodeLabelPresenceViolated,
            // this one is probably intentional since it's a workaround for not having
            // pod hard anti affinity.
            predicates.ErrPodNotFitsHostPorts:
            return false, false, false, nil
         case predicates.ErrTaintsTolerationsNotMatch:
            // DaemonSet is expected to respect taints and tolerations
            fitsNoExecute, _, err := predicates.PodToleratesNodeNoExecuteTaints(newPod, nil, nodeInfo)
            if err != nil {
               return false, false, false, err
            }
            if !fitsNoExecute {
               return false, false, false, nil
            }
            wantToRun, shouldSchedule = false, false
         // unintentional
         case
            predicates.ErrDiskConflict,
            predicates.ErrVolumeZoneConflict,
            predicates.ErrMaxVolumeCountExceeded,
            predicates.ErrNodeUnderMemoryPressure,
            predicates.ErrNodeUnderDiskPressure:
            // wantToRun and shouldContinueRunning are likely true here. They are
            // absolutely true at the time of writing the comment. See first comment
            // of this method.
            shouldSchedule = false
            emitEvent = true
         // unexpected
         case
            predicates.ErrPodAffinityNotMatch,
            predicates.ErrServiceAffinityViolated:
            glog.Warningf("unexpected predicate failure reason: %s", reason.GetReason())
            return false, false, false, fmt.Errorf("unexpected reason: DaemonSet Predicates should not return reason %s", reason.GetReason())
         default:
            glog.V(4).Infof("unknown predicate failure reason: %s", reason.GetReason())
            wantToRun, shouldSchedule, shouldContinueRunning = false, false, false
            emitEvent = true
         }
         if emitEvent {
            dsc.eventRecorder.Eventf(ds, v1.EventTypeWarning, FailedPlacementReason, "failed to place pod on %q: %s", node.ObjectMeta.Name, reason.GetReason())
         }
      }
   }

   // only emit this event if insufficient resource is the only thing
   // preventing the daemon pod from scheduling
   //检查完所有reasons以后检查shouldSchedule和insufficientResourceErr两个变量,如果shouldSchedule为true且insufficientResourceErr不为nil,则将shouldSchedule设为false(即预选失败原因是InsufficientResourceError,node也不能运行该ds)
   if shouldSchedule && insufficientResourceErr != nil {
      dsc.eventRecorder.Eventf(ds, v1.EventTypeWarning, FailedPlacementReason, "failed to place pod on %q: %s", node.ObjectMeta.Name, insufficientResourceErr.Error())
      shouldSchedule = false
   }
   return
}
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:1289
func (dsc *DaemonSetsController) simulate(newPod *v1.Pod, node *v1.Node, ds *apps.DaemonSet) ([]algorithm.PredicateFailureReason, *schedulercache.NodeInfo, error) {
   //获取node上所有的pod到objects
   objects, err := dsc.podNodeIndex.ByIndex("nodeName", node.Name)
   if err != nil {
      return nil, nil, err
   }

   //创建nodeInfo,并设置nodeInfo的node
   nodeInfo := schedulercache.NewNodeInfo()
   nodeInfo.SetNode(node)

   //循环检查objects中所有的pod,如果pod不属于ds,则调用nodeInfo.AddPod将pod的资源统计到nodeInfo,即统计node已使用资源
   for _, obj := range objects {
      // Ignore pods that belong to the daemonset when taking into account whether a daemonset should bind to a node.
      // TODO: replace this with metav1.IsControlledBy() in 1.12
      pod, ok := obj.(*v1.Pod)
      if !ok {
         continue
      }
      if isControlledByDaemonSet(pod, ds.GetUID()) {
         continue
      }
      nodeInfo.AddPod(pod)
   }

   //调用Predicates函数使用调度策略检查ds是否能在node上运行
   _, reasons, err := Predicates(newPod, nodeInfo)
   return reasons, nodeInfo, err
}
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:1461
func Predicates(pod *v1.Pod, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
   var predicateFails []algorithm.PredicateFailureReason

   // If ScheduleDaemonSetPods is enabled, only check nodeSelector and nodeAffinity.
   //如果ScheduleDaemonSetPods的FeatureGate enable(1.11为Alpha,默认不开启;1.12+变为Beta,默认开启),则调用checkNodeFitness函数只检查PodFitsHost、PodMatchNodeSelector、PodToleratesNodeTaints这三个调度策略,完成后直接return结果
   if utilfeature.DefaultFeatureGate.Enabled(features.ScheduleDaemonSetPods) {
      fit, reasons, err := checkNodeFitness(pod, nil, nodeInfo)
      if err != nil {
         return false, predicateFails, err
      }
      if !fit {
         predicateFails = append(predicateFails, reasons...)
      }

      return len(predicateFails) == 0, predicateFails, nil
   }

   //调用kubelettypes.IsCriticalPod检查是否是Critical Pod:1、开启了PodPriority的FeatureGate(默认不开启),且pod的Priority大于等于2000000000;2、开启了ExperimentalCriticalPodAnnotation的FeatureGate(默认不开启),pod的namespace为kube-system,
     且annotations中有[scheduler.alpha.kubernetes.io/critical-pod=“”];满足以上两个条件中的一个则为Critical Pod。
   critical := kubelettypes.IsCriticalPod(pod)

   //先使用PodToleratesNodeTaints这个调度策略检查pod
   fit, reasons, err := predicates.PodToleratesNodeTaints(pod, nil, nodeInfo)
   if err != nil {
      return false, predicateFails, err
   }
   if !fit {
      predicateFails = append(predicateFails, reasons...)
   }

   //如果是critical pod,则调用predicates.EssentialPredicates函数,否则调用predicates.GeneralPredicates函数。predicates.EssentialPredicates函数使用PodFitsHost、PodFitsHostPorts、PodMatchNodeSelector这三个调度策略检查pod;
     predicates.GeneralPredicates函数先使用PodFitsResources调度策略检查pod,接着也是调用predicates.EssentialPredicates函数检查pod
   if critical {
      // If the pod is marked as critical and support for critical pod annotations is enabled,
      // check predicates for critical pods only.
      fit, reasons, err = predicates.EssentialPredicates(pod, nil, nodeInfo)
   } else {
      fit, reasons, err = predicates.GeneralPredicates(pod, nil, nodeInfo)
   }
   if err != nil {
      return false, predicateFails, err
   }
   if !fit {
      predicateFails = append(predicateFails, reasons...)
   }

   //返回调度策略的检查结果
   return len(predicateFails) == 0, predicateFails, nil
}

dsc.syncDaemonSet方法

dsc.syncDaemonSet方法是dsController处理的ds的主要逻辑,具体处理如下:

  • 通过传入的key获取到namespace和name
  • 通过上一步的获取到的namespace和name调用dsc.dsLister.DaemonSets获取ds
  • 检查ds.Spec.Selector是否为为空,如果为空,则直接写一笔Event record后return nil
  • 检查是否能成功获取的ds的key即ds的namespace和name,如果获取的时候出现error,则直接return error
  • 检查ds的DeletionTimestamp,如果为nil说明ds待删除,直接return nil
  • 调用dsc.constructHistory获取当前的history(即和ds的spec一样的history)到cur,如果没有获取到cur则创建一个;如果cur中有多个history则去重,只保留一个。另外获取除cur以外的所有history到old,接着获取cur的hash值
  • 调用dsc.expectations.SatisfiedExpectations检查ds是否ready,如果没有ready,则只调用dsc.updateDaemonSetStatus更新ds的status,完成后return
  • 调用dsc.manage管理ds,检查该ds的pod在各个node上的运行状态,及进行后续处理
  • 调用dsc.expectations.SatisfiedExpectations检查ds是否ready,如果ready则根据UpdateStrategy.Type对ds进行更新
    • 如果UpdateStrategy.Type是OnDelete则不做任何处理
    • 如果UpdateStrategy.Type是RollingUpdate则调用dsc.rollingUpdate更新ds
  • 调用dsc.cleanupHistory,根据ds的Spec.RevisionHistoryLimit的值,检查是否删除多余的history
  • 调用dsc.updateDaemonSetStatus更新ds的status,完成后return

可以看到dsc.syncDaemonSet方法主要调用了dsc.updateDaemonSetStatus更新DaemonSet的stuatus;调用dsc.manage管理ds;调用dsc.rollingUpdate对DaemonSet进行滚动更新;调用dsc.cleanupHistory来清除多余的History。接下来就来看看这四个方法的详细逻辑。

k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:1206
func (dsc *DaemonSetsController) syncDaemonSet(key string) error {
   startTime := time.Now()
   defer func() {
      glog.V(4).Infof("Finished syncing daemon set %q (%v)", key, time.Since(startTime))
   }()

   //通过传入的key获取到namespace和name
   namespace, name, err := cache.SplitMetaNamespaceKey(key)
   if err != nil {
      return err
   }

   //通过上一步的获取到的namespace和name调用dsc.dsLister.DaemonSets获取ds
   ds, err := dsc.dsLister.DaemonSets(namespace).Get(name)
   if errors.IsNotFound(err) {
      glog.V(3).Infof("daemon set has been deleted %v", key)
      dsc.expectations.DeleteExpectations(key)
      return nil
   }
   if err != nil {
      return fmt.Errorf("unable to retrieve ds %v from store: %v", key, err)
   }

   //检查ds.Spec.Selector是否为为空,如果为空,则直接写一笔Event record后return nil
   everything := metav1.LabelSelector{}
   if reflect.DeepEqual(ds.Spec.Selector, &everything) {
      dsc.eventRecorder.Eventf(ds, v1.EventTypeWarning, SelectingAllReason, "This daemon set is selecting all pods. A non-empty selector is required.")
      return nil
   }

   //检查是否能成功获取的ds的key即ds的namespace和name,如果获取的时候出现error,则直接return error
   // Don't process a daemon set until all its creations and deletions have been processed.
   // For example if daemon set foo asked for 3 new daemon pods in the previous call to manage,
   // then we do not want to call manage on foo until the daemon pods have been created.
   dsKey, err := controller.KeyFunc(ds)
   if err != nil {
      return fmt.Errorf("couldn't get key for object %#v: %v", ds, err)
   }

   //检查ds的DeletionTimestamp,如果为nil说明ds待删除,直接return nil
   // If the DaemonSet is being deleted (either by foreground deletion or
   // orphan deletion), we cannot be sure if the DaemonSet history objects
   // it owned still exist -- those history objects can either be deleted
   // or orphaned. Garbage collector doesn't guarantee that it will delete
   // DaemonSet pods before deleting DaemonSet history objects, because
   // DaemonSet history doesn't own DaemonSet pods. We cannot reliably
   // calculate the status of a DaemonSet being deleted. Therefore, return
   // here without updating status for the DaemonSet being deleted.
   if ds.DeletionTimestamp != nil {
      return nil
   }

   //调用dsc.constructHistory获取当前的history(即和ds的spec一样的history)到cur,如果没有获取到cur则创建一个;如果cur中有多个history则去重,只保留一个。
     另外获取除cur以外的所有history到old,接着获取cur的hash值
   // Construct histories of the DaemonSet, and get the hash of current history
   cur, old, err := dsc.constructHistory(ds)
   if err != nil {
      return fmt.Errorf("failed to construct revisions of DaemonSet: %v", err)
   }
   hash := cur.Labels[apps.DefaultDaemonSetUniqueLabelKey]

   //调用dsc.expectations.SatisfiedExpectations检查ds是否ready,如果没有ready,则只调用dsc.updateDaemonSetStatus更新ds的status,完成后return
   if !dsc.expectations.SatisfiedExpectations(dsKey) {
      // Only update status.
      return dsc.updateDaemonSetStatus(ds, hash)
   }

   //调用dsc.manage管理ds,检查该ds的pod在各个node上的运行状态,及进行后续处理
   err = dsc.manage(ds, hash)
   if err != nil {
      return err
   }

   //调用dsc.expectations.SatisfiedExpectations检查ds是否ready,如果ready则根据UpdateStrategy.Type对ds进行更新,(1)如果UpdateStrategy.Type是OnDelete则不做任何处理
     (2) 如果UpdateStrategy.Type是RollingUpdate则调用dsc.rollingUpdate更新ds
   // Process rolling updates if we're ready.
   if dsc.expectations.SatisfiedExpectations(dsKey) {
      switch ds.Spec.UpdateStrategy.Type {
      case apps.OnDeleteDaemonSetStrategyType:
      case apps.RollingUpdateDaemonSetStrategyType:
         err = dsc.rollingUpdate(ds, hash)
      }
      if err != nil {
         return err
      }
   }

   //调用dsc.cleanupHistory,根据ds的Spec.RevisionHistoryLimit的值,检查是否删除多余的history
   err = dsc.cleanupHistory(ds, old)
   if err != nil {
      return fmt.Errorf("failed to clean up revisions of DaemonSet: %v", err)
   }

   //调用dsc.updateDaemonSetStatus更新ds的status,完成后return
   return dsc.updateDaemonSetStatus(ds, hash)
}

dsc.updateDaemonSetStatus方法

dsc.updateDaemonSetStatus方法顾名思义就是检查ds在集群中的状态并更新ds的status,逻辑如下:

  • 调用dsc.getNodesToDaemonPods获取一个map,存储了当前集群运行的ds的pod及pod所在的nodename,其中key为nodename获取方法是根据ds.spec.Selector,以及controllerKind创建一个PodControllerRefManager,接着调用PodControllerRefManager的ClaimPods方法获取pod,接着从获取到的pod获取pod所在的node的nodename,组成以nodename为key的map存入nodeToDaemonPods(注:在replicaset-controller的源码阅读中有详细列出PodControllerRefManager的ClaimPods方法:https://my.oschina.net/u/3797264/blog/2985926)
  • List集群中所有的node存入nodeList中
  • 循环处理上一步获取到的nodeList
    • 调用dsc.nodeShouldRunDaemonPod方法检查ds是否需要在node上运行node,将结果给到变量wantToRun
    • 检查node上是否运行了ds的pod,将结果给到变量scheduled
    • 根据变量wantToRun、scheduled以及pod的ready/Available状态计算出status中的几个状态数据,计算的代码非常的清晰,这里不一一列出
    • 调用storeDaemonSetStatus函数检查ds当前的status和上一步计算的是否一致,如果不一致则调用api更新ds的status
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:1145
func (dsc *DaemonSetsController) updateDaemonSetStatus(ds *apps.DaemonSet, hash string) error {
   glog.V(4).Infof("Updating daemon set status")
   //调用dsc.getNodesToDaemonPods获取一个map,存储了当前集群运行的ds的pod及pod所在的nodename,其中key为nodename
     获取方法是根据ds.spec.Selector,以及controllerKind创建一个PodControllerRefManager,接着调用PodControllerRefManager的ClaimPods方法获取pod,
     接着从获取到的pod获取pod所在的node的nodename,组成以nodename为key的map存入nodeToDaemonPods
   nodeToDaemonPods, err := dsc.getNodesToDaemonPods(ds)
   if err != nil {
      return fmt.Errorf("couldn't get node to daemon pod mapping for daemon set %q: %v", ds.Name, err)
   }

   //List集群中所有的node存入nodeList中
   nodeList, err := dsc.nodeLister.List(labels.Everything())
   if err != nil {
      return fmt.Errorf("couldn't get list of nodes when updating daemon set %#v: %v", ds, err)
   }

   //循环处理上一步获取到的nodeList
     1、调用dsc.nodeShouldRunDaemonPod方法检查ds是否需要在node上运行node,将结果给到变量wantToRun
     2、检查node上是否运行了ds的pod,将结果给到变量scheduled
     3、根据变量wantToRun、scheduled以及pod的ready/Available状态计算出status中的几个状态数据,计算都比较简单,这里就不一一列出
     4、调用storeDaemonSetStatus函数检查ds当前的status和上一步计算的是否一致,如果不一致则调用api更新ds的status
   var desiredNumberScheduled, currentNumberScheduled, numberMisscheduled, numberReady, updatedNumberScheduled, numberAvailable int
   for _, node := range nodeList {
      wantToRun, _, _, err := dsc.nodeShouldRunDaemonPod(node, ds)
      if err != nil {
         return err
      }

      scheduled := len(nodeToDaemonPods[node.Name]) > 0

      if wantToRun {
         desiredNumberScheduled++
         if scheduled {
            currentNumberScheduled++
            // Sort the daemon pods by creation time, so that the oldest is first.
            daemonPods, _ := nodeToDaemonPods[node.Name]
            sort.Sort(podByCreationTimestampAndPhase(daemonPods))
            pod := daemonPods[0]
            if podutil.IsPodReady(pod) {
               numberReady++
               if podutil.IsPodAvailable(pod, ds.Spec.MinReadySeconds, metav1.Now()) {
                  numberAvailable++
               }
            }
            // If the returned error is not nil we have a parse error.
            // The controller handles this via the hash.
            generation, err := util.GetTemplateGeneration(ds)
            if err != nil {
               generation = nil
            }
            if util.IsPodUpdated(pod, hash, generation) {
               updatedNumberScheduled++
            }
         }
      } else {
         if scheduled {
            numberMisscheduled++
         }
      }
   }
   numberUnavailable := desiredNumberScheduled - numberAvailable

   err = storeDaemonSetStatus(dsc.kubeClient.AppsV1().DaemonSets(ds.Namespace), ds, desiredNumberScheduled, currentNumberScheduled, numberMisscheduled, numberReady, updatedNumberScheduled, numberAvailable, numberUnavailable)
   if err != nil {
      return fmt.Errorf("error storing status for daemon set %#v: %v", ds, err)
   }

   return nil
}

dsc.manage方法

dsc.manage方法是管理ds的pods,检查哪些node上需要调度或者运行ds的pod,具体逻辑如下:

  • 调用dsc.getNodesToDaemonPods获取一个map,存储了当前集群运行的ds的pod及pod所在的nodename,其中key为nodename
  • List集群所有的node到nodeList
  • 对nodeList中的所有node做如下处理:
    • 调用dsc.podsShouldBeOnNode,检查node上是否创建ds的pod,如需要则将node.name给到nodesNeedingDaemonPodsOnNode,获取node上需要删除的ds的pod将值给podsToDeleteOnNode,以及ds的pod在node上失败的个数将值给failedPodsObservedOnNode
    • 将nodesNeedingDaemonPodsOnNode汇总到nodesNeedingDaemonPods,即nodesNeedingDaemonPods保存了所有需要创建ds pod的node;将podsToDeleteOnNode汇总到podsToDelete,即podsToDelete保存了ds所有需要删除的pod;将failedPodsObservedOnNode汇总到failedPodsObserved,即failedPodsObserved为ds当前失败的pod数量
  • 用nodesNeedingDaemonPods & podsToDelete调用dsc.syncNodes,在对应的node上创建pod,以及将需要删除的pod删除
  • 检查failedPodsObserved,如果大于0即ds有failed的pod,则写一笔error日志
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:941
func (dsc *DaemonSetsController) manage(ds *apps.DaemonSet, hash string) error {

   //调用dsc.getNodesToDaemonPods获取一个map,存储了当前集群运行的ds的pod及pod所在的nodename,其中key为nodename
   // Find out the pods which are created for the nodes by DaemonSet.
   nodeToDaemonPods, err := dsc.getNodesToDaemonPods(ds)
   if err != nil {
      return fmt.Errorf("couldn't get node to daemon pod mapping for daemon set %q: %v", ds.Name, err)
   }

   // For each node, if the node is running the daemon pod but isn't supposed to, kill the daemon
   // pod. If the node is supposed to run the daemon pod, but isn't, create the daemon pod on the node.
   //List集群所有的node到nodeList
   nodeList, err := dsc.nodeLister.List(labels.Everything())
   if err != nil {
      return fmt.Errorf("couldn't get list of nodes when syncing daemon set %#v: %v", ds, err)
   }
   var nodesNeedingDaemonPods, podsToDelete []string
   var failedPodsObserved int

   //对nodeList中的所有node做如下处理:
     1、调用dsc.podsShouldBeOnNode,检查node上是否创建ds的pod,如需要则将node.name给到nodesNeedingDaemonPodsOnNode,获取node上需要删除的ds的pod将值给podsToDeleteOnNode,以及ds的pod在node上失败的个数将值给failedPodsObservedOnNode
     2、将nodesNeedingDaemonPodsOnNode汇总到nodesNeedingDaemonPods,即nodesNeedingDaemonPods保存了所有需要创建ds pod的node;将podsToDeleteOnNode汇总到podsToDelete,即podsToDelete保存了ds所有需要删除的pod;
        将failedPodsObservedOnNode汇总到failedPodsObserved,即failedPodsObserved为ds当前失败的pod数量
   for _, node := range nodeList {
      nodesNeedingDaemonPodsOnNode, podsToDeleteOnNode, failedPodsObservedOnNode, err := dsc.podsShouldBeOnNode(
         node, nodeToDaemonPods, ds)

      if err != nil {
         continue
      }

      nodesNeedingDaemonPods = append(nodesNeedingDaemonPods, nodesNeedingDaemonPodsOnNode...)
      podsToDelete = append(podsToDelete, podsToDeleteOnNode...)
      failedPodsObserved += failedPodsObservedOnNode
   }

   //用nodesNeedingDaemonPods & podsToDelete调用dsc.syncNodes,在对应的node上创建pod,以及将需要删除的pod删除
   // Label new pods using the hash label value of the current history when creating them
   if err = dsc.syncNodes(ds, podsToDelete, nodesNeedingDaemonPods, hash); err != nil {
      return err
   }

   //检查failedPodsObserved,如果大于0即ds有failed的pod,则写一笔error日志
   // Throw an error when the daemon pods fail, to use ratelimiter to prevent kill-recreate hot loop
   if failedPodsObserved > 0 {
      return fmt.Errorf("deleted %d failed pods of DaemonSet %s/%s", failedPodsObserved, ds.Namespace, ds.Name)
   }

   return nil
}
dsc.podsShouldBeOnNode方法

dsc.podsShouldBeOnNode方法是根据node上ds的pod的运行情况,决定node上是否要调度ds的pod或者删除ds的pod或者其他,具体逻辑如下:

  • 调用dsc.nodeShouldRunDaemonPod方法,获取ds是否需要在node上运行到变量wantToRun,是否需要调度到node上到变量shouldSchedule,node上是否运行了ds的pod及继续运行到变量shouldContinueRunning
  • 检查node上时候运行了ds的pod,将ds的pod给到变量daemonPods,存在与否给到变量exists
  • 调用dsc.removeSuspendedDaemonPods方法删除suspendedDaemonPods中该node上的ds
  • 检查wantToRun、shouldSchedule、shouldContinueRunning、daemonPods、exists这几个变量,检查node是否要运行ds的pod,结果给到nodesNeedingDaemonPods;检查node上运行的ds的pod是否需要删除;以及获取node上失败的ds的pod的数量
    • 如果wantToRun为true,而shouldSchedule为false,即node上本该运行ds,但是因为node的资源不足或者其他问题造成不能调度,则调用dsc.addSuspendedDaemonPods将node.Name和ds添加到ds Controller的suspendedDaemonPods map中
    • 如果shouldSchedule为true,而exists为false,即node上可以调度及运行ds,但是还未运行,则将node.Name添加到nodesNeedingDaemonPods数组,待之后在node上创建ds的pod
    • 如果shouldContinueRunning为true,则说明node上已经有ds的pod及需要继续运行,那么就检查daemonPods中每个pod的状态,根据pod的不同的状态做以下不同的处理,检查完所有daemonPods以后,在检查daemonPodsRunning中的数量,如果超过1个,则保留daemonPodsRunning中创建时间最早的一个pod,其他pod都添加到
      • 如果pod的DeletionTimestamp不为nil,说明pod将要删除,则跳过该pod,待pod删除后产生pod event再处理
      • 如果pod的failed了,failedPodsObserved累加1,再检查pod的backoff key来判断是否删除pod,同一个pod失败删除的时间呈指数级增长(即1s,2s,4s,8s...,最大到15分钟则不再增加)
      • 如果pod的status phase没有failed,则将pod添加到podsToDelete中待之后删除
      •  
    • 如果shouldContinueRunning为false,而exists为true,说明node不用运行ds的pod,但实际却运行了,则将daemonPods中所有的pod添加到podsToDelete
  • 返回nodesNeedingDaemonPods(需要创建ds的pod的node), podsToDelete(node上需要删除的ds的pod), failedPodsObserved(ds的pod在该node上失败的个数)
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:860
func (dsc *DaemonSetsController) podsShouldBeOnNode(
   node *v1.Node,
   nodeToDaemonPods map[string][]*v1.Pod,
   ds *apps.DaemonSet,
) (nodesNeedingDaemonPods, podsToDelete []string, failedPodsObserved int, err error) {

   //调用dsc.nodeShouldRunDaemonPod方法,获取ds是否需要在node上运行到变量wantToRun,是否需要调度到node上到变量shouldSchedule,node上是否运行了ds的pod及继续运行到变量shouldContinueRunning
   wantToRun, shouldSchedule, shouldContinueRunning, err := dsc.nodeShouldRunDaemonPod(node, ds)
   if err != nil {
      return
   }

   //检查node上时候运行了ds的pod,将ds的pod给到变量daemonPods,存在与否给到变量exists
   daemonPods, exists := nodeToDaemonPods[node.Name]
   dsKey, _ := cache.MetaNamespaceKeyFunc(ds)

   //调用dsc.removeSuspendedDaemonPods方法删除suspendedDaemonPods中该node上的ds
   dsc.removeSuspendedDaemonPods(node.Name, dsKey)

   //检查wantToRun、shouldSchedule、shouldContinueRunning、daemonPods、exists这几个变量,检查node是否要运行ds的pod,结果给到nodesNeedingDaemonPods;检查node上运行的ds的pod是否需要删除;以及获取node上失败的ds的pod的数量
     (1) 如果wantToRun为true,而shouldSchedule为false,即node上本该运行ds,但是因为node的资源不足或者其他问题造成不能调度,则调用dsc.addSuspendedDaemonPods将node.Name和ds添加到ds Controller的suspendedDaemonPods map中
     (2) 如果shouldSchedule为true,而exists为false,即node上可以调度及运行ds,但是还未运行,则将node.Name添加到nodesNeedingDaemonPods数组,待之后在node上创建ds的pod
     (3) 如果shouldContinueRunning为true,则说明node上已经有ds的pod及需要继续运行,那么就检查daemonPods中每个pod的状态,根据pod的不同的状态做以下不同的处理,检查完所有daemonPods以后,在检查daemonPodsRunning中的数量,如果超过1个,
         则保留daemonPodsRunning中创建时间最早的一个pod,其他pod都添加到
         a. 如果pod的DeletionTimestamp不为nil,说明pod将要删除,则跳过该pod,待pod删除后产生pod event再处理
         b. 如果pod的failed了,failedPodsObserved累加1,再检查pod的backoff key来判断是否删除pod,同一个pod失败删除的时间呈指数级增长(即1s,2s,4s,8s...,最大到15分钟则不再增加)
         c. 如果pod的status phase没有failed,则将pod添加到podsToDelete中待之后删除
     (4) 如果shouldContinueRunning为false,而exists为true,说明node不用运行ds的pod,但实际却运行了,则将daemonPods中所有的pod添加到podsToDelete
   switch {
   case wantToRun && !shouldSchedule:
      // If daemon pod is supposed to run, but can not be scheduled, add to suspended list.
      dsc.addSuspendedDaemonPods(node.Name, dsKey)
   case shouldSchedule && !exists:
      // If daemon pod is supposed to be running on node, but isn't, create daemon pod.
      nodesNeedingDaemonPods = append(nodesNeedingDaemonPods, node.Name)
   case shouldContinueRunning:
      // If a daemon pod failed, delete it
      // If there's non-daemon pods left on this node, we will create it in the next sync loop
      var daemonPodsRunning []*v1.Pod
      for _, pod := range daemonPods {
         if pod.DeletionTimestamp != nil {
            continue
         }
         if pod.Status.Phase == v1.PodFailed {
            failedPodsObserved++

            // This is a critical place where DS is often fighting with kubelet that rejects pods.
            // We need to avoid hot looping and backoff.
            backoffKey := failedPodsBackoffKey(ds, node.Name)

            now := dsc.failedPodsBackoff.Clock.Now()
            inBackoff := dsc.failedPodsBackoff.IsInBackOffSinceUpdate(backoffKey, now)
            if inBackoff {
               delay := dsc.failedPodsBackoff.Get(backoffKey)
               glog.V(4).Infof("Deleting failed pod %s/%s on node %s has been limited by backoff - %v remaining",
                  pod.Namespace, pod.Name, node.Name, delay)
               dsc.enqueueDaemonSetAfter(ds, delay)
               continue
            }

            dsc.failedPodsBackoff.Next(backoffKey, now)

            msg := fmt.Sprintf("Found failed daemon pod %s/%s on node %s, will try to kill it", pod.Namespace, pod.Name, node.Name)
            glog.V(2).Infof(msg)
            // Emit an event so that it's discoverable to users.
            dsc.eventRecorder.Eventf(ds, v1.EventTypeWarning, FailedDaemonPodReason, msg)
            podsToDelete = append(podsToDelete, pod.Name)
         } else {
            daemonPodsRunning = append(daemonPodsRunning, pod)
         }
      }
      // If daemon pod is supposed to be running on node, but more than 1 daemon pod is running, delete the excess daemon pods.
      // Sort the daemon pods by creation time, so the oldest is preserved.
      if len(daemonPodsRunning) > 1 {
         sort.Sort(podByCreationTimestampAndPhase(daemonPodsRunning))
         for i := 1; i < len(daemonPodsRunning); i++ {
            podsToDelete = append(podsToDelete, daemonPodsRunning[i].Name)
         }
      }
   case !shouldContinueRunning && exists:
      // If daemon pod isn't supposed to run on node, but it is, delete all daemon pods on node.
      for _, pod := range daemonPods {
         podsToDelete = append(podsToDelete, pod.Name)
      }
   }

   //返回nodesNeedingDaemonPods(需要创建ds的pod的node), podsToDelete(node上需要删除的ds的pod), failedPodsObserved(ds的pod在该node上失败的个数)
   return nodesNeedingDaemonPods, podsToDelete, failedPodsObserved, nil
}
dsc.syncNodes方法

dsc.syncNodes方法是根据上一步dsc.podsShouldBeOnNode方法计算得出的nodesNeedingDaemonPods和podsToDelete,在响应的node上创建ds的pod及删除podsToDelete数组中的pod,具体逻辑如下:

  • 获取nodesNeedingDaemonPods长度到createDiff,获取podsToDelete长度到deleteDiff,如果createDiff或者deleteDiff大于250,则将其设为250个,即单次创建、删除pod最大值为250个
  • 调用dsc.expectations.SetExpectations方法设置本次需要创建及删除的pod个数
  • 从ds的”deprecated.daemonset.template.generation“的annotations中获取ds的generation
  • 调用util.CreatePodTemplate创建pod的Template,创建的逻辑如下:
    • 检查是否是Critical pod:满足以下3点要求则为Critical pod:a. 集群开启了ExperimentalCriticalPodAnnotation的FeatureGate(默认不开启); b. pod的namespace为kube-system; c. annotations中有[scheduler.alpha.kubernetes.io/critical-pod=“”]
    • 为pod添加Tolerations,添加key为"node.kubernetes.io/not-ready"和"node.kubernetes.io/unreachable",Operator为"Exists",Effect为"NoExecute"的两个Tolerations;添加key为"node.kubernetes.io/disk-pressure"、"node.kubernetes.io/memory-pressure"和"node.kubernetes.io/unschedulable",Operator为"Exists",Effect为"NoSchedule"的Tolerations;如果ds使用了HostNetwork,则再添加key为"node.kubernetes.io/network-unavailable",Operator为"Exists",Effect为"NoSchedule"的Tolerations;如果是Critical pod,则再添加key为"node.kubernetes.io/out-of-disk",Operator为"Exists",Effect为"NoSchedule"和"NoExecute"的两个Tolerations。
    • 如有generation和hash,则为pod的label添加"pod-template-generation"和"controller-revision-hash"以及对应的值
  • 根据createDiff的值,创建相应的pod,其中创建过程中有以下需要留意:
    • 第一次创建启动一个go routine创建1个Pod,之后启动的go routine个数呈指数级增长,即创建pod的个数为1、2、4、8...,下一次创建需要等上一次创建的go routine完成
    • 如果集群开启了ScheduleDaemonSetPods的FeatureGate(1.11默认不开启,1.12+默认开启),则pod在创建之前添加requiredDuringSchedulingIgnoredDuringExecution类型的nodeAffinity,其中matchFields为metadata.name,operator为In,values为对应的nodename
  • 调用dsc.podControl.DeletePod,逐个启动go routine删除podsToDelete中的pod
k8s.io/kubernetes/pkg/controller/daemon/daemon_controller.go:984
func (dsc *DaemonSetsController) syncNodes(ds *apps.DaemonSet, podsToDelete, nodesNeedingDaemonPods []string, hash string) error {
   // We need to set expectations before creating/deleting pods to avoid race conditions.
   dsKey, err := controller.KeyFunc(ds)
   if err != nil {
      return fmt.Errorf("couldn't get key for object %#v: %v", ds, err)
   }

   //获取nodesNeedingDaemonPods长度到createDiff,获取podsToDelete长度到deleteDiff,如果createDiff或者deleteDiff大于250,则将其设为250个,即单次创建、删除pod最大值为250个
   createDiff := len(nodesNeedingDaemonPods)
   deleteDiff := len(podsToDelete)

   if createDiff > dsc.burstReplicas {
      createDiff = dsc.burstReplicas
   }
   if deleteDiff > dsc.burstReplicas {
      deleteDiff = dsc.burstReplicas
   }

   //调用dsc.expectations.SetExpectations方法设置本次需要创建及删除的pod个数
   dsc.expectations.SetExpectations(dsKey, createDiff, deleteDiff)

   // error channel to communicate back failures.  make the buffer big enough to avoid any blocking
   errCh := make(chan error, createDiff+deleteDiff)

   glog.V(4).Infof("Nodes needing daemon pods for daemon set %s: %+v, creating %d", ds.Name, nodesNeedingDaemonPods, createDiff)
   createWait := sync.WaitGroup{}
   // If the returned error is not nil we have a parse error.
   // The controller handles this via the hash.
   //从ds的”deprecated.daemonset.template.generation“的annotations中获取ds的generation
   generation, err := util.GetTemplateGeneration(ds)
   if err != nil {
      generation = nil
   }
   
   //调用util.CreatePodTemplate创建pod的Template,创建的逻辑如下:
     1、检查是否是Critical pod:满足以下3点要求则为Critical pod:a. 集群开启了ExperimentalCriticalPodAnnotation的FeatureGate(默认不开启); b. pod的namespace为kube-system; c. annotations中有[scheduler.alpha.kubernetes.io/critical-pod=“”]
     2、为pod添加Tolerations,添加key为"node.kubernetes.io/not-ready"和"node.kubernetes.io/unreachable",Operator为"Exists",Effect为"NoExecute"的两个Tolerations;添加key为"node.kubernetes.io/disk-pressure"、"node.kubernetes.io/memory-pressure"         和"node.kubernetes.io/unschedulable",Operator为"Exists",Effect为"NoSchedule"的Tolerations;如果ds使用了HostNetwork,则再添加key为"node.kubernetes.io/network-unavailable",Operator为"Exists",Effect为"NoSchedule"的Tolerations;
        如果是Critical pod,则再添加key为"node.kubernetes.io/out-of-disk",Operator为"Exists",Effect为"NoSchedule"和"NoExecute"的两个Tolerations。
     3、如有generation和hash,则为pod的label添加"pod-template-generation"和"controller-revision-hash"以及对应的值
   template := util.CreatePodTemplate(ds.Namespace, ds.Spec.Template, generation, hash)

   // Batch the pod creates. Batch sizes start at SlowStartInitialBatchSize
   // and double with each successful iteration in a kind of "slow start".
   // This handles attempts to start large numbers of pods that would
   // likely all fail with the same error. For example a project with a
   // low quota that attempts to create a large number of pods will be
   // prevented from spamming the API service with the pod create requests
   // after one of its pods fails.  Conveniently, this also prevents the
   // event spam that those failures would generate.
   //根据createDiff的值,创建相应的pod,其中创建过程中有以下需要留意:
     1、第一次创建启动一个go routine创建1个Pod,之后启动的go routine个数呈指数级增长,即创建pod的个数为1、2、4、8...,下一次创建需要等上一次创建的go routine完成
     2、如果集群开启了ScheduleDaemonSetPods的FeatureGate(1.11默认不开启,1.12+默认开启),则pod在创建之前添加requiredDuringSchedulingIgnoredDuringExecution类型的nodeAffinity,其中matchFields为metadata.name,operator为In,values为对应的nodename
   batchSize := integer.IntMin(createDiff, controller.SlowStartInitialBatchSize)
   for pos := 0; createDiff > pos; batchSize, pos = integer.IntMin(2*batchSize, createDiff-(pos+batchSize)), pos+batchSize {
      errorCount := len(errCh)
      createWait.Add(batchSize)
      for i := pos; i < pos+batchSize; i++ {
         go func(ix int) {
            defer createWait.Done()
            var err error

            podTemplate := &template

            if utilfeature.DefaultFeatureGate.Enabled(features.ScheduleDaemonSetPods) {
               podTemplate = template.DeepCopy()
               // The pod's NodeAffinity will be updated to make sure the Pod is bound
               // to the target node by default scheduler. It is safe to do so because there
               // should be no conflicting node affinity with the target node.
               podTemplate.Spec.Affinity = util.ReplaceDaemonSetPodNodeNameNodeAffinity(
                  podTemplate.Spec.Affinity, nodesNeedingDaemonPods[ix])

               err = dsc.podControl.CreatePodsWithControllerRef(ds.Namespace, podTemplate,
                  ds, metav1.NewControllerRef(ds, controllerKind))
            } else {
               err = dsc.podControl.CreatePodsOnNode(nodesNeedingDaemonPods[ix], ds.Namespace, podTemplate,
                  ds, metav1.NewControllerRef(ds, controllerKind))
            }

            if err != nil && errors.IsTimeout(err) {
               // Pod is created but its initialization has timed out.
               // If the initialization is successful eventually, the
               // controller will observe the creation via the informer.
               // If the initialization fails, or if the pod keeps
               // uninitialized for a long time, the informer will not
               // receive any update, and the controller will create a new
               // pod when the expectation expires.
               return
            }
            if err != nil {
               glog.V(2).Infof("Failed creation, decrementing expectations for set %q/%q", ds.Namespace, ds.Name)
               dsc.expectations.CreationObserved(dsKey)
               errCh <- err
               utilruntime.HandleError(err)
            }
         }(i)
      }
      createWait.Wait()
      // any skipped pods that we never attempted to start shouldn't be expected.
      skippedPods := createDiff - batchSize
      if errorCount < len(errCh) && skippedPods > 0 {
         glog.V(2).Infof("Slow-start failure. Skipping creation of %d pods, decrementing expectations for set %q/%q", skippedPods, ds.Namespace, ds.Name)
         for i := 0; i < skippedPods; i++ {
            dsc.expectations.CreationObserved(dsKey)
         }
         // The skipped pods will be retried later. The next controller resync will
         // retry the slow start process.
         break
      }
   }

   //调用dsc.podControl.DeletePod,逐个启动go routine删除podsToDelete中的pod
   glog.V(4).Infof("Pods to delete for daemon set %s: %+v, deleting %d", ds.Name, podsToDelete, deleteDiff)
   deleteWait := sync.WaitGroup{}
   deleteWait.Add(deleteDiff)
   for i := 0; i < deleteDiff; i++ {
      go func(ix int) {
         defer deleteWait.Done()
         if err := dsc.podControl.DeletePod(ds.Namespace, podsToDelete[ix], ds); err != nil {
            glog.V(2).Infof("Failed deletion, decrementing expectations for set %q/%q", ds.Namespace, ds.Name)
            dsc.expectations.DeletionObserved(dsKey)
            errCh <- err
            utilruntime.HandleError(err)
         }
      }(i)
   }
   deleteWait.Wait()

   // collect errors if any for proper reporting/retry logic in the controller
   errors := []error{}
   close(errCh)
   for err := range errCh {
      errors = append(errors, err)
   }
   return utilerrors.NewAggregate(errors)
}

dsc.rollingUpdate方法

dsc.rollingUpdate方法是处理ds的滚动的更新的,实际是删除旧Template的pod,再等ds Controller创建新的pod,具体逻辑如下:

  • 调用dsc.getNodesToDaemonPods获取一个map,存储了当前集群运行的ds的pod及pod所在的nodename,其中key为nodename
  • 调用dsc.getAllDaemonSetPods方法对比ds和nodeToDaemonPods中pod的Generation、hash、Template,获取旧的pod到oldPods
  • 调用dsc.getUnavailableNumbers获取ds最大的pod不可用数maxUnavailable,已经当前的pod不吭用数量numUnavailable,具体计算逻辑如下:
    • numUnavailable有两种情况:a. 如果node上需要运行ds的pod但是实际node上没有运行ds的pod; b.如果node上需要运行ds的pod,node上也运行了该ds的pod,但是pod没有available
    • maxUnavailable,是根据ds.Spec.UpdateStrategy.RollingUpdate.MaxUnavailable计算得出,如果该值是数值,则直接取用;如果是百分比,则以此乘以集群中需要运行该ds的node的数量,并向上取整
  • 调用util.SplitByAvailablePods获取oldpods中已可用的pod到oldAvailablePods,以及不可用的pod到oldUnavailablePods
  • 将oldUnavailablePods中所有pod添加到oldPodsToDelete等待删除
  • 再添加 maxUnavailable - numUnavailable个oldAvailablePods到oldPodsToDelete待删除,即保证unavailable的pod数量不超过maxUnavailable
  • 调用dsc.syncNodes删除oldPodsToDelete中的pod
func (dsc *DaemonSetsController) rollingUpdate(ds *apps.DaemonSet, hash string) error {
   
   //调用dsc.getNodesToDaemonPods获取一个map,存储了当前集群运行的ds的pod及pod所在的nodename,其中key为nodename
   nodeToDaemonPods, err := dsc.getNodesToDaemonPods(ds)
   if err != nil {
      return fmt.Errorf("couldn't get node to daemon pod mapping for daemon set %q: %v", ds.Name, err)
   }

   //调用dsc.getAllDaemonSetPods方法对比ds和nodeToDaemonPods中pod的Generation、hash、Template,获取旧的pod到oldPods
   _, oldPods := dsc.getAllDaemonSetPods(ds, nodeToDaemonPods, hash)

   //调用dsc.getUnavailableNumbers获取ds最大的pod不可用数maxUnavailable,已经当前的pod不吭用数量numUnavailable,具体计算逻辑如下:
     1、numUnavailable有两种情况:a. 如果node上需要运行ds的pod但是实际node上没有运行ds的pod; b.如果node上需要运行ds的pod,node上也运行了该ds的pod,但是pod没有available
     2、maxUnavailable,是根据ds.Spec.UpdateStrategy.RollingUpdate.MaxUnavailable计算得出,如果该值是数值,则直接取用;如果是百分比,则以此乘以集群中需要运行该ds的node的数量,并向上取整
   maxUnavailable, numUnavailable, err := dsc.getUnavailableNumbers(ds, nodeToDaemonPods)
   if err != nil {
      return fmt.Errorf("Couldn't get unavailable numbers: %v", err)
   }

   //调用util.SplitByAvailablePods获取oldpods中已可用的pod到oldAvailablePods,以及不可用的pod到oldUnavailablePods
   oldAvailablePods, oldUnavailablePods := util.SplitByAvailablePods(ds.Spec.MinReadySeconds, oldPods)

   //将oldUnavailablePods中所有pod添加到oldPodsToDelete等待删除
   // for oldPods delete all not running pods
   var oldPodsToDelete []string
   glog.V(4).Infof("Marking all unavailable old pods for deletion")
   for _, pod := range oldUnavailablePods {
      // Skip terminating pods. We won't delete them again
      if pod.DeletionTimestamp != nil {
         continue
      }
      glog.V(4).Infof("Marking pod %s/%s for deletion", ds.Name, pod.Name)
      oldPodsToDelete = append(oldPodsToDelete, pod.Name)
   }

   //再添加 maxUnavailable - numUnavailable个oldAvailablePods到oldPodsToDelete待删除,即保证unavailable的pod数量不超过maxUnavailable
   glog.V(4).Infof("Marking old pods for deletion")
   for _, pod := range oldAvailablePods {
      if numUnavailable >= maxUnavailable {
         glog.V(4).Infof("Number of unavailable DaemonSet pods: %d, is equal to or exceeds allowed maximum: %d", numUnavailable, maxUnavailable)
         break
      }
      glog.V(4).Infof("Marking pod %s/%s for deletion", ds.Name, pod.Name)
      oldPodsToDelete = append(oldPodsToDelete, pod.Name)
      numUnavailable++
   }

   //调用dsc.syncNodes删除oldPodsToDelete中的pod
   return dsc.syncNodes(ds, oldPodsToDelete, []string{}, hash)
}

总结

至此,DaemonSet Controller已阅读完。DaemonSet Controller主要就是控制DaemonSet在Node上的运行,它watch了集群DaemonSet、ControllerRevisions、Pod和Node的 Event,维护了一个Queue用来存储需要sync的DaemonSet ,最终启动了--concurrent-daemonset-syncs个go routine 循环从该Queue中取出DaemonSet处理。其中DaemonSet.syncDaemonSet方法是DaemonSet Controller处理DaemonSet最主要的逻辑,而它调用了DaemonSet.manage来控制ds在node上的运行,并调用了Create/Delete Pod接口来创建和删除。另外需要注意一下以下两点:

1、DaemonSet 的pod默认会添加key为"node.kubernetes.io/not-ready"和"node.kubernetes.io/unreachable",Operator为"Exists",Effect为"NoExecute"的两个Tolerations;key为"node.kubernetes.io/disk-pressure"、"node.kubernetes.io/memory-pressure" 和"node.kubernetes.io/unschedulable",Operator为"Exists",Effect为"NoSchedule"的Tolerations;如果ds使用了HostNetwork,则还会默认添加key为"node.kubernetes.io/network-unavailable",Operator为"Exists",Effect为"NoSchedule"的Tolerations;如果是Critical pod,则还会默认添加key为"node.kubernetes.io/out-of-disk",Operator为"Exists",Effect为"NoSchedule"和"NoExecute"的两个Tolerations。

2、DaemonSet有两种更新策略OnDelete和RollingUpdate。DaemonSet的Spec更新了:如果更新策略为OnDelete,则DaemonSet Controller则不做任何处理;如果更新策略为RollingUpdate,则DaemonSet Controller会先删除maxUnavailable个旧的pod之后,之后再创建新的pod,直到所有pod都更新为新的Spec为止。

转载于:https://my.oschina.net/u/3797264/blog/3000510

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值