Kubernetes 1.13.0 Kube-controller-manager之Statefulset-controller源码阅读分析

前言

Kube-controller-manager组件最终启动了很多controller,本文将对其中的Statefulset-controller的源码进行阅读分析。

启动Statefulset Controller

startStatefulSetController函数是Kube-controller-manager启动Statefulset Controller的入口,函数比较简单就三个逻辑。

  • 检查apps/v1/statefulsets资源是否available
  • 调用statefulset包中的NewStatefulSetController函数创建StatefulSetController实例
  • 调用Statefulset Controller实例 StatefulSetController的Run方法
k8s.io/kubernetes/cmd/kube-controller-manager/app/apps.go:55
func startStatefulSetController(ctx ControllerContext) (http.Handler, bool, error) {
   if !ctx.AvailableResources[schema.GroupVersionResource{Group: "apps", Version: "v1", Resource: "statefulsets"}] {
      return nil, false, nil
   }
   go statefulset.NewStatefulSetController(
      ctx.InformerFactory.Core().V1().Pods(),
      ctx.InformerFactory.Apps().V1().StatefulSets(),
      ctx.InformerFactory.Core().V1().PersistentVolumeClaims(),
      ctx.InformerFactory.Apps().V1().ControllerRevisions(),
      ctx.ClientBuilder.ClientOrDie("statefulset-controller"),
   ).Run(1, ctx.Stop)
   return nil, true, nil
}

创建Statefulset Controller

Kube-controller-manager调用NewStatefulSetController函数创建Statefulset Controller实例,NewStatefulSetController函数具体逻辑如下:

  • 创建对应的eventBroadcaster
  • 创建defaultStatefulSetControl
  • 注册pvc和ControllerRevision的synced函数到Statefulset Controller
  • podInformer注册了Add/Update/Delete EventHandler,三个EventHandler最终会将pod对应的Statefulset添加到Statefulset Controller的queue中
  • 注册Pod的Lister和Synced函数到Statefulset Controller
  • Statefulset Informer 注册了Add/Update/Delete EventHandler,同样这三个EventHandler也会将Statefulset添加到Statefulset Controller的queue中
  • 注册Statefulset的Lister和Synced函数到Statefulset Controller

可以看到Statefulset Controller watch了集群的Pod和Statefulset,并维护一个queue存放需要处理的DaemonSet

k8s.io/kubernetes/pkg/controller/statefulset/stateful_set.go:81
func NewStatefulSetController(
   podInformer coreinformers.PodInformer,
   setInformer appsinformers.StatefulSetInformer,
   pvcInformer coreinformers.PersistentVolumeClaimInformer,
   revInformer appsinformers.ControllerRevisionInformer,
   kubeClient clientset.Interface,
) *StatefulSetController {
   eventBroadcaster := record.NewBroadcaster()
   eventBroadcaster.StartLogging(klog.Infof)
   eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: kubeClient.CoreV1().Events("")})
   recorder := eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "statefulset-controller"})

   ssc := &StatefulSetController{
      kubeClient: kubeClient,
      control: NewDefaultStatefulSetControl(
         NewRealStatefulPodControl(
            kubeClient,
            setInformer.Lister(),
            podInformer.Lister(),
            pvcInformer.Lister(),
            recorder),
         NewRealStatefulSetStatusUpdater(kubeClient, setInformer.Lister()),
         history.NewHistory(kubeClient, revInformer.Lister()),
         recorder,
      ),
      pvcListerSynced: pvcInformer.Informer().HasSynced,
      queue:           workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "statefulset"),
      podControl:      controller.RealPodControl{KubeClient: kubeClient, Recorder: recorder},

      revListerSynced: revInformer.Informer().HasSynced,
   }

   podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
      // lookup the statefulset and enqueue
      AddFunc: ssc.addPod,
      // lookup current and old statefulset if labels changed
      UpdateFunc: ssc.updatePod,
      // lookup statefulset accounting for deletion tombstones
      DeleteFunc: ssc.deletePod,
   })
   ssc.podLister = podInformer.Lister()
   ssc.podListerSynced = podInformer.Informer().HasSynced

   setInformer.Informer().AddEventHandlerWithResyncPeriod(
      cache.ResourceEventHandlerFuncs{
         AddFunc: ssc.enqueueStatefulSet,
         UpdateFunc: func(old, cur interface{}) {
            oldPS := old.(*apps.StatefulSet)
            curPS := cur.(*apps.StatefulSet)
            if oldPS.Status.Replicas != curPS.Status.Replicas {
               klog.V(4).Infof("Observed updated replica count for StatefulSet: %v, %d->%d", curPS.Name, oldPS.Status.Replicas, curPS.Status.Replicas)
            }
            ssc.enqueueStatefulSet(cur)
         },
         DeleteFunc: ssc.enqueueStatefulSet,
      },
      statefulSetResyncPeriod,
   )
   ssc.setLister = setInformer.Lister()
   ssc.setListerSynced = setInformer.Informer().HasSynced

   // TODO: Watch volumes
   return ssc
}

执行Statefulset Controller

创建完Statefulset Controller之后,接着就会调用StatefulSetController的Run方法执行StatefulSetController。

  • Run方法首先会等待PodInformer、StatefulSetInformer、PvcInformer、ControllerRevisionsInformer的sync都返回true,即等待Pod、StatefulSet、Pvc和ControllerRevisions同步完成
  • 接着启动1个go routine执行StatefulSetController的worker,worker会循环执行processNextWorkItem方法,processNextWorkItem方法则是从StatefulSetController的queue中取出StatefulSet对象,使用StatefulSetController的sync方法处理取出的StatefulSet,如果sync处理成功则从queue中删除该StatefulSet,否则则在等待一段时间后重新将该StatefulSet添加到StatefulSetController的queue中
k8s.io/kubernetes/pkg/controller/statefulset/stateful_set.go:147
func (ssc *StatefulSetController) Run(workers int, stopCh <-chan struct{}) {
   defer utilruntime.HandleCrash()
   defer ssc.queue.ShutDown()

   klog.Infof("Starting stateful set controller")
   defer klog.Infof("Shutting down statefulset controller")

   if !controller.WaitForCacheSync("stateful set", stopCh, ssc.podListerSynced, ssc.setListerSynced, ssc.pvcListerSynced, ssc.revListerSynced) {
      return
   }

   for i := 0; i < workers; i++ {
      go wait.Until(ssc.worker, time.Second, stopCh)
   }

   <-stopCh
}

k8s.io/kubernetes/pkg/controller/statefulset/stateful_set.go:409
func (ssc *StatefulSetController) worker() {
   for ssc.processNextWorkItem() {
   }
}

k8s.io/kubernetes/pkg/controller/statefulset/stateful_set.go:393
func (ssc *StatefulSetController) processNextWorkItem() bool {
   key, quit := ssc.queue.Get()
   if quit {
      return false
   }
   defer ssc.queue.Done(key)
   if err := ssc.sync(key.(string)); err != nil {
      utilruntime.HandleError(fmt.Errorf("Error syncing StatefulSet %v, requeuing: %v", key.(string), err))
      ssc.queue.AddRateLimited(key)
   } else {
      ssc.queue.Forget(key)
   }
   return true
}

StatefulSetController.sync方法

sync方法是处理从StatefulSetController queue取出的statefulset,处理流程如下:

  • 根据从StatefulSetController queue取出的key获取namespace和name,接着根据namespace和name获取到具体的statefulset object
  • 调用metav1.LabelSelectorAsSelector,根据Statefulset Spec.Selector中的MatchExpressions和MatchLabels得出selector
  • 调用StatefulSetController的adoptOrphanRevisions方法。adoptOrphanRevisions方法根据Statefulset同一namespace下符合Statefulset的Spec.Selector的ControllerRevisions,如果获取到的ControllerRevisions没有owner(即ControllerRevisions没有OwnerReference),将owner更新为当前Statefulset
  • 调用StatefulSetController的getPodsForStatefulSet方法获取属于Statefulset的所有pods。getPodsForStatefulSet方法获取Statefulset同一namespace下所有的pods,检查获取到的所有pods:如果pod的owner是Statefulset,但是实际和Statefulset并不match,则删除pod的OwnerReference;如果pods没有owner,但是实际却和Statefulset match,则更新pod的OwnerReference为Statefulset。(注:在replicaset-controller的源码阅读中有详细列出PodControllerRefManager的ClaimPods方法:https://my.oschina.net/u/3797264/blog/2985926)
  • 调用StatefulSetController的syncStatefulSet方法继续处理Statefulset和所有属于Statefulset的Pods。syncStatefulSet方法则只是调用defaultStatefulSetControl.UpdateStatefulSet处理。
k8s.io/kubernetes/pkg/controller/statefulset/stateful_set.go:415
func (ssc *StatefulSetController) sync(key string) error {
   startTime := time.Now()
   defer func() {
      klog.V(4).Infof("Finished syncing statefulset %q (%v)", key, time.Since(startTime))
   }()

   namespace, name, err := cache.SplitMetaNamespaceKey(key)
   if err != nil {
      return err
   }
   set, err := ssc.setLister.StatefulSets(namespace).Get(name)
   if errors.IsNotFound(err) {
      klog.Infof("StatefulSet has been deleted %v", key)
      return nil
   }
   if err != nil {
      utilruntime.HandleError(fmt.Errorf("unable to retrieve StatefulSet %v from store: %v", key, err))
      return err
   }

   selector, err := metav1.LabelSelectorAsSelector(set.Spec.Selector)
   if err != nil {
      utilruntime.HandleError(fmt.Errorf("error converting StatefulSet %v selector: %v", key, err))
      // This is a non-transient error, so don't retry.
      return nil
   }

   if err := ssc.adoptOrphanRevisions(set); err != nil {
      return err
   }

   pods, err := ssc.getPodsForStatefulSet(set, selector)
   if err != nil {
      return err
   }

   return ssc.syncStatefulSet(set, pods)
}

// syncStatefulSet syncs a tuple of (statefulset, []*v1.Pod).
func (ssc *StatefulSetController) syncStatefulSet(set *apps.StatefulSet, pods []*v1.Pod) error {
   klog.V(4).Infof("Syncing StatefulSet %v/%v with %d pods", set.Namespace, set.Name, len(pods))
   // TODO: investigate where we mutate the set during the update as it is not obvious.
   if err := ssc.control.UpdateStatefulSet(set.DeepCopy(), pods); err != nil {
      return err
   }
   klog.V(4).Infof("Successfully synced StatefulSet %s/%s successful", set.Namespace, set.Name)
   return nil
}

defaultStatefulSetControl.UpdateStatefulSet方法

UpdateStatefulSet的主要流程为:

  • 调用defaultStatefulSetControl的ListRevisions方法,根据Statefulset的Spec.Selector获取Statefulset所有的ControllerRevisions,获取之后排序
  • 调用defaultStatefulSetControl的getStatefulSetRevisions方法,获取Statefulset的currentRevision以及updateRevision。有以下几种情况currentRevision和updateRevision会不一样
    • spec.updateStrategy为rollingUpdate且Statefulset在更新的过程中,currentRevision和updateRevision会不一样直到Statefulset更新完成
    • spec.updateStrategy为rollingUpdate且.spec.updateStrategy.rollingUpdate.partition不为0,Statefulset更新以后,currentRevision和updateRevision会不一样
    • spec.updateStrategy为OnDelete,只要Statefulset有更新过(spec.Template相比最初有改变),currentRevision和updateRevision会不一样
  • 调用defaultStatefulSetControl的updateStatefulSet方法处理Statefulset,updateStatefulSet方法是StatefulSetController处理Statefulset的核心的逻辑,稍后再详细阅读该方法
  • 调用defaultStatefulSetControl的updateStatefulSetStatus方法,根据上一步获得的status更新Statefulset的status。注意updateStatefulSetStatus方法中只有Statefulset的Spec.UpdateStrategy.Type为RollingUpdate、status.UpdatedReplicas等于status.Replicas且status.ReadyReplicas等于status.Replicas时,才会将status.CurrentRevision更新为status.UpdateRevision,这样也就理解了关于上面第二点中提及的currentRevision和updateRevision不一样的说明。
  • 调用defaultStatefulSetControl的truncateHistory方法:根据Statefulset的Spec.RevisionHistoryLimit参数,保留对应数量的不活动的ControllerRevisions(即没有对应pod的ControllerRevisions)
k8s.io/kubernetes/pkg/controller/statefulset/stateful_set_control.go:75
func (ssc *defaultStatefulSetControl) UpdateStatefulSet(set *apps.StatefulSet, pods []*v1.Pod) error {

   // list all revisions and sort them
   revisions, err := ssc.ListRevisions(set)
   if err != nil {
      return err
   }
   history.SortControllerRevisions(revisions)

   // get the current, and update revisions
   currentRevision, updateRevision, collisionCount, err := ssc.getStatefulSetRevisions(set, revisions)
   if err != nil {
      return err
   }

   // perform the main update function and get the status
   status, err := ssc.updateStatefulSet(set, currentRevision, updateRevision, collisionCount, pods)
   if err != nil {
      return err
   }

   // update the set's status
   err = ssc.updateStatefulSetStatus(set, status)
   if err != nil {
      return err
   }

   klog.V(4).Infof("StatefulSet %s/%s pod status replicas=%d ready=%d current=%d updated=%d",
      set.Namespace,
      set.Name,
      status.Replicas,
      status.ReadyReplicas,
      status.CurrentReplicas,
      status.UpdatedReplicas)

   klog.V(4).Infof("StatefulSet %s/%s revisions current=%s update=%s",
      set.Namespace,
      set.Name,
      status.CurrentRevision,
      status.UpdateRevision)

   // maintain the set's revision history limit
   return ssc.truncateHistory(set, pods, revisions, currentRevision, updateRevision)
}

defaultStatefulSetControl.updateStatefulSet方法

该方法是StatefulSetController处理Statefulset的最核心的逻辑,主要流程如下:

  • 分别获取currentRevision和updateRevision的Statefulset object
  • 获取status的ObservedGeneration、CurrentRevision、UpdateRevision
  • 根据Statefulset的pods计算status的Replicas(pod的数量)、ReadyReplicas(running & ready的pod数量)、CurrentReplicas(CurrentRevision的pod数量)、UpdatedReplicas(UpdateRevision的pod数量)
  • 将Statefulset的pods按ord(ord为pod name中的序数)的值分到replicas和condemned两个数组
    • 其中0 <= ord && ord < Spec.Replicas的放到replicas,replicas数组中的pods是valid的
    • ord >= Spec.Replicas的放到condemned,condemned数组中的pods是将要删除的
  • 检查replicas数组中是否缺失pod,即下标0 - Spec.Replicas是否都有pod,如果有缺失,则调用newVersionedStatefulSetPod根据Statefulset的配置以及状态new对应Revision的pod object
    • 没有设置currentSet.Spec.UpdateStrategy.RollingUpdate.Partition,且ord小于Status.CurrentReplicas,则new CurrentRevision的pod object
    • 设置了currentSet.Spec.UpdateStrategy.RollingUpdate.Partition,且ord小于currentSet.Spec.UpdateStrategy.RollingUpdate.Partition,同样也new CurrentRevision的pod object
    • 除以上两种情况之外,都new UpdateRevision的pod object
  • 从replicas数组和condemned数组中找出firstunHealthyPod(pod UnHealthy且ord最小),其中Running & Ready且没有Terminating的pod为Healthy pod。
  • 遍历replicas数组中的pod,按如下流程检查处理每一个pod
    • 如果pod的状态Failed(pod.Status.Phase为Failed),删除该pod,将status中的Replicas减一,并根据pod的Revision将对应的CurrentReplicas/UpdatedReplicas减一,最后调用newVersionedStatefulSetPod重新new对应Revision的pod object
    • 如果pod未创建(pod.Status.Phase不为“”表示已创建),创建该pod,将status中的Replicas加一,并根据pod的Revision将对应的CurrentReplicas/UpdatedReplicas加一。接着检查Spec.PodManagementPolicy:如果为Parallel,直接return status结束;如果为OrderedReady,循环处理下一个pod。
    • 如果pod正在删除(pod.DeletionTimestamp不为nil),且Spec.PodManagementPolicy不为Parallel,直接return status结束
    • 如果pod状态不是Running & Ready,且Spec.PodManagementPolicy不为Parallel,直接return status结束
    • 调用identityMatches和storageMatches函数检查pod的id和storage是否和Statefulset match
      • 满足如下条件则identity Match:
        • pod name的序数大于等于0
        • Statefulset name加上序数和pod name一样
        • pod和Statefulset的namespace一样
        • pod的label中statefulset.kubernetes.io/pod-name的值和pod name一样
      • 满足以下其中一个条件则storage不Match
        • pod name的序数小于0
        • set.Spec.VolumeClaimTemplates有Volume没有在pod.Spec.Volumes中
        • pod.Spec.Volumes中有Volume的PVC为空
        • pod.Spec.Volumes中Volume的PVC.ClaimName与Statefulset的不一致
    • 如果上一步检查到identity和storage都Match,循环处理下一个pod;如果上一步检查到identity和storage其中一个不Match,调用realStatefulPodControl.UpdateStatefulPod方法Update pod的Identity和storage(如pod对应的PVC还没有则会创建)
  • 倒序处理condemned数组中的pod,每个pod的检查处理流程如下:
    • 如果pod正在删除(pod.DeletionTimestamp不为nil),检查Spec.PodManagementPolicy的值:如果为Parallel,循环处理下一个pod;如果为OrderedReady,直接return status结束
    • 如果pod不是Running&Ready,且Spec.PodManagementPolicy为OrderedReady,且该pod不是firstunHealthyPod,直接return status结束
    • 其他状态的Pod则直接删除,根据Pod的Revision将对应的CurrentReplicas/UpdatedReplicas减一
    • 如果Spec.PodManagementPolicy为OrderedReady,直接return status结束
  • 如果Spec.UpdateStrategy.Type为OnDelete,不做任何处理直接return status结束
  • 如果Spec.UpdateStrategy.Type为RollingUpdate,倒序处理replicas数组中下标大于等于Spec.UpdateStrategy.RollingUpdate.Partition的pod,每个pod的检查处理流程如下。注:滚动更新策略并没有检查Spec.PodManagementPolicy的值,所以Parallel的策略不适用于滚动更新。
    • 如果Pod的Revision不等于updateRevision,且Pod没有正在删除(pod.DeletionTimestamp不为nil),则直接删除Pod,并将CurrentReplicas减一,接着直接return status结束
    • 如果pod不是Healthy的,直接return status结束。其中Running & Ready且没有Terminating的pod为Healthy pod
k8s.io/kubernetes/pkg/controller/statefulset/stateful_set_control.go:254
func (ssc *defaultStatefulSetControl) updateStatefulSet(
   set *apps.StatefulSet,
   currentRevision *apps.ControllerRevision,
   updateRevision *apps.ControllerRevision,
   collisionCount int32,
   pods []*v1.Pod) (*apps.StatefulSetStatus, error) {
   // get the current and update revisions of the set.
   currentSet, err := ApplyRevision(set, currentRevision)
   if err != nil {
      return nil, err
   }
   updateSet, err := ApplyRevision(set, updateRevision)
   if err != nil {
      return nil, err
   }

   // set the generation, and revisions in the returned status
   status := apps.StatefulSetStatus{}
   status.ObservedGeneration = set.Generation
   status.CurrentRevision = currentRevision.Name
   status.UpdateRevision = updateRevision.Name
   status.CollisionCount = new(int32)
   *status.CollisionCount = collisionCount

   replicaCount := int(*set.Spec.Replicas)
   // slice that will contain all Pods such that 0 <= getOrdinal(pod) < set.Spec.Replicas
   replicas := make([]*v1.Pod, replicaCount)
   // slice that will contain all Pods such that set.Spec.Replicas <= getOrdinal(pod)
   condemned := make([]*v1.Pod, 0, len(pods))
   unhealthy := 0
   firstUnhealthyOrdinal := math.MaxInt32
   var firstUnhealthyPod *v1.Pod

   // First we partition pods into two lists valid replicas and condemned Pods
   for i := range pods {
      status.Replicas++

      // count the number of running and ready replicas
      if isRunningAndReady(pods[i]) {
         status.ReadyReplicas++
      }

      // count the number of current and update replicas
      if isCreated(pods[i]) && !isTerminating(pods[i]) {
         if getPodRevision(pods[i]) == currentRevision.Name {
            status.CurrentReplicas++
         }
         if getPodRevision(pods[i]) == updateRevision.Name {
            status.UpdatedReplicas++
         }
      }

      if ord := getOrdinal(pods[i]); 0 <= ord && ord < replicaCount {
         // if the ordinal of the pod is within the range of the current number of replicas,
         // insert it at the indirection of its ordinal
         replicas[ord] = pods[i]

      } else if ord >= replicaCount {
         // if the ordinal is greater than the number of replicas add it to the condemned list
         condemned = append(condemned, pods[i])
      }
      // If the ordinal could not be parsed (ord < 0), ignore the Pod.
   }

   // for any empty indices in the sequence [0,set.Spec.Replicas) create a new Pod at the correct revision
   for ord := 0; ord < replicaCount; ord++ {
      if replicas[ord] == nil {
         replicas[ord] = newVersionedStatefulSetPod(
            currentSet,
            updateSet,
            currentRevision.Name,
            updateRevision.Name, ord)
      }
   }

   // sort the condemned Pods by their ordinals
   sort.Sort(ascendingOrdinal(condemned))

   // find the first unhealthy Pod
   for i := range replicas {
      if !isHealthy(replicas[i]) {
         unhealthy++
         if ord := getOrdinal(replicas[i]); ord < firstUnhealthyOrdinal {
            firstUnhealthyOrdinal = ord
            firstUnhealthyPod = replicas[i]
         }
      }
   }

   for i := range condemned {
      if !isHealthy(condemned[i]) {
         unhealthy++
         if ord := getOrdinal(condemned[i]); ord < firstUnhealthyOrdinal {
            firstUnhealthyOrdinal = ord
            firstUnhealthyPod = condemned[i]
         }
      }
   }

   if unhealthy > 0 {
      klog.V(4).Infof("StatefulSet %s/%s has %d unhealthy Pods starting with %s",
         set.Namespace,
         set.Name,
         unhealthy,
         firstUnhealthyPod.Name)
   }

   // If the StatefulSet is being deleted, don't do anything other than updating
   // status.
   if set.DeletionTimestamp != nil {
      return &status, nil
   }

   monotonic := !allowsBurst(set)

   // Examine each replica with respect to its ordinal
   for i := range replicas {
      // delete and recreate failed pods
      if isFailed(replicas[i]) {
         ssc.recorder.Eventf(set, v1.EventTypeWarning, "RecreatingFailedPod",
            "StatefulSet %s/%s is recreating failed Pod %s",
            set.Namespace,
            set.Name,
            replicas[i].Name)
         if err := ssc.podControl.DeleteStatefulPod(set, replicas[i]); err != nil {
            return &status, err
         }
         if getPodRevision(replicas[i]) == currentRevision.Name {
            status.CurrentReplicas--
         }
         if getPodRevision(replicas[i]) == updateRevision.Name {
            status.UpdatedReplicas--
         }
         status.Replicas--
         replicas[i] = newVersionedStatefulSetPod(
            currentSet,
            updateSet,
            currentRevision.Name,
            updateRevision.Name,
            i)
      }
      // If we find a Pod that has not been created we create the Pod
      if !isCreated(replicas[i]) {
         if err := ssc.podControl.CreateStatefulPod(set, replicas[i]); err != nil {
            return &status, err
         }
         status.Replicas++
         if getPodRevision(replicas[i]) == currentRevision.Name {
            status.CurrentReplicas++
         }
         if getPodRevision(replicas[i]) == updateRevision.Name {
            status.UpdatedReplicas++
         }

         // if the set does not allow bursting, return immediately
         if monotonic {
            return &status, nil
         }
         // pod created, no more work possible for this round
         continue
      }
      // If we find a Pod that is currently terminating, we must wait until graceful deletion
      // completes before we continue to make progress.
      if isTerminating(replicas[i]) && monotonic {
         klog.V(4).Infof(
            "StatefulSet %s/%s is waiting for Pod %s to Terminate",
            set.Namespace,
            set.Name,
            replicas[i].Name)
         return &status, nil
      }
      // If we have a Pod that has been created but is not running and ready we can not make progress.
      // We must ensure that all for each Pod, when we create it, all of its predecessors, with respect to its
      // ordinal, are Running and Ready.
      if !isRunningAndReady(replicas[i]) && monotonic {
         klog.V(4).Infof(
            "StatefulSet %s/%s is waiting for Pod %s to be Running and Ready",
            set.Namespace,
            set.Name,
            replicas[i].Name)
         return &status, nil
      }
      // Enforce the StatefulSet invariants
      if identityMatches(set, replicas[i]) && storageMatches(set, replicas[i]) {
         continue
      }
      // Make a deep copy so we don't mutate the shared cache
      replica := replicas[i].DeepCopy()
      if err := ssc.podControl.UpdateStatefulPod(updateSet, replica); err != nil {
         return &status, err
      }
   }

   // At this point, all of the current Replicas are Running and Ready, we can consider termination.
   // We will wait for all predecessors to be Running and Ready prior to attempting a deletion.
   // We will terminate Pods in a monotonically decreasing order over [len(pods),set.Spec.Replicas).
   // Note that we do not resurrect Pods in this interval. Also not that scaling will take precedence over
   // updates.
   for target := len(condemned) - 1; target >= 0; target-- {
      // wait for terminating pods to expire
      if isTerminating(condemned[target]) {
         klog.V(4).Infof(
            "StatefulSet %s/%s is waiting for Pod %s to Terminate prior to scale down",
            set.Namespace,
            set.Name,
            condemned[target].Name)
         // block if we are in monotonic mode
         if monotonic {
            return &status, nil
         }
         continue
      }
      // if we are in monotonic mode and the condemned target is not the first unhealthy Pod block
      if !isRunningAndReady(condemned[target]) && monotonic && condemned[target] != firstUnhealthyPod {
         klog.V(4).Infof(
            "StatefulSet %s/%s is waiting for Pod %s to be Running and Ready prior to scale down",
            set.Namespace,
            set.Name,
            firstUnhealthyPod.Name)
         return &status, nil
      }
      klog.V(2).Infof("StatefulSet %s/%s terminating Pod %s for scale down",
         set.Namespace,
         set.Name,
         condemned[target].Name)

      if err := ssc.podControl.DeleteStatefulPod(set, condemned[target]); err != nil {
         return &status, err
      }
      if getPodRevision(condemned[target]) == currentRevision.Name {
         status.CurrentReplicas--
      }
      if getPodRevision(condemned[target]) == updateRevision.Name {
         status.UpdatedReplicas--
      }
      if monotonic {
         return &status, nil
      }
   }

   // for the OnDelete strategy we short circuit. Pods will be updated when they are manually deleted.
   if set.Spec.UpdateStrategy.Type == apps.OnDeleteStatefulSetStrategyType {
      return &status, nil
   }

   // we compute the minimum ordinal of the target sequence for a destructive update based on the strategy.
   updateMin := 0
   if set.Spec.UpdateStrategy.RollingUpdate != nil {
      updateMin = int(*set.Spec.UpdateStrategy.RollingUpdate.Partition)
   }
   // we terminate the Pod with the largest ordinal that does not match the update revision.
   for target := len(replicas) - 1; target >= updateMin; target-- {

      // delete the Pod if it is not already terminating and does not match the update revision.
      if getPodRevision(replicas[target]) != updateRevision.Name && !isTerminating(replicas[target]) {
         klog.V(2).Infof("StatefulSet %s/%s terminating Pod %s for update",
            set.Namespace,
            set.Name,
            replicas[target].Name)
         err := ssc.podControl.DeleteStatefulPod(set, replicas[target])
         status.CurrentReplicas--
         return &status, err
      }

      // wait for unhealthy Pods on update
      if !isHealthy(replicas[target]) {
         klog.V(4).Infof(
            "StatefulSet %s/%s is waiting for Pod %s to update",
            set.Namespace,
            set.Name,
            replicas[target].Name)
         return &status, nil
      }

   }
   return &status, nil
}

总结

至此,StatefulSet Controller已阅读完。StatefulSet Controller watch了集群StatefulSet、Pod Event,维护了一个Queue用来存储需要sync的StatefulSet ,最终启动了1个go routine 循环从该Queue中取出StatefulSet处理。其中defaultStatefulSetControl.updateStatefulSet方法是StatefulSet Controller处理StatefulSet最核心的逻辑。其中有以下几点需要留意:

1、currentSet.Spec.UpdateStrategy.RollingUpdate.Partition不设置和设置为0会有一点细微的区别。当Statefulset还未创建完所有pods时对Statefulset进行滚动更新,假如这时候删除Statefulset已创建Pod:那么currentSet.Spec.UpdateStrategy.RollingUpdate.Partition没有设置时,重新创建的pod是Current Revision;而currentSet.Spec.UpdateStrategy.RollingUpdate.Partition设置为0时,重新创建的Pod是Update Revison的Pod。

2、Spec.updateStrategy为OnDelete,那么Statefulset无论更新多少次,其status.currentRevision会一直保留Statefulset的第一个Revision。

3、Spec.PodManagementPolicy为Parallel不适用于Spec.updateStrategy为RollingUpdate的滚动更新策略,即即使Spec.PodManagementPolicy设置为Parallel,Statefulset的滚动更新仍然是按pod name倒序一个一个更新到UpdateRevision,而不是像创建及删除Statefulset时那样同时处理。

转载于:https://my.oschina.net/u/3797264/blog/3016756

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值