ReplicasSet是怎么工作的?(下篇)

ReplicaSet简介

replica n. 副本;复制品;

在k8s集群内,replica用于描述一个一个由相同镜像、相同运行环境(相同yaml描述出来的)运行出的pod数量。

ReplicaSet维护一组在任何时候都要处于运行状态的 Pod 副本,这些Pod副本组成了一个集合。
ReplicaSet通常用来保证Pod的可用性,具备如下两个特点:

  • 实际运行的pod数量不断调整到给定数量
  • 每个ReplicaSet所管理的Pod完全相同

了解过Deployment的朋友会知道,在滚动更新过程中,有时尽管pod副本数和期望的副本数一致(比如扩容一个实例成功,刚刚缩容一个旧实例的时候),但是pod的版本并不一定相同(尽管最终会达到一致)。

由于每个ReplicasSet所管理的Pod完全相同,Deployment滚动更新中是如何调整不同版本的实例数的呢?想必答案呼之欲出了,那就是——Deployment会同时维护很多个ReplicaSet,通过增加和减少*新旧两个版本ReplicaSet的副本数,来达到滚动更新的目的。

问题的提出

kube-controller-manager是如何加载的Deployment、StatefulSet等控制器的?,我们知道了controller-manager会在NewControllerInitializers()方法中加载一系列的控制器。作为这些控制器中相对而言最为简单的、最为基础的、也是为了搞懂Deployment必不可少的ReplicaSet,它是怎么工作的呢?

由于ReplicaSet的控制器中有个队列,上篇ReplicasSet是怎么工作的?(上篇)描述了向队列里写入了什么,本篇来介绍队列中的数据是如何消费的

源码阅读

方法入口

在controller-manager的NewControllerInitializers()方法中找到ReplicaSet的运行方法:

func startReplicaSetController(ctx ControllerContext) (http.Handler, bool, error) {
	go replicaset.NewReplicaSetController(
		ctx.InformerFactory.Apps().V1().ReplicaSets(),
		ctx.InformerFactory.Core().V1().Pods(),
		ctx.ClientBuilder.ClientOrDie("replicaset-controller"),
		replicaset.BurstReplicas,
	).Run(int(ctx.ComponentConfig.ReplicaSetController.ConcurrentRSSyncs), ctx.Stop)
	return nil, true, nil
}

这里便是一个ReplicaSet控制器运行地方了,通过context传入了一个ReplicaSet的informer,和一个Pod的informer,这两个informer分别通知ReplicaSet和Pod的事件。一个新的控制器构造出来,用go关键字创建一个channel来运行Run方法。

看到Run方法,里面便是消费队列的秘密了

Run

// Run begins watching and syncing.
func (rsc *ReplicaSetController) Run(workers int, stopCh <-chan struct{}) {
	defer utilruntime.HandleCrash()
	defer rsc.queue.ShutDown()

	controllerName := strings.ToLower(rsc.Kind)
	klog.Infof("Starting %v controller", controllerName)
	defer klog.Infof("Shutting down %v controller", controllerName)

	if !cache.WaitForNamedCacheSync(rsc.Kind, stopCh, rsc.podListerSynced, rsc.rsListerSynced) {
		return
	}

	for i := 0; i < workers; i++ {
		go wait.Until(rsc.worker, time.Second, stopCh)
	}

	<-stopCh
}

其中入参workers,是启动的go协程数量,也意味着消费的队列数。从方法入口处可知,传入的协程数为ctx.ComponentConfig.ReplicaSetController.ConcurrentRSSyncs,参考kube-controller-manager是如何加载的Deployment、StatefulSet等控制器的?
,找到这个参数的Default值为5,也就是默认有5个队列的消费者

// RecommendedDefaultReplicaSetControllerConfiguration defaults a pointer to a
// ReplicaSetControllerConfiguration struct. This will set the recommended default
// values, but they may be subject to change between API versions. This function
// is intentionally not registered in the scheme as a "normal" `SetDefaults_Foo`
// function to allow consumers of this type to set whatever defaults for their
// embedded configs. Forcing consumers to use these defaults would be problematic
// as defaulting in the scheme is done as part of the conversion, and there would
// be no easy way to opt-out. Instead, if you want to use this defaulting method
// run it in your wrapper struct of this type in its `SetDefaults_` method.
func RecommendedDefaultReplicaSetControllerConfiguration(obj *kubectrlmgrconfigv1alpha1.ReplicaSetControllerConfiguration) {
	if obj.ConcurrentRSSyncs == 0 {
		obj.ConcurrentRSSyncs = 5
	}
}

聚焦到消费者内部worker()方法

// worker runs a worker thread that just dequeues items, processes them, and marks them done.
// It enforces that the syncHandler is never invoked concurrently with the same key.
func (rsc *ReplicaSetController) worker() {
	for rsc.processNextWorkItem() {
	}
}

func (rsc *ReplicaSetController) processNextWorkItem() bool {
	key, quit := rsc.queue.Get()
	if quit {
		return false
	}
	defer rsc.queue.Done(key)

	err := rsc.syncHandler(key.(string))
	if err == nil {
		rsc.queue.Forget(key)
		return true
	}

	utilruntime.HandleError(fmt.Errorf("sync %q failed with %v", key, err))
	rsc.queue.AddRateLimited(key)

	return true
}

worker协程只是运行processNextWorkItem方法,用于消费队列里的数据、处理数据,并将它们标记为完成。它迫使syncHandler永远不会使用同一索引键的ReplicaSet。processNextWokrItem每次从队列里Get一个ReplicaSet,然后执行同步方法syncHandler,如果正常结束,先调用Forget方法,之后调用Done;如果异常结束,重新加入到限速队列中,之后调用Done。

queue

详细看一下这几个队列的方法

// RateLimitingInterface is an interface that rate limits items being added to the queue.
type RateLimitingInterface interface {
	DelayingInterface

	// AddRateLimited adds an item to the workqueue after the rate limiter says it's ok
	AddRateLimited(item interface{})

	// Forget indicates that an item is finished being retried.  Doesn't matter whether it's for perm failing
	// or for success, we'll stop the rate limiter from tracking it.  This only clears the `rateLimiter`, you
	// still have to call `Done` on the queue.
	Forget(item interface{})

	// NumRequeues returns back how many times the item was requeued
	NumRequeues(item interface{}) int
}

// DelayingInterface is an Interface that can Add an item at a later time. This makes it easier to
// requeue items after failures without ending up in a hot-loop.
type DelayingInterface interface {
	Interface
	// AddAfter adds an item to the workqueue after the indicated duration has passed
	AddAfter(item interface{}, duration time.Duration)
}

type Interface interface {
	// Add marks item as needing processing.
	Add(item interface{})	
	
	// Get blocks until it can return an item to be processed. If shutdown = true,
	// the caller should end their goroutine. You must call Done with item when you
	// have finished processing it.
	Get() (item interface{}, shutdown bool)
	
	// Done marks item as done processing, and if it has been marked as dirty again
	// while it was being processed, it will be re-added to the queue for
	// re-processing.
	Done(item interface{})
	// ... 省略一些
}

rateLimitingType关系图

syncHandler

了解了队列的进出逻辑之后,我们终于要聚焦于ReplicaSet最最关键的同步方法了。看一下ReplicaSetController中syncHandler这个属性的定义,类型是一个函数。

type ReplicaSetController struct {
	//...省略其他

	// To allow injection of syncReplicaSet for testing.
	syncHandler func(rsKey string) error

	//...省略其他
}

在初始化ReplicaSetController时,NewBaseController给这个属性赋的什么值呢?

// NewBaseController is the implementation of NewReplicaSetController with additional injected
// parameters so that it can also serve as the implementation of NewReplicationController.
func NewBaseController(rsInformer appsinformers.ReplicaSetInformer, podInformer coreinformers.PodInformer, kubeClient clientset.Interface, burstReplicas int,
	gvk schema.GroupVersionKind, metricOwnerName, queueName string, podControl controller.PodControlInterface) *ReplicaSetController {
	//...省略其他
	
	rsc.syncHandler = rsc.syncReplicaSet

	//...省略其他
	return rsc
	}

原来是syncReplicaSet方法,看看它都做了些什么

syncReplicaSet

// syncReplicaSet will sync the ReplicaSet with the given key if it has had its expectations fulfilled,
// meaning it did not expect to see any more of its pods created or deleted. This function is not meant to be
// invoked concurrently with the same key.
func (rsc *ReplicaSetController) syncReplicaSet(key string) error {
	startTime := time.Now()
	defer func() {
		klog.V(4).Infof("Finished syncing %v %q (%v)", rsc.Kind, key, time.Since(startTime))
	}()

	namespace, name, err := cache.SplitMetaNamespaceKey(key)
	if err != nil {
		return err
	}
	rs, err := rsc.rsLister.ReplicaSets(namespace).Get(name)
	if apierrors.IsNotFound(err) {
		klog.V(4).Infof("%v %v has been deleted", rsc.Kind, key)
		rsc.expectations.DeleteExpectations(key)
		return nil
	}
	if err != nil {
		return err
	}

	rsNeedsSync := rsc.expectations.SatisfiedExpectations(key)
	selector, err := metav1.LabelSelectorAsSelector(rs.Spec.Selector)
	if err != nil {
		utilruntime.HandleError(fmt.Errorf("error converting pod selector to selector: %v", err))
		return nil
	}

	// list all pods to include the pods that don't match the rs`s selector
	// anymore but has the stale controller ref.
	// TODO: Do the List and Filter in a single pass, or use an index.
	allPods, err := rsc.podLister.Pods(rs.Namespace).List(labels.Everything())
	if err != nil {
		return err
	}
	// Ignore inactive pods.
	filteredPods := controller.FilterActivePods(allPods)

	// NOTE: filteredPods are pointing to objects from cache - if you need to
	// modify them, you need to copy it first.
	filteredPods, err = rsc.claimPods(rs, selector, filteredPods)
	if err != nil {
		return err
	}

	var manageReplicasErr error
	if rsNeedsSync && rs.DeletionTimestamp == nil {
		manageReplicasErr = rsc.manageReplicas(filteredPods, rs)
	}
	rs = rs.DeepCopy()
	newStatus := calculateStatus(rs, filteredPods, manageReplicasErr)

	// Always updates status as pods come up or die.
	updatedRS, err := updateReplicaSetStatus(rsc.kubeClient.AppsV1().ReplicaSets(rs.Namespace), rs, newStatus)
	if err != nil {
		// Multiple things could lead to this update failing. Requeuing the replica set ensures
		// Returning an error causes a requeue without forcing a hotloop
		return err
	}
	// Resync the ReplicaSet after MinReadySeconds as a last line of defense to guard against clock-skew.
	if manageReplicasErr == nil && updatedRS.Spec.MinReadySeconds > 0 &&
		updatedRS.Status.ReadyReplicas == *(updatedRS.Spec.Replicas) &&
		updatedRS.Status.AvailableReplicas != *(updatedRS.Spec.Replicas) {
		rsc.queue.AddAfter(key, time.Duration(updatedRS.Spec.MinReadySeconds)*time.Second)
	}
	return manageReplicasErr
}

syncReplicaSet方法主要做了如下的几件事情:

  1. 根据队列中给定的ReplicasSet key,认领所涉及到的Pods
  2. 管理副本数Replicas
  3. 计算状态,并更新状态
  4. 重新更新
认领Pods
  1. 获取和输入key相关的所有ReplicaSets
func (rsc *ReplicaSetController) syncReplicaSet(key string) error {
	//......省略
	namespace, name, err := cache.SplitMetaNamespaceKey(key)
	if err != nil {
		return err
	}
	//......省略
}

// SplitMetaNamespaceKey returns the namespace and name that
// MetaNamespaceKeyFunc encoded into key.
//
// TODO: replace key-as-string with a key-as-struct so that this
// packing/unpacking won't be necessary.
func SplitMetaNamespaceKey(key string) (namespace, name string, err error) {
	parts := strings.Split(key, "/")
	switch len(parts) {
	case 1:
		// name only, no namespace
		return "", parts[0], nil
	case 2:
		// namespace and name
		return parts[0], parts[1], nil
	}

	return "", "", fmt.Errorf("unexpected key format: %q", key)
}

能够定位到唯一的ReplicaSet的key是由namespace和name组成的,中间由分隔符/隔开。以分解后的namespace和name在缓存中获取对应的ReplicaSet。

	rs, err := rsc.rsLister.ReplicaSets(namespace).Get(name)
	if apierrors.IsNotFound(err) {
		klog.V(4).Infof("%v %v has been deleted", rsc.Kind, key)
		rsc.expectations.DeleteExpectations(key)
		return nil
	}

ReplicaSetController中的rsLister定义如下:

	// A store of ReplicaSets, populated by the shared informer passed to NewReplicaSetController
	rsLister appslisters.ReplicaSetLister

ReplicaSetLister
ReplicaSetController中的expectations的定义如下:

	// A TTLCache of pod creates/deletes each rc expects to see.
	expectations *controller.UIDTrackingControllerExpectations

expectations
从调用关系图上可以看到,expectations本质上也是client-go缓存的拓展,主要用于做一个映射,把将控制器映射到它们在被唤醒进行同步之前期望看到的内容

// ControllerExpectations is a cache mapping controllers to what they expect to see before being woken up for a sync.

具体可以参考Controller(Kubernetes)的ControllerExpectations解析,ControllerExpectations有点类似于一种并发的计数器(waitGroup),简单来说就是从ReplicaSetController从创建N个Pod开始,直到确认N个Pod创建完成为止,整个过程其实存在很多异步操作,而ControllerExpectations的作用就是让整个操作跟同步操作一样。

  1. 判断是否要同步
    如果一个ReplicaSet控制器需要新增或者删除一个Pod操作,SatisfiedExpectations方法会返回true,以便后续执行manageReplicas方法

  2. 获取所有Pods

    列出所有 Pods的目的是为了包含有过时控制器引用的 Pods,这些Pods与当前ReplicaSet的选择器是不匹配的

  3. 过滤Pods

    过滤一些不活跃的pod,看判断逻辑是保留DeletionTimestamp为空(即没有标记删除的pod),以及PendingRunningUnknown(Deprecated)这些状态的pod

    func IsPodActive(p *v1.Pod) bool {
    return v1.PodSucceeded != p.Status.Phase &&
    	v1.PodFailed != p.Status.Phase &&
    	p.DeletionTimestamp == nil
    }
    
  4. 认领

    func (rsc *ReplicaSetController) claimPods(rs *apps.ReplicaSet, selector labels.Selector, filteredPods []*v1.Pod) ([]*v1.Pod, error) {
    // If any adoptions are attempted, we should first recheck for deletion with
    // an uncached quorum read sometime after listing Pods (see #42639).
    canAdoptFunc := controller.RecheckDeletionTimestamp(func() (metav1.Object, error) {
    	fresh, err := rsc.kubeClient.AppsV1().ReplicaSets(rs.Namespace).Get(context.TODO(), rs.Name, metav1.GetOptions{})
    	if err != nil {
    		return nil, err
    	}
    	if fresh.UID != rs.UID {
    		return nil, fmt.Errorf("original %v %v/%v is gone: got uid %v, wanted %v", rsc.Kind, rs.Namespace, rs.Name, fresh.UID, rs.UID)
    	}
    	return fresh, nil
    })
    cm := controller.NewPodControllerRefManager(rsc.podControl, rs, selector, rsc.GroupVersionKind, canAdoptFunc)
    return cm.ClaimPods(filteredPods)
    }
    

    能够认领的方式是通过kubeClient,透过ApiServer从etcd中取一次最新数据,而不用本地的缓存,如果UID一致,就用远端数据。接着看实际执行认领逻辑的ClaimPods,输入是第四步过滤Pods后留下的活跃Pods,输出是在输入基础上过滤没认领的Pods

    // ClaimPods 尝试获取一组Pods的所有者。
    // 它将协调以下内容:
    // 	- 如果Selector匹配,则认领孤儿pod。
    // 	- 如果Selector不再匹配,则释放拥有的对象。
    // 可选:如果指定了一个或多个过滤器,则只有在所有过滤器都返回 true 时才会认领Pod。
    // 如果尝试某种形式的协调失败,则返回非 nil 错误。通常,如果仍需要协调,控制器应稍后重试。
    // 如果错误为 nil,则协调成功,或者不需要协调。将返回现在的 Pod 列表。
    func (m *PodControllerRefManager) ClaimPods(pods []*v1.Pod, filters ...func(*v1.Pod) bool) ([]*v1.Pod, error) {
    var claimed []*v1.Pod
    var errlist []error
    
    match := func(obj metav1.Object) bool {
    	pod := obj.(*v1.Pod)
    	// Check selector first so filters only run on potentially matching Pods.
    	if !m.Selector.Matches(labels.Set(pod.Labels)) {
    		return false
    	}
    	for _, filter := range filters {
    		if !filter(pod) {
    			return false
    		}
    	}
    	return true
    }
    adopt := func(obj metav1.Object) error {
    	return m.AdoptPod(obj.(*v1.Pod))
    }
    release := func(obj metav1.Object) error {
    	return m.ReleasePod(obj.(*v1.Pod))
    }
    
    for _, pod := range pods {
    	ok, err := m.ClaimObject(pod, match, adopt, release)
    	if err != nil {
    		errlist = append(errlist, err)
    		continue
    	}
    	if ok {
    		claimed = append(claimed, pod)
    	}
    }
    return claimed, utilerrors.NewAggregate(errlist)
    }
    

    主要关注AdoptPodReleasePod,实际都是执行一个Patch方法,去修改这个Pod的OwnerReferences

    • 如果一个ReplicaSet能认领,就把OwnerReferences改成自己
    • 如果不认领,则要释放这个Pod,把OwnerReferences置空
管理副本数

截止到这里,我们已经得到了所有在队列里需要唤醒处理的ReplicaSet,已经所有过滤后可以被进行计算的Pods对象,下一步就是执行ReplicaSet管理副本数的任务了。

func (rsc *ReplicaSetController) syncReplicaSet(key string) error {
	// 省略...
	rsNeedsSync := rsc.expectations.SatisfiedExpectations(key)
	
	// 省略...
	if rsNeedsSync && rs.DeletionTimestamp == nil {
		manageReplicasErr = rsc.manageReplicas(filteredPods, rs)
	}
	
	// 省略...
}

在先前SatisfiedExpectations的解析里提到,rsNeedsSync如果是true,表示的是如果已观察到给定ReplicaSetController所需的 adds或者dels。计数由ReplicaSetController在同步时建立,并在ReplicaSetController观察ReplicaSet时更新。
已经有所需的更新,且当前的ReplicaSet没有被删除,则开始管理副本数。

func (rsc *ReplicaSetController) manageReplicas(filteredPods []*v1.Pod, rs *apps.ReplicaSet) error {
	diff := len(filteredPods) - int(*(rs.Spec.Replicas))
	rsKey, err := controller.KeyFunc(rs)
	if err != nil {
		utilruntime.HandleError(fmt.Errorf("couldn't get key for %v %#v: %v", rsc.Kind, rs, err))
		return nil
	}
	if diff < 0 {
		diff *= -1
		if diff > rsc.burstReplicas {
			diff = rsc.burstReplicas
		}
		// 省略很多代码...
		err := rsc.podControl.CreatePods(rs.Namespace, &rs.Spec.Template, rs, metav1.NewControllerRef(rs, rsc.GroupVersionKind))
		// 省略很多代码...
		
	} else if diff > 0 {
	
		// 省略很多代码...
		if err := rsc.podControl.DeletePod(rs.Namespace, targetPod.Name, rs); err != nil {
		
		// 省略很多代码...
		}

		// 省略很多代码...

	}

	return nil
}

拆解掉很多代码,manageReplicas的逻辑很清晰,已有Pods数量少于期望Replicas,则扩;已有Pods数量多于期望,则缩;
我们先看扩容的逻辑:

		// TODO:像删除一样跟踪创建的 UID。
		// 目前的问题是我们需要等待创建的结果来记录 pod 的 UID,这将需要在创建过程中锁定,这将成为性能瓶颈。
		// 我们应该为 pod 生成一个 UID 预先并通过 ExpectCreations 存储它。
		rsc.expectations.ExpectCreations(rsKey, diff)
		// 批量执行pod的创建。批量大小从 SlowStartInitialBatchSize 开始,并在每次成功迭代时以“慢启动”的方式加倍(x2)。
		// 这个处理尝试启动大量 Pod ,这些 Pod 可能会因相同的错误而失败。
		// 例如,一个配额较低的项目尝试创建大量 pod,在其 pod 之一失败后,将无法向 API 服务发送 pod 创建请求。
		// 方便的是,这还可以防止这些故障可能生成的垃圾event。
		successfulCreations, err := slowStartBatch(diff, controller.SlowStartInitialBatchSize, func() error {
			err := rsc.podControl.CreatePods(rs.Namespace, &rs.Spec.Template, rs, metav1.NewControllerRef(rs, rsc.GroupVersionKind))
			if err != nil {
				if apierrors.HasStatusCause(err, v1.NamespaceTerminatingCause) {
					// 如果命名空间被终止,我们不必执行任何操作,因为任何创建都会失败
					return nil
				}
			}
			return err
		})

		// 任何我们从未尝试启动的、被跳过的Pod 都是背离预期的。跳过的 Pod 稍后将重试。下一次控制器重新同步将重试慢启动过程。
		if skippedPods := diff - successfulCreations; skippedPods > 0 {
			for i := 0; i < skippedPods; i++ {
				// Decrement the expected number of creates because the informer won't observe this pod
				rsc.expectations.CreationObserved(rsKey)
			}
		}
		return err

着重看一下批量慢启动的代码,它表明如果有大量的实例通过ReplicaSet创建,会一一种呈现指数增长的方式进行扩容,当扩容有一个实例失败时,就不再继续往后执行扩容操作了:

// SlowStartBatch 尝试调用提供的函数总共“count”次,开始时缓慢检查错误,然后在调用成功时加快速度。
// 它将调用分组为批次,从一组initialBatchSize 开始。在每个批次中,它可能会同时多次调用该函数。
// 如果整个批次成功,下一批可能会呈指数级增长(x2倍)。如果批次中有任何失败,则跳过所有剩余批次等待当前批次完成后。
// 它返回成功调用该函数的次数。
func slowStartBatch(count int, initialBatchSize int, fn func() error) (int, error) {
	remaining := count
	successes := 0
	for batchSize := integer.IntMin(remaining, initialBatchSize); batchSize > 0; batchSize = integer.IntMin(2*batchSize, remaining) {
		errCh := make(chan error, batchSize)
		var wg sync.WaitGroup
		wg.Add(batchSize)
		for i := 0; i < batchSize; i++ {
			go func() {
				defer wg.Done()
				if err := fn(); err != nil {
					errCh <- err
				}
			}()
		}
		wg.Wait()
		curSuccesses := batchSize - len(errCh)
		successes += curSuccesses
		if len(errCh) > 0 {
			return successes, <-errCh
		}
		remaining -= batchSize
	}
	return successes, nil
}

再来看缩容逻辑,通过协程并发的删除要删除的pod

relatedPods, err := rsc.getIndirectlyRelatedPods(rs)
		utilruntime.HandleError(err)

		// 选择要删除的 Pod,优先选择启动早期阶段的 Pod。
		podsToDelete := getPodsToDelete(filteredPods, relatedPods, diff)

		// 为我们期望看到被删除pod 的 UID (ns/name)进行快照,因此当删除时间戳的更新或删除时,要准确记录一次他们的期望。
		// 请注意,如果 pod/rs 上的标签发生变化导致该 pod 成为孤立的,则即使其他 pod 被删除,rs 也只会在期望过期后唤醒。
		rsc.expectations.ExpectDeletions(rsKey, getPodKeys(podsToDelete))

		errCh := make(chan error, diff)
		var wg sync.WaitGroup
		wg.Add(diff)
		for _, pod := range podsToDelete {
			go func(targetPod *v1.Pod) {
				defer wg.Done()
				if err := rsc.podControl.DeletePod(rs.Namespace, targetPod.Name, rs); err != nil {
					// 减少预期的删除次数,因为informer不会观察到此删除
					podKey := controller.PodKey(targetPod)
					rsc.expectations.DeletionObserved(rsKey, podKey)
					if !apierrors.IsNotFound(err) {
						klog.V(2).Infof("Failed to delete %v, decremented expectations for %v %s/%s", podKey, rsc.Kind, rs.Namespace, rs.Name)
						errCh <- err
					}
				}
			}(pod)
		}
		wg.Wait()

		select {
		case err := <-errCh:
			// all errors have been reported before and they're likely to be the same, so we'll only return the first one we hit.
			if err != nil {
				return err
			}
		default:

注意到删除的时候如果不是全部删除,是有做排序的,less为true表示前者优先删除,判断条件很多:

  1. 未分配node节点的 Pod 会优先删除。
  2. Pending会优先删除,Running最后删除,PodUnknown 介于两者之间
  3. Not ready的 Pod 优先删除
  4. 如果设置了controller.kubernetes.io/pod-deletion-cost注解,则值较小的pod将优先出现
  5. 如果 Pod 的排名不同,则排名较高的 Pod 优先删除
  6. 如果两个 Pod 均已ready,但准备时间不同,则准备时间较短的 Pod 优先删除
  7. 如果一个 Pod 的container重启次数多于另一个 Pod 中任何container的重启次数,则包含重启次数较多container的 Pod 优先删除
  8. 如果 Pod 的创建时间不同,则最近创建的 Pod 会优先删除
// Less compares two pods with corresponding ranks and returns true if the first
// one should be preferred for deletion.
func (s ActivePodsWithRanks) Less(i, j int) bool {
	// 1. Unassigned < assigned
	// If only one of the pods is unassigned, the unassigned one is smaller
	if s.Pods[i].Spec.NodeName != s.Pods[j].Spec.NodeName && (len(s.Pods[i].Spec.NodeName) == 0 || len(s.Pods[j].Spec.NodeName) == 0) {
		return len(s.Pods[i].Spec.NodeName) == 0
	}
	// 2. PodPending < PodUnknown < PodRunning
	if podPhaseToOrdinal[s.Pods[i].Status.Phase] != podPhaseToOrdinal[s.Pods[j].Status.Phase] {
		return podPhaseToOrdinal[s.Pods[i].Status.Phase] < podPhaseToOrdinal[s.Pods[j].Status.Phase]
	}
	// 3. Not ready < ready
	// If only one of the pods is not ready, the not ready one is smaller
	if podutil.IsPodReady(s.Pods[i]) != podutil.IsPodReady(s.Pods[j]) {
		return !podutil.IsPodReady(s.Pods[i])
	}

	// 4. higher pod-deletion-cost < lower pod-deletion cost
	if utilfeature.DefaultFeatureGate.Enabled(features.PodDeletionCost) {
		pi, _ := helper.GetDeletionCostFromPodAnnotations(s.Pods[i].Annotations)
		pj, _ := helper.GetDeletionCostFromPodAnnotations(s.Pods[j].Annotations)
		if pi != pj {
			return pi < pj
		}
	}

	// 5. Doubled up < not doubled up
	// If one of the two pods is on the same node as one or more additional
	// ready pods that belong to the same replicaset, whichever pod has more
	// colocated ready pods is less
	if s.Rank[i] != s.Rank[j] {
		return s.Rank[i] > s.Rank[j]
	}
	// TODO: take availability into account when we push minReadySeconds information from deployment into pods,
	//       see https://github.com/kubernetes/kubernetes/issues/22065
	// 6. Been ready for empty time < less time < more time
	// If both pods are ready, the latest ready one is smaller
	if podutil.IsPodReady(s.Pods[i]) && podutil.IsPodReady(s.Pods[j]) {
		readyTime1 := podReadyTime(s.Pods[i])
		readyTime2 := podReadyTime(s.Pods[j])
		if !readyTime1.Equal(readyTime2) {
			if !utilfeature.DefaultFeatureGate.Enabled(features.LogarithmicScaleDown) {
				return afterOrZero(readyTime1, readyTime2)
			} else {
				if s.Now.IsZero() || readyTime1.IsZero() || readyTime2.IsZero() {
					return afterOrZero(readyTime1, readyTime2)
				}
				rankDiff := logarithmicRankDiff(*readyTime1, *readyTime2, s.Now)
				if rankDiff == 0 {
					return s.Pods[i].UID < s.Pods[j].UID
				}
				return rankDiff < 0
			}
		}
	}
	// 7. Pods with containers with higher restart counts < lower restart counts
	if maxContainerRestarts(s.Pods[i]) != maxContainerRestarts(s.Pods[j]) {
		return maxContainerRestarts(s.Pods[i]) > maxContainerRestarts(s.Pods[j])
	}
	// 8. Empty creation time pods < newer pods < older pods
	if !s.Pods[i].CreationTimestamp.Equal(&s.Pods[j].CreationTimestamp) {
		if !utilfeature.DefaultFeatureGate.Enabled(features.LogarithmicScaleDown) {
			return afterOrZero(&s.Pods[i].CreationTimestamp, &s.Pods[j].CreationTimestamp)
		} else {
			if s.Now.IsZero() || s.Pods[i].CreationTimestamp.IsZero() || s.Pods[j].CreationTimestamp.IsZero() {
				return afterOrZero(&s.Pods[i].CreationTimestamp, &s.Pods[j].CreationTimestamp)
			}
			rankDiff := logarithmicRankDiff(s.Pods[i].CreationTimestamp, s.Pods[j].CreationTimestamp, s.Now)
			if rankDiff == 0 {
				return s.Pods[i].UID < s.Pods[j].UID
			}
			return rankDiff < 0
		}
	}
	return false
}
计算并更新状态

通过管理副本数,实际运行的pod已经经过了一些调整,到了这一步知道调整的结果是成功还是失败的。

仅更新本次传入的这些pod(filteredPods)时的一些ReplicaSet状态,管理副本数阶段创建和删除的pod实例等待他们实际执行完成后,再同步。

本次传入的这些pod(filteredPods)是上一步管理副本数计算时的pod,统计具有与ReplicaSet中Pod Template labels相匹配的 Pod 数量,但匹配的 Pod 数量可能多于模板中的标签数量,因为podTemplateSpec中含有的标签数大概率会少于Pod中的labels,即 podTemplateSpec 的 label 是ReplicaSet selector 的超集,所以匹配的 Pod 一定是 filteredPods的一部分。


func calculateStatus(rs *apps.ReplicaSet, filteredPods []*v1.Pod, manageReplicasErr error) apps.ReplicaSetStatus {
	newStatus := rs.Status
	fullyLabeledReplicasCount := 0
	readyReplicasCount := 0
	availableReplicasCount := 0
	templateLabel := labels.Set(rs.Spec.Template.Labels).AsSelectorPreValidated()
	for _, pod := range filteredPods {
		if templateLabel.Matches(labels.Set(pod.Labels)) {
			fullyLabeledReplicasCount++
		}
		if podutil.IsPodReady(pod) {
			readyReplicasCount++
			if podutil.IsPodAvailable(pod, rs.Spec.MinReadySeconds, metav1.Now()) {
				availableReplicasCount++
			}
		}
	}

	failureCond := GetCondition(rs.Status, apps.ReplicaSetReplicaFailure)
	if manageReplicasErr != nil && failureCond == nil {
		var reason string
		if diff := len(filteredPods) - int(*(rs.Spec.Replicas)); diff < 0 {
			reason = "FailedCreate"
		} else if diff > 0 {
			reason = "FailedDelete"
		}
		cond := NewReplicaSetCondition(apps.ReplicaSetReplicaFailure, v1.ConditionTrue, reason, manageReplicasErr.Error())
		SetCondition(&newStatus, cond)
	} else if manageReplicasErr == nil && failureCond != nil {
		RemoveCondition(&newStatus, apps.ReplicaSetReplicaFailure)
	}

	newStatus.Replicas = int32(len(filteredPods))
	newStatus.FullyLabeledReplicas = int32(fullyLabeledReplicasCount)
	newStatus.ReadyReplicas = int32(readyReplicasCount)
	newStatus.AvailableReplicas = int32(availableReplicasCount)
	return newStatus
}

平时不太关注的ReplicaSet的副本数状态,在此处做一个总结

// ReplicaSetStatus represents the current status of a ReplicaSet.
type ReplicaSetStatus struct {
	// Replicas is the most recently oberved number of replicas.
	// More info: https://kubernetes.io/docs/concepts/workloads/controllers/replicationcontroller/#what-is-a-replicationcontroller
	Replicas int32 `json:"replicas" protobuf:"varint,1,opt,name=replicas"`

	// The number of pods that have labels matching the labels of the pod template of the replicaset.
	// +optional
	FullyLabeledReplicas int32 `json:"fullyLabeledReplicas,omitempty" protobuf:"varint,2,opt,name=fullyLabeledReplicas"`

	// The number of ready replicas for this replica set.
	// +optional
	ReadyReplicas int32 `json:"readyReplicas,omitempty" protobuf:"varint,4,opt,name=readyReplicas"`

	// The number of available replicas (ready for at least minReadySeconds) for this replica set.
	// +optional
	AvailableReplicas int32 `json:"availableReplicas,omitempty" protobuf:"varint,5,opt,name=availableReplicas"`
	
	// 省略。。。。。
}
  • Replicas 是最近观察到的副本数量
  • FullyLabeledReplicas labels与ReplicaSet中podTemplate labels相匹配的 pod 数量
  • ReadyReplicas 这个ReplicaSet中已经ready的副本数
  • AvailableReplicas 这个ReplicaSet中可用的副本数(ready,且至少持续了minReadySeconds)

ready表示的是Pod condition中的状态,Pod Condition一共有四个状态

// These are valid conditions of pod.
const (
	// ContainersReady indicates whether all containers in the pod are ready.
	ContainersReady PodConditionType = "ContainersReady"
	// PodInitialized means that all init containers in the pod have started successfully.
	PodInitialized PodConditionType = "Initialized"
	// PodReady means the pod is able to service requests and should be added to the
	// load balancing pools of all matching services.
	PodReady PodConditionType = "Ready"
	// PodScheduled represents status of the scheduling process for this pod.
	PodScheduled PodConditionType = "PodScheduled"
)
重新更新

到了最后一步,在 MinReadySeconds 之后重新同步 ReplicaSet,作为防止时钟偏差(clock-skew)的最后一道防线。

	// Resync the ReplicaSet after MinReadySeconds as a last line of defense to guard against clock-skew.
	if manageReplicasErr == nil && updatedRS.Spec.MinReadySeconds > 0 &&
		updatedRS.Status.ReadyReplicas == *(updatedRS.Spec.Replicas) &&
		updatedRS.Status.AvailableReplicas != *(updatedRS.Spec.Replicas) {
		rsc.queue.AddAfter(key, time.Duration(updatedRS.Spec.MinReadySeconds)*time.Second)
	}

总结

ReplicaSet的控制器中有个队列,上篇ReplicasSet是怎么工作的?(上篇)描述了向队列里写入了什么,本篇描述了从队列中消费的ReplicaSet都做了什么,总结起来主要做了如下的几件事情:

  1. 通过labels认领Pods,并过滤
  2. 用过滤后的Pods和ReplicaSet的副本数,来管理当前实际运行的Pods,多了缩,少了扩
  3. 用过滤后的Pods来更新一次ReplicaSet的状态
  4. 重新更新来防止时钟偏差的问题
    RS出队列
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值