ReplicasSet是怎么工作的？（上篇）

假装这里是StackOverflow

已于 2023-07-04 19:09:10 修改

阅读量90

点赞数

分类专栏： k8s 文章标签： kubernetes 容器云原生

于 2023-07-03 17:09:18 首次发布

本文链接：https://blog.csdn.net/qq_21365961/article/details/131455166

版权

k8s 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

ReplicaSet简介

replica n. 副本；复制品；

在k8s集群内，replica用于描述一个一个由相同镜像、相同运行环境（相同yaml描述出来的）运行出的pod数量。

ReplicaSet维护一组在任何时候都要处于运行状态的 Pod 副本，这些Pod副本组成了一个集合。
ReplicaSet通常用来保证Pod的可用性，具备如下两个特点：

实际运行的pod数量不断调整到给定数量
每个ReplicaSet所管理的Pod完全相同

了解过Deployment的朋友会知道，在滚动更新过程中，有时尽管pod副本数和期望的副本数一致（比如扩容一个实例成功，刚刚缩容一个旧实例的时候），但是pod的版本并不一定相同（尽管最终会达到一致）。

由于每个ReplicasSet所管理的Pod完全相同，Deployment滚动更新中是如何调整不同版本的实例数的呢？想必答案呼之欲出了，那就是——Deployment会同时维护很多个ReplicaSet，通过增加和减少*新旧两个版本ReplicaSet的副本数，来达到滚动更新的目的。

问题的提出

在kube-controller-manager是如何加载的Deployment、StatefulSet等控制器的？，我们知道了controller-manager会在NewControllerInitializers()方法中加载一系列的控制器。作为这些控制器中相对而言最为简单的、最为基础的、也是为了搞懂Deployment必不可少的ReplicaSet，它是怎么工作的呢？

由于ReplicaSet的控制器中有个队列，本篇只描述向队列写入的

源码阅读

方法入口

在controller-manager的NewControllerInitializers()方法中找到ReplicaSet的运行方法：

func startReplicaSetController(ctx ControllerContext) (http.Handler, bool, error) {
	go replicaset.NewReplicaSetController(
		ctx.InformerFactory.Apps().V1().ReplicaSets(),
		ctx.InformerFactory.Core().V1().Pods(),
		ctx.ClientBuilder.ClientOrDie("replicaset-controller"),
		replicaset.BurstReplicas,
	).Run(int(ctx.ComponentConfig.ReplicaSetController.ConcurrentRSSyncs), ctx.Stop)
	return nil, true, nil
}

这里便是一个ReplicaSet控制器运行地方了，通过context传入了一个ReplicaSet的informer，和一个Pod的informer，这两个informer分别通知ReplicaSet和Pod的事件。一个新的控制器构造出来，用go关键字创建一个channel来运行Run方法。

构造一个控制器

先看它的构造方法NewReplicaSetController()


// NewReplicaSetController 使用指定的event记录器配置一个ReplicaSet控制器
func NewReplicaSetController(rsInformer appsinformers.ReplicaSetInformer, podInformer coreinformers.PodInformer, kubeClient clientset.Interface, burstReplicas int) *ReplicaSetController {
	eventBroadcaster := record.NewBroadcaster() //这个方法主要初始化一个event的记录器
	eventBroadcaster.StartStructuredLogging(0)
	eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: kubeClient.CoreV1().Events("")})
	if err := metrics.Register(legacyregistry.Register); err != nil {
		klog.ErrorS(err, "unable to register metrics")
	}
	return NewBaseController(rsInformer, podInformer, kubeClient, burstReplicas,
		apps.SchemeGroupVersion.WithKind("ReplicaSet"),
		"replicaset_controller",
		"replicaset",
		controller.RealPodControl{
			KubeClient: kubeClient,
			Recorder:   eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "replicaset-controller"}),
		},
	)
}

// NewBaseController 是 NewReplicaSetController的实现，具有额外的注入参数 
func NewBaseController(rsInformer appsinformers.ReplicaSetInformer, podInformer coreinformers.PodInformer, kubeClient clientset.Interface, burstReplicas int,
	gvk schema.GroupVersionKind, metricOwnerName, queueName string, podControl controller.PodControlInterface) *ReplicaSetController {
	if kubeClient != nil && kubeClient.CoreV1().RESTClient().GetRateLimiter() != nil {
		ratelimiter.RegisterMetricAndTrackRateLimiterUsage(metricOwnerName, kubeClient.CoreV1().RESTClient().GetRateLimiter())
	}

	rsc := &ReplicaSetController{
		GroupVersionKind: gvk,
		kubeClient:       kubeClient,
		podControl:       podControl,
		burstReplicas:    burstReplicas,
		expectations:     controller.NewUIDTrackingControllerExpectations(controller.NewControllerExpectations()),
		queue:            workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), queueName),
	}

	// 关键代码，对于每一个ReplicaSet的增删改事件，有单独的处理过程
	rsInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
		AddFunc:    rsc.addRS,
		UpdateFunc: rsc.updateRS,
		DeleteFunc: rsc.deleteRS,
	})
	rsc.rsLister = rsInformer.Lister()
	rsc.rsListerSynced = rsInformer.Informer().HasSynced

	// 关键代码，对于每一个Pod的增删改事件，有单独的处理过程
	podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
		AddFunc: rsc.addPod,
		// This invokes the ReplicaSet for every pod change, eg: host assignment. Though this might seem like
		// overkill the most frequent pod update is status, and the associated ReplicaSet will only list from
		// local storage, so it should be ok.
		UpdateFunc: rsc.updatePod,
		DeleteFunc: rsc.deletePod,
	})
	rsc.podLister = podInformer.Lister()
	rsc.podListerSynced = podInformer.Informer().HasSynced

	// 关键代码，负责给ReplicaSet同步消息
	rsc.syncHandler = rsc.syncReplicaSet

	return rsc
}

我们着重看其中的三部分：

rsInformer的EventHandler，对于每一个ReplicaSet的增删改事件，单独的处理过程
podInformer的EventHandler，对于每一个Pod的增删改事件，单独的处理过程
syncHandler方法，用于同步状态的实际处理，将在ReplicasSet是怎么工作的？（下篇）介绍

ReplicaSet事件处理

AddFunc

func (rsc *ReplicaSetController) addRS(obj interface{}) {
	rs := obj.(*apps.ReplicaSet)
	klog.V(4).Infof("Adding %s %s/%s", rsc.Kind, rs.Namespace, rs.Name)
	rsc.enqueueRS(rs)
}

func (rsc *ReplicaSetController) enqueueRS(rs *apps.ReplicaSet) {
	key, err := controller.KeyFunc(rs)
	if err != nil {
		utilruntime.HandleError(fmt.Errorf("couldn't get key for object %#v: %v", rs, err))
		return
	}

	rsc.queue.Add(key)
}

当有一个ReplicaSet的配置新增的时候，client-go会有一个Add事件，可以看到对于Add事件没有什么特殊的处理，ReplicaSetController执行enqueueRS()方法把它写到了自己的队列里面。

此处注意一下一个rs对象，是如何通过controller.KeyFunc()方法转换成了一个标识key？
参考 Kubernetes 对象缓存和索引一文，这个key是这个对象在client-go缓存索引的唯一标识，通过这个key能够快速定位到缓存中的这个对象。

UpdateFunc

// callback when RS is updated
func (rsc *ReplicaSetController) updateRS(old, cur interface{}) {
	oldRS := old.(*apps.ReplicaSet)
	curRS := cur.(*apps.ReplicaSet)

	// TODO: make a KEP and fix informers to always call the delete event handler on re-create
	if curRS.UID != oldRS.UID {
		key, err := controller.KeyFunc(oldRS)
		if err != nil {
			utilruntime.HandleError(fmt.Errorf("couldn't get key for object %#v: %v", oldRS, err))
			return
		}
		rsc.deleteRS(cache.DeletedFinalStateUnknown{
			Key: key,
			Obj: oldRS,
		})
	}

	// You might imagine that we only really need to enqueue the
	// replica set when Spec changes, but it is safer to sync any
	// time this function is triggered. That way a full informer
	// resync can requeue any replica set that don't yet have pods
	// but whose last attempts at creating a pod have failed (since
	// we don't block on creation of pods) instead of those
	// replica sets stalling indefinitely. Enqueueing every time
	// does result in some spurious syncs (like when Status.Replica
	// is updated and the watch notification from it retriggers
	// this function), but in general extra resyncs shouldn't be
	// that bad as ReplicaSets that haven't met expectations yet won't
	// sync, and all the listing is done using local stores.
	if *(oldRS.Spec.Replicas) != *(curRS.Spec.Replicas) {
		klog.V(4).Infof("%v %v updated. Desired pod count change: %d->%d", rsc.Kind, curRS.Name, *(oldRS.Spec.Replicas), *(curRS.Spec.Replicas))
	}
	rsc.enqueueRS(curRS)
}

func (rsc *ReplicaSetController) enqueueRS(rs *apps.ReplicaSet) {
	key, err := controller.KeyFunc(rs)
	if err != nil {
		utilruntime.HandleError(fmt.Errorf("couldn't get key for object %#v: %v", rs, err))
		return
	}

	rsc.queue.Add(key)
}

ReplicaSet更新事件的逻辑里有一长段注释，它表明并不是只在ReplicaSet.Spec发生变化的时候才把它加到自己的队列里，每一个次update事件触发时都会加到队列里，无论什么变更，会更安全一些。

如果有一个ReplicaSet尚未管理任何一个pod（比如说可能上一次创建pod的时候失败了），这种所有update事件都处理的方式会为这个ReplicaSet重新排队，确保控制器会管理到这个一直创建不出来pod的ReplicaSet。

尽管每次排队确实会导致一些虚假同步（例如当 Status.Replica 更新并且来自它的监视通知重新触发此函数时），但总的来说，额外的重新同步不会那么糟糕，因为尚未达到预期的副本集不会同步，并且所有list操作都是使用本地缓存完成的。

DeleteFunc


func (rsc *ReplicaSetController) deleteRS(obj interface{}) {
	rs, ok := obj.(*apps.ReplicaSet)
	if !ok {
		tombstone, ok := obj.(cache.DeletedFinalStateUnknown)
		if !ok {
			utilruntime.HandleError(fmt.Errorf("couldn't get object from tombstone %#v", obj))
			return
		}
		rs, ok = tombstone.Obj.(*apps.ReplicaSet)
		if !ok {
			utilruntime.HandleError(fmt.Errorf("tombstone contained object that is not a ReplicaSet %#v", obj))
			return
		}
	}

	key, err := controller.KeyFunc(rs)
	if err != nil {
		utilruntime.HandleError(fmt.Errorf("couldn't get key for object %#v: %v", rs, err))
		return
	}

	klog.V(4).Infof("Deleting %s %q", rsc.Kind, key)

	// Delete expectations for the ReplicaSet so if we create a new one with the same name it starts clean
	rsc.expectations.DeleteExpectations(key)

	rsc.queue.Add(key)
}

// DeleteExpectations deletes the UID set and invokes DeleteExpectations on the
// underlying ControllerExpectationsInterface.
func (u *UIDTrackingControllerExpectations) DeleteExpectations(rcKey string) {
	u.uidStoreLock.Lock()
	defer u.uidStoreLock.Unlock()

	u.ControllerExpectationsInterface.DeleteExpectations(rcKey)
	if uidExp, exists, err := u.uidStore.GetByKey(rcKey); err == nil && exists {
		if err := u.uidStore.Delete(uidExp); err != nil {
			klog.V(2).Infof("Error deleting uid expectations for controller %v: %v", rcKey, err)
		}
	}
}

delete常见的(cache.DeletedFinalStateUnknown)处理方式，参考 cache.DeletedFinalStateUnknown是什么？一文，DeletedFinalStateUnknown是一种特殊的结构，当一个对象被删除时，监听的delete时间丢了（比如和apiserver断开连接了），就按照DeletedFinalStateUnknown对象的格式放入DeltaFIFO中。

DeletedFinalStateUnknown就是把一个Object放入了Obj变量中，保存了那些已经被删除的对象，作为一种cache，避免获取不到这个对象了。

// DeletedFinalStateUnknown is placed into a DeltaFIFO in the case where an object
// was deleted but the watch deletion event was missed while disconnected from
// apiserver. In this case we don't know the final "resting" state of the object, so
// there's a chance the included `Obj` is stale.
type DeletedFinalStateUnknown struct {
	Key string
	Obj interface{}
}

delete事件的入控制器队列的方法和add、update不太一样，没有调用enqueueRS()方法加入到队列当中，区别在于对client-go中缓存的唯一标识key进行了一次处理

Pod事件处理

AddFunc

// When a pod is created, enqueue the replica set that manages it and update its expectations.
func (rsc *ReplicaSetController) addPod(obj interface{}) {
	pod := obj.(*v1.Pod)

	if pod.DeletionTimestamp != nil {
		// on a restart of the controller manager, it's possible a new pod shows up in a state that
		// is already pending deletion. Prevent the pod from being a creation observation.
		rsc.deletePod(pod)
		return
	}

	// If it has a ControllerRef, that's all that matters.
	if controllerRef := metav1.GetControllerOf(pod); controllerRef != nil {
		rs := rsc.resolveControllerRef(pod.Namespace, controllerRef)
		if rs == nil {
			return
		}
		rsKey, err := controller.KeyFunc(rs)
		if err != nil {
			return
		}
		klog.V(4).Infof("Pod %s created: %#v.", pod.Name, pod)
		rsc.expectations.CreationObserved(rsKey)
		rsc.queue.Add(rsKey)
		return
	}

	// Otherwise, it's an orphan. Get a list of all matching ReplicaSets and sync
	// them to see if anyone wants to adopt it.
	// DO NOT observe creation because no controller should be waiting for an
	// orphan.
	rss := rsc.getPodReplicaSets(pod)
	if len(rss) == 0 {
		return
	}
	klog.V(4).Infof("Orphan Pod %s created: %#v.", pod.Name, pod)
	for _, rs := range rss {
		rsc.enqueueRS(rs)
	}
}

ReplicaSet控制器在处理pod的Add事件时非常有意思，首先，使用过client-go的朋友们会知道，如果client-go重启，他会以Add事件，把当前所有的pod都list出来一遍，那么如果有已经标记了删除的pod（即pod.DeletionTimestamp != nil），直接走删除的逻辑就好了。

其次，到认领环节了。pod的metadata有OwnerReference的信息，通过pod反查这个pod的Owner，找到这个pod对应的ReplicaSet，然后反查缓存中这个ReplicaSet的索引key，把这个ReplicaSet加入到ReplicaSet控制器队里中去。

最后，如果没有人认领到这个pod，就尝试列出所有和这个pod有关的ReplicaSet。看看有没有ReplicaSet认领这个孤儿pod。

如何判断一个ReplicaSet和pod有关呢？ReplicaSet是通过pod的labels来筛选pod的，如果这个pod的labels在ReplicaSet的selector能match上，就认为这个ReplicaSet和这个pod有关。

UpdateFunc

在注册事件控制器的时候有一段注释

// This invokes the ReplicaSet for every pod change, eg: host assignment. Though this might seem like
// overkill the most frequent pod update is status, and the associated ReplicaSet will only list from
// local storage, so it should be ok.

它提到说ReplicaSet看起来会矫枉过正，对于pod每个改变都会调用，比如说host，比说说pod的status，但ReplicaSet是从它的本地缓存中list的，所以速度会很快，不会造成太大的开销。

当一个pod更新的时候，控制器要搞清楚究竟要唤醒哪个ReplicaSet。如果pod的labels发生变换了，那么与新旧pod相关的ReplicaSet都要被唤醒

// When a pod is updated, figure out what replica set/s manage it and wake them
// up. If the labels of the pod have changed we need to awaken both the old
// and new replica set. old and cur must be *v1.Pod types.
func (rsc *ReplicaSetController) updatePod(old, cur interface{}) {
	curPod := cur.(*v1.Pod)
	oldPod := old.(*v1.Pod)
	if curPod.ResourceVersion == oldPod.ResourceVersion {
		// Periodic resync will send update events for all known pods.
		// Two different versions of the same pod will always have different RVs.
		return
	}

	labelChanged := !reflect.DeepEqual(curPod.Labels, oldPod.Labels)
	if curPod.DeletionTimestamp != nil {
		// when a pod is deleted gracefully it's deletion timestamp is first modified to reflect a grace period,
		// and after such time has passed, the kubelet actually deletes it from the store. We receive an update
		// for modification of the deletion timestamp and expect an rs to create more replicas asap, not wait
		// until the kubelet actually deletes the pod. This is different from the Phase of a pod changing, because
		// an rs never initiates a phase change, and so is never asleep waiting for the same.
		rsc.deletePod(curPod)
		if labelChanged {
			// we don't need to check the oldPod.DeletionTimestamp because DeletionTimestamp cannot be unset.
			rsc.deletePod(oldPod)
		}
		return
	}

	curControllerRef := metav1.GetControllerOf(curPod)
	oldControllerRef := metav1.GetControllerOf(oldPod)
	controllerRefChanged := !reflect.DeepEqual(curControllerRef, oldControllerRef)
	if controllerRefChanged && oldControllerRef != nil {
		// The ControllerRef was changed. Sync the old controller, if any.
		if rs := rsc.resolveControllerRef(oldPod.Namespace, oldControllerRef); rs != nil {
			rsc.enqueueRS(rs)
		}
	}

	// If it has a ControllerRef, that's all that matters.
	if curControllerRef != nil {
		rs := rsc.resolveControllerRef(curPod.Namespace, curControllerRef)
		if rs == nil {
			return
		}
		klog.V(4).Infof("Pod %s updated, objectMeta %+v -> %+v.", curPod.Name, oldPod.ObjectMeta, curPod.ObjectMeta)
		rsc.enqueueRS(rs)
		// TODO: MinReadySeconds in the Pod will generate an Available condition to be added in
		// the Pod status which in turn will trigger a requeue of the owning replica set thus
		// having its status updated with the newly available replica. For now, we can fake the
		// update by resyncing the controller MinReadySeconds after the it is requeued because
		// a Pod transitioned to Ready.
		// Note that this still suffers from #29229, we are just moving the problem one level
		// "closer" to kubelet (from the deployment to the replica set controller).
		if !podutil.IsPodReady(oldPod) && podutil.IsPodReady(curPod) && rs.Spec.MinReadySeconds > 0 {
			klog.V(2).Infof("%v %q will be enqueued after %ds for availability check", rsc.Kind, rs.Name, rs.Spec.MinReadySeconds)
			// Add a second to avoid milliseconds skew in AddAfter.
			// See https://github.com/kubernetes/kubernetes/issues/39785#issuecomment-279959133 for more info.
			rsc.enqueueRSAfter(rs, (time.Duration(rs.Spec.MinReadySeconds)*time.Second)+time.Second)
		}
		return
	}

	// Otherwise, it's an orphan. If anything changed, sync matching controllers
	// to see if anyone wants to adopt it now.
	if labelChanged || controllerRefChanged {
		rss := rsc.getPodReplicaSets(curPod)
		if len(rss) == 0 {
			return
		}
		klog.V(4).Infof("Orphan Pod %s updated, objectMeta %+v -> %+v.", curPod.Name, oldPod.ObjectMeta, curPod.ObjectMeta)
		for _, rs := range rss {
			rsc.enqueueRS(rs)
		}
	}
}

当有pod发生update时，主要做了如下的几件事情：

如果当前pod DeletionTimestamp不为空，则删除这个pod。但如果labels发生变换，则删除之前的pod
反查当前和之前pod的owner，并判断两个owner ReplicaSet是否一致
如果不一致，则先同步之前的owner ReplicaSet；没变更就无所谓啦，都一样
如果当前pod owner ReplicaSet不为空，则做相应的处理，去更新pod
如果当前pod没有owner，看看有没有ReplicaSet认领

DeleteFunc

// When a pod is deleted, enqueue the replica set that manages the pod and update its expectations.
// obj could be an *v1.Pod, or a DeletionFinalStateUnknown marker item.
func (rsc *ReplicaSetController) deletePod(obj interface{}) {
	pod, ok := obj.(*v1.Pod)

	// When a delete is dropped, the relist will notice a pod in the store not
	// in the list, leading to the insertion of a tombstone object which contains
	// the deleted key/value. Note that this value might be stale. If the pod
	// changed labels the new ReplicaSet will not be woken up till the periodic resync.
	if !ok {
		tombstone, ok := obj.(cache.DeletedFinalStateUnknown)
		if !ok {
			utilruntime.HandleError(fmt.Errorf("couldn't get object from tombstone %+v", obj))
			return
		}
		pod, ok = tombstone.Obj.(*v1.Pod)
		if !ok {
			utilruntime.HandleError(fmt.Errorf("tombstone contained object that is not a pod %#v", obj))
			return
		}
	}

	controllerRef := metav1.GetControllerOf(pod)
	if controllerRef == nil {
		// No controller should care about orphans being deleted.
		return
	}
	rs := rsc.resolveControllerRef(pod.Namespace, controllerRef)
	if rs == nil {
		return
	}
	rsKey, err := controller.KeyFunc(rs)
	if err != nil {
		utilruntime.HandleError(fmt.Errorf("couldn't get key for object %#v: %v", rs, err))
		return
	}
	klog.V(4).Infof("Pod %s/%s deleted through %v, timestamp %+v: %#v.", pod.Namespace, pod.Name, utilruntime.GetCaller(), pod.DeletionTimestamp, pod)
	rsc.expectations.DeletionObserved(rsKey, controller.PodKey(pod))
	rsc.queue.Add(rsKey)
}

如果有pod的删除事件产生，则把管理这个pod的ReplicaSet加入到队列里，更新ReplicaSet的预期状态

总结

ReplicaSet的控制器主要做了如下的几件事情：

监听变更：ReplicaSet 控制器在 controller-manager 中注册一个 ReplicaSets 的 informer 和一个 Pods 的 informer。在 pkg/controller/replicaset/replica_set.go 文件中，定义了一个名为 NewReplicaSetController 的方法，它将被这两个informer初始化，用于监听任何ReplicaSet和Pod的事件，包括ADD、UPDATE、DELETE，共六种
响应变更：一旦 ReplicaSet 控制器接收到变更通知，它会将变更放入队列以进行处理。
同步状态：这部分在下篇进行讲解。

ReplicaSetController入队列

假装这里是StackOverflow

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
ReplicasSet是怎么工作的？（上篇）

replica n. 副本；复制品；在k8s集群内，replica用于描述一个一个由相同镜像、相同运行环境（相同yaml描述出来的）运行出的pod数量。ReplicaSet维护一组在任何时候都要处于运行状态的 Pod 副本，这些Pod副本组成了一个集合。实际运行的pod数量不断调整到给定数量每个ReplicaSet所管理的Pod完全相同。
复制链接

扫一扫