ReplicasSet是怎么工作的?(上篇)

ReplicaSet简介

replica n. 副本;复制品;

在k8s集群内,replica用于描述一个一个由相同镜像、相同运行环境(相同yaml描述出来的)运行出的pod数量。

ReplicaSet维护一组在任何时候都要处于运行状态的 Pod 副本,这些Pod副本组成了一个集合。
ReplicaSet通常用来保证Pod的可用性,具备如下两个特点:

  • 实际运行的pod数量不断调整到给定数量
  • 每个ReplicaSet所管理的Pod完全相同

了解过Deployment的朋友会知道,在滚动更新过程中,有时尽管pod副本数和期望的副本数一致(比如扩容一个实例成功,刚刚缩容一个旧实例的时候),但是pod的版本并不一定相同(尽管最终会达到一致)。

由于每个ReplicasSet所管理的Pod完全相同,Deployment滚动更新中是如何调整不同版本的实例数的呢?想必答案呼之欲出了,那就是——Deployment会同时维护很多个ReplicaSet,通过增加和减少*新旧两个版本ReplicaSet的副本数,来达到滚动更新的目的。

问题的提出

kube-controller-manager是如何加载的Deployment、StatefulSet等控制器的?,我们知道了controller-manager会在NewControllerInitializers()方法中加载一系列的控制器。作为这些控制器中相对而言最为简单的、最为基础的、也是为了搞懂Deployment必不可少的ReplicaSet,它是怎么工作的呢?

由于ReplicaSet的控制器中有个队列,本篇只描述向队列写入的

源码阅读

方法入口

在controller-manager的NewControllerInitializers()方法中找到ReplicaSet的运行方法:

func startReplicaSetController(ctx ControllerContext) (http.Handler, bool, error) {
	go replicaset.NewReplicaSetController(
		ctx.InformerFactory.Apps().V1().ReplicaSets(),
		ctx.InformerFactory.Core().V1().Pods(),
		ctx.ClientBuilder.ClientOrDie("replicaset-controller"),
		replicaset.BurstReplicas,
	).Run(int(ctx.ComponentConfig.ReplicaSetController.ConcurrentRSSyncs), ctx.Stop)
	return nil, true, nil
}

这里便是一个ReplicaSet控制器运行地方了,通过context传入了一个ReplicaSet的informer,和一个Pod的informer,这两个informer分别通知ReplicaSet和Pod的事件。一个新的控制器构造出来,用go关键字创建一个channel来运行Run方法。

构造一个控制器

先看它的构造方法NewReplicaSetController()


// NewReplicaSetController 使用指定的event记录器配置一个ReplicaSet控制器
func NewReplicaSetController(rsInformer appsinformers.ReplicaSetInformer, podInformer coreinformers.PodInformer, kubeClient clientset.Interface, burstReplicas int) *ReplicaSetController {
	eventBroadcaster := record.NewBroadcaster() //这个方法主要初始化一个event的记录器
	eventBroadcaster.StartStructuredLogging(0)
	eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: kubeClient.CoreV1().Events("")})
	if err := metrics.Register(legacyregistry.Register); err != nil {
		klog.ErrorS(err, "unable to register metrics")
	}
	return NewBaseController(rsInformer, podInformer, kubeClient, burstReplicas,
		apps.SchemeGroupVersion.WithKind("ReplicaSet"),
		"replicaset_controller",
		"replicaset",
		controller.RealPodControl{
			KubeClient: kubeClient,
			Recorder:   eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "replicaset-controller"}),
		},
	)
}

// NewBaseController 是 NewReplicaSetController的实现,具有额外的注入参数 
func NewBaseController(rsInformer appsinformers.ReplicaSetInformer, podInformer coreinformers.PodInformer, kubeClient clientset.Interface, burstReplicas int,
	gvk schema.GroupVersionKind, metricOwnerName, queueName string, podControl controller.PodControlInterface) *ReplicaSetController {
	if kubeClient != nil && kubeClient.CoreV1().RESTClient().GetRateLimiter() != nil {
		ratelimiter.RegisterMetricAndTrackRateLimiterUsage(metricOwnerName, kubeClient.CoreV1().RESTClient().GetRateLimiter())
	}

	rsc := &ReplicaSetController{
		GroupVersionKind: gvk,
		kubeClient:       kubeClient,
		podControl:       podControl,
		burstReplicas:    burstReplicas,
		expectations:     controller.NewUIDTrackingControllerExpectations(controller.NewControllerExpectations()),
		queue:            workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), queueName),
	}

	// 关键代码,对于每一个ReplicaSet的增删改事件,有单独的处理过程
	rsInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
		AddFunc:    rsc.addRS,
		UpdateFunc: rsc.updateRS,
		DeleteFunc: rsc.deleteRS,
	})
	rsc.rsLister = rsInformer.Lister()
	rsc.rsListerSynced = rsInformer.Informer().HasSynced

	// 关键代码,对于每一个Pod的增删改事件,有单独的处理过程
	podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
		AddFunc: rsc.addPod,
		// This invokes the ReplicaSet for every pod change, eg: host assignment. Though this might seem like
		// overkill the most frequent pod update is status, and the associated ReplicaSet will only list from
		// local storage, so it should be ok.
		UpdateFunc: rsc.updatePod,
		DeleteFunc: rsc.deletePod,
	})
	rsc.podLister = podInformer.Lister()
	rsc.podListerSynced = podInformer.Informer().HasSynced

	// 关键代码,负责给ReplicaSet同步消息
	rsc.syncHandler = rsc.syncReplicaSet

	return rsc
}

我们着重看其中的三部分:

  • rsInformer的EventHandler,对于每一个ReplicaSet的增删改事件,单独的处理过程
  • podInformer的EventHandler,对于每一个Pod的增删改事件,单独的处理过程
  • syncHandler方法,用于同步状态的实际处理,将在ReplicasSet是怎么工作的?(下篇)介绍
ReplicaSet事件处理
AddFunc
func (rsc *ReplicaSetController) addRS(obj interface{}) {
	rs := obj.(*apps.ReplicaSet)
	klog.V(4).Infof("Adding %s %s/%s", rsc.Kind, rs.Namespace, rs.Name)
	rsc.enqueueRS(rs)
}

func (rsc *ReplicaSetController) enqueueRS(rs *apps.ReplicaSet) {
	key, err := controller.KeyFunc(rs)
	if err != nil {
		utilruntime.HandleError(fmt.Errorf("couldn't get key for object %#v: %v", rs, err))
		return
	}

	rsc.queue.Add(key)
}

当有一个ReplicaSet的配置新增的时候,client-go会有一个Add事件,可以看到对于Add事件没有什么特殊的处理,ReplicaSetController执行enqueueRS()方法把它写到了自己的队列里面。

此处注意一下一个rs对象,是如何通过controller.KeyFunc()方法转换成了一个标识key?
参考 Kubernetes 对象缓存和索引 一文,这个key是这个对象在client-go缓存索引的唯一标识,通过这个key能够快速定位到缓存中的这个对象。

UpdateFunc
// callback when RS is updated
func (rsc *ReplicaSetController) updateRS(old, cur interface{}) {
	oldRS := old.(*apps.ReplicaSet)
	curRS := cur.(*apps.ReplicaSet)

	// TODO: make a KEP and fix informers to always call the delete event handler on re-create
	if curRS.UID != oldRS.UID {
		key, err := controller.KeyFunc(oldRS)
		if err != nil {
			utilruntime.HandleError(fmt.Errorf("couldn't get key for object %#v: %v", oldRS, err))
			return
		}
		rsc.deleteRS(cache.DeletedFinalStateUnknown{
			Key: key,
			Obj: oldRS,
		})
	}

	// You might imagine that we only really need to enqueue the
	// replica set when Spec changes, but it is safer to sync any
	// time this function is triggered. That way a full informer
	// resync can requeue any replica set that don't yet have pods
	// but whose last attempts at creating a pod have failed (since
	// we don't block on creation of pods) instead of those
	// replica sets stalling indefinitely. Enqueueing every time
	// does result in some spurious syncs (like when Status.Replica
	// is updated and the watch notification from it retriggers
	// this function), but in general extra resyncs shouldn't be
	// that bad as ReplicaSets that haven't met expectations yet won't
	// sync, and all the listing is done using local stores.
	if *(oldRS.Spec.Replicas) != *(curRS.Spec.Replicas) {
		klog.V(4).Infof("%v %v updated. Desired pod count change: %d->%d", rsc.Kind, curRS.Name, *(oldRS.Spec.Replicas), *(curRS.Spec.Replicas))
	}
	rsc.enqueueRS(curRS)
}

func (rsc *ReplicaSetController) enqueueRS(rs *apps.ReplicaSet) {
	key, err := controller.KeyFunc(rs)
	if err != nil {
		utilruntime.HandleError(fmt.Errorf("couldn't get key for object %#v: %v", rs, err))
		return
	}

	rsc.queue.Add(key)
}

ReplicaSet更新事件的逻辑里有一长段注释,它表明并不是只在ReplicaSet.Spec发生变化的时候才把它加到自己的队列里,每一个次update事件触发时都会加到队列里,无论什么变更,会更安全一些。

如果有一个ReplicaSet尚未管理任何一个pod(比如说可能上一次创建pod的时候失败了),这种所有update事件都处理的方式会为这个ReplicaSet重新排队,确保控制器会管理到这个一直创建不出来pod的ReplicaSet。

尽管每次排队确实会导致一些虚假同步(例如当 Status.Replica 更新并且来自它的监视通知重新触发此函数时),但总的来说,额外的重新同步不会那么糟糕,因为尚未达到预期的副本集不会同步,并且所有list操作都是使用本地缓存完成的。

DeleteFunc

func (rsc *ReplicaSetController) deleteRS(obj interface{}) {
	rs, ok := obj.(*apps.ReplicaSet)
	if !ok {
		tombstone, ok := obj.(cache.DeletedFinalStateUnknown)
		if !ok {
			utilruntime.HandleError(fmt.Errorf("couldn't get object from tombstone %#v", obj))
			return
		}
		rs, ok = tombstone.Obj.(*apps.ReplicaSet)
		if !ok {
			utilruntime.HandleError(fmt.Errorf("tombstone contained object that is not a ReplicaSet %#v", obj))
			return
		}
	}

	key, err := controller.KeyFunc(rs)
	if err != nil {
		utilruntime.HandleError(fmt.Errorf("couldn't get key for object %#v: %v", rs, err))
		return
	}

	klog.V(4).Infof("Deleting %s %q", rsc.Kind, key)

	// Delete expectations for the ReplicaSet so if we create a new one with the same name it starts clean
	rsc.expectations.DeleteExpectations(key)

	rsc.queue.Add(key)
}

// DeleteExpectations deletes the UID set and invokes DeleteExpectations on the
// underlying ControllerExpectationsInterface.
func (u *UIDTrackingControllerExpectations) DeleteExpectations(rcKey string) {
	u.uidStoreLock.Lock()
	defer u.uidStoreLock.Unlock()

	u.ControllerExpectationsInterface.DeleteExpectations(rcKey)
	if uidExp, exists, err := u.uidStore.GetByKey(rcKey); err == nil && exists {
		if err := u.uidStore.Delete(uidExp); err != nil {
			klog.V(2).Infof("Error deleting uid expectations for controller %v: %v", rcKey, err)
		}
	}
}

delete常见的(cache.DeletedFinalStateUnknown)处理方式,参考 cache.DeletedFinalStateUnknown是什么? 一文,DeletedFinalStateUnknown是一种特殊的结构,当一个对象被删除时,监听的delete时间丢了(比如和apiserver断开连接了),就按照DeletedFinalStateUnknown对象的格式放入DeltaFIFO中。

DeletedFinalStateUnknown就是把一个Object放入了Obj变量中,保存了那些已经被删除的对象,作为一种cache,避免获取不到这个对象了。

// DeletedFinalStateUnknown is placed into a DeltaFIFO in the case where an object
// was deleted but the watch deletion event was missed while disconnected from
// apiserver. In this case we don't know the final "resting" state of the object, so
// there's a chance the included `Obj` is stale.
type DeletedFinalStateUnknown struct {
	Key string
	Obj interface{}
}

delete事件的入控制器队列的方法和add、update不太一样,没有调用enqueueRS()方法加入到队列当中,区别在于对client-go中缓存的唯一标识key进行了一次处理

Pod事件处理
AddFunc
// When a pod is created, enqueue the replica set that manages it and update its expectations.
func (rsc *ReplicaSetController) addPod(obj interface{}) {
	pod := obj.(*v1.Pod)

	if pod.DeletionTimestamp != nil {
		// on a restart of the controller manager, it's possible a new pod shows up in a state that
		// is already pending deletion. Prevent the pod from being a creation observation.
		rsc.deletePod(pod)
		return
	}

	// If it has a ControllerRef, that's all that matters.
	if controllerRef := metav1.GetControllerOf(pod); controllerRef != nil {
		rs := rsc.resolveControllerRef(pod.Namespace, controllerRef)
		if rs == nil {
			return
		}
		rsKey, err := controller.KeyFunc(rs)
		if err != nil {
			return
		}
		klog.V(4).Infof("Pod %s created: %#v.", pod.Name, pod)
		rsc.expectations.CreationObserved(rsKey)
		rsc.queue.Add(rsKey)
		return
	}

	// Otherwise, it's an orphan. Get a list of all matching ReplicaSets and sync
	// them to see if anyone wants to adopt it.
	// DO NOT observe creation because no controller should be waiting for an
	// orphan.
	rss := rsc.getPodReplicaSets(pod)
	if len(rss) == 0 {
		return
	}
	klog.V(4).Infof("Orphan Pod %s created: %#v.", pod.Name, pod)
	for _, rs := range rss {
		rsc.enqueueRS(rs)
	}
}

ReplicaSet控制器在处理pod的Add事件时非常有意思,首先,使用过client-go的朋友们会知道,如果client-go重启,他会以Add事件,把当前所有的pod都list出来一遍,那么如果有已经标记了删除的pod(即pod.DeletionTimestamp != nil),直接走删除的逻辑就好了。

其次,到认领环节了。pod的metadata有OwnerReference的信息,通过pod反查这个pod的Owner,找到这个pod对应的ReplicaSet,然后反查缓存中这个ReplicaSet的索引key,把这个ReplicaSet加入到ReplicaSet控制器队里中去。

最后,如果没有人认领到这个pod,就尝试列出所有和这个pod有关的ReplicaSet。看看有没有ReplicaSet认领这个孤儿pod。

如何判断一个ReplicaSet和pod有关呢?ReplicaSet是通过pod的labels来筛选pod的,如果这个pod的labels在ReplicaSet的selector能match上,就认为这个ReplicaSet和这个pod有关。

UpdateFunc

在注册事件控制器的时候有一段注释

// This invokes the ReplicaSet for every pod change, eg: host assignment. Though this might seem like
// overkill the most frequent pod update is status, and the associated ReplicaSet will only list from
// local storage, so it should be ok.

它提到说ReplicaSet看起来会矫枉过正,对于pod每个改变都会调用,比如说host,比说说pod的status,但ReplicaSet是从它的本地缓存中list的,所以速度会很快,不会造成太大的开销。

当一个pod更新的时候,控制器要搞清楚究竟要唤醒哪个ReplicaSet。如果pod的labels发生变换了,那么与新旧pod相关的ReplicaSet都要被唤醒

// When a pod is updated, figure out what replica set/s manage it and wake them
// up. If the labels of the pod have changed we need to awaken both the old
// and new replica set. old and cur must be *v1.Pod types.
func (rsc *ReplicaSetController) updatePod(old, cur interface{}) {
	curPod := cur.(*v1.Pod)
	oldPod := old.(*v1.Pod)
	if curPod.ResourceVersion == oldPod.ResourceVersion {
		// Periodic resync will send update events for all known pods.
		// Two different versions of the same pod will always have different RVs.
		return
	}

	labelChanged := !reflect.DeepEqual(curPod.Labels, oldPod.Labels)
	if curPod.DeletionTimestamp != nil {
		// when a pod is deleted gracefully it's deletion timestamp is first modified to reflect a grace period,
		// and after such time has passed, the kubelet actually deletes it from the store. We receive an update
		// for modification of the deletion timestamp and expect an rs to create more replicas asap, not wait
		// until the kubelet actually deletes the pod. This is different from the Phase of a pod changing, because
		// an rs never initiates a phase change, and so is never asleep waiting for the same.
		rsc.deletePod(curPod)
		if labelChanged {
			// we don't need to check the oldPod.DeletionTimestamp because DeletionTimestamp cannot be unset.
			rsc.deletePod(oldPod)
		}
		return
	}

	curControllerRef := metav1.GetControllerOf(curPod)
	oldControllerRef := metav1.GetControllerOf(oldPod)
	controllerRefChanged := !reflect.DeepEqual(curControllerRef, oldControllerRef)
	if controllerRefChanged && oldControllerRef != nil {
		// The ControllerRef was changed. Sync the old controller, if any.
		if rs := rsc.resolveControllerRef(oldPod.Namespace, oldControllerRef); rs != nil {
			rsc.enqueueRS(rs)
		}
	}

	// If it has a ControllerRef, that's all that matters.
	if curControllerRef != nil {
		rs := rsc.resolveControllerRef(curPod.Namespace, curControllerRef)
		if rs == nil {
			return
		}
		klog.V(4).Infof("Pod %s updated, objectMeta %+v -> %+v.", curPod.Name, oldPod.ObjectMeta, curPod.ObjectMeta)
		rsc.enqueueRS(rs)
		// TODO: MinReadySeconds in the Pod will generate an Available condition to be added in
		// the Pod status which in turn will trigger a requeue of the owning replica set thus
		// having its status updated with the newly available replica. For now, we can fake the
		// update by resyncing the controller MinReadySeconds after the it is requeued because
		// a Pod transitioned to Ready.
		// Note that this still suffers from #29229, we are just moving the problem one level
		// "closer" to kubelet (from the deployment to the replica set controller).
		if !podutil.IsPodReady(oldPod) && podutil.IsPodReady(curPod) && rs.Spec.MinReadySeconds > 0 {
			klog.V(2).Infof("%v %q will be enqueued after %ds for availability check", rsc.Kind, rs.Name, rs.Spec.MinReadySeconds)
			// Add a second to avoid milliseconds skew in AddAfter.
			// See https://github.com/kubernetes/kubernetes/issues/39785#issuecomment-279959133 for more info.
			rsc.enqueueRSAfter(rs, (time.Duration(rs.Spec.MinReadySeconds)*time.Second)+time.Second)
		}
		return
	}

	// Otherwise, it's an orphan. If anything changed, sync matching controllers
	// to see if anyone wants to adopt it now.
	if labelChanged || controllerRefChanged {
		rss := rsc.getPodReplicaSets(curPod)
		if len(rss) == 0 {
			return
		}
		klog.V(4).Infof("Orphan Pod %s updated, objectMeta %+v -> %+v.", curPod.Name, oldPod.ObjectMeta, curPod.ObjectMeta)
		for _, rs := range rss {
			rsc.enqueueRS(rs)
		}
	}
}

当有pod发生update时,主要做了如下的几件事情:

  1. 如果当前pod DeletionTimestamp不为空,则删除这个pod。但如果labels发生变换,则删除之前的pod
  2. 反查当前和之前pod的owner,并判断两个owner ReplicaSet是否一致
  3. 如果不一致,则先同步之前的owner ReplicaSet;没变更就无所谓啦,都一样
  4. 如果当前pod owner ReplicaSet不为空,则做相应的处理,去更新pod
  5. 如果当前pod没有owner,看看有没有ReplicaSet认领
DeleteFunc
// When a pod is deleted, enqueue the replica set that manages the pod and update its expectations.
// obj could be an *v1.Pod, or a DeletionFinalStateUnknown marker item.
func (rsc *ReplicaSetController) deletePod(obj interface{}) {
	pod, ok := obj.(*v1.Pod)

	// When a delete is dropped, the relist will notice a pod in the store not
	// in the list, leading to the insertion of a tombstone object which contains
	// the deleted key/value. Note that this value might be stale. If the pod
	// changed labels the new ReplicaSet will not be woken up till the periodic resync.
	if !ok {
		tombstone, ok := obj.(cache.DeletedFinalStateUnknown)
		if !ok {
			utilruntime.HandleError(fmt.Errorf("couldn't get object from tombstone %+v", obj))
			return
		}
		pod, ok = tombstone.Obj.(*v1.Pod)
		if !ok {
			utilruntime.HandleError(fmt.Errorf("tombstone contained object that is not a pod %#v", obj))
			return
		}
	}

	controllerRef := metav1.GetControllerOf(pod)
	if controllerRef == nil {
		// No controller should care about orphans being deleted.
		return
	}
	rs := rsc.resolveControllerRef(pod.Namespace, controllerRef)
	if rs == nil {
		return
	}
	rsKey, err := controller.KeyFunc(rs)
	if err != nil {
		utilruntime.HandleError(fmt.Errorf("couldn't get key for object %#v: %v", rs, err))
		return
	}
	klog.V(4).Infof("Pod %s/%s deleted through %v, timestamp %+v: %#v.", pod.Namespace, pod.Name, utilruntime.GetCaller(), pod.DeletionTimestamp, pod)
	rsc.expectations.DeletionObserved(rsKey, controller.PodKey(pod))
	rsc.queue.Add(rsKey)
}

如果有pod的删除事件产生,则把管理这个pod的ReplicaSet加入到队列里,更新ReplicaSet的预期状态

总结

ReplicaSet的控制器主要做了如下的几件事情:

  1. 监听变更:ReplicaSet 控制器在 controller-manager 中注册一个 ReplicaSets 的 informer 和一个 Pods 的 informer。在 pkg/controller/replicaset/replica_set.go 文件中,定义了一个名为 NewReplicaSetController 的方法,它将被这两个informer初始化,用于监听任何ReplicaSet和Pod的事件,包括ADD、UPDATE、DELETE,共六种
  2. 响应变更:一旦 ReplicaSet 控制器接收到变更通知,它会将变更放入队列以进行处理。
  3. 同步状态:这部分在下篇进行讲解。

ReplicaSetController入队列

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值