ReplicaSet简介
replica n. 副本;复制品;
在k8s集群内,replica用于描述一个一个由相同镜像、相同运行环境(相同yaml描述出来的)运行出的pod数量。
ReplicaSet维护一组在任何时候都要处于运行状态的 Pod 副本,这些Pod副本组成了一个集合。
ReplicaSet通常用来保证Pod的可用性,具备如下两个特点:
- 实际运行的pod数量不断调整到给定数量
- 每个ReplicaSet所管理的Pod完全相同
了解过Deployment的朋友会知道,在滚动更新过程中,有时尽管pod副本数和期望的副本数一致(比如扩容一个实例成功,刚刚缩容一个旧实例的时候),但是pod的版本并不一定相同(尽管最终会达到一致)。
由于每个ReplicasSet所管理的Pod完全相同,Deployment滚动更新中是如何调整不同版本的实例数的呢?想必答案呼之欲出了,那就是——Deployment会同时维护很多个ReplicaSet,通过增加和减少*新旧两个版本ReplicaSet的副本数,来达到滚动更新的目的。
问题的提出
在kube-controller-manager是如何加载的Deployment、StatefulSet等控制器的?,我们知道了controller-manager会在NewControllerInitializers()
方法中加载一系列的控制器。作为这些控制器中相对而言最为简单的、最为基础的、也是为了搞懂Deployment必不可少的ReplicaSet,它是怎么工作的呢?
由于ReplicaSet的控制器中有个队列,本篇只描述向队列写入的
源码阅读
方法入口
在controller-manager的NewControllerInitializers()
方法中找到ReplicaSet的运行方法:
func startReplicaSetController(ctx ControllerContext) (http.Handler, bool, error) {
go replicaset.NewReplicaSetController(
ctx.InformerFactory.Apps().V1().ReplicaSets(),
ctx.InformerFactory.Core().V1().Pods(),
ctx.ClientBuilder.ClientOrDie("replicaset-controller"),
replicaset.BurstReplicas,
).Run(int(ctx.ComponentConfig.ReplicaSetController.ConcurrentRSSyncs), ctx.Stop)
return nil, true, nil
}
这里便是一个ReplicaSet控制器运行地方了,通过context传入了一个ReplicaSet的informer,和一个Pod的informer,这两个informer分别通知ReplicaSet和Pod的事件。一个新的控制器构造出来,用go关键字创建一个channel来运行Run方法。
构造一个控制器
先看它的构造方法NewReplicaSetController()
// NewReplicaSetController 使用指定的event记录器配置一个ReplicaSet控制器
func NewReplicaSetController(rsInformer appsinformers.ReplicaSetInformer, podInformer coreinformers.PodInformer, kubeClient clientset.Interface, burstReplicas int) *ReplicaSetController {
eventBroadcaster := record.NewBroadcaster() //这个方法主要初始化一个event的记录器
eventBroadcaster.StartStructuredLogging(0)
eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: kubeClient.CoreV1().Events("")})
if err := metrics.Register(legacyregistry.Register); err != nil {
klog.ErrorS(err, "unable to register metrics")
}
return NewBaseController(rsInformer, podInformer, kubeClient, burstReplicas,
apps.SchemeGroupVersion.WithKind("ReplicaSet"),
"replicaset_controller",
"replicaset",
controller.RealPodControl{
KubeClient: kubeClient,
Recorder: eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "replicaset-controller"}),
},
)
}
// NewBaseController 是 NewReplicaSetController的实现,具有额外的注入参数
func NewBaseController(rsInformer appsinformers.ReplicaSetInformer, podInformer coreinformers.PodInformer, kubeClient clientset.Interface, burstReplicas int,
gvk schema.GroupVersionKind, metricOwnerName, queueName string, podControl controller.PodControlInterface) *ReplicaSetController {
if kubeClient != nil && kubeClient.CoreV1().RESTClient().GetRateLimiter() != nil {
ratelimiter.RegisterMetricAndTrackRateLimiterUsage(metricOwnerName, kubeClient.CoreV1().RESTClient().GetRateLimiter())
}
rsc := &ReplicaSetController{
GroupVersionKind: gvk,
kubeClient: kubeClient,
podControl: podControl,
burstReplicas: burstReplicas,
expectations: controller.NewUIDTrackingControllerExpectations(controller.NewControllerExpectations()),
queue: workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), queueName),
}
// 关键代码,对于每一个ReplicaSet的增删改事件,有单独的处理过程
rsInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: rsc.addRS,
UpdateFunc: rsc.updateRS,
DeleteFunc: rsc.deleteRS,
})
rsc.rsLister = rsInformer.Lister()
rsc.rsListerSynced = rsInformer.Informer().HasSynced
// 关键代码,对于每一个Pod的增删改事件,有单独的处理过程
podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: rsc.addPod,
// This invokes the ReplicaSet for every pod change, eg: host assignment. Though this might seem like
// overkill the most frequent pod update is status, and the associated ReplicaSet will only list from
// local storage, so it should be ok.
UpdateFunc: rsc.updatePod,
DeleteFunc: rsc.deletePod,
})
rsc.podLister = podInformer.Lister()
rsc.podListerSynced = podInformer.Informer().HasSynced
// 关键代码,负责给ReplicaSet同步消息
rsc.syncHandler = rsc.syncReplicaSet
return rsc
}
我们着重看其中的三部分:
- rsInformer的EventHandler,对于每一个ReplicaSet的增删改事件,单独的处理过程
- podInformer的EventHandler,对于每一个Pod的增删改事件,单独的处理过程
- syncHandler方法,用于同步状态的实际处理,将在ReplicasSet是怎么工作的?(下篇)介绍
ReplicaSet事件处理
AddFunc
func (rsc *ReplicaSetController) addRS(obj interface{}) {
rs := obj.(*apps.ReplicaSet)
klog.V(4).Infof("Adding %s %s/%s", rsc.Kind, rs.Namespace, rs.Name)
rsc.enqueueRS(rs)
}
func (rsc *ReplicaSetController) enqueueRS(rs *apps.ReplicaSet) {
key, err := controller.KeyFunc(rs)
if err != nil {
utilruntime.HandleError(fmt.Errorf("couldn't get key for object %#v: %v", rs, err))
return
}
rsc.queue.Add(key)
}
当有一个ReplicaSet的配置新增的时候,client-go会有一个Add事件,可以看到对于Add事件没有什么特殊的处理,ReplicaSetController执行enqueueRS()
方法把它写到了自己的队列里面。
此处注意一下一个rs对象,是如何通过controller.KeyFunc()
方法转换成了一个标识key?
参考 Kubernetes 对象缓存和索引 一文,这个key是这个对象在client-go缓存索引的唯一标识,通过这个key能够快速定位到缓存中的这个对象。
UpdateFunc
// callback when RS is updated
func (rsc *ReplicaSetController) updateRS(old, cur interface{}) {
oldRS := old.(*apps.ReplicaSet)
curRS := cur.(*apps.ReplicaSet)
// TODO: make a KEP and fix informers to always call the delete event handler on re-create
if curRS.UID != oldRS.UID {
key, err := controller.KeyFunc(oldRS)
if err != nil {
utilruntime.HandleError(fmt.Errorf("couldn't get key for object %#v: %v", oldRS, err))
return
}
rsc.deleteRS(cache.DeletedFinalStateUnknown{
Key: key,
Obj: oldRS,
})
}
// You might imagine that we only really need to enqueue the
// replica set when Spec changes, but it is safer to sync any
// time this function is triggered. That way a full informer
// resync can requeue any replica set that don't yet have pods
// but whose last attempts at creating a pod have failed (since
// we don't block on creation of pods) instead of those
// replica sets stalling indefinitely. Enqueueing every time
// does result in some spurious syncs (like when Status.Replica
// is updated and the watch notification from it retriggers
// this function), but in general extra resyncs shouldn't be
// that bad as ReplicaSets that haven't met expectations yet won't
// sync, and all the listing is done using local stores.
if *(oldRS.Spec.Replicas) != *(curRS.Spec.Replicas) {
klog.V(4).Infof("%v %v updated. Desired pod count change: %d->%d", rsc.Kind, curRS.Name, *(oldRS.Spec.Replicas), *(curRS.Spec.Replicas))
}
rsc.enqueueRS(curRS)
}
func (rsc *ReplicaSetController) enqueueRS(rs *apps.ReplicaSet) {
key, err := controller.KeyFunc(rs)
if err != nil {
utilruntime.HandleError(fmt.Errorf("couldn't get key for object %#v: %v", rs, err))
return
}
rsc.queue.Add(key)
}
ReplicaSet更新事件的逻辑里有一长段注释,它表明并不是只在ReplicaSet.Spec发生变化的时候才把它加到自己的队列里,每一个次update事件触发时都会加到队列里,无论什么变更,会更安全一些。
如果有一个ReplicaSet尚未管理任何一个pod(比如说可能上一次创建pod的时候失败了),这种所有update事件都处理的方式会为这个ReplicaSet重新排队,确保控制器会管理到这个一直创建不出来pod的ReplicaSet。
尽管每次排队确实会导致一些虚假同步(例如当 Status.Replica 更新并且来自它的监视通知重新触发此函数时),但总的来说,额外的重新同步不会那么糟糕,因为尚未达到预期的副本集不会同步,并且所有list操作都是使用本地缓存完成的。
DeleteFunc
func (rsc *ReplicaSetController) deleteRS(obj interface{}) {
rs, ok := obj.(*apps.ReplicaSet)
if !ok {
tombstone, ok := obj.(cache.DeletedFinalStateUnknown)
if !ok {
utilruntime.HandleError(fmt.Errorf("couldn't get object from tombstone %#v", obj))
return
}
rs, ok = tombstone.Obj.(*apps.ReplicaSet)
if !ok {
utilruntime.HandleError(fmt.Errorf("tombstone contained object that is not a ReplicaSet %#v", obj))
return
}
}
key, err := controller.KeyFunc(rs)
if err != nil {
utilruntime.HandleError(fmt.Errorf("couldn't get key for object %#v: %v", rs, err))
return
}
klog.V(4).Infof("Deleting %s %q", rsc.Kind, key)
// Delete expectations for the ReplicaSet so if we create a new one with the same name it starts clean
rsc.expectations.DeleteExpectations(key)
rsc.queue.Add(key)
}
// DeleteExpectations deletes the UID set and invokes DeleteExpectations on the
// underlying ControllerExpectationsInterface.
func (u *UIDTrackingControllerExpectations) DeleteExpectations(rcKey string) {
u.uidStoreLock.Lock()
defer u.uidStoreLock.Unlock()
u.ControllerExpectationsInterface.DeleteExpectations(rcKey)
if uidExp, exists, err := u.uidStore.GetByKey(rcKey); err == nil && exists {
if err := u.uidStore.Delete(uidExp); err != nil {
klog.V(2).Infof("Error deleting uid expectations for controller %v: %v", rcKey, err)
}
}
}
delete常见的(cache.DeletedFinalStateUnknown)
处理方式,参考 cache.DeletedFinalStateUnknown是什么? 一文,DeletedFinalStateUnknown
是一种特殊的结构,当一个对象被删除时,监听的delete时间丢了(比如和apiserver断开连接了),就按照DeletedFinalStateUnknown
对象的格式放入DeltaFIFO中。
DeletedFinalStateUnknown
就是把一个Object放入了Obj变量中,保存了那些已经被删除的对象,作为一种cache,避免获取不到这个对象了。
// DeletedFinalStateUnknown is placed into a DeltaFIFO in the case where an object
// was deleted but the watch deletion event was missed while disconnected from
// apiserver. In this case we don't know the final "resting" state of the object, so
// there's a chance the included `Obj` is stale.
type DeletedFinalStateUnknown struct {
Key string
Obj interface{}
}
delete事件的入控制器队列的方法和add、update不太一样,没有调用enqueueRS()
方法加入到队列当中,区别在于对client-go中缓存的唯一标识key进行了一次处理
Pod事件处理
AddFunc
// When a pod is created, enqueue the replica set that manages it and update its expectations.
func (rsc *ReplicaSetController) addPod(obj interface{}) {
pod := obj.(*v1.Pod)
if pod.DeletionTimestamp != nil {
// on a restart of the controller manager, it's possible a new pod shows up in a state that
// is already pending deletion. Prevent the pod from being a creation observation.
rsc.deletePod(pod)
return
}
// If it has a ControllerRef, that's all that matters.
if controllerRef := metav1.GetControllerOf(pod); controllerRef != nil {
rs := rsc.resolveControllerRef(pod.Namespace, controllerRef)
if rs == nil {
return
}
rsKey, err := controller.KeyFunc(rs)
if err != nil {
return
}
klog.V(4).Infof("Pod %s created: %#v.", pod.Name, pod)
rsc.expectations.CreationObserved(rsKey)
rsc.queue.Add(rsKey)
return
}
// Otherwise, it's an orphan. Get a list of all matching ReplicaSets and sync
// them to see if anyone wants to adopt it.
// DO NOT observe creation because no controller should be waiting for an
// orphan.
rss := rsc.getPodReplicaSets(pod)
if len(rss) == 0 {
return
}
klog.V(4).Infof("Orphan Pod %s created: %#v.", pod.Name, pod)
for _, rs := range rss {
rsc.enqueueRS(rs)
}
}
ReplicaSet控制器在处理pod的Add事件时非常有意思,首先,使用过client-go的朋友们会知道,如果client-go重启,他会以Add事件,把当前所有的pod都list出来一遍,那么如果有已经标记了删除的pod(即pod.DeletionTimestamp != nil),直接走删除的逻辑就好了。
其次,到认领环节了。pod的metadata有OwnerReference的信息,通过pod反查这个pod的Owner,找到这个pod对应的ReplicaSet,然后反查缓存中这个ReplicaSet的索引key,把这个ReplicaSet加入到ReplicaSet控制器队里中去。
最后,如果没有人认领到这个pod,就尝试列出所有和这个pod有关的ReplicaSet。看看有没有ReplicaSet认领这个孤儿pod。
如何判断一个ReplicaSet和pod有关呢?ReplicaSet是通过pod的labels来筛选pod的,如果这个pod的labels在ReplicaSet的selector能match上,就认为这个ReplicaSet和这个pod有关。
UpdateFunc
在注册事件控制器的时候有一段注释
// This invokes the ReplicaSet for every pod change, eg: host assignment. Though this might seem like
// overkill the most frequent pod update is status, and the associated ReplicaSet will only list from
// local storage, so it should be ok.
它提到说ReplicaSet看起来会矫枉过正,对于pod每个改变都会调用,比如说host,比说说pod的status,但ReplicaSet是从它的本地缓存中list的,所以速度会很快,不会造成太大的开销。
当一个pod更新的时候,控制器要搞清楚究竟要唤醒哪个ReplicaSet。如果pod的labels发生变换了,那么与新旧pod相关的ReplicaSet都要被唤醒
// When a pod is updated, figure out what replica set/s manage it and wake them
// up. If the labels of the pod have changed we need to awaken both the old
// and new replica set. old and cur must be *v1.Pod types.
func (rsc *ReplicaSetController) updatePod(old, cur interface{}) {
curPod := cur.(*v1.Pod)
oldPod := old.(*v1.Pod)
if curPod.ResourceVersion == oldPod.ResourceVersion {
// Periodic resync will send update events for all known pods.
// Two different versions of the same pod will always have different RVs.
return
}
labelChanged := !reflect.DeepEqual(curPod.Labels, oldPod.Labels)
if curPod.DeletionTimestamp != nil {
// when a pod is deleted gracefully it's deletion timestamp is first modified to reflect a grace period,
// and after such time has passed, the kubelet actually deletes it from the store. We receive an update
// for modification of the deletion timestamp and expect an rs to create more replicas asap, not wait
// until the kubelet actually deletes the pod. This is different from the Phase of a pod changing, because
// an rs never initiates a phase change, and so is never asleep waiting for the same.
rsc.deletePod(curPod)
if labelChanged {
// we don't need to check the oldPod.DeletionTimestamp because DeletionTimestamp cannot be unset.
rsc.deletePod(oldPod)
}
return
}
curControllerRef := metav1.GetControllerOf(curPod)
oldControllerRef := metav1.GetControllerOf(oldPod)
controllerRefChanged := !reflect.DeepEqual(curControllerRef, oldControllerRef)
if controllerRefChanged && oldControllerRef != nil {
// The ControllerRef was changed. Sync the old controller, if any.
if rs := rsc.resolveControllerRef(oldPod.Namespace, oldControllerRef); rs != nil {
rsc.enqueueRS(rs)
}
}
// If it has a ControllerRef, that's all that matters.
if curControllerRef != nil {
rs := rsc.resolveControllerRef(curPod.Namespace, curControllerRef)
if rs == nil {
return
}
klog.V(4).Infof("Pod %s updated, objectMeta %+v -> %+v.", curPod.Name, oldPod.ObjectMeta, curPod.ObjectMeta)
rsc.enqueueRS(rs)
// TODO: MinReadySeconds in the Pod will generate an Available condition to be added in
// the Pod status which in turn will trigger a requeue of the owning replica set thus
// having its status updated with the newly available replica. For now, we can fake the
// update by resyncing the controller MinReadySeconds after the it is requeued because
// a Pod transitioned to Ready.
// Note that this still suffers from #29229, we are just moving the problem one level
// "closer" to kubelet (from the deployment to the replica set controller).
if !podutil.IsPodReady(oldPod) && podutil.IsPodReady(curPod) && rs.Spec.MinReadySeconds > 0 {
klog.V(2).Infof("%v %q will be enqueued after %ds for availability check", rsc.Kind, rs.Name, rs.Spec.MinReadySeconds)
// Add a second to avoid milliseconds skew in AddAfter.
// See https://github.com/kubernetes/kubernetes/issues/39785#issuecomment-279959133 for more info.
rsc.enqueueRSAfter(rs, (time.Duration(rs.Spec.MinReadySeconds)*time.Second)+time.Second)
}
return
}
// Otherwise, it's an orphan. If anything changed, sync matching controllers
// to see if anyone wants to adopt it now.
if labelChanged || controllerRefChanged {
rss := rsc.getPodReplicaSets(curPod)
if len(rss) == 0 {
return
}
klog.V(4).Infof("Orphan Pod %s updated, objectMeta %+v -> %+v.", curPod.Name, oldPod.ObjectMeta, curPod.ObjectMeta)
for _, rs := range rss {
rsc.enqueueRS(rs)
}
}
}
当有pod发生update时,主要做了如下的几件事情:
- 如果当前pod DeletionTimestamp不为空,则删除这个pod。但如果labels发生变换,则删除之前的pod
- 反查当前和之前pod的owner,并判断两个owner ReplicaSet是否一致
- 如果不一致,则先同步之前的owner ReplicaSet;没变更就无所谓啦,都一样
- 如果当前pod owner ReplicaSet不为空,则做相应的处理,去更新pod
- 如果当前pod没有owner,看看有没有ReplicaSet认领
DeleteFunc
// When a pod is deleted, enqueue the replica set that manages the pod and update its expectations.
// obj could be an *v1.Pod, or a DeletionFinalStateUnknown marker item.
func (rsc *ReplicaSetController) deletePod(obj interface{}) {
pod, ok := obj.(*v1.Pod)
// When a delete is dropped, the relist will notice a pod in the store not
// in the list, leading to the insertion of a tombstone object which contains
// the deleted key/value. Note that this value might be stale. If the pod
// changed labels the new ReplicaSet will not be woken up till the periodic resync.
if !ok {
tombstone, ok := obj.(cache.DeletedFinalStateUnknown)
if !ok {
utilruntime.HandleError(fmt.Errorf("couldn't get object from tombstone %+v", obj))
return
}
pod, ok = tombstone.Obj.(*v1.Pod)
if !ok {
utilruntime.HandleError(fmt.Errorf("tombstone contained object that is not a pod %#v", obj))
return
}
}
controllerRef := metav1.GetControllerOf(pod)
if controllerRef == nil {
// No controller should care about orphans being deleted.
return
}
rs := rsc.resolveControllerRef(pod.Namespace, controllerRef)
if rs == nil {
return
}
rsKey, err := controller.KeyFunc(rs)
if err != nil {
utilruntime.HandleError(fmt.Errorf("couldn't get key for object %#v: %v", rs, err))
return
}
klog.V(4).Infof("Pod %s/%s deleted through %v, timestamp %+v: %#v.", pod.Namespace, pod.Name, utilruntime.GetCaller(), pod.DeletionTimestamp, pod)
rsc.expectations.DeletionObserved(rsKey, controller.PodKey(pod))
rsc.queue.Add(rsKey)
}
如果有pod的删除事件产生,则把管理这个pod的ReplicaSet加入到队列里,更新ReplicaSet的预期状态
总结
ReplicaSet的控制器主要做了如下的几件事情:
- 监听变更:ReplicaSet 控制器在 controller-manager 中注册一个 ReplicaSets 的 informer 和一个 Pods 的 informer。在 pkg/controller/replicaset/replica_set.go 文件中,定义了一个名为
NewReplicaSetController
的方法,它将被这两个informer初始化,用于监听任何ReplicaSet和Pod的事件,包括ADD、UPDATE、DELETE,共六种 - 响应变更:一旦 ReplicaSet 控制器接收到变更通知,它会将变更放入队列以进行处理。
- 同步状态:这部分在下篇进行讲解。