一、什么是operator?
在Kubernetes中我们经常使用 Deployment、DaemonSet、Service、ConfigMap 等资源,这些资源都是Kubernetes的内置资源,而对这些资源的创建、更新、删除的动作都会被称为事件(Event),Kubernetes 的 Controller Manager 负责事件监听,并触发相应的动作来满足期望(Spec),这种声明式的方式简化了用户的操作,用户在使用时只需关心应用程序的最终状态即可。随着 Kubernetes 的发展, 在一些场景更为复杂的分布式应用系统,原生 Kubernetes 内置资源在这些场景下就显得有些力不从心。简而言之,Operator就是一个为管理kubernetes集群内添加的一个自定义资源类型的对应的自定义控制器。
二、CRD
CRD 的全称是 Custom Resource Definition。顾名思义,它指的就是,允许用户在 Kubernetes 中添加一个跟 Pod、Node 类似的、新的 API 资源类型,即:自定义 API 资源。简而言之,就是介绍这个资源有什么属性,这些属性的类型是什么,结构是怎样的。
当你创建新的CRD时,Kubernetes API 服务器会为你所指定的每个版本生成一个新的 RESTful 资源路径。 CRD保证新的资源快速注册到kubenetes集群。基于 CRD 对象所创建的自定义资源可以是名称空间作用域的,也可以是集群作用域的, 取决于 CRD 对象 spec.scope 字段的设置。
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
# 名字必需与下面的 spec 字段匹配,并且格式为 '<名称的复数形式>.<组名>'
name: crontabs.stable.example.com
spec:
# 组名称,用于 REST API: /apis/<组>/<版本>
group: stable.example.com
# 列举此 CustomResourceDefinition 所支持的版本
versions:
- name: v1
# 每个版本都可以通过 served 标志来独立启用或禁止
served: true
# 其中一个且只有一个版本必需被标记为存储版本
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
cronSpec:
type: string
image:
type: string
replicas:
type: integer
# 可以是 Namespaced 或 Cluster
scope: Namespaced
names:
# 名称的复数形式,用于 URL:/apis/<组>/<版本>/<名称的复数形式>
plural: crontabs
# 名称的单数形式,作为命令行使用时和显示时的别名
singular: crontab
# kind 通常是单数形式的驼峰命名(CamelCased)形式。你的资源清单会使用这一形式。
kind: CronTab
# shortNames 允许你在命令行使用较短的字符串来匹配资源
shortNames:
- ct
三、CR
CR 的全称是Custom Resource,是CRD的产物,根据CRD确定自身有什么属性,为这些属性赋值来定义一个该实例。一个CR实例可以直接类比k8s内建的一个Pod实例,可以像Pod一样去使用一个CR。CRD 是资源类型定义,具体的资源叫 CR。
类比数据库:
CRD就像是一张表,定义了表有哪些字段,字段的类型,创建 CRD:这一步相当于 CREATE TABLE fruits ;
$ kubectl create -f fruits-crd.yaml
创建 CR:相当于 INSERT INTO fruits values(…);
$ kubectl create -f apple-cr.yaml
apple-cr.yaml:
apiVersion: example.org/v1
kind: Fruit
metadata:
name: apple
spec:
sweet: false
weight: 100
comment: little bit rotten
四、operator开发
4.1 controller-runtime
controller-runtime 是 Kubernetes 社区提供可供快速搭建一套 实现了controller 功能的工具,用户无需自行实现controller的功能了,只需要专注于如何处理当前Kubernetes APIServer发来的请求,只需要专注于自己的业务处理Reconciler即可。
目前,Kubernetes 社区基于controller-runtime推出了Operator SDK和 Kubebuilder这两种常用的开发Operator的SDK,他们本质上并没有什么区别。
4.2 client-go
client-go是kubernetes官方提供的go语言的客户端库,是从 Kubernetes的代码中单独抽离出来的包。使用client-go可以与kubernetes集群交互,包括资源的访问操作。掌握client-go,对于kubernetes开发非常重要。
controller-runtime是对client-go进行了封装可供快速搭建一套实现了controller功能的工具库包。由Kubernetes 社区推出。
五、informer机制底层原理
informer 是 client-go 中的核心工具包,在kubernetes中,各个组件通过HTTP协议跟 API Server 进行通信。如果各组件每次都直接和API Server 进行交互,会给API Server 和ETCD造成非常大的压力。在不依赖任何中间件的情况下,通过informer保证了消息的实时性、可靠性和顺序性。
API Server:以REST形式对外暴露提供服务,是k8s系统中所有组件沟通的桥梁,是整个系统的数据总线,集群管理的核心。
ETCD:etcd是一个高可用的key-value数据库,是k8s集群的唯一底层数据存储。
详细架构图:
由上图可知,Informer由如下组件组成:
- Reflector:一方面通过List和Watch API来监听资源对象的变化,另一方面将这些变更信息放入DeltaFIFO中。
- DeltaFIFO:用来存储Reflector监听到的对象变更信息。
- Controller:这里的Controller并不是用户的Controller,而是Informer中的一个对象;其首先不断的从DeltaFIFO中Pop Deltas(理解为是一个资源的Event即可),然后将Event的变化一方面同步到Indexer中(也就是cache),另一方面来触发用户在informer.AddEventHandler 注册的Handler。
- Indexer:Informer中维护的资源缓存,当我们通过Lister的list和get接口来获取资源信息时,是从cache中获取的,并不会从APIServer中获取;
- Processor:用来维护和响应用户注册的Handler
运行原理图:
5.1 informer启动
informer启动有以下步骤:
- 注册及启动processLoop和reflector
- reflector开始LIST和WATCH,watch到的数据进行对比处理,存入到queue中
- processLoop开始循环pop队列数据
factory := informers.NewSharedInformerFactory(clientset, 0)
podInformer := factory.Core().V1().Pods().Informer()
podInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) {
mObj := obj.(v1.Object)
log.Printf("New pod added: %s", mObj.GetName())
},
UpdateFunc: func(oldObj, newObj interface{}) {
oObj := oldObj.(v1.Object)
nObj := newObj.(v1.Object)
log.Printf("%s pod updated to %s", oObj.GetName(), nObj.GetName())
},
DeleteFunc: func(obj interface{}) {
mObj := obj.(v1.Object)
log.Printf("pod deleted from store: %s", mObj.GetName())
},
})
//启动informer
podInformer.Run(stopCh)
func (s *sharedIndexInformer) Run(stopCh <-chan struct{}) {
......
fifo := NewDeltaFIFOWithOptions(DeltaFIFOOptions{
//FIFO持有indexer引用
KnownObjects: s.indexer,
EmitDeltaTypeReplaced: true,
})
cfg := &Config{
Queue: fifo,
ListerWatcher: s.listerWatcher,
ObjectType: s.objectType,
FullResyncPeriod: s.resyncCheckPeriod,
RetryOnError: false,
ShouldResync: s.processor.shouldResync,
//注册回调函数HandleDeltas,后面从queue弹出数据的时候要用到
Process: s.HandleDeltas,
WatchErrorHandler: s.watchErrorHandler,
}
//根据config创建controller
func() {
s.startedLock.Lock()
defer s.startedLock.Unlock()
s.controller = New(cfg)
s.controller.(*controller).clock = s.clock
s.started = true
}()
......
s.controller.Run(stopCh)
}
func (c *controller) Run(stopCh <-chan struct{}) {
defer utilruntime.HandleCrash()
go func() {
<-stopCh
c.config.Queue.Close()
}()
r := NewReflector(
c.config.ListerWatcher,
c.config.ObjectType,
c.config.Queue,
c.config.FullResyncPeriod,
)
// 省略代码
......
var wg wait.Group
//启动reflector
wg.StartWithChannel(stopCh, r.Run)
//启动processLoop
wait.Until(c.processLoop, time.Second, stopCh)
wg.Wait()
}
5.2 ListAndWatch
reflector启动之后,开始ListAndWatch,watch与API Server建立长连接,使用HTTP协议的分块传输编码(ChunkedTransfer Encoding)实现。
func (r *Reflector) Run(stopCh <-chan struct{}) {
klog.V(3).Infof("Starting reflector %s (%s) from %s", r.expectedTypeName, r.resyncPeriod, r.name)
wait.BackoffUntil(func() {
// reflector进行list和watch
if err := r.ListAndWatch(stopCh); err != nil {
r.watchErrorHandler(r, err)
}
}, r.backoffManager, true, stopCh)
klog.V(3).Infof("Stopping reflector %s (%s) from %s", r.expectedTypeName, r.resyncPeriod, r.name)
}
switch event.Type {
//watch到add事件
case watch.Added:
err := r.store.Add(event.Object)
if err != nil {
utilruntime.HandleError(fmt.Errorf("%s: unable to add watch event object (%#v) to store: %v", r.name, event.Object, err))
}
//watch到modified事件
case watch.Modified:
err := r.store.Update(event.Object)
if err != nil {
utilruntime.HandleError(fmt.Errorf("%s: unable to update watch event object (%#v) to store: %v", r.name, event.Object, err))
}
//watch到delete事件
case watch.Deleted:
// TODO: Will any consumers need access to the "last known
// state", which is passed in event.Object? If so, may need
// to change this.
err := r.store.Delete(event.Object)
if err != nil {
utilruntime.HandleError(fmt.Errorf("%s: unable to delete watch event object (%#v) from store: %v", r.name, event.Object, err))
}
case watch.Bookmark:
// A `Bookmark` means watch has synced here, just update the resourceVersion
default:
utilruntime.HandleError(fmt.Errorf("%s: unable to understand watch event %#v", r.name, event))
}
watch到的对象加入到DelataFIFO中,以update事件为例:
func (f *DeltaFIFO) Update(obj interface{}) error {
f.lock.Lock()
defer f.lock.Unlock()
f.populated = true
return f.queueActionLocked(Updated, obj)
}
func (f *DeltaFIFO) queueActionLocked(actionType DeltaType, obj interface{}) error {
id, err := f.KeyOf(obj)
if err != nil {
return KeyError{obj, err}
}
oldDeltas := f.items[id]
newDeltas := append(oldDeltas, Delta{actionType, obj})
newDeltas = dedupDeltas(newDeltas)
if len(newDeltas) > 0 {
if _, exists := f.items[id]; !exists {
//将key放入到queue
f.queue = append(f.queue, id)
}
//将new Deltas放入到items中
f.items[id] = newDeltas
//事件到达广播,用于唤醒阻塞在cond上的协程
f.cond.Broadcast()
} else {
// This never happens, because dedupDeltas never returns an empty list
// when given a non-empty list (as it is here).
// If somehow it happens anyway, deal with it but complain.
if oldDeltas == nil {
klog.Errorf("Impossible dedupDeltas for id=%q: oldDeltas=%#+v, obj=%#+v; ignoring", id, oldDeltas, obj)
return nil
}
klog.Errorf("Impossible dedupDeltas for id=%q: oldDeltas=%#+v, obj=%#+v; breaking invariant by storing empty Deltas", id, oldDeltas, obj)
f.items[id] = newDeltas
return fmt.Errorf("Impossible dedupDeltas for id=%q: oldDeltas=%#+v, obj=%#+v; broke DeltaFIFO invariant by storing empty Deltas", id, oldDeltas, obj)
}
return nil
}
DeltaFIFO的数据结构如下:
type DeltaFIFO struct {
// lock/cond protects access to 'items' and 'queue'.
lock sync.RWMutex
cond sync.Cond
// `items` maps a key to a Deltas.
// Each such Deltas has at least one Delta.
items map[string]Deltas
// `queue` maintains FIFO order of keys for consumption in Pop().
// There are no duplicates in `queue`.
// A key is in `queue` if and only if it is in `items`.
queue []string
// populated is true if the first batch of items inserted by Replace() has been populated
// or Delete/Add/Update/AddIfNotPresent was called first.
populated bool
// initialPopulationCount is the number of items inserted by the first call of Replace()
initialPopulationCount int
// keyFunc is used to make the key used for queued item
// insertion and retrieval, and should be deterministic.
keyFunc KeyFunc
// knownObjects list keys that are "known" --- affecting Delete(),
// Replace(), and Resync()
knownObjects KeyListerGetter
// Used to indicate a queue is closed so a control loop can exit when a queue is empty.
// Currently, not used to gate any of CRUD operations.
closed bool
// emitDeltaTypeReplaced is whether to emit the Replaced or Sync
// DeltaType when Replace() is called (to preserve backwards compat).
emitDeltaTypeReplaced bool
}
到这里,已经将最新的数据推送到了DeltaFIFO的queue中,接下来看下怎么处理queue中的数据。
5.3 handleDelatas
DeltaFIFO的queue出队,回到之前注册并启动起来的processLoop:
func (c *controller) processLoop() {
for {
//从queue弹出数据,交由process处理,也就是之前注册的handleDeltas
obj, err := c.config.Queue.Pop(PopProcessFunc(c.config.Process))
if err != nil {
if err == ErrFIFOClosed {
return
}
if c.config.RetryOnError {
// This is the safe way to re-enqueue.
// 重新入队queue
c.config.Queue.AddIfNotPresent(obj)
}
}
}
}
func (f *DeltaFIFO) Pop(process PopProcessFunc) (interface{}, error) {
f.lock.Lock()
defer f.lock.Unlock()
for {
for len(f.queue) == 0 {
// When the queue is empty, invocation of Pop() is blocked until new item is enqueued.
// When Close() is called, the f.closed is set and the condition is broadcasted.
// Which causes this loop to continue and return from the Pop().
if f.closed {
return nil, ErrFIFOClosed
}
//如果queue中没有数据,阻塞等待,接收到广播后,协程会唤醒
f.cond.Wait()
}
id := f.queue[0]
f.queue = f.queue[1:]
depth := len(f.queue)
if f.initialPopulationCount > 0 {
f.initialPopulationCount--
}
item, ok := f.items[id]
if !ok {
// This should never happen
klog.Errorf("Inconceivable! %q was in f.queue but not f.items; ignoring.", id)
continue
}
//出队的时候会将items中对应key的数据删掉
delete(f.items, id)
// Only log traces if the queue depth is greater than 10 and it takes more than
// 100 milliseconds to process one item from the queue.
// Queue depth never goes high because processing an item is locking the queue,
// and new items can't be added until processing finish.
// https://github.com/kubernetes/kubernetes/issues/103789
if depth > 10 {
trace := utiltrace.New("DeltaFIFO Pop Process",
utiltrace.Field{Key: "ID", Value: id},
utiltrace.Field{Key: "Depth", Value: depth},
utiltrace.Field{Key: "Reason", Value: "slow event handlers blocking the queue"})
defer trace.LogIfLong(100 * time.Millisecond)
}
//处理数据,重点看下这个方法,进入HandleDeltas
err := process(item)
if e, ok := err.(ErrRequeue); ok {
f.addIfNotPresent(id, item)
err = e.Err
}
// Don't need to copyDeltas here, because we're transferring
// ownership to the caller.
return item, err
}
}
func (s *sharedIndexInformer) HandleDeltas(obj interface{}) error {
s.blockDeltas.Lock()
defer s.blockDeltas.Unlock()
// from oldest to newest
for _, d := range obj.(Deltas) {
switch d.Type {
//在初始化Delta_fifo时使用。Informer会通过list得到某资源下全部的对象,
//而Replace方法就可以把这些资源对象一次性装载至队列,并同步至Indexer。
case Sync, Replaced, Added, Updated:
s.cacheMutationDetector.AddObject(d.Object)
//从本地缓存indexer中查询数据是否存在
if old, exists, err := s.indexer.Get(d.Object); err == nil && exists {
//如果存在,则更新indexer中该数据
if err := s.indexer.Update(d.Object); err != nil {
return err
}
isSync := false
switch {
case d.Type == Sync:
// Sync events are only propagated to listeners that requested resync
isSync = true
case d.Type == Replaced:
if accessor, err := meta.Accessor(d.Object); err == nil {
if oldAccessor, err := meta.Accessor(old); err == nil {
// Replaced events that didn't change resourceVersion are treated as resync events
// and only propagated to listeners that requested resync
isSync = accessor.GetResourceVersion() == oldAccessor.GetResourceVersion()
}
}
}
//分发监听者,通知监听update
s.processor.distribute(updateNotification{oldObj: old, newObj: d.Object}, isSync)
} else {
//如果不存在,则在indexer中添加该数据
if err := s.indexer.Add(d.Object); err != nil {
return err
}
//分发监听者,通知监听add
s.processor.distribute(addNotification{newObj: d.Object}, false)
}
case Deleted:
if err := s.indexer.Delete(d.Object); err != nil {
return err
}
//分发监听者,通知监听delete
s.processor.distribute(deleteNotification{oldObj: d.Object}, false)
}
}
return nil
}
func (p *sharedProcessor) distribute(obj interface{}, sync bool) {
p.listenersLock.RLock()
defer p.listenersLock.RUnlock()
if sync {
for _, listener := range p.syncingListeners {
listener.add(obj)
}
} else {
for _, listener := range p.listeners {
//监听者添加通知
listener.add(obj)
}
}
}
func (p *processorListener) add(notification interface{}) {
//通知发送到addCh
p.addCh <- notification
}
数据发送到了监听者的addCh中,那么监听者是什么时候注册的,又是怎么工作的?
其实在informer注册eventHandler的时候就注册了监听者.
podInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) {
mObj := obj.(v1.Object)
log.Printf("New pod added: %s", mObj.GetName())
},
UpdateFunc: func(oldObj, newObj interface{}) {
oObj := oldObj.(v1.Object)
nObj := newObj.(v1.Object)
log.Printf("%s pod updated to %s", oObj.GetName(), nObj.GetName())
},
DeleteFunc: func(obj interface{}) {
mObj := obj.(v1.Object)
log.Printf("pod deleted from store: %s", mObj.GetName())
},
})
func (s *sharedIndexInformer) AddEventHandler(handler ResourceEventHandler) {
s.AddEventHandlerWithResyncPeriod(handler, s.defaultEventHandlerResyncPeriod)
}
func (s *sharedIndexInformer) AddEventHandlerWithResyncPeriod(handler ResourceEventHandler, resyncPeriod time.Duration) {
//省略代码
//......
//创建监听者
listener := newProcessListener(handler, resyncPeriod, determineResyncPeriod(resyncPeriod, s.resyncCheckPeriod), s.clock.Now(), initialBufferSize)
if !s.started {
s.processor.addListener(listener)
return
}
// in order to safely join, we have to
// 1. stop sending add/update/delete notifications
// 2. do a list against the store
// 3. send synthetic "Add" events to the new handler
// 4. unblock
s.blockDeltas.Lock()
defer s.blockDeltas.Unlock()
//添加监听者
s.processor.addListener(listener)
for _, item := range s.indexer.List() {
listener.add(addNotification{newObj: item})
}
}
func newProcessListener(handler ResourceEventHandler, requestedResyncPeriod, resyncPeriod time.Duration, now time.Time, bufferSize int) *processorListener {
ret := &processorListener{
nextCh: make(chan interface{}),
addCh: make(chan interface{}),
handler: handler,
pendingNotifications: *buffer.NewRingGrowing(bufferSize),
requestedResyncPeriod: requestedResyncPeriod,
resyncPeriod: resyncPeriod,
}
ret.determineNextResync(now)
return ret
}
func (p *sharedProcessor) addListener(listener *processorListener) {
p.listenersLock.Lock()
defer p.listenersLock.Unlock()
p.addListenerLocked(listener)
if p.listenersStarted {
//在两个不同的协程使监听者运行起来
//pop负责从channel中拿通知
//run负责处理通知
p.wg.Start(listener.run)
p.wg.Start(listener.pop)
}
}
func (p *processorListener) pop() {
defer utilruntime.HandleCrash()
defer close(p.nextCh) // Tell .run() to stop
var nextCh chan<- interface{}
var notification interface{}
for {
select {
case nextCh <- notification:
// Notification dispatched
var ok bool
notification, ok = p.pendingNotifications.ReadOne()
if !ok { // Nothing to pop
nextCh = nil // Disable this select case
}
//联系前面distribute分发监听者的时候将notification发送到addCh
case notificationToAdd, ok := <-p.addCh:
if !ok {
return
}
if notification == nil { // No notification to pop (and pendingNotifications is empty)
// Optimize the case - skip adding to pendingNotifications
notification = notificationToAdd
nextCh = p.nextCh
} else { // There is already a notification waiting to be dispatched
p.pendingNotifications.WriteOne(notificationToAdd)
}
}
}
}
p.pendingNotifications是一个环形buffer的数据结构,addCh将notification写入到环形buffer,再从环形buffer取出notification发送到nextCh
func (p *processorListener) run() {
// this call blocks until the channel is closed. When a panic happens during the notification
// we will catch it, **the offending item will be skipped!**, and after a short delay (one second)
// the next notification will be attempted. This is usually better than the alternative of never
// delivering again.
stopCh := make(chan struct{})
wait.Until(func() {
for next := range p.nextCh {
//这里调用到用户定义的handler方法
switch notification := next.(type) {
case updateNotification:
p.handler.OnUpdate(notification.oldObj, notification.newObj)
case addNotification:
p.handler.OnAdd(notification.newObj)
case deleteNotification:
p.handler.OnDelete(notification.oldObj)
default:
utilruntime.HandleError(fmt.Errorf("unrecognized notification: %T", next))
}
}
// the only way to get here is if the p.nextCh is empty and closed
close(stopCh)
}, 1*time.Second, stopCh)
}
5.4 三级缓存
List-watch 是一个典型的生产者-消费者模型,这种模型常见的问题就是,消费者处理事件的速度跟不上生产者生成事件的速度,所以我们需要缓存来存储生产者的事件,然后让消费者慢慢处理。
5.4.1 DeltaFIFO
DeltaFIFO并没有做具体地对资源对象做更新删除等操作,它更多是充当一个缓冲和转存的作用。
5.4.2 Indexer
资源对象的最新本地缓存是在Indexer中的,Indexer与etcd中存储的对象是保持状态一致的。Indexer的存储是由ThreadSafeMap实现的,ThreadSafeMap是一个并发安全的存储,具有存储相关的增、删、改、查等操作。Indexer在封装ThreadSafeMap的基础上,实现了索引的相关功能。所以Index可以理解为一个带索引的并发安全的本地存储。
DeltaFIFO中的元素被弹出来后被同步到了 Indexer 存储中,参照handlerDelatas。
5.4.2.1 Resync机制
Resync机制会将Indexer本地存储中的资源同步到DeltaFIFO中,并将这些资源对象设置为Sync的操作类型。Resync在Reflector中定时执行,定时调用UpdateFunc。
func (f *DeltaFIFO) Resync() error {
f.lock.Lock()
defer f.lock.Unlock()
if f.knownObjects == nil {
return nil
}
//遍历本地存储的所有key
keys := f.knownObjects.ListKeys()
for _, k := range keys {
//同步key
if err := f.syncKeyLocked(k); err != nil {
return err
}
}
return nil
}
func (f *DeltaFIFO) syncKeyLocked(key string) error {
obj, exists, err := f.knownObjects.GetByKey(key)
if err != nil {
klog.Errorf("Unexpected error %v during lookup of key %v, unable to queue object for sync", err, key)
return nil
} else if !exists {
klog.Infof("Key %v does not exist in known objects store, unable to queue object for sync", key)
return nil
}
id, err := f.KeyOf(obj)
if err != nil {
return KeyError{obj, err}
}
//如果FIFO中有相同key的Event进来,说明该资源对象有了新的Event,故不作sync
if len(f.items[id]) > 0 {
return nil
}
//重新放入FIFO队列中
if err := f.queueActionLocked(Sync, obj); err != nil {
return fmt.Errorf("couldn't queue object: %v", err)
}
return nil
}
5.4.3 RingGrowing
DeltaFIFO中的元素被弹出来后,一方面是被同步到Indexer中,另一方面是去通知事件回调。但是k8s并没有直接去做事件回调,而是多做了一层缓冲RingGrowing,RingGrowing 是一个环形数据结构。