[TODO] 使用 controller-runtime官方文档 重构一下文章的脉络。
在上一篇文章 [kubeflow] 从零搭建training-operator项目 中,我们从零搭建了一个简单的training-operator项目,最终就差完成controller的Reconcile函数逻辑。这次从TFJob的Reconcile函数为入口,探究training-operator到底是怎么工作的。在此之前,我们需要了解controller-runtime的原理。
controller-runtime源码分析
controller-runtime是社区封装的一个controller框架,借助kubebuilder等工具,开发者只需要关心Reconcile函数的实现,非常方便。下面这图不是controller-runtime,但很接近。Worker可以理解为reconciler,reconciler从工作队列中取出reconcile.request进行消耗。Readonly是指podLister,serviceLister这些。
Controller
下面controller-runtime的版本是v0.15.0,不同版本可能略有差异。下载了training-operator之后,执行go mod tidy下载依赖。使用vscode打开,找到controller相关代码通过 ctrl+左键 就可以跳转到源码了。下面关于controller-runtime的分析,我主要是参考 operator:controller-runtime 原理之控制器 这篇文章来的,因此分析过程基本差不多。
看看controller的抽象接口的定义,文件在controller-runtime@v0.15.0/pkg/controller/controller.go。核心主要是这四个函数,其中,reconcile.Reconciler也是一个抽象接口,里面只有一个Reconcile()函数的定义,即用户来实现。
// Controller implements a Kubernetes API. A Controller manages a work queue fed reconcile.Requests
// from source.Sources. Work is performed through the reconcile.Reconciler for each enqueued item.
// Work typically is reads and writes Kubernetes objects to make the system state match the state specified
// in the object Spec.
type Controller interface {
// Reconciler is called to reconcile an object by Namespace/Name
reconcile.Reconciler
// Watch takes events provided by a Source and uses the EventHandler to
// enqueue reconcile.Requests in response to the events.
//
// Watch may be provided one or more Predicates to filter events before
// they are given to the EventHandler. Events will be passed to the
// EventHandler if all provided Predicates evaluate to true.
Watch(src source.Source, eventhandler handler.EventHandler, predicates ...predicate.Predicate) error
// Start starts the controller. Start blocks until the context is closed or a
// controller has an error starting.
Start(ctx context.Context) error
// GetLogger returns this controller logger prefilled with basic information.
GetLogger() logr.Logger
}
看看controller的具体实现,文件在controller-runtime@v0.15.0/pkg/internal/controller/controller.go。MakeQueue用来初始化限速的工作队列Queue,成员Do则是reconciler,之后会运行用户的reconcile代码。mu是一个锁,保证同时只有一个controller在运行。Started标记controller是否在运行。startWatches用来存储所有的watchDescription对象,一个watchDescription对象包括src,handler和predicates三部分。
// Controller implements controller.Controller.
type Controller struct {
// Name is used to uniquely identify a Controller in tracing, logging and monitoring. Name is required.
Name string
// MaxConcurrentReconciles is the maximum number of concurrent Reconciles which can be run. Defaults to 1.
MaxConcurrentReconciles int
// Reconciler is a function that can be called at any time with the Name / Namespace of an object and
// ensures that the state of the system matches the state specified in the object.
// Defaults to the DefaultReconcileFunc.
Do reconcile.Reconciler
// MakeQueue constructs the queue for this controller once the controller is ready to start.
// This exists because the standard Kubernetes workqueues start themselves immediately, which
// leads to goroutine leaks if something calls controller.New repeatedly.
MakeQueue func() workqueue.RateLimitingInterface
// Queue is an listeningQueue that listens for events from Informers and adds object keys to
// the Queue for processing
Queue workqueue.RateLimitingInterface
// mu is used to synchronize Controller setup
mu sync.Mutex
// Started is true if the Controller has been Started
Started bool
// ctx is the context that was passed to Start() and used when starting watches.
//
// According to the docs, contexts should not be stored in a struct: https://golang.org/pkg/context,
// while we usually always strive to follow best practices, we consider this a legacy case and it should
// undergo a major refactoring and redesign to allow for context to not be stored in a struct.
ctx context.Context
// CacheSyncTimeout refers to the time limit set on waiting for cache to sync
// Defaults to 2 minutes if not set.
CacheSyncTimeout time.Duration
// startWatches maintains a list of sources, handlers, and predicates to start when the controller is started.
startWatches []watchDescription
// LogConstructor is used to construct a logger to then log messages to users during reconciliation,
// or for example when a watch is started.
// Note: LogConstructor has to be able to handle nil requests as we are also using it
// outside the context of a reconciliation.
LogConstructor func(request *reconcile.Request) logr.Logger
// RecoverPanic indicates whether the panic caused by reconcile should be recovered.
RecoverPanic *bool
// LeaderElected indicates whether the controller is leader elected or always running.
LeaderElected *bool
}
// watchDescription contains all the information necessary to start a watch.
type watchDescription struct {
src source.Source
handler handler.EventHandler
predicates []predicate.Predicate
}
Controller.Watch
看看Watch函数的具体实现,文件在controller-runtime@v0.15.0/pkg/internal/controller/controller.go。可以看到实际上是调用了Source.Start函数来完成。src是什么?是我们想要观察的对象,我们想观察到src对象的增删改时间并调用eventHandler相应处理。注意Watch函数运行时,并没有初始化工作队列Queue,因为src.Start之后只是使用Cache初始化informer并注册事件处理函数,之后会提到。
// Watch implements controller.Controller.
func (c *Controller) Watch(src source.Source, evthdler handler.EventHandler, prct ...predicate.Predicate) error {
c.mu.Lock()
defer c.mu.Unlock()
// Controller hasn't started yet, store the watches locally and return.
//
// These watches are going to be held on the controller struct until the manager or user calls Start(...).
if !c.Started {
c.startWatches = append(c.startWatches, watchDescription{src: src, handler: evthdler, predicates: prct})
return nil
}
c.LogConstructor(nil).Info("Starting EventSource", "source", src)
return src.Start(c.ctx, evthdler, c.Queue, prct...)
}
Source也是一个抽象接口,看看定义,文件在controller-runtime@v0.15.0/pkg/source/source.go。里面只有一个Start函数的定义。
// Source is a source of events (eh.g. Create, Update, Delete operations on Kubernetes Objects, Webhook callbacks, etc)
// which should be processed by event.EventHandlers to enqueue reconcile.Requests.
//
// * Use Kind for events originating in the cluster (e.g. Pod Create, Pod Update, Deployment Update).
//
// * Use Channel for events originating outside the cluster (eh.g. GitHub Webhook callback, Polling external urls).
//
// Users may build their own Source implementations.
type Source interface {
// Start is internal and should be called only by the Controller to register an EventHandler with the Informer
// to enqueue reconcile.Requests.
Start(context.Context, handler.EventHandler, workqueue.RateLimitingInterface, ...predicate.Predicate) error
}
我们再看看Watch函数是如何被training-operator调用的,文件在pkg/controller.v1/tensorflow/tfjob_controller.go。可以看到
- Kind结构体作为实参,Kind应该是Source抽象接口的一个实现。kubeflowv1.TFJob{}就是我们想关注的资源对象,Kind对kubeflowv1.TFJob{}进行了包装。
- handler.EnqueueRequestForObject{}是抽象接口EventHandler的具体实现,有Create,Delete,Update等函数,后面会提到。
- predicate.Funcs{CreateFunc: r.onOwnerCreateFunc()}是用户提供的断言函数,用于判断相关事件是否有必要推入队列,后面也会提到。
// using onOwnerCreateFunc is easier to set defaults
if err = c.Watch(source.Kind(mgr.GetCache(), &kubeflowv1.TFJob{}), &handler.EnqueueRequestForObject{},
predicate.Funcs{CreateFunc: r.onOwnerCreateFunc()},
); err != nil {
return err
}
看看Kind的具体实现,文件在controller-runtime@v0.15.0/pkg/internal/source/kind.go。Type便是我们想观察的具体类型,Kind为其做了封装,增加了一个实参来自manager的Cache,可以提供informer,毕竟Kind要实现Start这么重要的函数,肯定得需要相应的工具包。
// Kind is used to provide a source of events originating inside the cluster from Watches (e.g. Pod Create).
type Kind struct {
// Type is the type of object to watch. e.g. &v1.Pod{}
Type client.Object
// Cache used to watch APIs
Cache cache.Cache
// started may contain an error if one was encountered during startup. If its closed and does not
// contain an error, startup and syncing finished.
started chan error
startCancel func()
}
结合上面的分析可以知道,Controller.Watch实际会调用Kind.Start。Kind.Start的实现就在下面。核心函数是调用i.AddEventHandler来监听资源的变动并通过handler来进行相应的处理。使用过informer的应该对AddEventHandler不陌生,注意informer调用的增删改回调函数都是发生后才通知,也就是说资源对象已经发生了增删改事件。i
就是一个informer,使用Kind.Cache.GetInformer来初始化。
// Start is internal and should be called only by the Controller to register an EventHandler with the Informer
// to enqueue reconcile.Requests.
func (ks *Kind) Start(ctx context.Context, handler handler.EventHandler, queue workqueue.RateLimitingInterface,
// ...
// cache.GetInformer will block until its context is cancelled if the cache was already started and it can not
// sync that informer (most commonly due to RBAC issues).
ctx, ks.startCancel = context.WithCancel(ctx)
ks.started = make(chan error)
go func() {
var (
i cache.Informer
lastErr error
)
// ...
i, lastErr = ks.Cache.GetInformer(ctx, ks.Type)
// ...
_, err := i.AddEventHandler(NewEventHandler(ctx, queue, handler, prct).HandlerFuncs())
// ...
}()
return nil
}
NewEventHandler函数初始化一个EventHandler,成员包括一个事件处理handler,一个限速的工作队列queue,还有一堆判断函数predicates。注意这里的EventHandler是一个结构体,其成员之一的handler.EventHandler是抽象接口,而这虽然命名相同,但并不无关系,不要搞混了。handler就是前面提到的handler.EnqueueRequestForObject{},predicates就是前面提到的predicate.Funcs{CreateFunc: r.onOwnerCreateFunc()},这两个都是用户提供的。OnAdd函数预先使用predicates进行判断,成功通过判断函数后,最终会调用handler的Create函数,把reconcile.request推入队列。OnUpdate和OnDelete函数,逻辑都是类似的。
// EventHandler adapts a handler.EventHandler interface to a cache.ResourceEventHandler interface.
type EventHandler struct {
// ctx stores the context that created the event handler
// that is used to propagate cancellation signals to each handler function.
ctx context.Context
handler handler.EventHandler
queue workqueue.RateLimitingInterface
predicates []predicate.Predicate
}
// HandlerFuncs converts EventHandler to a ResourceEventHandlerFuncs
// TODO: switch to ResourceEventHandlerDetailedFuncs with client-go 1.27
func (e *EventHandler) HandlerFuncs() cache.ResourceEventHandlerFuncs {
return cache.ResourceEventHandlerFuncs{
AddFunc: e.OnAdd,
UpdateFunc: e.OnUpdate,
DeleteFunc: e.OnDelete,
}
}
// OnAdd creates CreateEvent and calls Create on EventHandler.
func (e *EventHandler) OnAdd(obj interface{}) {
c := event.CreateEvent{}
// ...
for _, p := range e.predicates {
if !p.Create(c) {
return
}
}
// Invoke create handler
ctx, cancel := context.WithCancel(e.ctx)
defer cancel()
e.handler.Create(ctx, c, e.queue)
}
看看EnqueueRequestForObject是如何实现Create函数的,文件在controller-runtime@v0.15.0/pkg/handler/enqueue.go。函数逻辑非常简单,而且顾名思义,就是把对象本身的namespace和name作为reconcile.Request推入工作队列。与之对应的还有一个enqueueRequestForOwner,这个则是把对象的owner的namespace和name作为reconcile.Request推入工作队列。对pod和service的Watch需要使用enqueueRequestForOwner,因为pod和service的结构体里面有ownRerference字段来标记其owner(TFJob);而对于TFJob本身的Watch,则使用EnqueueRequestForObject。tfjob_controller.go里面就是这样用的。
// EnqueueRequestForObject enqueues a Request containing the Name and Namespace of the object that is the source of the Event.
// (e.g. the created / deleted / updated objects Name and Namespace). handler.EnqueueRequestForObject is used by almost all
// Controllers that have associated Resources (e.g. CRDs) to reconcile the associated Resource.
type EnqueueRequestForObject struct{}
// Create implements EventHandler.
func (e *EnqueueRequestForObject) Create(ctx context.Context, evt event.CreateEvent, q workqueue.RateLimitingInterface) {
if evt.Object == nil {
enqueueLog.Error(nil, "CreateEvent received with no metadata", "event", evt)
return
}
q.Add(reconcile.Request{NamespacedName: types.NamespacedName{
Name: evt.Object.GetName(),
Namespace: evt.Object.GetNamespace(),
}})
}
看看training-operator提供的断言函数predicate.Funcs{CreateFunc: r.onOwnerCreateFunc()}。逻辑很直接,只要是TFJob,那么就判断为true。因为informer通知的时候资源已经发生了改动,因此状态标记为JobCreated。
// onOwnerCreateFunc modify creation condition.
func (r *TFJobReconciler) onOwnerCreateFunc() func(event.CreateEvent) bool {
return func(e event.CreateEvent) bool {
tfJob, ok := e.Object.(*kubeflowv1.TFJob)
if !ok {
return true
}
r.Scheme.Default(tfJob)
msg := fmt.Sprintf("TFJob %s is created.", e.Object.GetName())
logrus.Info(msg)
trainingoperatorcommon.CreatedJobsCounterInc(tfJob.Namespace, r.GetFrameworkName())
commonutil.UpdateJobConditions(&tfJob.Status, kubeflowv1.JobCreated, corev1.ConditionTrue, commonutil.NewReason(kubeflowv1.TFJobKind, commonutil.JobCreatedReason), msg)
return true
}
}
Controller.Start
Watch函数讲完了,一言以蔽之,那就是注册informer。然后我们再来看看Start函数,位置在controller-runtime@v0.15.0/pkg/internal/controller/controller.go。运行前先加锁,保证同时只有一个Controller在运行。使用MakeQueue对Queue进行初始化(Watch的时候没有初始化)。然后对startWatches里的每个对象再次执行src.Start(之前在Controller.Watch时调用了一次)。可以有MaxConcurrentReconciles个同时执行processNextWorkItem从Queue中取出reconcile.request进行消费。
// Start implements controller.Controller.
func (c *Controller) Start(ctx context.Context) error {
// use an IIFE to get proper lock handling
// but lock outside to get proper handling of the queue shutdown
c.mu.Lock()
if c.Started {
return errors.New("controller was started more than once. This is likely to be caused by being added to a manager multiple times")
}
c.initMetrics()
// Set the internal context.
c.ctx = ctx
c.Queue = c.MakeQueue()
go func() {
<-ctx.Done()
c.Queue.ShutDown()
}()
wg := &sync.WaitGroup{}
err := func() error {
defer c.mu.Unlock()
// TODO(pwittrock): Reconsider HandleCrash
defer utilruntime.HandleCrash()
// NB(directxman12): launch the sources *before* trying to wait for the
// caches to sync so that they have a chance to register their intendeded
// caches.
for _, watch := range c.startWatches {
c.LogConstructor(nil).Info("Starting EventSource", "source", fmt.Sprintf("%s", watch.src))
if err := watch.src.Start(ctx, watch.handler, c.Queue, watch.predicates...); err != nil {
return err
}
}
// Start the SharedIndexInformer factories to begin populating the SharedIndexInformer caches
c.LogConstructor(nil).Info("Starting Controller")
for _, watch := range c.startWatches {
syncingSource, ok := watch.src.(source.SyncingSource)
if !ok {
continue
}
if err := func() error {
// use a context with timeout for launching sources and syncing caches.
sourceStartCtx, cancel := context.WithTimeout(ctx, c.CacheSyncTimeout)
defer cancel()
// WaitForSync waits for a definitive timeout, and returns if there
// is an error or a timeout
if err := syncingSource.WaitForSync(sourceStartCtx); err != nil {
err := fmt.Errorf("failed to wait for %s caches to sync: %w", c.Name, err)
c.LogConstructor(nil).Error(err, "Could not wait for Cache to sync")
return err
}
return nil
}(); err != nil {
return err
}
}
// All the watches have been started, we can reset the local slice.
//
// We should never hold watches more than necessary, each watch source can hold a backing cache,
// which won't be garbage collected if we hold a reference to it.
c.startWatches = nil
// Launch workers to process resources
c.LogConstructor(nil).Info("Starting workers", "worker count", c.MaxConcurrentReconciles)
wg.Add(c.MaxConcurrentReconciles)
for i := 0; i < c.MaxConcurrentReconciles; i++ {
go func() {
defer wg.Done()
// Run a worker thread that just dequeues items, processes them, and marks them done.
// It enforces that the reconcileHandler is never invoked concurrently with the same object.
for c.processNextWorkItem(ctx) {
}
}()
}
c.Started = true
return nil
}()
if err != nil {
return err
}
<-ctx.Done()
c.LogConstructor(nil).Info("Shutdown signal received, waiting for all workers to finish")
wg.Wait()
c.LogConstructor(nil).Info("All workers finished")
return nil
}
processNextWorkItem的代码如下,实际上调用了reconcileHandler。
// processNextWorkItem will read a single work item off the workqueue and
// attempt to process it, by calling the reconcileHandler.
func (c *Controller) processNextWorkItem(ctx context.Context) bool {
obj, shutdown := c.Queue.Get()
if shutdown {
// Stop working
return false
}
// We call Done here so the workqueue knows we have finished
// processing this item. We also must remember to call Forget if we
// do not want this work item being re-queued. For example, we do
// not call Forget if a transient error occurs, instead the item is
// put back on the workqueue and attempted again after a back-off
// period.
defer c.Queue.Done(obj)
ctrlmetrics.ActiveWorkers.WithLabelValues(c.Name).Add(1)
defer ctrlmetrics.ActiveWorkers.WithLabelValues(c.Name).Add(-1)
c.reconcileHandler(ctx, obj)
return true
}
reconcileHandler的代码如下,实际调用了Reconcile函数。
func (c *Controller) reconcileHandler(ctx context.Context, obj interface{}) {
// Update metrics after processing each item
reconcileStartTS := time.Now()
defer func() {
c.updateMetrics(time.Since(reconcileStartTS))
}()
// Make sure that the object is a valid request.
req, ok := obj.(reconcile.Request)
// ...
log := c.LogConstructor(&req)
reconcileID := uuid.NewUUID()
log = log.WithValues("reconcileID", reconcileID)
ctx = logf.IntoContext(ctx, log)
ctx = addReconcileID(ctx, reconcileID)
// RunInformersAndControllers the syncHandler, passing it the Namespace/Name string of the
// resource to be synced.
result, err := c.Reconcile(ctx, req)
// ...
}
最终,c.Do.Reconcile(ctx, req)便是用户写的Reconclie函数。
// Reconcile implements reconcile.Reconciler.
func (c *Controller) Reconcile(ctx context.Context, req reconcile.Request) (_ reconcile.Result, err error) {
defer func() {
if r := recover(); r != nil {
if c.RecoverPanic != nil && *c.RecoverPanic {
for _, fn := range utilruntime.PanicHandlers {
fn(r)
}
err = fmt.Errorf("panic: %v [recovered]", r)
return
}
log := logf.FromContext(ctx)
log.Info(fmt.Sprintf("Observed a panic in reconciler: %v", r))
panic(r)
}
}()
return c.Do.Reconcile(ctx, req)
}