k8s 污点驱逐详解-源码分析

代码版本:1.17.4

1. startNodeLifecycleController

可以看到startNodeLifecycleController就是分为2个步骤:

  • NodeLifecycleController

  • NodeLifecycleController.run

func startNodeLifecycleController(ctx ControllerContext) (http.Handler, bool, error) {
  lifecycleController, err := lifecyclecontroller.NewNodeLifecycleController(
    ctx.InformerFactory.Coordination().V1().Leases(),
    ctx.InformerFactory.Core().V1().Pods(),
    ctx.InformerFactory.Core().V1().Nodes(),
    ctx.InformerFactory.Apps().V1().DaemonSets(),
    // node lifecycle controller uses existing cluster role from node-controller
    ctx.ClientBuilder.ClientOrDie("node-controller"),
    
    // 就是node-monitor-period参数
    ctx.ComponentConfig.KubeCloudShared.NodeMonitorPeriod.Duration,   
    
    // 就是node-startup-grace-period参数
    ctx.ComponentConfig.NodeLifecycleController.NodeStartupGracePeriod.Duration,
    
    // 就是node-monitor-grace-period参数
    ctx.ComponentConfig.NodeLifecycleController.NodeMonitorGracePeriod.Duration,
    
    // 就是pod-eviction-timeout参数
    ctx.ComponentConfig.NodeLifecycleController.PodEvictionTimeout.Duration,
    
    // 就是node-eviction-rate参数
    ctx.ComponentConfig.NodeLifecycleController.NodeEvictionRate,
    
    // 就是secondary-node-eviction-rate参数
    ctx.ComponentConfig.NodeLifecycleController.SecondaryNodeEvictionRate,
    
    // 就是large-cluster-size-threshold参数
    ctx.ComponentConfig.NodeLifecycleController.LargeClusterSizeThreshold,
    
    // 就是unhealthy-zone-threshold参数
    ctx.ComponentConfig.NodeLifecycleController.UnhealthyZoneThreshold,
    
    // 就是enable-taint-manager参数  (默认打开的)
    ctx.ComponentConfig.NodeLifecycleController.EnableTaintManager,
    
    // 就是这个是否打开--feature-gates=TaintBasedEvictions=true (默认打开的)
    utilfeature.DefaultFeatureGate.Enabled(features.TaintBasedEvictions),
  )
  if err != nil {
    return nil, true, err
  }
  go lifecycleController.Run(ctx.Stop)
  return nil, true, nil
}
​

具体参数介绍

  • enable-taint-manager 默认为true, 表示允许NoExecute污点,并且将会驱逐pod

  • large-cluster-size-threshold 默认50,基于这个阈值来判断所在集群是否为大规模集群。当集群规模小于等于这个值的时候,会将--secondary-node-eviction-rate参数强制赋值为0

  • secondary-node-eviction-rate 默认0.01。 当zone unhealthy时候,一秒内多少个node进行驱逐node上pod。二级驱赶速率,当集群中宕机节点过多时,相应的驱赶速率也降低,默认为0.01。

  • node-eviction-rate float32 默认为0.1。驱赶速率,即驱赶Node的速率,由令牌桶流控算法实现,默认为0.1,即每秒驱赶0.1个节点,注意这里不是驱赶Pod的速率,而是驱赶节点的速率。相当于每隔10s,清空一个节点。

  • node-monitor-grace-period duration 默认40s, 多久node没有响应认为node为unhealthy

  • node-startup-grace-period duration 默认1分钟。多久允许刚启动的node未响应,认为unhealthy

  • pod-eviction-timeout duration 默认5min。当node unhealthy时候多久删除上面的pod(只在taint manager未启用时候生效)

  • unhealthy-zone-threshold float32 默认55%,多少比例的unhealthy node认为zone unhealthy

2. NewNodeLifecycleController

2.1 NodeLifecycleController结构体介绍

// Controller is the controller that manages node's life cycle.
type Controller struct {
  // taintManager监听节点的Taint/Toleration变化,用于驱逐pod
  taintManager *scheduler.NoExecuteTaintManager
  
  // 监听pod
  podLister         corelisters.PodLister
  podInformerSynced cache.InformerSynced
  kubeClient        clientset.Interface
​
  // This timestamp is to be used instead of LastProbeTime stored in Condition. We do this
  // to avoid the problem with time skew across the cluster.
  now func() metav1.Time
  
  // 返回secondary-node-eviction-rate参数值。就是根据集群是否为大集群,如果是大集群,返回secondary-node-eviction-rate,否则返回0
  enterPartialDisruptionFunc func(nodeNum int) float32
  
  // 返回evictionLimiterQPS参数
  enterFullDisruptionFunc    func(nodeNum int) float32
  
  // 返回集群有多少nodeNotReady, 并且返回bool值ZoneState用于判断zone是否健康。利用了unhealthyZoneThreshold参数
  computeZoneStateFunc       func(nodeConditions []*v1.NodeCondition) (int, ZoneState)
  
  // node map
  knownNodeSet map[string]*v1.Node
  
  // node健康信息map表
  // per Node map storing last observed health together with a local time when it was observed.
  nodeHealthMap *nodeHealthMap
  
  
  // evictorLock protects zonePodEvictor and zoneNoExecuteTainter.
  // TODO(#83954): API calls shouldn't be executed under the lock.
  evictorLock     sync.Mutex
  
  // 存放node上pod是否已经执行驱逐的状态, 从这读取node eviction的状态是evicted、tobeeviced
  nodeEvictionMap *nodeEvictionMap
  // workers that evicts pods from unresponsive nodes.
  
  // zone的需要pod evictor的node列表
  zonePodEvictor map[string]*scheduler.RateLimitedTimedQueue
  
  // 存放需要更新taint的unready node列表--令牌桶队列
  // workers that are responsible for tainting nodes.
  zoneNoExecuteTainter map[string]*scheduler.RateLimitedTimedQueue
  
  // 重试列表
  nodesToRetry sync.Map
  
  // 存放每个zone的健康状态,有stateFullDisruption、statePartialDisruption、stateNormal、stateInitial
  zoneStates map[string]ZoneState
  
  // 监听ds相关
  daemonSetStore          appsv1listers.DaemonSetLister
  daemonSetInformerSynced cache.InformerSynced
  
  // 监听node相关
  leaseLister         coordlisters.LeaseLister
  leaseInformerSynced cache.InformerSynced
  nodeLister          corelisters.NodeLister
  nodeInformerSynced  cache.InformerSynced
  
  getPodsAssignedToNode func(nodeName string) ([]*v1.Pod, error)
​
  recorder record.EventRecorder
  
  // 之前推到的一对参数
  // Value controlling Controller monitoring period, i.e. how often does Controller
  // check node health signal posted from kubelet. This value should be lower than
  // nodeMonitorGracePeriod.
  // TODO: Change node health monitor to watch based.
  nodeMonitorPeriod time.Duration
  
  // When node is just created, e.g. cluster bootstrap or node creation, we give
  // a longer grace period.
  nodeStartupGracePeriod time.Duration
​
  // Controller will not proactively sync node health, but will monitor node
  // health signal updated from kubelet. There are 2 kinds of node healthiness
  // signals: NodeStatus and NodeLease. NodeLease signal is generated only when
  // NodeLease feature is enabled. If it doesn't receive update for this amount
  // of time, it will start posting "NodeReady==ConditionUnknown". The amount of
  // time before which Controller start evicting pods is controlled via flag
  // 'pod-eviction-timeout'.
  // Note: be cautious when changing the constant, it must work with
  // nodeStatusUpdateFrequency in kubelet and renewInterval in NodeLease
  // controller. The node health signal update frequency is the minimal of the
  // two.
  // There are several constraints:
  // 1. nodeMonitorGracePeriod must be N times more than  the node health signal
  //    update frequency, where N means number of retries allowed for kubelet to
  //    post node status/lease. It is pointless to make nodeMonitorGracePeriod
  //    be less than the node health signal update frequency, since there will
  //    only be fresh values from Kubelet at an interval of node health signal
  //    update frequency. The constant must be less than podEvictionTimeout.
  // 2. nodeMonitorGracePeriod can't be too large for user experience - larger
  //    value takes longer for user to see up-to-date node health.
  nodeMonitorGracePeriod time.Duration
​
  podEvictionTimeout          time.Duration
  evictionLimiterQPS          float32
  secondaryEvictionLimiterQPS float32
  largeClusterThreshold       int32
  unhealthyZoneThreshold      float32
​
  // if set to true Controller will start TaintManager that will evict Pods from
  // tainted nodes, if they're not tolerated.
  runTaintManager bool
​
  // if set to true Controller will taint Nodes with 'TaintNodeNotReady' and 'TaintNodeUnreachable'
  // taints instead of evicting Pods itself.
  useTaintBasedEvictions bool
  
  // pod, node队列
  nodeUpdateQueue workqueue.Interface
  podUpdateQueue  workqueue.RateLimitingInterface
}

2.2 NewNodeLifecycleController

核心逻辑如下:

(1)根据参数初始化Controller

(2)定义了pod的监听处理逻辑。都是先nc.podUpdated,如果enable-taint-manager=true,还会经过nc.taintManager.PodUpdated函数处理

(3)实现找出所有node上pod的函数

(4)如果enable-taint-manager=true,node有变化都需要经过 nc.taintManager.NodeUpdated函数

(5)实现node的监听处理,这里不管开没开taint-manager,都是要监听

(6)实现node, ds, lease的list,用于获取对象

// NewNodeLifecycleController returns a new taint controller.
func NewNodeLifecycleController(
  leaseInformer coordinformers.LeaseInformer,
  podInformer coreinformers.PodInformer,
  nodeInformer coreinformers.NodeInformer,
  daemonSetInformer appsv1informers.DaemonSetInformer,
  kubeClient clientset.Interface,
  nodeMonitorPeriod time.Duration,
  nodeStartupGracePeriod time.Duration,
  nodeMonitorGracePeriod time.Duration,
  podEvictionTimeout time.Duration,
  evictionLimiterQPS float32,
  secondaryEvictionLimiterQPS float32,
  largeClusterThreshold int32,
  unhealthyZoneThreshold float32,
  runTaintManager bool,
  useTaintBasedEvictions bool,
) (*Controller, error) {
​
  // 1.根据参数初始化Controller
  nc := &Controller{
    省略代码
    ....
  }
  
  if useTaintBasedEvictions {
    klog.Infof("Controller is using taint based evictions.")
  }
  nc.enterPartialDisruptionFunc = nc.ReducedQPSFunc
  nc.enterFullDisruptionFunc = nc.HealthyQPSFunc
  nc.computeZoneStateF
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值