Kubernetes Eviction Manager源码分析

http://blog.csdn.net/WaltonWang/article/details/56329109

摘要:本文作为Kubernetes Eviction Manager工作机制分析的后续篇,主要通过源码分析对其工作机制进行解读。

Kubernetes Eviction Manager介绍及工作原理

这部分内容,请看我的前一篇博文:Kubernetes Eviction Manager工作机制分析

Kubernetes Eviction Manager源码分析

Kubernetes Eviction Manager在何处启动

Kubelet在实例化一个kubelet对象的时候,调用eviction.NewManager新建了一个evictionManager对象。

pkg/kubelet/kubelet.go:273
func NewMainKubelet(kubeCfg *componentconfig.KubeletConfiguration, kubeDeps *KubeletDeps, standaloneMode bool) (*Kubelet, error) {

    ...

    thresholds, err := eviction.ParseThresholdConfig(kubeCfg.EvictionHard, kubeCfg.EvictionSoft, kubeCfg.EvictionSoftGracePeriod, kubeCfg.EvictionMinimumReclaim)
    if err != nil {
        return nil, err
    }
    evictionConfig := eviction.Config{
        PressureTransitionPeriod: kubeCfg.EvictionPressureTransitionPeriod.Duration,
        MaxPodGracePeriodSeconds: int64(kubeCfg.EvictionMaxPodGracePeriod),
        Thresholds:               thresholds,
        KernelMemcgNotification:  kubeCfg.ExperimentalKernelMemcgNotification,
    }
    ...

    // setup eviction manager
    evictionManager, evictionAdmitHandler, err := eviction.NewManager(klet.resourceAnalyzer, evictionConfig, killPodNow(klet.podWorkers, kubeDeps.Recorder), klet.imageManager, kubeDeps.Recorder, nodeRef, klet.clock)

    if err != nil {
        return nil, fmt.Errorf("failed to initialize eviction manager: %v", err)
    }
    klet.evictionManager = evictionManager
    klet.admitHandlers.AddPodAdmitHandler(evictionAdmitHandler)
    ...
}


 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29

kubelet执行Run方法开始工作时,启动了一个goroutine,每5s执行一次updateRuntimeUp。在updateRuntimeUp中,待确认runtime启动成功后,会调用initializeRuntimeDependentModules完成runtime依赖模块的初始化工作。

pkg/kubelet/kubelet.go:1219
func (kl *Kubelet) Run(updates <-chan kubetypes.PodUpdate) {
    go wait.Until(kl.updateRuntimeUp, 5*time.Second, wait.NeverStop)
}


pkg/kubelet/kubelet.go:2040
func (kl *Kubelet) updateRuntimeUp() {
    ...

    kl.oneTimeInitializer.Do(kl.initializeRuntimeDependentModules)

    ...
}
 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

再跟踪到initializeRuntimeDependentModules的代码可见,runtime的依赖模块包括cadvisor和evictionManager,初始化的工作其实就是分别调用它们的Start方法进行启动。

pkg/kubelet/kubelet.go:1206
func (kl *Kubelet) initializeRuntimeDependentModules() {
    if err := kl.cadvisor.Start(); err != nil {
        // Fail kubelet and rely on the babysitter to retry starting kubelet.
        // TODO(random-liu): Add backoff logic in the babysitter
        glog.Fatalf("Failed to start cAdvisor %v", err)
    }
    // eviction manager must start after cadvisor because it needs to know if the container runtime has a dedicated imagefs
    if err := kl.evictionManager.Start(kl, kl.getActivePods, evictionMonitoringPeriod); err != nil {
        kl.runtimeState.setInternalError(fmt.Errorf("failed to start eviction manager %v", err))
    }
}
 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

因此,从这里开始就进入到evictionManager的分析了。

Kubernetes Eviction Manager的定义

从上面的分析可见,kubelet在启动过程中进行runtime依赖模块的初始化过程中,将evictionManager启动了。先别急,我们必须先来看看Eviction Manager是如何定义的。

pkg/kubelet/eviction/eviction_manager.go:40
// managerImpl implements Manager
type managerImpl struct {
    //  used to track time
    clock clock.Clock
    // config is how the manager is configured
    config Config
    // the function to invoke to kill a pod
    killPodFunc KillPodFunc
    // the interface that knows how to do image gc
    imageGC ImageGC
    // protects access to internal state
    sync.RWMutex
    // node conditions are the set of conditions present
    nodeConditions []v1.NodeConditionType
    // captures when a node condition was last observed based on a threshold being met
    nodeConditionsLastObservedAt nodeConditionsObservedAt
    // nodeRef is a reference to the node
    nodeRef *v1.ObjectReference
    // used to record events about the node
    recorder record.EventRecorder
    // used to measure usage stats on system
    summaryProvider stats.SummaryProvider
    // records when a threshold was first observed
    thresholdsFirstObservedAt thresholdsObservedAt
    // records the set of thresholds that have been met (including graceperiod) but not yet resolved
    thresholdsMet []Threshold
    // resourceToRankFunc maps a resource to ranking function for that resource.
    resourceToRankFunc map[v1.ResourceName]rankFunc
    // resourceToNodeReclaimFuncs maps a resource to an ordered list of functions that know how to reclaim that resource.
    resourceToNodeReclaimFuncs map[v1.ResourceName]nodeReclaimFuncs
    // last observations from synchronize
    lastObservations signalObservations
    // notifiersInitialized indicates if the threshold notifiers have been initialized (i.e. synchronize() has been called once)
    notifiersInitialized bool
}
 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36

managerImpl就是evictionManager的具体定义,重点关注:

  • config - evictionManager的配置,包括:

    • PressureTransitionPeriod( –eviction-pressure-transition-period)
    • MaxPodGracePeriodSeconds(–eviction-max-pod-grace-period)
    • Thresholds(–eviction-hard, –eviction-soft)
    • KernelMemcgNotification(–experimental-kernel-memcg-notification)
  • killPodFunc - evict pod时kill pod的接口,kubelet NewManager的时候,赋值为killPodNow方法(pkg/kubelet/pod_workers.go:285)
  • imageGC - 当node出现diskPressure condition时,imageGC进行unused images删除操作以回收disk space。
  • summaryProvider - 提供node和node上所有pods的最新status数据汇总,既NodeStats and []PodStats。
  • thresholdsFirstObservedAt - 记录threshold第一次观察到的时间。
  • thresholdsMet - 保存已经触发但还没解决的Thresholds,包括那些处于grace period等待阶段的Thresholds。
  • resourceToRankFunc - 定义各种Resource进行evict 挑选时的排名方法。
  • resourceToNodeReclaimFuncs - 定义各种Resource进行回收时调用的方法。
  • lastObservations - 上一次获取的eviction signal的记录,确保每次更新thresholds时都是按照正确的时间序列进行。
  • notifierInitialized - bool值,表示threshold notifier是否已经初始化,以确定是否可以利用kernel memcg notification功能来提高evict的响应速度。目前创建manager时该值为false,是否要利用kernel memcg notification,完全取决于kubelet的--experimental-kernel-memcg-notification参数。

kubelet在NewMainKubelet时调用eviction.NewManager进行evictionManager的创建,eviction.NewManager的代码很简单,就是赋值。

pkg/kubelet/eviction/eviction_manager.go:79
// NewManager returns a configured Manager and an associated admission handler to enforce eviction configuration.
func NewManager(
    summaryProvider stats.SummaryProvider,
    config Config,
    killPodFunc KillPodFunc,
    imageGC ImageGC,
    recorder record.EventRecorder,
    nodeRef *v1.ObjectReference,
    clock clock.Clock) (Manager, lifecycle.PodAdmitHandler, error) {
    manager := &managerImpl{
        clock:           clock,
        killPodFunc:     killPodFunc,
        imageGC:         imageGC,
        config:          config,
        recorder:        recorder,
        summaryProvider: summaryProvider,
        nodeRef:         nodeRef,
        nodeConditionsLastObservedAt: nodeConditionsObservedAt{},
        thresholdsFirstObservedAt:    thresholdsObservedAt{},
    }
    return manager, manager, nil
}
 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23

但是,有一点很重要,NewManager不但返回evictionManager对象,还返回了一个lifecycle.PodAdmitHandler实例evictionAdmitHandler,它其实和evictionManager的内容相同,但是不同的两个实例。evictionAdmitHandler用来kubelet创建Pod前进行准入检查,满足条件后才会继续创建Pod,通过Admit(attrs *lifecycle.PodAdmitAttributes)方法来检查,代码如下:

pkg/kubelet/eviction/eviction_manager.go:102
// Admit rejects a pod if its not safe to admit for node stability.
func (m *managerImpl) Admit(attrs *lifecycle.PodAdmitAttributes) lifecycle.PodAdmitResult {
    m.RLock()
    defer m.RUnlock()
    if len(m.nodeConditions) == 0 {
        return lifecycle.PodAdmitResult{Admit: true}
    }

    // the node has memory pressure, admit if not best-effort
    if hasNodeCondition(m.nodeConditions, v1.NodeMemoryPressure) {
        notBestEffort := qos.BestEffort != qos.GetPodQOS(attrs.Pod)
        if notBestEffort || kubepod.IsCriticalPod(attrs.Pod) {
            return lifecycle.PodAdmitResult{Admit: true}
        }
    }

    // reject pods when under memory pressure (if pod is best effort), or if under disk pressure.
    glog.Warningf("Failed to admit pod %v - %s", format.Pod(attrs.Pod), "node has conditions: %v", m.nodeConditions)
    return lifecycle.PodAdmitResult{
        Admit:   false,
        Reason:  reason,
        Message: fmt.Sprintf(message, m.nodeConditions),
    }
}
 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25

上述Pod Admit逻辑,正是Kubernetes Eviction Manager工作机制分析中Scheduler一节提到的EvictionManager对Pod调度的逻辑影响:

Kubelet会定期的将Node Condition传给kube-apiserver并存于etcd。kube-scheduler watch到Node Condition Pressure之后,会根据以下策略,阻止更多Pods Bind到该Node。

Node Condition Scheduler Behavior
MemoryPressure No new BestEffort pods are scheduled to the node.
DiskPressure No new pods are scheduled to the node.

killPodNow的代码,后面再分析。

基本上,这一小节我们把evictionManager是什么以及怎么来的问题搞清楚了。下面我们来看看evictionManager的启动过程。

Kubernetes Eviction Manager的启动

上面分析过,kubelet在启动过程中进行runtime依赖模块的初始化过程中,将evictionManager启动了(kl.evictionManager.Start(kl, kl.getActivePods, evictionMonitoringPeriod)),那我们先来看看Start方法:

pkg/kubelet/eviction/eviction_manager.go:126
// Start starts the control loop to observe and response to low compute resources.
func (m *managerImpl) Start(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, monitoringInterval time.Duration) error {
    // start the eviction manager monitoring
    go wait.Until(func() { m.synchronize(diskInfoProvider, podFunc) }, monitoringInterval, wait.NeverStop)
    return nil
}
 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

很简单,启动一个goroutine,每执行完一次m.synchronize就间隔monitoringInterval(10s)的时间再次执行m.synchronize,如此反复。

接下来,就是evictionManager的关键工作流程了:

pkg/kubelet/eviction/eviction_manager.go:181
// synchronize is the main control loop that enforces eviction thresholds.
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) {
    // if we have nothing to do, just return
    thresholds := m.config.Thresholds
    if len(thresholds) == 0 {
        return
    }

    // build the ranking functions (if not yet known)
    if len(m.resourceToRankFunc) == 0 || len(m.resourceToNodeReclaimFuncs) == 0 {
        // this may error if cadvisor has yet to complete housekeeping, so we will just try again in next pass.
        hasDedicatedImageFs, err := diskInfoProvider.HasDedicatedImageFs()
        if err != nil {
            return
        }
        m.resourceToRankFunc = buildResourceToRankFunc(hasDedicatedImageFs)
        m.resourceToNodeReclaimFuncs = buildResourceToNodeReclaimFuncs(m.imageGC, hasDedicatedImageFs)
    }

    // make observations and get a function to derive pod usage stats relative to those observations.
    observations, statsFunc, err := makeSignalObservations(m.summaryProvider)
    if err != nil {
        glog.Errorf("eviction manager: unexpected err: %v", err)
        return
    }

    // attempt to create a threshold notifier to improve eviction response time
    if m.config.KernelMemcgNotification && !m.notifiersInitialized {
        glog.Infof("eviction manager attempting to integrate with kernel memcg notification api")
        m.notifiersInitialized = true
        // start soft memory notification
        err = startMemoryThresholdNotifier(m.config.Thresholds, observations, false, func(desc string) {
            glog.Infof("soft memory eviction threshold crossed at %s", desc)
            // TODO wait grace period for soft memory limit
            m.synchronize(diskInfoProvider, podFunc)
        })
        if err != nil {
            glog.Warningf("eviction manager: failed to create hard memory threshold notifier: %v", err)
        }
        // start hard memory notification
        err = startMemoryThresholdNotifier(m.config.Thresholds, observations, true, func(desc string) {
            glog.Infof("hard memory eviction threshold crossed at %s", desc)
            m.synchronize(diskInfoProvider, podFunc)
        })
        if err != nil {
            glog.Warningf("eviction manager: failed to create soft memory threshold notifier: %v", err)
        }
    }

    // determine the set of thresholds met independent of grace period
    thresholds = thresholdsMet(thresholds, observations, false)

    // determine the set of thresholds previously met that have not yet satisfied the associated min-reclaim
    if len(m.thresholdsMet) > 0 {
        thresholdsNotYetResolved := thresholdsMet(m.thresholdsMet, observations, true)
        thresholds = mergeThresholds(thresholds, thresholdsNotYetResolved)
    }

    // determine the set of thresholds whose stats have been updated since the last sync
    thresholds = thresholdsUpdatedStats(thresholds, observations, m.lastObservations)

    // track when a threshold was first observed
    now := m.clock.Now()
    thresholdsFirstObservedAt := thresholdsFirstObservedAt(thresholds, m.thresholdsFirstObservedAt, now)

    // the set of node conditions that are triggered by currently observed thresholds
    nodeConditions := nodeConditions(thresholds)

    // track when a node condition was last observed
    nodeConditionsLastObservedAt := nodeConditionsLastObservedAt(nodeConditions, m.nodeConditionsLastObservedAt, now)

    // node conditions report true if it has been observed within the transition period window
    nodeConditions = nodeConditionsObservedSince(nodeConditionsLastObservedAt, m.config.PressureTransitionPeriod, now)

    // determine the set of thresholds we need to drive eviction behavior (i.e. all grace periods are met)
    thresholds = thresholdsMetGracePeriod(thresholdsFirstObservedAt, now)

    // update internal state
    m.Lock()
    m.nodeConditions = nodeConditions
    m.thresholdsFirstObservedAt = thresholdsFirstObservedAt
    m.nodeConditionsLastObservedAt = nodeConditionsLastObservedAt
    m.thresholdsMet = thresholds
    m.lastObservations = observations
    m.Unlock()

    // determine the set of resources under starvation
    starvedResources := getStarvedResources(thresholds)
    if len(starvedResources) == 0 {
        glog.V(3).Infof("eviction manager: no resources are starved")
        return
    }

    // rank the resources to reclaim by eviction priority
    sort.Sort(byEvictionPriority(starvedResources))
    resourceToReclaim := starvedResources[0]
    glog.Warningf("eviction manager: attempting to reclaim %v", resourceToReclaim)

    // determine if this is a soft or hard eviction associated with the resource
    softEviction := isSoftEvictionThresholds(thresholds, resourceToReclaim)

    // record an event about the resources we are now attempting to reclaim via eviction
    m.recorder.Eventf(m.nodeRef, v1.EventTypeWarning, "EvictionThresholdMet", "Attempting to reclaim %s", resourceToReclaim)

    // check if there are node-level resources we can reclaim to reduce pressure before evicting end-user pods.
    if m.reclaimNodeLevelResources(resourceToReclaim, observations) {
        glog.Infof("eviction manager: able to reduce %v pressure without evicting pods.", resourceToReclaim)
        return
    }

    glog.Infof("eviction manager: must evict pod(s) to reclaim %v", resourceToReclaim)

    // rank the pods for eviction
    rank, ok := m.resourceToRankFunc[resourceToReclaim]
    if !ok {
        glog.Errorf("eviction manager: no ranking function for resource %s", resourceToReclaim)
        return
    }

    // the only candidates viable for eviction are those pods that had anything running.
    activePods := podFunc()
    if len(activePods) == 0 {
        glog.Errorf("eviction manager: eviction thresholds have been met, but no pods are active to evict")
        return
    }

    // rank the running pods for eviction for the specified resource
    rank(activePods, statsFunc)

    glog.Infof("eviction manager: pods ranked for eviction: %s", format.Pods(activePods))

    // we kill at most a single pod during each eviction interval
    for i := range activePods {
        pod := activePods[i]
        status := v1.PodStatus{
            Phase:   v1.PodFailed,
            Message: fmt.Sprintf(message, resourceToReclaim),
            Reason:  reason,
        }
        // record that we are evicting the pod
        m.recorder.Eventf(pod, v1.EventTypeWarning, reason, fmt.Sprintf(message, resourceToReclaim))
        gracePeriodOverride := int64(0)
        if softEviction {
            gracePeriodOverride = m.config.MaxPodGracePeriodSeconds
        }
        // this is a blocking call and should only return when the pod and its containers are killed.
        err := m.killPodFunc(pod, status, &gracePeriodOverride)
        if err != nil {
            glog.Infof("eviction manager: pod %s failed to evict %v", format.Pod(pod), err)
            continue
        }
        // success, so we return until the next housekeeping interval
        glog.Infof("eviction manager: pod %s evicted successfully", format.Pod(pod))
        return
    }
    glog.Infof("eviction manager: unable to evict any pods from the node")
}
 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143
  • 144
  • 145
  • 146
  • 147
  • 148
  • 149
  • 150
  • 151
  • 152
  • 153
  • 154
  • 155
  • 156
  • 157
  • 158

代码写的非常工整,注释也很到位,很棒。关键流程如下:

  • 通过buildResourceToRankFuncbuildResourceToNodeReclaimFuncs分别注册Evict Pod时各种Resource的排名函数和回收Node Resource的Reclaim函数。
  • 通过makeSignalObservations从cAdvisor中获取Eviction Signal Observation和Pod的StatsFunc(后续对Pods进行Rank时需要用)。
  • 如果kubelet配置了--experimental-kernel-memcg-notification且为true,则通过startMemoryThresholdNotifier启动soft & hard memory notification,当system usage第一时间达到soft & hard memory thresholds时,会立刻通知kubelet,并触发evictionManager.synchronize进行资源回收的流程。这样提高了eviction的实时性。
  • 根据从cAdvisor数据计算得到的Observation(observasions)和配置的thresholds通过thresholdsMet计算得到此次Met的thresholds。
  • 再根据从cAdvisor数据计算得到的Observation(observasions)和thresholdsMet通过thresholdsMet计算得到已记录但还没解决的thresholds,然后与上一步中的thresholds进行合并。
  • 根据lastObservations中Signal的时间,对比observasions的中Signal中的时间,过滤thresholds。
  • 更新thresholdsFirstObservedAtnodeConditions
  • 过滤出那些从observed time到now,已经历过grace period时间的thresholds。
  • 更新evictionManager对象的内部数据: nodeConditions,thresholdsFirstObservedAt,nodeConditionsLastObservedAt,thresholds,observations。
  • 根据thresholds得到starvedResources,并进行排序,如果memory属于starvedResources,则memory排序第一。
  • 取starvedResources排第一的Resource,调用reclaimNodeLevelResources对Node上这种Resource进行资源回收。如果回收完后,available满足thresholdValue+evictionMinimumReclaim,则流程结束,不再evict user-pods。
  • 如果reclaimNodeLevelResources后,还不足以达到要求,则会继续evict user-pods,首先根据前面buildResourceToRankFunc注册的方法对所有active Pods进行排序。
  • 按照前面的排序,顺序的调用killPodNow将选出的pod干掉。如果kill某个pod失败,则会跳过这个pod,再按顺序挑下一个pod进行kill。只要某个pod kill成功,就返回结束,也就是说这个流程中,最多只会kill最多一个Pod。

上面流程中,有两个最关键的步骤,回收节点资源(reclaimNodeLevelResources)和evict user-pods(killPodNow)。

pkg/kubelet/eviction/eviction_manager.go:340
// reclaimNodeLevelResources attempts to reclaim node level resources.  returns true if thresholds were satisfied and no pod eviction is required.
func (m *managerImpl) reclaimNodeLevelResources(resourceToReclaim v1.ResourceName, observations signalObservations) bool {
    nodeReclaimFuncs := m.resourceToNodeReclaimFuncs[resourceToReclaim]
    for _, nodeReclaimFunc := range nodeReclaimFuncs {
        // attempt to reclaim the pressured resource.
        reclaimed, err := nodeReclaimFunc()
        if err == nil {
            // update our local observations based on the amount reported to have been reclaimed.
            // note: this is optimistic, other things could have been still consuming the pressured resource in the interim.
            signal := resourceToSignal[resourceToReclaim]
            value, ok := observations[signal]
            if !ok {
                glog.Errorf("eviction manager: unable to find value associated with signal %v", signal)
                continue
            }
            value.available.Add(*reclaimed)

            // evaluate all current thresholds to see if with adjusted observations, we think we have met min reclaim goals
            if len(thresholdsMet(m.thresholdsMet, observations, true)) == 0 {
                return true
            }
        } else {
            glog.Errorf("eviction manager: unexpected error when attempting to reduce %v pressure: %v", resourceToReclaim, err)
        }
    }
    return false
}


pkg/kubelet/pod_workers.go:283
// killPodNow returns a KillPodFunc that can be used to kill a pod.
// It is intended to be injected into other modules that need to kill a pod.
func killPodNow(podWorkers PodWorkers, recorder record.EventRecorder) eviction.KillPodFunc {
    return func(pod *v1.Pod, status v1.PodStatus, gracePeriodOverride *int64) error {
        // determine the grace period to use when killing the pod
        gracePeriod := int64(0)
        if gracePeriodOverride != nil {
            gracePeriod = *gracePeriodOverride
        } else if pod.Spec.TerminationGracePeriodSeconds != nil {
            gracePeriod = *pod.Spec.TerminationGracePeriodSeconds
        }

        // we timeout and return an error if we don't get a callback within a reasonable time.
        // the default timeout is relative to the grace period (we settle on 2s to wait for kubelet->runtime traffic to complete in sigkill)
        timeout := int64(gracePeriod + (gracePeriod / 2))
        minTimeout := int64(2)
        if timeout < minTimeout {
            timeout = minTimeout
        }
        timeoutDuration := time.Duration(timeout) * time.Second

        // open a channel we block against until we get a result
        type response struct {
            err error
        }
        ch := make(chan response)
        podWorkers.UpdatePod(&UpdatePodOptions{
            Pod:        pod,
            UpdateType: kubetypes.SyncPodKill,
            OnCompleteFunc: func(err error) {
                ch <- response{err: err}
            },
            KillPodOptions: &KillPodOptions{
                PodStatusFunc: func(p *v1.Pod, podStatus *kubecontainer.PodStatus) v1.PodStatus {
                    return status
                },
                PodTerminationGracePeriodSecondsOverride: gracePeriodOverride,
            },
        })

        // wait for either a response, or a timeout
        select {
        case r := <-ch:
            return r.err
        case <-time.After(timeoutDuration):
            recorder.Eventf(pod, v1.EventTypeWarning, events.ExceededGracePeriod, "Container runtime did not kill the pod within specified grace period.")
            return fmt.Errorf("timeout waiting to kill pod")
        }
    }
}
 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81

讲到这里,整个evictionManager的主要流程都分析完了。

总结

  • kubelet在NewMainKubelet时创建了evictionManager。
  • kubelet在启动过程中进行runtime依赖模块的初始化过程中,将evictionManager启动了。
  • 整个EvictionManager工作流程中两个最关键的步骤是:回收节点资源(reclaimNodeLevelResources)和evict user-pods(killPodNow)。
  • 每次evict pods的流程中,最多只能成功kill一个pod,如果kill某个pod时候,会从排序好的pods中选择下一个进行kill,直到kill成功或者遍历完本节点所有的Pods为止。
  • 每次synchronize操作完成一次eviction流程,10s后都会再次循环这个流程。
  • 如果配置了--experimental-kernel-memcg-notification为true,那么会利用kernel memcg notification,当system usage第一时间达到soft & hard memory thresholds时,会立刻通知kubelet,并触发evictionManager.synchronize进行资源回收的流程,这样提高了eviction的实时性。
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值