💡 something about kubelet gc image and evict pod.
1. imagefs删除镜像的规则
1.1 官网介绍
1.1.1 参数介绍
imageMinimumGCAge
meta/v1.Duration imageMinimumGCAge is the minimum age for an unused image before it is garbage collected. Dynamic Kubelet Config (beta): If dynamically updating this field, consider that it may trigger or delay garbage collection, and may change the image overhead on the node. Default: "2m"
imageGCHighThresholdPercent
int32 imageGCHighThresholdPercent is the percent of disk usage after which image garbage collection is always run. The percent is calculated as this field value out of 100. Dynamic Kubelet Config (beta): If dynamically updating this field, consider that it may trigger or delay garbage collection, and may change the image overhead on the node. Default: 85
imageGCLowThresholdPercent
int32 imageGCLowThresholdPercent is the percent of disk usage before which image garbage collection is never run. Lowest disk usage to garbage collect to. The percent is calculated as this field value out of 100. Dynamic Kubelet Config (beta): If dynamically updating this field, consider that it may trigger or delay garbage collection, and may change the image overhead on the node. Default: 80
evictionHard
# 默认参数 imagefs 15%
map[string]string Map of signal names to quantities that defines hard eviction thresholds. For example: {"memory.available": "300Mi"}. To explicitly disable, pass a 0% or 100% threshold on an arbitrary resource. Dynamic Kubelet Config (beta): If dynamically updating this field, consider that it may trigger or delay Pod evictions. Default: memory.available: "100Mi" nodefs.available: "10%" nodefs.inodesFree: "5%" imagefs.available: "15%"
evictionSoft
map[string]string Map of signal names to quantities that defines soft eviction thresholds. For example: {"memory.available": "300Mi"}. Dynamic Kubelet Config (beta): If dynamically updating this field, consider that it may trigger or delay Pod evictions, and may change the allocatable reported by the node. Default: nil
根据默认参数,则使用了imagefs 和 nodefs
1.1.2 节点资源
如果 nodefs
文件系统满足驱逐阈值,kubelet
通过驱逐 pod 及其容器来释放磁盘空间。
如果 imagefs
文件系统满足驱逐阈值,kubelet
通过删除所有未使用的镜像来释放磁盘空间。
1.1.3 用户pod
如果是 nodefs
触发驱逐,kubelet
将按 nodefs
用量 - 本地卷 + pod 的所有容器日志的总和对其排序。
如果是 imagefs
触发驱逐,kubelet
将按 pod 所有可写层的用量对其进行排序。
1.2 代码
1.2.1 删除镜像的规律
阅读代码:pkg/kubelet/images/image_gc_manager.go
sort 按照上次使用和检测 进行排序
for image, record := range im.imageRecords {
if isImageUsed(image, imagesInUse) {
klog.V(5).InfoS("Image ID is being used", "imageID", image)
continue
}
images = append(images, evictionInfo{
id: image,
imageRecord: *record,
})
}
sort.Sort(byLastUsedAndDetected(images))
排序的方法:按照上次使用进行排序。 最久没有使用过的镜像是排在前面的。
func (ev byLastUsedAndDetected) Less(i, j int) bool {
// Sort by last used, break ties by detected.
if ev[i].lastUsed.Equal(ev[j].lastUsed) {
return ev[i].firstDetected.Before(ev[j].firstDetected)
}
return ev[i].lastUsed.Before(ev[j].lastUsed)
}
进行遍历镜像
if 如果镜像上次使用的时间 等于 要释放磁盘的时间 或者 镜像上次使用的时间在本次释放时间之后 ,跳过本次镜像的删除
if 如果首次检测的事件 小于 gc min.age的时间,跳过本次镜像的删除
删除镜像,删除成功后【空余空间】为镜像空间
如果空余空间大于要释放的空间,则break,跳出剩余的循环。
for _, image := range images {
klog.V(5).InfoS("Evaluating image ID for possible garbage collection", "imageID", image.id)
// Images that are currently in used were given a newer lastUsed.
if image.lastUsed.Equal(freeTime) || image.lastUsed.After(freeTime) {
klog.V(5).InfoS("Image ID was used too recently, not eligible for garbage collection", "imageID", image.id, "lastUsed", image.lastUsed, "freeTime", freeTime)
continue
}
// Avoid garbage collect the image if the image is not old enough.
// In such a case, the image may have just been pulled down, and will be used by a container right away.
if freeTime.Sub(image.firstDetected) < im.policy.MinAge {
klog.V(5).InfoS("Image ID's age is less than the policy's minAge, not eligible for garbage collection", "imageID", image.id, "age", freeTime.Sub(image.firstDetected), "minAge", im.policy.MinAge)
continue
}
// Remove image. Continue despite errors.
klog.InfoS("Removing image to free bytes", "imageID", image.id, "size", image.size)
err := im.runtime.RemoveImage(container.ImageSpec{Image: image.id})
if err != nil {
deletionErrors = append(deletionErrors, err)
continue
}
delete(im.imageRecords, image.id)
spaceFreed += image.size
if spaceFreed >= bytesToFree {
break
}
}
1.2.2 删除pod的规律
pkg/kubelet/eviction/eviction_manager.go
会根据active pod func
进行查找kubelet上活动的pod
// ActivePodsFunc returns pods bound to the kubelet that are active (i.e. non-terminal state)
type ActivePodsFunc func() []*v1.Pod
会根据资源使用的rank进行排序,进行循环遍历。
// we kill at most a single pod during each eviction interval
for i := range activePods {
pod := activePods[i]
gracePeriodOverride := int64(0)
if !isHardEvictionThreshold(thresholdToReclaim) {
gracePeriodOverride = m.config.MaxPodGracePeriodSeconds
}
message, annotations := evictionMessage(resourceToReclaim, pod, statsFunc)
if m.evictPod(pod, gracePeriodOverride, message, annotations) {
metrics.Evictions.WithLabelValues(string(thresholdToReclaim.Signal)).Inc()
return []*v1.Pod{pod}
}
}
klog.InfoS("Eviction manager: unable to evict any pods from the node")
return nil
有存储压力或是内存压力,根据rootfs使用量 和 日志使用量的总和作为使用的空间量进行驱逐。
case v1.ResourceEphemeralStorage:
if containerStats.Rootfs != nil && containerStats.Rootfs.UsedBytes != nil && containerStats.Logs != nil && containerStats.Logs.UsedBytes != nil {
usage = resource.NewQuantity(int64(*containerStats.Rootfs.UsedBytes+*containerStats.Logs.UsedBytes), resource.BinarySI)
}
case v1.ResourceMemory:
if containerStats.Memory != nil && containerStats.Memory.WorkingSetBytes != nil {
usage = resource.NewQuantity(int64(*containerStats.Memory.WorkingSetBytes), resource.BinarySI)
}
}
2. nodefs ephemeral storage 、 emptydir
ephemeral storage 是1.10 版本默认开启的功能。
The LocalStorageCapacityIsolation feature is beta and enabled by default. The LocalStorageCapacityIsolation feature added a new resource type ResourceEphemeralStorage "ephemeral-storage" so that this resource can be allocated, limited, and consumed as the same way as CPU/memory. All the features related to resource management (resource request/limit, quota, limitrange) are available for local ephemeral storage. This local ephemeral storage represents the storage for root file system, which will be consumed by containers' writabl