k8s-List机制及resourceVersion语义

傅里叶、

已于 2023-02-19 17:28:42 修改

阅读量2.9k

点赞数 3

分类专栏： K8S 文章标签： kubernetes

于 2022-11-20 22:57:14 首次发布

本文链接：https://blog.csdn.net/qq_34562093/article/details/127955775

版权

K8S 专栏收录该内容

38 篇文章 9 订阅

订阅专栏

1、kube-apiserver 核心职责

提供Restful API；代理集群组件，如dashboard、流式日志、kubectl exec 会话；缓存全量的etcd 数据且无状态服务可水平扩展。

2、kube-List操作

k8s在两级 List/ListWatch（但数据是同一份）：

（1）apiserver List/ListWatch etcd；

（2）基础服务 List/ListWatch apiserver；

因此，从最简形式上来说，apiserver 就是挡在 etcd 前面的一个代理（proxy）。

绝大部分情况下，kube-apiserver 都会直接从本地缓存提供服务（因为它缓存了etcd全量数据），某些特殊情况，apiserver 就只能将请求转发给 etcd，例如

（1）客户端明确要求从 etcd 读数据，追求最高的数据准确性，客户端 LIST 参数设置不当也可能会走到这个逻辑；

（2）apiserver 本地缓存还没建好

3、请求举例

1、LIST apis/cilium.io/v2/ciliumendpoints?limit=500&resourceVersion = 0

resourceVersion=0 表示会从apiserver缓存中获取数据，resourceVersion=""表示从etcd获取数据，etcd是 KV 存储，能支持limit/continue和namespace过滤，其余的 label/field 过滤功能都需要kube-apiserver处理，所以 resourceVersion=0 会导致 limit=500被忽略，导致客户端拿到的是全量 ciliumendpoints 数据。

2、LIST api/v1/pods?filedSelector=spec.nodeName%3Dnode1

%3D 是 = 的转义，这个请求是根据 nodename 做过滤，获取 node1 上的所有 pods，感觉数据量不太大，但其实背后要比看上去复杂：

（1）没有指定 resourceVersion=0会导致 apiserver 跳过缓存，直接去 etcd 读数据；

（2）etcd 只是 KV 存储，没有按 label/field 过滤功能（只处理 limit/continue），因此apiserver 是从 etcd 拉全量数据，然后在内存做过滤，再返回给客户端，开销是很大的。

3、LIST api/v1/pods?filedSelector=spec.nodeName%3Dnode1&resourceVersion = 0

resourceVersion=0表示 apiserver 会从缓存读数据，性能会有量级的提升，但是apiserver需要在内存做过滤再返回给客户端， apiserver 需要处理的数据量可能会非常大。

4、LIST 全量 pod

podList, err := Client().CoreV1().Pods("").List(ctx(), ListOptions{FieldSelector: "spec.nodeName=node1"})

我们来实际看一下它背后的数据量，以一个 4000 node，10w pod 的集群为例，全量 pod 数据量在etcd 中紧凑的非结构化 KV 存储，在 1GB 量级；在apiserver 缓存中已经是结构化的 golang objects，在 2GB 量级；kube-apiserver 返回，client 一般选择默认的 json 格式接收，也已经是结构化数据，在 2GB 量级。

可以看到，某些请求看起来很简单，只是客户端一行代码的事情，但背后的数据量是惊人的。指定按 nodeName 过滤 pod 可能只返回了 500KB 数据，但 apiserver 却需要过滤 2GB 数据，最坏的情况，etcd 也要跟着处理 1GB 数据。在集群规模比较小的时候，这个问题可能看不出来（etcd 在 LIST 响应延迟超过某个阈值后才开始打印 warning 日志）；规模大了之后，如果这样的请求比较多，apiserver/etcd 肯定是扛不住的。

4、如何判断是否必须从 etcd 读数据

shouldDelegateList()

//https://github.com/kubernetes/kubernetes/blob/v1.24.0/staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go#L591

func shouldDelegateList(opts storage.ListOptions) bool {
    resourceVersion := opts.ResourceVersion
    pred            := opts.Predicate
    pagingEnabled   := DefaultFeatureGate.Enabled(features.APIListChunking)      // 默认是启用的
    hasContinuation := pagingEnabled && len(pred.Continue) > 0                   // Continue 是个 token
    hasLimit        := pagingEnabled && pred.Limit > 0 && resourceVersion != "0" // 只有在 resourceVersion != "0" 的情况下，hasLimit 才有可能为 true

    // 1. 如果未指定 resourceVersion，从底层存储（etcd）拉去数据；
    // 2. 如果有 continuation，也从底层存储拉数据；
    // 3. 只有 resourceVersion != "0" 时，才会将 limit 传给底层存储（etcd），因为 watch cache 不支持 continuation
    return resourceVersion == "" || hasContinuation || hasLimit || opts.ResourceVersionMatch == metav1.ResourceVersionMatchExact
}

客户端未设置 ListOption{} 中的 ResourceVersion 字段，会对应到这里的 resourceVersion == ""从而导致从 etcd 拉全量数据；

客户端设置了 limit=500&resourceVersion=0 不会导致下次 hasContinuation==true，因为resourceVersion=0 将导致 limit 被忽略仍会返回全量数据。

因此，未指定resourceVersion、resourceVersionMatch=exact（即同时resourceVersion=非零值）、有limit/continue都会直接从etcd读数据。

5、resourceVersion的语义

1、对于Get resourceVersion的语义

任何版本指的是最新可用资源版本优先，但不要求强一致性。

2、对于List resourceVersion的语义

从 v1.19 版本开始，apiserver支持list在resourceVersion=<非零值>的同时携带resourceVersionMatch参数，来确定如何解析 resourceVersion。

resourceVersionMatch=Exact，表示精确匹配resourceVersion，如果resourceVersion找不到则返回 HTTP 410 (Gone)的响应。

resourceVersionMatch=NotOlderThan，不表示老于指定版本的resourceVersion，最新可用资源版本优先。

3、对于Watch resourceVersion的语义

6、部署和调优建议

1、Get/List 请求设置 ResourceVersion=0，client-go 的 ListWatch/Informer 接口默认已经设置了 ResourceVersion=0；

2、优先使用 namespaced API；

3、Restart backoff，对于kubelet、cilium-agent、daemonsets等需要通过有效的 restart backoff 降低大面积重启时对kube-apiserver的压力;

4、频繁的list操作尤其是带筛选条件的list建议使用informer的 ListWatch 机制，将数据拉到本地，业务逻辑根据需要自己从 local cache 过滤。如果只是一次性的 list操作，并且有筛选条件，那显然应该通过设置 label 或字段过滤器，让 apiserver 帮我们把数据过滤出来。同时不要忘记在请求中同时带上 resourceVersion=0。