k8s pod restartcount 改0_k8s权威指南 ——资源控制笔记

最新推荐文章于 2022-03-14 09:33:31 发布

weixin_39927158

最新推荐文章于 2022-03-14 09:33:31 发布

阅读量392

点赞数

文章标签： k8s pod restartcount 改0 k8s权威指南第五版 k8s权威指南第四版pdf

Kubernetes in Actionlearning.oreilly.com

第14章节的笔记，这张有写到：

对于容器使用cpu,内存等其他资源的请求
对cpu和内存使用设定上限
理解容器pod服务的如何保证资源
在设定pod在ns中的默认，最小，最大资源
在ns维度设定资源可用性

对pod中的容器资源进行限制

在创建pod时，你可以声明他需要(request)的cpu和内存资源，注意这类资源是针对POD中的每一个容器来进行设定而不是pod维度。那么如果创建带有资源请求的pod manifest呢？下面看个例子：

apiVersion

❶ You’re specifying resource requests for the main container.
❷ The container requests 200 millicores (that is, 1/5 of a single CPU core’s time).
❸ The container also requests 10 mebibytes of memory

在这个yaml中，cpu:200m代表请求1/5的cpu core，10mi的内存（最大能用到多少），如果你在创建时不请求CPU，可能会导致该pod 得不到任何其需要的计算资源。

现在我们就可以跑一下这个pod

exec -it requests-pod top
Mem: 1288116K used, 760368K free, 9196K shrd, 25748K buff, 814840K cached
CPU:  9.1% usr 42.1% sys  0.0% nic 48.4% idle  0.0% io  0.0% irq  0.2% sirq
Load average: 0.79 0.52 0.29 2/481

dd命令会能用多少cpu用多少cpu，所以你可以看到在一个单线程的命令中它使用了50%的cpu，this is expected因为我们并没有限制container的cpu使用limit。

那么你的pod中的资源请求又是怎么影响k8s的scheduling调度的呢？

总结来说就是k8s会以你的请求资源作为baseline去帮你申请资源，保证每个node的空闲资源最起码能够满足你的requests。并且他的资源计算逻辑并不是根据node目前运行的实际情况，而是根据同node上其他pods的资源requests，我们可以看个图来帮助理解：

调度器scheduler在接到请求防止pod时首先会排除资源量已经不满足pod资源需求的node，然后会按照LeastRequestedPriority and MostRequestedPriority 两种排序逻辑来对剩余的节点进行优先级排序。LeastRequestedPriority 优先选择空闲资源比较多的节点，MostRequestedPriority优先选择资源比较紧凑的节点（云上就比较省钱）。我们可以来看看k8s是怎么描述节点资源的：

2                ❶
  memory:        2048484Ki        ❶
  pods:

❶ The overall capacity of the node
❷ The resources allocatable to pods

可以看到输出中包含了2组可见资源：node的capacity以及可分配的资源。capacity代表了节点的资源总量，这些资源不一定都能给pod，因为系统本身以及k8s控制进程也需要资源。所以调度器只会完全按照可分配的资源量来进行调度参考。

我们以一个2个core的minikube作为示例，再用两个pod把minikube的cpu requests打满：

=busybox --restart Never
➥  --requests

到这里位置我们的pod已经完全请求了2个core的资源，再调度的话就会报错了：

0          4m

哪怕等再久这个pod也不会调度上去了，我们可以使用describe命令看看到底发生了啥：

(1

❶ No node is associated with the pod.
❷ The pod hasn’t been scheduled.
❸ Scheduling has failed because of insufficient CPU.

我们也可以看看node resource来看看为啥资源没有分到节点上。

Understanding how CPU requests affect CPU time sharing

假设我们现在有2个pod在系统里跑，分别requests了200m和1000m的资源。我们刚有说过k8s对于资源有requests和limits两种请求，我们没有设置limit，那么当这两个pod 开始抢cpu资源的时候，会发生什么状况呢？

实际上我们的cpu requests不仅影响scheduling，也会影响pod间对于nodes剩余资源的使用。比如目前2个pod的资源占比是1：5,那么对于剩余的cpu资源k8s也会按照这个比例去掉分配。也就是说第一个pod最多再使用16.7%的node资源（可以看下图帮助理解）

但如果一个pod希望能够使用所有的cpu, 另一个pod闲置，那么第一个是能够用到剩余的所有cpu资源。如果第二个也开始需求额外的cpu资源，那么第一个pod的cpu占比就会收缩。

Setting a hard limit for the amount of resources a container can use

刚刚我们看到如果不对cpu的使用做限制的话，他是可能打满的，为了机器的健康和规避潜在的风险，我们能设limit的时候还是一定要设置limit。

这里还需要说明的是cpu是可压缩的资源，占多了往下压最多跑的慢一点，但是内存就不一样了，它是不可压缩的。如果不对内存的用量做限制，单个pod的内存使用可能会占满节点影响到其他被编排到这个节点上的Pod（记得k8s是根据requets的资源而不是实际使用的资源来进行编排的）。

为了防止这种情况，我们要养成写limit的好习惯

❶ Specifying resource limits for the container
❷ This container will be allowed to use at most 1 CPU core.
❸ The container will be allowed to use up to 20 mebibytes of memory.

Because you haven’t specified any resource requests, they’ll be set to the same values as the resource limits.

与rescource requests资源不同，resource的限制并不会被可分配的资源所限制。单个节点上的pod限制之和可以大于100%，这就导致当节点资源使用量达到100%时候，有些容器会需要被终止：

我们已经知道cpu是可压缩的资源，超过限制的时候压回去就可以了。但是内存就不一样了，当一个进程尝试分配超过可分配的内存资源时，这个进程就会被终结（OOM）如果pod的restartPolicy是AlwaysOnFailure或者Always的话进程被干掉后会立刻重启。如果他一直请求超过可分配的内存资源的话，重启的时间会越来越长。

3          1m

（每一次重启的时间都会演唱，从20，40，80，160秒，限制是300秒，一旦到达300秒，k8s在pod不挂或者被删除之前会一直尝试重启。

我们客户以通过describe来查看为什么pod被杀掉了

137
      Started:      Tue,

❶ The current container was killed because it was out of memory (OOM).
❷ The previous container was also killed because it was OOM

OOM就代表是因为内存超分配被干掉的。所以如果我们不想我们的container被干掉的话，内存limit不能设置的太低（当然不超过也可能因为OOMkilled，下面来说说为什么）

Understanding how apps in containers see limits

我们可以来看一个设置了limit限制的pod内部，打一个pod命令看看输出：

exec -it limited-pod top
Mem: 1450980K used, 597504K free, 22012K shrd, 65876K buff, 857552K cached
CPU: 10.0% usr 40.0% sys  0.0% nic 50.0% idle  0.0% io  0.0% irq  0.0% sirq
Load average: 0.17 1.19 2.47 4/503

可以看到这个top出来的东西无论cpu还是内存都和内存中声明的相差甚远，因为我们的top命令2显示出来的是当前节点而不是pod的使用量，所以这对于在容器中需要查看内存和cpu使用量的app就比较蛋疼。

这个问题对于运行java程序特别明显，特别是我们如果不声明jvm的最大heapsize的话，jvm会根据节点的最大内存而不是当前容器的来分配内存, 一旦超过limit，容器就会被杀掉。

CPU也有一样的问题

A container with a one-core CPU limit running on a 64-core CPU will get 1/64th of the overall CPU time. And even though its limit is set to one core, the container’s processes will not run on only one core. At different points in time, its code may be executed on different cores.
Nothing is wrong with this, right? While that’s generally the case, at least one scenario exists where this situation is catastrophic.
Certain applications look up the number of CPUs on the system to decide how many worker threads they should run. Again, such an app will run fine on a development laptop, but when deployed on a node with a much bigger number of cores, it’s going to spin up too many threads, all competing for the (possibly) limited CPU time. Also, each thread requires additional memory, causing the apps memory usage to skyrocket.
You may want to use the Downward API to pass the CPU limit to the container and use it instead of relying on the number of CPUs your app can see on the system. You can also tap into the cgroups system directly to get the configured CPU limit by reading the following files:
/sys/fs/cgroup/cpu/cpu.cfs_quota_us
/sys/fs/cgroup/cpu/cpu.cfs_period_us

UNDERSTANDING POD QOS CLASSES

想象一个场景，pod a 用了90%的节点内存，pod b从闲置状态变为需要一定量的内存，此时节点应该如何调度？应该杀掉a还是杀掉b 呢？

答案是根据k8s的配置来，我们会将服务分成三种优先级：

BestEffort (the lowest priority)
Burstable
Guaranteed (the highest)

另外需要注意的事这些属性并不是通过一个spec的Qos class来设定的，他是通过pod中每个container的resources requests和limits推断出来的：

优先级最低的是BestEffort class，它适用于完全没有任何资源需求或者限制的pod(any of the container级别）。运行在此类pod中的容器没有任何的资源保证。最坏的情况下他的所有资源可能会被其他pod抢光，在资源竞争时也会被头一个干掉。但是因为此类class没有内存限制，他也可能在没有资源竞争的情况下吃光节点的内存资源。

与之相反的是GuaranteedQoS class，此类配置对pod中容器需求的requests和limit相同的配置生效，符合此类class的要求：

requests和limit都必须对CPU和内存一起生效
对于每个pod中每个容器都必须配置requests和limit
每个资源请求的requests和limit必须相等。

每个容器的resource requests默认是与limit相等的，所以对所有资源声明limit就足够将pod变为Guaranteed。此类容器会保证收到他们请求的资源，但不会被分配到额外的资源（以为limit不比requests高。

在这两者之间的是BurstableQoS class。所有除了以上两者之外的pods都属于此类。这包括了单容器pods但是其limit与requests并不相等，也包括了 container的requests与limit不匹配的情况

可以看下面的图来帮助更好的理解：

如果只有requests设置了，没有设置limit，那么可以参考表中2个requests<limits的行。

对于有多个容器的pod,可以参考如下表格：

A pod’s QoS class is shown when runningkubectl describe podand in the pod’s YAML/JSON manifest in thestatus.qosClassfield

在出问题的时候哪个pod最先被杀？

假设有2容器的pod，where the first one has theBestEffortQoS class, and the second one’s isBurstable.，然后当节点的内存已经打满并且尝试分配更多内存时，系统就会需要杀掉其中的一个进程时，系统就回去杀掉其中的一个进程。在这种场景下，class为BestEffort的总是第一个被杀掉的。

可以看到进程被杀掉的次序是依次相近的。每个运行的进程都会有一个OutOfMemory(OOM) socre.，系统使用这个分数来比较把谁杀掉。当内存需要被释放时，分数最高的进程被干掉。

OOM scores are calculated from two things: the percentage of the available memory the process is consuming and a fixed OOM score adjustment, which is based on the pod’s QoS class and the container’s requested memory. When two single-container pods exist, both in theBurstableclass, the system will kill the one using more of its requested memory than the other, percentage-wise. That’s why infigure 14.5, pod B, using 90% of its requested memory, gets killed before pod C, which is only using 70%, even though it’s using more megabytes of memory than pod B.

weixin_39927158

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
k8s pod restartcount 改0_k8s权威指南 ——资源控制笔记

Kubernetes in Actionlearning.oreilly.com第14章节的笔记，这张有写到：对于容器使用cpu,内存等其他资源的请求对cpu和内存使用设定上限理解容器pod服务的如何保证资源在设定pod在ns中的默认，最小，最大资源在ns维度设定资源可用性对pod中的容器资源进行限制在创建pod时，你可以声明他需要(request)的cpu和内存资源，注意这类资源是针对POD中...
复制链接

扫一扫