CPU ThrottlingHigh告警问题与数据指标采集

本文链接：https://blog.csdn.net/singgel/article/details/128816399

本文解析了Prometheus监控系统中CPU ThrottlingHigh告警的触发原理及采集方式，阐述了如何通过cgroup子系统的统计数据来判断容器是否被限速，并解决了实际案例中的冲突情况。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

问题背景

Prometheus的CPUThrottlingHigh指标采集及告警说明

每5分钟采集一次container_cpu_cfs_throttled_periods_total和container_cpu_cfs_periods_total，根据差值计算Throttling百分比，连续三次(15分钟内)超过5%时即告警。

- alert: CPUThrottlingHighannotations:message: '{{ printf "%0.0f" $value }}% throttling of CPU in namespace {{ $labels.namespace}} for container {{ $labels.container }} in pod {{ $labels.pod }}.'runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-cputhrottlinghighexpr: "100 * sum(increase(container_cpu_cfs_throttled_periods_total{container!=\"\",

}[5m])) by (container, pod, namespace)\n /\nsum(increase(container_cpu_cfs_periods_total{}[5m]))

by (container, pod, namespace)\n > 25 \n"

for: 15m

labels:

severity: warning

Prometheus的CPUUsage的指标采集

每隔1分钟采集一次container_cpu_usage_seconds_total

sum by(cluster_id, namespace, pod, container) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",job="kubelet",metrics_path="/metrics/cadvisor"}[5m])) * on(cluster_id, namespace, pod) group_left(node, host_ip) max by(cluster_id, namespace, pod, node, host_ip) (kube_pod_info)

冲突说明

用户pod kube-proxy-9pj9j cpu限额为20%，使用率一直低于5%，为何出现27% throttling？

问题分析

根据代码调查，确认Prometheus监控数据来源：

container_cpu_cfs_throttled_periods_total数据来源:cgroup cpuacct子系统cpu.stat中的nr_throttled

/sys/fs/cgroup/cpuacct/kubepods/burstable/pod89106a5c-7ede-11ea-b093-fa163e23cb69/bff06c9fa9a6dd0b0db642119f094c630b91932ea1a0556e750bc785626fb03f/cpu.stat

container_cpu_cfs_periods_total数据来源: cgroup cpuacct子系统cpu.stat中的nr_periods

cpu.statreports CPU time statistics using the following values:nr_periods — number of period intervals (as specified in cpu.cfs_period_us) that have elapsed.nr_throttled — number of times tasks in a cgroup have been throttled (that is, not allowed to run because they have exhausted all of the available time as specified by their quota).throttled_time — the total time duration (in nanoseconds) for which tasks in a cgroup have been throttled.

参见https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-cpu#sect-cfs

container_cpu_usage_seconds_total数据来源：cgroup cpuacct子系统cpuacct.usage

/sys/fs/cgroup/cpuacct/kubepods/burstable/pod89106a5c-7ede-11ea-b093-fa163e23cb69/bff06c9fa9a6dd0b0db642119f094c630b91932ea1a0556e750bc785626fb03f/cpuacct.usage

/sys/fs/cgroup/cpuacct.usage gives the CPU time (in nanoseconds) obtained
by this group which is essentially the CPU time obtained by all the tasks
in the system.

参见：https://www.kernel.org/doc/Documentation/cgroup-v1/cpuacct.txt

pod cgroup中 cpu quota/period配置：

cat cpu.cfs_quota_us
cat cpu.cfs_period_us

社区说明：

https://devops.stackexchange.com/questions/6494/prometheus-alert-cputhrottlinghigh-raised-but-monitoring-does-not-show-it

https://github.com/kubernetes/kubernetes/issues/67577

https://gist.github.com/bobrik/2030ff040fad360327a5fab7a09c4ff1

https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/108

https://bugzilla.kernel.org/show_bug.cgi?id=198197

从社区来看，也有关于CPUThrottling统计过高的问题，2019-05月才有人提交patch修复此问题