CPU ThrottlingHigh告警问题与数据指标采集

本文解析了Prometheus监控系统中CPU ThrottlingHigh告警的触发原理及采集方式,阐述了如何通过cgroup子系统的统计数据来判断容器是否被限速,并解决了实际案例中的冲突情况。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

问题背景

Prometheus的CPUThrottlingHigh指标采集及告警说明

每5分钟采集一次container_cpu_cfs_throttled_periods_total和container_cpu_cfs_periods_total,根据差值计算Throttling百分比,连续三次(15分钟内)超过5%时即告警。

- alert: CPUThrottlingHighannotations:message: '{{ printf "%0.0f" $value }}% throttling of CPU in namespace {{ $labels.namespace}} for container {{ $labels.container }} in pod {{ $labels.pod }}.'runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-cputhrottlinghighexpr: "100 * sum(increase(container_cpu_cfs_throttled_periods_total{container!=\"\",

}[5m])) by (container, pod, namespace)\n /\nsum(increase(container_cpu_cfs_periods_total{}[5m]))

by (container, pod, namespace)\n > 25 \n"

for: 15m

labels:

severity: warning

Prometheus的CPUUsage的指标采集

每隔1分钟采集一次container_cpu_usage_seconds_total

sum by(cluster_id, namespace, pod, container) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",job="kubelet",metrics_path="/metrics/cadvisor"}[5m])) * on(cluster_id, namespace, pod) group_left(node, host_ip) max by(cluster_id, namespace, pod, node, host_ip) (kube_pod_info)

冲突说明

用户pod kube-proxy-9pj9j cpu限额为20%,使用率一直低于5%,为何出现27% throttling?

问题分析

根据代码调查,确认Prometheus监控数据来源:

container_cpu_cfs_throttled_periods_total数据来源:cgroup cpuacct子系统cpu.stat中的nr_throttled

/sys/fs/cgroup/cpuacct/kubepods/burstable/pod89106a5c-7ede-11ea-b093-fa163e23cb69/bff06c9fa9a6dd0b0db642119f094c630b91932ea1a0556e750bc785626fb03f/cpu.stat

container_cpu_cfs_periods_total数据来源: cgroup cpuacct子系统cpu.stat中的nr_periods

cpu.statreports CPU time statistics using the following values:nr_periods — number of period intervals (as specified in cpu.cfs_period_us) that have elapsed.nr_throttled — number of times tasks in a cgroup have been throttled (that is, not allowed to run because they have exhausted all of the available time as specified by their quota).throttled_time — the total time duration (in nanoseconds) for which tasks in a cgroup have been throttled.

参见https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-cpu#sect-cfs

container_cpu_usage_seconds_total数据来源:cgroup cpuacct子系统cpuacct.usage

/sys/fs/cgroup/cpuacct/kubepods/burstable/pod89106a5c-7ede-11ea-b093-fa163e23cb69/bff06c9fa9a6dd0b0db642119f094c630b91932ea1a0556e750bc785626fb03f/cpuacct.usage

/sys/fs/cgroup/cpuacct.usage gives the CPU time (in nanoseconds) obtained
by this group which is essentially the CPU time obtained by all the tasks
in the system.

参见:https://www.kernel.org/doc/Documentation/cgroup-v1/cpuacct.txt

pod cgroup中 cpu quota/period配置:

cat cpu.cfs_quota_us
cat cpu.cfs_period_us

社区说明:

https://devops.stackexchange.com/questions/6494/prometheus-alert-cputhrottlinghigh-raised-but-monitoring-does-not-show-it

https://github.com/kubernetes/kubernetes/issues/67577

https://gist.github.com/bobrik/2030ff040fad360327a5fab7a09c4ff1

https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/108

https://bugzilla.kernel.org/show_bug.cgi?id=198197

从社区来看,也有关于CPUThrottling统计过高的问题,2019-05月才有人提交patch修复此问题

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值