Promrtheus etcd 监控

富士康质检员张全蛋

已于 2024-05-14 14:42:59 修改

阅读量1.1k

点赞数

分类专栏： ETCD Prometheus 文章标签： etcd

于 2022-08-05 21:52:28 首次发布

原文链接：c.com

版权

Prometheus 同时被 2 个专栏收录

110 篇文章 94 订阅

订阅专栏

ETCD

19 篇文章 11 订阅

订阅专栏

etcd/metrics.md at v3.2.17 · etcd-io/etcd · GitHub

etcd 监控可以帮助您更好地利用 etcd，特别用于是定位性能问题。etcd 服务提供了原生的指标接口。KubeSphere 监控系统提供了高度图形化和响应性强的仪表板，用于显示原生数据。

指标	描述
服务状态	- 是否有 Leader 表示成员是否有 Leader。如果成员没有 Leader，则成员完全不可用。如果集群中的所有成员都没有任何 Leader，则整个集群完全不可用。 - 1 小时内 Leader 变更次数表示集群成员观察到的 1 小时内 Leader 变更总次数。频繁变更 Leader 将显著影响 etcd 性能，同时这还表明 Leader 可能由于网络连接问题或 etcd 集群负载过高而不稳定。
库大小	etcd 的底层数据库大小，单位为 MiB。图表中显示的是 etcd 的每个成员数据库的平均大小。
客户端流量	包括发送到 gRPC 客户端的总流量和从 gRPC 客户端接收的总流量。有关该指标的更多信息，请参阅 etcd Network。
gRPC 流式消息	服务器端的 gRPC 流消息接收速率和发送速率，反映集群内是否正在进行大规模的数据读写操作。有关该指标的更多信息，请参阅 go-grpc-prometheus。
WAL 日志同步时间	WAL 调用 fsync 的延迟。在应用日志条目之前，etcd 会在持久化日志条目到磁盘时调用 `wal_fsync`。有关该指标的更多信息，请参阅 etcd Disk。
库同步时间	后端调用提交延迟的分布。当 etcd 将其最新的增量快照提交到磁盘时，会调用 `backend_commit`。需要注意的是，磁盘操作延迟较大（WAL 日志同步时间或库同步时间较长）通常表示磁盘存在问题，这可能会导致请求延迟过高或集群不稳定。有关该指标的详细信息，请参阅 etcd Disk。
Raft 提议	- 提议提交速率记录提交的协商一致提议的速率。如果集群运行状况良好，则该指标应随着时间的推移而增加。etcd 集群的几个健康成员可以同时具有不同的一般提议。单个成员与其 Leader 之间的持续较大滞后表示该成员缓慢或不健康。 - 提议应用速率记录协商一致提议的总应用率。etcd 服务器异步地应用每个提交的提议。提议提交速率和提议应用速率的差异应该很小（即使在高负载下也只有几千）。如果它们之间的差异持续增大，则表明 etcd 服务器过载。当使用大范围查询或大量 txn 操作等大规模查询时，可能会出现这种情况。 - 提议失败速率记录提议失败的总速率。这通常与两个问题有关：与 Leader 选举相关的临时失败或由于集群成员数目达不到规定数目而导致的长时间停机。 - 排队提议数记录当前待处理提议的数量。待处理提议的增加表明客户端负载较高或成员无法提交提议。目前，仪表板上显示的数据是 etcd 成员的平均数值。有关这些指标的详细信息，请参阅 etcd Server。

Monitoring etcd: What to look for?

Disclaimer: etcd metrics might differ between Kubernetes versions. Here, we used Kubernetes 1.15. You can check the metrics available for your version in the Kubernetes repo (link for the 1.15.3 version).

免责声明： Kubernetes 版本之间的etcd指标可能不同。在这里，我们使用了 Kubernetes 1.15。您可以在Kubernetes 存储库中查看适用于您的版本的指标（1.15.3 版本的链接）。

etcd node availability: An obvious error scenario for any cluster is that you lose one of the nodes. The cluster will continue operating, but it’s probably a good idea to receive an alert, diagnose, and recover before you continue losing nodes and risk facing the next scenario, total service failure. The simplest way to check this is with a PromQL query:

etcd 节点可用性：任何集群的一个明显错误场景是您丢失了一个节点。集群将继续运行，但在您继续丢失节点并面临下一个场景（完全服务失败）的风险之前接收警报、诊断和恢复可能是一个好主意。检查这一点的最简单方法是使用 PromQL 查询：

sum(up{job=\"etcd\"})

This should give the number of nodes running, if some node is down you can see it and the worst case would be if the number is 0 then you will know there is a problem.

这应该给出正在运行的节点数量，如果某个节点关闭，您可以看到它，最坏的情况是如果数字为 0，那么您就会知道有问题。

etcd has a leader: One key metric is to know if all nodes have a leader. If one node does not have a leader, this node will be unavailable. And if all nodes have no leader, then the cluster will become totally unavailable. To check this, there is a metric that indicates whether a node has a leader.

# HELP etcd_server_has_leader Whether or not a leader exists. 1 is existence, 0 is not.
# TYPE etcd_server_has_leader gauge
etcd_server_has_leader 1

etcd leader changes: The leader can change over time, but too frequent changes can impact the performance of the etcd itself. This can also be a signal of the leader being unstable because of connectivity problems, or maybe etcd has too much load.

etcd 领导者变化：领导者可以随着时间而改变，但过于频繁的变化会影响 etcd 本身的性能。这也可能是领导者由于连接问题而不稳定的信号，或者可能是 etcd 负载过大。

# HELP etcd_server_leader_changes_seen_total The number of leader changes seen.
# TYPE etcd_server_leader_changes_seen_total counter
etcd_server_leader_changes_seen_total 1

Consensus proposal: A proposal is a request (i.e., a write request, a configuration change request) that needs to go through raft protocol. The proposal metrics have four different types: committed, applied, pending, and failed. All four can give information about the problems the etcd can face, but the most important is the failed one. If there are proposals failed, it can be for two reasons; either the leader election is failing or there is a loss of the quorum.

共识提案：提案是需要经过raft协议的请求（即写请求、配置变更请求）。提案指标有四种不同的类型：已提交、已应用、未决和失败。这四个都可以提供有关 etcd 可能面临的问题的信息，但最重要的是失败的问题。如果有提案失败，可能有两个原因；要么领导选举失败，要么失去法定人数。

For example, if we wanted to set an alert to show that there were more than five consensus proposals failed over the course of a 15 minute period, we could use the following statement:

例如，如果我们想设置警报以显示在 15 分钟内有超过 5 个共识提案失败，我们可以使用以下语句：

rate(etcd_server_proposals_failed_total{job=~"etcd"}[15m]) > 5

# HELP etcd_server_proposals_applied_total The total number of consensus proposals applied.
# TYPE etcd_server_proposals_applied_total gauge
etcd_server_proposals_applied_total 1.3605153e+07
# HELP etcd_server_proposals_committed_total The total number of consensus proposals committed.
# TYPE etcd_server_proposals_committed_total gauge
etcd_server_proposals_committed_total 1.3605153e+07
# HELP etcd_server_proposals_failed_total The total number of failed proposals seen.
# TYPE etcd_server_proposals_failed_total counter
etcd_server_proposals_failed_total 0
# HELP etcd_server_proposals_pending The current number of pending proposals to commit.
# TYPE etcd_server_proposals_pending gauge
etcd_server_proposals_pending 0

Disk sync duration: As etcd is storing all important things about Kubernetes, the speed of committing changes to disk and the health of your storage is a key indicator if etcd is working properly. If the disk sync has high latencies, then the disk may have issues or the cluster can become unavailable. The metrics that show this are wal_fsync_duration_seconds and backend_commit_duration_seconds.

磁盘同步持续时间：由于 etcd 存储了有关 Kubernetes 的所有重要信息，因此将更改提交到磁盘的速度和存储的健康状况是 etcd 是否正常工作的关键指标。如果磁盘同步具有高延迟，则磁盘可能有问题或集群可能变得不可用。显示这一点的指标是wal_fsync_duration_seconds和backend_commit_duration_seconds。

# HELP etcd_disk_backend_commit_duration_seconds The latency distributions of commit called by backend.
# TYPE etcd_disk_backend_commit_duration_seconds histogram
etcd_disk_backend_commit_duration_seconds_bucket{le="0.001"} 0
etcd_disk_backend_commit_duration_seconds_bucket{le="0.002"} 5.402102e+06
etcd_disk_backend_commit_duration_seconds_bucket{le="0.004"} 6.0471e+06
...
etcd_disk_backend_commit_duration_seconds_sum 11017.523900176226
etcd_disk_backend_commit_duration_seconds_count 6.157407e+06
# HELP etcd_disk_wal_fsync_duration_seconds The latency distributions of fsync called by wal.
# TYPE etcd_disk_wal_fsync_duration_seconds histogram
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.001"} 4.659349e+06
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.002"} 7.276276e+06
...
etcd_disk_wal_fsync_duration_seconds_sum 11580.35429902582
etcd_disk_wal_fsync_duration_seconds_count 8.786736e+06

To know if the duration of the backend commit is good enough, you can visualize if the duration of each commit is good enough in a histogram. With the next command, you can show the time latency in which 99% of requests are covered.

histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{job=~"etcd"}[5m]))

富士康质检员张全蛋

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Promrtheus etcd 监控

etcd 监控可以帮助您更好地利用 etcd，特别用于是定位性能问题。KubeSphere 监控系统提供了高度图形化和响应性强的仪表板，用于显示原生数据。频繁变更 Leader 将显著影响 etcd 性能，同时这还表明 Leader 可能由于网络连接问题或 etcd 集群负载过高而不稳定。如果成员没有 Leader，则成员完全不可用。如果集群中的所有成员都没有任何 Leader，则整个集群完全不可用。etcd 的底层数据库大小，单位为 MiB。图表中显示的是 etcd 的每个成员数据库的平均大小。.....
复制链接

扫一扫