etcd监控-prometheus+grafana

etcd metrics

etcd通过2379端口,metrics。可以访问http://etcd-ip:2379/metrics

prometheus-etcd metric采集

配置文件内如如下

promethrus通过etcd的2379端口,采集metric

cat > promethrus-etcd.yaml <<EOF
global:
  scrape_interval: 10s
scrape_configs:
  - job_name: etcd
    metrics_path: '/metrics'
    static_configs:
    - targets: ['10.240.0.32:2379','10.240.0.33:2379','10.240.0.34:2379']
EOF

k8s使用ConfigMap注入配置文件示例


apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-conf-etcd
  namespace: prometheus
  labels:
    app: prometheus-etcd
data:
  prometheus.yml: |-
    global:
      scrape_interval:     10s
      evaluation_interval: 10s
    alerting:
      alertmanagers:
      - static_configs:
        - targets: ["alertmanager-service.prometheus:9093"]
    rule_files:
    - "/etc/prometheus/rules/custom.rule"
    scrape_configs:
      - job_name: etcd
        metrics_path: '/metrics'
        static_configs:
        - targets: ['172.24.31.25:2379','172.24.31.24:2379','172.24.31.22:2379']

配置完成重启prometheus后,即可看到新的job和指标

prometheus配置alert告警规则

 prometheus 2.x 规则如下

# These rules were manually synced from https://github.com/etcd-io/etcd/blob/master/contrib/mixin/mixin.libsonnet
groups:
- name: etcd
rules:
- alert: etcdInsufficientMembers
  annotations:
    message: 'etcd cluster "{{ $labels.job }}": insufficient members ({{ $value
      }}).'
  expr: |
    sum(up{job=~".*etcd.*"} == bool 1) by (job) < ((count(up{job=~".*etcd.*"}) by (job) + 1) / 2)
  for: 3m
  labels:
    severity: critical
- alert: etcdNoLeader
  annotations:
    message: 'etcd cluster "{{ $labels.job }}": member {{ $labels.instance }} has
      no leader.'
  expr: |
    etcd_server_has_leader{job=~".*etcd.*"} == 0
  for: 1m
  labels:
    severity: critical
- alert: etcdHighNumberOfLeaderChanges
  annotations:
    message: 'etcd cluster "{{ $labels.job }}": instance {{ $labels.instance }}
      has seen {{ $value }} leader changes within the last hour.'
  expr: |
    rate(etcd_server_leader_changes_seen_total{job=~".*etcd.*"}[15m]) > 3
  for: 15m
  labels:
    severity: warning
- alert: etcdHighNumberOfFailedGRPCRequests
  annotations:
    message: 'etcd cluster "{{ $labels.job }}": {{ $value }}% of requests for {{
      $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}.'
  expr: |
    100 * sum(rate(grpc_server_handled_total{job=~".*etcd.*", grpc_code!="OK"}[5m])) BY (job, instance, grpc_service, grpc_method)
      /
    sum(rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) BY (job, instance, grpc_service, grpc_method)
      > 1
  for: 10m
  labels:
    severity: warning
- alert: etcdHighNumberOfFailedGRPCRequests
  annotations:
    message: 'etcd cluster "{{ $labels.job }}": {{ $value }}% of requests for {{
      $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}.'
  expr: |
    100 * sum(rate(grpc_server_handled_total{job=~".*etcd.*", grpc_code!="OK"}[5m])) BY (job, instance, grpc_service, grpc_method)
      /
    sum(rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) BY (job, instance, grpc_service, grpc_method)
      > 5
  for: 5m
  labels:
    severity: critical
- alert: etcdGRPCRequestsSlow
  annotations:
    message: 'etcd cluster "{{ $labels.job }}": gRPC requests to {{ $labels.grpc_method
      }} are taking {{ $value }}s on etcd instance {{ $labels.instance }}.'
  expr: |
    histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job=~".*etcd.*", grpc_type="unary"}[5m])) by (job, instance, grpc_service, grpc_method, le))
    > 0.15
  for: 10m
  labels:
    severity: critical
- alert: etcdMemberCommunicationSlow
  annotations:
    message: 'etcd cluster "{{ $labels.job }}": member communication with {{ $labels.To
      }} is taking {{ $value }}s on etcd instance {{ $labels.instance }}.'
  expr: |
    histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket{job=~".*etcd.*"}[5m]))
    > 0.15
  for: 10m
  labels:
    severity: warning
- alert: etcdHighNumberOfFailedProposals
  annotations:
    message: 'etcd cluster "{{ $labels.job }}": {{ $value }} proposal failures within
      the last hour on etcd instance {{ $labels.instance }}.'
  expr: |
    rate(etcd_server_proposals_failed_total{job=~".*etcd.*"}[15m]) > 5
  for: 15m
  labels:
    severity: warning
- alert: etcdHighFsyncDurations
  annotations:
    message: 'etcd cluster "{{ $labels.job }}": 99th percentile fync durations are
      {{ $value }}s on etcd instance {{ $labels.instance }}.'
  expr: |
    histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m]))
    > 0.5
  for: 10m
  labels:
    severity: warning
- alert: etcdHighCommitDurations
  annotations:
    message: 'etcd cluster "{{ $labels.job }}": 99th percentile commit durations
      {{ $value }}s on etcd instance {{ $labels.instance }}.'
  expr: |
    histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{job=~".*etcd.*"}[5m]))
    > 0.25
  for: 10m
  labels:
    severity: warning
- alert: etcdHighNumberOfFailedHTTPRequests
  annotations:
    message: '{{ $value }}% of requests for {{ $labels.method }} failed on etcd
      instance {{ $labels.instance }}'
  expr: |
    sum(rate(etcd_http_failed_total{job=~".*etcd.*", code!="404"}[5m])) BY (method) / sum(rate(etcd_http_received_total{job=~".*etcd.*"}[5m]))
    BY (method) > 0.01
  for: 10m
  labels:
    severity: warning
- alert: etcdHighNumberOfFailedHTTPRequests
  annotations:
    message: '{{ $value }}% of requests for {{ $labels.method }} failed on etcd
      instance {{ $labels.instance }}.'
  expr: |
    sum(rate(etcd_http_failed_total{job=~".*etcd.*", code!="404"}[5m])) BY (method) / sum(rate(etcd_http_received_total{job=~".*etcd.*"}[5m]))
    BY (method) > 0.05
  for: 10m
  labels:
    severity: critical
- alert: etcdHTTPRequestsSlow
  annotations:
    message: etcd instance {{ $labels.instance }} HTTP requests to {{ $labels.method
      }} are slow.
  expr: |
    histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[5m]))
    > 0.15
  for: 10m
  labels:
    severity: warning

k8s ConfigMap注入配置示例

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-rules-etcd
  namespace: prometheus
  labels:
    instance: "etcd"
    severity: "critical"
    app: prometheus-etcd
data:
  custom.rule: |
    groups:
    - name: etcd
      rules:
      - alert: etcdInsufficientMembers
        annotations:
          message: 'etcd cluster "{{ $labels.job }}": insufficient members ({{ $value
            }}).'
        expr: |
          sum(up{job=~".*etcd.*"} == bool 1) by (job) < ((count(up{job=~".*etcd.*"}) by (job) + 1) / 2)
        for: 3m
        labels:
          severity: critical
      - alert: etcdNoLeader
      ......略......

prometheus重新加载配置文件后会在Alert页面出现这些新的规则

Grafana配置

 etcd3.4 官方提供的模板 grafana模板:(这个模板折腾半天搞不好,最后还是用了grafana官方的模板)

https://etcd.io/docs/v3.4/op-guide/grafana.json

我选择了3070号模板

加载模板

选择数据源

 生成后效果

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
一、prometheus简介 Prometheus是一个开源的系统监控和告警系统,现在已经加入到CNCF基金会,成为继k8s之后第二个在CNCF维护管理的项目,在kubernetes容器管理系统中,通常会搭配prometheus进行监控prometheus支持多种exporter采集数据,还支持通过pushgateway进行数据上报,Prometheus再性能上可支撑上万台规模的集群。 二、prometheus架构图 三、prometheus组件介绍 1.Prometheus Server: 用于收集和存储时间序列数据。 2.Client Library: 客户端库,检测应用程序代码,当Prometheus抓取实例的HTTP端点时,客户端库会将所有跟踪的metrics指标的当前状态发送到prometheus server端。 3.Exporters: prometheus支持多种exporter,通过exporter可以采集metrics数据,然后发送到prometheus server端 4.Alertmanager: 从 Prometheus server 端接收到 alerts 后,会进行去重,分组,并路由到相应的接收方,发出报警,常见的接收方式有:电子邮件,微信,钉钉, slack等。 5.Grafana监控仪表盘 6.pushgateway: 各个目标主机可上报数据到pushgatewy,然后prometheus server统一从pushgateway拉取数据。 四、课程亮点 五、效果图展示 六、讲师简介 先超(lucky):高级运维工程师、资深DevOps工程师,在互联网上市公司拥有多年一线运维经验,主导过亿级pv项目的架构设计和运维工作 主要研究方向: 1.云计算方向:容器 (kubernetes、docker),虚拟化(kvm、Vmware vSphere),微服务(istio),PaaS(openshift),IaaS(openstack)等2.系统/运维方向:linux系统下的常用组件(nginx,tomcat,elasticsearch,zookeeper,kafka等),DevOps(Jenkins+gitlab+sonarqube+nexus+k8s),CI/CD,监控(zabbix、prometheus、falcon)等 七、课程大纲
PrometheusGrafana是一对强大的监控工具,可以用于监控Kubernetes集群。下面是一种常见的部署方式: 1. 在Kubernetes集群内部部署Prometheus: - 将Prometheus部署在Kubernetes集群的monitoring命名空间下。 - 由于Kubernetes在所有命名空间下自动创建了serviceAccount和对应的Secret,其中包含访问Kubernetes API的token和ca证书,因此不需要手动创建serviceAccount和Secret。 - 在Prometheus的配置文件(prometheus.yaml)中,可以使用Kubernetes的服务发现功能来自动发现和监控Kubernetes集群中的各个组件。 2. 在Kubernetes集群外部部署Prometheus: - 将Prometheus部署在虚拟机上。 - 需要手动在Prometheus的配置文件中指定Kubernetes API的地址、ca证书和token,以便Prometheus能够访问Kubernetes集群。 无论是内部部署还是外部部署,Prometheus都可以通过抓取目标(exporter)来获取Kubernetes集群的监控指标。在Kubernetes中,有很多第三方组件(如Etcd、Kube-proxy、Node exporter等)也会产生重要的监控指标,这些指标可以通过exporter将其转换为Prometheus可识别的格式,并暴露给Prometheus进行抓取。 Grafana是一个用于可视化监控数据的工具,可以与Prometheus集成,通过查询Prometheus的数据来生成各种图表和仪表盘。可以使用Grafana的界面来创建自定义的监控仪表盘,展示Kubernetes集群的各项指标,并进行实时监控和告警。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值