[夜莺监控系列4]夜莺生产环境部署-2

最新推荐文章于 2024-04-22 18:16:56 发布

Rock-Wang

最新推荐文章于 2024-04-22 18:16:56 发布

阅读量1k

点赞数 1

文章标签：运维开发

本文链接：https://blog.csdn.net/weixin_42924568/article/details/132072399

版权

继续跟进上一章节的内容，上一章节讲到了n9e生产环境安装，本章节主要是讲 exporter的安装、出图、告警。

下面就开始进行常用指标采集工具安装、告警、出图~

1 主机监控

这里是用的n9e官方推荐的categraf。

1.1 安装、出图

上一章节里，已经把主机监控的categraf工具的安装已经讲过了，出图也告知直接导入n9e自带的图(名称为: linux_by_categraf)了，但是没有讲告警这一块，所以下面补充一下。

1.2 告警

机器负载-CPU较高

名称: 磁盘根分区使用率较高
PromSQL: disk_used_percent{path="/"} > 80
alert_type=disk
level: 2

在这里插入图片描述

磁盘根分区使用率较高

名称: 磁盘根分区使用率较高
PromSQL: disk_used_percent{path="/"} > 80
alert_type=disk
level: 2

监控对象失联

名称: 监控对象失联
PromSQL: avg_over_time(up{source="categraf"}[1m]) < 1
alert_type=host
level: 1

机器负载-uptime高

名称: 机器负载-uptime高
PromSQL: system_load1 > 1000
alert_type=cpu
level: 1

机器负载-iowait使用率

名称: 机器负载-iowait使用率
PromSQL: avg_over_time(cpu_usage_iowait[5m]) > 90
alert_type=io
level: 1

2 k8s监控

我们服务在运行过程中，想了解服务运行状态，pod有没有重启，伸缩有没有成功，pod的状态是怎么样的等，这时就需要kube-state-metrics，它主要关注deployment,、node 、 pod等内部对象的状态。而metrics-server 主要用于监测node,pod等的CPU，内存，网络等系统指标。
所以，kube-state-metrics算是k8s资源监控的一个很棒的工具，这里就安装它了。

2.1 安装

2.1.1 下载chart

大家可以直接到 https://github.com/prometheus-community/helm-charts/ 里面去找所有的charts包

# helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
prometheus-community            https://prometheus-community.github.io/helm-charts

# helm search  repo  prometheus-community
NAME                                                    CHART VERSION   APP VERSION     DESCRIPTION
prometheus-community/alertmanager                       0.27.0          v0.25.0         The Alertmanager handles alerts sent by client ...
prometheus-community/alertmanager-snmp-notifier         0.1.0           v1.4.0          The SNMP Notifier handles alerts coming from Pr...
prometheus-community/jiralert                           1.2.0           v1.3.0          A Helm chart for Kubernetes to install jiralert
prometheus-community/kube-prometheus-stack              45.9.1          v0.63.0         kube-prometheus-stack collects Kubernetes manif...
prometheus-community/kube-state-metrics                 5.3.0           2.8.2           Install kube-state-metrics to generate and expo...
prometheus-community/prom-label-proxy                   0.2.0           v0.6.0          A proxy that enforces a given label in a given ...
prometheus-community/prometheus                         20.2.0          v2.43.0         Prometheus is a monitoring system and time seri...
prometheus-community/prometheus-adapter                 4.1.1           v0.10.0         A Helm chart for k8s prometheus adapter
prometheus-community/prometheus-blackbox-exporter       7.7.0           0.23.0          Prometheus Blackbox Exporter

# helm repo update
# helm fetch prometheus-community/kube-state-metrics --version 5.3.0

2.1.2 安装

# tar xf kube-state-metrics-5.3.0.tgz
# cd kube-state-metrics
# helm upgrade -i kube-state-mertics --namespace=monitoring .

2.1.3 配置prometheus接入kube-state-metrics

将kube-state-metrics的地址添加到prometheus中用于被监控到指标

# cd nightingale
# vim templates/prometheus/configmap.yaml
    scrape_configs:
      # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
      - job_name: 'kube-state-metrics'
        static_configs:
        - targets: ['kube-state-metrics:8080']

重新更新服务后，重启 nightingale-prometheus-0 pod用于重载configmap中的prometheus.yaml ，然后到prometheus的targets中查看信息(如下图)

在这里插入图片描述

2.2 k8s pod出图

这个是n9e提供的官方pod图，总体还不错，但也有一些地方值得改进的，链接如下：

https://github.com/flashcatcloud/categraf/blob/main/k8s/pod-dash.json

后面我应该会单独出一章各种监控图的优化模板及优化思路，着急的同学可以留言或私聊我，我直接发给你哦~

2.3 k8s告警配置

关于kube-state-metrics的告警指标，我直接就引用了现成的，详细请见：https://samber.github.io/awesome-prometheus-alerts/rules.html#kubernetes
如果有这块告警不明白的，可以私聊找我~
下面提供一些比较常用的告警：

K8sNodeNotReady

名称：k8s-node-notReady
PromSQL: kube_node_status_condition{condition="Ready",status="true"} == 0
alert_type=k8s_node
level: 1

在这里插入图片描述

K8sMemoryPressure

名称：k8s-node-memoryPressure
PromSQL: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
alert_type=k8s_memory
level: 1

K8sDiskPressure

名称：k8s-node-diskPressure
PromSQL: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
alert_type=k8s_disk
level: 1

K8sNetworkUnavailable

名称：k8s-node-networkUnavailable
PromSQL: kube_node_status_condition{condition="NetworkUnavailable",status="true"} == 1
alert_type=k8s_network
level: 1

K8sContainerOomKiller

名称：k8s-pod-oom
PromSQL: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1
alert_type=k8s_pod
level: 3

K8sPodCrashLooping

名称：k8s-pod-nothealthy
PromSQL: sum (avg_over_time(kube_pod_status_phase{phase=~"Pending|Unknown|Failed", pod!~"helm-operation.*"}[5m]))  by (namespace, pod,phase) > 0
alert_type=k8s_pod
level: 2
执行频率：30 秒
持续时长：150 秒

K8sClientCertificateExpiresSoon

名称：k8s-client-certificateExpiresSoon
PromSQL: apiserver_client_certificate_expiration_seconds_count{job="categraf"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="categraf"}[5m]))) < 24*60*60
alert_type=k8s_ca
level: 1

3 显卡监控

3.1 安装

3.1.1 下载chart

# helm fetch http://utkuozdemir.org/helm-charts/nvidia-gpu-exporter-0.3.1.tgz
# tar xf nvidia-gpu-exporter-v0.3.1.tgz
# cd nvidia-gpu-exporter

3.1.2 配置

添加hostNetWork配置，这样便于被Prometheus读取指标

# vim templates/daemonset.yaml
spec:
... ...
  template:
  ... ...
    spec:
      hostNetwork: true

3.1.3 安装

# helm upgrade -i nvidia-exporter --namespace=monitoring .

3.1.4 配置prometheus接入nvidia-exporter

因为前面配置了hostNetwork，这里直接通过 ip:port 进行指标获取了
# cd nighting/templates/prometheus
# vim configmap.yaml
    scrape_configs:
      # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
      - job_name: 'kube-state-metrics'
        static_configs:
        - targets: ['kube-state-metrics:8080']
      - job_name: 'nvidia-gpu-exporter'
        static_configs:
        - targets: ['172.18.5.2:9835', '172.18.5.3:9835', '172.18.5.4:9835', '172.18.5.5:9835',  '172.18.5.6:9835']

重新更新服务后，重启 nightingale-prometheus-0 pod用于重载configmap中的prometheus.yaml

3.1.5 target验证

在这里插入图片描述

3.2 出图

我这里直接导入了Grafana的 14574 这个图，但这个json模板有点问题，需要简单改一下，具体如下：

1）配置模板的名称。

"name": "nvidia-gpu",  # 修改模板的名称

1. 新增instance这个变量，用于gpu下拉选择时的sql查询条件。

            {
                "type": "query",
                "name": "instance", # 此段json是新增的，原因是需要先选择instance机器，再到下面选择对应的gpu的uuid。
                "definition": "label_values(nvidia_smi_index, instance)",
                "allOption": false,
                "multi": false,
                "reg": ""
            },

3）gpu变量中，增加instance这个变量查询条件

"definition": "label_values(nvidia_smi_index{instance=\"$instance\"}, uuid)",

4）将Name图中的 nvidia_smi_gpu_info 改为 nvidia_smi_index，不然拿不到显卡的ID

        "panels": [
            {
                "version": "2.0.0",
                "id": "c15d9abb-4a9d-4a5a-ad05-ec79ed260d83",
                "type": "stat",
                "name": "Name",  # 大盘图中，显示当前机器的显卡的ID，如0/1/2/3
                "description": "The official product name of the GPU. This is an alphanumeric string. For all products.",
                "links": [],
                "layout": {
                    "h": 3,
                    "w": 4,
                    "x": 0,
                    "y": 0,
                    "i": "c15d9abb-4a9d-4a5a-ad05-ec79ed260d83"
                },
                "targets": [
                    {
                        "refId": "A",
                        "expr": "nvidia_smi_index{uuid=\"$gpu\"}",  # 将 nvidia_smi_gpu_info 改为 nvidia_smi_index，不然拿不到显卡的ID
                        "legend": "{{name}}"
                    }
                ],

后面我应该会单独出一章各种监控图的优化模板及优化思路，着急的同学可以留言或私聊我，我直接发给你哦~

效果图如下

3.3 告警

显卡状态检查

名称: 未检测到gpu
PromSQL: nvidia_smi_failed_scrapes_total > 0
alert_type=gpu
level: 2

gpu温度

名称: gpu温度高
PromSQL: nvidia_smi_temperature_gpu > 80
alert_type=gpu
level: 2

gpu使用率

名称: gpu使用率高
PromSQL: nvidia_smi_utilization_gpu_ratio * 100 > 80
alert_type=gpu
level: 2

gpu内存使用率

名称: gpu内存高
PromSQL: nvidia_smi_utilization_memory_ratio * 100 > 80
alert_type=gpu
level: 2

好了，这一章节的内容也挺多的，但有一些指标（如：ceph、minio、juicefs等）的监控还没讲到。
以及n9e的ibex自愈、blackbox的http/证书等监测、prometheus-gateway的巧用等，就放到后面的章节再说了。

感谢大家的关注，如有任何问题请留言，我会第一时间回复，再次感谢么么哒~

Rock-Wang

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
[夜莺监控系列4]夜莺生产环境部署-2

继续跟进上一章节的内容，上一章节讲到了n9e生产环境安装，本章节主要是讲 exporter的安装、出图、告警。下面就开始进行常用指标采集工具安装、告警、出图~
复制链接

扫一扫

[夜莺监控系列4]夜莺生产环境部署-2

1 主机监控

1.1 安装、出图

1.2 告警

2 k8s监控

2.1 安装

2.1.1 下载chart

2.1.2 安装

2.1.3 配置prometheus接入kube-state-metrics

2.2 k8s pod出图

2.3 k8s告警配置

3 显卡监控

3.1 安装

3.1.1 下载chart

3.1.2 配置

3.1.3 安装

3.1.4 配置prometheus接入nvidia-exporter

3.1.5 target验证

3.2 出图

3.3 告警

“相关推荐”对你有帮助么？