使用Grafana实现监控指标可视化、AlertManager介绍和安装、配置Prometheus告警规则、AlertManager配置邮件告警和AlertManager配置企业微信告警

一、使用Grafana实现监控指标可视化

1.用helm安装grafana

helm pull bitnami/grafana --untar

修改values.yaml

vi grafana/values.yaml ##定义storageClass
storageClass: "nfs-client" #两个

[root@aminglinux01 grafana]# cat values.yaml | grep storageClass:
  storageClass: "nfs-client"
  storageClass: "nfs-client"
[root@aminglinux01 grafana]# 

安装

cd grafana
helm install grafana .

[root@aminglinux01 grafana]# helm install grafana .
NAME: grafana
LAST DEPLOYED: Tue Aug  6 03:33:06 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
CHART NAME: grafana
CHART VERSION: 11.3.13
APP VERSION: 11.1.3

** Please be patient while the chart is being deployed **

1. Get the application URL by running these commands:
    echo "Browse to http://127.0.0.1:8080"
    kubectl port-forward svc/grafana 8080:3000 &

2. Get the admin credentials:

    echo "User: admin"
    echo "Password: $(kubectl get secret grafana-admin --namespace default -o jsonpath="{.data.GF_SECURITY_ADMIN_PASSWORD}" | base64 -d)"
# Note: Do not include grafana.validateValues.database here. See https://github.com/bitnami/charts/issues/20629


WARNING: There are "resources" sections in the chart not set. Using "resourcesPreset" is not recommended for production. For production installations, please set the following values according to your workload needs:
  - grafana.resources
+info https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

⚠ SECURITY WARNING: Original containers have been substituted. This Helm chart was designed, tested, and validated on multiple platforms using a specific set of Bitnami and Tanzu Application Catalog containers. Substituting other containers is likely to cause degraded security and performance, broken chart features, and missing environment variables.

Substituted images detected:
  - registry.cn-hangzhou.aliyuncs.com/*/grafana:11.1.3-debian-12-r0
  - registry.cn-hangzhou.aliyuncs.com/*/os-shell:12-debian-12-r27

显示:

1. Get the application URL by running these commands:
    echo "Browse to http://127.0.0.1:8080"
    kubectl port-forward svc/grafana 8080:3000 &

2. Get the admin credentials:

    echo "User: admin"
    echo "Password: $(kubectl get secret grafana-admin --namespace default -o jsonpath="{.data.GF_SECURITY_ADMIN_PASSWORD}" | base64 -d)"

做端口映射

[root@aminglinux01 grafana]# kubectl get pod -owide| grep grafana
grafana-57bb68b4b5-4rwhj                    1/1     Running            0              11m     10.18.206.201     aminglinux02   <none>           <none>
[root@aminglinux01 grafana]# kubectl get svc | grep grafana
grafana                      ClusterIP      10.15.242.254   <none>           3000/TCP    ###没有提供node ip的端口,只能从内部访问                   

kubectl port-forward svc/grafana --address 192.168.100.151 8087:3000 &

[root@aminglinux01 ~]# nohup: ignoring input and appending output to 'nohup.out'

[root@aminglinux01 ~]# 

也可以将values.yaml中的Service类型改为NodePort

[root@aminglinux01 grafana]# kubectl get svc | grep grafana
grafana                      NodePort       10.15.68.128    <none>           3000:31513/TCP                                                                     12m
[root@aminglinux01 grafana]#  

查看密码

[root@aminglinux01 ~]# echo "Password: $(kubectl get secret grafana-admin --namespace default -o jsonpath="{.data.GF_SECURITY_ADMIN_PASSWORD}" | base64 -d)"
Password: DOMMoUZS4v
[root@aminglinux01 ~]# 

2.访问grafana

http://192.168.100.151:8087,或者使用service的porthttp://192.168.100.152:31513

添加数据源

3.导入node模板

点击dashboards,点击右侧的“+”,再点击import dashboard

8919 或者 1860(node的模板id)

4.导入nginx模板

下载nginx的模板dashboard.json, https://github.com/nginxinc/nginx-prometheusexporter/blob/main/grafana/dashboard.json

访问http://192.168.100.151:8087/dashboard/import

导入json

http://192.168.100.151:8087/dashboard/import

id 13332

二、AlertManager介绍和安装

1.用helm安装的Prometheus,已经自动安装了Alertmanager查看service

[root@aminglinux01 ~]# kubectl get pod -owide| grep aler
prometheus-alertmanager-0                   1/1     Running            0                14h     10.18.206.200     aminglinux02   <none>           <none>
[root@aminglinux01 ~]# kubectl get  svc| grep aler
prometheus-alertmanager      LoadBalancer   10.15.238.147   192.168.10.241   80:30682/TCP                                                                       14h
[root@aminglinux01 ~]# 

访问

http://192.168.100.152:30682/#/alerts

三、配置Prometheus告警规则

告警规则大全:https://samber.github.io/awesome-prometheus-alerts/

1.vi prometheus_config.yaml

找到rules.yaml,将 rules.yaml: '{}' 改为
  rules.yaml: |
    groups:
    - name: hostStatsAlert
      - rules.yaml
  rules.yaml: |
    groups:
    - name: hostStatsAlert
      rules:
      - alert: hostCpuUsageAlert
        expr: 1 - rate(node_cpu_seconds_total{mode="idle"}[2m]) > 0.8
        for: 1m
        labels:
          severity: page
        annotations:
          summary: "Instance {{ $labels.instance }} CPU usgae high"
          description: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }})"
      - alert: hostMemUsageAlert
        expr: (node_memory_MemTotal_bytes -node_memory_MemAvailable_bytes)/node_memory_MemTotal_bytes > 0.85
        for: 1m
        labels:
          severity: page
        annotations:
          summary: "Instance {{ $labels.instance }} MEM usgae high"
          description: "{{ $labels.instance }} MEM usage above 85% (current value: {{ $value }})"

重新apply,并重启Prometheus

[root@aminglinux01 prometheus]# kubectl delete cm prometheus-server; kubectl apply -f prometheus_config.yaml
configmap "prometheus-server" deleted
configmap/prometheus-server created
[root@aminglinux01 prometheus]# kubectl get po |grep prometheus-server |awk '{print $1}' |xargs -i kubectl delete po {}
pod "prometheus-server-5df77fb7d7-pkc9w" deleted

[root@aminglinux01 prometheus]# kubectl get pod | grep server
prometheus-server-5df77fb7d7-nhj9c          1/1     Running   0              24s

查看rule是否生效(Status-Rules):

查看Alerts

Inactive : 规则还没有被触发
Pending: 规则被触发了,在评估等待时间范围内,比如上面的1m
Firing: 规则被触发了,超过了评估等待时间
测试规则:

在aminglinux02上执行:

cat /dev/zero > /dev/null & #如果你cpu为2核,则需要执行两次,这样才能保证cpu使用率增高到阈值

等待几分钟后,Prometheus alerts里,cpu那项先变黄色,再变红色

到Alertmanager里可以看到告警

四、AlertManager配置邮件告警

1.将Alertmanager的配置文件从configMap里导出来

kubectl get cm prometheus-alertmanager -o=yaml > alertmanager_config.yaml

2.编辑配置文件

vi alertmanager_config.yaml #改为

apiVersion: v1
data:
alertmanager.yaml: |
global:
resolve_timeout: 1h
smtp_smarthost: 'smtp.qq.com:465'
smtp_from: '发邮件的邮箱'
smtp_auth_username: '发邮件的邮箱'
smtp_auth_password: '你的授权码或者密码'
smtp_require_tls: false
templates:
- '/bitnami/alertmanager/data/template/email.tmpl'
receivers:
- name: 'default-receiver'
email_configs:
- to: '接收邮件的邮箱'
html: '{{ template "email.html" . }}'
send_resolved: true
route:
group_wait: 10s
group_interval: 5m
receiver: default-receiver
repeat_interval: 3h
kind: ConfigMap
metadata:
annotations:
meta.helm.sh/release-name: prometheus
labels:
app.kubernetes.io/component: alertmanager
app.kubernetes.io/instance: prometheus
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: prometheus
app.kubernetes.io/part-of: prometheus
helm.sh/chart: prometheus-0.1.3
name: prometheus-alertmanager

 

重新导入配置:

kubectl delete cm prometheus-alertmanager; kubectl apply -f alertmanager_config.yaml

由于Alertmanager有挂载到nfs,所以/bitnami/alertmanager/data/目录对应到nfs里
以下操作在NFS服务端上操作

cd /data/nfs2/default-data-bitnami-prometheus-alertmanager-0-pvc-47ca7949-a84b-4d72-
bdff-f9380f4f2fa1/
mkdir template

vi email.tmpl #内容如下

{{ define "email.html" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
========= 告警通知 ==========
告警名称:{{ .Labels.alertname }}
告警级别:{{ .Labels.severity }}
告警机器:{{ .Labels.instance }} {{ .Labels.device }}
告警详情:{{ .Annotations.summary }}
告警时间:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
========= END ==========
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
========= 恢复通知 ==========
告警名称:{{ .Labels.alertname }}
告警级别:{{ .Labels.severity }}
告警机器:{{ .Labels.instance }}
告警详情:{{ .Annotations.summary }}
告警时间:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
恢复时间:{{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
========= END ==========
{{- end }}
{{- end }}
{{- end }}

3.重启Alertmanager服务

kubectl get po |grep 'prometheus-alertmanager'|awk '{print $1}' |xargs -i kubectl delete po {}

4.根据前面的步骤,模拟cpu使用率增加,告警,看是否能收到告警邮件

五、AlertManager配置企业微信告警

vi alertmanager_config.yaml #改为

apiVersion: v1
data:
alertmanager.yaml: |
global:
resolve_timeout: 5m
wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
wechat_api_corp_id: 'ww07997bc80b832341'
templates:
- '/bitnami/alertmanager/data/template/weixin.tmpl'
receivers:
- name: 'wechat'
wechat_configs:
- corp_id: 'ww07997bc80b832341'
to_party: '2'
agent_id: '1000002'
api_secret: 'xMhpmaNZ_YYkGR_UCYUJaqHaDaIi7CQiMtazzaviAvE'
send_resolved: true
route:
group_wait: 10s
group_interval: 5m
repeat_interval: 3h
receiver: 'wechat'
kind: ConfigMap
metadata:
annotations:
meta.helm.sh/release-name: prometheus
labels:
app.kubernetes.io/component: alertmanager
app.kubernetes.io/instance: prometheus
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: prometheus
app.kubernetes.io/part-of: prometheus
helm.sh/chart: prometheus-0.1.3
name: prometheus-alertmanager

由于Alertmanager有挂载到nfs,所以/bitnami/alertmanager/data/目录对应到nfs里
以下操作在NFS服务端上操作

cd /data/nfs2/default-data-bitnami-prometheus-alertmanager-0-pvc-47ca7949-a84b-4d72-
bdff-f9380f4f2fa1/template

vi weixin.tmpl #内容如下

{{ define "wechat.default.message" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 -}}
**********告警通知**********
告警类型: {{ $alert.Labels.alertname }}
告警级别: {{ $alert.Labels.severity }}
{{- end }}
=====================
告警主题: {{ $alert.Annotations.summary }}
告警详情: {{ $alert.Annotations.description }}
故障时间: {{ $alert.StartsAt.Format "2006-01-02 15:04:05" }}
{{ if gt (len $alert.Labels.instance) 0 -}}故障实例: {{ $alert.Labels.instance }}{{- end
-}}
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 -}}
**********恢复通知**********
告警类型: {{ $alert.Labels.alertname }}
告警级别: {{ $alert.Labels.severity }}
{{- end }}
=====================
告警主题: {{ $alert.Annotations.summary }}
告警详情: {{ $alert.Annotations.description }}
故障时间: {{ $alert.StartsAt.Format "2006-01-02 15:04:05" }}
恢复时间: {{ $alert.EndsAt.Format "2006-01-02 15:04:05" }}
{{ if gt (len $alert.Labels.instance) 0 -}}故障实例: {{ $alert.Labels.instance }}{{- end
-}}
{{- end }}
{{- end }}
{{- end }}

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值