一、使用Grafana实现监控指标可视化
1.用helm安装grafana
helm pull bitnami/grafana --untar
修改values.yaml
vi grafana/values.yaml ##定义storageClass
storageClass: "nfs-client" #两个
[root@aminglinux01 grafana]# cat values.yaml | grep storageClass:
storageClass: "nfs-client"
storageClass: "nfs-client"
[root@aminglinux01 grafana]#
安装
cd grafana
helm install grafana .
[root@aminglinux01 grafana]# helm install grafana .
NAME: grafana
LAST DEPLOYED: Tue Aug 6 03:33:06 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
CHART NAME: grafana
CHART VERSION: 11.3.13
APP VERSION: 11.1.3
** Please be patient while the chart is being deployed **
1. Get the application URL by running these commands:
echo "Browse to http://127.0.0.1:8080"
kubectl port-forward svc/grafana 8080:3000 &
2. Get the admin credentials:
echo "User: admin"
echo "Password: $(kubectl get secret grafana-admin --namespace default -o jsonpath="{.data.GF_SECURITY_ADMIN_PASSWORD}" | base64 -d)"
# Note: Do not include grafana.validateValues.database here. See https://github.com/bitnami/charts/issues/20629
WARNING: There are "resources" sections in the chart not set. Using "resourcesPreset" is not recommended for production. For production installations, please set the following values according to your workload needs:
- grafana.resources
+info https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
⚠ SECURITY WARNING: Original containers have been substituted. This Helm chart was designed, tested, and validated on multiple platforms using a specific set of Bitnami and Tanzu Application Catalog containers. Substituting other containers is likely to cause degraded security and performance, broken chart features, and missing environment variables.
Substituted images detected:
- registry.cn-hangzhou.aliyuncs.com/*/grafana:11.1.3-debian-12-r0
- registry.cn-hangzhou.aliyuncs.com/*/os-shell:12-debian-12-r27
显示:
1. Get the application URL by running these commands:
echo "Browse to http://127.0.0.1:8080"
kubectl port-forward svc/grafana 8080:3000 &2. Get the admin credentials:
echo "User: admin"
echo "Password: $(kubectl get secret grafana-admin --namespace default -o jsonpath="{.data.GF_SECURITY_ADMIN_PASSWORD}" | base64 -d)"
做端口映射
[root@aminglinux01 grafana]# kubectl get pod -owide| grep grafana
grafana-57bb68b4b5-4rwhj 1/1 Running 0 11m 10.18.206.201 aminglinux02 <none> <none>
[root@aminglinux01 grafana]# kubectl get svc | grep grafana
grafana ClusterIP 10.15.242.254 <none> 3000/TCP ###没有提供node ip的端口,只能从内部访问
kubectl port-forward svc/grafana --address 192.168.100.151 8087:3000 &
[root@aminglinux01 ~]# nohup: ignoring input and appending output to 'nohup.out'
[root@aminglinux01 ~]#
也可以将values.yaml中的Service类型改为NodePort
[root@aminglinux01 grafana]# kubectl get svc | grep grafana
grafana NodePort 10.15.68.128 <none> 3000:31513/TCP 12m
[root@aminglinux01 grafana]#
查看密码
[root@aminglinux01 ~]# echo "Password: $(kubectl get secret grafana-admin --namespace default -o jsonpath="{.data.GF_SECURITY_ADMIN_PASSWORD}" | base64 -d)"
Password: DOMMoUZS4v
[root@aminglinux01 ~]#
2.访问grafana
http://192.168.100.151:8087,或者使用service的porthttp://192.168.100.152:31513
添加数据源
3.导入node模板
点击dashboards,点击右侧的“+”,再点击import dashboard
8919 或者 1860(node的模板id)
4.导入nginx模板
下载nginx的模板dashboard.json, https://github.com/nginxinc/nginx-prometheusexporter/blob/main/grafana/dashboard.json
访问http://192.168.100.151:8087/dashboard/import
导入json
http://192.168.100.151:8087/dashboard/import
id 13332
二、AlertManager介绍和安装
1.用helm安装的Prometheus,已经自动安装了Alertmanager查看service
[root@aminglinux01 ~]# kubectl get pod -owide| grep aler
prometheus-alertmanager-0 1/1 Running 0 14h 10.18.206.200 aminglinux02 <none> <none>
[root@aminglinux01 ~]# kubectl get svc| grep aler
prometheus-alertmanager LoadBalancer 10.15.238.147 192.168.10.241 80:30682/TCP 14h
[root@aminglinux01 ~]#
访问
三、配置Prometheus告警规则
告警规则大全:https://samber.github.io/awesome-prometheus-alerts/
1.vi prometheus_config.yaml
找到rules.yaml,将 rules.yaml: '{}' 改为
rules.yaml: |
groups:
- name: hostStatsAlert
- rules.yaml
rules.yaml: |
groups:
- name: hostStatsAlert
rules:
- alert: hostCpuUsageAlert
expr: 1 - rate(node_cpu_seconds_total{mode="idle"}[2m]) > 0.8
for: 1m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} CPU usgae high"
description: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }})"
- alert: hostMemUsageAlert
expr: (node_memory_MemTotal_bytes -node_memory_MemAvailable_bytes)/node_memory_MemTotal_bytes > 0.85
for: 1m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} MEM usgae high"
description: "{{ $labels.instance }} MEM usage above 85% (current value: {{ $value }})"
重新apply,并重启Prometheus
[root@aminglinux01 prometheus]# kubectl delete cm prometheus-server; kubectl apply -f prometheus_config.yaml
configmap "prometheus-server" deleted
configmap/prometheus-server created
[root@aminglinux01 prometheus]# kubectl get po |grep prometheus-server |awk '{print $1}' |xargs -i kubectl delete po {}
pod "prometheus-server-5df77fb7d7-pkc9w" deleted[root@aminglinux01 prometheus]# kubectl get pod | grep server
prometheus-server-5df77fb7d7-nhj9c 1/1 Running 0 24s
查看rule是否生效(Status-Rules):
查看Alerts
Inactive : 规则还没有被触发
Pending: 规则被触发了,在评估等待时间范围内,比如上面的1m
Firing: 规则被触发了,超过了评估等待时间
测试规则:
在aminglinux02上执行:
cat /dev/zero > /dev/null & #如果你cpu为2核,则需要执行两次,这样才能保证cpu使用率增高到阈值
等待几分钟后,Prometheus alerts里,cpu那项先变黄色,再变红色
到Alertmanager里可以看到告警
四、AlertManager配置邮件告警
1.将Alertmanager的配置文件从configMap里导出来
kubectl get cm prometheus-alertmanager -o=yaml > alertmanager_config.yaml
2.编辑配置文件
vi alertmanager_config.yaml #改为
apiVersion: v1
data:
alertmanager.yaml: |
global:
resolve_timeout: 1h
smtp_smarthost: 'smtp.qq.com:465'
smtp_from: '发邮件的邮箱'
smtp_auth_username: '发邮件的邮箱'
smtp_auth_password: '你的授权码或者密码'
smtp_require_tls: false
templates:
- '/bitnami/alertmanager/data/template/email.tmpl'
receivers:
- name: 'default-receiver'
email_configs:
- to: '接收邮件的邮箱'
html: '{{ template "email.html" . }}'
send_resolved: true
route:
group_wait: 10s
group_interval: 5m
receiver: default-receiver
repeat_interval: 3h
kind: ConfigMap
metadata:
annotations:
meta.helm.sh/release-name: prometheus
labels:
app.kubernetes.io/component: alertmanager
app.kubernetes.io/instance: prometheus
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: prometheus
app.kubernetes.io/part-of: prometheus
helm.sh/chart: prometheus-0.1.3
name: prometheus-alertmanager
重新导入配置:
kubectl delete cm prometheus-alertmanager; kubectl apply -f alertmanager_config.yaml
由于Alertmanager有挂载到nfs,所以/bitnami/alertmanager/data/目录对应到nfs里
以下操作在NFS服务端上操作
cd /data/nfs2/default-data-bitnami-prometheus-alertmanager-0-pvc-47ca7949-a84b-4d72-
bdff-f9380f4f2fa1/
mkdir template
vi email.tmpl #内容如下
{{ define "email.html" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
========= 告警通知 ==========
告警名称:{{ .Labels.alertname }}
告警级别:{{ .Labels.severity }}
告警机器:{{ .Labels.instance }} {{ .Labels.device }}
告警详情:{{ .Annotations.summary }}
告警时间:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
========= END ==========
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
========= 恢复通知 ==========
告警名称:{{ .Labels.alertname }}
告警级别:{{ .Labels.severity }}
告警机器:{{ .Labels.instance }}
告警详情:{{ .Annotations.summary }}
告警时间:{{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
恢复时间:{{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
========= END ==========
{{- end }}
{{- end }}
{{- end }}
3.重启Alertmanager服务
kubectl get po |grep 'prometheus-alertmanager'|awk '{print $1}' |xargs -i kubectl delete po {}
4.根据前面的步骤,模拟cpu使用率增加,告警,看是否能收到告警邮件
五、AlertManager配置企业微信告警
vi alertmanager_config.yaml #改为
apiVersion: v1
data:
alertmanager.yaml: |
global:
resolve_timeout: 5m
wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
wechat_api_corp_id: 'ww07997bc80b832341'
templates:
- '/bitnami/alertmanager/data/template/weixin.tmpl'
receivers:
- name: 'wechat'
wechat_configs:
- corp_id: 'ww07997bc80b832341'
to_party: '2'
agent_id: '1000002'
api_secret: 'xMhpmaNZ_YYkGR_UCYUJaqHaDaIi7CQiMtazzaviAvE'
send_resolved: true
route:
group_wait: 10s
group_interval: 5m
repeat_interval: 3h
receiver: 'wechat'
kind: ConfigMap
metadata:
annotations:
meta.helm.sh/release-name: prometheus
labels:
app.kubernetes.io/component: alertmanager
app.kubernetes.io/instance: prometheus
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: prometheus
app.kubernetes.io/part-of: prometheus
helm.sh/chart: prometheus-0.1.3
name: prometheus-alertmanager
由于Alertmanager有挂载到nfs,所以/bitnami/alertmanager/data/目录对应到nfs里
以下操作在NFS服务端上操作
cd /data/nfs2/default-data-bitnami-prometheus-alertmanager-0-pvc-47ca7949-a84b-4d72-
bdff-f9380f4f2fa1/template
vi weixin.tmpl #内容如下
{{ define "wechat.default.message" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 -}}
**********告警通知**********
告警类型: {{ $alert.Labels.alertname }}
告警级别: {{ $alert.Labels.severity }}
{{- end }}
=====================
告警主题: {{ $alert.Annotations.summary }}
告警详情: {{ $alert.Annotations.description }}
故障时间: {{ $alert.StartsAt.Format "2006-01-02 15:04:05" }}
{{ if gt (len $alert.Labels.instance) 0 -}}故障实例: {{ $alert.Labels.instance }}{{- end
-}}
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 -}}
**********恢复通知**********
告警类型: {{ $alert.Labels.alertname }}
告警级别: {{ $alert.Labels.severity }}
{{- end }}
=====================
告警主题: {{ $alert.Annotations.summary }}
告警详情: {{ $alert.Annotations.description }}
故障时间: {{ $alert.StartsAt.Format "2006-01-02 15:04:05" }}
恢复时间: {{ $alert.EndsAt.Format "2006-01-02 15:04:05" }}
{{ if gt (len $alert.Labels.instance) 0 -}}故障实例: {{ $alert.Labels.instance }}{{- end
-}}
{{- end }}
{{- end }}
{{- end }}