promethues+alertmanager+grafana监控redis和报警—详细文档
相关配套软件包网盘下载链接如下:
网盘地址: https://url28.ctfile.com/f/37115828-589253621-826cc0?p=4907
访问密码:4907
本人会经常更新运维相关技术文档,如有兴趣,可以关注我博客,欢迎互动分享
redis节点安装redis_exporter: (手动到各个节点安装)
机器:192.168.10.89 redis_exporter (redis被监控端)
机器:192.168.10.92 alertmanager、promethues
1.先搭建redis单机(yum搭建)(注意:docker安装的redis,redis_exporter使用密码时候测试连接不上redis)
redis: 192.168.10.89上:
[root@k8s-node3 ~]# yum -y install epel-release
[root@k8s-node3 ~]# yum -y install redis
[root@k8s-node3 ~]# cat /etc/redis.conf |grep -vE “^$|#” #调整redis配置文件后,如下:
bind 0.0.0.0
protected-mode yes
port 6379
tcp-backlog 511
timeout 0
tcp-keepalive 300
daemonize no
supervised no
pidfile /var/run/redis_6379.pid
loglevel notice
logfile /var/log/redis/redis.log
databases 16
save 900 1
save 300 10
save 60 10000
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename dump.rdb
dir /var/lib/redis
slave-serve-stale-data yes
slave-read-only yes
repl-diskless-sync no
repl-diskless-sync-delay 5
repl-disable-tcp-nodelay no
slave-priority 100
requirepass 123456
appendonly no
appendfilename “appendonly.aof”
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
aof-load-truncated yes
lua-time-limit 5000
slowlog-log-slower-than 10000
slowlog-max-len 128
latency-monitor-threshold 0
notify-keyspace-events “”
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
list-max-ziplist-size -2
list-compress-depth 0
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
hll-sparse-max-bytes 3000
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit slave 256mb 64mb 60
client-output-buffer-limit pubsub 32mb 8mb 60
hz 10
aof-rewrite-incremental-fsync yes
maxclients 4064
maxmemory 128m
[root@k8s-node3 ~]# systemctl restart redis
[root@k8s-node3 ~]# netstat -anput |grep redis|grep LISTEN
tcp 0 0 0.0.0.0:6379 0.0.0.0:* LISTEN 2915/redis-server 0
2.安装redis_exporter安装(二进制方式安装)(被监控端redis上安装)
[root@k8s-node3 ~]# ls redis_exporter-v1.24.0.linux-amd64.tar.gz
redis_exporter-v1.24.0.linux-amd64.tar.gz
[root@k8s-node3 ~]# tar -zxf redis_exporter-v1.24.0.linux-amd64.tar.gz
[root@k8s-node3 ~]# mv redis_exporter-v1.24.0.linux-amd64 /data/redis_exporter
[root@k8s-node3 ~]# ls /data/redis_exporter/ #查看二进制文件
LICENSE README.md redis_exporter
[root@k8s-node3 ~]# /data/redis_exporter/redis_exporter redis//192.168.10.89:6379 #启动方式1:redis无密码时启动
[root@k8s-node3 ~]# /data/redis_exporter/redis_exporter --redis.addr 192.168.10.89:6379 #启动方式1: 或者如下redis无密码时启动
[root@k8s-node3 ~]# /data/redis_exporter/redis_exporter --redis.addr 192.168.10.89:6379 -redis.password 123456 #启动方式2: redis有密码启动
交个systemed管理:
#vim /usr/lib/systemd/system/redis_exporter.service
[Unit]
Description=redis_exporter daemon
[Service]
Restart=on-failure
ExecStart=/data/redis_exporter/redis_exporter --redis.addr 192.168.10.89:6379 -redis.password 123456
[Install]
WantedBy=multi-user.target
[root@k8s-node3 ~]# systemctl daemon-reload
[root@k8s-node3 ~]# systemctl start redis_exporter
[root@k8s-node3 ~]# systemctl status redis_exporter
● redis_exporter.service - redis_exporter daemon
Loaded: loaded (/usr/lib/systemd/system/redis_exporter.service; disabled; vendor preset: disabled)
Active: active (running) since Sat 2022-06-04 09:15:11 CST; 4s ago
[root@k8s-node3 ~]# netstat -anput |grep 9121
tcp6 0 0 :::9121 ::😗 LISTEN 22790/redis_exporte
访问redis_exporter: http://192.168.10.89:9121/metrics
3.二进制安装alertmanager: (192.168.10.92)
[root@nacos-nfs ~]# ls alertmanager-0.24.0.linux-amd64.tar.gz
alertmanager-0.24.0.linux-amd64.tar.gz
[root@nacos-nfs ~]# tar -zxf alertmanager-0.24.0.linux-amd64.tar.gz
[root@nacos-nfs ~]# mv alertmanager-0.24.0.linux-amd64 /data/alertmanager
[root@nacos-nfs ~]# cd /data/alertmanager/
[root@nacos-nfs alertmanager]# ls
alertmanager alertmanager.yml amtool LICENSE NOTICE
[root@nacos-nfs alertmanager]# cp alertmanager.yml alertmanager.yml.bak
[root@nacos-nfs alertmanager]# vim alertmanager.yml #使用qq邮箱,有时163邮箱不靠谱,(使用自带默认模板,不配置模板就是默认使用自带模板)
[root@nacos-nfs alertmanager]# cat alertmanager.yml
global:
resolve_timeout: 1m
#配置邮箱服务器:
smtp_smarthost: ‘smtp.qq.com:25’
#配置发件人:
smtp_from: ‘1036981484@qq.com’
smtp_auth_username: ‘1036981484@qq.com’
#配置发件人的授权密码:
smtp_auth_password: ‘fwbxnmbfnrpvbedi’
smtp_require_tls: false
#配置路由树:
route:
group_by: [‘alertname’] #根据告警规则组名进行分组,默认这里就是用alertname就可以了,可以精确到每一个告警规则,alertname的取值就是promethues中rules中自定义的告警规则的名称,根据触发情况取值会有所变动
group_wait: 10s #分组内第一个告警等待时间,10s内如有第二个告警会合并一个告警
group_interval: 10s #发送新告警间隔时间
repeat_interval: 10s #重复告警间隔发送时间
receiver: ‘mail’ #发送给哪个接收人,定义一个名字,具体接收人是谁,可以在下面的该名字下定义
receivers:
- name: ‘mail’
email_configs:- to: ‘1441107787@qq.com’
send_resolved: true #设置恢复时候也提醒恢复信息
[root@nacos-nfs alertmanager]# ./alertmanager --config.file=./alertmanager.yml #指定配置文件启动
交给systemed管理服务:
[root@nacos-nfs alertmanager]# vim /usr/lib/systemd/system/alertmanager.service
[Unit]
Description=alertmanager daemon
[Service]
Restart=on-failure
ExecStart=/data/alertmanager/alertmanager --config.file=/data/alertmanager/alertmanager.yml
[Install]
WantedBy=multi-user.target
[root@nacos-nfs alertmanager]# systemctl daemon-reload
[root@nacos-nfs alertmanager]# systemctl enable alertmanager
[root@nacos-nfs alertmanager]# systemctl start alertmanager
[root@nacos-nfs alertmanager]# systemctl status alertmanager
● alertmanager.service - alertmanager daemon
Loaded: loaded (/usr/lib/systemd/system/alertmanager.service; enabled; vendor preset: disabled)
Active: active (running) since Sun 2022-04-10 17:43:02 CST; 4s ago
[root@nacos-nfs alertmanager]# netstat -anput |grep 9093
tcp6 0 0 :::9093 ::😗 LISTEN 30260/alertmanager
访问web界面如下:http://192.168.10.92:9093/
- to: ‘1441107787@qq.com’
4.二进制安装promethues并测试告警
[root@nacos-nfs ~]# ls
prometheus-2.35.0-rc0.linux-amd64.tar.gz
[root@nacos-nfs ~]# tar -zxf prometheus-2.35.0-rc0.linux-amd64.tar.gz
[root@nacos-nfs ~]# ls
prometheus-2.35.0-rc0.linux-amd64 prometheus-2.35.0-rc0.linux-amd64.tar.gz
[root@nacos-nfs ~]# mv prometheus-2.35.0-rc0.linux-amd64 /data/prometheus
[root@nacos-nfs ~]# cd /data/prometheus/
[root@nacos-nfs prometheus]# ls
console_libraries consoles LICENSE NOTICE prometheus prometheus.yml promtool
[root@nacos-nfs prometheus]# cp prometheus.yml prometheus.yml.bak
[root@nacos-nfs prometheus]# vim prometheus.yml
global:
scrape_interval: 15s # 采集数据时间间隔
evaluation_interval: 15s # 评估的告警规则的时间间隔,每多少时间评估一次告警规则
alerting:
alertmanagers:
- static_configs:
- targets:
- 192.168.10.92:9093 #当安装了alertmanager,需要告警时可以指定alertmanager的ip和端口,若不用告警则可注释该行.
#下面配置告警规则引用的文件:
rule_files:
- “rules/*.yml” #告警相关规则配置,不用可注释,rules是在promethues安装目录中创建一个rules目录
- “second_rules.yml” #告警相关规则配置,不用可注释
scrape_configs: #下面是被监控端的相关配置
# metrics_path defaults to ‘/metrics’
- job_name: “redis_monitor” #每一个job_name可以认为是一个分组,一个分组以包含一批机器
static_configs:- targets: [“192.168.10.89:9121”] #redis_exporter的地址和端口
[root@nacos-nfs prometheus]# mkdir /data/prometheus/rules
[root@nacos-nfs prometheus]# vim /data/prometheus/rules/redis.yml #配置系统服务状态告警触发条件和告警规则(后面可补充)
groups:
- targets: [“192.168.10.89:9121”] #redis_exporter的地址和端口
- name: redis.rules #告警规则组名称
rules:
#mysql服务宕- alert: redisServiceDown #自定义的告警规则名称,触发报警时该值就作为alertmanager.yml中定义的alertname的取值
expr: redis_up == 0 #基于promSQL的触发条件,mysql服务宕条件
for: 1m #等待评估时间,1分钟,就是满足触发条件后不直接发送告警到alertmanager,而是等待1分钟,若1分钟内一直是触发的才发送告警,任何实例1分钟内无法访问发出告警
labels: #自定义标签
severity: error
annotations: #指定附加信息
summary: “redis服务down”
description: “主机: {{ $labels.instance }} 上的redis服务停止,请检查,当前值: {{ $value }}”
#redis内存占用过多,>60%报警,redis使用内存/限制最大内存的比例,一般redis_exporter监控的就是本机redis(特殊的除外),所以一般redis_exporter机器就可以任务是监控的redis机器 - alert: RedisOutOfMemory-redis使用内存比例高 #自定义的告警规则名称,触发报警时该值就作为alertmanager.yml中定义的alertname的取值
expr: redis_memory_used_bytes / redis_memory_max_bytes * 100 > 60
for: 1m
labels:
severity: warning
annotations:
summary: “redis使用内存比例高”
description: “redis_exporter主机:{{$labels.instance}} 上监控的redis 使用内存比例高,>60%, 当前值: {{ $value }}”
#redis当前连接数过高,>2000报警 - alert: RedisTooManyConnetions-redis当前连接数高 #自定义的告警规则名称,触发报警时该值就作为alertmanager.yml中定义的alertname的取值
expr: redis_connected_clients > 2000
for: 1m
labels:
severity: warning
annotations:
summary: “redis当前连接数高”
description: “redis_exporter主机:{{$labels.instance}} 上监控的redis 当前连接数高,>2000, 当前值: {{ $value }}”
#redis1分钟内拒绝连接数小于0,<0报警 - alert: RedisRejectConnetions-redis1分钟内拒绝连接数小于0 #自定义的告警规则名称,触发报警时该值就作为alertmanager.yml中定义的alertname的取值
expr: increase(redis_rejected_connections_total[1m]) < 0
for: 1m
labels:
severity: warning
annotations:
summary: “redis1分钟内拒绝连接数小于0”
description: “redis_exporter主机:{{$labels.instance}} 上监控的redis redis1分钟内拒绝连接数小于0, 当前值: {{ $value }}”
- alert: redisServiceDown #自定义的告警规则名称,触发报警时该值就作为alertmanager.yml中定义的alertname的取值
[root@nacos-nfs prometheus]# ./promtool check config ./prometheus.yml #检查配置文件是否正确
Checking ./prometheus.yml
SUCCESS: ./prometheus.yml is valid prometheus config file syntax
[root@nacos-nfs prometheus]# /data/prometheus/prometheus --config.file=/data/prometheus/prometheus.yml #指定配置文件启动
交给systemed管理:
[root@nacos-nfs prometheus]# vim /usr/lib/systemd/system/prometheus.service
[Unit]
Description=Prometheus daemon
[Service]
Restart=on-failure
ExecStart=/data/prometheus/prometheus --config.file=/data/prometheus/prometheus.yml
[Install]
WantedBy=multi-user.target
[root@nacos-nfs prometheus]# systemctl daemon-reload
[root@nacos-nfs prometheus]# systemctl enable prometheus
[root@nacos-nfs prometheus]# systemctl start prometheus
[root@nacos-nfs prometheus]# systemctl status prometheus
● prometheus.service - Prometheus daemon
Loaded: loaded (/usr/lib/systemd/system/prometheus.service; enabled; vendor preset: disabled)
Active: active (running) since Sat 2022-04-09 23:35:06 CST; 6s ago
[root@nacos-nfs prometheus]# netstat -anput |grep 9090|grep LISTEN
tcp6 0 0 :::9090 ::😗 LISTEN 28970/prometheus
测试告警:测试几个报警例子演示即可
将redis的当前连接数的阀值设置为0,edis内存占用设置为0,拒绝连接数=0则报警如下:
[root@nacos-nfs rules]# vim redis.yml
Xx
[root@nacos-nfs rules]# systemctl restart prometheus
查看报警:
查看报警邮件发送情况:
当把参数调整为正常值,恢复报警时发送恢复邮件:
5.二进制安装grafana:界面展示
[root@nacos-nfs ~]# ls grafana-enterprise-8.4.5.linux-amd64.tar.gz
grafana-enterprise-8.4.5.linux-amd64.tar.gz
[root@nacos-nfs ~]# tar -zxf grafana-enterprise-8.4.5.linux-amd64.tar.gz
[root@nacos-nfs ~]# mv grafana-8.4.5/ /data/grafana
[root@nacos-nfs ~]# cd /data/grafana/
[root@nacos-nfs grafana]# ls
bin conf LICENSE NOTICE.md plugins-bundled public README.md scripts VERSION
[root@nacos-nfs grafana]# ls bin/
grafana-cli grafana-cli.md5 grafana-server grafana-server.md5
[root@nacos-nfs grafana]# /data/grafana/bin/grafana-server #即可启动grafana
交给systemed管理启动:
[root@nacos-nfs grafana]# vim /usr/lib/systemd/system/grafana.service
[Unit]
Description=grafana daemon
[Service]
Restart=on-failure
ExecStart=/data/grafana/bin/grafana-server -homepath=/data/grafana #指定安装目录启动
[Install]
WantedBy=multi-user.target
[root@nacos-nfs grafana]# systemctl start grafana
[root@nacos-nfs grafana]# systemctl status grafana
● grafana.service - grafana daemon
Loaded: loaded (/usr/lib/systemd/system/grafana.service; disabled; vendor preset: disabled)
Active: active (running) since Sat 2022-04-09 20:14:52 CST; 11s ago
[root@nacos-nfs grafana]# systemctl enable grafana
[root@nacos-nfs grafana]# netstat -anput |grep 3000
tcp6 0 0 :::3000 ::😗 LISTEN 28647/grafana-serve
访问grafana: http://192.168.10.92:3000 默认用户名和密码都是admin
添加promethues数据源:
导入监控redis的相关监控仪表盘,或者选择相应的id 763输入导入:
补充:邮件报警时候,自定义报警模板:(其他不变,只修改下面变化部分):
[root@nacos-nfs ~]# vim /data/alertmanager/alertmanager.yml #修改alertmanager配置文件
global:
resolve_timeout: 1m
#配置邮箱服务器:
smtp_smarthost: ‘smtp.qq.com:25’
#配置发件人:
smtp_from: ‘1036981484@qq.com’
smtp_auth_username: ‘1036981484@qq.com’
#配置发件人的授权密码:
smtp_auth_password: ‘fwbxnmbfnrpvbedi’
smtp_require_tls: false
#添加下面:
templates:
- ‘/data/alertmanager/templates/*.tmpl’ #定义模板路径
#配置路由树:
route:
group_by: [‘alertname’] #根据告警规则组名进行分组,默认这里就是用alertname就可以了,可以精确到每一个告警规则,alertname的取值就是promethues中rules中自定义的告警规则的名称,根据触发情况取值会有所变动
group_wait: 10s #分组内第一个告警等待时间,10s内如有第二个告警会合并一个告警
group_interval: 10s #发送新告警间隔时间
repeat_interval: 10s #重复告警间隔发送时间
receiver: ‘mail’ #发送给哪个接收人,定义一个名字,具体接收人是谁,可以在下面的该名字下定义
receivers:- name: ‘mail’
email_configs:- to: ‘1441107787@qq.com’
send_resolved: true #设置恢复时候也提醒恢复信息
#添加下面:
html: ‘{{ template “email.template.tmpl” . }}’ #配置调用哪个模板
- to: ‘1441107787@qq.com’
- name: ‘mail’
[root@nacos-nfs ~]# mkdir /data/alertmanager/templates #创建自定义报警模板路径
[root@nacos-nfs ~]# vim /data/alertmanager/templates/email.template.tmpl #编辑自定义报警模板
{{ define “email.template.tmpl” }}
{{- if gt (len .Alerts.Firing) 0 -}}{{ range.Alerts }}
告警名称: {{ .Labels.alertname }}
实例名: {{ .Labels.instance }}
摘要: {{ .Annotations.summary }}
详情: {{ .Annotations.description }}
级别: {{ .Labels.severity }}
开始时间: {{ (.StartsAt.Add 28800e9).Format “2006-01-02 15:04:05” }}
++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++
{{ end }}{{ end -}}
{{- if gt (len .Alerts.Resolved) 0 -}}{{ range.Alerts }}
Resolved-告警恢复了。
告警名称: {{ .Labels.alertname }}
实例名: {{ .Labels.instance }}
摘要: {{ .Annotations.summary }}
详情: {{ .Annotations.description }}
级别: {{ .Labels.severity }}
开始时间: {{ (.StartsAt.Add 28800e9).Format “2006-01-02 15:04:05” }}
恢复时间: {{ (.EndsAt.Add 28800e9).Format “2006-01-02 15:04:05” }}
++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++
{{ end }}{{ end -}}
{{- end }}
测试告警:
将redis的当前连接数的阀值设置为0,edis内存占用设置为0,拒绝连接数=0则报警如下
当告警时候,模板内容邮件如下:
当恢复报警时,模板内容如下: