-
简介
-
部署prometheus
-
部署grafana
-
服务器节点的监控
-
Pushgateway数据收集与Alertmanager监控
一.简介
Prometheus
是一个开源的系统监控和报警系统,现在已经加入到CNCF基金会,成为继k8s之后第二个在CNCF托管的项目,在kubernetes容器管理系统中,通常会搭配prometheus进行监控,同时也支持多种exporter
采集数据,还支持pushgateway
进行数据上报,Prometheus性能足够支撑上万台规模的集群。
grafana 是一款采用 go 语言编写的开源应用,主要用于大规模指标数据的可视化展现,是网络架构和应用分析中最流行的时序数据展示工具,目前已经支持绝大部分常用的时序数据库.Grafana支持许多不同的数据源。每个数据源都有一个特定的查询编辑器,该编辑器定制的特性和功能是公开的特定数据来源。 官方支持以下数据源:Graphite,Elasticsearch,InfluxDB,Prometheus,Cloudwatch,MySQL和OpenTSDB等。
二.部署promethues
Download | Prometheus 下载最新版本(包含promethues所需插件)
[root@localhost ~]# mkdir -p /app/prometheus
[root@localhost ~]# cd /app/prometheus
[root@localhost prometheus]# wget https://github.com/prometheus/prometheus/releases/download/v2.33.3/prometheus-2.33.3.linux-amd64.tar.gz
[root@localhost prometheus]# tar zxvf prometheus-2.33.3.linux-amd64.tar.gz
[root@localhost prometheus]# cd prometheus-2.33.3
查看下prometheus的程序包,修改配置文件完成各种类型监控~
[root@localhost prometheus-2.33.3]# ls
console_libraries consoles data LICENSE NOTICE prometheus prometheus.yml promtool
[root@localhost prometheus-2.33.3]# vim prometheus
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration 告警
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
# prometheus server
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["192.168.137.100:9090"]
#收集器
- job_name: 'pushgateway'
static_configs:
- targets: ['192.168.137.100:9091']
labels:
instance: pushgateway
#节点监控
- job_name: 'node_exporter'
static_configs:
- targets: ['192.168.137.100:9100','192.168.137.2:9100','192.168.137.3:9100','47.99.57.254:8100']
#mysql数据库监控
- job_name: 'mysqld_exporter'
static_configs:
- targets: ['47.99.57.254:9104']
#nginx监控
- job_name: 'nginx_node'
static_configs:
- targets: ['192.168.137.3:9913']
labels:
instance: web1
[root@localhost prometheus-2.33.3]# ./prometheus --config.file=/app/prometheus/prometheus-2.33.3/prometheus.yml --storage.tsdb.path=/app/prometheus/prometheus-2.33.3/data/ &
服务启动成功,从安全角度考虑,配置promethues开机自启也有利于我们后期维护操作
cat > /etc/systemd/system/prometheus.service <<EOF
[Unit]
Description=prometheus
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/app/prometheus/prometheus-2.33.3//prometheus --config.file=/app/prometheus/prometheus-2.33.3/prometheus.yml --storage.tsdb.path=/app/prometheus/prometheus-2.33.3/data
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl start prometheus.service
systemctl status prometheus.service
systemctl enable prometheus.service
访问 192.168.137.100:9090 进入prometheus界面
三.部署Grafana
[root@localhost ~]# cd /app/prometheus
[root@localhost prometheus]# wget https://dl.grafana.com/enterprise/release/grafana-enterprise-8.4.1.linux-amd64.tar.gz
[root@localhost prometheus]# cd granfana-8.4.1
[root@localhost grafana-8.4.1]# nohup ./bin/grafana-server web > ./grafana.log 2>&1 &
查看服务进程和端口是否正常(显示Ok)
访问OK_ ,指定IP和端口,将prometheus添加到grafana中
四.node_exporter节点监控
[root@localhost prometheus]# wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz
[root@localhost prometheus]# tar zxvf node_exporter-1.3.1.linux-amd64.tar.gz
[root@localhost prometheus]# mv node_exporter-1.3.1 node_exporter
[root@localhost prometheus]# cd node_exporter
[root@localhost node_exporter]# ./node_exporter --web.listen-address=:9100 >node_exporter.log 2>&1 &
服务启动成功,promethues成功监控到node节点(需要在prometheus.yml中配置node_exporter的监控节点ip:prot,上文已配置,只需在对应的节点启动node_exporter即可),从安全角度考虑,配置node_exporter开机自启也有利于我们后期维护操作
vim /etc/systemd/system/node_exporter.service
[Unit]
Description=node_exporter Monitoring System
Documentation=node_exporter Monitoring System
[Service]
ExecStart=自己本地路径/node_exporter --web.listen-address=:9100
[Install]
WantedBy=multi-user.target#设置开机自启
systemctl daemon-reload
systemctl start node_exporter.service
systemctl status node_exporter.service
systemctl enable node_exporter.service
上图可见,节点已经加入prometheus监控,现在,我们可以用grafana做可视化展览了
导入大神模板,看看效果!!!(当然你也可以自己做模板)
五.Pushgateway数据收集与Alertmanager监控
1.部署pushgateway
[root@localhost ~]# cd /app/prometheus/
[root@localhost prometheus]# wget https://github.com/prometheus/pushgateway/releases/download/v1.4.2/pushgateway-1.4.2.linux-amd64.tar.gz
[root@localhost prometheus]# tar zxvf pushgateway-1.4.2.linux-amd64.tar.gz
[root@localhost prometheus]# mv pushgateway-1.4.2 pushgateway
[root@localhost prometheus]# cd pushgateway
[root@localhost pushgateway]# nohup /app/prometheus/pushgateway/pushgateway --web.listen-address :9091 > /app/prometheus/pushgateway/pushgateway.log 2>&1 &
因为我们刚才将启动信息输入到/app/prometheus/pushgateway/pushgateway.log,可以cat看看启动的信息。查看pushgateway服务进程是否启动
验证是否有数据收集:访问IP:8091/metrics,如下显示,则服务信息收集正常。
2.部署Alertmanager
[root@localhost prometheus]# wget https://github.com/prometheus/alertmanager/releases/download/v0.23.0/alertmanager-0.23.0.linux-amd64.tar.gz
[root@localhost prometheus]# tar zxvf alertmanager-0.23.0.linux-amd64.tar.gz
[root@localhost prometheus]# mv alertmanager-0.23.0 alertmanager
[root@localhost prometheus]# cd alertmanager
设置alertmanager启动项
[root@localhost alertmanager]# cat /usr/lib/systemd/system/alertmanager.service
[Unit]
Description=prometheus[Service]
Restart=on-failure
ExecStart=/app/prometheus/alertmanager/alertmanager --config.file=/app/prometheus/alertmanager/alertmanager.yml[Install]
WantedBy=multi-user.targe
启动alertmanager服务,并设置开机自启动
[root@localhost alertmanager]# systemctl start alertmanager
[root@localhost alertmanager]# systemctl enable alertmanager
[root@localhost alertmanager]# ps -elf | grep alertmanager
4 S root 913 1 0 80 0 - 181955 futex_ 08:24 ? 00:00:15 /app/prometheus/alertmanager/alertmanager --config.file=/app/prometheus/alertmanager/alertmanager.yml
0 S root 3384 3008 0 80 0 - 28206 pipe_w 10:42 pts/0 00:00:00 grep --color=auto alertmanager
alertmanager服务需要在prometheus.yml配置文件中添加监控基本配置如下,重启prometheus刷新配置
我将监控规则统一格式,创建rule目录放入之中,分别为cpu\disk\mem的信息监控告警
下面是一个简单的测试,可根据具体情况设置服务环境监控的脚本
vim rule/cpu_rule.yml
groups:
- name: Host
rules:
- alert: HostCPU
expr: 100 * (1 - avg(irate(node_cpu_seconds_total{mode="idle"}[2m])) by(instance)) > 10
for: 5m
labels:
serverity: high
annotations:
summary: "{{$labels.instance}}: High CPU Usage Detected"
description: "{{$labels.instance}}: CPU usage is {{$value}}, above 10%"
vim rules/disk_rule.yml
groups:
- name: Host
rules:
- alert: HostDisk
expr: 100 * (node_filesystem_size_bytes{fstype=~"xfs|ext4"} - node_filesystem_avail_bytes) / node_filesystem_size_bytes > 30
for: 5m
labels:
serverity: low
annotations:
summary: "{{$labels.instance}}: High Disk Usage Detected"
description: "{{$labels.instance}}, mountpoint {{$labels.mountpoint}}: Disk Usage is {{ $value }}, above 30%"
vim rules/Memory_rule.yml
groups:
- name: Host
rules:
- alert: HostMemory
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 20
for: 5m
labels:
serverity: middle
annotations:
summary: "{{$labels.instance}}: High Memory Usage Detected"
description: "{{$labels.instance}}: Memory Usage i{{ $value }}, above 20%"
为了更好看出效果,CUP使用率超过10%,磁盘超过30%,内存超过20%,则告警如下: