部署PROMETHEUS
下载安装
下载地址
https://prometheus.io/download/
创建prometheus用户
useradd prometheus -s /sbin/nologin
创建prometheus数据目录
mkdir /data/prometheus
chown prometheus.prometheus /data/prometheus
解压部署
tar -zxf prometheus-2.25.0.linux-amd64.tar.gz
mv prometheus-2.25.0.linux-amd64 /usr/local/prometheus
配置文件
vim /usr/local/prometheus/prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
static_configs:
- targets: ['192.168.2.150:9090']
- job_name: 'node'
static_configs:
- targets:
- '192.168.2.151:9100'
- '192.168.2.152:9100'
配置文件说明
scrape_interval 抓取数据时间间隔 evaluation_interval 报警阈值检测频率
配置service启动文件
cat > /usr/lib/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus
After=network.target
[Service]
Type=simple
Environment="GOMAXPROCS=4"
User=prometheus
Group=prometheus
ExecReload=/bin/kill -HUP $MAINPID
ExecStop=/bin/kill -s QUIT $MAINPID
ExecStart=/usr/local/prometheus/prometheus \
--config.file=/usr/local/prometheus/prometheus.yml \
--storage.tsdb.path=/data/prometheus \
--storage.tsdb.retention=30d \
--web.console.libraries=/usr/local/prometheus/console_libraries \
--web.console.templates=/usr/local/prometheus/consoles \
--web.listen-address=0.0.0.0:9090 \
--web.read-timeout=5m \
--web.max-connections=10 \
--query.max-concurrency=20 \
--query.timeout=2m \
--web.enable-lifecycle
PrivateTmp=true
PrivateDevices=true
ProtectHome=true
NoNewPrivileges=true
LimitNOFILE=infinity
ReadWriteDirectories=/data/prometheus
ProtectSystem=full
SyslogIdentifier=prometheus
Restart=always
[Install]
WantedBy=multi-user.target
EOF
启动参数说明
web.read-timeout 请求连接等待最大时间(防止太多空链接占用资源) web.max-connections 最大连接数(获取数据源时候建立最大连接数限制,避免连接数太多资源浪费) storage.tsdb.retention 监控数据保存时长(企业级一般15为宜) storage.tsdb.path 监控数据保存路径(如果不设置,默认存在监控当前路径) query.timeout 针对用户-强行终止太庞大的慢查询 query.max-concurrency 针对用户-访问连接数限制
启动prometheus
systemctl daemon-reload
systemctl start prometheus
systemctl enable prometheus
命令行启动方式
/usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml --storage.tsdb.path=/data/prometheus --storage.tsdb.retention=30d --web.console.libraries=/usr/local/prometheus/console_libraries --web.console.templates=/usr/local/prometheus/consoles --web.listen-address=0.0.0.0:9090 --web.read-timeout=5m --web.max-connections=10 --query.max-concurrency=20 --query.timeout=2m --web.enable-lifecycle
node-exporter
https://prometheus.io/
解压部署
tar -zxf node_exporter-1.1.2.linux-amd64.tar.gz
mv node_exporter-1.1.2.linux-amd64 /usr/local/node_exporter
配置service启动文件
cat > /usr/lib/systemd/system/node-exporter.service << 'EOF'
[Unit]
Description=This is prometheus node exporter
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/node_exporter/node_exporter
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
启动exporter
systemctl daemon-reload
systemctl start node-exporte
systemctl enable node-exporter
相关计算方法
几种数据类型
counter: 持续增长的数值,比如CPU使用时间,开机时间累计,访问量累计
gauge: 无规律数值,比如实时抓取的带宽,访问量
几个常用函数
-
rate()函数
rate()函数是专门搭配counter类型数据使用的函数,它的功能是按照设置一个时间段,取counter在这个时间段内平均每秒的增量
rate(node_cpu_seconds_total{mode="user"}[5m])
-
increase()函数
用来针对counter这种持续增长的数值,截取其中一段时间的增量总量
increase(node_cpu[1m])
rate()函数和increase()函数区别
rate() 一段时间内每秒的增量 increase() 一段时间内增量总量
应用场景
rate() 采集频率高 increase() 采集频率低
-
sum()函数
外面套用一个sum函数就可以把里面所有数值加和
-
by (instance)
联合sum()函数使用,在函数加上,已主机名进行拆分
-
sum(increase(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
instance是按机器名拆分,也可以by (cluster_name)
node_exporter只能按机器名去拆分
如果需要用cluster_name来拆分,需要自定义标签
-
topk()函数
定义取前几位最高值
topk(3.count_netstat_wait_connections)
topk()函数即可以用于gauge类型数值,也可以用户counter类型数值(但外层需要包裹rate等函数,才有意义)
node_exporter采集示例:
计算CPU使用率
(1-((sum(increase(node_cpu_seconds_total{mode="idle"}[5m])) by (instance))/ (sum(increase(node_cpu_seconds_total[5m])) by (instance))))*100
命令解析:
node_cpu_seconds_total cpu使用率 mode="idle" cpu空闲模式 by (instance) 按主机名拆分 命令翻译: (1-(所有cpu空闲使用率综合)/(所有cpu总使用率总和))*100
计算磁盘IO等待占用CPU百分比
((sum(increase(node_cpu_seconds_total{mode="iowait"}[5m])) by (instance))/(sum(increase(node_cpu_seconds_total[5m])) by (instance)))*100
pushgateway
下载
https://prometheus.io/
解压部署
tar -zxf pushgateway-1.4.0.linux-amd64.tar.gz
mv pushgateway-1.4.0.linux-amd64 /usr/local/pushgateway
配置service启动文件
cat > /usr/lib/systemd/system/pushgateway.service << 'EOF'
[Unit]
Description=This is prometheus pushgateway
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/pushgateway/pushgateway
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
启动pushgateway
systemctl daemon-reload
systemctl start pushgateway
systemctl enable pushgateway
脚本样例
#!/bin/bash
instance_name=`hostname -f|cut -d'.' -f1`
if [ ${instance_name} == "localhost" ];then
echo "Must FQDN hostname"
exit 1;
fi
label="count_netstat_wait_connections"
count_netstat_wait_connections=`netstat -antp|grep -i wait |wc -l`
echo "$label: ${count_netstat_wait_connections}"
echo "$label ${count_netstat_wait_connections}"| curl --data-binary @- http://192.168.2.150:9091/metrics/job/pushgateway1/instance/${instance_name}
prometheus.yml配置pushgateway
- job_name: 'pushgateway'
static_configs:
- targets:
- '192.168.2.150:9091'
部署Grafana
下载地址
https://grafana.com/grafana/download
解压安装
tar -zxf grafana-7.4.3.linux-amd64.tar.gz
mv grafana-7.4.3 /usr/local/grafana
配置grafana配置文件
cd /usr/local/grafana/conf; cp defaults.ini grafana.ini
创建对应目录
mkdir -pv /data/grafana/data
mkdir -pv /data/grafana/log
修改配置文件
[paths]
data = /data/grafana/data
temp_data_lifetime = 24h
logs = /data/grafana/log
plugins = /data/grafana/plugins
provisioning = conf/provisioning
[server]
protocol = http
http_addr =
http_port = 3000
domain = 192.168.2.150
[smtp]
enabled = true
host = smtp.sina.com:465
user = wangshui898@sina.com
password = 邮箱授权码
cert_file =
key_file =
skip_verify = false
from_address = wangshui898@sina.com
from_name = Grafana
ehlo_identity =
startTLS_policy =
配置service启动文件
cat > /usr/lib/systemd/system/grafana-server.service << 'EOF'
[Unit]
Description=This is grafana-server
After=network.target
[Service]
Type=simple
WorkingDirectory=/usr/local/grafana
ExecReload=/bin/kill -HUP $MAINPID
ExecStop=/bin/kill -s QUIT $MAINPID
ExecStart=/usr/local/grafana/bin/grafana-server --config=/usr/local/grafana/conf/grafana.ini --pidfile=/data/grafana/log/grafana-server.pid
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
启动grafana
systemctl daemon-reload
systemctl start grafana-server
systemctl enable grafana-server
登录地址
IP:3000
帐号口令:
admin
admin
告警配置
常用插件
./grafana-cli plugins install grafana-piechart-panel
常用监控项
CPU使用率[1m]
(1-((sum(increase(node_cpu_seconds_total{mode="idle"}[1m])) by (instance))/ (sum(increase(node_cpu_seconds_total[1m])) by (instance))))*100
公式解析:
(1-(所有cpu空闲使用时间)/(所有cpu总使用时间总和))*100
CPU磁盘IO负载[1m]
((sum(increase(node_cpu_seconds_total{mode="iowait"}[1m])) by (instance))/(sum(increase(node_cpu_seconds_total[1m])) by (instance)))*100
公式解析:
(cpu_iowait使用时间综合)/(所有cpu使用时间总和)*100
内存使用率
(1-((node_memory_Buffers_bytes + node_memory_Cached_bytes + node_memory_MemFree_bytes)/node_memory_MemTotal_bytes))*100
可用内存=系统free memory+buffers+cached
buffers和cached虽然被占用,但新的内容到来时,是可以快速释放并应用
内存使用率=实际使用内存/总内存,即: (1-((可用内存)/内存总量)*100
centos7之后,available直接给出实际可用内存数
磁盘容量使用率
方法一:
(1-(node_filesystem_free_bytes/node_filesystem_size_bytes))*100
公式解析:
1-(系统空闲空间/系统总容量)*100
硬盘读写速度[1m]
((rate(node_disk_read_bytes_total[1m])+rate(node_disk_written_bytes_total[1m]))/1024/1024)>0
两次1024后为Mbs
网络带宽
rate(node_network_transmit_bytes_total[1m])
TCP连接数(wait_connections状态)[1m]
使用pushgateway,脚本如下
#!/bin/bash
instance_name=`hostname -f|cut -d'.' -f1`
if [ ${instance_name} == "localhost" ];then
echo "Must FQDN hostname"
exit 1;
fi
label="count_netstat_wait_connections"
count_netstat_wait_connections=`netstat -antp|grep -i wait |wc -l`
echo "$label: ${count_netstat_wait_connections}"
echo "$label ${count_netstat_wait_connections}"| curl --data-binary @- http://192.168.2.150:9091/metrics/job/pushgateway1/instance/${instance_name}
公式:
count_netstat_wait_connections
文件描述符使用率
(node_filefd_allocated/node_filefd_maximum) * 100
网络连通性监控
pushgateway方式
脚本
#!/bin/bash
instance_name=`hostname -f|cut -d'.' -f1`
if [ ${instance_name} == "localhost" ];then
echo "Must FQDN hostname"
exit 1;
fi
lspk=`timeout 3 ping -q -A -s 500 -W 1000 -c 100 192.168.2.150|grep transmitted|awk '{print $6}'`
rrt=`timeout 3 ping -q -A -s 500 -W 1000 -c 100 192.168.2.150|grep transmitted|awk '{print $10}'`
value_lspk=`echo $lspk|sed "s/%//g"`
value_rrt=`echo $rrt|sed "s/ms//g"`
echo "lost_packet_: ${value_lspk}"
echo "lost_packet ${value_lspk}"|curl --data-binary @- http://192.168.2.150:9091/metrics/job/pushgateway1/instance/${instance_name}
echo "rrt: ${value_rrt}"
echo "rrt ${value_rrt}"|curl --data-binary @- http://192.168.2.150:9091/metrics/job/pushgateway1/instance/${instance_name}
硬盘空间使用率
((node_filesystem_size_bytes{mountpoint="/"}-node_filesystem_free_bytes{mountpoint="/"})/node_filesystem_size_bytes{mountpoint="/"})*100
硬盘空间使用量
- 大饼图
使用容量
(node_filesystem_size_bytes{mountpoint="/",fstype="xfs",instance="192.168.2.151:9100"}-node_filesystem_free_bytes{mountpoint="/",fstype="xfs",instance="192.168.2.151:9100"})/1024/1024/1024
空闲容量
(node_filesystem_free_bytes{mountpoint="/",fstype="xfs",instance="192.168.2.151:9100"})/1024/1024/1024
硬盘总容量
(node_filesystem_size_bytes{mountpoint="/",fstype="xfs",instance="192.168.2.151:9100"})/1024/1024/1024