Prometheus+grafana+多种指标采集器
本案例prometheus安装在主机128.5.80.182上
1 安装prometheus主程序
1.1 安装
#解压安装包
tar zxvf prometheus-2.44.0.linux-amd64.tar.gz
#移动文件夹到指定位置
cd prometheus-2.44.0.linux-amd64
mv * /home/ap/prometheus
#创建启动命令到环境变量
ln -s /home/ap/prometheus/prometheus /usr/local/bin/prometheus
##验证普罗米修斯安装版本
prometheus --version
1.2 创建相关目录
mkdir -p /home/ap/prometheus/log ##存放日志目录
mkdir -p /home/ap/prometheus/data ##存放监控数据目录
1.3 启动方式
##启动方式1(不推荐,会前台运行,窗口不能关闭)
prometheus --config.file=/home/ap/prometheus/prometheus.yml --web.enable-lifecycle
##启动方式2(后台启动,并把日志生成到prometheus.log文件中)
nohup prometheus --config.file=/home/ap/prometheus/prometheus.yml \
--storage.tsdb.path=/home/ap/prometheus/data --web.enable-lifecycle > /home/ap/prometheus/log/prometheus.log 2>&1 &
##启动方式3,做成服务启动,可以通过systemctl控制
暂时先不配置
1.4 通过浏览器访问
http://128.5.80.182:9090/metrics
2 node_exporter节点导出器
node_exporter可以采集到操作系统各个方面指标,比如CPU、内存、硬盘、网络、IO等信息,通过grafana图形展示效果如下图所示:
2.1 安装节点导出器
#在各个需要监控的节点上解压、移动到指定位置
cd /home/ap/Prometheus
tar zxvf node_exporter-1.5.0.linux-amd64.tar.gz
mv node_exporter-1.5.0.linux-amd64 /home/ap/prometheus/node_exporter
2.2 启动node_exporter
#启动:
nohup /home/ap/prometheus/node_exporter/node_exporter >/dev/null 2>&1 &
2.3 配置prometheus.yml
将下面这段配置信息加入到scrape_configs: 下面
scrape_configs:
- job_name: "node"
file_sd_configs:
- files:
- targets/nodes.yml
refresh_interval: 2m
scrape_interval: 15s
static_configs:
- targets:
配置监控列表
vi /home/ap/prometheus/targets/nodes.yml
- targets:
- 128.5.80.160:9100
- 128.5.80.95:9100
- 128.5.80.96:9100
- 128.5.80.97:9100
- 128.1.80.43:9100
- 128.5.80.182:9100
2.4 重启prometheus主程序
#杀掉旧进程
ps -ef |grep Prometheus
kill -9 xxxx
#启动新进程
nohup prometheus --config.file=/home/ap/prometheus/prometheus.yml \
--storage.tsdb.path=/home/ap/prometheus/data --web.enable-lifecycle > /home/ap/prometheus/log/prometheus.log 2>&1 &
3 安装oracledb_exporter导出器
Oracledb_exporter可以监控ORACLE数据库相关的指标,比如表空间、会话情况、解析情况、等待情况,通过grafana图形展示效果如下图所示:
Oracle数据库的监控有几种方式,
1.可以是把导出器放在每个需要监控的数据库服务器上
2.可以把导出器放在prometheus服务端安装,这种方式可以仅安装一次,导出器进程全部在监控服务器这边,不会对数据库服务器造成什么影响,把压力放到监控服务器这边
本次这次环境的安装就是按照第二种方式部署
3.1 安装导出器
##解压
tar zxvf oracledb_exporter.0.3.0rc1-ora18.5.linux-amd64.tar.gz
mv oracledb_exporter.0.3.0rc1-ora18.5.linux-amd64 oracledb_exporter
3.2 配置环境变量
##在root下配置环境变量,以便从监控端连接数据库
export ORACLE_HOME=/home/db/oracle/product/19.3.0
export PATH=$PATH:/home/db/oracle/product/19.3.0/bin
export LD_LIBRARY_PATH=:/home/db/oracle/product/19.3.0/lib
3.3 测试数据库联通性
##在每个数据库上创建统一的监控账户,权限尽量的少
create user prometheus identified by Abcd_123;
grant create session to prometheus;
grant select_catalog_role to prometheus;
##测试联通性
sqlplus prometheus/Abcd_123@128.5.80.182:11521/clouddb
sqlplus prometheus/Abcd_123@128.5.80.97:11521/nbutf8db
sqlplus prometheus/Abcd_123@128.5.80.160:1521/odsbptdb
sqlplus prometheus/Abcd_123@128.5.80.160:11522/zyqdb
sqlplus prometheus/Abcd_123@128.1.80.43:1521/jstsptdb
sqlplus prometheus/Abcd_123@128.1.80.43:1522/P8UTF8DB
sqlplus prometheus/Abcd_123@128.5.80.95:11521/jstsptdb
sqlplus prometheus/Abcd_123@128.5.80.95:11521/nbutf8db
sqlplus prometheus/Abcd_123@128.5.80.96:11521/jstsptdb
sqlplus prometheus/Abcd_123@128.5.80.96:11521/nbutf8db
sqlplus prometheus/Abcd_123@128.5.80.97:11521/jstsptdb
3.4 启动导出器
每个数据库对应一个导出器
#库1
export DATA_SOURCE_NAME=prometheus/Abcd_123@128.5.80.182:11521/clouddb
nohup /root/oracledb_exporter/oracledb_exporter --default.metrics=/root/oracledb_exporter/default-metrics.toml --web.listen-address :9161 >/dev/null 2>&1 &
#查看是否可以获取到监控数据
curl http://128.5.80.182:9161/metrics
#库2
export DATA_SOURCE_NAME=prometheus/Abcd_123@128.5.80.97:11521/nbutf8db
nohup /root/oracledb_exporter/oracledb_exporter --default.metrics=/root/oracledb_exporter/default-metrics.toml --web.listen-address :9162 >/dev/null 2>&1 &
#查看是否可以获取到监控数据
curl http://128.5.80.182:9162/metrics
#库3.
export DATA_SOURCE_NAME=prometheus/Abcd_123@128.5.80.160:1521/odsbptdb
nohup /root/oracledb_exporter/oracledb_exporter --default.metrics=/root/oracledb_exporter/default-metrics.toml --web.listen-address :9163 >/dev/null 2>&1 &
#查看是否可以获取到监控数据
curl http://128.5.80.182:9163/metrics
其他库以此类推
3.5 配置prometheus.yml
将下面这段配置信息加入到scrape_configs: 下面
- job_name: "oracle"
file_sd_configs:
- files:
- targets/db.yml
refresh_interval: 2m
scrape_interval: 5m
static_configs:
- targets:
配置监控列表:
vi /home/ap/prometheus/targets/db.yml
- targets:
- 128.5.80.182:9161
labels:
dbname: '80.182-clouddb'
- targets:
- 128.5.80.182:9162
labels:
dbname: '80.97-nbutf8db'
- targets:
- 128.5.80.182:9163
labels:
dbname: '80.160-odsbptdb'
- targets:
- 128.5.80.182:9164
labels:
dbname: '80.160-zyqdb'
- targets:
- 128.5.80.182:9165
labels:
dbname: '80.43-jstsptdb'
- targets:
- 128.5.80.182:9166
labels:
dbname: '80.43-P8UTF8DB'
- targets:
- 128.5.80.182:9167
labels:
dbname: '80.95-jstsptdb'
- targets:
- 128.5.80.182:9168
labels:
dbname: '80.95-nbutf8db'
- targets:
- 128.5.80.182:9169
labels:
dbname: '80.96-jstsptdb'
- targets:
- 128.5.80.182:9170
labels:
dbname: '80.96-nbutf8db'
- targets:
- 128.5.80.182:9171
labels:
dbname: '80.97-jstsptdb'
3.6 重启prometheus主程序
#杀死旧进程
ps -ef |grep Prometheus
kill -9 xxxx
#重启新进程
nohup prometheus --config.file=/home/ap/prometheus/prometheus.yml \
--storage.tsdb.path=/home/ap/prometheus/data --web.enable-lifecycle > /home/ap/prometheus/log/prometheus.log 2>&1 &
4 安装mysqld_exporter
mysqld_exporter可以监控mysql数据库相关指标,比如连接情况,表锁情况等,通过grafana图形展示效果如下图所示:
4.1 安装mysqld_exporter导出器
##解压缩
cd /home/ap/prometheus
tar -zxvf mysqld_exporter-0.14.0.linux-amd64.tar.gz
mv mysqld_exporter-0.14.0.linux-amd64 mysqld_exporter
4.2 创建监控用户
create user 'exporter'@'localhost' identified by 'Exporter_123';
grant process,replication client,select on *.* to 'exporter'@'localhost';
4.3 添加配置文件
vi /home/ap/prometheus/mysqld_exporter.cnf
[client]
host=127.0.0.1
port=3306
user=exporter
password=Exporter_123
4.4 启动导出器
nohup /home/ap/prometheus/mysqld_exporter/mysqld_exporter --config.my-cnf=/home/ap/prometheus/mysqld_exporter/mysqld_exporter.cnf 2>&1 &
查看是否搜集到数据
curl http://128.5.80.182:9104/metrics
4.5 配置prometheus.yml
将下面这段配置信息加入到scrape_configs: 下面
- job_name: "mysql"
file_sd_configs:
- files:
- targets/mysql.yml
refresh_interval: 2m
scrape_interval: 2m
static_configs:
- targets:
配置监控列表
vi /home/ap/prometheus/targets/mysql.yml
- targets:
- 128.5.80.182:9104
4.6 重启prometheus主程序
##杀掉老进程
ps -ef |grep Prometheus
kill -9 xxxx
##重启新进程
nohup prometheus --config.file=/home/ap/prometheus/prometheus.yml \
--storage.tsdb.path=/home/ap/prometheus/data --web.enable-lifecycle > /home/ap/prometheus/log/prometheus.log 2>&1 &
5 安装wmware_exporter
Vmware_exporter可以监控虚拟机使用相关指标,通过grafana展示监控效果如下图所示:
5.1 安装docker环境
docker安装略
5.2 导入wmware_exporter镜像
#镜像文件位置
/home/ap/prometheus/vmware_exporter/vmware_exporter.tar.gz
#导入镜像
docker load -i vmware_exporter.tar.gz
#确认导入成功
docker images
5.3 编辑配置文件
vi /home/ap/prometheus/vmware_exporter/config.env
VSPHERE_USER=look@vsphere.local
VSPHERE_PASSWORD=Jsccb@123
VSPHERE_HOST=128.5.80.175
VSPHERE_IGNORE_SSL=TRUE
VSPHERE_SPECS_SIZE=2000
5.4 启动容器
docker run -itd -p 9272:9272 --name vmware_exporter --env-file /home/ap/prometheus/vmware_exporter/config.env pryorda/vmware_exporter
验证数据是否能采集
curl http://localhost:9272/metrics
http://128.5.80.182:9272/metrics
5.5 配置prometheus.yml
主配置文件,添加以下内容
- job_name: "vmware_vcenter"
file_sd_configs:
- files:
- targets/vmware_vcenter.yml
refresh_interval: 2m
scrape_interval: 2m
static_configs:
- targets:
配置监控列表
vi /home/ap/prometheus/targets/mysql.yml
- targets:
- 128.5.80.182:9104
5.6 重启prometheus主程序
##杀掉老进程
ps -ef |grep Prometheus
kill -9 xxxx
##重启新进程
nohup prometheus --config.file=/home/ap/prometheus/prometheus.yml \
--storage.tsdb.path=/home/ap/prometheus/data --web.enable-lifecycle > /home/ap/prometheus/log/prometheus.log 2>&1 &
6 安装ipmi_exporter
Ipmi_exporter可以监控物理机箱内的各种传感器的状态情况,比如风扇传感器、温度传感器、存储传感器等,通过grafana图形展示效果如下图所示:
6.1 安装
#需要安装freeipmi
yum install freeipmi
#解压
cd /home/ap/prometheus
tar -zxvf ipmi_exporter-1.6.1.linux-amd64.tar.gz
mv ipmi_exporter-1.6.1.linux-amd64 ipmi_exporter
6.2 编辑ipmi配置文件
vi /home/ap/prometheus/ipmi_exporter/ipmi_remote.yml
modules:
default:
user: "Administrator"
pass: "Fence12#$"
driver: "LAN_2_0"
privilege: "user"
timeout: 10000
collectors:
- bmc
- ipmi
- chassis
exclude_sensor_ids:
- 2
- 29
- 32
- 50
- 52
- 55
6.3 启动ipmi_exporter导出器
cd /home/ap/prometheus/ipmi_exporter
./ipmi_exporter --config.file=/home/ap/prometheus/ipmi_exporter/ipmi_remote.yml &
#测试
http://128.5.80.182:9290
#测试收入iLO地址能否抓取数据
iLO地址:128.5.80.147 128.5.80.148
6.4 配置prometheus.yml主配置文件
添加以下内容
- job_name: "ipmi"
params:
module: ['default']
scrape_interval: 1m
scrape_timeout: 30s
metrics_path: /ipmi
scheme: http
file_sd_configs:
- files:
- targets/ipmi.yml
refresh_interval: 2m
relabel_configs:
- source_labels: [__address__]
separator: ;
regex: (.*)
target_label: __param_target
replacement: ${1}
action: replace
- source_labels: [__param_target]
separator: ;
regex: (.*)
target_label: instance
replacement: ${1}
action: replace
- separator: ;
regex: .*
target_label: __address__
replacement: 128.5.80.182:9290
action: replace
#添加监控点
vi /home/ap/prometheus/targets/ipmi.yml
- targets:
- 128.5.80.148
- 128.5.80.147
- 128.5.80.149
- 128.5.80.150
- 128.5.80.222
- 128.5.80.223
- 128.5.80.141
- 128.5.80.168
- 128.5.80.139
- 128.5.80.140
- 128.5.80.240
- 128.5.80.241
- 128.5.80.242
labels:
job: ipmi_exporter
6.5 重启prometheus
##杀掉老进程
ps -ef |grep Prometheus
kill -9 xxxx
##重启新进程
nohup prometheus --config.file=/home/ap/prometheus/prometheus.yml \
--storage.tsdb.path=/home/ap/prometheus/data --web.enable-lifecycle > /home/ap/prometheus/log/prometheus.log 2>&1 &
7 安装grafana图形软件
grafana图形软件与prometheus配合,可以把prometheus获取的数据以图形的方式展示出来,方便监控
7.1 安装rpm包
rpm -ivh grafana-enterprise-9.4.10-1.x86_64.rpm
7.2 执行给定的服务启动脚本
/bin/systemctl daemon-reload
/bin/systemctl enable grafana-server.service
/bin/systemctl start grafana-server.service
7.3 通过浏览器访问
http://128.5.80.182:3000
默认密码是admin/admin
7.4 导入监控展板模板
监控模板可以在grafana官方网站下载
导入方式如下图:
这里选择从官网下载的展板模板即可导入
8 配置告警规则
8.1 创建规则目录
cd /home/ap/prometheus
mkdir rules
8.2 在prometheus.yml主配置文件中加入目录
vi prometheus.yml
rule_files:
- "rules/*.rules"
8.3 创建告警规则
vi /home/ap/prometheus/rules/alerts.rules
groups:
- name: disk_alerts
rules:
- alert: "磁盘告警"
expr: (1-node_filesystem_avail_bytes{mountpoint=~".*"}/node_filesystem_size_bytes{mountpoint=~".*"})*100>90
for: 1m
labels:
severity: "严重警告"
annotations:
summary: "磁盘分区使用率告警"
description: "磁盘使用率超过90%"
- name: tablespaces_alerts
rules:
- alert: "表空间使用率告警"
expr: (1-oracledb_tablespace_free{type!="TEMPORARY"}/oracledb_tablespace_bytes{type!="TEMPORARY"})*100>90
for: 5m
labels:
severity: "严重警告"
annotations:
summary: "表空间剩余空间告警"
description: "表空间使用率超过90%"
- name: Memory_alerts
rules:
- alert: "内存告警"
expr: (1 - (node_memory_MemAvailable_bytes / (node_memory_MemTotal_bytes)))* 100>80
for: 1m
labels:
severity: "次要警告"
annotations:
summary: "内存使用率告警"
description: "内存使用率超过80%"
- name: cpu_alerts
rules:
- alert: "cpu告警"
expr: 100-avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)*100>90
for: 1m
labels:
severity: "严重警告"
annotations:
summary: "cpu使用率告警"
description: "cpu使用率连续1分钟超过90%"
- name: sensors_alerts
rules:
- alert: "传感器告警"
expr: ipmi_sensor_state > 0
for: 1m
labels:
severity: "传感器告警"
annotations:
summary: "传感器告警"
description: "传感器告警"
- name: fan_alerts
rules:
- alert: "风扇转速传感器告警"
expr: ipmi_fan_speed_state > 0
for: 1m
labels:
severity: "风扇转速传感器告警"
annotations:
summary: "风扇转速传感器告警"
description: "风扇转速传感器告警"
- name: power_alerts
rules:
- alert: "电源传感器告警"
expr: ipmi_power_state > 0
for: 1m
labels:
severity: "电源传感器告警"
annotations:
summary: "电源传感器告警"
description: "电源传感器告警"
- name: temperature_alerts
rules:
- alert: "温度传感器告警"
expr: ipmi_temperature_state > 0
for: 1m
labels:
severity: "温度传感器告警"
annotations:
summary: "温度传感器告警"
description: "温度传感器告警"
- name: voltage_alerts
rules:
- alert: "电压传感器告警"
expr: ipmi_voltage_state > 0
for: 1m
labels:
severity: "电压传感器告警"
annotations:
summary: "电压传感器告警"
description: "电压传感器告警"
8.4 重启prometheus
##杀掉老进程
ps -ef |grep Prometheus
kill -9 xxxx
##重启新进程
nohup prometheus --config.file=/home/ap/prometheus/prometheus.yml \
--storage.tsdb.path=/home/ap/prometheus/data --web.enable-lifecycle > /home/ap/prometheus/log/prometheus.log 2>&1 &
8.5 查看告警信息
这里可以看到创建的告警规则有没有被触发
8.6 在grafana中展示告警数据
展示效果如下图,点击数字可以查看详细的告警信息
比如点击数字8可以看到表空间具体的告警情况如下图所示
实现以上效果的步骤如下:
8.6.1 新建告警明细信息panel
8.6.2 编辑panel
在标识1处输入查询表达式:ALERTS{alertname=“表空间使用率告警”}
注意:双引号内的名称是前面创建的告警规则的名字
在标识2处选择table
在标识3处选择instance
在标识4处选择table
如下图红圈处所示,对要展示的项目进行筛选,不需要展示的列就将前面的选项关闭
8.6.3 保存panel
保存后就得到了详细的表空间磁盘告警信息,如下图
8.6.4 获取panel的链接信息
获取此告警信息panel的链接信息如下
复制这个Link URL,后面会用到
8.6.5 再新建告警数panel
8.6.6 编辑panel
在圆圈内填上代码 sum(ALERTS{alertname=“表空间使用率告警”})
注意:双引号内的名称是前面创建的告警规则的名字
调整参数,format选择table,type选择instance,panel选择stat,thresholds里面将默认的80改为1
添加链接信息
8.6.7 保存panel
保存完毕后即可达到开头所要展示的效果,其他监控指标都可以按照这个方式搭建。