部署PROMETHEUS+Grafana

部署PROMETHEUS

下载安装

下载地址

https://prometheus.io/download/

创建prometheus用户

useradd prometheus -s /sbin/nologin

创建prometheus数据目录

mkdir /data/prometheus
chown prometheus.prometheus /data/prometheus

解压部署

tar -zxf prometheus-2.25.0.linux-amd64.tar.gz
mv prometheus-2.25.0.linux-amd64 /usr/local/prometheus

配置文件

vim /usr/local/prometheus/prometheus.yml
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'
    static_configs:
    - targets: ['192.168.2.150:9090']
  
  - job_name: 'node'
    static_configs:
    - targets:
        - '192.168.2.151:9100'
        - '192.168.2.152:9100'

配置文件说明

scrape_interval				抓取数据时间间隔
evaluation_interval			报警阈值检测频率

配置service启动文件

cat > /usr/lib/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus
After=network.target

[Service]
Type=simple
Environment="GOMAXPROCS=4"
User=prometheus
Group=prometheus
ExecReload=/bin/kill -HUP $MAINPID
ExecStop=/bin/kill -s QUIT $MAINPID
ExecStart=/usr/local/prometheus/prometheus \
  --config.file=/usr/local/prometheus/prometheus.yml \
  --storage.tsdb.path=/data/prometheus \
  --storage.tsdb.retention=30d \
  --web.console.libraries=/usr/local/prometheus/console_libraries \
  --web.console.templates=/usr/local/prometheus/consoles \
  --web.listen-address=0.0.0.0:9090 \
  --web.read-timeout=5m \
  --web.max-connections=10 \
  --query.max-concurrency=20 \
  --query.timeout=2m \
  --web.enable-lifecycle
PrivateTmp=true
PrivateDevices=true
ProtectHome=true
NoNewPrivileges=true
LimitNOFILE=infinity
ReadWriteDirectories=/data/prometheus
ProtectSystem=full

SyslogIdentifier=prometheus
Restart=always

[Install]
WantedBy=multi-user.target
EOF

启动参数说明

web.read-timeout		请求连接等待最大时间(防止太多空链接占用资源)
web.max-connections		最大连接数(获取数据源时候建立最大连接数限制,避免连接数太多资源浪费)
storage.tsdb.retention	监控数据保存时长(企业级一般15为宜)
storage.tsdb.path		监控数据保存路径(如果不设置,默认存在监控当前路径)
query.timeout			针对用户-强行终止太庞大的慢查询
query.max-concurrency	针对用户-访问连接数限制

启动prometheus

systemctl daemon-reload
systemctl start prometheus
systemctl enable prometheus

命令行启动方式

/usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml --storage.tsdb.path=/data/prometheus --storage.tsdb.retention=30d --web.console.libraries=/usr/local/prometheus/console_libraries --web.console.templates=/usr/local/prometheus/consoles --web.listen-address=0.0.0.0:9090 --web.read-timeout=5m --web.max-connections=10 --query.max-concurrency=20 --query.timeout=2m --web.enable-lifecycle

node-exporter

https://prometheus.io/

解压部署

tar -zxf node_exporter-1.1.2.linux-amd64.tar.gz
mv node_exporter-1.1.2.linux-amd64 /usr/local/node_exporter

配置service启动文件

cat > /usr/lib/systemd/system/node-exporter.service << 'EOF'
[Unit]
Description=This is prometheus node exporter
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/node_exporter/node_exporter
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

启动exporter

systemctl daemon-reload
systemctl start node-exporte
systemctl enable node-exporter

相关计算方法

几种数据类型

counter: 持续增长的数值,比如CPU使用时间,开机时间累计,访问量累计

gauge: 无规律数值,比如实时抓取的带宽,访问量

几个常用函数

  • rate()函数

    rate()函数是专门搭配counter类型数据使用的函数,它的功能是按照设置一个时间段,取counter在这个时间段内平均每秒的增量

    rate(node_cpu_seconds_total{mode="user"}[5m])
    
  • increase()函数

    用来针对counter这种持续增长的数值,截取其中一段时间的增量总量

    increase(node_cpu[1m])
    

rate()函数和increase()函数区别

rate()				一段时间内每秒的增量
increase()        	一段时间内增量总量

应用场景

rate()				采集频率高
increase()			采集频率低
  • sum()函数

    外面套用一个sum函数就可以把里面所有数值加和

    • by (instance)

      联合sum()函数使用,在函数加上,已主机名进行拆分

sum(increase(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)

instance是按机器名拆分,也可以by (cluster_name)

node_exporter只能按机器名去拆分

如果需要用cluster_name来拆分,需要自定义标签

  • topk()函数

    定义取前几位最高值

topk(3.count_netstat_wait_connections)

topk()函数即可以用于gauge类型数值,也可以用户counter类型数值(但外层需要包裹rate等函数,才有意义)

node_exporter采集示例:

计算CPU使用率

(1-((sum(increase(node_cpu_seconds_total{mode="idle"}[5m])) by (instance))/ (sum(increase(node_cpu_seconds_total[5m])) by (instance))))*100

命令解析:

node_cpu_seconds_total			cpu使用率
mode="idle"						cpu空闲模式
by (instance)					按主机名拆分
命令翻译:
(1-(所有cpu空闲使用率综合)/(所有cpu总使用率总和))*100					

计算磁盘IO等待占用CPU百分比

((sum(increase(node_cpu_seconds_total{mode="iowait"}[5m])) by (instance))/(sum(increase(node_cpu_seconds_total[5m])) by (instance)))*100

pushgateway

下载

https://prometheus.io/

解压部署

tar -zxf pushgateway-1.4.0.linux-amd64.tar.gz
mv pushgateway-1.4.0.linux-amd64 /usr/local/pushgateway

配置service启动文件

cat > /usr/lib/systemd/system/pushgateway.service << 'EOF'
[Unit]
Description=This is prometheus pushgateway
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/pushgateway/pushgateway
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

启动pushgateway

systemctl daemon-reload
systemctl start pushgateway
systemctl enable pushgateway

脚本样例

#!/bin/bash
instance_name=`hostname -f|cut -d'.' -f1`

if [ ${instance_name} == "localhost" ];then
        echo "Must FQDN hostname"
        exit 1;
fi

label="count_netstat_wait_connections"
count_netstat_wait_connections=`netstat -antp|grep -i wait |wc -l`
echo "$label: ${count_netstat_wait_connections}"
echo "$label ${count_netstat_wait_connections}"| curl --data-binary @- http://192.168.2.150:9091/metrics/job/pushgateway1/instance/${instance_name}

prometheus.yml配置pushgateway

  - job_name: 'pushgateway'
    static_configs:
    - targets:
        - '192.168.2.150:9091'

部署Grafana

下载地址

https://grafana.com/grafana/download

解压安装

tar -zxf grafana-7.4.3.linux-amd64.tar.gz
mv grafana-7.4.3 /usr/local/grafana

配置grafana配置文件

cd /usr/local/grafana/conf; cp defaults.ini grafana.ini

创建对应目录

mkdir -pv /data/grafana/data
mkdir -pv /data/grafana/log

修改配置文件

[paths]
data = /data/grafana/data
temp_data_lifetime = 24h
logs = /data/grafana/log
plugins = /data/grafana/plugins
provisioning = conf/provisioning

[server]
protocol = http
http_addr =
http_port = 3000
domain = 192.168.2.150

[smtp]
enabled = true
host = smtp.sina.com:465
user = wangshui898@sina.com
password = 邮箱授权码
cert_file =
key_file =
skip_verify = false
from_address = wangshui898@sina.com
from_name = Grafana
ehlo_identity =
startTLS_policy =

配置service启动文件

cat > /usr/lib/systemd/system/grafana-server.service << 'EOF'
[Unit]
Description=This is grafana-server
After=network.target

[Service]
Type=simple
WorkingDirectory=/usr/local/grafana
ExecReload=/bin/kill -HUP $MAINPID
ExecStop=/bin/kill -s QUIT $MAINPID
ExecStart=/usr/local/grafana/bin/grafana-server --config=/usr/local/grafana/conf/grafana.ini --pidfile=/data/grafana/log/grafana-server.pid
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

启动grafana

systemctl daemon-reload
systemctl start grafana-server
systemctl enable grafana-server

登录地址

IP:3000

帐号口令:
admin
admin

告警配置

常用插件

./grafana-cli plugins install grafana-piechart-panel

常用监控项

CPU使用率[1m]

(1-((sum(increase(node_cpu_seconds_total{mode="idle"}[1m])) by (instance))/ (sum(increase(node_cpu_seconds_total[1m])) by (instance))))*100

公式解析:

(1-(所有cpu空闲使用时间)/(所有cpu总使用时间总和))*100

CPU磁盘IO负载[1m]

((sum(increase(node_cpu_seconds_total{mode="iowait"}[1m])) by (instance))/(sum(increase(node_cpu_seconds_total[1m])) by (instance)))*100

公式解析:

(cpu_iowait使用时间综合)/(所有cpu使用时间总和)*100

内存使用率

(1-((node_memory_Buffers_bytes + node_memory_Cached_bytes + node_memory_MemFree_bytes)/node_memory_MemTotal_bytes))*100

可用内存=系统free memory+buffers+cached

buffers和cached虽然被占用,但新的内容到来时,是可以快速释放并应用

内存使用率=实际使用内存/总内存,即: (1-((可用内存)/内存总量)*100

centos7之后,available直接给出实际可用内存数

磁盘容量使用率

方法一:

(1-(node_filesystem_free_bytes/node_filesystem_size_bytes))*100

公式解析:

1-(系统空闲空间/系统总容量)*100

硬盘读写速度[1m]

((rate(node_disk_read_bytes_total[1m])+rate(node_disk_written_bytes_total[1m]))/1024/1024)>0

两次1024后为Mbs

网络带宽

rate(node_network_transmit_bytes_total[1m])

TCP连接数(wait_connections状态)[1m]

使用pushgateway,脚本如下

#!/bin/bash
instance_name=`hostname -f|cut -d'.' -f1`

if [ ${instance_name} == "localhost" ];then
        echo "Must FQDN hostname"
        exit 1;
fi

label="count_netstat_wait_connections"
count_netstat_wait_connections=`netstat -antp|grep -i wait |wc -l`
echo "$label: ${count_netstat_wait_connections}"
echo "$label ${count_netstat_wait_connections}"| curl --data-binary @- http://192.168.2.150:9091/metrics/job/pushgateway1/instance/${instance_name}

公式:

count_netstat_wait_connections

文件描述符使用率

(node_filefd_allocated/node_filefd_maximum) * 100

网络连通性监控

pushgateway方式

脚本

#!/bin/bash
instance_name=`hostname -f|cut -d'.' -f1`

if [ ${instance_name} == "localhost" ];then
        echo "Must FQDN hostname"
        exit 1;
fi

lspk=`timeout 3 ping -q -A -s 500 -W 1000 -c 100 192.168.2.150|grep transmitted|awk '{print $6}'`
rrt=`timeout 3 ping -q -A -s 500 -W 1000 -c 100 192.168.2.150|grep transmitted|awk '{print $10}'`

value_lspk=`echo $lspk|sed "s/%//g"`
value_rrt=`echo $rrt|sed "s/ms//g"`

echo "lost_packet_: ${value_lspk}"
echo "lost_packet ${value_lspk}"|curl --data-binary @- http://192.168.2.150:9091/metrics/job/pushgateway1/instance/${instance_name}

echo "rrt: ${value_rrt}"
echo "rrt ${value_rrt}"|curl --data-binary @- http://192.168.2.150:9091/metrics/job/pushgateway1/instance/${instance_name}

硬盘空间使用率

((node_filesystem_size_bytes{mountpoint="/"}-node_filesystem_free_bytes{mountpoint="/"})/node_filesystem_size_bytes{mountpoint="/"})*100

硬盘空间使用量

  • 大饼图

使用容量

(node_filesystem_size_bytes{mountpoint="/",fstype="xfs",instance="192.168.2.151:9100"}-node_filesystem_free_bytes{mountpoint="/",fstype="xfs",instance="192.168.2.151:9100"})/1024/1024/1024

空闲容量

(node_filesystem_free_bytes{mountpoint="/",fstype="xfs",instance="192.168.2.151:9100"})/1024/1024/1024

硬盘总容量

(node_filesystem_size_bytes{mountpoint="/",fstype="xfs",instance="192.168.2.151:9100"})/1024/1024/1024
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值