Prometheus部署(三)

Prometheus是最初在SoundCloud上构建的开源系统监视和警报工具包。自2012年成立以来,许多公司和组织都采用了Prometheus,该项目拥有非常活跃的开发人员和用户社区。Prometheus 于2016年加入了 Cloud Native Computing Foundation,这是继Kubernetes之后的第二个托管项目。

官网:https://prometheus.io 最新版本: 2.19.2

Exporter是一个采集监控数据并通过Prometheus监控规范对外提供数据的组件,能为Prometheus提供监控的接口。

Exporter将监控数据采集的端点通过HTTP服务的形式暴露给Prometheus Server,Prometheus Server通过访问该Exporter提供的Endpoint端点,即可获取到需要采集的监控数据。不同的Exporter负责不同的业务。

Prometheus              开源的系统监控和报警框架,灵感源自Google的Borgmon监控系统

AlertManager            处理由客户端应用程序(如Prometheus server)发送的警报。它负责将重复数据删除,分组和路由到正确的接收者集成,还负责沉默和抑制警报

Node_Exporter           用来监控各节点的资源信息的exporter,应部署到prometheus监控的所有节点

PushGateway             推送网关,用于接收各节点推送的数据并暴露给Prometheus server

文档:https://prometheus.io/docs/introduction/overview/

下载prometheus各组件:

https://prometheus.io/download/


环境准备

  • 主机说明:
系统 ip 角色 cpu 内存 hostname
CentOS 7.8 192.168.30.135 prometheus、node1 >=2 >=2G prometheus
CentOS 7.8 192.168.30.136 altermanager、node2 >=2 >=2G altermanager
CentOS 7.8 192.168.30.137 grafana、node3 >=2 >=2G grafana
  • 全部关闭防火墙和selinux:
systemctl stop firewalld && systemctl disable firewalld

sed -i 's/=enforcing/=disabled/g' /etc/selinux/config  && setenforce 0

配置规则

前面已经部署了prometheus、node_exporter、alertmanager和grafana,并介绍了PromQL,且对于metrics进行了查询演示。

本文配置各项规则,完成记录及告警。

prometheus的规则分为记录规则和告警规则,rule_files 主要用于配置 rules 文件,它支持多个文件以及文件目录。

  • 当前配置:
cat /usr/local/prometheus/prometheus.yml
global:
  scrape_interval:     15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 192.168.30.136:9093

rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['192.168.30.135:9090']

  - job_name: 'node'
    static_configs:
    - targets: ['192.168.30.135:9100','192.168.30.136:9100','192.168.30.137:9100']

  - job_name: 'alertmanager'
    static_configs:
    - targets: ['192.168.30.136:9093']
  • 记录规则配置:

记录规则可以预先计算经常需要或计算量大的表达式,并将其结果保存为一组新的时间序列。这样,查询预先计算的结果通常比每次需要原始表达式都要快得多。这对于仪表板非常有用,仪表板每次刷新时都需要重复查询相同的表达式。

记录和告警规则存在于规则组中。组中的规则以规则的时间间隔顺序运行。记录和告警规则的名称必须是有效的度量标准名称。

注意:冒号是为用户定义的记录规则保留的,不应该被exporter使用。

mkdir /usr/local/prometheus/rules

vim /usr/local/prometheus/rules/node-record-rules.yml
groups:
  - name: node-record
    rules:
    #system
    - record: node:up
      expr: up{
   job="node"}
      labels:
        job: "node"
        unit: 
        desc: "节点是否在线,在线1,不在线0"
    
    - record: node:name
      expr: count by (nodename) (node_uname_info)
      labels:
        job: "node"
        unit: 
        desc: "节点的主机名"
    
    - record: node:uptime
      expr: time() - node_boot_time_seconds{
   }
      labels:
        job: "node"
        unit: s
        desc: "节点的运行时间"

    #cpu
    - record: node:cpu:num
      expr: count by (instance) (node_cpu_seconds_total{
   job="node",mode='system'})
      labels:
        job: "node"
        unit: v
        desc: "节点的cpu 核数"

    - record: node:cpu:idle:percent
      expr: avg by (instance) (irate(node_cpu_seconds_total{
   job="node",mode="idle"}[5m])) * 100
      labels:
        job: "node"
        unit: "%"
        desc: "5m的cpu 空闲百分比"

    - record: node:cpu:used:percent
      expr: (1 - avg by (instance) (irate(node_cpu_seconds_total{
   job="node",mode="idle"}[5m]))) * 100
      labels:
        job: "node"
        unit: "%"
        desc: "5m的cpu 总使用百分比"

    - record: node:cpu:system:percent
      expr: avg by (instance) (irate(node_cpu_seconds_total{
   job="node",mode="system"}[5m])) * 100
      labels:
        job: "node"
        unit: "%"
        desc: "节点的cpu system使用百分比"

    - record: node:cpu:user:percent
      expr: avg by (instance) (irate(node_cpu_seconds_total{
   job="node",mode="user"}[5m])) * 100
      labels:
        job: "node"
        unit: "%"
        desc: "节点的cpu user使用百分比"

    - record: node:cpu:iowait:percent
      expr: avg by (instance) (irate(node_cpu_seconds_total{
   job="node",mode="iowait"}[5m])) * 100
      labels:
        job: "node"
        unit: "%"
        desc: "节点的cpu iowait使用百分比"

    - record: node:cpu:other:percent
      expr: avg by (instance) (irate(node_cpu_seconds_total{
   job="node",mode=~"softirq|nice|irq|steal"}[5m])) * 100
      labels:
        job: "node"
        unit: "%"
        desc: "节点的cpu other使用百分比"

    #memory
    - record: node:memory:total
      expr: node_memory_MemTotal_bytes{
   job="node"}
      labels:
        job: "node"
        unit: bytes
        desc: "节点的mem 总大小"

    - record: node:memory:avail
      expr: node_memory_MemAvailable_bytes{
   job="node"}
      labels:
        job: "node"
        unit: bytes
        desc: "节点的mem avail大小"
        
    - record: node:memory:free
      expr: node_memory_MemFree_bytes{
   job="node"}
      labels:
        job: "node"
        unit: bytes
        desc: "节点的mem free大小"

    - record: node:memory:used
      expr: node_memory_MemTotal_bytes{
   job="node"} - node_memory_MemFree_bytes{
   job="node"}
      labels: 
        job: "node"
        unit: bytes
        desc: "节点的mem used总大小"

    - record: node:memory:actuallyused
      expr: node_memory_MemTotal_bytes{
   job="node"} - node_memory_MemAvailable_bytes{
   job="node"}
      labels:
        job: "node"
        unit: bytes
        desc: "节点的mem used实际大小"

    - record: node:memory:avail:percent
      expr: node_memory_MemAvailable_bytes{
   job="node"} / node_memory_MemTotal_bytes{
   job="node"} * 100
      labels:
        job: "node"
        unit: "%"
        desc: "节点的mem avail百分比"
        
    - record: node:memory:free:percent
      expr: node_memory_MemFree_bytes{
   job="node"} / node_memory_MemTotal_bytes{
   job="node"} * 100
      labels:
        job: "node"
        unit: "%"
        desc: "节点的mem free百分比"

    - record: node:memory:used:percent
      expr: (1 - node_memory_MemAvailable_bytes{
   job="node"} / node_memory_MemTotal_bytes{
   job="node"}) * 100
      labels:
        job: "node"
        unit: "%"
        desc: "节点的mem used百分比"

    #load
    - record: node:load:load1
      expr: node_load1
      labels:
        job: "node"
        unit:
        desc: "节点1m load"
        
    - record: node:load:load5
      expr: node_load5
      labels:
        job: "node"
        unit:
        desc: "节点5m load"

    - record: node:load:load15
      expr: node_load15
      labels:
        job: "node"
        unit:
        desc: "节点15m load"

    - record: node:load:load1:sum
      expr: sum by (job) (node_load1)
      labels:
        job: "node"
        unit:
        desc: "节点1m 整体load"

    - record: node:load:load5:sum
      expr: sum by (job) (node_load5)
      labels:
        job: "node"
        unit:
        desc: <
  • 3
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 5
    评论
评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值