Centos6下Prometheus学习（零）——服务器搭建、简单监控数据采集

最新推荐文章于 2024-08-02 16:29:04 发布

阿团团

最新推荐文章于 2024-08-02 16:29:04 发布

阅读量1.1k

点赞数

分类专栏： Prometheus学习文章标签： Prometheus 集群监控监控

本文链接：https://blog.csdn.net/jiangxuege/article/details/88951998

版权

Prometheus学习专栏收录该内容

2 篇文章 0 订阅

订阅专栏

官方文档

https://prometheus.io/docs/prometheus/latest/getting_started/

1 服务器搭建

1.1 安装

下载解压缩就可以，肥肠简单

cd /path/to/your/work
wget https://github.com/prometheus/prometheus/releases/download/v2.8.1/prometheus-2.8.1.linux-amd64.tar.gz  # 下载

1.2 自监控

Prometheus把采集的目标成为target，采集方式是通过http抓取，现在用Prometheus采集自身监控数据，新建一个prometheus.yml文件

# 全局配置
global:
  scrape_interval: 15s # 监控周期，全局默认值
  # label
  external_labels:
    monitor: 'codelab-monitor'

# 采集配置
scrape_configs:
  # job名
  - job_name: 'prometheus'
    # 监控周期，覆盖全局默认的15s
    scrape_interval: 5s
    # 静态配置
    static_configs:
      # 采集目标
      - targets: ['localhost:9090']

然后就可以启动了

./prometheus --config.file=prometheus.yml > prometheus.log 2>&1 &

可以看到输出的日志

localhost:9090/metric可以获取到监控项，curl一下应该有如下输出

用浏览器访问一下，如果是外网，记得开启端口防火墙

简单查询一个监控项，这个监控项是每次监控数据采集的实际周期，我们设置的是5秒，实际上肯定有一些误差，label里面的quantile代表各个分位数，例如quantile=0.99对应的5.000640934，意味着这个监控项的时间序列中，99%的数据都小于5.000640934

如果我们只想查看99%对应的分位数，修改查询表达式

或者查看时间序列里数据的个数

切到graph页，就可以出图了

2 监控数据采集

现在用go的一个小程序来作为target，采集监控数据，这是官方提供的，随机产生rpc延时metric的简单代码

git clone https://github.com/prometheus/client_golang.git
cd client_golang/examples/random
go get -d
go build

# 启动
./random -listen-address=:8080 > 8080.out 2>&1 &
./random -listen-address=:8081 > 8081.out 2>&1 &
./random -listen-address=:8082 > 8082.out 2>&1 &

现在去prometheus的配置文件，添加上这个job，我们的job包含三个实例，其中两个实例模拟为生产节点，另外一个实例模拟为canary（不稳定）

# 全局配置
global:
  scrape_interval: 15s # 监控周期，全局默认值
  # label
  external_labels:
    monitor: 'codelab-monitor'

# 采集配置
scrape_configs:
  # job名
  - job_name: 'prometheus'
    # 监控周期，覆盖全局默认的15s
    scrape_interval: 5s
    # 静态配置
    static_configs:
      # 采集目标
      - targets: ['95.179.190.52:9090']
  # 添加新监控
  - job_name: 'example-random'
    static_configs:
      - targets: ['localhost:8080', 'localhost:8081']
        labels:
          group: 'production'
      - targets: ['localhost:8082']
        labels:
          group: 'canary'

查看一下rpc_durations_seconds_count这个监控项，不用在意这个监控项的实际意义，是random程序模拟生成的，不同group、instance、service，所以有非常多的曲线

现在利用Prometheus做一些简单的统计处理，取5分钟为时间窗口的平均值，并按照job、service分类，因为我们只有一个job，service有三种，所以出来的是3条曲线

再尝试一下，现在统计生产集群和canary集群的数据

现在想把这个统计项作为一个新的监控项，加入到Prometheus中，需要创建一条新的rule。创建方法是通过配置文件prometheus.rules.yml

groups:
# rule名
- name: example
  rules:
  # 监控项名，页面上可以通过这个名字搜到该统计监控项
  - record: group:rpc_durations_seconds_count:avg_rate5m
    # 表达式
    expr: avg(rate(rpc_durations_seconds_count[5m])) by (group)

再在prometheus.yml中声明这个rule文件，重启Prometheus

# 声明rule文件
rule_files:
  - 'prometheus.rules.yml'

group:rpc_durations_seconds_count:avg_rate5m这个监控项就能看见了，但是可以看到用rule新生成的监控项，和原监控项的图形并不一致，因为rule默认的采集周期是60s

现在修改rules文件，并重启Prometheus，就能看见统计监控项的曲线和之前的一致了

groups:
# rule名
- name: example
  # 采集间隔
  interval: 5s
  rules:
  # 监控项名，页面上可以通过这个名字搜到该统计监控项
  - record: group:rpc_durations_seconds_count:avg_rate5m
    # 表达式
    expr: avg(rate(rpc_durations_seconds_count[5m])) by (group)

修改采集间隔之后的右半段曲线，光滑了很多