自动化运维--prometheus(持续更新)

Nepal.

已于 2022-10-18 11:26:41 修改

阅读量2.7k

点赞数

分类专栏：自动化运维文章标签： prometheus java

于 2022-10-17 14:59:59 首次发布

本文链接：https://blog.csdn.net/weixin_44210007/article/details/127362345

版权

自动化运维专栏收录该内容

1 篇文章 0 订阅

订阅专栏

文档地址

prometheus官网文档地址：https://prometheus.io/docs/introduction/overview/
prometheus中文文档地址：https://prometheus.fuckcloudnative.io/di-yi-zhang-jie-shao/overview

一.prometheus简介

prometheus是由前 Google 工程师从 2012 年开始在 Soundcloud 以开源软件的形式进行研发的系统监控和告警工具包，自此以后，许多公司和组织都采用了 Prometheus 作为监控告警工具。Prometheus 的开发者和用户社区非常活跃，它现在是一个独立的开源项目，可以独立于任何公司进行维护。为了证明这一点，Prometheus 于 2016 年 5 月加入 CNCF 基金会，成为继 Kubernetes 之后的第二个 CNCF 托管项目。

1.1prometheus优势

由指标名称和和键/值对标签标识的时间序列数据组成的多维数据模型。
强大的查询语言PromQL。
不依赖分布式存储；单个服务节点具有自治能力。
时间序列数据是服务端通过 HTTP 协议主动拉取获得的。
也可以通过中间网关来推送时间序列数据。
可以通过静态配置文件或服务发现来获取监控目标。
支持多种类型的图表和仪表盘。

1.2prometheus特点

支持多维数据模型：由度量名和键值对组成的时间序列数据
内置时间序列数据库TSDB
支持PromQL查询语言，可以完成非常复杂的查询和分析，对图表展示和告警非常有意义
支持HTTP的Pull方式采集时间序列数据
支持PushGateway采集瞬时任务的数据
支持服务发现和静态配置两种方式发现目标
支持接入Grafana

1.3prometheus组件

Prometheus 生态系统由多个组件组成，其中有许多组件是可选的：

Prometheus Server 作为服务端，用来存储时间序列数据。
客户端库用来检测应用程序代码。
用于支持临时任务的推送网关。
Exporter 用来监控 HAProxy，StatsD，Graphite 等特殊的监控目标，并向 Prometheus 提供标准格式的监控样本数据。
alartmanager 用来处理告警。
其他各种周边工具

1.4prometheus使用逻辑

Prometheus server 定期从静态配置的 targets 或者服务发现的 targets 拉取数据。
当新拉取的数据大于配置内存缓存区的时候，Prometheus 会将数据持久化到磁盘（如果使用 remote storage 将持久化到云端）。
Prometheus 可以配置 rules，然后定时查询数据，当条件触发的时候，会将 alert 推送到配置的 Alertmanager。
Alertmanager 收到警告的时候，可以根据配置，聚合，去重，降噪，最后发送警告。
可以使用 API， Prometheus Console 或者 Grafana 查询和聚合数据。

1.5prometheus的架构

在这里插入图片描述

Prometheus Server 直接从监控目标中或者间接通过推送网关来拉取监控指标，它在本地存储所有抓取到的样本数据，并对此数据执行一系列规则，以汇总和记录现有数据的新时间序列或生成告警。可以通过 Grafana 或者其他工具来实现监控数据的可视化。
prometheus 采集数据的主要方式是 server 通过http请求去“主动拉取”数据,，所以需要各个服务的api地址和对数据返回格式的统一。

二、数据模型

Prometheus所有的采集数据都是以指标(metric)形式保存在内置的时间序列数据库当中(TSDB):属于同一指标名称、同一标签集合的、有时间戳标记的数据流，除了存储的时间序列Prometheus还可以根据还可以根据查询请求产生临时的、衍生的时间序列作为结果返回

2.1指标名称和标签

每一条时间序列由指标名称(Metric Name)以及一组标签(键值对)唯一标识。
指标名称：可以反映被监控样本的含义(http_requests_total系统接收到http请求总量)。指标名称只能由ASCII字符、数字、下划线以及冒号组成，同时必须匹配正则表达式
标签：标签的名称只能由ASCII字符、数字、下划线组成，同时必须匹配正则表达式，其中以__作为前缀的标签是系统保留的关键字。标签的值可以包含任何Unicode编码的字符。

2.2样本

在实践中序列中的每一个点成为一个样本(sample),样本由三部分组成
1.指标：指标名称和描述当前样本特征的labelsets(标签集)
2.时间戳：一个精确到毫秒的时间戳
3.样本值：一个folat64的浮点型数据标识当前样本的值

2.3表达方式

{=, …}
例：指标名称为api_http_requests_total 标签为method=“POST"和header=”/messages"的时间序列可以表示为api_http_requests_total{method=“POST”, handler=“/messages”}

三、指标类型

Prometheus的客户端提供了4种核心指标类型

Counter(计数器)–counter类型代表一种样本数据单调递增的指标。可以用counter类型的指标来表示服务的请求数、已完成的任务数等。counter主要有两个方法：
Gauge(仪表盘)–gauge类型代表样本数据可以随意变化的指标
Histogram(直方图)–对一段时间范围内数据进行采样，并将其计入可配置的存储桶(bucket)中，后续可通过指定区间筛选样本，也可统计样本总数，最后将数据展示为直方图。Histogram类型会提供三种指标bucket,sum,count
bucket:样本的值分布在 bucket 中的数量，命名为 _bucket{le=“<上边界>”}
例：// 在总共2次请求当中。http 请求响应时间 <=0.005 秒的请求次数为0 io_namespace_http_requests_latency_seconds_histogram_bucket{path=“/”,method=“GET”,code=“200”,le=“0.005”,} 0.0

sum：所有样本值的大小总和，命名为 _sum
例：// 发生的2次 http 请求总的响应时间为 13.107670803000001 秒
io_namespace_http_requests_latency_seconds_histogram_sum{path=“/”,method=“GET”,code=“200”,} 13.107670803000001

count：样本总数，命名为 _count。值和 _bucket{le=“+Inf”} 相同。
例：// 当前一共发生了 2 次 http 请求
io_namespace_http_requests_latency_seconds_histogram_count{path=“/”,method=“GET”,code=“200”,} 2.0
4. Summary(摘要)–与直方图类型类似，用于表示一段时间内的数据采样结果。但是他直接存储了分位数，而不是通过区间计算
Summary也会通过三种指标quantile，sum，count
quantile：样本值的分位数分布情况，命名为 {quantile=“<φ>”}
例：// 这 12 次 http 请求中有 50% 的请求响应时间是 3.052404983s
io_namespace_http_requests_latency_seconds_summary{path=“/”,method=“GET”,code=“200”,quantile=“0.5”,} 3.052404983

sum：所有样本值的大小总和，命名为 _sum
例：// 这12次 http 请求的总响应时间为 51.029495508s
io_namespace_http_requests_latency_seconds_summary_sum{path=“/”,method=“GET”,code=“200”,} 51.029495508

count：样本总数，命名为 _count
例： // 当前一共发生了 12 次 http 请求
io_namespace_http_requests_latency_seconds_summary_count{path=“/”,method=“GET”,code=“200”,} 12.0

四、prometheus下载

prometheus下载地址

五、Prometheus+Grafana部署

prometheus部署：

对压缩包进行解压并部署在/app/tools/prometheus/prometheus文件夹下

配置文件：

$ cat prometheus.yml # 查看默认配置文件，# 号开头表示注释行 
# 配置文件分为三个模块，global、rule_file及scrape_configs。 
# global一般不用改动 
# my global config 
global: 
# 设置每15s更新一次监控数据 
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. # 设置抓取数据时间间隔为15s，不设置的话默认一分钟 
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.# 设置分析数据时间间隔为15s，不设置的话默认一分钟 
# scrape_timeout is set to the global default (10s). 
# Alertmanager configuration 
# rule_files块指定了我们希望Prometheus服务器加载的任何报警规则的位置。 # 告警配置


alerting: 
alertmanagers: 
- static_configs: 
- targets: 
# - alertmanager:9093 
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'. 
rule_files: 
# - "first_rules.yml" 
# - "second_rules.yml" 
# A scrape configuration containing exactly one endpoint to scrape: 
# Here it's Prometheus itself. 
# scrape_configs控制Prometheus监视哪些资源。 


# 监控数据拉取配置
scrape_configs: 
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. # 抓取任务名称
- job_name: 'prometheus' 
# metrics_path defaults to '/metrics' 
# scheme defaults to 'http'. # 静态配置
static_configs: # 要抓取的地址，可配置多个
- targets: ['127.0.0.1:9090'] # 这里需要将localhost改为本机地址，否则有些小bug

启动方式：

# 启动方式
 ./prometheus --config.file="/app/tools/prometheus/prometheus/prometheus.yml" --web.enable-lifecycle --storage.tsdb.path=/app/tools/prometheus/prometheus/data --storage.tsdb.retention.time=30d &
 #--web.enable-lifecycle  表示允许通过web接口重载配置文件
 #--storage.tsdb.path=/app/tools/prometheus/prometheus/data 指定数据存储目录，默认是当前目录的data目录
 #--storage.tsdb.retention.time=30d 指定数据保留时间，默认为15d。
 # &代表后台启动

启动成功之后查看9090端口
在这里插入图片描述

访问web页面：http://127.0.0.1:9090
在这里插入图片描述
管理API

#健康检查
$ curl  http://127.0.0.1:9090/-/healthy
#重载配置文件
$ curl -XPOST http://127.0.0.1:9090/-/reload
#停止prometheus
$ curl -XPUT http://127.0.0.1:9090/-/quit
$ curl -XPOST http://127.0.0.1:9090/-/quit
注意：以上使用api方式必须在启动的时候脚本里添加--web.enable-lifecycle  选项。否则api操作不生效