Prometheus监控全程部署

最新推荐文章于 2024-08-19 20:01:43 发布

养了一只皮卡丘

最新推荐文章于 2024-08-19 20:01:43 发布

阅读量432

点赞数

分类专栏：监控服务文章标签： Linux saltstack mongodb

本文链接：https://blog.csdn.net/hyhxy0206/article/details/121551151

版权

监控服务专栏收录该内容

8 篇文章 0 订阅

订阅专栏

Prometheus介绍

Prometheus是最初在SoundCloud上构建的开源系统监视和警报工具包。自2012年成立以来，许多公司和组织都采用了Prometheus，该项目拥有非常活跃的开发人员和用户社区。Prometheus 于2016年加入了 Cloud Native Computing Foundation，这是继Kubernetes之后的第二个托管项目。

官网：https://prometheus.io 最新版本: 2.19.2

Exporter是一个采集监控数据并通过Prometheus监控规范对外提供数据的组件，能为Prometheus提供监控的接口。

Exporter将监控数据采集的端点通过HTTP服务的形式暴露给Prometheus Server，Prometheus Server通过访问该Exporter提供的Endpoint端点，即可获取到需要采集的监控数据。不同的Exporter负责不同的业务。

Prometheus 开源的系统监控和报警框架，灵感源自Google的Borgmon监控系统

AlertManager 处理由客户端应用程序（如Prometheus server）发送的警报。它负责将重复数据删除，分组和路由到正确的接收者集成，还负责沉默和抑制警报

Node_Exporter 用来监控各节点的资源信息的exporter，应部署到prometheus监控的所有节点

PushGateway 推送网关，用于接收各节点推送的数据并暴露给Prometheus server
Prometheus 开源的系统监控和报警框架，灵感源自Google的Borgmon监控系统

AlertManager 处理由客户端应用程序（如Prometheus server）发送的警报。它负责将重复数据删除，分组和路由到正确的接收者集成，还负责沉默和抑制警报

Node_Exporter 用来监控各节点的资源信息的exporter，应部署到prometheus监控的所有节点

PushGateway 推送网关，用于接收各节点推送的数据并暴露给Prometheus server

文档：https://prometheus.io/docs/introduction/overview/

下载prometheus各组件：

https://prometheus.io/download/

prometheus的特点：

多维的数据模型（基于时间序列的Key、Value键值对）
灵活的查询和聚合语言PromQL
提供本地存储和分布式存储
通过基于HTTP的Pull模型采集时间序列数据
可利用Pushgateway（Prometheus的可选中间件）实现Push模式
可通过动态服务发现或静态配置发现目标机器
支持多种图表和数据大盘

prometheus的组件：

Prometheus server，负责拉取、存储时间序列数据
客户端库（client library），插入应用程序代码
推送网关（push gateway），支持短暂的任务
特殊类型的exporter，支持如HAProxy，StatsD，Graphite等服务
一个alertmanager处理告警
各种支持工具

prometheus的架构：

下图说明了Prometheus的体系结构及其某些生态系统组件：

prometheus的使用场景：
prometheus非常适合记录任何纯数字时间序列。它既适合以机器为中心的监视，也适合监视高度动态的面向服务的体系结构。在微服务世界中，它对多维数据收集和查询的支持是一种特别的优势。

prometheus的设计旨在提高可靠性，使其成为中断期间要使用的系统，从而使您能够快速诊断问题。每个prometheus服务器都是独立的，而不依赖于网络存储或其他远程服务，当基础设施部分出现问题时仍然可以使用它。

Prometheus概念
数据模型：
prometheus将所有数据存储为时间序列：属于相同 metric名称和相同标签组（键值对）的时间戳值流。

metric 和标签：
每一个时间序列都是由其 metric名称和一组标签（键值对）组成唯一标识。

metric名称代表了被监控系统的一般特征（如 http_requests_total代表接收到的HTTP请求总数）。它可能包含ASCII字母和数字，以及下划线和冒号，它必须匹配正则表达式[a-zA-Z_:][a-zA-Z0-9_:]*。

注意：冒号是为用户定义的记录规则保留的，不应该被exporter使用。

标签给prometheus建立了多维度数据模型：对于相同的 metric名称，标签的任何组合都可以标识该 metric的特定维度实例（例如：所有使用POST方法到 /api/tracks 接口的HTTP请求）。查询语言会基于这些维度进行过滤和聚合。更改任何标签值，包括添加或删除标签，都会创建一个新的时间序列。

标签名称可能包含ASCII字母、数字和下划线，它必须匹配正则表达式[a-zA-Z_][a-zA-Z0-9_]*。另外，以双下划线__开头的标签名称仅供内部使用。

标签值可以包含任何Unicode字符。标签值为空的标签被认为是不存在的标签。

表示法：
给定 metric名称和一组标签，通常使用以下表示法标识时间序列：
{=, …}

例如，一个时间序列的 metric名称是 api_http_requests_total，标签是 method="POST" 和 handler="/messages"。可以这样写

api_http_requests_total{method=“POST”, handler="/messages"}

这和OpenTSDB的表示法是一样的。

metric类型：

Counter             值只能单调增加或重启时归零，可以用来表示处理的请求数、完成的任务数、出现的错误数量等

Gauge               值可以任意增加或减少，可以用来测量温度、当前内存使用等

Histogram           取样观测结果，一般用来请求持续时间或响应大小，并在一个可配置的分布区间（bucket）内计算这些结果，提供所有观测结果的总和
                        
                        累加的 counter，代表观测区间：<basename>_bucket{le="<upper inclusive bound>"}
                        所有观测值的总数：<basename>_sum
                        观测的事件数量：<basenmae>_count

Summary             取样观测结果，一般用来请求持续时间或响应大小，提供观测次数及所有观测结果的总和，还可以通过一个滑动的时间窗口计算可分配的分位数
                        观测的事件流φ-quantiles (0 ≤ φ ≤ 1)：<basename>{quantile="φ"}
                        所有观测值的总和：<basename>_sum
                        观测的事件数量：<basename>_count

实例与任务：
在prometheus中，一个可以拉取数据的端点叫做实例（instance），一般等同于一个进程。一组有着同样目标的实例（例如为弹性或可用性而复制的进程副本）叫做任务（job）。

当prometheus拉取目标时，它会自动添加一些标签到时间序列中，用于标识被拉取的目标：

job：目标所属的任务名称

instance：目标URL中的:部分如果两个标签在被拉取的数据中已经存在，那么就要看配置选项 honor_labels 的值来决定行为了。

每次对实例的拉取，prometheus会在以下的时间序列中保存一个样本（样本指的是在一个时间序列中特定时间点的一个值）：

up{job="<job-name>", instance="<instance-id>"}：如果实例健康（可达），则为 1 ，否则为 0

scrape_duration_seconds{job="<job-name>", instance="<instance-id>"}：拉取的时长

scrape_samples_post_metric_relabeling{job="<job-name>", instance="<instance-id>"}：在 metric relabeling 之后，留存的样本数量

scrape_samples_scraped{job="<job-name>", instance="<instance-id>"}：目标暴露出的样本数量

Prometheus部署环境配置

环境：centos8

主机名	IP地址	部署功能	性能
node2	192.168.143.103	Prometheus	4核8G
node3	192.168.143.104	node_exporter	4核2G

Prometheus部署配置

服务端操作

\\下载解压

[root@node2 ~]# cd /usr/src/
[root@node2 src]# wget https://github.com/prometheus/prometheus/releases/download/v2.31.1/prometheus-2.31.1.linux-amd64.tar.gz
[root@node2 src]# tar xf prometheus-2.31.1.linux-amd64.tar.gz -C /usr/local/
[root@node2 src]# cd /usr/local/
[root@node2 local]# ls
bin  games    lib    libexec                        sbin   src
etc  include  lib64  prometheus-2.31.1.linux-amd64  share
[root@node2 local]# ln -s /usr/local/prometheus-2.31.1.linux-amd64 /usr/local/prometheus
[root@node2 local]# ls
bin  games    lib    libexec     prometheus-2.31.1.linux-amd64  share
etc  include  lib64  prometheus  sbin                           src

\\配置文件详解

[root@node2 local]# cd prometheus
[root@node2 prometheus]# ls
console_libraries  consoles  LICENSE  NOTICE  prometheus  prometheus.yml  promtool
[root@node2 prometheus]# ./promtool check config ./prometheus.yml 
Checking ./prometheus.yml
  SUCCESS: 0 rule files found
[root@node2 prometheus]# vim /usr/local/prometheus/prometheus.yml
# my global config //全局配置
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. //每隔15秒向目标抓取一次数，默认为一分钟
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. //每隔15秒执行一次告警规则，默认为一分钟
  # scrape_timeout is set to the global default (10s). //抓取数据的超时时间，默认为10s

# Alertmanager configuration //警告配置
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093 //alertmanager所部署机器的ip和端口

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'. //定义告警规则和阈值的yml文件
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape: //收集数据配置
# Here it's Prometheus itself. //以下是Prometheus自身的一个配置.
scrape_configs:
        # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. //这个配置是表示在这个配置内的时间序例，每一条都会自动添加上这个{job_name:"prometheus"}的标签.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:       #静态配置
      - targets: ["localhost:9090"]

//service文件启动服务

[root@node2 prometheus]# cat > /usr/lib/systemd/system/prometheus.service <<EOF
[Unit]
Description=The Prometheus Server
After=network.target

[Service]
Restart=on-failure
ExecStart=/usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml
RestartSec=15s

[Install]
WantedBy=multi-user.target

EOF
[root@node2 ~]# systemctl daemon-reload 
[root@node2 ~]# systemctl enable --now prometheus
Created symlink /etc/systemd/system/multi-user.target.wants/prometheus.service → /usr/lib/systemd/system/prometheus.service.
[root@node2 ~]# systemctl status prometheus
● prometheus.service - The Prometheus Server
   Loaded: loaded (/usr/lib/systemd/system/prometheus.service; enabled; vendor pres>
   Active: active (running) since Thu 2021-11-25 23:50:00 CST; 29s ago
 Main PID: 7234 (prometheus)
    Tasks: 6 (limit: 11208)
   Memory: 21.1M
   CGroup: /system.slice/prometheus.service
           └─7234 /usr/local/prometheus/prometheus --config.file=/usr/local/prometh>

11月 25 23:50:01 node2 prometheus[7234]: ts=2021-11-25T15:50:01.140Z caller=head.go>
11月 25 23:50:01 node2 prometheus[7234]: ts=2021-11-25T15:50:01.140Z caller=head.go>
11月 25 23:50:01 node2 prometheus[7234]: ts=2021-11-25T15:50:01.140Z caller=head.go>
11月 25 23:50:01 node2 prometheus[7234]: ts=2021-11-25T15:50:01.142Z caller=head.go>
11月 25 23:50:01 node2 prometheus[7234]: ts=2021-11-25T15:50:01.142Z caller=head.go>
11月 25 23:50:01 node2 prometheus[7234]: ts=2021-11-25T15:50:01.143Z caller=main.go>
11月 25 23:50:01 node2 prometheus[7234]: ts=2021-11-25T15:50:01.143Z caller=main.go>
11月 25 23:50:01 node2 prometheus[7234]: ts=2021-11-25T15:50:01.143Z caller=main.go>
11月 25 23:50:01 node2 prometheus[7234]: ts=2021-11-25T15:50:01.193Z caller=main.go>
11月 25 23:50:01 node2 prometheus[7234]: ts=2021-11-25T15:50:01.193Z caller=main.go>
[root@node2 ~]# ss -atnl
State    Recv-Q   Send-Q     Local Address:Port       Peer Address:Port   Process   
LISTEN   0        128              0.0.0.0:22             0.0.0.0:*                
LISTEN   0        128              0.0.0.0:22              0.0.0.0:*                
LISTEN   0        128                    *:9090                  *:*

//参数使用介绍

##启动参数介绍
--config.file      	   #加载prometheus的配置文件
--web.listen-address   #监听prometheus的web地址和端口
--web.enable-lifecycle #热启动参数，可以在不中断服务的情况下重启加载配置文件
--storage.tsdb.retention   #数据持久化的时间                         
--storage.tsdb.path        #数据持久化的保存路径

服务端操作

\\下载解压

[root@node3 ~]# cd /usr/src/
[root@node3 src]# wget https://github.com/prometheus/node_exporter/releases/download/v1.3.0/node_exporter-1.3.0.linux-amd64.tar.gz
[root@node3 src]# tar xf node_exporter-1.3.0.linux-amd64.tar.gz -C /usr/local/
[root@node3 src]# cd /usr/local/
[root@node3 local]# ls
bin  games    lib    libexec                          sbin   src
etc  include  lib64  node_exporter-1.3.0.linux-amd64  share
[root@node3 local]# ln -s node_exporter-1.3.0.linux-amd64 node_exporter
[root@node3 local]# ls
bin  games    lib    libexec        node_exporter-1.3.0.linux-amd64  share
etc  include  lib64  node_exporter  sbin                             src

\\参数使用介绍

注意：相关启动的参数
--web.listen-address     #node_expoetrt暴露的端口
--collector.systemd	     #从systemd中收集
--collector.systemd.unit-whitelist   ##白名单，收集目标
		".+"         		      #从systemd中循环正则匹配单元
		"(docker|sshd|nginx).service"  #白名单，收集目标，收集参数node_systemd_unit_state

//service文件启动服务

[root@node3 local]# cat > /usr/lib/systemd/system/node_exporter.service <<EOF
[unit]
Description=The node_exporter Server
After=network.target

[Service]
ExecStart=/usr/local/node_exporter/node_exporter
Restart=on-failure
RestartSec=15s
SyslogIdentifier=node_exporter

[Install]
WantedBy=multi-user.target

EOF
[root@node3 local]# systemctl daemon-reload 
[root@node3 local]# systemctl enable --now node_exporter.service 
Created symlink /etc/systemd/system/multi-user.target.wants/node_exporter.service → /usr/lib/systemd/system/node_exporter.service.
[root@node3 local]# systemctl status node_exporter.service 
● node_exporter.service
   Loaded: loaded (/usr/lib/systemd/system/node_exporter.service; enabled; v>
   Active: active (running) since Fri 2021-11-26 00:06:31 CST; 24s ago
 Main PID: 18188 (node_exporter)
    Tasks: 3 (limit: 11208)
   Memory: 4.7M
   CGroup: /system.slice/node_exporter.service
           └─18188 /usr/local/node_exporter/node_exporter

11月 26 00:06:32 node3 node_exporter[18188]: ts=2021-11-25T16:06:32.015Z cal>
11月 26 00:06:32 node3 node_exporter[18188]: ts=2021-11-25T16:06:32.015Z cal>
11月 26 00:06:32 node3 node_exporter[18188]: ts=2021-11-25T16:06:32.015Z cal>
11月 26 00:06:32 node3 node_exporter[18188]: ts=2021-11-25T16:06:32.015Z cal>
11月 26 00:06:32 node3 node_exporter[18188]: ts=2021-11-25T16:06:32.015Z cal>
11月 26 00:06:32 node3 node_exporter[18188]: ts=2021-11-25T16:06:32.015Z cal>
11月 26 00:06:32 node3 node_exporter[18188]: ts=2021-11-25T16:06:32.015Z cal>
11月 26 00:06:32 node3 node_exporter[18188]: ts=2021-11-25T16:06:32.015Z cal>
11月 26 00:06:32 node3 node_exporter[18188]: ts=2021-11-25T16:06:32.015Z cal>
11月 26 00:06:32 node3 node_exporter[18188]: ts=2021-11-25T16:06:32.015Z cal>
[root@node3 local]# ss -atnl
State   Recv-Q  Send-Q    Local Address:Port     Peer Address:Port  Process  
LISTEN  0       128             0.0.0.0:22            0.0.0.0:*              
LISTEN  0       128                   *:9100                *:*

web页面访问操作

//在服务端主机修改prometheus.yml配置文件

[root@node2 ~]# vim /usr/local/prometheus/prometheus.yml
。。。。。。
    static_configs:       #静态配置
      - targets: ["192.168.143.104:9100"] #此行修改为客户端ip和端口
[root@node2 ~]# systemctl restart prometheus.service 
[root@node2 ~]# ss -atnl
State    Recv-Q   Send-Q     Local Address:Port       Peer Address:Port   Process   
LISTEN   0        128              0.0.0.0:22              0.0.0.0:*                
LISTEN   0        128                    *:9090                  *:*                
LISTEN   0        128                 [::]:22                 [::]:*

//查看客户端状态