Prometheus入门

怎么就重名了

已于 2024-04-05 14:40:10 修改

阅读量659

点赞数 26

分类专栏：杂记文章标签： prometheus

于 2024-04-02 22:12:31 首次发布

本文链接：https://blog.csdn.net/xiaolixi199311/article/details/137266008

版权

杂记专栏收录该内容

65 篇文章 1 订阅

订阅专栏

1.简介

官网地址 https://prometheus.io/
官网文档地址
 中文文档地址
 参考视频

在这里插入图片描述

Dimensional data 维度数据模型。我也不知道是啥，看着就是时间序列模型？
Powerful查询。查询基于Prometheus自定义的查询语言PromQL【Prometheus Query language】，讨厌一切DSL。。。
超棒的可视化。一点也不棒。
高效的存储。之后再了解吧。
运维简单。Go写的部署就是简单。
精准告警。估计公司里面都是使用Grafana用于告警吧。
众多客户端。
众多集成。

特性

Prometheus’s main features are:

a multi-dimensional data model with time series data identified by metric name and key/value pairs
PromQL, a flexible query language to leverage this dimensionality
no reliance on distributed storage; single server nodes are autonomous
time series collection happens via a pull model over HTTP
pushing time series is supported via an intermediary gateway
targets are discovered via service discovery or static configuration
multiple modes of graphing and dashboarding support

发展历程

在这里插入图片描述

个人感受

PromQL的DSL让人贼难受，就像一个毒瘤趴在prometheus上。之前用过graphite，感觉它的API还挺好用，API比较多且好理解。PromQL形式难看且不易理解。
本身好像是非高可用的，企业想要真的用起来要考虑哪些问题，可能就不是入门选手能搞得定的了，可能也是我太菜。

2. Prometheus的架构说明

架构图
图片来源：https://prometheus.io/assets/architecture.png

图片说明

短生命周期的jobs可以主动把数据push到pushgateway。然后Prometheus定时从pushgateway pull数据到自己的存储中
Prometheus定时从exporters pull数据到自己的存储中
Prometheus通过静态配置targets或者动态发现机制寻找目标实例。
Prometheus提供一个HTTP server，外部可以通过Prometheus web UI,Grafana,API clients用PromQL查询指标可视化展示。
Prometheus可以把告警信息推送到Alertmanger。Alertmanger通过配置可以通过叮叮、邮件、企业微信等方式告知用户。

组件说明

Prometheus 的生态有众多的组件，部分式可选的:

the main Prometheus server which scrapes and stores time series data
client libraries for instrumenting application code
a push gateway for supporting short-lived jobs
special-purpose exporters for services like HAProxy, StatsD, Graphite, etc.
an alertmanager to handle alerts
various support tools

3. 万事不懂，先搞个环境看看

docker-compose快速搞个学习环境

(base)  ~/data/prometheus/ tree .
.
├── LICENSE
├── README.md
├── alertmanager
│   └── config.yml
├── docker-compose.yaml
├── grafana
│   ├── config.monitoring
│   └── provisioning
└── prometheus
    ├── alert.yml
    ├── prometheus.yml
    └── web.yml

5 directories, 8 files
(base)  ~/data/prometheus/

docker-compose.yaml配置文件

version: '3.3'

volumes:
  prometheus_data: {}
  grafana_data: {}

networks:
  monitoring:
    driver: bridge

services:
  prometheus:
    image: prom/prometheus:v2.37.6
    container_name: prometheus
    restart: always
    volumes:
      - /etc/localtime:/etc/localtime:ro
      - ./prometheus/:/etc/prometheus/
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--web.config.file=/etc/prometheus/web.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
      #热加载配置
      - '--web.enable-lifecycle'
      #api配置
      #- '--web.enable-admin-api'
      #历史数据最大保留时间，默认15天
      - '--storage.tsdb.retention.time=30d'  
    networks:
      - monitoring
    links:
      - node_exporter
    expose:
      - '9090'
    ports:
      - 9090:9090

  node_exporter:
    image: prom/node-exporter:v1.5.0
    container_name: node-exporter
    restart: always
    volumes:
      - /etc/localtime:/etc/localtime:ro
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command: 
      - '--path.procfs=/host/proc' 
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc|rootfs/var/lib/docker)($$|/)'
    networks:
      - monitoring
    ports:
      - '9100:9100'

  grafana:
    image: grafana/grafana:9.4.3
    container_name: grafana
    restart: always
    volumes:
      - /etc/localtime:/etc/localtime:ro
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning/:/etc/grafana/provisioning/
    env_file:
      - ./grafana/config.monitoring
    networks:
      - monitoring
    links:
      - prometheus
    ports:
      - 3000:3000
    depends_on:
      - prometheus

2. `Prometheus`的配置文件

(base)  ~/data/prometheus/ cat prometheus/alert.yml 
groups:
- name: Prometheus alert
  rules:
  # 对任何实例超过30秒无法联系的情况发出警报
  - alert: 服务告警
    expr: up == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "服务异常,实例:{{ $labels.instance }}"
      description: "{{ $labels.job }} 服务已关闭"
(base)  ~/data/prometheus/ 
(base)  ~/data/prometheus/ cat prometheus/prometheus.yml 
# 全局配置
global:
  scrape_interval:     15s # 将搜刮间隔设置为每15秒一次。默认是每1分钟一次。
  evaluation_interval: 15s # 每15秒评估一次规则。默认是每1分钟一次。
#  query_log_file: /prometheus/query.log
# Alertmanager 配置
alerting:
  alertmanagers:
  - static_configs:
    - targets: ['alertmanager:9093']

# 报警(触发器)配置
rule_files:
  - "alert.yml"

# 搜刮配置
scrape_configs:
  - job_name: 'prometheus'
    # 覆盖全局默认值，每15秒从该作业中刮取一次目标
    scrape_interval: 15s
    static_configs:
    - targets: ['localhost:9090']
    basic_auth:
      username: test
      password: test
      #password: $2b$12$hNf2lSsxfm0.i4a.1kVpSOVyBCfIB51VRjgBUyv6kdnyTlgWj81Ay
  - job_name: 'alertmanager'
    scrape_interval: 15s
    static_configs:
    - targets: ['alertmanager:9093']
  - job_name: 'cadvisor'
    scrape_interval: 15s
    static_configs:
    - targets: ['cadvisor:8080']
      labels:
        instance: Prometheus服务器 

  - job_name: 'node-exporter'
    scrape_interval: 15s
    static_configs:
    - targets: ['node_exporter:9100']
      labels:
        instance: Prometheus服务器 
(base)  ~/data/prometheus/ 
(base)  ~/data/prometheus/ cat prometheus/web.yml 
basic_auth_users:
  admin: $2b$12$hNf2lSsxfm0.i4a.1kVpSOVyBCfIB51VRjgBUyv6kdnyTlgWj81Ay
  test: $2b$12$hNf2lSsxfm0.i4a.1kVpSOVyBCfIB51VRjgBUyv6kdnyTlgWj81Ay

(base)  ~/data/prometheus/

3. `grafana`配置文件

(base)  ~/data/prometheus/ cat grafana/config.monitoring 
GF_SECURITY_ADMIN_PASSWORD=password
GF_USERS_ALLOW_SIGN_UP=false
(base)  ~/data/prometheus/ 
(base)  ~/data/prometheus/

4. 启动

在这里插入图片描述

4. 基本概念介绍

0. Jobs和Instance的含义

直接上英文，更清楚

In Prometheus terms, an endpoint you can scrape is called an instance, usually corresponding to a single process.
A collection of instances with the same purpose, a process replicated for scalability or reliability for example, is called a job.
For example, an API server job with four replicated instances:
job: api-server
- instance 1: 1.2.3.4:5670
- instance 2: 1.2.3.4:5671
- instance 3: 5.6.7.8:5670
- instance 4: 5.6.7.8:5671

1. 指标数据

Prometheus 的指标第一次看到肯定觉得什么玩意？啥意思呢？
不要急，还是要一步步理解。首先我们在Prometheus 的页面输入up后，可以看到下面的返回值。
在这里插入图片描述
其实度量名称和标签的格式如下：
metric_name{label_name=label_value, ...}
如up{instance="alertmanager:9093", job="alertmanager"} 0表示的含义是up这个指标，在标签是instance="alertmanager:9093", job="alertmanager"上的值是0。
在prometheus的内部，其实up是这样存储的__up__。
metric_name must match the regex [a-zA-Z_:][a-zA-Z0-9_:]*.
labels must match the regex [a-zA-Z_][a-zA-Z0-9_]*.

2. 指标类型

promethues提供了四种指标类型，指标类型是对外界的一个概念，在promethues内部其实是不区分这几种类型的。

计数器类型Counter。是一个持续增长的数，如接口请求次数

# HELP prometheus_notifications_dropped_total Total number of alerts dropped due to errors when sending to Alertmanager.
# TYPE prometheus_notifications_dropped_total counter
prometheus_notifications_dropped_total 84

瞬时值 Guage，是一个当前时刻的状态秩，如cpu的负载。

# HELP prometheus_engine_query_log_enabled State of the query log.
# TYPE prometheus_engine_query_log_enabled gauge
prometheus_engine_query_log_enabled 0

直方图 Histogram，统计直方图

# HELP prometheus_http_request_duration_seconds Histogram of latencies for HTTP requests.
# TYPE prometheus_http_request_duration_seconds histogram
prometheus_http_request_duration_seconds_bucket{handler="/",le="0.1"} 3
prometheus_http_request_duration_seconds_bucket{handler="/",le="0.2"} 3
prometheus_http_request_duration_seconds_bucket{handler="/",le="0.4"} 3
prometheus_http_request_duration_seconds_bucket{handler="/",le="1"} 3
prometheus_http_request_duration_seconds_bucket{handler="/",le="3"} 3
prometheus_http_request_duration_seconds_bucket{handler="/",le="8"} 3
prometheus_http_request_duration_seconds_bucket{handler="/",le="20"} 3
prometheus_http_request_duration_seconds_bucket{handler="/",le="60"} 3
prometheus_http_request_duration_seconds_bucket{handler="/",le="120"} 3
prometheus_http_request_duration_seconds_bucket{handler="/",le="+Inf"} 3
prometheus_http_request_duration_seconds_sum{handler="/"} 9.2834e-05
prometheus_http_request_duration_seconds_count{handler="/"} 3

统计值 Summary，统计分位，如接口请求耗时的p99，p999;gc耗时的分位数。

# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 2.675e-05
go_gc_duration_seconds{quantile="0.25"} 0.0001485
go_gc_duration_seconds{quantile="0.5"} 0.000254875
go_gc_duration_seconds{quantile="0.75"} 0.000390875
go_gc_duration_seconds{quantile="1"} 0.000809751
go_gc_duration_seconds_sum 0.011161251
go_gc_duration_seconds_count 38

5 基本认证

5.1 访问web页面和HTTP接口可以加上基本认证

import getpass
import bcrypt

password = getpass.getpass("password: ")
hashed_password = bcrypt.hashpw(password.encode("utf-8"), bcrypt.gensalt())
print(hashed_password.decode())

运行脚本，

python3 gen-pass.py

输入test

password:
$2b$12$hNf2lSsxfm0.i4a.1kVpSOVyBCfIB51VRjgBUyv6kdnyTlgWj81Ay

5.2 基本认证的配置

(base)  ~/data/prometheus/prometheus/ cat web.yml 
basic_auth_users:
  admin: $2b$12$hNf2lSsxfm0.i4a.1kVpSOVyBCfIB51VRjgBUyv6kdnyTlgWj81Ay
  test: $2b$12$hNf2lSsxfm0.i4a.1kVpSOVyBCfIB51VRjgBUyv6kdnyTlgWj81Ay

(base)  ~/data/prometheus/prometheus/ 
(base)  ~/data/prometheus/prometheus/

Prometheus启动参数加上--web.config.file=/etc/prometheus/web.yml，此外prometheus.yml也要加上认证配置

scrape_configs:
  - job_name: 'prometheus'
    # 覆盖全局默认值，每15秒从该作业中刮取一次目标
    scrape_interval: 15s
    static_configs:
    - targets: ['localhost:9090']
    basic_auth:
      username: test
      password: test

5.3 调用HTTP接口也需要传入认证参数

(base)  ~/data/prometheus/prometheus/ 
(base)  ~/data/prometheus/prometheus/ curl http://localhost:9090/metrics        
Unauthorized
(base)  ~/data/prometheus/prometheus/ curl -u test:test http://localhost:9090/metrics

# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 2.675e-05

6 HTTP API说明

prometheus的接口都是以/api/v1前缀开始的。真个接口的梳理比较少，也比较好理解。我分成了两类。

指标数据的获取
元数据的获取

6.1 返回数据的类型

返回的json类型格式如下

{
  "status": "success" | "error",
  "data": <data>,

  // Only set if status is "error". The data field may still hold
  // additional data.
  "errorType": "<string>",
  "error": "<string>",

  // Only if there were warnings while executing the request.
  // There will still be data in the data field.
  "warnings": ["<string>"]
}

6.2 指标数据的获取API

返回值的 data 字段格式如下：

{
  "resultType": "matrix" | "vector" | "scalar" | "string",
  "result": <value>
}

瞬时数据查询

GET /api/v1/query

URL 请求参数：

query=<string> : PromQL 表达式。
time=<rfc3339 | unix_timestamp> : 用于指定用于计算 PromQL 的时间戳。可选参数，默认情况下使用当前系统时间。
timeout=<duration> : 超时设置。可选参数，默认情况下使用全局设置的参数 -query.timeout。
如果 time 参数缺省，则使用当前服务器时间。

区间数据查询

GET /api/v1/query_range

URL 请求参数：

query=<string> : PromQL 表达式。
start=<rfc3339 | unix_timestamp> : 起始时间戳。
end=<rfc3339 | unix_timestamp> : 结束时间戳。
step=<duration | float> : 查询时间步长，时间区间内每 step 秒执行一次。
timeout=<duration> : 超时设置。可选参数，默认情况下使用全局设置的参数 -query.timeout。

6.3 官网API截图

6.4 管理端口API

Health check

GET /-/healthy
HEAD /-/healthy

Readiness check

GET /-/ready
HEAD /-/ready

Reload

PUT /-/reload
POST /-/reload

Quit

PUT /-/quit
POST /-/quit

(base)  ~/data/jupyter/ curl http://localhost:9090/-/healthy
Unauthorized
(base)  ~/data/jupyter/ curl -u test:test  http://localhost:9090/-/healthy
Prometheus Server is Healthy.
(base)  ~/data/jupyter/ 
(base)  ~/data/jupyter/ curl -u test:test  http://localhost:9090/-/ready  
Prometheus Server is Ready.
(base)  ~/data/jupyter/ 
(base)  ~/data/jupyter/ curl -u test:test  http://localhost:9090/-/reload
Only POST or PUT requests allowed%                                                                                                                                           (base)  ~/data/jupyter/