Prometheus入门
1.简介
官网地址 https://prometheus.io/
官网文档地址
中文文档地址
参考视频
- Dimensional data 维度数据模型。我也不知道是啥,看着就是时间序列模型?
- Powerful查询。查询基于Prometheus自定义的查询语言PromQL【Prometheus Query language】,讨厌一切DSL。。。
- 超棒的可视化。一点也不棒。
- 高效的存储。之后再了解吧。
- 运维简单。Go写的部署就是简单。
- 精准告警。估计公司里面都是使用Grafana用于告警吧。
- 众多客户端。
- 众多集成。
特性
Prometheus’s main features are:
- a multi-dimensional data model with time series data identified by metric name and key/value pairs
- PromQL, a flexible query language to leverage this dimensionality
- no reliance on distributed storage; single server nodes are autonomous
- time series collection happens via a pull model over HTTP
- pushing time series is supported via an intermediary gateway
- targets are discovered via service discovery or static configuration
- multiple modes of graphing and dashboarding support
发展历程
个人感受
PromQL
的DSL
让人贼难受,就像一个毒瘤趴在prometheus
上。之前用过graphite
,感觉它的API
还挺好用,API
比较多且好理解。PromQL
形式难看且不易理解。- 本身好像是非高可用的,企业想要真的用起来要考虑哪些问题,可能就不是入门选手能搞得定的了,可能也是我太菜。
2. Prometheus的架构说明
图片来源:https://prometheus.io/assets/architecture.png
- 图片说明
- 短生命周期的
jobs
可以主动把数据push
到pushgateway
。然后Prometheus
定时从pushgateway
pull
数据到自己的存储中 Prometheus
定时从exporters
pull数据到自己的存储中Prometheus
通过静态配置targets
或者动态发现机制寻找目标实例。Prometheus
提供一个HTTP server
,外部可以通过Prometheus web UI,Grafana,API clients
用PromQL
查询指标可视化展示。Prometheus
可以把告警信息推送到Alertmanger
。Alertmanger
通过配置可以通过叮叮、邮件、企业微信等方式告知用户。
组件说明
Prometheus
的生态有众多的组件,部分式可选的:
- the main Prometheus server which scrapes and stores time series data
- client libraries for instrumenting application code
- a push gateway for supporting short-lived jobs
- special-purpose exporters for services like HAProxy, StatsD, Graphite, etc.
- an alertmanager to handle alerts
- various support tools
3. 万事不懂,先搞个环境看看
docker-compose
快速搞个学习环境
(base) ~/data/prometheus/ tree .
.
├── LICENSE
├── README.md
├── alertmanager
│ └── config.yml
├── docker-compose.yaml
├── grafana
│ ├── config.monitoring
│ └── provisioning
└── prometheus
├── alert.yml
├── prometheus.yml
└── web.yml
5 directories, 8 files
(base) ~/data/prometheus/
docker-compose.yaml
配置文件
version: '3.3'
volumes:
prometheus_data: {}
grafana_data: {}
networks:
monitoring:
driver: bridge
services:
prometheus:
image: prom/prometheus:v2.37.6
container_name: prometheus
restart: always
volumes:
- /etc/localtime:/etc/localtime:ro
- ./prometheus/:/etc/prometheus/
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--web.config.file=/etc/prometheus/web.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
#热加载配置
- '--web.enable-lifecycle'
#api配置
#- '--web.enable-admin-api'
#历史数据最大保留时间,默认15天
- '--storage.tsdb.retention.time=30d'
networks:
- monitoring
links:
- node_exporter
expose:
- '9090'
ports:
- 9090:9090
node_exporter:
image: prom/node-exporter:v1.5.0
container_name: node-exporter
restart: always
volumes:
- /etc/localtime:/etc/localtime:ro
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc|rootfs/var/lib/docker)($$|/)'
networks:
- monitoring
ports:
- '9100:9100'
grafana:
image: grafana/grafana:9.4.3
container_name: grafana
restart: always
volumes:
- /etc/localtime:/etc/localtime:ro
- grafana_data:/var/lib/grafana
- ./grafana/provisioning/:/etc/grafana/provisioning/
env_file:
- ./grafana/config.monitoring
networks:
- monitoring
links:
- prometheus
ports:
- 3000:3000
depends_on:
- prometheus
2. Prometheus
的配置文件
(base) ~/data/prometheus/ cat prometheus/alert.yml
groups:
- name: Prometheus alert
rules:
# 对任何实例超过30秒无法联系的情况发出警报
- alert: 服务告警
expr: up == 0
for: 30s
labels:
severity: critical
annotations:
summary: "服务异常,实例:{{ $labels.instance }}"
description: "{{ $labels.job }} 服务已关闭"
(base) ~/data/prometheus/
(base) ~/data/prometheus/ cat prometheus/prometheus.yml
# 全局配置
global:
scrape_interval: 15s # 将搜刮间隔设置为每15秒一次。默认是每1分钟一次。
evaluation_interval: 15s # 每15秒评估一次规则。默认是每1分钟一次。
# query_log_file: /prometheus/query.log
# Alertmanager 配置
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
# 报警(触发器)配置
rule_files:
- "alert.yml"
# 搜刮配置
scrape_configs:
- job_name: 'prometheus'
# 覆盖全局默认值,每15秒从该作业中刮取一次目标
scrape_interval: 15s
static_configs:
- targets: ['localhost:9090']
basic_auth:
username: test
password: test
#password: $2b$12$hNf2lSsxfm0.i4a.1kVpSOVyBCfIB51VRjgBUyv6kdnyTlgWj81Ay
- job_name: 'alertmanager'
scrape_interval: 15s
static_configs:
- targets: ['alertmanager:9093']
- job_name: 'cadvisor'
scrape_interval: 15s
static_configs:
- targets: ['cadvisor:8080']
labels:
instance: Prometheus服务器
- job_name: 'node-exporter'
scrape_interval: 15s
static_configs:
- targets: ['node_exporter:9100']
labels:
instance: Prometheus服务器
(base) ~/data/prometheus/
(base) ~/data/prometheus/ cat prometheus/web.yml
basic_auth_users:
admin: $2b$12$hNf2lSsxfm0.i4a.1kVpSOVyBCfIB51VRjgBUyv6kdnyTlgWj81Ay
test: $2b$12$hNf2lSsxfm0.i4a.1kVpSOVyBCfIB51VRjgBUyv6kdnyTlgWj81Ay
(base) ~/data/prometheus/
3. grafana
配置文件
(base) ~/data/prometheus/ cat grafana/config.monitoring
GF_SECURITY_ADMIN_PASSWORD=password
GF_USERS_ALLOW_SIGN_UP=false
(base) ~/data/prometheus/
(base) ~/data/prometheus/
4. 启动
4. 基本概念介绍
0. Jobs和Instance的含义
直接上英文,更清楚
- In Prometheus terms, an endpoint you can scrape is called an
instance
, usually corresponding to a single process. - A collection of instances with the same purpose, a process replicated for scalability or reliability for example, is called a
job
.
For example, an API server job with four replicated instances: - job:
api-server
- instance 1:
1.2.3.4:5670
- instance 2:
1.2.3.4:5671
- instance 3:
5.6.7.8:5670
- instance 4:
5.6.7.8:5671
- instance 1:
1. 指标数据
Prometheus
的指标第一次看到肯定觉得什么玩意?啥意思呢?
不要急,还是要一步步理解。首先我们在Prometheus
的页面输入up
后,可以看到下面的返回值。
其实度量名称和标签的格式如下:
metric_name{label_name=label_value, ...}
如up{instance="alertmanager:9093", job="alertmanager"} 0
表示的含义是up
这个指标,在标签是instance="alertmanager:9093", job="alertmanager"
上的值是0。
在prometheus
的内部,其实up
是这样存储的__up__
。
metric_name
must match the regex [a-zA-Z_:][a-zA-Z0-9_:]*
.
labels
must match the regex [a-zA-Z_][a-zA-Z0-9_]*
.
2. 指标类型
promethues
提供了四种指标类型,指标类型是对外界的一个概念,在promethues
内部其实是不区分这几种类型的。
- 计数器类型
Counter
。是一个持续增长的数,如接口请求次数
# HELP prometheus_notifications_dropped_total Total number of alerts dropped due to errors when sending to Alertmanager.
# TYPE prometheus_notifications_dropped_total counter
prometheus_notifications_dropped_total 84
- 瞬时值 Guage,是一个当前时刻的状态秩,如cpu的负载。
# HELP prometheus_engine_query_log_enabled State of the query log.
# TYPE prometheus_engine_query_log_enabled gauge
prometheus_engine_query_log_enabled 0
- 直方图 Histogram,统计直方图
# HELP prometheus_http_request_duration_seconds Histogram of latencies for HTTP requests.
# TYPE prometheus_http_request_duration_seconds histogram
prometheus_http_request_duration_seconds_bucket{handler="/",le="0.1"} 3
prometheus_http_request_duration_seconds_bucket{handler="/",le="0.2"} 3
prometheus_http_request_duration_seconds_bucket{handler="/",le="0.4"} 3
prometheus_http_request_duration_seconds_bucket{handler="/",le="1"} 3
prometheus_http_request_duration_seconds_bucket{handler="/",le="3"} 3
prometheus_http_request_duration_seconds_bucket{handler="/",le="8"} 3
prometheus_http_request_duration_seconds_bucket{handler="/",le="20"} 3
prometheus_http_request_duration_seconds_bucket{handler="/",le="60"} 3
prometheus_http_request_duration_seconds_bucket{handler="/",le="120"} 3
prometheus_http_request_duration_seconds_bucket{handler="/",le="+Inf"} 3
prometheus_http_request_duration_seconds_sum{handler="/"} 9.2834e-05
prometheus_http_request_duration_seconds_count{handler="/"} 3
- 统计值 Summary,统计分位,如接口请求耗时的p99,p999;gc耗时的分位数。
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 2.675e-05
go_gc_duration_seconds{quantile="0.25"} 0.0001485
go_gc_duration_seconds{quantile="0.5"} 0.000254875
go_gc_duration_seconds{quantile="0.75"} 0.000390875
go_gc_duration_seconds{quantile="1"} 0.000809751
go_gc_duration_seconds_sum 0.011161251
go_gc_duration_seconds_count 38
5 基本认证
5.1 访问web页面和HTTP接口可以加上基本认证
import getpass
import bcrypt
password = getpass.getpass("password: ")
hashed_password = bcrypt.hashpw(password.encode("utf-8"), bcrypt.gensalt())
print(hashed_password.decode())
运行脚本,
python3 gen-pass.py
输入test
password:
$2b$12$hNf2lSsxfm0.i4a.1kVpSOVyBCfIB51VRjgBUyv6kdnyTlgWj81Ay
5.2 基本认证的配置
(base) ~/data/prometheus/prometheus/ cat web.yml
basic_auth_users:
admin: $2b$12$hNf2lSsxfm0.i4a.1kVpSOVyBCfIB51VRjgBUyv6kdnyTlgWj81Ay
test: $2b$12$hNf2lSsxfm0.i4a.1kVpSOVyBCfIB51VRjgBUyv6kdnyTlgWj81Ay
(base) ~/data/prometheus/prometheus/
(base) ~/data/prometheus/prometheus/
Prometheus
启动参数加上--web.config.file=/etc/prometheus/web.yml
,此外prometheus.yml
也要加上认证配置
scrape_configs:
- job_name: 'prometheus'
# 覆盖全局默认值,每15秒从该作业中刮取一次目标
scrape_interval: 15s
static_configs:
- targets: ['localhost:9090']
basic_auth:
username: test
password: test
5.3 调用HTTP接口也需要传入认证参数
(base) ~/data/prometheus/prometheus/
(base) ~/data/prometheus/prometheus/ curl http://localhost:9090/metrics
Unauthorized
(base) ~/data/prometheus/prometheus/ curl -u test:test http://localhost:9090/metrics
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 2.675e-05
6 HTTP API说明
prometheus
的接口都是以/api/v1
前缀开始的。真个接口的梳理比较少,也比较好理解。我分成了两类。
- 指标数据的获取
- 元数据的获取
6.1 返回数据的类型
返回的json类型格式如下
{
"status": "success" | "error",
"data": <data>,
// Only set if status is "error". The data field may still hold
// additional data.
"errorType": "<string>",
"error": "<string>",
// Only if there were warnings while executing the request.
// There will still be data in the data field.
"warnings": ["<string>"]
}
6.2 指标数据的获取API
返回值的 data
字段格式如下:
{
"resultType": "matrix" | "vector" | "scalar" | "string",
"result": <value>
}
- 瞬时数据查询
GET /api/v1/query
URL 请求参数:
query=<string> : PromQL 表达式。
time=<rfc3339 | unix_timestamp> : 用于指定用于计算 PromQL 的时间戳。可选参数,默认情况下使用当前系统时间。
timeout=<duration> : 超时设置。可选参数,默认情况下使用全局设置的参数 -query.timeout。
如果 time 参数缺省,则使用当前服务器时间。
- 区间数据查询
GET /api/v1/query_range
URL 请求参数:
query=<string> : PromQL 表达式。
start=<rfc3339 | unix_timestamp> : 起始时间戳。
end=<rfc3339 | unix_timestamp> : 结束时间戳。
step=<duration | float> : 查询时间步长,时间区间内每 step 秒执行一次。
timeout=<duration> : 超时设置。可选参数,默认情况下使用全局设置的参数 -query.timeout。
6.3 官网API截图
6.4 管理端口API
- Health check
GET /-/healthy
HEAD /-/healthy
- Readiness check
GET /-/ready
HEAD /-/ready
- Reload
PUT /-/reload
POST /-/reload
- Quit
PUT /-/quit
POST /-/quit
(base) ~/data/jupyter/ curl http://localhost:9090/-/healthy
Unauthorized
(base) ~/data/jupyter/ curl -u test:test http://localhost:9090/-/healthy
Prometheus Server is Healthy.
(base) ~/data/jupyter/
(base) ~/data/jupyter/ curl -u test:test http://localhost:9090/-/ready
Prometheus Server is Ready.
(base) ~/data/jupyter/
(base) ~/data/jupyter/ curl -u test:test http://localhost:9090/-/reload
Only POST or PUT requests allowed% (base) ~/data/jupyter/