Prometheus生产实战全流程详解（存储/负载/调度篇）

最新推荐文章于 2025-10-01 04:49:37 发布

原创最新推荐文章于 2025-10-01 04:49:37 发布 · 821 阅读

13 ·

CC 4.0 BY-SA版权

文章标签：

#prometheus #云原生

云计算同时被 2 个专栏收录

16 篇文章

订阅专栏

python自动化运维

2 篇文章

订阅专栏

二、负载治理实战（百万级Series管控）

本文提供可直接落地的生产配置模板，并附性能压测数据与调优对照表。以下为全链路实战演示

一、存储架构实战（TSDB深度优化）

1. 存储拓扑设计

# 存储目录结构示例
/data/prometheus/
├── 01BKGV7JBM69T2G1BGBGM6KB12 # Block
│   ├── chunks
│   ├── index
│   └── meta.json
├── chunks_head
├── wal
│   ├── 000000002
│   └── 000000003

2. 关键参数调优

# prometheus.yml 存储配置片段
storage:
tsdb:
    retention: 30d
    out_of_order_time_window: 2h # 允许乱序数据窗口
    max_block_chunk_segment_size: 512MB
exemplars:
    max_exemplars: 1000000

3. 远程存储实战（Thanos集成）

# 远程写入配置
remote_write:
- url: "http://thanos-receive:19291/api/v1/receive"
    name: thanos-receive
    queue_config:
      capacity: 10000
      max_samples_per_send: 2000
      batch_send_deadline: 60s
      max_shards: 200
      min_shards: 50
      retry_on_http_429: true

4. 性能压测对照表

场景	默认配置	优化后	提升幅度
写入吞吐量	8w/s	15w/s	87.5%
查询延迟(P99)	850ms	320ms	62.3%
磁盘空间占用	1TB	650GB	35%

二、负载治理实战（百万级Series管控）

1. 动态分片方案

2. 分片配置模板

#yaml文件

# 自动分片配置示例
- job_name: 'node_exporter'
consul_sd_configs:
    - server: 'consul:8500'
relabel_configs:
    - source_labels: [__meta_consul_node]
      modulus: 3 # 总分片数
      target_label: __tmp_hash
      action: hashmod
    - source_labels: [__tmp_hash]
      regex: ^(0)$ # 当前分片编号
      action: keep

3. 负载熔断策略

# 启动参数设置资源上限
--storage.tsdb.max-block-chunk-segment-size=512MB \
--storage.tsdb.max-query-length=721h \
--query.max-concurrency=50 \
--query.timeout=15m \
--query.max-samples=50000000

4. 高基数拦截方案

# 实时标签过滤
relabel_configs:
- source_labels: [service]
    regex: (user_data|payment) # 禁止采集敏感服务
    action: drop
- source_labels: [__name__]
    regex: '(go_threads|http_request_duration_seconds_bucket)'
    action: keep

三、调度优化实战（精准采集控制）

2. 优先级调度配置

yaml

scrape_configs:
  - job_name: 'critical_metrics'
    scrape_interval: 5s
    scrape_timeout: 4s
    http_sd_configs: [...]  # 高优先级服务发现
  
  - job_name: 'normal_metrics'
    scrape_interval: 30s
    scrape_timeout: 25s
    honor_labels: true  # 避免标签冲突