翻译：普罗米修斯的缩放和联邦

最新推荐文章于 2024-08-18 18:03:33 发布

夺宝奇兵

最新推荐文章于 2024-08-18 18:03:33 发布

阅读量1.1k

点赞数

分类专栏： prometheus 文章标签： prometheus

本文链接：https://blog.csdn.net/sinkou/article/details/75304017

版权

prometheus 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

原文：Scaling and Federating Prometheus | Robust Perception

Scaling and Federating Prometheus
普罗米修斯的缩放和联邦
Brian Brazil August 14, 2015

A single Prometheus server can easily handle millions of time series. That’s enough for a thousand servers with a thousand time series each scraped every 10 seconds. As your systems scale beyond that, Prometheus can scale too.
单台普罗米修斯服务器可以轻松处理数百万个时间序列。这对与每10秒就会有上千个时间序列、1000台服务器的信息这种情况而言已经足够了。随着系统规模超过这个范围，普罗米修斯也可以扩展。

Initial Deployment
早期部署

When starting out it’s best to keep things simple. A single Prometheus server per datacenter or similar failure domain (e.g. EC2 region) can typically handle a thousand servers, so should last you for a good while. Running one per datacenter avoids having the internet or WAN links on the critical path of your monitoring.
开始的时候最好保持简单。每个数据中心或简单的故障域(例如EC2区域)的一个普罗米修斯服务器通常可以服务上千个服务器，应该让你好好过一段时间。运行这的数据中心应该避免使用internet或WAN链接来监控关键的路径。

If you’ve more than one datacenter, you may wish to have global aggregates of some time series. This is done with a “global Prometheus” server, which federates from the datacenter Prometheus servers.
如果有不止一个数据中心，你可能希望有一些全局的时间序列。可以通过“global Prometheus”服务器完成的，该服务器由数据中心的普罗米修斯服务器组成。

- scrape_config:
  - job_name: dc_prometheus
    honor_labels: true
    metrics_path: /federate
    params:
      match[]:
        - '{__name__=~"^job:.*"}'   # Request all job-level time series
    static_configs:
      - targets:
        - dc1-prometheus:9090
        - dc2-prometheus:9090

It’s suggested to run two global Prometheis in different datacenters. This keeps your global monitoring working even if one datacenter has an outage.
上面的配置说明在数据中心运行两个全局的普罗米修斯。即使一个数据中心发生了故障，全局监控仍然可以工作。

Splitting By Use
使用情况的分类

As you grow you’ll come to a point where a single Prometheus isn’t quite enough. The next step is to run multiple Prometheus servers per datacenter. Each one will own monitoring for some team or slice of the stack. A first pass may result in fronted, backend and machines (node exporter) for example.
当数据不断增长，你会发现一个普罗米修斯还不够。下一步就是在每个数据中心运行多个普罗米修斯服务器。每个普罗米修斯都将监视某些团队或堆栈的一部分。例如，第一个通过可能会导致前端、后端和机器(节点输出)。

As you continue to grow, this process can be repeated. MySQL and Cassandra monitoring may end up with their own Prometheis, or each Cassandra cluster may have a Prometheus server dedicated to it.
当数据继续增长，可以重复这个过程。MySQL和Cassandra监控最后传输到专属的普罗米修斯服务器，或者每个Cassandra集群可能有一个专属的普罗米修斯服务器。

You may also wish to start splitting by use before there are performance issues, as teams may not want to share Prometheis or to improve isolation.
你可能希望在出现性能问题之前就开始使用扩容，因为团队不想共享普罗米修斯服务器或为了隔离的目的。

Horizontal Sharding
水平扩展

When you can’t subdivide Prometheus servers any longer, the final step in scaling is to scale out. This usually requires that a single job has thousands of instances, a scale that most users never reach. This is more complex setup and is much more involved to manage than a normal Prometheus deployment, so should be avoided for as long as you can.
当你不能再细分普罗米修斯的服务器时，扩展的最后一步是缩放（scale out）。这通常要求单个作业有成千上万个实例，大多数用户是无法达到这种情况。所以应该尽可能地避免这种水平扩展的复杂设置，比正常的普罗米修斯部署要复杂得多。

The architecture is to have multiple slave Prometheis, each scraping a subset of the targets and aggregating them up within the slave. A master federates the aggregates produced by the slaves, and then the master aggregates them up to the job level.
水平扩展的架构包含多个子（slave）普罗米修斯的服务器，每一个都抓取目标的一个子集并将它们聚集到子服务器中。一个主服务器会把子服务器获取的数据联合起来，然后主服务器把数据聚集到作业的层次。

On the slaves you can use a hash of the address to select only some targets to scrape:
在子服务器上，你可以使用地址散列（hash）来选择一些要拉取的目标数据:

global:
  external_labels:
    slave: 1  # This is the 2nd slave. This prevents clashes between slaves.
scrape_configs:
  - job_name: some_job
    # Add usual service discovery here, such as static_configs
    relabel_configs:
    - source_labels: [__address__]
      modulus:       4    # 4 slaves
      target_label:  __tmp_hash
      action:        hashmod
    - source_labels: [__tmp_hash]
      regex:         ^1$  # This is the 2nd slave
      action:        keep

And the master federates from the slaves:
接着在主服务器上配置与子服务器的通信

- scrape_config:
  - job_name: slaves
    honor_labels: true
    metrics_path: /federate
    params:
      match[]:
        - '{__name__=~"^slave:.*"}'   # Request all slave-level time series
    static_configs:
      - targets:
        - slave0:9090
        - slave1:9090
        - slave3:9090
        - slave4:9090

Information for dashboards is usually taken from the master. If you wanted to drill down to a particular target, you’d do so via its slave.
仪表盘的信息通常从主服务器那里获取。如果想钻取（drill down）某个特定的目标，你可以通过子服务器。