翻译:普罗米修斯的缩放和联邦

原文:Scaling and Federating Prometheus | Robust Perception

Scaling and Federating Prometheus
普罗米修斯的缩放和联邦
Brian Brazil August 14, 2015

A single Prometheus server can easily handle millions of time series. That’s enough for a thousand servers with a thousand time series each scraped every 10 seconds. As your systems scale beyond that, Prometheus can scale too.
单台普罗米修斯服务器可以轻松处理数百万个时间序列。这对与每10秒就会有上千个时间序列、1000台服务器的信息这种情况而言已经足够了。随着系统规模超过这个范围,普罗米修斯也可以扩展。

Initial Deployment
早期部署

When starting out it’s best to keep things simple. A single Prometheus server per datacenter or similar failure domain (e.g. EC2 region) can typically handle a thousand servers, so should last you for a good while. Running one per datacenter avoids having the internet or WAN links on the critical path of your monitoring.
开始的时候最好保持简单。每个数据中心或简单的故障域(例如EC2区域)的一个普罗米修斯服务器通常可以服务上千个服务器,应该让你好好过一段时间。运行这的数据中心应该避免使用internet或WAN链接来监控关键的路径。

If you’ve more than one datacenter, you may wish to have global aggregates of some time series. This is done with a “global Prometheus” server, which federates from the datacenter Prometheus servers.
如果有不止一个数据中心,你可能希望有一些全局的时间序列。可以通过“global Prometheus”服务器完成的,该服务器由数据中心的普罗米修斯服务器组成。

- scrape_config:
  - job_name: dc_prometheus
    honor_labels: true
    metrics_path: /federate
    params:
      match[]:
        - '{__name__=~"^job:.*"}'   # Request all job-level time series
    static_configs:
      - targets:
        - dc1-prometheus:9090
        - dc2-prometheus:9090


It’s suggested to run two global Prometheis in different datacenters. This keeps your global monitoring working even if one datacenter has an outage.
上面的配置说明在数据中心运行两个全局的普罗米修斯。即使一个数据中心发生了故障,全局监控仍然可以工作。

Splitting By Use
使用情况的分类

As you grow you’ll come to a point where a single Prometheus isn’t quite enough. The next step is to run multiple Prometheus servers per datacenter. Each one will own monitoring for some team or slice of the stack. A first pass may result in fronted, backend and machines (node exporter) for example.
当数据不断增长,你会发现一个普罗米修斯还不够。下一步就是在每个数据中心运行多个普罗米修斯服务器。每个普罗米修斯都将监视某些团队或堆栈的一部分。例如,第一个通过可能会导致前端、后端和机器(节点输出)。

As you continue to grow, this process can be repeated. MySQL and Cassandra monitoring may end up with their own Prometheis, or each Cassandra cluster may have a Prometheus server dedicated to it.
当数据继续增长,可以重复这个过程。MySQL和Cassandra监控最后传输到专属的普罗米修斯服务器,或者每个Cassandra集群可能有一个专属的普罗米修斯服务器。

You may also wish to start splitting by use before there are performance issues, as teams may not want to share Prometheis or to improve isolation. 
你可能希望在出现性能问题之前就开始使用扩容,因为团队不想共享普罗米修斯服务器或为了隔离的目的。

Horizontal Sharding
水平扩展

When you can’t subdivide Prometheus servers any longer, the final step in scaling is to scale out. This usually requires that a single job has thousands of instances, a scale that most users never reach. This is more complex setup and is much more involved to manage than a normal Prometheus deployment, so should be avoided for as long as you can.
当你不能再细分普罗米修斯的服务器时,扩展的最后一步是缩放(scale out)。这通常要求单个作业有成千上万个实例,大多数用户是无法达到这种情况。所以应该尽可能地避免这种水平扩展的复杂设置,比正常的普罗米修斯部署要复杂得多。

The architecture is to have multiple slave Prometheis, each scraping a subset of the targets and aggregating them up within the slave. A master federates the aggregates produced by the slaves, and then the master aggregates them up to the job level.
水平扩展的架构包含多个子(slave)普罗米修斯的服务器,每一个都抓取目标的一个子集并将它们聚集到子服务器中。一个主服务器会把子服务器获取的数据联合起来,然后主服务器把数据聚集到作业的层次。

On the slaves you can use a hash of the address to select only some targets to scrape:
在子服务器上,你可以使用地址散列(hash)来选择一些要拉取的目标数据:

global:
  external_labels:
    slave: 1  # This is the 2nd slave. This prevents clashes between slaves.
scrape_configs:
  - job_name: some_job
    # Add usual service discovery here, such as static_configs
    relabel_configs:
    - source_labels: [__address__]
      modulus:       4    # 4 slaves
      target_label:  __tmp_hash
      action:        hashmod
    - source_labels: [__tmp_hash]
      regex:         ^1$  # This is the 2nd slave
      action:        keep


And the master federates from the slaves:
接着在主服务器上配置与子服务器的通信

- scrape_config:
  - job_name: slaves
    honor_labels: true
    metrics_path: /federate
    params:
      match[]:
        - '{__name__=~"^slave:.*"}'   # Request all slave-level time series
    static_configs:
      - targets:
        - slave0:9090
        - slave1:9090
        - slave3:9090
        - slave4:9090


Information for dashboards is usually taken from the master. If you wanted to drill down to a particular target, you’d do so via its slave.
仪表盘的信息通常从主服务器那里获取。如果想钻取(drill down)某个特定的目标,你可以通过子服务器。
 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
一、prometheus简介 Prometheus是一个开源的系统监控和告警系统,现在已经加入到CNCF基金会,成为继k8s之后第二个在CNCF维护管理的项目,在kubernetes容器管理系统中,通常会搭配prometheus进行监控,prometheus支持多种exporter采集数据,还支持通过pushgateway进行数据上报,Prometheus再性能上可支撑上万台规模的集群。 二、prometheus架构图 三、prometheus组件介绍 1.Prometheus Server: 用于收集和存储时间序列数据。 2.Client Library: 客户端库,检测应用程序代码,当Prometheus抓取实例的HTTP端点时,客户端库会将所有跟踪的metrics指标的当前状态发送到prometheus server端。 3.Exporters: prometheus支持多种exporter,通过exporter可以采集metrics数据,然后发送到prometheus server端 4.Alertmanager: 从 Prometheus server 端接收到 alerts 后,会进行去重,分组,并路由到相应的接收方,发出报警,常见的接收方式有:电子邮件,微信,钉钉, slack等。 5.Grafana:监控仪表盘 6.pushgateway: 各个目标主机可上报数据到pushgatewy,然后prometheus server统一从pushgateway拉取数据。 四、课程亮点 五、效果图展示 六、讲师简介 先超(lucky):高级运维工程师、资深DevOps工程师,在互联网上市公司拥有多年一线运维经验,主导过亿级pv项目的架构设计和运维工作 主要研究方向: 1.云计算方向:容器 (kubernetes、docker),虚拟化(kvm、Vmware vSphere),微服务(istio),PaaS(openshift),IaaS(openstack)等2.系统/运维方向:linux系统下的常用组件(nginx,tomcat,elasticsearch,zookeeper,kafka等),DevOps(Jenkins+gitlab+sonarqube+nexus+k8s),CI/CD,监控(zabbix、prometheus、falcon)等 七、课程大纲
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值