Prometheus入门

最新推荐文章于 2023-06-17 09:47:46 发布

佚名_c

最新推荐文章于 2023-06-17 09:47:46 发布

阅读量233

点赞数

分类专栏：监控

监控专栏收录该内容

2 篇文章 0 订阅

订阅专栏

入门

本指南是一个“Hello World”风格的教程，演示了如何在简单的示例设置中安装，配置和使用Prometheus。您将在本地下载并运行Prometheus，将其配置为自我填充和示例应用程序，然后使用查询，规则和图形来使用收集的时间序列数据。

下载并运行Prometheus

为您的平台下载最新版本的Prometheus，然后解压缩并运行它：

tar xvfz prometheus-*.tar.gz
cd prometheus-*

在启动Prometheus之前，让我们配置它。

配置Prometheus以监控自身

Prometheus通过在这些目标上抓取指标HTTP端点来收集受监控目标的指标。由于普罗米修斯也以同样的方式公开数据，因此它也可以掠夺和监控自身的健康状况。

虽然Prometheus服务器只收集有关自身的数据在实践中并不是很有用，但它是一个很好的起始示例。将以下基本Prometheus配置保存为名为的文件prometheus.yml：

global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    monitor: 'codelab-monitor'

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
      - targets: ['localhost:9090']

有关配置选项的完整规范，请参阅配置文档。

启动普罗米修斯

要使用新创建的配置文件启动Prometheus，请切换到包含Prometheus二进制文件的目录并运行：

# Start Prometheus.
# By default, Prometheus stores its database in ./data (flag --storage.tsdb.path).
./prometheus --config.file=prometheus.yml

普罗米修斯应该启动。您还应该能够在localhost：9090浏览到自己的状态页面。给它几秒钟从自己的HTTP指标端点收集有关自身的数据。

您还可以通过导航到其指标端点来验证Prometheus是否正在提供有关自身的指标： localhost：9090 / metrics

使用表达式浏览器

让我们试着看看普罗米修斯收集的一些关于自己的数据。要使用Prometheus的内置表达式浏览器，请导航到 http：// localhost：9090 / graph并在“Graph”选项卡中选择“Console”视图。

正如您可以从localhost：9090 / metrics收集的那样，调用Prometheus自身导出的一个指标prometheus_target_interval_length_seconds（目标擦除之间的实际时间量）。继续并将其输入表达式控制台：

prometheus_target_interval_length_seconds

这应该返回许多不同的时间序列（以及为每个记录的最新值），所有时间序列都具有度量标准名称prometheus_target_interval_length_seconds，但具有不同的标签。这些标签指定不同的延迟百分位数和目标组间隔。

如果我们只对第99个百分位延迟感兴趣，我们可以使用此查询来检索该信息：

prometheus_target_interval_length_seconds{quantile="0.99"}

要计算返回的时间序列数，您可以写：

count(prometheus_target_interval_length_seconds)

有关表达式语言的更多信息，请参阅表达式语言文档。

使用图形界面

要绘制表达式图表，请导航到http：// localhost：9090 / graph并使用“图表”选项卡。

例如，输入以下表达式来绘制在自我擦除的普罗米修斯中创建的每秒块速率：

rate(prometheus_tsdb_head_chunks_created_total[1m])

试验图形范围参数和其他设置。

启动一些示例目标

让我们让这个更有趣，并开始一些示例目标，让普罗米修斯刮掉。

Go客户端库包含一个示例，该示例为具有不同延迟分布的三个服务导出虚构的RPC延迟。

确保安装了Go编译器并设置了正常的Go构建环境（正确GOPATH）。

下载Prometheus的Go客户端库并运行以下三个示例流程：

# Fetch the client library code and compile example.
git clone https://github.com/prometheus/client_golang.git
cd client_golang/examples/random
go get -d
go build

# Start 3 example targets in separate terminals:
./random -listen-address=:8080
./random -listen-address=:8081
./random -listen-address=:8082

您现在应该在http：// localhost：8080 / metrics， http：// localhost：8081 / metrics和http：// localhost：8082 / metrics上侦听示例目标。

配置Prometheus以监控样本目标

现在我们将配置普罗米修斯来摧毁这些新目标。让我们将所有三个端点分组到一个叫做的工作中example-random。但是，假设前两个端点是生产目标，而第三个端点代表一个金丝雀实例。要在Prometheus中对此进行建模，我们可以将多组端点添加到单个作业中，为每组目标添加额外的标签。在此示例中，我们将group="production"标签添加到第一组目标，同时添加group="canary"到第二组。

要实现此目的，请将以下作业定义添加到scrape_configs 您的部分prometheus.yml并重新启动Prometheus实例：

scrape_configs:
  - job_name:       'example-random'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
      - targets: ['localhost:8080', 'localhost:8081']
        labels:
          group: 'production'

      - targets: ['localhost:8082']
        labels:
          group: 'canary'

转到表达式浏览器并验证Prometheus现在是否包含有关这些示例端点公开的时间序列的信息，例如rpc_durations_seconds度量标准。

配置将抓取数据聚合到新时间序列的规则

虽然在我们的示例中不是问题，但是在计算ad-hoc时，聚合了数千个时间序列的查询会变慢。为了提高效率，Prometheus允许您通过配置的录制规则将表达式预先记录到全新的持久时间序列中。假设我们感兴趣的是记录在5分钟窗口内测量的rpc_durations_seconds_count所有实例（但保留job和service维度）平均的每秒实例RPC（）的速率。我们可以这样写：

avg(rate(rpc_durations_seconds_count[5m])) by (job, service)

尝试绘制此表达式。

要将此表达式生成的时间序列记录到调用的新度量标准中job_service:rpc_durations_seconds_count:avg_rate5m，请使用以下记录规则创建文件并将其另存为prometheus.rules.yml：

groups:
- name: example
  rules:
  - record: job_service:rpc_durations_seconds_count:avg_rate5m
    expr: avg(rate(rpc_durations_seconds_count[5m])) by (job, service)

要让普罗米修斯接受这个新规则，请rule_files在您的帐号中添加一个语句prometheus.yml。配置现在应该如下所示：

global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.
  evaluation_interval: 15s # Evaluate rules every 15 seconds.

  # Attach these extra labels to all timeseries collected by this Prometheus instance.
  external_labels:
    monitor: 'codelab-monitor'

rule_files:
  - 'prometheus.rules.yml'

scrape_configs:
  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
      - targets: ['localhost:9090']

  - job_name:       'example-random'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
      - targets: ['localhost:8080', 'localhost:8081']
        labels:
          group: 'production'

      - targets: ['localhost:8082']
        labels:
          group: 'canary'

使用新配置重新启动Prometheus，并job_service:rpc_durations_seconds_count:avg_rate5m 通过表达式浏览器查询或绘制图表，验证具有度量标准名称的新时间序列现在是否可用。