Prometheus介绍（k8s-实现prometheus异地转发服务yaml）

晴空万里长风微凉 

已于 2022-01-19 17:01:49 修改

阅读量545

点赞数

分类专栏： Monitor Prometheus 文章标签：运维

于 2021-12-06 10:12:18 首次发布

本文链接：https://blog.csdn.net/weixin_51910506/article/details/121741040

版权

Monitor 同时被 2 个专栏收录

2 篇文章 0 订阅

订阅专栏

Prometheus

2 篇文章 0 订阅

订阅专栏

相关链接：https://songjiayang.gitbooks.io/prometheus/content/configuration/global.html
https://www.cnblogs.com/yuezhimi/p/11056819.html
https://github.com/ernestas-poskus/ansible-prometheus

硬件监控	1）通过远程控制卡：Dell的IDRAC2）IPMI（硬件管理接口）监控物理设备。3）网络设备：路由器、交换机温度，硬件故障等。
系统监控	CPU，内存，硬盘利用率，硬件I/O，网卡流量，TCP状态，进程数
应用监控	Nginx、Tomcat、PHP、MySQL、Redis等，业务涉及的服务都要监控起来
日志监控	系统日志、服务日志、访问日志、错误日志，这个现成的开源的ELK解决方案，会在下一章讲解
安全监控	1）可以利用Nginx+Lua实现WAF功能，并存储到ES，通过Kibana可视化展示不同的攻击类型。2）用户登录数，passwd文件变化，其他关键文件改动
API监控	收集API接口操作方法（GET、POST等）请求，分析负载、可用性、正确性、响应时间
业务监控	例如电商网站，每分钟产生多少订单、注册多少用户、多少活跃用户、推广活动效果（产生多少用户、多少利润）
流量分析	根据流量获取用户相关信息，例如用户地理位置、某页面访问状况、页面停留时间等。监控各地区访问业务网络情况，优化用户体验和提升收益

1. 安装

1.1. 二进制安装

mkdir ~/Download
cd ~/Download

wget https://github.com/prometheus/prometheus/releases/download/v1.6.2/prometheus-1.6.2.linux-amd64.tar.gz
mkdir ~/Prometheus
cd ~/Prometheus
tar -xvzf ~/Download/prometheus-1.6.2.linux-amd64.tar.gz

# 二进制安装非常方便，没有依赖，自带查询 web 界面。
# 在生产环境中，我们可以将 Prometheus 添加到 init 配置里，或者使用 supervisord 作为服务自启动。

2.时序

Prometheus 时序数据分为 Counter, Gauge, Histogram, Summary 四种类型。

Counter(递增的计数器)

Counter 表示收集的数据是按照某个趋势（增加／减少）一直变化的，我们往往用它记录服务请求总量、错误总数等。

Gauge(可以任意变化的数值)

Gauge 表示搜集的数据是一个瞬时的值，与时间没有关系，可以任意变高变低，往往可以用来记录内存使用率、磁盘使用率等。

Histogram(对一段时间范围内数据进行采样，并对所有数值求和与统计数量)

Histogram 由 _bucket{le=""}，_bucket{le="+Inf"}, _sum，_count 组成，主要用于表示一段时间范围内对数据进行采样（通常是请求持续时间或响应大小），并能够对其指定区间以及总数进行统计，通常它采集的数据展示为直方图。

例如 Prometheus server 中 prometheus_local_storage_series_chunks_persisted, 表示 Prometheus 中每个时序需要存储的 chunks 数量，我们可以用它计算待持久化的数据的分位数。

Summary(与Histogram类似)

Summary 和 Histogram 类似，由 {quantile="<φ>"}，_sum，_count 组成，主要用于表示一段时间内数据采样结果（通常是请求持续时间或响应大小），它直接存储了 quantile 数据，而不是根据统计区间计算出来的。

3.查询

PromQL 查询结果主要有 3 种类型：

瞬时数据 (Instant vector): 包含一组时序，每个时序只有一个点，例如：http_requests_total
区间数据 (Range vector): 包含一组时序，每个时序有多个点，例如：http_requests_total[5m]
纯量数据 (Scalar): 纯量只有一个数字，没有时序，例如：count(http_requests_total)

3.1查询条件

查询条件支持正则匹配，例如：

http_requests_total{code!=“200”} // 表示查询 code 不为 “200” 的数据
http_requests_total{code=～"2…"} // 表示查询 code 为 “2xx” 的数据
http_requests_total{code!～"2…"} // 表示查询 code 不为 “2xx” 的数据

3.2操作符

算术运算符:
支持的算术运算符有 +，-，*，/，%，^, 例如 http_requests_total * 2 表示将 http_requests_total 所有数据 double 一倍。
比较运算符:
支持的比较运算符有 ==，!=，>，<，>=，<=, 例如 http_requests_total > 100 表示 http_requests_total 结果中大于 100 的数据。
逻辑运算符:
支持的逻辑运算符有 and，or，unless, 例如 http_requests_total == 5 or http_requests_total == 2 表示 http_requests_total 结果中等于 5 或者 2 的数据。
聚合运算符:
支持的聚合运算符有 sum，min，max，avg，stddev，stdvar，count，count_values，bottomk，topk，quantile，, 例如 max(http_requests_total) 表示 http_requests_total 结果中最大的数据。

注意，和四则运算类型，Prometheus 的运算符也有优先级，它们遵从（^）> (*, /, %) > (+, -) > (==, !=, <=, <, >=, >) > (and, unless) > (or) 的原则。

3.3内置函数

https://prometheus.io/docs/prometheus/latest/querying/functions/

Prometheus 内置不少函数，方便查询以及数据格式化
例如将结果由浮点数转为整数的 floor 和 ceil，
floor(avg(http_requests_total{code=“200”}))
ceil(avg(http_requests_total{code=“200”}))
查看 http_requests_total 5分钟内，平均每秒数据
rate(http_requests_total[5m])

4.配置

4.1.全局配置

global 属于全局的默认配置，它主要包含 4 个属性:

scrape_interval: 拉取 targets 的默认时间间隔。

scrape_timeout: 拉取一个 target 的超时时间。

evaluation_interval: 执行 rules 的时间间隔。

external_labels: 额外的属性，会添加到拉取的数据并存到数据库中。

global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.
  evaluation_interval: 15s # By default, scrape targets every 15 seconds.
  scrape_timeout: 10s # is set to the global default (10s).

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    monitor: 'codelab-monitor'

4.2.告警配置

1. 使用运行参数 -alertmanager.xxx 来配置 Alertmanager，但是这样不够灵活，没有办法做到动态更新加载，以及动态定义告警属性。
1. alerting 配置主要用来解决这个问题，它能够更好的管理 Alertmanager, 主要包含 2 个参数：

alert_relabel_configs: 动态修改 alert 属性的规则配置。

alertmanagers: 用于动态发现 Alertmanager 的配置。

# 文件结构
# Alerting specifies settings related to the Alertmanager.
alerting:
  alert_relabel_configs:
    [ - <relabel_config> ... ]
  alertmanagers:
    [ - <alertmanager_config> ... ]

4.2.1. 基础告警

alerting:
  alertmanagers:
  - scheme: http
    timeout: 10s
    static_configs:
    - targets:
      - "192.168.99.5:9093"

4.2.2. 服务转发

alerting:
  alertmanagers:
    - scheme: http
      path_prefix: /
      timeout: 10s
      api_version: v2
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_name]
        separator: ;
        regex: alertmanager
        replacement: $1
        action: keep
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        separator: ;
        regex: web
        replacement: $1
        action: keep
      kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names:
          - paf

4.2.3. 双流转发

alerting:
  alert_relabel_configs:
  - separator: ;
    regex: prometheus_replica
    replacement: $1
    action: labeldrop
  alertmanagers:
  - follow_redirects: true
    scheme: http
    path_prefix: /
    timeout: 10s
    api_version: v2
    relabel_configs:
    - source_labels: [__meta_kubernetes_service_name]
      separator: ;
      regex: alertmanager
      replacement: $1
      action: keep
    - source_labels: [__meta_kubernetes_endpoint_port_name]
      separator: ;
      regex: web
      replacement: $1
      action: keep
    kubernetes_sd_configs:
    - role: endpoints
      kubeconfig_file: ""
      follow_redirects: true
      namespaces:
        names:
        - paf

4.2.4. 本地转发

alerting:
  alertmanagers:

  - kubernetes_sd_configs:
    - role: pod
      tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
    - source_labels: [__meta_kubernetes_namespace]
      regex: prometheus
      action: keep
    - source_labels: [__meta_kubernetes_pod_label_app]
      regex: prometheus
      action: keep
    - source_labels: [__meta_kubernetes_pod_label_component]
      regex: alertmanager
      action: keep
    - source_labels: [__meta_kubernetes_pod_container_port_number]
      regex:
      action: drop

alerting:
  alertmanagers:

  - kubernetes_sd_configs:
    - role: pod
      tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
    - source_labels: [__meta_kubernetes_namespace]
      regex: prometheus
      action: keep
    - source_labels: [__meta_kubernetes_pod_label_app]
      regex: prometheus
      action: keep
    - source_labels: [__meta_kubernetes_pod_label_component]
      regex: alertmanager
      action: keep
    - source_labels: [__meta_kubernetes_pod_container_port_number]
      regex:
      action: drop

4.2.5. 公网转发

ng-client.pem

ng-client-key.pem

alerting:
  alertmanagers:
  - static_configs:
    - targets: ['am.qz.cloudyc.cn']
    scheme: https
    tls_config:
      cert_file: /opt/prometheus/certs/ng-client.pem
      key_file: /opt/prometheus/certs/ng-client-key.pem
      insecure_skip_verify: true
  external_labels:
    area: "gc"
    area_name: "工厂"
    team: "ceph"
    namespace: paf

4.2.6. K8s端口转发

vi qz-am-service.yaml 
kubectl apply -f qz-am-service.yaml -n prometheus

定义qz-am-service.yaml

---
apiVersion: v1
kind: Service
metadata:
  labels:
    alertmanager: main
  name: alertmanager
spec:
  ports:
    - name: web
      port: 9093
      protocol: TCP
      targetPort: 9093
  type: ClusterIP

---
apiVersion: v1
kind: Endpoints
metadata:
  name: alertmanager
subsets:
- addresses:
  - ip: 192.168.0.6
  ports:
  - name: web
    port: 9093
    protocol: TCP

4.3. 规则配置

rule_files:
  - "rules/node.rules"
  - "rules2/*.rules"

4.4. 目标抓取

https://prometheus.io/docs/prometheus/latest/configuration/configuration/

scrape_configs 主要用于配置拉取数据节点，每一个拉取配置主要包含以下参数：

job_name：任务名称

honor_labels：用于解决拉取数据标签有冲突，当设置为 true, 以拉取数据为准，否则以服务配置为准

params：数据拉取访问时带的请求参数

scrape_interval：拉取时间间隔

scrape_timeout: 拉取超时时间

metrics_path：拉取节点的 metric 路径

scheme：拉取数据访问协议

sample_limit：存储的数据标签个数限制，如果超过限制，该数据将被忽略，不入存储；默认值为0，表示没有限制

relabel_configs：拉取数据重置标签配置

metric_relabel_configs：metric 重置标签配置

4.4.1. 静态拉取

写死

4.4.2. 动态服务发现

https://prometheus.io/docs/prometheus/latest/configuration/configuration/#consul_sd_config

static_configs: 静态服务发现
dns_sd_configs: DNS 服务发现
file_sd_configs: 文件服务发现

consul_sd_configs: Consul 服务发现
https://blog.51cto.com/u_1000682/2363038

serverset_sd_configs: Serverset 服务发现
nerve_sd_configs: Nerve 服务发现
marathon_sd_configs: Marathon 服务发现
kubernetes_sd_configs: Kubernetes 服务发现
gce_sd_configs: GCE 服务发现
ec2_sd_configs: EC2 服务发现
openstack_sd_configs: OpenStack 服务发现
azure_sd_configs: Azure 服务发现
triton_sd_configs: Triton 服务发现

4.5. 存储

https://prometheus.io/docs/prometheus/latest/storage/
–storage.tsdb.path：普罗米修斯在哪里写入其数据库。默认为data/.

–storage.tsdb.retention.time：何时删除旧数据。默认为15d

5. 常用命令

#启动prometheus
./prometheus --config.file=prometheus.yml

#检测配置文件
./promtool check config prometheus.yml

#重新加载配置文件（两种方式）
kill -hup PID
kill -HUP pid
curl -X POST http://IP/-/reload


--storage.tsdb.retention.time=15d --config.file=/etc/config/prometheus.yml --storage.tsdb.path=/data --web.console.libraries=/etc/prometheus/console_libraries --web.console.templates=/etc/prometheus/consoles --web.enable-lifecycle