Prometheus Operator自定义监控对象 -- Ingress-Nginx

曾经有人告诉我

于 2024-08-27 02:16:34 发布

阅读量25

点赞数

文章标签： prometheus nginx 运维

自定义资源

Prometheus-operator 通过定期循环watch apiserver，获取到CRD资源（比如servicemonitor）的创建或者更新，将配置更新及时应用到运行中的prometheus pod中转换成标准promethesu配置文件供prometheus server使用。

各个CRD以及operator之间的关系：

使用CRD做prometheus配置，“匹配” 是一个很重要的细节，详细匹配关系如图，任何地方匹配失败会导致转化成的标配prometheus文件无法识别到targets。

Prometheus Operater 定义了如下的四类自定义资源：

Prometheus
ServiceMonitor
Alertmanager
PrometheusRule

Prometheus

Prometheus 自定义资源（CRD）声明了在 Kubernetes 集群中运行的 Prometheus 的期望设置。包含了副本数量，持久化存储，以及 Prometheus 实例发送警告到的 Alertmanagers等配置选项。

每一个 Prometheus 资源，Operator 都会在相同 namespace 下部署成一个正确配置的 StatefulSet，Prometheus 的 Pod 都会挂载一个名为的 Secret，里面包含了 Prometheus 的配置。Operator 根据包含的 ServiceMonitor 生成配置，并且更新含有配置的 Secret。无论是对 ServiceMonitors 或者 Prometheus 的修改，都会持续不断的被按照前面的步骤更新。

示例配置如下：

kind: Prometheus
metadata: # 略
spec:
  alerting:
    alertmanagers:
    - name: prometheus-prometheus-oper-alertmanager # 定义该 Prometheus 对接的 Alertmanager 集群的名字, 在 default 这个 namespace 中
      namespace: default
      pathPrefix: /
      port: web
  baseImage: quay.io/prometheus/prometheus
  replicas: 2 # 定义该 Proemtheus “集群”有两个副本
  ruleSelector: # 定义这个 Prometheus 需要使用带有 prometheus=k8s 且 role=alert-rules 标签的 PrometheusRule
    matchLabels:
      prometheus: k8s
      role: alert-rules
  serviceMonitorNamespaceSelector: {} # 定义这些 Prometheus 在哪些 namespace 里寻找 ServiceMonitor
  serviceMonitorSelector: # 定义这个 Prometheus 需要使用带有 k8s-app=node-exporter 标签的 ServiceMonitor，不声明则会全部选中
    matchLabels:
      k8s-app: node-exporter
  version: v2.10.0

ServiceMonitor

ServiceMonitor 配置

ServiceMonitor 自定义资源(CRD)能够声明如何监控一组动态服务的定义。它使用标签选择定义一组需要被监控的服务。这样就允许组织引入如何暴露 metrics 的规定，只要符合这些规定新服务就会被发现列入监控，而不需要重新配置系统。

要想使用 Prometheus Operator 监控 Kubernetes 集群中的应用，Endpoints 对象必须存在。Endpoints 对象本质是一个 IP 地址列表。通常，Endpoints 对象由 Service 构建。Service 对象通过对象选择器发现 Pod 并将它们添加到 Endpoints 对象中。

Prometheus Operator 引入 ServiceMonitor 对象，它发现 Endpoints 对象并配置 Prometheus 去监控这些 Pods。

ServiceMonitorSpec 的 endpoints 部分用于配置需要收集 metrics 的 Endpoints 的端口和其他参数。

注意：endpoints（小写）是 ServiceMonitor CRD 中的一个字段，而 Endpoints（大写）是 Kubernetes 资源类型。

ServiceMonitor 和发现的目标可能来自任何 namespace：

使用 PrometheusSpec 下 ServiceMonitorNamespaceSelector, 通过各自 Prometheus server 限制 ServiceMonitors 作用 namespece。
使用 ServiceMonitorSpec 下的 namespaceSelector 允许发现 Endpoints 对象的命名空间。要发现所有命名空间下的目标，namespaceSelector 必须为空。

示例配置如下：

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    k8s-app: node-exporter # 这个 ServiceMonitor 对象带有 k8s-app=node-exporter 标签，因此会被 Prometheus 选中
    name: ingress-nginx
    namespace: monitoring
spec:
  endpoints:
  - interval: 15s       # 定义这些 Endpoints 需要每 15 秒抓取一次
    port: prometheus        # 这边一定要用svc中 port的name。
  namespaceSelector:
    matchNames:
    - ingress-nginx         # 选定抓取的指定namespace
  selector:
    matchLabels:        # 匹配的抓取标签
      app.kubernetes.io/component: controller
      app.kubernetes.io/instance: ingress-nginx
      app.kubernetes.io/name: ingress-nginx
      app.kubernetes.io/part-of: ingress-nginx
      app.kubernetes.io/version: 1.7.0

Alertmanager

Alertmanager 配置

Alertmanager 自定义资源(CRD)声明在 Kubernetes 集群中运行的 Alertmanager 的期望设置。它也提供了配置副本集和持久化存储的选项。

每一个 Alertmanager 资源，Operator 都会在相同 namespace 下部署成一个正确配置的 StatefulSet。Alertmanager pods 配置挂载一个名为的 Secret，使用 alertmanager.yaml key 对作为配置文件。

当有两个或更多配置的副本时，Operator 可以高可用性模式运行Alertmanager实例。

示例配置如下：

kind: Alertmanager #  一个 Alertmanager 对象
metadata:
  name: prometheus-prometheus-oper-alertmanager
spec:
  baseImage: quay.io/prometheus/alertmanager
  replicas: 3      # 定义该 Alertmanager 集群的节点数为 3
  version: v0.17.0

PrometheusRule

PrometheusRule 配置

PrometheusRule CRD 声明一个或多个 Prometheus 实例需要的 Prometheus rule。

Alerts 和 recording rules 可以保存并应用为 yaml 文件，可以被动态加载而不需要重启。

示例配置如下：

kind: PrometheusRule
metadata:
  labels: # 定义该 PrometheusRule 的 label, 显然它会被 Prometheus 选中
    prometheus: k8s
    role: alert-rules
  name: prometheus-k8s-rules
spec:
  groups:
  - name: k8s.rules
    rules: # 定义了一组规则，其中只有一条报警规则，用来报警 kubelet 是不是挂了
    - alert: KubeletDown
      annotations:
        message: Kubelet has disappeared from Prometheus target discovery.
      expr: |
        absent(up{job="kubelet"} == 1)
      for: 15m
      labels:
        severity: critical

配置间的匹配总结

ServiceMonitor 注意事项：

ServiceMonitor 的 label 需要跟prometheus中定义的serviceMonitorSelector一致。
ServiceMonitor 的 endpoints 中 port 时对应k8s service资源中的 portname , 不是 port number。
ServiceMonitor 的 selector.matchLabels 需要匹配 k8s service 中的 label。
ServiceMonitor 资源创建在 prometheus 的 namespace 下，使用 namespaceSelector 匹配要监控的 k8s svc 的 ns。
servicemonitor 若匹配多个 svc ,会发生数据重复。

抓取自定义资源 -- Ingress-nginx（Helm）

暴露ingress的监控端口

通过 helm 部署维护的 ingress-nginx

vim values.yaml

  metrics:
    port: 10254
    portName: metrics
    # if this port is changed, change healthz-port: in extraArgs: accordingly
    enabled: true
    service:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "10254"
···
    prometheusRule:
      enabled: true
      additionalLabels: {}
      namespace: "monitoring"
      
#参考下图

Prometheus Operator自定义监控对象 -- Ingress-Nginx_nginx

Prometheus Operator自定义监控对象 -- Ingress-Nginx_nginx_02

默认情况下nginx-ingress的监控指标端口为10254，监控路径为其下的/metrics。调整配置ingress-nginx的配置文件，打开service及pod的10254端口。

更新：

查看验证指标数据：

kubectl get ep -ningress-nginx

kubectl get ep  -ningress-nginxNAME                                 ENDPOINTS                                                            AGEingress-nginx-controller-metrics     192.10.192.192:10254,192.10.192.43:10254                             3m14s

curl 192.10.192.192:10254/metrics

手动添加serviceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
    name: ingress-nginx
    namespace: monitoring
spec:
  endpoints:
  - interval: 15s
    port: metrics
  namespaceSelector:
    matchNames:
    - ingress-nginx
  selector:
    matchLabels:
      app.kubernetes.io/component: controller
      app.kubernetes.io/instance: ingress-nginx
      app.kubernetes.io/name: ingress-nginx
      app.kubernetes.io/part-of: ingress-nginx
      app.kubernetes.io/version: 1.7.0
---
# 在对应的ns中创建角色
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: prometheus-k8s
  namespace: ingress-nginx
rules:
- apiGroups:
  - ""
  resources:
  - services
  - endpoints
  - pods
  verbs:
  - get
  - list
  - watch
---
# 绑定角色 prometheus-k8s 角色到 Role
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: prometheus-k8s
  namespace: ingress-nginx
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: prometheus-k8s
subjects:
- kind: ServiceAccount
  name: prometheus-k8s # Prometheus 容器使用的 serviceAccount，kube-prometheus默认使用prometheus-k8s这个用户
  namespace: monitoring

验证：

添加报警规则

prometheusRule:
      enabled: true
      additionalLabels: {}
      namespace: "monitoring"
      rules:
      - alert: NginxFailedtoLoadConfiguration
        expr: nginx_ingress_controller_config_last_reload_successful == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Nginx Ingress Controller配置文件加载失败"
          description: "Nginx Ingress Controller的配置文件加载失败，请检查配置文件是否正确。"
      - alert: NginxHighHttp4xxErrorRate
        expr: rate(nginx_ingress_controller_requests{status=~"^404"}[5m]) * 100 > 1
        for: 1m
        labels:
          severity: warining
        annotations:
          description: Nginx high HTTP 4xx error rate ( namespaces {{ $labels.exported_namespace }} host {{ $labels.host }} )
          summary: "Too many HTTP requests with status 404 (> 1%)"
      - alert: NginxHighHttp5xxErrorRate
        expr: rate(nginx_ingress_controller_requests{status=~"^5.."}[5m]) * 100 > 1
        for: 1m
        labels:
          severity: warining
        annotations:
          description: Nginx high HTTP 5xx error rate ( namespaces {{ $labels.exported_namespace }} host {{ $labels.host }} )
          summary: "Too many HTTP requests with status 5xx (> 1%)"
      - alert: NginxLatencyHigh
        expr: histogram_quantile(0.99, sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[2m])) by (host, node)) > 3
        for: 2m
        labels:
          severity: warining
        annotations:
         description: Nginx latency high ( namespaces {{ $labels.exported_namespace }} host {{ $labels.host }} )
         summary: "Nginx p99 latency is higher than 3 seconds"
      - alert: NginxHighRequestRate
        expr: rate(nginx_ingress_controller_nginx_process_requests_total[5m]) * 100 > 1000
        for: 1m
        labels:
          severity: warning
        annotations:
          description: Nginx ingress controller high request rate ( instance {{ $labels.instance }} namespaces {{ $labels.namespaces }} pod {{$labels.pod}})
          summary: "Nginx ingress controller high request rate (> 1000 requests per second)"
      - alert: SSLCertificateExpiration15day
        expr: nginx_ingress_controller_ssl_expire_time_seconds < 1296000
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: SSL/TLS certificate for {{ $labels.host $labels.secret_name }} is about to expire
          description: The SSL/TLS certificate for {{ $labels.host $labels.secret_name }} will expire in less than 15 days.
      - alert: SSLCertificateExpiration7day
        expr: nginx_ingress_controller_ssl_expire_time_seconds < 604800
        for: 30m
        labels:
          severity: critical
        annotations:
          summary: SSL/TLS certificate for {{ $labels.host $labels.secret_name }} is about to expire
          description: The SSL/TLS certificate for {{ $labels.host $labels.secret_name }} will expire in less than 7 days.

通过 webUI 去验证prometheus 的rules/target 是否ok；

抓取自定义资源 -- 常规部署的Ingress-nginx

修改Ingress Service

apiVersion: v1
kind: Service
metadata:
 annotations:
   prometheus.io/scrape: "true"
   prometheus.io/port: "10254"
..
spec:
  ports:
  - name: prometheus
    port: 10254
    targetPort: 10254
..

修改完成最终效果：
apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/port: "10254"
    prometheus.io/scrape: "true"
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/instance: ingress-nginx
    app.kubernetes.io/name: ingress-nginx
    app.kubernetes.io/part-of: ingress-nginx
    app.kubernetes.io/version: 1.3.1
  name: ingress-nginx-controller
  namespace: ingress-nginx
spec:
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - appProtocol: http
    name: http
    port: 80
    protocol: TCP
    targetPort: http
  - appProtocol: https
    name: https
    port: 443
    protocol: TCP
    targetPort: https
  - name: prometheus
    port: 10254
    targetPort: 10254
  selector:
    app.kubernetes.io/component: controller
    app.kubernetes.io/instance: ingress-nginx
    app.kubernetes.io/name: ingress-nginx
  type: ClusterIP

修改Ingress deployment

apiVersion: v1
kind: Deployment
metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "10254"
..
spec:
  ports:
    - name: prometheus
      containerPort: 10254
      ..
      
## 重新apply一下yaml文件让修改的配置生效
kubectl apply -f ingress-deploy.yml

测试验证

curl 127.0.0.1:10254/metrics      # 在Ingress的节点上运行一下看看是否可以获取到资源
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 2.4853e-05
go_gc_duration_seconds{quantile="0.25"} 5.7798e-05
go_gc_duration_seconds{quantile="0.5"} 7.5043e-05
go_gc_duration_seconds{quantile="0.75"} 9.8753e-05
go_gc_duration_seconds{quantile="1"} 0.001074475
go_gc_duration_seconds_sum 0.010298983
go_gc_duration_seconds_count 104
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 100
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.18.2"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 1.2434328e+07
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 8.21745944e+08

新增Ingress ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
    name: ingress-nginx
    namespace: monitoring
spec:
  endpoints:
  - interval: 15s
    port: prometheus
  namespaceSelector:
    matchNames:
    - ingress-nginx
  selector:
    matchLabels:
      app.kubernetes.io/component: controller
      app.kubernetes.io/instance: ingress-nginx
      app.kubernetes.io/name: ingress-nginx
      app.kubernetes.io/part-of: ingress-nginx
      app.kubernetes.io/version: 1.7.0
---
# 在对应的ns中创建角色
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: prometheus-k8s
  namespace: ingress-nginx
rules:
- apiGroups:
  - ""
  resources:
  - services
  - endpoints
  - pods
  verbs:
  - get
  - list
  - watch
---
# 绑定角色 prometheus-k8s 角色到 Role
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: prometheus-k8s
  namespace: ingress-nginx
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: prometheus-k8s
subjects:
- kind: ServiceAccount
  name: prometheus-k8s # Prometheus 容器使用的 serviceAccount，kube-prometheus默认使用prometheus-k8s这个用户
  namespace: monitoring

添加报警规则

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    prometheus: k8s
    role: alert-rules
  name: nginx-ingress-rules
  namespace: monitoring
spec:
  groups:
  - name: nginx-ingress-rules
    rules:
    - alert: NginxFailedtoLoadConfiguration
      expr: nginx_ingress_controller_config_last_reload_successful == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "Nginx Ingress Controller配置文件加载失败"
        description: "Nginx Ingress Controller的配置文件加载失败，请检查配置文件是否正确。"
    - alert: NginxHighHttp4xxErrorRate
      expr: rate(nginx_ingress_controller_requests{status=~"^404"}[5m]) * 100 > 1
      for: 1m
      labels:
        severity: warining
      annotations:
        description: Nginx high HTTP 4xx error rate ( namespaces {{ $labels.exported_namespace }} host {{ $labels.host }} )
        summary: "Too many HTTP requests with status 404 (> 1%)"
    - alert: NginxHighHttp5xxErrorRate
      expr: rate(nginx_ingress_controller_requests{status=~"^5.."}[5m]) * 100 > 1
      for: 1m
      labels:
        severity: warining
      annotations:
        description: Nginx high HTTP 5xx error rate ( namespaces {{ $labels.exported_namespace }} host {{ $labels.host }} )
        summary: "Too many HTTP requests with status 5xx (> 1%)"
    - alert: NginxLatencyHigh
      expr: histogram_quantile(0.99, sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[2m])) by (host, node)) > 3
      for: 2m
      labels:
        severity: warining
      annotations:
       description: Nginx latency high ( namespaces {{ $labels.exported_namespace }} host {{ $labels.host }} )
       summary: "Nginx p99 latency is higher than 3 seconds"
    - alert: NginxHighRequestRate
      expr: rate(nginx_ingress_controller_nginx_process_requests_total[5m]) * 100 > 1000
      for: 1m
      labels:
        severity: warning
      annotations:
        description: Nginx ingress controller high request rate ( instance {{ $labels.instance }} namespaces {{ $labels.namespaces }} pod {{$labels.pod}})
        summary: "Nginx ingress controller high request rate (> 1000 requests per second)"
    - alert: SSLCertificateExpiration15day
      expr: nginx_ingress_controller_ssl_expire_time_seconds < 1296000
      for: 30m
      labels:
        severity: warning
      annotations:
        summary: SSL/TLS certificate for {{ $labels.host $labels.secret_name }} is about to expire
        description: The SSL/TLS certificate for {{ $labels.host $labels.secret_name }} will expire in less than 15 days.
    - alert: SSLCertificateExpiration7day
      expr: nginx_ingress_controller_ssl_expire_time_seconds < 604800
      for: 30m
      labels:
        severity: critical
      annotations:
        summary: SSL/TLS certificate for {{ $labels.host $labels.secret_name }} is about to expire
        description: The SSL/TLS certificate for {{ $labels.host $labels.secret_name }} will expire in less than 7 days.

导入Grafana模板

Ingress-nginx 模板ID：9614、14314

总结

Kube-prometheus 是一个用于在Kubernetes上运行Prometheus的开源项目；
它利用了Kubernetes的自定义资源定义（Custom Resource Definitions，CRD）机制来定义和管理Prometheus实例；
在Prometheus Operator中，Prometheus实例、ServiceMonitor、Alertmanager、PrometheusRule等都是自定义资源；
通过Prometheus Operator来自动化地管理和更新Prometheus实例的配置；
通过Kubernetes的CRD来管理Prometheus实例，而无需手动管理和维护Prometheus的配置文件。

原图下载： https://xmars-devops.oss-cn-shanghai.aliyuncs.com/AliCloud/shili-1514f254b1a9.jpeg

Prometheus Operator自定义监控对象 -- Ingress-Nginx_Nginx_03