Prometheus、在K8S里部署Prometheus、Prometheus的Exporter、Prometheus的Exporter问题解决和使用PromQL查询监控指标（2024-07-29）

最新推荐文章于 2024-09-18 12:35:08 发布

Chauncey_Qin

最新推荐文章于 2024-09-18 12:35:08 发布

阅读量799

点赞数 10

文章标签： prometheus kubernetes 容器

本文链接：https://blog.csdn.net/Chauncey_Qin/article/details/140776670

版权

一、Prometheus介绍

Prometheus（普罗米修斯）是一个最初在SoundCloud上构建的监控系统。自2012年成为社区开源项目，拥有非常活跃的开发人员和用户社区。为强调开源及独立维护，Prometheus于2016年加入云原生云计算基金会（CNCF），成为继Kubernetes之后的第二个托管项目。Prometheus基于时序数据库，非常适合Kubernetes集群的监控。

Prometheus的基本原理是通过HTTP协议周期性抓取被监控组件的状态，任意组件只要提供对应的HTTP接口就可以接入监控。不需要任何SDK或者其他的集成过程。这样做非常适合做虚拟化环境监控系统，比如VM、Docker、Kubernetes等。输出被监控组件信息的HTTP接口被叫做exporter 。目前互联网公司常用的组件大部分都有exporter可以直接使用，比如Varnish、HaproxyNginx、MySQL、Linux系统信息(包括磁盘、内存、CPU、网络等等)。

官方网站：https://prometheus.io
项目托管：https://github.com/prometheus

Prometheus 特点
1）多维数据模型：由度量名称和键值对标识的时间序列数据
2）PromQL：一种灵活的查询语言，可以利用多维数据完成复杂的查询
3）不依赖分布式存储，单个服务器节点可直接工作
4）基于HTTP的pull方式采集时间序列数据
5）推送时间序列数据通过PushGateway组件支持
6）通过服务发现或静态配置发现目标
7）多种图形模式及仪表盘支持（grafana）
8）适用于以机器为中心的监控以及高度动态面向服务架构的监控

Prometheus 架构
Prometheus 由多个组件组成，但是其中许多组件是可选的：
1）Prometheus Server：用于收集指标和存储时间序列数据，并提供查询接口
2）client Library：客户端库（例如Go，Python，Java等），为需要监控的服务产生相应的/metrics并暴露给Prometheus Server。目前已经有很多的软件原生就支持Prometheus，提供/metrics，可以直接使用。对于像操作系统已经不提供/metrics，可以使用exporter，或者自己开发exporter来提供/metrics服务。
3）push gateway：主要用于临时性的 jobs。由于这类 jobs 存在时间较短，可能在 Prometheus 来 pull 之前就消失了。对此Jobs定时将指标push到pushgateway，再由Prometheus Server从Pushgateway上pull。
4）exporter：用于暴露已有的第三方服务的 metrics 给 Prometheus。
5）alertmanager：从 Prometheus server 端接收到 alerts 后，会进行去除重复数据，分组，并路由到对收的接受方式，发出报警。常见的接收方式有：电子邮件，pagerduty，OpsGenie, webhook 等。
6）Web UI：Prometheus内置一个简单的Web控制台，可以查询指标，查看配置信息或者Service Discovery等，实际工作中，查看指标或者创建仪表盘通常使用Grafana，Prometheus作为Grafana的数据源；

注：大多数 Prometheus 组件都是用 Go 编写的，因此很容易构建和部署为静态的二进制文件。

为了能够更加直观的了解Prometheus Server，接下来我们将在k8s里部署并运行一个Prometheus Server实例，通过Node Exporter采集当前主机的系统资源使用情况。并通过Grafana创建一个简单的可视化仪表盘。

二、在Kubernetes里部署Prometheus

说明：我们使用helm来安装Prometheus，所以请先安装helm

1.配置helm仓库

helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update

2.使用helm安装Prometheus

先把包下载下来，需要更改values.yaml

helm pull bitnami/prometheus --untar

更改values.yaml

cd prometheus
vi values.yaml #更改所有storageClass配置，指定为nfs-client，这个是前面我们配置的NFS的sc，还有两处，需要将enable：false改为 true

[root@aminglinux01 prometheus]# cat values.yaml | grep storageClass
## Current available global Docker image parameters: imageRegistry, imagePullSecrets and storageClass
## @param global.storageClass DEPRECATED: use global.defaultStorageClass instead
  storageClass: "nfs-client"
    ## @param alertmanager.persistence.storageClass PVC Storage Class for Concourse worker data volume
    ## If defined, storageClassName: <storageClass>
    ## If set to "-", storageClassName: "", which disables dynamic provisioning
    ## If undefined (the default) or set to null, no storageClassName spec is
    storageClass: "nfs-client"
    ## @param server.persistence.storageClass Storage class of backing PVC
    ## If defined, storageClassName: <storageClass>
    ## If set to "-", storageClassName: "", which disables dynamic provisioning
    ## If undefined (the default) or set to null, no storageClassName spec is
    storageClass: "nfs-client"
[root@aminglinux01 prometheus]#

安装

helm install prometheus .

[root@aminglinux01 prometheus]# helm install prometheus . 
NAME: prometheus
LAST DEPLOYED: Sat Aug  3 01:12:29 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
CHART NAME: prometheus
CHART VERSION: 1.3.14
APP VERSION: 2.53.1

** Please be patient while the chart is being deployed **

Prometheus can be accessed via port "80" on the following DNS name from within your cluster:

    prometheus-server.default.svc.cluster.local

To access Prometheus from outside the cluster execute the following commands:

  NOTE: It may take a few minutes for the LoadBalancer IP to be available.
        Watch the status with: 'kubectl get svc --namespace default -w prometheus'

    export SERVICE_IP=$(kubectl get svc --namespace default prometheus --template "{{ range (index .status.loadBalancer.ingress 0) }}{{ . }}{{ end }}")
    echo "Prometheus URL: http://$SERVICE_IP/"

Watch the Alertmanager StatefulSet status using the command:

    kubectl get sts -w --namespace default -l app.kubernetes.io/name=prometheus-alertmanager,app.kubernetes.io/instance=prometheus

Alertmanager can be accessed via port "80" on the following DNS name from within your cluster:

    prometheus-alertmanager.default.svc.cluster.local

To access Alertmanager from outside the cluster execute the following commands:

  NOTE: It may take a few minutes for the LoadBalancer IP to be available.
        Watch the status with: 'kubectl get svc --namespace default -w prometheus-alertmanager'

    export SERVICE_IP=$(kubectl get svc --namespace default prometheus-alertmanager --template "{{ range (index .status.loadBalancer.ingress 0) }}{{ . }}{{ end }}")
    echo "Alertmanager URL: http://$SERVICE_IP/"

WARNING: There are "resources" sections in the chart not set. Using "resourcesPreset" is not recommended for production. For production installations, please set the following values according to your workload needs:
  - alertmanager.resources
  - server.resources
  - server.thanos.resources
+info https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
[root@aminglinux01 prometheus]# ls

查看helm安装的应用

helm list -A

[root@aminglinux01 prometheus]# helm list -A
NAME      	NAMESPACE	REVISION	UPDATED                                	STATUS  	CHART            	APP VERSION
NAME         	NAMESPACE 	REVISION	UPDATED                                	STATUS  	CHART               	APP VERSION
myharbor     	default   	1       	2024-07-29 22:46:49.307336562 +0800 CST	deployed	harbor-22.0.5       	2.11.0     
node-exporter	default   	1       	2024-07-30 03:23:47.492180325 +0800 CST	deployed	node-exporter-4.4.11	1.8.2      
prometheus   	prometheus	1       	2024-08-01 00:42:00.437880383 +0800 CST	deployed	prometheus-1.3.14   	2.53.1     
prometheus   	default   	1       	2024-08-03 01:12:29.93926442 +0800 CST 	deployed	prometheus-1.3.14   	2.53.1     
[root@aminglinux01 prometheus]#

3.访问Prometheus

查看service

kubectl get svc

[root@aminglinux01 prometheus]# kubectl get pod -owide | grep prometheus
prometheus-alertmanager-0 1/1 Running 0 3m37s 10.18.206.230 aminglinux02 <none> <none>
prometheus-server-bd476698f-jrf2q 1/1 Running 0 3m37s 10.18.68.148 aminglinux03 <none> <none>

[root@aminglinux01 prometheus]# kubectl get svc | grep prometheus
prometheus-alertmanager LoadBalancer 10.15.29.91 192.168.10.243 80:30901/TCP 14m
prometheus-server LoadBalancer 10.15.83.252 192.168.10.244 80:30118/TCP 14m
[root@aminglinux01 prometheus]#

通过红色的port来访问Prometheus和Alertmanager

三、Prometheus的Exporter

在Prometheus的架构设计中，Prometheus Server并不直接服务监控特定的目标，其主要任务负责数据的收集，存储并且对外提供数据查询支持。因此为了能够能够监控到某些东西，如主机的CPU使用率，我们需要使用到Exporter。
Exporter是一个用于收集和暴露应用程序指标的工具。它允许你将应用程序中的特定指标暴露给Prometheus监控系统。Exporter可以作为一个独立的进程运行，它通过暴露一个HTTP端点来提供指标数据。Prometheus可以通过定期访问Exporter的端点来获取最新的指标数据，并进行存储和可视化。
Exporter提供了针对各种应用程序和服务的特定实现，包括数据库、消息代理、Web服务器等。我们平时用的各种服务（如Nginx、MySQL、Redis、RabbitMQ、MongoDB等）都有自己的Exporter，它们会从应用程序中提取指标，并将其格式化为Prometheus可理解的格式。
总之，通过使用Prometheus exporter，你可以方便地监控应用程序的性能、资源利用率和其他重要指标。它提供了一种简单而强大的方式来收集和分析应用程序的监控数据，帮助你及时发现潜在的问题并做出相应的调整。

1.Node Exporter

Node Exporter主要用来采集主机上的各种指标（如CPU、内存、磁盘、网络等），Node exporter作为一个独立的进程在主机上运行，并通过HTTP端点暴露指标数据。Prometheus可以定期访问该端点以获取最新的主机指标数据，并将其存储和可视化。
Node exporter可以在各种操作系统上运行，包括Linux、Windows和Mac。它使用系统级接口和命令行工具来收集主机指标数据，并将其转换为Prometheus可理解的格式。我们可以给K8S各个节点安装Node Exporter

helm install node-exporter bitnami/node-exporter

[root@aminglinux01 node-exporter]# helm install node-exporter .
NAME: node-exporter
LAST DEPLOYED: Tue Jul 30 03:23:47 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
CHART NAME: node-exporter
CHART VERSION: 4.4.11
APP VERSION: 1.8.2

** Please be patient while the chart is being deployed **

Watch the Node Exporter DaemonSet status using the command:

    kubectl get ds -w --namespace default node-exporter

Node Exporter can be accessed via port "9100" on the following DNS name from within your cluster:

    node-exporter.default.svc.cluster.local

To access Node Exporter from outside the cluster execute the following commands:

    echo "URL: http://127.0.0.1:9100/"
    kubectl port-forward --namespace default svc/node-exporter 9100:9100

WARNING: There are "resources" sections in the chart not set. Using "resourcesPreset" is not recommended for production. For production installations, please set the following values according to your workload needs:
  - resources
+info https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

⚠ SECURITY WARNING: Original containers have been substituted. This Helm chart was designed, tested, and validated on multiple platforms using a specific set of Bitnami and Tanzu Application Catalog containers. Substituting other containers is likely to cause degraded security and performance, broken chart features, and missing environment variables.

Substituted images detected:
  - registry.cn-hangzhou.aliyuncs.com/*/node-exporter:1.8.2-debian-12-r2
[root@aminglinux01 node-exporter]#

由于魔法法原因，离线安装，手动修改镜像地址

查看pod

kubectl get pod

[root@aminglinux01 node-exporter]# kubectl get pod | grep node 
node-exporter-29kkf                    1/1     Running                      0                109s
node-exporter-5rkhs                    1/1     Running                      0                109s
[root@aminglinux01 node-exporter]#

node-exporter为daemonset，正常应该有3个pod，这里之所以为2个，这是因为master节点上有污点，需要给daemonset设置一个容忍度，按如下方法操作：

首先查看master节点上的污点：

[root@aminglinux01 ~]# kubectl describe node aminglinux01 |grep -i taint
Taints: node-role.kubernetes.io/control-plane:NoSchedule
[root@aminglinux01 ~]#

在线编辑daemonset

kubectl edit daemonset node-exporter ## 搜索volumes，在其上面增加如下红色字体内容，注意tolerations和container是平级
tolerations:
- key: "node-role.kubernetes.io/control-plane"
operator: "Exists"
effect: "NoSchedule"
volumes:
- hostPath:
path: /proc
type: ""

[root@aminglinux01 ~]# kubectl get pod | grep node 
node-exporter-9cn2c                    1/1     Running                      0               95s
node-exporter-h4ntw                    1/1     Running                      0               2m27s
node-exporter-wvp2h                    1/1     Running                      0               54s
[root@aminglinux01 ~]# ^C

获取节点metrics，如果node-exporter工作正常，那么下面三个节点的9100端口都可以访问

[root@aminglinux01 ~]# echo > /dev/tcp/192.168.100.151/9100
[root@aminglinux01 ~]# echo > /dev/tcp/192.168.100.152/9100
[root@aminglinux01 ~]# echo > /dev/tcp/192.168.100.153/9100
[root@aminglinux01 ~]#

2.Prometheus通过node-export来监控节点

由于我们用helm安装的Prometheus，要想通过编辑配置文件来修改Prometheus的配置就非常麻烦，好在k8s里Prometheus的配置是通过Configmap的形式存在的。

要修改Configmap有两种方式，

将k8s内的Configmap导出来，编辑后再次apply，
是直接使用kubectl edit命令来在线编辑。

建议使用导出Configmap再编辑，安全更方便，因为后面更改这个配置的情况很多。先把Configmap导出为yaml文件：

[root@aminglinux01 prometheus]# kubectl get pod | grep prometheus
prometheus-alertmanager-0 1/1 Running 0 16m
prometheus-server-bd476698f-jrf2q 1/1 Running 0 16m
[root@aminglinux01 prometheus]#

kubectl get cm prometheus-server -o yaml > prometheus_config.yaml

编辑完，再重新应用yaml

kubectl apply -f prometheus_config.yaml

Warning: resource configmaps/prometheus-server is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by kubectl apply. kubectl apply should only be used on resources created declaratively by either kubectl create --save-config or kubectl apply. The missing annotation will be patched automatically.
configmap/prometheus-server configured
[root@aminglinux01 prometheus]#

此时，配置虽然更新，但是Prometheus并还未应用最新配置，还需要重启一下Prometheus服务，比较简单的方法是，将现有pod删除，Kubernetes会自动新启动一个新的pod。

kubectl get po |grep prometheus-server |awk '{print $1}' |xargs -i kubectl delete po {}

[root@aminglinux01 prometheus]# kubectl get pod | grep prometheus
prometheus-alertmanager-0 1/1 Running 0 29m
prometheus-server-bd476698f-vbkjd 1/1 Running 0 21s
[root@aminglinux01 prometheus]#

等新的pod正常running后，再去浏览器查看

除了Node exporter外，其它常见的服务也可以使用exporter来监控，后面章节再介绍。

四、使用PromQL查询监控指标

1.什么是PromQL

PromQL(Prometheus Query Language)是Prometheus内置的数据查询语言，其提供对时间序列数据丰富的查询，聚合以及逻辑运算能力的支持。并且被广泛应用在Prometheus的日常应用当中，包括对数据查询、可视化、告警处理当中。可以这么说，PromQL是Prometheus所有应用场景的基础。

引申：metrics四种类型

* counter（计数器） 只增不减的计数器（除非系统发生重置）。常见的监控指标，如http_requests_total，node_cpu都是Counter类型的监控指标。
* gauge （仪表类型）与Counter不同，Gauge类型的指标侧重于反应系统的当前状态。因此这类指标的样本数据可增可减。常见指标如：node_memory_MemFree（主机当前空闲的内容大小）、node_memory_MemAvailable（可用内存大小）都是Gauge类型的监控指标。
* histogram（直方图类型）
* summary （摘要类型）
Histogram和Summary主用用于统计和分析样本的分布情况。
例如，为了分析某服务接口的质量，需要统计0~100ms之间的请求数、100~500ms之间的请求数、500ms~1000ms之间请求数、大于1000ms的请求数有多少。通过分析这四个区间请求数的分布从而能确定接口是快还是慢。
Histogram和Summary都是为了能够解决这样问题的存在，通过Histogram和Summary类型的监控指标，我们可以快速了解监控样本的分布情况。
例如，指标go_gc_duration_seconds的指标类型为Summary；
prometheus_tsdb_compaction_chunk_size_bytes的指标类型Histogram

2.查询系统负载

3.查询内存剩余

node_memory_MemAvailable_bytes ##单位字节

node_memory_MemAvailable_bytes/1024 ##以k为单位显示

加条件限制：

node_memory_MemAvailable_bytes{instance="192.168.100.152:9100"}/1024

正则匹配：

node_memory_MemAvailable_bytes{instance=~"192.168.*"}

内存使用率

(1-node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes)*100

CPU使用率（2分钟内）

1 - rate(node_cpu_seconds_total{cpu="0",instance="192.168.100.151:9100",mode="idle"} [2m])

磁盘空间使用

node_filesystem_avail_bytes{fstype!="tmpfs"}/1024/1024

网卡流量

rate(node_network_receive_bytes_total{device="ens160"}[3m])

rate(node_network_receive_bytes_total{device="ens160",instance="192.168.100.151:9100"}[3m])

rate(node_network_transmit_bytes_total{device="ens160",instance="192.168.100.151:9100"}[3m])