k8s部署prometheus
版本说明:
k8s:1.24.4
prometheus:release-0.12(https://github.com/prometheus-operator/kube-prometheus.git)
本次部署采用operator的方式将prometheus部署到k8s中,需对k8s和prometheus有一定的了解
一、下载对应版本代码到服务器
git clone -b release-0.12 https://github.com/prometheus-operator/kube-prometheus.git
二、修改几个配置的镜像(国内无法访问registry.k8s.io)
vim prometheusAdapter-deployment.yaml
image: v5cn/prometheus-adapter:v0.10.0
vim kubeStateMetrics-deployment.yaml
image: bitnami/kube-state-metrics:2.8.1
另外修改下grafana的service类型为nodeport
vim grafana-service.yaml
selector:
app.kubernetes.io/component: grafana
app.kubernetes.io/name: grafana
app.kubernetes.io/part-of: kube-prometheus
type: NodePort
二、部署
cd kube-prometheus/manifests
kubectl create -f setup/
kubectl create -f .
部署的时候看下输出,看有无报错,最后看下启动的pod
kubectl get pod -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-main-0 2/2 Running 0 16m
blackbox-exporter-58c9c5ff8d-l6x52 3/3 Running 0 70m
grafana-74f97479b9-pwdn9 1/1 Running 0 70m
kube-state-metrics-676764b849-4qjqw 3/3 Running 0 33m
node-exporter-46w4n 2/2 Running 0 70m
node-exporter-4zwtf 2/2 Running 0 28m
node-exporter-m6bgl 2/2 Running 0 70m
prometheus-adapter-85df796f6c-cfgj6 1/1 Running 0 59m
prometheus-adapter-85df796f6c-p2vr5 1/1 Running 0 59m
prometheus-k8s-0 2/2 Running 0 25m
prometheus-operator-5687547bb5-jgrb6 2/2 Running 0 31m
三、访问grafana
查看grafana的service地址和端口
kubectl get service -n monitoring |grep grafana
grafana NodePort 192.168.252.227 <none> 3000:30280/TCP 73m
然后使用node节点的公网ip加30280访问,默认账号密码是admin/admin
grafana进入容器重置密码
grafana-cli admin reset-admin-password admin123
黑盒监控
一、部署
新版 Prometheus Stack 已经默认安装了 BlackboxExporter,可以通过以下命令查看:
kubectl get po -n monitoring -l app.kubernetes.io/name=blackbox-exporter
NAME READY STATUS RESTARTS AGE
blackbox-exporter-58c9c5ff8d-spvbz 3/3 Running 4 (23h ago) 23h
同时也会创建一个 Service,可以通过该 Service 访问 Blackbox Exporter 并传递一些参数:
kubectl get svc -n monitoring -l app.kubernetes.io/name=blackbox-exporter
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
blackbox-exporter ClusterIP 192.168.253.141 <none> 9115/TCP,19115/TCP 27h
比如检测下 www.baidu.com(使用任何一个公网域名或者公司内的域名探测即可)网站的状态,可以通过如下命令进行检查:
curl -s "http://192.168.253.141:19115/probe?target=www.baidu.com&module=http_2xx" | tail -1
probe_success 1
probe 是接口地址,target 是检测的目标,module 是使用哪个模块进行探测。
如果集群中没有配置 Blackbox Exporter,可以参考https://github.com/prometheus/blackbox_exporter 进行安装。
二、Prometheus 静态配置
首先创建一个空文件,然后通过该文件创建一个 Secret,那么这个 Secret 即可作为Prometheus 的静态配置:
touch prometheus-additional.yaml
kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring
创建完 Secret 后,需要编辑下 Prometheus 配置:
cd /soft/yaml/prometheus/kube-prometheus/manifests
vim prometheus-prometheus.yaml
# 添加image下面的四行
image: quay.io/prometheus/prometheus:v2.42.0
additionalScrapeConfigs:
key: prometheus-additional.yaml
name: additional-configs
optional: true
然后replace一下
kubectl replace -f prometheus-prometheus.yaml
之后在 prometheus-additional.yaml 文件内编辑一些静态配置,此处用黑盒监控的配置进行演示:
注意回到刚才创建这个文件的路径
vim prometheus-additional.yaml
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx] # Look for a HTTP 200 response.
static_configs:
- targets:
- http://gaoxin.kubeasy.com # Target to probe with http.
- https://www.baidu.com # Target to probe with https.
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:19115 # The blackbox exporter's realhostname:port.
➢ targets:探测的目标,根据实际情况进行更改
➢ params:使用哪个模块进行探测
➢ replacement:Blackbox Exporter 的地址
可以看到此处的内容,和传统配置的内容一致,只需要添加对应的 job 即可。之后通过该文件更新该 Secret:
kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml --dry-run=client -oyaml | kubectl replace -f - -n monitoring
更新完成后,稍等一分钟即可在 Prometheus Web UI 看到该配置:
但是它这个state的取值有点奇怪,上面那个域名是无法访问的,但是在这里确是up状态
但是在graph这边查询对应的取值,是可以看到的,无法访问的那个取值为0,后面添加告警规则的时候通过这个来进行取值就行了
配置企业微信告警
一、部署alertmanager
新版本的prometheus-operator是已经部署好的
kubectl get pod -n monitoring
alertmanager-main-0 2/2 Running 0 20h
二、修改配置
cd /soft/yaml/prometheus/kube-prometheus/manifests
vim alertmanager-secret.yaml
整个拷贝即可
apiVersion: v1
kind: Secret
metadata:
labels:
app.kubernetes.io/component: alert-router
app.kubernetes.io/instance: main
app.kubernetes.io/name: alertmanager
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 0.25.0
name: alertmanager-main
namespace: monitoring
stringData:
alertmanager.yaml: |-
"global":
"resolve_timeout": "5m"
"inhibit_rules":
- "equal":
- "namespace"
- "alertname"
"source_matchers":
- "severity = critical"
"target_matchers":
- "severity =~ warning|info"
- "equal":
- "namespace"
- "alertname"
"source_matchers":
- "severity = warning"
"target_matchers":
- "severity = info"
- "equal":
- "namespace"
"source_matchers":
- "alertname = InfoInhibitor"
"target_matchers":
- "severity = info"
"receivers":
- "name": "web.hook"
"webhook_configs":
- "url": 'http://172.16.0.47:8880/wx'
"route":
"group_by": ['alertname']
"group_wait": 10s
"group_interval": 1m
"repeat_interval": 5m
"receiver": 'web.hook'
type: Opaque
我这边是发送到一个中转站- "url": 'http://172.16.0.47:8880/wx'
,有了这个中转站就可以直接发送到企业微信的群里面,而不用配置企业应用
三、部署中转站
微信群机器人 第三方钩子
docker run -itd --name webhook-adapter -p 8880:80 \
guyongquan/webhook-adapter \
--adapter=/app/prometheusalert/wx.js=/wx=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key={https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=8255-9602-418e-a0b3-0b8deaa28}
部署的时候就修改一个地方,key后面的内容换成你自己企业微信群的机器人的key,也就是这一串8255-9602-418e-a0b3-0b8deaa28
四、添加告警规则
其实上面部署的时候已经自带了很多官方的告警规则,但很多不适用,然后我们自己来添加一些规则
mkdir -p /soft/yaml/prometheus/myself-config-prometheus/probe
cd /soft/yaml/prometheus/myself-config-prometheus/probe
vim blackbox-prometheusRule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
app.kubernetes.io/component: exporter
app.kubernetes.io/name: blackbox-exporter
prometheus: k8s
role: alert-rules
name: blackbox-rule
namespace: monitoring
spec:
groups:
- name: blackbox-exporter
rules:
- alert: DomainAccessDelayExceeds1s
annotations:
description: 域名:{{ $labels.instance }} 探测延迟大于 1 秒,当前延迟为:{{ $value }}
summary: 域名探测,访问延迟超过 1 秒
expr: sum(probe_http_duration_seconds{job=~"blackbox"}) by(instance) > 1
for: 1m
labels:
severity: warning
- alert: ServiceIsDown
annotations:
description: 服务:{{ $labels.instance }} 挂了,当前取值为:{{ $value }}
summary: 服务探测,30s内已经访问不到了
expr: probe_success < 1
for: 30s
labels:
severity: error
这个写的比较简单,需根据自己业务去自定义规则