对于Prometheus的组件能力是毋庸置疑的,但是使用久了会发现很多的性能问题,诸如内存问题、大规模拉取问题、大规模存储问题等等。如何基于云原生Prometheus进行Kubernetes集群基础监控大规模数据拉取,本文将会给出答案。
架构图
上图是我们当前的监控平台架构图,根据架构图可以看出我们当前的监控平台结合了多个成熟开源组件和能力完成了当前集群的数据+指标+展示的工作。
当前我们监控不同的Kubernetes集群,包含不同功能、不同业务的集群,包含业务、基础和告警信息。
针对Kubernetes集群监控
我们采用常见的2种监控架构之一:
Prometheus-operator
Prometheus单独配置(选择的架构)
tips:对于Prometheus-operator确实易于部署化、简单的ServiceMonitor省了很大的力气,不过对于我们这样多种私有化集群来说维护成本稍微有点高,我们选择第二种方案更多的是想省略创建服务发现的步骤,更多的采用服务发现、服务注册的能力。
数据拉取
在数据拉取方面我们做了一定的调整,为了应对大规模节点或者数据对于apiserver的大压力问题和大规模数据拉取Prometheus内存OOM问题。
利用Kubernetes做服务发现,监控数据拉取由Prometheus之间拉取,降低apiserver拉取压力
采用Hashmod方式进行分布式拉取缓解内存压力
RBAC权限修改:
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: prometheus
namespace: monitoring
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- nodes/metrics #新增路径为了外部拉取
- nodes/metrics/cadvisor #新增路径为了外部拉取
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups:
- extensions
resources:
- ingresses
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: prometheus
namespace: monitoring
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitoring
需要新增对于Node节点的/metrics和/metrics/cadvsior路径的拉取权限。
以完整配置拉取示例:
对于Thanos的数据写入提供写入阿里云OSS示例
对于node_exporter数据提取,线上除Kubernetes外皆使用Consul作为配置注册和发现
对于业务自定义基于Kubernetes做服务发现和拉取
主机命名规则
机房-业务线-业务属性-序列数(例:bja-athena-etcd-001)
Consul自动注册示例脚本
#!/bin/bash
#ip=$(ip addr show eth0|grep inet | awk '{ print $2; }' | sed 's/\/.*$//')
ip=$(ip addr | egrep -o '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' | egrep "^192\.168|^172\.21|^10\.101|^10\.100" | egrep -v "\.255$" | awk -F. '{print $1"."$2"."$3"."$4}' | head -n 1)
ahost=`echo $HOSTNAME`
idc=$(echo $ahost|awk -F "-" '{print $1}')
app=$(echo $ahost|awk -F "-" '{print $2}')
group=$(echo $ahost|awk -F "-" '{print $3}')
if [ "$app" != "test" ]
then
echo "success"
curl -X PUT -d "{\"ID\": \"${ahost}_${ip}_node\", \"Name\": \"node_exporter\", \"Address\": \"${ip}\", \"tags\": [\"idc=${idc}\",\"group=${group}\",\"app=${app}\",\"server=${ahost}\"], \"Port\": 9100,\"checks\": [{\"tcp\":\"${ip}:9100\",\"interval\": \"60s\"}]}" http://consul_server:8500/v1/agent/service/register
fi
完整配置文件示例
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
bucket.yaml: |
type: S3
config:
bucket: "gcl-download"
endpoint: "gcl-download.oss-cn-beijing.aliyuncs.com"
access_key: "xxxxxxxxxxxxxx"
insecure: false
signature_version2: false
secret_key: "xxxxxxxxxxxxxxxxxx"
http_config:
idle_conn_timeout: 0s
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
monitor: 'k8s-sh-prod'
service: 'k8s-all'
ID: 'ID_NUM'
remote_write:
- url: "http://vmstorage:8400/insert/0/prometheus/"
remote_read:
- url: "http://vmstorage:8401/select/0/prometheus"
scrape_configs:
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: 'kubernetes-cadvisor'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
#ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
#bearer_token: monitoring
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- source_labels: [__meta_kubernetes_node_address_InternalIP]
regex: (.+)
target_label: __address__
replacement: ${1}:10250
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /metrics/cadvisor
- source_labels: [__meta_kubernetes_node_name]
modulus: 10
target_label: __tmp_hash
action: hashmod
- source_labels: [__tmp_hash]
regex: ID_NUM
action: keep
metric_relabel_configs:
- source_labels: [container]
regex: (.+)
target_label: container_name
replacement: $1
action: replace
- source_labels: [pod]
regex: (.+)
target_label: pod_name
replacement: $1
action: replace
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
#ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
#bearer_token: monitoring
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- source_labels: [__meta_kubernetes_node_address_InternalIP]
regex: (.+)
target_label: __address__
replacement: ${1}:10250
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /metrics
- source_labels: [__meta_kubernetes_node_name]
modulus: 10
target_label: __tmp_hash
action: hashmod
- source_labels: [__tmp_hash]
regex: ID_NUM
action: keep
metric_relabel_configs:
- source_labels: [container]
regex: (.+)
target_label: container_name
replacement: $1
action: replace
- source_labels: [pod]
regex: (.+)
target_label: pod_name
replacement: $1
action: replace
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- monitoring
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- default
relabel_configs:
- action: labelmap
regex: