序
Flink 1.11.2 在K8s里基于NFS搭建高可用集群
一文中,用于生产环境中发现一个问题,就是在输入流量大的情况下,经常出现checkpoint失败的情况。经排查发现是checkpooint的存储方式有问题。改用了rocksdb以后终于好了。下面将修改步骤记录如下:
在第3步中增加两个PVC
3.1 创建存储pvc
创建jobmanager-checkpoint-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: jobmanager-checkpoint-pvc
namespace: flink-ha
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 30Gi
storageClassName: nfs
创建taskmanager-checkpoint-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: taskmanager-checkpoint-pvc
namespace: flink-ha
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 30Gi
storageClassName: nfs
kubectl apply -f jobmanager-checkpoint-pvc.yaml
kubectl apply -f taskmanager-checkpoint-pvc.yaml
3.2 创建configmap
修改jobmanager-flink-conf.yaml 用于jobmanager配置
apiVersion: v1
data:
flink-conf.yaml: |-
jobmanager.rpc.address: localhost
blob.server.port: 6124
jobmanager.rpc.port: 6123
taskmanager.rpc.port: 6122
queryable-state.proxy.ports: 6125
jobmanager.memory.process.size: 3200m
taskmanager.memory.process.size: 10240m
taskmanager.numberOfTaskSlots: 1
parallelism.default: 1
high-availability: zookeeper
high-availability.cluster-id: /flink-cluster
high-availability.storageDir: file:/usr/flink/ha/flink-cluster
high-availability.zookeeper.quorum: 192.168.1.205:2181
# state.backend: filesystem
# state.checkpoints.dir: file:/usr/flink/flink-checkpoints
# state.checkpoints.num-retained: 100
# state.savepoints.dir: hdfs://namenode-host:port/flink-checkpoints
state.backend: rocksdb
state.backend.incremental: false
# Directory for storing checkpoints
state.checkpoints.dir: file:///tmp/rocksdb/data/
state.checkpoints.num-retained: 100
jobmanager.execution.failover-strategy: region
web.upload.dir: /usr/flink/jars
#metrics reporter
metrics.reporter.promgateway.class: org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
metrics.reporter.promgateway.host: pushgateway
metrics.reporter.promgateway.port: 9091
metrics.reporter.promgateway.jobName: flink-cluster-job
metrics.reporter.promgateway.randomJobNameSuffix: true
metrics.reporter.promgateway.deleteOnShutdown: true
metrics.reporter.promgateway.groupingKey: k1=v1;k2=v2
metrics.reporter.promgateway.interval: 60 SECONDS
log4j-console.properties: |-
# This affects logging for both user code and Flink
rootLogger.level = INFO
rootLogger.app