Flink 1.11.2 在K8s里基于NFS搭建高可用集群故障排查

本文记录了在使用Flink 1.11.2在Kubernetes(K8s)上,通过NFS搭建高可用集群时遇到的问题及解决方案。在流量大的场景下,原checkpoint策略导致checkpoint失败。通过切换到RocksDB并修改PVC、ConfigMap和部署文件,成功解决了问题。
摘要由CSDN通过智能技术生成

Flink 1.11.2 在K8s里基于NFS搭建高可用集群
一文中,用于生产环境中发现一个问题,就是在输入流量大的情况下,经常出现checkpoint失败的情况。经排查发现是checkpooint的存储方式有问题。改用了rocksdb以后终于好了。下面将修改步骤记录如下:

在第3步中增加两个PVC

3.1 创建存储pvc

创建jobmanager-checkpoint-pvc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: jobmanager-checkpoint-pvc
  namespace: flink-ha
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 30Gi
  storageClassName: nfs

创建taskmanager-checkpoint-pvc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: taskmanager-checkpoint-pvc
  namespace: flink-ha
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 30Gi
  storageClassName: nfs
kubectl apply -f jobmanager-checkpoint-pvc.yaml
kubectl apply -f taskmanager-checkpoint-pvc.yaml

3.2 创建configmap

修改jobmanager-flink-conf.yaml 用于jobmanager配置

apiVersion: v1
data:
  flink-conf.yaml: |-
    jobmanager.rpc.address: localhost
    blob.server.port: 6124
    jobmanager.rpc.port: 6123
    taskmanager.rpc.port: 6122
    queryable-state.proxy.ports: 6125
    jobmanager.memory.process.size: 3200m
    taskmanager.memory.process.size: 10240m
    taskmanager.numberOfTaskSlots: 1
    parallelism.default: 1

    high-availability: zookeeper
    high-availability.cluster-id: /flink-cluster
    high-availability.storageDir: file:/usr/flink/ha/flink-cluster
    high-availability.zookeeper.quorum: 192.168.1.205:2181

    # state.backend: filesystem
    # state.checkpoints.dir: file:/usr/flink/flink-checkpoints
    # state.checkpoints.num-retained: 100
    # state.savepoints.dir: hdfs://namenode-host:port/flink-checkpoints
    state.backend: rocksdb
    state.backend.incremental: false
    # Directory for storing checkpoints
    state.checkpoints.dir: file:///tmp/rocksdb/data/
    state.checkpoints.num-retained: 100

    jobmanager.execution.failover-strategy: region

    web.upload.dir: /usr/flink/jars

    #metrics reporter
    metrics.reporter.promgateway.class: org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
    metrics.reporter.promgateway.host: pushgateway
    metrics.reporter.promgateway.port: 9091
    metrics.reporter.promgateway.jobName: flink-cluster-job
    metrics.reporter.promgateway.randomJobNameSuffix: true
    metrics.reporter.promgateway.deleteOnShutdown: true
    metrics.reporter.promgateway.groupingKey: k1=v1;k2=v2
    metrics.reporter.promgateway.interval: 60 SECONDS
  log4j-console.properties: |-
    # This affects logging for both user code and Flink
    rootLogger.level = INFO
    rootLogger.app
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值