Flink on Yarn集群HA高可用部署基于flink1.12 hadoop 3.0 CDH6.3.2

最新推荐文章于 2024-05-17 10:33:48 发布

Mumunu-

最新推荐文章于 2024-05-17 10:33:48 发布

阅读量2.9k

点赞数

分类专栏： hadoop 文章标签： flink

本文链接：https://blog.csdn.net/h952520296/article/details/114318407

版权

hadoop 专栏收录该内容

82 篇文章 5 订阅

订阅专栏

1.1 概要介绍

Flink on Yarn的HA高可用模式，首先依赖于Yarn自身的高可用机制（ResourceManager高可用），并通过Yarn对JobManager进行管理，当JobManager失效时，Yarn将重新启动JobManager。其次Flink Job在恢复时，需要依赖Checkpoint进行恢复，而Checkpoint的快照依赖于远端的存储：HDFS，所以HDFS也必须是高可用，同时JobManager的元数据信息也依赖于HDFS的高可用（namenode的高可用，和多副本机制），再者JobManager元数据的指针信息要依赖于Zookeeper的高可用。

1.2 Flink on Yarn的优势
相对于 Standalone 模式，在Yarn 模式下有以下几点好处：

1.资源按需使用，提高集群的资源利用率；

2.任务有优先级，根据优先级运行作业；

3.基于 Yarn 调度系统，能够自动化地处理各个角色的 Failover：

JobManager 进程和 TaskManager 进程都由 Yarn NodeManager 监控；

如果 JobManager 进程异常退出，则 Yarn ResourceManager 会重新调度 JobManager 到其他机器；

如果 TaskManager 进程异常退出，JobManager 会收到消息并重新向 Yarn ResourceManager 申请资源，重新启动 TaskManager。

第2章 Flink on Yarn模式运行的方式

2.1 Per-Job
Per-Job模式：简答的说就是直接run job，每次提交的任务Yarn都会分配一个JobManager，执行完之后整个资源会释放，包括JobManager和TaskManager。

Per-Job模式适合比较大的任务、执行时间比较长的任务。

2.2 Session
Session模式：在Session模式中， Dispatcher 和 ResourceManager 是可以复用的；当执行完Job之后JobManager并不会释放，Session 模式也称为多线程模式，其特点是资源会一直存在不会释放。使用时先启动yarn-session，然后再提交job，每次提交job，也都会分配一个JobManager。
Session模式适合比较小的任务、执行时间比较短的任务。该模式不用频繁的申请资源和释放资源。

所以一般生产情况下我们都会选取 on Yarn 部署 per-job方式运行

下载的话直接官网下载

https://www.apache.org/dyn/closer.lua/flink/flink-1.12.1/flink-1.12.1-bin-scala_2.12.tgz

https://mirrors.bfsu.edu.cn/apache/flink/flink-1.12.1/flink-1.12.1-bin-scala_2.12.tgz

下载1.12的

解压后进入目录

然后开始部署

修改Hadoop配置

修改hadoop配置文件/etc/hadoop/yarn-site.xml，设置application master重启时，尝试的最大次数。

cdh上yarn ---> 配置搜索框 yarn.resourcemanager.am.max-attempts

默认2次先写个4次吧按需填写即可然后重启yarn

修改masters

修改conf目录下masters文件

hadoop1:8081

端口可以改的以免和spark的端口冲突

部署Per-Job模式

Flink on yarn将会覆盖掉几个参数：
jobmanager.rpc.address因为jobmanager的在集群的运行位置并不是事先确定的，它就是am的地址；
taskmanager.tmp.dirs使用yarn给定的临时目录;
parallelism.default也会被覆盖掉，如果在命令行里指定了slot数。

提前创建flink在hadoop上的逻辑数据目录


#jobmanager和taskmanager、其他client的RPC通信IP地址，TaskManager用于连接到#JobManager/ResourceManager 。HA模式不用配置此项，在master文件中配置，由zookeeper选出leader与#standby

jobmanager.rpc.address: 192.168.101.1

# jobmanager和taskmanager、其他client的RPC通信端口，TaskManager用于连接到JobManager/ResourceManager 。HA模式不用配置此项，在master文件中配置，由zookeeper选出leader与standby


jobmanager.rpc.port: 6123


# jobmanager JVM heap 内存大小 给个4g吧

jobmanager.memory.process.size: 4096m


# taskmanager JVM heap 内存大小 按需填写

taskmanager.memory.process.size: 16384m

# To exclude JVM metaspace and overhead, please, use total Flink memory size instead of 'taskmanager.memory.process.size'.
# It is not recommended to set both 'taskmanager.memory.process.size' and Flink memory.
#
# taskmanager.memory.flink.size: 1280m

# 每个taskmanager提供的任务slots数量
# 并行度等于TM 数量乘以每个TM 的Solts 数量 TM=并行度/Solts数量 如果slots数量大于8 则只会起一个TM
# 听说不要超过5

taskmanager.numberOfTaskSlots: 3


# 并行计算个数 和cpu一致吧

parallelism.default: 4


#==============================================================================
# High Availability
#==============================================================================

# The high-availability mode. Possible options are 'NONE' or 'zookeeper'.
#
high-availability: zookeeper


# The path where metadata for master recovery is persisted. While ZooKeeper stores
# the small ground truth for checkpoint and leader election, this location stores
# the larger objects, like persisted dataflow graphs.
# 
# Must be a durable file system that is accessible from all nodes
# (like HDFS, S3, Ceph, nfs, ...) 
#
# JobManager元数据保存位置
high-availability.storageDir: hdfs://nameservice1/flink/ha/


# zookeeper 地址
high-availability.zookeeper.quorum:  master1:2181,master2:2181,core1:2181


# ACL options are based on https://zookeeper.apache.org/doc/r3.1.2/zookeeperProgrammers.html#sc_BuiltinACLSchemes
# It can be either "creator" (ZOO_CREATE_ALL_ACL) or "open" (ZOO_OPEN_ACL_UNSAFE)
# The default value is "open" and it can be changed to "creator" if ZK security is enabled
#
# high-availability.zookeeper.client.acl: open

#==============================================================================
# Fault tolerance and checkpointing
#==============================================================================

# The backend that will be used to store operator state checkpoints if
# checkpointing is enabled.
#
# Supported backends are 'jobmanager', 'filesystem', 'rocksdb', or the
# <class-name-of-factory>.
#rocksdb是官方生产环境推荐的存储方式
state.backend: rocksdb

# Directory for checkpoints filesystem, when using any of the default bundled
# state backends.
#存储检查点的数据文件和元数据的默认目录
state.checkpoints.dir: hdfs://nameservice1/flink/flink-checkpoints

# Default target directory for savepoints, optional.
##存储检查点的数据文件和元数据的默认目录 同上即可
state.savepoints.dir: hdfs://nameservice1/flink/flink-savepoints

# Flag to enable/disable incremental checkpoints for backends that
# support incremental checkpoints (like the RocksDB state backend). 
#
# state.backend.incremental: false

# The failover strategy, i.e., how the job computation recovers from task failures.
# Only restart tasks that may have been affected by the task failure, which typically includes
# downstream tasks and potentially upstream tasks if their produced data is no longer available for consumption.

jobmanager.execution.failover-strategy: region

#==============================================================================
# Rest & web frontend
#==============================================================================

# The port to which the REST client connects to. If rest.bind-port has
# not been specified, then the server will bind to this port as well.
#
rest.port: 8083

# The address to which the REST client will connect to
#
#rest.address: 0.0.0.0

# Port range for the REST and web server to bind to.
#
#rest.bind-port: 8080-8090

# The address that the REST & web server binds to
#
#rest.bind-address: 0.0.0.0

# Flag to specify whether job submission is enabled from the web-based
# runtime monitor. Uncomment to disable.

#web.submit.enable: false

#==============================================================================
# Advanced
#==============================================================================

# Override the directories for temporary files. If not specified, the
# system-specific Java temporary directory (java.io.tmpdir property) is taken.
#
# For framework setups on Yarn or Mesos, Flink will automatically pick up the
# containers' temp directories without any need for configuration.
#
# Add a delimited list for multiple directories, using the system directory
# delimiter (colon ':' on unix) or a comma, e.g.:
#     /data1/tmp:/data2/tmp:/data3/tmp
#
# Note: Each directory entry is read from and written to by a different I/O
# thread. You can include the same directory multiple times in order to create
# multiple I/O threads against that directory. This is for example relevant for
# high-throughput RAIDs.
#
# io.tmp.dirs: /tmp

# The classloading resolve order. Possible values are 'child-first' (Flink's default)
# and 'parent-first' (Java's default).
#
# Child first classloading allows users to use different dependency/library
# versions in their application than those in the classpath. Switching back
# to 'parent-first' may help with debugging dependency issues.
#
# classloader.resolve-order: child-first

# The amount of memory going to the network stack. These numbers usually need 
# no tuning. Adjusting them may be necessary in case of an "Insufficient number
# of network buffers" error. The default min is 64MB, the default max is 1GB.
# 
# taskmanager.memory.network.fraction: 0.1
# taskmanager.memory.network.min: 64mb
# taskmanager.memory.network.max: 1gb

#==============================================================================
# Flink Cluster Security Configuration
#==============================================================================

# Kerberos authentication for various components - Hadoop, ZooKeeper, and connectors -
# may be enabled in four steps:
# 1. configure the local krb5.conf file
# 2. provide Kerberos credentials (either a keytab or a ticket cache w/ kinit)
# 3. make the credentials available to various JAAS login contexts
# 4. configure the connector to use JAAS/SASL

# The below configure how Kerberos credentials are provided. A keytab will be used instead of
# a ticket cache if the keytab path and principal are set.

# security.kerberos.login.use-ticket-cache: true
# security.kerberos.login.keytab: /path/to/kerberos/keytab
# security.kerberos.login.principal: flink-user

# The configuration below defines which JAAS login contexts

# security.kerberos.login.contexts: Client,KafkaClient

#==============================================================================
# ZK Security Configuration
#==============================================================================

# Below configurations are applicable if ZK ensemble is configured for security

# Override below configuration to provide custom ZK service name if configured
# zookeeper.sasl.service-name: zookeeper

# The configuration below must match one of the values set in "security.kerberos.login.contexts"
# zookeeper.sasl.login-context-name: Client

#==============================================================================
# HistoryServer
#==============================================================================

# The HistoryServer is started and stopped via bin/historyserver.sh (start|stop)

# Directory to upload completed jobs to. Add this directory to the list of
# monitored directories of the HistoryServer as well (see below).
# 历史数据的存储位置
jobmanager.archive.fs.dir: hdfs://nameservice1/flink/log/completed-jobs/

# The address under which the web-based HistoryServer listens.
historyserver.web.address: 192.168.102.1

# The port under which the web-based HistoryServer listens.
historyserver.web.port: 8082

# Comma separated list of directories to monitor for completed jobs.
# 历史数据的存储位置 同上
historyserver.archive.fs.dir: hdfs://nameservice1/flink/log/completed-jobs/

# Interval in milliseconds for refreshing the monitored directories.
historyserver.archive.fs.refresh-interval: 10000

然后就可以运行命令测试了

直接命令行跑这个

export HADOOP_CLASSPATH=`hadoop classpath`

我配置了环境变量如果你没配置那就bin/flink 再flink安装目录下运行使用官方example 运行看看

flink run -m yarn-cluster ./examples/batch/WordCount.jar

没报错出来一堆

这种就是安装成功了

可以在yarn的界面和cdh的界面查到flink的任务

注意需要在hadoop的master服务器上运行，会去获取yarn的地址但是没找到配置项只在本地找

至此，Flink on Yarn集群高可用模式的部署及使用就完成了

本篇使用了网上的文档进行编写。实际搭建中我怀疑是不是on yarn模式不需要配置什么东西官方文档非常简单下包 --运行环境变量--运行flink 下次搭建的时候试试是不是真的on yarn模式不需要配置什么东西

on yarn 模式下web ui地址不固定会在最后一行指出复制下来网页查看即可。不知道有没有大佬告诉我一下怎么弄个常驻的web ui 真的很不方便

Mumunu-

关注

0
点赞
踩
13

收藏

觉得还不错? 一键收藏
5
评论
Flink on Yarn集群HA高可用部署基于flink1.12 hadoop 3.0 CDH6.3.2

1.1 概要介绍Flink on Yarn的HA高可用模式，首先依赖于Yarn自身的高可用机制（ResourceManager高可用），并通过Yarn对JobManager进行管理，当JobManager失效时，Yarn将重新启动JobManager。其次Flink Job在恢复时，需要依赖Checkpoint进行恢复，而Checkpoint的快照依赖于远端的存储：HDFS，所以HDFS也必须是高可用，同时JobManager的元数据信息也依赖于HDFS的高可用（namenode的高可用，和多副本机制
复制链接

扫一扫