Flink概况及搭建
Flink发展史
第一代大数据处理方案:2006年Hadoop的MapReduce-批/HDFS, 2014年9月份 apache Storm-流
第二代大数据处理方案:2014年2 Spark RDD -批处理 ,DStream - 流 (批模拟流 )延迟高
第三代大数据处理方案:2014年12 Flink DataStream-流,Dataset- 批 吞吐量高,低延迟特点。
Flink和Spark相似采用先进的DAG模型做任务拆分完成数据的内存计算,但是Flink是一个纯流式计算引擎。不同于Spark在批处理之上构建流处理,Flink设计恰恰和Spark相反,Flink是在流计算上构建批处理。
架构剖析(√)
推荐阅读:https://ci.apache.org/projects/flink/flink-docs-release-1.9/concepts/runtime.html
Tasks&Operator Chains
“Oprator Chain”:将多个操作符合并到一个Task中,减少不必要的任务间网络通信。
“Task”:类似于Spark中Stage,将任务拆分成若干个阶段。
“SubTaks”:每个Task都有任务执行并行度,每个Task根据并行度拆分成若干个SubTask(等价于线程)
Operator Chain效果
Operator Chain关闭
Job Managers, Task Managers, Clients
JobManagers
:称为Master,负责分布式任务调度,调度Task执行,协调checkpoint,实现故障恢复。
There is always at least one Job Manager. A high-availability setup will have multiple JobManagers, one of which one is always the leader, and the others are standby.
TaskManagers
:称为Worker节点,负责执行Dataflow Graph(DAG)中SubTask,负责数据缓冲和交换。主动连接JobManager,声明自身状态信息和汇报应经分配的任务。
There must always be at least one TaskManager.
client
:并不是运行时一个部分(和Spark Driver不同),负责生成或者发送dataflow给JobManager。
用户可以通过WebUI或者flink run运行代码
[root@HadoopNode00 flink-1.8.1]# ./bin/flink run
--class com.baizhi.quickstart.FlinkWordCounts
--detached
--parallelism 3
--jobmanager HadoopNode00:8081
/root/flink-1.0-SNAPSHOT.jar
[root@HadoopNode00 flink-1.8.1]# ./bin/flink list --jobmanager HadoopNode00:8081 #查看服务列表
------------------ Running/Restarting Jobs -------------------
23.12.2019 14:59:10 : 2fce0e6f136fac71ce6b1aad1ae8927e : FlinkWordCounts (RUNNING)
--------------------------------------------------------------
No scheduled jobs.
[root@HadoopNode00 flink-1.8.1]# ./bin/flink cancel --jobmanager HadoopNode00:8081 2fce0e6f136fac71ce6b1aad1ae8927e # cancel任务
Cancelling job 2fce0e6f136fac71ce6b1aad1ae8927e.
Cancelled job 2fce0e6f136fac71ce6b1aad1ae8927e.
Task Slots & Resources
每个TaskManager是一个JVM进程,执行1~N个SubTask(等价线程),TaskSolts表示一个TaskManager所能接受的任务数,TaskSlot越多说明该节点结算能力越强。
Each task slot represents a fixed subset of resources of the TaskManager. A TaskManager with three slots, for example, will dedicate 1/3 of its managed memory to each slot. Slotting the resources means that a subtask will not compete with subtasks from other jobs for managed memory, but instead has a certain amount of reserved managed memory. Note that no CPU isolation happens here; currently slots only separate the managed memory of tasks.
TaskSlot会对计算节点内存进行均分,不同的Job持有不同TaskSlot,继而程序在运行时实现内存隔离。任意job在执行之前都必须分配额定数据的TaskSlot,这些TaskSlot和该job中最大的Task并行度相等。
- 不同Job间通过TaskSlot进行隔离
- 同一个Job的不同Task的SubTask之间可以共享slot
- 同一个Job的相同Task的SubTask之间不可以共享slot
资源隔离:https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/stream/operators/#task-chaining-and-resource-groups
State Backends(状态后端)
Flink是一个有状态的计算框架,所有的计算状态数据存储在内存中或者是RocksDB中,这取决于state backend策略的选取。默认情况下Flink的backend状态数据存在JobManager的内存中,一般用作测试环境(数据集非常小)。如果在生产环境下一般会采取以下两种状态后端:filesysterm,rocksdb.由于流计算在JobManager的管理下会定期的进行checkpoint机制,存储计算的快照信息,快照信息会根据state backend后端实现,将状态数据持久化起来。
Savepoints&Checkpoint
Flink的流计算可以从savepoint进行恢复,savepoint使得程序在升级的时候,可以依然保持升级以前的计算状态。Checkpoint是由JobManager定期触发checkpoint机制,用于将计算状态存储在backend中,当完成了最新checkpoint之后,系统会自动删除older checkpoint数据。相比较于checkpoint而言,Savepoint是通过手动方式触发checkpoint,触发的结果不会被删除。
[root@HadoopNode00 flink-1.8.1]# ./bin/flink cancel --jobmanager HadoopNode00:8081 --withSavepoint hdfs:///flink-savepoint ee04f833b7df47bc5c4876cede2cbfb5
Cancelling job ee04f833b7df47bc5c4876cede2cbfb5 with savepoint to hdfs:///flink-savepoint.
Cancelled job ee04f833b7df47bc5c4876cede2cbfb5. Savepoint stored in hdfs://HadoopNode00:9000/flink-savepoint/savepoint-ee04f8-2b314a0076aa.
Flink HA
参考:https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/jobmanager_high_availability.html
设置CentOS进程数和文件数(重启生效) -可选
[root@HadoopNode00 ~]# vi /etc/security/limits.conf
* soft nofile 204800
* hard nofile 204800
* soft nproc 204800
* hard nproc 204800
配置主机名(重启生效)
[root@HadoopNode00 ~]# vi /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=HadoopNode00
配置主机名和IP的关系
[root@HadoopNode00 ~]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.126.10 HadoopNode00
关闭防火墙
[root@HadoopNode00 ~]# service iptables stop
iptables: Setting chains to policy ACCEPT: filter [ OK ]
iptables: Flushing firewall rules: [ OK ]
iptables: Unloading modules: [ OK ]
[root@HadoopNode00 ~]# chkconfig iptables off
[root@HadoopNode00 ~]# chkconfig --list | grep iptables
iptables 0:off 1:off 2:off 3:off 4:off 5:off 6:off
- 安装JD1.8,配置JAVA_HOME(~/.bashrc)-略
- 配置SSH面密码认证-略
- 安装配置Hadoop,配置HADOOP_HOME和HADOOP_CALSSPATH(~/.bashrc)- 略
- Flink 安装与配置
1,解压flink-1.8.1-bin-scala_2.11.tgz到指定目录下/home/flink
[root@HadoopNode00 ~]# mkdir /home/flink
[root@HadoopNode00 ~]# tar -zxf flink-1.8.1-bin-scala_2.11.tgz -C /home/flink/
[root@HadoopNode00 ~]# cd /home/flink/flink-1.8.1/
[root@HadoopNode00 flink-1.8.1]# ls -l
total 628
drwxr-xr-x. 2 502 games 4096 Dec 23 11:26 bin # 执行脚本
drwxr-xr-x. 2 502 games 4096 Jun 25 16:10 conf # 配置目录
drwxr-xr-x. 6 502 games 4096 Dec 23 11:26 examples # 案例
drwxr-xr-x. 2 502 games 4096 Dec 23 11:26 lib # 系统依赖jars
-rw-r--r--. 1 502 games 11357 Jun 14 2019 LICENSE
drwxr-xr-x. 2 502 games 4096 Dec 23 11:26 licenses
drwxr-xr-x. 2 502 games 4096 Jun 24 23:02 log #系统启动日志,出错可以查看
-rw-r--r--. 1 502 games 596009 Jun 24 23:02 NOTICE
drwxr-xr-x. 2 502 games 4096 Dec 23 11:26 opt # Flink第三方可选jar,当需要的时候拷贝到lib下
-rw-r--r--. 1 502 games 1308 Jun 14 2019 README.txt
[root@HadoopNode00 flink-1.8.1]# tree conf/
conf/
├── flink-conf.yaml # 主配置文件 √
├── log4j-cli.properties
├── log4j-console.properties
├── log4j.properties
├── log4j-yarn-session.properties
├── logback-console.xml
├── logback.xml
├── logback-yarn.xml
├── masters # 主节点信息,在单机环境下无需配置
├── slaves # 计算节点信息 √
├── sql-client-defaults.yaml
└── zoo.cfg
2,配置slaves配置文件
[root@HadoopNode00 flink-1.8.1]# vi conf/slaves
HadoopNode00
3,配置flink-conf.yaml
#==============================================================================
# Common
#==============================================================================
jobmanager.rpc.address: HadoopNode00
# 表示从机的计算资源数
taskmanager.numberOfTaskSlots: 4
# 配置任务的默认计算并行度
parallelism.default: 3
4,启动Flink服务
[root@HadoopNode00 flink-1.8.1]# ./bin/start-cluster.sh
Starting cluster.
Starting standalonesession daemon on host HadoopNode00.
Starting taskexecutor daemon on host HadoopNode00.
[root@HadoopNode00 flink-1.8.1]# jps
10833 TaskManagerRunner
10340 StandaloneSessionClusterEntrypoint
10909 Jps
- 搭建HDFS HA集群,保证正常启动
- 配置HADOOP_CLASSPATH
- 配置FlinkHA(准备HadoopNode01~03)
[root@HadoopNodeXX ~]# mkdir /home/flink
[root@HadoopNodeXX ~]# tar -zxf flink-1.8.1-bin-scala_2.11.tgz -C /home/flink
[root@HadoopNodeXX ~]# cd /home/flink/
[root@HadoopNodeXX flink]# cd flink-1.8.1/
[root@HadoopNodeXX flink-1.8.1]# vi conf/masters
HadoopNode01:8081
HadoopNode02:8081
HadoopNode03:8081
[root@HadoopNodeXX flink-1.8.1]# vi conf/slaves
HadoopNode01
HadoopNode02
HadoopNode03
[root@HadoopNodeXX flink-1.8.1]# vi conf/flink-conf.yaml
taskmanager.numberOfTaskSlots: 4
parallelism.default: 3
high-availability: zookeeper
high-availability.storageDir: hdfs:///flink/ha/
high-availability.zookeeper.quorum: HadoopNode01:2181,HadoopNode02:2181,HadoopNode03:2181
high-availability.zookeeper.path.root: /flink
high-availability.cluster-id: /default_ns
state.backend: rocksdb
state.checkpoints.dir: hdfs:///flink-checkpoints
state.savepoints.dir: hdfs:///flink-savepoints
state.backend.incremental: false
state.backend.rocksdb.ttl.compaction.filter.enabled: true
启动Flink集群
[root@HadoopNode01 flink-1.8.1]# ./bin/start-cluster.sh
Starting HA cluster with 3 masters.
Starting standalonesession daemon on host HadoopNode01.
Starting standalonesession daemon on host HadoopNode02.
Starting standalonesession daemon on host HadoopNode03.
Starting taskexecutor daemon on host HadoopNode01.
Starting taskexecutor daemon on host HadoopNode02.
Starting taskexecutor daemon on host HadoopNode03.
用户可以通过 ./bin/jobmanager.sh start|stop