Apache-Flink
概述
Flink是构建在数据流之上地有状态计算地流计算框架 通常被人们理解为是第三代大数据分析方案
- 第一代-Hadoop的MapReduce(计算) Storm流计算(2014.9) 两套独立计算引擎 使用难度大
- 第二代-Spark RDD静态批处理(2014.2) DStream|Structured Streaming流计算 统一计算引擎 难度系数小
- 第三代-Flink DataStream(2014.12)流计算框架 Flink Dataset批处理 统一计算引擎 难度系数中级
可以看出Spark和Flink几乎同时诞生 但是Flink之所以发展慢 是因为早期人们对大数据的分析认知不够深刻 或者当时业务场景大都局限在批处理领域 从而导致Flink的发展相比较于Spark较为缓慢 直到2016年人们才开始慢慢意识流计算的重要性
流计算领域:系统监控 舆情监控 交通预测 国家电网 疾病预测 银行/金融风控等
参考:https://blog.csdn.net/weixin_38231448/article/details/100062961
Spark vs Flink战略

| 计算层 | 工具层 | |
|---|---|---|
| Spark | 采用静态批处理 | Spark SQL SparkStreaming Spark ML Spark GraphX |
| Flink | 有状态计算的数据流 | Table API Dataset API(批处理) Fink ML Flink Gelly |
运行架构
概念
Task和Operator Chain
Flink是一个分布式流计算引擎 该引擎将一个计算job拆分成若干个Task(等价于Spark中的Stage) 每个Task都有自己的并行度 每个并行度都由一个Thread表示 因为一个Task是并行执行的 因此一个Task底层对应一系列的Thread Flink称为这些Thread为该Task的subtask 与Spark不同的地方在于Spark是通过RDD的依赖关系实现Stage的划分而Flink是通过OperatorChain的概念实现Task的拆分 所谓的OperatorChain指的是Flink在做job编织的时候 尝试将多个操作符算子进行串联到一个Task中 以减少数据的Thread到Thread的传输开销
目前Flink创建的OperatorChain 的方式有两种: forwark hash|rebalance

-Task: 等价Spark中的stage 每个Task都有若干个sbuTask
-SubTask: 等价一个线程 是Task中的一个子任务
-OperatorChain: 将多个算子归并到一个Task的一种机制 归并原则类似SparkRDD的宽窄依赖
JobManagers TaskManagers Clients
- jobManagers- (也称master) 负责协调分布式执行 负责任务调度 协调检查点 协调故障恢复等 等价于Spark中的Master+Driver的功能 通常一个集群中至少1个Active的JobManager 如果在HA模式下其他出于StandBy状态
(also called masters ) coordinate the distributed execution. They schedule tasks, coordinate
checkpoints, coordinate recovery on failures, etc. There is always at least one Job Manager. A high-availability setup will have multiple JobManagers, one of which one is always the leader , and the others are standby .
- TaskManagers-(称为Worker) 真正负责Task执行计算节点 同时需要向JobManager汇报自身状态信息和工作负荷 通常一个集群中有若干个TaskManager
(also called workers ) execute the tasks (or more specifically, the subtasks) of a dataflow, and bu!er
and exchange the data streams .There must always be at least one TaskManager
- Clients- 与Spark不同 Flink中的Client并不是集群计算的一部分 Client仅仅负责提交任务的Dataflow Graph给JobManager 提交完成之后可以直接退出 Client不负责任务执行过程中调度
The client is not part of the runtime and program execution, but is used to prepare and send a
dataflow to the JobManager. A"er that, the client can disconnect, or stay connected to receive
progress reports

Task Slots和 Resources
每一个Worker(TaskManager) 是一个JVM进程 可以执行一个或者多个子任务(Thread/SubTask) 为了控制Worker节点能够接受多个Task Worker提出Task slot用于表达一个计算节点的计算能力(每个计算节点至少有一个Task slot)
每个TaskSlot表示的是TaskManager计算资源的固定子集 例如:如果一个TaskManager拥有3个TaskSlot 每个Task Slot表示占用当前TaskManager的进程的1/3内存资源 每个job(计算)启动的时候都拥有子集的固定的Task Slot 也就意味着避免了不同job间的在运行时产生内存资源抢占 这些被分配的TaskSlot资源只能被当前job的所有Task所使用 不同Job的Task之间不存在资源共享和抢占问题
但是⼀个Job会被拆分成若⼲个Task,每个Task由若⼲个SubTask构成(取决于Task并⾏度)。默认Task Slot
所对应的内存资源只能在同⼀个Job下的不同Task的subtask间进⾏共享,也就意味着同⼀个Task的不同
subtask不能运⾏在同⼀个Taskslot中,但是如果是相同的job的不同Task的SubTask却可以如果同⼀个Job的不同Task的subtask不共⽤slot,会导致资源浪费。例如下图中 source、map操作定位 为资源稀疏性操作,因为该操作占⽤内存量⼩,⽽keyBy/windows()/apply()涉及Shuffle会占⽤⼤量的内存资源,定位为资源密集型操作,⽐较吃内存。

因此Flink底层默认做的时不同Task的子任务共享TaskSlot资源 因此用户可以将source/map和keyBy/windows()/apply()所对应的任务的并行读进行调整 将并行度由上图中2调整6 这样Flink底层就会做如下资源分配

因此可以看出Flink默认⾏为是尝试将同⼀个job的下的不同Task的SubTask进⾏Task slot共享。也就意味着
⼀个Job的运⾏所需要的Task Slot的个数应该等于该Job中Task并⾏度的最⼤值。当然⽤户也可以通过 程
序⼲预Flink Task间Task Slot共享策略
结论:Flink的job运行所需要的资源数时自动计算出来的 无需用户指定 用户只需指定计算并行度即可
State Backends
Flink是⼀个基于状态计算流计算引擎,存储的key/value状态索引的确切数据结构取决于所选的State
Backend。例如:使⽤Memory State Backend将数据存储在内存中的HashMap中,或者使⽤RocksDB(内
嵌NoSQL数据,和Derby数据库类似)作为State Backend 存储状态。除了定义保存状态的数据结构之
外,State Backend还实现逻辑以获key/value状态的时间点快照并将该快照存储为Checkpoint的⼀部分

Savepoints
⽤Data Stream API编写的程序可以从Savepoint恢复执⾏。Savepoint允许更新程序和Flink群集,⽽不会丢
失任何状态
Savepoint是⼿动触发的Checkpoint,Savepoint为程序创建快照并将其写到State Backend。Savepoint依
靠常规的Checkpoint机制。所谓的Checkpoint指的是程序在执⾏期间,程序会定期在⼯作节点上快照并
产⽣Checkpoint。为了进⾏恢复,仅需要获取最后⼀次完成的Checkpoint即可,并且可以在新的
Checkpoint完成后⽴即安全地丢弃较旧的Checkpoint。
Savepoint与这些定期Checkpoint类似,Savepoint由⽤户触发并且更新的Checkpoint完成时不会⾃动过
期。⽤户可以使⽤命令⾏或通过REST API取消作业时创建Savepoint
参考:https://ci.apache.org/projects/flink/flink-docs-release-1.10/concepts/runtime.html
环境安装
下载地址:https://www.apache.org/dyn/closer.lua/flink/flink-1.10.0/flink-1.10.0-bin-scala_2.11.tgz
前提条件
- jdk必须是1.8+ 完成JAVA_HOME配置
- 安装Hadoop 并保证正常运行-SSH免密 HADOOP_HOME
Flink安装(Standalone)
- 上传并解压
[root@hbase ~]# tar -zxvf flink-1.10.0-bin-scala_2.11.tgz -C /usr/soft flink-1.10.0
[root@hbase flink-1.10.0]# tree -L 1 ./
./
!"" bin #执⾏脚本⽬录
!"" conf #配置⽬录
!"" examples #案例jar
!"" lib # 依赖的jars
!"" LICENSE
!"" licenses
!"" log # 运⾏⽇志
!"" NOTICE
!"" opt # 第三⽅备⽤插件包
!"" plugins
#"" README.txt
8 directories, 3 files
- 配置flink-conf.yaml
[root@hbase conf]# vim flink-conf.yaml
#==============================================================================
# Common
#==============================================================================
jobmanager.rpc.address: hbase
taskmanager.numberOfTaskSlots: 4
parallelism.default: 3
- 配置salves
[root@hbase conf]# vim slaves
hbase
- 启动Flink
[root@hbase flink-1.10.0]# ./bin/start-cluster.sh
[root@hbase ~]# jps
3444 NameNode
4804 StandaloneSessionClusterEntrypoint
5191 Jps
3544 DataNode
5145 TaskManagerRunner
3722 SecondaryNameNode
- 检查是否启动成功
可以访问flink地WEB UI地址:http://hbase:8081

QuitckStart
- 引依赖
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.9.2</version>
</dependency> <dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_2.11</artifactId>
<version>1.10.0</version>
</dependency>
- Client程序
import org.apache.flink.streaming.api.scala._
object FlinkWordCountQiuckStart {
def main(args: Array[String]): Unit = {
//1.创建流计算执行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream-细化
val text = env.socketTextStream("hbase",9999)
//3.执行DataStream的转换算子
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制打印
counts.print()
//5.执行流计算任务
env.execute("Window Stream WordCount")
}
}
- 引入maven打包插件
<build>
<plugins>
<!--scala编译插件-->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>4.0.1</version>
<executions>
<execution>
<id>scala-compile-first</id>
<phase>process-resources</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
<!--创建fatjar插件-->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
<!--编译插件-->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.2</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
<executions>
<execution>
<phase>compile</phase>
<goals>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
- 使用mvn package打包
注意:在提交任务之前先启动nc服务 否则任务执行不成功
- 使用WEB UI提交任务

- 查看运行结果

程序部署
本地执行
//1.创建流计算执行环境
val env = StreamExecutionEnvironment.createLocalEnvironment(3)
//2.创建DataStream-细化
val text = env.socketTextStream("hbase",9999)
//3.执行DataStream的转换算子
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算结果在控制台打印
counts.print()
//5.执行流计算任务
env.execute("Window Stream WordCount")
[root@hbase ~]# nc -lk 9999
this is demo
1> (this,1)
1> (demo,1)
3> (is,1)
远程部署
//1.创建流计算执行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream-细化
val text = env.socketTextStream("hbase",9999)
//3.执行DataStream的转换算子
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算结果在控制台打印
counts.print()
//5.执行流计算任务
env.execute("Window Stream WordCount")
[root@hbase ~]# nc -lk 9999
hello flink
4> (flink,1)
2> (hello,1)
StreamExecutionEnvironment.getExecutionEnvironment会自动识别运行环境 如果运行环境是idea系统会自动切换本地模式 默认系统的并行度使用系统最大线程数 等价于spark中设置的
local[*]如果是生产环境 需要用户在提交任务的时候指定并行度parallelism
- 部署方式
- WEB UI部署(略)
- 通过脚本部署
[root@hbase flink-1.10.0]# ./bin/flink run
--class qiuck_start.FlinkWordCountQuickStart
--detached #后台提交
--parallelism 4 #指定程序默认并行度
--jobmanager hbase:8081 #提交目标主机
/root/ #jar包存放路径
Job has been submitted with JobID b84b11c64018ffd303e5370bb5a9bf44
查看现有任务
[root@hbase flink-1.10.0]# ./bin/flink list --running --jobmanager hbase:8081
Waiting for response...
------------------ Running/Restarting Jobs -------------------
05.03.2020 12:50:47 : b84b11c64018ffd303e5370bb5a9bf44 : Window Stream WordCount (RUNNING)
--------------------------------------------------------------
取消指定任务
[root@hbase flink-1.10.0]# ./bin/flink cancel --jobmanager hbase:8081 b84b11c64018ffd303e5370bb5a9bf44
Cancelling job b84b11c64018ffd303e5370bb5a9bf44.
Cancelled job b84b11c64018ffd303e5370bb5a9bf44.
查看程序执行计划
[root@hbase flink-1.10.0]# ./bin/flink info --class qiuck_start.FlinkWordCountQuickStart --parallelism 4 /root/flink-1.0-SNAPSHOT.jar
----------------------- Execution Plan -----------------------
{
"nodes":[{
"id":1,"type":"Source: Socket Stream","pact":"Data Source","contents":"Source: Socket Stream","parallelism":1},{
"id":2,"type":"Flat Map","pact":"Operator","contents":"Flat Map","parallelism":4,"predecessors":[{
"id":1,"ship_strategy":"REBALANCE","side":"second"}]},{
"id":3,"type":"Map","pact":"Operator","contents":"Map","parallelism":4,"predecessors":[{
"id":2,"ship_strategy":"FORWARD","side":"second"}]},{
"id":5,"type":"aggregation","pact":"Operator","contents":"aggregation","parallelism":4,"predecessors":[{
"id":3,"ship_strategy":"HASH","side":"second"}]},{
"id":6,"type":"Sink: Print to Std. Out","pact":"Data Sink","contents":"Sink: Print to Std. Out","parallelism":4,"predecessors":[{
"id":5,"ship_strategy":"FORWARD","side":"second"}]}]}
--------------------------------------------------------------
No description provided.
用户可以访问:https://flink.apache.org/visualizer/将json数据粘贴过去 查看Flink执行计划图

跨平台发布
//1.创建流计算执行环境
var jars="D:\\ideaProject\\bigdataCodes\\Flink\\target\\flink-1.0-SNAPSHOT.jar"
val env = StreamExecutionEnvironment.createRemoteEnvironment("hbase",9999,jars)
//设置默认并行度
env.setParallelism(4)
//2.创建DataStream-细化
val text = env.socketTextStream("hbase",9999)
//3.执行DataStream的转换算子
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算结果在控制台打印
counts.print()
//5.执行流计算任务
env.execute("Window Stream WordCount")
在运行之前需要使用mvn重新打包程序 直接运行main函数即可
Streaming(DataStream API)
DataSource
数据源是程序读取数据的来源 用户可以通过env.addSource(SourceFunction) 将SourceFunction添加到程序中 Flink内置许多已知实现的SourceFunction 但是用户可以自定义实现SourceFunction(非并行化)接口或者实现ParallelSourceFunction(并行化)接口 如果需要有状态管理还可以继承RichParallelSourceFunction
File-based
- readTextFile(path)
Reads(once) text files, i.e. files that respect the TextInputFormat specification, line-by-line and returns them as Strings.
//1.创建流计算执行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream -细化
val text:DataStream[String]=env.readTextFile("hdfs://hbase:9000/demo/word/words")
//3.执行DataStream的转换算子
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制台打印
counts.print()
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
4> (is,1)
3> (good,1)
4> (day,1)
1> (this,1)
2> (demo,1)
4> (day,2)
3> (good,2)
3> (study,1)
2> (up,1)
- readFile(fileInputFormat,path)
Reads (once) files as dictated by the specified file input format.
//1.创建流计算执行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream -细化
var inputFormat:FileInputFormat[String]=new TextInputFormat(null)
val text:DataStream[String]=env.readFile(inputFormat,"hdfs://hbase:9000/demo/word/words")
//3.执行DataStream的转换算子
val counts = text.flatMap(line=>line.split(","))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制台打印
counts.print()
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
4> (is,1)
3> (good,1)
4> (day,1)
1> (this,1)
2> (demo,1)
4> (day,2)
3> (good,2)
3> (study,1)
2> (up,1)
- readFile(fileInputFormat,path,watchType,interval,pathFilter,typeInfo)
This is the method called internally by the two previous ones. It reads files in the path based on the
given fileInputFormat . Depending on the provided watchType , this source may periodically
monitor (every interval ms) the path for new data
( FileProcessingMode.PROCESS_CONTINUOUSLY ), or process once the data currently in the path
and exit ( FileProcessingMode.PROCESS_ONCE ). Using the pathFilter , the user can further
exclude files from being processed
//1.创建流计算执行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream -细化
var inputFormat:FileInputFormat[String]=new TextInputFormat(null)
val text:DataStream[String]=env.readFile(inputFormat,"hdfs://hbase:9000/demo/word/words"
,FileProcessingMode.PROCESS_CONTINUOUSLY,1000)// PROCESS_ONCE
//3.执行DataStream的转换算子
val counts = text.flatMap(line=>line.split(","))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制台打印
counts.print()
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
2> (demo,1)
4> (is,1)
2> (up,1)
4> (day,1)
1> (this,1)
3> (good,1)
4> (day,2)
3> (good,2)
3> (study,1)
该方法会检查采集目录下的文件 如果文件发生变换系统会重新采集 此时可能会导致文件的重复计算 一般来说不建议修改文件内容 直接上传文件即可
Socket Based
- socketTextStream
从socket读取 元素可以用分割符分割
Reads from a socket. Elements can be separated by a delimiter
//1.创建流计算执行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream -细化
val text = env.socketTextStream("hbase",9999,'\n',3)
//3.执行DataStream的转换算子
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制台打印
counts.print()
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
2> (hello,1)
4> (flink,1)
3> (,1)
1> (spark,1)
Collection-based
//1.创建流计算执行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream -细化
val text = env.fromCollection(List("this is demo","hello flink"))
//3.执行DataStream的转换算子
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制台打印
counts.print()
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
4> (flink,1)
4> (is,1)
1> (this,1)
2> (hello,1)
2> (demo,1)
UserDefinedSource
- SourceFunction-非并行化接口
class UserDefinedonParallelSourceFunction extends SourceFunction[String]{
@volatile //防止线程拷贝变量
var isRunning:Boolean=true
val lines:Array[String]=Array("this is demo","hello flink","spark kafka")
//在该方法中启动线程 通过sourceContext的collect方法发送数据
override def run(sourceContext: SourceFunction.SourceContext[String]): Unit = {
while (isRunning){
Thread.sleep(1000)
//输送数据给下游
sourceContext.collect(lines(new Random().nextInt(lines.size)))
}
}
//释放资源
override def cancel(): Unit = {
isRunning=false
}
}
- ParallelSourceFunction
class UserDefinedonParallelSourceFunction extends ParallelSourceFunction[String]{
@volatile //防止线程拷贝变量
var isRunning:Boolean=true
val lines:Array[String]=Array("this is demo","hello flink","spark kafka")
//在该方法中启动线程 通过sourceContext的collect方法发送数据
override def run(sourceContext: SourceFunction.SourceContext[String]): Unit = {
while (isRunning){
Thread.sleep(1000)
//输送数据给下游
sourceContext.collect(lines(new Random().nextInt(lines.size)))
}
}
//释放资源
override def cancel(): Unit = {
isRunning=false
}
}
ParallelSourceFunction-并行化接口
//1.创建流计算执行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream -细化
val text = env.addSource[String](new UserDefinedonParallelSourceFunction) //⽤户定义的SourceFunction
//3.执行DataStream的转换算子
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制台打印
counts.print()
//5.执⾏流计算任务
env.execute("Wind
Apache Flink是一个分布式流计算框架,提供有状态的流处理能力。它将计算任务拆分为Task,通过TaskManager和JobManager协调执行。Flink支持多种数据源,如文件、Socket和Kafka,以及多种数据处理方式,如Map、Reduce、Join和Window操作。Flink具有强大的容错机制,包括State Backends和Checkpoint/Savepoint,确保高可用性。此外,Flink支持基于事件时间的处理,处理迟到数据,并提供了多种触发器和剔除器策略。文章详细介绍了Flink的安装、编程模型和高级特性,适合流计算初学者和开发者深入学习。
最低0.47元/天 解锁文章

4202

被折叠的 条评论
为什么被折叠?



