一:1.Spark某些算子有状态
2.Flink的状态是默认存在的 RichFunction中创建 内存状根据配置定时时保存为ChickPoint.在HDFS上。
3.Flink 优势 EventTime Spark仅支持ProcessTime
4.Windonw TimeWindow CountWindow
5.内存管理 Flink基于JVM独立内存管理 提前规定内存大小 固定占用
6.chickPoint Flink基于快照 Spark基于RDD做CheckPoint
二:流式WC
package second.study.opertor
import org.apache.flink.streaming.api.scala._
object WoedCount {
def main(args: Array[String]): Unit = {
//1.初始化流计算环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//2.导入隐式转换
import org.apache.flink.streaming.api.scala._
//3.读取数据 得到的 DataStream[String]==spark DStream
val stream: DataStream[String] = env.socketTextStream("localhost",7777)
//4.处理数据
val result: DataStream[(String, Int)] = stream.flatMap(_.split(" "))
.map((_, 1))
.keyBy(0) //分组
.sum(1) //聚合累加
//5.输出数据
result.print()
//6.启动流计算程序
env.execute("wc")
}
}
//flink批处理 对比SparkRDD
val dataSet: DataSet[String] = batchEnv.readTextFile("in/test.txt")
3.流计算中没有groupBy 只能用keyBy Batch计算非常类似与Spark.
集群结构:![在这里插入图片描述](https://img-blog.csdnimg.cn/20210227154229216.png)
配置文件说明 生产环境用 注意:与1.10.1的区别
jobmanager.heap.size:JobManager 节点可用的内存大小。
taskmanager.heap.size:TaskManager 节点可用的内存大小。
taskmanager.memory.process.size: 1728m 1.10.1的配置
taskmanager.numberOfTaskSlots:每台机器可用的Slot 数量。
parallelism.default:默认情况下Flink 任务的并行度。
上面参数中所说的Slot 和parallelism 的区别:
Slot 是静态的概念,是指TaskManager 具有的并发执行能力。跟集群配置有关 建议和cpu核心数一样
parallelism 是动态的概念,是指程序运行时实际使用的并发能力。跟代码有关
设置合适的parallelism 能提高运算效率。
1.Client 提交 数据流程图到JobManager
2.JobManager 1.资源管理 2.任务调度 3.触发ChickPoint
3.TaskManager 管理slot .区别MR 。MR采用多进程 Flink采用多线程(多线程可以使用ThreadLoad)
Stondalon模式:
![在这里插入图片描述](https://img-blog.csdnimg.cn/20210227155934782.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MzAwMzc5Mg==,size_16,color_FFFFFF,t_70)
![在这里插入图片描述](https://img-blog.csdnimg.cn/20210227155945430.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MzAwMzc5Mg==,size_16,color_FFFFFF,t_70)
Yarn模式:1.Session-Cluster.提前初始化Flink集群。所有的Job直接往一个Flink集群里面提交。多个job共享资源。Flink会一直占用yarn资源。
yarn-session.sh -n 3 -s 3 -nm bjsxt -d 提前启动FlinkSession模式
flink run -c com.bjsxt.flink.StreamWordCount /home/Flink-Demo-1.0-SNAPSHOT.jar //直接提交
-n,--container <arg> 表示分配容器的数量(也就是TaskManager 的数量)。
-D <arg> 动态属性。
-d,--detached 在后台独立运行。
-jm,--jobManagerMemory <arg>:设置JobManager 的内存,单位是MB。
-nm,--name:在YARN 上为一个自定义的应用设置一个名字。
-q,--query:显示YARN 中可用的资源(内存、cpu 核数)。
-qu,--queue <arg>:指定YARN 队列。
-s,--slots <arg>:每个TaskManager 使用的Slot 数量。
-tm,--taskManagerMemory <arg>:每个TaskManager 的内存,单位是MB。
-z,--zookeeperNamespace <arg>:针对HA 模式在ZooKeeper 上创建NameSpace。
-id,--applicationId <yarnAppId>:指定YARN 集群上的任务ID,附着到一个后台独
立运行的yarn session 中。
2.Per-Job-Cluster 刚开始yarn没有Flink集群。来一个job启动一个Flink集群。job结束Flink集群结束。资源独立。
flink run -m yarn-cluster -yn 3 -ys 3 -ynm bjsxt02 -c flink.StreamWordCount /home/Flink-Demo-1.0-SNAPSHOT.jar
-yn,--container <arg> 表示分配容器的数量,也就是TaskManager 的数量。
-d,--detached:设置在后台运行。
-yjm,--jobManagerMemory<arg>:设置JobManager 的内存,单位是MB。
-ytm,--taskManagerMemory<arg>:设置每个TaskManager 的内存,单位是MB。
-ynm,--name:给当前Flink application 在Yarn 上指定名称。
-yq,--query:显示yarn 中可用的资源(内存、cpu 核数)
-yqu,--queue<arg> :指定yarn 资源队列
-ys,--slots<arg> :每个TaskManager 使用的Slot 数量。
-yz,--zookeeperNamespace<arg>:针对HA 模式在Zookeeper 上创建NameSpace
-yid,--applicationID<yarnAppId> : 指定Yarn 集群上的任务ID,附着到一个后台独
立运行的Yarn Session 中。
API调用:Source
1. val stream = streamEnv.readTextFile("hdfs://hadoop101:9000/wc.txt") //hdfs路径
参见Flink其他篇章的sink source
转换算子:
1.def map[R: TypeInformation]***(fun: T => R)***: DataStream[R]
TypeInformation Flink中的序列化类型 顶级父类
map[DataStream[]=>DataStream[]] 将数据流转换为数据流 流的类型不变 但流里面的元素可以变。
2.def flatMap[R: TypeInformation]***(fun: T => TraversableOnce[R]):*** DataStream[R] :
给一个对象 转换为一个数组或列表等 可迭代的对象 返回R
3.def keyBy(firstField: String, otherFields: String*): KeyedStream[T, JavaTuple]
4.reduce 必须基于 KeyedStream 。返回dataStream[T] 类型不变 。滚动聚合
package second.study.opertor
import org.apache.flink.streaming.api.functions.source.SourceFunction
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import second.study.opertor.StationCount.StationLog
import scala.util.Random
object TransformTest {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
import org.apache.flink.streaming.api.scala._
val stream: DataStream[StationLog] = env.addSource(new MyCustomerSource)
val result: DataStream[(String, Long)] = stream.filter(_.callType.equals("success"))
.map(log => (log.sid, log.duration)) //转换为(基站ID,通话时长)
.keyBy(0)
.reduce((t1, t2) => {
var duration = t1._2 + t2._2
(t1._1, duration)
})
result.print()
env.execute()
}
case class StationLog(sid:String,callOut:String,callIn:String,callType:String,callTime:Long,duration:Long)
//写一个实现SourceFunction接口
class MyCustomerSource extends SourceFunction[StationLog] {
//是否终止数据流的标记
var flag = true;
override def run(sourceContext: SourceFunction.SourceContext[StationLog]): Unit = {
val random = new Random()
var types = Array("fail", "busy", "barring", "success")
while (flag) { //如果流没有终止,继续获取数据
1.to(5).map(i => {
var callOut = "1860000%04d".format(random.nextInt(10000))
var callIn = "1890000%04d".format(random.nextInt(10000))
new StationLog("station_" + random.nextInt(10), callOut, callIn, types(random.nextInt(4
)), System.currentTimeMillis(), random.nextInt(30))
}).foreach(sourceContext.collect(_)) //发数据
Thread.sleep(2000) //每发送一次数据休眠2秒
}
}
//终止数据流
override def cancel(): Unit = flag = false
}
}
重要 Spark中RDD中只要是二元组 就是键值对RDD
Flink中是KeyedStream,里面不一定是二元组。跟二元组没关系、
继续转换算子
union(两条流中的元素必须相同)
Connect:(合完后还是两条流,) 后接map或者flatmap,把元素变一样才是真正的合流
RichFunction 生命周期内可以创建MySql连接 getruntime()zhuangtai
Function
ProcessFunction
KeyedProcessFunction使用案例 记录手机号连续5s内被叫失败 报警。
package second.study.opertor
import org.apache.flink.api.common.state.{ValueState, ValueStateDescriptor}
import org.apache.flink.streaming.api.TimerService
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.functions.source.SourceFunction
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.util.Collector
import scala.util.Random
//监控所有手机号码 如果这个号码在5s内 所有呼叫他的日志他是失败的 册发出告警信息
//按照被叫分组
//如果5s内只有一个呼叫成功 则不输出
object TestProcessFunction {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
import org.apache.flink.streaming.api.scala._
val stream: DataStream[StationLog] = env.addSource(new MyCustomerSource)
val result: DataStream[String] = stream.keyBy(_.callIn) //先按照被叫分组
.process(new MontionCallFail)
result.print()
env.execute()
}
//* @param <K> Type of the key.
// * @param <I> Type of the input elements.
// * @param <O> Type of the output elements.
class MontionCallFail extends KeyedProcessFunction[String,StationLog,String]{
//需要保存第一次呼叫的时间
lazy val timeState: ValueState[Long] = getRuntimeContext.getState(new ValueStateDescriptor[Long] ("start-time",classOf[Long]))
override def processElement(value: StationLog, ctx: KeyedProcessFunction[String, StationLog, String]#Context, out: Collector[String]): Unit = {
val time: Long = timeState.value()
if(time==0 && value.callType.equals("fail")){ //如果发现第一次呼叫失败
val nowTime: Long = ctx.timerService().currentProcessingTime()
//定义定时器5s后触发
var onTime = nowTime+5*1000
ctx.timerService().registerProcessingTimeTimer(onTime) //注册定时器5s后触发
timeState.update(onTime)
}else if(time!=0 && value.callType.equals("fail")){ //表示有一次成功呼叫
//直接删除定时器
ctx.timerService().deleteProcessingTimeTimer(time)
timeState.clear()
}
}
//触发事件
override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[String, StationLog, String]#OnTimerContext, out: Collector[String]): Unit = {
var wartStr= "触发时间"+timestamp+"手机号"+ctx.getCurrentKey
out.collect(wartStr)
timeState.clear()
}
}
case class StationLog(sid:String,callOut:String,callIn:String,callType:String,callTime:Long,duration:Long)
//写一个实现SourceFunction接口
class MyCustomerSource extends SourceFunction[StationLog] {
//是否终止数据流的标记
var flag = true;
override def run(sourceContext: SourceFunction.SourceContext[StationLog]): Unit = {
val random = new Random()
var types = Array("fail", "busy", "barring", "success")
while (flag) { //如果流没有终止,继续获取数据
1.to(5).map(i => {
var callOut = "1860000%04d".format(random.nextInt(10000))
var callIn = "1890000%04d".format(random.nextInt(10000))
new StationLog("station_" + random.nextInt(10), callOut, callIn, types(random.nextInt(4
)), System.currentTimeMillis(), random.nextInt(30))
}).foreach(sourceContext.collect(_)) //发数据
Thread.sleep(2000) //每发送一次数据休眠2秒
}
}
//终止数据流
override def cancel(): Unit = flag = false
}
}
AggregateFunction的解析应用
package second.study.opertor
import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
object TestTimeWiondow {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val stream: DataStream[String] = env.socketTextStream("localhost",7777)
val ds: DataStream[StationLog] = stream.map(
line => {
var arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
}
)
val result= ds.keyBy(_.sid)
.timeWindow(Time.seconds(5))
.aggregate(new MyAggregateFunction,new MyWiondowFunction)
//增量聚合函数 没有类型的限制
result.print()
env.execute()
}
//输入类型 输出类型 累加器类型
//里面的add方法来一条数据执行一次
//getResult 关窗时执行一次
class MyAggregateFunction extends AggregateFunction[StationLog,Long,Long]{
override def add(value: StationLog, accumulator: Long): Long = 1+accumulator
override def createAccumulator(): Long = 0
override def getResult(accumulator: Long): Long = accumulator
override def merge(a: Long, b: Long): Long = a+b
}
//windofunction输入雷子玉AggregateFunction,
//MyWiondowFunction 方法在关窗时 先执行AggregateFunction的getResult方法 再执行自己的apply方法
class MyWiondowFunction extends WindowFunction[Long,(String,Long),String,TimeWindow]{
override def apply(key: String, window: TimeWindow, input: Iterable[Long], out: Collector[(String, Long)]): Unit = {
out.collect((key,input.iterator.next())) //累加器中只有一个值
}
}
case class StationLog(sid:String,callOut:String,callIn:String,callType:String,callTime:Long,duration:Long
)
}
将乱序数据按不同逻辑进行处理
package second.study.opertor
import org.apache.calcite.rel.core.Aggregate
import org.apache.flink.api.common.functions.{AggregateFunction, ReduceFunction}
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
//需求 每隔5s统计一下最近10s内每个基站中通话时长最长的一次通话发生的呼叫时间,主叫号码,白交号码,通话时长。并且是哪个时间范围内的
//1.按照基站分组
//2.开窗(10s窗口的5s滑动)
//watemark 2s
//
//3.聚合操作 保存当前通话最长的对象
//注:
object TestWortermark2 {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)//设置事件时间语义
val stream: DataStream[String] = env.socketTextStream("localhost",7777)
val ds: DataStream[StationLog] = stream.map(
line => {
var arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).trim.toLong, arr(5).toLong)
}
).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[StationLog](Time.seconds(1)) {
override def extractTimestamp(element: StationLog) = element.callTime
})
//ds.print()
val lateTag = new OutputTag[StationLog]("lateTag")
val result: DataStream[String] = ds //.filter(_.callType.equals("success"))
.keyBy(_.sid)
//开窗
.timeWindow(Time.seconds(10),Time.seconds(5))
//超出超过2s的数据怎么办 交给
//1.允许超过2-5s的数据允许再次触发窗口
.allowedLateness(Time.seconds(5)) //2-5s内的数据再次触发窗口
//2.超过5s以上的输出到侧流当中
.sideOutputLateData(lateTag)
.aggregate(new MyAggregateCountFunction, new OutputResultWiondowFunction)
result.print()
result.getSideOutput(lateTag).print("late")
//增量聚合函数 没有类型的限制
env.execute()
}
class MyAggregateCountFunction extends AggregateFunction[StationLog,Long,Long]{
override def add(value: StationLog, accumulator: Long): Long = accumulator+1
override def createAccumulator(): Long = 0L
override def getResult(accumulator: Long): Long = accumulator
override def merge(a: Long, b: Long): Long = a+b
}
class OutputResultWiondowFunction extends WindowFunction[Long,String,String,TimeWindow]{
override def apply(key: String, window: TimeWindow, input: Iterable[Long], out: Collector[String]): Unit = {
val value: Long = input.iterator.next()
val builder = new StringBuilder
builder.append("窗口范围"+window.getStart+"-------"+window.getEnd+"\n")
builder.append("基站"+key+"呼叫数量"+value)
out.collect(builder.toString())
}
}
case class StationLog(sid:String,callOut:String,callIn:String,callType:String,callTime:Long,duration:Long)
}
Table API 和Table SQL
package second.study.opertor.tableandsql
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api.{EnvironmentSettings, Table}
import org.apache.flink.table.api.scala._
import org.apache.flink.types.Row
object TestTableAPI {
case class StationLog(sid:String,callOut:String,callIn:String,callType:String,callTime:Long,duration:Long)
def main(args: Array[String]): Unit = {
//上下文执行环境 1.批计算 2.流计算
//1.先初始化流上下文
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val stream: DataStream[String] = env.socketTextStream("localhost",7777)
val ds: DataStream[StationLog] = stream.map(
line => {
var arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
}
)
//2.设置settings 可以选择blink
val settings: EnvironmentSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build()
//3.创建tableEnv 接下来操作都按照tableEnv操作
val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env,settings)
val table: Table = tableEnv.fromDataStream(ds)
val filterResult: Table = table.filter('callType==="success")
val resultStream: DataStream[Row] = tableEnv.toAppendStream[Row](filterResult)
resultStream.print()
env.execute()
}
}
分组聚合
package second.study.opertor.tableandsql
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api.{EnvironmentSettings, Table}
import org.apache.flink.table.api.scala._
import org.apache.flink.types.Row
object TestTableAPI {
case class StationLog(sid:String,callOut:String,callIn:String,callType:String,callTime:Long,duration:Long)
def main(args: Array[String]): Unit = {
//上下文执行环境 1.批计算 2.流计算
//1.先初始化流上下文
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val stream: DataStream[String] = env.socketTextStream("localhost",7777)
val ds: DataStream[StationLog] = stream.map(
line => {
var arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
}
)
//2.设置settings 可以选择blink
val settings: EnvironmentSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build()
//3.创建tableEnv 接下来操作都按照tableEnv操作
val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env,settings)
val table: Table = tableEnv.fromDataStream(ds)
val cntTable: Table = table.groupBy('sid).select('sid,'sid.count as 'cnt)
tableEnv.toRetractStream[Row](cntTable).print()
env.execute()
}
}
package second.study.opertor.tableandsql
import org.apache.flink.api.common.typeinfo.{TypeInformation, Types}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api.{EnvironmentSettings, Table}
import org.apache.flink.table.api.scala._
import org.apache.flink.table.functions.TableFunction
import org.apache.flink.types.Row
import second.study.opertor.tableandsql.TestTableAPI.StationLog
object TestUDFFunction {
def main(args: Array[String]): Unit = {
//上下文执行环境 1.批计算 2.流计算
//1.先初始化流上下文
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val stream: DataStream[String] = env.socketTextStream("localhost",7777)
val ds: DataStream[StationLog] = stream.map(
line => {
var arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
}
)
//2.设置settings 可以选择blink
val settings: EnvironmentSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build()
//3.创建tableEnv 接下来操作都按照tableEnv操作
val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env,settings)
//如若写SQL 则需要把表注册到执行环境中 如果不写SQL 只用TABLEAPI 则不用注册表
val my_fun = new MyFlatMapFunction
val tableAPI: Table = tableEnv.fromDataStream(ds,'word)
val result: DataStream[(Boolean, Row)] = tableAPI.flatMap(my_fun('word)).as('word, 'cnt)
.groupBy('word)
.select('word, 'cnt.sum as 'sum).toRetractStream[Row]
result.print()
//使用TableAPI切割单词
env.execute()
}
class MyFlatMapFunction extends TableFunction[Row]{
//定义函数处理后的返回类型 输出(_,1)
override def getResultType: TypeInformation[Row] =Types.ROW(Types.STRING,Types.INT)
//函数主体
def eval(str:String): Unit ={
val arr: Array[String] = str.split("_")
arr.foreach(word=>{
var row = new Row(2)
row.setField(0,word)
row.setField(1,1)
collect(row)
}
)
}
}
}
重点 开窗函数和sql 对比hive