Flink 笔记

最新推荐文章于 2024-04-11 20:48:58 发布

冷艳无情的小妈

最新推荐文章于 2024-04-11 20:48:58 发布

阅读量877

点赞数 4

文章标签： flink 数据库 java scala

本文链接：https://blog.csdn.net/wuhahaq/article/details/129999013

版权

1、数据源：kafka、mysql。(file、es、redis、hbase、clickhouse)
2、分析处理：批处理（了解）、流处理（算子（重点）、sql）
3、数据落地：kafka、mysql、hbase、clickhouse、redis、file等。
4、部署：flink standalone、yarn

（1）数据源：
file: env.readTextFile("datas/a.txt")
kafka:
val properties = new Properties()
properties.setProperty("bootstrap.servers", "hdp1:9092")
properties.setProperty("group.id", "group1")
val stream = env.addSource(new FlinkKafkaConsumer[String]("test", new SimpleStringSchema(), properties))
mysql:(自定义source方式，可以扩展到redis等其他数据源)
1) 创建类继承RichSourceFunction
2) 重写open：打开资源，获得链接
3) 重写close:关闭资源
4) 重写run方法获得数据并收集返回
5) env.addSource添加自定义数据源
(2)数据落地：
kafka:
val properties = new Properties
properties.setProperty("bootstrap.servers", "hdp1:9092")
val myProducer = new FlinkKafkaProducer[String]("test", new SimpleStringSchema(), properties)
stream.addSink(myProducer)
mysql:(自定义sink方式，可以扩展到redis等其他数据源)
1) 创建类继承RichSinkFunction
2) 重写open：打开资源，获得链接
3) 重写close:关闭资源
4) 重写invoke方法获得数据并收集返回
5) stream.addSink添加自定义数据源
file:
方式一（过时）：
stream.writeAsText("result/a")
方式二：
val sink: StreamingFileSink[String] = StreamingFileSink
.forRowFormat(new Path("result/b"), new SimpleStringEncoder[String]("UTF-8"))
.withRollingPolicy(
DefaultRollingPolicy.builder()
.withRolloverInterval(TimeUnit.SECONDS.toMillis(10))
.withInactivityInterval(TimeUnit.SECONDS.toMillis(5))
.withMaxPartSize(1024 * 1024 * 1024)
.build())
.build()
stream.addSink(sink)
(3) 部署
standalone:
Session Mode:
方式一：
命令提交任务：
./bin/flink run ./examples/streaming/TopSpeedWindowing.jar
方式二：
webUI提交任务：
yarn:
session Mode:
./bin/yarn-session.sh --detached
./bin/flink run ./examples/streaming/TopSpeedWindowing.jar
观察，在yarn session模式下与standalone模式任务提交方式一致，默认情况下会提交到yarn上，如过想提交
到standalone需要清除yarn缓存：/tmp目录下yarn相关缓存（注意：如果在yarn上运行成功后，
回到standalone运行失败，提示连接不到8032，则需要删除/tmp/.yarn-properties-root）
Application Mode:
如果存在依赖冲突问题可以修改fink-conf.yaml文件：191行：classloader.resolve-order: parent-first
./bin/flink run-application -t yarn-application ./examples/streaming/TopSpeedWindowing.jar
Per-Job Mode:
./bin/flink run -t yarn-per-job --detached ./examples/streaming/TopSpeedWindowing.jar

(1) 打jar包提交任务：
// 如果监听的主机名或者ip地址，端口号，需要动态输入，则需要使用这种方式
var tool = ParameterTool.fromArgs(args)
var host = tool.get("host")
var port = tool.getInt("port")
提交到flink standalone:
./bin/flink run -c com.bawei.package3.WC1 /root/myflink.jar
提交到yarn运行：略
(2) 常规算子：
map、flatmap、filter、keyBy、reduce(sum\max\maxby\min\minby)、union、connect(map)
keyBy：分组，一般分组之后做聚合，非窗口聚合（滚动聚合）
union与connect的区别：union可以一次合并多条流,union的流要求同类型;connect只能允许两条流做合并，connect何以合并其他类型的流。
(3) 窗口分类
1）窗口算子：window() 、windowAll()
需求如：统计所有学生的分数总和：非keyby用windowAll
需求如：统计每个学生的分数总和：keyby用window
2）窗口类型：
时间窗口（重点掌握）：
滚动窗口（Tumbling Windows）:
TumblingProcessingTimeWindows.of(Time.seconds(5))
如：每5秒统计每个学生的总分
滑动窗口（Sliding Windows）:
SlidingProcessingTimeWindows.of(Time.seconds(窗口大小), Time.seconds(步长))
如：每5秒统计每个学生最近10秒的总分
会话窗口（Session Windows）:
ProcessingTimeSessionWindows.withGap(Time.seconds(回话间隔<等待时间>))
如：会话窗口，当20秒没有新数据到来则关闭窗口，执行计算
全局窗口（Global Windows）: 略
计数窗口（扩展了解）：
滚动窗口（Tumbling Windows）:
countWindow(5)
如：每5条数据统计每个学生前5条数据成绩总和
滑动窗口（Sliding Windows）:
countWindow(10,5)
如：计数滑动窗口：每5条数据统计每个学生前10条数据成绩总和
11月16日：
（1）时间语义：
系统时间：
事件时间：
val timeDS: DataStream[StuScore] = mapDS.assignAscendingTimestamps(_.ts)
//每5秒统计每个学生的总分
timeDS.keyBy(_.name)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
（2）窗口函数：
1）窗口的开始时间和结束时间
如果窗口时间秒、分钟或小时，都是从0开始计算窗口开始时间：
例如：5秒窗口： 0-5，5-10，10-15，15-20……
5分钟窗口：第0分钟到第5分钟，第5分钟到第10分钟……
如果窗口是天则从当天的8点开始计算：
2) reduce() sum() max() maxBy() :增量聚合
3) apply()
4) aggregate() :增量聚合，可以实现reduce() sum() max() maxBy()对应功能
（3）Join:
流数据关联的两种形式：
1) DataStream关联DataStream数据（双流Join）:
实现方式：
基于窗口进行关联:
window Join: 内连接
window CoGrop: 全外连接
不基于窗口进行关联:
intervalJoin:
2) DataStream关联数据库中的某条数据：
通过异步IO实现关联：

（1）Watermark:
val timeDS: DataStream[StuScore] = mapDS.assignTimestampsAndWatermarks(WatermarkStrategy
.forBoundedOutOfOrderness[StuScore](Duration.ofSeconds(3))
.withTimestampAssigner(new SerializableTimestampAssigner[StuScore] {
override def extractTimestamp(element: StuScore, recordTimestamp: Long): Long = element.ts
}))
（2）收集迟到数据：
//创建侧流标签
val tag = new OutputTag[StuScore]("late-data")
val sumDS: DataStream[StuScore] = timeDS.keyBy(_.name)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.sideOutputLateData(tag) // 通过侧流收集迟到数据
.sum("score")
//需要通过从主流结果来提取侧流
val lateDS: DataStream[StuScore] = sumDS.getSideOutput(tag)
11月18日：
（1）process()
ProcessFunction常用分类：
ProcessFunction:
使用方式：DataStream.process(< new ProcessFunction>)
常见需求：分流
ProcessWindowFunction:
使用方式：DateStream.keyBy().window().process(<new ProcessWindowFunction>)
常见需求：聚合（求和、计数、求最大、最小、平均）、窗口TopN
ProcessAllWindowFunction:
使用方式：DataStream.AllWindow().process(<new ProcessAllWindowFunction>)
常见需求：聚合（求和、计数、求最大、最小、平均）、窗口TopN
KeyedProcessFunction:(状态（可以理解为一种全局变量））、定时器)
（1）状态：
ValueState:用于存取一个值，类型自定义
声明状态：lazy val state: ValueState[Int] = getRuntimeContext.getState(new ValueStateDescriptor[Int]("myState", classOf[Int],-1))
获取值：state.value()
更新值：state.update()
ListState:用于存取一个集合
声明状态：lazy val state: ListState[Ws] = getRuntimeContext.getListState(new ListStateDescriptor[Ws]("myState", classOf[Ws]))
获取值：
更新值：
MapState:用于存取键值形式数据
（2）定时器：
功能：定义一个未来的具体时间，当改时间到达时，触发定时任务。
注册定时器：
系统时间：ctx.timerService.registerProcessingTimeTimer(coalescedTime)
事件事件：ctx.timerService.registerEventTimeTimer(coalescedTime)
取消定时器：
系统时间：ctx.timerService.deleteProcessingTimeTimer(timestampOfTimerToStop)
事件事件：ctx.timerService.deleteEventTimeTimer(timestampOfTimerToStop)
（3）使用方式：DataStream.keyBy().process(<new KeyedProcessFunction>)
（4）常见需求：预警（例如连续两次登录失败预警等）、TopN
（5）TopN:
上午：通过ProcessWindowFunction求每5秒每个学生考试前3次记录。
keyby().window().process(list.sort.take())
新需求：
(1)分组之后求TopN: 实时求每个学生考试前3次记录（这个自己去完成）
(2)分组之后经过窗口聚合后求TopN:求每5秒每个学生的平均分，并按照平均分倒序排序求前3。（重点掌握）
keyby().window().process(求平均)？？？

（1） CEP:
规则：条件1->条件2->条件3……
各条件之间常用的连接方式：
next:紧挨着的下一条（严格近邻），例如：条件1.next().条件2 表示第一条满足条件1且紧挨着的下一条满足条件2
followby:不严格紧挨（非严格近邻），例如：条件1.followby().条件2 表示第一条满足条件1且后面有一条满足条件2
(2) SQL:
(1)批处理：
val env = ExecutionEnvironment.getExecutionEnvironment
val tEnv: BatchTableEnvironment = BatchTableEnvironment.create(env)
将一个DataSet数据转换成表：
tEnv.createTemporaryView[SqlData]("student",<DataSet>)

(2)流处理：
(1)创建流处理表环境：
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val tEnv: StreamTableEnvironment = StreamTableEnvironment.create(env)
(2)建表
方式一：将一个DataStream数据转换成表：
tEnv.createTemporaryView[SqlData]("student",<DataStream>)
方式二：通过table connector获取数据并建表
(3)表结果数据落地：
方式一：将表转换成DataStream后保存：
val resultDS: DataStream[Row] = tEnv.toRetractStream[Row](table).map(_._2)
resultDS.writeAsText("result/sqlresult")
方式二：通过table connector链接到目的地并建表，后insert插入：
tEnv.executeSql(
"""
|insert into t_result
|select id,score from student where score>=95
|""".stripMargin)

（1）常见数据源及目的地（connetor）:
1、kafka->sql->kafka（两种方式）
2、kafka->sql->redis
执行sql语句：
tEnv.sqlQuery("select * ……").execute(),sqlQuery只能做查询，的返回值是Table
tEnv.executeSql("select"),executeSql的返回值是TableResult
（2）SQL基于窗口的聚合
1) 时间类型：
方式一：在使用connector建表时指定时间：
TIMESTAMP: TO_TIMESTAMP(yyyy-MM-dd HH:mm:ss)
TIMESTAMP_LTZ: TO_TIMESTAMP_LTZ(时间戳,3)
可以基于TIMESTAMP或者TIMESTAMP_LTZ设置事件时间：WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND
事件时间设置后，字段附加了 *ROWTIME* 信息。有*ROWTIME* 信息就可以基于该字段做窗口。
方式二：通过DataStream转换成表的时候指定时间：
val timeDS: DataStream[SqlScore] = scoreDS.assignAscendingTimestamps(_.ts)
tEnv.createTemporaryView("t_score",timeDS,$("id"),$("subject"),$("score"),$("ts").rowtime())
2) 滚动窗口：
t1:
TABLE(TUMBLE(TABLE t1, DESCRIPTOR(事件时间字段), INTERVAL '10' MINUTES)):
将t1表包装成滚动窗口表，同时扩充了window_start和window_end字段
3) 滑动窗口：
t1:
TABLE(HOP(TABLE t1, DESCRIPTOR(事件时间字段), 滑动步长,窗口大小)):
4) 注意：使用窗口必须根据窗口聚合，也就是sql语句当中应该有group by window_start,window_end

(1)CDC:
cdc官网：
https://github.com/ververica/flink-cdc-connectors
https://gitee.com/mirrors_trending/flink-cdc-connectors#building-from-source
(2)Hbase:
补充：
kafka存放数据在哪里：
（1）一部分信息存放在zookeeper中，即存到zk 的dataDir指定的目录
（2）另一部分数据存放在kafka配置路径中：config/server.propeties 的log.dirs配置值：
当需要重置kafka:清除kafka之前的数据：
1:删除zookeeper/conf/zoo.cfg的dataDir指定的目录,
2:删除kafka/config/server.properies 的log.dirs指定的目录。
hbase:
(1)一部分数据存放在zookeeper，即存到zk 的dataDir指定的目录
(2)另一部分数据存放在hdfs
当需要重置hbase：清除hbase之前的数据：
1:删除zookeeper/conf/zoo.cfg的dataDir指定的目录,
2:删除hbase/conf/hbase-site.xml的hbase.rootdir指定的目录

(3)Hbase shell:

namespace类似与数据库：
创建表空间：create_namespace
列出所有表空间：list_namespace
列出某个表空间下的所有表：list_namespace_tables 'myns'
删除表空间：drop_namespace
Commands: alter_namespace, create_namespace, describe_namespace,
drop_namespace, list_namespace, list_namespace_tables

ddl:
建表: create 'myns:student','info','extra'
查看所有表：list
查看表结构:describe 'student'
删除表:先disable再drop
Commands: alter, alter_async, alter_status, create, describe,
disable, disable_all, drop, drop_all, enable, enable_all, exists,
get_table, is_disabled, is_enabled, list, locate_region, show_filters

dml:
添加、修改数据：put
001 zhangsan 男 12 pingpong

put 'student','001','info:name','zhangsan' //cell
put 'student','001','info:sex','男'
put 'student','001','info:age','13'
put 'student','001','extra:hobby','pingpong'

put 'student','002','info:name','lisi'
put 'student','002','info:age','13'

查询数据：get/scan
scan 'student'
get 'student','001'

删除数据：
delete
deleteall

Commands: append, count, delete, deleteall, get, get_counter,
get_splits, incr, put, scan, truncate, truncate_preserve

创建多版本表：
create 'stdent2', {NAME => 'info', VERSIONS => 3}
查看表得时候指定查看多少版本数据：
scan 'student', {VERSIONS => 5}
scan 'stdent2', {VERSIONS => 5}

（4）过滤器
Hbase过滤器由3个部分组成：
过滤器分类: 行键过滤器、列族过滤器、列过滤器、值过滤器……
比较运算符：LESS (<) 、ESS_OR_EQUAL (⇐)、EQUAL (=)、NOT_EQUAL (!=)、GREATER_OR_EQUAL (>=)、GREATER (>)
比较器：BinaryComparator、BinaryPrefixComparator 、RegexStringComparator、SubStringComparator
例如：
查询值等于张三：
值过滤器、EQUAL (=)、BinaryComparator("张三")
查询值包含张三：
值过滤器、EQUAL (=)、SubStringComparator("张三")

ValueFilter
PrefixFilter(常用)
QualifierFilter
过滤器链 and or
SingleColumnValueFilter（常用）