Flink写入数据到Hudi数据湖的各种方式 + Flink从Hudi数据湖读取数据的各种方式

最新推荐文章于 2024-09-18 17:16:31 发布

Bulut0907

最新推荐文章于 2024-09-18 17:16:31 发布

阅读量5.8k

点赞数 1

分类专栏： # Hudi 文章标签： flink hudi 写入方式和模式 write写入速率限制读取方式

本文链接：https://blog.csdn.net/yy8623977/article/details/123810836

版权

Hudi 专栏收录该内容

14 篇文章 14 订阅

订阅专栏

1. 写入方式

1.1 CDC Ingestion

有两种方式同步数据到Hudi

使用Flink CDC直接将Mysql的binlog日志同步到Hudi
数据先同步到Kafka/Pulsar等消息系统，然后再使用Flink cdc-format将数据同步到Hudi

Flink CDC 注意：1. 如果upstream不能保证数据的order，则需要显式指定write.precombine.field
2. MOR类型的表，还不能处理delete，所以会导致数据不一致。可以通过changelog.enabled转换到change log模式

1.2 Bulk Insert

主要用于数据初始化导入。Bulk Insert不会进行数据去重，需要用户在数据插入前进行数据去重

Bulk Insert在batch execution mode下更高效

使用参数如下：

参数名称	是否必填	默认值	备注
write.operation	true	upsert	设置为：bulk_insert开启该功能
write.tasks	false	4	bulk_insert的并行度。会影响小文件的数量，file的数量 >= write.bucket_assign.tasks
write.bulk_insert.shuffle_by_partition	false	true	是否在write之前，根据partition字段进行数据shuffle。开启会减少小文件的数量，但可能造成数据Flink端数据倾斜
write.bulk_insert.sort_by_partition	false	true	是否在write之前，根据partition字段进行数据排序。开启会减少小文件的数量
write.sort.memory	false	128	sort operator可用内存

1.3 Index Bootstrap

用于snapshot data + incremental data数据导入。snapshot data部分使用Bulk insert方式完成。incremental data进行实时导入

使用参数如下：

参数名称	是否必填	默认值	备注
index.bootstrap.enabled	true	false	如果开启index bootstrap，Hudi表中的remain records会被同时加载到Flink state
index.partition.regex	false	*	设置哪些partition会被加载到Flink state

但是incremental data如何不丢失数据，又不重复导入数据：

incremental data导入部分刚开始可以多导入一部分数据，确保数据不丢失。同时开启index bootstrap function避免数据重复。
等Flink第一次checkpoint成功，关闭index bootstrap function，从Flink的State恢复状态进行incremental data导入

详细使用步骤如下：

在flink-conf.yaml中设置一个application允许checkpoint失败的次数：execution.checkpointing.tolerable-failed-checkpoints = n
在Flink的Catalog创建Hudi表，创建Hudi表的SQL中添加参数index.bootstrap.enabled = true
启动Application将incremental data导入到Hudi表
等第一次checkpoint成功，表明index bootstrap完成
停止Flink的Application，并进行Savepoint
重新在Flink的Catalog创建Hudi表，创建Hudi表的SQL中添加参数index.bootstrap.enabled = false
重启Application，从Savepoint或checkpoint恢复状态执行

注意：

index bootstrap是一个阻塞过程，因此在index bootstrap期间无法完成checkpoint
index bootstrap由输入input data触发。用户需要确保每个分区中至少有一条数据
index bootstrap是并发执行的。用户可以在日志文件中通过finish loading the index under partition和Load record form file观察index bootstrap的进度

2. 写入模式

2.1 Changelog Mode

使用参数如下：

参数名称	是否必填	默认值	备注
changelog.enabled	false	false	默认是upsert语义。设置为true以支持消费all changes

保留消息的all changes(I / -U / U / D)，Hudi MOR类型的表将all changes append到file log中，但是compaction会对all changes进行merge。如果想消费all changes，需要调整compaction参数：compaction.delta_commits和 compaction.delta_seconds

Snapshot读取，永远读取merge后的结果数据

2.2 Append Mode

使用参数如下：

参数名称	是否必填	默认值	备注
write.insert.cluster	false	false	对于COW类型的表，默认write的时候是不会对base file进行合并的。开启时，每次write都会对base file进行合并，合并时不对key进行merge，会影响写入吞吐量，提高读取性能

3. write写入速率限制

场景：使用Flink消费历史数据 + 实时增量数据，然后写入到Hudi。会造成写入吞吐量巨大 + 写入分区乱序严重，影响集群和application的稳定性。所以需要限制速率

使用参数如下：

参数名称	是否必填	默认值	备注
write.rate.limit	false	0	每秒写入的数据条数。默认write没有限制

4. 读取方式

4.1 Streaming Query

默认是Batch query，查询最新的Snapshot

Streaming Query需要设置read.streaming.enabled = true。再设置read.start-commit，如果想消费所以数据，设置值为earliest

使用参数如下：

参数名称	是否必填	默认值	备注
read.streaming.enabled	false	false	设置为true，开启stream query
read.start-commit	false	the latest commit	instant time的格式为：‘yyyyMMddHHmmss’
read.streaming.skip_compaction	false	false	是否不消费compaction commit，消费compaction commit会出现重复数据
clean.retain_commits	false	10	当开启change log mode，保留的最大commit数量。如果checkpoint interval为5分钟，则保留50分钟的change log

注意：如果开启read.streaming.skip_compaction，但stream reader的速度比clean.retain_commits慢，可能会造成数据丢失

4.2 Incremental Query

有3种使用场景

Streaming query: 设置read.start-commit
Batch query: 同时设置read.start-commit和read.end-commit，start commit和end commit都包含
TimeTravel: 设置read.end-commit为大于当前的一个instant time，read.start-commit默认为latest

使用参数如下：