【数据湖Hudi的概念】Key Generation和Concurrency Control

最新推荐文章于 2023-05-31 17:25:51 发布

Bulut0907

最新推荐文章于 2023-05-31 17:25:51 发布

阅读量775

点赞数

分类专栏： # Hudi 文章标签： hudi key generation concurrency控制时间戳KeyGenerator 复杂KeyGenerator

本文链接：https://blog.csdn.net/yy8623977/article/details/123777737

版权

Hudi 专栏收录该内容

14 篇文章 14 订阅

订阅专栏

1. Key Generation

Hudi提供了几种key generators，key generators的通用配置如下：

Config	含义/目的
hoodie.datasource.write.recordkey.field	数据的key字段，必须包含
hoodie.datasource.write.partitionpath.field	数据的partition字段，必须包含
hoodie.datasource.write.keygenerator.class	full path的Key generator class，必须包含
hoodie.datasource.write.partitionpath.urlencode	默认为false，如果为true，partition path将按url进行编码
hoodie.datasource.write.hive_style_partitioning	默认为false，分区字段名称只有partition_field_value，如果为true，分区字段名称为：partition_field_name=partition_field_value

1.1 SimpleKeyGenerator

将一个列转换成string类型，作为分区字段名称

1.2 ComplexKeyGenerator

recordkey和partitionpath都将一个或多个字段作为key，多个字段逗号分隔。比如"Hoodie.datasource.write.recordkey.field" : "col1,col3"

1.3 NonPartitionedKeyGenerator

如果表不是分区表，使用NonPartitionedKeyGenerator，生成一个empty “” partiiton

1.4 CustomKeyGenerator

可以同时使用SimpleKeyGenerator、ComplexKeyGenerator、TimestampBasedKeyGenerator

指定keygenerator.class

hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator

指定recordkey，可以是SimpleKeyGenerator或ComplexKeyGenerator

hoodie.datasource.write.recordkey.field=col1,col3

创建的record key格式为：col1:value1,col3:value3

指定partitionpath，格式为：“field1:PartitionKeyType1,field2:PartitionKeyType2,…”，PartitionKeyType的可选值为simple、timestamp

hoodie.datasource.write.partitionpath.field=col2:simple,col4:timestamp

HDFS上创建的分区路径为：value2/value4

1.5 TimestampBasedKeyGenerator

这个key generator用于partition字段，需要设置的配置如下：

Config	含义/目录
hoodie.deltastreamer.keygen.timebased.timestamp.type	UNIX_TIMESTAMP、DATE_STRING、MIXED、EPOCHMILLISECONDS、SCALAR
hoodie.deltastreamer.keygen.timebased.output.dateformat	输出的date format
hoodie.deltastreamer.keygen.timebased.timezone	data format的Timezone
oodie.deltastreamer.keygen.timebased.input.dateformat	输入的date format

下面是使用的一些例子

Timestamp is GMT

Config字段	值
hoodie.deltastreamer.keygen.timebased.timestamp.type	“EPOCHMILLISECONDS”
hoodie.deltastreamer.keygen.timebased.output.dateformat	“yyyy-MM-dd hh”
hoodie.deltastreamer.keygen.timebased.timezone	“GMT+8:00”

输入字段值: “1578283932000L”，生成的Partition path: “2020-01-06 12”

如果输入字段值为null，生成的Partition path: “1970-01-01 08”

Timestamp is DATE_STRING

Config字段	值
hoodie.deltastreamer.keygen.timebased.timestamp.type	“DATE_STRING”
hoodie.deltastreamer.keygen.timebased.output.dateformat	“yyyy-MM-dd hh”
hoodie.deltastreamer.keygen.timebased.timezone	“GMT+8:00”
hoodie.deltastreamer.keygen.timebased.input.dateformat	“yyyy-MM-dd hh:mm:ss”

输入字段值: “2020-01-06 12:12:12”，生成的Partition path: “2020-01-06 12”

如果输入字段值为null，生成的Partition path: “1970-01-01 12:00:00”

Scalar examples

Config字段	值
hoodie.deltastreamer.keygen.timebased.timestamp.type	“SCALAR”
hoodie.deltastreamer.keygen.timebased.output.dateformat	“yyyy-MM-dd hh”
hoodie.deltastreamer.keygen.timebased.timezone	“GMT”
hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit	“days”
输入字段值: “20000L”，生成的Partition path: “2024-10-04 12”

如果输入字段值为null，生成的Partition path: “1970-01-02 12”

2. Concurrency Control

支持的方式：

MVCC：Hudi的table service，如compaction、clean，利用MVCC在写入和读取之间提供snapshot isolation。可以实现单一写入和并发读
OPTIMISTIC CONCURRENCY(experimental)：实现并发写入，需要Zookeeper或HiveMetastore获取locks的支持。如write_A写入file1和file2，write_B写入file3和file4，则两个write写入成功；如write_A写入file1和file2，write_B写入file2和file3，则只能有一个write成功，另一个write失败

Multi Writer Guarantees

upsert: 表不会有重复数据
insert: 即使开启dedup，表也可能有重复数据
bulk_insert: 即使开启dedup，表也可能有重复数据
incremental pull: Data consumption和checkpoints可能会乱序

Bulut0907

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
【数据湖Hudi的概念】Key Generation和Concurrency Control

目录1. Timeline2. File Layouts3. Table Types3.1 Copy On Write3.2 Merge On Read3.3 Copy On Write对比Merge On Read4. Indexing5. Metadata Table6. Write Operation类型7. Schema Evolution8. Key Generation8.1 SimpleKeyGenerator8.2 ComplexKeyGenerator8.3 NonPartitionedK
复制链接

扫一扫

专栏目录