【Hudi】参数

TaiKuLaHa

已于 2024-03-12 13:25:55 修改

阅读量172

点赞数 2

分类专栏： Hudi 文章标签： Hudi

于 2024-02-07 15:42:59 首次发布

本文链接：https://blog.csdn.net/JH_Zhai/article/details/136069525

版权

Hudi 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

quickStart https://hudi.apache.org/docs/flink-quick-start-guide/
官方文档 https://hudi.apache.org/docs/configurations/#FLINK_SQL

hoodie.datasource.write.hive_style_partitioning

在这里插入图片描述

Hudi提供多种写入方式，具体见hoodie.datasource.write.operation配置项，这里主要介绍UPSERT、INSERT和BULK_INSERT。

INSERT（插入）：该操作流程和UPSERT基本一致，但是不需要通过索引去查询具体更新的文件分区，因此它的速度比- UPSERT快。当数据源不包含更新数据时建议使用该操作，若数据源中存在更新数据，则在数据湖中会出现重复数据。
BULK_INSERT（批量插入）：用于初始数据集加载，该操作会对主键进行排序后直接以写普通parquet表的方式插入Hudi表，该操作性能是最高的，但是无法控制小文件，而UPSERT和INSERT操作使用启发式方法可以很好的控制小文件。
UPSERT（插入更新）：默认操作类型。Hudi会根据主键进行判断，如果历史数据存在则update如果不存在则insert。因此在对于CDC之类几乎肯定包括更新的数据源，建议使用该操作。
说明：
由于INSERT时不会对主键进行排序，所以初始化数据集不建议使用INSERT。
在确定数据都为新数据时建议使用INSERT，当存在更新数据时建议使用UPSERT，当初始化数据集时建议使用BULK_INSERT。

Apache Hudi支持bulk_insert操作来将数据初始化至Hudi表中，该操作相比insert和upsert操作速度更快，效率更高。bulk_insert不会查看已存在数据的开销并且不会进行小文件优化。

set ht_sparkconf_spark.hudi.enabled=true;
set hoodie.sql.bulk.insert.enable=true;
set hoodie.sql.insert.mode=non-strict;
set hoodie.combine.before.insert=false;

set ht_sparkconf_spark.hudi.enabled=true;

https://hudi.apache.org/cn/docs/performance/

Bulk Insert
Write configurations in Hudi are optimized for incremental upserts by default. In fact, the default write operation type is UPSERT as well. For simple append-only use case to bulk load the data, following set of configurations are recommended for optimal writing:

-- Use “bulk-insert” write-operation instead of default “upsert”
hoodie.datasource.write.operation = BULK_INSERT
-- Disable populating meta columns and metadata, and enable virtual keys
hoodie.populate.meta.fields = false
hoodie.metadata.enable = false
-- Enable snappy compression codec for lesser CPU cycles (but more storage overhead)
hoodie.parquet.compression.codec = snappy

For ingesting via spark-sql

-- Use “bulk-insert” write-operation instead of default “upsert”
hoodie.sql.insert.mode = non-strict,
hoodie.sql.bulk.insert.enable = true,
-- Disable populating meta columns and metadata, and enable virtual keys
hoodie.populate.meta.fields = false
hoodie.metadata.enable = false
-- Enable snappy compression codec for lesser CPU cycles (but more storage overhead)
hoodie.parquet.compression.codec = snappy

We recently benchmarked Hudi against TPC-DS workload. Please check out our blog for more details.

set hoodie.sql.insert.mode=non-strict;

https://doc.hcs.huawei.com/zh-cn/usermanual/mrs/mrs_01_24273.html

Insert模式：Hudi对于设置了主键的表支持三种Insert模式，用户可以设置参数hoodie.sql.insert.mode来指定Insert模式，默认为upsert。
strict模式，Insert 语句将保留 COW 表的主键唯一性约束，不允许重复记录。如果在插入过程中已经存在记录，则会为 COW 表抛出 HoodieDuplicateKeyException；对于MOR表，该模式与upsert模式行为一致。
non-strict模式，对主键表采用insert处理。
upsert模式，对于主键表的重复值进行更新操作。

在这里插入图片描述