Hudi的insert

最新推荐文章于 2024-05-29 16:40:09 发布

ZL_bigdata

最新推荐文章于 2024-05-29 16:40:09 发布

阅读量1.6k

点赞数

文章标签： big data

本文链接：https://blog.csdn.net/ZL_javaco/article/details/120481160

版权

一、概要：

先看原文吧，Hudi官方公众号推出的‘数据更快导入Hudi’。略有受益，感到有必要做个总结。

如何将数据更快导入Apache Hudi？

文章围绕的对象是bulk_insert:

其中包含三种原生模式和支持自定义拓展模式。

二、配置：

hoodie.bulkinsert.sort.mode

--可配：NONE、GLOBAL_SORT、PARTITION_SORT

--默认：GLOBAL_SORT

三、模式：

3.1 GLOBAL_SORT（全局排序）：

upsert效率高

全局排序就是为了提高upsert的性能。

insert效率低

由于全局排序的过程，导致insert的性能降低。但可以缓解大分区写入时的内存压力。

3.2 PARTITION_SORT（分区排序）：

upsert效率居中

不是全局排序，而仅对spark分区内排序

insert效率居中

无论是什么排序过程，总会降低insert效率，但可以缓解内存压力。

3.3 NONE

upsert效率低

未排序的原始文件进行upsert索引查找期间大量读取bloom filter

insert效率高

虽然写入效率高，但会有内存风险。也会有大量小文件产生

3.4 自定义Partitioner

性能不详（没试过；但是都已经花费工时自定义了，肯定是量身定做的更好）

官方代码走一波：

package org.apache.hudi.table;

/**
 * Repartition input records into at least expected number of output spark partitions. It should give below guarantees -
 * Output spark partition will have records from only one hoodie partition. - Average records per output spark
 * partitions should be almost equal to (#inputRecords / #outputSparkPartitions) to avoid possible skews.
 */
public interface BulkInsertPartitioner<I> {

  /**
   * Repartitions the input records into at least expected number of output spark partitions.
   *
   * @param records               Input Hoodie records
   * @param outputSparkPartitions Expected number of output partitions
   * @return
   */
  I repartitionRecords(I records, int outputSparkPartitions);

  /**
   * @return {@code true} if the records within a partition are sorted; {@code false} otherwise.
   */
  boolean arePartitionRecordsSorted();
}