hudi索引

1.重点类

1.1.HoodieIndex

  索引实现的基类,核心方法是两个:tagLocation和updateLocation
  后续有不同的子类实现具体的索引
在这里插入图片描述

1.2.HoodieIndexFactory

  没有具体这个类,是创建HoodieIndex的工厂类。具体操作类的名字以这个为后缀:FlinkHoodieIndexFactory、SparkHoodieIndexFactory、JavaHoodieIndexFactory
  以FlinkHoodieIndexFactory来说,支持如下索引的创建

// TODO more indexes to be added
switch (config.getIndexType()) {
  case INMEMORY:
    return new FlinkInMemoryStateIndex(context, config);
  case BLOOM:
    return new HoodieBloomIndex(config, ListBasedHoodieBloomIndexHelper.getInstance());
  case GLOBAL_BLOOM:
    return new HoodieGlobalBloomIndex(config, ListBasedHoodieBloomIndexHelper.getInstance());
  case SIMPLE:
    return new HoodieSimpleIndex(config, Option.empty());
  case GLOBAL_SIMPLE:
    return new HoodieGlobalSimpleIndex(config, Option.empty());
  case BUCKET:
    return new HoodieSimpleBucketIndex(config);
  default:
    throw new HoodieIndexException("Unsupported index type " + config.getIndexType());
}

1.3.BaseHoodieBloomIndexHelper

  执行过滤步骤的类,两个实现类ListBasedHoodieBloomIndexHelper和SparkHoodieBloomIndexHelpe

1.4.HoodieBaseBloomIndexCheckFunction

  BaseHoodieBloomIndexHelper进行操作的核心,computeNext是进行bloom过滤的核心方法,关键点在于keyLookupHandle.addKey(recordKey)的调用

1.5.HoodieKeyLookupHandle

  就是1.4中的keyLookupHandle的类名,方法中利用了bloomFilter过滤

public void addKey(String recordKey) {
  // check record key against bloom filter of current file & add to possible keys if needed
  if (bloomFilter.mightContain(recordKey)) {
    if (LOG.isDebugEnabled()) {
      LOG.debug("Record key " + recordKey + " matches bloom filter in  " + partitionPathFileIDPair);
    }
    candidateRecordKeys.add(recordKey);
  }
  totalKeysChecked++;
}

  bloomFilter的获取也在这个类当中

public HoodieKeyLookupHandle(HoodieWriteConfig config, HoodieTable<T, I, K, O> hoodieTable,
                             Pair<String, String> partitionPathFileIDPair) {
  super(config, hoodieTable, partitionPathFileIDPair);
  this.candidateRecordKeys = new ArrayList<>();
  this.totalKeysChecked = 0;
  this.bloomFilter = getBloomFilter();
}

1.6.HoodieFileReader

  文件读取类,有一个重点方法就是读取bloom过滤器readBloomFilter()
  有三个实现:HoodieParquetReader、HoodieOrcReader、HoodieHFileReader

1.7.BaseWriteHelper

  触发调用HoodieIndex的tagLocation接口的类,调用接口是tag(),tag接口由自身的write接口调用
  对Flink有点特殊,Flink重载了write接口,内部没有调用tag()。索引Flink流程可能没有使用bloom索引

2.流程

2.1.创建

  进行查询时相关类的构建的调用链

HoodieTable.getIndex -> HoodieIndexFactory.createIndex -> HoodieIndex

  其中,创建HoodieIndex时,参数中包含了BaseHoodieBloomIndexHelper
  相关配置项为hoodie.index.type

2.2.获取bloomFilter

  获取bloomFilter的整体调用流程为

BaseWriteHelper.write -> tag -> HoodieIndex.tagLocation -> lookupIndex -> BaseHoodieBloomIndexHelper.findMatchingFilesForRecordKeys -> HoodieBaseBloomIndexCheckFunction.apply -> LazyKeyCheckIterator.computeNext -> HoodieKeyLookupHandle.getBloomFilter -> HoodieFileReader.readBloomFilter -> BaseFileUtils.readBloomFilterFromMetadata

  获取bloomFilter以后,在HoodieKeyLookupHandle.addKey中使用

2.3.bloomFilter写入文件

  写入数据时产生将Key写入bloomFilter

DataWriter.write -> BulkInsertDataInternalWriterHelper.write -> HoodieRowCreateHandle.writeRow -> HoodieInternalRowParquetWriter.writeRow -> HoodieRowParquetWriteSupport.add -> HoodieBloomFilterWriteSupport.addKey -> BloomFilter.add

  之后将bloomFilter写入文件,因为是写入Parquet的Footer,好像是直接基于Parquet提供的接口,并没有直接调用写文件的接口

HoodieRowParquetWriteSupport.finalizeWrite -> HoodieBloomFilterWriteSupport.finalizeMetadata -> BloomFilter.serializeToString
public Map<String, String> finalizeMetadata() {
  HashMap<String, String> extraMetadata = new HashMap<>();

  extraMetadata.put(HOODIE_AVRO_BLOOM_FILTER_METADATA_KEY, bloomFilter.serializeToString());
  if (bloomFilter.getBloomFilterTypeCode().name().contains(HoodieDynamicBoundedBloomFilter.TYPE_CODE_PREFIX)) {
    extraMetadata.put(HOODIE_BLOOM_FILTER_TYPE_CODE, bloomFilter.getBloomFilterTypeCode().name());
  }

  if (minRecordKey != null && maxRecordKey != null) {
    extraMetadata.put(HOODIE_MIN_RECORD_KEY_FOOTER, minRecordKey.toString());
    extraMetadata.put(HOODIE_MAX_RECORD_KEY_FOOTER, maxRecordKey.toString());
  }

  return extraMetadata;
}

  Parquet最终写基于ParquetFileWriter

FinalizedWriteContext finalWriteContext = writeSupport.finalizeWrite();
Map<String, String> finalMetadata = new HashMap<String, String>(extraMetaData);
String modelName = writeSupport.getName();
if (modelName != null) {
  finalMetadata.put(ParquetWriter.OBJECT_MODEL_NAME_PROP, modelName);
}
finalMetadata.putAll(finalWriteContext.getExtraMetaData());
parquetFileWriter.end(finalMetadata);

2.4.索引的key

  索引配置是一个全局配置,没有针对某列建索引,Key的来源有很多种:可能是uuid、partition、primaryKey(filed.get(0)获取)、还有基于计算引擎数据类型获取的,具体看HoodieKey的构造
  此外有一个配置,设置key取哪个列:hoodie.datasource.write.recordkey.field,默认uuid

3.Parquet writer类结构

3.1.创建

  以HoodieRowCreateHandle起点看,在构造函数中,创建了HoodieInternalRowFileWriter

this.fileWriter = HoodieInternalRowFileWriterFactory.getInternalRowFileWriter(path, table, writeConfig, structType);

  这里会创建WriteSupport,然后作为HoodieInternalRowParquetWriter的成员

HoodieRowParquetWriteSupport writeSupport =
        new HoodieRowParquetWriteSupport(table.getHadoopConf(), structType, bloomFilterOpt, writeConfig);

return new HoodieInternalRowParquetWriter(
    path,
    new HoodieParquetConfig<>(
        writeSupport,
        writeConfig.getParquetCompressionCodec(),
        writeConfig.getParquetBlockSize(),
        writeConfig.getParquetPageSize(),
        writeConfig.getParquetMaxFileSize(),
        writeSupport.getHadoopConf(),
        writeConfig.getParquetCompressionRatio(),
        writeConfig.parquetDictionaryEnabled()
    ));

  HoodieInternalRowParquetWriter的构造函数里可以看到作为了独立的成员

public HoodieInternalRowParquetWriter(Path file, HoodieParquetConfig<HoodieRowParquetWriteSupport> parquetConfig)
    throws IOException {
  super(file, parquetConfig);

  this.writeSupport = parquetConfig.getWriteSupport();
}

  HoodieInternalRowParquetWriter最终的父类就是第三方的ParquetWriter,ParquetWriter的构造函数的参数包含WriteSupport

3.2.写数据

  写的时候,上层调用的是HoodieInternalRowParquetWriter,接口是writeRow,其中有对WriteSupport写接口的调用

public void writeRow(UTF8String key, InternalRow row) throws IOException {
  super.write(row);
  writeSupport.add(key);
}

  writer最终就是调用父类ParquetWriter的write接口
  writeSupport.add(key)是WriteSupport的接口,这里是做一个内存缓存

public void add(UTF8String recordKey) {
  this.bloomFilterWriteSupportOpt.ifPresent(bloomFilterWriteSupport ->
      bloomFilterWriteSupport.addKey(recordKey));
}

  缓存在finalizeWrite接口中使用

public WriteSupport.FinalizedWriteContext finalizeWrite() {
  Map<String, String> extraMetadata =
      bloomFilterWriteSupportOpt.map(HoodieBloomFilterWriteSupport::finalizeMetadata)
          .orElse(Collections.emptyMap());

  return new WriteSupport.FinalizedWriteContext(extraMetadata);
}

  finalizeWrite接口在第三方的InternalParquetRecordWriter的close方法调用
  HoodieBloomFilterWriteSupport::finalizeMetadata这一步是实现了写入自定义的元数据

public Map<String, String> finalizeMetadata() {
  HashMap<String, String> extraMetadata = new HashMap<>();

  extraMetadata.put(HOODIE_AVRO_BLOOM_FILTER_METADATA_KEY, bloomFilter.serializeToString());
  if (bloomFilter.getBloomFilterTypeCode().name().contains(HoodieDynamicBoundedBloomFilter.TYPE_CODE_PREFIX)) {
    extraMetadata.put(HOODIE_BLOOM_FILTER_TYPE_CODE, bloomFilter.getBloomFilterTypeCode().name());
  }

  if (minRecordKey != null && maxRecordKey != null) {
    extraMetadata.put(HOODIE_MIN_RECORD_KEY_FOOTER, minRecordKey.toString());
    extraMetadata.put(HOODIE_MAX_RECORD_KEY_FOOTER, maxRecordKey.toString());
  }

  return extraMetadata;
}

4.触发索引

4.1.建表语句

create table indexTest (
  id int,
  name string,
  price double
) using hudi
 location '/spark/hudi/'
 tblproperties (
  primaryKey ='id',
  type = 'mor',
  hoodie.index.type = 'BLOOM',
  hoodie.compact.inline = 'true'
 );

  HoodieIndexConfig控制配置项,Spark默认使用bloomFilter

private String getDefaultIndexType(EngineType engineType) {
  switch (engineType) {
    case SPARK:
      return HoodieIndex.IndexType.SIMPLE.name();
    case FLINK:
    case JAVA:
      return HoodieIndex.IndexType.INMEMORY.name();
    default:
      throw new HoodieNotSupportedException("Unsupported engine " + engineType);
  }
}

5.索引更新

  • 19
    点赞
  • 16
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值