1.重点类
1.1.HoodieIndex
索引实现的基类,核心方法是两个:tagLocation和updateLocation
后续有不同的子类实现具体的索引
1.2.HoodieIndexFactory
没有具体这个类,是创建HoodieIndex的工厂类。具体操作类的名字以这个为后缀:FlinkHoodieIndexFactory、SparkHoodieIndexFactory、JavaHoodieIndexFactory
以FlinkHoodieIndexFactory来说,支持如下索引的创建
// TODO more indexes to be added
switch (config.getIndexType()) {
case INMEMORY:
return new FlinkInMemoryStateIndex(context, config);
case BLOOM:
return new HoodieBloomIndex(config, ListBasedHoodieBloomIndexHelper.getInstance());
case GLOBAL_BLOOM:
return new HoodieGlobalBloomIndex(config, ListBasedHoodieBloomIndexHelper.getInstance());
case SIMPLE:
return new HoodieSimpleIndex(config, Option.empty());
case GLOBAL_SIMPLE:
return new HoodieGlobalSimpleIndex(config, Option.empty());
case BUCKET:
return new HoodieSimpleBucketIndex(config);
default:
throw new HoodieIndexException("Unsupported index type " + config.getIndexType());
}
1.3.BaseHoodieBloomIndexHelper
执行过滤步骤的类,两个实现类ListBasedHoodieBloomIndexHelper和SparkHoodieBloomIndexHelpe
1.4.HoodieBaseBloomIndexCheckFunction
BaseHoodieBloomIndexHelper进行操作的核心,computeNext是进行bloom过滤的核心方法,关键点在于keyLookupHandle.addKey(recordKey)的调用
1.5.HoodieKeyLookupHandle
就是1.4中的keyLookupHandle的类名,方法中利用了bloomFilter过滤
public void addKey(String recordKey) {
// check record key against bloom filter of current file & add to possible keys if needed
if (bloomFilter.mightContain(recordKey)) {
if (LOG.isDebugEnabled()) {
LOG.debug("Record key " + recordKey + " matches bloom filter in " + partitionPathFileIDPair);
}
candidateRecordKeys.add(recordKey);
}
totalKeysChecked++;
}
bloomFilter的获取也在这个类当中
public HoodieKeyLookupHandle(HoodieWriteConfig config, HoodieTable<T, I, K, O> hoodieTable,
Pair<String, String> partitionPathFileIDPair) {
super(config, hoodieTable, partitionPathFileIDPair);
this.candidateRecordKeys = new ArrayList<>();
this.totalKeysChecked = 0;
this.bloomFilter = getBloomFilter();
}
1.6.HoodieFileReader
文件读取类,有一个重点方法就是读取bloom过滤器readBloomFilter()
有三个实现:HoodieParquetReader、HoodieOrcReader、HoodieHFileReader
1.7.BaseWriteHelper
触发调用HoodieIndex的tagLocation接口的类,调用接口是tag(),tag接口由自身的write接口调用
对Flink有点特殊,Flink重载了write接口,内部没有调用tag()。索引Flink流程可能没有使用bloom索引
2.流程
2.1.创建
进行查询时相关类的构建的调用链
HoodieTable.getIndex -> HoodieIndexFactory.createIndex -> HoodieIndex
其中,创建HoodieIndex时,参数中包含了BaseHoodieBloomIndexHelper
相关配置项为hoodie.index.type
2.2.获取bloomFilter
获取bloomFilter的整体调用流程为
BaseWriteHelper.write -> tag -> HoodieIndex.tagLocation -> lookupIndex -> BaseHoodieBloomIndexHelper.findMatchingFilesForRecordKeys -> HoodieBaseBloomIndexCheckFunction.apply -> LazyKeyCheckIterator.computeNext -> HoodieKeyLookupHandle.getBloomFilter -> HoodieFileReader.readBloomFilter -> BaseFileUtils.readBloomFilterFromMetadata
获取bloomFilter以后,在HoodieKeyLookupHandle.addKey中使用
2.3.bloomFilter写入文件
写入数据时产生将Key写入bloomFilter
DataWriter.write -> BulkInsertDataInternalWriterHelper.write -> HoodieRowCreateHandle.writeRow -> HoodieInternalRowParquetWriter.writeRow -> HoodieRowParquetWriteSupport.add -> HoodieBloomFilterWriteSupport.addKey -> BloomFilter.add
之后将bloomFilter写入文件,因为是写入Parquet的Footer,好像是直接基于Parquet提供的接口,并没有直接调用写文件的接口
HoodieRowParquetWriteSupport.finalizeWrite -> HoodieBloomFilterWriteSupport.finalizeMetadata -> BloomFilter.serializeToString
public Map<String, String> finalizeMetadata() {
HashMap<String, String> extraMetadata = new HashMap<>();
extraMetadata.put(HOODIE_AVRO_BLOOM_FILTER_METADATA_KEY, bloomFilter.serializeToString());
if (bloomFilter.getBloomFilterTypeCode().name().contains(HoodieDynamicBoundedBloomFilter.TYPE_CODE_PREFIX)) {
extraMetadata.put(HOODIE_BLOOM_FILTER_TYPE_CODE, bloomFilter.getBloomFilterTypeCode().name());
}
if (minRecordKey != null && maxRecordKey != null) {
extraMetadata.put(HOODIE_MIN_RECORD_KEY_FOOTER, minRecordKey.toString());
extraMetadata.put(HOODIE_MAX_RECORD_KEY_FOOTER, maxRecordKey.toString());
}
return extraMetadata;
}
Parquet最终写基于ParquetFileWriter
FinalizedWriteContext finalWriteContext = writeSupport.finalizeWrite();
Map<String, String> finalMetadata = new HashMap<String, String>(extraMetaData);
String modelName = writeSupport.getName();
if (modelName != null) {
finalMetadata.put(ParquetWriter.OBJECT_MODEL_NAME_PROP, modelName);
}
finalMetadata.putAll(finalWriteContext.getExtraMetaData());
parquetFileWriter.end(finalMetadata);
2.4.索引的key
索引配置是一个全局配置,没有针对某列建索引,Key的来源有很多种:可能是uuid、partition、primaryKey(filed.get(0)获取)、还有基于计算引擎数据类型获取的,具体看HoodieKey的构造
此外有一个配置,设置key取哪个列:hoodie.datasource.write.recordkey.field,默认uuid
3.Parquet writer类结构
3.1.创建
以HoodieRowCreateHandle起点看,在构造函数中,创建了HoodieInternalRowFileWriter
this.fileWriter = HoodieInternalRowFileWriterFactory.getInternalRowFileWriter(path, table, writeConfig, structType);
这里会创建WriteSupport,然后作为HoodieInternalRowParquetWriter的成员
HoodieRowParquetWriteSupport writeSupport =
new HoodieRowParquetWriteSupport(table.getHadoopConf(), structType, bloomFilterOpt, writeConfig);
return new HoodieInternalRowParquetWriter(
path,
new HoodieParquetConfig<>(
writeSupport,
writeConfig.getParquetCompressionCodec(),
writeConfig.getParquetBlockSize(),
writeConfig.getParquetPageSize(),
writeConfig.getParquetMaxFileSize(),
writeSupport.getHadoopConf(),
writeConfig.getParquetCompressionRatio(),
writeConfig.parquetDictionaryEnabled()
));
HoodieInternalRowParquetWriter的构造函数里可以看到作为了独立的成员
public HoodieInternalRowParquetWriter(Path file, HoodieParquetConfig<HoodieRowParquetWriteSupport> parquetConfig)
throws IOException {
super(file, parquetConfig);
this.writeSupport = parquetConfig.getWriteSupport();
}
HoodieInternalRowParquetWriter最终的父类就是第三方的ParquetWriter,ParquetWriter的构造函数的参数包含WriteSupport
3.2.写数据
写的时候,上层调用的是HoodieInternalRowParquetWriter,接口是writeRow,其中有对WriteSupport写接口的调用
public void writeRow(UTF8String key, InternalRow row) throws IOException {
super.write(row);
writeSupport.add(key);
}
writer最终就是调用父类ParquetWriter的write接口
writeSupport.add(key)是WriteSupport的接口,这里是做一个内存缓存
public void add(UTF8String recordKey) {
this.bloomFilterWriteSupportOpt.ifPresent(bloomFilterWriteSupport ->
bloomFilterWriteSupport.addKey(recordKey));
}
缓存在finalizeWrite接口中使用
public WriteSupport.FinalizedWriteContext finalizeWrite() {
Map<String, String> extraMetadata =
bloomFilterWriteSupportOpt.map(HoodieBloomFilterWriteSupport::finalizeMetadata)
.orElse(Collections.emptyMap());
return new WriteSupport.FinalizedWriteContext(extraMetadata);
}
finalizeWrite接口在第三方的InternalParquetRecordWriter的close方法调用
HoodieBloomFilterWriteSupport::finalizeMetadata这一步是实现了写入自定义的元数据
public Map<String, String> finalizeMetadata() {
HashMap<String, String> extraMetadata = new HashMap<>();
extraMetadata.put(HOODIE_AVRO_BLOOM_FILTER_METADATA_KEY, bloomFilter.serializeToString());
if (bloomFilter.getBloomFilterTypeCode().name().contains(HoodieDynamicBoundedBloomFilter.TYPE_CODE_PREFIX)) {
extraMetadata.put(HOODIE_BLOOM_FILTER_TYPE_CODE, bloomFilter.getBloomFilterTypeCode().name());
}
if (minRecordKey != null && maxRecordKey != null) {
extraMetadata.put(HOODIE_MIN_RECORD_KEY_FOOTER, minRecordKey.toString());
extraMetadata.put(HOODIE_MAX_RECORD_KEY_FOOTER, maxRecordKey.toString());
}
return extraMetadata;
}
4.触发索引
4.1.建表语句
create table indexTest (
id int,
name string,
price double
) using hudi
location '/spark/hudi/'
tblproperties (
primaryKey ='id',
type = 'mor',
hoodie.index.type = 'BLOOM',
hoodie.compact.inline = 'true'
);
HoodieIndexConfig控制配置项,Spark默认使用bloomFilter
private String getDefaultIndexType(EngineType engineType) {
switch (engineType) {
case SPARK:
return HoodieIndex.IndexType.SIMPLE.name();
case FLINK:
case JAVA:
return HoodieIndex.IndexType.INMEMORY.name();
default:
throw new HoodieNotSupportedException("Unsupported engine " + engineType);
}
}