Apache Hudi architecture and implementation research

最新推荐文章于 2024-09-30 11:08:04 发布

年更yao

最新推荐文章于 2024-09-30 11:08:04 发布

阅读量324

点赞数

分类专栏： # hadoop 文章标签： hadoop hudi 大数据数据湖

本文链接：https://blog.csdn.net/Gloria_y/article/details/106405611

版权

hadoop 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

There are 2 parts of the article，as follows:

1.Hudi scenario and concepts
2.Performance bottlenecks

1.Hudi scenario and concepts

There is a lot of concepts , some is similar as HBase.Some is new concepts.But What is the relationship with scenario and concepts?

This is a mind map about Hudi.Mind map is better than text.if we can connecting scenarios and concepts.we will run the engine better.There is 3 parts of it .I will introduce these parts in turn

This is the Common scenarios in hudi；Is copy on write and merge on read mode .Mor is a extention of copy on write
Mainly differents is MOR mode has avro log;

In fact，Many MOR classes are extend from COW

For example:
HoodieMergeOnReadTable<T extends HoodieRecordPayload> extends HoodieCopyOnWriteTable<T>
MergeOnReadLazyInsertIterable<T extends HoodieRecordPayload> extends CopyOnWriteLazyInsertIterable<T>

Both the cow or mor operation will trigger Timeline change.

Timeline is the core of Hudi.Basically all operation are related to Timeline.

Their are 3 classes represent active，archive，and rollback timeline

A timeline contains a instant list . a instant object contains a state.

I draw a stats Diagram about the stats transition. stats transite in diff type in HoodieTimeline class.

Timeline classes control instant to different status.But Finally it will write many metadata files to HDFS.

So we can see next page, about what is the hudi’s file management .

File management policy is important in hudi.If we run a spark job to write data into hudi table.Infact you only needs set the basePath and set the key configuration instead of a DDL SQl.

Because hudi store the metadata in HDFS directory.and if we need read metadata it will trigger the HDFS operation too.

There is Instant metadata file,Log file,Parquet file and Partitions metadata file.

We will continue to talk about file management in the finally part.

This is the whole relationship diagram of hudi,and I add the class name in the corresponding node.

Those repationship and implaments closely related to engine performace.

2.Performance bottlenecks

When we know the relationship between scenarios ,concepts and implementation

we can locate the performance bottleneck of the engine.

first performance bottleneck :

• $ basePath /.hoodie

• If you request write operation and commit it , instant files will be increased.

• A instant will create more than one file. Every stats change will have a new file.

• $ basePath /$ partitionsPath

• In $ basePath /$ partitionPath , there are . hoodie_partition_metadata and parquet files, avro log files in this directory.

• partitions items and dimension amount is important.

• if we have a 3-layer depth partitionPath (A/B/C) ，

• A contains 10 items.

• B contains 5 items.

• C contains 20 items.

• Our directory amount in $ basePath is 10*5*20 = 2000 directorys .

•

• Due to many partition directory ， items, and instant files . When we need to get or reload metadata . it will request fs.listFiles or or listStatus a lot.

In hudi’s community .they are talking about how to reduce the operation on HDFS.

Multiple metadata files => get metadata from index file and single metadata file.Reduce HDFS namenode pressure and improve performance of reading hudi metadata

The second performance bottlenecks is Write amplifications and Read Perspiration

Engineers never stop optimizing it in storage engine.

It means if you write some data into storage,More data will read and write on disk.

In COW mode,'update' operation will trigger 'Write amplifications and Read Perspiration'