前言
本文对Hudi官网提到的部分特性(功能)做了测试,具体的测试数据均由以下代码直接生成:
from faker import Faker
def fake_data(faker: Faker, row_num: int):
file_name = f'/Users/gavin/Desktop/tmp/student_{row_num}_rows.csv'
with open(file=file_name, mode='w') as file:
file.write("id,name,age,adress,partition_path\n")
for i in range(row_num):
file.write(
f'{my_faker.iana_id()},{my_faker.name()},{my_faker.random_int(min=15, max=25)},{my_faker.address()},{my_faker.day_of_week()}\n')
if __name__ == '__main__':
my_faker = Faker(locale='zh_CN')
fake_data(my_faker, 100000)
测试数据例:
id | name | age | adress | partition_path |
---|---|---|---|---|
7548525 | 谭娜 | 15 | 黑龙江省广州市白云姚路w座 391301 | Sunday |
5615440 | 金亮 | 19 | 陕西省巢湖县西峰张街N座 711897 | Tuesday |
3887721 | 刘倩 | 21 | 贵州省敏县清浦深圳路A座 116469 | Thursday |
pyspark启动时引入hudi的命令:
pyspark --packages org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.1.2 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
Hudi基础下探
File Layouts(文件结构)
Copy-on-Write
hudi表除了parquet文件之外,在表名根目录下,有一个.hoodie文件夹,存储了该表的元信息;示例如下:
gavin@GavindeMacBook-Pro hudi_tables % tree -a student_for_pre_validate
student_for_pre_validate
├── .hoodie #hudi表元信息文件,包含commit信息、marker信息等
│ ├── .20220317111613163.commit.requested.crc
│ ├── .20220317111613163.inflight.crc
│ ├── .aux
│ │ └── .bootstrap #.bootstrap下存放的是进行引导操作的时候的文件,引导操作是用来将已有的表转化为Hudi表的操作,因为没有执行这个,所以下面没有内容
│ │ ├── .fileids
│ │ └── .partitions
│ ├── .hoodie.properties.crc
│ ├── .temp
│ │ └── 20220317111613163
│ │ ├── .MARKERS.type.crc
│ │ ├── .MARKERS0.crc
│ │ ├── MARKERS.type
│ │ └── MARKERS0
│ ├── 20220317111613163.commit.requested
│ ├── 20220317111613163.inflight
│ ├── archived #存放归档Instant的目录,当不断写入Hudi表时,Timeline上的Instant数量会持续增多,为减少Timeline的操作压力,会在Commit时对Instant进行归档,并将Timeline上对应的Instant删除。因为我们的Instant个数尚未达到默认值30个,所以并没有产生对应的文件
│ └── hoodie.properties
├── Friday #具体分区数据
│ ├── ..hoodie_partition_metadata.crc
│ ├── .65792147-0976-4433-91a1-cb9867326bdf-0_0-30-30_20220317111613163.parquet.crc
│ ├── .hoodie_partition_metadata
│ └── 65792147-0976-4433-91a1-cb9867326bdf-0_0-30-30_20220317111613163.parquet
└── Wednesday #具体分区数据
├── ..hoodie_partition_metadata.crc
├── .4454a7c0-4e4c-4ef6-b790-e066dd2fc8ca-0_1-30-31_20220317111613163.parquet.crc
├── .hoodie_partition_metadata
└── 4454a7c0-4e4c-4ef6-b790-e066dd2fc8ca-0_1-30-31_20220317111613163.parquet
10 directories, 18 files
gavin@GavindeMacBook-Pro hudi_tables %
Merge-on-Read
可以参考:Apache Hudi 从入门到放弃(2) —— MOR表的文件结构分析
commit文件中的信息
结论:
- 每一个parquet文件在创建的时候都有一个对应的fileId,该Id作为parquet文件的文件名前缀,同时记录在commit文件中;后续对该文件的修改只会改变文件名后时间戳部分,前缀fileId不变
- commit文件中会记录每次每个fileId的「numWrites」、「numDeletes」、「numUpdateWrites」、「numInserts」以及文件大小等其他基本信息
- commit文件中记录了fileId和具体文件的映射关系
- commit文件中记录了表的schema信息
具体数据演示
vi 20220316171316850.commit:
{
"partitionToWriteStats" : {
"Thursday" : [ {
"fileId" : "9643d9e7-82b1-4e84-b8e2-0ae625bb54d5-0",
"path" : "Thursday/9643d9e7-82b1-4e84-b8e2-0ae625bb54d5-0_0-29-41_20220316171316850.parquet",
"prevCommit" : "null",
"numWrites" : 461,
"numDeletes" : 0,
"numUpdateWrites" : 0,
"numInserts" : 461,
"totalWriteBytes" : 451097,
"totalWriteErrors" : 0,
"tempPath" : null,
"partitionPath" : "Thursday",
"totalLogRecords" : 0,
"totalLogFilesCompacted" : 0,
"totalLogSizeCompacted" : 0,
"totalUpdatedRecordsCompacted" : 0,
"totalLogBlocks" : 0,
"totalCorruptLogBlock" : 0,
"totalRollbackBlocks" : 0,
"fileSizeInBytes" : 451097,
"minEventTime" : null,
"maxEventTime" : null
},
···
···
···
{
"fileId" : "1efa72c3-a714-46e2-bb91-5019fa6e7ede-0",
"path" : "Saturday/1efa72c3-a714-46e2-bb91-5019fa6e7ede-0_224-53-265_20220316171316850.parquet",
"prevCommit" : "null",
"numWrites" : 210,
"numDeletes" : 0,
"numUpdateWrites" : 0,
"numInserts" : 210,
"totalWriteBytes" : 443162,
"totalWriteErrors" : 0,
"tempPath" : null,
"partitionPath" : "Saturday",
"totalLogRecords" : 0,
"totalLogFilesCompacted" : 0,
"totalLogSizeCompacted" : 0,
"totalUpdatedRecordsCompacted" : 0,
"totalLogBlocks" : 0,
"totalCorruptLogBlock" : 0,
"totalRollbackBlocks" : 0,
"fileSizeInBytes" : 443162,
"minEventTime" : null,
"maxEventTime" : null
} ]
},
"compacted" : false,
"extraMetadata" : {
"schema" : "{\"type\":\"record\",\"name\":\"student_record\",\"namespace\":\"hoodie.student\",\"fields\":[{\"name\":\"id\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"name\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"age\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"adress\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"partition_path\",\"type\":[\"null\",\"string\"],\"default\":null}]}"
},
"operationType" : "UPSERT",
"fileIdAndRelativePaths" : {
"111e0979-9006-441d-9af2-ac9656be4500-0" : "Sunday/111e0979-9006-441d-9af2-ac9656be4500-0_120-47-161_20220316171316850.parquet",
...
...
...
"a13bf769-7dcb-4aa7-a26f-fd701aa07eaf-0" : "Monday/a13bf769-7dcb-4aa7-a26f-fd701aa07eaf-0_33-47-74_20220316171316850.parquet",
"9fa086af-e28e-4a3f-9a31-06b658ad514b-0" : "Thursday/9fa086af-e28e-4a3f-9a31-06b658ad514b-0_15-41-56_20220316171316850.parquet"
},
"totalLogRecordsCompacted" : 0,
"totalLogFilesCompacted" : 0,
"totalCompactedRecordsUpdated" : 0,
"totalRecordsDeleted" : 0,
"totalLogFilesSize" : 0,
"totalScanTime" : 0,
"totalCreateTime" : 36958,
"totalUpsertTime" : 0,
"minAndMaxEventTime" : {
"Optional.empty" : {
"val" : null,
"present" : false
}
},
"writePartitionPaths" : [ "Thursday", "Monday", "Friday", "Sunday", "Wednesday", "Tuesday", "Saturday" ]
}
执行了一次upsert之后:
vi 20220316171648081.commit
{
"partitionToWriteStats" : {
"Thursday" : [ {
"fileId" : "5540e2fd-bc18-42db-a831-f72a6d7eb603-0",
"path" : "Thursday/5540e2fd-bc18-42db-a831-f72a6d7eb603-0_0-29-492_20220316171648081.parquet",
"prevCommit" : "20220316171316850",
"numWrites" : 459,
"numDeletes" : 0,
"numUpdateWrites" : 1,
"numInserts" : 0,
"totalWriteBytes" : 450943,
"totalWriteErrors" : 0,
"tempPath" : null,
"partitionPath" : "Thursday",
"totalLogRecords" : 0,
"totalLogFilesCompacted" : 0,
"totalLogSizeCompacted" : 0,
"totalUpdatedRecordsCompacted" : 0,
"totalLogBlocks" : 0,
"totalCorruptLogBlock" : 0,
"totalRollbackBlocks" : 0,
"fileSizeInBytes" : 450943,
"minEventTime" : null,
"maxEventTime" : null
},
···
···
···
{
"fileId" : "d02425d8-0216-4a3b-9810-b613d80cd60f-0",
"path" : "Saturday/d02425d8-0216-4a3b-9810-b613d80cd60f-0_433-53-925_20220316171648081.parquet",
"prevCommit" : "null",
"numWrites" : 84,
"numDeletes" : 0,
"numUpdateWrites" : 0,
"numInserts" : 84,
"totalWriteBytes" : 439040,
"totalWriteErrors" : 0,
"tempPath" : null,
"partitionPath" : "Saturday",
"totalLogRecords" : 0,
"totalLogFilesCompacted" : 0,
"totalLogSizeCompacted" : 0,
"totalUpdatedRecordsCompacted" : 0,
"totalLogBlocks" : 0,
"totalCorruptLogBlock" : 0,
"totalRollbackBlocks" : 0,
"fileSizeInBytes" : 439040,
"minEventTime" : null,
"maxEventTime" : null
} ]
},
"compacted" : false,
"extraMetadata" : {
"schema" : "{\"type\":\"record\",\"name\":\"student_record\",\"namespace\":\"hoodie.student\",\"fields\":[{\"name\":\"id\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"name\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"age\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"adress\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"partition_path\",\"type\":[\"null\",\"string\"],\"default\":null}]}"
},
"operationType" : "UPSERT",
"fileIdAndRelativePaths" : {
"0d288e1e-f593-4782-95de-0583c4cd286b-0" : "Saturday/0d288e1e-f593-4782-95de-0583c4cd286b-0_415-53-907_20220316171648081.parquet",
···
···
···
"cc32046a-55b1-4b2b-be93-3225e42154b7-0" : "Saturday/cc32046a-55b1-4b2b-be93-3225e42154b7-0_211-53-703_20220316171648081.parquet"
},
"totalLogRecordsCompacted" : 0,
"totalLogFilesCompacted" : 0,
"totalCompactedRecordsUpdated" : 0,
"writePartitionPaths" : [ "Thursday", "Monday", "Friday", "Sunday", "Wednesday", "Tuesday", "Saturday" ],
"totalRecordsDeleted" : 0,
"totalLogFilesSize" : 0,
"totalScanTime" : 0,
"totalCreateTime" : 31116,
"totalUpsertTime" : 37426,
"minAndMaxEventTime" : {
"Optional.empty" : {
"val" : null,
"present" : false
}
}
}
upsert数据时候数据文件变化
结论:upsert数据之后,会新增一个新版的数据文件,新的版本数据文件中包含了历史数据和新的数据;之前的版本文件不会变化
测试代码
import pyspark
if __name__ == '__main__':
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars",
"/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
"/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
spark = builder.getOrCreate()
sc = spark.sparkContext
tableName = "student"
basePath = "file:///tmp/hudi_base_path"
csv_path = '/Users/gavin/Desktop/tmp/student_3_rows.csv'
csv_df = spark.read.csv(path=csv_path, header='true')
csv_df.printSchema()
print(f'csv_df.count(): [{csv_df.count()}]')
hudi_options = {
'hoodie.table.name': tableName,
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.partitionpath.field': 'partition_path',
'hoodie.datasource.write.table.name': tableName,
'hoodie.datasource.write.precombine.fi