Apache Hudi - 2 - 基础功能&特性实践

本文深入探讨Apache Hudi的特性,包括Copy-on-Write与Merge-on-Read模式,commit文件详情,upsert操作的影响,手动删除数据的影响,以及precombine.field功能。此外,还实践了Hudi的 Upsert控制小文件、Clustering、Cleaning、数据质量检查等,并详述了相关参数配置和SQL建表语句。
摘要由CSDN通过智能技术生成

前言

​ 本文对Hudi官网提到的部分特性(功能)做了测试,具体的测试数据均由以下代码直接生成:

from faker import Faker


def fake_data(faker: Faker, row_num: int):
    file_name = f'/Users/gavin/Desktop/tmp/student_{row_num}_rows.csv'
    with open(file=file_name, mode='w') as file:
        file.write("id,name,age,adress,partition_path\n")
        for i in range(row_num):
            file.write(
                f'{my_faker.iana_id()},{my_faker.name()},{my_faker.random_int(min=15, max=25)},{my_faker.address()},{my_faker.day_of_week()}\n')


if __name__ == '__main__':
    my_faker = Faker(locale='zh_CN')
    fake_data(my_faker, 100000)

测试数据例:

id name age adress partition_path
7548525 谭娜 15 黑龙江省广州市白云姚路w座 391301 Sunday
5615440 金亮 19 陕西省巢湖县西峰张街N座 711897 Tuesday
3887721 刘倩 21 贵州省敏县清浦深圳路A座 116469 Thursday

pyspark启动时引入hudi的命令:

pyspark --packages org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.1.2 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'

Hudi基础下探

File Layouts(文件结构)
Copy-on-Write

hudi表除了parquet文件之外,在表名根目录下,有一个.hoodie文件夹,存储了该表的元信息;示例如下:

gavin@GavindeMacBook-Pro hudi_tables % tree -a student_for_pre_validate
student_for_pre_validate
├── .hoodie #hudi表元信息文件,包含commit信息、marker信息等
│   ├── .20220317111613163.commit.requested.crc
│   ├── .20220317111613163.inflight.crc
│   ├── .aux
│   │   └── .bootstrap #.bootstrap下存放的是进行引导操作的时候的文件,引导操作是用来将已有的表转化为Hudi表的操作,因为没有执行这个,所以下面没有内容
│   │       ├── .fileids
│   │       └── .partitions
│   ├── .hoodie.properties.crc
│   ├── .temp
│   │   └── 20220317111613163
│   │       ├── .MARKERS.type.crc
│   │       ├── .MARKERS0.crc
│   │       ├── MARKERS.type
│   │       └── MARKERS0
│   ├── 20220317111613163.commit.requested
│   ├── 20220317111613163.inflight
│   ├── archived #存放归档Instant的目录,当不断写入Hudi表时,Timeline上的Instant数量会持续增多,为减少Timeline的操作压力,会在Commit时对Instant进行归档,并将Timeline上对应的Instant删除。因为我们的Instant个数尚未达到默认值30个,所以并没有产生对应的文件
│   └── hoodie.properties 
├── Friday #具体分区数据
│   ├── ..hoodie_partition_metadata.crc
│   ├── .65792147-0976-4433-91a1-cb9867326bdf-0_0-30-30_20220317111613163.parquet.crc
│   ├── .hoodie_partition_metadata
│   └── 65792147-0976-4433-91a1-cb9867326bdf-0_0-30-30_20220317111613163.parquet
└── Wednesday #具体分区数据
    ├── ..hoodie_partition_metadata.crc
    ├── .4454a7c0-4e4c-4ef6-b790-e066dd2fc8ca-0_1-30-31_20220317111613163.parquet.crc
    ├── .hoodie_partition_metadata
    └── 4454a7c0-4e4c-4ef6-b790-e066dd2fc8ca-0_1-30-31_20220317111613163.parquet

10 directories, 18 files
gavin@GavindeMacBook-Pro hudi_tables % 
Merge-on-Read

可以参考:Apache Hudi 从入门到放弃(2) —— MOR表的文件结构分析

commit文件中的信息

结论:

  • 每一个parquet文件在创建的时候都有一个对应的fileId,该Id作为parquet文件的文件名前缀,同时记录在commit文件中;后续对该文件的修改只会改变文件名后时间戳部分,前缀fileId不变
  • commit文件中会记录每次每个fileId的「numWrites」、「numDeletes」、「numUpdateWrites」、「numInserts」以及文件大小等其他基本信息
  • commit文件中记录了fileId和具体文件的映射关系
  • commit文件中记录了表的schema信息

具体数据演示

vi 20220316171316850.commit:

{
   
  "partitionToWriteStats" : {
   
    "Thursday" : [ {
   
      "fileId" : "9643d9e7-82b1-4e84-b8e2-0ae625bb54d5-0",
      "path" : "Thursday/9643d9e7-82b1-4e84-b8e2-0ae625bb54d5-0_0-29-41_20220316171316850.parquet",
      "prevCommit" : "null",
      "numWrites" : 461,
      "numDeletes" : 0,
      "numUpdateWrites" : 0,
      "numInserts" : 461,
      "totalWriteBytes" : 451097,
      "totalWriteErrors" : 0,
      "tempPath" : null,
      "partitionPath" : "Thursday",
      "totalLogRecords" : 0,
      "totalLogFilesCompacted" : 0,
      "totalLogSizeCompacted" : 0,
      "totalUpdatedRecordsCompacted" : 0,
      "totalLogBlocks" : 0,
      "totalCorruptLogBlock" : 0,
      "totalRollbackBlocks" : 0,
      "fileSizeInBytes" : 451097,
      "minEventTime" : null,
      "maxEventTime" : null
    },
			···
			···
			···
			{
   
      "fileId" : "1efa72c3-a714-46e2-bb91-5019fa6e7ede-0",
      "path" : "Saturday/1efa72c3-a714-46e2-bb91-5019fa6e7ede-0_224-53-265_20220316171316850.parquet",
      "prevCommit" : "null",
      "numWrites" : 210,
      "numDeletes" : 0,
      "numUpdateWrites" : 0,
      "numInserts" : 210,
      "totalWriteBytes" : 443162,
      "totalWriteErrors" : 0,
      "tempPath" : null,
      "partitionPath" : "Saturday",
      "totalLogRecords" : 0,
      "totalLogFilesCompacted" : 0,
      "totalLogSizeCompacted" : 0,
      "totalUpdatedRecordsCompacted" : 0,
      "totalLogBlocks" : 0,
      "totalCorruptLogBlock" : 0,
      "totalRollbackBlocks" : 0,
      "fileSizeInBytes" : 443162,
      "minEventTime" : null,
      "maxEventTime" : null
    } ]
  },
  "compacted" : false,
  "extraMetadata" : {
   
    "schema" : "{\"type\":\"record\",\"name\":\"student_record\",\"namespace\":\"hoodie.student\",\"fields\":[{\"name\":\"id\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"name\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"age\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"adress\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"partition_path\",\"type\":[\"null\",\"string\"],\"default\":null}]}"
  },
  "operationType" : "UPSERT",
  "fileIdAndRelativePaths" : {
   
    "111e0979-9006-441d-9af2-ac9656be4500-0" : "Sunday/111e0979-9006-441d-9af2-ac9656be4500-0_120-47-161_20220316171316850.parquet",
      ...
      ...
      ...
		"a13bf769-7dcb-4aa7-a26f-fd701aa07eaf-0" : "Monday/a13bf769-7dcb-4aa7-a26f-fd701aa07eaf-0_33-47-74_20220316171316850.parquet",
    "9fa086af-e28e-4a3f-9a31-06b658ad514b-0" : "Thursday/9fa086af-e28e-4a3f-9a31-06b658ad514b-0_15-41-56_20220316171316850.parquet"
  },
  "totalLogRecordsCompacted" : 0,
  "totalLogFilesCompacted" : 0,
  "totalCompactedRecordsUpdated" : 0,
  "totalRecordsDeleted" : 0,
  "totalLogFilesSize" : 0,
  "totalScanTime" : 0,
  "totalCreateTime" : 36958,
  "totalUpsertTime" : 0,
  "minAndMaxEventTime" : {
   
    "Optional.empty" : {
   
      "val" : null,
      "present" : false
    }
  },
  "writePartitionPaths" : [ "Thursday", "Monday", "Friday", "Sunday", "Wednesday", "Tuesday", "Saturday" ]
}

执行了一次upsert之后:

vi 20220316171648081.commit

{
   
  "partitionToWriteStats" : {
   
    "Thursday" : [ {
   
      "fileId" : "5540e2fd-bc18-42db-a831-f72a6d7eb603-0",
      "path" : "Thursday/5540e2fd-bc18-42db-a831-f72a6d7eb603-0_0-29-492_20220316171648081.parquet",
      "prevCommit" : "20220316171316850",
      "numWrites" : 459,
      "numDeletes" : 0,
      "numUpdateWrites" : 1,
      "numInserts" : 0,
      "totalWriteBytes" : 450943,
      "totalWriteErrors" : 0,
      "tempPath" : null,
      "partitionPath" : "Thursday",
      "totalLogRecords" : 0,
      "totalLogFilesCompacted" : 0,
      "totalLogSizeCompacted" : 0,
      "totalUpdatedRecordsCompacted" : 0,
      "totalLogBlocks" : 0,
      "totalCorruptLogBlock" : 0,
      "totalRollbackBlocks" : 0,
      "fileSizeInBytes" : 450943,
      "minEventTime" : null,
      "maxEventTime" : null
    },
		···
		···
		···
		{
   
      "fileId" : "d02425d8-0216-4a3b-9810-b613d80cd60f-0",
      "path" : "Saturday/d02425d8-0216-4a3b-9810-b613d80cd60f-0_433-53-925_20220316171648081.parquet",
      "prevCommit" : "null",
      "numWrites" : 84,
      "numDeletes" : 0,
      "numUpdateWrites" : 0,
      "numInserts" : 84,
      "totalWriteBytes" : 439040,
      "totalWriteErrors" : 0,
      "tempPath" : null,
      "partitionPath" : "Saturday",
      "totalLogRecords" : 0,
      "totalLogFilesCompacted" : 0,
      "totalLogSizeCompacted" : 0,
      "totalUpdatedRecordsCompacted" : 0,
      "totalLogBlocks" : 0,
      "totalCorruptLogBlock" : 0,
      "totalRollbackBlocks" : 0,
      "fileSizeInBytes" : 439040,
      "minEventTime" : null,
      "maxEventTime" : null
    } ]
  },
  "compacted" : false,
  "extraMetadata" : {
   
    "schema" : "{\"type\":\"record\",\"name\":\"student_record\",\"namespace\":\"hoodie.student\",\"fields\":[{\"name\":\"id\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"name\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"age\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"adress\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"partition_path\",\"type\":[\"null\",\"string\"],\"default\":null}]}"
  },
  "operationType" : "UPSERT",
  "fileIdAndRelativePaths" : {
   
    "0d288e1e-f593-4782-95de-0583c4cd286b-0" : "Saturday/0d288e1e-f593-4782-95de-0583c4cd286b-0_415-53-907_20220316171648081.parquet",
    ···
    ···
    ···
    "cc32046a-55b1-4b2b-be93-3225e42154b7-0" : "Saturday/cc32046a-55b1-4b2b-be93-3225e42154b7-0_211-53-703_20220316171648081.parquet"
  },
  "totalLogRecordsCompacted" : 0,
  "totalLogFilesCompacted" : 0,
  "totalCompactedRecordsUpdated" : 0,
  "writePartitionPaths" : [ "Thursday", "Monday", "Friday", "Sunday", "Wednesday", "Tuesday", "Saturday" ],
  "totalRecordsDeleted" : 0,
  "totalLogFilesSize" : 0,
  "totalScanTime" : 0,
  "totalCreateTime" : 31116,
  "totalUpsertTime" : 37426,
  "minAndMaxEventTime" : {
   
    "Optional.empty" : {
   
      "val" : null,
      "present" : false
    }
  }
}

upsert数据时候数据文件变化

结论:upsert数据之后,会新增一个新版的数据文件,新的版本数据文件中包含了历史数据和新的数据;之前的版本文件不会变化

测试代码

import pyspark

if __name__ == '__main__':
    builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
        .config("spark.jars",
                "/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
                "/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

    spark = builder.getOrCreate()
    sc = spark.sparkContext

    tableName = "student"
    basePath = "file:///tmp/hudi_base_path"
    csv_path = '/Users/gavin/Desktop/tmp/student_3_rows.csv'
    csv_df = spark.read.csv(path=csv_path, header='true')
    csv_df.printSchema()
    print(f'csv_df.count(): [{csv_df.count()}]')
    hudi_options = {
   
        'hoodie.table.name': tableName,
        'hoodie.datasource.write.recordkey.field': 'id',
        'hoodie.datasource.write.partitionpath.field': 'partition_path',
        'hoodie.datasource.write.table.name': tableName,
        'hoodie.datasource.write.precombine.fi
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值