Hudi表创建时HDFS上的变化

23 篇文章 1 订阅
19 篇文章 0 订阅

SparkSQL 建 Hudi 表语句:

CREATE TABLE t71 (
    ds BIGINT,
    ut STRING,
    pk BIGINT,
    f0 BIGINT,
    f1 BIGINT,
    f2 BIGINT,
    f3 BIGINT,
    f4 BIGINT
) USING hudi
PARTITIONED BY (ds)
TBLPROPERTIES ( -- 这里也可使用 options (https://hudi.apache.org/docs/table_management)
  type = 'mor',
  primaryKey = 'pk',
  preCombineField = 'ut',
  hoodie.index.type = 'BUCKET',
  hoodie.bucket.index.num.buckets = '2',
  hoodie.compaction.payload.class = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
  hoodie.datasource.write.payload.class = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
  hoodie.archive.merge.enable = 'true',
  hoodie.datasource.write.operation = 'upsert'
);

执行 create table 后,会创建一子目录和文件:

[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71
Found 1 items
drwxr-xr-x   - zhangsan dfsusers          0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie

[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71
Found 1 items
drwxr-xr-x   - zhangsan dfsusers          0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie

[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie
Found 5 items
drwxr-xr-x   - zhangsan dfsusers          0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux
drwxr-xr-x   - zhangsan dfsusers          0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.schema
drwxr-xr-x   - zhangsan dfsusers          0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.temp
drwxr-xr-x   - zhangsan dfsusers          0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/archived
-rw-r--r--   3 zhangsan dfsusers       1501 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/hoodie.properties

[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/archived
[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.schema
[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux
Found 1 items
drwxr-xr-x   - zhangsan dfsusers          0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux/.bootstrap

[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.temp
[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux/.bootstrap
Found 2 items
drwxr-xr-x   - zhangsan dfsusers          0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux/.bootstrap/.fileids
drwxr-xr-x   - zhangsan dfsusers          0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux/.bootstrap/.partitions

[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux/.bootstrap/.partitions
[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux/.bootstrap/.fileids

[/home/zhangsan]$ sh hadoop.sh fs -cat hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/hoodie.properties
#Properties saved on 2023-05-31T03:09:25.601Z
#Wed May 31 11:09:25 CST 2023
hoodie.table.precombine.field=ut
hoodie.datasource.write.drop.partition.columns=false
hoodie.table.partition.fields=ds
hoodie.bucket.index.num.buckets=2
hoodie.table.type=MERGE_ON_READ
hoodie.archivelog.folder=archived
hoodie.compaction.payload.class=org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload
hoodie.table.version=5
hoodie.timeline.layout.version=1
hoodie.table.recordkey.fields=pk
hoodie.database.name=test
hoodie.datasource.write.partitionpath.urlencode=false
hoodie.table.name=t71
hoodie.table.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator
hoodie.datasource.write.hive_style_partitioning=true
hoodie.table.create.schema={"type"\:"record","name"\:"t71_record","namespace"\:"hoodie.t71","fields"\:[{"name"\:"_hoodie_commit_time","type"\:["string","null"]},{"name"\:"_hoodie_commit_seqno","type"\:["string","null"]},{"name"\:"_hoodie_record_key","type"\:["string","null"]},{"name"\:"_hoodie_partition_path","type"\:["string","null"]},{"name"\:"_hoodie_file_name","type"\:["string","null"]},{"name"\:"ut","type"\:["string","null"]},{"name"\:"pk","type"\:["long","null"]},{"name"\:"f0","type"\:["long","null"]},{"name"\:"f1","type"\:["long","null"]},{"name"\:"f2","type"\:["long","null"]},{"name"\:"f3","type"\:["long","null"]},{"name"\:"f4","type"\:["long","null"]},{"name"\:"ds","type"\:["long","null"]}]}
hoodie.index.type=BUCKET
hoodie.table.checksum=3938074607

执行 drop table 后,会将表目录如 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71 删除掉。

如果是分区表,则在有数据插入是,会在表目录下以分区值建立子目录,比如:

insert into t71 (ds,ut,pk,f0) values (20230101,CURRENT_TIMESTAMP,1102,1);

上述语句会在 HDFS 上建立以“ds=20230101”为名的子目录:

[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101
Found 2 items
-rw-r--r--   3 zhangsan dfsusers         96 2023-05-31 11:29 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101/.hoodie_partition_metadata
-rw-r--r--   3 zhangsan dfsusers     435756 2023-05-31 11:29 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101/00000001-6776-4b80-915b-ad6bdff96948-0_1-21-19_20230531112913107.parquet

执行连续三条 insert:

insert into t71 (ds,ut,pk,f0) values (20230101,CURRENT_TIMESTAMP,1102,1);
select * from t71 where pk=1102;
insert into t71 (ds,ut,pk,f1) values (20230101,CURRENT_TIMESTAMP,1102,2);
select * from t71 where pk=1102;
insert into t71 (ds,ut,pk,f2) values (20230101,CURRENT_TIMESTAMP,1102,3);
select * from t71 where pk=1102;
[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101
Found 3 items
-rw-r--r--   3 zhangsan dfsusers       1048 2023-05-31 14:26 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101/.00000001-b6d7-4eaa-8004-ac7d0626bf8d-0_20230531141236926.log.1_1-8-6
-rw-r--r--   3 zhangsan dfsusers       2096 2023-05-31 14:31 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101/.00000001-b6d7-4eaa-8004-ac7d0626bf8d-0_20230531141236926.log.1_1-8-6
-rw-r--r--   3 zhangsan dfsusers         96 2023-05-31 14:13 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101/.hoodie_partition_metadata
-rw-r--r--   3 zhangsan dfsusers     435757 2023-05-31 14:13 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101/00000001-b6d7-4eaa-8004-ac7d0626bf8d-0_1-21-17_20230531141236926.parquet

上面列了两个“.log.”文件,分别为第一次 insert 后和第二次 insert 后的结果,为方便对比观察放在了一起。

分区的第 1 笔插入总是会生成“.parquet”文件,而不是“.log.”文件。上述“.parquet”为列存储格式的基础文件,COW 和 MOR 表都有的文件,但 COW 每笔 insert 都会整个重写,而 MOR 表则不会。“.log.”为行存储格式的增量日志文件,为 MOR 表独有文件。文件 .hoodie_partition_metadata 为分区元数据文件:

[/home/zhangsan]$ sh hadoop.sh fs -cat hdfs://hadoop-cluster-01/user/zhangsan/warehouse/testtest.db/t71/ds=20230101/.hoodie_partition_metadata
#partition metadata
#Wed May 31 11:29:49 CST 2023
commitTime=20230531112913107
partitionDepth=1

“.parquet”文件

使用在线的工具 https://parquet-viewer-online.com/result 打开“.parquet”文件,可发现内容同 select 完全一样。

_hoodie_commit_time _hoodie_commit_seqno  _hoodie_record_key _hoodie_partition_path _hoodie_file_name                                                        ut	                     pk	  f0 f1	  f2   f3   f4   ds
20230531141236926   20230531141236926_1_0 1102               ds=20230101            00000001-b6d7-4eaa-8004-ac7d0626bf8d-0_1-21-17_20230531141236926.parquet 2023-05-31 14:12:37.126 1102 1  null null null null 20230

“.log.”文件

第二次 insert 后生成了“.log.”文件:

#HUDI#      
              4{"type":"record","name":"t71_record","namespace":"hoodie.t71","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"ut","type":"string"},{"name":"pk","type":"long"},{"name":"f0","type":["null","long"],"default":null},{"name":"f1","type":["null","long"],"default":null},{"name":"f2","type":["null","long"],"default":null},{"name":"f3","type":["null","long"],"default":null},{"name":"f4","type":["null","long"],"default":null},{"name":"ds","type":"long"}]}       20230531142614512       •         ‰"20230531142614512*20230531142614512_1_11102ds=20230101L00000001-b6d7-4eaa-8004-ac7d0626bf8d-0.2023-05-31 14:26:14.761œ    ª¿¥          

第三次 insert 后更新了“.log.”文件:

#HUDI#      
              4{"type":"record","name":"t71_record","namespace":"hoodie.t71","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"ut","type":"string"},{"name":"pk","type":"long"},{"name":"f0","type":["null","long"],"default":null},{"name":"f1","type":["null","long"],"default":null},{"name":"f2","type":["null","long"],"default":null},{"name":"f3","type":["null","long"],"default":null},{"name":"f4","type":["null","long"],"default":null},{"name":"ds","type":"long"}]}       20230531142614512       •         ‰"20230531142614512*20230531142614512_1_11102ds=20230101L00000001-b6d7-4eaa-8004-ac7d0626bf8d-0.2023-05-31 14:26:14.761œ    ª¿¥          #HUDI#      
              4{"type":"record","name":"t71_record","namespace":"hoodie.t71","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"ut","type":"string"},{"name":"pk","type":"long"},{"name":"f0","type":["null","long"],"default":null},{"name":"f1","type":["null","long"],"default":null},{"name":"f2","type":["null","long"],"default":null},{"name":"f3","type":["null","long"],"default":null},{"name":"f4","type":["null","long"],"default":null},{"name":"ds","type":"long"}]}       20230531143136695       •         ‰"20230531143136695*20230531143136695_1_11102ds=20230101L00000001-b6d7-4eaa-8004-ac7d0626bf8d-0.2023-05-31 14:31:36.801œ    ª¿¥          

这里可以看到 pk=1102 有两笔数据,ut 值分别为 2023-05-31 14:26:14.761 和 2023-05-31 14:31:36.801 。使用 OverwriteNonDefaultsWithLatestAvroPayload 读取时,只能读取到 2023-05-31 14:31:36.801 这笔,这是依据 preCombineField 更大原则的结果,在 HoodieRecordPayload::preCombine 时完成的逻辑。

相关源码

// OverwriteNonDefaultsWithLatestAvroPayload 没有重写 OverwriteWithLatestAvroPayload 的 preCombine 方法
public class OverwriteNonDefaultsWithLatestAvroPayload extends OverwriteWithLatestAvroPayload {
}

public class OverwriteWithLatestAvroPayload extends BaseAvroPayload
    implements HoodieRecordPayload<OverwriteWithLatestAvroPayload> {

  @Override
  public OverwriteWithLatestAvroPayload preCombine(OverwriteWithLatestAvroPayload oldValue) {
    if (oldValue.recordBytes.length == 0) {
      // use natural order for delete record
      return this;
    }
    if (oldValue.orderingVal.compareTo(orderingVal) > 0) {
      // pick the payload with greatest ordering value
      return oldValue;
    } else {
      return this;
    }
  }
}

如果换用 PartialUpdateAvroPayload,则

_hoodie_commit_time _hoodie_commit_seqno  _hoodie_record_key _hoodie_partition_path _hoodie_file_name                       ut         pk           f0      f1 f2   f3  f4
20230531164701237   20230531164701237_0_1 1006	             ds=20230101	        00000000-ad06-474e-a7ac-0580f60307e1-0	2023-05-31 16:47:02.337	1006	1	2	3	NULL	NULL
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
在 Scala 中创建 Hudi 需要使用 Hudi 提供的 Java API。以下是一个示例代码,用于创建一个 Hudi : ```scala import org.apache.hudi.DataSourceWriteOptions._ import org.apache.hudi.config.HoodieWriteConfig._ import org.apache.hudi.hive.MultiPartKeysValueExtractor import org.apache.hudi.keygen.SimpleKeyGenerator import org.apache.hudi.{DataSourceReadOptions, DataSourceWriteOptions, HoodieSparkSqlWriter} import org.apache.spark.sql.SparkSession val spark = SparkSession.builder.appName("CreateHudiTable").master("local").getOrCreate() // 数据源和目标地址 val sourcePath = "/path/to/source" val targetPath = "/path/to/target" // 配置选项 val writeConfig = Map( TABLE_NAME -> "my_hudi_table", RECORDKEY_FIELD_OPT_KEY -> "id", PRECOMBINE_FIELD_OPT_KEY -> "timestamp", PARTITIONPATH_FIELD_OPT_KEY -> "date", KEYGENERATOR_CLASS_OPT_KEY -> classOf[SimpleKeyGenerator].getName, // 读合并小文件 COMBINE_BEFORE_INSERT_PROP -> "true", // 写入合并小文件 COMBINE_BEFORE_UPSERT_PROP -> "true" ) // 读取源数据 val sourceDF = spark.read.format("csv").load(sourcePath) // 写入 Hudi sourceDF.write .format("org.apache.hudi") .options(writeConfig) .option(PRECOMBINE_FIELD_OPT_KEY, "timestamp") .option(RECORDKEY_FIELD_OPT_KEY, "id") .option(PARTITIONPATH_FIELD_OPT_KEY, "date") .mode("overwrite") .save(targetPath) ``` 在上面的代码中,我们使用了 Hudi 提供的配置选项来指定的名称、记录键、预合并键、分区路径键和键生成器。我们还使用了 Spark SQL 的写入 API 将源数据写入 Hudi 中。 请注意,在运行上述代码之前,您需要将以下依赖项添加到您的项目中: ```scala libraryDependencies += "org.apache.hudi" % "hudi-spark-bundle_2.11" % "0.9.0-incubating" % "provided" libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.5" libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.5" ``` 另外,你需要将 Hadoop 和 Hive 的配置文件添加到项目中。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值