Flink重写Iceberg数据湖小文件变大文件

1. 重写小文件变大文件

Flink支持Batch任务,将iceberg表的小文件重写成大文件

合并前HDFS的metadata和data目录文件如下:

[root@flink1 ~]# 
[root@flink1 ~]# hadoop fs -ls hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata
Found 17 items
-rw-r--r--   1 root supergroup       6493 2022-02-13 22:10 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/2b20c57e-5428-4483-9f7b-928b980dd50d-m0.avro
-rw-r--r--   1 root supergroup       6493 2022-02-13 22:10 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/2b20c57e-5428-4483-9f7b-928b980dd50d-m1.avro
-rw-r--r--   1 root supergroup       6423 2022-02-13 22:10 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/2b20c57e-5428-4483-9f7b-928b980dd50d-m2.avro
-rw-r--r--   1 root supergroup       6478 2022-02-13 22:05 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/5c33451b-48ab-4ce5-be7a-2c2d2dc9e11d-m0.avro
-rw-r--r--   1 root supergroup       6421 2022-02-13 22:11 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/b243b39e-7122-4571-b6fa-c902241e36a8-m0.avro
-rw-r--r--   1 root supergroup       6423 2022-02-13 22:11 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/b243b39e-7122-4571-b6fa-c902241e36a8-m1.avro
-rw-r--r--   1 root supergroup       6476 2022-02-13 22:05 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/bc0e56ec-9f78-4956-8412-4d8ca70ccc19-m0.avro
-rw-r--r--   1 root supergroup       3895 2022-02-13 22:11 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/snap-138573494821828246-1-b243b39e-7122-4571-b6fa-c902241e36a8.avro
-rw-r--r--   1 root supergroup       3864 2022-02-13 22:05 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/snap-453371561664052237-1-bc0e56ec-9f78-4956-8412-4d8ca70ccc19.avro
-rw-r--r--   1 root supergroup       3835 2022-02-13 22:10 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/snap-6410282459040239217-1-2b20c57e-5428-4483-9f7b-928b980dd50d.avro
-rw-r--r--   1 root supergroup       3792 2022-02-13 22:05 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/snap-8012517928892530314-1-5c33451b-48ab-4ce5-be7a-2c2d2dc9e11d.avro
-rw-r--r--   1 root supergroup       2115 2022-02-13 22:01 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/v1.metadata.json
-rw-r--r--   1 root supergroup       3141 2022-02-13 22:05 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/v2.metadata.json
-rw-r--r--   1 root supergroup       4197 2022-02-13 22:05 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/v3.metadata.json
-rw-r--r--   1 root supergroup       5399 2022-02-13 22:10 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/v4.metadata.json
-rw-r--r--   1 root supergroup       6597 2022-02-13 22:11 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/v5.metadata.json
-rw-r--r--   1 root supergroup          1 2022-02-13 22:11 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/version-hint.text
[root@flink1 ~]#
[root@flink1 ~]# hadoop fs -ls hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/data/birthday=2022-02-01/country=china
Found 2 items
-rw-r--r--   1 root supergroup       1258 2022-02-13 22:05 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/data/birthday=2022-02-01/country=china/00000-0-4ef3835f-b18b-4c48-b47a-85af1771a10a-00001.parquet
-rw-r--r--   1 root supergroup       1257 2022-02-13 22:05 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/data/birthday=2022-02-01/country=china/00000-0-6e66c02b-cb09-4fd0-b669-15aa7f5194e4-00001.parquet
[root@flink1 ~]#
[root@flink1 ~]# hadoop fs -ls hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/data/birthday=2022-02-02/country=japan
Found 4 items
-rw-r--r--   1 root supergroup       1244 2022-02-13 22:10 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/data/birthday=2022-02-02/country=japan/00000-0-1d0ff907-60a7-4062-93a3-9b443626e383-00001.parquet
-rw-r--r--   1 root supergroup       1229 2022-02-13 22:05 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/data/birthday=2022-02-02/country=japan/00000-0-4ef3835f-b18b-4c48-b47a-85af1771a10a-00002.parquet
-rw-r--r--   1 root supergroup       1230 2022-02-13 22:05 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/data/birthday=2022-02-02/country=japan/00000-0-6e66c02b-cb09-4fd0-b669-15aa7f5194e4-00002.parquet
-rw-r--r--   1 root supergroup       1251 2022-02-13 22:11 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/data/birthday=2022-02-02/country=japan/00000-0-ba8b2366-9189-48af-ae6d-8f20b297e0b4-00001.parquet
[root@flink1 ~]#

pom.xml添加依赖

        <dependency>
            <groupId>org.apache.iceberg</groupId>
            <artifactId>iceberg-flink-runtime-1.14</artifactId>
            <version>0.13.0</version>
            <scope>provided</scope>
        </dependency>

合并的程序如下

import org.apache.iceberg.Table
import org.apache.iceberg.actions.RewriteDataFilesActionResult
import org.apache.iceberg.flink.TableLoader
import org.apache.iceberg.flink.actions.Actions;


object flink_test {

  def main(args: Array[String]): Unit = {
    
    val tableLoader:TableLoader = TableLoader.fromHadoopTable(
      "hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/"
    )
    tableLoader.open()
    val table:Table = tableLoader.loadTable()
    val result:RewriteDataFilesActionResult = Actions.forTable(table)
      .rewriteDataFiles()
      .execute()
    
  }
}

将代码进行打包,上传到服务器上面,运行jar包

[root@flink1 ~]# flink run -c flink_test -D classloader.check-leaked-classloader=false /root/flink_dev-1.0-SNAPSHOT.jar 
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/flink-1.14.3/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/hadoop-3.3.1/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
[root@flink1 ~]#

再次查看HDFS的metadata和data目录文件,如下所示:

[root@flink1 ~]# 
[root@flink1 ~]# hadoop fs -ls hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata
Found 22 items
-rw-r--r--   1 root supergroup       6493 2022-02-13 22:10 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/2b20c57e-5428-4483-9f7b-928b980dd50d-m0.avro
-rw-r--r--   1 root supergroup       6493 2022-02-13 22:10 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/2b20c57e-5428-4483-9f7b-928b980dd50d-m1.avro
-rw-r--r--   1 root supergroup       6423 2022-02-13 22:10 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/2b20c57e-5428-4483-9f7b-928b980dd50d-m2.avro
-rw-r--r--   1 root supergroup       6423 2022-02-13 22:51 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/58bc3631-aca1-4a55-9afb-c76c9d0cc592-m0.avro
-rw-r--r--   1 root supergroup       6425 2022-02-13 22:51 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/58bc3631-aca1-4a55-9afb-c76c9d0cc592-m1.avro
-rw-r--r--   1 root supergroup       6430 2022-02-13 22:51 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/58bc3631-aca1-4a55-9afb-c76c9d0cc592-m2.avro
-rw-r--r--   1 root supergroup       6478 2022-02-13 22:05 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/5c33451b-48ab-4ce5-be7a-2c2d2dc9e11d-m0.avro
-rw-r--r--   1 root supergroup       6421 2022-02-13 22:11 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/b243b39e-7122-4571-b6fa-c902241e36a8-m0.avro
-rw-r--r--   1 root supergroup       6423 2022-02-13 22:11 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/b243b39e-7122-4571-b6fa-c902241e36a8-m1.avro
-rw-r--r--   1 root supergroup       6476 2022-02-13 22:05 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/bc0e56ec-9f78-4956-8412-4d8ca70ccc19-m0.avro
-rw-r--r--   1 root supergroup       3895 2022-02-13 22:11 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/snap-138573494821828246-1-b243b39e-7122-4571-b6fa-c902241e36a8.avro
-rw-r--r--   1 root supergroup       3902 2022-02-13 22:51 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/snap-3407622807613840742-1-58bc3631-aca1-4a55-9afb-c76c9d0cc592.avro
-rw-r--r--   1 root supergroup       3864 2022-02-13 22:05 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/snap-453371561664052237-1-bc0e56ec-9f78-4956-8412-4d8ca70ccc19.avro
-rw-r--r--   1 root supergroup       3835 2022-02-13 22:10 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/snap-6410282459040239217-1-2b20c57e-5428-4483-9f7b-928b980dd50d.avro
-rw-r--r--   1 root supergroup       3792 2022-02-13 22:05 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/snap-8012517928892530314-1-5c33451b-48ab-4ce5-be7a-2c2d2dc9e11d.avro
-rw-r--r--   1 root supergroup       2115 2022-02-13 22:01 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/v1.metadata.json
-rw-r--r--   1 root supergroup       3141 2022-02-13 22:05 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/v2.metadata.json
-rw-r--r--   1 root supergroup       4197 2022-02-13 22:05 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/v3.metadata.json
-rw-r--r--   1 root supergroup       5399 2022-02-13 22:10 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/v4.metadata.json
-rw-r--r--   1 root supergroup       6597 2022-02-13 22:11 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/v5.metadata.json
-rw-r--r--   1 root supergroup       7634 2022-02-13 22:51 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/v6.metadata.json
-rw-r--r--   1 root supergroup          1 2022-02-13 22:51 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/metadata/version-hint.text
[root@flink1 ~]# 
[root@flink1 ~]# 
[root@flink1 ~]# hadoop fs -ls hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/data/birthday=2022-02-01/country=china
Found 3 items
-rw-r--r--   1 root supergroup       1258 2022-02-13 22:05 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/data/birthday=2022-02-01/country=china/00000-0-4ef3835f-b18b-4c48-b47a-85af1771a10a-00001.parquet
-rw-r--r--   1 root supergroup       1257 2022-02-13 22:05 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/data/birthday=2022-02-01/country=china/00000-0-6e66c02b-cb09-4fd0-b669-15aa7f5194e4-00001.parquet
-rw-r--r--   1 root supergroup       1422 2022-02-13 22:51 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/data/birthday=2022-02-01/country=china/00000-0-8410e44a-bed3-45c8-9c3d-37cb42443087-00001.parquet
[root@flink1 ~]#
[root@flink1 ~]# hadoop fs -ls hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/data/birthday=2022-02-02/country=japan
Found 4 items
-rw-r--r--   1 root supergroup       1244 2022-02-13 22:10 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/data/birthday=2022-02-02/country=japan/00000-0-1d0ff907-60a7-4062-93a3-9b443626e383-00001.parquet
-rw-r--r--   1 root supergroup       1229 2022-02-13 22:05 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/data/birthday=2022-02-02/country=japan/00000-0-4ef3835f-b18b-4c48-b47a-85af1771a10a-00002.parquet
-rw-r--r--   1 root supergroup       1230 2022-02-13 22:05 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/data/birthday=2022-02-02/country=japan/00000-0-6e66c02b-cb09-4fd0-b669-15aa7f5194e4-00002.parquet
-rw-r--r--   1 root supergroup       1251 2022-02-13 22:11 hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user/data/birthday=2022-02-02/country=japan/00000-0-ba8b2366-9189-48af-ae6d-8f20b297e0b4-00001.parquet
[root@flink1 ~]#

可以看到会将多个文件进行合并,生成新的文件。但是合并前的文件并没有删除

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值