初识 Apache Iceberg 及自动化 Iceberg 表维护（小文件治理）

本旺

已于 2023-07-05 23:55:35 修改

阅读量1.4k

点赞数 2

分类专栏：大数据 Iceberg Flink 文章标签： apache 大数据 etl

于 2023-07-05 23:51:19 首次发布

本文链接：https://blog.csdn.net/weixin_39750695/article/details/131565675

版权

大数据同时被 3 个专栏收录

11 篇文章 2 订阅

订阅专栏

Flink

6 篇文章 0 订阅

订阅专栏

Iceberg

3 篇文章 0 订阅

订阅专栏

Iceberg 简介

Apache Iceberg是一个用于大型分析数据集的开放表格式。目前支持的计算引擎有：Spark、Trino、Doris、Flink、Hive等。

特性

支持表结构变更，包括增加、修改、删除字段
隐藏分区可以防止用户错误导致错误的结果或极其缓慢的查询
分区布局演变可以在数据量或查询模式发生变化时更新表的布局，即可以将分区由天演变成月
时间旅行支持使用完全相同的表快照的可重复查询，或者允许用户轻松检查更改，即历史快照数据存储
版本回滚允许用户通过将表重置为良好状态来快速纠正问题，历史快照功能

可靠性和性能

Iceberg用于生产环境中，其中单个表可以包含数十pb的数据，甚至这些巨大的表也可以在没有分布式SQL引擎的情况下读取。

扫描规划非常快——不需要分布式SQL引擎来读取表或查找文件
高级过滤——使用表元数据对数据文件进行分区和列级统计
Iceberg旨在解决最终一致的云对象存储中的正确性问题。

可用于任何云存储，并通过避免列出和重命名来减少HDFS中的NN拥塞，支持多种文件系统 OSS、HDFS等；
可序列化隔离-表更改是原子性的，读取器永远不会看到部分或未提交的更改，原子写入
多个并发写入器使用乐观并发性，并将重试以确保兼容更新成功，即使在写入冲突时也是如此开放的标准

Iceberg 文件结构

(https://img-blog.csdnimg.cn/59b475da0cff47358932085cfb15a287.png)

iceberg 文件结构分为三层，分别为：

catalog 层

这一层主要表示 iceberg 表结构定义所存放的地方，可以自定义 catalog，也可以使用 hadoop 或者 hive 。

matedata 层

该层主要存放 iceberg 表的元数据相关信息分为：

metadata file

文件为 .metadata.json的 json 文件,包含数据大致如下

format-version: 表格式版本 v1 or v2(支持upsert)
table-uuid: 表 uuid
location: 表的存储地址
last-sequence-number: 最后 sequence number （最新生产数据的序列号 ?）
last-updated-ms: 最后更新时间
last-column-id: 最后一个字段的 ID
current-schema-id: 当前表的 schena id
schemas: 当前表的 schemas
partition-specs: 分区信息
sort-orders: 排序信息
properties: 属性信息
current-snapshot-id: 当前的快照 ID
snapshots: 所有的快照版本

单个快照里面包括
- sequence-number: 快照的序列号、
- snapshot-id: 快照 ID
- parent-snapshot-id: 父快照，上一个快照 ID
- timestamp-ms: 快照的时间戳
- summary: 汇总信息
- operation: 操作 append，replace (目前发现这两种)
- flink.job-id: flink 任务才会有？
- flink.max-committed-checkpoint-id: flink 最大的 ck id
- added-data-files: 本次新增的文件数
- added-records: 本次新增的行数
- added-files-size: 本次文件大小 bytes
- total-records: 总数据条数
- total-files-size: 总文件大小
- total-data-files: 总数据文件数
- total-delete-files: 删除语句的文件数
- manifest-list: manifest list 文件 .avro 文件文件名 snap-xxx.avro
metadata-log: 日志文件(?)

{
    "format-version": 2,
    "table-uuid": "16dd1efa-4bd9-44a3-8a3f-69f0cd53afd9",
    "location": "hdfs://HDPCluster/warehouse/tablespace/managed/hive/iceberg/iceberg.db/ods_action_event_log",
    "last-sequence-number": 340,
    "last-updated-ms": 1648193228333,
    "last-column-id": 2,
    "current-schema-id": 0,
    "schemas": [
        {
            "type": "struct",
            "schema-id": 0,
            "fields": [
                {
                    "id": 1,
                    "name": "log_info",
                    "required": false,
                    "type": "string"
                },
                {
                    "id": 2,
                    "name": "stat_date",
                    "required": false,
                    "type": "string"
                }
            ]
        }
    ],
    "default-spec-id": 0,
    "partition-specs": [
        {
            "spec-id": 0,
            "fields": [
                {
                    "name": "stat_date",
                    "transform": "identity",
                    "source-id": 2,
                    "field-id": 1000
                }
            ]
        }
    ],
    "last-partition-id": 1000,
    "default-sort-order-id": 0,
    "sort-orders": [
        {
            "order-id": 0,
            "fields": []
        }
    ],
    "properties": {
        "engine.hive.enabled": "true",
        "connector": "iceberg",
        "write.format.default": "orc",
        "write.metadata.previous-versions-max": "10",
        "catalog-database": "iceberg",
        "write.upsert.enabled": "false",
        "write.metadata.delete-after-commit.enabled": "true",
        "write.target-file-size-bytes": "268435456",
        "write.distribution-mode": "hash",
        "warehouse": "/warehouse/tablespace/managed/hive/iceberg",
        "catalog-name": "iceberg-catlog",
        "uri": "thrift://xxxxxx:9083"
    },
    "current-snapshot-id": 7928227817048565499,
    "snapshots": [
        {
            "sequence-number": 340,
            "snapshot-id": 7928227817048565499,
            "parent-snapshot-id": 8315786418643141497,
            "timestamp-ms": 1648193125981,
            "summary": {
                "operation": "append",
                "flink.job-id": "53918d357941a7091f60907064b78397",
                "flink.max-committed-checkpoint-id": "338",
                "added-data-files": "1",
                "added-records": "251",
                "added-files-size": "15916",
                "changed-partition-count": "1",
                "total-records": "37109",
                "total-files-size": "2039168",
                "total-data-files": "187",
                "total-delete-files": "0",
                "total-position-deletes": "0",
                "total-equality-deletes": "0"
            },
            "manifest-list": "hdfs://HDPCluster/warehouse/tablespace/managed/hive/iceberg/iceberg.db/ods_action_event_log/metadata/snap-7928227817048565499-1-d20d888c-e7fa-4f71-bb6b-883f41370acc.avro",
            "schema-id": 0
        }
    ],
    "snapshot-log": [
        {
            "timestamp-ms": 1648193125981,
            "snapshot-id": 7928227817048565499
        }
    ],
    "metadata-log": [
     
        {
            "timestamp-ms": 1648193125981,
            "metadata-file": "hdfs://HDPCluster/warehouse/tablespace/managed/hive/iceberg/iceberg.db/ods_action_event_log/metadata/00340-f810fb4b-fde5-4318-b6be-f989c3a2c7ff.metadata.json"
        }
    ]
}

manifest list

文件名为 snap-xxx.avro 的文件，其中记录 manifest file 文件列表
manifest file

文件名为 xxx-m0.avro 的文件, 其中记录具体的data file 路径, ORC or Parquat 文件

data 层

具体的数据文件，ORC or Parquat 文件

Iceberg Flink 使用

集成需要的 Jar 包
iceberg-flink-runtime-1.14-0.13.1.jar
hive-exec-3.1.0.3.1.0.0-78.jar
libfb303-0.9.3.jar
Flink 建表语句例子
```
CREATE TABLE if not exists `ods_action_event_log`(
`log_info` string,
`stat_date` string
)
PARTITIONED BY (
`stat_date`
)
with (
'connector'='iceberg',
'catalog-name'='iceberg-catlog',
'uri'='thrift://xxxxxx:9083',
'catalog-database' = 'iceberg' ,
'format-version' = '2',
'write.upsert.enabled' = 'false',
'engine.hive.enabled' = 'true',
'warehouse'='/warehouse/tablespace/managed/hive/iceberg',
'write.format.default' = 'orc', 
'write.distribution-mode'='hash', 
'write.target-file-size-bytes'='268435456', 
'write.metadata.delete-after-commit.enabled'='true', 
'write.metadata.previous-versions-max' = '10' 
); 
```
参数说明:
- engine.hive.enabled: 是否同步 hive, 即在 hive 数据库中可见
- uri: hive metastore 地址
- catalog-database: hive database 名称
- write.upsert.enabled: 是否使用 upsert 模式
- warehouse: hdfs 路径
- write.format.default: 数据文件存储格式
- write.distribution-mode: 是否开起分桶，建议使用可以减少小文件数目、避免数据倾斜
- equality.field.columns: 主键，多个采用逗号分隔，注意主键列不可为NULL
- read.parquet.vectorization.enabled: 是否开启向量化查询
- read.parquet.vectorization.batch-size
- write.target-file-size-bytes: 写入文件大小，在实时处理中一般达不到这么大
- write.metadata.delete-after-commit.enabled: 在一次 commit 后是否删除 metadata.json 文件
- write.metadata.previous-versions-max: 保留的最多 metadata.json 版本数, 可控制 metadata.json 数量

iceberg 表管理

使用 Spark 进行 iceberg 的后续运维管理，流程：

重写数据文件 -> Rewrite Data

实时写入会产生许多小文件，多了会对 namenode 产生影响，定期进行合并处理

RewriteDataFiles.Result result = 
    SparkActions.get(spark)
        // icebergTable 指定需要合并的 iceberg 表
        .rewriteDataFiles(icebergTable)
        // 指定合并文件的条件，这里是分区为 stat_date = 20220324 的数据文件
        .filter(Expressions.equal("stat_date", "20220324"))
        // 指定合并后的文件最大大小 128 M
        .option("target-file-size-bytes", Long.toString(128 * 1024 * 1024))
        .execute();

重写清单文件 Manifest -> Rewrite Manifests

在 iceberg 中，Manifest 相当于数据文件的索引，在元数据树中，清单文件会根据加入的顺序自动排序压缩，用来加速写入和查询，可以对清单文件进行维护

org.apache.iceberg.actions.RewriteManifests.Result result = SparkActions.get(spark)
        // 指定 iceberg 数据表
        .rewriteManifests(icebergTable)
        // 指定需要合并的 Manifests 条件，这里指大小小于 10M 的参与合并
        .rewriteIf(file -> file.length() < 10 * 1024 * 1024)
        .execute();

过期快照 -> Expire Snapshots

iceberg 中一次 commit 会生成一个快照，一个快照中会维护 Manifest List 来确定这个快照所包含的 Data Files，快照可以用来进行时间旅行，即查询历史版本的数据，在 Spark 中可使用以下语句查询已有的快照

-- hive -> catalog (自己定义的名称不一定一样)
-- iceberg -> database (自己定义的名称不一定一样)
-- ods_action_event_log -> 对应的 iceberg 表
-- 也可以直接写表明，根据你查询引擎的要求
SELECT * FROM hive.iceberg.ods_action_event_log.snapshots;



维护：

```java
// 1 hour
long tsToExpire = System.currentTimeMillis() - (1000 * 60 * 30);
org.apache.iceberg.actions.ExpireSnapshots.Result result = SparkActions.get(spark)
            // 指定 iceberg 表
            .expireSnapshots(icebergTable)
            // 指定过期时间戳 使所有早于给定时间戳的快照过期 这里指使一小时之前的快照过期
            .expireOlderThan(tsToExpire)
            // 最多保留当前快照之前的多少个 如果快照由于比过期时间戳更旧而过期，但它是numSnapshots当前状态的最新祖先之一，它将被保留。即如果在 tsToExpire 到现在不足 10 个 那么自在 tsToExpire 之前的也会保留 
            .retainLast(10)
            .execute();

删除无用文件 -> Delete OrphanFiles

无用文件定义：任何有效快照都无法访问元数据或数据文件
默认清除 3 天前的无用数据文件

long tsToExpire = System.currentTimeMillis() - (1000 * 60 * 60 * 16);
org.apache.iceberg.actions.DeleteOrphanFiles.Result result = SparkActions.get(spark)
            // 指定表
            .deleteOrphanFiles(icebergTable)
            // 指定过期时间戳 删除所有早于给定时间戳的数据文件 这里指使16小时之前的快照过期
            .olderThan(tsToExpire)
            .execute();

目前在使用社区提供的Spark或者Flink方式进行Icebebrg表维护时没有处理生成的 DELETE-FILE 文件，在Flink流式写入的情况下会产生大量的 EQ-DELETE FILE 文件，导致在读取时性能较差（由于ICEBERG V2 格式支持UPSERT使用MERGE-ON-READ，需要合并 DATA-FILE 和 DELETE-FILE 文件）；同时为了自动化进行表文件的维护，自定义实现了DATA FILE重写功能，实现了DELETE-FILE文件的处理优化，同时用户在创建表时可以自定义表的维护周期参数如下：

// 表维护周期
'iceberg.refresh.time'='300000',
// 保存的快照数量
'iceberg.max.snp.num'='10',

在Flink流式写入完成一次提交后会发送EVENT至外部系统，外部系统根据事件统计信息自动触发表维护程序进行小文件重写。
大致流程如下：

在这里插入图片描述

本旺

关注

2
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
初识 Apache Iceberg 及自动化 Iceberg 表维护（小文件治理）

Apache Iceberg 架构迁徙及自动化小文件治理方案
复制链接

扫一扫

专栏目录