iceberg 用户文档(持续更新)

表 Schema 变更

Iceberg 支持使用 Alter table … alter column 语法对 Schema 进行变更,示例如下

-- spark sql
-- 更改字段类型
ALTER TABLE prod.db.sample ALTER COLUMN measurement TYPE double;

-- 更新字段和 comment
ALTER TABLE prod.db.sample ALTER COLUMN measurement TYPE double COMMENT 'unit is bytes per second'

-- 更改字段顺序, FIRST/AFTER
ALTER TABLE prod.db.sample ALTER COLUMN col FIRST
ALTER TABLE prod.db.sample ALTER COLUMN nested.col AFTER other_col
-- null 更改,如果该字段是主键则不支持
ALTER TABLE prod.db.sample ALTER COLUMN id DROP NOT NULL
ALTER TABLE prod.db.sample ALTER COLUMN id SET NOT NULL

查看表的元数据信息

Iceberg 采用 MVCC 的设计模式,存在多个快照,允许使用使用 Spark 语法查询 iceberg 的历史快照及相关变更信息。同时Iceberg实现了文件级别的追踪,所有data files的字段max、min等数据也可以直接通过sql查询。在工场Alpha的即席查询中即可进行Iceberg表的元数据查询。下面会展示如何通过sql查询表的历史、快照信息、分区信息、数据文件信息、manifest信息:
注意:查询表的元数据时一定要以三级结构名称对表进行引用,比如:catalog_name.db_name.table_name.history
catalog_name.db_name.table_name.snapshots

  • 查看表的历史 : SELECT * FROM catalog_name.db_name.table_name.history
+-------------------------+---------------------+---------------------+---------------------+
| made_current_at         | snapshot_id         | parent_id           | is_current_ancestor |
+-------------------------+---------------------+---------------------+---------------------+
| 2019-02-08 03:29:51.215 | 5781947118336215154 | NULL                | true                |
| 2019-02-08 03:47:55.948 | 5179299526185056830 | 5781947118336215154 | true                |
| 2019-02-09 16:24:30.13  | 296410040247533544  | 5179299526185056830 | false               |
| 2019-02-09 16:32:47.336 | 2999875608062437330 | 5179299526185056830 | true                |
| 2019-02-09 19:42:03.919 | 8924558786060583479 | 2999875608062437330 | true                |
| 2019-02-09 19:49:16.343 | 6536733823181975045 | 8924558786060583479 | true                |
+-------------------------+---------------------+---------------------+---------------------+
  • 查看表的快照

–会显示表的当前有效快照,已过期快照不包含在内。
SELECT * FROM catalog_name.db_name.table_name.snapshots

+-------------------------+----------------+-----------+-----------+----------------------------------------------------+-------------------------------------------------------+
| committed_at            | snapshot_id    | parent_id | operation | manifest_list                                      | summary                                               |
+-------------------------+----------------+-----------+-----------+----------------------------------------------------+-------------------------------------------------------+
| 2019-02-08 03:29:51.215 | 57897183625154 | null      | append    | s3://.../table/metadata/snap-57897183625154-1.avro | { added-records -> 2478404, total-records -> 2478404, |
|                         |                |           |           |                                                    |   added-data-files -> 438, total-data-files -> 438,   |
|                         |                |           |           |                                                    |   spark.app.id -> application_1520379288616_155055 }  |
| ...                     | ...            | ...       | ...       | ...                                                | ...                                                   |
+-------------------------+----------------+-----------+-----------+----------------------------------------------------+-------------------------------------------------------+

查询快照那一刻的数据:
select * from table /+ OPTIONS(‘snapshot-id’=‘284660747962682079’)/;

summary 信息很有用,该字段记录了本次 snapshot 新增的文件数、记录数及变更的分区。你也可以通过直接指定想要的summary中字段来获取指定信息:

select snapshot_id, summary[‘total-data-files’] from catalog_name.db_name.table_name.snapshots;

  • 查看表的分区信息

–Iceberg不支持Hive那样查询分区信息:
show partitions table_name --❌
–在Iceberg中查询分区信息方式:
SELECT * FROM catalog_name.db_name.table_name.partitions --✅

+--------------------+---------------+-------------+--+
|   partition   | record_count | file_count |
+--------------------+---------------+-------------+--+
| {"date":20210407} | 1000000    | 1      |
| {"date":20210420} | 1000000    | 1      |
| {"date":20210421} | 1000000    | 1      |
| {"date":20210930} | 1000000    | 1      |
| {"date":20210418} | 1000000    | 1      |
+--------------------+---------------+-------------+--+
  • 查看数据文件的信息
    Iceberg将每一个数据文件的每个字段的max、min等信息都通过manifest文件记录了下来,可以直接通过sql查到:

SELECT * FROM catalog_name.db_name.table_name.files

+-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+------------------+-----------------+-----------------+--------------+---------------+
| file_path                                                               | file_format | record_count | file_size_in_bytes | column_sizes       | value_counts     | null_value_counts | nan_value_counts | lower_bounds    | upper_bounds    | key_metadata | split_offsets |
+-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+------------------+-----------------+-----------------+--------------+---------------+
| s3:/.../table/data/00000-3-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET     | 1            | 597                | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0]  | []               | [1 -> , 2 -> c] | [1 -> , 2 -> c] | null         | [4]           |
| s3:/.../table/data/00001-4-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET     | 1            | 597                | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0]  | []               | [1 -> , 2 -> b] | [1 -> , 2 -> b] | null         | [4]           |
| s3:/.../table/data/00002-5-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET     | 1            | 597                | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0]  | []               | [1 -> , 2 -> a] | [1 -> , 2 -> a] | null         | [4]           |
+-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+------------------+-----------------+-----------------+--------------+---------------+
  • 查看Manifest文件信息

SELECT * FROM catalog_name.db_name.table_name.manifests

+----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+--------------------------------------+
| path                                                                 | length | partition_spec_id | added_snapshot_id   | added_data_files_count | existing_data_files_count | deleted_data_files_count | partition_summaries                  |
+----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+--------------------------------------+
| s3://.../table/metadata/45b5290b-ee61-4788-b324-b1e2735c0e10-m0.avro | 4479   | 0                 | 6668963634911763636 | 8                      | 0                         | 0                        | [[false,null,2019-05-13,2019-05-15]] |
+----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+--------------------------------------+

partition_summaries列的字段依次对应了:

Filed nameTypeDescriptionv1v2
contains_nullbooleanmanifest文件是否包含了至少一个分区字段为null值的行requiredrequired
contains_nanbooleanmanifest文件是否包含了至少一个分区字段为NaN值的行optionaloptional
lower_boundbytes分区字段中非空、非NaN值的下界,如果所有值都是空或NaN,则为空optionaloptional
upper_boundbytes分区字段中非空、非NaN值的上界,如果所有值都是空或NaN,则为空optionaloptional

表参数变更

Iceberg 支持通过 Spark DDL 进行修改,示例如下:

ALTER TABLE prod.db.sample SET TBLPROPERTIES (
    'read.split.target-size'='268435456'
);
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Direction_Wind

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值