Iceberg 源码阅读(一) ContentFile统一文件接口

最新推荐文章于 2024-10-14 08:59:24 发布

肥叔菌

最新推荐文章于 2024-10-14 08:59:24 发布

阅读量696

点赞数 1

分类专栏： # Delta、Hudi、Iceberg 文章标签：大数据 java 开发语言

本文链接：https://blog.csdn.net/asmartkiller/article/details/127600923

版权

Delta、Hudi、Iceberg 专栏收录该内容

17 篇文章

订阅专栏

在这里插入图片描述
ContentFile统一文件接口用于代表上图上4中的avro格式的manifest文件清单，以如下文件内容为例，将其中第一条记录美化。

{"status":1,
 "snapshot_id":{"long":5890630393462353077},
 "data_file":{
    "file_path":"file:/data/hive/warehouse/iteblog/data/ts_year=2020/id_bucket=0/00000-0-b9ec5fd7-8784-489d-99df-fdb160d0e1b1-00001.parquet",
    "file_format":"PARQUET",
    "partition":{
      "ts_year":{"int":50},
      "id_bucket"{"int":0}
    },
    "record_count":2,
    "file_size_in_bytes":1188,
    "block_size_in_bytes":67108864,
    "column_sizes":{"array":[{"key":1,"value":49},{"key":2,"value":62},{"key":3,"value":51},{"key":4,"value":98}]},
    "value_counts":{"array":[{"key":1,"value":2},{"key":2,"value":2},{"key":3,"value":2},{"key":4,"value":2}]},
    "null_value_counts":{"array":[{"key":1,"value":0},{"key":2,"value":0},{"key":3,"value":0},{"key":4,"value":0}]},
    "lower_bounds":{"array":[{"key":1,"value":"\u0001\u0000\u0000\u0000"},{"key":2,"value":"iteblog"},{"key":3,"value":"d\u0000\u0000\u0000"},{"key":4,"value":"\u0000€öšù²\u0005\u0000"}]},
    "upper_bounds":{"array":[{"key":1,"value":"\u0002\u0000\u0000\u0000"},{"key":2,"value":"iteblog2"},{"key":3,"value":",\u0001\u0000\u0000"},{"key":4,"value":"\u0000€öšù²\u0005\u0000"}]},
    "key_metadata":null,
    "split_offsets":{"array":[4]}
  }
}

{"status":1,"snapshot_id":{"long":5890630393462353077},"data_file":{"file_path":"file:/data/hive/warehouse/iteblog/data/ts_year=2020/id_bucket=1/00001-1-606ce3a6-ca71-4179-be95-b08fc5c65734-00001.parquet","file_format":"PARQUET","partition":{"ts_year":{"int":50},"id_bucket":{"int":1}},"record_count":1,"file_size_in_bytes":1145,"block_size_in_bytes":67108864,"column_sizes":{"array":[{"key":1,"value":47},{"key":2,"value":58},{"key":3,"value":47},{"key":4,"value":57}]},"value_counts":{"array":[{"key":1,"value":1},{"key":2,"value":1},{"key":3,"value":1},{"key":4,"value":1}]},"null_value_counts":{"array":[{"key":1,"value":0},{"key":2,"value":0},{"key":3,"value":0},{"key":4,"value":0}]},"lower_bounds":{"array":[{"key":1,"value":"\u0003\u0000\u0000\u0000"},{"key":2,"value":"iteblog"},{"key":3,"value":"d\u0000\u0000\u0000"},{"key":4,"value":"\u0000 \u000B8iµ\u0005\u0000"}]},"upper_bounds":{"array":[{"key":1,"value":"\u0003\u0000\u0000\u0000"},{"key":2,"value":"iteblog"},{"key":3,"value":"d\u0000\u0000\u0000"},{"key":4,"value":"\u0000 \u000B8iµ\u0005\u0000"}]},"key_metadata":null,"split_offsets":{"array":[4]}}}

{"status":1,"snapshot_id":{"long":5890630393462353077},"data_file":{"file_path":"file:/data/hive/warehouse/iteblog/data/ts_year=2020/id_bucket=0/00001-1-606ce3a6-ca71-4179-be95-b08fc5c65734-00002.parquet","file_format":"PARQUET","partition":{"ts_year":{"int":50},"id_bucket":{"int":0}},"record_count":1,"file_size_in_bytes":1145,"block_size_in_bytes":67108864,"column_sizes":{"array":[{"key":1,"value":47},{"key":2,"value":58},{"key":3,"value":47},{"key":4,"value":57}]},"value_counts":{"array":[{"key":1,"value":1},{"key":2,"value":1},{"key":3,"value":1},{"key":4,"value":1}]},"null_value_counts":{"array":[{"key":1,"value":0},{"key":2,"value":0},{"key":3,"value":0},{"key":4,"value":0}]},"lower_bounds":{"array":[{"key":1,"value":"\u0004\u0000\u0000\u0000"},{"key":2,"value":"iteblog"},{"key":3,"value":"d\u0000\u0000\u0000"},{"key":4,"value":"\u0000àÍ¸\r³\u0005\u0000"}]},"upper_bounds":{"array":[{"key":1,"value":"\u0004\u0000\u0000\u0000"},{"key":2,"value":"iteblog"},{"key":3,"value":"d\u0000\u0000\u0000"},{"key":4,"value":"\u0000àÍ¸\r³\u0005\u0000"}]},"key_metadata":null,"split_offsets":{"array":[4]}}}

如下图所示ContentFile【定义在api/src/main/java/org/apache/iceberg/ContentFile.java】作为DataFile和DeleteFile的父接口(Superinterface)，其包含了这两个子接口的公共函数。参数<F>为ContentFile实例的具体Java类。

Long pos()函数返回文件在清单manifest中的序号位置，如果未从清单manifest中读取，则返回null（Returns the ordinal position of the file in a manifest, or null if it was not read from a manifest)。
int specId()函数返回用于分区元数据的分区规范的id（Returns id of the partition spec used for partition metadata）。
FileContent content()函数返回文件中存储的内容类型；DATA、POSITION_DELETES或EQUALITY_DELETES之一（Returns type of content stored in the file; one of DATA, POSITION_DELETES, or EQUALITY_DELETES）。FileContent枚举类型定义在api/src/main/java/org/apache/iceberg/FileContent.java文件。
CharSequence path()函数返回文件的完全限定路径，适合构建Hadoop路径（Returns fully qualified path to the file, suitable for constructing a Hadoop Path）。
FileFormat format()函数返回文件格式（Returns format of the file）。FileFormat枚举类型定义在api/src/main/java/org/apache/iceberg/FileFormat.java文件，包含ORC("orc", true), PARQUET("parquet", true), AVRO("avro", true), METADATA("metadata.json", false)。
StructLike partition()函数将此文件的分区作为｛@link StructLike｝返回（Returns partition for this file as a {@link StructLike}）。
long recordCount()函数返回文件中顶级记录的数量（Returns the number of top-level records in the file）。
long fileSizeInBytes()函数返回文件大小（字节）（Returns the file size in bytes）。
Map<Integer, Long> columnSizes()函数如果已收集，则返回从列ID到列大小的映射（以字节为单位），否则返回null（Returns if collected, map from column ID to the size of the column in bytes, null otherwise）。
Map<Integer, Long> valueCounts()函数如果收集，则返回从列ID映射到其非空值的计数，否则为空（Returns if collected, map from column ID to the count of its non-null values, null otherwise）。
Map<Integer, Long> nullValueCounts()函数如果收集，则返回从列ID到其空值计数的映射，否则返回空值（Returns if collected, map from column ID to its null value count, null otherwise）
Map<Integer, Long> nanValueCounts()函数如果收集，则返回从列ID映射到其NaN值计数，否则为空（Returns if collected, map from column ID to its NaN value count, null otherwise）。
Map<Integer, ByteBuffer> lowerBounds()函数如果收集，则返回从列ID到值下限的映射，否则返回null（Returns if collected, map from column ID to value lower bounds, null otherwise）。
Map<Integer, ByteBuffer> upperBounds()函数如果收集，则返回从列ID到值上限的映射，否则返回null（Returns if collected, map from column ID to value upper bounds, null otherwise）。
ByteBuffer keyMetadata()函数返回有关此文件如何加密的元数据，如果文件以纯文本存储，则返回null（Returns metadata about how this file is encrypted, or null if the file is stored in plain text）。
List<Long> splitOffsets()函数返回建议的拆分位置列表（如果适用），否则为空。可用时，此信息用于规划边界由这些偏移确定的扫描任务。返回的列表必须按升序排序（Returns list of recommended split locations, if applicable, null otherwise. When available, this information is used for planning scan tasks whose boundaries are determined by these offsets. The returned list must be sorted in ascending order）。
List<Integer> equalityFieldIds()函数返回相等删除文件中用于相等比较的字段ID集。等式删除文件可能包含等式比较未使用的其他数据字段。删除文件中用于相等比较的列的子集由ID跟踪。可以使用额外的列来重构更改，并在作业计划期间使用额外列的度量@返回用于与此删除文件中的记录进行相等比较的字段的ID（Returns the set of field IDs used for equality comparison, in equality delete files. An equality delete file may contain additional data fields that are not used by equality comparison. The subset of columns in a delete file to be used in equality comparison are tracked by ID. Extra columns can be used to reconstruct changes and metrics from extra columns are used during job planning. @return IDs of the fields used in equality comparison with the records in this delete file）。
default Integer sortOrderId() { return null; }函数返回此文件的排序顺序id，该id描述文件的排序方式。当数据和相等删除文件共享相同的排序顺序id时，此信息将有助于更有效地合并它们（Returns the sort order id of this file, which describes how the file is ordered. This information will be useful for merging data and equality delete files more efficiently when they share the same sort order id.）。
F copy()函数复制此文件。清单阅读器可以重用文件实例；从任务中收集文件时，使用此方法复制数据@返回此数据文件的副本（Copies this file. Manifest readers can reuse file instances; use this method to copy data when collecting files from tasks. @return a copy of this data file）。
F copyWithoutStats()函数复制此文件而不带文件统计信息。清单阅读器可以重用文件实例；使用此方法可以在收集文件时复制不带统计信息的数据@返回此数据文件的副本，不带下限、上限、值计数、空值计数或nan值计数（Copies this file without file stats. Manifest readers can reuse file instances; use this method to copy data without stats when collecting files. @return a copy of this data file, without lower bounds, upper bounds, value counts, null value counts, or nan value counts）。
default F copy(boolean withStats) { return withStats ? copy() : copyWithoutStats(); }函数复制此文件（可能没有文件统计信息）。清单阅读器可以重用文件实例；从任务中收集文件时，使用此方法复制数据@param withStats如果设置为＜code＞false＜/code＞，将复制此文件而不带文件统计信息@返回此数据文件的副本。如果<code>withStats</code>设置为<code>false</code>，则文件将不包含下限、上限、值计数、空值计数或nan值计数（Copies this file (potentially without file stats). Manifest readers can reuse file instances; use this method to copy data when collecting files from tasks. @param withStats Will copy this file without file stats if set to <code>false</code>. @return a copy of this data file. If <code>withStats</code> is set to <code>false</code> the file will not contain lower bounds, upper bounds, value counts, null value counts, or nan value counts）。

在这里插入图片描述