iceberg系列(1):存储详解-初探2

上篇 iceberg系列(1):存储详解-初探1 中我们查看了元数据结构,那么现在,我们插入数据到表中

INSERT INTO local.db.table VALUES (1, 'a');

再次查看目录结构:

(base) ➜ table tree -A -C -D
.
├── [Jan  2 21:45]  data
│   └── [Jan  2 21:45]  00000-0-ea35130e-b5ed-4443-889f-2ee5e62e6757-00001.parquet
└── [Jan  2 21:45]  metadata
    ├── [Jan  2 21:45]  1bd1f809-55ea-4ba1-b425-ab4ecc212434-m0.avro
    ├── [Jan  2 21:45]  snap-5028042644139258397-1-1bd1f809-55ea-4ba1-b425-ab4ecc212434.avro
    ├── [Jan  2 21:22]  v1.metadata.json
    ├── [Jan  2 21:45]  v2.metadata.json
    └── [Jan  2 21:45]  version-hint.text

2 directories, 6 files

查看v2.metadata.json文件内容:

{
  "format-version" : 1,
  "table-uuid" : "0dc08d49-ed4d-49bb-8ddf-006e37c65372",
  "location" : "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/table",
  "last-updated-ms" : 1641131156558,
  "last-column-id" : 2,
  "schema" : {
    "type" : "struct",
    "schema-id" : 0,
    "fields" : [ {
      "id" : 1,
      "name" : "id",
      "required" : false,
      "type" : "long"
    }, {
      "id" : 2,
      "name" : "data",
      "required" : false,
      "type" : "string"
    } ]
  },
  "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "fields" : [ {
      "id" : 1,
      "name" : "id",
      "required" : false,
      "type" : "long"
    }, {
      "id" : 2,
      "name" : "data",
      "required" : false,
      "type" : "string"
    } ]
  } ],
  "partition-spec" : [ ],
  "default-spec-id" : 0,
  "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ ]
  } ],
  "last-partition-id" : 999,
  "default-sort-order-id" : 0,
  "sort-orders" : [ {
    "order-id" : 0,
    "fields" : [ ]
  } ],
  "properties" : {
    "owner" : "liliwei"
  },
  "current-snapshot-id" : 5028042644139258397,
  "snapshots" : [ {
    "snapshot-id" : 5028042644139258397,
    "timestamp-ms" : 1641131156558,
    "summary" : {
      "operation" : "append",
      "spark.app.id" : "local-1641129606166",
      "added-data-files" : "1",
      "added-records" : "1",
      "added-files-size" : "643",
      "changed-partition-count" : "1",
      "total-records" : "1",
      "total-files-size" : "643",
      "total-data-files" : "1",
      "total-delete-files" : "0",
      "total-position-deletes" : "0",
      "total-equality-deletes" : "0"
    },
    "manifest-list" : "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/table/metadata/snap-5028042644139258397-1-1bd1f809-55ea-4ba1-b425-ab4ecc212434.avro",
    "schema-id" : 0
  } ],
  "snapshot-log" : [ {
    "timestamp-ms" : 1641131156558,
    "snapshot-id" : 5028042644139258397
  } ],
  "metadata-log" : [ {
    "timestamp-ms" : 1641129739691,
    "metadata-file" : "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/table/metadata/v1.metadata.json"
  } ]
}

version-hint.text内容:

2

snap-5028042644139258397-1-1bd1f809-55ea-4ba1-b425-ab4ecc212434.avro内容:
此处借助于avro转json的工具:avro-tools-1.10.2.jar (https://repo1.maven.org/maven2/org/apache/avro/avro-tools/1.10.2/)

java -jar ~/plat/tools/avro-tools-1.10.2.jar tojson metadata/snap-5028042644139258397-1-1bd1f809-55ea-4ba1-b425-ab4ecc212434.avro
{
	"manifest_path": "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/table/metadata/1bd1f809-55ea-4ba1-b425-ab4ecc212434-m0.avro",
	"manifest_length": 5803,
	"partition_spec_id": 0,
	"added_snapshot_id": {
		"long": 5028042644139258397
	},
	"added_data_files_count": {
		"int": 1
	},
	"existing_data_files_count": {
		"int": 0
	},
	"deleted_data_files_count": {
		"int": 0
	},
	"partitions": {
		"array": []
	},
	"added_rows_count": {
		"long": 1
	},
	"existing_rows_count": {
		"long": 0
	},
	"deleted_rows_count": {
		"long": 0
	}
}

查看对应的 manifest 文件

(base) ➜ table java -jar ~/plat/tools/avro-tools-1.10.2.jar tojson metadata/1bd1f809-55ea-4ba1-b425-ab4ecc212434-m0.avro
{
	"status": 1,
	"snapshot_id": {
		"long": 5028042644139258397
	},
	"data_file": {
		"file_path": "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/table/data/00000-0-ea35130e-b5ed-4443-889f-2ee5e62e6757-00001.parquet",
		"file_format": "PARQUET",
		"partition": {},
		"record_count": 1,
		"file_size_in_bytes": 643,
		"block_size_in_bytes": 67108864,
		"column_sizes": {
			"array": [{
				"key": 1,
				"value": 46
			}, {
				"key": 2,
				"value": 48
			}]
		},
		"value_counts": {
			"array": [{
				"key": 1,
				"value": 1
			}, {
				"key": 2,
				"value": 1
			}]
		},
		"null_value_counts": {
			"array": [{
				"key": 1,
				"value": 0
			}, {
				"key": 2,
				"value": 0
			}]
		},
		"nan_value_counts": {
			"array": []
		},
		"lower_bounds": {
			"array": [{
				"key": 1,
				"value": "\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
			}, {
				"key": 2,
				"value": "a"
			}]
		},
		"upper_bounds": {
			"array": [{
				"key": 1,
				"value": "\u0001\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
			}, {
				"key": 2,
				"value": "a"
			}]
		},
		"key_metadata": null,
		"split_offsets": {
			"array": [4]
		},
		"sort_order_id": {
			"int": 0
		}
	}
}

很明显,其中包含了数据文件的具体位置,以及一些统计信息。我们再插入一条数据,看看会有什么变化:

INSERT INTO local.db.table VALUES (2, 'b');

查看目录结构:

(base) ➜ table tree -C
(base) ➜ table tree -C -D
.
├── [Jan  2 22:07]  data
│   ├── [Jan  2 21:45]  00000-0-ea35130e-b5ed-4443-889f-2ee5e62e6757-00001.parquet
│   └── [Jan  2 22:07]  00000-1-631cd5bc-2ad0-4ddd-9530-f055b2888d56-00001.parquet
└── [Jan  2 22:07]  metadata
    ├── [Jan  2 21:45]  1bd1f809-55ea-4ba1-b425-ab4ecc212434-m0.avro
    ├── [Jan  2 22:07]  6881af48-5efa-4660-99ed-be5b9f640e52-m0.avro
    ├── [Jan  2 22:07]  snap-1270004071302473053-1-6881af48-5efa-4660-99ed-be5b9f640e52.avro
    ├── [Jan  2 21:45]  snap-5028042644139258397-1-1bd1f809-55ea-4ba1-b425-ab4ecc212434.avro
    ├── [Jan  2 21:22]  v1.metadata.json
    ├── [Jan  2 21:45]  v2.metadata.json
    ├── [Jan  2 22:07]  v3.metadata.json
    └── [Jan  2 22:07]  version-hint.text

2 directories, 10 files

v3.metadata.json文件内容如下:

{
  "format-version" : 1,
  "table-uuid" : "0dc08d49-ed4d-49bb-8ddf-006e37c65372",
  "location" : "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/table",
  "last-updated-ms" : 1641132476394,
  "last-column-id" : 2,
  "schema" : {
    "type" : "struct",
    "schema-id" : 0,
    "fields" : [ {
      "id" : 1,
      "name" : "id",
      "required" : false,
      "type" : "long"
    }, {
      "id" : 2,
      "name" : "data",
      "required" : false,
      "type" : "string"
    } ]
  },
  "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "fields" : [ {
      "id" : 1,
      "name" : "id",
      "required" : false,
      "type" : "long"
    }, {
      "id" : 2,
      "name" : "data",
      "required" : false,
      "type" : "string"
    } ]
  } ],
  "partition-spec" : [ ],
  "default-spec-id" : 0,
  "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ ]
  } ],
  "last-partition-id" : 999,
  "default-sort-order-id" : 0,
  "sort-orders" : [ {
    "order-id" : 0,
    "fields" : [ ]
  } ],
  "properties" : {
    "owner" : "liliwei"
  },
  "current-snapshot-id" : 1270004071302473053,
  "snapshots" : [ {
    "snapshot-id" : 5028042644139258397,
    "timestamp-ms" : 1641131156558,
    "summary" : {
      "operation" : "append",
      "spark.app.id" : "local-1641129606166",
      "added-data-files" : "1",
      "added-records" : "1",
      "added-files-size" : "643",
      "changed-partition-count" : "1",
      "total-records" : "1",
      "total-files-size" : "643",
      "total-data-files" : "1",
      "total-delete-files" : "0",
      "total-position-deletes" : "0",
      "total-equality-deletes" : "0"
    },
    "manifest-list" : "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/table/metadata/snap-5028042644139258397-1-1bd1f809-55ea-4ba1-b425-ab4ecc212434.avro",
    "schema-id" : 0
  }, {
    "snapshot-id" : 1270004071302473053,
    "parent-snapshot-id" : 5028042644139258397,
    "timestamp-ms" : 1641132476394,
    "summary" : {
      "operation" : "append",
      "spark.app.id" : "local-1641129606166",
      "added-data-files" : "1",
      "added-records" : "1",
      "added-files-size" : "643",
      "changed-partition-count" : "1",
      "total-records" : "2",
      "total-files-size" : "1286",
      "total-data-files" : "2",
      "total-delete-files" : "0",
      "total-position-deletes" : "0",
      "total-equality-deletes" : "0"
    },
    "manifest-list" : "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/table/metadata/snap-1270004071302473053-1-6881af48-5efa-4660-99ed-be5b9f640e52.avro",
    "schema-id" : 0
  } ],
  "snapshot-log" : [ {
    "timestamp-ms" : 1641131156558,
    "snapshot-id" : 5028042644139258397
  }, {
    "timestamp-ms" : 1641132476394,
    "snapshot-id" : 1270004071302473053
  } ],
  "metadata-log" : [ {
    "timestamp-ms" : 1641129739691,
    "metadata-file" : "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/table/metadata/v1.metadata.json"
  }, {
    "timestamp-ms" : 1641131156558,
    "metadata-file" : "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/table/metadata/v2.metadata.json"
  } ]
}

snap-1270004071302473053-1-6881af48-5efa-4660-99ed-be5b9f640e52.avro文件内容如下:

{
	"manifest_path": "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/table/metadata/6881af48-5efa-4660-99ed-be5b9f640e52-m0.avro",
	"manifest_length": 5802,
	"partition_spec_id": 0,
	"added_snapshot_id": {
		"long": 1270004071302473053
	},
	"added_data_files_count": {
		"int": 1
	},
	"existing_data_files_count": {
		"int": 0
	},
	"deleted_data_files_count": {
		"int": 0
	},
	"partitions": {
		"array": []
	},
	"added_rows_count": {
		"long": 1
	},
	"existing_rows_count": {
		"long": 0
	},
	"deleted_rows_count": {
		"long": 0
	}
}

{
	"manifest_path": "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/table/metadata/1bd1f809-55ea-4ba1-b425-ab4ecc212434-m0.avro",
	"manifest_length": 5803,
	"partition_spec_id": 0,
	"added_snapshot_id": {
		"long": 5028042644139258397
	},
	"added_data_files_count": {
		"int": 1
	},
	"existing_data_files_count": {
		"int": 0
	},
	"deleted_data_files_count": {
		"int": 0
	},
	"partitions": {
		"array": []
	},
	"added_rows_count": {
		"long": 1
	},
	"existing_rows_count": {
		"long": 0
	},
	"deleted_rows_count": {
		"long": 0
	}
}

注意,以上是两条数据,分别是两个JSON格式。
6881af48-5efa-4660-99ed-be5b9f640e52-m0.avro内容如下:

{
	"status": 1,
	"snapshot_id": {
		"long": 1270004071302473053
	},
	"data_file": {
		"file_path": "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/table/data/00000-1-631cd5bc-2ad0-4ddd-9530-f055b2888d56-00001.parquet",
		"file_format": "PARQUET",
		"partition": {},
		"record_count": 1,
		"file_size_in_bytes": 643,
		"block_size_in_bytes": 67108864,
		"column_sizes": {
			"array": [{
				"key": 1,
				"value": 46
			}, {
				"key": 2,
				"value": 48
			}]
		},
		"value_counts": {
			"array": [{
				"key": 1,
				"value": 1
			}, {
				"key": 2,
				"value": 1
			}]
		},
		"null_value_counts": {
			"array": [{
				"key": 1,
				"value": 0
			}, {
				"key": 2,
				"value": 0
			}]
		},
		"nan_value_counts": {
			"array": []
		},
		"lower_bounds": {
			"array": [{
				"key": 1,
				"value": "\u0002\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
			}, {
				"key": 2,
				"value": "b"
			}]
		},
		"upper_bounds": {
			"array": [{
				"key": 1,
				"value": "\u0002\u0000\u0000\u0000\u0000\u0000\u0000\u0000"
			}, {
				"key": 2,
				"value": "b"
			}]
		},
		"key_metadata": null,
		"split_offsets": {
			"array": [4]
		},
		"sort_order_id": {
			"int": 0
		}
	}
}

具体的分析及源码讲解暂时还未添加,暂且TODO一下。
下面我们来看一下iceberg提供表格式友好演变的精髓之一,partition相关的内容:
iceberg系列(2):存储详解-partition-1

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值