简介
通常来说Parquet文件可以使用Spark或Flink来读取内容。对于问题分析或者学习研究场景,临时查看一个parquet文件专门使用Spark/Flink编写一段程序显得十分繁琐。本篇为大家带来两个命令行环境运行的Parquet文件读取和分析工具。使用较为简单,无需再编写程序代码。
使用parquet-cli
项目地址和下载
项目地址:https://github.com/apache/parquet-java.git
下载地址:https://repo1.maven.org/maven2/org/apache/parquet/parquet-cli/1.14.1/parquet-cli-1.14.1-runtime.jar
官方使用方式和文档:https://github.com/apache/parquet-java/tree/master/parquet-cli
使用方式
命令格式:
hadoop jar parquet-cli-1.14.1-runtime.jar 命令 本地parquet文件路径
查看帮助:
[root@manager paul]# hadoop jar parquet-cli-1.14.1-runtime.jar help
Usage: parquet [options] [command] [command options]
Options:
-v, --verbose, --debug
Print extra debugging information
Commands:
help
Retrieves details on the functions of other commands
meta
Print a Parquet file's metadata
pages
Print page summaries for a Parquet file
dictionary
Print dictionaries for a Parquet column
check-stats
Check Parquet files for corrupt page and column stats (PARQUET-251)
schema
Print the Avro schema for a file
csv-schema
Build a schema from a CSV data sample
convert-csv
Create a file from CSV data
convert
Create a Parquet file from a data file
to-avro
Create an Avro file from a data file
cat
Print the first N records from a file
head
Print the first N records from a file
column-index
Prints the column and offset indexes of a Parquet file
column-size
Print the column sizes of a parquet file
prune
(Deprecated: will be removed in 2.0.0, use rewrite command instead) Prune column(s) in a Parquet file and save it to a new file. The columns left are not changed.
trans-compression
(Deprecated: will be removed in 2.0.0, use rewrite command instead) Translate the compression from one to another (It doesn't support bloom filter feature yet).
masking
(Deprecated: will be removed in 2.0.0, use rewrite command instead) Replace columns with masked values and write to a new Parquet file
footer
Print the Parquet file footer in json format
bloom-filter
Check bloom filters for a Parquet column
scan
Scan all records from a file
rewrite
Rewrite one or more Parquet files to a new Parquet file
Examples:
# print information for meta
parquet help meta
See 'parquet help <command>' for more information on a specific command.
使用示例
这里以一个Hudi表底层的parquet文件为例。说明parquet-cli工具的使用方式。
查看parquet文件的schema:
[root@manager paul]# hadoop jar parquet-cli-1.14.1-runtime.jar schema ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet
{
"type" : "record",
"name" : "hudi_student_record",
"namespace" : "hoodie.hudi_student",
"fields" : [ {
"name" : "_hoodie_commit_time",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_commit_seqno",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_record_key",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_partition_path",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "_hoodie_file_name",
"type" : [ "null", "string" ],
"doc" : "",
"default" : null
}, {
"name" : "id",
"type" : "int"
}, {
"name" : "name",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "tel",
"type" : [ "null", "int" ],
"default" : null
} ]
}
查看parquet文件数据:
[root@manager paul]# hadoop jar parquet-cli-1.14.1-runtime.jar cat ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet
{"_hoodie_commit_time": "20240710084413943", "_hoodie_commit_seqno": "20240710084413943_0_11", "_hoodie_record_key": "1", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 1, "name": "Paul", "tel": 111111}
{"_hoodie_commit_time": "20240710084317041", "_hoodie_commit_seqno": "20240710084317041_0_8", "_hoodie_record_key": "3", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 3, "name": "Peter", "tel": 222222}
{"_hoodie_commit_time": "20240710084352978", "_hoodie_commit_seqno": "20240710084352978_0_9", "_hoodie_record_key": "4", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 4, "name": "Jessy", "tel": 222222}
{"_hoodie_commit_time": "20240710084244349", "_hoodie_commit_seqno": "20240710084244349_0_7", "_hoodie_record_key": "2", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 2, "name": "Mary", "tel": 222222}
{"_hoodie_commit_time": "20240710083659244", "_hoodie_commit_seqno": "20240710083659244_0_3", "_hoodie_record_key": "5", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 5, "name": "Tom", "tel": 666666}
查看parquet文件前3行数据:
[root@manager paul]# hadoop jar parquet-cli-1.14.1-runtime.jar head -n 3 ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet
{"_hoodie_commit_time": "20240710084413943", "_hoodie_commit_seqno": "20240710084413943_0_11", "_hoodie_record_key": "1", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 1, "name": "Paul", "tel": 111111}
{"_hoodie_commit_time": "20240710084317041", "_hoodie_commit_seqno": "20240710084317041_0_8", "_hoodie_record_key": "3", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 3, "name": "Peter", "tel": 222222}
{"_hoodie_commit_time": "20240710084352978", "_hoodie_commit_seqno": "20240710084352978_0_9", "_hoodie_record_key": "4", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 4, "name": "Jessy", "tel": 222222}
获取parquet文件meta信息:
[root@manager paul]# hadoop jar parquet-cli-1.14.1-runtime.jar meta ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet
File path: ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet
Created by: parquet-mr version 1.12.3 (build f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)
Properties:
hoodie_bloom_filter_type_code: DYNAMIC_V0
org.apache.hudi.bloomfilter: //太长省略
hoodie_min_record_key: 1
parquet.avro.schema: {"type":"record","name":"hudi_student_record","namespace":"hoodie.hudi_student","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"id","type":"int"},{"name":"name","type":["null","string"],"default":null},{"name":"tel","type":["null","int"],"default":null}]}
writer.model.name: avro
hoodie_max_record_key: 5
Schema:
message hoodie.hudi_student.hudi_student_record {
optional binary _hoodie_commit_time (STRING);
optional binary _hoodie_commit_seqno (STRING);
optional binary _hoodie_record_key (STRING);
optional binary _hoodie_partition_path (STRING);
optional binary _hoodie_file_name (STRING);
required int32 id;
optional binary name (STRING);
optional int32 tel;
}
Row group 0: count: 5 152.20 B records start: 4 total(compressed): 761 B total(uncompressed):702 B
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
_hoodie_commit_time BINARY G _ 5 19.60 B 0 "20240710083659244" / "20240710084413943"
_hoodie_commit_seqno BINARY G _ 5 21.80 B 0 "20240710083659244_0_3" / "20240710084413943_0_11"
_hoodie_record_key BINARY G _ 5 12.60 B 0 "1" / "5"
_hoodie_partition_path BINARY G _ R 5 18.80 B 0 "" / ""
_hoodie_file_name BINARY G _ R 5 31.20 B 0 "ba74ba57-d45c-43c7-9ddb-7..." / "ba74ba57-d45c-43c7-9ddb-7..."
id INT32 G _ 5 11.40 B 0 "1" / "5"
name BINARY G _ 5 16.00 B 0 "Jessy" / "Tom"
tel INT32 G _ R 5 20.80 B 0 "111111" / "666666"
使用parquet-tools
下载方式
下载jar文件:
wget https://repo1.maven.org/maven2/org/apache/parquet/parquet-tools/1.11.2/parquet-tools-1.11.2.jar
使用方式
hadoop jar parquet-tools-1.x.0.jar 命令 HDFS中parquet文件路径
命令使用方式和前面的parquet-cli工具相同,不再赘述。