使用命令行查看Parquet文件

最新推荐文章于 2024-11-29 12:12:54 发布

zxfBdd

最新推荐文章于 2024-11-29 12:12:54 发布

阅读量804

点赞数

分类专栏： Java 文章标签：大数据

原文链接：https://www.jianshu.com/p/5240c68014e5

版权

Java 专栏收录该内容

259 篇文章 5 订阅

订阅专栏

简介

通常来说Parquet文件可以使用Spark或Flink来读取内容。对于问题分析或者学习研究场景，临时查看一个parquet文件专门使用Spark/Flink编写一段程序显得十分繁琐。本篇为大家带来两个命令行环境运行的Parquet文件读取和分析工具。使用较为简单，无需再编写程序代码。

使用parquet-cli

项目地址和下载

项目地址：https://github.com/apache/parquet-java.git

下载地址：https://repo1.maven.org/maven2/org/apache/parquet/parquet-cli/1.14.1/parquet-cli-1.14.1-runtime.jar

官方使用方式和文档：https://github.com/apache/parquet-java/tree/master/parquet-cli

使用方式

命令格式：

hadoop jar parquet-cli-1.14.1-runtime.jar 命令 本地parquet文件路径

查看帮助：

[root@manager paul]# hadoop jar parquet-cli-1.14.1-runtime.jar help

Usage: parquet [options] [command] [command options]

  Options:

    -v, --verbose, --debug
        Print extra debugging information

  Commands:

    help
        Retrieves details on the functions of other commands
    meta
        Print a Parquet file's metadata
    pages
        Print page summaries for a Parquet file
    dictionary
        Print dictionaries for a Parquet column
    check-stats
        Check Parquet files for corrupt page and column stats (PARQUET-251)
    schema
        Print the Avro schema for a file
    csv-schema
        Build a schema from a CSV data sample
    convert-csv
        Create a file from CSV data
    convert
        Create a Parquet file from a data file
    to-avro
        Create an Avro file from a data file
    cat
        Print the first N records from a file
    head
        Print the first N records from a file
    column-index
        Prints the column and offset indexes of a Parquet file
    column-size
        Print the column sizes of a parquet file
    prune
        (Deprecated: will be removed in 2.0.0, use rewrite command instead) Prune column(s) in a Parquet file and save it to a new file. The columns left are not changed.
    trans-compression
        (Deprecated: will be removed in 2.0.0, use rewrite command instead) Translate the compression from one to another (It doesn't support bloom filter feature yet).
    masking
        (Deprecated: will be removed in 2.0.0, use rewrite command instead) Replace columns with masked values and write to a new Parquet file
    footer
        Print the Parquet file footer in json format
    bloom-filter
        Check bloom filters for a Parquet column
    scan
        Scan all records from a file
    rewrite
        Rewrite one or more Parquet files to a new Parquet file

  Examples:

    # print information for meta
    parquet help meta

  See 'parquet help <command>' for more information on a specific command.

使用示例

这里以一个Hudi表底层的parquet文件为例。说明parquet-cli工具的使用方式。

查看parquet文件的schema：

[root@manager paul]# hadoop jar parquet-cli-1.14.1-runtime.jar schema ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet
{
  "type" : "record",
  "name" : "hudi_student_record",
  "namespace" : "hoodie.hudi_student",
  "fields" : [ {
    "name" : "_hoodie_commit_time",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "_hoodie_commit_seqno",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "_hoodie_record_key",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "_hoodie_partition_path",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "_hoodie_file_name",
    "type" : [ "null", "string" ],
    "doc" : "",
    "default" : null
  }, {
    "name" : "id",
    "type" : "int"
  }, {
    "name" : "name",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "tel",
    "type" : [ "null", "int" ],
    "default" : null
  } ]
}

查看parquet文件数据：

[root@manager paul]# hadoop jar parquet-cli-1.14.1-runtime.jar cat ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet
{"_hoodie_commit_time": "20240710084413943", "_hoodie_commit_seqno": "20240710084413943_0_11", "_hoodie_record_key": "1", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 1, "name": "Paul", "tel": 111111}
{"_hoodie_commit_time": "20240710084317041", "_hoodie_commit_seqno": "20240710084317041_0_8", "_hoodie_record_key": "3", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 3, "name": "Peter", "tel": 222222}
{"_hoodie_commit_time": "20240710084352978", "_hoodie_commit_seqno": "20240710084352978_0_9", "_hoodie_record_key": "4", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 4, "name": "Jessy", "tel": 222222}
{"_hoodie_commit_time": "20240710084244349", "_hoodie_commit_seqno": "20240710084244349_0_7", "_hoodie_record_key": "2", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 2, "name": "Mary", "tel": 222222}
{"_hoodie_commit_time": "20240710083659244", "_hoodie_commit_seqno": "20240710083659244_0_3", "_hoodie_record_key": "5", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 5, "name": "Tom", "tel": 666666}

查看parquet文件前3行数据：

[root@manager paul]# hadoop jar parquet-cli-1.14.1-runtime.jar head -n 3 ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet
{"_hoodie_commit_time": "20240710084413943", "_hoodie_commit_seqno": "20240710084413943_0_11", "_hoodie_record_key": "1", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 1, "name": "Paul", "tel": 111111}
{"_hoodie_commit_time": "20240710084317041", "_hoodie_commit_seqno": "20240710084317041_0_8", "_hoodie_record_key": "3", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 3, "name": "Peter", "tel": 222222}
{"_hoodie_commit_time": "20240710084352978", "_hoodie_commit_seqno": "20240710084352978_0_9", "_hoodie_record_key": "4", "_hoodie_partition_path": "", "_hoodie_file_name": "ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet", "id": 4, "name": "Jessy", "tel": 222222}

获取parquet文件meta信息：

[root@manager paul]# hadoop jar parquet-cli-1.14.1-runtime.jar meta ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet

File path:  ba74ba57-d45c-43c7-9ddb-7c8afb3bab8f_0-1-0_20240710084455963.parquet
Created by: parquet-mr version 1.12.3 (build f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)
Properties:
  hoodie_bloom_filter_type_code: DYNAMIC_V0
    org.apache.hudi.bloomfilter: //太长省略
          hoodie_min_record_key: 1
            parquet.avro.schema: {"type":"record","name":"hudi_student_record","namespace":"hoodie.hudi_student","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"id","type":"int"},{"name":"name","type":["null","string"],"default":null},{"name":"tel","type":["null","int"],"default":null}]}
              writer.model.name: avro
          hoodie_max_record_key: 5
Schema:
message hoodie.hudi_student.hudi_student_record {
  optional binary _hoodie_commit_time (STRING);
  optional binary _hoodie_commit_seqno (STRING);
  optional binary _hoodie_record_key (STRING);
  optional binary _hoodie_partition_path (STRING);
  optional binary _hoodie_file_name (STRING);
  required int32 id;
  optional binary name (STRING);
  optional int32 tel;
}


Row group 0:  count: 5  152.20 B records  start: 4  total(compressed): 761 B total(uncompressed):702 B
--------------------------------------------------------------------------------
                        type      encodings count     avg size   nulls   min / max
_hoodie_commit_time     BINARY    G   _     5         19.60 B    0       "20240710083659244" / "20240710084413943"
_hoodie_commit_seqno    BINARY    G   _     5         21.80 B    0       "20240710083659244_0_3" / "20240710084413943_0_11"
_hoodie_record_key      BINARY    G   _     5         12.60 B    0       "1" / "5"
_hoodie_partition_path  BINARY    G _ R     5         18.80 B    0       "" / ""
_hoodie_file_name       BINARY    G _ R     5         31.20 B    0       "ba74ba57-d45c-43c7-9ddb-7..." / "ba74ba57-d45c-43c7-9ddb-7..."
id                      INT32     G   _     5         11.40 B    0       "1" / "5"
name                    BINARY    G   _     5         16.00 B    0       "Jessy" / "Tom"
tel                     INT32     G _ R     5         20.80 B    0       "111111" / "666666"

使用parquet-tools

下载方式

下载jar文件：

wget https://repo1.maven.org/maven2/org/apache/parquet/parquet-tools/1.11.2/parquet-tools-1.11.2.jar

使用方式

hadoop jar parquet-tools-1.x.0.jar 命令 HDFS中parquet文件路径

命令使用方式和前面的parquet-cli工具相同，不再赘述。