parquet-tools工具
目前有两种parquet-tools工具
1、wesleypeck编写的开源parquet-tools(使用偏多,且可定制)
parquet-tools出现org/apache/hadoop/conf/Configuration问题的解决
该版本由于原作者不在进行更新,目前网上能够找到的版本大部分无法使用,原因在于源码中pom.xml并没有引入对应hadoop-core的依赖,导致jar包在执行对应命令时会报错:
NoClassDefFoundError: org/apache/hadoop/conf/Configuration
或执行命令无反应 仅仅输出如下内容:
等一系列关于hadoop的问题
解决方法:
(1)下载链接提供的工具
jar:
https://download.csdn.net/download/weixin_42532968/87431652
tar.gz:
https://download.csdn.net/download/weixin_42532968/87431657
(2)下载对应源码,在pom依赖中添加如下文件,重新进行打包,使用其提供的对应tar.gz文件
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>1.2.1</version>
</dependency>
如需jar包需额外加入如下依赖:
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
<configuration>
<archive>
<manifest>
<mainClass>
parquet.tools.Main
</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</execution>
</executions>
</plugin>
生成如下文件:
即可解决缺少hadoop相关组件的问题
使用方式:
//查看parquet文件中字段DEVICE_NUMBER的dump信息
parquet_tools dump -c DEVICE_NUMBER -d /opt/trafodion/bss_userinfo_20180812_0
//查看parquet文件的dump信息
parquet_tools dump -d /opt/trafodion/bss_userinfo_20180812_0
//查看parquet文件的前10行内容
parquet_tools head -n 10 /opt/trafodion/bss_userinfo_20180812_0
//查看parquet文件的meta信息
parquet_tools meta /opt/trafodion/bss_userinfo_20180812_0
//查看parquet文件的schema信息
parquet_tools schema /opt/trafodion/bss_userinfo_20180812_0
2、Apache Arrow的parquet-tools
安装
pip install parquet-tools
使用
parquet-tools --help
usage: parquet-tools [-h] {show,csv,inspect} ...
parquet CLI tools
positional arguments:
{show,csv,inspect}
show Show human readble format. see `show -h`
csv Cat csv style. see `csv -h`
inspect Inspect parquet file. see `inspect -h`
optional arguments:
-h, --help show this help message and exit
举例
$ parquet-tools show test.parquet
+-------+-------+---------+
| one | two | three |
|-------+-------+---------|
| -1 | foo | True |
| nan | bar | False |
| 2.5 | baz | True |
+-------+-------+---------+
$ parquet-tools inspect /path/to/parquet
############ file meta data ############
created_by: parquet-cpp version 1.5.1-SNAPSHOT
num_columns: 3
num_rows: 3
num_row_groups: 1
format_version: 1.0
serialized_size: 2226
############ Columns ############
one
two
three
############ Column(one) ############
name: one
path: one
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
############ Column(two) ############
name: two
path: two
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(three) ############
name: three
path: three
max_definition_level: 1
max_repetition_level: 0
physical_type: BOOLEAN
logical_type: None
converted_type (legacy): NONE
$ parquet-tools csv s3://bucket-name/test.parquet |csvq "select one, three where three"
+-------+-------+
| one | three |
+-------+-------+
| -1.0 | True |
| 2.5 | True |
+-------+-------+