parquet文件本质是json文件的压缩版,这样不仅大幅度减少了其大小,而且是压缩过的,比较安全一点,spark的安装包里面提供了一个例子,在这个路径下有一个parquet文件:
spark-2.0.1-bin-hadoop2.7/examples/src/main/resources
我们可以查看一下:
[root@hadoop001 resources]# cat users.parquet
0red88,lyssaBen,
@ \Hexample.avro.User
%name%
%favorite_color%5favorite_numbers%array<&
nameDH&P
favorite_color<@&P&%(favorite_numbersarray
ZZ&
avro.schema{"type":"record","name":"User","namespace":"example.avro","fields":[{"name":"name","type":"string"},{"name":"favorite_color","type":["string","null"]},{"name":"favorite_numbers","type":{"type":"array","items":"int"}}]}parquet-mr version 1.4.3耐AR1[root@hadoop001 resources]# XshellXshellXshell
-bash: XshellXshellXshell: command not found
看到了json格式的影子,但是好像是乱码了的,这就是压缩后的效果我们可以用spark进行查看
+------+--------------+----------------+
| name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa| null| [3, 9, 15, 20]|
| Ben| red| []|
+------+--------------+----------------+
发现spark可以读取到真实的内容,没有乱码,这是因为spark进行了解压缩的过程
怎么使用spark进行读取parquet文件,w3c也给了详细介绍(https://www.w3cschool.cn/spark_sql/parquet.html)
这是我读取文件的代码:
val sqlContext=new org.apache.spark.sql.SQLContext(sc);
val df5=sqlContext.read.parquet("/home/spark-2.0.1-bin-hadoop2.7/examples/src/main/resources/users.parquet")
df5.show