parquet格式_关于Parquet文件的一些问题

最新推荐文章于 2021-02-22 02:32:28 发布

weixin_39666782

最新推荐文章于 2021-02-22 02:32:28 发布

阅读量2.1k

点赞数 1

文章标签： parquet格式

本文详细分析了Spark、Hive和Impala在处理Parquet文件时遇到的兼容性问题，包括列名大小写、时间戳时区、列类型转换等问题，并提供了相应的解决策略，如设置特定配置参数来确保数据的一致性读取。

摘要由CSDN通过智能技术生成

任何一个地方都不比另一个地方拥有更多的天空。by 辛波斯卡

01 Parquet is case Sensitive

Since 2.4, when spark.sql.caseSensitive is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values

Parquet介绍资料详见 https://www.infoq.cn/article/in-depth-analysis-of-parquet-column-storage-format/

02 SparkSql和Hive执行结果不一致

原因：When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the spark.sql.hive.convertMetastoreParquet configuration, and is turned on by default.(为了优化读取parquet格式文件，spark默认选择使用自己的解析方式读取数据 )

解决方案：set spark.sql.hive.convertMetastoreParquet=false;

03 Timestamp时区问题

Hive在0.8的版本后开始支持Timestamp的格式。Hive在储存时间戳的时候会先把时间转成UTC的时间，然后再把转换后的时间存储到Parquet文件中。在读取Parquet文件的时候Hive会把时间从UTC时间再转化回成本地的时间。这样的话，如果存和读取都是用Hive的话，时间不会有任何的问题。如果是存成普通的文本文件的话，存取都不会进行任何时间的转换。Spark同Hive是兼容的，所以时间读取方面不会有问题。但Impala在存取Parquet格式的时间的时候和Hive不一样，Impala在存储时间的时候不转换成UTC的时间，而是直接保存当前时区的时间。所以Impala在读的时候也不进行任何的转换，是直接读取Parquet文件中的时间

04 Impala正确读取Hive存储的Parquet时间解决方案：set convert_legacy_hive_parquet_utc_timestamps=true (default false)

原因：该参数会让Impala在读取Parquet文件的时候，会先检查Parquet文件的meta信息，查看该Parquet文件是否由Hive创建，如果是由Hive创建的话，在读取的时候会进行时间格式的转换，这样读取出来的时间就能和Hive存进去的时间一致了

05 Hive正确读取Impala存储的Parquet时间

解决方案：set convert_legacy_hive_parquet_utc_timestamps=true (default false)

原因：Hive设置了该参数后，在读取Parquet文件的时候，同样会读取该Parquet文件的meta信息，如果是由Impala创建的Parquet文件的话，在读取的时候就不会进行时间格式的转换了

06 Hive更改列类型无法查询数据

原因：Hive的Bug,具体Issue:https://issues.apache.org/jira/browse/HIVE-6784

解决方案：INSERT OVERWRITE {table_name} SELECT * FROM {table_name} 执行该语句可以更新至新类型

07 Spark Issuse With Hive

现象：Failed with exception java.io.IOException:parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file

原因：This issue is caused because of different parquet conventions used in Hive and Spark. In Hive, the decimal datatype is represented as fixed bytes (INT 32). In Spark 1.4 or later the default convention is to use the Standard Parquet representation for decimal data type. As per the Standard Parquet representation based on the precision of the column datatype, the underlying representation changes

解决方案：The convention used by Spark to write Parquet data is configurable. This is determined by the property spark.sql.parquet.writeLegacyFormat ，The default value is false. If set to "true", Spark will use the same convention as Hive for writing the Parquet data. This will help to solve the issue.Set spark.sql.parquet.writeLegacyFormat=true

08 Impala可查询数据,Hive无结果

原因：Impala是以序号形式读取parquet，而Hive是以列形式读取parquet

解决方案：set parquet.column.index.access = true ；before query sql

09 Hive无法读取Spark写入的Decimal类型数据

现象：Spark2.1 Bug https://issues.apache.org/jira/browse/SPARK-20297

spark.range(10).write.parquet("/tmp/data")sql("DROP TABLE t")sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")scala> sql("select * from t where id > 0").show+---+| ID|+---++---+

解决方案：Set spark.sql.parquet.writeLegacyFormat=true

010 Map类型：Parquet record is malformed

现象:Hive Bug https://issues.apache.org/jira/browse/HIVE-11625

In current Spark 2.3.1, below query returns wrong data silently.spark.range(10).write.parquet("/tmp/data")sql("DROP TABLE t")sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")scala> sql("select * from t where id > 0").show+---+| ID|+---++---+

解决方案：The key field encodes the map's key type. This field must have repetition required and must always be present. Map keys written to Parquet must not be null.

011 Hive metastore schema和parquet schema不同

现象：Spark Bug https://jira.apache.org/jira/browse/SPARK-25206

In current Spark 2.3.1, below query returns wrong data silently.spark.range(10).write.parquet("/tmp/data")sql("DROP TABLE t")sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")scala> sql("select * from t where id > 0").show+---+| ID|+---++---+

解决方案：Spark2.4 fix the bug ,since spark2.3 and earlier,

set spark.sql.parquet.filterPushdown=false can fix