任何一个地方都不比另一个地方拥有更多的天空。by 辛波斯卡
01 Parquet is case Sensitive
Since 2.4, when spark.sql.caseSensitive is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values
Parquet介绍资料详见 https://www.infoq.cn/article/in-depth-analysis-of-parquet-column-storage-format/
02 SparkSql和Hive执行结果不一致
原因:When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the spark.sql.hive.convertMetastoreParquet configuration, and is turned on by default.(为了优化读取parquet格式文件,spark默认选择使用自己的解析方式读取数据 )
解决方案:set spark.sql.hive.convertMetastoreParquet=false;
03 Timestamp时区问题
04 Impala正确读取Hive存储的Parquet时间解决方案:set convert_legacy_hive_parquet_utc_timestamps=true (default false)
05 Hive正确读取Impala存储的Parquet时间
解决方案:set convert_legacy_hive_parquet_utc_timestamps=true (default false)
06 Hive更改列类型无法查询数据
解决方案:INSERT OVERWRITE {table_name} SELECT * FROM {table_name} 执行该语句可以更新至新类型
07 Spark Issuse With Hive
现象:Failed with exception java.io.IOException:parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file
原因:This issue is caused because of different parquet conventions used in Hive and Spark. In Hive, the decimal datatype is represented as fixed bytes (INT 32). In Spark 1.4 or later the default convention is to use the Standard Parquet representation for decimal data type. As per the Standard Parquet representation based on the precision of the column datatype, the underlying representation changes
解决方案:The convention used by Spark to write Parquet data is configurable. This is determined by the property spark.sql.parquet.writeLegacyFormat ,The default value is false. If set to "true", Spark will use the same convention as Hive for writing the Parquet data. This will help to solve the issue.Set spark.sql.parquet.writeLegacyFormat=true
08 Impala可查询数据,Hive无结果
解决方案:set parquet.column.index.access = true ;before query sql
09 Hive无法读取Spark写入的Decimal类型数据
现象:Spark2.1 Bug https://issues.apache.org/jira/browse/SPARK-20297
spark.range(10).write.parquet("/tmp/data")sql("DROP TABLE t")sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")scala> sql("select * from t where id > 0").show+---+| ID|+---++---+
解决方案:Set spark.sql.parquet.writeLegacyFormat=true
010 Map类型:Parquet record is malformed
现象:Hive Bug https://issues.apache.org/jira/browse/HIVE-11625
In current Spark 2.3.1, below query returns wrong data silently.spark.range(10).write.parquet("/tmp/data")sql("DROP TABLE t")sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")scala> sql("select * from t where id > 0").show+---+| ID|+---++---+
解决方案:The key field encodes the map's key type. This field must have repetition required and must always be present. Map keys written to Parquet must not be null.
011 Hive metastore schema和parquet schema不同
现象:Spark Bug https://jira.apache.org/jira/browse/SPARK-25206
In current Spark 2.3.1, below query returns wrong data silently.spark.range(10).write.parquet("/tmp/data")sql("DROP TABLE t")sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")scala> sql("select * from t where id > 0").show+---+| ID|+---++---+
解决方案:Spark2.4 fix the bug ,since spark2.3 and earlier,
set spark.sql.parquet.filterPushdown=false can fix