parquet格式_关于Parquet文件的一些问题

本文详细分析了Spark、Hive和Impala在处理Parquet文件时遇到的兼容性问题,包括列名大小写、时间戳时区、列类型转换等问题,并提供了相应的解决策略,如设置特定配置参数来确保数据的一致性读取。
摘要由CSDN通过智能技术生成

任何一个地方都不比另一个地方拥有更多的天空。by 辛波斯卡

01 Parquet is case Sensitive

Since 2.4, when spark.sql.caseSensitive is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values

Parquet介绍资料详见 https://www.infoq.cn/article/in-depth-analysis-of-parquet-column-storage-format/

02 SparkSql和Hive执行结果不一致

原因:When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the spark.sql.hive.convertMetastoreParquet configuration, and is turned on by default.(为了优化读取parquet格式文件,spark默认选择使用自己的解析方式读取数据 )

解决方案:set spark.sql.hive.convertMetastoreParquet=false;

03 Timestamp时区问题

Hive在0.8的版本后开始支持Timestamp的格式。Hive在储存时间戳的时候会先把时间转成UTC的时间,然后再把转换后的时间存储到Parquet文件中。在读取Parquet文件的时候Hive会把时间从UTC时间再转化回成本地的时间。这样的话,如果存和读取都是用Hive的话,时间不会有任何的问题。如果是存成普通的文本文件的话,存取都不会进行任何时间的转换。Spark同Hive是兼容的,所以时间读取方面不会有问题。但Impala在存取Parquet格式的时间的时候和Hive不一样,Impala在存储时间的时候不转换成UTC的时间,而是直接保存当前时区的时间。所以Impala在读的时候也不进行任何的转换,是直接读取Parquet文件中的时间

04 Impala正确读取Hive存储的Parquet时间解决方案:set convert_legacy_hive_parquet_utc_timestamps=true (default false)

原因:该参数会让Impala在读取Parquet文件的时候,会先检查Parquet文件的meta信息,查看该Parquet文件是否由Hive创建,如果是由Hive创建的话,在读取的时候会进行时间格式的转换,这样读取出来的时间就能和Hive存进去的时间一致了

05 Hive正确读取Impala存储的Parquet时间

解决方案:set convert_legacy_hive_parquet_utc_timestamps=true (default false)

原因:Hive设置了该参数后,在读取Parquet文件的时候,同样会读取该Parquet文件的meta信息,如果是由Impala创建的Parquet文件的话,在读取的时候就不会进行时间格式的转换了

06 Hive更改列类型无法查询数据

原因:Hive的Bug,具体Issue:https://issues.apache.org/jira/browse/HIVE-6784

解决方案:INSERT OVERWRITE {table_name} SELECT * FROM {table_name} 执行该语句可以更新至新类型

07 Spark Issuse With Hive

现象:Failed with exception java.io.IOException:parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file

原因:This issue is caused because of different parquet conventions used in Hive and Spark. In Hive, the decimal datatype is represented as fixed bytes (INT 32). In Spark 1.4 or later the default convention is to use the Standard Parquet representation for decimal data type. As per the Standard Parquet representation based on the precision of the column datatype, the underlying representation changes

解决方案:The convention used by Spark to write Parquet data is configurable. This is determined by the property spark.sql.parquet.writeLegacyFormat ,The default value is false. If set to "true", Spark will use the same convention as Hive for writing the Parquet data. This will help to solve the issue.Set spark.sql.parquet.writeLegacyFormat=true

08 Impala可查询数据,Hive无结果

原因:Impala是以序号形式读取parquet,而Hive是以列形式读取parquet

解决方案:set parquet.column.index.access = true ;before query sql

09 Hive无法读取Spark写入的Decimal类型数据

现象:Spark2.1 Bug https://issues.apache.org/jira/browse/SPARK-20297

spark.range(10).write.parquet("/tmp/data")sql("DROP TABLE t")sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")scala> sql("select * from t where id > 0").show+---+| ID|+---++---+

解决方案:Set spark.sql.parquet.writeLegacyFormat=true

010 Map类型:Parquet record is malformed

现象:Hive Bug https://issues.apache.org/jira/browse/HIVE-11625

In current Spark 2.3.1, below query returns wrong data silently.​spark.range(10).write.parquet("/tmp/data")sql("DROP TABLE t")sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")​scala> sql("select * from t where id > 0").show+---+| ID|+---++---+

解决方案:The key field encodes the map's key type. This field must have repetition required and must always be present. Map keys written to Parquet must not be null.

011 Hive metastore schema和parquet schema不同

现象:Spark Bug https://jira.apache.org/jira/browse/SPARK-25206

In current Spark 2.3.1, below query returns wrong data silently.​spark.range(10).write.parquet("/tmp/data")sql("DROP TABLE t")sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")​scala> sql("select * from t where id > 0").show+---+| ID|+---++---+

解决方案:Spark2.4 fix the bug ,since spark2.3 and earlier,

set spark.sql.parquet.filterPushdown=false can fix

c3ef29e9758c069c53d684410a94fabf.gif
生成Parquet格式文件需要使用Parquet工具包,以下是使用Java生成Parquet格式文件的示例代码: ```java import org.apache.hadoop.fs.Path; import org.apache.parquet.column.ParquetProperties; import org.apache.parquet.example.data.Group; import org.apache.parquet.example.data.simple.SimpleGroupFactory; import org.apache.parquet.hadoop.ParquetFileWriter.Mode; import org.apache.parquet.hadoop.ParquetWriter; import org.apache.parquet.hadoop.example.GroupWriteSupport; import org.apache.parquet.hadoop.metadata.CompressionCodecName; public class GenerateParquetFile { public static void main(String[] args) throws Exception { String fileName = "example.parquet"; // Create a Group object SimpleGroupFactory groupFactory = new SimpleGroupFactory(GroupWriteSupport.getSchema()); Group group = groupFactory.newGroup() .append("id", 1) .append("name", "Tom") .append("age", 25) .append("score", 89.5f); // Create a ParquetWriter object Path path = new Path(fileName); GroupWriteSupport writeSupport = new GroupWriteSupport(); ParquetWriter<Group> writer = new ParquetWriter<>(path, writeSupport, CompressionCodecName.SNAPPY, ParquetWriter.DEFAULT_BLOCK_SIZE, ParquetWriter.DEFAULT_PAGE_SIZE, ParquetWriter.DEFAULT_PAGE_SIZE, true, false, ParquetProperties.WriterVersion.PARQUET_2_0, Mode.CREATE); // Write the Group object to the Parquet file writer.write(group); // Close the ParquetWriter object writer.close(); } } ``` 这段代码使用SimpleGroupFactory创建了一个Group对象,然后使用ParquetWriter将该对象写入Parquet文件中。需要注意的是,需要使用GroupWriteSupport类指定Parquet schema。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值