利用Spark往Hive中存储parquet数据,针对一些复杂数据类型如map、array、struct的处理遇到的问题?
为了更好的说明导致问题的原因、现象以及解决方案,首先看下述示例:
创建存储格式为parquet的Hive非分区表:
CREATE EXTERNAL TABLE `t1`(
`id` STRING,
`map_col` MAP<STRING, STRING>,
`arr_col` ARRAY<STRING>,
`struct_col` STRUCT<A:STRING,B:STRING>)
STORED AS PARQUET
LOCATION '/home/spark/test/tmp/t1';
创建存储格式为parquet的Hive分区表:
CREATE EXTERNAL TABLE `t2`(
`id` STRING,
`map_col` MAP<STRING, STRING>,
`arr_col` ARRAY<STRING>,
`struct_col` STRUCT<A:STRING,B:STRING>)
PARTITIONED BY (`dt` STRING)
STORED AS PARQUET
LOCATION '/home/spark/test/tmp/t2';
分别向t1、t2执行insert into(insert overwrite…select也会导致下列问题)语句,列map_col都存储为空map:
insert into table t1 values(1,map(),array('1,1,1'),named_struct('A','1','B','1'));
insert into table t2 partition(dt='20200101')
values(1,map(),array('1,1,1'),named_struct('A','1','B','1'));
t1表正常执行,但对t2执行上述insert语句时,报如下异常:
Caused by: parquet.io.ParquetEncodingException: empty fields are illegal, the field should be ommited completely instead
at parquet.io.MessageColumnIO$MessageColumnIORecordCon