1. 问题
hive查询报错:
Diagnostic Messages for this Task:
[2020-04-02 05:32:04,360] {bash_operator.py:110} INFO - Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row [Error getting row data with exception java.lang.ClassCastException: java.lang.String cannot be cast to org.openx.data.jsonserde.json.JSONObject
[2020-04-02 05:32:04,360] {bash_operator.py:110} INFO - at org.openx.data.jsonserde.objectinspector.JsonMapObjectInspector.getMap(JsonMapObjectInspector.java:40)
[2020-04-02 05:32:04,361] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:318)
[2020-04-02 05:32:04,361] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:354)
[2020-04-02 05:32:04,361] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:198)
[2020-04-02 05:32:04,361] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:184)
[2020-04-02 05:32:04,361] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.ql.exec.MapOperator.toErrorMessage(MapOperator.java:544)
[2020-04-02 05:32:04,361] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:513)
[2020-04-02 05:32:04,361] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:163)
[2020-04-02 05:32:04,362] {bash_operator.py:110} INFO - at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
[2020-04-02 05:32:04,362] {bash_operator.py:110} INFO - at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
[2020-04-02 05:32:04,362] {bash_operator.py:110} INFO - at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
[2020-04-02 05:32:04,362] {bash_operator.py:110} INFO - at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
[2020-04-02 05:32:04,362] {bash_operator.py:110} INFO - at java.security.AccessController.doPrivileged(Native Method)
[2020-04-02 05:32:04,363] {bash_operator.py:110} INFO - at javax.security.auth.Subject.doAs(Subject.java:422)
[2020-04-02 05:32:04,363] {bash_operator.py:110} INFO - at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
[2020-04-02 05:32:04,363] {bash_operator.py:110} INFO - at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
[2020-04-02 05:32:04,363] {bash_operator.py:110} INFO - ]
[2020-04-02 05:32:04,363] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:172)
[2020-04-02 05:32:04,364] {bash_operator.py:110} INFO - at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
[2020-04-02 05:32:04,364] {bash_operator.py:110} INFO - at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
[2020-04-02 05:32:04,364] {bash_operator.py:110} INFO - at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
[2020-04-02 05:32:04,364] {bash_operator.py:110} INFO - at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
[2020-04-02 05:32:04,364] {bash_operator.py:110} INFO - at java.security.AccessController.doPrivileged(Native Method)
[2020-04-02 05:32:04,364] {bash_operator.py:110} INFO - at javax.security.auth.Subject.doAs(Subject.java:422)
[2020-04-02 05:32:04,364] {bash_operator.py:110} INFO - at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
[2020-04-02 05:32:04,364] {bash_operator.py:110} INFO - at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
[2020-04-02 05:32:04,364] {bash_operator.py:110} INFO - Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row [Error getting row data with exception java.lang.ClassCastException: java.lang.String cannot be cast to org.openx.data.jsonserde.json.JSONObject
[2020-04-02 05:32:04,365] {bash_operator.py:110} INFO - at org.openx.data.jsonserde.objectinspector.JsonMapObjectInspector.getMap(JsonMapObjectInspector.java:40)
[2020-04-02 05:32:04,365] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:318)
[2020-04-02 05:32:04,365] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:354)
[2020-04-02 05:32:04,365] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:198)
[2020-04-02 05:32:04,365] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:184)
[2020-04-02 05:32:04,365] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.ql.exec.MapOperator.toErrorMessage(MapOperator.java:544)
[2020-04-02 05:32:04,365] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:513)
[2020-04-02 05:32:04,365] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:163)
[2020-04-02 05:32:04,365] {bash_operator.py:110} INFO - at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
[2020-04-02 05:32:04,366] {bash_operator.py:110} INFO - at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
[2020-04-02 05:32:04,366] {bash_operator.py:110} INFO - at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
[2020-04-02 05:32:04,366] {bash_operator.py:110} INFO - at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
[2020-04-02 05:32:04,366] {bash_operator.py:110} INFO - at java.security.AccessController.doPrivileged(Native Method)
[2020-04-02 05:32:04,366] {bash_operator.py:110} INFO - at javax.security.auth.Subject.doAs(Subject.java:422)
[2020-04-02 05:32:04,366] {bash_operator.py:110} INFO - at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
[2020-04-02 05:32:04,366] {bash_operator.py:110} INFO - at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
[2020-04-02 05:32:04,366] {bash_operator.py:110} INFO - ]
[2020-04-02 05:32:04,366] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:518)
[2020-04-02 05:32:04,367] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:163)
[2020-04-02 05:32:04,367] {bash_operator.py:110} INFO - ... 8 more
[2020-04-02 05:32:04,367] {bash_operator.py:110} INFO - Caused by: java.lang.RuntimeException: Parquet record is malformed: java.lang.String cannot be cast to org.openx.data.jsonserde.json.JSONObject
[2020-04-02 05:32:04,367] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64)
[2020-04-02 05:32:04,367] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59)
[2020-04-02 05:32:04,367] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
[2020-04-02 05:32:04,367] {bash_operator.py:110} INFO - at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121)
[2020-04-02 05:32:04,367] {bash_operator.py:110} INFO - at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123)
[2020-04-02 05:32:04,367] {bash_operator.py:110} INFO - at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42)
[2020-04-02 05:32:04,368] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:111)
[2020-04-02 05:32:04,368] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:124)
[2020-04-02 05:32:04,368] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:753)
[2020-04-02 05:32:04,368] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
[2020-04-02 05:32:04,368] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88)
[2020-04-02 05:32:04,368] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
[2020-04-02 05:32:04,368] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:97)
[2020-04-02 05:32:04,368] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:162)
[2020-04-02 05:32:04,368] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:508)
[2020-04-02 05:32:04,368] {bash_operator.py:110} INFO - ... 9 more
[2020-04-02 05:32:04,369] {bash_operator.py:110} INFO - Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to org.openx.data.jsonserde.json.JSONObject
[2020-04-02 05:32:04,369] {bash_operator.py:110} INFO - at org.openx.data.jsonserde.objectinspector.JsonMapObjectInspector.getMap(JsonMapObjectInspector.java:40)
[2020-04-02 05:32:04,369] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeMap(DataWritableWriter.java:211)
[2020-04-02 05:32:04,369] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeValue(DataWritableWriter.java:116)
[2020-04-02 05:32:04,369] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeGroupFields(DataWritableWriter.java:89)
[2020-04-02 05:32:04,369] {bash_operator.py:110} INFO - at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:60)
[2020-04-02 05:32:04,369] {bash_operator.py:110} INFO - ... 23 more
[2020-04-02 05:32:04,369] {bash_operator.py:110} INFO -
[2020-04-02 05:32:04,369] {bash_operator.py:110} INFO -
[2020-04-02 05:32:04,391] {bash_operator.py:110} INFO - FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
2. 问题追踪
错误发生在对表dw.table1的查询,该表直接加载json格式的hdfs文件,建表语句:
CREATE EXTERNAL TABLE IF NOT EXISTS dw.table1${env_suffix}(
u_timestamp STRING COMMENT '东八区时间',
bucketlist array<STRING> COMMENT 'AB实验策略名',
ext map<STRING,STRING> COMMENT '扩展',
algInfo map<STRING,STRING> COMMENT '视频分发策略',
u_bigger_json STRING COMMENT '最后的大json'
)
COMMENT 'xxx'
PARTITIONED BY(dt STRING)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ("ignore.malformed.json" = "true")
LOCATION '/rec/yyy${env_suffix}/';
进一步追踪,发现在查询字段algInfo时才会出现此错误,该字段格式为:map<STRING,STRING>,遇到数据内容为非法json时便会抛出异常:
java.lang.RuntimeException: Parquet record is malformed: java.lang.String cannot be cast to org.openx.data.jsonserde.json.JSONObject
3. 解决方案
将dw.table1表的algInfo字段的数据类型修改为STRING,即可解决查询异常问题,但下游表仍有将该字段的数据类型定义为map<STRING,STRING>的需求,可以通过hive函数str_to_map直接将STRING类型的字段转换为map后写入。
下游表定义:
CREATE EXTERNAL TABLE IF NOT EXISTS edw.table2${env_suffix}(
-- 请求信息
u_timestamp TIMESTAMP COMMENT '东八区时间',
u_bucketlist array<STRING> COMMENT 'AB实验策略名',
u_ext map<STRING,STRING> COMMENT '扩展',
u_algInfo map<STRING,STRING> COMMENT '视频分发策略'
)
COMMENT 'xxx'
PARTITIONED BY(dt STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
collection items terminated by ','
map keys terminated by ':'
STORED AS PARQUET
LOCATION '/dw/edw/table2${env_suffix}'
TBLPROPERTIES ('parquet.compress'='SNAPPY')
;
str_to_map函数定义如下:
str_to_map(字符串参数, 分隔符1, 分隔符2)
使用两个分隔符将文本拆分为键值对。
分隔符1将文本分成K-V对,分隔符2分割每个K-V对。
对于分隔符1默认分隔符是 ',',对于分隔符2默认分隔符是 '='。
需要注意的是,在使用str_to_map前 ,需要将数据中的json字符串中的花括号、双引号、逗号、冒号等替换掉:
原始数据:
{"reasons":"","ruletags":"(264,first_cat),(vitemcf,retrieve)","lasthitrule":"-1","vfactors":"vitemcf,vyoutubednn"}
正确的打开方式:
insert OVERWRITE table edw.table2 PARTITION(dt='2020-04-01')
select
...
if(algInfo is not null, str_to_map(regexp_replace(regexp_replace(regexp_replace(regexp_replace(algInfo,'\",\"','@@@@@'),'\":\"','#####'),'\\{\"',''),'\"\\}',''), '@@@@@', '#####'), null)
from dw.table1
where dt='2020-04-01'