hive 非正确json格式字段造成查询错误

4 篇文章 0 订阅
2 篇文章 0 订阅

1. 问题

hive查询报错:

Diagnostic Messages for this Task:
[2020-04-02 05:32:04,360] {bash_operator.py:110} INFO - Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row [Error getting row data with exception java.lang.ClassCastException: java.lang.String cannot be cast to org.openx.data.jsonserde.json.JSONObject
[2020-04-02 05:32:04,360] {bash_operator.py:110} INFO - 	at org.openx.data.jsonserde.objectinspector.JsonMapObjectInspector.getMap(JsonMapObjectInspector.java:40)
[2020-04-02 05:32:04,361] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:318)
[2020-04-02 05:32:04,361] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:354)
[2020-04-02 05:32:04,361] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:198)
[2020-04-02 05:32:04,361] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:184)
[2020-04-02 05:32:04,361] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.MapOperator.toErrorMessage(MapOperator.java:544)
[2020-04-02 05:32:04,361] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:513)
[2020-04-02 05:32:04,361] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:163)
[2020-04-02 05:32:04,362] {bash_operator.py:110} INFO - 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
[2020-04-02 05:32:04,362] {bash_operator.py:110} INFO - 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
[2020-04-02 05:32:04,362] {bash_operator.py:110} INFO - 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
[2020-04-02 05:32:04,362] {bash_operator.py:110} INFO - 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
[2020-04-02 05:32:04,362] {bash_operator.py:110} INFO - 	at java.security.AccessController.doPrivileged(Native Method)
[2020-04-02 05:32:04,363] {bash_operator.py:110} INFO - 	at javax.security.auth.Subject.doAs(Subject.java:422)
[2020-04-02 05:32:04,363] {bash_operator.py:110} INFO - 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
[2020-04-02 05:32:04,363] {bash_operator.py:110} INFO - 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
[2020-04-02 05:32:04,363] {bash_operator.py:110} INFO -  ]
[2020-04-02 05:32:04,363] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:172)
[2020-04-02 05:32:04,364] {bash_operator.py:110} INFO - 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
[2020-04-02 05:32:04,364] {bash_operator.py:110} INFO - 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
[2020-04-02 05:32:04,364] {bash_operator.py:110} INFO - 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
[2020-04-02 05:32:04,364] {bash_operator.py:110} INFO - 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
[2020-04-02 05:32:04,364] {bash_operator.py:110} INFO - 	at java.security.AccessController.doPrivileged(Native Method)
[2020-04-02 05:32:04,364] {bash_operator.py:110} INFO - 	at javax.security.auth.Subject.doAs(Subject.java:422)
[2020-04-02 05:32:04,364] {bash_operator.py:110} INFO - 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
[2020-04-02 05:32:04,364] {bash_operator.py:110} INFO - 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
[2020-04-02 05:32:04,364] {bash_operator.py:110} INFO - Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row [Error getting row data with exception java.lang.ClassCastException: java.lang.String cannot be cast to org.openx.data.jsonserde.json.JSONObject
[2020-04-02 05:32:04,365] {bash_operator.py:110} INFO - 	at org.openx.data.jsonserde.objectinspector.JsonMapObjectInspector.getMap(JsonMapObjectInspector.java:40)
[2020-04-02 05:32:04,365] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:318)
[2020-04-02 05:32:04,365] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:354)
[2020-04-02 05:32:04,365] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:198)
[2020-04-02 05:32:04,365] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:184)
[2020-04-02 05:32:04,365] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.MapOperator.toErrorMessage(MapOperator.java:544)
[2020-04-02 05:32:04,365] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:513)
[2020-04-02 05:32:04,365] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:163)
[2020-04-02 05:32:04,365] {bash_operator.py:110} INFO - 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
[2020-04-02 05:32:04,366] {bash_operator.py:110} INFO - 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
[2020-04-02 05:32:04,366] {bash_operator.py:110} INFO - 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
[2020-04-02 05:32:04,366] {bash_operator.py:110} INFO - 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
[2020-04-02 05:32:04,366] {bash_operator.py:110} INFO - 	at java.security.AccessController.doPrivileged(Native Method)
[2020-04-02 05:32:04,366] {bash_operator.py:110} INFO - 	at javax.security.auth.Subject.doAs(Subject.java:422)
[2020-04-02 05:32:04,366] {bash_operator.py:110} INFO - 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
[2020-04-02 05:32:04,366] {bash_operator.py:110} INFO - 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
[2020-04-02 05:32:04,366] {bash_operator.py:110} INFO -  ]
[2020-04-02 05:32:04,366] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:518)
[2020-04-02 05:32:04,367] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:163)
[2020-04-02 05:32:04,367] {bash_operator.py:110} INFO - 	... 8 more
[2020-04-02 05:32:04,367] {bash_operator.py:110} INFO - Caused by: java.lang.RuntimeException: Parquet record is malformed: java.lang.String cannot be cast to org.openx.data.jsonserde.json.JSONObject
[2020-04-02 05:32:04,367] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64)
[2020-04-02 05:32:04,367] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59)
[2020-04-02 05:32:04,367] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
[2020-04-02 05:32:04,367] {bash_operator.py:110} INFO - 	at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121)
[2020-04-02 05:32:04,367] {bash_operator.py:110} INFO - 	at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123)
[2020-04-02 05:32:04,367] {bash_operator.py:110} INFO - 	at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42)
[2020-04-02 05:32:04,368] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:111)
[2020-04-02 05:32:04,368] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:124)
[2020-04-02 05:32:04,368] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:753)
[2020-04-02 05:32:04,368] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
[2020-04-02 05:32:04,368] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88)
[2020-04-02 05:32:04,368] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
[2020-04-02 05:32:04,368] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:97)
[2020-04-02 05:32:04,368] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:162)
[2020-04-02 05:32:04,368] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:508)
[2020-04-02 05:32:04,368] {bash_operator.py:110} INFO - 	... 9 more
[2020-04-02 05:32:04,369] {bash_operator.py:110} INFO - Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to org.openx.data.jsonserde.json.JSONObject
[2020-04-02 05:32:04,369] {bash_operator.py:110} INFO - 	at org.openx.data.jsonserde.objectinspector.JsonMapObjectInspector.getMap(JsonMapObjectInspector.java:40)
[2020-04-02 05:32:04,369] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeMap(DataWritableWriter.java:211)
[2020-04-02 05:32:04,369] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeValue(DataWritableWriter.java:116)
[2020-04-02 05:32:04,369] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeGroupFields(DataWritableWriter.java:89)
[2020-04-02 05:32:04,369] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:60)
[2020-04-02 05:32:04,369] {bash_operator.py:110} INFO - 	... 23 more
[2020-04-02 05:32:04,369] {bash_operator.py:110} INFO - 
[2020-04-02 05:32:04,369] {bash_operator.py:110} INFO - 
[2020-04-02 05:32:04,391] {bash_operator.py:110} INFO - FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

2. 问题追踪

错误发生在对表dw.table1的查询,该表直接加载json格式的hdfs文件,建表语句:

CREATE EXTERNAL TABLE IF NOT EXISTS dw.table1${env_suffix}(
  u_timestamp STRING COMMENT '东八区时间',
  bucketlist array<STRING> COMMENT 'AB实验策略名',
  ext map<STRING,STRING> COMMENT '扩展',
  algInfo map<STRING,STRING> COMMENT '视频分发策略',
  u_bigger_json STRING COMMENT '最后的大json'
)
COMMENT 'xxx'
PARTITIONED BY(dt STRING)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ("ignore.malformed.json" = "true")
LOCATION '/rec/yyy${env_suffix}/';

进一步追踪,发现在查询字段algInfo时才会出现此错误,该字段格式为:map<STRING,STRING>,遇到数据内容为非法json时便会抛出异常:

java.lang.RuntimeException: Parquet record is malformed: java.lang.String cannot be cast to org.openx.data.jsonserde.json.JSONObject

3. 解决方案

将dw.table1表的algInfo字段的数据类型修改为STRING,即可解决查询异常问题,但下游表仍有将该字段的数据类型定义为map<STRING,STRING>的需求,可以通过hive函数str_to_map直接将STRING类型的字段转换为map后写入。

下游表定义:


CREATE EXTERNAL TABLE IF NOT EXISTS edw.table2${env_suffix}(
    -- 请求信息
    u_timestamp TIMESTAMP COMMENT '东八区时间',
    u_bucketlist array<STRING> COMMENT 'AB实验策略名',
    u_ext map<STRING,STRING> COMMENT '扩展',
    u_algInfo map<STRING,STRING> COMMENT '视频分发策略'
)
COMMENT 'xxx'
PARTITIONED BY(dt STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
collection items terminated by ','
map keys terminated by ':'
STORED AS PARQUET
LOCATION '/dw/edw/table2${env_suffix}'
TBLPROPERTIES ('parquet.compress'='SNAPPY')
;

str_to_map函数定义如下:

str_to_map(字符串参数, 分隔符1, 分隔符2)
使用两个分隔符将文本拆分为键值对。
分隔符1将文本分成K-V对,分隔符2分割每个K-V对。
对于分隔符1默认分隔符是 ',',对于分隔符2默认分隔符是 '='。

需要注意的是,在使用str_to_map前 ,需要将数据中的json字符串中的花括号、双引号、逗号、冒号等替换掉:

原始数据:
{"reasons":"","ruletags":"(264,first_cat),(vitemcf,retrieve)","lasthitrule":"-1","vfactors":"vitemcf,vyoutubednn"}

正确的打开方式:
insert OVERWRITE table edw.table2 PARTITION(dt='2020-04-01')
select
    ...
    if(algInfo is not null, str_to_map(regexp_replace(regexp_replace(regexp_replace(regexp_replace(algInfo,'\",\"','@@@@@'),'\":\"','#####'),'\\{\"',''),'\"\\}',''), '@@@@@', '#####'), null)
from dw.table1
where dt='2020-04-01'
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值