项目场景:
项目任务调度 dwd层事件明细表-->dws层流量聚合表脚本开发 初期使用时一切正常 *脚本测试代码如下:*export HIVE_HOME=/opt/apps/apache-hive-3.1.2-bin/
datestr=`date -d'-1 day' +%Y-%m-%d`
if [ $1 ]
then
datestr=$1
fi
${HIVE_HOME}/bin/hive -e "
INSERT INTO TABLE dws.app_pv_agg_session PARTITION(dt='${datestr}')
SELECT
guid,
sessionid as session_id,
min(ts) as start_time,
max(ts) as end_time,
max(first_page) as first_page,
max(last_page) as last_page,
count(1) as pv_cnt,
max(isnew) as isnew,
max(hour) as hour,
max(province) as province,
max(city) as city,
max(district) as district,
max(devicetype) as device_type
FROM
(
SELECT
guid,
sessionid,
ts,
isnew,
first_value(properties['pageid']) over(partition by sessionid order by ts) as first_page,
last_value(properties['pageid']) over(partition by sessionid order by ts) as last_page,
first_value(hour(from_unixtime(cast(ts/1000 as bigint)))) over(partition by sessionid order by ts) as hour,
first_value(province) over(partition by sessionid order by ts) as province,
first_value(city) over(partition by sessionid order by ts) as city,
first_value(district) over(partition by sessionid order by ts) as district,
devicetype
FROM dwd.app_event_dtl
WHERE dt = '${datestr}' AND eventid = 'pageView'
) O
GROUP BY guid,sessionid
"
if [ $? -eq 0 ]
then
echo "execution succeed ,a mail has been sent ........."
else
echo "execution failed ,a mail has been sent ........."
fi
问题描述:
在后续数据写入过程中 , 一直发生数据无法从integer转换成byte数组报错如下所示:
原因分析:
一直在调试当中也没有找到有任何问题,对测试脚本代码进行优化 ,同时对数据中的数据进行查询过滤 ,发现数据方面也是没有任何问题的优化后测试代码如下:
export HIVE_HOME=/opt/apps/apache-hive-3.1.2-bin/
datestr=`date -d'-1 day' +%Y-%m-%d`
if [ $1 ]
then
datestr=$1
fi
${HIVE_HOME}/bin/hive -e "
INSERT INTO TABLE dws.app_pv_agg_session PARTITION (dt = '${datestr}')
SELECT guid,
sessionid as session_id,
min(ts) as start_time,
max(ts) as end_time,
max(split(area,',')[3]) as first_page,
max(last_page) as last_page,
count(1) as pv_cnt,
max(isnew) as isnew,
max(split(area,',')[4]) as hour,
max(split(area,',')[0]) as province,
max(split(area,',')[1]) as city,
max(split(area,',')[2]) as district,
max(devicetype) as device_type
FROM (
SELECT guid,
sessionid,
ts,
isnew,
first_value(area) over (partition by sessionid order by ts) as area,
last_value(pageid) over (partition by sessionid order by ts) as last_page,
devicetype
FROM (
SELECT
guid,
isnew,
properties['pageid'] as pageid,
concat_ws(',',nvl(province,'UNKNOWN'),nvl(city,'UNKNOWN'),nvl(district,'UNKNOWN'),nvl(lpad(properties['pageid'],5,'0'),'00000'),nvl(lpad(hour(from_unixtime(cast((ts/1000) as bigint))),2,'0'),'00')) as area ,
sessionid,
devicetype,
ts
FROM dwd.app_event_dtl
WHERE dt = '${datestr}'
AND eventid = 'pageView' AND properties['pageid'] is not null
) o1
) o2
GROUP BY guid, sessionid
"
if [ $? -eq 0 ]
then
echo "execution succeed ,a mail has been sent ........."
else
echo "execution failed ,a mail has been sent ........."
fi
最终在查看日志和查询过程中发现 其实是HIVE中向量化执行引擎优化功能的BUG
解决方案:
目前官方是没有 关于这方面BUG 的解决方案 但是针对于上诉 项目中导致的 问题 我们可以添加一行代码 实现数据的正常写入set hive.vectorized.execution.enabled = false;
官方也是给出这样的解决方案:
代码中添加后 ,问题解决 ,数据正常写入dws层流量聚合表中