业务保障部有一个需求,需要用hive实时计算上一小时的数据,比如现在是12点,我需要计算11点的数据,而且必须在1小时之后运行出来,但是他们用hive实现的时候发现就单个map任务运行都超过了1小时,根本没法满足需求,后来打电话让我帮忙优化一下,以下是优化过程:
1、hql语句:
CREATE TABLE weibo_mobile_nginx AS SELECT
split(split(log, '`') [ 0 ], '\\|')[ 0 ] HOST,
split(split(log, '`') [ 0 ], '\\|')[ 1 ] time,
substr(
split(
split(split(log, '`') [ 2 ], '\\?')[ 0 ], ' '
)[ 0 ], 2
)request_type,
split(
split(split(log, '`') [ 2 ], '\\?')[ 0 ], ' '
)[ 1 ] interface,
regexp_extract(
log,
’.*& ua =[^ _ ]* __([^ _ ]*)__([^ _ ]*)__([^ _ ]*)__<span style="font-family: Arial, Helvetica, sans-serif;">[^&]*</span>’,
3
)version,
regexp_extract(
log,
’.*& ua =[^ _ ]* __([^ _ ]*)__([^ _ ]*)__([^ _ ]*)__.* ',1) systerm,regexp_extract(log,’.*&networktype=([^&%]*).*',
1
)net_type,
split(log, '`')[ 4 ] STATUS,
split(log, '`')[ 5 ] client_ip,
split(log, '`')[ 6 ] uid,
split(log, '`')[ 8 ] request_time,
split(log, '`')[ 12 ] request_uid,
split(log, '`')[ 13 ] http_host,
split(log, '`')[ 15 ] upstream_response_time,
split(log, '`')[ 16 ] idc
FROM
ods_wls_wap_base_orig
WHERE
dt = '20150311