Hive通过炸裂函数实现按小时差X小时炸出X行数据
输入数据如下:
期待sql处理后的结果:
Hive SQL:
SELECT
uid,
start_date as start_date,
next_mid_date as end_date
from
(
SELECT
uid,
start_date,
end_date,
mid_date,
lead(mid_date,1,end_date) over (partition by uid,start_date order by pos) as next_mid_date,
pos
FROM
(
SELECT
uid,
start_date,
end_date,
mid_date,
pos
FROM
(
select
tmp.*,
t.*,
from_unixtime(unix_timestamp(start_date)+pos*3600,'yyyy-MM-dd HH:00:00') as mid_date
from(
select
4001 as uid
,'2023-12-17 07:19:36' as start_date
,'2023-12-17 07:23:10' as end_date
UNION ALL
select
4001 as uid
,'2023-12-17 07:32:30' as start_date
,'2023-12-17 09:24:56' as end_date
UNION ALL
select
4001 as uid
,'2023-12-17 09:25:57' as start_date
,'2023-12-17 09:49:49' as end_date
UNION ALL
select
4002 as uid
,'2023-12-18 19:25:57' as start_date
,'2023-12-18 20:49:49' as end_date
UNION ALL
select
4002 as uid
,'2023-12-18 21:25:57' as start_date
,'2023-12-18 23:49:49' as end_date
)tmp
lateral view posexplode( split( space( ceil( (unix_timestamp(end_date)-unix_timestamp(start_date))/3600 )), '' ) ) t as pos, val
)
where mid_date < end_date
)
)
where pos = 0
union all
SELECT
uid,
mid_date as start_date,
next_mid_date as end_date
from
(
SELECT
uid,
start_date,
end_date,
mid_date,
lead(mid_date,1,end_date) over (partition by uid,start_date order by pos) as next_mid_date,
pos
FROM
(
SELECT
uid,
start_date,
end_date,
mid_date,
pos
FROM
(
select
tmp.*,
t.*,
from_unixtime(unix_timestamp(start_date)+pos*3600,'yyyy-MM-dd HH:00:00') as mid_date
from(
select
4001 as uid
,'2023-12-17 07:19:36' as start_date
,'2023-12-17 07:23:10' as end_date
UNION ALL
select
4001 as uid
,'2023-12-17 07:32:30' as start_date
,'2023-12-17 09:24:56' as end_date
UNION ALL
select
4001 as uid
,'2023-12-17 09:25:57' as start_date
,'2023-12-17 09:49:49' as end_date
UNION ALL
select
4002 as uid
,'2023-12-18 19:25:57' as start_date
,'2023-12-18 20:49:49' as end_date
UNION ALL
select
4002 as uid
,'2023-12-18 21:25:57' as start_date
,'2023-12-18 23:49:49' as end_date
)tmp
lateral view posexplode( split( space( ceil( (unix_timestamp(end_date)-unix_timestamp(start_date))/3600 )), '' ) ) t as pos, val
)
where mid_date < end_date
)
)
where pos != 0
1.unix_timestamp(date): 这是一个函数,它返回一个给定日期的UNIX时间戳。UNIX时间戳是从1970年1月1日开始的秒数。
2.unix_timestamp(end_date) - unix_timestamp(start_date): 这部分代码计算两个日期之间的秒数差。
3.ceil(…/3600): 这部分代码将上述计算的秒数差除以3600(一个小时的秒数)。ceil函数确保结果是一个整数,即使结果是3601秒,它也会四舍五入为1小时。
4.space(…, ‘’): 这个函数用于生成一个字符串,该字符串包含由两个参数确定的空格。在这里,第一个参数是上一步计算的结果(小时数),第二个参数是一个空字符串。因此,它将生成一个由多个空格字符组成的字符串,长度等于小时数。
5.split(…, ’ '): 这个函数将上述生成的空格字符串分割成一个数组。每个空格代表一个小时,所以分割后的数组包含从开始时间到结束时间的每个小时。
6.posexplode(…): 这是一个Hive的特殊函数,它对数组进行操作,并为数组中的每个元素生成两个字段:一个是索引(在这里是小时的序号),另一个是值(在这里是小时的名称)。
7.lateral view: 这是Hive SQL的一个关键字,用于与posexplode函数一起使用。它允许我们将posexplode的结果与原始表中的行相关联。
8.t as pos, val: 这部分代码为posexplode的结果定义了别名。t是索引字段的别名,pos是小时的序号,val是小时的名称。