现有如下数据,主播id、房间号、播出的批次号,每个批次号进出房间的时间戳、分区时间:
accid | room_id | batch_id | dateline | dt |
---|---|---|---|---|
5eb705796a8971881mFJmd | 223 | 1629047963024 | 1629047965592 | 08-18 |
5eb705796a8971881mFJmd | 223 | 1629047963024 | 1629047975336 | 08-18 |
8c15215f733b8e7f1m90uT | 264 | 1629273804000 | 1629273815723 | 08-18 |
8c15215f733b8e7f1m90uT | 264 | 1629273804000 | 1629274407664 | 08-18 |
问题:
每一次直播都有一个上播和下播,每个房间里,同一个批次号会有两条数据,分别记录了上播和下播时间,求每个主播的播出时长?
分析:
通过上面的数据,可以清晰的看出,同一个批次号里面,只要拿后一条时间戳减前一条时间戳就是每个主播的播出时长
我们来看看具体怎么实现:
方法一:
通过row_number()函数,根据主播id、房间号、播出的批次号、分区进行分组,根据时间戳进行排序得到每个主播在每个房间的上播和下播排序数据。拍完序后,怎么拿到后一条时间戳跟前一条时间戳相减呢,那只能自关联,根据排序的序号进行差1关联,取第一条数据。因为我们做了差值关联,只会有第一条数据关联上了后一条时间戳。
查看row_number()函数的语法,可以点击hive分析函数row_number()的应用
spark-sql> with test1 as
> (select '5eb705796a8971881mFJmd' as accid,223 as room_id,1629047963024 as batch_id,1629047965 as dateline,'08-18' as dt
> union all
> select '5eb705796a8971881mFJmd' as accid,223 as room_id,1629047963024 as batch_id,1629047975 as dateline,'08-18' as dt
> union all
> select '8c15215f733b8e7f1m90uT' as accid,264 as room_id,1629273804000 as batch_id,1629273815 as dateline,'08-18' as dt
> union all
> select '8c15215f733b8e7f1m90uT' as accid,264 as room_id,1629273804000 as batch_id,1629274407 as dateline,'08-18' as dt)
> select
> accid,
> room_id,
> batch_id,
> dateline_end-dateline_start as dateline