原数据
taskid | createtime | userid | specialistid | senduserid | content |
---|---|---|---|---|---|
3285620445415730000 | 2018-06-26 12:10:46.0 | 3281029200492120000 | 0 | 3281029200492120000 | aaaaaa |
3285620445415730000 | 2018-06-26 13:56:58.0 | 3281029200492120000 | 3275783155638360000 | 3275783155638360000 | bbbbb |
3285620445415730000 | 2018-06-26 13:57:32.0 | 3281029200492120000 | 3275783155638360000 | 3275783155638360000 | |
3285620445415730000 | 2018-06-26 14:06:59.0 | 3281029200492120000 | 3275783155638360000 | 3281029200492120000 | |
3285620445415730000 | 2018-06-26 14:07:41.0 | 3281029200492120000 | 3275783155638360000 | 3281029200492120000 | |
3285620445415730000 | 2018-06-26 14:15:56.0 | 3281029200492120000 | 3275783155638360000 | 3275783155638360000 | |
3285620445415730000 | 2018-06-26 14:16:12.0 | 3281029200492120000 | 3275783155638360000 | 3275783155638360000 | |
3285620445415730000 | 2018-06-26 14:16:31.0 | 3281029200492120000 | 3275783155638360000 | 3275783155638360000 | |
3285620445415730000 | 2018-06-26 14:16:40.0 | 3281029200492120000 | 3275783155638360000 | 3281029200492120000 | |
3285620445415730000 | 2018-06-26 14:17:14.0 | 3281029200492120000 | 3275783155638360000 | 3281029200492120000 |
要求交付的数据结果表如下,请用一个简洁的hql计算出如下数据:
taskid | createtime | userid | specialistid | replytime(回复时间) | response_speed(响应速度) |
---|---|---|---|---|---|
3285620445415730000 | 2018-06-26 12:10:46.0 | 3281029200492120000 | 3275783155638360000 | 2018-06-26 13:56:58.0 | 01:46:12 |
3285620445415730000 | 2018-06-26 14:06:59.0 | 3281029200492120000 | 3275783155638360000 | 2018-06-26 14:15:56.0 | 00:08:57 |
3285620445415730000 | 2018-06-26 14:16:40.0 | 3281029200492120000 | 3275783155638360000 | 2018-06-27 13:43:56.0 | 23:27:16 |
3285641069848950000 | 2018-06-26 14:50:50.0 | 3274133722840170000 | 3275783155638360000 | 2018-06-26 16:45:27.0 | 01:54:37 |
思路分析
- 每一个taskid只对应一个userid,但可以对应多个specialistid;
- 客户(userid)可连续发送n条信息给服务端(specialist),服务端(specialist)也可以连续回复n条信息给客户(userid);
- 客户 => 服务端(×n) , 服务端 => 客户(×n) 为一个 循环,一个taskid中可以有多个循环;
- 需求用户每次发第一次消息的时间与客服回复的第一次时间差 =》 一个循环中的 first(客户 => 服务端) , first(服务端 => 客户);
解题步骤
标记法
- 客户 => 服务端 类型的消息 标记为 , 服务端 => 客户 类型的消息 标记为1
select *,if(senduserid = userid,0,1) check from fact_customer_service_info;
- 为了去除思路分析中第2条的连续发送问题,新增一列match_check,该列为上一行的check,第一行的上一行没有,默认取1
select
t1.* ,
lag(check,1,1) over(distribute by taskid sort by createtime) match_check
from
(select
*,
if(senduserid = userid,0,1) check
from
fact_customer_service_info) t1
order by taskid;
- 去除check与match_check相同的列,这样就可以去除不是第一条的连续发送信息行
select
t2.*
from
(select
t1.* ,
lag(check,1,1) over(distribute by taskid sort by createtime) match_chenck
from
(select
*,
if(senduserid = userid,0,1) check from fact_customer_service_info) t1 ) t2
where t2.check <> t2.match_chenck;
- 然后check列则为0、1、0、1、0 或者 0、1、0、1两种情况,将check为1的行中specialist 和 createtime并到上一行中即可
select
taskid,
if(check = 0 and match_chenck = 1,createtime,0) as createtime,
userid,
if((lead(match_chenck,1) over(distribute by taskid sort by createtime) = 0),(lead(specialistid,1) over(distribute by taskid sort by createtime)),null) as specialistid,
if((lead(match_chenck,1) over(distribute by taskid sort by createtime) = 0),(lead(createtime,1) over(distribute by taskid sort by createtime)),null) as replytime
from
(select
t2.*
from
(select
t1.* ,
lag(check,1,1) over(distribute by taskid sort by createtime) match_chenck
from
(select
*,
if(senduserid = userid,0,1) check from fact_customer_service_info) t1)t2
where t2.check <> t2.match_chenck) t3;
- 这样就完成了合并了一个循环内的客户 => 服务端 类型的消息 和服务端 => 客户 类型的消息,
select
t4.taskid,
t4.createtime,
t4.userid,
t4.specialistid,
t4.replytime,
from_unixtime(unix_timestamp(t4.replytime) - unix_timestamp(t4.createtime) - 8 * 60 * 60, 'dd HH:mm:ss') as respond_speed
from
(select
taskid,
if(check = 0 and match_chenck = 1,createtime,0) as createtime,
userid,
if((lead(match_chenck,1) over(distribute by taskid sort by createtime) = 0),(lead(specialistid,1) over(distribute by taskid sort by createtime)),null) as specialistid,
if((lead(match_chenck,1) over(distribute by taskid sort by createtime) = 0),(lead(createtime,1) over(distribute by taskid sort by createtime)),null) as replytime
from
(select
t2.*
from
(select t1.* ,
lag(check,1,1) over(distribute by taskid sort by createtime) match_chenck
from
(select
*,
if(senduserid = userid,0,1) check from fact_customer_service_info) t1)t2
where t2.check <> t2.match_chenck) t3) t4
where t4.replytime is not null;
核心思想为打标记
- 同一个循环中的客户发送,服务端发送作出不同标记
- 使用lag()函数开窗,行数之间错位后的check和match_check不同则为我们想要的数据
- 最后则是去除相同数据、以及合并之后的null数据
最后需要注意的是时间问题,
from_unixtime() 是从1970年01月01日 00:00:00开始计算增加的秒数,得出年月日,时分秒
由于处于东8区,转换replytime 和 createtime的时间差时,会与格林尼治时间有个8小时的时区差,需减掉8小时的时差才能得出正确时间差