1.题目
events表 有四个字段 event_date,日期类型,代表事件发生的日期; event_name,字符串类型,代表事件名称;event_timestamp,时间戳;user_id,用户id;app_id,app的id;
样本数据如下:
当一个用户发生app_remove(即为卸载)前的两个动作叫做卸载路径,如路径:
user_engagement àscreen_view àapp_remove
对于app2来说,需要输出不同卸载路径的发生次数:
输出格式:
2.解法
初看比较复杂,可以采用在本地ide中逐步调试的方法找到最终结果;
首先,先限定app_id=’app2’ 查看用户的卸载行为,即event_name=’app_remove’
SELECT *
FROM `events`
WHERE app_id='app2' AND event_name='app_remove'
结果如下:
其次,需要找到同一个用户id,在发生卸载行为前的两个行为。
这一步看起来没有什么思路,但是可以仔细分析下,同一个用户,发生卸载行为前两个行为怎么找到呢?我们看另外一个字段,就是时间戳,也就是时间戳小于卸载的时间戳同时最近的;看到这里想到了什么?就是分类和排序,我们借助窗口函数partition by 用户id order by 时间戳
新增一列,
row_number() over(partition by user_id ORDER BY event_timestamp ) as r,
完整查询语句如下:
SELECT row_number() over(partition by user_id ORDER BY event_timestamp ) as r,
event_date,
event_name,
event_timestamp,
user_id,
app_id
FROM `events`
WHERE app_id='app2'
结果如下:
可以看到这里,除了筛选出app2外,表格的行数没有发生改变,但是新增了一列序号r,代表该行数据的用户发生event的顺序。
从结果也可以看到不是所有的用户最后一步都是app_remove,所以我们需要再聚焦到只发生了卸载行为的用户,那么怎么找到发生卸载行为的用户呢?
SELECT DISTINCT user_id
FROM `events`
WHERE app_id='app2' and event_name='app_remove'
结果如下
只要再限制窗口函数排过序后的user_id 在这个范围内即可。
SELECT row_number() over(partition by user_id ORDER BY event_timestamp ) as r,
event_date,
event_name,
event_timestamp,
user_id,
app_id
FROM `events`
WHERE app_id='app2' and user_id in (SELECT DISTINCT user_id FROM `events` WHERE app_id='app2' and event_name='app_remove')
结果如下
至此怎么找到用户卸载行为前的两个行为呢?
目前app_remove对应的r是同一用户最大的,不妨将前面的窗口函数排序颠倒一下。
row_number() over(partition by user_id ORDER BY event_timestamp desc) as r,
即为
SELECT row_number() over(partition by user_id ORDER BY event_timestamp DESC ) as r,
event_date,
event_name,
event_timestamp,
user_id,
app_id
FROM `events`
WHERE app_id='app2' and user_id in (SELECT DISTINCT user_id FROM `events` WHERE app_id='app2' and event_name='app_remove')
至此呢,r=1,即为卸载行为,r=2即为卸载前的行为,r=3即为卸载前两步行为。
终于顺利找到了卸载路径,但是题目输出的格式为:
最后一列是发生次数,先不考虑最后一列,仅考虑前三列,怎样将卸载路径像输出格式一样输出呢?这里需要涉及到行列转换,而行列转换需要用到 case when。
行转列
SELECT event_date,
(case when r=1 then event_name end ) AS s1,
(case when r=2 then event_name end ) AS s2,
(case when r=3 then event_name end ) AS s3,
event_timestamp,
user_id,
app_id
FROM
(SELECT row_number() over(partition by user_id ORDER BY event_timestamp DESC ) as r,
event_date,
event_name,
event_timestamp,
user_id,
app_id
FROM `events`
WHERE app_id='app2' and user_id in (SELECT DISTINCT user_id FROM `events` WHERE app_id='app2' and event_name='app_remove')) as tK
结果
从结果来看,已经完成了行转列,但是行数和上一步保持一样,存在很多null,而我们期望同一个用户id能聚合到同一行中。
首先 想到能不能用group by呢?group by可以按照用户id聚合,但是聚合后group by前面的select就无法直接选到s1,s2,s3了,只能选到用户id,和做一些聚合函数,count,sum等。
但是,我们可以想到多行聚合时,有值和null做max聚合可以把值取出来
那么,我们可以先GROUP BY user_id;然后通过 max聚合函数将s1,s2,s3 聚合到一行中。
SELECT
max(case when r=1 then event_name end ) AS s1,
max(case when r=2 then event_name end ) AS s2,
max(case when r=3 then event_name end ) AS s3,
user_id,
app_id
FROM
(SELECT row_number() over(partition by user_id ORDER BY event_timestamp DESC ) as r,
event_date,
event_name,
event_timestamp,
user_id,
app_id
FROM `events`
WHERE app_id='app2' and user_id in (SELECT DISTINCT user_id FROM `events` WHERE app_id='app2' and event_name='app_remove')) as tK
GROUP BY user_id;
结果如下:
这里和要求的输出有细微差别,要求输出第三列放app_remove,颠倒一下列的顺序即可。
SELECT
max(case when r=3 then event_name end ) AS s1,
max(case when r=2 then event_name end ) AS s2,
max(case when r=1 then event_name end ) AS s3,
user_id,
app_id
FROM
(SELECT row_number() over(partition by user_id ORDER BY event_timestamp DESC ) as r,
event_date,
event_name,
event_timestamp,
user_id,
app_id
FROM `events`
WHERE app_id='app2' and user_id in (SELECT DISTINCT user_id FROM `events` WHERE app_id='app2' and event_name='app_remove')) as tK
GROUP BY user_id;
最后进行次数统计
select s1,s2,s3,count(*)
from
(SELECT
max(case when r=3 then event_name end ) AS s1,
max(case when r=2 then event_name end ) AS s2,
max(case when r=1 then event_name end ) AS s3,
user_id,
app_id
FROM
(SELECT row_number() over(partition by user_id ORDER BY event_timestamp DESC ) as r,
event_date,
event_name,
event_timestamp,
user_id,
app_id
FROM `events`
WHERE app_id='app2' and user_id in (SELECT DISTINCT user_id FROM `events` WHERE app_id='app2' and event_name='app_remove')) as tK
GROUP BY user_id) as t_ready
GROUP BY s1,s2,s3;