有日志如下,请写出代码
求得所有用户和活跃用户的总数及平均年龄。
(活跃用户指连续两天都有访问记录的用户)日期 用户 年龄
数据集
2019-02-11,test_1,23
2019-02-11,test_2,19
2019-02-11,test_3,39
2019-02-11,test_1,23
2019-02-11,test_3,39
2019-02-11,test_1,23
2019-02-12,test_2,19
2019-02-13,test_1,23
2019-02-15,test_2,19
2019-02-16,test_2,19
1)建表
create table user_age
(
dt string,
user_id string,
age int
) row format delimited fields terminated by ',';
2)按照日期以及用户分组,按照日期排序并给出排名
select dt,
user_id,
min(age) age,
rank() over (distribute by user_id sort by dt)
from user_age
group by dt, user_id;
3)计算日期及排名的差值
with a as (select dt,
user_id,
min(age) age,
rank() over (distribute by user_id sort by dt) `rn`
from user_age
group by dt, user_id)
select user_id,
age,
date_sub(dt, rn) `flag`
from a;
+-------+---+----------+
|user_id|age|flag |
+-------+---+----------+
|test_1 |23 |2019-02-10|
|test_1 |23 |2019-02-11|
|test_2 |19 |2019-02-10|
|test_2 |19 |2019-02-10|
|test_2 |19 |2019-02-12|
|test_2 |19 |2019-02-12|
|test_3 |39 |2019-02-10|
+-------+---+----------+
4)过滤出差值大于等于2的,即为连续两天活跃的用户
with a as (select dt,
user_id,
min(age) age,
rank() over (distribute by user_id sort by dt) `rn`
from user_age
group by dt, user_id),
b as (select user_id,
age,
date_sub(dt, rn) `flag`
from a)
select user_id, age
from b
group by user_id, age
having count(*) >= 2
;
+-------+---+
|user_id|age|
+-------+---+
|test_1 |23 |
|test_2 |19 |
+-------+---+
5)对数据进行去重处理(一个用户可以在两个不同的时间点连续登录),例如:a用户在1月10号1月11号以及1月20号和1月21号4天登录。
with a as (select dt,
user_id,
min(age) age,
rank() over (distribute by user_id sort by dt) `rn`
from user_age
group by dt, user_id),
b as (select user_id,
age,
date_sub(dt, rn) `flag`
from a),
c as (select user_id, age
from b
group by user_id, age
having count(*) >= 2)
select user_id,
min(age) age
from c
group by user_id;
;
+-------+---+
|user_id|age|
+-------+---+
|test_1 |23 |
|test_2 |19 |
+-------+---+
6)计算活跃用户(两天连续有访问)的人数以及平均年龄
with a as (select dt,
user_id,
min(age) age,
rank() over (distribute by user_id sort by dt) `rn`
from user_age
group by dt, user_id),
b as (select user_id,
age,
date_sub(dt, rn) `flag`
from a),
c as (select user_id, age
from b
group by user_id, age
having count(*) >= 2),
d as (select user_id,
min(age) age
from c
group by user_id)
select count(*),
cast(sum(age) / count(*) as decimal(10, 2))
from d;
+--+-----+
|c0|c1 |
+--+-----+
|2 |21.00|
+--+-----+
7)对全量数据集进行按照用户去重
select user_id,
min(age) age
from user_age
group by user_id;
+-------+---+
|user_id|age|
+-------+---+
|test_1 |23 |
|test_2 |19 |
|test_3 |39 |
+-------+---+
8)计算所有用户的数量以及平均年龄
select count(*) `user_count,`,
cast(sum(age) / count(*) as decimal(10, 2))
from (select user_id,
min(age) age
from user_age
group by user_id) g
;
+-----------+-----+
|user_count,|c1 |
+-----------+-----+
|3 |27.00|
+-----------+-----+
9)将第5步以及第7步两个数据集进行union all操作
10)求和并拼接为最终SQL
with a as (select dt,
user_id,
min(age) age,
rank() over (distribute by user_id sort by dt) `rn`
from user_age
group by dt, user_id),
b as (select user_id,
age,
date_sub(dt, rn) `flag`
from a),
c as (select user_id, age
from b
group by user_id, age
having count(*) >= 2),
d as (select user_id,
min(age) age
from c
group by user_id),
e as (select count(*) `activity_user_count`,
cast(sum(age) / count(*) as decimal(10, 2)) `activity_avg_age`
from d),
f as (select user_id,
min(age) age
from user_age
group by user_id),
g as (select count(*) `twice_count`,
cast(sum(age) / count(*) as decimal(10, 2)) `twice_count_age`
from f),
h as (select activity_user_count,
activity_avg_age,
0 twice_count,
0 twice_count_age
from e
union all
select 0 user_total_count,
0 user_total_avg_age,
twice_count,
twice_count_age
from g)
select sum(activity_user_count) `activity_user_count`,
sum(activity_avg_age) `activity_avg_age`,
sum(twice_count) `twice_count`,
sum(twice_count_age) `twice_count_age`
from h
;
+-------------------+----------------+-----------+---------------+
|activity_user_count|activity_avg_age|twice_count|twice_count_age|
+-------------------+----------------+-----------+---------------+
|2 |21.00 |3 |27.00 |
+-------------------+----------------+-----------+---------------+
总结
思路就是分别求所有用户和活跃用户的总数及平均年龄,之后进行union即可