hql统计连续登录三天及以上的用户
数据提供
用户ID、登入日期
user01,2018-02-28
user01,2018-03-01
user01,2018-03-02
user01,2018-03-04
user01,2018-03-05
user01,2018-03-06
user01,2018-03-07
user02,2018-03-01
user02,2018-03-02
user02,2018-03-03
user02,2018-03-06
输出字段
user_id | times | start_date | end_date
创建表
create table lxdl(
user_id string comment '用户ID'
,login_date date comment '登入日期'
)
row format delimited fields terminated by ',';
加载数据
load data local inpath '/opt/module/datas/dsjsfc/user.txt' into table lxdl;
验证数据
select * from lxdl;
+---------------+------------------+--+
| lxdl.user_id | lxdl.login_date |
+---------------+------------------+--+
| user01 | 2018-02-28 |
| user01 | 2018-03-01 |
| user01 | 2018-03-02 |
| user01 | 2018-03-04 |
| user01 | 2018-03-05 |
| user01 | 2018-03-06 |
| user01 | 2018-03-07 |
| user02 | 2018-03-01 |
| user02 | 2018-03-02 |
| user02 | 2018-03-03 |
| user02 | 2018-03-06 |
+---------------+------------------+--+
思路
1.根据user_id分区,日期排序
select
user_id,login_date,
row_number() over(partition by user_id order by login_date) as rn
from
lxdl;
输出
+-------------+-------------+-------+--+
| user_id | login_date | rn |
+-------------+-------------+-------+--+
| user01 | 2018-02-28 | 1 |
| user01 | 2018-03-01 | 2 |
| user01 | 2018-03-02 | 3 |
| user01 | 2018-03-04 | 4 |
| user01 | 2018-03-05 | 5 |
| user01 | 2018-03-06 | 6 |
| user01 | 2018-03-07 | 7 |
| user02 | 2018-03-01 | 1 |
| user02 | 2018-03-02 | 2 |
| user02 | 2018-03-03 | 3 |
| user02 | 2018-03-06 | 4 |
+-------------+-------------+-------+--+
2.用日期与rn想减,得到的差值如果相等,则说明日期是连续的
select
user_id,login_date,
date_sub(login_date,rn) as date_diff
from
(
select
user_id,login_date,
row_number() over(partition by user_id order by login_date) as rn
from
lxdl
) t1;
输出
+-------------+-------------+-------------+--+
| user_id | login_date | date_diff |
+-------------+-------------+-------------+--+
| user01 | 2018-02-28 | 2018-02-27 |
| user01 | 2018-03-01 | 2018-02-27 |
| user01 | 2018-03-02 | 2018-02-27 |
| user01 | 2018-03-04 | 2018-02-28 |
| user01 | 2018-03-05 | 2018-02-28 |
| user01 | 2018-03-06 | 2018-02-28 |
| user01 | 2018-03-07 | 2018-02-28 |
| user02 | 2018-03-01 | 2018-02-28 |
| user02 | 2018-03-02 | 2018-02-28 |
| user02 | 2018-03-03 | 2018-02-28 |
| user02 | 2018-03-06 | 2018-03-02 |
+-------------+-------------+-------------+--+
3.根据user_id和date_diff分组,最小登录日期即为此次连续登录的开始日期start_date,最大登录日期即为结束日期end_date,登录次数即为分组后的count(1)
select
user_id,count(1) as times,
min(login_date) as start_date,
max(login_date) as end_date
from
(
select
user_id,login_date,
date_sub(login_date,rn) as date_diff
from
(
select
user_id,login_date,
row_number() over(partition by user_id order by login_date) as rn
from
lxdl)
t1) t2
group by user_id,date_diff
having times >=3;
输出
+-----------+-----------+----------------+--------------+--+
| user_id | times | start_date | end_date |
+-----------+-----------+----------------+--------------+--+
| user01 | 3 | 2018-02-28 | 2018-03-02 |
| user01 | 4 | 2018-03-04 | 2018-03-07 |
| user02 | 3 | 2018-03-01 | 2018-03-03 |
+-----------+-----------+----------------+--------------+--+