需求:将上面的表转化成下面的形式,首先按照用户进行分组,
在用户分组的基础上,name字段每遇到一个e*就分一组
user_id,name
u1,e1
u1,e1
u1,e*
u1,e2
u1,e3
u1,e*
u2,e1
u2,e2
u2,e*
u2,e1
u2,e3
u2,e*
u2,e*
上面的用户行为记录,每遇到一个e*,就分到一组,得到如下结果:
u1, [e1,e1,e*]
u1, [e2,e3,e*]
u2, [e1,e2,e*]
u2, [e1,e3,e*]
u2, [e*]
--建表语句
drop table if exists window_test;
create table window_test
(
user_id varchar(10),
name string
)
DUPLICATE KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 1;
-- 为了保证原数据的插入顺序,一条一条的inert 进去
insert into window_test values ('u1','e1');
insert into window_test values ('u1','e1');
insert into window_test values ('u1','e*');
insert into window_test values ('u1','e2');
insert into window_test values ('u1','e3');
insert into window_test values ('u1','e*');
insert into window_test values ('u2','e1');
insert into window_test values ('u2','e2');
insert into window_test values ('u2','e*');
insert into window_test values ('u2','e1');
insert into window_test values ('u2','e3');
insert into window_test values ('u2','e*');
insert into window_test values ('u2','e*');
逻辑分析:
1.我们需要关注的是name字段中的值是不是e*,所以可以将它转换成flag,1 0这样的标签
2.按照用户分组,来打行号(注意:这边必须要按照原来数据的顺序)
3.开窗,将flag的值从第一行加到当前行
4.将开窗的结果和原flag进行相减,得到一个新的flag标签结果
5.按照用户和新的标签结果进行分组,收集即可
-- 添加标签
select
user_id,
name,
if(name = 'e*',1,0) as flag
from window_test
+---------+------+------+
| user_id | name | flag |
+---------+------+------+
| u1 | e1 | 0 |
| u1 | e1 | 0 |
| u1 | e* | 1 |
| u1 | e2 | 0 |
| u1 | e3 | 0 |
| u1 | e* | 1 |
| u2 | e1 | 0 |
| u2 | e2 | 0 |
| u2 | e* | 1 |
| u2 | e1 | 0 |
| u2 | e3 | 0 |
| u2 | e* | 1 |
| u2 | e* | 1 |
+---------+------+------+
select
user_id,
name,
flag,
sum(flag)over(partition by user_id order by user_id rows between unbounded preceding and current row) as sum_flag
from
(
select
user_id,
name,
if(name = 'e*',1,0) as flag
from window_test
) as t;
+---------+------+------+----------+
| user_id | name | flag | sum_flag |
+---------+------+------+----------+
| u1 | e1 | 0 | 0 |
| u1 | e1 | 0 | 0 |
| u1 | e* | 1 | 1 |
| u1 | e2 | 0 | 1 |
| u1 | e3 | 0 | 1 |
| u1 | e* | 1 | 2 |
| u2 | e1 | 0 | 0 |
| u2 | e2 | 0 | 0 |
| u2 | e* | 1 | 1 |
| u2 | e1 | 0 | 1 |
| u2 | e3 | 0 | 1 |
| u2 | e* | 1 | 2 |
| u2 | e* | 1 | 3 |
+---------+------+------+----------+
观察现象,现在想把user_id和按照遇到e*就分组的这哥逻辑去处理的话,需要一个新标签,同一组相等
可以拿sum_flag - flag 得到的结果就是我们想要的
(plan2:或者拿取sum_flag 上面的一行数据,如果没有数据拿,默认就是0 ,这样的话也行)
select
user_id,
name,
flag,
sum(flag)over(partition by user_id order by user_id rows between unbounded preceding and current row) -flag as diff_flag
from
(
select
user_id,
name,
if(name = 'e*',1,0) as flag
from window_test
) as t;
+---------+------+------+-----------+
| user_id | name | flag | diff_flag |
+---------+------+------+-----------+
| u1 | e1 | 0 | 0 |
| u1 | e1 | 0 | 0 |
| u1 | e* | 1 | 0 |
| u1 | e2 | 0 | 1 |
| u1 | e3 | 0 | 1 |
| u1 | e* | 1 | 1 |
| u2 | e1 | 0 | 0 |
| u2 | e2 | 0 | 0 |
| u2 | e* | 1 | 0 |
| u2 | e1 | 0 | 1 |
| u2 | e3 | 0 | 1 |
| u2 | e* | 1 | 1 |
| u2 | e* | 1 | 2 |
+---------+------+------+-----------+
最后只要group by之后收集就行了,注意收集的时候没有collect_set 和ollect_list,只有group_concat()
select
user_id,
group_concat(name,',') as res
from
(
select
user_id,
name,
flag,
sum(flag)over(partition by user_id order by user_id rows between unbounded preceding and current row) -flag as diff_flag
from
(
select
user_id,
name,
if(name = 'e*',1,0) as flag
from window_test
) as t
) as t1
group by user_id,diff_flag
+---------+----------+
| user_id | res |
+---------+----------+
| u1 | e1,e1,e* |
| u2 | e* |
| u1 | e2,e3,e* |
| u2 | e1,e2,e* |
| u2 | e1,e3,e* |
+---------+----------+