原始数据(店铺名称, 销售时间, 销售金额)
a,2020-02-05,200
a,2020-02-06,300
a,2020-02-07,200
a,2020-02-08,400
a,2020-02-10,600
a,2020-03-01,200
a,2020-03-02,300
a,2020-03-03,200
a,2020-03-04,400
a,2020-03-05,600
b,2020-02-05,200
b,2020-02-06,300
b,2020-02-08,200
b,2020-02-09,400
b,2020-02-10,600
c,2020-01-31,200
c,2020-02-01,300
c,2020-02-02,200
c,2020-02-03,400
c,2020-02-10,600
在hive中建表
create table shop(
name string ,
ctime string ,
money int
)
row format delimited fields terminated by "," ;
load data local inpath "/doit16/shop.txt" into table shop ;
需求: 查找连续三天又销售记录的店铺名称
分析:
1.将店铺按名称分区, 时间排序 ,并编号
select
* ,
row_number() over(partition by name order by ctime) as rn
from
shop
此时得到的表结果如下:
2.用ctime字段的值减去后面的编号,这里用date_sub()函数 会得到一个结果, 如果两条数据的结果相同, 则说明他们是连续的两天的购买数据
select
name,
ctime,
money,
rn,
date_sub(ctime,rn) date_sub_res
from
(select
* ,
row_number() over(partition by name order by ctime) as rn
from
shop) t
查询结果:
3.按照name, date_sub_res分组
select
name,
date_sub_res,
count(*) cc
from
(select
name,
ctime,
money,
rn,
date_sub(ctime,rn) date_sub_res
from
(select
* ,
row_number() over(partition by name order by ctime) as rn
from
shop) t1) t2
group by name,date_sub_res
查询结果:
4.筛选出cc > 3 的数据, 并对name去重, 所得结果就是连续3天就销售记录的店铺
select
distinct(name)
from
(select
name,
date_sub_res,
count(*) cc
from
(select
name,
ctime,
money,
rn,
date_sub(ctime,rn) date_sub_res
from
(select
* ,
row_number() over(partition by name order by ctime) as rn
from
shop) t1) t2
group by name,date_sub_res) t3
where cc > 3;
查询结果: