行转列
行转列 将多行的数据合并到一列中
collect_set
collect_set – 去重
collect_list
以上两个都是聚合函数 将收集的多行数据聚集成一个数组集合
concat
拼接 参数是可变参数 拼接字符串
concat_ws
参数一 拼接符 参数二 可变个数的字符串/数组
表内容
zz 处女座 B
nn 射手座 A
bb 处女座 B
gg 白羊座 A
tg 射手座 A
create table tb_teacher(
name string ,
xz string ,
xx string
)
row format delimited fields terminated by '\t' ;
load data local inpath "/data/tc" into table tb_teacher ;
需要得到如下结果
射手座,A nn|tg
白羊座,A gg
处女座,B bb|zz
select
concat(xz ,"," , xx) as xax,
concat_ws("|",collect_list(name)) as names
from
tb_teacher
group by xz, xx ;
--结果
+----------+-----------+
| xax | names |
+----------+-----------+
| 处女座,B | zz|bb |
| 射手座,A | nn|tg |
| 白羊座,A | gg |
+----------+-----------+
列转行
表内容
《八佰》 战争
《八佰》 动作
《八佰》 抗日
《八佰》 剧情
《姜子牙》 动画
《姜子牙》 神话
《姜子牙》 科幻
《姜子牙》 动作
《姜子牙》 伦理
《战狼2》 战争
《战狼2》 动作
《战狼2》 灾难
create table tb_movie(
name string ,
typ string
)
row format delimited fields terminated by '\t' ;
load data local inpath "/root/movie" into table tb_movie ;
select
name ,
concat_ws(",",collect_list(typ)) as categorys
from
tb_movie
group by name ;
--结果
+--------+-----------------+
| name | categorys |
+--------+-----------------+
| 《八佰》 | 战争,动作,抗日,剧情 |
| 《姜子牙》 | 动画,神话,科幻,动作,伦理 |
| 《战狼2》 | 战争,动作,灾难 |
+--------+-----------------+
explode函数
表内容
《八佰》 战争,动作,抗日,剧情
《姜子牙》 动画,神话,科幻,动作,伦理
《战狼2》 战争,动作,灾难
需求:再转换成第一个的结果
create table tb_movie2(
name string ,
categorys string
)
row format delimited fields terminated by '\t' ;
load data local inpath "/data/movie2" into table tb_movie2 ;
select
explode(split(categorys,",")) -- 将字符串切割成数组
from
tb_movie2 ;
--结果如下
+------+
| col |
+------+
| 战争 |
| 动作 |
| 抗日 |
| 剧情 |
| 动画 |
| 神话 |
| 科幻 |
| 动作 |
| 伦理 |
| 战争 |
| 动作 |
| 灾难 |
+------+
侧窗口函数lateral view
类似隐式 join
select name,tp
from tb_movie2
lateral view
explode(split(categorys,',')) t as tp
--t 表名
--tp 虚拟表的字段名
--结果
+--------+-----+
| name | tp |
+--------+-----+
| 《八佰》 | 战争 |
| 《八佰》 | 动作 |
| 《八佰》 | 抗日 |
| 《八佰》 | 剧情 |
| 《姜子牙》 | 动画 |
| 《姜子牙》 | 神话 |
| 《姜子牙》 | 科幻 |
| 《姜子牙》 | 动作 |
| 《姜子牙》 | 伦理 |
| 《战狼2》 | 战争 |
| 《战狼2》 | 动作 |
| 《战狼2》 | 灾难 |
+--------+-----+
窗口函数
在进行分组聚合以后 ,我们还想操作集合以前的数据 使用到窗口函数
OVER():指定分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变而变化
CURRENT ROW:当前行 current row
n PRECEDING:往前n行数据 n preceding
n FOLLOWING:往后n行数据 n following
UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING表示到
后面的终点 unbound preceding unbound following
LAG(col,n):往前第n行数据 lag 参数一 字段 n
LEAD(col,n):往后第n行数据 lead
NTILE(n):把有序分区中的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,NTILE返回
此行所属的组的编号。注意:n必须为int类型。
表内容
jack,2020-01-01,10
tony,2020-01-02,15
jack,2020-02-03,23
tony,2020-01-04,29
jack,2020-01-05,46
jack,2020-04-06,42
tony,2020-01-07,50
jack,2020-01-08,55
mart,2020-04-08,62
mart,2020-04-09,68
neil,2020-05-10,12
mart,2020-04-11,75
neil,2020-06-12,80
mart,2020-04-13,94
--创建表,导入数据
create table tb_orders(
name string ,
ctime string ,
money double
)
row format delimited fields terminated by ',' ;
load data local inpath "/data/orders" into table tb_orders ;
求每个人的订单明细和总量
select
* ,
count(1) over(partition by name) , -- 指定窗口大小是一个人
sum(money) over(partition by name)
from
tb_orders ;
--结果展示
+-----------------+------------------+------------------+-----------------+---------------+
| tb_orders.name | tb_orders.ctime | tb_orders.money | count_window_0 | sum_window_1 |
+-----------------+------------------+------------------+-----------------+---------------+
| jack | 2020-01-05 | 46.0 | 5 | 176.0 |
| jack | 2020-01-08 | 55.0 | 5 | 176.0 |
| jack | 2020-01-01 | 10.0 | 5 | 176.0 |
| jack | 2020-04-06 | 42.0 | 5 | 176.0 |
| jack | 2020-02-03 | 23.0 | 5 | 176.0 |
| mart | 2020-04-13 | 94.0 | 4 | 299.0 |
| mart | 2020-04-11 | 75.0 | 4 | 299.0 |
| mart | 2020-04-09 | 68.0 | 4 | 299.0 |
| mart | 2020-04-08 | 62.0 | 4 | 299.0 |
| neil | 2020-05-10 | 12.0 | 2 | 92.0 |
| neil | 2020-06-12 | 80.0 | 2 | 92.0 |
| tony | 2020-01-04 | 29.0 | 3 | 94.0 |
| tony | 2020-01-02 | 15.0 | 3 | 94.0 |
| tony | 2020-01-07 | 50.0 | 3 | 94.0 |
+-----------------+------------------+------------------+-----------------+---------------+
select
* ,
sum(money) over(partition by name order by ctime rows between unbounded
preceding and current row )
from
tb_orders ;
等价于
select
* ,
sum(money) over(partition by name order by ctime) -- order by ctime 起始行和当前行
from
tb_orders ;
--结果
+-----------------+------------------+------------------+---------------+
| tb_orders.name | tb_orders.ctime | tb_orders.money | sum_window_0 |
+-----------------+------------------+------------------+---------------+
| jack | 2020-01-01 | 10.0 | 10.0 |
| jack | 2020-01-05 | 46.0 | 56.0 |
| jack | 2020-01-08 | 55.0 | 111.0 |
| jack | 2020-02-03 | 23.0 | 134.0 |
| jack | 2020-04-06 | 42.0 | 176.0 |
| mart | 2020-04-08 | 62.0 | 62.0 |
| mart | 2020-04-09 | 68.0 | 130.0 |
| mart | 2020-04-11 | 75.0 | 205.0 |
| mart | 2020-04-13 | 94.0 | 299.0 |
| neil | 2020-05-10 | 12.0 | 12.0 |
| neil | 2020-06-12 | 80.0 | 92.0 |
| tony | 2020-01-02 | 15.0 | 15.0 |
| tony | 2020-01-04 | 29.0 | 44.0 |
| tony | 2020-01-07 | 50.0 | 94.0 |
+-----------------+------------------+------------------+---------------+
上一行和当前行
select
* ,
sum(money) over(partition by name order by ctime rows between 1 preceding and current row ) -- order by ctime 起始行和当前行
from
tb_orders ;
上一行 当前行 和下一行
select
* ,
sum(money) over(partition by name order by ctime rows between 1 preceding and 1 following ) -- order by ctime 起始行和当前行
from
tb_orders ;
LAG(col,n) 往前第n行数据
LAG(col,n):往前第n行数据
参数一 :字段
参数二: n
参数三:如果没有前n行的值,默认为null
select
* ,
lag(ctime ,1 , '第一次购买') over(partition by name order by ctime)
from
tb_orders ;
--结果
+-----------------+------------------+------------------+---------------+
| tb_orders.name | tb_orders.ctime | tb_orders.money | lag_window_0 |
+-----------------+------------------+------------------+---------------+
| jack | 2020-01-01 | 10.0 | 第一次购买 |
| jack | 2020-01-05 | 46.0 | 2020-01-01 |
| jack | 2020-01-08 | 55.0 | 2020-01-05 |
| jack | 2020-02-03 | 23.0 | 2020-01-08 |
| jack | 2020-04-06 | 42.0 | 2020-02-03 |
| mart | 2020-04-08 | 62.0 | 第一次购买 |
| mart | 2020-04-09 | 68.0 | 2020-04-08 |
| mart | 2020-04-11 | 75.0 | 2020-04-09 |
| mart | 2020-04-13 | 94.0 | 2020-04-11 |
| neil | 2020-05-10 | 12.0 | 第一次购买 |
| neil | 2020-06-12 | 80.0 | 2020-05-10 |
| tony | 2020-01-02 | 15.0 | 第一次购买 |
| tony | 2020-01-04 | 29.0 | 2020-01-02 |
| tony | 2020-01-07 | 50.0 | 2020-01-04 |
+-----------------+------------------+------------------+---------------+
ntile(n)
select
name ,
ctime ,
money
from
(select
* ,
ntile(5) over(order by ctime) as num
from
tb_orders)t
where num =1 ;
+-------+-------------+--------+
| name | ctime | money |
+-------+-------------+--------+
| jack | 2020-01-01 | 10.0 |
| tony | 2020-01-02 | 15.0 |
| tony | 2020-01-04 | 29.0 |
+-------+-------------+--------+
练习
还是上表的内容
(1)查询在2020年4月份购买过的顾客及总人数
select
name,
count(1) over() --整个表
from
(select
distinct name
from
tb_orders where substring(ctime , 0 , 7) = '2020-04' --筛选出4月的顾客
)t ;
--结果
+-------+------+
| name | num |
+-------+------+
| mart | 2 |
| jack | 2 |
+-------+------+
(2)查询顾客的购买明细及月购买总额
select
* ,
sum(money) over(partition by name , substring(ctime ,1 ,7))
from
tb_orders ;
或者使用month函数
select
* ,
sum(money) over(partition by name , month(ctime)) as month_cost
from
tb_orders ;
--结果
+-----------------+------------------+------------------+-------------+
| tb_orders.name | tb_orders.ctime | tb_orders.money | month_cost |
+-----------------+------------------+------------------+-------------+
| jack | 2020-01-05 | 46.0 | 111.0 |
| jack | 2020-01-08 | 55.0 | 111.0 |
| jack | 2020-01-01 | 10.0 | 111.0 |
| jack | 2020-02-03 | 23.0 | 23.0 |
| jack | 2020-04-06 | 42.0 | 42.0 |
| mart | 2020-04-13 | 94.0 | 299.0 |
| mart | 2020-04-11 | 75.0 | 299.0 |
| mart | 2020-04-09 | 68.0 | 299.0 |
| mart | 2020-04-08 | 62.0 | 299.0 |
| neil | 2020-05-10 | 12.0 | 12.0 |
| neil | 2020-06-12 | 80.0 | 80.0 |
| tony | 2020-01-04 | 29.0 | 94.0 |
| tony | 2020-01-02 | 15.0 | 94.0 |
| tony | 2020-01-07 | 50.0 | 94.0 |
+-----------------+------------------+------------------+-------------+
(3)上述的场景,要将money按照日期进行累加
select *,
sum(money) over(order by ctime rows between unbounded preceding
and current row)
from tb_orders
--结果
+-----------------+------------------+------------------+---------------+
| tb_orders.name | tb_orders.ctime | tb_orders.money | sum_window_0 |
+-----------------+------------------+------------------+---------------+
| jack | 2020-01-01 | 10.0 | 10.0 |
| tony | 2020-01-02 | 15.0 | 25.0 |
| tony | 2020-01-04 | 29.0 | 54.0 |
| jack | 2020-01-05 | 46.0 | 100.0 |
| tony | 2020-01-07 | 50.0 | 150.0 |
| jack | 2020-01-08 | 55.0 | 205.0 |
| jack | 2020-02-03 | 23.0 | 228.0 |
| jack | 2020-04-06 | 42.0 | 270.0 |
| mart | 2020-04-08 | 62.0 | 332.0 |
| mart | 2020-04-09 | 68.0 | 400.0 |
| mart | 2020-04-11 | 75.0 | 475.0 |
| mart | 2020-04-13 | 94.0 | 569.0 |
| neil | 2020-05-10 | 12.0 | 581.0 |
| neil | 2020-06-12 | 80.0 | 661.0 |
+-----------------+------------------+------------------+---------------+
(4)查询顾客上次的购买时间
select *,
lag(ctime,1) over(partition by name order by ctime)
from tb_orders
--结果
+-----------------+------------------+------------------+-------------+
| tb_orders.name | tb_orders.ctime | tb_orders.money | last_time |
+-----------------+------------------+------------------+-------------+
| jack | 2020-01-01 | 10.0 | NULL |
| jack | 2020-01-05 | 46.0 | 2020-01-01 |
| jack | 2020-01-08 | 55.0 | 2020-01-05 |
| jack | 2020-02-03 | 23.0 | 2020-01-08 |
| jack | 2020-04-06 | 42.0 | 2020-02-03 |
| mart | 2020-04-08 | 62.0 | NULL |
| mart | 2020-04-09 | 68.0 | 2020-04-08 |
| mart | 2020-04-11 | 75.0 | 2020-04-09 |
| mart | 2020-04-13 | 94.0 | 2020-04-11 |
| neil | 2020-05-10 | 12.0 | NULL |
| neil | 2020-06-12 | 80.0 | 2020-05-10 |
| tony | 2020-01-02 | 15.0 | NULL |
| tony | 2020-01-04 | 29.0 | 2020-01-02 |
| tony | 2020-01-07 | 50.0 | 2020-01-04 |
+-----------------+------------------+------------------+-------------+
(5)查询前20%时间的订单信息
select
name ,
ctime ,
money
from
(select
* ,
ntile(5) over(order by ctime) as num
from
tb_orders)t
where num =1 ;
--结果
+-------+-------------+--------+
| name | ctime | money |
+-------+-------------+--------+
| jack | 2020-01-01 | 10.0 |
| tony | 2020-01-02 | 15.0 |
| tony | 2020-01-04 | 29.0 |
+-------+-------------+--------+