1.group by 语句
GROUP BY语句通常会和聚合函数一起使用,按照一个或者多个列队结果进行分组,然后对每个组执行聚合操作。
案例实操:
(1)计算emp表每个部门的平均工资
hive (default)> select t.deptno, avg(t.sal) avg_sal from emp t group by t.deptno;
(2)计算emp每个部门中每个岗位的最高薪水
hive (default)> select t.deptno, t.job, max(t.sal) max_sal from emp t group by
t.deptno, t.job;
2.窗口函数
OVER(控制数据集大小):指定分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变而变化
=====窗口里边=====
CURRENT ROW:当前行
n PRECEDING: 往前n行数据
n FOLLOWING:往后n行数据
UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING表示到后面的终点
====窗口外边====
LAG(col,n):往前第n行数据(根据col某个字段往前n行)over()over内不支持group by
LEAD(col,n):往后第n行数据
NTILE(n):把有序分区中的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,NTILE返回此行所属的组的编号。注意:n必须为int类型。
注:OVER(不填写)开窗是针对每一条数据进行开窗,计算是全局数据集。针对所有行。
数据准备:
name,orderdate,cost
jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94
需求:
- 查询在2017年4月份购买过的顾客及总人数
- 查询顾客的购买明细及月购买总额
- 上述的场景,要将cost按照日期进行累加
1.查询在2017年4月份购买过的顾客及总人数
hive (gmall)> select
> name,
> count(*) over() --所有行做统计
> from
> business
> where
> date_format(orderdate,'yyyy-MM')='2017-04'
> group by name;
Query ID = root_20200907204809_d2869633-6117-4a93-92ae-235b55447432
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1599464433998_0008)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
Reducer 3 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 03/03 [==========================>>] 100% ELAPSED TIME: 9.38 s
--------------------------------------------------------------------------------
OK
name count_window_0
jack 2
mart 2
Time taken: 10.645 seconds, Fetched: 2 row(s)
over(填写)开窗是针对每一条数据进行开窗,根据窗口大小计算数据集。根据每一个窗口数据集进行计算。
如果括号里的开窗函数是order by 每个窗口的大小是包含比当前窗口小的数据集
2.查询顾客的购买明细及月购买总额:
hive (gmall)> select name,orderdate, cost,sum(cost) over(partition by month(orderdate)) from business;
name orderdate cost sum_window_0
jack 2017-01-01 10 205
tony 2017-01-02 15 205
tony 2017-01-04 29 205
jack 2017-01-05 46 205
tony 2017-01-07 50 205
jack 2017-01-08 55 205
jack 2017-02-03 23 23
mart 2017-04-13 94 341
mart 2017-04-08 62 341
mart 2017-04-09 68 341
mart 2017-04-11 75 341
jack 2017-04-06 42 341
neil 2017-05-10 12 12
neil 2017-06-12 80 80
3.上述的场景,要将cost按照日期进行累加:
hive (gmall)> select name,orderdate,cost,
> sum(cost) over() as sample1,--所有行相加
> sum(cost) over(partition by name) as sample2,--按name分组,组内数据相加
> sum(cost) over(partition by name order by orderdate) as sample3,--按name分组,组内数据累加
> sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 ,--和sample3一样,由起点到当前行的聚合
> sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, --当前行和前面一行做聚合
> sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,--当前行和前边一行及后面一行
> sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 --当前行及后面所有行
> from business;
name orderdate cost sample1 sample2 sample3 sample4 sample5 sample6 sample7
jack 2017-01-01 10 661 176 10 10 10 56 176
jack 2017-01-05 46 661 176 56 56 56 111 166
jack 2017-01-08 55 661 176 111 111 101 124 120
jack 2017-02-03 23 661 176 134 134 78 120 65
jack 2017-04-06 42 661 176 176 176 65 65 42
mart 2017-04-08 62 661 299 62 62 62 130 299
mart 2017-04-09 68 661 299 130 130 130 205 237
mart 2017-04-11 75 661 299 205 205 143 237 169
mart 2017-04-13 94 661 299 299 299 169 169 94
neil 2017-05-10 12 661 92 12 12 12 92 92
neil 2017-06-12 80 661 92 92 92 92 92 80
tony 2017-01-02 15 661 94 15 15 15 44 94
tony 2017-01-04 29 661 94 44 44 44 94 79
tony 2017-01-07 50 661 94 94 94 79 79 50