MySQL 5中没有开窗函数,给数据分析带来了不便,相比之下Oracle 12c和hive 3均有相关功能, MySQL 8中补齐了此短板
本文以耳熟能详的emp表结合几个例子说明之,代码在oracle、mysql、hive中完全通用。
需求1:请以如下方式展示10,20,30部门的总人数
DEPTNO_10 DEPTNO_20 DEPTNO_30
---------- ---------- ----------
3 5 6
不妨先通过在原表右边加三列标记列,标记每个员工是否属于10、20、30部门
select empno,ename,deptno,
case when deptno = 10 then 1 else 0 end deptno_10,
case when deptno = 20 then 1 else 0 end deptno_20,
case when deptno = 30 then 1 else 0 end deptno_30
from emp
结果如下,伴随数据分析过程,表被逐渐展宽,符合三范式的表变成反范式,右边的三列可以称为‘标记矩阵’,在数据分析中大有作为:
EMPNO ENAME DEPTNO DEPTNO_10 DEPTNO_20 DEPTNO_30
---------- ---------- ---------- ---------- ---------- ----------
7369 SMITH 20 0 1 0
7499 ALLEN 30 0 0 1
7521 WARD 30 0 0 1
7566 JONES 20 0 1 0
7654 MARTIN 30 0 0 1
7698 BLAKE 30 0 0 1
7782 CLARK 10 1 0 0
7788 SCOTT 20 0 1 0
7839 KING 10 1 0 0
7844 TURNER 30 0 0 1
7876 ADAMS 20 0 1 0
7900 JAMES 30 0 0 1
7902 FORD 20 0 1 0
7934 MILLER 10 1 0 0
根据需求可以只取后三列,然后对整体sum求和,即得最终结果代码如下:
select sum(case when deptno = 10 then 1 else 0 end) deptno_10,
sum(case when deptno = 20 then 1 else 0 end) deptno_20,
sum(case when deptno = 30 then 1 else 0 end) deptno_30
from emp
DEPTNO_10 DEPTNO_20 DEPTNO_30
---------- ---------- ----------
3 5 6
需求2:如图将员工按job排成5队,队头是job名,队内按员工名排序
CLERK ANALYST MANAGER PRESIDENT SALESMAN
---------- ---------- ---------- ---------- ----------
ADAMS FORD BLAKE KING ALLEN
JAMES SCOTT CLARK MARTIN
MILLER JONES TURNER
SMITH WARD
既然是按小组内排名,row_number() over()开窗函数派上用场,partition by指明按哪列划分小组,order by指明小组内按哪列排名,依照标记矩阵思想,在原表后加一列组内名次作为标记列,代码如下
select job,
ename,
row_number() over (partition by job order by ename) rn
from emp
结果如下:
JOB ENAME RN
--------- ---------- ----------
ANALYST FORD 1
ANALYST SCOTT 2
CLERK ADAMS 1
CLERK JAMES 2
CLERK MILLER 3
CLERK SMITH 4
MANAGER BLAKE 1
MANAGER CLARK 2
MANAGER JONES 3
PRESIDENT KING 1
SALESMAN ALLEN 1
SALESMAN MARTIN 2
SALESMAN TURNER 3
SALESMAN WARD 4
再按job构建标记矩阵,只不过此时的矩阵元素不再是1、0表示是否,也不是数字表示名次,而是员工的姓名,即ename值,代码如下
with temp as (select job,
ename,
row_number() over (partition by job order by ename) rn
from emp)
select rn,
case when job = 'CLERK' then ename end clerk,
case when job = 'ANALYST' then ename end analyst,
case when job = 'MANAGER' then ename end manager,
case when job = 'PRESIDENT' then ename end president,
case when job = 'SALESMAN' then ename end salesman
from temp
order by rn
结果如下,这里特意按小组内名次排序,是为了让每个job内的第一名相邻,放在一起便于观察
RN CLERK ANALYST MANAGER PRESIDENT SALESMAN
---------- ---------- ---------- ---------- ---------- ----------
1 FORD
1 BLAKE
1 ALLEN
1 KING
1 ADAMS
2 MARTIN
2 JAMES
2 SCOTT
2 CLARK
3 TURNER
3 JONES
3 MILLER
4 WARD
4 SMITH
只用按rn列分组取组内最大值(或最小值),即可筛选掉null行而得最后结果,代码如下
with temp as (select job,
ename,
row_number() over (partition by job order by ename) rn
from emp)
select min(case when job = 'CLERK' then ename end) clerks,
min(case when job = 'ANALYST' then ename end) analysts,
min(case when job = 'MANAGER' then ename end) mgrs,
min(case when job = 'PRESIDENT' then ename end) prez,
min(case when job = 'SALESMAN' then ename end) sales
from temp
group by rn
order by rn
CLERK ANALYST MANAGER PRESIDENT SALESMAN
---------- ---------- ---------- ---------- ----------
ADAMS FORD BLAKE KING ALLEN
JAMES SCOTT CLARK MARTIN
MILLER JONES TURNER
SMITH WARD
需求3:请以如下方式展示每个部门的员工姓名,注意部门名只在每小组的第一行显示
DEPT_NO ENAME
---------- ----------
10 CLARK
KING
MILLER
20 ADAMS
FORD
JONES
SCOTT
SMITH
30 ALLEN
BLAKE
JAMES
MARTIN
TURNER
WARD
当然首先想到的是每行都显示deptno,代码和结果如下:
select deptno,
ename
from emp
order by deptno
DEPTNO ENAME
---------- ----------
10 CLARK
10 KING
10 MILLER
20 JONES
20 FORD
20 ADAMS
20 SMITH
20 SCOTT
30 WARD
30 TURNER
30 ALLEN
30 JAMES
30 BLAKE
30 MARTIN
比照最终需要的结果,想到lead() over()开窗函数可以在小组内将某一列的值,往前提若干行,比如将deptno列的值往前提2行,代码如下
select lead(deptno,2,null) over (partition by deptno order by ename) dept_no,
ename
from emp
lead()提列函数的第一参数指明,要将哪一列的值往前提,第二参数指明往前提几行,第三参数指明提行后,列尾的空白位置用什么值填充,lag()压列函数正好与之相反,第二参指明往后压几行,第三参指明列头的空白用什么值填充
DEPT_NO ENAME
---------- ----------
10 CLARK
KING
MILLER
20 ADAMS
20 FORD
20 JONES
SCOTT
SMITH
30 ALLEN
30 BLAKE
30 JAMES
30 MARTIN
TURNER
WARD
部门10的deptno能达到效果,但20和30部门不满足要求,因为三个部门要提前的行数并不一样,但如果使用压列函数,三个部门都只用往下压一行,倒不如反其道而行之,代码如下
select lag(deptno) over (partition by deptno order by ename) dept_no,
ename
from emp
效果其实和最终需求接近,只是要将空白处填上deptno值,将有deptno值的地方设置为空白,类似于取反的过程
DEPT_NO ENAME
---------- ----------
CLARK
10 KING
10 MILLER
ADAMS
20 FORD
20 JONES
20 SCOTT
20 SMITH
ALLEN
30 BLAKE
30 JAMES
30 MARTIN
30 TURNER
30 WARD
加上取反的判断语句,代码和结果如下:
select case
when (lag(deptno) over (partition by deptno order by ENAME)) is null then deptno
else null
end dept_no,
ename
from emp
如果说第一次不用提列函数反用压列函数是大胆的逆向思维,那第二次取反操作就是负负得正,取得所需效果
DEPT_NO ENAME
---------- ----------
10 CLARK
KING
MILLER
20 ADAMS
FORD
JONES
SCOTT
SMITH
30 ALLEN
BLAKE
JAMES
MARTIN
TURNER
WARD