hive 常用函数及其查询
hive常用的自带函数及查询
show functions;
查询hive自带的函数
—查看某一个函数的具体用法;
desc function extended 函数名称;
栗子:desc function extended max;聚合函数
max(),min(),sum(),count(),avg()…
eg:基于emp,dept表计算每个部分的工资总数,并且显示部门的名称
语法:
使用group by
如果字段出现在select中,那么这些字段必须出现在group by里面,除非你用聚合函数修饰
select deptno,sum(sal) from emp e inner join dept d on e.deptno=d.deptno group by e.deptno; select e.deptno,d.dname,sum(sal) from emp e inner join dept d on e.deptno=d.deptno group by e.deptno; FAILED: SemanticException [Error 10025]: Line 1:16 Expression not in GROUP BY key 'dname' select e.deptno,d.dname,sum(sal) from emp e inner join dept d on e.deptno=d.deptno group by e.deptno,d.dname;
having条件 –使用where就会报错
select e.deptno,dname,sum(sal) from emp e inner join dept d on e.deptno=d.deptno group by e.deptno,d.dname where e.deptno=10; FAILED: ParseException line 1:99 missing EOF at 'where' near 'deptno' select e.deptno,dname,sum(sal) from emp e inner join dept d on e.deptno=d.deptno group by e.deptno,d.dname having e.deptno=10;
- rand() –取随机数,范围0-1
sql
语法: rand()
rand(int exer)
select rand(); 0.6549622205135005
select rand(); 0.5992520662072582
select rand(1); 0.7308781907032909
select rand(2); 0.7311469360199058
返回值是:double类型
concat() –字段拼接
select concat(empno,ename) noname from emp; noname 7369SMITH 7499ALLEN 7521WARD 7566JONES 7654MARTIN 7698BLAKE 7782CLARK 7788SCOTT 7839KING 7844TURNER 7876ADAMS 7900JAMES 7902FORD 7934MILLER
substr() –字段抽取
sql
select substr(hiredate,1,4) from emp;
select substr(hiredate,6,2) from emp;
第一位数表示:从第几位开始截取,第二位数字代表:步长
day() –获取日期
select day(hiredate) from emp;
month() –获取月份
hour() –获取小时
时间戳:从1970年1月1号-
unix_timestamp 北京时间转化成时间戳
select unix_timestamp("2017-03-30 08:00:00") 1490832000
from_unixtime 时间戳转化为北京时间
select from_unixtime(1490832000) 2017-03-30 08:00:00
- 应用场景
浏览网页的停留时间
开始浏览 2017/10/16 08:00:00
离开时间 2017/10/16 08:10:00
unix_timestamp(2017/10/16 08:10:00) - unix_timestamp(2017/10/16 08:00:00)
得到停留的时间戳
1490832000/1000/3600
- 应用场景
cast()
select cast(1490854394123/1000 as int) ==>1490854394
case when …then …else …end
eg:根据emp表计算每个员工一个月的薪资 sal+ com select empno,sal+comm from emp; 7369 NULL 7499 1900.0 7521 1750.0 7566 NULL 7654 2650.0 7698 NULL 7782 NULL 7788 NULL 7839 NULL 7844 1500.0 7876 NULL 7900 NULL 7902 NULL 7934 NULL select empno,case when comm is null then sal+0 else sal+comm end from emp; empno _c1 7369 800.0 7499 1900.0 7521 1750.0 7566 2975.0 7654 2650.0 7698 2850.0 7782 2450.0 7788 3000.0 7839 5000.0 7844 1500.0 7876 1100.0 7900 950.0 7902 3000.0 7934 1300.0 eg:根据emp表,对于sal小于500为down,小于800且大于500定义为middle,大于800的定义为up select empno case when sal<500 then 'down' when sal>500 and sal<800 then 'middle' else 'up' end from emp;
【join】
join:两个表进行连接,把两张表的一行数据整合成一行数据
等值(inner)join :(inner) join ..on …(两张表中存在相同的字段)
- 栗子:
select e.empno,d.dname from emp e (inner) join dept d on e.depyno=d.deptno;
- 栗子:
左连接left join:已左边的表为主表
- 栗子:
select e.empno,e.ename,d.deptno,d.dname from emp e left join dept d on e.deptno=d.deptno;e.empno e.ename d.deptno d.dname 7369 SMITH 20 RESEARCH 7499 ALLEN 30 SALES 7521 WARD 30 SALES 7566 JONES 20 RESEARCH 7654 MARTIN 30 SALES 7698 BLAKE 30 SALES 7782 CLARK 10 ACCOUNTING 7788 SCOTT 20 RESEARCH 7839 KING 10 ACCOUNTING 7844 TURNER 30 SALES 7876 ADAMS 20 RESEARCH 7900 JAMES 30 SALES 7902 FORD 20 RESEARCH 7934 MILLER 10 ACCOUNTING
右连接right join:以右边的表为主表
栗子:
select e.empno,e.ename,d.deptno,d.dname from emp e right join dept d on e.deptno=d.deptno; e.empno e.ename d.deptno d.dname 7782 CLARK 10 ACCOUNTING 7839 KING 10 ACCOUNTING 7934 MILLER 10 ACCOUNTING 7369 SMITH 20 RESEARCH 7566 JONES 20 RESEARCH 7788 SCOTT 20 RESEARCH 7876 ADAMS 20 RESEARCH 7902 FORD 20 RESEARCH 7499 ALLEN 30 SALES 7521 WARD 30 SALES 7654 MARTIN 30 SALES 7698 BLAKE 30 SALES 7844 TURNER 30 SALES 7900 JAMES 30 SALES NULL NULL 40 OPERATIONS
- 全连接full join
sql
select e.empno,e.ename,d.deptno,d.dname from emp e full join dept d on e.deptno=d.deptno;
map join –小表join 大表 a,b
select /* + mapjoin(b) */ a.key,a.value from a join b on a.key=b.key
Hive中的order by,sort by ,distribute by ,cluster by 的讲解
升序(asc) 降序(desc)In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number>
order by:对全局数据的排序,只存在一个reduce
select * from emp order by empno desc; 不写desc (默认就是升序)
sort by 对每一个reduce内部进行排序,全局结果来说并没有
set mapreduce.job.reduces =3; insert overwrite local directory '/opt/hivedata/sortby' select * from emp sort by empno;
- distribute by 分布式排序,类似于mapreduce中的分区功能,对数据进行分区,结合sort by
sql
insert overwrite local directory '/opt/hivedata/distributeby' select * from emp distribute by deptno sort by empno;
- cluster by 特殊情况 当distribute by 和sort by 修饰同一个字段的时候可以直接使用
sql
insert overwrite local directory '/opt/hivedata/clusterby' select * from emp cluster by empno;
hive的窗口和分析函数
对于分组之后的数据进行处理
查询部门编号10的所有员工,按照薪资进行降序排列
select * from emp where deptno=10 order by sal desc;
将每个部门薪资最高的那个人显示在最后一列 max() over()
select empno,ename,deptno,sal,max(sal) over(partition by deptno order by sal desc) as max_sal from emp; empno ename deptno sal max_sal 7698 BLAKE 30 2850.0 2850.0 7499 ALLEN 30 1600.0 2850.0 7844 TURNER 30 1500.0 2850.0 7521 WARD 30 1250.0 2850.0 7654 MARTIN 30 1250.0 2850.0 7900 JAMES 30 950.0 2850.0 7839 KING 10 5000.0 5000.0 7782 CLARK 10 2450.0 5000.0 7934 MILLER 10 1300.0 5000.0 7788 SCOTT 20 3000.0 3000.0 7902 FORD 20 3000.0 3000.0 7566 JONES 20 2975.0 3000.0 7876 ADAMS 20 1100.0 3000.0 7369 SMITH 20 800.0 3000.0
将每个部门最后一列显示一个唯一的ID编号
select empno,ename,deptno,sal,row_number() over(partition by deptno order by sal desc) as rn from emp; empno ename deptno sal rn 7698 BLAKE 30 2850.0 1 7499 ALLEN 30 1600.0 2 7844 TURNER 30 1500.0 3 7521 WARD 30 1250.0 4 7654 MARTIN 30 1250.0 5 7900 JAMES 30 950.0 6 7839 KING 10 5000.0 1 7782 CLARK 10 2450.0 2 7934 MILLER 10 1300.0 3 7788 SCOTT 20 3000.0 1 7902 FORD 20 3000.0 2 7566 JONES 20 2975.0 3 7876 ADAMS 20 1100.0 4 7369 SMITH 20 800.0 5
获取每个部门薪资最高的前两位
select empno,ename,deptno,sal from(select empno,ename,deptno,sal,row_number() over(partition by deptno order by sal desc) as rn from emp) tmp where rn < 3 empno ename deptno sal 7698 BLAKE 30 2850.0 7499 ALLEN 30 1600.0 7839 KING 10 5000.0 7782 CLARK 10 2450.0 7788 SCOTT 20 3000.0 7902 FORD 20 3000.0