详解 Hive 的各种查询语法

文刀小桂

于 2024-05-15 13:14:01 发布

阅读量645

点赞数 15

分类专栏： Hive 文章标签： hive hadoop 数据仓库大数据

本文链接：https://blog.csdn.net/weixin_44480009/article/details/138904156

版权

Hive 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

一、数据准备

1. 创建测试表

--创建 部门表
create table if not exists dept
(
	deptNo int,
    deptName string,
    loc int
)
row format delimited fields terminated by '\t';

--创建 员工表
create table if not exists emp
(
	empNo int,
    empName string,
    job string,
    mgr int,
    hiredate string,
    salary double,
    comm double,
    deptNo int
)
row format delimited fields terminated by '\t';

2. 导入数据

# dept
10  ACCOUNTING  1700
20  RESEARCH  1800
30  SALES  1900
40  OPERATIONS  1700

# emp
7369  SMITH  CLERK  7902  1980-12-17  800.00    20 
7499  ALLEN  SALESMAN  7698  1981-2-20  1600.00  300.00  30
7521  WARD  SALESMAN  7698  1981-2-22  1250.00  500.00  30
7566  JONES  MANAGER  7839  1981-4-2  2975.00    20
7654  MARTIN  SALESMAN  7698  1981-9-28  1250.00  1400.00  30
7698  BLAKE  MANAGER  7839  1981-5-1  2850.00    30
7782  CLARK  MANAGER  7839  1981-6-9  2450.00    10
7788  SCOTT  ANALYST  7566  1987-4-19  3000.00    20
7839  KING  PRESIDENT    1981-11-17  5000.00    10
7844  TURNER  SALESMAN  7698  1981-9-8  1500.00  0.00  30
7876  ADAMS  CLERK  7788  1987-5-23  1100.00    20
7900  JAMES  CLERK  7698  1981-12-3  950.00    30
7902  FORD  ANALYST  7566  1981-12-3  3000.00    20
7934  MILLER  CLERK  7782  1982-1-23  1300.00    10

load data local inpath '/opt/module/hive/datas/dept.txt' into table dept;
load data local inpath '/opt/module/hive/datas/emp.txt' into table emp;

二、基本查询

1. 语法

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_name
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[ORDER BY col_list]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list] ]
[LIMIT number]

案例：

select * from dept;
select empNo, empName, deptNo from emp;

别名：可以给查询列和查询表设置别名

select empName as eName from emp;
select * from dept d;

2. 规范事项

HQL 语言大小写不敏感
HQL 可以写在一行或者多行
关键字不能被缩写也不能分行
各子句一般要分行写
使用空格缩进提高语句的可读性 (Tab 缩进在 Linux 中会出错)

3. 算术运算

运算符	说明
A+B	A 和 B 相加
A-B	A 减去 B
A*B	A 和 B 相乘
A/B	A 除以 B
A%B	A 对 B 取余
A&B	A 和 B 按位取与
A\|B	A 和 B 按位取或
A^B	A 和 B 按位取异或
~A	A 按位取反

select salary + 1 as sal from emp;

4. 聚合函数

-- 求总行数（count）
select count(*) as num cnt from emp;
-- 求工资的最大值（max）
select max(sal) max_sal from emp;
-- 求工资的最小值（min）
select min(sal) min_sal from emp;
-- 求工资的总和（sum）
select sum(sal) sum_sal from emp; 
-- 求工资的平均值（avg）
select avg(sal) avg_sal from emp;

5. 比较运算

运算符	支持的数据类型	说明
A=B	基本数据类型	如果 A 等于 B 则返回 TRUE，反之返回 FALSE
A<=>B	基本数据类型	如果 A 和 B 都为 NULL，则返回 TRUE，如果一边为 NULL，返回 False
A<>B, A!=B	基本数据类型	A 或者 B 为 NULL 则返回 NULL；如果 A 不等于 B，则返回TRUE，反之返回 FALSE
A<B	基本数据类型	A 或者 B 为 NULL，则返回 NULL；如果 A 小于 B，则返回TRUE，反之返回 FALSE
A<=B	基本数据类型	A 或者 B 为 NULL，则返回 NULL；如果 A 小于等于 B，则返回 TRUE，反之返回 FALSE
A>B	基本数据类型	A 或者 B 为 NULL，则返回 NULL；如果 A 大于 B，则返回TRUE，反之返回 FALSE
A>=B	基本数据类型	A 或者 B 为 NULL，则返回 NULL；如果 A 大于等于 B，则返回 TRUE，反之返回 FALSE
A [NOT] BETWEEN B AND C	基本数据类型	如果 A,B,C 任一为 NULL，则结果为 NULL。如果 A 的值大于等于 B 而且小于或等于 C，则结果为 TRUE，反之为 FALSE。使用 NOT 则效果相反
A IS NULL	所有数据类型	如果 A 等于 NULL，则返回 TRUE，反之返回 FALSE
A IS NOT NULL	所有数据类型	如果 A 不等于 NULL，则返回 TRUE，反之返回 FALSE
A IN (数值 1, 数值 2,…)	所有数据类型	A 在 IN 列表中的值，则返回 TRUE，否则返回 FALSE

-- 查询出薪水等于 5000 的所有员工
select * from emp where sal =5000;
-- 查询工资在 500 到 1000 的员工信息
select * from emp where sal between 500 and 1000;
-- 查询 comm 为空的所有员工信息
select * from emp where comm is null;
-- 查询工资是 1500 或 5000 的员工信息
select * from emp where sal IN (1500, 5000);

6. 逻辑运算

运算符	说明
and	逻辑与
or	逻辑或
not	逻辑否

-- 查询薪水大于 1000，部门是 30 的员工
select * from emp where sal>1000 and deptno=30; 
-- 查询薪水大于 1000，或者部门是 30 的员工
select * from emp where sal>1000 or deptno=30;
-- 查询除了 20 部门和 30 部门以外的员工信息
select * from emp where deptno not in (30, 20);

7. 模糊查询

7.1 LIKE

语法：
```
where column like 'expr'
```
- % 代表零个或多个字符
- _ 代表一个字符

案例：

-- 查找名字以 A 开头的员工信息
select * from emp where ename LIKE 'A%';
-- 查找名字中第二个字母为 A 的员工信息
select * from emp where ename LIKE '_A%';

7.2 RLIKE/REGEXP

可以通过正则表达式来指定匹配条件查询

-- 查找名字中带有 A 的员工信息
select * from emp where ename RLIKE '[A]';
select * from emp where ename REGEXP '[\w]';

三、分组查询

1. group by

按照一个或者多个列进行分组，然后对每个组执行聚合操作，常会和聚合函数一起使用

-- 计算 emp 表每个部门的平均工资
select t.deptno, avg(t.sal) avg_sal from emp t group by t.deptno;
-- 计算 emp 每个部门中每个岗位的最高薪水
select t.deptno, t.job, max(t.sal) max_sal from emp t group by t.deptno, t.job;

2. having

配合 Group by 使用的过滤条件语句

having 与 where 不同点：
- where 后面不能写分组函数，而 having 后面可以使用分组函数
- having 只用于 group by 分组统计语句

案例：

-- 求每个部门的平均薪水大于 2000 的部门
select deptno, avg(sal) avg_sal from emp group by deptno having avg_sal > 2000;

四、表连接查询

1. 语法

select 
	a.column,
	b.column
from 
	table_name1 a
[ left | right | full [outer] ] | [inner] join 
	table_name2 b
on a.key=b.key
where condition;

使用表别名可以简化查询
使用表别名前缀获取字段可以提高执行效率

2. 案例

2.1 内连接

只有进行连接的两个表中都存在与连接条件相匹配的数据才会被保留下来

select e.empno, e.ename, d.deptno from emp e join dept d on e.deptno = d.deptno;

2.2 左外连接

JOIN 操作符左边表中符合 WHERE 子句的所有记录将会被返回

select e.empno, e.ename, d.deptno from emp e left join dept d on e.deptno = d.deptno;

2.3 右外连接

JOIN 操作符右边表中符合 WHERE 子句的所有记录将会被返回

select e.empno, e.ename, d.deptno from emp e right join dept d on e.deptno = d.deptno;

2.4 全外连接

将会返回所有表中符合 WHERE 语句条件的所有记录。如果任一表的指定字段没有符合条件的值的话，那么就使用 NULL 值替代

select e.empno, e.ename, d.deptno from emp e full join dept d on e.deptno = d.deptno;

2.5 笛卡尔积

产生条件：
- 省略连接条件
- 连接条件无效
- 所有表中的所有行互相连接
案例：
```
select empno, dname from emp, dept;
```

2.6 多表连接

数据准备

新增城市数据

# location
1700  Beijing
1800  London
1900  Tokyo

创建 loc 表

create table if not exists location
(
    loc int,
    loc_name string
)
row format delimited fields terminated by '\t';

导入数据

load data local inpath '/opt/module/datas/location.txt' into table location;

案例：

SELECT 
	e.ename, 
	d.dname, 
	l.loc_name
FROM 
	emp e 
JOIN 
	dept d
ON d.deptno = e.deptno 
JOIN 
	location l
ON d.loc = l.loc;

优化说明：
- 大多数情况下， Hive 会对每对 JOIN 连接对象启动一个 MapReduce 任务。本例中会首先启动一个 MapReduce job 对表 e 和表 d 进行连接操作，然后会再启动一个 MapReduce job 将第一个MapReduce job 的输出和表 l 进行连接操作
- Hive 是按照从左到右的顺序执行的
- 当对 3 个或者更多表进行 join 连接时，如果每个 on 子句都使用相同的连接键，则只会产生一个 MapReduce job

五、排序

1. Order by

全局排序，最终只会有一个 Reducer

-- 查询员工信息按工资升序排列
select * from emp order by sal;
-- 查询员工信息按工资降序排列
select * from emp order by sal desc;

ASC（ascend）: 升序（默认）
DESC（descend）: 降序

1.1 别名排序

-- 按照员工薪水的 2 倍排序
select ename, sal*2 as twosal from emp order by twosal;

1.2 多列排序

-- 按照部门和工资升序排序
select ename, deptno, sal from emp order by deptno, sal;

2. Sort by

每个 Reduce 内部排序，为每个 reducer 产生一个排序文件。每个 Reducer 内部进行排序，对全局结果集来说不是排序，分区是随机的

-- 设置 reduce 个数
set mapreduce.job.reduces=3;
-- 查看设置 reduce 个数
set mapreduce.job.reduces;
-- 根据部门编号降序查看员工信息
select * from emp sort by deptno desc;
-- 将查询结果导入到文件中（按照部门编号降序排序）
insert overwrite local directory 
'/opt/module/data/sortby-result'
select * from emp sort by deptno desc;

3. Distribute by

分区，可以控制某个特定行到指定 reducer，类似 MR 中 partition（自定义分区），结合 sort by 使用

-- 设置 reduce 个数
set mapreduce.job.reduces=3;
-- 先按照部门编号分区，再按照员工编号降序排序。
insert overwrite local directory 
'/opt/module/data/distribute-result' 
select * from emp distribute by deptno sort by empno desc;

distribute by 的分区规则是根据分区字段的 hash 码与 reduce 的个数进行取模运算后，余数相同的分到一个区
Hive 要求 DISTRIBUTE BY 语句要写在 SORT BY 语句之前

4. Cluster by

当 distribute by 和 sorts by 字段相同时，可以使用 cluster by 方式替代，但是排序只能是升序排序，不能指定排序规则为 ASC 或者 DESC

-- 等价
select * from emp cluster by deptno;
select * from emp distribute by deptno sort by deptno;

文刀小桂

关注

15
点赞
踩
20

收藏

觉得还不错? 一键收藏
0
评论
详解 Hive 的各种查询语法

当 distribute by 和 sorts by 字段相同时，可以使用 cluster by 方式替代，但是排序只能是升序排序，不能指定排序规则为 ASC 或者 DESC。每个 Reducer 内部进行排序，对全局结果集来说不是排序，分区是随机的。将会返回所有表中符合 WHERE 语句条件的所有记录。分区，可以控制某个特定行到指定 reducer，类似 MR 中 partition（自定义分区），结合 sort by 使用。只有进行连接的两个表中都存在与连接条件相匹配的数据才会被保留下来。
复制链接

扫一扫

专栏目录