Hive 之查询

最新推荐文章于 2024-07-31 15:12:24 发布

风老魔

最新推荐文章于 2024-07-31 15:12:24 发布

阅读量946

点赞数

分类专栏：大数据文章标签： hive hadoop 数据仓库

本文链接：https://blog.csdn.net/hj1993/article/details/129017960

版权

大数据专栏收录该内容

33 篇文章 1 订阅

订阅专栏

本文详细介绍了Hive中的查询语句，包括全表查询、列别名、算术运算符、常用函数，以及LIMIT、WHERE、GROUPBY、HAVING、JOIN、ORDERBY等语句的使用方法，还涉及到内连接、外连接和笛卡尔积的概念，以及DISTRIBUTEBY、SORTBY和CLUSTERBY在排序中的应用。

摘要由CSDN通过智能技术生成

查询语句语法：

[WITH CommonTableExpression (, CommonTableExpression)*]    (Note: Only available
 starting with Hive 0.13.0)
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
  FROM table_reference
  [WHERE where_condition]
  [GROUP BY col_list]
  [ORDER BY col_list]
  [CLUSTER BY col_list
    | [DISTRIBUTE BY col_list] [SORT BY col_list]
  ]
 [LIMIT number]

1. 基本查询

1.1 全表查询

// 全表查询
select * from tableName;

// 指定列名查询
select columnName1, columnName2 from tableName;

// 列别名
select columnName1 AS columnAliasName1, columnName2  columnAliasName2 from emp;

1.2 算术运算符

运算符	描述
A+B	A和B 相加
A-B	A减去B
A*B	A和B 相乘
A/B	A除以B
A%B	A对B取余
A&B	A和B按位取与
A	B
A^B	A和B按位取异或
~A	A按位取反

1.3 常用函数

// 求总行数
select count(*) cnt from emp;

// 求最大（小 min）值 max_sal 为别名
select max(sal) max_sal from emp;

// 求和
select sum(sal) sum_sal from emp;

// 求平均值
select avg(sal) avg_sal from emp;

1.4 Limit 语句

Limit 子句会限制返回的行数：

0: jdbc:hive2://hadoop1:10000> select name alias_name from stu_buck limit 3;

2. where 语句

WHERE 子句紧随 FROM 子句，只返回满足条件的记录

2.1 比较运算符

这些操作符同样可以用于 JOIN…ON 和 HAVING 语句中

操作符	支持的数据类型	描述
A=B	基本数据类型	如果A等于B则返回TRUE，反之返回FALSE
A<=>B	基本数据类型	如果A和B都为NULL，则返回TRUE，其他的和等号（=）操作符的结果一致，如果任一为NULL则结果为NULL
A<>B, A!=B	基本数据类型	A或者B为NULL则返回NULL；如果A不等于B，则返回TRUE，反之返回FALSE
A<B	基本数据类型	A或者B为NULL，则返回NULL；如果A小于B，则返回TRUE，反之返回FALSE
A<=B	基本数据类型	A或者B为NULL，则返回NULL；如果A小于等于B，则返回TRUE，反之返回FALSE
A>B	基本数据类型	A或者B为NULL，则返回NULL；如果A大于B，则返回TRUE，反之返回FALSE
A>=B	基本数据类型	A或者B为NULL，则返回NULL；如果A大于等于B，则返回TRUE，反之返回FALSE
A [NOT] BETWEEN B AND C	基本数据类型	如果A，B或者C任一为NULL，则结果为NULL。如果A的值大于等于B而且小于或等于C，则结果为TRUE，反之为FALSE。如果使用NOT关键字则可达到相反的效果。
A IS NULL	所有数据类型	如果A等于NULL，则返回TRUE，反之返回FALSE
A IS NOT NULL	所有数据类型	如果A不等于NULL，则返回TRUE，反之返回FALSE
IN(数值1, 数值2)	所有数据类型	使用 IN运算显示列表中的值
A [NOT] LIKE B	STRING 类型	B是一个SQL下的简单正则表达式，如果A与其匹配的话，则返回TRUE；反之返回FALSE。B的表达式说明如下：‘x%’表示A必须以字母‘x’开头，‘%x’表示A必须以字母’x’结尾，而‘%x%’表示A包含有字母’x’,可以位于开头，结尾或者字符串中间。如果使用NOT关键字则可达到相反的效果。
A RLIKE B, A REGEXP B	STRING 类型	B是一个正则表达式，如果A与其匹配，则返回TRUE；反之返回FALSE。匹配使用的是JDK中的正则表达式接口实现的，因为正则也依据其中的规则。例如，正则表达式必须和整个字符串A相匹配，而不是只需与其字符串匹配。

示例：

// 查询 sal=5000 的员工
select * from emp where sal =5000;

// 查询工资在500到1000的员工信息
hive (default)> select * from emp where sal between 500 and 1000;

// 查询comm为空的所有员工信息
hive (default)> select * from emp where comm is null;

// 查询工资是1500或5000的员工信息
hive (default)> select * from emp where sal IN (1500, 5000);

2.2 Like 和 RLIKE 子句

like 表示符合类似的值，条件可以包含字符或数字，RLIKE 是Hive中这个功能的一个扩展，其可以通过Java的正则表达式来指定匹配条件

% ：代表零个或多个字符(任意个字符)。
_ ：代表一个字符

示例：

// 查找以2开头薪水的员工信息
hive (default)> select * from emp where sal LIKE '2%';

// 查找第二个数值为2的薪水的员工信息
hive (default)> select * from emp where sal LIKE '_2%';

// 查找薪水中含有2的员工信息
hive (default)> select * from emp where sal RLIKE '[2]';

2.3 逻辑运算符 And/Or/Not

操作符	含义
AND	逻辑并
OR	逻辑或
NOT	逻辑否

示例：

// 查询薪水大于1000，部门是30
hive (default)> select * from emp where sal>1000 and deptno=30;

// 查询薪水大于1000，或者部门是30
hive (default)> select * from emp where sal>1000 or deptno=30;

// 查询除了20部门和30部门以外的员工信息
hive (default)> select * from emp where deptno not IN(30, 20);

3. 分组

3.1 Group By 语句

GROUP BY 语句通常会和聚合函数（如：count、sum、avg 等）一起使用，按照一个或者多个列队结果进行分组，然后对每个组执行聚合操作：

// 计算emp表每个部门的平均工资，按 deptno  分组
hive (default)> select t.deptno, avg(t.sal) avg_sal from emp t group by t.deptno;

// 计算emp每个部门中每个岗位的最高薪水，按 deptno、job 分组
hive (default)> select t.deptno, t.job, max(t.sal) max_sal from emp t group by
 t.deptno, t.job;

3.2 Having 语句

where 与 having 的区别

where 针对表中的列发挥作用，查询数据；having 针对查询结果中的列发挥作用，筛选数据。即 where 是在结果返回之前起作用的，而 having 对查询结果进行的过滤操作
where 后面不能写分组函数，而having后面可以使用分组函数
having 只用于 group by分组统计语句

示例：

// 求每个部门的平均工资
hive (default)> select deptno, avg(sal) from emp group by deptno;

// 求每个部门的平均薪水大于2000的部门
hive (default)> select deptno, avg(sal) avg_sal from emp group by deptno having avg_sal > 2000;

4. Join 语句

Hive 支持通常的SQL JOIN语句，但是只支持等值连接，不支持非等值连接，即表与表之间 join 必须有一列值相等作为连接条件，n 张表连接，至少要有 n-1 个连接条件

// e 为 emp 这张表的别名，e.deptno= d.deptno 为连接条件
select e.empno, e.ename, d.deptno from emp e join dept d on e.deptno= d.deptno;

表别名可以简化查询，使用表名前缀还可以提高执行效率

1、内连接

内连接：只有进行连接的两个表中都存在与连接条件相匹配的数据才会被保留下来

select e.empno, e.ename, d.deptno from emp e join dept d on e.deptno = d.deptno;

2、左外连接

左外连接：JOIN操作符左边表中符合WHERE子句的所有记录将会被返回

select e.empno, e.ename, d.deptno from emp e left join dept d on e.deptno= d.deptno;

3、右外连接

右外连接：JOIN操作符右边表中符合WHERE子句的所有记录将会被返回

select e.empno, e.ename, d.deptno from emp e right join dept d on e.deptno = d.deptno;

4、满外连接

满外连接：将会返回所有表中符合WHERE语句条件的所有记录。如果任一表的指定字段没有符合条件的值的话，那么就使用NULL值替代

select e.empno, e.ename, d.deptno from emp e full join dept d on e.deptno= d.deptno;

多表连接查询

连接 n个表，至少需要n-1个连接条件

SELECT e.ename, d.deptno, l. loc_name
FROM   emp e 
JOIN   dept d
ON     d.deptno = e.deptno 
JOIN   location l
ON     d.loc = l.loc;

一般来说，hive 会对每个 join 连接对象启动一个 MapReduce 任务，上面的查询会先启动一个MapReduce job对表e和表d进行连接操作，再启动一个 MR 将第一个 MR 输出结果和表 l 进行连接操作，另外 Hive 是按照从左到右的顺序执行连接的

笛卡尔积

笛卡尔积产生条件：

省略连接条件
连接条件无效
所有表中的所有行互相连接

select * from stu_buck, stu_buck_tmp;

5. 排序

5.1 全局排序 order by

order by 必须在 select 语句结尾

ASC（ascend）: 升序（默认）
DESC（descend）: 降序

// 按 id 降序排序
select * from stu_buck order by id desc;

按别名排序

select name alias_name from stu_buck order by alias_name desc;

多列排序

select id, name from stu_buck order by id, name;

5.2 MapReduce内部排序（Sort By）

sort by 可以对 Reduce 内部进行排序，但是堆全局结果集来说不是排序

// 查看设置reduce个数
0: jdbc:hive2://hadoop1:10000> set mapreduce.job.reduces;
+---------------------------+--+
|            set            |
+---------------------------+--+
| mapreduce.job.reduces=-1  |
+---------------------------+--+
1 row selected (0.024 seconds)

// 设置 reduce个数
0: jdbc:hive2://hadoop1:10000> set mapreduce.job.reduces=3;
No rows affected (0.005 seconds)
0: jdbc:hive2://hadoop1:10000> set mapreduce.job.reduces;
+--------------------------+--+
|           set            |
+--------------------------+--+
| mapreduce.job.reduces=3  |
+--------------------------+--+
1 row selected (0.015 seconds)

// 排序
0: jdbc:hive2://hadoop1:10000> select * from stu_buck sort by id desc;

5.3 分区排序（Distribute By）

类似 MR 中 partitions 进行分区，再结合 sort by 使用，但是一定要在 sort by 语句前，采用 Distribute By 分区排序最好多分配一点 reduce，否则无法看到效果

1、原始数据：

0: jdbc:hive2://hadoop1:10000> select * from tb_info;
+--------------+---------------+-----------------+--------------+--+
| tb_info.uid  | tb_info.name  | tb_info.gender  | tb_info.age  |
+--------------+---------------+-----------------+--------------+--+
| 9            | zl            | F               | 49           |
| 3            | zl            | F               | 48           |
| 8            | ww            | F               | 38           |
| 2            | ww            | F               | 36           |
| 6            | TQ            | F               | 32           |
| 12           | TQ            | F               | 23           |
| 11           | wb            | M               | 78           |
| 4            | pp            | M               | 44           |
| 5            | wb            | M               | 32           |
| 1            | zs            | M               | 28           |
| 7            | zs            | M               | 27           |
| 10           | pp            | M               | 10           |
+--------------+---------------+-----------------+--------------+--+

2、设置 reduce 个数：

hive (hive_1)> set mapreduce.job.reduces=3;

3、sort by 分区内排序：

0: jdbc:hive2://hadoop1:10000> select * from tb_info  sort by age desc;
INFO  : Session is already open
INFO  : 

INFO  : Status: Running (Executing on YARN cluster with App id application_1638604924632_0006)

INFO  : Map 1: -/-	Reducer 2: 0/3	
INFO  : Map 1: 0/1	Reducer 2: 0/3	
INFO  : Map 1: 0/1	Reducer 2: 0/3	
INFO  : Map 1: 0(+1)/1	Reducer 2: 0/3	
INFO  : Map 1: 0(+1)/1	Reducer 2: 0/3	
INFO  : Map 1: 1/1	Reducer 2: 0(+1)/3	
INFO  : Map 1: 1/1	Reducer 2: 1(+0)/3	
INFO  : Map 1: 1/1	Reducer 2: 1(+0)/3	
INFO  : Map 1: 1/1	Reducer 2: 1(+1)/3	
INFO  : Map 1: 1/1	Reducer 2: 1(+2)/3	
INFO  : Map 1: 1/1	Reducer 2: 1(+2)/3	
INFO  : Map 1: 1/1	Reducer 2: 1(+2)/3	
INFO  : Map 1: 1/1	Reducer 2: 2(+1)/3	
INFO  : Map 1: 1/1	Reducer 2: 3/3	
+--------------+---------------+-----------------+--------------+--+
| tb_info.uid  | tb_info.name  | tb_info.gender  | tb_info.age  |
+--------------+---------------+-----------------+--------------+--+
| 11           | wb            | M               | 78           |
| 4            | pp            | M               | 44           |
| 6            | TQ            | F               | 32           |
| 5            | wb            | M               | 32           |
| 1            | zs            | M               | 28           |
| 12           | TQ            | F               | 23           |
| 3            | zl            | F               | 48           |
| 8            | ww            | F               | 38           |
| 2            | ww            | F               | 36           |
| 7            | zs            | M               | 27           |
| 10           | pp            | M               | 10           |
| 9            | zl            | F               | 49           |
+--------------+---------------+-----------------+--------------+--+
12 rows selected (30.795 seconds)

因为 reduce 有 3 个，按照 age 排序后，可以看到结果分为三份降序排序，这样看可能效果不是很明显，可以将其导出到本地查看：

0: jdbc:hive2://hadoop1:10000> insert overwrite local directory '/home/hadoop/apps/big_source/tb_info_res.txt' select * from tb_info  sort by age desc;

可以看到本地 /home/hadoop/apps/big_source/tb_info_res.txt 有三份文件，刚好对应三个 reduce，其数据个数也能对应上

distribute by 与 sort by 结合

// 按照 gender、age 降序排序
0: jdbc:hive2://hadoop1:10000> select * from tb_info distribute by gender sort by age desc;

+--------------+---------------+-----------------+--------------+--+
| tb_info.uid  | tb_info.name  | tb_info.gender  | tb_info.age  |
+--------------+---------------+-----------------+--------------+--+
| 9            | zl            | F               | 49           |
| 3            | zl            | F               | 48           |
| 8            | ww            | F               | 38           |
| 2            | ww            | F               | 36           |
| 6            | TQ            | F               | 32           |
| 12           | TQ            | F               | 23           |
| 11           | wb            | M               | 78           |
| 4            | pp            | M               | 44           |
| 5            | wb            | M               | 32           |
| 1            | zs            | M               | 28           |
| 7            | zs            | M               | 27           |
| 10           | pp            | M               | 10           |
+--------------+---------------+-----------------+--------------+--+

gender 去模以 hashcode 取余只有两种结果，因此即使 reduce=3，导出到本地只会有两个文件有数据，另一个为空，只用到了两个 reduce

Cluster By

当 distribute by 和 sorts by 字段相同时，可以使用 cluster by 方式，cluster by 除了具有 distribute by 的功能外还兼具 sort by 的功能。但是排序只能是升序排序，不能指定排序规则为 ASC 或者 DESC

// 以下两种写法等价
select * from tb_info cluster by gender;
select * from tb_info distribute by gender sort by gender;

+--------------+---------------+-----------------+--------------+--+
| tb_info.uid  | tb_info.name  | tb_info.gender  | tb_info.age  |
+--------------+---------------+-----------------+--------------+--+
| 7            | zs            | M               | 27           |
| 2            | ww            | F               | 36           |
| 3            | zl            | F               | 48           |
| 11           | wb            | M               | 78           |
| 10           | pp            | M               | 10           |
| 1            | zs            | M               | 28           |
| 9            | zl            | F               | 49           |
| 12           | TQ            | F               | 23           |
| 6            | TQ            | F               | 32           |
| 5            | wb            | M               | 32           |
| 8            | ww            | F               | 38           |
| 4            | pp            | M               | 44           |
+--------------+---------------+-----------------+--------------+--+