1. select 语法
Hive 中的 SELECT 基础语法和标准SQL语法基本一致,支持 WHERE、DISTINCT、GROUP BY、ORDER BY、HAVING、LIMIT、子查询等。语法如下:
[WITH CommonTableExpression (, CommonTableExpression)*] (Note: Only available starting with Hive 0.13.0)<font></font>
SELECT [ALL | DISTINCT] select_expr, select_expr, ...<font></font>
FROM table_reference<font></font>
[WHERE where_condition]<font></font>
[GROUP BY col_list]<font></font>
[ORDER BY col_list]<font></font>
[CLUSTER BY col_list<font></font>
| [DISTRIBUTE BY col_list] [SORT BY col_list]<font></font>
]<font></font>
[LIMIT [offset,] rows]<font></font>
创建表
hive (test_db)> create table if not exists `tb_hql_select`(num int)
> comment 'hql select'
> row format delimited
> fields terminated by '\t';
hive (test_db)> select * from tb_hql_select;
10
90
38
91
9
18
3
2. order by 和 sort by
- order by
全局排序,即对指定的所有排序键进行全局排序。使用 order by 的查询语句,最后会用一个 reduce task 来完成全局排序
使用 order by 查询排序
select num from tb_hql_select order by num;
3
9
10
18
38
90
91
- sort by
用于分区内排序,即每个 reduce 任务内排序,所以并不能保证全局有序,一般和distribute by 一起使用,而且distribute by 要写在sort by前面
使用 sort by 查询排序:
select num from tb_hql_select sort by num;
3
9
10
18
38
90
91
sort by 只有一个 reduce 和 order by 运行结果一致,下面设置reduce task 数量为
set mapred.reduce.tasks=2;
select num from tb_hql_select sort by num;
9
10
38
91
3
18
90
查询结果中可以看出,排序分为两个部分,每个 reduce 进行一次排序
- 总结
实际开发中,大多需要使用 order by 进行全局排序
3.distribute by 和 cluster by
- distribute by
控制map的输出在reducer是如何划分的。按照指定的字段或表达式对数据进行划分,输出到对应的Reduce或者文件中(会对指定的字段按照hashCode值对reduce的个数取模,然后将任务分配到对应的reduce中去执行)
set mapred.reduce.tasks=2;
insert overwrite local directory '/users/harvey/data/tmp' select num from tb_hql_select distribute by id;
执行后会在本地目录 /users/harvey/data/tmp 下生成两个文件,即两个 reduce task 每个 task 对应一个文件
harveydeMacBook-Pro:tmp harvey$ ll /Users/harvey/data/tmp/
total 16
-rw-r--r-- 1 harvey staff 12 10 18 15:01 000000_0
-rw-r--r-- 1 harvey staff 7 10 18 15:01 000001_0
# 字段 num % 2 = 0 的记录
harveydeMacBook-Pro:tmp harvey$ cat 000000_0
18
38
90
10
# 字段 num % 2 = 1 的记录
harveydeMacBook-Pro:tmp harvey$ cat 000001_0
3
9
91
- distribute by 与 sort by 一起使用
举例:如下有一张商店表 tb_store,id 为 store 所属的商户(商家),money 为商户的盈利,name 为商户的名称
mid | money | name |
---|---|---|
1001 | 10.0 | 商店1 |
1001 | 20.0 | 商店2 |
1002 | 30.0 | 商店3 |
1003 | 22.0 | 商店4 |
hive (test_db)> create table if not exists tb_store(
> mid int,
> money double,
> name string
> )
> row format delimited fields terminated by '\t';
load data local inpath '/Users/harvey/data/tb_store.txt' overwrite into table test_db.tb_store
select * from test_db.tb_store;
1001 10.0 商店1
1001 20.0 商店2
1002 30.0 商店3
1003 22.0 商店4
执行查询
reduce task 数量为1
set mapred.reduce.tasks=1;
select mid, money, name from tb_store distribute by mid sort by mid asc, money asc;
1001 10.0 商店1
1001 20.0 商店2
1002 30.0 商店3
1003 22.0 商店4
等同于
select mid, money, name from tb_store order by mid asc, money asc;
1001 10.0 商店1
1001 20.0 商店2
1002 30.0 商店3
1003 22.0 商店4
reduce task 数量为2,mid为1002被划分到一个reduce task,mid为1001和1003的被划分到一个reduce task
set mapred.reduce.tasks=2;
select mid, money, name from tb_store distribute by mid sort by mid asc, money asc;
1002 30.0 商店3
1001 10.0 商店1
1001 20.0 商店2
1003 22.0 商店4
- cluster by
cluster by 功能即 distribute by 与 sort by 结合使用,但 cluster by 不能指定排序,即 asc 或 desc,其指定的列只能是升序
如下 2 个 sql 语句是等价的
select mid, money, name from tb_store cluster by mid;
select mid, money, name from tb_store distribute by mid sort by mid;
1001 20.0 商店2
1001 10.0 商店1
1002 30.0 商店3
1003 22.0 商店4