hive 中 order by、sort by、distribute by 、cluster by 区别

最新推荐文章于 2024-07-29 09:03:07 发布

harveybd

最新推荐文章于 2024-07-29 09:03:07 发布

阅读量553

点赞数 2

分类专栏： Hive

本文链接：https://blog.csdn.net/HG_Harvey/article/details/103188061

版权

Hive 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

1. select 语法

Hive 中的 SELECT 基础语法和标准SQL语法基本一致，支持 WHERE、DISTINCT、GROUP BY、ORDER BY、HAVING、LIMIT、子查询等。语法如下：

[WITH CommonTableExpression (, CommonTableExpression)*]    (Note: Only available starting with Hive 0.13.0)<font></font>
SELECT [ALL | DISTINCT] select_expr, select_expr, ...<font></font>
  FROM table_reference<font></font>
  [WHERE where_condition]<font></font>
  [GROUP BY col_list]<font></font>
  [ORDER BY col_list]<font></font>
  [CLUSTER BY col_list<font></font>
    | [DISTRIBUTE BY col_list] [SORT BY col_list]<font></font>
  ]<font></font>
 [LIMIT [offset,] rows]<font></font>

创建表

hive (test_db)> create table if not exists `tb_hql_select`(num int)
              > comment 'hql select'
              > row format delimited
              > fields terminated by '\t';
              
hive (test_db)> select * from tb_hql_select;
10
90
38
91
9
18
3

2. order by 和 sort by

order by

全局排序，即对指定的所有排序键进行全局排序。使用 order by 的查询语句，最后会用一个 reduce task 来完成全局排序
使用 order by 查询排序

select num from tb_hql_select order by num;

3
9
10
18
38
90
91

sort by

用于分区内排序，即每个 reduce 任务内排序，所以并不能保证全局有序，一般和distribute by 一起使用，而且distribute by 要写在sort by前面
使用 sort by 查询排序：

select num from tb_hql_select sort by num;

3
9
10
18
38
90
91

sort by 只有一个 reduce 和 order by 运行结果一致，下面设置reduce task 数量为

set mapred.reduce.tasks=2;
select num from tb_hql_select sort by num;

9
10
38
91
3
18
90

查询结果中可以看出，排序分为两个部分，每个 reduce 进行一次排序

总结
实际开发中，大多需要使用 order by 进行全局排序

3.distribute by 和 cluster by

distribute by
控制map的输出在reducer是如何划分的。按照指定的字段或表达式对数据进行划分，输出到对应的Reduce或者文件中（会对指定的字段按照hashCode值对reduce的个数取模，然后将任务分配到对应的reduce中去执行）

set mapred.reduce.tasks=2;
insert overwrite local directory '/users/harvey/data/tmp' select num from tb_hql_select distribute by id;

执行后会在本地目录 /users/harvey/data/tmp 下生成两个文件，即两个 reduce task 每个 task 对应一个文件

harveydeMacBook-Pro:tmp harvey$ ll /Users/harvey/data/tmp/
total 16
-rw-r--r--  1 harvey  staff  12 10 18 15:01 000000_0
-rw-r--r--  1 harvey  staff   7 10 18 15:01 000001_0
# 字段 num % 2 = 0 的记录
harveydeMacBook-Pro:tmp harvey$ cat 000000_0
18
38
90
10

# 字段 num % 2 = 1 的记录
harveydeMacBook-Pro:tmp harvey$ cat 000001_0
3
9
91

distribute by 与 sort by 一起使用
举例：如下有一张商店表 tb_store，id 为 store 所属的商户（商家），money 为商户的盈利，name 为商户的名称

mid	money	name
1001	10.0	商店1
1001	20.0	商店2
1002	30.0	商店3
1003	22.0	商店4

hive (test_db)> create table if not exists tb_store(
              > mid int,
              > money double,
              > name string
              > )
              > row format delimited fields terminated by '\t';

load data local inpath '/Users/harvey/data/tb_store.txt' overwrite into table test_db.tb_store

select * from test_db.tb_store;
1001	10.0	商店1
1001	20.0	商店2
1002	30.0	商店3
1003	22.0	商店4

执行查询
reduce task 数量为1

set mapred.reduce.tasks=1;
select mid, money, name from tb_store distribute by mid sort by mid asc, money asc;

1001	10.0	商店1
1001	20.0	商店2
1002	30.0	商店3
1003	22.0	商店4

等同于

select mid, money, name from tb_store order by mid asc, money asc;

1001	10.0	商店1
1001	20.0	商店2
1002	30.0	商店3
1003	22.0	商店4

reduce task 数量为2，mid为1002被划分到一个reduce task，mid为1001和1003的被划分到一个reduce task

set mapred.reduce.tasks=2;
select mid, money, name from tb_store distribute by mid sort by mid asc, money asc;

1002	30.0	商店3
1001	10.0	商店1
1001	20.0	商店2
1003	22.0	商店4

cluster by
cluster by 功能即 distribute by 与 sort by 结合使用，但 cluster by 不能指定排序，即 asc 或 desc，其指定的列只能是升序

如下 2 个 sql 语句是等价的

select mid, money, name from tb_store cluster by mid;

select mid, money, name from tb_store distribute by mid sort by mid;

1001	20.0	商店2
1001	10.0	商店1
1002	30.0	商店3
1003	22.0	商店4