Hive的order by、sort by、distribute by、cluster by_distribute by sort by between and-CSDN博客

本文链接：https://blog.csdn.net/hzj1998/article/details/99677388

Hive 的 sort by 与 order by、distribute by 与 cluster by

我们知道，在MapReduce中，每个分区的数据是key值有序的，有几个reduce任务就有几个分区，当只有一个分区时，数据就是全局有序的了。sort by的功能就是保证每个分区有序，而order by就相当于全局有序，即这几个分区连起来也是有序的。

为了能够看出他们的区别，我们需要提前设置reduce任务的个数大于1：

hive > set mapred.reduce.tasks=2;

创建测试表sortandorder：

create table sortandorder(
    id int
)
row format delimited
stored as textfile;

导入测试数据：

hive > load data local inpath '/home/au/sortandorder' into table sortandorder;

sortandorder文件数据如下：
1
3
2
9
10
8
4
7
6
5

用order by查询数据：

hive > select * from sortandorder order by id;
结果如下：
1
2
3
4
5
6
7
8
9
10

用sort by查询数据：

hive > select * from sortandorder sort by id;
结果如下：
1
2
5
7
9
10
3
4
6
8

由于分了两个区，可以看出order by是全局排序，sort by是区内排序，分区个数由reduce任务个数决定。

distribute by：按照指定的字段或表达式对数据进行划分，输出到对应的reduce或者文件中。

cluster by：除了兼具distribute by的功能，还具有sort by的排序功能。

利用上面的sortandorder的表进行distribute by分区，存入本地文件/home/au/distributeandcluster/：

insert overwrite local directory '/home/au/distributeandcluster/'
select id from sortandorder distribute by id; // 还是用上面sortandorder的表

运行完后可在/home/au/distributeandcluster/目录下看到有两个文件(00000_0，000001_0)，并且两个文件内都是无序的。

同样，用cluster by进行分区，存入本地文件/home/au/distributeandcluster/：

insert overwrite local directory '/home/au/distributeandcluster/'
select id from sortandorder cluster by id; // 还是用上面sortandorder的表

运行完后可在/home/au/distributeandcluster/目录下看到有两个文件(00000_0，000001_0)，并且两个文件内都是有序的。