数据准备
-- 测试表
select * from test;
+--------------+--------------+--------------+--+
| test.field1 | test.field2 | test.field3 |
+--------------+--------------+--------------+--+
| 1 | 6 | 14 |
| 2 | 3 | 4 |
| 9 | 8 | 2 |
| 5 | 2 | 7 |
| 6 | 2 | 6 |
| 1 | 3 | 2 |
| 7 | 13 | 7 |
| 5 | 9 | 1 |
| 2 | 3 | 4 |
| 4 | 3 | 8 |
| 1 | 6 | 4 |
| 7 | 1 | 9 |
+--------------+--------------+--------------+--+
– reducetask的数量为2
set mapred.reduce.tasks=2;
1.order by
全局排序只有一个reduce
select * from test order by field1;
+--------------+--------------+--------------+--+
| test.field1 | test.field2 | test.field3 |
+--------------+--------------+--------------+--+
| 1 | 6 | 4 |
| 1 | 3 | 2 |
| 1 | 6 | 14 |
| 2 | 3 | 4 |
| 2 | 3 | 4 |
| 4 | 3 | 8 |
| 5 | 9 | 1 |
| 5 | 2 | 7 |
| 6 | 2 | 6 |
| 7 | 1 | 9 |
| 7 | 13 | 7 |
| 9 | 8 | 2 |
+--------------+--------------+--------------+--+
job: 1
Map: 1 Reduce: 1
多字段排序
select * from test order by field1,field2;
按字段1排序,字段1一样的按照字段2排序.
+--------------+--------------+--------------+--+
| test.field1 | test.field2 | test.field3 |
+--------------+--------------+--------------+--+
| 1 | 3 | 2 |
| 1 | 6 | 4 |
| 1 | 6 | 14 |
-----------------------------------------------
| 2 | 3 | 4 |
| 2 | 3 | 4 |
------------------------------------------------
| 4 | 3 | 8 |
-------------------------------------------------
| 5 | 2 | 7 |
| 5 | 9 | 1 |
------------------------------------------------
| 6 | 2 | 6 |
-----------------------------------------------
| 7 | 1 | 9 |
| 7 | 13 | 7 |
-------------------------------------------------
| 9 | 8 | 2 |
+--------------+--------------+--------------+--+
job: 1
Map: 1 Reduce: 1
2.sort by
每个reduce内部进行排序,局部有序全局不是有序
select * from test sort by field1;
+--------------+--------------+--------------+--+
| test.field1 | test.field2 | test.field3 |
+--------------+--------------+--------------+--+
| 1 | 6 | 4 |
| 1 | 6 | 14 |
| 4 | 3 | 8 | reduce task 0
| 5 | 9 | 1 |
| 5 | 2 | 7 |
| 6 | 2 | 6 |
| 7 | 1 | 9 |
| 9 | 8 | 2 |
-------------------------------------------------
| 1 | 3 | 2 |
| 2 | 3 | 4 | reduce task 1
| 2 | 3 | 4 |
| 7 | 13 | 7 |
+--------------+--------------+--------------+--+
job: 1
Map: 1 Reduce: 2
多字段排序
select * from test sort by field1,field2;
同一个reduce中,按字段1排序,字段1一样的按照字段2排序.
+--------------+--------------+--------------+--+
| test.field1 | test.field2 | test.field3 |
+--------------+--------------+--------------+--+
| 1 | 6 | 4 |
| 1 | 6 | 14 |
| 4 | 3 | 8 |
| 5 | 2 | 7 |
| 5 | 9 | 1 |
| 6 | 2 | 6 |
| 7 | 1 | 9 |
| 9 | 8 | 2 |
-----------------------------------------------
| 1 | 3 | 2 |
| 2 | 3 | 4 |
| 2 | 3 | 4 |
| 7 | 13 | 7 |
+--------------+--------------+--------------+--+
3.distribute by
控制在map端如何拆分数据给reduce端的
hive会根据distribute by后面列,根据reduce的个数进行数据分发,默认是采用hash算法。
select * from test distribute by field1 sort by field1;
将字段field1的hashcode值模于redice的numberr得到的结果相同的就分配到了
同一个reduce中,然后sort by在每个reduce中进行排序
+--------------+--------------+--------------+--+
| test.field1 | test.field2 | test.field3 |
+--------------+--------------+--------------+--+
| 2 | 3 | 4 | 2.hashCode() % 2 = 0
| 2 | 3 | 4 | 2.hashCode() % 2 = 0
| 4 | 3 | 8 | 4.hashCode() % 2 = 0 reduce task 0
| 6 | 2 | 6 | 6.hashCode() % 2 = 0
----------------------------------------------------
| 1 | 6 | 4 | 1.hashCode() % 2 = 1
| 1 | 3 | 2 | 1.hashCode() % 2 = 1
| 1 | 6 | 14 | 1.hashCode() % 2 = 1
| 5 | 9 | 1 | 5.hashCode() % 2 = 1 reduce task 1
| 5 | 2 | 7 | 5.hashCode() % 2 = 1
| 7 | 1 | 9 | 7.hashCode() % 2 = 1
| 7 | 13 | 7 | 7.hashCode() % 2 = 1
| 9 | 8 | 2 | 9.hashCode() % 2 = 1
+--------------+--------------+--------------+--+
分组字段 field1
排序字段 field1
job: 1
Map: 1 Reduce: 2
select * from test distribute by field1 sort by field2;
distribute by后面字段数据的hashCode值%reduce的个数,值相同的发送到同一个reduce
sort by后面字段,是每个reduce中按照这个字段排序
+--------------+--------------+--------------+--+
| test.field1 | test.field2 | test.field3 |
+--------------+--------------+--------------+--+
| 6 | 2 | 6 |
| 4 | 3 | 8 |
| 2 | 3 | 4 |
| 2 | 3 | 4 |
-------------------------------------------------
| 7 | 1 | 9 |
| 5 | 2 | 7 |
| 1 | 3 | 2 |
| 1 | 6 | 4 |
| 1 | 6 | 14 |
| 9 | 8 | 2 |
| 5 | 9 | 1 |
| 7 | 13 | 7 |
+--------------+--------------+--------------+--+
分组字段 field1
排序字段 field2
job:1
Map: 1 Reduce: 2
select * from test distribute by field1,field2 sort by field2;
多字段排序
select * from test distribute by field1 sort by field1,field2;
+--------------+--------------+--------------+--+
| test.field1 | test.field2 | test.field3 |
+--------------+--------------+--------------+--+
| 2 | 3 | 4 |
| 2 | 3 | 4 |
| 4 | 3 | 8 |
| 6 | 2 | 6 |
------------------------------------------------
| 1 | 3 | 2 |
| 1 | 6 | 4 |
| 1 | 6 | 14 |
---------------
| 5 | 2 | 7 |
| 5 | 9 | 1 |
--------------
| 7 | 1 | 9 |
| 7 | 13 | 7 |
---------------
| 9 | 8 | 2 |
+--------------+--------------+--------------+--+
分组字段 field1
排序字段 field1,field2
job: 1
Map: 1 Reduce: 2
4.cluster by
相当于distribute by+sort by 同一个字段,只能默认升序,不能使用倒序;
select * from test distribute by field1 sort by field1;
select * from test cluster by field1;
+--------------+--------------+--------------+--+
| test.field1 | test.field2 | test.field3 |
+--------------+--------------+--------------+--+
| 2 | 3 | 4 |
| 2 | 3 | 4 |
| 4 | 3 | 8 |
| 6 | 2 | 6 |
---------------------------------------------------
| 1 | 6 | 4 |
| 1 | 3 | 2 |
| 1 | 6 | 14 |
| 5 | 9 | 1 |
| 5 | 2 | 7 |
| 7 | 1 | 9 |
| 7 | 13 | 7 |
| 9 | 8 | 2 |
+--------------+--------------+--------------+--+
分组字段 field1
排序字段 field1
job:1
Map: 1 Reduce: 2
多字段控制map数据输出到reduce,多字段排序
select * from test cluster by field1,field2;
+--------------+--------------+--------------+--+
| test.field1 | test.field2 | test.field3 |
+--------------+--------------+--------------+--+
| 1 | 3 | 2 |
| 5 | 9 | 1 |
| 6 | 2 | 6 |
--------------
| 7 | 1 | 9 |
| 7 | 13 | 7 |
-------------------------------------------------
| 1 | 6 | 4 |
| 1 | 6 | 14 |
| 2 | 3 | 4 |
| 2 | 3 | 4 |
| 4 | 3 | 8 |
| 5 | 2 | 7 |
| 9 | 8 | 2 |
+--------------+--------------+--------------+--+