group by多字段
select rank, isonline, count(*) from city group by rank, isonline;
将GroupBy的字段组合为map的输出key值,利用MapReduce的排序,在reduce阶段保存LastKey区分不同的key。MapReduce的过程如下(当然这里只是说明Reduce端的非Hash聚合过程)
group by单字段
group by 单字段和多字段时的不同在于key上,以如下例子为例(出处太多):SELECT uid, SUM(COUNT) FROM logs GROUP BY uid;
hive> SELECT * FROM logs;
a 苹果 5
a 橙子 3
a 苹果 2
b 烧鸡 1
hive> SELECT uid, SUM(COUNT) FROM logs GROUP BY uid;
a 10
b 1
其过程如下图:
默认设置了hive.map.aggr=true,所以会在mapper端先group by一次,最后再把结果merge起来,为了减少reducer处理的数据量。注意看explain的mode是不一样的。mapper是hash,reducer是mergepartial。如果把hive.map.aggr=false,那将groupby放到reducer才做,他的mode是complete.
Explain
hive> explain SELECT uid, sum(count) FROM logs group by uid;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME logs))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL uid)) (TOK_SELEXPR (TOK_FUNCTION sum (TOK_TABLE_OR_COL count)))) (TOK_GROUPBY (TOK_TABLE_OR_COL uid))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
logs
TableScan // 扫描表
alias: logs
Select Operator //选择字段
expressions:
expr: uid
type: string
expr: count
type: int
outputColumnNames: uid, count
Group By Operator //这里是因为默认设置了hive.map.aggr=true,会在mapper先做一次聚合,减少reduce需要处理的数据
aggregations:
expr: sum(count) //聚集函数
bucketGroup: false
keys: //键
expr: uid
type: string
mode: hash //hash方式,processHashAggr()
outputColumnNames: _col0, _col1
Reduce Output Operator //输出key,value给reducer
key expressions:
expr: _col0
type: string
sort order: +
Map-reduce partition columns:
expr: _col0
type: string
tag: -1
value expressions:
expr: _col1
type: bigint
Reduce Operator Tree:
Group By Operator
aggregations:
expr: sum(VALUE._col0)
//聚合
bucketGroup: false
keys:
expr: KEY._col0
type: string
mode: mergepartial //合并值
outputColumnNames: _col0, _col1
Select Operator //选择字段
expressions:
expr: _col0
type: string
expr: _col1
type: bigint
outputColumnNames: _col0, _col1
File Output Operator //输出到文件
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
细分Group by 的多种执行计划
在上文中给出的只是在Map Aggr为true时的执行计划,而Group by的执行计划细分的话有很多种。
术语说明
在介绍之前,首先明白几个术语:
- (No) Map Aggr:hive.map.aggr配置变量的缩写,可以设置为true或false,代表了map端是否开启聚合功能。
- (No) Skew: hive.groupby.skewindata配置变量的缩写,可以设置为true或false,表示是否是倾斜表等。
- Rollup: rollup比较神奇,rollup可以实现从右到做递减多级的统计,显示统计某一层次结构的聚合。
如:
select device_id,os_id,app_id,client_version,from_id,count(user_id)
from test_xinyan_reg
group by device_id,os_id,app_id,client_version,from_id with rollup;
- 1
- 2
- 3
等价以下sql语句:
select device_id,os_id,app_id,client_version,from_id,count(user_id)
from test_xinyan_reg
group by device_id,os_id,app_id,client_version,from_id
grouping sets ((device_id,os_id,app_id,client_version,from_id),(device_id,os_id,app_id,client_version),(device_id,os_id,app_id),(device_id,os_id),(device_id),());
- 1
- 2
- 3
- 4
在此不做Rollup的进一步说明。
基于Map Aggr与Skew的取值,可以产生四种不同的执行计划,再加上Rollup可以拓展到6种,下文对这六种做简要的介绍。
Map Aggr & No Skew:
该执行计划可以说就是上文group by 单字段中给出的那种情况,在map端多了一个聚合操作。
执行过程如下(为了原汁原味,此处就给出英文了):
- Mapper:
*Hash-based group by operator to perform partial aggregations
*Reduce sink operator, performs some partial aggregations
- Reducer:
*MergePartial (list-based) group by operator to perform final aggregations
Map Aggr & Skew
为了处理一些额外的数据行,该种执行计划中,只在Map端的hash-based aggregation operator有更改。
执行过程如下:
- Mapper 1:
*Hash-based group by operator to perform partial aggregations
*Reduce sink operator to spray by the group by and distinct keys (if there is a distinct key) or a random number otherwise
-
Reducer 1:
*Partials (list-based) group by operator to perform further partial aggregations -
Mapper 2:
*Reduce sink operator, performs some partial aggregations -
Reducer 2:
*Final (list-based) group by operator to perform final aggregations
需要注意的是,如果没有Group by keys或者是distinct keys, Reduce 1 和Mapper 2将不会执行,且Mapper 1中的reduce sink operator 也不会spray.
No Map Aggr & No Skew & No Rollup
该种执行计划中,有没有No Rollup其实是没有什么影响的,代表的是一个意思,即就是No Map Aggr & No Skew的情况,其执行过程如下:
-
Mapper:
*Reduce sink operator, performs some partial aggregations -
Reducer:
*Complete (list-based) group by operator to perform all aggregations
No Map Aggr & No Skew & With Rollup
说明:尽管此条中是No Map Aggr,在使用With Rollup时, hiveconf:hive.map.aggr要为true。
我的理解是No Map Aggr表示语句中没有需要在Map端聚合的语句,如Count等。
其执行过程如下:
-
Mapper 1:
*Reduce sink operator, does not perform any partial aggregations -
Reducer 1:
*Hash-based group by operator, much like the one used in the mappers of previous cases -
Mapper 2:
*Reduce sink operator, performs some partial aggregations -
Reducer 2:
*MergePartial (list-based) group by operator to perform remaining aggregations
No Map Aggr & Skew & (No Distinct or No Rollup)
在该执行计划下,list-based group by operator 做了更改,以适应rollup使用时对额外行的处理。
-
Mapper 1:
*Reduce sink operator to spray by the group by and distinct keys (if there is a distinct key) or a random number otherwise -
Reducer 1:
*Partial1 (list-based) group by operator to perform partial aggregations, it makes use of the new list-based group by operator implementation for rollup if necessary -
Mapper 2:
*Reduce sink operator, performs some partial aggregations -
Reducer 2:
*Final (list-based) group by operator to perform remaining aggregations
No Map Aggr & Skew & Distinct & Rollup
Mapper 1:
*Reduce sink operator to spray by the group by and distinct keys (if there is a distinct key) or a random number otherwise
Reducer 1:
*Hash-based group by operator, much like the one used in the mappers of previous cases
Mapper 2:
*Reduce sink operator to spray by the group by and distinct keys (if there is a distinct key) or a random number otherwise
Reducer 2:
*Partials (list-based) group by operator to perform further partial aggregations
Mapper 3:
*Reduce sink operator, performs some partial aggregations
Reducer 3:
*Final (list-based) group by operator to perform final aggregations
注意:如果没有Group by keys或者是distinct keys, Reduce 2 和Mapper 3将不会执行,且Mapper 2中的reduce sink operator 也不会spray.
Also, note that the reason for Mapper 2 spraying is that if the skew in the data existed in a column that is not immediately nulled by the rollup (e.g. if we the group by keys are columns g1, g2, g3 in that order, we are concerned with the case where the skew exists in column g1 or g2) the skew may continue to exist after the hash aggregation, so we spray.