Hive Group By的实现原理

最新推荐文章于 2025-03-16 20:38:29 发布

数仓大山哥

最新推荐文章于 2025-03-16 20:38:29 发布

阅读量2.6k

点赞数 1

分类专栏： hive Hadoop系列文章标签： group by hive 执行原理

本文链接：https://blog.csdn.net/panfelix/article/details/107658067

版权

hive 同时被 2 个专栏收录

34 篇文章

订阅专栏

Hadoop系列

25 篇文章

订阅专栏

本文深入探讨了Hive中GroupBy操作的多种执行计划，包括MapAggr与Skew的不同组合，以及Rollup的使用。解释了不同配置下，如hive.map.aggr和hive.groupby.skewindata的设置如何影响数据处理流程，展示了MapReduce过程中聚合操作的具体细节。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

group by多字段

select rank, isonline, count(*) from city group by rank, isonline;

将GroupBy的字段组合为map的输出key值，利用MapReduce的排序，在reduce阶段保存LastKey区分不同的key。MapReduce的过程如下（当然这里只是说明Reduce端的非Hash聚合过程）

group by单字段

group by 单字段和多字段时的不同在于key上，以如下例子为例（出处太多）：
SELECT uid, SUM(COUNT) FROM logs GROUP BY uid;

hive> SELECT * FROM logs;
a   苹果  5
a   橙子  3
a      苹果   2
b   烧鸡  1

hive> SELECT uid, SUM(COUNT) FROM logs GROUP BY uid;
a   10
b   1

其过程如下图：

默认设置了hive.map.aggr=true，所以会在mapper端先group by一次，最后再把结果merge起来，为了减少reducer处理的数据量。注意看explain的mode是不一样的。mapper是hash，reducer是mergepartial。如果把hive.map.aggr=false，那将groupby放到reducer才做，他的mode是complete.

Explain

hive> explain SELECT uid, sum(count) FROM logs group by uid;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME logs))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL uid)) (TOK_SELEXPR (TOK_FUNCTION sum (TOK_TABLE_OR_COL count)))) (TOK_GROUPBY (TOK_TABLE_OR_COL uid))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        logs 
          TableScan // 扫描表
            alias: logs
            Select Operator //选择字段
              expressions:
                    expr: uid
                    type: string
                    expr: count
                    type: int
              outputColumnNames: uid, count
              Group By Operator //这里是因为默认设置了hive.map.aggr=true，会在mapper先做一次聚合，减少reduce需要处理的数据
                aggregations:
                      expr: sum(count) //聚集函数
                bucketGroup: false
                keys: //键
                      expr: uid
                      type: string
                mode: hash //hash方式，processHashAggr()
                outputColumnNames: _col0, _col1
                Reduce Output Operator //输出key，value给reducer
                  key expressions:
                        expr: _col0
                        type: string
                  sort order: +
                  Map-reduce partition columns:
                        expr: _col0
                        type: string
                  tag: -1
                  value expressions:
                        expr: _col1
                        type: bigint
      Reduce Operator Tree:
        Group By Operator

          aggregations:
                expr: sum(VALUE._col0)
//聚合
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: string
          mode: mergepartial //合并值
          outputColumnNames: _col0, _col1
          Select Operator //选择字段
            expressions:
                  expr: _col0
                  type: string
                  expr: _col1
                  type: bigint
            outputColumnNames: _col0, _col1
            File Output Operator //输出到文件
              compressed: false
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

细分Group by 的多种执行计划

在上文中给出的只是在Map Aggr为true时的执行计划,而Group by的执行计划细分的话有很多种。

术语说明

在介绍之前，首先明白几个术语：
- (No) Map Aggr：hive.map.aggr配置变量的缩写，可以设置为true或false，代表了map端是否开启聚合功能。
- (No) Skew: hive.groupby.skewindata配置变量的缩写，可以设置为true或false，表示是否是倾斜表等。
- Rollup: rollup比较神奇，rollup可以实现从右到做递减多级的统计，显示统计某一层次结构的聚合。
如：

select device_id,os_id,app_id,client_version,from_id,count(user_id) 
from test_xinyan_reg 
group by device_id,os_id,app_id,client_version,from_id with rollup;

等价以下sql语句：

select device_id,os_id,app_id,client_version,from_id,count(user_id) 
from test_xinyan_reg 
group by device_id,os_id,app_id,client_version,from_id 
grouping sets ((device_id,os_id,app_id,client_version,from_id),(device_id,os_id,app_id,client_version),(device_id,os_id,app_id),(device_id,os_id),(device_id),());

在此不做Rollup的进一步说明。

基于Map Aggr与Skew的取值，可以产生四种不同的执行计划，再加上Rollup可以拓展到6种，下文对这六种做简要的介绍。

Map Aggr & No Skew:

该执行计划可以说就是上文group by 单字段中给出的那种情况，在map端多了一个聚合操作。

执行过程如下（为了原汁原味，此处就给出英文了）：
- Mapper:
*Hash-based group by operator to perform partial aggregations
*Reduce sink operator, performs some partial aggregations
- Reducer:
*MergePartial (list-based) group by operator to perform final aggregations

Map Aggr & Skew

为了处理一些额外的数据行，该种执行计划中，只在Map端的hash-based aggregation operator有更改。
执行过程如下：
- Mapper 1:
*Hash-based group by operator to perform partial aggregations
*Reduce sink operator to spray by the group by and distinct keys (if there is a distinct key) or a random number otherwise

Reducer 1:
*Partials (list-based) group by operator to perform further partial aggregations
Mapper 2:
*Reduce sink operator, performs some partial aggregations
Reducer 2:
*Final (list-based) group by operator to perform final aggregations

需要注意的是，如果没有Group by keys或者是distinct keys, Reduce 1 和Mapper 2将不会执行，且Mapper 1中的reduce sink operator 也不会spray.

No Map Aggr & No Skew & No Rollup

该种执行计划中，有没有No Rollup其实是没有什么影响的，代表的是一个意思，即就是No Map Aggr & No Skew的情况，其执行过程如下：

Mapper:
*Reduce sink operator, performs some partial aggregations
Reducer:
*Complete (list-based) group by operator to perform all aggregations

No Map Aggr & No Skew & With Rollup

说明：尽管此条中是No Map Aggr，在使用With Rollup时， hiveconf:hive.map.aggr要为true。
我的理解是No Map Aggr表示语句中没有需要在Map端聚合的语句，如Count等。

其执行过程如下：

Mapper 1:
*Reduce sink operator, does not perform any partial aggregations
Reducer 1:
*Hash-based group by operator, much like the one used in the mappers of previous cases
Mapper 2:
*Reduce sink operator, performs some partial aggregations
Reducer 2:
*MergePartial (list-based) group by operator to perform remaining aggregations

No Map Aggr & Skew & (No Distinct or No Rollup)

在该执行计划下，list-based group by operator 做了更改，以适应rollup使用时对额外行的处理。

Mapper 1:
*Reduce sink operator to spray by the group by and distinct keys (if there is a distinct key) or a random number otherwise
Reducer 1:
*Partial1 (list-based) group by operator to perform partial aggregations, it makes use of the new list-based group by operator implementation for rollup if necessary
Mapper 2:
*Reduce sink operator, performs some partial aggregations
Reducer 2:
*Final (list-based) group by operator to perform remaining aggregations

No Map Aggr & Skew & Distinct & Rollup

Mapper 1:
*Reduce sink operator to spray by the group by and distinct keys (if there is a distinct key) or a random number otherwise

Reducer 1:
*Hash-based group by operator, much like the one used in the mappers of previous cases

Mapper 2:
*Reduce sink operator to spray by the group by and distinct keys (if there is a distinct key) or a random number otherwise

Reducer 2:
*Partials (list-based) group by operator to perform further partial aggregations

Mapper 3:
*Reduce sink operator, performs some partial aggregations

Reducer 3:
*Final (list-based) group by operator to perform final aggregations

注意：如果没有Group by keys或者是distinct keys, Reduce 2 和Mapper 3将不会执行，且Mapper 2中的reduce sink operator 也不会spray.

Also, note that the reason for Mapper 2 spraying is that if the skew in the data existed in a column that is not immediately nulled by the rollup (e.g. if we the group by keys are columns g1, g2, g3 in that order, we are concerned with the case where the skew exists in column g1 or g2) the skew may continue to exist after the hash aggregation, so we spray.