Hive进阶-深入解析Hive底层实现 - Distinct 的底层实现

最新推荐文章于 2024-01-22 09:40:56 发布

左VJ

最新推荐文章于 2024-01-22 09:40:56 发布

阅读量8.4k

点赞数 4

分类专栏： hive 文章标签： hive 进阶底层实现优化

hive 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

转自Hive – Distinct 的实现并稍作更改
http://ju.outofmemory.cn/entry/784

Hive版本为1.1.0。有空的话其实可以分析它在hive on spark 的底层实现是怎么样的

分析语句
准备数据
计算过程
Operator
Explain

分析语句

SELECT count, COUNT(DISTINCT uid) FROM logs GROUP BY count;

根据count分组，计算独立用户数。
业务解释：分析有多少个人有x个物品

准备数据

create table logs(uid string,name string,count string);

insert into table logs values('a','apple','3'),('a','orange','3'),('a','banana','1'),('b','banana','3');

select * from logs;
OK
a       apple   3
a       orange  3
a       banana  1
b       banana  3

计算过程

这里写图片描述

第一步先在mapper计算部分值，会以count和uid作为key，如果是distinct并且之前已经出现过，则忽略这条计算。第一步是以组合为key，第二步是以count为key.
ReduceSink是在mapper.close()时才执行的，在GroupByOperator.close()时，把结果输出。注意这里虽然key是count和uid，但是在reduce时分区是按count来的！
第一步的distinct计算的值没用，要留到reduce计算的才准确。这里只是减少了key组合相同的行。不过如果是普通的count，后面是会合并起来的。
distinct通过比较lastInvoke判断要不要+1（因为在reduce是排序过了的，所以判断distict的字段变了没有，如果没变，则不+1）

Operator

这里写图片描述

Explain

explain select count, count(distinct uid) from logs group by count;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: logs
            Statistics: Num rows: 4 Data size: 39 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: count (type: string), uid (type: string)
              outputColumnNames: count, uid
              Statistics: Num rows: 4 Data size: 39 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: count(DISTINCT uid)
                keys: count (type: string), uid (type: string)
                mode: hash
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 4 Data size: 39 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string), _col1 (type: string)
                  sort order: ++
                  Map-reduce partition columns: _col0 (type: string)
                  Statistics: Num rows: 4 Data size: 39 Basic stats: COMPLETE Column stats: NONE
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(DISTINCT KEY._col1:0._col0)
          keys: KEY._col0 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 2 Data size: 19 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 2 Data size: 19 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

左VJ

关注

4
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
Hive进阶-深入解析Hive底层实现 - Distinct 的底层实现

转自Hive – Distinct 的实现并稍作更改 http://ju.outofmemory.cn/entry/784Hive版本为1.1.0。有空的话其实可以分析它在hive on spark 的底层实现是怎么样的分析语句准备数据计算过程OperatorExplain分析语句SELECT count, COUNT(DISTINCT uid)
复制链接

扫一扫

专栏目录