hive count(distinct ) 和group by 在count的区别

最新推荐文章于 2022-09-02 17:02:32 发布

梁丰

最新推荐文章于 2022-09-02 17:02:32 发布

阅读量1.3k

点赞数

我们直接用explain查看下执行计划
select count(distinct remote_addr) uv from ods_weblog_visit where datastr = ‘20181101’;
±---------------------------------------------------±-+
| Explain |
±---------------------------------------------------±-+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: ods_weblog_visit //表 |
| Statistics: Num rows: 1727 Data size: 172771 Basic stats: COMPLETE Column stats: NONE |
| Select Operator //select操作 |
| expressions: remote_addr (type: string) //查询的字段 |
| outputColumnNames: remote_addr //输出字段|
| Statistics: Num rows: 1727 Data size: 172771 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator //gruopby操作 |
| aggregations: count(DISTINCT remote_addr) 聚合函数进行聚合 |
| keys: remote_addr (type: string) 输入的key |
| mode: hash |
| outputColumnNames: _col0, _col1 输出|
| Statistics: Num rows: 1727 Data size: 172771 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator 给reduce的输出 |
| key expressions: _col0 (type: string) |
| sort order: + |
| Statistics: Num rows: 1727 Data size: 172771 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: |
| Group By Operator //reduce继续对输入的数据进行groupby操作 |
| aggregations: count(DISTINCT KEY._col0:0._col0) 输入的数据执行聚合函数聚合|
| mode: mergepartial |
| outputColumnNames: _col0 输出 |
| Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 //没有limit |
| Processor Tree: |
| ListSink |
| |
±---------------------------------------------------±-+

select count(1) from (select remote_addr from ods_weblog_visit group by remote_addr) a ;
±---------------------------------------------------±-+
| Explain |
±---------------------------------------------------±-+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-2 depends on stages: Stage-1 |
| Stage-0 depends on stages: Stage-2 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: ods_weblog_visit //表 |
| Statistics: Num rows: 1727 Data size: 172771 Basic stats: COMPLETE Column stats: NONE |
| Select Operator select操作 |
| expressions: remote_addr (type: string) select的字段 |
| outputColumnNames: remote_addr 输出字段 |
| Statistics: Num rows: 1727 Data size: 172771 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator groupby操作 |
| keys: remote_addr (type: string) 根据remote_addr groupby |
| mode: hash 操作 |
| outputColumnNames: _col0 输出 _col0为临时数据 |
| Statistics: Num rows: 1727 Data size: 172771 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: string) |
| sort order: + |
| Map-reduce partition columns: _col0 (type: string) |
| Statistics: Num rows: 1727 Data size: 172771 Basic stats: COMPLETE Column stats: NONE |
| Reduce Operator Tree: reduce阶段 |
| Group By Operator 继续groupby |
| keys: KEY._col0 (type: string) |
| mode: mergepartial 根据key合并 |
| outputColumnNames: _col0 |
| Statistics: Num rows: 863 Data size: 86335 Basic stats: COMPLETE Column stats: NONE |
| Select Operator Select操作 |
| Statistics: Num rows: 863 Data size: 86335 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator Groupby操作 |
| aggregations: count(1) 做一次聚合 |
| mode: hash hash方式 |
| outputColumnNames: _col0 输出 |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
| |
| Stage: Stage-2 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| Reduce Output Operator |
| sort order: |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col0 (type: bigint) |
| Reduce Operator Tree: |
| Group By Operator groupby操作 |
| aggregations: count(VALUE._col0) 执行聚合函数 |
| mode: mergepartial 合并操作 |
| outputColumnNames: _col0 输出 |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
±---------------------------------------------------±-+

详细可以参考https://www.cnblogs.com/cxzdy/p/5116222.html

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。