HIVE简介
讲explain之前,先了解大家常说的用hive查询取数和etl的过程,但是在讲hive查询取数过程,又得先了解大家又常说的hadoop集群到底是干什么的?
一句话概括:Hadoop就是存储海量数据和分析海量数据的工具。
- Hadoop是由java语言编写的,在分布式服务器集群上存储海量数据并运行分布式分析应用的开源框架,其核心部件是HDFS与MapReduce。
- HDFS是一个分布式文件系统:引入存放文件元数据信息的服务器Namenode和实际存放数据的服务器Datanode,对数据进行分布式储存和读取。
- MapReduce是一个计算框架:MapReduce的核心思想是把计算任务分配给集群内的服务器里执行。通过对计算任务的拆分(Map计算/Reduce计算)再根据任务调度器(JobTracker)对任务进行分布式计算。
Hadoop执行的是MR的任务,相对来说开发成本高,而我只会写sql的菜鸟,根本没办法开发。这个时候有大聪明就给我这样的菜鸟开发了hive这样的工具,告诉我只要在这写sql就行了。
Hive是基于Hadoop的一个数据仓库工具。通过hive,我们可以方便地进行ETL与数据查询的工作。Hive定义了一个类似于SQL的查询语言:HQL,能够将用户编写的HQL转化为相应的Mapreduce程序基于Hadoop执行。
一直在说mr,hadoop也是mr,到了hive后还是mr,那么mapreduce的工作流到底是啥样呢?
mapreduce工作流程
mapreduce 分为:map & reduce
MapReduce用于大规模数据集(大于1TB)的并行运算。概念**“Map(映射)”和“Reduce(归纳)”**以及它们的主要思想,都是从函数式编程语言借来的,还有从矢量编程语言借来的特性。
Map(映射)函数:用来把一组键值对映射成一组新的键值对。
Reduce(归纳)函数,用来保证所有映射的键值对中的每一个共享相同的键组。
如下图所示:
下面一个经典的“Hello World”例程—单词计数来分析MapReduce的逻辑,如下图 所示。一般的 MapReduce 程序会经过以下几个过程:输入(Input)、输入分片(Splitting)、Map阶段、Shuffle阶段、Reduce阶段、输出(Final result)。
1、输入就不用说了,数据一般放在 HDFS 上面就可以了,而且文件是被分块的。关于文件块和文件分片的关系,在输入分片中说明。
2、输入分片:在进行 Map 阶段之前,MapReduce 框架会根据输入文件计算输入分片(split),每个输入分片会对应一个 Map 任务,输入分片往往和 HDFS 的块关系很密切。例如,HDFS 的块的大小是 128M,如果我们输入两个文件,大小分别是 27M、129M,那么 27M 的文件会作为一个输入分片(不足 128M 会被当作一个分片),而 129MB 则是两个输入分片(129-128=1,不足 128M,所以 1M 也会被当作一个输入分片),所以,一般来说,一个文件块会对应一个分片。Splitting 对应下面的三个数据应该理解为三个分片。
3、Map 阶段:这个阶段的处理逻辑就是编写好的 Map 函数,因为一个分片对应一个 Map 任务,并且是对应一个文件块,所以这里其实是数据本地化的操作,也就是所谓的移动计算而不是移动数据。如上图所示,这里的操作其实就是把每句话进行分割,然后得到每个单词,再对每个单词进行映射,得到单词和1的键值对。
4、Shuffle 阶段:这是“奇迹”发生的地方,MapReduce 的核心其实就是 Shuffle。那么 Shuffle 的原理呢?Shuffle 就是将 Map 的输出进行整合,然后作为 Reduce 的输入发送给 Reduce。简单理解就是把所有 Map 的输出按照键进行排序,并且把相对键的键值对整合到同一个组中。如上图所示,Bear、Car、Deer、River 是排序的,并且 Bear 这个键有两个键值对。
5、Reduce 阶段:与 Map 类似,这里也是用户编写程序的地方,可以针对分组后的键值对进行处理。如上图所示,针对同一个键 Bear 的所有值进行了一个加法操作,得到 <Bear,2> 这样的键值对。
6、输出:Reduce 的输出直接写入 HDFS 上,同样这个输出文件也是分块的。
用一张图表示上述的运行流程:MapReduce 的本质就是把一组键值对 <K1,V1> 经过 Map 阶段映射成新的键值对 <K2,V2>;接着经过 Shuffle/Sort 阶段进行排序和“洗牌”,把键值对排序,同时把相同的键的值整合;最后经过 Reduce 阶段,把整合后的键值对组进行逻辑处理,输出到新的键值对 <K3,V3>。这样的一个过程,其实就是 MapReduce 的本质。
上面mapreduce过程的图片比较抽象,然后下面找一张形象的图片
形象的图片总是看上去好像,但是没有说明,我又找一张人家提供的备注讲解图片
到这里整个mr过程形象讲解就到这里了,但是跟我们说的hive和explain又有什么关系呢?
EXPLAIN 解释
一个HIVE查询被转换为一个由一个或多个stage组成的序列(有向无环图DAG)。这些stage可以是MapReduce stage,也可以是负责元数据存储的stage,也可以是负责文件系统的操作(比如移动和重命名)的stage。
HIVE提供了EXPLAIN命令来展示一个查询的执行计划,这个执行计划对于我们了解底层原理,hive 调优,排查数据倾斜等很有帮助,那我们就去了解如何查看这个explain的结果。
使用语法如下:
EXPLAIN [EXTENDED|CBO|AST|DEPENDENCY|AUTHORIZATION|LOCKS|VECTORIZATION|ANALYZE] query
explain 后面可以跟以下可选参数,注意:这几个可选参数不是 hive 每个版本都支持的
EXTENDED:加上 extended 可以输出有关计划的额外信息。这通常是物理信息,例如文件名。这些额外信息对我们用处不大
CBO:输出由Calcite优化器生成的计划。CBO 从 hive 4.0.0 版本开始支持
AST:输出查询的抽象语法树。AST 在hive 2.1.0 版本删除了,存在bug,转储AST可能会导致OOM错误,将在4.0.0版本修复
DEPENDENCY:dependency在EXPLAIN语句中使用会产生有关计划中输入的额外信息。它显示了输入的各种属性
AUTHORIZATION:显示所有的实体需要被授权执行(如果存在)的查询和授权失败
LOCKS:这对于了解系统将获得哪些锁以运行指定的查询很有用。LOCKS 从 hive 3.2.0 开始支持
VECTORIZATION:将详细信息添加到EXPLAIN输出中,以显示为什么未对Map和Reduce进行矢量化。从 Hive 2.3.0 开始支持
ANALYZE:用实际的行数注释计划。从 Hive 2.2.0 开始支持
加参数的太过复杂,但相对的更有针对性,作为菜鸟的我是看不懂的。这里只讲不加参数的explain的结果查看,其他可以自己再查看类比。
Explain 先从最外层开始,包含两个大的部分:
- stage dependencies: 各个stage之间的依赖性
- stage plan: 各个stage的执行计划
先看第一部分 stage dependencies ,包含两个 stage,Stage-1 是根stage,说明这是开始的stage,Stage-0 依赖 Stage-1,Stage-1执行完成后执行Stage-0。
再看第二部分 stage plan,里面有一个 Map Reduce,一个MR的执行计划分为两个部分:
- Map Operator Tree: MAP端的执行计划树
- Reduce Operator Tree: Reduce端的执行计划树
这两个执行计划树里面包含这条sql语句的 operator:
1、map端第一个操作肯定是加载表,所以就是 TableScan 表扫描操作,常见的属性:
- alias: 表名称
- Statistics: 表统计信息,包含表中数据条数,数据大小等
2、Select Operator: 选取操作,常见的属性 :
- expressions:需要的字段名称及字段类型
- outputColumnNames:输出的列名称
- Statistics:表统计信息,包含表中数据条数,数据大小等
3、Group By Operator:分组聚合操作,常见的属性:
- aggregations:显示聚合函数信息
- mode:聚合模式,值有 hash:随机聚合,就是hash partition;partial:局部聚合;final:最终聚合
- keys:分组的字段,如果没有分组,则没有此字段
- outputColumnNames:聚合之后输出列名
- Statistics: 表统计信息,包含分组聚合之后的数据条数,数据大小等
4、Reduce Output Operator:输出到reduce操作,常见属性:
- sort order:值为空 不排序;值为 + 正序排序,值为 - 倒序排序;值为 ± 排序的列为两列,第一列为正序,第二列为倒序
5、Filter Operator:过滤操作,常见的属性:
- predicate:过滤条件,如sql语句中的where id>=1,则此处显示(id >= 1)
6、Map Join Operator:join 操作,常见的属性:
- condition map:join方式 ,如Inner Join 0 to 1 Left Outer Join0 to 2
- keys: join 的条件字段
- outputColumnNames: join 完成之后输出的字段
- Statistics: join 完成之后生成的数据条数,大小等
7、File Output Operator:文件输出操作,常见的属性
- compressed:是否压缩
- table:表的信息,包含输入输出文件格式化方式,序列化方式等
8、Fetch Operator 客户端获取数据操作,常见的属性:
- limit,值为 -1 表示不限制条数,其他值为限制的条数
好了,到这里再翻到上面 explain 的查询结果,是不是感觉基本都能看懂了。
explain实战:
执行语句:(查询每月会员首次消费日期)
explain select concat(substring(cdate,1,7),'-01') as cmonth
,mem_type
,mem_id
,min(cdate) as cdate
from member_order
where iss = 1 and is_flag <> 1 and oh_type = 'gn'
group by concat(substring(cdate,1,7),'-01')
,mem_type
,mem_id
执行计划查看:
Explain
STAGE DEPENDENCIES: -- 执行顺序stage的依赖关系
Stage-1 is a root stage -- 根节点
Stage-0 depends on stages: Stage-1 -- stage-0 依赖 stage -1 执行完成
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree: -- map阶段
TableScan
alias: member_order -- 扫描表显示表信息
Statistics: Num rows: 393173860 Data size: 98293465088 Basic stats: COMPLETE Column stats: NONE
Filter Operator -- 过滤操作相当于where 或者部分 on的条件
predicate: (((iss = 1) and (is_flag <> 1)) and (oh_type = 'gn')) (type: boolean)
Statistics: Num rows: 98293465 Data size: 24573366272 Basic stats: COMPLETE Column stats: NONE
Select Operator -- 选取列
expressions: cdate (type: timestamp), mem_type (type: string), mem_id (type: string)
outputColumnNames: cdate, mem_type, mem_id
Statistics: Num rows: 98293465 Data size: 24573366272 Basic stats: COMPLETE Column stats: NONE
Group By Operator -- 聚合操作
aggregations: min(cdate) -- 显示聚合函数
keys: concat(substring(cdate, 1, 7), '-01') (type: string), mem_type (type: string), mem_id (type: string)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 98293465 Data size: 24573366272 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator -- 输出到reduce端的排序结果
key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string)
sort order: +++ -- 排序后的结果,已经处理了concat(substring(cdate, 1, 7), '-01') 为新的key
Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string)
Statistics: Num rows: 98293465 Data size: 24573366272 Basic stats: COMPLETE Column stats: NONE
value expressions: _col3 (type: timestamp) -- 预聚合字段的输出坑位
Reduce Operator Tree: -- reduce阶段
Group By Operator
aggregations: min(VALUE._col0)
keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2, _col3 -- 填坑操作
Statistics: Num rows: 49146732 Data size: 12286683011 Basic stats: COMPLETE Column stats: NONE
File Output Operator -- 文件输出操作
compressed: false
Statistics: Num rows: 49146732 Data size: 12286683011 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator -- 客户端获取数据操作
limit: -1 -- 没有limit语句,stage-0是没有任何操作的阶段
Processor Tree:
ListSink
执行结果查看:(做了一部分删减)
当前任务的jobName为:20210622113635742_fab56710-09b3-4a1f-b391-caf05f1d6573_mr
INFO : Number of reduce tasks not specified. Estimated from input data size: 50
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=<number>
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=<number>
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=<number>
INFO : Starting Job = job_1623485006032_477104, Tracking URL = http://rm1.yarn-common.rm.bigdata.ly:15844/proxy/application_1623485006032_477104/
INFO : Kill Command = /BigData/run/hadoop/bin/hadoop job -kill job_1623485006032_477104
INFO : Hadoop job information for Stage-1: number of mappers: 172; number of reducers: 50
INFO : 2021-06-22 11:38:50,659 Stage-1 map = 0%, reduce = 0%
INFO : 2021-06-22 11:47:40,327 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 8524.43 sec
INFO : MapReduce Total cumulative CPU time: 0 days 2 hours 22 minutes 4 seconds 430 msec
INFO : Ended Job = job_1623485006032_477104
INFO : Stage-4 is filtered out by condition resolver.
INFO : Stage-3 is selected by condition resolver.
INFO : Stage-5 is filtered out by condition resolver.
INFO : Number of reduce tasks is set to 0 since there‘s no reduce operator
INFO : Starting Job = job_1623485006032_477313, Tracking URL = http://test/proxy/application_1623485006032_477313/
INFO : Kill Command = /BigData/run/hadoop/bin/hadoop job -kill job_1623485006032_477313
INFO : Hadoop job information for Stage-3: number of mappers: 35; number of reducers: 0
INFO : 2021-06-22 11:48:13,760 Stage-3 map = 0%, reduce = 0%
INFO : 2021-06-22 11:50:44,984 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1050.53 sec
INFO : MapReduce Total cumulative CPU time: 17 minutes 30 seconds 530 msec
INFO : Ended Job = job_1623485006032_477313
INFO : Moving data to: /test/cs_20210609 from viewfs://test/cs_20210609/.hive-staging_hive_2021-06-22_11-38-06_121_6338166242465351708-279/-ext-10001
-------------------------------------------
INFO : Table cs_20210609 stats: [numFiles=35, numRows=189958534, totalSize=9984586410, rawDataSize=9794627876]
执行语句:(分会员类型一段语句)
select t1.cmonth as cdate
,t1.mem_type
,case when t1.cdate = t2.f_date then '首次下单' else '再次下单' end as mclassify
,count(distinct t1.mem_id) as mem_count_num
from ( -- 会员月度消费情况
select concat(substring(cdate,1,7),'-01') as cmonth
,mem_type
,mem_id
,min(cdate) as cdate
from member_order
where iss = 1 and is_flag <> 1 and oh_type = 'gn'
group by concat(substring(cdate,1,7),'-01')
,mem_type
,mem_id
) as t1
left join member_first as t2 -- 会员首次消费详情
on t1.mem_type = t2.mem_type and t1.mem_id = t2.mem_id
group by t1.cmonth
,t1.mem_type
,case when t1.cdate = t2.f_date then '首次下单' else '再次下单' end
完整执行计划:
Explain
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-6 depends on stages: Stage-1 , consists of Stage-7, Stage-2
Stage-7 has a backup stage: Stage-2
Stage-5 depends on stages: Stage-7
Stage-3 depends on stages: Stage-2, Stage-5
Stage-2
Stage-0 depends on stages: Stage-3
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: member_order
Statistics: Num rows: 391388758 Data size: 97847189504 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (((iss = 1) and (is_flag <> 1)) and (oh_type = 'gn')) (type: boolean)
Statistics: Num rows: 97847189 Data size: 24461797251 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: cdate (type: timestamp), mem_type (type: string), mem_id (type: string)
outputColumnNames: cdate, mem_type, mem_id
Statistics: Num rows: 97847189 Data size: 24461797251 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: min(cdate)
keys: concat(substring(cdate, 1, 7), '-01') (type: string), mem_type (type: string), mem_id (type: string)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 97847189 Data size: 24461797251 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string)
sort order: +++
Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string)
Statistics: Num rows: 97847189 Data size: 24461797251 Basic stats: COMPLETE Column stats: NONE
value expressions: _col3 (type: timestamp)
Reduce Operator Tree:
Group By Operator
aggregations: min(VALUE._col0)
keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 48923594 Data size: 12230898500 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: true
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-6
Conditional Operator
Stage: Stage-7
Map Reduce Local Work
Alias -> Map Local Tables:
t2
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
t2
TableScan
alias: t2
HashTable Sink Operator
keys:
0 _col1 (type: string), _col2 (type: string)
1 mem_type (type: string), mem_num (type: string)
Stage: Stage-5
Map Reduce
Map Operator Tree:
TableScan
Map Join Operator
condition map:
Left Outer Join0 to 1
keys:
0 _col1 (type: string), _col2 (type: string)
1 mem_type (type: string), mem_id (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col22, _col26
Select Operator
expressions: _col0 (type: string), _col1 (type: string), _col3 (type: timestamp), _col22 (type: timestamp), _col26 (type: int), _col2 (type: string)
outputColumnNames: _col0, _col1, _col3, _col22, _col26, _col2
Group By Operator
aggregations: count(DISTINCT _col2)
keys: _col0 (type: string), _col1 (type: string), CASE WHEN (((_col3 = _col22) THEN ('首次下单') ELSE ('再次下单') END (type: string), _col2 (type: string)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3, _col4
File Output Operator
compressed: true
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Local Work:
Map Reduce Local Work
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
Reduce Output Operator
key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string)
sort order: ++++
Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string)
Statistics: Num rows: 143106343 Data size: 34917947944 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Group By Operator
aggregations: count(DISTINCT KEY._col3:0._col0)
keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 71553171 Data size: 17458973850 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: 'M' (type: string), _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: bigint)
outputColumnNames: _col0, _col1, _col2, _col3, _col4
Statistics: Num rows: 71553171 Data size: 17458973850 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 71553171 Data size: 17458973850 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
Reduce Output Operator
key expressions: _col1 (type: string), _col2 (type: string)
sort order: ++
Map-reduce partition columns: _col1 (type: string), _col2 (type: string)
Statistics: Num rows: 48923594 Data size: 12230898500 Basic stats: COMPLETE Column stats: NONE
value expressions: _col0 (type: string), _col3 (type: timestamp)
TableScan
alias: t2
Statistics: Num rows: 130096673 Data size: 31743588352 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: mem_type (type: string), mem_id (type: string)
sort order: ++
Map-reduce partition columns: mem_type (type: string), mem_id (type: string)
Statistics: Num rows: 130096673 Data size: 31743588352 Basic stats: COMPLETE Column stats: NONE
value expressions: f_date (type: timestamp)
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
keys:
0 _col1 (type: string), _col2 (type: string)
1 mem_type (type: string), mem_id (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col22, _col26
Statistics: Num rows: 143106343 Data size: 34917947944 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string), _col1 (type: string), _col3 (type: timestamp), _col22 (type: timestamp), _col26 (type: int), _col2 (type: string)
outputColumnNames: _col0, _col1, _col3, _col22, _col26, _col2
Statistics: Num rows: 143106343 Data size: 34917947944 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(DISTINCT _col2)
keys: _col0 (type: string), _col1 (type: string), CASE WHEN (((_col3 = _col22) THEN ('首次下单') ELSE ('再次下单') END (type: string), _col2 (type: string)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3, _col4
Statistics: Num rows: 143106343 Data size: 34917947944 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: true
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
第一部分:STAGE DEPENDENCIES
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-6 depends on stages: Stage-1 , consists of Stage-7, Stage-2
Stage-7 has a backup stage: Stage-2
Stage-5 depends on stages: Stage-7
Stage-3 depends on stages: Stage-2, Stage-5
Stage-2
Stage-0 depends on stages: Stage-3
这里发现为什么不是简单的顺序执行(只有depends on),还有这么复杂的 consists of 、has a backup英文也不认识,翻译过来认识了又看不懂。
其实在执行计划中可以看到mapjoin执行计划的影子(Map Reduce Local Work),简单的说下关于mapjoin的执行过程。
上面语句中Join的两张表一张表是临时表,就会生成一个ConditionalTask,在运行期间判断是否使用MapJoin
那其实完整的应该是 stage -2 mr过程,保留原始执行计划为backup plan ,然后拷贝一份生成一个mapjoin的操作计划来代替stage-2,生成一个mapreducelocalwork:stage -7 和 -5 ,那么stage依赖的执行计划图应该如下:
执行日志查看:
INFO : Number of reduce tasks not specified. Estimated from input data size: 50
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=<number>
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=<number>
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=<number>
INFO : Starting Job = job_1623484184026_605053, Tracking URL = http://test/proxy/application_1623484184026_605053/
INFO : Kill Command = /BigData/run/hadoop/bin/hadoop job -kill job_1623484184026_605053
INFO : Hadoop job information for Stage-1: number of mappers: 172; number of reducers: 50
INFO : 2021-06-22 16:52:40,323 Stage-1 map = 0%, reduce = 0%
INFO : 2021-06-22 16:57:56,889 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 8223.73 sec
INFO : MapReduce Total cumulative CPU time: 0 days 2 hours 17 minutes 3 seconds 730 msec
INFO : Ended Job = job_1623484184026_605053
INFO : Stage-13 is filtered out by condition resolver.
INFO : Stage-2 is selected by condition resolver.
INFO : Number of reduce tasks not specified. Estimated from input data size: 22
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=<number>
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=<number>
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=<number>
INFO : Starting Job = job_1623484184026_605119, Tracking URL = http://test/proxy/application_1623484184026_605119/
INFO : Kill Command = /BigData/run/hadoop/bin/hadoop job -kill job_1623484184026_605119
INFO : Hadoop job information for Stage-2: number of mappers: 80; number of reducers: 22
INFO : 2021-06-22 17:00:17,806 Stage-2 map = 0%, reduce = 0%
INFO : 2021-06-22 17:05:11,454 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 5335.03 sec
INFO : MapReduce Total cumulative CPU time: 0 days 1 hours 28 minutes 55 seconds 30 msec
INFO : Ended Job = job_1623484184026_605119
INFO : Number of reduce tasks not specified. Estimated from input data size: 7
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=<number>
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=<number>
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=<number>
INFO : Starting Job = job_1623484184026_605251, Tracking URL = http://test/proxy/application_1623484184026_605251/
INFO : Kill Command = /BigData/run/hadoop/bin/hadoop job -kill job_1623484184026_605251
INFO : Hadoop job information for Stage-3: number of mappers: 23; number of reducers: 7
INFO : 2021-06-22 17:05:18,899 Stage-3 map = 0%, reduce = 0%
INFO : 2021-06-22 17:09:32,182 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 3249.12 sec
INFO : MapReduce Total cumulative CPU time: 54 minutes 9 seconds 120 msec
INFO : Ended Job = job_1623484184026_605251
INFO : Stage-6 is filtered out by condition resolver.
INFO : Stage-5 is selected by condition resolver.
INFO : Stage-7 is filtered out by condition resolver.
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
INFO : Starting Job = job_1623484184026_605343, Tracking URL = http://test/proxy/application_1623484184026_605343/
INFO : Kill Command = /BigData/run/hadoop/bin/hadoop job -kill job_1623484184026_605343
INFO : Hadoop job information for Stage-5: number of mappers: 1; number of reducers: 0
INFO : 2021-06-22 17:09:41,588 Stage-5 map = 0%, reduce = 0%
INFO : 2021-06-22 17:09:51,946 Stage-5 map = 100%, reduce = 0%, Cumulative CPU 2.24 sec
-------------------------------------------
INFO : MapReduce Total cumulative CPU time: 2 seconds 240 msec
INFO : Ended Job = job_1623484184026_605343
INFO : Moving data to: /test/cs_11_20191113 from viewfs://test/cs_11_20191113_11_20191113/.hive-staging_hive_2021-06-22_16-52-12_407_5993177840848507921-3/-ext-10001
INFO : Table cs_11_20191113 stats: [numFiles=1, numRows=652, totalSize=21528, rawDataSize=20876]
注:在执行计划的结果里面可以看到,最终被选择的stage步骤
explian实践:
1. join 语句会过滤 null 的值吗?
2. group by 分组语句会进行排序吗?
3. 哪条sql执行效率高呢?