只有理解了join 底层的mr代码如何实现,那你才能写出更好的sql,以及更好的优化。
加入有一下sql
select model_id from tableA a join tableB b on a.model_id= b.model_id
要想深刻的理解这个sql你就得知道他的底层mr到底发生了什么,I will show you
- 首先我们来了解一下他的explain
STAGE DEPENDENCIES:
Stage-4 is a root stage , consists of Stage-1
Stage-1
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-4
Conditional Operator
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: a
Statistics: Num rows: 47242264 Data size: 1943341979 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: model_id (type: string)
sort order: +
Map-reduce partition columns: model_id (type: string)
Statistics: Num rows: 47242264 Data size: 1943341979 Basic stats: COMPLETE Column stats: NONE
TableScan
alias: b
Statistics: Num rows: 1871689 Data size: 26749999 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: model_id (type: string)
sort order: +
Map-reduce partition columns: model_id (type: string)
Statistics: Num rows: 1871689 Data size: 26749999 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join 0 to 1
keys:
0 model_id (type: string)
1 model_id (type: string)
outputColumnNames: _col3
Statistics: Num rows: 51966491 Data size: 2137676223 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col3 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 51966491 Data size: 2137676223 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 51966491 Data size: 2137676223 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
- 一下是java 伪代码 这个其实和执行计划有呼应的
public static void main(String[] args) {
//map 阶段
map(inkey,invalue,context);
// fileinputsplit inputsplit =context.getinputsplit();
// String name =inputsplit.getpath().getname;
path=context.getpath()
model_id=invalue.split()[0];
//相同的不同表的key(model_id)用一个标记写出去
if (path.contain(tableA)&&model_id!=null) {
context.write(model_id,0);
}
if (path.contain(tableB)&&model_id!=null) {
context.write(model_id,1);
}
//reduce 阶段
reduce(inkey,invalue,context)
//对于每一个inkey model_id ,他的invalue是一个集合,所以遍历一下
//执行一下判断
for(value:invalue){
//inner join reduce 阶段
if (value.contain('0') && value.contain('1')) {
context.write(model_id,null)
}
//left join reduce 阶段
if (value.contain('0')) {
context.write(model_id,null)
}
//right join redude 同理
//full join reduce
if (value.contain('0')||value.contain('1')) {
context.write(model_id,0)
}
}
}
我的sql是
select model_id from tableA a join tableB b on a.model_id= b.model_id
那如果是
select * from tableA a join tableB b on a.model_id= b.model_id
那么reduce阶段的context不仅要写出model_id,还要写出起对应的 value expression