hive join对应的mapreduce代码

只有理解了join 底层的mr代码如何实现,那你才能写出更好的sql,以及更好的优化。

加入有一下sql

select model_id from tableA a join tableB b  on a.model_id= b.model_id

要想深刻的理解这个sql你就得知道他的底层mr到底发生了什么,I will show you

  • 首先我们来了解一下他的explain
STAGE DEPENDENCIES:
  Stage-4 is a root stage , consists of Stage-1
  Stage-1
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-4
    Conditional Operator

  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: a
            Statistics: Num rows: 47242264 Data size: 1943341979 Basic stats: COMPLETE Column stats: NONE
            Reduce Output Operator
              key expressions: model_id (type: string)
              sort order: +
              Map-reduce partition columns: model_id (type: string)
              Statistics: Num rows: 47242264 Data size: 1943341979 Basic stats: COMPLETE Column stats: NONE
          TableScan
            alias: b
            Statistics: Num rows: 1871689 Data size: 26749999 Basic stats: COMPLETE Column stats: NONE
            Reduce Output Operator
              key expressions: model_id (type: string)
              sort order: +
              Map-reduce partition columns: model_id (type: string)
              Statistics: Num rows: 1871689 Data size: 26749999 Basic stats: COMPLETE Column stats: NONE
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join 0 to 1
          keys:
            0 model_id (type: string)
            1 model_id (type: string)
          outputColumnNames: _col3
          Statistics: Num rows: 51966491 Data size: 2137676223 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col3 (type: string)
            outputColumnNames: _col0
            Statistics: Num rows: 51966491 Data size: 2137676223 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 51966491 Data size: 2137676223 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
  • 一下是java 伪代码 这个其实和执行计划有呼应的
public static void main(String[] args) {
	//map 阶段
	 map(inkey,invalue,context);
	// fileinputsplit inputsplit =context.getinputsplit();
	// String name =inputsplit.getpath().getname;
	path=context.getpath()
	model_id=invalue.split()[0];
	//相同的不同表的key(model_id)用一个标记写出去
	if (path.contain(tableA)&&model_id!=null) {
		context.write(model_id,0);
	}
	if (path.contain(tableB)&&model_id!=null) {
		context.write(model_id,1);
	}

   //reduce 阶段
	reduce(inkey,invalue,context)
		//对于每一个inkey model_id  ,他的invalue是一个集合,所以遍历一下 
		//执行一下判断
		for(value:invalue){
		//inner join reduce 阶段
	if (value.contain('0') && value.contain('1')) {
		context.write(model_id,null)
	}
	//left join reduce 阶段
	if (value.contain('0')) {
		context.write(model_id,null)
	}
	//right join redude 同理
	//full join reduce
	if (value.contain('0')||value.contain('1')) {
		context.write(model_id,0)
	}
	}
}

我的sql是

select model_id from tableA a join tableB b  on a.model_id= b.model_id
那如果是
select * from tableA a join tableB b  on a.model_id= b.model_id
那么reduce阶段的context不仅要写出model_id,还要写出起对应的 value expression
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值