hive1.1版本中mapjoin踩到的一个坑

可以通过设置hive.auto.convert.join.noconditionaltask.size这个参数来将一个小表变成hashtable然后作为分布式缓存文件分发到各个worker节点,进而实现Map side join。map side join有许多优势,顾名思义就是没有了reduce的过程,这样可以用来解决join的时候数据倾斜的问题。

一般应用在大表和小表join的场景下,这样我们把hive.auto.convert.join.noconditionaltask.size(这个值默认大小是10M)设置为超过小表大小,Hive就会把这个join自动转换成map side join,一个类似的explain如下:

STAGE DEPENDENCIES:
  Stage-5 is a root stage
  Stage-2 depends on stages: Stage-5
  Stage-0 depends on stages: Stage-2

STAGE PLANS:
  Stage: Stage-5
    Map Reduce Local Work
      Alias -> Map Local Tables:
        a:b 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        a:b 
          TableScan
            alias: b
            Statistics: Num rows: 100 Data size: 1714 Basic stats: COMPLETE Column stats: NONE
            HashTable Sink Operator
              keys:
                0 UDFToString(f_uid) (type: string)
                1 uid (type: string)

  Stage: Stage-2
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: a
            Statistics: Num rows: 385139074 Data size: 4621669010 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ('20160630' BETWEEN '20160630' AND '20160631' and (f_type) IN (101, 102)) (type: boolean)
              Statistics: Num rows: 96284768 Data size: 1155417246 Basic stats: COMPLETE Column stats: NONE
              Map Join Operator
                condition map:
                     Left Outer Join0 to 1
                keys:
                  0 UDFToString(f_uid) (type: string)
                  1 uid (type: string)
                outputColumnNames: _col6, _col54, _col55, _col56, _col63
                Statistics: Num rows: 105913247 Data size: 1270958998 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: concat(_col54, _col55, _col56) (type: string), _col6 (type: double), _col63 (type: string)
                  outputColumnNames: _col0, _col1, _col2
                  Statistics: Num rows: 105913247 Data size: 1270958998 Basic stats: COMPLETE Column stats: NONE
                  Group By Operator
                    aggregations: count(DISTINCT _col1), count(DISTINCT _col2)
                    keys: _col0 (type: string), _col1 (type: double), _col2 (type: string)
                    mode: hash
                    outputColumnNames: _col0, _col1, _col2, _col3, _col4
                    Statistics: Num rows: 105913247 Data size: 1270958998 Basic stats: COMPLETE Column stats: NONE
                    Reduce Output Operator
                      key expressions: _col0 (type: string), _col1 (type: double), _col2 (type: string)
                      sort order: +++
                      Map-reduce partition columns: _col0 (type: string)
                      Statistics: Num rows: 105913247 Data size: 1270958998 Basic stats: COMPLETE Column stats: NONE
      Local Work:
        Map Reduce Local Work
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(DISTINCT KEY._col1:0._col0), count(DISTINCT KEY._col1:1._col0)
          keys: KEY._col0 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 52956623 Data size: 635479492 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 52956623 Data size: 635479492 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
主要看到里面有map reduce local work,然后也能看到table scan把哪个表变成了hashtable.

但是。。。。。。该死的hive1.1.0这个版本有BUG,join的时候用两个double类型的字段进行连接的时候,生成hashtable的那个过程巨慢无比,详情见:https://issues.apache.org/jira/browse/HIVE-11502

join ON a.f_uid = B.uid ,a.f_uid为double类型,B.uid为string类型,类型不同时,hive会自动将b.uid转换成double类型来进行匹配,然后就踩中了那个坑。

解决方式是把他们都转换成string类型来进行连接:ON cast(a.f_uid as string) = B.uid


最后再说说那个可恶的科学计数法。。。。int , float , double这些数值类型在存储大额度数字的时候都会变成科学计数法表示,例如:

hive> select pow(10,8) from dual;

OK

1.0E8

可以通过先转为bigint再转为string消除:

cast(cast(a.f_uid as bigint) as string)

不过有小数位的就不行了。还好我们连接字段是整数。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值