hive1.1版本中mapjoin踩到的一个坑

最新推荐文章于 2024-08-13 17:53:59 发布

xiao_jun_0820

最新推荐文章于 2024-08-13 17:53:59 发布

阅读量3.2k

点赞数

分类专栏： hive

本文链接：https://blog.csdn.net/xiao_jun_0820/article/details/51982262

版权

hive 专栏收录该内容

32 篇文章 1 订阅

订阅专栏

可以通过设置hive.auto.convert.join.noconditionaltask.size这个参数来将一个小表变成hashtable然后作为分布式缓存文件分发到各个worker节点，进而实现Map side join。map side join有许多优势，顾名思义就是没有了reduce的过程，这样可以用来解决join的时候数据倾斜的问题。

一般应用在大表和小表join的场景下，这样我们把hive.auto.convert.join.noconditionaltask.size（这个值默认大小是10M）设置为超过小表大小，Hive就会把这个join自动转换成map side join，一个类似的explain如下：

STAGE DEPENDENCIES:
  Stage-5 is a root stage
  Stage-2 depends on stages: Stage-5
  Stage-0 depends on stages: Stage-2

STAGE PLANS:
  Stage: Stage-5
    Map Reduce Local Work
      Alias -> Map Local Tables:
        a:b 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        a:b 
          TableScan
            alias: b
            Statistics: Num rows: 100 Data size: 1714 Basic stats: COMPLETE Column stats: NONE
            HashTable Sink Operator
              keys:
                0 UDFToString(f_uid) (type: string)
                1 uid (type: string)

  Stage: Stage-2
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: a
            Statistics: Num rows: 385139074 Data size: 4621669010 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ('20160630' BETWEEN '20160630' AND '20160631' and (f_type) IN (101, 102)) (type: boolean)
              Statistics: Num rows: 96284768 Data size: 1155417246 Basic stats: COMPLETE Column stats: NONE
              Map Join Operator
                condition map:
                     Left Outer Join0 to 1
                keys:
                  0 UDFToString(f_uid) (type: string)
                  1 uid (type: string)
                outputColumnNames: _col6, _col54, _col55, _col56, _col63
                Statistics: Num rows: 105913247 Data size: 1270958998 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: concat(_col54, _col55, _col56) (type: string), _col6 (type: double), _col63 (type: string)
                  outputColumnNames: _col0, _col1, _col2
                  Statistics: Num rows: 105913247 Data size: 1270958998 Basic stats: COMPLETE Column stats: NONE
                  Group By Operator
                    aggregations: count(DISTINCT _col1), count(DISTINCT _col2)
                    keys: _col0 (type: string), _col1 (type: double), _col2 (type: string)
                    mode: hash
                    outputColumnNames: _col0, _col1, _col2, _col3, _col4
                    Statistics: Num rows: 105913247 Data size: 1270958998 Basic stats: COMPLETE Column stats: NONE
                    Reduce Output Operator
                      key expressions: _col0 (type: string), _col1 (type: double), _col2 (type: string)
                      sort order: +++
                      Map-reduce partition columns: _col0 (type: string)
                      Statistics: Num rows: 105913247 Data size: 1270958998 Basic stats: COMPLETE Column stats: NONE
      Local Work:
        Map Reduce Local Work
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(DISTINCT KEY._col1:0._col0), count(DISTINCT KEY._col1:1._col0)
          keys: KEY._col0 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 52956623 Data size: 635479492 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 52956623 Data size: 635479492 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

主要看到里面有map reduce local work,然后也能看到table scan把哪个表变成了hashtable.

但是。。。。。。该死的hive1.1.0这个版本有BUG，join的时候用两个double类型的字段进行连接的时候，生成hashtable的那个过程巨慢无比，详情见:https://issues.apache.org/jira/browse/HIVE-11502

join ON a.f_uid = B.uid ,a.f_uid为double类型，B.uid为string类型，类型不同时，hive会自动将b.uid转换成double类型来进行匹配，然后就踩中了那个坑。

解决方式是把他们都转换成string类型来进行连接：ON cast(a.f_uid as string) = B.uid

最后再说说那个可恶的科学计数法。。。。int , float , double这些数值类型在存储大额度数字的时候都会变成科学计数法表示，例如：

hive> select pow(10,8) from dual;

1.0E8

可以通过先转为bigint再转为string消除：

cast(cast(a.f_uid as bigint) as string)

不过有小数位的就不行了。还好我们连接字段是整数。